tReservoirSampling Standard properties - 7.3

Sampling

Version
7.3
Language
English
Product
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance > Third-party systems > Data Quality components > Sampling components
Data Quality and Preparation > Third-party systems > Data Quality components > Sampling components
Design and Development > Third-party systems > Data Quality components > Sampling components
Last publication date
2024-02-21

These properties are used to configure tReservoirSampling running in the Standard Job framework.

The Standard tReservoirSampling component belongs to the Data Quality family.

The component in this framework is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, and in Talend Data Fabric.

Basic settings

Schema and Edit schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields.

Click Sync columns to retrieve the schema from the previous component in the Job.

 

Built-In: You create and store the schema locally for this component only.

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs.

Sample Size

Set how many rows to sample from the input flow.

Advanced settings

Seed for random generator

Set a random number if you want to extract the same sample in different executions of the Job.

Repeating the execution with a different value for the seed will result in a different data samples being extracted.

Keep this field empty if you want to extract a different data sample each time you execute the Job.

tStat Catcher Statistics

Select this check box to collect log data at the component level.

Usage

Usage rule

This component helps you to test profiling analyses on a sample data and have results similar to the results on the full data set.

tReservoirSampling can not be used in Map/Reduce Jobs for the time being.