tDataMasking - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Warning

This component will be available in the Palette of Talend Studio on the condition that you have subscribed to one of the Talend Platform products.

Function

tDataMasking reads a data set row by row and creates a structurally similar but inauthentic version of the data after having applied specific functions on data fields. It generates one row for each input row.

Purpose

tDataMasking enables you to hide original data with random characters or figures to protect the actual data while having a functional substitute for occasions when it is not advisable to show sensitive real data.

Data will keep looking real and consistent and will remain usable for purposes such as testing and training.

The most common data type which may need masking method is where the data contains Personally Identifiable Information (PII) or Sensitive Personal Data (SPD). For further information, see Function behavior in common PII.

If you have subscribed to one of the Talend solutions with Big Data, this component is available in the following types of Jobs:

Function behavior in common PII

What is sensitive data

The definition of sensitive data is broad and may differ from one country to the other or from one organization to the other. Basically, sensitive data can be personal information or business information which includes anything that poses a risk to the person or company in question.

Globally, Credit/Debit card data for example is considered to be sensitive. Also an employee's salary details any information that can be used to identify or locate a person can be considered to be sensitive data. A non-exhaustive list of personal sensitive data may include: first and last names, email addresses, addresses, Security Social Number (SSN), credit card numbers, bank account numbers, race, gender, date of birth, salary and geolocation combined with time.

For further information about personal sensitive data, check Personally Identifiable Information.

Also, business sensitive data may include trade secrets, acquisition plans, financial data and customer information, among other possibilities.

Functions and common PII

There are several functions in the tDataMasking component which vary according to the type of the data column.

It is advisable to use the functions predefined in the component with columns that hold personal information, such as first and last names, email addresses, addresses, SSN, credit card numbers, bank account numbers, race, gender, date of birth and salary.

Functions that are not self-explanatory are explained in the below table:

Function

Description

Date Variance

This function only applies on Date values. It uses a parameter which must be a number, this parameter represents a number of days. It will then modify the input date by adding or retrieving a number of days lower than the parameter.

For example : If the input date is 15-02-1992 and the parameter is 10, then the generated date will be randomly selected between 05-02-1992 (15 - 10) and 25-02-1992 (15 + 10). If the input date is null, then the function returns the current date. If the given parameter is 0 or null or if it is not a number, it will be replaced by 31.

Generate Account Number

This function generates a valid French bank account number. It requires no parameter and only applies on String values.

A French IBAN number is a 27-character code. The numbers are randomly generated but against algorithms. The last digit of the IBAN is known as the "clef RIB" and is generated with an algorithm and the third and fourth digits of the IBAN are also generated through an algorithm.

Generate Account Number and keep original country

This function works like Generate Account Number, it generates a valid bank account number for the original country.

If the input is a correct IBAN number, the function generates an IBAN number from the same country as the input taking into account the IBAN number which is different from one country to the other. If the input is a correct American account number the function keeps the first nine digits and randomly replaces the other.

This function requires a parameter that can be: true or false. If the parameter is true, the function keeps the input format, that is if there are spaces in the input, the output will have the same spaces. If the parameter is not true or false, it is considered to be false. If the input is null or an incorrect number, the function will have the same behavior as Generate Account Number.

Generate Credit Card

This function generates a valid credit card number. It requires no parameter and can be applied on String or Long values.

There are three types of credit card that can be generated: Visa, Master Card or American Express. One of these types is randomly chosen and a credit card number is generated. The number generated is randomly generated and pass algorithms that detect false credit card number.

Generate from Pattern

This function is applied only on Strings and it requires a parameter.

It generates a value that matches the pattern given as parameter. The pattern must follow the below rules:

- the A character is replaced by a random upper case letter.

- the a character is replaced by a random lower case letter.

- the 9 figure is replaced by a random digit.

- all other characters are kept as they are.

You can generate several strings with the same argument (value) by using \\1 in the pattern.

For example, if the given pattern is Aaaaa.Aaaaa99\\1,@gmail.com, the function here will generate something like Dsdf.Ksknt12@gmail.com. The @gmail.com value will be kept unchanged.

Please note that the function does not work correctly if a comma ',' is used in the pattern.

Generate Phone Number

This function is applied only on Strings and requires no parameter.

It generates a random phone number from different countries (France, Germany, Japan, UK and US).

Generate Social Security Number (SSN)

This function is only used on Strings and requires no parameter. It generates a correct random SSN for different countries according to your choice (France, Germany, Japan, UK and US).

Generate Sequence

Note

This function is not supported in the Spark version of the component.

This function can be applied on everything that is not a date (Integer, Long, Strings and so on). It requires a parameter that must be a number. This function returns the parameter, and, for each row, will increase this number by 1. If the parameter is not a number, it is set to 0.

Generate Uuid

This function is only applied on Strings and requires no parameter. It will replace the input value by a randomly generated UUID.

This function uses the UUID.randomUUID() provided by java, meaning that no seed is used here, implying that if the user runs twice the job, the uuids generated will be different.

Generate value between two values

This function generates a value randomly chosen between two values you give as argument. The argument must be a string holding the bounds, separated by comas, that is min and max.

This function can be applied to any types of fields. However, if the field is a date the bounds must also be dates and they must have the same format as in the schema, dd-MM-yyyy for example. Otherwise, the bounds must be integers.

If the input is of Date type, the function returns the current date if the parameter is not in the right format. Otherwise, it returns an empty string for string values and 0 for numeric values.

Keep characters between two positions

This function can be used on Strings and requires two parameters separated by commas.

The two first parameters represent the places of two elements in the input. The function returns a new String that only contains those elements and what is in between.

If the input is null or if the parameter is in a wrong format, the function will return an empty String. If the lower bound is lower than 1, it will be set to 1 and if the higher bound is greater than the length of the string, it will be set to this length. The two parameters can be given in any order. If the argument is 4, 2, it will be replaced by 2, 4. For example, if the input is Steven and the argument is 4, 2, the result will be tev.

Remove Characters between two positions and Replace characters between two positions

These functions have the same behavior as Keep characters between two positions but with a remove or replace statement.

Keep n first digits and replace following ones

This function is used on Strings, Integers and Long values and requires a number as a parameter.

If the parameter is n, the function keeps the first n digits of the input and then replaces all the digits that follow by other digits. Anything that is not a digit will not be changed. A null input will make the function returns an empty string or 0.

If the parameter is bigger than the input length, no modifications are applied.

keep n last digits and replace previous ones

This function is the counterpart of Keep n first digits and replace following ones.

Mask Address

This function can only be used on Strings. It replaces digits by other digits and everything else by X.

Moreover, there is a list of key words that will not be transformed: Rue, rue, r., strasse, Strasse, Street, street, St., St, Strae, Strada, Rua, Calle, Ave., avenue, Av., Allée, allée, alle, Avenue, Avenida, Bvd., Bd., Boulevard, boulevard, Blv., Viale, Avenida, Bulevar, Route, route, road, Road, Rd., Chemin, Way, Cour, Court, Ct., Place, place, Pl., Square, Impasse, Alle, Driveway, Auahrt, Viale, Esplanade, Esplanade, Promenade, Lungomare, Esplanada, Esplanada, Faubourg, faubourg, Suburb, Vorort, Periferia, Subrbio, Suburbio, Via, Via, industrial, area, zone, industrielle, Périphérique, Peripheral, Voie, voie, Track, Gleis, Carreggiata, Caminho, Pista, Forum, STREET, RUE, ST., AVENUE, BOULEVARD, BLV., BD, ROAD, ROUTE, RD., RTE, WAY, CHEMIN, COURT, CT., SQUARE, DRIVEWAY, ALLEE, DR., ESPLANADE, SUBURB, BANLIEUE, VIA, PERIPHERAL, PERIPHERIQUE, TRACK, VOIE, FORUM, INDUSTRIAL, AREA, ZONE, INDUSTRIELLE.

You can give a parameter, it can either be a list of key words to be added to the above list (separated by commas) or it can be a path to a file containing the words.

Mask Email

This function can only be used on Strings. It looks for the @ character and replaces everything before by either one element of the list given as parameter (can also be a path to a file containing those words), or by a series of X if there is no parameter.

Numeric Variance

This function applies only to numerical types (Integer, Long, Float and Double).

It takes a parameter that must be a number, this parameter represents a percentage of modification. The function modifies the input data by multiplying it by a number between the parameter and its opposite. For example, if the input is 100 and the parameter is 10, then the generated value will be a randomly selected value between 90 (100 - 10%) and 110 (100 + 10%). If the input is null, then the function will return 0. If the given parameter is 0, it will be replaced by 10.

Replace by consistent items from input list (or file)

This function modifies the input value by randomly selecting one of the values given as parameter. The values must be stored in a String and separated by commas, for example ("item1, item2, item3, etc."). It uses the hashCode() function provided by Java to choose an element from the list.

It can be applied to Strings or numerical types and it ensures that two similar inputs will have the same output. It will return an empty String or 0 if there is no parameter given or a wrong one.

Replace by item from input list (or file)

This function has the same behavior as Replace by consistent item from input list, but it randomly select the value from the list (or file), so outputs will be different.

tDataMasking properties

Component family

Data Quality

 

Basic settings

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Since version 5.6, both the Built-In mode and the Repository mode are available in any of the Talend solutions.

Click Sync columns to retrieve the schema from the previous component connected in the Job.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

The output schema of this component contains one read-only column, ORIGINAL_MARK. This column identifies by true or false if the record is an original record or a substitute record respectively.

 

 

Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.

 

Modification

Define in the table what fields to change and how to change them:

Input Column: Select the column from the input flow for which you want to generate similar data by modifying its values.

These modifications are based on the function you select in the Function column and the number of modifications you set in the Max Modification Count column.

Function: Select the function that will decide what modification to do in order to generate similar substitutional data. For example, you can decide to have similar values through replacing or adding letters or numbers, replacing values with synonyms from an index file or deleting values by setting the function to null.

The Function list will vary according to the column type. For further information about function behavior, see Function behavior in common PII.

For example, a column of a Long type will have a Numeric variance option in the list while a column of a String type will not have such function. Also, the Function list for a Date column is date-specific, it allows you to decide the type of modification you want to do on date values.

-Parameter: This field is used by some of the functions, it will be disabled when not applicable. When applicable, enter a number or a letter to decide the behavior of the function you have selected.

Advanced settings

Seed for random generator

Set a random number if you want to generate the same sample of substitute data in each execution of the Job. This field is set to 12345678 by default.

Repeating the execution with a different value for this field will result in a different sample being generated. Keep this field empty if you want to generate a different sample each time you execute the Job.

 

Output the original row

Select this check box to output original data rows in addition to the substitute data. Having both data rows can be useful in debug or test processes.

 

Should null input returns null

This check box is selected by default. When selected, the component outputs null when input values are null. Otherwise, it returns the default value when the input is null, that is an empty string for string values, 0 for numeric values and the current date for date values.

This parameter does not have an effect on the Generate Sequence function. If the input is null, this function will not return null, even if the box is checked.

 

tStatCatcher Statistics

Select this check box to gather the Job processing metadata at the Job level as well as at each component level.

Usage

This component is an intermediary step. It requires an input and output flows.

Limitation

n/a

Scenario: Altering data values to restrict the use of actual sensitive data

With the tDataMasking component, you can replace sensitive information such as credit card or social security numbers with realistic values, allowing production data to be safely used for purposes such as testing and training.

This scenario describes a Job which uses:

  • the tFixedFlowInput component to generate personal data including credit card numbers,

  • the tDataMasking component to hide specific original data with random characters or figures,

  • the tFileOutputExcel component to output the substitute data set.

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tDataMasking and tFileOutputExcel.

  2. Connect the three components together using the Main links.

Configuring the input component

  1. Double-click tFixedFlowInput to open its Basic settings view in the Component tab.

  2. Create the schema through the Edit Schema button.

    In the open dialog box, click the [+] button and add the columns that will hold the initial input data.

  3. Click OK.

  4. In the Number of rows field, enter 1.

  5. In the Mode area, select the Use Inline Content option.

  6. In the Content table, enter the customer data you want to replace with realistic values, for example:

    0|4244487462024688|Nowmer|Sheri|A.|2433 Bailey Road|Tlaxiaco|Oaxaca|15057|Mexico|271-555-9715|SheriNowmer@@Tlaxiaco.org
    1|3458687462024688||Sheri|A.|2433 Bailey Road|Tlaxiaco|Oaxaca|15057|Mexico|271-555-9715|SheriNowmer@Tlaxiaco.org.org
    2|4639587470586299|Whelply|Derrick|I.|2219 Dewing Avenue|Sooke|BC|17172|Canada|211-555-7669|DerrickWhelply@Sooke.org
    3|2541387475757600|Derry|Jeanne||7640 First Ave.|Issaquah|WA|73980|USA|656-555-2272|JeanneDerry@Issaquah.org
    4|7845987500482201|Spence|Michael|J.|337 Tosca Way|Burnaby|BC|74674|Canada|929-555-7279|MichaelSpence@Burnaby.org
    5|1547887514054179|Gutierrez|Maya||8668 Via Neruda|Novato|CA|57355|$$#|387-555-7172|MayaGutierrez@Novato.org
    6|5469887517782449|Damstra|Robert|F.|1619 Stillman Court|Lynnwood|WA|90792|$$#|922-555-5465|RobertDamstra@Lynnwood.org
    7|54896387521172800|Kanagaki|Rebecca||2860 D Mt. Hood Circle|||13343|Mexico|515-555-6247|RebeccaKanagaki@Tlaxiaco.org
    8|47859687539744377||Kim|H.|6064 Brodia Court|San Andres|DF|12942|Mexico|411-555-6825|Kim@Brunner@San Andresorg
    9|35698487544797658||Brenda|C.|7560 Trees Drive||BC|$$|Canada|815-555-3975|BrendaBlumberg@Richmond.org
    10|36521487568712234|Stanz|Darren|M.|1019 Kenwal Rd.|$$#|OR|82017|USA|847-555-5443|DarrenStanz@Lake Oswego.org
    ...

Replacing actual data with realistic values

  1. Double-click tDataMasking to display the Basic settings view and define the component properties.

  2. If required, click Sync columns to retrieve the schema defined in the input component.

  3. Click the Edit schema button to open the schema dialog box.

    tDataMasking proposes one predefined read-only column as shown in the below capture.

    This column identifies by true or false if the output record is an original or a substitute record respectively.

  4. Move any of the input columns to the output schema if you want to show them in the results, click OK and accept to propagate the changes.

  5. In the Modifications table, click the [+] button to add four rows, and then:

    • in the Input Column, select the columns which content you want to substitute,

    • in the Function column, select from the predefined list the function you want to use to generate the substitute data,

    • in the Parameter column, enter a value, a pattern or a path to be used by the function to substitute data.

    The Job will generate inauthentic credit card numbers, replace the first three letters of first names, replace last names with names from a local file and finally replace the part before the @ sign in email addresses by a series of X.

  6. Click the Advanced settings tab and select the Output the original row check box.

    The Job will add the original data rows to the substitute data.

Configuring the output component and executing the Job

  1. Double-click the tFileOutputExcel component to display the Basic settings view and define the component properties.

  2. Set the destination file name as well as the sheet name and then select the Define all columns auto size check box.

  3. Save your Job and press F6 to execute it.

    The tDataMasking component substitute data in the selected columns and writes the result in an output file.

  4. Right-click the output component and select Data Viewer to display the original and substituted data.

    tDataMasking outputs original and substitute rows marked respectively with true and false in the ORIGINAL_MARK column. It generates inauthentic credit card numbers, replaces the first three letters of first names, replaces last names with names from a local file and finally replaces the part before the @ sign in email addresses by the names defined in the component basic settings.

    Sensitive personal information in the input data has been "hidden" but data keeps looking real and consistent. The substitute data is still usable for purposes other than production.

tDataMasking properties in Spark Batch Jobs

Component family

Data Quality

 

Basic settings

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Since version 5.6, both the Built-In mode and the Repository mode are available in any of the Talend solutions.

Click Sync columns to retrieve the schema from the previous component connected in the Job.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

The output schema of this component contains one read-only column, ORIGINAL_MARK. This column identifies by true or false if the record is an original record or a substitute record respectively.

 

 

Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.

 

Modification

Define in the table what fields to change and how to change them:

Input Column: Select the column from the input flow for which you want to generate similar data by modifying its values.

These modifications are based on the function you select in the Function column and the number of modifications you set in the Max Modification Count column.

Function: Select the function that will decide what modification to do in order to generate similar substitutional data. For example, you can decide to have similar values through replacing or adding letters or numbers, replacing values with synonyms from an index file or deleting values by setting the function to null.

The Function list will vary according to the column type. For further information about function behavior, see Function behavior in common PII.

For example, a column of a Long type will have a Numeric variance option in the list while a column of a String type will not have such function. Also, the Function list for a Date column is date-specific, it allows you to decide the type of modification you want to do on date values.

-Parameter: This field is used by some of the functions, it will be disabled when not applicable. When applicable, enter a number or a letter to decide the behavior of the function you have selected.

Advanced settings

Seed for random generator

Set a random number if you want to generate the same sample of substitute data in each execution of the Job. This field is set to 12345678 by default.

Repeating the execution with a different value for this field will result in a different sample being generated. Keep this field empty if you want to generate a different sample each time you execute the Job.

 

Output the original row

Select this check box to output original data rows in addition to the substitute data. Having both data rows can be useful in debug or test processes.

 

Should null input returns null

This check box is selected by default. When selected, the component outputs null when input values are null. Otherwise, it returns the default value when the input is null, that is an empty string for string values, 0 for numeric values and the current date for date values.

This parameter does not have an effect on the Generate Sequence function. If the input is null, this function will not return null, even if the box is checked.

 

tStatCatcher Statistics

Select this check box to gather the Job processing metadata at the Job level as well as at each component level.

Usage in Spark Batch Jobs

In a Talend Spark Batch Job, this component is used as an intermediate step and other components used along with it must be Spark Batch components, too. They generate native Spark Batch code that can be executed directly in a Spark cluster.

This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs.

Log4j

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Spark Connection

You need to use the Spark Configuration tab in the Run view to define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, one and only one file system related component from the Storage family is required in the same Job so that Spark can use this component to connect to the file system to which the jar files dependent on the Job are transferred:

This connection is effective on a per-Job basis.

Related scenarios

No scenario is available for the Spark Batch version of this component yet.

tDataMasking properties in Spark Streaming Jobs

Component family

Data Quality

 

Basic settings

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Since version 5.6, both the Built-In mode and the Repository mode are available in any of the Talend solutions.

Click Sync columns to retrieve the schema from the previous component connected in the Job.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

The output schema of this component contains one read-only column, ORIGINAL_MARK. This column identifies by true or false if the record is an original record or a substitute record respectively.

 

 

Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.

 

Modification

Define in the table what fields to change and how to change them:

Input Column: Select the column from the input flow for which you want to generate similar data by modifying its values.

These modifications are based on the function you select in the Function column and the number of modifications you set in the Max Modification Count column.

Function: Select the function that will decide what modification to do in order to generate similar substitutional data. For example, you can decide to have similar values through replacing or adding letters or numbers, replacing values with synonyms from an index file or deleting values by setting the function to null.

The Function list will vary according to the column type. For further information about function behavior, see Function behavior in common PII.

For example, a column of a Long type will have a Numeric variance option in the list while a column of a String type will not have such function. Also, the Function list for a Date column is date-specific, it allows you to decide the type of modification you want to do on date values.

-Parameter: This field is used by some of the functions, it will be disabled when not applicable. When applicable, enter a number or a letter to decide the behavior of the function you have selected.

Advanced settings

Seed for random generator

Set a random number if you want to generate the same sample of substitute data in each execution of the Job. This field is set to 12345678 by default.

Repeating the execution with a different value for this field will result in a different sample being generated. Keep this field empty if you want to generate a different sample each time you execute the Job.

 

Output the original row

Select this check box to output original data rows in addition to the substitute data. Having both data rows can be useful in debug or test processes.

 

Should null input returns null

This check box is selected by default. When selected, the component outputs null when input values are null. Otherwise, it returns the default value when the input is null, that is an empty string for string values, 0 for numeric values and the current date for date values.

This parameter does not have an effect on the Generate Sequence function. If the input is null, this function will not return null, even if the box is checked.

 

tStatCatcher Statistics

Select this check box to gather the Job processing metadata at the Job level as well as at each component level.

Usage in Spark Streaming Jobs

This component, along with the Spark Streaming component Palette it belongs to, appears only when you are creating a Spark Streaming Job.

In a Talend Spark Streaming Job, this component is used as an intermediate step and other components used along with it must be Spark Streaming components, too. They generate native Spark Streaming code that can be executed directly in a Spark cluster.

You need to use the Spark Configuration tab in the Run view to define the connection to a given Spark cluster for the whole Job.

This connection is effective on a per-Job basis.

For further information about a Talend Spark Streaming Job, see the sections describing how to create, convert and configure a Talend Spark Streaming Job of the Talend Big Data Getting Started Guide.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs.

Log4j

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Spark Connection

You need to use the Spark Configuration tab in the Run view to define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, one and only one file system related component from the Storage family is required in the same Job so that Spark can use this component to connect to the file system to which the jar files dependent on the Job are transferred:

This connection is effective on a per-Job basis.

Related scenarios

No scenario is available for the Spark Streaming version of this component yet.