Component family |
Big Data / Pig | |
Basic settings |
Schema and Edit Schema |
A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:
|
|
|
Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide. |
|
|
Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide. |
|
Group by |
Click the [+] button to add one or more columns of the input flows to this Group by table so as to set these columns as group condition. |
|
Output mapping |
This table is automatically filled with the output schema you have defined using the Schema field. Then complete this table to configure how the grouped data is aggregated in the output flow: Function: select the function you need to use to aggregate a given column. Source schema: select the input flow from which you aggregate the data. Expression: select the column to be aggregated and if needed, edit expressions |
Advanced settings |
Group optimization |
Select the Pig algorithm depending on the situation of the input data and the loader you are using to optimize the COGROUP operation. For further information, see Apache's documentation about Pig. |
Use partitioner |
Select this check box to call a Hadoop partitioner in order to partition records and return the reduce task or tasks that each record should go to. Note that this partitioner class must be registered in the Register jar table provided by the tPigLoad component that starts the current Pig process. | |
|
Increase parallelism |
Select this check box to set the number of reduce tasks for the MapReduce Jobs. |
|
tStatCatcher Statistics |
Select this check box to gather the Job processing metadata at the Job level as well as at each component level. |
Global Variables |
ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide. | |
Usage |
This component is commonly used as intermediate step together with input component and output component. | |
Prerequisites |
The Hadoop distribution must be properly installed, so as to guarantee the interaction with Talend Studio. The following list presents MapR related information for example.
For further information about how to install a Hadoop distribution, see the manuals corresponding to the Hadoop distribution you are using. | |
Limitation |
Knowledge of Pig scripts is required. |