Analyze the log file and save the result - 7.0

Big Data Job Examples

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Open Studio for Big Data
Talend Real-Time Big Data Platform
task
Design and Development > Designing Jobs
Design and Development > Designing Jobs > Hadoop distributions
Design and Development > Designing Jobs > Job Frameworks > Standard
EnrichPlatform
Talend Studio

Procedure

  1. In the Basic settings view of the tPigFilterRow component, click the [+] button to add a line in the Filter configuration table, and set filter parameters to remove records that contain the code of 404 and pass the rest records on to the output flow:
    1. In the Logical field, select AND.
    2. In the Column field, select the code column of the schema.
    3. Select the NOT check box.
    4. In the Operator field, select equal.
    5. In the Value field, enter 404.
  2. In the Basic settings view of the tPigFilterColumns component, click the [...] button to open the [Schema] dialog box. Select the column code in the Input panel and click the single-arrow button to copy the column to the Output panel to pass the information of the code column to the output flow. Click OK to confirm the output schema settings and close the dialog box.
  3. In the Basic settings view of the tPigAggregate component, click Sync columns to retrieve the schema from the preceding component, and permit the schema to be propagated to the next component.
  4. Click the [...] button next to Edit schema to open the [Schema] dialog box, and add a new column: count.

    This column will store the number of occurrences of each code of successful service calls.

  5. Configure the following parameters to count the number of occurrences of each code:
    1. In the Group by area, click the [+] button to add a line in the table, and select the column count in the Column field.
    2. In the Operations area, click the [+] button to add a line in the table, and select the column count in the Additional Output Column field, select count in the Function field, and select the column code in the Input Column field.
  6. In the Basic settings view of the tPigSort component, configure the sorting parameters to sort the data to be passed on:
    1. Click the [+] button to add a line in the Sort key table.
    2. In the Column field, select count to set the column count as the key.
    3. In the Order field, select DESC to sort data in the descendent order.
  7. In the Basic settings view of the tPigStoreResult component, configure the component properties to upload the result data to the specified location on the Hadoop system:
    1. Click Sync columns to retrieve the schema from the preceding component.
    2. In the Result file URI field, enter the path to the result file, /user/hdp/weblog/apache_code_cnt in this example.
    3. From the Store function list, select PigStorage.
    4. If needed, select the Remove result directory if exists check box.
  8. Save the schema of this component as a generic schema in the Repository for convenient reuse in the last Job, like what we did in Centralize the schema for the access log file for reuse in Job configurations. Name this generic schema code_count.
  9. In this step, we will configure the fifth Job, E_Pig_Count_IPs, to analyze the uploaded access log file using a similar Pig chain as in the previous Job to get the IP addresses of successful service calls and their number of visits to the website. We can use the component settings in the previous Job, with the following differences:
    1. In the [Schema] dialog box of the tPigFilterColumns component, copy the column host, instead of code, from the Input panel to the Output panel.
    2. In the tPigAggregate component, select the column host in the Column field of the Group by table and in the Input Column field of the Operations table.
    3. In the tPigStoreResult component, fill the Result file URI field with /user/hdp/weblog/apache_ip_cnt.
    4. Save a generic schema named ip_count in the Repository from the schema of the tPigStoreResult component for convenient reuse in the last Job.
    5. Upon completion of the component settings, press Ctrl+S to save your Job configurations.