Filtering data on crimes in Google BigQuery tables - Cloud

Talend Cloud Apps Connectors Guide

Version
Cloud
Language
English
Product
Talend Cloud
Module
Talend Data Inventory
Talend Data Preparation
Talend Pipeline Designer
Content
Administration and Monitoring > Managing connections
Design and Development > Designing Pipelines
Last publication date
2024-03-21

This scenario aims at helping you set up and use connectors in a pipeline. You are advised to adapt it to your environment and use case.

Before you begin

  • If you want to reproduce this scenario, you can use the chicago_crime BigQuery open dataset that is publicly available for anyone to use.

Procedure

  1. Click Connections > Add connection.
  2. In the panel that opens, select the type of connection you want to create.

    Example

    BigQuery
  3. Select your engine in the Engine list.
    Note:
    • It is recommended to use the Remote Engine Gen2 rather than the Cloud Engine for Design for advanced processing of data.
    • If no Remote Engine Gen2 has been created from Talend Management Console or if it exists but appears as unavailable which means it is not up and running, you will not be able to select a Connection type in the list nor to save the new connection.
    • The list of available connection types depends on the engine you have selected.
  4. Select the type of connection you want to create.
    Here, select Google BigQuery.
  5. Fill in the connection properties to access your BigQuery project as described in Google BigQuery properties, including your project name and JSON credentials, check the connection and click Add dataset.
  6. In the Add a new dataset panel, name your dataset chicago crime.
  7. Fill in the required properties to access the file located in your BigQuery bucket (dataset name, table name or query) and click View sample to see a preview of your dataset sample.
  8. Click Validate to save your dataset.
  9. Do the same to add the BigQuery table dataset that will be used as Destination in your pipeline.
  10. Click Add pipeline on the Pipelines page. Your new pipeline opens.
  11. Click ADD SOURCE to open the panel allowing you to select your source data, here a public dataset of Chicago crimes stored in a BigQuery table.
  12. Select your dataset and click Select in order to add it to the pipeline.
    Rename it if needed.
  13. Click and add a Field selector processor to the pipeline. The configuration panel opens.
  14. Give a meaningful name to the processor.

    Example

    select crime detail fields
  15. In the Selectors area of the Advanced mode:
    1. Enter or select .case_number in the Input list and case_id in the Output list, as you want to select and rename the case_number field.
    2. Click the + sign to add a new element and enter or select .description in the Input list and desc in the Output list, as you want to select and rename the description field.
    3. Click the + sign to add a new element and enter or select .primary_type in the Input list and crime_type in the Output list, as you want to select the primary_type field.
    4. Click the + sign to add a new element and enter or select .location_description in the Input list and location in the Output list, as you want to select the location_description field.
  16. Click Save to save your configuration.

    (Optional) Look at the preview of the processor to see your data after the restructuring operation.

  17. Click and add a Filter processor to the pipeline. Give it a meaningful name.

    Example

    filter on 1st degree murders
  18. Configure the processor:
    1. Add a new element and select .crime_type in the Input list, as you want to split the local and the domain parts of the customers emails.
    2. Select NONE in the Optionally select a function to apply list.
    3. Select == in the Operator list.
    4. Enter HOMICIDE in the Value field, as you want to filter on homicide cases.
    5. Click Save to save your configuration.
  19. (Optional) Look at the preview of the Filter processor to see your data sample after the filtering operation.
  20. Click ADD DESTINATION and select the dataset table that will hold your reorganized data.
    Rename it if needed.
  21. On the top toolbar of Talend Cloud Pipeline Designer, click the Run button to open the panel allowing you to select your run profile.
  22. Select your run profile in the list (for more information, see Run profiles), then click Run to run your pipeline.

Results

Your pipeline is being executed, the crime data has been processed, homicide cases have been isolated and the output flow is sent to the Google BigQuery table you have indicated.