Crawling for multiple datasets - Cloud

Talend Cloud Data Inventory User Guide

Version
Cloud
Language
English
Product
Talend Cloud
Module
Talend Data Inventory
Content
Administration and Monitoring > Managing connections
Data Governance
Data Quality and Preparation > Enriching data
Data Quality and Preparation > Identifying data
Data Quality and Preparation > Managing datasets
Last publication date
2023-11-08

If you need to import numerous datasets from the same source, instead of manually creating them one by one in Talend Cloud Data Inventory, you can create a crawler to retrieve a full list of assets in a single operation.

Crawling a connection allows you to retrieve data at a large scale and enrich your inventory more efficiently. After selecting a connection, you will be able to import all of its content, or part of it via a quick search and filter, and select which users will have access to the newly created datasets.

There are two crawling modes that you can use depending on your use case:
  • The dynamic selection to retrieve all tables that match a specific filter, regardless of the content of your data source at a given time.
  • The manual selection to manually select the tables to retrieve from the current state of your data source.

Crawling a connection for multiple datasets comes with the following prerequisites and limitations:

  • The Dataset administrator or Dataset manager role has been assigned to you in Talend Cloud Management Console, or at least the Crawling - Add permission.
  • You are using the Remote Engine 2022-02 or later.
  • You can only crawl data from a JDBC connection, and only one crawler can be created from a connection at the same time.

Procedure

  1. To start creating a crawler for a connection, you can either:
    • Hover over your connection in the connection list, click the Crawl connection icon, and then the Add crawler button.
    • Click your connection in the connection list, select the Crawler tab of the drawer panel, and click Add crawler.
    The crawler configuration window opens.
  2. Select your preferred crawling mode:
  3. Select the tables to import from your data source and click Next.

    You now need to define which users will be able to access the datasets that will be created, and with which rights.

  4. To add users to the list of people who can access the datasets, you can either:
    • Hover over a user or group, click the + icon and assign the rights you want to give with the drop-down list in the right column.
    • Select a user or group, click Add as and assign the rights you want to give with the drop-down list.

      You can select multiple groups or users at once using Ctrl + Click or Shift + Click.

    Important: You need to select at least one owner for the datasets in order to proceed.
    For more information on sharing and roles, see Sharing a dataset.
  5. Click Next to reach the last configuration step.
  6. Enter a Name for your crawler, Snowflake crawler in this case, and optionally a Description to describe the use case and scope of the crawler.
  7. Click Run.
    An asynchronous process is launched in the background to crawl the selected datasets from the connection. You are now back to the connections list, with the Crawler tab of the right drawer panel opened, where you can monitor the progress of the datasets creation, as well as the sample availability.
    Note: When the samples have all been fetched, the data quality and Talend Trust Score™ of every crawled dataset have been fully computed and are visible in the dataset list and each dataset overview. If you want to start working on one of the crawled datasets before its sample is available, you can manually retrieve one by clicking Refresh sample in the dataset sample view.

Results

Datasets created from your tables are progressively added to the dataset list.

You cannot edit a crawler configuration after it has started running. To crawl the connection again, with a different table selection or sharing parameters for example, delete the crawler and create a new one.

It is possible to use a crawler name as facet in the dataset search to see all the datasets linked to a given crawler.

Tip: It is possible to automate your crawler runs via API to retrieve data from your connection at regular intervals. See Scheduling a crawler run for more information.