Crawling for multiple datasets

If you need to import numerous datasets from the same source, instead of manually creating them one by one in Talend Cloud Data Inventory, you can create a crawler to retrieve a full list of assets in a single operation.

Crawling a connection allows you to retrieve data at a large scale and enrich your inventory more efficiently. After selecting a connection, you will be able to import all of its content, or part of it via a quick search and filter, and select which users will have access to the newly created datasets.

There are two crawling modes that you can use depending on your use case:

The dynamic selection to retrieve all tables that match a specific filter, regardless of the content of your data source at a given time.
The manual selection to manually select the tables to retrieve from the current state of your data source.

Crawling a connection for multiple datasets comes with the following prerequisites and limitations:

The Dataset administrator or Dataset manager role has been assigned to you in Talend Management Console, or at least the Crawling - Add permission.
You are using the Remote Engine 2022-02 or later.
You can only crawl data from a JDBC connection, and only one crawler can be created from a connection at the same time.

Procedure

To start creating a crawler for a connection, you can either:
- Hover over your connection in the connection list, click the Crawl connection icon, and then the Add crawler button.
- Click your connection in the connection list, select the Crawler tab of the drawer panel, and click Add crawler.
The crawler configuration window opens.
Select your preferred crawling mode:
- Dynamic selection, see Crawling datasets using the dynamic selection for more information.
- Manual selection, see Crawling datasets using the manual selection for more information.
Select the tables to import from your data source and click Next.

You now need to define which users will be able to access the datasets that will be created, and with which rights.
To add users to the list of people who can access the datasets, you can either:
- Hover over a user or group, click the + icon and assign the rights you want to give with the drop-down list in the right column.
- Select a user or group, click Add as and assign the rights you want to give with the drop-down list.
  You can select multiple groups or users at once using Ctrl + Click or Shift + Click.
Information noteImportant: You need to select at least one owner for the datasets in order to proceed.
For more information on sharing and roles, see Sharing a dataset.
Click Next to reach the last configuration step.
Enter a Name for your crawler, Snowflake crawler in this case, and optionally a Description to describe the use case and scope of the crawler.
Click Run.
An asynchronous process is launched in the background to crawl the selected datasets from the connection. You are now back to the connections list, with the Crawler tab of the right drawer panel opened, where you can monitor the progress of the datasets creation, as well as the sample availability.

Information noteNote: When the samples have all been fetched, the data quality and Talend Trust Score™ of every crawled dataset have been fully computed and are visible in the dataset list and each dataset overview. If you want to start working on one of the crawled datasets before its sample is available, you can manually retrieve one by clicking Refresh sample in the dataset sample view.

Results

Datasets created from your tables are progressively added to the dataset list.

You cannot edit a crawler configuration after it has started running. If the crawler is stopped or finished, you can edit the table selection, name, and description of the crawler. However, you cannot edit the sharing settings. To crawl the connection again with different sharing settings, delete the crawler and create a new one.

It is possible to use a crawler name as facet in the dataset search to see all the datasets linked to a given crawler.

Tip: It is possible to automate your crawler runs using API to retrieve data from your connection at regular intervals. See Scheduling a crawler run for more information.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here