If you need to import numerous datasets from the same source, instead of manually
creating them one by one in Talend Cloud Data Inventory, you
can create a crawler to retrieve a full list of assets in a single
operation.
Crawling a connection allows you to retrieve data at a large scale and enrich your
inventory more efficiently. After selecting a connection, you will be able to import all
of its content, or part of it via a quick search and filter, and select which users will
have access to the newly created datasets.
There are two crawling modes that you can use depending on your use case:
- The dynamic
selection to retrieve all tables that match a specific filter,
regardless of the content of your data source at a given time.
- The manual
selection to manually select the tables to retrieve from the current
state of your data source.
Crawling a connection for multiple datasets comes with the following prerequisites and
limitations:
- The Dataset administrator or Dataset
manager role has been assigned to you in Talend Cloud Management Console,
or at least the Crawling - Add permission.
- You are using the Remote Engine 2022-02 or later.
- You can only crawl data from a JDBC connection, and only one crawler can be created
from a connection at the same time.
Procedure
-
To start creating a crawler for a connection, you can either:
- Hover over your connection in the connection list, click the
Crawl connection icon, and then the
Add crawler button.
- Click your connection in the connection list, select the
Crawler tab of the drawer panel, and click
Add crawler.
The crawler configuration window opens.
-
Select your preferred crawling mode:
-
Select the tables to import from your data source and click
Next.
You now need to define which users will be able to access the datasets that
will be created, and with which rights.
-
To add users to the list of people who can access the datasets, you can
either:
Important: You need to select at least one owner for the datasets
in order to proceed.
For more information on sharing and roles, see
Sharing a dataset.
-
Click Next to reach the last configuration step.
-
Enter a Name for your crawler, Snowflake
crawler in this case, and optionally a
Description to describe the use case and scope of the
crawler.
-
Click Run.
An asynchronous process is launched in the background to crawl the selected
datasets from the connection. You are now back to the connections list, with the
Crawler tab of the right drawer panel opened, where
you can monitor the progress of the datasets creation, as well as the sample availability.
Note: When the samples have all been fetched, the data quality and
Talend Trust Score™ of every crawled dataset have been fully computed and are visible in the
dataset list and each dataset overview. If you want to start working on one
of the crawled datasets before its sample is available, you can manually
retrieve one by clicking Refresh sample in the
dataset sample view.
Results
Datasets created from your tables are progressively added to the dataset list.
You
cannot edit a crawler configuration after it has started running. To crawl the
connection again, with a different table selection or sharing parameters for
example, delete the crawler and create a new one.
It is possible to use a
crawler name as facet in the dataset search to see all the datasets linked to a
given crawler.
Tip: It is possible to automate your crawler runs via API to retrieve
data from your connection at regular intervals. See
Scheduling a crawler run for more
information.