Centralizing Cassandra metadata - 6.1

Talend Data Fabric Studio User Guide

EnrichVersion
6.1
EnrichProdName
Talend Data Fabric
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

If you often need to handle data of a Cassandra database, then you may want to centralize the connection to the Cassandra database and the schema details in the Metadata folder in the Repository tree view.

The Cassandra metadata setup procedure is made of two separate but closely related major tasks:

  1. Create a connection to a Cassandra database.

  2. Retrieve Cassandra schemas of interest.

Prerequisites:

  • All the required external modules that are missing in Talend Studio due to license restrictions have been installed. For more information, see Talend Installation Guide.

Creating a connection to a Cassandra database

  1. In the Repository tree view, expand the Metadata node, right-click NoSQL Connection, and select Create Connection from the contextual menu. The connection wizard opens up.

  2. In the connection wizard, fill in the general properties of the connection you need to create, such as Name, Purpose and Description.

    The information you fill in the Description field will appear as a tooltip when you move your mouse pointer over the connection.

    When done, click Next to proceed to the next step.

  3. Select Cassandra from the DB Type list and Cassandra version of the database you are connecting to from the DB Version list, and specify the following details:

    • From the API type list, either select Datastax to use CQL 3 (Cassandra Query Language) with Cassandra, or select Hector to use CQL 2.

      Note that the Hector API is deprecated for the 2.0 or later version of Cassandra, but it is still available for use in the Studio so that you can be flexible about the version of the query language to be used with Cassandra 2.0.0.

    • Enter the host name or IP address of the Cassandra server in the Server field.

    • Enter the port number of the Cassandra server in the Port field.

      Note

      The wizard can connect to your Cassandra database without you having to specify a port. The port you provide here is only for use in the Cassandra component that you drop onto the design workspace from this centralized connection.

    • If you want to restrict your Cassandra connection to a particular keyspace only, enter the keyspace in the Keyspace field.

      If you leave this field blank, the wizard will list the column families of all the existing keyspaces of the connected database when you retrieve schemas.

    • If your Cassandra server requires authentication for database access, select the Require authentication check box and provide your username and password in the corresponding fields.

  4. Click the Check button to make sure that the connection works.

  5. Click Finish to validate the settings.

    The newly created Cassandra database connection appears under the NoSQL Connection node in the Repository tree view. You can now drop it onto your design workspace as a Cassandra component, but you still need to define the schema information where needed.

    Next, you need to retrieve one or more schemas of interest for your connection.

Retrieving schemas

In this step, we will retrieve the schemas of interest from the connected Cassandra database.

  1. In the Repository view, right-click the newly created connection and select Retrieve Schema from the contextual menu.

    The wizard opens a new view that lists all the available column families of the specified keyspace, or all the available keyspaces if you did not specify one in the previous step.

  2. Expand the keyspace, or keyspaces of interest if you did not specify a keyspace in the previous step as in this example, and select the column family or column families of interest.

  3. Click Next to proceed to the next step of the wizard where you can edit the generated schema or schemas.

    By default, each generated schema is named after the column family on which it is based.

    Select a schema from the Schema panel to display its details on the right side, and modify the schema if needed. You can rename any schema, and customize the schema structure according to your needs in the Schema area.

    The tool bar allows you to add, remove or move columns in your schema, or replace the schema with the schema defined in an XML file.

    To base a schema on another column family, select the schema name in the Schema panel, and select a new column family from the Based on Column Family list, and click the Guess Schema button to overwrite the schema with that of the selected column family. You may need to click the refresh button to refresh the list of column families.

    To add a new schema, click the Add Schema button in the Schema panel, which creates an empty schema for you to define.

    To remove a schema, select the schema name in the Schema panel and click the Remove Schema button.

    To overwrite the modifications you made on the selected schema using its default schema, click Guess schema. Note that all your changes to the schema will be lost if you click this button.

  4. Click Finish to complete the schema creation. The result schemas appear under your Cassandra connection in the Repository view. You can now drop the connection or any schema node under it onto your design workspace as a Cassandra component, with all the metadata information automatically filled.

    If you need to further edit a schema, right-click the schema and select Edit Schema from the contextual menu to open this wizard again and make your modifications.

    Warning

    If you modify the schemas, ensure that the data type in the Type column is correctly defined.