tCassandraOutput - 6.3

Talend Open Studio for Big Data Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Open Studio for Big Data
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Function

tCassandraOutput receives data from the preceding component, and writes data into Cassandra.

Purpose

tCassandraOutput allows you to write data into or delete data from a column family of a Cassandra keyspace.

tCassandraOutput properties

Component family

Big Data / Cassandra

 

Basic settings

Property type

Either Built-In or Repository.

Built-In: No property data stored centrally.

Repository: Select the repository file where the properties are stored.

 

Use existing connection

Select this check box and in the Component List click the relevant connection component to reuse the connection details you already defined.

 

DB Version

Select the Cassandra version you are using.

 

API type

This drop-down list is displayed only when you have selected the 2.0 version of Cassandra from the DB version list. From this API type list, you can either select Datastax to use CQL 3 (Cassandra Query Language) with Cassandra, or select Hector to use CQL 2.

Note that the Hector API is deprecated for the 2.0 or later version of Cassandra, but it is still available for use in the Studio so that you can be flexible about the version of the query language to be used with Cassandra 2.0.0.

Along with the evolution of the CQL commands, the parameters to be set in the Basic settings view varies.

 

Host

Hostname or IP address of the Cassandra server.

 

Port

Listening port number of the Cassandra server.

 

Required authentication

Select this check box to provide credentials for the Cassandra authentication.

This check box appears only if you do not select the Use existing connection check box.

 

Username

Fill in this field with the username for the Cassandra authentication.

 

Password

Fill in this field with the password for the Cassandra authentication.

To enter the password, click the [...] button next to the password field, and then in the pop-up dialog box enter the password between double quotes and click OK to save the settings.

 

Use SSL

Select this check box to enable the SSL or TLS encrypted connection.

Then you need to use the tSetKeystore component in the same Job to specify the encryption information.

For further information about tSetKeystore, see tSetKeystore.

Keyspace configuration

Keyspace

Type in the name of the keyspace into which you want to write data.

 

Action on keyspace

Select the operation you want to perform on the keyspace to be used:

  • None: No operation is carried out.

  • Drop and create keyspace: The keyspace is removed and created again.

  • Create keyspace: The keyspace does not exist and gets created.

  • Create keyspace if not exists: A keyspace gets created if it does not exist.

  • Drop keyspace if exists and create: The keyspace is removed if it already exists and created again.

 

Column family

Type in the name of the keyspace into which you want to write data.

 

Action on column family

Select the operation you want to perform on the column family to be used:

  • None: no operation is carried out.

  • Drop and create column family: the column family is removed and created again.

  • Create column family: the column family does not exist and gets created.

  • Create column family if not exists: a column family gets created if it does not exist.

  • Drop column family if exists and create: the column family is removed if it already exists and created again.

 

Action on data

On the data of the table defined, you can perform:

  • Upsert: insert the columns if they do not exist or update the existing columns.

  • Insert: insert the columns if they do not exist. This action also updates the existing ones.

  • Update: update the existing columns or add the columns that do not exist. This action does not support the Counter Cassandra data type.

  • Delete: remove columns corresponding to the input flow.

Note that the action list varies depending on the Hector or Datastax API you are using. When the API is Datastax, more actions become available.

For more advanced actions, use the Advanced settings view.

 

Schema and Edit schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

 

 

Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.

When the schema to be reused has default values that are integers or functions, ensure that these default values are not enclosed within quotation marks. If they are, you must remove the quotation marks manually.

For more details, see the article Verifying default values in a retrieved schema on Talend Help Center (https://help.talend.com).

 

Sync columns

Click this button to retrieve schema from the previous component connected in the Job.

 

Die on error

Clear the check box to skip any rows on error and complete the process for error-free rows. When errors are skipped, you can collect the rows on error using a Row > Reject link.

Note that these features are available to the Hector API only.

Column family type

Standard: Column family is of standard type.

Super: Column family is of super type.

Row key column

Select the row key column from the list.

Include row key in columns

Select this check box to include row key in columns.

Super columns

Select the super column from the list.

This drop-down list appears only if you select Super from the Column family type drop-down list.

Include super columns in standard columns

Select this check box to include the super columns in standard columns.

Delete row

Select this check box to delete the row.

This check box appears only if you select Delete from the Action on data drop-down list.

Delete columns

Customize the columns you want to delete.

Delete super columns

Select this check box to delete super columns.

This check box appears only if you select the Delete Row check box.

Advanced settings

Batch Size

Number of lines in each processed batch.

When you are using the Datastax API, this feature is displayed only when you have selected the Use unlogged batch check box.

 

tStatCatcher Statistics

Select this check box to gather the Job processing metadata at the Job level as well as at each component level.

Note that these features are available to the Datastax API only.

Use unlogged batch

Select this check box to handle data in batch but with Cassandra's UNLOGGED approach. This feature is available to the following three actions: Insert, Update and Delete.

Then you need to configure how the batch mode works:

  • Batch size: enter the number of lines in each batch to be processed.

  • Group batch method: select how to group rows into batches:

    1. Partition: rows sharing the same partition keys are grouped.

    2. Replica: rows to be written to the same replica are grouped.

    3. None: rows are grouped randomly. This option is suitable for a single node Cassandra.

  • Cache batch group: select this check box to load rows into memory before grouping them. This way, grouping is not impacted by the order of the rows.

    If you leave this check box clear, only successive rows that meet the same criteria are grouped.

  • Async execute: select this check box if you want tCassandraOutput to send batches in parallel. If you leave it clear, tCassandraOutput waits for the result of a batch before sending another batch to Cassandra.

  • Maximum number of batches executed in parallel: once you have selected Async execute, enter the number of batches to be sent in parallel to Cassandra.

    This number should not be a negative number or 0 and it is also recommended not to use too large a value.

The ideal situation to use batches with Cassandra is when a small number of tables must synchronize the data to be inserted or updated.

In this UNLOGGED approach, the Job does not write batches into Cassandra's batchlog system and thus avoids the performance issue incurred by this writing. For further information about Cassandra's BATCH statement and UNLOGGED approach, see Batches and Using unlogged batches.

Insert if not exists

Select this check box to insert rows. This row insertion takes place only when they do not exist in the target table.

This feature is available to the Insert action only.

Delete if exists

Select this check box to remove from the target table only the rows that have the same records in the incoming flow.

This feature is available only to the Delete action.

Use TTL

Select this check box to write the TTL data in the target table. In the column list that is displayed, you need to select the column to be used as the TTL column. The DB type of this column must be Int.

This feature is available to the Insert action and the Update action only.

Use Timestamp

Select this check box to write the timestamp data in the target table. In the column list that is displayed, you need to select the column to be used to store the timestamp data. The DB type of this column must be BigInt.

This feature is available to the following actions: Insert, Update and Delete.

IF condition

Add the condition to be met for the Update or the Delete action to take place. This condition allows you to be more precise about the columns to be updated or deleted.

Special assignment operation

Complete this table to construct advanced SET commands of Cassandra to make the Update action more specific. For example, add a record to the beginning or a particular position of a given column.

In the Update column column of this table, you need to select the column to be updated and then select the operations to be used from the Operation column. The following operations are available:

  • Append: it adds incoming records to the end of the column to be updated. The Cassandra data types it can handle are Counter, List, Set and Map.

  • Prepend: it adds incoming records to the beginning of the column to be updated. The only Cassandra data type it can handle is List.

  • Remove: it removes records from the target table when the same records exist in the incoming flow. The Cassandra data types it can handle are Counter, List, Set and Map.

  • Assign based on position/key: it adds records to a particular position of the column to be updated. The Cassandra data types it can handle are List and Map.

    Once you select this operation, the Map key/list position column becomes editable. From this column, you need to select the column to be used as reference to locate the position to be updated.

For more details about these operations, see Datastax's related documentation in http://docs.datastax.com/en/cql/3.1/cql/cql_reference/update_r.html?scroll=reference_ds_g4h_qzq_xj__description_unique_34.

Row key in the List type

Select the column to be used to construct the WHERE clause of Cassandra to perform the Update or the Delete action on only selected rows. The column(s) to be used in this table should be from the set of the Primary key columns of the Cassandra table.

Delete collection column based on postion/key

Select the column to be used as reference to locate the particular row(s) to be removed.

This feature is available only to the Delete action.

Global Variables

NB_LINE: the number of rows read by an input component or transferred to an output component. This is an After variable and it returns an integer.

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

This component is used as an output component and it always needs an incoming link.

Log4j

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Limitation

n/a

Related Scenario

For a scenario in which tCassandraOutput is used, see Scenario: Handling data with Cassandra.