Function |
tCacheOut writes RDDs (Resilient Distributed Datasets) to the cache for later use in the same Job. | |
Purpose |
This component makes the input RDDs persist depending on the specific storage level you define in order to offer faster access to these datasets later. |
Depending on the Talend solution you are using, this component can be used in one, some or all of the following Job frameworks:
Spark Batch: see tCacheOut Properties in Spark Batch Jobs.
The component in this framework is available only if you have subscribed to one of the Talend solutions with Big Data.
Spark Streaming: see tCacheOut Properties in Spark Streaming Jobs.
The component in this framework is available only if you have subscribed to Talend Real-time Big Data Platform or Talend Data Fabric.
Component family |
Processing | |
Basic settings |
Schema and Edit schema |
A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:
|
|
|
Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide. |
|
|
Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide. |
|
Storage level |
From the Storage level drop-down list that is displayed, select how the cached RDDs are stored, such as in memory only or in memory and on disk. For further information about each of the storage level, see https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence. |
Usage in Spark Batch Jobs |
This component is used as an end component and requires an input link. This component makes datasets persist and is closely related to tCacheIn. Iteratively, tCacheOut stores input data as cache so that tCacheIn can reads the cache without having to calculate again all of the Spark DAG (Directed Acyclic Graph, the model used by Spark for scheduling Spark actions). This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job. Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs. | |
Spark Connection | You need to use the Spark Configuration tab in the Run view to define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, one and only one file system related component from the Storage family is required in the same Job so that Spark can use this component to connect to the file system to which the jar files dependent on the Job are transferred:
This connection is effective on a per-Job basis. | |
Log4j | If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide. For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html. |
For a related scenario, see Performing download analysis using a Spark Batch Job.
Component family |
Processing | |
Basic settings |
Schema and Edit schema |
A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:
|
|
|
Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide. |
|
|
Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide. |
|
Storage level |
From the Storage level drop-down list that is displayed, select how the cached RDDs are stored, such as in memory only or in memory and on disk. For further information about each of the storage level, see https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence. |
Usage in Spark Streaming Jobs |
This component is used as an end component and requires an input link. This component makes datasets persist and is closely related to tCacheIn. Iteratively, tCacheOut stores input data as cache so that tCacheIn can reads the cache without having to calculate again all of the Spark DAG (Directed Acyclic Graph, the model used by Spark for scheduling Spark actions). At any given moment, tCacheOut stores only one micro-batch in memory. This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job. Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs. | |
Spark Connection | You need to use the Spark Configuration tab in the Run view to define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, one and only one file system related component from the Storage family is required in the same Job so that Spark can use this component to connect to the file system to which the jar files dependent on the Job are transferred:
This connection is effective on a per-Job basis. | |
Log4j | If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide. For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html. |