Complete the Cloudera connection configuration in the Spark configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.
If you cannot
find the Cloudera or Hortonworks version to be used from the Version
drop-down list, you can add your distribution via some dynamic distribution settings in the
Studio. For further information, see Adding the latest Big Data Platform dynamically.
-
Dynamic distributions for HDP 3.x and CDH 6.x are in technical preview.
- On the version list of the distributions, some versions are labelled Builtin. These versions were added by Talend via the Dynamic distribution mechanism and delivered with the Studio when the Studio was released. They are certified by Talend, thus officially supported and ready to use.
The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data but it is not applicable to Talend Open Studio for Big Data users.
Procedure
Results
-
After the connection is configured, you can tune the Spark performance, although not required, by following the process explained in:
-
Tuning Spark for Apache Spark Batch Jobs for Spark Batch Jobs.
-
Tuning Spark for Apache Spark Streaming Jobs for Spark Streaming Jobs.
-
-
It is recommended to activate the Spark logging and checkpointing system in the Spark configuration tab of the Run view of your Spark Job, in order to help debug and resume your Spark Job when issues arise:
-
If you are using Cloudera V5.5+ to run your MapReduce or Apache Spark Batch Jobs, you can make use of Cloudera Navigator to trace the lineage of given data flow to discover how this data flow was generated by a Job.