Setting up the Knox parameters with CDP Public Cloud Data Hub - 7.3

Spark Batch

Version
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development > Designing Jobs > Job Frameworks > Spark Batch
Last publication date
2024-02-21

Talend Studio allows you to authenticate to your Spark Streaming and Spark Batch Jobs using Knox with a CDP Public Cloud Data Hub instance in YARN cluster mode. You can complete the Knox connection parameters either in the Spark configuration tab of the Run view of your Job or in the Hadoop Cluster Connection metadata wizard. This configuration is effective on a per-Job basis.

In this scenario, the configuration via the Hadoop Cluster Connection metadata wizard is used. Setting up the connection to Knox in the Repository allows you to avoid configuring that connection each time you need it in the Spark Configuration view of your Spark Jobs.

For more information about the configuration via the Spark configuration tab of the Run view of your Job, see Defining the Cloudera connection parameters.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data.

Procedure

  1. In the Repository tree view of your Studio, expand Metadata and then right-click Hadoop cluster.
  2. Select Create Hadoop cluster from the contextual menu to open the Hadoop Cluster Connection wizard.
  3. Fill in generic information about this connection, such as Name and Description and click Next to open the Hadoop Configuration Import Wizard window that allows you to select the distribution to be used and the manual or the automatic mode to configure the connection.
    Important: Knox is only supported with CDP 7.1 and onwards.
  4. Select Cloudera from the Distribution drop-down list and Cloudera CDP 7.1 from the Version drop-down list.
  5. Select Enter manually Hadoop services and click Finish.
  6. Select the Use Knox check box and enter the Knox related connection parameters:
    • Knox URL: enter the Knox URL respecting the following format https://<host>/<datahub>/cdp-proxy-api. You can find the Knox URL on the Cloudera Management Console in the Endpoints section of your Data Hub under Livy Server.
      Important: If you have the R2021-07 or a previous patch installed, the URL should not include /livy or any other suffix after cdp-proxy-api at the end. If you have the R2021-08 or a later patch installed, the URL work with or without /livy at the end.
    • Knox user: enter your Workload User Name from Cloudera Management Console.
    • Knox password: enter your Workload Password from Cloudera Management Console.
    • Knox directory: type in the location storing the loaded file in HDFS.
    • Knox session timeout: specify the amount of time to wait for the Job to reconnect to the cluster via Knox.
  7. Optional: Click Check services to verify that Talend Studio can connect to the services you have specified in this wizard.
  8. Optional: Click Export as context to create a new context with these data and save it in the repository.
  9. Click Finish to validate your changes and close the wizard.
    The newly set-up Hadoop connection displays under the Hadoop cluster folder in the Repository tree view.