The Hive-on-Tez issue with Hortonworks in Spark Jobs - 7.1

author
Talend Documentation Team
EnrichVersion
7.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
task
Design and Development > Designing Jobs > Hadoop distributions > Hortonworks
Design and Development > Designing Jobs > Job Frameworks > Spark Batch
Design and Development > Designing Jobs > Job Frameworks > Spark Streaming
EnrichPlatform
Talend Studio

Hive-on-Tez issue in Spark Jobs when using Hortonworks

The Hive configuration in a Hortonworks cluster is specific. The configuration uses Tez as Hive engine and can lead to a known issue when running Talend Studio Hive components in Spark Jobs.
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:983)
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:552)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:307)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:321)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:451)
at dev_v6_001.test_hdp26_hive_0_1.test_hdp26_hive.tRowGenerator_1Process(test_hdp26_hive.java:1152)
at dev_v6_001.test_hdp26_hive_0_1.test_hdp26_hive.run(test_hdp26_hive.java:1597)
at dev_v6_001.test_hdp26_hive_0_1.test_hdp26_hive.runJobInTOS(test_hdp26_hive.java:1387)
at dev_v6_001.test_hdp26_hive_0_1.test_hdp26_hive.main(test_hdp26_hive.java:1272)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:980)
... 11 more
Caused by: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':
at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:176)
at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:86)
at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:101)
at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:101)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:101)
at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:100)
at org.apache.spark.sql.internal.SessionState.<init>(SessionState.scala:157)
at org.apache.spark.sql.hive.HiveSessionState.<init>(HiveSessionState.scala:32)
... 16 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:173)
... 24 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:358)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
at org.apache.spark.sql.hive.HiveExternalCatalog.<init>(HiveExternalCatalog.scala:65)
... 29 more
Caused by: java.lang.RuntimeException: org.apache.tez.dag.api.TezUncheckedException: Invalid configuration of tez jars, tez.lib.uris is not defined in the configuration
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:535)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:188)
... 37 more
Caused by: org.apache.tez.dag.api.TezUncheckedException: Invalid configuration of tez jars, tez.lib.uris is not defined in the configuration
at org.apache.tez.client.TezClientUtils.setupTezJarsLocalResources(TezClientUtils.java:166)
at org.apache.tez.client.TezClient.getTezJarResources(TezClient.java:831)
at org.apache.tez.client.TezClient.start(TezClient.java:355)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:184)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:116)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:532)
... 38 more

This known issue demonstrated above happens during Hive initialization by Spark, during which Hive looks for a Tez configuration as Tez is the designated execution engine by Hortonworks. However, as Spark does not expect Tez, this configuration should be overridden:

Environment:

  • Subscription-based Talend Studio solution with Big Data

  • Spark Jobs

Use a Spark-specific Hive configuration file to resolve the Hive-on-Tez issue for Spark Jobs on Hortonworks

Hortonworks ships a Spark-specific hive-site.xml file to resolve this Hive-on-Tez issue. You can use this file to define the connection to your Hortonworks cluster in the Studio.

This file is stored in the Spark configuration folder of your Hortonworks cluster: /etc/spark/conf.

Procedure

  1. Obtain this Spark-specific Hive configuration file from the administrator of your cluster.
  2. Download the regular Hive configuration files from your cluster, for example, using Ambari.
  3. Among these files, replace the /etc/hive/conf/hive-site.xml file with this Spark-specific /etc/spark/conf/hive-site.xml file.
  4. Define the Hadoop connection to your Hortonworks cluster in the Repository if you have not done so.

    For an example about how to define this type of connection, see Create the cluster metadata - Hortonworks 2.4.

  5. Right-click this connection and from the contextual menu, select Edit Hadoop cluster to open the Hadoop cluster connection wizard.
  6. Click Next to open the second step of this wizard and select the Use custom Hadoop configurations check box.
  7. Click the [...] button next to Use custom Hadoop configurations to open the Hadoop configuration import wizard.
  8. Select the Hortonworks version you are using and then select the Import configuration from local files radio button.
  9. Click Next and click Browse... to find the Hive configuration files among which you placed the Spark-specific hive-site.xml file in one of the previous steps.
  10. Click Finish to close the import wizard and thus finish the import to go back to the Hadoop cluster connection wizard.
  11. Click Finish to validate the changes and in the pop-up dialog box, click Yes to accept the propagation. Then the wizard is closed and the Spark-specific Hive configuration file is going to be used along with this Hadoop connection.

    This new configuration is effective only for the Jobs that use this connection.

    For an example about how to use this type of connection, see Write Data to HDFS - Hortonworks.

Use the Hadoop property filter of the Studio to resolve the Hive-on-Tez issue for Spark Jobs on Hortonworks

You need to use the original hive-site.xml file of your Hortonworks cluster or you do not have access to the Spark specific configuration files, you can use the property filter provided in the Hadoop metadata wizard in the Studio to solve this issue.

Procedure

  1. Define the Hadoop connection to your Hortonworks cluster in the Repository if you have not done so.

    For an example about how to define this type of connection, see Create the cluster metadata - Hortonworks 2.4.

  2. Right-click this connection and from the contextual menu, select Edit Hadoop cluster to open the Hadoop cluster connection wizard.
  3. Click Next to open the second step of this wizard and select the Use custom Hadoop configurations check box.
  4. Click the [...] button next to Use custom Hadoop configurations to open the Hadoop configuration import wizard.
  5. Select the Hortonworks version you are using and then perform one of the following operations:
    • If your Hortonworks has Ambari installed, select the Retrieve configuration from Ambari or Cloudera radio button and click Next. Then do the following:
      1. In the wizard that is opened, enter the Ambari credentials in the corresponding fields and click Connect.

        Then a cluster name is displayed on the Discovered clusters drop-down list.

      2. On the list, select your cluster and click Fetch to retrieve the configuration of the related services.

      3. Click the [...] button next to Hadoop property filter to open the wizard.

    • If your Hortonworks does not have Ambari, you have to import the Hive configuration files from a local directory. This means you need to contact the administrator of your cluster to obtain the Hive configuration files or download these files yourself.

      Once you have these files, do the following:

      1. In Hadoop configuration import wizard, select the Import configuration from local files radio button and click Next.

      2. Click Browse... to find the Hive configuration files.

      3. Click the [...] button next to Hadoop property filter to open the wizard.

  6. Click the [+] button to add one row and enter hive.execution.engine in this new row to filter this property out.
  7. Click OK to validate this addition to go back to Hadoop configuration import wizard.
  8. Click Finish to close the import wizard and thus finish the import to go back to the Hadoop cluster connection wizard.
  9. Click Finish to validate the changes and in the pop-up dialog box, click Yes to accept the propagation. Then the wizard is closed and the Spark-specific Hive configuration file is going to be used along with this Hadoop connection.

    This new configuration is effective only for the Jobs that use this connection.

    For an example about how to use this type of connection, see Write Data to HDFS - Hortonworks.