Complete the Databricks connection configuration in the Spark Configuration tab of the Run view of your Job. This configuration is effective on a per-Job
basis.
Before you begin
- When running a Spark Streaming Job, only one Job is allowed to run on the same Databricks cluster per time.
- When running a Spark Batch Job, only if you have selected the Do not restart the cluster when submitting check box, you can send more than one Job to run in parallel on the same Databricks cluster; otherwise, since each run automatically restarts the cluster, the Jobs that are launched in parallel interrupt each other and thus cause execution failure.
Procedure
-
From the Cloud provider drop-down list, select
Azure.
-
Enter the basic connection information to Databricks.
Standalone
|
-
Use pool:
you can select this check box to leverage a Databricks pool. If you
do, you must indicate the pool ID instead of the cluster ID in the
Spark Configuration. You
must also select the Use transient
cluster check box.
-
In the Endpoint field, enter the URL
address of your Azure Databricks workspace. This URL can be
found in the Overview
blade of your Databricks workspace page on your Azure portal.
For example, this URL could look like https://adb-$workspaceId.$random.azuredatabricks.net.
-
In the Cluster ID field, enter the ID
of the Databricks cluster to be used. This ID is the value of
the spark.databricks.clusterUsageTags.clusterId
property of your Spark cluster. You can find this property on
the properties list in the Environment tab in the Spark UI view of your cluster.
You can also
easily find this ID from the URL of your Databricks cluster. It
is present immediately after cluster/ in this URL.
-
If you selected the
Use pool option, in the
Pool ID field, enter the
ID of the Databricks pool to be used. This ID is the value of the
DatabricksInstancePoolId
key of your pool. You can find this key under Tags in the Configuration tab of your pool. It
is also available in the tags of the clusters that are using the
pool.
You can also easily find this ID from the URL of
your Databricks pool. It is present immediately after cluster/instance-pools/view/ in this
URL.
-
Click the [...] button next to the Token field to enter the
authentication token generated for your Databricks user account.
You can generate or find this token on the User settings page of your
Databricks workspace. For further information, see Token management
from the Azure documentation.
-
In the
DBFS dependencies
folder field, enter the directory that is used
to store your Job related dependencies on Databricks Filesystem
at runtime, putting a slash (/) at the end of this directory.
For example, enter /jars/
to store the dependencies in a folder named jars. This folder is created on
the fly if it does not exist then.
-
Poll interval when
retrieving Job status (in ms): enter, without the
quotation marks, the time interval (in milliseconds) at the end of
which you want the Studio to ask Spark for the status of your Job.
For example, this status could be Pending or Running.
The default value is 300000, meaning 30 seconds. This interval is
recommended by Databricks to correctly retrieve the Job status.
-
Use transient
cluster: you can select this check box to leverage
the transient Databricks clusters.
The custom properties you defined in the Advanced properties table are
automatically taken into account by the transient clusters at
runtime.
- Autoscale: select or clear this check box to define
the number of workers to be used by your transient cluster.
- If you select this check box,
autoscaling is enabled. Then define the minimum number
of workers in Min
workers and the maximum number of
worders in Max
workers. Your transient cluster is
scaled up and down within this scope based on its
workload.
According to the Databricks
documentation, autoscaling works best with
Databricks runtime versions 3.0 or onwards.
- If you clear this check box, autoscaling
is deactivated. Then define the number of workers a
transient cluster is expected to have. This number does
not include the Spark driver node.
- Node type
and Driver node type:
select the node types for the workers and the Spark driver node.
These types determine the capacity of your nodes and their
pricing by Databricks.
For details about
these node types and the Databricks Units they use, see
Supported Instance
Types from the Databricks documentation.
- Elastic
disk: select this check box to enable your
transient cluster to automatically scale up its disk space when
its Spark workers are running low on disk space.
For more details about this elastic disk
feature, search for the section about autoscaling local
storage from your Databricks documentation.
- SSH public
key: if an SSH access has been set up for your
cluster, enter the public key of the generated SSH key pair.
This public key is automatically added to each node of your
transient cluster. If no SSH access has been set up, ignore this
field.
For further information about SSH
access to your cluster, see SSH access to
clusters from the Databricks
documentation.
- Configure cluster
log: select this check box to define where to
store your Spark logs for a long term. This storage system could
be S3 or DBFS.
- Do not restart the cluster when
submitting: select this check box to prevent the Studio
restarting the cluster when the Studio is submitting your Jobs. However,
if you make changes in your Jobs, clear this check box so that the
Studio resarts your cluster to take these changes into account.
|
Results
If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the
Spark checkpointing operation. In the field that is displayed, enter the
directory in which Spark stores, in the file system of the cluster, the context
data of the computations such as the metadata and the generated RDDs of this
computation.
For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .