Reading data from a HDFS connection on Spark

Using predefined HDFS metadata, you can read data from a HDFS filesystem on Spark.

Before you begin

This tutorial makes use of a Hadoop cluster. You must have a Hadoop cluster available to you.
You must also have HDFS metadata configured (see Creating a Hadoop cluster metadata definition and Importing a Hadoop cluster metadata definition).
You must have configured your HDFS connection on Spark (see Configuring a HDFS connection to run on Spark).

In the Designer, add an input component.
Example
Add a tFileInputDelimited component.
Double-click the component.
Your component is configured with the tHDFSConfiguration component information, under Storage.
Click the […] button next to Edit schema.
Click the plus button to add a data column.
Example
1. CustomerID
2. FirstName
3. LastName
Select the column Types.
Example
For CustomerID, select the Integer Type.
Click OK.
In the File Name field, enter the file path and name of your choice.

The tFileInputDelimited component is now configured to read data from HDFS on Spark.

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!