Installing Python 3 third-party libraries - Cloud

Talend Cloud Pipeline Designer Processors Guide

Version
Cloud
Language
English
Product
Talend Cloud
Module
Talend Pipeline Designer
Content
Design and Development > Designing Pipelines
Last publication date
2024-02-26

Python 3 is the replacement of the Python processor. It supports everything the old processor could do and adds some functionalities like the installation of third-party libraries.

Differences between Python 2 and Python 3

The Python 3 processor works exactly like the old one with two notable differences:
  • The code should be Python 3 code and not Python 2.
  • There is no concept of Map and Flatmap when trying to modify your records. Hence, the disappearance of the corresponding drop-down list in the user interface.

The latter difference is important as it allows you to write straightforward code to filter, map, or flatmap records.

Example of Python 3 code that no longer requires using MAP or FLATMAP:
if input['type'] == "house":
    # Single family dwellings have a top-level occupant (MAP).
    output = input['occupant']
elif input['type']  == "apartment":
    # Apartment blocks have many occupants (FLATMAP).
    output = [apt['occupant'] for apt in input['subdwellings']
else:
    # Deleting the record (FILTER).
    output = None

Installing libraries

First you need to set up your Remote Engine Gen2 as usual. The installation of libraries takes place in two different locations:
  • in the previewrunner container
  • in the livy container

You can install libraries either using a file or via command line.

Installing libraries in the Remote Engine Gen2 using the requirements.txt file

In the previewrunner container:

To install libraries in the previewrunner container:
  • Create a folder on your local machine. Name it /tmp/rqmts for example.

  • Open this file to edit it:

    default/docker-compose.yml if you are using the engine in the AWS USA, AWS Europe, AWS Asia-Pacific or Azure regions.

    eap/docker-compose.yml if you are using the engine as part of the Early Adopter Program.

  • Add this parameter in the previewrunner > volumes section of the file:
    - /tmp/rqmts:/opt/rqmts
  • Add this parameter in the previewrunner > environment section and save your changes:
    PYTHON_RQMTS_PATH: /opt/rqmts

    Note that the paths are entirely customizable as long as you have written access to them.

  • Go to Talend Cloud Pipeline Designer and check that creating a pipeline with a Python 3 processor works as usual.
  • Create a requirements.txt file inside your /tmp/rqmts folder. This file should contain libraries to install in the Python Virtual Environment:
    jinja2==2.11.2
  • Go back to your pipeline and add some code that uses the libraries specified in requirements.txt in your Python 3 processor. For example:
    from jinja2 import Template
    
    t = Template("Hello {{something}}!")
    output["hello"] = t.render(something = input["Op"])

Save your changes and check that the data is previewed successfully. You can modify the requirements.txt file on your local machine and update your code, and you should see that everything works fine.

In the livy container:

The procedure is the same as the one for the previewrunner container, the only difference being that you have to edit the livy section in the docker-compose.xml file.

  • Add the PYTHON_RQMTS_PATH environment variable in your cluster. It should point to a mounted volume and not a folder that is erased whenever the worker server dies.

    For example: /dbfs/tpd-python3-rqmts

  • Repeat the same steps as for the previewrunner container (create the requirements.txt inside the /dbfs/tpd-python3-rqmts folder, update pipeline, etc.). Everything should be working fine.

Installing libraries in the Remote Engine Gen2 via command lines

Libraries can also be installed directly using command lines or by launching a shell script.

To do that you will have to install your libraries in both the previewrunner and the livy container.

To install libraries in the previewrunner container:
  • In Talend Cloud Pipeline Designer, start by creating a pipeline with a Python 3 processor in it and try to preview it.

    This will force the unpacking of all the Python files in your previewrunner container.

  • From the command line, run a command like this to install numpy, for example:
    docker exec -it [previewrunner_docker_img_name] \
        bash -c "source /tmp/luci/local/env/default/bin/activate && pip install numpy"
  • You can then edit your code in the Python 3 processor and save your changes.
To install libraries in the livy container:
  • In Talend Cloud Pipeline Designer, create a pipeline with a Python 3 processor in it to force the unpacking of all the Python files.

  • From the command line, run a command like this to install jinja2, for example:
    docker exec -it [livy_docker_img_name] \
        bash -c "source /tmp/luci/local/env/default/bin/activate && pip install jinja2"
  • Write some code in your Python 3 processor that uses jinja2, save your changes and check that the preview is displayed successfully.