Best Practices: Using Git with Talend 6.2 and later versions

author
Irshad Burtally
EnrichVersion
6.5
EnrichProdName
Talend Real-Time Big Data Platform
Talend Data Integration
Talend Data Fabric
Talend Big Data
Talend Big Data Platform
Talend ESB
Talend Data Services Platform
Talend Data Management Platform
Talend MDM Platform
task
Administration and Monitoring > Managing versions
EnrichPlatform
Talend Administration Center
Talend Studio

Best Practices: Using Git with Talend 6.2 and later versions

Talend has enhanced the functionality of Git with merge capabilities in the Talend 6.2.1 release. This article describes the best practices on how to use Git with Talend Administration Center and Talend Studio including details on how to leverage the merge capabilities.

Overview

Both Subversion (SVN) and Git are supported in the subscription edition of the Talend products.

For more information about how to use both Suversion and Git in the Talend Administration Center, you can read Can I use both Git and Subversion within the same Talend Administration Center (TAC).

Once the projects are configured in Talend Administration Center, a developer can open the remote project from the studio. The screenshot below shows a Git project selected. It is easy to identify that it is a Git project since it shows master in the Branch list.
Tip: You can connect to the master branch of the Git repository. However, as a best practice, it is recommended to pull a local branch and work locally. When working on local branch for Git, the studio is still connected to the Talend Administration Center for authorization requirements, while all save is done locally in the local workspace.

The screenshot below shows a Subversion project selected. You can easily identify that it is a Subversion project since it shows trunk in the Branch list. There is not merge functionality in Subversion.

Also, Subversion stills leverages a centralised model where every save is committed back to the repository (unlike working in local branch that is offered by Git).

Working with Git Local Branches

Talend has added a new dropdown menu for Git functionalities as shown in the screenshot below. This menu only appears if you are logged into a Git project.

Once connected to the master branch, it is a good practice to create a new local branch. The screenshot below shows that you have a developer_local branch created and you are currently working in this local branch.

At this point, the studio is still connected to the Talend Administration Center for license management (especially allowing concurrent users through) and projects authorisation.

However, all jobs, joblets, etc. will be saved locally in the workspace. These artefacts only get saved and committed to Git when you do a Push. A developer can also get the latest version from the master and merge into his current local branch by selecting the Pull And Merge Branch menu.

As the developer create more local branches, it is easy to switch to other local branches without restarting the studio. The developer can also delete a local branch, after doing a Push, to clean up his/her workspace. Note that you will lose your changes/edits if you delete your local branch without doing a Push first.

Tip: Always do a Pull and Merge Branch before doing a Push. The Pull and Merge will happen locally in your workspace.

General Git Best Practices

This section will outline some general Git best practices that developers should follow when using Talend.

Use Branching and Tagging

One of the most important Best Practices with Git is to use Branching and Tagging correctly. There are a number of articles describing this in depth, but for best practice, a number of simple steps should be followed. Branching is standard usage, but tagging is very important and should be used correctly.

Remember GIT allow multiple developers to work on the same project by committing/pushing and then retrieving their changes to/from the Git server. Branching allows developers to work independently without affecting the main development line. This is called the master. A ‘branch’ is a copy of a project taken at a specific point in time. A copy is taken from the main development line (the 'master'), from another Branch or from a Tag.

Tags are used by developers to mark a particular revision as important in the development process and their use is a very good best practice.

Use Git with a Centralised Workflow

The Centralized Workflow uses a central repository to serve as the single point-of-entry for all changes. Instead of trunk, the default development branch is called master and all changes are committed into this branch. This workflow doesn’t require any other branches besides the master.

Developers clone the central repository. In their own local copies of the project, they edit files and commit changes. These new commits are stored locally. To publish changes to the official project, developers push their local ‘master branch’ to the central repository.

Make Regular Backups

With Git, every clone is basically a backup. Backups should still be done though because clones do not save git configurations. Having your files backed up on a remote server is a good side effect of having a version control system.

Divide Work into Repositories

Repositories are often used for storing things that they really shouldn't. This is because they are there, are available and are easily accessible. This is not good practice. With Git you can group things together using Git submodules.

Use Commit Messages

Git allows you to use Commit messages via the command:
git commit -m "<message>"
Use descriptive commit messages. It allows colleagues to understand changes without having to read code.

Use a Security Model

Without a security model, everyone will have to access everything? This may be OK or may not be.

A good idea is to limit access so that only certain repositories have read/write access for everyone. Git allows users to set up different types of access control. You may even consider creating a centralized git master repository with tools such as Gitlite Manager.

Make use of Standards

Using standards, such as naming standards, will improve the quality of your commits and the code-base.

It is best practice to make use of them. Other standards to use are ones surrounding testing, syntax, commit message analysis, etc.

Use of External Tools

There are a number of useful external tools that integrate with Git.

However, Talend Studio already provides all the functionalities you need to leverage Talend with Git. On rare occasions, you may need to use external Git tools, but this is not generally a best practice with Talend. Talend Studio should be used to manage Talend projects stored in the Git repository.

Release Management

Managing release workflow is a valid best practice. Ensure versions are tagged and named according to your naming standards.

Maintain your Repositories

Your repository is only as good as the files kept in there. Old code, missing objects etc. just cause confusion. It is a good idea to do periodic maintenance on repositories. There are a number of useful Git commands to help you. The most useful ones are:

  • Validate and check the integrity of your repository:
    git fsck
  • Compact your repository:
    git gc
    and
    git gc --aggressive
  • Prune remote tracking branches:
    git remote update --prune
  • Check your stash for unused/old work:
    git stash list

Large files

Try to avoid committing large binary files to Git. There are a couple of Git utilities that you can use if you have to commit large binary files. They are Git annex or Git media.

Large file usage is an actively discussed topic within the external Git community.

Large Repositories

Wherever possible, try not to create large repositories in Git. Git can be slow when large repositories exist. Now, ‘Large’ depends on definition, but in general it depends upon factors such as RAM size, I/O speed, etc.

However, having many files, say 100K-200K, in a repository will slow common operations due to system call speeds. In addition, having many large files, as discussed above, can slow many operations.