Best Practices: Using Git with Talend 6.2 and later versions

EnrichVersion
6.4
6.3
6.2
EnrichProdName
Talend ESB
Talend Data Fabric
Talend Big Data
Talend Real-Time Big Data Platform
Talend Data Management Platform
Talend Data Integration
Talend MDM Platform
Talend Data Services Platform
Talend Big Data Platform
task
Administration and Monitoring > Managing versions
EnrichPlatform
Talend Studio
Talend Administration Center

Best Practices: Using Git with Talend 6.2 and later versions

Talend has enhanced the functionality of Git with merge capabilities in the Talend 6.2.1 release. This article describes the best practices on how to use Git with Talend Administration Center and Talend Studio including details on how to leverage the merge capabilities.

Overview

Both Subversion (SVN) and Git are supported in the subscription edition of the Talend products.

For more information about how to use both Suversion and Git in the Talend Administration Center, you can read Can I use both Git and Subversion within the same Talend Administration Center (TAC).

Once the projects are configured in Talend Administration Center, a developer can open the remote project from the studio. The screenshot below shows a Git project selected. It is easy to identify that it is a Git project since it shows master in the Branch list.
Tip: You can connect to the master branch of the Git repository. However, as a best practice, it is recommended to pull a local branch and work locally. When working on local branch for Git, the studio is still connected to the Talend Administration Center for authorization requirements, while all save is done locally in the local workspace.

The screenshot below shows a Subversion project selected. You can easily identify that it is a Subversion project since it shows trunk in the Branch list. There is not merge functionality in Subversion.

Also, Subversion stills leverages a centralised model where every save is committed back to the repository (unlike working in local branch that is offered by Git).

Working with Git Local Branches

Talend has added a new dropdown menu for Git functionalities as shown in the screenshot below. This menu only appears if you are logged into a Git project.

Once connected to the master branch, it is a good practice to create a new local branch. The screenshot below shows that you have a developer_local branch created and you are currently working in this local branch.

At this point, the studio is still connected to the Talend Administration Center for license management (especially allowing concurrent users through) and projects authorisation.

However, all jobs, joblets, etc. will be saved locally in the workspace. These artefacts only get saved and committed to Git when you do a Push. A developer can also get the latest version from the master and merge into his current local branch by selecting the Pull And Merge Branch menu.

As the developer create more local branches, it is easy to switch to other local branches without restarting the studio. The developer can also delete a local branch, after doing a Push, to clean up his/her workspace. Note that you will lose your changes/edits if you delete your local branch without doing a Push first.

Tip: Always do a Pull and Merge Branch before doing a Push. The Pull and Merge will happen locally in your workspace.

General Git Best Practices

This section will outline some general Git best practices that developers should follow when using Talend.

Use Branching and Tagging

One of the most important Best Practices with Git is to use Branching and Tagging correctly. There are a number of articles describing this in depth, but for best practice, a number of simple steps should be followed. Branching is standard usage, but tagging is very important and should be used correctly.

Remember Git allows multiple developers to work on the same project by committing/pushing and then retrieving their changes to/from the Git server. Branching allows developers to work independently without affecting the main development line. This is called the master. A ‘branch’ is a copy of a project taken at a specific point in time. A copy is taken from the main development line (the 'master'), from another Branch or from a Tag.

Tags are used by developers to mark a particular revision as important in the development process and their use is a very good best practice.

Choose A Workflow

Git lets you pick from a lot of different workflows: centralized workflows, feature workflow and Gitflow workflow.

  • The decision on the workflow model depends on a couple of factors like the project, the overall development and deployment workflows, number of developers, and so on.
  • It is recommended that you prudently select the workflow model and once it is decided, make sure that everyone agrees and follows the principles of the workflow model.
  • You should decide on a workflow that suits the sprint duration, software development life cycle, and frequency of releases for your project. Different teams/projects may embrace different workflows.

Remote Versus Local Branches

Git allows you to work directly work on the Remote branch and in the local branch. As a best practice, it is recommended to pull a local branch and work locally.

  • Remember that when working on a local branch for Git:
    • all save/commit actions are done locally in the locally workspace
    • the Studio is still connected to the Talend Administration Center for Project Authorization requirements.
  • It is also recommended to have the local branch name be the same as the remote branch name.

Once the development is completed and an agreed upon milestone is achieved, it is recommended to clean up the local and remote branches before they grow into a branch jungle that can be too difficult to manage.

Pull And Merge Branch

  • Itt is recommended to always do a Pull and Merge Branch before doing a Push. The Pull and Merge will happen locally in your workspace.
  • All pull or merge operations from remote/master into the local master ranch should be fast-forward.
    • Do NOT perform development in the master branch, periodically update with pull, and then push your local master. Instead, perform local commits and merge with master right before pushing it.
    • Remember that if you are working in a team, the pull request allows you and the other developers to review the changes. If you are an individual contributor then it lets you review a difference of the code in the branch.
    • Once all the issues are settled, merge the code into the repository.

Branch Hygiene

  • Using Git opens up a whole lot of possibilities for workflows using branches with different conventions, but if you don’t take the time every so often to delete those old branches, especially on very active projects, it doesn’t take long before you find yourself lost in a forest. The best practice is to delete branches as they’re merged, but doing a periodic sweep works all right too.
  • If you delete a branch on your machine and you know that it’s not needed on the upstream remote anymore, you should delete it there as well. You can do it via the Talend Administration Center or the Branches page for the repo on GitHub or via the command line.
    • Match the branch name to either your jira#, defect#, feature#, sprint#. This way, the branches will be easily identifiable.

Make Regular Backups

  • With Git, every clone is basically a backup. However, there is still a need for a more formal backup system as well as the Git clones do not save git configurations, the working directory, index or any non-standard reference.
  • Having your files backed up on a remote server is a good side effect of having a version control system.

Divide Work into Repositories

  • Repositories are often used for storing things that they really shouldn't. This is because they are there, are available and are easily accessible. This is not good practice. With Git you can group things together using Git submodules.
  • The way the work is divided depends on the individual/organization like separate repositories for separate datamarts/modules, separate repositories for large files, separate read only repositories for review/audit purpose, and so on. For example, an international bank might consider having separate repositories per country/region, or per currency.
  • For Talend, we recommend having one Git Repository per Talend Project. It is possible to have more than one project in one Git Repository. However, tagging and branching, will always branch all projects in that repository, and can be a bit confusing. Hence, sticking to one project per Git Repository is the best practice.

Use Commit Messages

  • Using descriptive commit messages is the best thing one can do for others who use the repository. You need to enable the Custom Log option in the Talend Administration Center to do this.
  • Leave a clean commit history.
  • Git allows you to use Commit messages via the command:
    git commit -m "<message>"
  • Use descriptive commit messages. It allows colleagues to understand changes without having to read code.

Clean Local History Before Pushing

  • When using Git we do frequent commits. While working in our local repository, all our local commits are local and we have complete control to clean, rewrite or cancel them.
  • It is good practice to let only the meaningful commits reach the remote repository.
  • Some examples to consider cleaning local history before pushing are
    • You added too many tLogRow/tJava component for fixing a bug and it took you several commits to actually fix the bug. Once the bug is fixed, you have removed the extra tLogRow/tJava components.
    • Now obviously you would not want all of this history to go to remote. So before pushing the code to remote it is recommended to clean the local history.

Don't Change Published History In Remote Repository

  • Once you perform git push or pull and merge, the changes are committed to the upstream repository and from then on the changes or tags are publicly visible.
  • It is a best practice to ideally consider those commits etched in diamond for all eternity and never change it as it is problematic for everyone and thus it is just not best practice to do so.

Use a Security Model

  • Without a security model, everyone will have to access everything? This may be OK or may not be.
  • A good idea is to limit access so that only certain repositories have read/write access for everyone. Git allows users to set up different types of access control. You may even consider creating a centralized git master repository with tools such as Gitlite Manager.

Make use of Standards

  • Using standards, such as naming standards, will improve the quality of your commits and the code-base.
  • It is best practice to make use of them. Other standards to use are ones surrounding testing, syntax, commit message analysis, and so on.

Use of External Tools

There are a number of useful external tools that integrate with Git.

  • However, Talend Studio already provides all the functionalities you need to leverage Talend with Git.
  • On rare occasions, you may need to use external Git tools, but this is not generally a best practice with Talend. Talend Studio should be used to manage Talend projects stored in the Git repository.

Release Management

Managing release workflow is a valid best practice. Ensure versions are tagged and named according to your naming standards.

Maintain your Repositories

  • Your repository is only as good as the files kept in there. Old code, missing objects etc. just cause confusion. It is a good idea to do periodic maintenance on repositories. There are a number of useful Git commands to help you. The most useful ones are:
  • Validate and check the integrity of your repository:
    git fsck
  • Compact your repository:
    git gc
    and
    git gc --aggressive
  • Prune remote tracking branches:
    git remote update --prune
  • Check your stash for unused/old work:
    git stash list
  • Your periodic maintenance could involve cleaning the branches, compressing the Git repository to save space and speed up Git operations, and so on.

Large files

  • Try to avoid committing large binary files to Git. There are a couple of Git utilities that you can use if you have to commit large binary files. They are Git annex or Git media.
  • Best Practice: Do NOT commit large binary files in Git for Talend. Developers will need to pull the project from Git, and those large files will impact the time it takes to open a project. This will also impact the Talend Administratoin Center which also automatically pull files from Git.
  • Best Practice: Binary files should go in the Nexus repositories.
  • Best Practice: Non-Talend files, such as sample data files, html, SQL, and so on, should go in another Git Project and you should avoid polluting the Talend projects with these.
  • Large file usage is an actively discussed topic within the external Git community.

Large Repositories

  • Wherever possible, try not to create large repositories in Git. Git can be slow when large repositories exist. Now, ‘Large’ depends on definition, but in general it depends upon factors such as RAM size, I/O speed, etc.
  • However, having many files, say 100K-200K, in a repository will slow common operations due to system call speeds. In addition, having many large files, as discussed above, can slow many operations.