Gnu parallel

September 30, 2014

About GNU Parallel

GNU parallel is a tool that lets you apply a series of commands across a multicore system. It behaves very much like xargs and is fed by a pipeline | of input.

I typically use parallel when I want to perform the same operation over a set of files. For example, in remote sensing workflows, I’ll have a directory full of files that I would like to perform the same operation upon. Maybe unzip, maybe gdal_translate or some other command.

Gnu parallel works by taking a stream of input, applying that input to a templated command and executing those commands in parallel on multiple processors available on the system. To do this it uses a templating expression {} which is replaced for each call of the command with a corresponding line of input.

Example:

ls *.zip | parallel "unzip {}"

When we run just ls *.zip we might get:

file1.zip
file2.zip
file3.zip
file4.zip
file5.zip
file6.zip

Assuming we’re running on a dual-core computer, GNU parallel will take each line of input, apply it to the template unzip {} replacing the {} with first file1.zip then file2.zip, and so on. It also ensures that only two processes are running at any given time (we have two cores available). This can be extrapolated to larger jobs with thousands of files and tens-hundreds of cores available.

parallel is very flexible and has many many options for running in various ways. The templating language is quite versatile and has ways to drop extentions of files. For example if you just wanted file1 without the .zip extension, you could include {.} in your template and the trailing .zip will be excluded.

The -j option allows you to specify the total number of jobslots available on a system. For example, if you were on a system with 24 cores and you only wanted to use say 8 of them, you could specify -j 8. This can be useful when you’re on a multiuser system and you don’t want to degrade performance for others, or if your particular job is highly I/O intensive and will outstrip the available I/O if running on too many cores. See the man page for even more on templating and flags.

There is even a way to provide a list of ssh accessible servers and apply the commands across a cluster of machines. man parallel can

Installing

Mac OS X

First install Homebrew, then:

brew install parallel

Linux

If you are system administrator install with a package manager, otherwise download the .tar.gz from the GNU parallel website and build in your home directory. It has very few dependencies.

Debian/Ubuntu

sudo apt-get install parallel

RedHat/CentOS

sudo yum install parallel
comments powered by Disqus
GNU Parallel - September 30, 2014 - Jonah Duckles