parallel/src/parallel_tutorial.pod

2844 lines
67 KiB
Plaintext
Raw Normal View History

2013-08-17 16:46:12 +00:00
#!/usr/bin/perl -w
2013-09-20 23:25:06 +00:00
=head1 GNU Parallel Tutorial
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
This tutorial shows off much of GNU B<parallel>'s functionality. The
tutorial is meant to learn the options in GNU B<parallel>. The tutorial
2013-08-17 16:46:12 +00:00
is not to show realistic examples from the real world.
Spend an hour walking through the tutorial. Your command line will
2013-08-17 16:46:12 +00:00
love you for it.
=head1 Prerequisites
To run this tutorial you must have the following:
=over 9
2016-08-13 17:11:15 +00:00
=item parallel >= version 20160822
2013-08-17 16:46:12 +00:00
2016-11-21 21:35:25 +00:00
Install the newest version using your package manager (recommended for
security reasons), the way described in README, or with this command:
2015-03-29 03:28:57 +00:00
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
2016-08-13 17:11:15 +00:00
This will also install the newest version of the tutorial which you
can see by running this:
man parallel_tutorial
Most of the tutorial will work on older versions, too.
2013-08-17 16:46:12 +00:00
=item abc-file:
2016-08-13 17:11:15 +00:00
The file can be generated by this command:
2013-08-17 16:46:12 +00:00
parallel -k echo ::: A B C > abc-file
=item def-file:
2016-08-13 17:11:15 +00:00
The file can be generated by this command:
2013-08-17 16:46:12 +00:00
parallel -k echo ::: D E F > def-file
=item abc0-file:
2016-08-13 17:11:15 +00:00
The file can be generated by this command:
2013-08-17 16:46:12 +00:00
perl -e 'printf "A\0B\0C\0"' > abc0-file
=item abc_-file:
2016-08-13 17:11:15 +00:00
The file can be generated by this command:
2013-08-17 16:46:12 +00:00
perl -e 'printf "A_B_C_"' > abc_-file
2013-08-17 16:46:12 +00:00
=item tsv-file.tsv
2016-08-13 17:11:15 +00:00
The file can be generated by this command:
2013-08-17 16:46:12 +00:00
perl -e 'printf "f1\tf2\nA\tB\nC\tD\n"' > tsv-file.tsv
=item num8
2016-08-13 17:11:15 +00:00
The file can be generated by this command:
perl -e 'for(1..8){print "$_\n"}' > num8
2015-03-29 03:28:57 +00:00
=item num128
2016-08-13 17:11:15 +00:00
The file can be generated by this command:
2015-03-29 03:28:57 +00:00
perl -e 'for(1..128){print "$_\n"}' > num128
2013-08-17 16:46:12 +00:00
=item num30000
2016-08-13 17:11:15 +00:00
The file can be generated by this command:
2013-08-17 16:46:12 +00:00
perl -e 'for(1..30000){print "$_\n"}' > num30000
=item num1000000
2016-08-13 17:11:15 +00:00
The file can be generated by this command:
2013-08-17 16:46:12 +00:00
perl -e 'for(1..1000000){print "$_\n"}' > num1000000
=item num_%header
2016-08-13 17:11:15 +00:00
The file can be generated by this command:
2013-08-17 16:46:12 +00:00
(echo %head1; echo %head2; perl -e 'for(1..10){print "$_\n"}') > num_%header
=item For remote running: ssh login on 2 servers with no password in
2016-08-13 17:11:15 +00:00
$SERVER1 and $SERVER2 must work.
2013-08-17 16:46:12 +00:00
SERVER1=server.example.com
SERVER2=server2.example.net
2016-08-13 17:11:15 +00:00
So you must be able to do this:
2013-08-17 16:46:12 +00:00
ssh $SERVER1 echo works
ssh $SERVER2 echo works
It can be setup by running 'ssh-keygen -t dsa; ssh-copy-id $SERVER1'
and using an empty pass phrase.
2013-08-17 16:46:12 +00:00
=back
=head1 Input sources
2016-01-01 14:12:43 +00:00
GNU B<parallel> reads input from input sources. These can be files, the
2013-08-17 16:46:12 +00:00
command line, and stdin (standard input or a pipe).
=head2 A single input source
Input can be read from the command line:
parallel echo ::: A B C
Output (the order may be different because the jobs are run in
parallel):
A
B
C
The input source can be a file:
parallel -a abc-file echo
Output: Same as above.
STDIN (standard input) can be the input source:
cat abc-file | parallel echo
Output: Same as above.
=head2 Multiple input sources
2016-01-01 14:12:43 +00:00
GNU B<parallel> can take multiple input sources given on the command
line. GNU B<parallel> then generates all combinations of the input
2013-08-17 16:46:12 +00:00
sources:
parallel echo ::: A B C ::: D E F
Output (the order may be different):
A D
A E
A F
B D
B E
B F
C D
C E
C F
The input sources can be files:
parallel -a abc-file -a def-file echo
Output: Same as above.
2016-01-01 14:12:43 +00:00
STDIN (standard input) can be one of the input sources using B<->:
2013-08-17 16:46:12 +00:00
cat abc-file | parallel -a - -a def-file echo
2013-08-17 16:46:12 +00:00
Output: Same as above.
2016-01-01 14:12:43 +00:00
Instead of B<-a> files can be given after B<::::>:
2013-08-17 16:46:12 +00:00
cat abc-file | parallel echo :::: - def-file
Output: Same as above.
::: and :::: can be mixed:
parallel echo ::: A B C :::: def-file
Output: Same as above.
2016-08-13 17:11:15 +00:00
=head3 Linking arguments from input sources
2013-08-17 16:46:12 +00:00
2016-08-13 17:11:15 +00:00
With B<--link> you can link the input sources and get one argument
from each input source:
2013-08-17 16:46:12 +00:00
2016-08-13 17:11:15 +00:00
parallel --link echo ::: A B C ::: D E F
2013-08-17 16:46:12 +00:00
Output (the order may be different):
A D
B E
C F
If one of the input sources is too short, its values will wrap:
2016-08-13 17:11:15 +00:00
parallel --link echo ::: A B C D E ::: F G
2013-08-17 16:46:12 +00:00
Output (the order may be different):
A F
B G
C F
D G
E F
2016-08-13 17:11:15 +00:00
For more flexible linking you can use B<:::+> and B<::::+>. They work
like B<:::> and B<::::> except they link the previous input source to
this input source.
This will link ABC to GHI:
parallel echo :::: abc-file :::+ G H I :::: def-file
Output (the order may be different):
A G D
A G E
A G F
B H D
B H E
B H F
C I D
C I E
C I F
This will link GHI to DEF:
parallel echo :::: abc-file ::: G H I ::::+ def-file
Output (the order may be different):
A G D
A H E
A I F
B G D
B H E
B I F
C G D
C H E
C I F
If one of the input sources is too short when using B<:::+> or
B<::::+>, the rest will be ignored:
parallel echo ::: A B C D E :::+ F G
Output (the order may be different):
A F
B G
2013-08-17 16:46:12 +00:00
=head2 Changing the argument separator.
2016-01-01 14:12:43 +00:00
GNU B<parallel> can use other separators than B<:::> or B<::::>. This is
typically useful if B<:::> or B<::::> is used in the command to run:
2013-08-17 16:46:12 +00:00
parallel --arg-sep ,, echo ,, A B C :::: def-file
Output (the order may be different):
A D
A E
A F
B D
B E
B F
C D
C E
C F
Changing the argument file separator:
parallel --arg-file-sep // echo ::: A B C // def-file
Output: Same as above.
=head2 Changing the argument delimiter
2016-01-01 14:12:43 +00:00
GNU B<parallel> will normally treat a full line as a single argument: It
uses B<\n> as argument delimiter. This can be changed with B<-d>:
2013-08-17 16:46:12 +00:00
parallel -d _ echo :::: abc_-file
Output (the order may be different):
A
B
C
2016-08-13 17:11:15 +00:00
NUL can be given as B<\0>:
2013-08-17 16:46:12 +00:00
parallel -d '\0' echo :::: abc0-file
Output: Same as above.
2016-01-01 14:12:43 +00:00
A shorthand for B<-d '\0'> is B<-0> (this will often be used to read files
from B<find ... -print0>):
2013-08-17 16:46:12 +00:00
parallel -0 echo :::: abc0-file
Output: Same as above.
=head2 End-of-file value for input source
2016-01-01 14:12:43 +00:00
GNU B<parallel> can stop reading when it encounters a certain value:
2013-08-17 16:46:12 +00:00
parallel -E stop echo ::: A B stop C D
Output:
A
B
=head2 Skipping empty lines
2016-01-01 14:12:43 +00:00
Using B<--no-run-if-empty> GNU B<parallel> will skip empty lines.
2013-08-17 16:46:12 +00:00
(echo 1; echo; echo 2) | parallel --no-run-if-empty echo
Output:
1
2
=head1 Building the command line
=head2 No command means arguments are commands
If no command is given after parallel the arguments themselves are
treated as commands:
parallel ::: ls 'echo foo' pwd
Output (the order may be different):
[list of files in current dir]
foo
[/path/to/current/working/dir]
The command can be a script, a binary or a Bash function if the function is
2016-01-01 14:12:43 +00:00
exported using B<export -f>:
2013-08-17 16:46:12 +00:00
2015-03-29 03:28:57 +00:00
# Only works in Bash
2013-08-17 16:46:12 +00:00
my_func() {
echo in my_func $1
}
export -f my_func
parallel my_func ::: 1 2 3
Output (the order may be different):
in my_func 1
in my_func 2
in my_func 3
=head2 Replacement strings
=head3 The 7 predefined replacement strings
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
GNU B<parallel> has several replacement strings. If no replacement
strings are used the default is to append B<{}>:
2013-08-17 16:46:12 +00:00
parallel echo ::: A/B.C
Output:
A/B.C
2016-01-01 14:12:43 +00:00
The default replacement string is B<{}>:
2013-08-17 16:46:12 +00:00
parallel echo {} ::: A/B.C
Output:
A/B.C
2016-01-01 14:12:43 +00:00
The replacement string B<{.}> removes the extension:
2013-08-17 16:46:12 +00:00
parallel echo {.} ::: A/B.C
Output:
A/B
2016-01-01 14:12:43 +00:00
The replacement string B<{/}> removes the path:
2013-08-17 16:46:12 +00:00
parallel echo {/} ::: A/B.C
Output:
B.C
2016-01-01 14:12:43 +00:00
The replacement string B<{//}> keeps only the path:
2013-08-17 16:46:12 +00:00
parallel echo {//} ::: A/B.C
Output:
A
2016-01-01 14:12:43 +00:00
The replacement string B<{/.}> removes the path and the extension:
2013-08-17 16:46:12 +00:00
parallel echo {/.} ::: A/B.C
Output:
B
2016-01-01 14:12:43 +00:00
The replacement string B<{#}> gives the job number:
2013-08-17 16:46:12 +00:00
parallel echo {#} ::: A B C
Output (the order may be different):
1
2
3
2016-01-01 14:12:43 +00:00
The replacement string B<{%}> gives the job slot number (between 1 and
2014-05-31 06:42:56 +00:00
number of jobs to run in parallel):
parallel -j 2 echo {%} ::: A B C
2015-03-29 03:28:57 +00:00
Output (the order may be different and 1 and 2 may be swapped):
2014-05-31 06:42:56 +00:00
1
2
1
=head3 Changing the replacement strings
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
The replacement string B<{}> can be changed with B<-I>:
2013-08-17 16:46:12 +00:00
parallel -I ,, echo ,, ::: A/B.C
Output:
A/B.C
2016-01-01 14:12:43 +00:00
The replacement string B<{.}> can be changed with B<--extensionreplace>:
2013-08-17 16:46:12 +00:00
parallel --extensionreplace ,, echo ,, ::: A/B.C
Output:
A/B
2016-01-01 14:12:43 +00:00
The replacement string B<{/}> can be replaced with B<--basenamereplace>:
2013-08-17 16:46:12 +00:00
parallel --basenamereplace ,, echo ,, ::: A/B.C
Output:
B.C
2016-01-01 14:12:43 +00:00
The replacement string B<{//}> can be changed with B<--dirnamereplace>:
2013-08-17 16:46:12 +00:00
parallel --dirnamereplace ,, echo ,, ::: A/B.C
Output:
A
2016-01-01 14:12:43 +00:00
The replacement string B<{/.}> can be changed with B<--basenameextensionreplace>:
2013-08-17 16:46:12 +00:00
parallel --basenameextensionreplace ,, echo ,, ::: A/B.C
Output:
B
2016-01-01 14:12:43 +00:00
The replacement string B<{#}> can be changed with B<--seqreplace>:
2013-08-17 16:46:12 +00:00
parallel --seqreplace ,, echo ,, ::: A B C
Output (the order may be different):
1
2
3
2016-01-01 14:12:43 +00:00
The replacement string B<{%}> can be changed with B<--slotreplace>:
2014-05-31 06:42:56 +00:00
parallel -j2 --slotreplace ,, echo ,, ::: A B C
2015-03-29 03:28:57 +00:00
Output (the order may be different and 1 and 2 may be swapped):
2014-05-31 06:42:56 +00:00
1
2
1
=head3 Perl expression replacement string
When predefined replacement strings are not flexible enough a perl
expression can be used instead. One example is to remove two
2016-01-01 14:12:43 +00:00
extensions: foo.tar.gz becomes foo
parallel echo '{= s:\.[^.]+$::;s:\.[^.]+$::; =}' ::: foo.tar.gz
Output:
foo
2016-01-01 14:12:43 +00:00
In B<{= =}> you can access all of GNU B<parallel>'s internal functions
and variables. A few are worth mentioning.
B<total_jobs()> returns the total number of jobs:
parallel echo Job {#} of {= '$_=total_jobs()' =} ::: {1..5}
Output:
Job 1 of 5
Job 2 of 5
Job 3 of 5
Job 4 of 5
Job 5 of 5
B<Q(...)> shell quotes the string:
parallel echo {} shell quoted is {= '$_=Q($_)' =} ::: '*/!#$'
Output:
*/!#$ shell quoted is \*/\!\#\$
2016-11-21 21:35:25 +00:00
B<skip()> skips the job:
2016-01-01 14:12:43 +00:00
2016-11-21 21:35:25 +00:00
parallel echo {= 'if($_==3) { skip() }' =} ::: {1..5}
2016-01-01 14:12:43 +00:00
Output:
1
2
4
5
B<@arg> contains the input source variables:
2016-11-21 21:35:25 +00:00
parallel echo {= 'if($arg[1]==$arg[2]) { skip() }' =} ::: {1..3} ::: {1..3}
2016-01-01 14:12:43 +00:00
Output:
1 2
1 3
2 1
2 3
3 1
3 2
If the strings B<{=> and B<=}> cause problems they can be replaced with B<--parens>:
parallel --parens ,,,, echo ',, s:\.[^.]+$::;s:\.[^.]+$::; ,,' ::: foo.tar.gz
Output:
foo
2016-01-01 14:12:43 +00:00
To define a shorthand replacement string use B<--rpl>:
parallel --rpl '.. s:\.[^.]+$::;s:\.[^.]+$::;' echo '..' ::: foo.tar.gz
Output: Same as above.
2016-01-01 14:12:43 +00:00
If the shorthand starts with B<{> it can be used as a positional
replacement string, too:
parallel --rpl '{..} s:\.[^.]+$::;s:\.[^.]+$::;' echo '{..}' ::: foo.tar.gz
Output: Same as above.
2016-01-01 14:12:43 +00:00
GNU B<parallel>'s 7 replacement strings are implemented as this:
--rpl '{} '
--rpl '{#} $_=$job->seq()'
--rpl '{%} $_=$job->slot()'
--rpl '{/} s:.*/::'
--rpl '{//} $Global::use{"File::Basename"} ||= eval "use File::Basename; 1;"; $_ = dirname($_);'
--rpl '{/.} s:.*/::; s:\.[^/.]+$::;'
--rpl '{.} s:\.[^/.]+$::'
=head3 Positional replacement strings
2013-08-17 16:46:12 +00:00
With multiple input sources the argument from the individual input
2016-01-01 14:12:43 +00:00
sources can be accessed with B<{>numberB<}>:
2013-08-17 16:46:12 +00:00
parallel echo {1} and {2} ::: A B ::: C D
Output (the order may be different):
A and C
A and D
B and C
B and D
2016-01-01 14:12:43 +00:00
The positional replacement strings can also be modified using B</>, B<//>, B</.>, and B<.>:
2013-08-17 16:46:12 +00:00
parallel echo /={1/} //={1//} /.={1/.} .={1.} ::: A/B.C D/E.F
Output (the order may be different):
/=B.C //=A /.=B .=A/B
/=E.F //=D /.=E .=D/E
If a position is negative, it will refer to the input source counted
from behind:
parallel echo 1={1} 2={2} 3={3} -1={-1} -2={-2} -3={-3} ::: A B ::: C D ::: E F
Output (the order may be different):
1=A 2=C 3=E -1=E -2=C -3=A
1=A 2=C 3=F -1=F -2=C -3=A
1=A 2=D 3=E -1=E -2=D -3=A
1=A 2=D 3=F -1=F -2=D -3=A
1=B 2=C 3=E -1=E -2=C -3=B
1=B 2=C 3=F -1=F -2=C -3=B
1=B 2=D 3=E -1=E -2=D -3=B
1=B 2=D 3=F -1=F -2=D -3=B
=head3 Positional perl expression replacement string
To use a perl expression as a positional replacement string simply
prepend the perl expression with number and space:
parallel echo '{=2 s:\.[^.]+$::;s:\.[^.]+$::; =} {1}' ::: bar ::: foo.tar.gz
Output:
foo bar
If a shorthand defined using B<--rpl> starts with B<{> it can be used as
2016-01-01 14:12:43 +00:00
a positional replacement string, too:
parallel --rpl '{..} s:\.[^.]+$::;s:\.[^.]+$::;' echo '{2..} {1}' ::: bar ::: foo.tar.gz
Output: Same as above.
=head3 Input from columns
2013-08-17 16:46:12 +00:00
The columns in a file can be bound to positional replacement strings
2016-01-01 14:12:43 +00:00
using B<--colsep>. Here the columns are separated by TAB (\t):
2013-08-17 16:46:12 +00:00
parallel --colsep '\t' echo 1={1} 2={2} :::: tsv-file.tsv
Output (the order may be different):
1=f1 2=f2
1=A 2=B
1=C 2=D
=head3 Header defined replacement strings
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
With B<--header> GNU B<parallel> will use the first value of the input
2013-08-17 16:46:12 +00:00
source as the name of the replacement string. Only the non-modified
2016-01-01 14:12:43 +00:00
version B<{}> is supported:
2013-08-17 16:46:12 +00:00
parallel --header : echo f1={f1} f2={f2} ::: f1 A B ::: f2 C D
Output (the order may be different):
f1=A f2=C
f1=A f2=D
f1=B f2=C
f1=B f2=D
2016-01-01 14:12:43 +00:00
It is useful with B<--colsep> for processing files with TAB separated values:
2013-08-17 16:46:12 +00:00
parallel --header : --colsep '\t' echo f1={f1} f2={f2} :::: tsv-file.tsv
Output (the order may be different):
f1=A f2=B
f1=C f2=D
2016-01-01 14:12:43 +00:00
=head3 More pre-defined replacement strings
B<--plus> adds the replacement strings B<{+/} {+.} {+..} {+...} {..} {...}
{/..} {/...} {##}>. The idea being that B<{+foo}> matches the opposite of B<{foo}>
and B<{}> = B<{+/}>/B<{/}> = B<{.}>.B<{+.}> = B<{+/}>/B<{/.}>.B<{+.}> = B<{..}>.B<{+..}> =
B<{+/}>/B<{/..}>.B<{+..}> = B<{...}>.B<{+...}> = B<{+/}>/B<{/...}>.B<{+...}>.
parallel --plus echo {} ::: dir/sub/file.ext1.ext2.ext3
parallel --plus echo {+/}/{/} ::: dir/sub/file.ext1.ext2.ext3
parallel --plus echo {.}.{+.} ::: dir/sub/file.ext1.ext2.ext3
parallel --plus echo {+/}/{/.}.{+.} ::: dir/sub/file.ext1.ext2.ext3
parallel --plus echo {..}.{+..} ::: dir/sub/file.ext1.ext2.ext3
parallel --plus echo {+/}/{/..}.{+..} ::: dir/sub/file.ext1.ext2.ext3
parallel --plus echo {...}.{+...} ::: dir/sub/file.ext1.ext2.ext3
parallel --plus echo {+/}/{/...}.{+...} ::: dir/sub/file.ext1.ext2.ext3
Output:
dir/sub/file.ext1.ext2.ext3
B<{##}> is simply the number of jobs:
parallel --plus echo Job {#} of {##} ::: {1..5}
Output:
Job 1 of 5
Job 2 of 5
Job 3 of 5
Job 4 of 5
Job 5 of 5
2013-08-17 16:46:12 +00:00
=head2 More than one argument
2016-01-01 14:12:43 +00:00
With B<--xargs> GNU B<parallel> will fit as many arguments as possible on a
2013-08-17 16:46:12 +00:00
single line:
cat num30000 | parallel --xargs echo | wc -l
2015-03-29 03:28:57 +00:00
Output (if you run this under Bash on GNU/Linux):
2013-08-17 16:46:12 +00:00
2
The 30000 arguments fitted on 2 lines.
2016-01-01 14:12:43 +00:00
The maximal length of a single line can be set with B<-s>. With a maximal
2013-08-17 16:46:12 +00:00
line length of 10000 chars 17 commands will be run:
cat num30000 | parallel --xargs -s 10000 echo | wc -l
Output:
17
2016-01-01 14:12:43 +00:00
For better parallelism GNU B<parallel> can distribute the arguments
2013-08-17 16:46:12 +00:00
between all the parallel jobs when end of file is met.
2016-01-01 14:12:43 +00:00
Below GNU B<parallel> reads the last argument when generating the second
job. When GNU B<parallel> reads the last argument, it spreads all the
2014-01-22 01:54:18 +00:00
arguments for the second job over 4 jobs instead, as 4 parallel jobs
are requested.
2016-01-01 14:12:43 +00:00
The first job will be the same as the B<--xargs> example above, but the
2014-01-22 01:54:18 +00:00
second job will be split into 4 evenly sized jobs, resulting in a
total of 5 jobs:
2013-08-17 16:46:12 +00:00
cat num30000 | parallel --jobs 4 -m echo | wc -l
2015-03-29 03:28:57 +00:00
Output (if you run this under Bash on GNU/Linux):
2013-08-17 16:46:12 +00:00
5
2014-01-22 01:54:18 +00:00
This is even more visible when running 4 jobs with 10 arguments. The
10 arguments are being spread over 4 jobs:
2015-03-29 03:28:57 +00:00
parallel --jobs 4 -m echo ::: 1 2 3 4 5 6 7 8 9 10
2014-01-22 01:54:18 +00:00
Output:
1 2 3
4 5 6
7 8 9
10
2016-01-01 14:12:43 +00:00
A replacement string can be part of a word. B<-m> will not repeat the context:
2013-08-17 16:46:12 +00:00
parallel --jobs 4 -m echo pre-{}-post ::: A B C D E F G
Output (the order may be different):
pre-A B-post
pre-C D-post
pre-E F-post
pre-G-post
2016-01-01 14:12:43 +00:00
To repeat the context use B<-X> which otherwise works like B<-m>:
2013-08-17 16:46:12 +00:00
parallel --jobs 4 -X echo pre-{}-post ::: A B C D E F G
Output (the order may be different):
pre-A-post pre-B-post
pre-C-post pre-D-post
pre-E-post pre-F-post
pre-G-post
2016-01-01 14:12:43 +00:00
To limit the number of arguments use B<-N>:
2013-08-17 16:46:12 +00:00
parallel -N3 echo ::: A B C D E F G H
Output (the order may be different):
A B C
D E F
G H
2016-01-01 14:12:43 +00:00
B<-N> also sets the positional replacement strings:
2013-08-17 16:46:12 +00:00
parallel -N3 echo 1={1} 2={2} 3={3} ::: A B C D E F G H
Output (the order may be different):
1=A 2=B 3=C
1=D 2=E 3=F
1=G 2=H 3=
2016-01-01 14:12:43 +00:00
B<-N0> reads 1 argument but inserts none:
2013-08-17 16:46:12 +00:00
parallel -N0 echo foo ::: 1 2 3
Output:
foo
foo
foo
=head2 Quoting
Command lines that contain special characters may need to be protected from the shell.
2016-01-01 14:12:43 +00:00
The B<perl> program B<print "@ARGV\n"> basically works like B<echo>.
2013-08-17 16:46:12 +00:00
perl -e 'print "@ARGV\n"' A
Output:
A
To run that in parallel the command needs to be quoted:
parallel perl -e 'print "@ARGV\n"' ::: This wont work
Output:
[Nothing]
2016-01-01 14:12:43 +00:00
To quote the command use B<-q>:
2013-08-17 16:46:12 +00:00
parallel -q perl -e 'print "@ARGV\n"' ::: This works
Output (the order may be different):
This
works
2016-01-01 14:12:43 +00:00
Or you can quote the critical part using B<\'>:
2013-08-17 16:46:12 +00:00
parallel perl -e \''print "@ARGV\n"'\' ::: This works, too
Output (the order may be different):
This
works,
too
2016-01-01 14:12:43 +00:00
GNU B<parallel> can also \-quote full lines. Simply run this:
2013-08-17 16:46:12 +00:00
parallel --shellquote
parallel: Warning: Input is read from the terminal. You either know what you
parallel: Warning: are doing (in which case: YOU ARE AWESOME!) or you forgot
parallel: Warning: ::: or :::: or to pipe data into parallel. If so
parallel: Warning: consider going through the tutorial: man parallel_tutorial
parallel: Warning: Press CTRL-D to exit.
2013-08-17 16:46:12 +00:00
perl -e 'print "@ARGV\n"'
[CTRL-D]
Output:
perl\ -e\ \'print\ \"@ARGV\\n\"\'
This can then be used as the command:
parallel perl\ -e\ \'print\ \"@ARGV\\n\"\' ::: This also works
Output (the order may be different):
This
also
works
=head2 Trimming space
2016-01-01 14:12:43 +00:00
Space can be trimmed on the arguments using B<--trim>:
2013-08-17 16:46:12 +00:00
parallel --trim r echo pre-{}-post ::: ' A '
Output:
pre- A-post
To trim on the left side:
parallel --trim l echo pre-{}-post ::: ' A '
Output:
pre-A -post
To trim on the both sides:
parallel --trim lr echo pre-{}-post ::: ' A '
Output:
pre-A-post
2016-09-04 08:42:04 +00:00
=head2 Respecting the shell
This tutorial uses Bash as the shell. GNU B<parallel> respects which
shell you are using, so in B<zsh> you can do:
parallel echo \={} ::: zsh bash ls
Output:
/usr/bin/zsh
/bin/bash
/bin/ls
In B<csh> you can do:
parallel 'set a="{}"; if( { test -d "$a" } ) echo "$a is a dir"' ::: *
Output:
[somedir] is a dir
This also becomes useful if you use GNU B<parallel> in a shell script:
GNU B<parallel> will use the same shell as the shell script.
2015-03-29 03:28:57 +00:00
=head1 Controlling the output
2013-08-17 16:46:12 +00:00
The output can prefixed with the argument:
parallel --tag echo foo-{} ::: A B C
Output (the order may be different):
A foo-A
B foo-B
C foo-C
2016-01-01 14:12:43 +00:00
To prefix it with another string use B<--tagstring>:
2013-08-17 16:46:12 +00:00
parallel --tagstring {}-bar echo foo-{} ::: A B C
Output (the order may be different):
A-bar foo-A
B-bar foo-B
C-bar foo-C
2016-01-01 14:12:43 +00:00
To see what commands will be run without running them use B<--dryrun>:
2013-08-17 16:46:12 +00:00
parallel --dryrun echo {} ::: A B C
Output (the order may be different):
echo A
echo B
echo C
2016-01-01 14:12:43 +00:00
To print the command before running them use B<--verbose>:
2013-08-17 16:46:12 +00:00
parallel --verbose echo {} ::: A B C
Output (the order may be different):
echo A
echo B
A
echo C
B
C
2016-01-01 14:12:43 +00:00
GNU B<parallel> will postpone the output until the command completes:
2013-08-17 16:46:12 +00:00
parallel -j2 'printf "%s-start\n%s" {} {};sleep {};printf "%s\n" -middle;echo {}-end' ::: 4 2 1
Output:
2-start
2-middle
2-end
1-start
1-middle
1-end
4-start
4-middle
4-end
2016-01-01 14:12:43 +00:00
To get the output immediately use B<--ungroup>:
2013-08-17 16:46:12 +00:00
parallel -j2 --ungroup 'printf "%s-start\n%s" {} {};sleep {};printf "%s\n" -middle;echo {}-end' ::: 4 2 1
2013-08-17 16:46:12 +00:00
Output:
2013-08-17 16:46:12 +00:00
4-start
42-start
2-middle
2-end
1-start
1-middle
1-end
-middle
4-end
2016-01-01 14:12:43 +00:00
B<--ungroup> is fast, but can cause half a line from one job to be mixed
2013-08-17 16:46:12 +00:00
with half a line of another job. That has happend in the second line,
where the line '4-middle' is mixed with '2-start'.
2016-01-01 14:12:43 +00:00
To avoid this use B<--linebuffer>:
2013-08-17 16:46:12 +00:00
parallel -j2 --linebuffer 'printf "%s-start\n%s" {} {};sleep {};printf "%s\n" -middle;echo {}-end' ::: 4 2 1
Output:
4-start
2-start
2-middle
2-end
1-start
1-middle
1-end
4-middle
4-end
2016-01-01 14:12:43 +00:00
To force the output in the same order as the arguments use B<--keep-order>/B<-k>:
2013-08-17 16:46:12 +00:00
parallel -j2 -k 'printf "%s-start\n%s" {} {};sleep {};printf "%s\n" -middle;echo {}-end' ::: 4 2 1
Output:
4-start
4-middle
4-end
2-start
2-middle
2-end
1-start
1-middle
1-end
2016-01-01 14:12:43 +00:00
2013-08-17 16:46:12 +00:00
=head2 Saving output into files
2016-01-01 14:12:43 +00:00
GNU B<parallel> can save the output of each job into files:
2013-08-17 16:46:12 +00:00
parallel --files echo ::: A B C
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
Output will be similar to this:
2013-08-17 16:46:12 +00:00
/tmp/pAh6uWuQCg.par
/tmp/opjhZCzAX4.par
/tmp/W0AT_Rph2o.par
2016-01-01 14:12:43 +00:00
By default GNU B<parallel> will cache the output in files in B</tmp>. This
can be changed by setting B<$TMPDIR> or B<--tmpdir>:
2013-08-17 16:46:12 +00:00
parallel --tmpdir /var/tmp --files echo ::: A B C
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
Output will be similar to this:
2013-08-17 16:46:12 +00:00
/var/tmp/N_vk7phQRc.par
/var/tmp/7zA4Ccf3wZ.par
/var/tmp/LIuKgF_2LP.par
Or:
TMPDIR=/var/tmp parallel --files echo ::: A B C
2013-08-17 16:46:12 +00:00
Output: Same as above.
2016-01-01 14:12:43 +00:00
The output files can be saved in a structured way using B<--results>:
2013-08-17 16:46:12 +00:00
parallel --results outdir echo ::: A B C
Output:
A
B
C
2016-01-01 14:12:43 +00:00
These files were also generated containing the standard output
(stdout), standard error (stderr), and the sequence number (seq):
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
outdir/1/A/seq
2013-08-17 16:46:12 +00:00
outdir/1/A/stderr
outdir/1/A/stdout
2016-01-01 14:12:43 +00:00
outdir/1/B/seq
2013-08-17 16:46:12 +00:00
outdir/1/B/stderr
outdir/1/B/stdout
2016-01-01 14:12:43 +00:00
outdir/1/C/seq
2013-08-17 16:46:12 +00:00
outdir/1/C/stderr
outdir/1/C/stdout
2016-01-01 14:12:43 +00:00
B<--header :> will take the first value as name and use that in the
directory structure. This is useful if you are using multiple input
sources:
2013-08-17 16:46:12 +00:00
parallel --header : --results outdir echo ::: f1 A B ::: f2 C D
Generated files:
2016-01-01 14:12:43 +00:00
outdir/f1/A/f2/C/seq
2013-08-17 16:46:12 +00:00
outdir/f1/A/f2/C/stderr
outdir/f1/A/f2/C/stdout
2016-01-01 14:12:43 +00:00
outdir/f1/A/f2/D/seq
2013-08-17 16:46:12 +00:00
outdir/f1/A/f2/D/stderr
outdir/f1/A/f2/D/stdout
2016-01-01 14:12:43 +00:00
outdir/f1/B/f2/C/seq
2013-08-17 16:46:12 +00:00
outdir/f1/B/f2/C/stderr
outdir/f1/B/f2/C/stdout
2016-01-01 14:12:43 +00:00
outdir/f1/B/f2/D/seq
2013-08-17 16:46:12 +00:00
outdir/f1/B/f2/D/stderr
outdir/f1/B/f2/D/stdout
The directories are named after the variables and their values.
2016-01-01 14:12:43 +00:00
=head1 Controlling the execution
2013-08-17 16:46:12 +00:00
=head2 Number of simultaneous jobs
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
The number of concurrent jobs is given with B<--jobs>/B<-j>:
2013-08-17 16:46:12 +00:00
2015-03-29 03:28:57 +00:00
/usr/bin/time parallel -N0 -j64 sleep 1 :::: num128
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
With 64 jobs in parallel the 128 B<sleep>s will take 2-8 seconds to run -
2013-08-17 16:46:12 +00:00
depending on how fast your machine is.
2016-01-01 14:12:43 +00:00
By default B<--jobs> is the same as the number of CPU cores. So this:
2013-08-17 16:46:12 +00:00
2015-03-29 03:28:57 +00:00
/usr/bin/time parallel -N0 sleep 1 :::: num128
2013-08-17 16:46:12 +00:00
should take twice the time of running 2 jobs per CPU core:
2015-03-29 03:28:57 +00:00
/usr/bin/time parallel -N0 --jobs 200% sleep 1 :::: num128
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
B<--jobs 0> will run as many jobs in parallel as possible:
2013-08-17 16:46:12 +00:00
2015-03-29 03:28:57 +00:00
/usr/bin/time parallel -N0 --jobs 0 sleep 1 :::: num128
2013-08-17 16:46:12 +00:00
which should take 1-7 seconds depending on how fast your machine is.
2016-01-01 14:12:43 +00:00
B<--jobs> can read from a file which is re-read when a job finishes:
2013-08-17 16:46:12 +00:00
echo 50% > my_jobs
2015-03-29 03:28:57 +00:00
/usr/bin/time parallel -N0 --jobs my_jobs sleep 1 :::: num128 &
2013-08-17 16:46:12 +00:00
sleep 1
echo 0 > my_jobs
wait
2016-01-01 14:12:43 +00:00
The first second only 50% of the CPU cores will run a job. Then B<0> is
put into B<my_jobs> and then the rest of the jobs will be started in
2013-08-17 16:46:12 +00:00
parallel.
Instead of basing the percentage on the number of CPU cores
2016-01-01 14:12:43 +00:00
GNU B<parallel> can base it on the number of CPUs:
2013-08-17 16:46:12 +00:00
parallel --use-cpus-instead-of-cores -N0 sleep 1 :::: num8
2013-08-17 16:46:12 +00:00
2015-05-20 19:09:33 +00:00
=head2 Shuffle job order
If you have many jobs (e.g. by multiple combinations of input
sources), it can be handy to shuffle the jobs, so you get different
2016-01-01 14:12:43 +00:00
values run. Use B<--shuf> for that:
2015-05-20 19:09:33 +00:00
parallel --shuf echo ::: 1 2 3 ::: a b c ::: A B C
Output:
All combinations but different order for each run.
=head2 Interactivity
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
GNU B<parallel> can ask the user if a command should be run using B<--interactive>:
2013-08-17 16:46:12 +00:00
parallel --interactive echo ::: 1 2 3
Output:
echo 1 ?...y
echo 2 ?...n
1
echo 3 ?...y
3
2016-01-01 14:12:43 +00:00
GNU B<parallel> can be used to put arguments on the command line for an
interactive command such as B<emacs> to edit one file at a time:
2013-08-17 16:46:12 +00:00
parallel --tty emacs ::: 1 2 3
Or give multiple argument in one go to open multiple files:
parallel -X --tty vi ::: 1 2 3
=head2 A terminal for every job
2016-01-01 14:12:43 +00:00
Using B<--tmux> GNU B<parallel> can start a terminal for every job run:
seq 10 20 | parallel --tmux 'echo start {}; sleep {}; echo done {}'
This will tell you to run something similar to:
2015-05-20 19:09:33 +00:00
tmux -S /tmp/tmsrPrO0 attach
2016-01-01 14:12:43 +00:00
Using normal B<tmux> keystrokes (CTRL-b n or CTRL-b p) you can cycle
between windows of the running jobs. When a job is finished it will
pause for 10 seconds before closing the window.
2013-08-17 16:46:12 +00:00
=head2 Timing
Some jobs do heavy I/O when they start. To avoid a thundering herd GNU
2016-01-01 14:12:43 +00:00
B<parallel> can delay starting new jobs. B<--delay> I<X> will make
sure there is at least I<X> seconds between each start:
2013-08-17 16:46:12 +00:00
parallel --delay 2.5 echo Starting {}\;date ::: 1 2 3
Output:
Starting 1
Thu Aug 15 16:24:33 CEST 2013
Starting 2
Thu Aug 15 16:24:35 CEST 2013
Starting 3
Thu Aug 15 16:24:38 CEST 2013
If jobs taking more than a certain amount of time are known to fail,
2016-01-01 14:12:43 +00:00
they can be stopped with B<--timeout>. The accuracy of B<--timeout> is
2 seconds:
2013-08-17 16:46:12 +00:00
parallel --timeout 4.1 sleep {}\; echo {} ::: 2 4 6 8
2013-08-17 16:46:12 +00:00
Output:
2
4
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
GNU B<parallel> can compute the median runtime for jobs and kill those
2013-08-17 16:46:12 +00:00
that take more than 200% of the median runtime:
parallel --timeout 200% sleep {}\; echo {} ::: 2.1 2.2 3 7 2.3
Output:
2.1
2.2
3
2.3
2016-01-01 14:12:43 +00:00
=head2 Progress information
Based on the runtime of completed jobs GNU B<parallel> can estimate the
2013-08-17 16:46:12 +00:00
total runtime:
parallel --eta sleep ::: 1 3 2 2 1 3 3 2 1
Output:
Computers / CPU cores / Max jobs to run
1:local / 2 / 2
Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
ETA: 2s 0left 1.11avg local:0/9/100%/1.1s
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
GNU B<parallel> can give progress information with B<--progress>:
2013-08-17 16:46:12 +00:00
parallel --progress sleep ::: 1 3 2 2 1 3 3 2 1
Output:
Computers / CPU cores / Max jobs to run
1:local / 2 / 2
Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/9/100%/1.1s
2016-01-01 14:12:43 +00:00
A progress bar can be shown with B<--bar>:
parallel --bar sleep ::: 1 3 2 2 1 3 3 2 1
2016-01-01 14:12:43 +00:00
And a graphic bar can be shown with B<--bar> and B<zenity>:
seq 1000 | parallel -j10 --bar '(echo -n {};sleep 0.1)' 2> >(zenity --progress --auto-kill --auto-close)
2016-01-01 14:12:43 +00:00
A logfile of the jobs completed so far can be generated with B<--joblog>:
2013-08-17 16:46:12 +00:00
parallel --joblog /tmp/log exit ::: 1 2 3 0
2013-08-17 16:46:12 +00:00
cat /tmp/log
Output:
Seq Host Starttime Runtime Send Receive Exitval Signal Command
1 : 1376577364.974 0.008 0 0 1 0 exit 1
2 : 1376577364.982 0.013 0 0 2 0 exit 2
3 : 1376577364.990 0.013 0 0 3 0 exit 3
4 : 1376577365.003 0.003 0 0 0 0 exit 0
The log contains the job sequence, which host the job was run on, the
2016-01-01 14:12:43 +00:00
start time and run time, how much data was transferred, the exit
value, the signal that killed the job, and finally the command being
run.
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
With a joblog GNU B<parallel> can be stopped and later pickup where it
2013-08-17 16:46:12 +00:00
left off. It it important that the input of the completed jobs is
unchanged.
parallel --joblog /tmp/log exit ::: 1 2 3 0
2013-08-17 16:46:12 +00:00
cat /tmp/log
parallel --resume --joblog /tmp/log exit ::: 1 2 3 0 0 0
cat /tmp/log
Output:
Seq Host Starttime Runtime Send Receive Exitval Signal Command
1 : 1376580069.544 0.008 0 0 1 0 exit 1
2 : 1376580069.552 0.009 0 0 2 0 exit 2
3 : 1376580069.560 0.012 0 0 3 0 exit 3
4 : 1376580069.571 0.005 0 0 0 0 exit 0
Seq Host Starttime Runtime Send Receive Exitval Signal Command
1 : 1376580069.544 0.008 0 0 1 0 exit 1
2 : 1376580069.552 0.009 0 0 2 0 exit 2
3 : 1376580069.560 0.012 0 0 3 0 exit 3
4 : 1376580069.571 0.005 0 0 0 0 exit 0
5 : 1376580070.028 0.009 0 0 0 0 exit 0
6 : 1376580070.038 0.007 0 0 0 0 exit 0
2016-01-01 14:12:43 +00:00
Note how the start time of the last 2 jobs is clearly different from the second run.
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
With B<--resume-failed> GNU B<parallel> will re-run the jobs that failed:
2013-08-17 16:46:12 +00:00
parallel --resume-failed --joblog /tmp/log exit ::: 1 2 3 0 0 0
cat /tmp/log
Output:
Seq Host Starttime Runtime Send Receive Exitval Signal Command
1 : 1376580069.544 0.008 0 0 1 0 exit 1
2 : 1376580069.552 0.009 0 0 2 0 exit 2
3 : 1376580069.560 0.012 0 0 3 0 exit 3
4 : 1376580069.571 0.005 0 0 0 0 exit 0
5 : 1376580070.028 0.009 0 0 0 0 exit 0
6 : 1376580070.038 0.007 0 0 0 0 exit 0
1 : 1376580154.433 0.010 0 0 1 0 exit 1
2 : 1376580154.444 0.022 0 0 2 0 exit 2
3 : 1376580154.466 0.005 0 0 3 0 exit 3
2016-01-01 14:12:43 +00:00
Note how seq 1 2 3 have been repeated because they had exit value
different from 0.
B<--retry-failed> does almost the same as B<--resume-failed>. Where
B<--resume-failed> reads the commands from the command line (and
ignores the commands in the joblog), B<--retry-failed> ignores the
command line and reruns the commands mentioned in the joblog.
parallel --retry-failed --joblog /tmp/log
2016-01-01 14:12:43 +00:00
cat /tmp/log
Output:
Seq Host Starttime Runtime Send Receive Exitval Signal Command
1 : 1376580069.544 0.008 0 0 1 0 exit 1
2 : 1376580069.552 0.009 0 0 2 0 exit 2
3 : 1376580069.560 0.012 0 0 3 0 exit 3
4 : 1376580069.571 0.005 0 0 0 0 exit 0
5 : 1376580070.028 0.009 0 0 0 0 exit 0
6 : 1376580070.038 0.007 0 0 0 0 exit 0
1 : 1376580154.433 0.010 0 0 1 0 exit 1
2 : 1376580154.444 0.022 0 0 2 0 exit 2
3 : 1376580154.466 0.005 0 0 3 0 exit 3
1 : 1376580164.633 0.010 0 0 1 0 exit 1
2 : 1376580164.644 0.022 0 0 2 0 exit 2
3 : 1376580164.666 0.005 0 0 3 0 exit 3
2013-08-17 16:46:12 +00:00
=head2 Termination
For certain jobs there is no need to continue if one of the jobs fails
2016-01-01 14:12:43 +00:00
and has an exit code different from 0. GNU B<parallel> will stop spawning new jobs
with B<--halt soon,fail=1>:
2013-08-17 16:46:12 +00:00
2015-05-20 19:09:33 +00:00
parallel -j2 --halt soon,fail=1 echo {}\; exit {} ::: 0 0 1 2 3
2013-08-17 16:46:12 +00:00
Output:
0
0
1
parallel: Starting no more jobs. Waiting for 2 jobs to finish. This job failed:
echo 1; exit 1
2
parallel: Starting no more jobs. Waiting for 1 jobs to finish. This job failed:
echo 2; exit 2
2016-01-01 14:12:43 +00:00
With B<--halt now,fail=1> the running jobs will be killed immediately:
2013-08-17 16:46:12 +00:00
2015-05-20 19:09:33 +00:00
parallel -j2 --halt now,fail=1 echo {}\; exit {} ::: 0 0 1 2 3
2013-08-17 16:46:12 +00:00
Output:
0
0
1
parallel: This job failed:
echo 1; exit 1
2016-01-01 14:12:43 +00:00
If B<--halt> is given a percentage this percentage of the jobs must fail
before GNU B<parallel> stops spawning more jobs:
2015-05-20 19:09:33 +00:00
parallel -j2 --halt soon,fail=20% echo {}\; exit {} ::: 0 1 2 3 4 5 6 7 8 9
Output:
0
2015-05-20 19:09:33 +00:00
1
parallel: This job failed:
echo 1; exit 1
2
parallel: This job failed:
echo 2; exit 2
parallel: Starting no more jobs. Waiting for 1 jobs to finish.
3
parallel: This job failed:
echo 3; exit 3
2016-01-01 14:12:43 +00:00
If you are looking for success instead of failures, you can use
B<success>. This will finish as soon as the first job succeeds:
2015-05-20 19:09:33 +00:00
2016-01-01 14:12:43 +00:00
parallel -j2 --halt now,success=1 echo {}\; exit {} ::: 1 2 3 0 4 5 6
2015-05-20 19:09:33 +00:00
Output:
2016-01-01 14:12:43 +00:00
1
2
3
2015-05-20 19:09:33 +00:00
0
parallel: This job succeeded:
echo 0; exit 0
2016-01-01 14:12:43 +00:00
GNU B<parallel> can retry the command with B<--retries>. This is useful if a
command fails for unknown reasons now and then.
2013-08-17 16:46:12 +00:00
parallel -k --retries 3 'echo tried {} >>/tmp/runs; echo completed {}; exit {}' ::: 1 2 0
cat /tmp/runs
Output:
completed 1
completed 2
completed 0
tried 1
tried 2
tried 1
tried 2
tried 1
tried 2
tried 0
Note how job 1 and 2 were tried 3 times, but 0 was not retried because it had exit code 0.
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
=head3 Termination signals (advanced)
Using B<--termseq> you can control which signals are sent when killing
children. Normally children will be killed by sending them B<SIGTERM>,
waiting 200 ms, then another B<SIGTERM>, waiting 100 ms, then another
B<SIGTERM>, waiting 50 ms, then a B<SIGKILL>, finally waiting 25 ms
before giving up. It looks like this:
show_signals() {
perl -e 'for(keys %SIG) { $SIG{$_} = eval "sub { print \"Got $_\\n\"; }";} while(1){sleep 1}'
2016-01-01 14:12:43 +00:00
}
export -f show_signals
echo | parallel --termseq TERM,200,TERM,100,TERM,50,KILL,25 -u --timeout 1 show_signals
Output:
Got TERM
Got TERM
Got TERM
Or just:
echo | parallel -u --timeout 1 show_signals
Output: Same as above.
You can change this to B<SIGINT>, B<SIGTERM>, B<SIGKILL>:
echo | parallel --termseq INT,200,TERM,100,KILL,25 -u --timeout 1 show_signals
Output:
2016-01-01 14:12:43 +00:00
Got INT
Got TERM
The B<SIGKILL> does not show because it cannot be caught, and thus the child dies.
2014-11-11 03:56:55 +00:00
=head2 Limiting the resources
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
To avoid overloading systems GNU B<parallel> can look at the system load
2013-08-17 16:46:12 +00:00
before starting another job:
parallel --load 100% echo load is less than {} job per cpu ::: 1
2013-08-17 16:46:12 +00:00
Output:
[when then load is less than the number of cpu cores]
load is less than 1 job per cpu
GNU B<parallel> can also check if the system is swapping.
2013-08-17 16:46:12 +00:00
parallel --noswap echo the system is not swapping ::: now
Output:
[when then system is not swapping]
the system is not swapping now
2016-01-01 14:12:43 +00:00
Some jobs need a lot of memory, and should only be started when there
is enough memory free. Using B<--memfree> GNU B<parallel> can check if
there is enough memory free. Additionally, GNU B<parallel> will kill
off the youngest job if the memory free falls below 50% of the
size. The killed job will put back on the queue and retried later.
parallel --memfree 1G echo will run if more than 1 GB is ::: free
GNU B<parallel> can run the jobs with a nice value. This will work both
2013-08-17 16:46:12 +00:00
locally and remotely.
parallel --nice 17 echo this is being run with nice -n ::: 17
Output:
this is being run with nice -n 17
=head1 Remote execution
2016-01-01 14:12:43 +00:00
GNU B<parallel> can run jobs on remote servers. It uses B<ssh> to
communicate with the remote machines.
2013-08-17 16:46:12 +00:00
=head2 Sshlogin
2016-01-01 14:12:43 +00:00
The most basic sshlogin is B<-S> I<host>:
2013-08-17 16:46:12 +00:00
parallel -S $SERVER1 echo running on ::: $SERVER1
Output:
running on [$SERVER1]
2016-01-01 14:12:43 +00:00
To use a different username prepend the server with I<username@>:
2013-08-17 16:46:12 +00:00
parallel -S username@$SERVER1 echo running on ::: username@$SERVER1
Output:
running on [username@$SERVER1]
2016-01-01 14:12:43 +00:00
The special sshlogin B<:> is the local machine:
2013-08-17 16:46:12 +00:00
parallel -S : echo running on ::: the_local_machine
Output:
running on the_local_machine
2016-01-01 14:12:43 +00:00
If B<ssh> is not in $PATH it can be prepended to $SERVER1:
2013-08-17 16:46:12 +00:00
parallel -S '/usr/bin/ssh '$SERVER1 echo custom ::: ssh
Output:
custom ssh
2016-01-01 14:12:43 +00:00
The B<ssh> command can also be given using B<--ssh>:
parallel --ssh /usr/bin/ssh -S $SERVER1 echo custom ::: ssh
or by setting B<$PARALLEL_SSH>:
export PARALLEL_SSH=/usr/bin/ssh
parallel -S $SERVER1 echo custom ::: ssh
Several servers can be given using multiple B<-S>:
2013-08-17 16:46:12 +00:00
parallel -S $SERVER1 -S $SERVER2 echo ::: running on more hosts
Output (the order may be different):
running
on
more
hosts
2016-01-01 14:12:43 +00:00
Or they can be separated by B<,>:
2013-08-17 16:46:12 +00:00
parallel -S $SERVER1,$SERVER2 echo ::: running on more hosts
Output: Same as above.
2015-05-20 19:09:33 +00:00
Or newline:
# This gives a \n between $SERVER1 and $SERVER2
SERVERS="`echo $SERVER1; echo $SERVER2`"
parallel -S "$SERVERS" echo ::: running on more hosts
2016-01-01 14:12:43 +00:00
They can also be read from a file (replace I<user@> with the user on B<$SERVER2>):
2013-08-17 16:46:12 +00:00
echo $SERVER1 > nodefile
# Force 4 cores, special ssh-command, username
echo 4//usr/bin/ssh user@$SERVER2 >> nodefile
parallel --sshloginfile nodefile echo ::: running on more hosts
Output: Same as above.
2016-01-01 14:12:43 +00:00
Every time a job finished, the B<--sshloginfile> will be re-read, so
it is possible to both add and remove hosts while running.
The special B<--sshloginfile ..> reads from B<~/.parallel/sshloginfile>.
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
To force GNU B<parallel> to treat a server having a given number of CPU
cores prepend the number of core followed by B</> to the sshlogin:
2013-08-17 16:46:12 +00:00
parallel -S 4/$SERVER1 echo force {} cpus on server ::: 4
Output:
force 4 cpus on server
2016-01-01 14:12:43 +00:00
Servers can be put into groups by prepending I<@groupname> to the
server and the group can then be selected by appending I<@groupname> to
the argument if using B<--hostgroup>:
2015-05-20 19:09:33 +00:00
2016-01-01 14:12:43 +00:00
parallel --hostgroup -S @grp1/$SERVER1 -S @grp2/$SERVER2 echo {} ::: \
run_on_grp1@grp1 run_on_grp2@grp2
2015-05-20 19:09:33 +00:00
Output:
run_on_grp1
run_on_grp2
2016-01-01 14:12:43 +00:00
A host can be in multiple groups by separating the groups with B<+>, and
2015-05-20 19:09:33 +00:00
you can force GNU B<parallel> to limit the groups on which the command
2016-01-01 14:12:43 +00:00
can be run with B<-S> I<@groupname>:
2015-05-20 19:09:33 +00:00
2016-01-01 14:12:43 +00:00
parallel -S @grp1 -S @grp1+grp2/$SERVER1 -S @grp2/SERVER2 echo {} ::: \
run_on_grp1 also_grp1
2015-05-20 19:09:33 +00:00
Output:
run_on_grp1
also_grp1
2013-08-17 16:46:12 +00:00
=head2 Transferring files
2016-01-01 14:12:43 +00:00
GNU B<parallel> can transfer the files to be processed to the remote
2013-08-17 16:46:12 +00:00
host. It does that using rsync.
echo This is input_file > input_file
parallel -S $SERVER1 --transferfile {} cat ::: input_file
2013-08-17 16:46:12 +00:00
Output:
This is input_file
2016-01-01 14:12:43 +00:00
If the files are processed into another file, the resulting file can be
2013-08-17 16:46:12 +00:00
transferred back:
echo This is input_file > input_file
parallel -S $SERVER1 --transferfile {} --return {}.out cat {} ">"{}.out ::: input_file
2013-08-17 16:46:12 +00:00
cat input_file.out
Output: Same as above.
2016-01-01 14:12:43 +00:00
To remove the input and output file on the remote server use B<--cleanup>:
2013-08-17 16:46:12 +00:00
echo This is input_file > input_file
parallel -S $SERVER1 --transferfile {} --return {}.out --cleanup cat {} ">"{}.out ::: input_file
2013-08-17 16:46:12 +00:00
cat input_file.out
Output: Same as above.
2016-01-01 14:12:43 +00:00
There is a shorthand for B<--transferfile {} --return --cleanup> called B<--trc>:
2013-08-17 16:46:12 +00:00
echo This is input_file > input_file
parallel -S $SERVER1 --trc {}.out cat {} ">"{}.out ::: input_file
2013-08-17 16:46:12 +00:00
cat input_file.out
Output: Same as above.
2016-01-01 14:12:43 +00:00
Some jobs need a common database for all jobs. GNU B<parallel> can
transfer that using B<--basefile> which will transfer the file before the
2013-08-17 16:46:12 +00:00
first job:
echo common data > common_file
parallel --basefile common_file -S $SERVER1 cat common_file\; echo {} ::: foo
2013-08-17 16:46:12 +00:00
Output:
common data
foo
2016-01-01 14:12:43 +00:00
To remove it from the remote host after the last job use B<--cleanup>.
2013-08-17 16:46:12 +00:00
=head2 Working dir
The default working dir on the remote machines is the login dir. This
2016-01-01 14:12:43 +00:00
can be changed with B<--workdir> I<mydir>.
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
Files transferred using B<--transferfile> and B<--return> will be relative
2013-08-17 16:46:12 +00:00
to I<mydir> on remote computers, and the command will be executed in
the dir I<mydir>.
2016-01-01 14:12:43 +00:00
The special I<mydir> value B<...> will create working dirs under
B<~/.parallel/tmp> on the remote computers. If B<--cleanup> is given
2013-08-17 16:46:12 +00:00
these dirs will be removed.
2016-01-01 14:12:43 +00:00
The special I<mydir> value B<.> uses the current working dir. If the
current working dir is beneath your home dir, the value B<.> is
2013-08-17 16:46:12 +00:00
treated as the relative path to your home dir. This means that if your
home dir is different on remote computers (e.g. if your login is
different) the relative path will still be relative to your home dir.
parallel -S $SERVER1 pwd ::: ""
parallel --workdir . -S $SERVER1 pwd ::: ""
parallel --workdir ... -S $SERVER1 pwd ::: ""
Output:
[the login dir on $SERVER1]
[current dir relative on $SERVER1]
[a dir in ~/.parallel/tmp/...]
=head2 Avoid overloading sshd
2016-01-01 14:12:43 +00:00
If many jobs are started on the same server, B<sshd> can be
overloaded. GNU B<parallel> can insert a delay between each job run on
2013-08-17 16:46:12 +00:00
the same server:
parallel -S $SERVER1 --sshdelay 0.2 echo ::: 1 2 3
Output (the order may be different):
1
2
3
2016-01-01 14:12:43 +00:00
B<sshd> will be less overloaded if using B<--controlmaster>, which will
2013-08-17 16:46:12 +00:00
multiplex ssh connections:
parallel --controlmaster -S $SERVER1 echo ::: 1 2 3
Output: Same as above.
=head2 Ignore hosts that are down
2016-01-01 14:12:43 +00:00
In clusters with many hosts a few of them are often down. GNU B<parallel>
2013-08-17 16:46:12 +00:00
can ignore those hosts. In this case the host 173.194.32.46 is down:
parallel --filter-hosts -S 173.194.32.46,$SERVER1 echo ::: bar
2013-08-17 16:46:12 +00:00
Output:
bar
=head2 Running the same commands on all hosts
2016-01-01 14:12:43 +00:00
GNU B<parallel> can run the same command on all the hosts:
2013-08-17 16:46:12 +00:00
parallel --onall -S $SERVER1,$SERVER2 echo ::: foo bar
Output (the order may be different):
foo
bar
foo
bar
Often you will just want to run a single command on all hosts with out
2016-01-01 14:12:43 +00:00
arguments. B<--nonall> is a no argument B<--onall>:
2013-08-17 16:46:12 +00:00
parallel --nonall -S $SERVER1,$SERVER2 echo foo bar
Output:
2013-08-17 16:46:12 +00:00
foo bar
foo bar
2016-01-01 14:12:43 +00:00
When B<--tag> is used with B<--nonall> and B<--onall> the B<--tagstring> is the host:
2013-08-17 16:46:12 +00:00
parallel --nonall --tag -S $SERVER1,$SERVER2 echo foo bar
Output (the order may be different):
$SERVER1 foo bar
$SERVER2 foo bar
2016-01-01 14:12:43 +00:00
B<--jobs> sets the number of servers to log in to in parallel.
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
=head2 Transferring environment variables and functions
2013-08-17 16:46:12 +00:00
B<env_parallel> is a shell function that transfers all aliases,
functions, variables, and arrays. You active it by running:
source `which env_parallel.bash`
Replace B<bash> with the shell you use.
Now you can use B<env_parallel> instead of B<parallel> and still have
your environment:
alias myecho=echo
myvar="Joe's var is"
env_parallel -S $SERVER1 'myecho $myvar' ::: green
Output:
Joe's var is green
The disadvantage is that if your environment is huge B<env_parallel>
will fail.
When B<env_parallel> fails, you can still use B<--env> to tell GNU
B<parallel> to transfer an environment variable to the remote system.
2013-08-17 16:46:12 +00:00
MYVAR='foo bar'
export MYVAR
parallel --env MYVAR -S $SERVER1 echo '$MYVAR' ::: baz
Output:
foo bar baz
2016-01-01 14:12:43 +00:00
This works for functions, too, if your shell is Bash:
2013-08-17 16:46:12 +00:00
# This only works in Bash
2013-08-17 16:46:12 +00:00
my_func() {
echo in my_func $1
}
export -f my_func
parallel --env my_func -S $SERVER1 my_func ::: baz
Output:
in my_func baz
2016-01-01 14:12:43 +00:00
GNU B<parallel> can copy all defined variables and functions to the
remote system. It just needs to record which ones to ignore in
2016-01-01 14:12:43 +00:00
B<~/.parallel/ignored_vars>. Do that by running this once:
2013-08-17 16:46:12 +00:00
parallel --record-env
cat ~/.parallel/ignored_vars
Output:
[list of variables to ignore - including $PATH and $HOME]
Now all new variables and functions defined will be copied when using
2016-01-01 14:12:43 +00:00
B<--env _>:
2013-08-17 16:46:12 +00:00
# The function is only copied if using Bash
2013-08-17 16:46:12 +00:00
my_func2() {
echo in my_func2 $VAR $1
}
export -f my_func2
VAR=foo
export VAR
parallel --env _ -S $SERVER1 'echo $VAR; my_func2' ::: bar
2013-08-17 16:46:12 +00:00
Output:
foo
2013-08-17 16:46:12 +00:00
in my_func2 foo bar
2013-08-17 16:46:12 +00:00
=head2 Showing what is actually run
2016-01-01 14:12:43 +00:00
B<--verbose> will show the command that would be run on the local
machine.
2013-08-17 16:46:12 +00:00
When using B<--cat>, B<--pipepart>, or when a job is run on a remote
machine, the command is wrapped with helper scripts. B<-vv> shows all
of this.
parallel -vv --pipepart --block 1M wc :::: num30000
2013-08-17 16:46:12 +00:00
Output:
<num30000 perl -e 'while(@ARGV) { sysseek(STDIN,shift,0) || die;
$left = shift; while($read = sysread(STDIN,$buf, ($left > 131072
? 131072 : $left))){ $left -= $read; syswrite(STDOUT,$buf); } }'
0 0 0 168894 | (wc)
30000 30000 168894
2013-08-17 16:46:12 +00:00
When the command gets more complex, the output is so hard to read,
that it is only useful for debugging:
2013-08-17 16:46:12 +00:00
my_func3() {
echo in my_func $1 > $1.out
}
export -f my_func3
parallel -vv --workdir ... --nice 17 --env _ --trc {}.out -S $SERVER1 my_func3 {} ::: abc-file
Output will be similar to:
( ssh server -- mkdir -p ./.parallel/tmp/aspire-1928520-1;rsync
--protocol 30 -rlDzR -essh ./abc-file
server:./.parallel/tmp/aspire-1928520-1 );ssh server -- exec perl -e
\''@GNU_Parallel=("use","IPC::Open3;","use","MIME::Base64");
eval"@GNU_Parallel";my$eval=decode_base64(join"",@ARGV);eval$eval;'\'
c3lzdGVtKCJta2RpciIsIi1wIiwiLS0iLCIucGFyYWxsZWwvdG1wL2FzcGlyZS0xOTI4N
TsgY2hkaXIgIi5wYXJhbGxlbC90bXAvYXNwaXJlLTE5Mjg1MjAtMSIgfHxwcmludChTVE
BhcmFsbGVsOiBDYW5ub3QgY2hkaXIgdG8gLnBhcmFsbGVsL3RtcC9hc3BpcmUtMTkyODU
iKSAmJiBleGl0IDI1NTskRU5WeyJPTERQV0QifT0iL2hvbWUvdGFuZ2UvcHJpdmF0L3Bh
IjskRU5WeyJQQVJBTExFTF9QSUQifT0iMTkyODUyMCI7JEVOVnsiUEFSQUxMRUxfU0VRI
0BiYXNoX2Z1bmN0aW9ucz1xdyhteV9mdW5jMyk7IGlmKCRFTlZ7IlNIRUxMIn09fi9jc2
ByaW50IFNUREVSUiAiQ1NIL1RDU0ggRE8gTk9UIFNVUFBPUlQgbmV3bGluZXMgSU4gVkF
TL0ZVTkNUSU9OUy4gVW5zZXQgQGJhc2hfZnVuY3Rpb25zXG4iOyBleGVjICJmYWxzZSI7
YXNoZnVuYyA9ICJteV9mdW5jMygpIHsgIGVjaG8gaW4gbXlfZnVuYyBcJDEgPiBcJDEub
Xhwb3J0IC1mIG15X2Z1bmMzID4vZGV2L251bGw7IjtAQVJHVj0ibXlfZnVuYzMgYWJjLW
RzaGVsbD0iJEVOVntTSEVMTH0iOyR0bXBkaXI9Ii90bXAiOyRuaWNlPTE3O2RveyRFTlZ
MRUxfVE1QfT0kdG1wZGlyLiIvcGFyIi5qb2luIiIsbWFweygwLi45LCJhIi4uInoiLCJB
KVtyYW5kKDYyKV19KDEuLjUpO313aGlsZSgtZSRFTlZ7UEFSQUxMRUxfVE1QfSk7JFNJ
fT1zdWJ7JGRvbmU9MTt9OyRwaWQ9Zm9yazt1bmxlc3MoJHBpZCl7c2V0cGdycDtldmFse
W9yaXR5KDAsMCwkbmljZSl9O2V4ZWMkc2hlbGwsIi1jIiwoJGJhc2hmdW5jLiJAQVJHVi
JleGVjOiQhXG4iO31kb3skcz0kczwxPzAuMDAxKyRzKjEuMDM6JHM7c2VsZWN0KHVuZGV
mLHVuZGVmLCRzKTt9dW50aWwoJGRvbmV8fGdldHBwaWQ9PTEpO2tpbGwoU0lHSFVQLC0k
dW5sZXNzJGRvbmU7d2FpdDtleGl0KCQ/JjEyNz8xMjgrKCQ/JjEyNyk6MSskPz4+OCk=;
_EXIT_status=$?; mkdir -p ./.; rsync --protocol 30 --rsync-path=cd\
./.parallel/tmp/aspire-1928520-1/./.\;\ rsync -rlDzR -essh
server:./abc-file.out ./.;ssh server -- \(rm\ -f\
./.parallel/tmp/aspire-1928520-1/abc-file\;\ sh\ -c\ \'rmdir\
./.parallel/tmp/aspire-1928520-1/\ ./.parallel/tmp/\ ./.parallel/\
2\>/dev/null\'\;rm\ -rf\ ./.parallel/tmp/aspire-1928520-1\;\);ssh
server -- \(rm\ -f\ ./.parallel/tmp/aspire-1928520-1/abc-file.out\;\
sh\ -c\ \'rmdir\ ./.parallel/tmp/aspire-1928520-1/\ ./.parallel/tmp/\
./.parallel/\ 2\>/dev/null\'\;rm\ -rf\
./.parallel/tmp/aspire-1928520-1\;\);ssh server -- rm -rf
.parallel/tmp/aspire-1928520-1; exit $_EXIT_status;
2016-01-01 14:12:43 +00:00
=head1 Saving to an SQL base (advanced)
GNU B<parallel> can save into an SQL base. Point GNU B<parallel> to a
table and it will put the joblog there together with the variables and
the output each in their own column.
=head2 CSV as SQL base
The simplest is to use a CSV file as the storage table:
parallel --sqlandworker csv:////%2Ftmp%2Flog.csv seq ::: 10 ::: 12 13 14
cat /tmp/log.csv
Note how '/' in the path must be written as %2F.
Output will be similar to:
Seq,Host,Starttime,JobRuntime,Send,Receive,Exitval,_Signal,Command,V1,V2,Stdout,Stderr
1,:,1458254498.254,0.069,0,9,0,0,"seq 10 12",10,12,"10
11
12
",
2,:,1458254498.278,0.080,0,12,0,0,"seq 10 13",10,13,"10
11
12
13
",
3,:,1458254498.301,0.083,0,15,0,0,"seq 10 14",10,14,"10
11
12
13
14
",
A proper CSV reader (like LibreOffice or R's read.csv) will read this
format correctly - even with fields containing newlines as above.
If the output is big you may want to put it into files using B<--results>:
parallel --results outdir --sqlandworker csv:////%2Ftmp%2Flog2.csv seq ::: 10 ::: 12 13 14
cat /tmp/log2.csv
Output will be similar to:
Seq,Host,Starttime,JobRuntime,Send,Receive,Exitval,_Signal,Command,V1,V2,Stdout,Stderr
1,:,1458824738.287,0.029,0,9,0,0,"seq 10 12",10,12,outdir/1/10/2/12/stdout,outdir/1/10/2/12/stderr
2,:,1458824738.298,0.025,0,12,0,0,"seq 10 13",10,13,outdir/1/10/2/13/stdout,outdir/1/10/2/13/stderr
3,:,1458824738.309,0.026,0,15,0,0,"seq 10 14",10,14,outdir/1/10/2/14/stdout,outdir/1/10/2/14/stderr
=head2 DBURL as table
The CSV file is an example of a DBURL.
2016-01-01 14:12:43 +00:00
GNU B<parallel> uses a DBURL to address the table. A DBURL has this format:
vendor://[[user][:password]@][host][:port]/[database[/table]
Example:
mysql://scott:tiger@my.example.com/mydatabase/mytable
postgresql://scott:tiger@pg.example.com/mydatabase/mytable
sqlite3:///%2Ftmp%2Fmydatabase/mytable
csv:////%2Ftmp%2Flog.csv
2016-01-01 14:12:43 +00:00
To refer to B</tmp/mydatabase> with B<sqlite> or B<csv> you need to
encode the B</> as B<%2F>.
2016-01-01 14:12:43 +00:00
Run a job using B<sqlite> on B<mytable> in B</tmp/mydatabase>:
DBURL=sqlite3:///%2Ftmp%2Fmydatabase
DBURLTABLE=$DBURL/mytable
parallel --sqlandworker $DBURLTABLE echo ::: foo bar ::: baz quuz
To see the result:
sql $DBURL 'SELECT * FROM mytable ORDER BY Seq;'
Output will be similar to:
Seq|Host|Starttime|JobRuntime|Send|Receive|Exitval|_Signal|Command|V1|V2|Stdout|Stderr
1|:|1451619638.903|0.806||8|0|0|echo foo baz|foo|baz|foo baz
|
2|:|1451619639.265|1.54||9|0|0|echo foo quuz|foo|quuz|foo quuz
|
3|:|1451619640.378|1.43||8|0|0|echo bar baz|bar|baz|bar baz
|
4|:|1451619641.473|0.958||9|0|0|echo bar quuz|bar|quuz|bar quuz
|
The first columns are well known from B<--joblog>. B<V1> and B<V2> are
data from the input sources. B<Stdout> and B<Stderr> are standard
output and standard error, respectively.
=head2 Using multiple workers
Using an SQL base as storage costs overhead in the order of 1 second
per job.
2016-01-01 14:12:43 +00:00
One of the situations where it makes sense is if you have multiple
workers.
You can then have a single master machine that submits jobs to the SQL
base (but does not do any of the work):
parallel --sqlmaster $DBURLTABLE echo ::: foo bar ::: baz quuz
2016-01-01 14:12:43 +00:00
On the worker machines you run exactly the same command except you
replace B<--sqlmaster> with B<--sqlworker>.
2016-01-01 14:12:43 +00:00
parallel --sqlworker $DBURLTABLE echo ::: foo bar ::: baz quuz
To run a master and a worker on the same machine use B<--sqlandworker>
as shown earlier.
2013-08-17 16:46:12 +00:00
=head1 --pipe
2016-01-01 14:12:43 +00:00
The B<--pipe> functionality puts GNU B<parallel> in a different mode:
2013-08-22 15:24:36 +00:00
Instead of treating the data on stdin (standard input) as arguments
for a command to run, the data will be sent to stdin (standard input)
of the command.
2013-08-17 16:46:12 +00:00
2013-08-22 15:24:36 +00:00
The typical situation is:
2013-08-17 16:46:12 +00:00
command_A | command_B | command_C
where command_B is slow, and you want to speed up command_B.
=head2 Chunk size
2016-01-01 14:12:43 +00:00
By default GNU B<parallel> will start an instance of command_B, read a
2013-08-17 16:46:12 +00:00
chunk of 1 MB, and pass that to the instance. Then start another
instance, read another chunk, and pass that to the second instance.
cat num1000000 | parallel --pipe wc
Output (the order may be different):
165668 165668 1048571
149797 149797 1048579
149796 149796 1048572
149797 149797 1048579
149797 149797 1048579
149796 149796 1048572
85349 85349 597444
2016-01-01 14:12:43 +00:00
The size of the chunk is not exactly 1 MB because GNU B<parallel> only
2013-08-17 16:46:12 +00:00
passes full lines - never half a line, thus the blocksize is only
1 MB on average. You can change the block size to 2 MB with B<--block>:
2013-08-17 16:46:12 +00:00
cat num1000000 | parallel --pipe --block 2M wc
Output (the order may be different):
315465 315465 2097150
299593 299593 2097151
299593 299593 2097151
85349 85349 597444
GNU B<parallel> treats each line as a record. If the order of records
is unimportant (e.g. you need all lines processed, but you do not care
2016-01-01 14:12:43 +00:00
which is processed first), then you can use B<--round-robin>. Without
B<--round-robin> GNU B<parallel> will start a command per block; with
B<--round-robin> only the requested number of jobs will be started
(B<--jobs>). The records will then be distributed between the running
2013-08-17 16:46:12 +00:00
jobs:
cat num1000000 | parallel --pipe -j4 --round-robin wc
Output will be similar to:
149797 149797 1048579
299593 299593 2097151
315465 315465 2097150
235145 235145 1646016
One of the 4 instances got a single record, 2 instances got 2 full
records each, and one instance got 1 full and 1 partial record.
=head2 Records
2016-01-01 14:12:43 +00:00
GNU B<parallel> sees the input as records. The default record is a single
2013-08-17 16:46:12 +00:00
line.
2016-01-01 14:12:43 +00:00
Using B<-N140000> GNU B<parallel> will read 140000 records at a time:
2013-08-17 16:46:12 +00:00
cat num1000000 | parallel --pipe -N140000 wc
Output (the order may be different):
140000 140000 868895
140000 140000 980000
140000 140000 980000
140000 140000 980000
140000 140000 980000
140000 140000 980000
140000 140000 980000
20000 20000 140001
Note how that the last job could not get the full 140000 lines, but
only 20000 lines.
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
If a record is 75 lines B<-L> can be used:
2013-08-17 16:46:12 +00:00
cat num1000000 | parallel --pipe -L75 wc
Output (the order may be different):
165600 165600 1048095
149850 149850 1048950
149775 149775 1048425
149775 149775 1048425
149850 149850 1048950
149775 149775 1048425
85350 85350 597450
25 25 176
Note how GNU B<parallel> still reads a block of around 1 MB; but
instead of passing full lines to B<wc> it passes full 75 lines at a
time. This of course does not hold for the last job (which in this
case got 25 lines).
2013-08-17 16:46:12 +00:00
=head2 Record separators
2016-01-01 14:12:43 +00:00
GNU B<parallel> uses separators to determine where two records split.
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
B<--recstart> gives the string that starts a record; B<--recend> gives the
string that ends a record. The default is B<--recend '\n'> (newline).
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
If both B<--recend> and B<--recstart> are given, then the record will only
2013-08-17 16:46:12 +00:00
split if the recend string is immediately followed by the recstart
string.
2016-01-01 14:12:43 +00:00
Here the B<--recend> is set to B<', '>:
2013-08-17 16:46:12 +00:00
echo /foo, bar/, /baz, qux/, | parallel -kN1 --recend ', ' --pipe echo JOB{#}\;cat\;echo END
Output:
JOB1
/foo, END
JOB2
bar/, END
JOB3
/baz, END
JOB4
qux/,
END
2016-01-01 14:12:43 +00:00
Here the B<--recstart> is set to B</>:
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
echo /foo, bar/, /baz, qux/, | parallel -kN1 --recstart / --pipe echo JOB{#}\;cat\;echo END
2013-08-17 16:46:12 +00:00
Output:
JOB1
/foo, barEND
JOB2
/, END
JOB3
/baz, quxEND
JOB4
/,
END
2016-01-01 14:12:43 +00:00
Here both B<--recend> and B<--recstart> are set:
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
echo /foo, bar/, /baz, qux/, | parallel -kN1 --recend ', ' --recstart / --pipe echo JOB{#}\;cat\;echo END
2013-08-17 16:46:12 +00:00
Output:
JOB1
/foo, bar/, END
JOB2
/baz, qux/,
END
Note the difference between setting one string and setting both strings.
2016-01-01 14:12:43 +00:00
With B<--regexp> the B<--recend> and B<--recstart> will be treated as a regular expression:
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
echo foo,bar,_baz,__qux, | parallel -kN1 --regexp --recend ,_+ --pipe echo JOB{#}\;cat\;echo END
2013-08-17 16:46:12 +00:00
Output:
JOB1
foo,bar,_END
JOB2
baz,__END
JOB3
qux,
END
2016-01-01 14:12:43 +00:00
GNU B<parallel> can remove the record separators with B<--remove-rec-sep>/B<--rrs>:
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
echo foo,bar,_baz,__qux, | parallel -kN1 --rrs --regexp --recend ,_+ --pipe echo JOB{#}\;cat\;echo END
2013-08-17 16:46:12 +00:00
Output:
JOB1
foo,barEND
JOB2
bazEND
JOB3
qux,
END
=head2 Header
If the input data has a header, the header can be repeated for each
2016-01-01 14:12:43 +00:00
job by matching the header with B<--header>. If headers start with
B<%> you can do this:
2013-08-17 16:46:12 +00:00
cat num_%header | parallel --header '(%.*\n)*' --pipe -N3 echo JOB{#}\;cat
Output (the order may be different):
JOB1
2013-08-17 16:46:12 +00:00
%head1
%head2
1
2
3
JOB2
%head1
%head2
4
5
6
JOB3
%head1
%head2
7
8
9
JOB4
%head1
%head2
10
2016-01-01 14:12:43 +00:00
If the header is 2 lines, B<--header> 2 will work:
2013-08-17 16:46:12 +00:00
cat num_%header | parallel --header 2 --pipe -N3 echo JOB{#}\;cat
Output: Same as above.
=head2 --pipepart
2016-01-01 14:12:43 +00:00
B<--pipe> is not very efficient. It maxes out at around 500
MB/s. B<--pipepart> can easily deliver 5 GB/s. But there are a few
limitations. The input has to be a normal file (not a pipe) given by
B<-a> or B<::::> and B<-L>/B<-l>/B<-N> do not work. B<--recend> and
B<--recstart>, however, I<do> work, and records can often be split on
that alone.
parallel --pipepart -a num1000000 --block 3m wc
Output (the order may be different):
444443 444444 3000002
428572 428572 3000004
126985 126984 888890
2013-08-17 16:46:12 +00:00
=head1 Shebang
=head2 Input data and parallel command in the same file
2016-01-01 14:12:43 +00:00
GNU B<parallel> is often called as this:
2013-08-17 16:46:12 +00:00
cat input_file | parallel command
2016-01-01 14:12:43 +00:00
With B<--shebang> the I<input_file> and B<parallel> can be combined into the same script.
2013-08-17 16:46:12 +00:00
UNIX shell scripts start with a shebang line like this:
2013-08-17 16:46:12 +00:00
#!/bin/bash
2016-01-01 14:12:43 +00:00
GNU B<parallel> can do that, too. With B<--shebang> the arguments can be
listed in the file. The B<parallel> command is the first line of the
2013-08-17 16:46:12 +00:00
script:
#!/usr/bin/parallel --shebang -r echo
foo
bar
baz
Output (the order may be different):
foo
bar
baz
=head2 Parallelizing existing scripts
GNU B<parallel> is often called as this:
2013-08-17 16:46:12 +00:00
cat input_file | parallel command
parallel command ::: foo bar
If B<command> is a script, B<parallel> can be combined into a single
file so this will run the script in parallel:
2013-08-17 16:46:12 +00:00
cat input_file | command
command foo bar
2016-01-01 14:12:43 +00:00
This B<perl> script B<perl_echo> works like B<echo>:
2013-08-17 16:46:12 +00:00
#!/usr/bin/perl
print "@ARGV\n"
It can be called as this:
2013-08-17 16:46:12 +00:00
parallel perl_echo ::: foo bar
2016-01-01 14:12:43 +00:00
By changing the B<#!>-line it can be run in parallel:
2013-08-17 16:46:12 +00:00
#!/usr/bin/parallel --shebang-wrap /usr/bin/perl
print "@ARGV\n"
Thus this will work:
perl_echo foo bar
Output (the order may be different):
foo
bar
This technique can be used for:
=over 9
=item Perl:
#!/usr/bin/parallel --shebang-wrap /usr/bin/perl
print "Arguments @ARGV\n";
2016-01-01 14:12:43 +00:00
2013-08-17 16:46:12 +00:00
=item Python:
#!/usr/bin/parallel --shebang-wrap /usr/bin/python
import sys
print 'Arguments', str(sys.argv)
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
=item Bash/sh/zsh/Korn shell:
2016-01-01 14:12:43 +00:00
#!/usr/bin/parallel --shebang-wrap /bin/bash
echo Arguments "$@"
2016-01-01 14:12:43 +00:00
2013-08-17 16:46:12 +00:00
=item csh:
2013-08-17 16:46:12 +00:00
#!/usr/bin/parallel --shebang-wrap /bin/csh
echo Arguments "$argv"
2016-01-01 14:12:43 +00:00
=item Tcl:
2013-08-17 16:46:12 +00:00
#!/usr/bin/parallel --shebang-wrap /usr/bin/tclsh
puts "Arguments $argv"
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
=item R:
2016-01-01 14:12:43 +00:00
#!/usr/bin/parallel --shebang-wrap /usr/bin/Rscript --vanilla --slave
args <- commandArgs(trailingOnly = TRUE)
print(paste("Arguments ",args))
2013-08-17 16:46:12 +00:00
=item GNUplot:
2016-01-01 14:12:43 +00:00
#!/usr/bin/parallel --shebang-wrap ARG={} /usr/bin/gnuplot
print "Arguments ", system('echo $ARG')
2016-01-01 14:12:43 +00:00
2013-08-17 16:46:12 +00:00
=item Ruby:
2013-08-17 16:46:12 +00:00
#!/usr/bin/parallel --shebang-wrap /usr/bin/ruby
print "Arguments "
puts ARGV
2016-01-01 14:12:43 +00:00
=item Octave:
#!/usr/bin/parallel --shebang-wrap /usr/bin/octave
printf ("Arguments");
arg_list = argv ();
for i = 1:nargin
printf (" %s", arg_list{i});
endfor
printf ("\n");
2016-01-01 14:12:43 +00:00
=item Common LISP:
#!/usr/bin/parallel --shebang-wrap /usr/bin/clisp
(format t "~&~S~&" 'Arguments)
(format t "~&~S~&" *args*)
=item PHP:
#!/usr/bin/parallel --shebang-wrap /usr/bin/php
<?php
echo "Arguments";
foreach(array_slice($argv,1) as $v)
{
echo " $v";
}
echo "\n";
?>
=item Node.js:
#!/usr/bin/parallel --shebang-wrap /usr/bin/node
var myArgs = process.argv.slice(2);
console.log('Arguments ', myArgs);
=item LUA:
2016-01-01 14:12:43 +00:00
#!/usr/bin/parallel --shebang-wrap /usr/bin/lua
io.write "Arguments"
for a = 1, #arg do
io.write(" ")
io.write(arg[a])
end
print("")
2016-01-01 14:12:43 +00:00
=item C#:
2016-01-01 14:12:43 +00:00
#!/usr/bin/parallel --shebang-wrap ARGV={} /usr/bin/csharp
var argv = Environment.GetEnvironmentVariable("ARGV");
print("Arguments "+argv);
2016-01-01 14:12:43 +00:00
2013-08-17 16:46:12 +00:00
=back
=head1 Semaphore
2016-01-01 14:12:43 +00:00
GNU B<parallel> can work as a counting semaphore. This is slower and less
2013-08-17 16:46:12 +00:00
efficient than its normal mode.
2015-05-06 22:40:36 +00:00
A counting semaphore is like a row of toilets. People needing a toilet
can use any toilet, but if there are more people than toilets, they
will have to wait for one of the toilets to become available.
2015-05-06 22:40:36 +00:00
2016-01-01 14:12:43 +00:00
An alias for B<parallel --semaphore> is B<sem>.
2015-05-06 22:40:36 +00:00
2016-01-01 14:12:43 +00:00
B<sem> will follow a person to the toilets, wait until a toilet is
2015-05-06 22:40:36 +00:00
available, leave the person in the toilet and exit.
2016-01-01 14:12:43 +00:00
B<sem --fg> will follow a person to the toilets, wait until a toilet is
2015-05-06 22:40:36 +00:00
available, stay with the person in the toilet and exit when the person
exits.
2016-01-01 14:12:43 +00:00
B<sem --wait> will wait for all persons to leave the toilets.
2015-05-06 22:40:36 +00:00
2016-01-01 14:12:43 +00:00
B<sem> does not have a queue discipline, so the next person is chosen
2015-05-06 22:40:36 +00:00
randomly.
2016-01-01 14:12:43 +00:00
B<-j> sets the number of toilets.
=head2 Mutex
The default is to have only one toilet (this is called a mutex). The
program is started in the background and B<sem> exits immediately. Use
B<--wait> to wait for all B<sem>s to finish:
2013-08-17 16:46:12 +00:00
sem 'sleep 1; echo The first finished' &&
echo The first is now running in the background &&
sem 'sleep 1; echo The second finished' &&
echo The second is now running in the background
sem --wait
Output:
The first is now running in the background
The first finished
The second is now running in the background
The second finished
2016-01-01 14:12:43 +00:00
The command can be run in the foreground with B<--fg>, which will only
2015-05-06 22:40:36 +00:00
exit when the command completes:
2013-08-17 16:46:12 +00:00
sem --fg 'sleep 1; echo The first finished' &&
echo The first finished running in the foreground &&
sem --fg 'sleep 1; echo The second finished' &&
echo The second finished running in the foreground
sem --wait
The difference between this and just running the command, is that a
2016-01-01 14:12:43 +00:00
mutex is set, so if other B<sem>s were running in the background only one
would run at a time.
2013-08-17 16:46:12 +00:00
To control which semaphore is used, use
2016-01-01 14:12:43 +00:00
B<--semaphorename>/B<--id>. Run this in one terminal:
2013-08-17 16:46:12 +00:00
sem --id my_id -u 'echo First started; sleep 10; echo The first finished'
and simultaneously this in another terminal:
sem --id my_id -u 'echo Second started; sleep 10; echo The second finished'
Note how the second will only be started when the first has finished.
=head2 Counting semaphore
A mutex is like having a single toilet: When it is in use everyone
else will have to wait. A counting semaphore is like having multiple
toilets: Several people can use the toilets, but when they all are in
use, everyone else will have to wait.
2016-01-01 14:12:43 +00:00
B<sem> can emulate a counting semaphore. Use B<--jobs> to set the number of
toilets like this:
2013-08-17 16:46:12 +00:00
sem --jobs 3 --id my_id -u 'echo First started; sleep 5; echo The first finished' &&
sem --jobs 3 --id my_id -u 'echo Second started; sleep 6; echo The second finished' &&
sem --jobs 3 --id my_id -u 'echo Third started; sleep 7; echo The third finished' &&
sem --jobs 3 --id my_id -u 'echo Fourth started; sleep 8; echo The fourth finished' &&
sem --wait --id my_id
Output:
First started
Second started
Third started
The first finished
Fourth started
The second finished
The third finished
The fourth finished
2015-05-20 19:09:33 +00:00
=head2 Timeout
2016-01-01 14:12:43 +00:00
With B<--semaphoretimeout> you can force running the command anyway after
2015-05-20 19:09:33 +00:00
a period (postive number) or give up (negative number):
sem --id foo -u 'echo Slow started; sleep 5; echo Slow ended' &&
sem --id foo --semaphoretimeout 1 'echo Force this running after 1 sec' &&
sem --id foo --semaphoretimeout -2 'echo Give up after 1 sec'
sem --id foo --wait
Output:
Slow started
parallel: Warning: Semaphore timed out. Stealing the semaphore.
Force this running after 1 sec
Slow ended
parallel: Warning: Semaphore timed out. Exiting.
Note how the 'Give up' was not run.
2013-08-17 16:46:12 +00:00
=head1 Informational
2016-01-01 14:12:43 +00:00
GNU B<parallel> has some options to give short information about the
2013-08-17 16:46:12 +00:00
configuration.
2016-01-01 14:12:43 +00:00
B<--help> will print a summary of the most important options:
2013-08-17 16:46:12 +00:00
parallel --help
Output:
Usage:
2013-08-17 16:46:12 +00:00
parallel [options] [command [arguments]] < list_of_arguments
parallel [options] [command [arguments]] (::: arguments|:::: argfile(s))...
cat ... | parallel --pipe [options] [command [arguments]]
-j n Run n jobs in parallel
-k Keep same order
-X Multiple arguments with context replace
--colsep regexp Split input on regexp for positional replacements
{} {.} {/} {/.} {#} {%} {= perl code =} Replacement strings
{3} {3.} {3/} {3/.} {=3 perl code =} Positional replacement strings
With --plus: {} = {+/}/{/} = {.}.{+.} = {+/}/{/.}.{+.} = {..}.{+..} =
{+/}/{/..}.{+..} = {...}.{+...} = {+/}/{/...}.{+...}
2013-08-17 16:46:12 +00:00
-S sshlogin Example: foo@server.example.com
--slf .. Use ~/.parallel/sshloginfile as the list of sshlogins
--trc {}.bar Shorthand for --transfer --return {}.bar --cleanup
--onall Run the given command with argument on all sshlogins
--nonall Run the given command with no arguments on all sshlogins
2013-08-17 16:46:12 +00:00
--pipe Split stdin (standard input) to multiple jobs.
--recend str Record end separator for --pipe.
--recstart str Record start separator for --pipe.
2013-08-17 16:46:12 +00:00
See 'man parallel' for details
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:
O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.
2013-08-17 16:46:12 +00:00
This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.
2013-08-17 16:46:12 +00:00
When asking for help, always report the full output of this:
2013-08-17 16:46:12 +00:00
parallel --version
Output:
GNU parallel 20160323
2017-01-01 16:51:14 +00:00
Copyright (C) 2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
Ole Tange and Free Software Foundation, Inc.
2013-08-17 16:46:12 +00:00
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.
Web site: http://www.gnu.org/software/parallel
When using programs that use GNU Parallel to process data for publication
please cite as described in 'parallel --citation'.
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
In scripts B<--minversion> can be used to ensure the user has at least
2013-08-17 16:46:12 +00:00
this version:
parallel --minversion 20130722 && echo Your version is at least 20130722.
Output:
20160322
2013-08-17 16:46:12 +00:00
Your version is at least 20130722.
If you are using GNU B<parallel> for research the BibTeX citation can be
generated using B<--citation>:
2013-08-17 16:46:12 +00:00
parallel --citation
2013-08-17 16:46:12 +00:00
Output:
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:
2013-08-17 16:46:12 +00:00
@article{Tange2011a,
title = {GNU Parallel - The Command-Line Power Tool},
author = {O. Tange},
address = {Frederiksberg, Denmark},
journal = {;login: The USENIX Magazine},
month = {Feb},
number = {1},
volume = {36},
url = {http://www.gnu.org/s/parallel},
year = {2011},
pages = {42-47},
doi = {10.5281/zenodo.16303}
2013-08-17 16:46:12 +00:00
}
(Feel free to use \nocite{Tange2011a})
This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.
If you send a copy of your published article to tange@gnu.org, it will be
mentioned in the release notes of next version of GNU Parallel.
2016-01-01 14:12:43 +00:00
With B<--max-line-length-allowed> GNU B<parallel> will report the maximal
2013-08-17 16:46:12 +00:00
size of the command line:
parallel --max-line-length-allowed
Output (may vary on different systems):
131071
2016-01-01 14:12:43 +00:00
B<--number-of-cpus> and B<--number-of-cores> run system specific code to
2013-08-17 16:46:12 +00:00
determine the number of CPUs and CPU cores on the system. On
unsupported platforms they will return 1:
parallel --number-of-cpus
2013-08-17 16:46:12 +00:00
parallel --number-of-cores
Output (may vary on different systems):
4
64
=head1 Profiles
2016-01-01 14:12:43 +00:00
The defaults for GNU B<parallel> can be changed systemwide by putting the
command line options in B</etc/parallel/config>. They can be changed for
a user by putting them in B<~/.parallel/config>.
2013-08-17 16:46:12 +00:00
2016-01-01 14:12:43 +00:00
Profiles work the same way, but have to be referred to with B<--profile>:
2013-08-17 16:46:12 +00:00
echo '--nice 17' > ~/.parallel/nicetimeout
echo '--timeout 300%' >> ~/.parallel/nicetimeout
parallel --profile nicetimeout echo ::: A B C
2013-08-17 16:46:12 +00:00
Output:
A
B
C
Profiles can be combined:
echo '-vv --dry-run' > ~/.parallel/dryverbose
parallel --profile dryverbose --profile nicetimeout echo ::: A B C
2013-08-17 16:46:12 +00:00
Output:
echo A
echo B
echo C
2013-08-17 16:46:12 +00:00
=head1 Spread the word
I hope you have learned something from this tutorial.
2016-01-01 14:12:43 +00:00
If you like GNU B<parallel>:
2013-08-17 16:46:12 +00:00
=over 2
=item *
(Re-)walk through the tutorial if you have not done so in the past year
(http://www.gnu.org/software/parallel/parallel_tutorial.html)
=item *
Give a demo at your local user group/your team/your colleagues
2013-08-17 16:46:12 +00:00
=item *
Post the intro videos and the tutorial on Reddit, Diaspora*,
forums, blogs, Identi.ca, Google+, Twitter, Facebook, Linkedin,
and mailing lists
2013-08-17 16:46:12 +00:00
=item *
Request or write a review for your favourite blog or magazine
2016-01-01 14:12:43 +00:00
(especially if you do something cool with GNU B<parallel>)
2013-08-17 16:46:12 +00:00
=item *
Invite me for your next conference
=back
2016-01-01 14:12:43 +00:00
If you use GNU B<parallel> for research:
2013-08-17 16:46:12 +00:00
=over 2
=item *
Please cite GNU B<parallel> in you publications (use B<--citation>)
2013-08-17 16:46:12 +00:00
=back
2016-01-01 14:12:43 +00:00
If GNU B<parallel> saves you money:
2013-08-17 16:46:12 +00:00
=over 2
=item *
(Have your company) donate to FSF or become a member
https://my.fsf.org/donate/
2013-08-17 16:46:12 +00:00
=back
2017-01-01 16:51:14 +00:00
(C) 2013,2014,2015,2016,2017 Ole Tange, GPLv3
2013-08-17 16:46:12 +00:00
=cut