#!/usr/bin/perl -w =encoding utf8 =head1 Design of GNU Parallel This document describes design decisions made in the development of GNU B and the reasoning behind them. It will give an overview of why some of the code looks like it does, and help new maintainers understand the code better. =head2 One file program GNU B is a Perl script in a single file. It is object oriented, but contrary to normal Perl scripts each class is not in its own file. This is due to user experience: The goal is that in a pinch the user will be able to get GNU B working simply by copying a single file: No need messing around with environment variables like PERL5LIB. =head2 Old Perl style GNU B uses some old, deprecated constructs. This is due to a goal of being able to run on old installations. Currently the target is CentOS 3.9 and Perl 5.8.0. =head2 Exponentially back off GNU B busy waits. This is because the reason why a job is not started may be due to load average, and thus it will not make sense to wait for a job to finish. Instead the load average must be checked again. Load average is not the only reason: --timeout has a similar problem. To not burn up too up too much CPU GNU B sleeps exponentially longer and longer if nothing happens, maxing out at 1 second. =head2 Shell compatibility It is a goal to have GNU B work equally well in any shell. However, in practice GNU B is being developed in B and thus testing in other shells is limited to reported bugs. When an incompatibility is found there is often not an easy fix: Fixing the problem in B often breaks it in B. In these cases the fix is often to use a small Perl script and call that. =head2 Job slots The easiest way to explain what GNU B does is to assume that there are a number of job slots, and when a slot becomes available a job from the queue will be run in that slot. But originally GNU B did not model job slots in the code. Job slots have been added to make it possible to use {%} as a replacement string. Job slots were added to the code in 20140522, but while the job sequence number can be computed in advance, the job slot can only be computed the moment a slot becomes available. So it has been implemented as a stack with lazy evaluation: Draw one from an empty stack and the stack is extended by one. When a job is done, push the available job slot back on the stack. This implementation also means that if you use remote executions, you cannot assume that a given job slot will remain on the same remote server. This goes double since number of job slots can be adjusted on the fly (by giving B<--jobs> a file name). =head2 Rsync protocol version B 3.1.x uses protocol 31 which is unsupported by version 2.5.7. That means that you cannot push a file to a remote system using B protocol 31, if the remote system uses 2.5.7. B does not automatically downgrade to protocol 30. GNU B does not require protocol 31, so if the B version is >= 3.1.0 then B<--protocol 30> is added to force newer Bs to talk to version 2.5.7. =head2 Compression B<--compress> compresses the data in the temporary files. This is a bit tricky because there should be no files to clean up if GNU B is killed by a power outage. GNU B first selects a compress program. If the user has not selected one, the first of these that are in $PATH is used: B. They are sorted by speed on a 8 core machine. Schematically the setup is as follows: command started by parallel | compress > tmpfile cattail tmpfile | uncompress | parallel The setup is duplicated for both standard output (stdout) and standard error (stderr). GNU B pipes output from the command run into the compress program which saves to a tmpfile. GNU B records the pid of the compress program. At the same time a small perl script (called B above) is started: It basically does B followed by B, but it also removes the tmpfile as soon as the first byte is read, and it continously checks if the pid of the compress program is dead. If the compress program is dead, B reads the rest of tmpfile and exits. As most compress programs write out a header when they start, the tmpfile in practice is unlinked after around 40 ms. =head2 Wrapping The command given by the user can be wrapped in multiple templates. Templates can be wrapped in other templates. =over 15 =item --shellquote echo <> =item --nice I \nice -n I $shell -c <> The \ is needed to avoid using the builtin nice command, which does not support -n in B. B<$shell -c> is needed to nice composed commands command. =item --cat (cat > {}; <> {}; perl -e '$bash=shift; $csh=shift; for(@ARGV) {unlink;rmdir;} if($bash=~s/h//) {exit$bash;} exit$csh;' "$?h" "$status" {}); {} is set to $PARALLEL_TMP which is a tmpfile. The Perl script saves the exit value, unlinks the tmpfile, and returns the exit value - no matter if the shell is B (using $?) or B<*csh> (using $status). =item --fifo perl -e '($s,$c,$f) = @ARGV; system "mkfifo", $f; $pid = fork || exec $s, "-c", $c; open($o,">",$f) || die $!; while(sysread(STDIN,$buf,32768)){ syswrite $o, $buf; } close $o; waitpid $pid,0; unlink $f; exit $?/256;' $shell <> $PARALLEL_TMP This is an elaborate way of: mkfifo {}; run <> in the background using $shell; copying STDIN to {}; waiting for background to complete; remove {} and exit with the exit code from <>. It is made this way to be compatible with B<*csh>. =item --sshlogin I ssh I <> =item --transfer ( ssh I mkdir -p ./I;rsync --protocol 30 -rlDzR -essh ./{} I:./I ); <> Read about B<--protocol 30> in the section B. =item --basefile <> =item --return I <>; _EXIT_status=$?; mkdir -p I; rsync --protocol 30 --rsync-path=cd\ ./I\;\ rsync -rlDzR -essh I:./I ./I; exit $_EXIT_status; The B<--rsync-path=cd ...> is needed because old versions of B do not support B<--no-implied-dirs>. The B<$_EXIT_status> trick is to postpone the exit value. This makes it incompatible with B<*csh> and should be fixed in the future. Maybe a wrapping 'sh -c' is enough? =item --cleanup <> _EXIT_status=$?; <> ssh I \(rm\ -f\ ./I/{}\;\ rmdir\ ./I\ \>\&/dev/null\;\); exit $_EXIT_status; B<$_EXIT_status>: see B<--return> above. =item --pipe sh -c 'dd bs=1 count=1 of=I 2>/dev/null'; test ! -s "I" && rm -f "I" && exec true; (cat I; rm I; cat - ) | ( <> ); This small wrapper makes sure that <> will never be run if there is no data. B is needed to hide stderr if the user's shell is B (which cannot hide stderr). =item --tmux mkfifo I; tmux -S new-session -s pI -d 'sleep .2' >&/dev/null; tmux -S new-window -t pI -n <> \(<>\)\;\ perl\ -e\ \'while\(\$t++\<3\)\{\ print\ \$ARGV\[0\],\"\\n\"\ \}\'\ \$\?h/\$status\ \>\>\ I\&echo\ <>\;echo\ \Job\ finished\ at:\ \`date\`\;sleep\ 10; exec perl -e '$/="/";$_=<>;$c=<>;unlink $ARGV; /(\d+)h/ and exit($1);exit$c' I First a FIFO is made (.tmx). It is used for communicating exit value. Next a new tmux session is made. This may fail if there is already a session, so the output is ignored. If all job slots finish at the same time, then B will close the session. A temporary socket is made (.tms) to avoid a race condition in B. It is cleaned up when GNU B finishes. The input is used as the name of the windows in B. When the job inside B finishes, the exit value is printed to the FIFO (.tmx). This FIFO is opened by B outside B, and B then removes the FIFO. B blocks until the first value is read from the FIFO, and this value is used as exit value. To make it compatible with B and B the exit value is printed as: $?h/$status and this is parsed by B. Works in B. There is a bug that makes it necessary to print the exit value 3 times. Another bug in B requires the length of the tmux title and command to not have certain limits. When inside these limits, 75 '\ ' are added to the title to force it to be outside the limits. You can map the bad limits using: perl -e 'sub r { int(rand(shift)).($_[0] && "\t".r(@_)) } print map { r(@ARGV)."\n" } 1..10000' 1600 1500 90 | perl -ane '$F[0]+$F[1]+$F[2] < 2037 and print ' | parallel --colsep '\t' --tagstring '{1}\t{2}\t{3}' tmux -S /tmp/p{%}-'{=3 $_="O"x$_ =}' \ new-session -d -n '{=1 $_="O"x$_ =}' true'\ {=2 $_="O"x$_ =};echo $?;rm -f /tmp/p{%}-O*' perl -e 'sub r { int(rand(shift)).($_[0] && "\t".r(@_)) } print map { r(@ARGV)."\n" } 1..10000' 17000 17000 90 | parallel --colsep '\t' --tagstring '{1}\t{2}\t{3}' \ tmux -S /tmp/p{%}-'{=3 $_="O"x$_ =}' new-session -d -n '{=1 $_="O"x$_ =}' true'\ {=2 $_="O"x$_ =};echo $?;rm /tmp/p{%}-O*' > value.csv 2>/dev/null R -e 'a<-read.table("value.csv");X11();plot(a[,1],a[,2],col=a[,3]+5,cex=0.1);Sys.sleep(1000)' For B 17000 can be lowered to 2100. The interesting areas are title 0..1000 with (title + whole command) in 996..1127 and 9331..9636. =back The ordering of the wrapping is important: =over 5 =item * B<--nice>/B<--cat>/B<--fifo> should be done on the remote machine =item * B<--pipepart>/B<--pipe> should be done on the local machine inside B<--tmux> =back =head2 --block-size adjustment Every time GNU B detects a record bigger than B<--block-size> it increases the block size by 30%. A small B<--block-size> gives very poor performance; by exponentially increasing the block size performance will not suffer. =head2 Convenience options --nice --basefile --transfer --return --cleanup --tmux --group --compress --cat --fifo --workdir These are all convenience options that make it easier to do a task. But more importantly: They are tested to work on corner cases, too. Take B<--nice> as an example: nice parallel command ... will work just fine. But when run remotely, you need to move the nice command so it is being run on the server: parallel -S server nice command ... And this will again work just fine, as long as you are running a single command. When you are running a composed command you need nice to apply to the whole command, and it gets harder still: parallel -S server -q nice bash -c 'command1 ...; command2 | command3' It is not impossible, but by using B<--nice> GNU B will do the right thing for you. Similarly when transferring files: It starts to get hard when the file names contain space, :, `, *, or other special characters. To run the commands in a B session you basically just need to quote the command. For simple commands that is easy, but when commands contain special characters, it gets much harder to get right. B<--cat> and B<--fifo> are easy to do by hand, until you want to clean up the tmpfile and keep the exit code of the command. The real killer comes when you try to combine several of these: Doing that correctly for all corner cases is next to impossible to do by hard. =head2 Shell shock The shell shock bug in B did not affect GNU B, but the solutions did. B first introduced functions in variables named: I and later changed that to I. When transferring functions GNU B reads off the function and changes that into a function definition, which is copied to the remote system and executed before the actual command is executed. Therefore GNU B needs to know how to read the function. From version 20150122 GNU B tries both the ()-version and the %%-version, and the function definition works on both pre- and post-shellshock versions of B. =head2 Remote Ctrl-C and standard error (stderr) If the user presses Ctrl-C the user expects jobs to stop. This works out of the box if the jobs are run locally. Unfortunately it is not so simple if the jobs are run remotely. If remote jobs are run in a tty using B, then Ctrl-C works, but all output to standard error (stderr) is sent to standard output (stdout). This is not what the user expects. If remote jobs are run without a tty using B (without B<-tt>), then output to standard error (stderr) is kept on stderr, but Ctrl-C does not kill remote jobs. This is not what the user expects. So what is needed is a way to have both. It seems the reason why Ctrl-C does not kill the remote jobs is because the shell does not propagate the hang-up signal from B. But when B dies, the parent of the login shell becomes B (process id 1). So by exec'ing a Perl wrapper to monitor the parent pid and kill the child if the parent pid becomes 1, then Ctrl-C works and stderr is kept on stderr. The wrapper looks like this: $SIG{CHLD} = sub { $done = 1; }; $pid = fork; unless($pid) { # Make own process group to be able to kill HUP it later setpgrp; exec $ENV{SHELL}, "-c", ($bashfunc."@ARGV"); die "exec: $!\n"; } do { # Parent is not init (ppid=1), so sshd is alive # Exponential sleep up to 1 sec $s = $s < 1 ? 0.001 + $s * 1.03 : $s; select(undef, undef, undef, $s); } until ($done || getppid == 1); # Kill HUP the process group if job not done kill(SIGHUP, -${pid}) unless $done; wait; exit ($?&127 ? 128+($?&127) : 1+$?>>8) =head2 Transferring of variables and functions Transferring of variables and functions given by B<--env> is done by running a Perl script remotely that calls the actual command. The Perl script sets $ENV{variable} to the correct value before exec'ing the a shell that runs the function definition followed by the actual command. B (mentioned in the man page) copies the full current environment into the environment variable B. This variable is picked up by GNU B and used to create the Perl script mentioned above. =head2 Base64 encoded bzip2 B limits words of commands to 1024 chars. This is often too little when GNU B encodes environment variables and wraps the command with different templates. All of these are combined and quoted into one single word, which often is longer than 1024 chars. When the line to run is > 1000 chars, GNU B therefore encodes the line to run. The encoding Bs the line to run, converts this to base64, splits the base64 into 1000 char blocks (so B does not fail), and prepends it with this Perl script that decodes, decompresses and Bs the line. @GNU_Parallel=("use","IPC::Open3;","use","MIME::Base64"); eval "@GNU_Parallel"; $SIG{CHLD}="IGNORE"; # Search for bzip2. Not found => use default path my $zip = (grep { -x $_ } "/usr/local/bin/bzip2")[0] || "bzip2"; # $in = stdin on $zip, $out = stdout from $zip my($in, $out,$eval); open3($in,$out,">&STDERR",$zip,"-dc"); if(my $perlpid = fork) { close $in; $eval = join "", <$out>; close $out; } else { close $out; # Pipe decoded base64 into 'bzip2 -dc' print $in (decode_base64(join"",@ARGV)); close $in; exit; } wait; eval $eval; Perl and B must be installed on the remote system, but a small test showed that B is installed by default on all platforms that runs GNU B, so this is not a big problem. The added bonus of this is that much bigger environments can now be transferred as they will be below B's limit of 131072 chars. =head2 Which shell to use Different shells behave differently. A command that works in B may not work in B. It is therefore important that the correct shell is used when GNU B executes commands. GNU B tries hard to use the right shell. If GNU B is called from B it will use B. If it is called from B it will use B. It does this by looking at the (grand*)parent process: If the (grand*)parent process is a shell, use this shell; otherwise look at the parent of this (grand*)parent. If none of the (grand*)parents are shells, then $SHELL is used. This will do the right thing if called from: =over 2 =item * an interactive shell =item * a shell script =item * a Perl script in `` or using B if called as a single string. =back While these cover most cases, there are situations where it will fail: =over 2 =item * When run using B. =item * When run as the last command using B<-c> from another shell (because some shells use B): zsh% bash -c "parallel 'echo {} is not run in bash; set | grep BASH_VERSION' ::: This" You can work around that by appending '&& true': zsh% bash -c "parallel 'echo {} is run in bash; set | grep BASH_VERSION' ::: This && true" =item * When run in a Perl script using B with parallel as the first string: #!/usr/bin/perl system("parallel",'setenv a {}; echo $a',":::",2); Here it depends on which shell is used to call the Perl script. If the Perl script is called from B it will work just fine, but if it is called from B it will fail, because the command B is not known to B. =back If GNU B guesses wrong in these situation, set the shell using $PARALLEL_SHELL. =head2 Quoting Quoting depends on the shell. For most shells \ is used for all special chars and ' is used for newline. Whether a char is special depends on the shell and the context. Luckily quoting a bit too many chars does not break things. It is fast, but had the distinct disadvantage that if a string needs to be quoted multiple times, the \'s double every time - increasing the string length exponentially. For B/B newline is quoted as \ followed by newline. For B everything is quoted using '. =head2 --pipepart vs. --pipe While B<--pipe> and B<--pipepart> look much the same to the user, they are implemented very differently. With B<--pipe> GNU B reads the blocks from standard input (stdin), which is then given to the command on standard input (stdin); so every block is being processed by GNU B itself. This is the reason why B<--pipe> maxes out at around 100 MB/sec. B<--pipepart>, on the other hand, first identifies at which byte positions blocks start and how long they are. It does that by seeking into the file by the size of a block and then reading until it meets end of a block. The seeking explains why GNU B does not know the line number and why B<-L/-l> and B<-N> do not work. With a reasonable block and file size this seeking is often more than 1000 faster than reading the full file. The byte positions are then given to a small script that reads from position X to Y and sends output to standard output (stdout). This small script is prepended to the command and the full command is executed just as if GNU B had been in its normal mode. The script looks like this: < file perl -e 'while(@ARGV) { sysseek(STDIN,shift,0) || die; $left = shift; while($read = sysread(STDIN,$buf, ($left > 32768 ? 32768 : $left))){ $left -= $read; syswrite(STDOUT,$buf); } }' startbyte length_in_bytes It delivers 1 GB/s per core. Instead of the script B
was tried, but many versions of B
do not support reading from one byte to another and might cause partial data. See this for a surprising example: yes | dd bs=1024k count=10 | wc =head2 --jobs and --onall When running the same commands on many servers what should B<--jobs> signify? Is it the number of servers to run on in parallel? Is it the number of jobs run in parallel on each server? GNU B lets B<--jobs> represent the number of servers to run on in parallel. This is to make it possible to run a sequence of commands (that cannot be parallelized) on each server, but run the same sequence on multiple servers. =head2 Buffering on disk GNU B buffers on disk in $TMPDIR using files, that are removed as soon as they are created, but which are kept open. So even if GNU B is killed by a power outage, there will be no files to clean up afterwards. Another advantage is that the file system is aware that these files will be lost in case of a crash, so it does not need to sync them to disk. It gives the odd situation that a disk can be fully used, but there are no visible files on it. =head2 Disk full GNU B buffers on disk. If the disk is full data may be lost. To check if the disk is full GNU B writes a 8193 byte file when a job finishes. If this file is written successfully, it is removed immediately. If it is not written successfully, the disk is full. The size 8193 was chosen because 8192 gave wrong result on some file systems, whereas 8193 did the correct thing on all tested filesystems. =head2 Perl replacement strings, {= =}, and --rpl The shorthands for replacement strings make a command look more cryptic. Different users will need different replacement strings. Instead of inventing more shorthands you get more more flexible replacement strings if they can be programmed by the user. The language Perl was chosen because GNU B is written in Perl and it was easy and reasonably fast to run the code given by the user. If a user needs the same programmed replacement string again and again, the user may want to make his own shorthand for it. This is what B<--rpl> is for. It works so well, that even GNU B's own shorthands are implemented using B<--rpl>. In Perl code the bigrams {= and =} rarely exist. They look like a matching pair and can be entered on all keyboards. This made them good candidates for enclosing the Perl expression in the replacement strings. Another candidate ,, and ,, was rejected because they do not look like a matching pair. B<--parens> was made, so that the users can still use ,, and ,, if they like: B<--parens ,,,,> Internally, however, the {= and =} are replaced by \257< and \257>. This is to make it simple to make regular expressions: \257 is disallowed on the command line, so when that is matched in a regular expression, it is known that this is a replacement string. =head2 Test suite GNU B uses its own testing framework. This is mostly due to historical reasons. It deals reasonably well with tests that are dependent on how long a given test runs (e.g. more than 10 secs is a pass, but less is a fail). It parallelizes most tests, but it is easy to force a test to run as the single test (which may be important for timing issues). It deals reasonably well with tests that fail intermittently. It detects which tests failed and pushes these to the top, so when running the test suite again, the tests that failed most recently are run first. If GNU B should adopt a real testing framework then those elements would be important. Since many tests are dependent on which hardware it is running on, these tests break when run on a different hardware than what the test was written for. When most bugs are fixed a test is added, so this bug will not reappear. It is, however, sometimes hard to create the environment in which the bug shows up - especially if the bug only shows up sometimes. One of the harder problems was to make a machine start swapping without forcing it to its knees. =head2 Median run time Using a percentage for B<--timeout> causes GNU B to compute the median run time of a job. The median is a better indicator of the expected run time than average, because there will often be outliers taking way longer than the normal run time. To avoid keeping all run times in memory, an implementation of remedian was made (Rousseeuw et al). =head2 Error messages and warnings Error messages like: ERROR, Not found, and 42 are not very helpful. GNU B strives to inform the user: =over 2 =item * What went wrong? =item * Why did it go wrong? =item * What can be done about it? =back Unfortunately it is not always possible to predict the root cause of the error. =head2 Computation of load Contrary to the obvious B<--load> does not use load average. This is due to load average rising too slowly. Instead it uses B to list the number of jobs in running or blocked state (state D, O or R). This gives an instant load. As remote calculation of load can be slow, a process is spawned to run B and put the result in a file, which is then used next time. =head2 Killing jobs B<--memfree>, B<--halt> and when GNU B meets a condition from which it cannot recover, jobs are killed. This is done by finding the (grand)*children of the jobs and killing those processes. More specifically GNU B maintains a list of processes to be killed, sends a signal to all processes (first round this is a TERM). It weeds out the processes that exited from the list then waits a while and weeds out again. It does that until all processes are dead or 200 ms passed. Then it does another round with TERM, and finally a round with KILL. pids = family_pids(jobs) for signal in TERM, TERM, KILL: for pid in pids: kill signal, pid while kill 0, pids and slept < 200 ms: sleep sleeptime pids = kill 0, pids slept += sleeptime sleeptime = sleeptime * 1.1 By doing so there is a tiny risk, that GNU B will kill processes that are not started from GNU B. It, however, requires all of these to be true: * Process A is sent a signal * It dies during a I cycle * A new process B is spawned (by an unrelated process) * This is done during the same I cycle * B is owned by the same user * B reuses the pid of the A It is considered unlikely to ever happen due to: * The longest I sleeps is 10 ms * Re-use of a dead pid rarely happens within a few seconds =head1 Ideas for new design =head2 Multiple processes working together Open3 is slow. Printing is slow. It would be good if they did not tie up ressources, but were run in separate threads. =head2 Transferring of variables and functions from zsh Transferring Bash functions to remote zsh works. Can parallel_bash_environment be used to import zsh functions? =head2 --rrs on remote using a perl wrapper ... | perl -pe '$/=$recend$recstart;BEGIN{ if(substr($_) eq $recstart) substr($_)="" } eof and substr($_) eq $recend) substr($_)="" It ought to be possible to write a filter that removed rec sep on the fly instead of inside GNU B. This could then use more cpus. Will that require 2x record size memory? Will that require 2x block size memory? =head1 Historical decisions =head2 --tollef You can read about the history of GNU B on https://www.gnu.org/software/parallel/history.html B<--tollef> was included to make GNU B switch compatible with the parallel from moreutils (which is made by Tollef Fog Heen). This was done so that users of that parallel easily could port their use to GNU B: Simply set B and that would be it. But several distributions chose to make B<--tollef> global (by putting it into /etc/parallel/config), and that caused much confusion when people tried out the examples from GNU B's man page and these did not work. The users became frustrated because the distribution did not make it clear to them that it has made B<--tollef> global. So to lessen the frustration and the resulting support, B<--tollef> was obsoleted 20130222 and removed one year later. =head2 Transferring of variables and functions Until 20150122 variables and functions were transferred by looking at $SHELL to see whether the shell was a B<*csh> shell. If so the variables would be set using B. Otherwise they would be set using B<=>. The caused the content of the variable to be repeated: echo $SHELL | grep "/t\{0,1\}csh" > /dev/null && setenv VAR foo || export VAR=foo =cut