src/parallel: Better examples

2024-11-22 05:57:54 +00:00 · 2010-05-13 15:41:52 +02:00 · 2010-05-13 15:41:52 +02:00 · 65b073c7c4
parent d7be89d786
commit 65b073c7c4
1 changed files with 159 additions and 71 deletions
--- a/src/parallel
+++ b/src/parallel
@ -11,16 +11,17 @@ B<parallel> [-0cdEfghiIkmnpqrtuUvVX] [B<-I> str] [B<-j> num] [--silent]

 =head1 DESCRIPTION

-GNU B<parallel> is a shell tool for executing jobs in parallel. A job is
-typically a single command or a small script that has to be run for
+GNU B<parallel> is a shell tool for executing jobs in parallel. A job
+is typically a single command or a small script that has to be run for
 each of the lines in the input. The typical input is a list of files,
-a list of hosts, a list of users, or a list of tables.
+a list of hosts, a list of users, a list of URLs, or a list of tables.

 If you use B<xargs> today you will find GNU B<parallel> very easy to
-use. If you write loops in shell, you will find GNU B<parallel> may be
-able to replace most of the loops and make them run faster by running
-jobs in parallel. If you use B<ppss> or B<pexec> you will find GNU
-B<parallel> will often make the command easier to read.
+use as GNU B<parallel> is written to have the same options as
+B<xargs>. If you write loops in shell, you will find GNU B<parallel>
+may be able to replace most of the loops and make them run faster by
+running several jobs in parallel. If you use B<ppss> or B<pexec> you will find
+GNU B<parallel> will often make the command easier to read.

 GNU B<parallel> makes sure output from the commands is the same output as
 you would get had you run the commands sequentially. This makes it
@ -168,9 +169,9 @@ B<-g> is the default. Can be reversed with B<-u>.
 Print a summary of the options to GNU B<parallel> and exit.


-=item B<-I> I<string>
+=item B<-I> I<replace-str>

-Use the replacement string I<string> instead of {}.
+Use the replacement string I<replace-str> instead of {}.


 =item B<--replace>[=I<replace-str>]
@ -439,11 +440,11 @@ Ungroup output.  Output is printed as soon as possible. This may cause
 output from different commands to be mixed. Can be reversed with B<-g>.


-=item B<--extensionreplace> I<string>
+=item B<--extensionreplace> I<replace-str>

-=item B<-U> I<string>
+=item B<-U> I<replace-str>

-Use the replacement string I<string> instead of {.} for input line without extension.
+Use the replacement string I<replace-str> instead of {.} for input line without extension.


 =item B<--use-cpus-instead-of-cores> (not implemented)
@ -453,7 +454,7 @@ jobs to run in parallel relative to the number of cores you can ask
 GNU B<parallel> to instead look at the number of CPUs. This will make sense
 for computers that have hyperthreading as two jobs running on one CPU
 with hyperthreading will run slower than two jobs running on two CPUs.
-Normal users will not need this option.
+Most users will not need this option.


 =item B<-v>
@ -473,56 +474,70 @@ Print the version GNU B<parallel> and exit.

 =item B<-m>

-Multiple. Insert as many arguments as the command line length permits. If
-{} is not used the arguments will be appended to the line.  If {} is
-used multiple times each {} will be replaced with all the arguments.
+Multiple. Insert as many arguments as the command line length
+permits. If {} is not used the arguments will be appended to the line.
+If {} is used multiple times each {} will be replaced with all the
+arguments.


 =item B<-X>

 xargs with context replace. This works like B<-m> except if {} is part
-of a word (like I<pic{}.jpg>) then the whole word will be repeated.
+of a word (like I<pic{}.jpg>) then the whole word will be
+repeated. Normally B<-X> will do the right thing, whereas B<-m> can
+give surprising results if {} is used as part of a word.

 =back

-=head1 EXAMPLE 1: Working as cat | sh. Ressource inexpensive jobs and evaluation
-
-GNU B<parallel> can work similar to B<cat | sh>. 
-
-A ressource inexpensive job is a job that takes very little CPU, disk
-I/O and network I/O. Ping is an example of a ressource inexpensive
-job. wget is too - if the webpages are small.
-
-The content of the file jobs_to_run:
-
-  ping -c 1 10.0.0.1
-  wget http://status-server/status.cgi?ip=10.0.0.1
-  ping -c 1 10.0.0.2
-  wget http://status-server/status.cgi?ip=10.0.0.2
-  ...
-  ping -c 1 10.0.0.255
-  wget http://status-server/status.cgi?ip=10.0.0.255
-
-To run 100 processes simultaneously do:
-
-B<parallel -j 100 < jobs_to_run>
-
-As there is not a B<command> the option B<-c> is default because the
-jobs needs to be evaluated by the shell.
-
-=head1 EXAMPLE 2: Working as xargs -n1. Argument appending
+=head1 EXAMPLE: Working as xargs -n1. Argument appending

 GNU B<parallel> can work similar to B<xargs -n1>.

-To output all html files run:
+To compress all html files using B<gzip> run:

-B<find . -name '*.html' | parallel cat>
+B<find . -name '*.html' | parallel gzip>

-As there is a B<command> the option B<-f> is default because the
-filenames needs to be protected from the shell in case a filename
-contains special characters.

-=head1 EXAMPLE 3: Compute intensive jobs and substitution
+=head1 EXAMPLE: Inserting multiple arguments
+
+When moving a lot of files like this: B<mv * destdir> you will
+sometimes get the error:
+
+B<bash: /bin/mv: Argument list too long>
+
+because there are too many files. You can instead do:
+
+B<ls | parallel mv {} destdir>
+
+This will run B<mv> for each file. It can be done faster if B<mv> gets
+as many arguments that will fit on the line:
+
+B<ls | parallel -m mv {} destdir>
+
+
+=head1 EXAMPLE: Context replace
+
+To remove the files I<pict0000.jpg> .. I<pict9999.jpg> you could do:
+
+B<seq -f %04g 0 9999 | parallel rm pict{}.jpg>
+
+You could also do:
+
+B<seq -f %04g 0 9999 | perl -pe 's/(.*)/pict$1.jpg/' | parallel -m rm>
+
+The first will run B<rm> 10000 times, while the last will only run
+B<rm> as many times needed to keep the command line length short
+enough to avoid B<Argument list too long> (it typically runs 1-2 times).
+
+You could also run:
+
+B<seq -f %04g 0 9999 | parallel -X rm pict{}.jpg>
+
+This will also only run B<rm> as many times needed to keep the command
+line length short enough.
+
+
+=head1 EXAMPLE: Compute intensive jobs and substitution

 If ImageMagick is installed this will generate a thumbnail of a jpg
 file:
@ -541,27 +556,31 @@ B<find . -name '*.jpg' | parallel -j +0 convert -geometry 120 {} {}_thumb.jpg>

 Notice how the argument has to start with {} as {} will include path
 (e.g. running B<convert -geometry 120 ./foo/bar.jpg
-thumb_./foo/bar.jpg> would clearly be wrong). It will result in files
-like ./foo/bar.jpg_thumb.jpg.
+thumb_./foo/bar.jpg> would clearly be wrong). The command will
+generate files like ./foo/bar.jpg_thumb.jpg.

-This will make files like ./foo/bar_thumb.jpg:
+Use B<{.}> to avoid the extra .jpg in the file name. This command will
+make files like ./foo/bar_thumb.jpg:

 B<find . -name '*.jpg' | parallel -j +0 convert -geometry 120 {} {.}_thumb.jpg>

-=head1 EXAMPLE 4: Substitution and redirection

-This will compare all files in the dir to the file foo and save the
-diffs in corresponding .diff files:
+=head1 EXAMPLE: Substitution and redirection

-B<ls | parallel diff {} foo ">>B<"{}.diff>
+This will generate an uncompressed version of .gz-files next to the .gz-file:
+
+B<ls *.gz | parallel zcat {} ">>B<"{.}>

 Quoting of > is necessary to postpone the redirection. Another
 solution is to quote the whole command:

-B<ls | parallel "diff {} foo >>B<{}.diff">
+B<ls *.gz | parallel "zcat {} >>B<{.}">

+Other special shell charaters (such as * ; $ > < | >> <<) also needs
+to be put in quotes, as they may otherwise be interpreted by the shell
+and not given to GNU B<parallel>.

-=head1 EXAMPLE 5: Composed commands
+=head1 EXAMPLE: Composed commands

 A job can consist of several commands. This will print the number of
 files in each directory:
@ -573,28 +592,61 @@ To put the output in a file called <name>.dir:
 B<ls | parallel '(echo -n {}" "; ls {}|wc -l) >> B<{}.dir'>


-=head1 EXAMPLE 6: Context replace
+=head1 EXAMPLE: Removing file extension when processing files

-To remove the files I<pict0000.jpg> .. I<pict9999.jpg> you could do:
+When processing files removing the file extension using {.} is often
+useful.

-B<seq -f %04g 0 9999 | parallel rm pict{}.jpg>
+Create a directory for each zip-file and unzip it in that dir:

-You could also do:
+B<ls *zip | parallel 'mkdir {.}; cd {.}; unzip ../{}'>

-B<seq -f %04g 0 9999 | perl -pe 's/(.*)/pict$1.jpg/' | parallel -m rm>
+Recompress all .gz files in current directory using B<bzip2> running 1
+job per CPU in parallel:

-The first will run B<rm> 10000 times, while the last will only run
-B<rm> as many times needed to keep the command line length short
-enough (typically 1-2 times).
+B<ls *.gz | parallel -j+0 "zcat {} | bzip2 >>B<{.}.bz2 && rm {}">

-You could also run:

-B<seq -f %04g 0 9999 | parallel -X rm pict{}.jpg>
+=head1 EXAMPLE: Rewriting a for-loop and a while-loop

-This will also only run B<rm> as many times needed to keep the command
-line length short enough.
+for-loops like this:

-=head1 EXAMPLE 7: Group output lines
+B<  (for x in `cat list` ; do
+    do_something $x
+  done) | process_output>
+
+and while-loops like this:
+
+B<  cat list | (while read x ; do
+    do_something $x
+  done) | process_output>
+
+can be written like this:
+
+B<cat list | parallel do_something | process_output>
+
+If the processing requires more steps the for-loop like this:
+
+B< (for x in `cat list` ; do
+   no_extension=${x%.png};
+   do_something $x scale $no_extension.jpg
+   do_step2 <$x $no_extension
+ done) | process_output>
+
+and while-loops like this:
+
+B<  cat list | (while read x ; do
+   no_extension=${x%.png};
+   do_something $x scale $no_extension.jpg
+   do_step2 <$x $no_extension
+ done) | process_output>
+
+can be written like this:
+
+B<cat list | parallel "do_something {} scale {.}.jpg ; do_step2 <{} {.}" | process_output>
+
+
+=head1 EXAMPLE: Group output lines

 When runnning jobs that output data, you often do not want the output
 of multiple jobs to run together. GNU B<parallel> defaults to grouping the
@ -611,14 +663,23 @@ to the output of:
 B<(echo foss.org.my; echo debian.org; echo freenetproject.org) | parallel -u traceroute>


-=head1 EXAMPLE 8: Keep order of output same as order of input
+=head1 EXAMPLE: Keep order of output same as order of input

 Normally the output of a job will be printed as soon as it
 completes. Sometimes you want the order of the output to remain the
-same as the order of the input. B<-k> will make sure the order of
+same as the order of the input. This is often important, if the output
+is used for input for another system. B<-k> will make sure the order of
 output will be in the same order as input even if later jobs end
 before earlier jobs.

+Append a string to every line in a text file:
+
+B<cat textfile | parallel -k echo {} append_string>
+
+If you remove B<-k> some of the lines may come out in the wrong order.
+
+Another example is B<traceroute>:
+
 B<(echo foss.org.my; echo debian.org; echo freenetproject.org) | parallel traceroute>

 will give traceroute of foss.org.my, debian.org and
@ -632,7 +693,7 @@ B<(echo foss.org.my; echo debian.org; echo freenetproject.org) | parallel -k tra
 This will make sure the traceroute to foss.org.my will be printed
 first.

-=head1 EXAMPLE 9: Using remote computers (not implemented)
+=head1 EXAMPLE: Using remote computers (not implemented)

 To run commands on a remote computer SSH needs to be set up and you
 must be able to login without entering a password (B<ssh-agent> may be
@ -681,7 +742,7 @@ server has 8 CPU cores.
  seq 1 10 | parallel --sshlogin 8/server.example.com echo


-=head1 EXAMPLE 10: Transferring of files (not implemented)
+=head1 EXAMPLE: Transferring of files (not implemented)

 To recompress gzipped files with B<bzip2> using a remote server run:

@ -745,6 +806,33 @@ With the file I<mymachines> containing the compute machines it becomes:
  find logs/ -name '*.gz' | parallel --sshloginfile mymachines \
    --trc {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2"

+
+=head1 EXAMPLE: Working as cat | sh. Ressource inexpensive jobs and evaluation
+
+GNU B<parallel> can work similar to B<cat | sh>.
+
+A ressource inexpensive job is a job that takes very little CPU, disk
+I/O and network I/O. Ping is an example of a ressource inexpensive
+job. wget is too - if the webpages are small.
+
+The content of the file jobs_to_run:
+
+  ping -c 1 10.0.0.1
+  wget http://status-server/status.cgi?ip=10.0.0.1
+  ping -c 1 10.0.0.2
+  wget http://status-server/status.cgi?ip=10.0.0.2
+  ...
+  ping -c 1 10.0.0.255
+  wget http://status-server/status.cgi?ip=10.0.0.255
+
+To run 100 processes simultaneously do:
+
+B<parallel -j 100 < jobs_to_run>
+
+As there is not a B<command> the option B<-c> is default because the
+jobs needs to be evaluated by the shell.
+
+
 =head1 QUOTING

 For more advanced use quoting may be an issue. The following will
@ -764,9 +852,9 @@ B<ls | parallel -q  perl -ne '/^\S+\s+\S+$/ and print $ARGV,"\n"'>
 However, this means you cannot make the shell interpret special
 characters. For example this B<will not work>:

-B<ls | parallel -q "diff {} foo >>B<{}.diff"> 
+B<ls *.gz | parallel -q "zcat {} >>B<{.}">

-B<ls | parallel -q "ls {} | wc -l">
+B<ls *.gz | parallel -q "zcat {} | bzip2 >>B<{.}.bz2">

 because > and | need to be interpreted by the shell.

@ -808,7 +896,7 @@ should send the signal B<SIGTERM> to GNU B<parallel>:
 B<killall -TERM parallel>

 This will tell GNU B<parallel> to not start any new jobs, but wait until
-the currently running jobs are finished.
+the currently running jobs are finished before exiting.


 =head1 DIFFERENCES BETWEEN xargs/find -exec AND parallel