parallel/doc/promo

469 lines
16 KiB
Plaintext
Raw Normal View History

# SPDX-FileCopyrightText: 2021-2022 Ole Tange, http://ole.tange.dk and Free Software and Foundation, Inc.
2021-04-22 16:20:41 +00:00
# SPDX-License-Identifier: GFDL-1.3-or-later
# SPDX-License-Identifier: CC-BY-SA-4.0
=head1 GNU Parallel 10 year anniversery - 2020-04-22
2019-04-21 18:16:33 +00:00
Git log entry 2010-04-22:
"""
Author: Ole Tange <ole@tange.dk>
Date: Thu Apr 22 01:23:00 2010 +0200
Name change: Parallel is now GNU Parallel.
Basic structure for sshlogin and sshloginfile.
"""
Wow. It has been 10 years since my parallel program was officially
renamed GNU Parallel. It has been quite a ride.
2019-04-21 18:16:33 +00:00
Not all software is maintained for 10 full years, so it is a probably
a good time to take stock.
=head2 The design
The user interface of GNU Parallel has changed very little during the
last 10 years. In total around 10 things have changed in a way that
was not backwards compatible - most of them corner cases that very few
use.
=head2 Videos
In 2010 one of the competitors was PPSS. My colleague, Hans Schou,
saw louwrentius' video showing off PPSS
(https://www.youtube.com/watch?v=32PwsARbePw) and nudged me to make my
own videos and most of the information in those still applies to the
newest version.
=head2 Complete rewrite
Before GNU Parallel was a GNU tool, it started as a wrapper around
`make -j`. But GNU Parallel grew, and was no longer just a small
hack.
The design goals included not requiring a compiler, compatibility with
old operating systems, and single file program. This limited the
languages tremendously.
Perl and Python were in practice the only possibilities. Python was at
the time quite slow, ressource hungry, and not as widely installed as
Perl. So Perl was the choice.
To make the code easier to maintain it was rewritten to object
orientation. This would not have been possible if the test suite had
not been so thorough: It made it much easier to see if a code change
cause change in behaviour.
=head2 --tollef
Tollef's parallel from moreutils was a headache: Before Tollef's
parallel was adopted by moreutils I tried getting Parallel adopted in
moreutils. So it was a bit of a disappointment seeing another program
called exactly the same included some months later.
--tollef was added to make GNU Parallel compatible with Tollef's
parallel, so that if you depended on Tollef's parallel, then you could
drop in GNU Parallel as a replacement.
I honestly don't think anyone used this. Ever. But it silenced the
argument that GNU Parallel would break existing usage.
Unfortunately distributions enabled --tollef by default and did not
stress this to the user. So users experienced no end of frustration
when the examples from GNU Parallel's man page did not work.
moreutils is now generally packaged with Tollef's parallel split off
into a separate package, and the frustration seems to be lower today.
=head2 GNU Paralel on NASA Pleiades supercomputer
In 2013 I stumbled on a happy surprise: NASA seemed to have installed
GNU Parallel on their Pleiades supercomputer.
https://web.archive.org/web/20130221072030/https://www.nas.nasa.gov/hecc/support/kb/using-gnu-parallel-to-package-multiple-jobs-in-a-single-pbs-job_303.html
"""On Pleiades, a copy of GNU parallel is available under /usr/bin."""
Pleiades was 16th on top500.org in 2013.
I have the feeling that GNU Parallel is also used on some of the
bigger supercomputers, but I have found no confirmation of that.
=head2 GNU Parallel on Termux and OpenWRT
At the other end of the system size is Termux on Android and OpenWRT
for accesspoints. GNU Parallel runs on both of them, and while I can
see why you might run GNU Parallel on an access point I still do not
know why you would do it on an Android device.
It is still cool that it can be done at all.
=head2 Attack on funding
A sad chapter is the attack on the funding of GNU Parallel.
2019-04-21 18:16:33 +00:00
You would think such an attack would come from non-free competitors,
but this attack was from packagers that packaged GNU Parallel. Didier
Raboud <odyx@debian.org> (Debian and by inheritance Ubuntu), Andreas
Stieger <astieger@suse.com> (SuSE) and Johannes Löthberg
<johannes@kyriasis.com> (Arch). These are all people sitting in jobs,
where they are paid to use free software, and you would think they
would understand the importance of getting paid.
<<Check before sending
https://www.linkedin.com/in/didierraboud/
https://www.linkedin.com/in/andreas-stieger-74b68a72/
https://www.linkedin.com/in/johanneslothberg?originalSubdomain=se
Maintainer today:
Jan Engelhardt <jengelh@inai.de>
<<debian>>
<johannes@kyriasis.com>
>>
GNU Parallel is funded by me having a job. It is easier to get a paid
job that will allow for maintaining GNU Parallel if GNU Parallel is
cited, because that proves the tool is useful for serious work. This
is especially true in academia.
I saw GNU Parallel being used in scientific articles, which was great,
but without being cited, which was not ideal. So we discussed on the
email list how to make users aware that citing is how GNU Parallel is
financed and why this is important.
So it was decided to make a notice similar to a do-show-this-again box
known from e.g. Firefox. The notice could be silenced in less than 10
seconds.
2019-04-21 18:16:33 +00:00
Unfortunately in a misguided act of short term gain in popularity
SuSE, Debian, and Arch did a disservice to free software and disabled
this notice in the version they currently distribute.
As GNU Parallel is free software they are allowed to fork the
software, but only if they make sure the forked version cannot be
mistaken for GNU Parallel. We have court cases showing this is the
2019-04-21 18:16:33 +00:00
case, but still Didier Raboud <odyx@debian.org> (Debian and by
inheritance Ubuntu), Andreas Stieger <astieger@suse.com> (SUSE) and
Johannes Löthberg <johannes@kyriasis.com> (Arch) refuse to back down,
so the problem still not resolved.
If you would like to see GNU Parallel maintained in the future, please
2019-04-21 18:16:33 +00:00
help by raising this issue with SUSE (current maintainer: Jan
Engelhardt <jengelh@inai.de>), Debian/Ubuntu (current maintainer:
<<>>), and Arch (current maintainer: Johannes Löthberg
<johannes@kyriasis.com>). Their current stance hurts free software by
making it harder to justify spending time on maintaining GNU
Parallel. Not having GNU Parallel distributed by Debian/Ubuntu, Arch,
and SUSE is actually preferable to the current situation, though, the
best outcome would be if they distributed the non-modified version.
For users who are unwiling to spend the 10 seconds on silencing the
notice there is an easy solution: "Don't like it? Don't use it." A
considerable amount of time has been spent on mapping the
alternatives, so there is really no excuse. See `man
parallel_alternatives`.
=head2 The GNU Parallel 2018 book
Hans Schou teased me by calling the man page "the book". In 2018 I
took the consequence of that and wrote a book. The book is available
online (https://doi.org/10.5281/zenodo.1146014) and in print
(http://www.lulu.com/shop/ole-tange/gnu-parallel-2018/paperback/product-23558902.html).
=head2 Cheatsheet
A lot of hours has been put into documentation, but the problem with
having a lot of documentation is that is can make some people think
the program is hard to use giving rise to the myth that "You have to
read a full book to be able to use GNU Parallel".
Several people noted that GNU Parallel was missing a cheat sheet. So
in 2019 a one page cheat sheet was included in the package.
2019-04-21 18:16:33 +00:00
=head2 Why so many options?
GNU B<parallel> has a lot of options. A good part of them could be
written as wrapper scripts.
Normally it would not be in the UNIX philosophy to merge the wrapper
scripts into the tool itself. But experience showed that people could
not write these wrapper scripts without bugs.
By having them part of GNU B<parallel> the code will be tested by more
people and will in general be of better quality than code that was
just thrown together in a couple of hours.
An example of this is B<--nice>: For running local jobs the option is
not needed, you simply B<nice> everything:
nice parallel ...
But as soon as you run composed commands on remote systems,
it becomes much harder:
parallel -S .. nice bash -c "'cmd1; cmd2'"
When you combine that with other wrapper scripts (such as
B<--return>), it quickly becomes tricky to get right for all cases.
=head2 Convenience options --nice --basefile --transfer --return
--cleanup --tmux --group --compress --cat --fifo --workdir --tag
2019-04-21 18:16:33 +00:00
=head2 The May 1st incident
I was at a May 1st event for computer professionals where I sat at a
long table opposite a guy. At some point the discussion turned to
parallelism.
"I have found the brilliant program," he said. "It does everything if
you want to parallelize."
The more he explained the more certain was I that I knew this program
quite intimately.
"And it is written by a Dane," he said excited.
2019-04-21 18:16:33 +00:00
"Oh. Are you aware that the author is sitting on my side of the
table?"
We were the only ones sitting at the table, but we had had a few
beers, so it took a while before it dawned to him, who I was.
=head2 Underappreciated functionality
2019-04-21 18:16:33 +00:00
There is some functionality I am particularly proud of, but which is
currently not in wide use.
=head3 env_parallel
When I was shown you could encode variables into a single variable and
move that to a remote system I was intrigued. But why stop at
variables? Why not include aliases, functions, and arrays?
env_parallel started out as a technical challenge: How much can be
copied transparently?
But it quickly got a more practical side: Why should you not be able
to use the variables, aliases and functions defined on the local
system just because you want to run jobs on a remote system?
=head3 parset
Some of GNU Parallel functionality is inspired by other people
problems: How could this problem be solved in general?
2019-04-21 18:16:33 +00:00
B<parset> is one of those. It was inpired by a user who needed the
output from different jobs to be stored in different variables. The
jobs were slow and could be run in parallel. So while the running of
the jobs were clearly a task for GNU Parallel, the storing in
variables was not so clear.
It was fairly easy to code something that would work if the output was
a single line with no spaces, but GNU Parallel tries hard not to set
artifical limits: It is much preferable to a bit slower if the outcome
is predictable - whether the output is a single word or some binary
data.
=head3 --embed
2019-04-21 18:16:33 +00:00
Some of the functionality is inspired by other tools. B<--embed> is one
of those.
2019-04-21 18:16:33 +00:00
B<--embed> was inspired by Lesser Parallel that in turn was inspired
by GNU Parallel. The major feature of Lesser Parallel is to be
embedded in any bash script. The developer will embed the code into
his own bash script and distribute this script.
2019-04-21 18:16:33 +00:00
So with B<--embed> the users of the script will not have to install
GNU Parallel to run it.
=head3 --pipepart with --fifo
=head3 --bar
2019-04-21 18:16:33 +00:00
I see people using B<--bar> too rarely. It is one of the easiest ways
to get a visual representation of when all the jobs are expected done.
=head3 Combining ::: with :::+
=head3 --rpl with dynamic replacement strings
=head3 --results with replacement strings
=head3 --tagstring with replacement strings
=head2 Feedback
Best ever
2019-04-21 18:16:33 +00:00
=head2 Vitality
On average there has been a new release of GNU Parallel every month
since 2010-04-24.
In the autumn of 2010 Henrik Sandklef teased me that he knew when the
next release would be. GNU Parallel just happened to have been
released twice in the 22nd, so he assumed the next release would also
2019-04-21 18:16:33 +00:00
be on the 22nd. And why not? A few releases were not in line with this
rule, but since 2011 there has been a release every month around the
22nd.
The fixed release cycle means there has been more than 100 releases
making GNU Parallel in the top 5 of GNU tools with the most releases.
=head3 Naming releases
At the presenatation at FOSDEM (20110205) I found it might be fun to
give each release code name, so this release was named FOSDEM. After
the Japan release a naming convention started to emerge. And since
then each release has had a name related to an event in the past
month.
I will be honest: Some releases were easier to name than others.
Since the events are not always happy events, the names have now and
then stirred a bit of controversy. But if you want happier names, go
make a happier world :)
2019-04-21 18:16:33 +00:00
=head3 Competitors
Apart from xargs no competitor has had the strength to live for 10
years. And even xargs has not had a steady release cycle with a new
release every month.
=head2 The next 10 years
Parallization has come to stay, and there are a lot of competitors to
GNU Parallel that do specialized tasks better. But I have a feeling
that there is room for a generalized tool like GNU Parallel also in 10
years.
=head1 top photos
http://www.flickr.com/photos/dexxus/5499821986/in/photostream/
https://www.google.com/search?lr=&safe=images&hl=en&tbs=sur:fmc&tbm=isch&q=top+nature+photos&revid=600471240&biw=1024&bih=569
=head1 What is GNU Parallel used for
Searching for transit planets using data from the Kepler space telescope.
Searching 1700 genomes for 1000-10000 protein sequences using Amazon
EC2 compute cloud.
Processing Earth Observation data from satellites to grep for pieces
of information.
Running tons of simulations of granular materials.
Converting formats of movie frames in the film industry.
Computational fluid dynamics. Numerical simulation of the compressible
Navier-Stokes equations.
Analysing data and running simulations for searching for the Higgs
boson at the Tevatron.
=head1 search terms
run commands in parallel
Parallel shell loops
multi threading in bash xargs
# TAGS: parallel | parallel processing | multicore | multiprocessor | Clustering/Distributed Networks
# job control | multiple jobs | parallelization | text processing | cluster | filters
# Clustering Tools | Command Line Tools | Utilities | System Administration
# Bash parallel
GNU parallel execution shell bash script simultaneous concurrent linux
scripting run xargs ppss code.google.com/p/ppss/
@vvuksan @ychaker @ncb000gt
xargs can lead to nasty surprises caused by the separator problem
http://nd.gd/0t GNU Parallel http://nd.gd/0s may be better.
Comments:
2012-01-22 14:06:19 +00:00
http://pi.dk/0 https://www.gnu.org/software/parallel/
http://pi.dk/1 https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
http://pi.dk/2 https://savannah.gnu.org/news/?group=parallel
2012-01-22 14:06:19 +00:00
http://pi.dk/5 https://en.wikipedia.org/wiki/Xargs#The_separator_problem
http://pi.dk/6 https://www.gnu.org/software/parallel/man.html#differences_between_xargs_and_gnu_parallel
http://pi.dk/7 https://www.gnu.org/software/parallel/man.html#example__distributing_work_to_local_and_remote_computers
2012-01-22 14:06:19 +00:00
If you like xargs you may love GNU Parallel: http://pi.dk/1
With GNU Parallel (http://pi.dk/0) you can do:
ls | grep jpeg | parallel mv {} {.}.jpg
2012-01-22 14:06:19 +00:00
Watch the intro video for GNU Parallel: http://pi.dk/1
If your input file names are generated by users, you need to deal with
surprising file names containing space, ', or " in the filename.
xargs can give nasty surprises due to the separator problem
2012-01-22 14:06:19 +00:00
http://pi.dk/5
@jaylyerly @stevenf xargs will bite you if file names contain
2012-01-22 14:06:19 +00:00
space http://pi.dk/5. Use GNU Parallel instead: http://pi.dk/0
Please repay by spreading the word about GNU Parallel to your
contacts/blog/facebook/linkedin/mailing lists/user group
Your use of xargs can lead to nasty surprises because of the separator
problem http://en.wikipedia.org/wiki/Xargs#The_separator_problem
GNU Parallel http://www.gnu.org/software/parallel/ does not have that
problem.
If you have GNU Parallel http://www.gnu.org/software/parallel/ installed you can do this:
You can install GNU Parallel simply by:
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
Watch the intro videos for GNU Parallel to learn more:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
GNU Parallel also makes it possible to run small scripts. Try this:
ls *.zip | parallel 'mkdir {.}; cd {.}; unzip ../{}'