Implemented --halt-on-error.

Make exit status more consistent.
This commit is contained in:
Ole Tange 2010-06-09 22:26:59 +02:00
parent 87b68365dd
commit 27f2829f05
7 changed files with 345 additions and 97 deletions

View file

@ -10,28 +10,24 @@ Transfer scriptfile before first job. Remove it when last job done.
monitor to see which jobs are currently running
http://code.google.com/p/ppss/
Accept signal INT instead of TERM to complete current running jobs but
do not start new jobs. Print out the number of jobs waiting to
complete on STDERR. Accept sig INT again to kill now. This seems to be
hard, as all foreground processes get the INT from the shell.
If there are nomore jobs (STDIN is closed) then make sure to
distribute the arguments evenly if running -X.
Parallelize so this can be done:
mdm.screen find dir -execdir mdm-run cmd {} \;
Maybe:
find dir -execdir par$ --communication-file /tmp/comfile cmd {} \;
=head1 options
find dir -execdir mutex -j4 -b cmd {} \;
One char options not used: F G J K P Q Y
=head2 Comfile
Skilletegn i sshlogin:
#=item B<--sshlogin> I<[ncpu/]sshlogin[,[ncpu/]sshlogin[,...]]> (beta testing)
# Skilletegn:
# No: "#!&()?\<>|;*'~ shellspecial
# No: @.- part of user@i.p.n.r i.p.n.r host-name
# No: , separates different sshlogins
# No: space Will make it hard to do: 8/server1,server2
# Maybe: / 8//usr/bin/myssh,//usr/bin/ssh
# %/=:_^
This will put a lock on /tmp/comfile. The number of locks is the number of running commands.
If the number is smaller than -j then it will start a process in the background ( cmd & ),
otherwise wait.
par$ --wait /tmp/comfile will wait until no more locks on the file
=head2 mutex
@ -49,16 +45,27 @@ If -b given works like: mutex -l lockfile -n number_of_locks ; (command; mutex -
Kan vi finde på lockid som giver mening?
=head1 options
Parallelize so this can be done:
mdm.screen find dir -execdir mdm-run cmd {} \;
Maybe:
find dir -execdir par$ --communication-file /tmp/comfile cmd {} \;
One char options not used: F G J K P Q Y
find dir -execdir mutex -j4 -b cmd {} \;
=head2 Comfile
This will put a lock on /tmp/comfile. The number of locks is the number of running commands.
If the number is smaller than -j then it will start a process in the background ( cmd & ),
otherwise wait.
par$ --wait /tmp/comfile will wait until no more locks on the file
=head1 Unlikely
Accept signal INT instead of TERM to complete current running jobs but
do not start new jobs. Print out the number of jobs waiting to
complete on STDERR. Accept sig INT again to kill now. This seems to be
hard, as all foreground processes get the INT from the shell.
Skilletegn i sshlogin:
#=item B<--sshlogin> I<[ncpu/]sshlogin[,[ncpu/]sshlogin[,...]]> (beta testing)
# Skilletegn:
# No: "#!&()?\<>|;*'~ shellspecial
# No: @.- part of user@i.p.n.r i.p.n.r host-name
# No: , separates different sshlogins
# No: space Will make it hard to do: 8/server1,server2
# Maybe: / 8//usr/bin/myssh,//usr/bin/ssh
# %/=:_^

View file

@ -175,6 +175,29 @@ B<-g> is the default. Can be reversed with B<-u>.
Print a summary of the options to GNU B<parallel> and exit.
=item B<--halt-on-error> <0|1|2>
=item B<-H> <0|1|2>
=over 3
=item 0
Do not halt if a job fails. This is the default.
=item 1
Do not start new jobs if a job fails, but complete the running jobs
including cleanup. The exit status will be the exit status from the
last failing job.
=item 2
Kill off all jobs immediately and exit without cleanup. The exit
status will be the exit status from the failing job.
=back
=item B<-I> I<replace-str>
@ -958,6 +981,84 @@ This will tell GNU B<parallel> to not start any new jobs, but wait until
the currently running jobs are finished before exiting.
=head1 ENVIRONMENT VARIABLES
=over 9
=item $PARALLEL_PID - unimplemented
The environment variable $PARALLEL_PID is set by GNU B<parallel> and
is visible to the jobs started from GNU B<parallel>. This makes it
possible for the jobs to communicate directly to GNU <parallel>.
B<Example:> If each of the jobs tests a solution and one of jobs finds
the solution the job can tell GNU B<parallel> not to start more jobs
by: B<kill -TERM $PARALLEL_PID>. This only works on the local
computer.
=item $PARALLEL
The environment variable $PARALLEL will be used as default options for
GNU B<parallel>. However, because some options take arguments the
options need to be split into groups in which only the last option
takes an argument. Each group of options should be put on a line of its
own.
B<Example:>
B<cat list | parallel -j1 -k -v ls>
can be written as:
B<cat list | PARALLEL="-kvj1" parallel ls>
B<cat list | parallel -j1 -k -v -S"myssh user@server" ls>
can be written as:
B<cat list | PARALLEL="-kvj1>
B<-Smyssh user@server" parallel echo>
Notice the newline in the middle is needed because both B<-S> and
B<-j> take an argument and thus both need to be at the end of a group.
=back
=head1 INIT FILE (RC FILE)
The file ~/.parallelrc will be read if it exists. It should be
formatted like the environment variable $PARALLEL. Lines starting with
'#' will be ignored.
=head1 EXIT STATUS
If B<--halt-on-error> 0 or not specified:
=over 6
=item 0
All jobs ran without error.
=item 1-253
Some of the jobs failed. The exit status gives the number of failed jobs
=item 254
More than 253 jobs failed.
=item 255
Other error.
=back
If B<--halt-on-error> 1 or 2: Exit status of the failing job.
=head1 DIFFERENCES BETWEEN GNU Parallel AND ALTERNATIVES
There are a lot programs with some of the functionality of GNU
@ -1254,58 +1355,6 @@ B<seq 1 19 | parallel -j+0 buffon -o - | sort -n >>B< result>
B<cat files | parallel -j+0 cmd>
=head1 ENVIRONMENT VARIABLES
=over 9
=item $PARALLEL_PID - unimplemented
The environment variable $PARALLEL_PID is set by GNU B<parallel> and
is visible to the jobs started from GNU B<parallel>. This makes it
possible for the jobs to communicate directly to GNU <parallel>.
B<Example:> If each of the jobs tests a solution and one of jobs finds
the solution the job can tell GNU B<parallel> not to start more jobs
by: B<kill -TERM $PARALLEL_PID>. This only works on the local
computer.
=item $PARALLEL
The environment variable $PARALLEL will be used as default options for
GNU B<parallel>. However, because some options take arguments the
options need to be split into groups in which only the last option
takes an argument. Each group of options should be put on a line of its
own.
B<Example:>
B<cat list | parallel -j1 -k -v ls>
can be written as:
B<cat list | PARALLEL="-kvj1" parallel ls>
B<cat list | parallel -j1 -k -v -S"myssh user@server" ls>
can be written as:
B<cat list | PARALLEL="-kvj1>
B<-Smyssh user@server" parallel echo>
Notice the newline in the middle is needed because both B<-S> and
B<-j> take an argument and thus both need to be at the end of a group.
=back
=head1 INIT FILE (RC FILE)
The file ~/.parallelrc will be read if it exists. It should be
formatted like the environment variable $PARALLEL. Lines starting with
'#' will be ignored.
=head1 BUGS
Filenames beginning with '-' can cause some commands to give
@ -1467,6 +1516,11 @@ init_run_jobs();
start_more_jobs();
ReapIfNeeded();
drain_job_queue();
if($::opt_halt_on_error) {
exit $Global::halt_on_error_exitstatus;
} else {
exit(min($Global::exitstatus,254));
}
sub parse_options {
# Defaults:
@ -1485,6 +1539,8 @@ sub parse_options {
$Global::interactive = 0;
$Global::stderr_verbose = 0;
$Global::default_simultaneous_sshlogins = 9;
$Global::exitstatus = 0;
$Global::halt_on_error_exitstatus = 0;
Getopt::Long::Configure ("bundling","require_order");
# Add options from .parallelrc
@ -1526,6 +1582,7 @@ sub parse_options {
"trc=s" => \@::opt_trc,
"transfer" => \$::opt_transfer,
"cleanup" => \$::opt_cleanup,
"halt-on-error|H=s" => \$::opt_halt_on_error,
# xargs-compatibility - implemented, man, unittest
"max-procs|P=s" => \$::opt_P,
"delimiter|d=s" => \$::opt_d,
@ -1590,7 +1647,7 @@ sub parse_options {
if(defined $::opt_a) {
if(not open(ARGFILE,"<",$::opt_a)) {
print STDERR "$Global::progname: Cannot open input file `$::opt_a': No such file or directory\n";
exit(-1);
exit(255);
}
$Global::argfile = *ARGFILE;
}
@ -1922,7 +1979,7 @@ sub processes_available_by_system_limit {
# The child takes one process slot
# It will be killed later
sleep 100000;
exit;
exit(0);
} else {
$max_system_proc_reached = 1;
}
@ -2226,6 +2283,7 @@ sub min {
# Variable structure:
# $Global::running{$pid}{'seq'} = printsequence
# $Global::running{$pid}{sshlogin} = server to run on
# $Global::running{$pid}{'exitstatus'} = exit status
# $Global::host{$sshlogin}{'no_of_running'} = number of currently running jobs
# $Global::host{$sshlogin}{'ncpus'} = number of cpus
# $Global::host{$sshlogin}{'maxlength'} = max line length (currently buggy for remote)
@ -2288,7 +2346,11 @@ sub next_command_line_with_sshlogin {
$post .= "$sshcmd $serverlogin rm -f ".shell_quote($file).";";
}
}
return "$pre$sshcmd $serverlogin ".shell_quote($next_command_line)."; $post";
if($post) {
# We need to save the exit status of the job
$post = '_EXIT_status=$?; '.$post.' exit $_EXIT_status;';
}
return "$pre$sshcmd $serverlogin ".shell_quote($next_command_line).";".$post;
} else {
return $next_command_line;
}
@ -2591,7 +2653,7 @@ sub sshcommand_of_sshlogin {
} else {
debug($master,"\n");
`$master`;
exit;
exit(0);
}
}
} else {
@ -2661,9 +2723,13 @@ sub Reaper {
# This is one of the ssh -M: ignore
next;
}
if($Global::keeporder) {
$Global::running{$stiff}{'exitstatus'} = $? >> 8;
debug("died ($Global::running{$stiff}{'exitstatus'}): $Global::running{$stiff}{'seq'}");
# Force printing now if the job failed and we are going to exit
my $print_now = ($Global::running{$stiff}{'exitstatus'} and
$::opt_halt_on_error and $::opt_halt_on_error == 2);
if($Global::keeporder and not $print_now) {
$Global::print_later{$Global::running{$stiff}{"seq"}} = $Global::running{$stiff};
debug("died: $Global::running{$stiff}{'seq'}");
while($Global::print_later{$Global::job_end_sequence}) {
debug("Found job end $Global::job_end_sequence");
print_job($Global::print_later{$Global::job_end_sequence});
@ -2673,6 +2739,26 @@ sub Reaper {
} else {
print_job ($Global::running{$stiff});
}
if($Global::running{$stiff}{'exitstatus'}) {
# The jobs had a exit status <> 0, so error
$Global::exitstatus++;
if($::opt_halt_on_error) {
if($::opt_halt_on_error == 1) {
# If halt on error == 1 we should gracefully exit
print STDERR ("$Global::progname: Starting no more jobs. ",
"Waiting for ", scalar(keys %Global::running),
" jobs to finish. This job failed:\n",
$Global::running{$stiff}{"command"},"\n");
$Global::StartNoNewJobs++;
$Global::halt_on_error_exitstatus = $Global::running{$stiff}{'exitstatus'};
} elsif($::opt_halt_on_error == 2) {
# If halt on error == 2 we should exit immediately
print STDERR ("$Global::progname: This job failed:\n",
$Global::running{$stiff}{"command"},"\n");
exit ($Global::running{$stiff}{'exitstatus'});
}
}
}
my $sshlogin = $Global::running{$stiff}{'sshlogin'};
$Global::host{$sshlogin}{'no_of_running'}--;
$Global::running_jobs--;
@ -2690,7 +2776,7 @@ sub Reaper {
sub die_usage {
usage();
exit(1);
exit(255);
}
sub usage {
@ -2808,8 +2894,13 @@ $Global::control_path = 0;
# TODO Debian package
# TODO transfer a script to be run
# TODO check that error code is passed out. echo | parallel /bin/false should give error code
# TODO halt on first error. (/bin/false; E=$?; /bin/true; echo $E; exit $E); echo $?
# TODO halt on first error --soft (let running complete) --hard (killall running)
# TODO to kill from a run script parallel should set PARALLEL_PID that can be sig termed
# -F basefile this file will be transferred to each sshlogin before a
# jobs is started. It will be removed if --cleanup is active. The file
# may be a script to run or some common base data needed for the jobs.
# Multiple -F can be specified to transfer more basefiles.
# TODO to kill from a run script parallel should set PARALLEL_PID that can be sig termed
# TAGS: parallel | parallel processing | multicore | multiprocessor | Clustering/Distributed Networks
# job control | multiple jobs | parallelization | text processing | cluster | filters
# Clustering Tools | Command Line Tools | Utilities | System Administration

View file

@ -1,11 +1,11 @@
### Test $PARALLEL
1
ssh -l parallel parallel-server2 echo\ 1;
ssh -l parallel parallel-server2 echo\ 1;
1
ssh parallel-server1 echo\ 2;
ssh parallel-server1 echo\ 2;
2
### Test ~/.parallelrc
ssh -l parallel parallel-server2 echo\ 1;
ssh -l parallel parallel-server2 echo\ 1;
1
ssh parallel-server1 echo\ 2;
ssh parallel-server1 echo\ 2;
2

View file

@ -0,0 +1,58 @@
### Test exit val
0
1
### Test --halt-on-error
1
parallel: Starting no more jobs. Waiting for 2 jobs to finish. This job failed:
sleep 2;false
1
parallel: This job failed:
sleep 2;false
1
sh: non_exist: command not found
2
parallel: Starting no more jobs. Waiting for 3 jobs to finish. This job failed:
sleep 1;false
sh: non_exist: command not found
parallel: Starting no more jobs. Waiting for 1 jobs to finish. This job failed:
sleep 1; non_exist
127
parallel: This job failed:
sleep 2;false
1
### Test last dying print --halt-on-error
0
1
parallel: Starting no more jobs. Waiting for 9 jobs to finish. This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 1
2
parallel: Starting no more jobs. Waiting for 8 jobs to finish. This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 2
3
parallel: Starting no more jobs. Waiting for 7 jobs to finish. This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 3
4
parallel: Starting no more jobs. Waiting for 6 jobs to finish. This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 4
5
parallel: Starting no more jobs. Waiting for 5 jobs to finish. This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 5
6
parallel: Starting no more jobs. Waiting for 4 jobs to finish. This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 6
7
parallel: Starting no more jobs. Waiting for 3 jobs to finish. This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 7
8
0
parallel: Starting no more jobs. Waiting for 2 jobs to finish. This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 8
9
parallel: Starting no more jobs. Waiting for 1 jobs to finish. This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 9
9
0
1
parallel: This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 1
1

34
unittest/tests-to-run/test22.sh Executable file
View file

@ -0,0 +1,34 @@
#!/bin/bash
PAR=parallel
SERVER1=parallel-server1
SERVER2=parallel-server2
(
echo '### Test exit val'
echo true | parallel
echo $?
echo false | parallel
echo $?
echo '### Test --halt-on-error'
(echo "sleep 1;true"; echo "sleep 2;false";echo "sleep 3;true") | parallel --halt-on-error 0
echo $?
(echo "sleep 1;true"; echo "sleep 2;false";echo "sleep 3;true") | parallel --halt-on-error 1
echo $?
(echo "sleep 1;true"; echo "sleep 2;false";echo "sleep 3;true") | parallel --halt-on-error 2
echo $?
(echo "sleep 1;true"; echo "sleep 1;false";echo "sleep 1;true";echo "sleep 1; non_exist") | parallel -H0
echo $?
(echo "sleep 1;true"; echo "sleep 1;false";echo "sleep 1;true";echo "sleep 1; non_exist") | parallel -H1
echo $?
(echo "sleep 1;true"; echo "sleep 2;false";echo "sleep 3;true";echo "sleep 4; non_exist") | parallel -H2
echo $?
echo '### Test last dying print --halt-on-error'
(seq 0 8;echo 0; echo 9) | parallel -kqH1 perl -e 'sleep $ARGV[0];print STDERR @ARGV,"\n"; exit shift'
echo $?
(seq 0 8;echo 0; echo 9) | parallel -kqH2 perl -e 'sleep $ARGV[0];print STDERR @ARGV,"\n"; exit shift'
echo $?
) 2>&1

View file

@ -1,11 +1,11 @@
### Test $PARALLEL
1
ssh -l parallel parallel-server2 echo\ 1;
ssh -l parallel parallel-server2 echo\ 1;
1
ssh parallel-server1 echo\ 2;
ssh parallel-server1 echo\ 2;
2
### Test ~/.parallelrc
ssh -l parallel parallel-server2 echo\ 1;
ssh -l parallel parallel-server2 echo\ 1;
1
ssh parallel-server1 echo\ 2;
ssh parallel-server1 echo\ 2;
2

View file

@ -0,0 +1,58 @@
### Test exit val
0
1
### Test --halt-on-error
1
parallel: Starting no more jobs. Waiting for 2 jobs to finish. This job failed:
sleep 2;false
1
parallel: This job failed:
sleep 2;false
1
sh: non_exist: command not found
2
parallel: Starting no more jobs. Waiting for 3 jobs to finish. This job failed:
sleep 1;false
sh: non_exist: command not found
parallel: Starting no more jobs. Waiting for 1 jobs to finish. This job failed:
sleep 1; non_exist
127
parallel: This job failed:
sleep 2;false
1
### Test last dying print --halt-on-error
0
1
parallel: Starting no more jobs. Waiting for 9 jobs to finish. This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 1
2
parallel: Starting no more jobs. Waiting for 8 jobs to finish. This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 2
3
parallel: Starting no more jobs. Waiting for 7 jobs to finish. This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 3
4
parallel: Starting no more jobs. Waiting for 6 jobs to finish. This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 4
5
parallel: Starting no more jobs. Waiting for 5 jobs to finish. This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 5
6
parallel: Starting no more jobs. Waiting for 4 jobs to finish. This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 6
7
parallel: Starting no more jobs. Waiting for 3 jobs to finish. This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 7
8
0
parallel: Starting no more jobs. Waiting for 2 jobs to finish. This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 8
9
parallel: Starting no more jobs. Waiting for 1 jobs to finish. This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 9
9
0
1
parallel: This job failed:
perl -e sleep\ \$ARGV[0]\;print\ STDERR\ @ARGV,\"\\n\"\;\ exit\ shift 1
1