Added -k: Keep sequence of output same as the order of input.

If jobs 1 2 3 4 end in the sequence 3 1 4 2 the output will still be 1 2 3 4.
2024-11-22 05:57:54 +00:00 · 2009-09-04 07:23:57 +02:00 · 2009-09-04 07:23:57 +02:00 · 2d930320ad
parent 84118cb14b
commit 2d930320ad
5 changed files with 290 additions and 73 deletions
--- a/189
+++ b/189
@ -6,7 +6,7 @@ parallel - build and execute command lines from standard input in parallel
 =head1 SYNOPSIS
-B<parallel> [-0cfgqsuvxX] [-j num] [command [arguments]] < list_of_arguments
+B<parallel> [-0cfgkqsuvxX] [-j num] [command [arguments]] < list_of_arguments
 =head1 DESCRIPTION
@ -72,6 +72,11 @@ If the evaluated number is less than 1 then 1 will be used.
 Multiply N% with the number of CPUs.  Run this many jobs in parallel.
 If the evaluated number is less than 1 then 1 will be used.
 =item B<-k>
 Keep sequence of output same as the order of input. If jobs 1 2 3 4
 end in the sequence 3 1 4 2 the output will still be 1 2 3 4.
 =item B<-q>
 Quote B<command>.  This will quote the command line so special
@ -223,6 +228,26 @@ B<seq -f %04g 0 9999 | parallel -X rm pict{}.jpg>
 This will also only run B<rm> as many times needed to keep the command
 line length short enough.
 =head1 EXAMPLE 7: Keep order of output same as order of input
 Normally the output of a job will be printed as soon as it
 completes. Sometimes you want the order of the output to remain the
 same as the order of the input. B<-k> will make sure the order of
 output will be in the same order as input even if later jobs end
 before earlier jobs.
 If you have a directory with subdirectories that contain different
 amount of files running:
 B<ls | sort | parallel -v "ls {} | wc">
 will give the output of each dir, but it will be sorted accoring to
 which job completed first.
 To keep the order the same as input run:
 B<ls | sort | parallel -kv "ls {} | wc">
 =head1 QUOTING
@ -255,6 +280,15 @@ B<sh: -c: line 0: syntax error near unexpected token>
 then you might try using B<-q>.
 If you are using B<bash> process substitution like B<<(cat foo)> then
 you may try B<-q> and prepending B<command> with B<bash -c>:
 B<ls | parallel -q bash -c 'wc -c <(echo {})'>
 Or for substituting output:
 B<ls | parallel -q bash -c 'tar c {} | tee >>B<(gzip >>B<{}.tar.gz) | bzip2 >>B<{}.tar.bz2'>
 B<Conclusion>: To avoid dealing with the quoting problems it may be
 easier just to write a small script and have B<parallel> call that
 script.
@ -269,20 +303,17 @@ B<find -exec> only works on files. So processing other input (such as
 hosts or URLs) will require creating these inputs as files. B<find
 -exec> has no support for running commands in parallel.
-B<xargs> deals badly with special characters (such as space, ' and ") unless
+B<xargs> deals badly with special characters (such as space, ' and ")
-B<-0> is specified. Many input generators are not optimized for using
+unless B<-0> or B<-d "\n"> is specified. Many input generators are not
-B<NUL> as separator but are optimized for B<newline> as separator. E.g
+optimized for using B<NUL> as separator but are optimized for
-B<head>, B<tail>, B<awk>, B<ls>, B<echo>, B<sed>, B<tar -v>, B<perl>
+B<newline> as separator. E.g B<head>, B<tail>, B<awk>, B<ls>, B<echo>,
-(-0 and \0 instead of \n), B<locate> (requires using -0), B<find>
+B<sed>, B<tar -v>, B<perl> (-0 and \0 instead of \n), B<locate>
-(requires using -print0), B<grep> (requires user to use -z or -Z).
+(requires using -0), B<find> (requires using -print0), B<grep>
-
+(requires user to use -z or -Z).
 The input I<can> be fixed for B<xargs> with: 
 tr '\n' '\0'
 So B<parallel>'s newline separation can be emulated with:
-cat | tr '\n' '\0' | xargs -0 -n1 I<command>
+B<cat | xargs -d "\n" -n1 I<command>>
 B<xargs> can run a given number of jobs in parallel, but has no
 support for running no_of_cpus jobs in parallel.
@ -291,27 +322,45 @@ B<xargs> has no support for grouping the output, therefore output may
 run together, e.g. the first half of a line is from one process and
 the last half of the line is from another process.
-If no command is given to B<xargs> it defaults to /bin/echo. So the
+B<xargs> has no support for keeping the order of the output, therefore
-B<cat | sh> functionality is missing.
+output of the second job cannot be postponed till the first job is done.
 B<xargs> has no support for context replace, so you will have to create the 
 arguments.
 If you use a replace string in B<xargs> (B<-I>) you can not force
 B<xargs> to use more than one argument.
 Quoting in B<xargs> works like B<-q> in B<parallel>. This means
-composed commands and redirection is impossible: B<ls | parallel "wc
+composed commands and redirection requires using B<bash -c>.
 {} >> B<{}.wc"> or B<ls | parallel "echo {}; ls {}|wc"> cannot be done using
 B<xargs>.
 B<ls | parallel "wc {} >> B<{}.wc"> 
 becomes 
 B<ls | xargs -d "\n" -P10 -I {} bash -c "wc {} >>B< {}.wc">
 and 
 B<ls | parallel "echo {}; ls {}|wc"> 
 becomes 
 B<ls | xargs -d "\n" -P10 -I {} bash -c "echo {}; ls {}|wc">
 =head1 BUGS
 Filenames beginning with '-' can cause some commands to give
 unexpected results, as it will often be interpreted as an option.
 Some Bash specific constructs like '<(cmd)' do not work.  They can be
 fixed by prepending '/bin/bash -c'. How much slower will that be?
 =head1 REPORTING BUGS
 Report bugs to <bug-parallel@tange.dk>.
 =head1 IDEAS
 xargs dropin-replacement.
 Implement the missing --features
 =head1 AUTHOR
@ -360,14 +409,15 @@ use Getopt::Std;
 use strict;
 my ($processes,$command);
-getopts("0cdefgj:qsuvxX") || die_usage();
+getopts("0cdfgj:kqsuvxX") || die_usage();
 # Defaults:
 $Global::debug = 0;
-$processes = 10;
+$Global::processes_to_run = 10;
 $command = undef;
 $Global::verbose = 0;
 $Global::grouped = 1;
 $Global::keeporder = 0;
 $Global::quoting = 0;
 $Global::input_is_filename = (@ARGV);
 $/="\n";
@ -383,6 +433,7 @@ if(defined $::opt_X) {
 }
 if(defined $::opt_v) { $Global::verbose = 1; }
 if(defined $::opt_s) { $Global::verbose = 0; }
 if(defined $::opt_k) { $Global::keeporder = 1; }
 if(defined $::opt_g) { $Global::grouped = 1; }
 if(defined $::opt_u) { $Global::grouped = 0; }
 if(defined $::opt_c) { $Global::input_is_filename = 0; }
@ -398,17 +449,13 @@ if(@ARGV) {
 }
 # Needs to be done after setting $Global::command and $Global::command_line_max_len
 # as '-x' influences the number of commands that needs to be run
-if(defined $::opt_j) { $processes = compute_number_of_processes($::opt_j); }
+if(defined $::opt_j) { $Global::processes_to_run = compute_number_of_processes($::opt_j); }
 $Global::job_end_sequence=1;
 init_run_jobs();
 DoNotReap();
-
+start_more_jobs();
 while($Global::running_jobs < $processes 
      and 
      start_another_job()) {
    # skip
 }
 ReapIfNeeded();
 drain_job_queue();
@ -646,7 +693,7 @@ sub processes_available_by_system_limit {
 		      "Limiting to ", $system_limit, " jobs in parallel.\n");
    }
    # Cleanup: Close the files
-    for (keys %fh) { close $fh{$_} }
+    for (values %fh) { close $_ }
    # Cleanup: Kill the children
    for my $pid (@children) {
 	kill 15, $pid;
@ -658,6 +705,24 @@ sub processes_available_by_system_limit {
    return $system_limit;
 }
 sub enough_file_handles {
    # check that we have enough filehandles available for starting
    # another job
    if($Global::grouped) {
 	my %fh;
 	my $enough_filehandles = 1;
 	# We need a filehandle for STDOUT and STDERR
 	# open3 uses 2 extra filehandles temporarily
 	for my $i (1..4) {
 	    $enough_filehandles &&= open($fh{$i},"</dev/null");
 	}
 	for (values %fh) { close $_; }
 	return $enough_filehandles;
    } else {
 	return 1;
    }
 }
 sub user_requested_processes {
    # Parse the number of processes that the user asked for
    my $opt_j = shift;
@ -787,15 +852,30 @@ sub drain_job_queue {
    }
 }
 sub start_more_jobs {
    my $jobs_started = 0;
    while($Global::running_jobs < $Global::processes_to_run
 	  and 
 	  start_another_job()) {
 	$jobs_started++;
    }
    return $jobs_started;
 }
 sub start_another_job {
    # Grab a job from @Global::command, start it 
    # and remember the pid, the STDOUT and the STDERR handles
    # If no more jobs: do nothing
-    my $command = next_command_line();
+    # Do we have enough file handles to start another job?
-    if(defined $command) {
+    if(enough_file_handles()) {
-	my %jobinfo = start_job($command);
+	my $command = next_command_line();
-	$Global::running{$jobinfo{"pid"}} = \%jobinfo;
+	if(defined $command) {
-	return 1;
+	    my %jobinfo = start_job($command);
 	    $Global::running{$jobinfo{"pid"}} = \%jobinfo;
 	    return 1;
 	} else {
 	    return 0;
 	}
    } else {
 	return 0;
    }
@ -831,13 +911,17 @@ sub start_job {
    open STDOUT, ">&", $Global::original_stdout or die "Can't dup \$oldout: $!";
    open STDERR, ">&", $Global::original_stderr or die "Can't dup \$oldout: $!";
    $Global::job_start_sequence++;
    if($Global::grouped) {
-	return ("pid" => $pid,
+	return ("seq" => $Global::job_start_sequence,
 		"pid" => $pid,
 		"out" => $out{$outname},
 		"err" => $err{$errname},
 		"command" => $command);
    } else {
-	return ("pid" => $pid, "command" => $command);
+	return ("seq" => $Global::job_start_sequence,
 		"pid" => $pid, 
 		"command" => $command);
    }
 }
@ -912,10 +996,24 @@ sub Reaper {
    my $stiff;
    debug("Reaper called $Global::reaperlevel\n");
    while (($stiff = waitpid(-1, &WNOHANG)) > 0) {
-	print_job($Global::running{$stiff});
+	if($Global::keeporder) {
-	delete $Global::running{$stiff};
+	    $Global::print_later{$Global::running{$stiff}{"seq"}} = $Global::running{$stiff};
-	$Global::running_jobs--;
+	    debug("died: $Global::running{$stiff}{'seq'}");
-	start_another_job();
+	    while($Global::print_later{$Global::job_end_sequence}) {
 		debug("Found job end $Global::job_end_sequence");
 		print_job($Global::print_later{$Global::job_end_sequence});
 		delete $Global::print_later{$Global::job_end_sequence};
 		$Global::job_end_sequence++;
 	    }
 	    delete $Global::running{$stiff};
 	    $Global::running_jobs--;
 	    start_more_jobs();
 	} else {
 	    print_job($Global::running{$stiff});
 	    delete $Global::running{$stiff};
 	    $Global::running_jobs--;
 	    start_more_jobs();
 	}
    }
    ReapIfNeeded();
    debug("Reaper exit $Global::reaperlevel\n");
@ -933,7 +1031,7 @@ sub die_usage {
 sub usage {
    print "Usage:\n";
-    print "parallel [-0cfgqsuvxX] [-j num] [command [arguments]] < list_of_arguments\n";
+    print "parallel [-0cdfgkqsuvxX] [-j num] [command [arguments]] < list_of_arguments\n";
 }
 #
@ -999,5 +1097,6 @@ sub my_dump {
 # Keep perl -w happy
 $main::opt_u = $main::opt_c = $main::opt_f = $main::opt_q =
 $main::opt_0 = $main::opt_s = $main::opt_v = $main::opt_g =
-$main::opt_j = $main::opt_d = $main::opt_x = $main::opt_X =1;
+$main::opt_j = $main::opt_d = $main::opt_x = $main::opt_X =
-$Global::xargs = 1;
+$main::opt_k =
 $Global::xargs = $Global::keeporder = 0;
--- a/parallel.1
+++ b/parallel.1
@ -124,7 +124,7 @@
 .\" ========================================================================
 .\"
 .IX Title "PARALLEL 1"
-.TH PARALLEL 1 "2009-08-26" "perl v5.10.0" "User Contributed Perl Documentation"
+.TH PARALLEL 1 "2009-09-02" "perl v5.10.0" "User Contributed Perl Documentation"
 .\" For nroff, turn off justification.  Always turn off hyphenation; it makes
 .\" way too many mistakes in technical documents.
 .if n .ad l
@ -133,7 +133,7 @@
 parallel \- build and execute command lines from standard input in parallel
 .SH "SYNOPSIS"
 .IX Header "SYNOPSIS"
-\&\fBparallel\fR [\-0cfgqsuvxX] [\-j num] [command [arguments]] < list_of_arguments
+\&\fBparallel\fR [\-0cfgkqsuvxX] [\-j num] [command [arguments]] < list_of_arguments
 .SH "DESCRIPTION"
 .IX Header "DESCRIPTION"
 For each line of input \fBparallel\fR will execute \fBcommand\fR with the
@ -186,6 +186,10 @@ If the evaluated number is less than 1 then 1 will be used.
 .IX Item "-j N%"
 Multiply N% with the number of CPUs.  Run this many jobs in parallel.
 If the evaluated number is less than 1 then 1 will be used.
 .IP "\fB\-k\fR" 9
 .IX Item "-k"
 Keep sequence of output same as the order of input. If jobs 1 2 3 4
 end in the sequence 3 1 4 2 the output will still be 1 2 3 4.
 .IP "\fB\-q\fR" 9
 .IX Item "-q"
 Quote \fBcommand\fR.  This will quote the command line so special
@ -305,27 +309,46 @@ files in each directory:
 .PP
 To put the output in a file called <name>.dir:
 .PP
-\&\fBls | parallel '(echo \-n {}\*(L" \*(R"; ls {}|wc \-l) \fR>\fB {}.dir'\fR
+\&\fBls | parallel '(echo \-n {}\*(L" \*(R"; ls {}|wc \-l) \fR> \fB{}.dir'\fR
 .SH "EXAMPLE 6: Context replace"
 .IX Header "EXAMPLE 6: Context replace"
-To remove the files \fIpict1000.jpg\fR .. \fIpict9999.jpg\fR you could do:
+To remove the files \fIpict0000.jpg\fR .. \fIpict9999.jpg\fR you could do:
 .PP
-\&\fBseq 1000 9999 | parallel rm pict{}.jpg\fR
+\&\fBseq \-f \f(CB%04g\fB 0 9999 | parallel rm pict{}.jpg\fR
 .PP
 You could also do:
 .PP
-\&\fBseq 1000 9999 | perl \-pe 's/(.*)/pict$1.jpg/' | parallel \-x rm\fR
+\&\fBseq \-f \f(CB%04g\fB 0 9999 | perl \-pe 's/(.*)/pict$1.jpg/' | parallel \-x rm\fR
 .PP
-The first will run \fBrm\fR 8999 times, while the last will only run
+The first will run \fBrm\fR 10000 times, while the last will only run
 \&\fBrm\fR as many times needed to keep the command line length short
-enough.
+enough (typically 1\-2 times).
 .PP
 You could also run:
 .PP
-\&\fBseq 1000 9999 | parallel \-X rm pict{}.jpg\fR
+\&\fBseq \-f \f(CB%04g\fB 0 9999 | parallel \-X rm pict{}.jpg\fR
 .PP
 This will also only run \fBrm\fR as many times needed to keep the command
 line length short enough.
 .SH "EXAMPLE 7: Keep order of output same as order of input"
 .IX Header "EXAMPLE 7: Keep order of output same as order of input"
 Normally the output of a job will be printed as soon as it
 completes. Sometimes you want the order of the output to remain the
 same as the order of the input. \fB\-k\fR will make sure the order of
 output will be in the same order as input even if later jobs end
 before earlier jobs.
 .PP
 If you have a directory with subdirectories that contain different
 amount of files running:
 .PP
 \&\fBls | sort | parallel \-v \*(L"ls {} | wc\*(R"\fR
 .PP
 will give the output of each dir, but it will be sorted accoring to
 which job completed first.
 .PP
 To keep the order the same as input run:
 .PP
 \&\fBls | sort | parallel \-kv \*(L"ls {} | wc\*(R"\fR
 .SH "QUOTING"
 .IX Header "QUOTING"
 For more advanced use quoting may be an issue. The following will
@ -357,6 +380,15 @@ If you get errors like:
 .PP
 then you might try using \fB\-q\fR.
 .PP
 If you are using \fBbash\fR process substitution like \fB<(cat foo)\fR then
 you may try \fB\-q\fR and prepending \fBcommand\fR with \fBbash \-c\fR:
 .PP
 \&\fBls | parallel \-q bash \-c 'wc \-c <(echo {})'\fR
 .PP
 Or for substituting output:
 .PP
 \&\fBls | parallel \-q bash \-c 'tar c {} | tee \fR>\fB(gzip \fR>\fB{}.tar.gz) | bzip2 \fR>\fB{}.tar.bz2'\fR
 .PP
 \&\fBConclusion\fR: To avoid dealing with the quoting problems it may be
 easier just to write a small script and have \fBparallel\fR call that
 script.
@ -369,20 +401,17 @@ script.
 hosts or URLs) will require creating these inputs as files. \fBfind
 \&\-exec\fR has no support for running commands in parallel.
 .PP
-\&\fBxargs\fR deals badly with special characters (such as space, ' and ") unless
+\&\fBxargs\fR deals badly with special characters (such as space, ' and ")
-\&\fB\-0\fR is specified. Many input generators are not optimized for using
+unless \fB\-0\fR or \fB\-d \*(L"\en\*(R"\fR is specified. Many input generators are not
-\&\fB\s-1NUL\s0\fR as separator but are optimized for \fBnewline\fR as separator. E.g
+optimized for using \fB\s-1NUL\s0\fR as separator but are optimized for
-\&\fBhead\fR, \fBtail\fR, \fBawk\fR, \fBls\fR, \fBecho\fR, \fBsed\fR, \fBtar \-v\fR, \fBperl\fR
+\&\fBnewline\fR as separator. E.g \fBhead\fR, \fBtail\fR, \fBawk\fR, \fBls\fR, \fBecho\fR,
-(\-0 and \e0 instead of \en), \fBlocate\fR (requires using \-0), \fBfind\fR
+\&\fBsed\fR, \fBtar \-v\fR, \fBperl\fR (\-0 and \e0 instead of \en), \fBlocate\fR
-(requires using \-print0), \fBgrep\fR (requires user to use \-z or \-Z).
+(requires using \-0), \fBfind\fR (requires using \-print0), \fBgrep\fR
-.PP
+(requires user to use \-z or \-Z).
 The input \fIcan\fR be fixed for \fBxargs\fR with:
 .PP
 tr '\en' '\e0'
 .PP
 So \fBparallel\fR's newline separation can be emulated with:
 .PP
-cat | tr '\en' '\e0' | xargs \-0 \-n1 \fIcommand\fR
+\&\fBcat | xargs \-d \*(L"\en\*(R" \-n1 \f(BIcommand\fB\fR
 .PP
 \&\fBxargs\fR can run a given number of jobs in parallel, but has no
 support for running no_of_cpus jobs in parallel.
@ -391,23 +420,42 @@ support for running no_of_cpus jobs in parallel.
 run together, e.g. the first half of a line is from one process and
 the last half of the line is from another process.
 .PP
-If no command is given to \fBxargs\fR it defaults to /bin/echo. So the
+\&\fBxargs\fR has no support for keeping the order of the output, therefore
-\&\fBcat | sh\fR functionality is missing.
+output of the second job cannot be postponed till the first job is done.
 .PP
 \&\fBxargs\fR has no support for context replace, so you will have to create the 
 arguments.
 .PP
 If you use a replace string in \fBxargs\fR (\fB\-I\fR) you can not force
 \&\fBxargs\fR to use more than one argument.
 .PP
 Quoting in \fBxargs\fR works like \fB\-q\fR in \fBparallel\fR. This means
-composed commands and redirection is impossible: \fBls | parallel "wc
+composed commands and redirection requires using \fBbash \-c\fR.
-{} \fR> \fB{}.wc"\fR or \fBls | parallel \*(L"echo {}; ls {}|wc\*(R"\fR cannot be done using
+.PP
-\&\fBxargs\fR.
+\&\fBls | parallel "wc {} \fR> \fB{}.wc"\fR
 .PP
 becomes
 .PP
 \&\fBls | xargs \-d \*(L"\en\*(R" \-P10 \-I {} bash \-c "wc {} \fR>\fB {}.wc"\fR
 .PP
 and
 .PP
 \&\fBls | parallel \*(L"echo {}; ls {}|wc\*(R"\fR
 .PP
 becomes
 .PP
 \&\fBls | xargs \-d \*(L"\en\*(R" \-P10 \-I {} bash \-c \*(L"echo {}; ls {}|wc\*(R"\fR
 .SH "BUGS"
 .IX Header "BUGS"
 Filenames beginning with '\-' can cause some commands to give
 unexpected results, as it will often be interpreted as an option.
 .PP
 Some Bash specific constructs like '<(cmd)' do not work.  They can be
 fixed by prepending '/bin/bash \-c'. How much slower will that be?
 .SH "REPORTING BUGS"
 .IX Header "REPORTING BUGS"
 Report bugs to <bug\-parallel@tange.dk>.
 .SH "IDEAS"
 .IX Header "IDEAS"
 xargs dropin-replacement.
 Implement the missing \-\-features
 .SH "AUTHOR"
 .IX Header "AUTHOR"
 Copyright (C) 2007\-10\-18 Ole Tange, http://ole.tange.dk
--- a/unittest/actual-results/test13
+++ b/unittest/actual-results/test13
@ -0,0 +1,32 @@
 begin
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 end
--- a/unittest/tests-to-run/test13.sh
+++ b/unittest/tests-to-run/test13.sh
@ -0,0 +1,6 @@
 #!/bin/bash
 # Test -k
 ulimit -n 50
 (echo "sleep 3; echo begin"; seq 1 30 | parallel -kq echo "sleep 1; echo {}"; echo "echo end") \
 | parallel -k -j0
--- a/unittest/wanted-results/test13
+++ b/unittest/wanted-results/test13
@ -0,0 +1,32 @@
 begin
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 end
		`@ -0,0 +1,32 @@`
							`begin`
							`1`
							`2`
							`3`
							`4`
							`5`
							`6`
							`7`
							`8`
							`9`
							`10`
							`11`
							`12`
							`13`
							`14`
							`15`
							`16`
							`17`
							`18`
							`19`
							`20`
							`21`
							`22`
							`23`
							`24`
							`25`
							`26`
							`27`
							`28`
							`29`
							`30`
							`end`