2grep: --header implemented. Passes tests.
This commit is contained in:
parent
d0da0206ea
commit
34e0b4e136
254
2search/2grep
254
2search/2grep
|
@ -8,17 +8,17 @@
|
||||||
|
|
||||||
=head1 SYNOPSIS
|
=head1 SYNOPSIS
|
||||||
|
|
||||||
B<2search> [-nrfB] file string [string...]
|
B<2search> [-nrfHB] file string [string...]
|
||||||
|
|
||||||
B<2search> --grep [-nrf] file string [string...]
|
B<2search> --grep [-nrfH] file string [string...]
|
||||||
|
|
||||||
B<2grep> [-nrf] file string [string...]
|
B<2grep> [-nrfH] file string [string...]
|
||||||
|
|
||||||
... | B<2search> [-nrfB] file
|
... | B<2search> [-nrfHB] file
|
||||||
|
|
||||||
... | B<2search> --grep [-nrf] file
|
... | B<2search> --grep [-nrfH] file
|
||||||
|
|
||||||
... | B<2grep> [-nrf] file
|
... | B<2grep> [-nrfH] file
|
||||||
|
|
||||||
=head1 DESCRIPTION
|
=head1 DESCRIPTION
|
||||||
|
|
||||||
|
@ -52,12 +52,11 @@ print byte position where string would have been
|
||||||
consider only blanks and alphanumeric characters
|
consider only blanks and alphanumeric characters
|
||||||
|
|
||||||
|
|
||||||
=item B<--debug> (not implemented)
|
=item B<--debug>
|
||||||
|
|
||||||
=item B<-D>
|
=item B<-D>
|
||||||
|
|
||||||
annotate the part of the line used to sort, and warn about
|
annotate the part of the line used to sort to stderr
|
||||||
questionable usage to stderr
|
|
||||||
|
|
||||||
|
|
||||||
=item B<--ignore-case>
|
=item B<--ignore-case>
|
||||||
|
@ -81,6 +80,13 @@ search for all lines in I<file>
|
||||||
compare according to general numerical value
|
compare according to general numerical value
|
||||||
|
|
||||||
|
|
||||||
|
=item B<--header>
|
||||||
|
|
||||||
|
=item B<-H>
|
||||||
|
|
||||||
|
treat the first line in I<file> as a header
|
||||||
|
|
||||||
|
|
||||||
=item B<--ignore-nonprinting> (not implemented)
|
=item B<--ignore-nonprinting> (not implemented)
|
||||||
|
|
||||||
=item B<-i>
|
=item B<-i>
|
||||||
|
@ -114,18 +120,20 @@ sort via a key; KEYDEF gives location and type
|
||||||
=item B<-n>
|
=item B<-n>
|
||||||
|
|
||||||
compare according to string numerical value. If numerical values are
|
compare according to string numerical value. If numerical values are
|
||||||
the same: split the string into blocks of numbers and non-numbers, and
|
the same: compare as strings.
|
||||||
compare numbers as numbers and strings as strings.
|
|
||||||
|
|
||||||
This will sort like this: chr3 chr11 3chr 11chr
|
|
||||||
|
|
||||||
|
|
||||||
=item B<--numascii>
|
=item B<--numascii>
|
||||||
|
|
||||||
=item B<-N>
|
=item B<-N>
|
||||||
|
|
||||||
compare according to string numerical value. If numerical values are
|
split the string into blocks of numbers and non-numbers. For each
|
||||||
the same: compare as strings
|
block compare the block as numbers, if the numerical values are the
|
||||||
|
same: compare the block as strings.
|
||||||
|
|
||||||
|
This will sort like this: 3chr 11chr chr3 chr11
|
||||||
|
|
||||||
|
This is simiar to B<--version-sort>, but without the exceptions.
|
||||||
|
|
||||||
|
|
||||||
=item B<--random-sort>
|
=item B<--random-sort>
|
||||||
|
@ -152,7 +160,7 @@ B<-M>, numeric B<-n>, random B<-R>, version B<-V>
|
||||||
|
|
||||||
=item B<--field-separator=SEP>
|
=item B<--field-separator=SEP>
|
||||||
|
|
||||||
use SEP instead of non-blank to blank transition
|
use I<SEP> instead of blanks (\s+). I<SEP> is a regexp.
|
||||||
|
|
||||||
|
|
||||||
=item B<-z>
|
=item B<-z>
|
||||||
|
@ -161,39 +169,101 @@ use SEP instead of non-blank to blank transition
|
||||||
|
|
||||||
end lines with 0 byte, not newline
|
end lines with 0 byte, not newline
|
||||||
|
|
||||||
|
|
||||||
=back
|
=back
|
||||||
|
|
||||||
=head1 EXAMPLES
|
=head1 EXAMPLES
|
||||||
|
|
||||||
=head2 Single key
|
=head2 Single key
|
||||||
|
|
||||||
Input is sorted by Chromosome,Position:
|
Given sorted I<input.txt> like:
|
||||||
|
|
||||||
SampleID Position Chromosome
|
A_number B_number Date Duration CellID
|
||||||
foo 10000123 chr3
|
12893827 21034191 2020-03-21T13:38:13 P00:00:20 CPH382
|
||||||
foo 10000125 chr3
|
12893827 80012345 2020-03-20T12:34:23 P00:00:20 CPH382
|
||||||
foo 9999998 chr11
|
12893827 80012345 2020-03-20T12:45:03 P00:05:20 CPH382
|
||||||
foo 10000124 chr11
|
22355591 47827750 2020-03-20T11:28:33 P00:32:27 ALB923
|
||||||
foo 10000126 chr11
|
22355591 81382631 2020-03-21T21:28:33 P00:12:48 CPH382
|
||||||
|
22356142 45701514 2020-03-20T22:41:23 P00:02:48 CPH022
|
||||||
|
22356142 56818446 2020-03-21T08:38:34 P00:31:24 CPH645
|
||||||
|
|
||||||
To find all chr3:
|
To get all records with 22355591 you can run:
|
||||||
|
|
||||||
2grep -n -k3 inputfile chr3
|
grep ^22355591 input.txt
|
||||||
|
|
||||||
-n will split 'chr3' into 'chr' which is compared asciibetically and
|
But if I<input.txt> is several TB big, it can be very slow. B<2grep>
|
||||||
|
uses binary search which only works if the file is sorted, but takes
|
||||||
|
less than 1 second to run:
|
||||||
|
|
||||||
|
2grep -H input.txt 22355591
|
||||||
|
|
||||||
|
You can also search for a shorter string to get all records starting
|
||||||
|
with 2235:
|
||||||
|
|
||||||
|
2grep -H input.txt 2235
|
||||||
|
|
||||||
|
Or you can search for multiple search strings:
|
||||||
|
|
||||||
|
2grep -H input.txt 12893827 22356142
|
||||||
|
|
||||||
|
=head2 Multiple keys
|
||||||
|
|
||||||
|
Input is sorted by SampleID, Chromosome, Position (in that order):
|
||||||
|
|
||||||
|
SampleID Chromosome Position Data
|
||||||
|
PatientA chr3 10002123 CCGTCTAATGGCTTGATTGGTACACCATGACATTGA
|
||||||
|
PatientA chr3 10003125 TCCATCGTCGGCGAGAAGGTACCAGGTAA
|
||||||
|
PatientA chr11 9999998 AATTCACAGTATGGCTGACGGTGTCGTAGCTACACG
|
||||||
|
PatientA chr11 10001240 TCCAGAAGTTTGA
|
||||||
|
PatientA chr11 10001260 ATAACGAGAACTTACGTTTTAAAAGGCCTA
|
||||||
|
PatientB chr3 10000125 GTCTTCACTTTATAAATGGATGATAGCCTTCA
|
||||||
|
|
||||||
|
SampleID is sorted as text. Chromosome is sorted by text first and
|
||||||
|
numerically for the number. Position is sorted by number.
|
||||||
|
|
||||||
|
To find all chr3 for PatientA:
|
||||||
|
|
||||||
|
2grep -H -k1,2N inputfile PatientA chr3
|
||||||
|
|
||||||
|
-N will split 'chr3' into 'chr' which is compared asciibetically and
|
||||||
'3' which is compared numerically.
|
'3' which is compared numerically.
|
||||||
|
|
||||||
=head2 Not implemented
|
To find all chr3 for PatientA and all chr3 for PatientB:
|
||||||
|
|
||||||
To find all lines with chr3,10000125:
|
2grep -H -k1,2N inputfile PatientA chr3 PatientB chr3
|
||||||
|
|
||||||
2grep -k3n,2n inputfile chr3 10000125
|
|
||||||
|
|
||||||
|
=head1 PERFORMANCE
|
||||||
|
|
||||||
|
Binary search requires seeks from the disk. But B<2search> is designed
|
||||||
|
so that multiple searches will reuse cached data. This means searches
|
||||||
|
will be faster the more you run.
|
||||||
|
|
||||||
|
You can improve the speed even more by sorting the input strings. This
|
||||||
|
will make it possible to reuse cached data more.
|
||||||
|
|
||||||
|
It can be even faster if you run multiple searches in parallel.
|
||||||
|
|
||||||
|
This is due to magnetic drives' elevator sorting of requests when
|
||||||
|
seeking and due to NVMe drives working faster with more queues in
|
||||||
|
parallel.
|
||||||
|
|
||||||
|
cat searchstrings | parallel -n50 -j10 2grep inputfile
|
||||||
|
|
||||||
|
|
||||||
|
=head1 BUGS
|
||||||
|
|
||||||
|
B<2search> does not respect your locale setting. It assumes the input
|
||||||
|
is sorted with LC_ALL=C. If it is not B<2search> may give the wrong
|
||||||
|
result.
|
||||||
|
|
||||||
|
To solve this sort the input with B<LC_ALL=C sort ...>.
|
||||||
|
|
||||||
|
|
||||||
=head1 REPORTING BUGS
|
=head1 REPORTING BUGS
|
||||||
|
|
||||||
B<2search> is part of tangetools. Report bugs to <tools@tange.dk>.
|
B<2search> is part of tangetools. Report bugs on
|
||||||
|
https://gitlab.com/ole.tange/tangetools/-/issues
|
||||||
|
|
||||||
|
|
||||||
=head1 AUTHOR
|
=head1 AUTHOR
|
||||||
|
@ -342,20 +412,32 @@ GetOptions(
|
||||||
"sort=s" => \$opt::sort,
|
"sort=s" => \$opt::sort,
|
||||||
"V|version-sort" => \$opt::version_sort,
|
"V|version-sort" => \$opt::version_sort,
|
||||||
"k|key=s" => \@opt::key,
|
"k|key=s" => \@opt::key,
|
||||||
|
"H|header" => \$opt::header,
|
||||||
"t|field-separator=s" => \$opt::field_separator,
|
"t|field-separator=s" => \$opt::field_separator,
|
||||||
|
"recend|record-end=s" => \$opt::record_end,
|
||||||
|
"recstart|record-start=s" => \$opt::record_start,
|
||||||
"z|zero-terminated" => \$opt::zero_terminated,
|
"z|zero-terminated" => \$opt::zero_terminated,
|
||||||
);
|
) || exit(255);
|
||||||
$Global::progname = ($0 =~ m:(^|/)([^/]+)$:)[1];
|
$Global::progname = ($0 =~ m:(^|/)([^/]+)$:)[1];
|
||||||
$Global::version = 20200328;
|
$Global::version = 20200328;
|
||||||
if($opt::version) { version(); exit 0; }
|
if($opt::version) { version(); exit 0; }
|
||||||
if($opt::zero_terminated) { $/ = "\0"; }
|
if($opt::zero_terminated) { $/ = "\0"; }
|
||||||
if(@opt::key) {
|
if(@opt::key) {
|
||||||
# Default separator if --key = whitespace
|
# Default separator if --key = whitespace
|
||||||
$Global::sep = '\s+';
|
$Global::fieldsep = '\s+';
|
||||||
if(defined $opt::field_separator) { $Global::sep = $opt::field_separator; }
|
if(defined $opt::field_separator) { $Global::fieldsep = $opt::field_separator; }
|
||||||
}
|
}
|
||||||
if($Global::progname eq "2grep") { $opt::grep = 1; }
|
if($Global::progname eq "2grep") { $opt::grep = 1; }
|
||||||
$Global::debug = $opt::D;
|
$Global::debug = $opt::D;
|
||||||
|
if(defined $opt::record_end or defined $opt::record_start) {
|
||||||
|
if(not defined $opt::record_end) { $opt::record_end = ""; }
|
||||||
|
if(not defined $opt::record_start) { $opt::record_start = ""; }
|
||||||
|
$/ = unquote_printf($opt::record_end).unquote_printf($opt::record_start);
|
||||||
|
} else {
|
||||||
|
# Default = \n
|
||||||
|
$opt::record_end = "\n";
|
||||||
|
$/ = $opt::record_end;
|
||||||
|
}
|
||||||
|
|
||||||
parse_keydef();
|
parse_keydef();
|
||||||
|
|
||||||
|
@ -370,6 +452,19 @@ if(@ARGV) {
|
||||||
$opt::stdin = 1;
|
$opt::stdin = 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
$Global::headersize = 0;
|
||||||
|
if($opt::header) {
|
||||||
|
if(not open (my $fh, "<", $file)) {
|
||||||
|
error("Cannot open '$file'");
|
||||||
|
exit 1;
|
||||||
|
} else {
|
||||||
|
my $header = <$fh>;
|
||||||
|
$header =~ s/\Q$opt::record_start\E$//;
|
||||||
|
$Global::headersize = length $header;
|
||||||
|
print $header;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
round:
|
round:
|
||||||
while(1) {
|
while(1) {
|
||||||
my @search_vals;
|
my @search_vals;
|
||||||
|
@ -385,7 +480,7 @@ if(@ARGV) {
|
||||||
} else {
|
} else {
|
||||||
print bsearch($file,@search_vals);
|
print bsearch($file,@search_vals);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
{
|
{
|
||||||
my $fh;
|
my $fh;
|
||||||
|
@ -447,7 +542,7 @@ sub bgrep {
|
||||||
sub bsearch {
|
sub bsearch {
|
||||||
my $file = shift;
|
my $file = shift;
|
||||||
my @search_vals = @_;
|
my @search_vals = @_;
|
||||||
my $min = 0;
|
my $min = $Global::headersize;
|
||||||
my $max = -s $file;
|
my $max = -s $file;
|
||||||
my $fh;
|
my $fh;
|
||||||
if(not open ($fh, "<", $file)) {
|
if(not open ($fh, "<", $file)) {
|
||||||
|
@ -474,7 +569,7 @@ sub bsearch {
|
||||||
compare(($line = <$fh>),@search_vals) >= 0) {
|
compare(($line = <$fh>),@search_vals) >= 0) {
|
||||||
# We have see this newline position before
|
# We have see this newline position before
|
||||||
# or we are at the end of the file
|
# or we are at the end of the file
|
||||||
# or we should search the upper half
|
# or we should search the lower half
|
||||||
$max = $middle;
|
$max = $middle;
|
||||||
$maxnl = $newline_pos;
|
$maxnl = $newline_pos;
|
||||||
} else {
|
} else {
|
||||||
|
@ -485,19 +580,43 @@ sub bsearch {
|
||||||
}
|
}
|
||||||
seek($fh,$minnl,0) or die("Cannot seek to $minnl");
|
seek($fh,$minnl,0) or die("Cannot seek to $minnl");
|
||||||
$line = <$fh>;
|
$line = <$fh>;
|
||||||
|
my $len = length $opt::record_start;
|
||||||
|
my $retpos;
|
||||||
if(compare($line,@search_vals) >= 0) {
|
if(compare($line,@search_vals) >= 0) {
|
||||||
if($opt::byte_offset) {
|
# Adjust for length of $recstart
|
||||||
return $minnl."\n";
|
$retpos = $minnl - $len;
|
||||||
} else {
|
|
||||||
return $line;
|
|
||||||
}
|
|
||||||
} else {
|
} else {
|
||||||
if($opt::byte_offset) {
|
$retpos = tell($fh) - $len;
|
||||||
return tell($fh)."\n";
|
}
|
||||||
|
$retpos = $retpos < 0 ? 0 : $retpos;
|
||||||
|
if($opt::byte_offset) {
|
||||||
|
return $retpos."\n";
|
||||||
|
} else {
|
||||||
|
seek($fh,$retpos,0) or die("Cannot seek to $minnl");
|
||||||
|
if(length $opt::record_end) {
|
||||||
|
# read record: A...BA
|
||||||
|
# Remove $opt::record_start if it is at the end
|
||||||
|
# (might not be only record)
|
||||||
|
$line = <$fh>;
|
||||||
|
$line =~ s/\Q$opt::record_start\E$//;
|
||||||
} else {
|
} else {
|
||||||
$line=<$fh>;
|
# --recend == ''
|
||||||
return $line;
|
if(length $opt::record_start) {
|
||||||
|
# read record: A...A
|
||||||
|
# Remove $opt::record_start if it is at the end
|
||||||
|
# (might not be only record)
|
||||||
|
$line = <$fh>; # Read: A
|
||||||
|
$line .= <$fh>; # Read: ...A
|
||||||
|
$line =~ s/\Q$opt::record_start\E$//;
|
||||||
|
} else {
|
||||||
|
# Len recstart == Len recend = 0. Does this ever happen?
|
||||||
|
# read record.
|
||||||
|
# Remove $opt::record_start if it is there (might be only record)
|
||||||
|
$line = <$fh>;
|
||||||
|
$line =~ s/\Q$opt::record_start\E$//;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
return $line;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -533,11 +652,11 @@ sub parse_keydef {
|
||||||
);
|
);
|
||||||
|
|
||||||
if(@opt::key) {
|
if(@opt::key) {
|
||||||
|
# skip
|
||||||
} else {
|
} else {
|
||||||
# Convert -n -r to -k1rn
|
# Convert -n -r to -k1rn
|
||||||
# with sep = undef
|
# with sep = undef
|
||||||
$Global::sep = undef;
|
$Global::fieldsep = undef;
|
||||||
my $opt;
|
my $opt;
|
||||||
$opt->{'field'} = 1;
|
$opt->{'field'} = 1;
|
||||||
$opt->{'char'} = 1;
|
$opt->{'char'} = 1;
|
||||||
|
@ -546,7 +665,7 @@ sub parse_keydef {
|
||||||
}
|
}
|
||||||
push(@Global::keydefs,$opt);
|
push(@Global::keydefs,$opt);
|
||||||
}
|
}
|
||||||
|
|
||||||
for my $keydefs (@opt::key) {
|
for my $keydefs (@opt::key) {
|
||||||
for my $keydef (split /,/, $keydefs) {
|
for my $keydef (split /,/, $keydefs) {
|
||||||
my $opt;
|
my $opt;
|
||||||
|
@ -573,11 +692,11 @@ sub compare {
|
||||||
# One key to search for per search column
|
# One key to search for per search column
|
||||||
my($line,@search_vals) = @_;
|
my($line,@search_vals) = @_;
|
||||||
chomp($line);
|
chomp($line);
|
||||||
debug("Compare: $line <=> @search_vals ");
|
debug("Compare: $line <=> @search_vals; ");
|
||||||
my @field;
|
my @field;
|
||||||
if($Global::sep) {
|
if($Global::fieldsep) {
|
||||||
# Split line
|
# Split line
|
||||||
@field = split /$Global::sep/o, $line;
|
@field = split /$Global::fieldsep/o, $line;
|
||||||
} else {
|
} else {
|
||||||
@field = ($line);
|
@field = ($line);
|
||||||
}
|
}
|
||||||
|
@ -628,9 +747,20 @@ sub compare_single {
|
||||||
return ($m{$a} || 0) <=> ($m{$b} || 0);
|
return ($m{$a} || 0) <=> ($m{$b} || 0);
|
||||||
}
|
}
|
||||||
if($opt->{'numeric_sort'}) {
|
if($opt->{'numeric_sort'}) {
|
||||||
return $a <=> $b;
|
return($a <=> $b or $a cmp $b);
|
||||||
} elsif($opt->{'numascii'}) {
|
} elsif($opt->{'numascii'}) {
|
||||||
return $a <=> $b or $a cmp $b;
|
# Split on digit boundary
|
||||||
|
my @a = split /(?<=\d)(?=\D)|(?<=\D)(?=\d)/i, $a;
|
||||||
|
my @b = split /(?<=\d)(?=\D)|(?<=\D)(?=\d)/i, $b;
|
||||||
|
my $c;
|
||||||
|
for(my $t = 0;
|
||||||
|
defined $a[$t] and defined $b[$t];
|
||||||
|
$t++) {
|
||||||
|
$c = ($a[$t] <=> $b[$t] or $a[$t] cmp $b[$t]);
|
||||||
|
$c and return $c;
|
||||||
|
}
|
||||||
|
# All parts match, maybe one is longer
|
||||||
|
return $#a <=> $#b;
|
||||||
} else {
|
} else {
|
||||||
return $a cmp $b;
|
return $a cmp $b;
|
||||||
}
|
}
|
||||||
|
@ -775,3 +905,19 @@ sub debug(@) {
|
||||||
$Global::debug or return;
|
$Global::debug or return;
|
||||||
print @_;
|
print @_;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
sub unquote_printf() {
|
||||||
|
# Convert \t \n \r \000 \0
|
||||||
|
# Inputs:
|
||||||
|
# $string = string with \t \n \r \num \0
|
||||||
|
# Returns:
|
||||||
|
# $replaced = string with TAB NEWLINE CR <ascii-num> NUL
|
||||||
|
$_ = shift;
|
||||||
|
s/\\t/\t/g;
|
||||||
|
s/\\n/\n/g;
|
||||||
|
s/\\r/\r/g;
|
||||||
|
s/\\(\d\d\d)/eval 'sprintf "\\'.$1.'"'/ge;
|
||||||
|
s/\\(\d)/eval 'sprintf "\\'.$1.'"'/ge;
|
||||||
|
return $_;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
254
2search/2search
254
2search/2search
|
@ -8,17 +8,17 @@
|
||||||
|
|
||||||
=head1 SYNOPSIS
|
=head1 SYNOPSIS
|
||||||
|
|
||||||
B<2search> [-nrfB] file string [string...]
|
B<2search> [-nrfHB] file string [string...]
|
||||||
|
|
||||||
B<2search> --grep [-nrf] file string [string...]
|
B<2search> --grep [-nrfH] file string [string...]
|
||||||
|
|
||||||
B<2grep> [-nrf] file string [string...]
|
B<2grep> [-nrfH] file string [string...]
|
||||||
|
|
||||||
... | B<2search> [-nrfB] file
|
... | B<2search> [-nrfHB] file
|
||||||
|
|
||||||
... | B<2search> --grep [-nrf] file
|
... | B<2search> --grep [-nrfH] file
|
||||||
|
|
||||||
... | B<2grep> [-nrf] file
|
... | B<2grep> [-nrfH] file
|
||||||
|
|
||||||
=head1 DESCRIPTION
|
=head1 DESCRIPTION
|
||||||
|
|
||||||
|
@ -52,12 +52,11 @@ print byte position where string would have been
|
||||||
consider only blanks and alphanumeric characters
|
consider only blanks and alphanumeric characters
|
||||||
|
|
||||||
|
|
||||||
=item B<--debug> (not implemented)
|
=item B<--debug>
|
||||||
|
|
||||||
=item B<-D>
|
=item B<-D>
|
||||||
|
|
||||||
annotate the part of the line used to sort, and warn about
|
annotate the part of the line used to sort to stderr
|
||||||
questionable usage to stderr
|
|
||||||
|
|
||||||
|
|
||||||
=item B<--ignore-case>
|
=item B<--ignore-case>
|
||||||
|
@ -81,6 +80,13 @@ search for all lines in I<file>
|
||||||
compare according to general numerical value
|
compare according to general numerical value
|
||||||
|
|
||||||
|
|
||||||
|
=item B<--header>
|
||||||
|
|
||||||
|
=item B<-H>
|
||||||
|
|
||||||
|
treat the first line in I<file> as a header
|
||||||
|
|
||||||
|
|
||||||
=item B<--ignore-nonprinting> (not implemented)
|
=item B<--ignore-nonprinting> (not implemented)
|
||||||
|
|
||||||
=item B<-i>
|
=item B<-i>
|
||||||
|
@ -114,18 +120,20 @@ sort via a key; KEYDEF gives location and type
|
||||||
=item B<-n>
|
=item B<-n>
|
||||||
|
|
||||||
compare according to string numerical value. If numerical values are
|
compare according to string numerical value. If numerical values are
|
||||||
the same: split the string into blocks of numbers and non-numbers, and
|
the same: compare as strings.
|
||||||
compare numbers as numbers and strings as strings.
|
|
||||||
|
|
||||||
This will sort like this: chr3 chr11 3chr 11chr
|
|
||||||
|
|
||||||
|
|
||||||
=item B<--numascii>
|
=item B<--numascii>
|
||||||
|
|
||||||
=item B<-N>
|
=item B<-N>
|
||||||
|
|
||||||
compare according to string numerical value. If numerical values are
|
split the string into blocks of numbers and non-numbers. For each
|
||||||
the same: compare as strings
|
block compare the block as numbers, if the numerical values are the
|
||||||
|
same: compare the block as strings.
|
||||||
|
|
||||||
|
This will sort like this: 3chr 11chr chr3 chr11
|
||||||
|
|
||||||
|
This is simiar to B<--version-sort>, but without the exceptions.
|
||||||
|
|
||||||
|
|
||||||
=item B<--random-sort>
|
=item B<--random-sort>
|
||||||
|
@ -152,7 +160,7 @@ B<-M>, numeric B<-n>, random B<-R>, version B<-V>
|
||||||
|
|
||||||
=item B<--field-separator=SEP>
|
=item B<--field-separator=SEP>
|
||||||
|
|
||||||
use SEP instead of non-blank to blank transition
|
use I<SEP> instead of blanks (\s+). I<SEP> is a regexp.
|
||||||
|
|
||||||
|
|
||||||
=item B<-z>
|
=item B<-z>
|
||||||
|
@ -161,39 +169,101 @@ use SEP instead of non-blank to blank transition
|
||||||
|
|
||||||
end lines with 0 byte, not newline
|
end lines with 0 byte, not newline
|
||||||
|
|
||||||
|
|
||||||
=back
|
=back
|
||||||
|
|
||||||
=head1 EXAMPLES
|
=head1 EXAMPLES
|
||||||
|
|
||||||
=head2 Single key
|
=head2 Single key
|
||||||
|
|
||||||
Input is sorted by Chromosome,Position:
|
Given sorted I<input.txt> like:
|
||||||
|
|
||||||
SampleID Position Chromosome
|
A_number B_number Date Duration CellID
|
||||||
foo 10000123 chr3
|
12893827 21034191 2020-03-21T13:38:13 P00:00:20 CPH382
|
||||||
foo 10000125 chr3
|
12893827 80012345 2020-03-20T12:34:23 P00:00:20 CPH382
|
||||||
foo 9999998 chr11
|
12893827 80012345 2020-03-20T12:45:03 P00:05:20 CPH382
|
||||||
foo 10000124 chr11
|
22355591 47827750 2020-03-20T11:28:33 P00:32:27 ALB923
|
||||||
foo 10000126 chr11
|
22355591 81382631 2020-03-21T21:28:33 P00:12:48 CPH382
|
||||||
|
22356142 45701514 2020-03-20T22:41:23 P00:02:48 CPH022
|
||||||
|
22356142 56818446 2020-03-21T08:38:34 P00:31:24 CPH645
|
||||||
|
|
||||||
To find all chr3:
|
To get all records with 22355591 you can run:
|
||||||
|
|
||||||
2grep -n -k3 inputfile chr3
|
grep ^22355591 input.txt
|
||||||
|
|
||||||
-n will split 'chr3' into 'chr' which is compared asciibetically and
|
But if I<input.txt> is several TB big, it can be very slow. B<2grep>
|
||||||
|
uses binary search which only works if the file is sorted, but takes
|
||||||
|
less than 1 second to run:
|
||||||
|
|
||||||
|
2grep -H input.txt 22355591
|
||||||
|
|
||||||
|
You can also search for a shorter string to get all records starting
|
||||||
|
with 2235:
|
||||||
|
|
||||||
|
2grep -H input.txt 2235
|
||||||
|
|
||||||
|
Or you can search for multiple search strings:
|
||||||
|
|
||||||
|
2grep -H input.txt 12893827 22356142
|
||||||
|
|
||||||
|
=head2 Multiple keys
|
||||||
|
|
||||||
|
Input is sorted by SampleID, Chromosome, Position (in that order):
|
||||||
|
|
||||||
|
SampleID Chromosome Position Data
|
||||||
|
PatientA chr3 10002123 CCGTCTAATGGCTTGATTGGTACACCATGACATTGA
|
||||||
|
PatientA chr3 10003125 TCCATCGTCGGCGAGAAGGTACCAGGTAA
|
||||||
|
PatientA chr11 9999998 AATTCACAGTATGGCTGACGGTGTCGTAGCTACACG
|
||||||
|
PatientA chr11 10001240 TCCAGAAGTTTGA
|
||||||
|
PatientA chr11 10001260 ATAACGAGAACTTACGTTTTAAAAGGCCTA
|
||||||
|
PatientB chr3 10000125 GTCTTCACTTTATAAATGGATGATAGCCTTCA
|
||||||
|
|
||||||
|
SampleID is sorted as text. Chromosome is sorted by text first and
|
||||||
|
numerically for the number. Position is sorted by number.
|
||||||
|
|
||||||
|
To find all chr3 for PatientA:
|
||||||
|
|
||||||
|
2grep -H -k1,2N inputfile PatientA chr3
|
||||||
|
|
||||||
|
-N will split 'chr3' into 'chr' which is compared asciibetically and
|
||||||
'3' which is compared numerically.
|
'3' which is compared numerically.
|
||||||
|
|
||||||
=head2 Not implemented
|
To find all chr3 for PatientA and all chr3 for PatientB:
|
||||||
|
|
||||||
To find all lines with chr3,10000125:
|
2grep -H -k1,2N inputfile PatientA chr3 PatientB chr3
|
||||||
|
|
||||||
2grep -k3n,2n inputfile chr3 10000125
|
|
||||||
|
|
||||||
|
=head1 PERFORMANCE
|
||||||
|
|
||||||
|
Binary search requires seeks from the disk. But B<2search> is designed
|
||||||
|
so that multiple searches will reuse cached data. This means searches
|
||||||
|
will be faster the more you run.
|
||||||
|
|
||||||
|
You can improve the speed even more by sorting the input strings. This
|
||||||
|
will make it possible to reuse cached data more.
|
||||||
|
|
||||||
|
It can be even faster if you run multiple searches in parallel.
|
||||||
|
|
||||||
|
This is due to magnetic drives' elevator sorting of requests when
|
||||||
|
seeking and due to NVMe drives working faster with more queues in
|
||||||
|
parallel.
|
||||||
|
|
||||||
|
cat searchstrings | parallel -n50 -j10 2grep inputfile
|
||||||
|
|
||||||
|
|
||||||
|
=head1 BUGS
|
||||||
|
|
||||||
|
B<2search> does not respect your locale setting. It assumes the input
|
||||||
|
is sorted with LC_ALL=C. If it is not B<2search> may give the wrong
|
||||||
|
result.
|
||||||
|
|
||||||
|
To solve this sort the input with B<LC_ALL=C sort ...>.
|
||||||
|
|
||||||
|
|
||||||
=head1 REPORTING BUGS
|
=head1 REPORTING BUGS
|
||||||
|
|
||||||
B<2search> is part of tangetools. Report bugs to <tools@tange.dk>.
|
B<2search> is part of tangetools. Report bugs on
|
||||||
|
https://gitlab.com/ole.tange/tangetools/-/issues
|
||||||
|
|
||||||
|
|
||||||
=head1 AUTHOR
|
=head1 AUTHOR
|
||||||
|
@ -342,20 +412,32 @@ GetOptions(
|
||||||
"sort=s" => \$opt::sort,
|
"sort=s" => \$opt::sort,
|
||||||
"V|version-sort" => \$opt::version_sort,
|
"V|version-sort" => \$opt::version_sort,
|
||||||
"k|key=s" => \@opt::key,
|
"k|key=s" => \@opt::key,
|
||||||
|
"H|header" => \$opt::header,
|
||||||
"t|field-separator=s" => \$opt::field_separator,
|
"t|field-separator=s" => \$opt::field_separator,
|
||||||
|
"recend|record-end=s" => \$opt::record_end,
|
||||||
|
"recstart|record-start=s" => \$opt::record_start,
|
||||||
"z|zero-terminated" => \$opt::zero_terminated,
|
"z|zero-terminated" => \$opt::zero_terminated,
|
||||||
);
|
) || exit(255);
|
||||||
$Global::progname = ($0 =~ m:(^|/)([^/]+)$:)[1];
|
$Global::progname = ($0 =~ m:(^|/)([^/]+)$:)[1];
|
||||||
$Global::version = 20200328;
|
$Global::version = 20200328;
|
||||||
if($opt::version) { version(); exit 0; }
|
if($opt::version) { version(); exit 0; }
|
||||||
if($opt::zero_terminated) { $/ = "\0"; }
|
if($opt::zero_terminated) { $/ = "\0"; }
|
||||||
if(@opt::key) {
|
if(@opt::key) {
|
||||||
# Default separator if --key = whitespace
|
# Default separator if --key = whitespace
|
||||||
$Global::sep = '\s+';
|
$Global::fieldsep = '\s+';
|
||||||
if(defined $opt::field_separator) { $Global::sep = $opt::field_separator; }
|
if(defined $opt::field_separator) { $Global::fieldsep = $opt::field_separator; }
|
||||||
}
|
}
|
||||||
if($Global::progname eq "2grep") { $opt::grep = 1; }
|
if($Global::progname eq "2grep") { $opt::grep = 1; }
|
||||||
$Global::debug = $opt::D;
|
$Global::debug = $opt::D;
|
||||||
|
if(defined $opt::record_end or defined $opt::record_start) {
|
||||||
|
if(not defined $opt::record_end) { $opt::record_end = ""; }
|
||||||
|
if(not defined $opt::record_start) { $opt::record_start = ""; }
|
||||||
|
$/ = unquote_printf($opt::record_end).unquote_printf($opt::record_start);
|
||||||
|
} else {
|
||||||
|
# Default = \n
|
||||||
|
$opt::record_end = "\n";
|
||||||
|
$/ = $opt::record_end;
|
||||||
|
}
|
||||||
|
|
||||||
parse_keydef();
|
parse_keydef();
|
||||||
|
|
||||||
|
@ -370,6 +452,19 @@ if(@ARGV) {
|
||||||
$opt::stdin = 1;
|
$opt::stdin = 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
$Global::headersize = 0;
|
||||||
|
if($opt::header) {
|
||||||
|
if(not open (my $fh, "<", $file)) {
|
||||||
|
error("Cannot open '$file'");
|
||||||
|
exit 1;
|
||||||
|
} else {
|
||||||
|
my $header = <$fh>;
|
||||||
|
$header =~ s/\Q$opt::record_start\E$//;
|
||||||
|
$Global::headersize = length $header;
|
||||||
|
print $header;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
round:
|
round:
|
||||||
while(1) {
|
while(1) {
|
||||||
my @search_vals;
|
my @search_vals;
|
||||||
|
@ -385,7 +480,7 @@ if(@ARGV) {
|
||||||
} else {
|
} else {
|
||||||
print bsearch($file,@search_vals);
|
print bsearch($file,@search_vals);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
{
|
{
|
||||||
my $fh;
|
my $fh;
|
||||||
|
@ -447,7 +542,7 @@ sub bgrep {
|
||||||
sub bsearch {
|
sub bsearch {
|
||||||
my $file = shift;
|
my $file = shift;
|
||||||
my @search_vals = @_;
|
my @search_vals = @_;
|
||||||
my $min = 0;
|
my $min = $Global::headersize;
|
||||||
my $max = -s $file;
|
my $max = -s $file;
|
||||||
my $fh;
|
my $fh;
|
||||||
if(not open ($fh, "<", $file)) {
|
if(not open ($fh, "<", $file)) {
|
||||||
|
@ -474,7 +569,7 @@ sub bsearch {
|
||||||
compare(($line = <$fh>),@search_vals) >= 0) {
|
compare(($line = <$fh>),@search_vals) >= 0) {
|
||||||
# We have see this newline position before
|
# We have see this newline position before
|
||||||
# or we are at the end of the file
|
# or we are at the end of the file
|
||||||
# or we should search the upper half
|
# or we should search the lower half
|
||||||
$max = $middle;
|
$max = $middle;
|
||||||
$maxnl = $newline_pos;
|
$maxnl = $newline_pos;
|
||||||
} else {
|
} else {
|
||||||
|
@ -485,19 +580,43 @@ sub bsearch {
|
||||||
}
|
}
|
||||||
seek($fh,$minnl,0) or die("Cannot seek to $minnl");
|
seek($fh,$minnl,0) or die("Cannot seek to $minnl");
|
||||||
$line = <$fh>;
|
$line = <$fh>;
|
||||||
|
my $len = length $opt::record_start;
|
||||||
|
my $retpos;
|
||||||
if(compare($line,@search_vals) >= 0) {
|
if(compare($line,@search_vals) >= 0) {
|
||||||
if($opt::byte_offset) {
|
# Adjust for length of $recstart
|
||||||
return $minnl."\n";
|
$retpos = $minnl - $len;
|
||||||
} else {
|
|
||||||
return $line;
|
|
||||||
}
|
|
||||||
} else {
|
} else {
|
||||||
if($opt::byte_offset) {
|
$retpos = tell($fh) - $len;
|
||||||
return tell($fh)."\n";
|
}
|
||||||
|
$retpos = $retpos < 0 ? 0 : $retpos;
|
||||||
|
if($opt::byte_offset) {
|
||||||
|
return $retpos."\n";
|
||||||
|
} else {
|
||||||
|
seek($fh,$retpos,0) or die("Cannot seek to $minnl");
|
||||||
|
if(length $opt::record_end) {
|
||||||
|
# read record: A...BA
|
||||||
|
# Remove $opt::record_start if it is at the end
|
||||||
|
# (might not be only record)
|
||||||
|
$line = <$fh>;
|
||||||
|
$line =~ s/\Q$opt::record_start\E$//;
|
||||||
} else {
|
} else {
|
||||||
$line=<$fh>;
|
# --recend == ''
|
||||||
return $line;
|
if(length $opt::record_start) {
|
||||||
|
# read record: A...A
|
||||||
|
# Remove $opt::record_start if it is at the end
|
||||||
|
# (might not be only record)
|
||||||
|
$line = <$fh>; # Read: A
|
||||||
|
$line .= <$fh>; # Read: ...A
|
||||||
|
$line =~ s/\Q$opt::record_start\E$//;
|
||||||
|
} else {
|
||||||
|
# Len recstart == Len recend = 0. Does this ever happen?
|
||||||
|
# read record.
|
||||||
|
# Remove $opt::record_start if it is there (might be only record)
|
||||||
|
$line = <$fh>;
|
||||||
|
$line =~ s/\Q$opt::record_start\E$//;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
return $line;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -533,11 +652,11 @@ sub parse_keydef {
|
||||||
);
|
);
|
||||||
|
|
||||||
if(@opt::key) {
|
if(@opt::key) {
|
||||||
|
# skip
|
||||||
} else {
|
} else {
|
||||||
# Convert -n -r to -k1rn
|
# Convert -n -r to -k1rn
|
||||||
# with sep = undef
|
# with sep = undef
|
||||||
$Global::sep = undef;
|
$Global::fieldsep = undef;
|
||||||
my $opt;
|
my $opt;
|
||||||
$opt->{'field'} = 1;
|
$opt->{'field'} = 1;
|
||||||
$opt->{'char'} = 1;
|
$opt->{'char'} = 1;
|
||||||
|
@ -546,7 +665,7 @@ sub parse_keydef {
|
||||||
}
|
}
|
||||||
push(@Global::keydefs,$opt);
|
push(@Global::keydefs,$opt);
|
||||||
}
|
}
|
||||||
|
|
||||||
for my $keydefs (@opt::key) {
|
for my $keydefs (@opt::key) {
|
||||||
for my $keydef (split /,/, $keydefs) {
|
for my $keydef (split /,/, $keydefs) {
|
||||||
my $opt;
|
my $opt;
|
||||||
|
@ -573,11 +692,11 @@ sub compare {
|
||||||
# One key to search for per search column
|
# One key to search for per search column
|
||||||
my($line,@search_vals) = @_;
|
my($line,@search_vals) = @_;
|
||||||
chomp($line);
|
chomp($line);
|
||||||
debug("Compare: $line <=> @search_vals ");
|
debug("Compare: $line <=> @search_vals; ");
|
||||||
my @field;
|
my @field;
|
||||||
if($Global::sep) {
|
if($Global::fieldsep) {
|
||||||
# Split line
|
# Split line
|
||||||
@field = split /$Global::sep/o, $line;
|
@field = split /$Global::fieldsep/o, $line;
|
||||||
} else {
|
} else {
|
||||||
@field = ($line);
|
@field = ($line);
|
||||||
}
|
}
|
||||||
|
@ -628,9 +747,20 @@ sub compare_single {
|
||||||
return ($m{$a} || 0) <=> ($m{$b} || 0);
|
return ($m{$a} || 0) <=> ($m{$b} || 0);
|
||||||
}
|
}
|
||||||
if($opt->{'numeric_sort'}) {
|
if($opt->{'numeric_sort'}) {
|
||||||
return $a <=> $b;
|
return($a <=> $b or $a cmp $b);
|
||||||
} elsif($opt->{'numascii'}) {
|
} elsif($opt->{'numascii'}) {
|
||||||
return $a <=> $b or $a cmp $b;
|
# Split on digit boundary
|
||||||
|
my @a = split /(?<=\d)(?=\D)|(?<=\D)(?=\d)/i, $a;
|
||||||
|
my @b = split /(?<=\d)(?=\D)|(?<=\D)(?=\d)/i, $b;
|
||||||
|
my $c;
|
||||||
|
for(my $t = 0;
|
||||||
|
defined $a[$t] and defined $b[$t];
|
||||||
|
$t++) {
|
||||||
|
$c = ($a[$t] <=> $b[$t] or $a[$t] cmp $b[$t]);
|
||||||
|
$c and return $c;
|
||||||
|
}
|
||||||
|
# All parts match, maybe one is longer
|
||||||
|
return $#a <=> $#b;
|
||||||
} else {
|
} else {
|
||||||
return $a cmp $b;
|
return $a cmp $b;
|
||||||
}
|
}
|
||||||
|
@ -775,3 +905,19 @@ sub debug(@) {
|
||||||
$Global::debug or return;
|
$Global::debug or return;
|
||||||
print @_;
|
print @_;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
sub unquote_printf() {
|
||||||
|
# Convert \t \n \r \000 \0
|
||||||
|
# Inputs:
|
||||||
|
# $string = string with \t \n \r \num \0
|
||||||
|
# Returns:
|
||||||
|
# $replaced = string with TAB NEWLINE CR <ascii-num> NUL
|
||||||
|
$_ = shift;
|
||||||
|
s/\\t/\t/g;
|
||||||
|
s/\\n/\n/g;
|
||||||
|
s/\\r/\r/g;
|
||||||
|
s/\\(\d\d\d)/eval 'sprintf "\\'.$1.'"'/ge;
|
||||||
|
s/\\(\d)/eval 'sprintf "\\'.$1.'"'/ge;
|
||||||
|
return $_;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
|
@ -2,6 +2,7 @@
|
||||||
|
|
||||||
test_tmp=`tempfile`
|
test_tmp=`tempfile`
|
||||||
export test_tmp
|
export test_tmp
|
||||||
|
export LC_ALL=C
|
||||||
|
|
||||||
opt_tester() {
|
opt_tester() {
|
||||||
opt="$@"
|
opt="$@"
|
||||||
|
@ -111,10 +112,10 @@ test_rn_opt() {
|
||||||
}
|
}
|
||||||
|
|
||||||
test_r_opt() {
|
test_r_opt() {
|
||||||
opt_tester -rn
|
opt_tester -r
|
||||||
}
|
}
|
||||||
|
|
||||||
test_k32_2n_1n() {
|
test_k3N_2N_1n() {
|
||||||
tmp=$(tempfile)
|
tmp=$(tempfile)
|
||||||
cat >$tmp <<EOF
|
cat >$tmp <<EOF
|
||||||
1 chr1 Sample 1
|
1 chr1 Sample 1
|
||||||
|
@ -172,8 +173,8 @@ test_k32_2n_1n() {
|
||||||
11111 chr10 Sample 10
|
11111 chr10 Sample 10
|
||||||
111111 chr10 Sample 10
|
111111 chr10 Sample 10
|
||||||
EOF
|
EOF
|
||||||
2grep -k3N,2N,1n $tmp 'Sample 10' chr10 111
|
# Find the line with 111,chr2,Sample 10
|
||||||
echo $tmp
|
2grep -t '\t' -k3N,2N,1n $tmp 'Sample 10' chr2 111
|
||||||
}
|
}
|
||||||
|
|
||||||
test_partial_line() {
|
test_partial_line() {
|
||||||
|
@ -188,7 +189,70 @@ test_partial_line() {
|
||||||
rm $tmp
|
rm $tmp
|
||||||
}
|
}
|
||||||
|
|
||||||
|
test_recstart() {
|
||||||
|
tmp=$(tempfile)
|
||||||
|
cat >$tmp <<EOF
|
||||||
|
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11
|
||||||
|
MIHNYMEHLERTKLHQLSGSDQLESTAHSRIRKERPISLGIFPLPAGDGLLTPDAQKGGET
|
||||||
|
PGSEQWKFQELSQPRSHTSLKVSNSPEPQKAVEQEDELSDVSQGGSKATTPASTANSDVAT
|
||||||
|
IPTDTPLKEENEGFVKVTDAPNKSEISKHIEVQVAQETRNVSTGSAENEEKSEVQAIIEST
|
||||||
|
PELDMDKDLSGYKGSSTPTKGIENKAFDRNTESLFEELSSAGSGLIGDVDEGADLLGMGRE
|
||||||
|
VENLILENTQLLETKNALNIVKNDLIAKVDELTCEKDVLQGELEAVKQAKLKLEEKNRELE
|
||||||
|
EELRKARAEAEDARQKAKDDDDSDIPTAQRKRFTRVEMARVLMERNQYKERLMELQEAVRW
|
||||||
|
TEMIRASRENPAMQEKKRSSIWQFFSRLFSSSSNTTKKPEPPVNLKYNAPTSHVTPSVX
|
||||||
|
>sp|P04637|P53_HUMAN Cellular
|
||||||
|
IQVVSRCRLRHTEVLPAEEENDSLGADGT
|
||||||
|
PQLX
|
||||||
|
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606
|
||||||
|
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRD
|
||||||
|
WGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKRT
|
||||||
|
SNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCEX
|
||||||
|
>sp|P13674|P4HA1_HUMAN Prolyl
|
||||||
|
VECCPNCRGTGMQIRIHQIGPGMVQQIQS
|
||||||
|
DGQKITFHGEGDQEPGLEPGDIIIVLDQK
|
||||||
|
SHPGQIVKHGDIKCVLNEGMX
|
||||||
|
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transc
|
||||||
|
IVVKGHSTCLSEGALSPDGTVLATASHDGYVKFWQIYIEGQDEPRCLHEWKP
|
||||||
|
HDGRPLSCLLFCDNHKKQDPDVPFWRFLITGADQNRELKMWCTVSWTCLQTI
|
||||||
|
RFSPDIFSSVSVPPSLKVCLDLSAEYLILSDVQRKVLYVMELLQNQEEGHAC
|
||||||
|
FSSISEFLLTHPVLSFGIQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAX
|
||||||
|
>sp|Q7Z4N8|P4HA3_HUMAN
|
||||||
|
MTEQMTLRGTLKGHNGWVTQIA
|
||||||
|
YGIPQRALRGHSHFVSDVVISS
|
||||||
|
GHTKDVLSVAFSSDNRQIVSGS
|
||||||
|
VRFSPNSSNPIIVSX
|
||||||
|
>sp|Q96A73|P33MX_HUMAN Putative
|
||||||
|
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSA
|
||||||
|
VPLIGLPNTQDYKWVDRNSGLTWSGNDTCLY
|
||||||
|
SCQNQTKGLLYQLFRNLFCSYGLTEAHGKWR
|
||||||
|
CADASITNDKGHDGHRTPTWWLTGSNLTLSV
|
||||||
|
NNSGLFFLCGNGVYKGFPPKWSGRCGLGYLV
|
||||||
|
PSLTRYLTLNASQITNLRSFIHKVTPHRX
|
||||||
|
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing
|
||||||
|
MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKE
|
||||||
|
IAEAYDVLSDPRKREIFDRYGEEGLKGSGPSGGSGGGANGTSFSYTF
|
||||||
|
HGDPHAMFAEFFGGRNPFDTFFGQRNGEEGMDIDDPFSGFPMGMGGF
|
||||||
|
TNVNFGRSRSAQEPARKKQDPPVTHDLX
|
||||||
|
EOF
|
||||||
|
echo "--regstart"
|
||||||
|
echo start
|
||||||
|
2search -t '\|' -k2 --recstart '>' $tmp O14683
|
||||||
|
echo middle
|
||||||
|
2search -t '\|' -k2 --recstart '>' --recend '' $tmp Q96A73
|
||||||
|
echo end
|
||||||
|
2search -t '\|' -k2 --recstart '>' $tmp Q9UHX1
|
||||||
|
|
||||||
|
echo "--regstart + --regend"
|
||||||
|
echo start
|
||||||
|
2search -t '\|' -k2 --recstart '>' --recend '\n' $tmp O14683
|
||||||
|
echo middle
|
||||||
|
2search -t '\|' -k2 --recstart '>' --recend '\n' $tmp Q96A73
|
||||||
|
echo end
|
||||||
|
2search -t '\|' -k2 --recstart '>' --recend '\n' $tmp Q9UHX1
|
||||||
|
|
||||||
|
rm $tmp
|
||||||
|
}
|
||||||
|
|
||||||
export -f $(compgen -A function | grep test_)
|
export -f $(compgen -A function | grep test_)
|
||||||
compgen -A function | grep test_ | sort | parallel -j6 --tag -k '{} 2>&1' > regressiontest.new
|
compgen -A function | grep test_ | sort | parallel -j6 --tag -k '{} 2>&1' > regressiontest.new
|
||||||
diff regressiontest.new regressiontest.out
|
diff -Naur regressiontest.new regressiontest.out
|
||||||
|
|
|
@ -1,15 +1,11 @@
|
||||||
test_k32_2n_1n 111 chr10 Sample 10
|
test_k3N_2N_1n 111 chr2 Sample 10
|
||||||
test_k32_2n_1n 1111 chr10 Sample 10
|
|
||||||
test_k32_2n_1n 11111 chr10 Sample 10
|
|
||||||
test_k32_2n_1n 111111 chr10 Sample 10
|
|
||||||
test_n Search in null file
|
test_n Search in null file
|
||||||
test_n 0
|
test_n 0
|
||||||
test_n 0
|
test_n 0
|
||||||
test_n 0
|
test_n 0
|
||||||
test_n 0
|
test_n 0
|
||||||
test_n Search in newline
|
test_n Search in newline
|
||||||
test_n
|
test_n 1
|
||||||
test_n 0
|
|
||||||
test_n 1
|
test_n 1
|
||||||
test_n 1
|
test_n 1
|
||||||
test_n 1
|
test_n 1
|
||||||
|
@ -65,8 +61,7 @@ test_n_opt 0
|
||||||
test_n_opt 0
|
test_n_opt 0
|
||||||
test_n_opt Search in newline
|
test_n_opt Search in newline
|
||||||
test_n_opt Search in
|
test_n_opt Search in
|
||||||
test_n_opt
|
test_n_opt 1
|
||||||
test_n_opt 0
|
|
||||||
test_n_opt 1
|
test_n_opt 1
|
||||||
test_n_opt 1
|
test_n_opt 1
|
||||||
test_n_opt 1
|
test_n_opt 1
|
||||||
|
@ -150,6 +145,118 @@ test_partial_line 36
|
||||||
test_partial_line 37
|
test_partial_line 37
|
||||||
test_partial_line 38
|
test_partial_line 38
|
||||||
test_partial_line 39
|
test_partial_line 39
|
||||||
|
test_r_opt Search in null file
|
||||||
|
test_r_opt Search in
|
||||||
|
test_r_opt 0
|
||||||
|
test_r_opt 0
|
||||||
|
test_r_opt 0
|
||||||
|
test_r_opt 0
|
||||||
|
test_r_opt Search in newline
|
||||||
|
test_r_opt Search in
|
||||||
|
test_r_opt
|
||||||
|
test_r_opt
|
||||||
|
test_r_opt
|
||||||
|
test_r_opt
|
||||||
|
test_r_opt 0
|
||||||
|
test_r_opt 0
|
||||||
|
test_r_opt 0
|
||||||
|
test_r_opt 0
|
||||||
|
test_r_opt Search in 1.000000000
|
||||||
|
test_r_opt 1.000000000
|
||||||
|
test_r_opt 1.000000000
|
||||||
|
test_r_opt 1.000000000
|
||||||
|
test_r_opt 12
|
||||||
|
test_r_opt 0
|
||||||
|
test_r_opt 0
|
||||||
|
test_r_opt 0
|
||||||
|
test_r_opt Search in 2 1.000000000
|
||||||
|
test_r_opt 2
|
||||||
|
test_r_opt 2
|
||||||
|
test_r_opt 1.000000000
|
||||||
|
test_r_opt 14
|
||||||
|
test_r_opt 0
|
||||||
|
test_r_opt 0
|
||||||
|
test_r_opt 2
|
||||||
|
test_r_opt Search in 2.000000000 1
|
||||||
|
test_r_opt 1
|
||||||
|
test_r_opt 2.000000000
|
||||||
|
test_r_opt 1
|
||||||
|
test_r_opt 14
|
||||||
|
test_r_opt 12
|
||||||
|
test_r_opt 0
|
||||||
|
test_r_opt 12
|
||||||
|
test_r_opt Search in 3 2 1.000000000
|
||||||
|
test_r_opt 2
|
||||||
|
test_r_opt 2
|
||||||
|
test_r_opt 1.000000000
|
||||||
|
test_r_opt 16
|
||||||
|
test_r_opt 2
|
||||||
|
test_r_opt 2
|
||||||
|
test_r_opt 4
|
||||||
|
test_r_opt Search in 3 2.000000000 1
|
||||||
|
test_r_opt 1
|
||||||
|
test_r_opt 2.000000000
|
||||||
|
test_r_opt 1
|
||||||
|
test_r_opt 16
|
||||||
|
test_r_opt 14
|
||||||
|
test_r_opt 2
|
||||||
|
test_r_opt 14
|
||||||
|
test_r_opt Search in 3.000000000 2 1
|
||||||
|
test_r_opt 2
|
||||||
|
test_r_opt 2
|
||||||
|
test_r_opt 1
|
||||||
|
test_r_opt 16
|
||||||
|
test_r_opt 12
|
||||||
|
test_r_opt 12
|
||||||
|
test_r_opt 14
|
||||||
|
test_recstart --regstart
|
||||||
|
test_recstart start
|
||||||
|
test_recstart >sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11
|
||||||
|
test_recstart MIHNYMEHLERTKLHQLSGSDQLESTAHSRIRKERPISLGIFPLPAGDGLLTPDAQKGGET
|
||||||
|
test_recstart PGSEQWKFQELSQPRSHTSLKVSNSPEPQKAVEQEDELSDVSQGGSKATTPASTANSDVAT
|
||||||
|
test_recstart IPTDTPLKEENEGFVKVTDAPNKSEISKHIEVQVAQETRNVSTGSAENEEKSEVQAIIEST
|
||||||
|
test_recstart PELDMDKDLSGYKGSSTPTKGIENKAFDRNTESLFEELSSAGSGLIGDVDEGADLLGMGRE
|
||||||
|
test_recstart VENLILENTQLLETKNALNIVKNDLIAKVDELTCEKDVLQGELEAVKQAKLKLEEKNRELE
|
||||||
|
test_recstart EELRKARAEAEDARQKAKDDDDSDIPTAQRKRFTRVEMARVLMERNQYKERLMELQEAVRW
|
||||||
|
test_recstart TEMIRASRENPAMQEKKRSSIWQFFSRLFSSSSNTTKKPEPPVNLKYNAPTSHVTPSVX
|
||||||
|
test_recstart middle
|
||||||
|
test_recstart >sp|Q96A73|P33MX_HUMAN Putative
|
||||||
|
test_recstart RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSA
|
||||||
|
test_recstart VPLIGLPNTQDYKWVDRNSGLTWSGNDTCLY
|
||||||
|
test_recstart SCQNQTKGLLYQLFRNLFCSYGLTEAHGKWR
|
||||||
|
test_recstart CADASITNDKGHDGHRTPTWWLTGSNLTLSV
|
||||||
|
test_recstart NNSGLFFLCGNGVYKGFPPKWSGRCGLGYLV
|
||||||
|
test_recstart PSLTRYLTLNASQITNLRSFIHKVTPHRX
|
||||||
|
test_recstart end
|
||||||
|
test_recstart >sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing
|
||||||
|
test_recstart MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKE
|
||||||
|
test_recstart IAEAYDVLSDPRKREIFDRYGEEGLKGSGPSGGSGGGANGTSFSYTF
|
||||||
|
test_recstart HGDPHAMFAEFFGGRNPFDTFFGQRNGEEGMDIDDPFSGFPMGMGGF
|
||||||
|
test_recstart TNVNFGRSRSAQEPARKKQDPPVTHDLX
|
||||||
|
test_recstart --regstart + --regend
|
||||||
|
test_recstart start
|
||||||
|
test_recstart >sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11
|
||||||
|
test_recstart MIHNYMEHLERTKLHQLSGSDQLESTAHSRIRKERPISLGIFPLPAGDGLLTPDAQKGGET
|
||||||
|
test_recstart PGSEQWKFQELSQPRSHTSLKVSNSPEPQKAVEQEDELSDVSQGGSKATTPASTANSDVAT
|
||||||
|
test_recstart IPTDTPLKEENEGFVKVTDAPNKSEISKHIEVQVAQETRNVSTGSAENEEKSEVQAIIEST
|
||||||
|
test_recstart PELDMDKDLSGYKGSSTPTKGIENKAFDRNTESLFEELSSAGSGLIGDVDEGADLLGMGRE
|
||||||
|
test_recstart VENLILENTQLLETKNALNIVKNDLIAKVDELTCEKDVLQGELEAVKQAKLKLEEKNRELE
|
||||||
|
test_recstart EELRKARAEAEDARQKAKDDDDSDIPTAQRKRFTRVEMARVLMERNQYKERLMELQEAVRW
|
||||||
|
test_recstart TEMIRASRENPAMQEKKRSSIWQFFSRLFSSSSNTTKKPEPPVNLKYNAPTSHVTPSVX
|
||||||
|
test_recstart middle
|
||||||
|
test_recstart >sp|Q96A73|P33MX_HUMAN Putative
|
||||||
|
test_recstart RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSA
|
||||||
|
test_recstart VPLIGLPNTQDYKWVDRNSGLTWSGNDTCLY
|
||||||
|
test_recstart SCQNQTKGLLYQLFRNLFCSYGLTEAHGKWR
|
||||||
|
test_recstart CADASITNDKGHDGHRTPTWWLTGSNLTLSV
|
||||||
|
test_recstart NNSGLFFLCGNGVYKGFPPKWSGRCGLGYLV
|
||||||
|
test_recstart PSLTRYLTLNASQITNLRSFIHKVTPHRX
|
||||||
|
test_recstart end
|
||||||
|
test_recstart >sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing
|
||||||
|
test_recstart MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKE
|
||||||
|
test_recstart IAEAYDVLSDPRKREIFDRYGEEGLKGSGPSGGSGGGANGTSFSYTF
|
||||||
|
test_recstart HGDPHAMFAEFFGGRNPFDTFFGQRNGEEGMDIDDPFSGFPMGMGGF
|
||||||
|
test_recstart TNVNFGRSRSAQEPARKKQDPPVTHDLX
|
||||||
test_rn_opt Search in null file
|
test_rn_opt Search in null file
|
||||||
test_rn_opt Search in
|
test_rn_opt Search in
|
||||||
test_rn_opt 0
|
test_rn_opt 0
|
||||||
|
@ -183,11 +290,11 @@ test_rn_opt 0
|
||||||
test_rn_opt 0
|
test_rn_opt 0
|
||||||
test_rn_opt 0
|
test_rn_opt 0
|
||||||
test_rn_opt Search in 2.000000000 1
|
test_rn_opt Search in 2.000000000 1
|
||||||
test_rn_opt 2.000000000
|
test_rn_opt 1
|
||||||
test_rn_opt 2.000000000
|
test_rn_opt 2.000000000
|
||||||
test_rn_opt 2.000000000
|
test_rn_opt 2.000000000
|
||||||
test_rn_opt 14
|
test_rn_opt 14
|
||||||
test_rn_opt 0
|
test_rn_opt 12
|
||||||
test_rn_opt 0
|
test_rn_opt 0
|
||||||
test_rn_opt 0
|
test_rn_opt 0
|
||||||
test_rn_opt Search in 3 2 1.000000000
|
test_rn_opt Search in 3 2 1.000000000
|
||||||
|
@ -199,11 +306,11 @@ test_rn_opt 2
|
||||||
test_rn_opt 2
|
test_rn_opt 2
|
||||||
test_rn_opt 0
|
test_rn_opt 0
|
||||||
test_rn_opt Search in 3 2.000000000 1
|
test_rn_opt Search in 3 2.000000000 1
|
||||||
test_rn_opt 2.000000000
|
test_rn_opt 1
|
||||||
test_rn_opt 2.000000000
|
test_rn_opt 2.000000000
|
||||||
test_rn_opt 3
|
test_rn_opt 3
|
||||||
test_rn_opt 16
|
test_rn_opt 16
|
||||||
test_rn_opt 2
|
test_rn_opt 14
|
||||||
test_rn_opt 2
|
test_rn_opt 2
|
||||||
test_rn_opt 0
|
test_rn_opt 0
|
||||||
test_rn_opt Search in 3.000000000 2 1
|
test_rn_opt Search in 3.000000000 2 1
|
||||||
|
@ -214,67 +321,3 @@ test_rn_opt 16
|
||||||
test_rn_opt 12
|
test_rn_opt 12
|
||||||
test_rn_opt 12
|
test_rn_opt 12
|
||||||
test_rn_opt 0
|
test_rn_opt 0
|
||||||
test_r_opt Search in null file
|
|
||||||
test_r_opt Search in
|
|
||||||
test_r_opt 0
|
|
||||||
test_r_opt 0
|
|
||||||
test_r_opt 0
|
|
||||||
test_r_opt 0
|
|
||||||
test_r_opt Search in newline
|
|
||||||
test_r_opt Search in
|
|
||||||
test_r_opt
|
|
||||||
test_r_opt
|
|
||||||
test_r_opt
|
|
||||||
test_r_opt
|
|
||||||
test_r_opt 0
|
|
||||||
test_r_opt 0
|
|
||||||
test_r_opt 0
|
|
||||||
test_r_opt 0
|
|
||||||
test_r_opt Search in 1.000000000
|
|
||||||
test_r_opt 1.000000000
|
|
||||||
test_r_opt 1.000000000
|
|
||||||
test_r_opt 1.000000000
|
|
||||||
test_r_opt 12
|
|
||||||
test_r_opt 0
|
|
||||||
test_r_opt 0
|
|
||||||
test_r_opt 0
|
|
||||||
test_r_opt Search in 2 1.000000000
|
|
||||||
test_r_opt 2
|
|
||||||
test_r_opt 2
|
|
||||||
test_r_opt 2
|
|
||||||
test_r_opt 14
|
|
||||||
test_r_opt 0
|
|
||||||
test_r_opt 0
|
|
||||||
test_r_opt 0
|
|
||||||
test_r_opt Search in 2.000000000 1
|
|
||||||
test_r_opt 2.000000000
|
|
||||||
test_r_opt 2.000000000
|
|
||||||
test_r_opt 2.000000000
|
|
||||||
test_r_opt 14
|
|
||||||
test_r_opt 0
|
|
||||||
test_r_opt 0
|
|
||||||
test_r_opt 0
|
|
||||||
test_r_opt Search in 3 2 1.000000000
|
|
||||||
test_r_opt 2
|
|
||||||
test_r_opt 2
|
|
||||||
test_r_opt 3
|
|
||||||
test_r_opt 16
|
|
||||||
test_r_opt 2
|
|
||||||
test_r_opt 2
|
|
||||||
test_r_opt 0
|
|
||||||
test_r_opt Search in 3 2.000000000 1
|
|
||||||
test_r_opt 2.000000000
|
|
||||||
test_r_opt 2.000000000
|
|
||||||
test_r_opt 3
|
|
||||||
test_r_opt 16
|
|
||||||
test_r_opt 2
|
|
||||||
test_r_opt 2
|
|
||||||
test_r_opt 0
|
|
||||||
test_r_opt Search in 3.000000000 2 1
|
|
||||||
test_r_opt 2
|
|
||||||
test_r_opt 2
|
|
||||||
test_r_opt 3.000000000
|
|
||||||
test_r_opt 16
|
|
||||||
test_r_opt 12
|
|
||||||
test_r_opt 12
|
|
||||||
test_r_opt 0
|
|
||||||
|
|
8
README
8
README
|
@ -2,9 +2,9 @@ Tools developed by Ole Tange <ole@tange.dk>.
|
||||||
|
|
||||||
Probably not useful for you, but then again you never now.
|
Probably not useful for you, but then again you never now.
|
||||||
|
|
||||||
blink - blink disks in a disk enclosure
|
2search - binary search through sorted text files.
|
||||||
|
|
||||||
bsearch - binary search through sorted text files.
|
blink - blink disks in a disk enclosure.
|
||||||
|
|
||||||
decrypt-root-with-usb - patch for cryptroot to decrypt root with key on USB.
|
decrypt-root-with-usb - patch for cryptroot to decrypt root with key on USB.
|
||||||
|
|
||||||
|
@ -14,6 +14,8 @@ em - force emacs to run in terminal. Use xemacs if installed.
|
||||||
|
|
||||||
field - split on whitespace. Give the given field number. Supports syntax 1-3,6-
|
field - split on whitespace. Give the given field number. Supports syntax 1-3,6-
|
||||||
|
|
||||||
|
find-first-fail - find the lowest argument that makes a command fail.
|
||||||
|
|
||||||
forever - run the same command or list of commands every second.
|
forever - run the same command or list of commands every second.
|
||||||
|
|
||||||
G - shorthand for multi level grep.
|
G - shorthand for multi level grep.
|
||||||
|
|
Loading…
Reference in a new issue