parallel: man page example with web crawler -> web mirrorer

2024-12-23 05:07:54 +00:00 · 2011-07-28 20:12:02 +02:00 · 2011-07-28 20:12:02 +02:00 · d70f2eb8ee
parent 5298af094d
commit d70f2eb8ee
1 changed files with 19 additions and 8 deletions
--- a/src/parallel.pod
+++ b/src/parallel.pod
@ -631,7 +631,8 @@ Implies B<-X> unless B<-m> is set.

 Do not start new jobs on a given computer unless the load is less than
 I<max-load>. I<max-load> uses the same syntax as B<--jobs>, so I<100%>
-for one per CPU is a valid setting.
+for one per CPU is a valid setting. Only difference is 0 which
+actually means 0.

 The load average is only sampled every 10 seconds to avoid stressing
 small computers.
@ -1523,17 +1524,27 @@ B<$(date -d "today -{1} days" +%Y%m%d)> with give the dates in
 YYYYMMDD with {1} days subtracted.


-=head1 EXAMPLE: Parallel spider
+=head1 EXAMPLE: Parallel web crawler/mirrorer

-This script below will spider a URL in parallel (breadth first). Run
-like this:
+This script below will crawl and mirror a URL in parallel (breadth
+first). Run like this:

-B<PARALLEL=-j50 ./parallel-spider http://www.gnu.org/software/parallel>
+B<PARALLEL=-j100 ./parallel-crawl http://gatt.org.yeslab.org/>
+
+Remove the B<wget> part if you only want a web crawler.
+
+It works by fetching a page from a list of URLs and looking for links
+in that page that are within the same starting URL and that have not
+already been seen. These links are added to a new queue. When all the
+pages from the list is done, the new queue is moved to the list of
+URLs and the process is started over until no unseen links are found.

  #!/bin/bash

-  # E.g. http://www.gnu.org/software/parallel
+  # E.g. http://gatt.org.yeslab.org/
  URL=$1
+  # Stay inside the start dir
+  BASEURL=$(echo $URL | perl -pe 's:#.*::; s:(//.*/)[^/]*:$1:')
  URLLIST=$(mktemp urllist.XXXX)
  URLLIST2=$(mktemp urllist.XXXX)
  SEEN=$(mktemp seen.XXXX)
@ -1544,9 +1555,9 @@ B<PARALLEL=-j50 ./parallel-spider http://www.gnu.org/software/parallel>

  while [ -s $URLLIST ] ; do
    cat $URLLIST |
-      parallel lynx -listonly -image_links -dump {} \; echo Spidered: {} \>\&2 |
+      parallel lynx -listonly -image_links -dump {} \; wget -qm -l1 -Q1 {} \; echo Spidered: {} \>\&2 |
      perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and do { $seen{$1}++ or print }' |
-      grep -F $URL |
+      grep -F $BASEURL |
      grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2
    mv $URLLIST2 $URLLIST
  done