REplican: A Regular Expression based web replicator

REplican is a powerful web replicator based on regular expressions. Regular expressions allow either great specificity or generality, giving you a lot of power to retrieve whatever you wish from the web.

Usage: java REplican-1.4.7.jar [args...] URL [URL...]

The jar file, that includes all the source, can be downloaded from http://sourceforge.net/project/showfiles.php?group_id=200370

Any number of Accept's, Reject's, Save's, Refuse's, Examine's, and Ignore's can be specified individually on the command line; their values are ANDed when matching.

Arguments (RE = Regular Expression)

--PathAccept=REa RE of URL paths to accept.
--PathReject=REa RE of URL paths to reject.
--PathExamine=REa RE of URL paths to examine for other links.
--PathIgnore=REa RE of URL paths to ignore for other links.
--PathSave=REa RE of URL paths to save.
--PathRefuse=REa RE of URL paths not to save.
--MIMEAccept=REa RE of MIME types to accept.
--MIMEReject=REa RE of MIME types to reject.
--MIMEExamine=REa RE of MIME types to examine for other links.
--MIMEIgnore=REa RE of MIME types to examine for other links.
--MIMESave=REa RE of MIME types to save.
--MIMERefuse=RE a RE of MIME types not to save.
--LogLevel=LEVEL Possible values for LEVEL are: SEVERE, WARNING, INFO, CONFIG, FINE, FINER, FINEST, ALL, OFF. Default: INFO.
--Username=NAME A username to provide if requested by server.
--Password=PASSWORD A password to provide if requested by server.
--Overwrite[=TRUE|FALSE] Whether or not to overwrite files if they already exist locally. Default: FALSE
--SetLastModified[=TRUE|FALSE] Set the date on the file created to the date on the associated URL.
--IfModifiedSince[=TRUE|FALSE] Only overwrite an existing file if the associated URL is newer. Assumes all files were created using the --SetLastModified flag and the --Overwrite flag is used. Essentially, if you want to keep date information, always use all three flags (--Overwrite, --SetLastModified, and --IfModifiedSince) when replicating.
--Directory=String The directory to replicate to, instead of the current working directory.
--UserAgent=String Set the User-Agent identifier. Default: Java/1.5.0
--LoadCookies=String... Load cookies from file(s).
--LoadCookies=String Save cookies to file.
--IgnoreCookies[=TRUE|FALSE] If you want to ignore all cookies. Default: FALSE
--IndexName=String The file name to save paths ending with a '/'. Default: index.html
--FollowRedirects[=TRUE|FALSE] Follow redirections. Default: TRUE.
--StopOn=HTTP return code... The return codes REplican should stop on. Examples include 403, 404, etc.
--CheckpointEvery=# Write a checkpoint file every # changes (additions, downloads, etc.). REplican checks for the file on startup and uses it to initialize its state. This is very useful if one has a lot of files to process and the machine crashes, the network dies, etc.; REplican does not have to start over examining the entire web site(s).
--CheckpointFile=String The name of the checkpoint file (default: REplican.cp).
--Interesting=Regular expression... Any number of regular expressions that match patterns inside <...> pairs for consideration by REplican. The capturing group(s) contain the URLs to be considered. The default set is:
    <\s*[aA].+[hH][rR][eE][fF]\s*=[\"']?([^\"'#> ]*)
    <\s*[lL][iI][nN][kK].+[hH][rR][eE][fF]\s*=[\"']?([^\"'#> ]*)
    <\s*[iI][mM][gG]".+[sS][rR][cC]\s*=[\"']?([^\"'#> ]*)
    <\s*[fF][rR][aA][mM][eE].+[sS][rR][cC]\s*=[\"']?([^\"'#> ]*)
    
If you want to remove all of these, see the Boring argument, and add your own.
--Boring Remove all the Interesting regular expressions. You'll want to add some after using the Interesting argument.
--Help Print the command line options and exit.
--Version Print the version and exit.
Note: if you neither Examine nor Save a Path or MIME type, there is no reason to Accept it either, and not Accepting it (or Rejecting it) will increase the speed of the program as those URL's won't be considered. Also, basing accept/examine/save decisions on paths is faster and uses less bandwidth as a URL must be opened to examine its MIME information whereas paths can be evaluated without opening the remote URL.

Regular Expressions

REplican uses the full expressiveness of Java's regular expressions, an introduction to which can be found at: http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html.

Typically, you want to use '\.' in the matching regular expressions for site and extension specifiers as the dot ('.') is used in regular expressions to mean 'any character'. '.*' is used to specify 'zero or more of any character'. The exception to the '\.' rule are the initial URL's (listed last on the command line) as they do not require escaping the dots with a backslash.

With the usage of regular expressions to drive REplican, there is no "recursive" command line argument needed -- if a link is found that matches the criteria, it is followed.

Examples

Grab just one document by PathIgnore'ing all files -- i.e.: don't look for more links in any files. Note the backslashes before the dots in the regular expressions, but the lack of them in the URL.

java -jar REplican-1.3.jar \
    --PathAccept='http://emess\.mscd\.edu/~beaty/Dossier/vitae\.html' \
    --PathIgnore='.*' --Debug=INFO \
    http://emess.mscd.edu/~beaty/Dossier/vitae.html

As above, but overwrite any existing file.

java -jar REplican-1.3.jar \
    --PathAccept='http://emess\.mscd\.edu/~beaty/Dossier/vitae\.html' \
    --PathIgnore='.*' --Debug=INFO \
    --Overwrite \
    http://emess.mscd.edu/~beaty/Dossier/vitae.html

Use MIME types: accept html files, but don't look in any file for more links.

java -jar REplican-1.3.jar \
    --MIMEAccept='text/html.*' --MIMEIgnore='.*' \
    --Debug=finest \
    http://emess.mscd.edu/~beaty/Dossier/vitae.html

Save only PDF's (not even HTML's), look at [hH][tT][mM][lL]* paths for new links, and accept files from only one location.

java -jar REplican-1.3.jar \
    --PathAccept='http://emess\.mscd\.edu/~beaty/Dossier/.*\.html' \
    --PathAccept='http://emess\.mscd\.edu/~beaty/Dossier/.*\.pdf' \
    --PathSave='.*\.pdf' \
    --PathExamine='.*\.[hH][tT][mM][lL]*' \
    --Debug=finest \
    http://emess.mscd.edu/~beaty/Dossier/vitae.html

Process

  1. Check URL against PathAccept and PathReject. Note: if no PathAccept, PathReject, MIMEAccept, and MIMEReject are given, PathAccept is set to only the initial URL'(s) specified on the command line so that REplican doesn't fetch the entire web :-)
  2. If PathExamine or MIMEExamine match, and PathIgnore or MIMEIgnore don't Note: If nothing about examining or ignoring is specified on the command line, MIMEExamine is set to "text/.*", which is a good general value as one typically does not want to examine images, videos, applications, etc.
  3. Whether we examine them or not, should we save the URL? Note: if nothing is specified on the command line, everything is saved.
A list of MIME types can be found at http://www.iana.org/assignments/media-types/.

SOCKS

Built into Java (not just REplican), is the ability to have all network traffic be proxied through a SOCKS proxy. There are two command line arguments to java to make this work: java -DsocksProxyHost=[hostname] -DsocksProxyPort=[portnumber] -jar REplican-1.3.jar [REplican's arguments]. Important: as with all other options to java, these must appear before the class or jar file on the command line or they will be taken as arguments to the program. Also, if Java cannot make a connection via SOCKS, it will try without using SOCKS which might not be the functionality you expect.

Possible futures