REplican: A Regular Expression based web replicator

The most flexible way to copy files on the web!

REplican is a powerful web replicator based on regular expressions. Regular expressions allow either great specificity or generality, giving you a lot of power to retrieve whatever you wish from the web.

Usage: java REplican-1.5.0.jar [args...] URL [URL...]

The jar file, that includes all the source, can be downloaded from http://sourceforge.net/project/showfiles.php?group_id=200370

Arguments (RE = Regular Expression)

--PathAccept=REa RE of URL paths to accept.
--PathReject=REa RE of URL paths to reject.
--PathExamine=REa RE of URL paths to examine for other links.
--PathIgnore=REa RE of URL paths to ignore for other links.
--PathSave=REa RE of URL paths to save.
--PathRefuse=REa RE of URL paths not to save.
--MIMEAccept=REa RE of MIME types to accept.
--MIMEReject=REa RE of MIME types to reject.
--MIMEExamine=REa RE of MIME types to examine for other links.
--MIMEIgnore=REa RE of MIME types to examine for other links.
--MIMESave=REa RE of MIME types to save.
--MIMERefuse=RE a RE of MIME types not to save.
--LogLevel=LEVEL Possible values for LEVEL are: SEVERE, WARNING, INFO, CONFIG, FINE, FINER, FINEST, ALL, OFF. Default: INFO.
--Username=NAME A username to provide if requested by server.
--Password=PASSWORD A password to provide if requested by server.
--Overwrite[=TRUE|FALSE] Whether or not to overwrite files if they already exist locally. Default: FALSE
--SetLastModified[=TRUE|FALSE] Set the date on the file created to the date on the associated URL.
--IfModifiedSince[=TRUE|FALSE] Only overwrite an existing file if the associated URL is newer. Assumes all files were created using the --SetLastModified flag and the --Overwrite flag is used. Essentially, if you want to keep date information, always use all three flags (--Overwrite, --SetLastModified, and --IfModifiedSince) when replicating.
--Directory=String The directory to replicate to, instead of the current working directory.
--UserAgent=String Set the User-Agent identifier. Default: Java/1.5.0
--LoadCookies=String... Load cookies from file(s).
--SaveCookies=String Save cookies to file.
--IgnoreCookies[=TRUE|FALSE] If you want to ignore all cookies. Default: FALSE
--IndexName=String The file name to save paths ending with a '/'. Default: index.html
--FollowRedirects[=TRUE|FALSE] Follow redirections. Default: TRUE.
--StopOn=HTTP return code... The return codes REplican should stop on. Examples include 403, 404, etc.
--CheckpointEvery=# Write a checkpoint file every # changes (additions, downloads, etc.). REplican checks for the file on startup and uses it to initialize its state. This is very useful if one has a lot of files to process and the machine crashes, the network dies, etc.; REplican does not have to start over examining the entire web site(s).
--CheckpointFile=String The name of the checkpoint file (default: REplican.cp).
--Interesting=Regular expression... Any number of regular expressions that match patterns inside <...> pairs for consideration by REplican. The capturing group(s) contain the URLs to be considered. The default set is:
    <\s*[aA].+[hH][rR][eE][fF]\s*=[\"']?([^\"'#>]*)
    <\s*[lL][iI][nN][kK].+[hH][rR][eE][fF]\s*=[\"']?([^\"'#>]*)
    <\s*[iI][mM][gG]".+[sS][rR][cC]\s*=[\"']?([^\"'#>]*)
    <\s*[fF][rR][aA][mM][eE].+[sS][rR][cC]\s*=[\"']?([^\"'#>]*)
    
If you want to remove all of these, see the Boring argument, and add your own.
--Boring Remove all the Interesting regular expressions. You'll want to add some after using the Interesting argument.
--FilenameRewrite=String --FilenameRewrite=String These arguments must come in pairs. The first is a pattern to match, and the second is a replacement string. For example, if you want to remove anything after a '.wmv', you can do:
--FilenameRewrite="\\.wmv.*"
--FilenameRewrite=".wmv"
    
If you want to combine all find from a content distribution network in to one directory:
--FilenameRewrite="[1234].cdn.site.com"
--FilenameRewrite="cdn.site.com"
    
If you want to remove certain characters from saved files:
--FilenameRewrite="%20"
--FilenameRewrite=" "
--FilenameRewrite="\\&"
--FilenameRewrite="AMP"
--FilenameRewrite="\\="
--FilenameRewrite="EQ"
--FilenameRewrite="\\?"
--FilenameRewrite="QUES"
    
--PrintAccept[=TRUE|FALSE]
--PrintReject[=TRUE|FALSE]
--PrintExamine[=TRUE|FALSE]
--PrintIgnore[=TRUE|FALSE]
--PrintSave[=TRUE|FALSE]
--PrintRefuse[=TRUE|FALSE]
--PrintAdd[=TRUE|FALSE]
Whether to display the results of specific operations. By default, only --PrintSave is active.
--Help Print the command line options and exit.
--Version Print the version and exit.

Algorithm

First, a URL is checked to see whether it should be accepted or rejected. Note: even the initial URL from the command line must be accepted and not rejected, or nothing will be done. Also note: if you don't specify to accept a certain set of URLs (or reject a set), all URLs found will be considered for replication. This probably isn't what you want as it will attempt to replicate all parts of the web linked to from the initial URL(s). You control acceptance via the --PathAccept, --PathReject, --MIMEAccept, and --MINEReject command line regular expression arguments. The Path arguments take regular expressions that match URLs; MIME that match MIME types. For each pair, Accept's are found before Reject's, so the Reject's can be a subset of what is Accept'ed.

Once a URL is accepted, it is added to the list to be examined or saved. The --PathExamine, --PathIgnore, --MIMEExamine, and --MIMEIgnore control whether a given document is searched for links to other documents and objects to be saved. In addition, the --PathSave, --PathRefuse, --MIMESave, and --MIMERefuse regular expressions decide on whether to save a URL to the local machine. Therefore, one can examine certain types of files, and save other types. If you specify nothing, --MIMEExamine is set to "text/.*" and --PathSave is set to ".*"; i.e.: look at all MIME text file and save everything.

With the usage of regular expressions to drive REplican, there is no "recursive" command line argument needed -- if a link is found that matches the criteria, it is followed.

Regular Expressions

REplican uses the full expressiveness of Java's regular expressions, an introduction to which can be found at: http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html.

Typically, you want to use '\.' in the matching regular expressions for site and extension specifiers as the dot ('.') is used in regular expressions to mean 'any character'. '.*' is used to specify 'zero or more of any character'. The exception to the '\.' rule are the initial URL's (listed last on the command line) as they do not require escaping the dots with a backslash.

Examples

Grab just one document by PathIgnore'ing all files -- i.e.: don't look for more links in any files. Note the backslashes before the dots in the regular expressions, but the lack of them in the URL.

java -jar REplican-1.3.jar \
    --PathAccept='http://emess\.mscd\.edu/~beaty/Dossier/vitae\.html' \
    --PathIgnore='.*' --Debug=INFO \
    http://emess.mscd.edu/~beaty/Dossier/vitae.html

As above, but overwrite any existing file.

java -jar REplican-1.3.jar \
    --PathAccept='http://emess\.mscd\.edu/~beaty/Dossier/vitae\.html' \
    --PathIgnore='.*' --Debug=INFO \
    --Overwrite \
    http://emess.mscd.edu/~beaty/Dossier/vitae.html

Use MIME types: accept html files, but don't look in any file for more links.

java -jar REplican-1.3.jar \
    --MIMEAccept='text/html.*' --MIMEIgnore='.*' \
    --Debug=finest \
    http://emess.mscd.edu/~beaty/Dossier/vitae.html

Save only PDF's (not even HTML's), look at [hH][tT][mM][lL]* paths for new links, and accept files from only one location.

java -jar REplican-1.3.jar \
    --PathAccept='http://emess\.mscd\.edu/~beaty/Dossier/.*\.html' \
    --PathAccept='http://emess\.mscd\.edu/~beaty/Dossier/.*\.pdf' \
    --PathSave='.*\.pdf' \
    --PathExamine='.*\.[hH][tT][mM][lL]*' \
    --Debug=finest \
    http://emess.mscd.edu/~beaty/Dossier/vitae.html

Process

  1. Check URL against PathAccept and PathReject. Note: if no PathAccept, PathReject, MIMEAccept, and MIMEReject are given, PathAccept is set to only the initial URL'(s) specified on the command line so that REplican doesn't fetch the entire web :-)
  2. If PathExamine or MIMEExamine match, and PathIgnore or MIMEIgnore don't Note: If nothing about examining or ignoring is specified on the command line, MIMEExamine is set to "text/.*", which is a good general value as one typically does not want to examine images, videos, applications, etc.
  3. Whether we examine them or not, should we save the URL? Note: if nothing is specified on the command line, everything is saved.
A list of MIME types can be found at http://www.iana.org/assignments/media-types/.

SOCKS

Built into Java (not just REplican), is the ability to have all network traffic be proxied through a SOCKS proxy. There are two command line arguments to java to make this work: java -DsocksProxyHost=[hostname] -DsocksProxyPort=[portnumber] -jar REplican-1.3.jar [REplican's arguments]. Important: as with all other options to java, these must appear before the class or jar file on the command line or they will be taken as arguments to the program. Also, if Java cannot make a connection via SOCKS, it will try without using SOCKS which might not be the functionality you expect.

Possible futures