REplican: A Regular Expression based web replicator

The most flexible way to copy files on the web!

REplican is a powerful web replicator based on regular expressions. Regular expressions allow either great specificity or generality, giving you a lot of power to retrieve whatever you wish from the web.

Usage: java REplican-1.5.1.jar [args...] URL [URL...]

N.B.: the URL(s) are automatically added to PathAccept.

The jar file, that includes all the source, can be downloaded from http://sourceforge.net/project/showfiles.php?group_id=200370

Arguments (RE = Regular Expression)

Only save files that are newer than the specified file.

--PathAccept=RE	a RE of URL paths to accept.
--PathReject=RE	a RE of URL paths to reject.
--PathExamine=RE	a RE of URL paths to examine for other links. These are automatically added to PathAccept.
--PathIgnore=RE	a RE of URL paths to ignore for other links.
--PathSave=RE	a RE of URL paths to save. These are automatically added to PathAccept.
--PathRefuse=RE	a RE of URL paths not to save.
--MIMEAccept=RE	a RE of MIME types to accept.
--MIMEReject=RE	a RE of MIME types to reject.
--MIMEExamine=RE	a RE of MIME types to examine for other links.
--MIMEIgnore=RE	a RE of MIME types to ignore for other links.
--MIMESave=RE	a RE of MIME types to save.
--MIMERefuse=RE	a RE of MIME types not to save.
--LogLevel=LEVEL	Possible values for LEVEL are: SEVERE, WARNING, INFO, CONFIG, FINE, FINER, FINEST, ALL, OFF. Default: INFO.
--Username=NAME	A username to provide if requested by server.
--Password=PASSWORD	A password to provide if requested by server.
--Overwrite[=TRUE\|FALSE]	Whether or not to overwrite files if they already exist locally. Default: FALSE
--SetLastModified[=TRUE\|FALSE]	Set the date on the file created to the date on the associated URL.
--IfModifiedSince[=TRUE\|FALSE]	Only overwrite an existing file if the associated URL is newer. Assumes all files were created using the --SetLastModified flag and the --Overwrite flag is used. Essentially, if you want to keep date information, always use all three flags (--Overwrite, --SetLastModified, and --IfModifiedSince) when replicating.
--IfNewerThan=file
--Directory=String	The directory to replicate to, instead of the current working directory.
--UserAgent=String	Set the User-Agent identifier. Default: Java/1.5.0
--LoadCookies=String...	Load cookies from file(s).
--SaveCookies=String	Save cookies to file.
--IgnoreCookies[=TRUE\|FALSE]	If you want to ignore all cookies. Default: FALSE
--IndexName=String	The file name to save paths ending with a '/'. Default: index.html
--FollowRedirects[=TRUE\|FALSE]	Follow redirections. Default: TRUE.
--SaveProgress[=TRUE\|FALSE]	Show a progress bar for saves. Default: FALSE.
--StopOn=HTTP return code...	The return codes REplican should stop on. Examples include 403, 404, etc.
--CheckpointEvery=#	Write a checkpoint file every # changes (additions, downloads, etc.). REplican checks for the file on startup and uses it to initialize its state. This is very useful if one has a lot of files to process and the machine crashes, the network dies, etc.; REplican does not have to start over examining the entire web site(s).
--CheckpointFile=String	The name of the checkpoint file (default: REplican.cp).
--PauseBetween=#	Pause for # milliseconds between each request.
--PauseAfterSave=#	Pause for # milliseconds after each file saved.
--Interesting=Regular expression...	Any number of regular expressions that match patterns inside <...> pairs for consideration by REplican. The capturing group(s) contain the URLs to be considered. If you specify any, they override all the defaults, which are: [hH][rR][eE][fF]\s=\s[\"']?([^\"'#>]) [sS][rR][cC]\s=\s[\"']?([^\"'#>])
--FilenameRewrite=String --FilenameRewrite=String	These arguments must come in pairs. The first is a pattern to match, and the second is a replacement string. For example, if you want to remove anything after a '.wmv', you can do: --FilenameRewrite="\\.wmv.*" --FilenameRewrite=".wmv" If you want to combine all find from a content distribution network in to one directory: --FilenameRewrite="[1234].cdn.site.com" --FilenameRewrite="cdn.site.com" If you want to remove certain characters from saved files: --FilenameRewrite="%20" --FilenameRewrite=" " --FilenameRewrite="\\&" --FilenameRewrite="AMP" --FilenameRewrite="\\=" --FilenameRewrite="EQ" --FilenameRewrite="\\?" --FilenameRewrite="QUES"
--URLRewrite=String --URLRewrite=String	Rewrite the URLs as with the FilenameRewrite above. Useful if a site dynamically rewrites or randomizes URLs
--URLFixUp=String --URLFixUp=String	Fix up URLs before matching to reduce the complexity of regular expressions to match URLs. The defaults are: --URLFixUp="[\s]+" --URLFixUp=" " which removes white space and condense multiple spaces into a single space.
--PrintAccept[=TRUE\|FALSE] --PrintReject[=TRUE\|FALSE] --PrintExamine[=TRUE\|FALSE] --PrintIgnore[=TRUE\|FALSE] --PrintSave[=TRUE\|FALSE] --PrintRefuse[=TRUE\|FALSE] --PrintAdd[=TRUE\|FALSE]	Whether to display the results of specific operations. By default, only --PrintSave is active.
--Help	Print the command line options and exit.
--Version	Print the version and exit.

Algorithm

First, a URL is checked to see whether it should be accepted or rejected. Note: even the initial URL from the command line must be accepted and not rejected, or nothing will be done. Also note: if you don't specify to accept a certain set of URLs (or reject a set), all URLs found will be considered for replication. This probably isn't what you want as it will attempt to replicate all parts of the web linked to from the initial URL(s). You control acceptance via the --PathAccept, --PathReject, --MIMEAccept, and --MINEReject command line regular expression arguments. The Path arguments take regular expressions that match URLs; MIME that match MIME types. For each pair, Accept's are found before Reject's, so the Reject's can be a subset of what is Accept'ed.

Once a URL is accepted, it is added to the list to be examined or saved. The --PathExamine, --PathIgnore, --MIMEExamine, and --MIMEIgnore control whether a given document is searched for links to other documents and objects to be saved. In addition, the --PathSave, --PathRefuse, --MIMESave, and --MIMERefuse regular expressions decide on whether to save a URL to the local machine. Therefore, one can examine certain types of files, and save other types. If you specify nothing, --MIMEExamine is set to "text/.*" and --PathSave is set to ".*"; i.e.: look at all MIME text file and save everything. Note: if you want to examine or save a URL, you must first accept and not reject it.

With the usage of regular expressions to drive REplican, there is no "recursive" command line argument needed -- if a link is found that matches the criteria, it is followed.

Regular Expressions

REplican uses the full expressiveness of Java's regular expressions, an introduction to which can be found at: http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html.

Typically, you want to use '\.' in the matching regular expressions for site and extension specifiers as the dot ('.') is used in regular expressions to mean 'any character'. '.*' is used to specify 'zero or more of any character'. The exception to the '\.' rule are the initial URL's (listed last on the command line) as they do not require escaping the dots with a backslash.

Examples

Grab just one document by PathIgnore'ing all files -- i.e.: don't look for more links in any files. Note the backslashes before the dots in the regular expressions, but the lack of them in the URL.

java -jar REplican-1.3.jar \
    --PathAccept='http://emess\.mscd\.edu/~beaty/Dossier/vitae\.html' \
    --PathIgnore='.*' --Debug=INFO \
    http://emess.mscd.edu/~beaty/Dossier/vitae.html

As above, but overwrite any existing file.

java -jar REplican-1.3.jar \
    --PathAccept='http://emess\.mscd\.edu/~beaty/Dossier/vitae\.html' \
    --PathIgnore='.*' --Debug=INFO \
    --Overwrite \
    http://emess.mscd.edu/~beaty/Dossier/vitae.html

Use MIME types: accept html files, but don't look in any file for more links.

java -jar REplican-1.3.jar \
    --MIMEAccept='text/html.*' --MIMEIgnore='.*' \
    --Debug=finest \
    http://emess.mscd.edu/~beaty/Dossier/vitae.html

Save only PDF's (not even HTML's), look at [hH][tT][mM][lL]* paths for new links, and accept files from only one location.

java -jar REplican-1.3.jar \
    --PathAccept='http://emess\.mscd\.edu/~beaty/Dossier/.*\.html' \
    --PathAccept='http://emess\.mscd\.edu/~beaty/Dossier/.*\.pdf' \
    --PathSave='.*\.pdf' \
    --PathExamine='.*\.[hH][tT][mM][lL]*' \
    --Debug=finest \
    http://emess.mscd.edu/~beaty/Dossier/vitae.html

Process

Check URL against PathAccept and PathReject.
- If passes, add to list of URL's to download
- When it is time to fetch, open URL and check against MIMEAccept and MIMEIgnore.
Note: if no PathAccept, PathReject, MIMEAccept, and MIMEReject are given, PathAccept is set to only the initial URL'(s) specified on the command line so that REplican doesn't fetch the entire web :-)
If PathExamine or MIMEExamine match, and PathIgnore or MIMEIgnore don't
- fetch the URL and look for other URL's to fetch (within "<a.*[hH][rR][eE][fF]=", "<link.*[hH][rR][eE][fF]=", and "<img.*[sS][rR][cC]=" tags).
- Those found are put on a list for later retrieval.
Note: If nothing about examining or ignoring is specified on the command line, MIMEExamine is set to "text/.*", which is a good general value as one typically does not want to examine images, videos, applications, etc.
Whether we examine them or not, should we save the URL?
- PathSave, MIMESave, PathRefuse, and MIMERefuse are consulted.
Note: if nothing is specified on the command line, everything is saved.

A list of MIME types can be found at http://www.iana.org/assignments/media-types/.

SOCKS

Built into Java (not just REplican), is the ability to have all network traffic be proxied through a SOCKS proxy. There are two command line arguments to java to make this work:

java  -DsocksProxyHost=[hostname] -DsocksProxyPort=[portnumber]
-jar REplican-1.3.jar [REplican's arguments]

. Important: as with all other options to java, these must appear before the class or jar file on the command line or they will be taken as arguments to the program. Also, if Java cannot make a connection via SOCKS, it will try without using SOCKS which might not be the functionality you expect.

Possible futures

Per-URL username and passwords (as is part of the URL specification).
Updating file based on If-modified-since and saving server file times.