REplican is a powerful web replicator based on regular expressions. Regular expressions allow either great specificity or generality, giving you a lot of power to retrieve whatever you wish from the web.
Usage: java REplican-1.5.1.jar [args...] URL [URL...]
N.B.: the URL(s) are automatically added to PathAccept.
The jar file, that includes all the source, can be downloaded from http://sourceforge.net/project/showfiles.php?group_id=200370
--PathAccept=RE | a RE of URL paths to accept. |
--PathReject=RE | a RE of URL paths to reject. |
--PathExamine=RE | a RE of URL paths to examine for other links. These are automatically added to PathAccept. |
--PathIgnore=RE | a RE of URL paths to ignore for other links. |
--PathSave=RE | a RE of URL paths to save. These are automatically added to PathAccept. |
--PathRefuse=RE | a RE of URL paths not to save. |
--MIMEAccept=RE | a RE of MIME types to accept. |
--MIMEReject=RE | a RE of MIME types to reject. |
--MIMEExamine=RE | a RE of MIME types to examine for other links. |
--MIMEIgnore=RE | a RE of MIME types to ignore for other links. |
--MIMESave=RE | a RE of MIME types to save. |
--MIMERefuse=RE | a RE of MIME types not to save. |
--LogLevel=LEVEL | Possible values for LEVEL are: SEVERE, WARNING, INFO, CONFIG, FINE, FINER, FINEST, ALL, OFF. Default: INFO. |
--Username=NAME | A username to provide if requested by server. |
--Password=PASSWORD | A password to provide if requested by server. |
--Overwrite[=TRUE|FALSE] | Whether or not to overwrite files if they already exist locally. Default: FALSE |
--SetLastModified[=TRUE|FALSE] | Set the date on the file created to the date on the associated URL. |
--IfModifiedSince[=TRUE|FALSE] | Only overwrite an existing file if the associated URL is newer. Assumes all files were created using the --SetLastModified flag and the --Overwrite flag is used. Essentially, if you want to keep date information, always use all three flags (--Overwrite, --SetLastModified, and --IfModifiedSince) when replicating. |
--IfNewerThan=file | Only save files that are newer than the specified file.|
--Directory=String | The directory to replicate to, instead of the current working directory. |
--UserAgent=String | Set the User-Agent identifier. Default: Java/1.5.0 |
--LoadCookies=String... | Load cookies from file(s). |
--SaveCookies=String | Save cookies to file. |
--IgnoreCookies[=TRUE|FALSE] | If you want to ignore all cookies. Default: FALSE |
--IndexName=String | The file name to save paths ending with a '/'. Default: index.html |
--FollowRedirects[=TRUE|FALSE] | Follow redirections. Default: TRUE. |
--SaveProgress[=TRUE|FALSE] | Show a progress bar for saves. Default: FALSE. |
--StopOn=HTTP return code... | The return codes REplican should stop on. Examples include 403, 404, etc. |
--CheckpointEvery=# | Write a checkpoint file every # changes (additions, downloads, etc.). REplican checks for the file on startup and uses it to initialize its state. This is very useful if one has a lot of files to process and the machine crashes, the network dies, etc.; REplican does not have to start over examining the entire web site(s). |
--CheckpointFile=String | The name of the checkpoint file (default: REplican.cp). |
--PauseBetween=# | Pause for # milliseconds between each request. |
--PauseAfterSave=# | Pause for # milliseconds after each file saved. |
--Interesting=Regular expression... |
Any number of regular expressions that match patterns inside
<...> pairs for consideration by REplican. The capturing
group(s) contain the URLs to be considered.
If you specify any, they override all the defaults, which are:
[hH][rR][eE][fF]\s*=\s*[\"']?([^\"'#>]*) [sS][rR][cC]\s*=\s*[\"']?([^\"'#>]*) |
--FilenameRewrite=String --FilenameRewrite=String |
These arguments must come in pairs. The first is
a pattern to match, and the second is a replacement string. For
example, if you want to remove anything after a '.wmv', you can do:
--FilenameRewrite="\\.wmv.*" --FilenameRewrite=".wmv"If you want to combine all find from a content distribution network in to one directory: --FilenameRewrite="[1234].cdn.site.com" --FilenameRewrite="cdn.site.com"If you want to remove certain characters from saved files: --FilenameRewrite="%20" --FilenameRewrite=" " --FilenameRewrite="\\&" --FilenameRewrite="AMP" --FilenameRewrite="\\=" --FilenameRewrite="EQ" --FilenameRewrite="\\?" --FilenameRewrite="QUES" |
--URLRewrite=String --URLRewrite=String | Rewrite the URLs as with the FilenameRewrite above. Useful if a site dynamically rewrites or randomizes URLs |
--URLFixUp=String --URLFixUp=String |
Fix up URLs before matching to reduce the complexity of regular
expressions to match URLs. The defaults are:
--URLFixUp="[\s]+" --URLFixUp=" "which removes white space and condense multiple spaces into a single space. |
--PrintAccept[=TRUE|FALSE] --PrintReject[=TRUE|FALSE] --PrintExamine[=TRUE|FALSE] --PrintIgnore[=TRUE|FALSE] --PrintSave[=TRUE|FALSE] --PrintRefuse[=TRUE|FALSE] --PrintAdd[=TRUE|FALSE] |
Whether to display the results of specific operations. By default, only --PrintSave is active. |
--Help | Print the command line options and exit. |
--Version | Print the version and exit. |
Once a URL is accepted, it is added to the list to be examined or saved. The --PathExamine, --PathIgnore, --MIMEExamine, and --MIMEIgnore control whether a given document is searched for links to other documents and objects to be saved. In addition, the --PathSave, --PathRefuse, --MIMESave, and --MIMERefuse regular expressions decide on whether to save a URL to the local machine. Therefore, one can examine certain types of files, and save other types. If you specify nothing, --MIMEExamine is set to "text/.*" and --PathSave is set to ".*"; i.e.: look at all MIME text file and save everything. Note: if you want to examine or save a URL, you must first accept and not reject it.
With the usage of regular expressions to drive REplican, there is no "recursive" command line argument needed -- if a link is found that matches the criteria, it is followed.
REplican uses the full expressiveness of Java's regular expressions, an introduction to which can be found at: http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html.
Typically, you want to use '\.' in the matching regular expressions for site and extension specifiers as the dot ('.') is used in regular expressions to mean 'any character'. '.*' is used to specify 'zero or more of any character'. The exception to the '\.' rule are the initial URL's (listed last on the command line) as they do not require escaping the dots with a backslash.
Grab just one document by PathIgnore'ing all files -- i.e.: don't look for more links in any files. Note the backslashes before the dots in the regular expressions, but the lack of them in the URL.
java -jar REplican-1.3.jar \ --PathAccept='http://emess\.mscd\.edu/~beaty/Dossier/vitae\.html' \ --PathIgnore='.*' --Debug=INFO \ http://emess.mscd.edu/~beaty/Dossier/vitae.html
As above, but overwrite any existing file.
java -jar REplican-1.3.jar \ --PathAccept='http://emess\.mscd\.edu/~beaty/Dossier/vitae\.html' \ --PathIgnore='.*' --Debug=INFO \ --Overwrite \ http://emess.mscd.edu/~beaty/Dossier/vitae.html
Use MIME types: accept html files, but don't look in any file for more links.
java -jar REplican-1.3.jar \ --MIMEAccept='text/html.*' --MIMEIgnore='.*' \ --Debug=finest \ http://emess.mscd.edu/~beaty/Dossier/vitae.html
Save only PDF's (not even HTML's), look at [hH][tT][mM][lL]* paths for new links, and accept files from only one location.
java -jar REplican-1.3.jar \ --PathAccept='http://emess\.mscd\.edu/~beaty/Dossier/.*\.html' \ --PathAccept='http://emess\.mscd\.edu/~beaty/Dossier/.*\.pdf' \ --PathSave='.*\.pdf' \ --PathExamine='.*\.[hH][tT][mM][lL]*' \ --Debug=finest \ http://emess.mscd.edu/~beaty/Dossier/vitae.html
<a.*[hH][rR][eE][fF]=
",
"<link.*[hH][rR][eE][fF]=
", and
"<img.*[sS][rR][cC]=
" tags).
java -DsocksProxyHost=[hostname] -DsocksProxyPort=[portnumber]
-jar REplican-1.3.jar [REplican's arguments]
.
Important: as with all other options to java
,
these must appear before the class or jar file on the command line or they
will be taken as arguments to the program. Also, if Java cannot make a
connection via SOCKS, it will try without using SOCKS which might not be
the functionality you expect.