http://www.satzbau-gmbh.de/staff/abel/httrack-py/
The httrack-py library allows to write HTTrack callbacks in Python, or to use the httrack engine in a Python program.
It allows to use nearly all features of the httrack callbacks with a Python module.
as an httrack plugin:
gcc -g -I <path-to-Python-header-files> \ -I <path-to-httrack-source-dir> \ -I <httrack-source-dir>/src \ -l python<version-number> -O -g3 -Wall -D_REENTRANT -DINET6 \ -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -DPLUGIN \ -shared -pthread -o httrack-py.so httrack-py.c
Next, copy the file httrack-py.so to /usr/local/lib or to some other place in the library search path.
as a Python extension:
gcc -g -I <path-to-Python-header-files> \ -I <path-to-httrack-source-dir> \ -I <httrack-source-dir>/src \ -l python<version-number> -O -g3 -Wall -D_REENTRANT -DINET6 \ -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE \ -shared -pthread -o httracklib.so httrack-py.c
Next, copy the file httracklib.so to
start, end, pause, query2, query3, change_options, check_html, preprocess_html, postprocess_html, loop, check_link, save_file, link_detected, link_detected2 (*), save_name, send_header, receive_header.
(*) for httrack version 3.33-beta4 and newer
--wrapper start=httrack-py.so:hts_py_start
In the start callback, the httrack-py plugin calls register() to get an instance of a class, whose methods are called in the other callbacks.
If methods with the names mentioned above exist, the start callback registers the corresponding callbacks in the httrack core, and the methods will be called by the callback functions of the httrack-py library.
See the example file httrack.py for details on these methods.
NOTE: The automatic registration of other callbacks by the start callback works only in httrack 3.33-beta3 or newer! For older httrack versions, you'll need to specify each callback as an httrack command line parameter.
The extension module defines only one function, httrack, which invokes the httrack engine. The function needs two parameters; the first is an instance of a callback class; the second parameter must be a sequence. The sequence should contain strings, which are valid httrack command line arguments. The first element is interpreted as the program name, which must always specified, i.e., passing a sequence of length 0 raises an execption.
A trivial example:
import sys, httracklib class CbClass: # we don't want to do anything fancy, but we simply need a # class definition pass callback = CbClass() # simply pass the command line parameters to the httrack engine httracklib.httrack(callback, sys.argv)
This example should work exactly like the "regular" httrack command line program.
Since the httrack plugin has no main Python program, exceptions raised by a method of a callback class instance must be caught within the plugin.
The C extension class providing the Python function httrack has the same behaviour for exceptions raised in callbacks. (The httrack engine is IMO too complex to try to forward a Python exception from a callback up to the caller of the module function httrack.)
While the recommended way to handle execptions is to use try/except/finally statements within a callback method, it can sometimes be difficult to predict all possible exceptions. Moreover, exceptions can also occur in httrack-py itself, e.g. for memory allocation errors, or if a callback method defines the wrong number of arguments.
From version 0.6 on, the "exception behaviour" of httrack-py is configurable.
The main problem with exceptions in Python callback methods is the fact, that not every httrack callback allows to abort the mirror. An example: While I assume that Xavier Roche intended the save-name callback for relatively simple things like directory layout manipulation or for dealing with the 8.3 filename limit under MSDOS, i.e., for code, where errors are highly unlikely (at least after debugging...), Python callbacks make it easy to "abuse" this callback. For example, one can record the relation between original URLs and file paths in an SQL database. If this relation is essential for the usage of the mirrored data, and if the access to the database raises an exception, it does not make sense to continue the mirroring process. In another application, an execption in this callback may be completely irrelevant.
Only the callbacks start, end, change-options, loop, send-header and receive-header can abort the mirror by returning 0; all other callbacks have no way to do this directly. httrack-py offers two ways to deal with this problem:
Alternatively, an exception can be completely ignored (IGNORE_EXCEPTION).
Both the plugin version of httrack-py and the C extension module provide definitions for the constants IMMEDIATE_STOP, REGULAR_STOP, IGNORE_EXCEPTION.
The default behaviour is REGULAR_STOP. It can be modified as follows:
error_handler
def error_handler(self, cbname, exctype, value, traceback):
this method is called, when httrack-py detects an exception in any of the callbacks. cbname is the name of the method raising the exception; exctype and value provide Python's exception type and exception value as described for example in chapter 8 of the Python Tutorial; traceback is the traceback object.
The method must return one of the values IMMEDIATE_STOP, REGULAR_STOP and IGNORE_EXCEPTION.
Python's error information is not automatically printed to stderr, except for uncaught exceptions raised in the error handler itself.
If an exception is raised in the error_handler, httrack-py uses IMMEDIATE_STOP to abort the mirror.
error_policy
. This attribute must be a mapping. Its keys
should be the names of the callbacks, or '__default__'; the values
must be IMMEDIATE_STOP, REGULAR_STOP or IGNORE_EXCEPTION. If an exception is raised, and if the mapping has an entry with the name of callback that raised the exception, the value of the entry define the error behaviour. Otherwise, the __default__ value is used, if defined.
__default__
, httrack-py tries to read the environment
variable HTTRACK_PY_ERROR_POLICY. If defined, this variable must have
the value -1 (IMMEDIATE_STOP), 0 (REGULAR_STOP) or 1(IGNORE_EXCEPTION).If any of the ways to define the error behaviour result in another value than IMMEDIATE_STOP, REGULAR_STOP or IGNORE_EXCEPTION, the value IMMEDIATE_STOP is assumed.
In the cases 2, 3 and 4, Python's error information is printed to stderr.
Both the plugin version of httrack-py and the C extension module provide definitions for the constants IMMEDIATE_STOP, REGULAR_STOP, IGNORE_EXCEPTION. For the plugin version, integer variables are inserted into the namespace of the Python module during plugin initialization, so you can use them for example in the constructor of the callback class:
def cbClass: def __init__(self): self.error_policy = {'__default__': IGNORE_EXCEPTION}
NOTE: Since the plugin library can create the variables only after the Python module has been loaded, you can't use them for code that is executed at module initialization time. Hence, the following example will lead to an error:
def cbClass: error_policy = {'__default__': IGNORE_EXCEPTION}
The C extension module provides the variables in its namespace:
import httracklib def cbClass: error_policy = {'__default__': httracklib.IGNORE_EXCEPTION}
Most exceptions raised in httrack-py itself are regular Python exceptions. Additionally, httrack-py can raise execptions of the specific type 'httrack-error':
If an error occurs during plugin initialization, the mirror process is aborted in the start callback, except if all of the following conditions are met:
1) the plugin is the default httrack plugin and if it is loaded as the default plugin, i.e., without specifying a --wrapper option
2) the environment variable HTTRACK_PYTHON is not set
3) an ImportError is raised.
This can mean that the user simply does not need a Python callback module for this mirror, hence the plugin simply prints the Python error message, but continues the mirror.
For all other exceptions (e.g, a syntax error in the Python source code), or if HTTRACK_PYTHON is set, or if a --wrapper option is specified, the mirror is aborted.
httrack-py is released under the GPL
Only the source code is available.
httrack-py-0.6.tar.gz (recent version)
For questions, bug reports etc., send me a mail