# WARCbench 🛠️
A tool for exploring, analyzing, transforming, recombining, and extracting data from WARC (Web ARChive) files.
<a href="https://tools.perma.cc"><img src="https://github.com/harvard-lil/tools.perma.cc/blob/main/perma-tools.png?raw=1" alt="Perma Tools" width="150"></a>
[](https://codecov.io/gh/harvard-lil/warcbench)
---
## Contents
- [Quickstart](#quickstart)
- [About](#about)
- [Command line usage](#command-line-usage)
- [Python usage](#python-usage)
- [Configuration](#configuration)
- [Development setup](#development-setup)
---
## Quickstart
To install WARCbench, use Pip:
```sh
# From PyPI (recommended):
pip install warcbench
# Or directly from GitHub using HTTPS...
pip install git+https://github.com/harvard-lil/warcbench.git
# ...or SSH:
pip install git+ssh://git@github.com/harvard-lil/warcbench.git
```
Once WARCbench is installed, you may run it on the command line...
```sh
wb summarize example.com.warc
```
...or import it in your Python project:
```python
from warcbench import WARCParser
with open('example.com.warc', 'rb') as warc_file:
parser = WARCParser(warc_file)
parser.parse()
```
[⇧ Back to top](#contents)
---
## About
WARCbench has been designed as a resilient, efficient, and highly configurable tool for working with WARC files in all their variety. Among our motivations for the project:
- Enable users to explore a WARC without prior knowledge of the format
- Support inspection of malformed or misbehaving WARCs
- Everything is configurable: plenty of hooks and custom callbacks
- Flexibility to optimize for memory, speed, or convenience as needed
- As little magic as possible: e.g., don't decode bytes into strings or deserialize headers until you need to
Many other useful open-source WARC packages can be found online. Among the inspirations for WARCbench are:
- [Warchaeology](https://github.com/nlnwa/warchaeology)
- [WARCAT](https://github.com/chfoo/warcat)
- [WARCIO](https://github.com/webrecorder/warcio)
- [Warctools](https://github.com/internetarchive/warctools)
- [warc](https://github.com/internetarchive/warc)
WARCbench is a project of the [Harvard Library Innovation Lab](https://lil.law.harvard.edu).
[⇧ Back to top](#contents)
---
## Command line usage
After installing WARCbench, you may use `wb` to interact with WARC files on the command line:
```console
user@host~$ wb inspect example.com.warc
Record bytes 0-280
WARC/1.1
WARC-Filename: archive.warc
WARC-Date: 2024-11-04T19:10:55.900Z
WARC-Type: warcinfo
...
```
All commands support `.warc`, `.warc.gz`, and `.wacz` file formats.
To view a complete summary of WARCbench commands and options, invoke the `--help` flag:
```console
user@host~$ wb --help
Usage: wb [OPTIONS] COMMAND [ARGS]...
WARCbench command framework
Options:
-o, --out [raw|json] Format subcommand output as a human-readable
report (raw) or as JSON.
-v, --verbose Logging verbosity; repeatable.
-d, --decompression [python|system]
Use native Python or system tools for
extracting archives. [default: python]
--gunzip / --no-gunzip Gunzip the input archive before parsing, if
it is gzipped. [default: no-gunzip]
-V, --version Show the version and exit.
-h, --help Show this message and exit.
Commands:
compare-headers Compare the record headers of two archives.
compare-parsers Compare all available parsing strategies.
extract Extract files of MIMETYPE to disk.
filter-records Filter records; optionally extract to a new archive.
inspect Get detailed record metadata.
match-record-pairs Match requests/responses into pairs.
summarize Summarize the contents of an archive.
...
```
Each subcommand has its own, more-detailed `--help` text. For example, `filter-records`:
```console
user@host~$ wb filter-records --help
Usage: wb filter-records [OPTIONS] FILEPATH
Applies the specified filters (if any) to the archive's records. If no
filters are specified, all WARC records are considered to match.
By default, outputs the number of matching records. Use the `--output-*`
options to include more detailed information about matching records, or
`--no-output-count` to suppress the count.
Can also extract the matching records to a new WARC file (`--extract-to-
warc`, `--extract-to-gzipped-warc`). To ensure the new WARC includes a
`WARC-Type: warcinfo` record (if present in the original), even if it would
otherwise be filtered out by any applied filters, run with `--force-include-
warcinfo`.
If extracting records to a new WARC file, by default, no other output is
produced. To produce a summary report as well, run with `--extract-summary-
to`.
To apply your own, custom filters, use `--custom-filter-path` to specify the
path to a python file where the custom filter functions are listed, in
desired order of application, in `__all__`. See `tests/assets/custom-
filters.py` for an example. See the "Filters" section of the README for more
information on constructing filters.
This command also supports custom record handlers, which can be used to do
arbitrary work on records that pass through the supplied filters. For
example, you could use record handlers to construct a custom report, or
export records one-at-a-time to an upstream service. Use `--custom-record-
handler-path` to specify the path to a python file where the custom handler
functions are listed, in desired order of application, in `__all__`. See
`tests/assets/custom-handlers.py` for an example. See the "Handlers" section
of the README for more information on constructing handlers.
---
Example:
$ wb filter-records --filter-by-warc-named-field Type response tests/assets/example.com.warc
Found 6 records.
Options:
--filter-by-http-header TEXT...
Find records with WARC-Type: {request,
response} and look for the supplied HTTP
header name and value.
--filter-by-http-response-content-type TEXT
Find records with WARC-Type: response, and
then filters by Content-Type.
--filter-by-http-status-code INTEGER
Find records with WARC-Type: response, and
then filters by HTTP status code.
--filter-by-http-verb TEXT Find records with WARC-Type: request, and
then filter by HTTP verb.
...
```
See `tests/assets` for sample outputs.
[⇧ Back to top](#contents)
---
## Python usage
### Parsing a WARC file
The `WARCParser` class is typically the best way to start interacting with a WARC file in Python:
```python
from warcbench import WARCParser
# Instantiate a parser, passing in a file handle along with any other config
with open('example.com.warc', 'rb') as warc_file:
parser = WARCParser(warc_file)
# Iterate lazily over each record in the WARC...
for record in parser.iterator():
print(record.bytes)
# ...or parse the entire file and produce a list of all records
parser.parse(cache_records=True)
print(len(parser.records))
print(parser.records[3].header.bytes)
```
### Parsing a Gzipped WARC file
You can parse and interact with a gzipped WARC file without gunzipping it using the `WARCGZParser` class.
This is not only for convenience, but for utility: by convention, WARCs are frequently gzipped [one record at a time](http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#record-at-time-compression), such that a complete `warc.gz` file is in fact a series of concatenated, individually valid gzip files, or "members". This makes it possible to extract individual WARC records, if the byte offsets of their members are known in advance, without needing to gunzip the entire file, which in certain applications can be a significant performance improvement.
```python
from warcbench import WARCGZParser
# Instantiate a parser, passing in a file handle along with any other config
with open('example.com.warc.gz', 'rb') as warcgz_file:
parser = WARCGZParser(warcgz_file)
# Iterate lazily over each record in the WARC...
for record in parser.iterator(yield_type="records"):
print(record.start, record.length)
# ... or over each gzipped member...
for member in parser.iterator(yield_type="members"):
print(member.start, member.length, member.record.bytes)
# ...or parse the entire file and produce a list of all members and records
parser.parse(cache_members=True)
print(len(parser.members))
print(len(parser.records))
print(parser.records[3].header.bytes)
```
### Utility functions
For other use cases, such as extracting and working with WARCs in a WACZ file, you may wish to use WARCbench's utility functions:
```python
from warcbench import WARCParser
from warcbench.utils import python_open_archive, system_open_archive
# Slower: uses Python zip/gzip to decompress
with python_open_archive('example.com.wacz') as warcgz_file:
parser = WARCGZParser(warcgz_file)
with python_open_archive('example.com.wacz', gunzip=True) as warc_file:
parser = WARCParser(warc_file)
# Faster: uses system zip/gzip to decompress where possible
with system_open_archive('example.com.wacz') as warcgz_file:
parser = WARCGZParser(warcgz_file)
with system_open_archive('example.com.wacz', gunzip=True) as warc_file:
parser = WARCParser(warc_file)
```
### Filters, handlers, and callbacks
WARCbench includes several additional mechanisms for wrangling WARC records: filters, handlers, and callbacks.
#### Filters
**Record Filters** are functions that include or exclude a WARC record based on a given condition. You can pass in any function that accepts a `warcbench.models.Record` as its sole argument and returns a Boolean value. (A number of built-in filters are included in the `warcbench.filters` module.) Example:
```python
from warcbench import WARCGZParser
from warcbench.config import WARCGZProcessorConfig
from warcbench.filters import warc_named_field_filter
from warcbench.utils import system_open_archive
with system_open_archive('example.com.wacz') as warcgz_file:
parser = WARCGZParser(
warcgz_file,
processors=WARCGZProcessorConfig(
record_filters=[
warc_named_field_filter('type', 'request'),
]
)
)
```
**Member Filters** (only supported when using the WARCGZParser) behave just like record filters, except they work with `warcbench.models.GzippedMember` objects instead of `Record`s. Example:
```python
from warcbench import WARCGZParser
from warcbench.config import WARCGZProcessorConfig
from warcbench.utils import system_open_archive
with system_open_archive('example.com.wacz') as warcgz_file:
parser = WARCGZParser(
warcgz_file,
processors=WARCGZProcessorConfig(
member_filters=[
# only yield malformed members
lambda member: bool(member.uncompressed_non_warc_data),
]
)
)
```
#### Handlers
**Record handlers** are functions that process a record once it is parsed. For example, you could use a record handler to print each record's content in bytes for debugging purposes, or write each record to disk as a separate file. As with filters, you may pass in an arbitrary handler function that accepts a `warcbench.models.Record` as its sole argument; a handler's return value is ignored. Example:
```python
from warcbench import WARCParser
from warcbench.config import WARCProcessorConfig
from warcbench.record_handlers import get_record_offsets
from warcbench.utils import system_open_archive
with system_open_archive('example.com.warc') as warc_file:
parser = WARCParser(
warc_file,
processors=WARCProcessorConfig(
record_handlers=[
get_record_offsets(),
]
)
)
```
To support inspection of WARC files that contain invalid records, WARCbench also includes a way to specify handlers for unparsable lines. **Unparsable line handlers** behave just like record handlers, except that they accept `warcbench.models.UnparsableLine` objects instead of `Record`s. You could use these handlers to print information about unparsable lines, or even repair them. Example:
```python
from warcbench import WARCParser
from warcbench.config import WARCProcessorConfig
from warcbench.record_handlers import get_record_offsets
from warcbench.utils import system_open_archive
with system_open_archive('example.com.wacz') as warc_file:
parser = WARCParser(
warc_file,
processors=WARCProcessorConfig(
unparsable_line_handlers=[
lambda line: print(line),
]
)
)
```
#### Callbacks
**Callbacks** are functions that run after the WARCbench parser finishes parsing a WARC file. A callback can be any function that accepts a `warcbench.WARCParser` or `warcbench.WARCGZParser` object as its sole argument. You could use a callback to print the number of records parsed, write the records out to disk, pass the full set of records over to another function, and so on.
#### Combining filters, handlers, and callbacks
Filters, handlers, and callbacks are additive, but you can combine them together to produce output of arbitrary complexity. Example:
```python
from warcbench import WARCGZParser
from warcbench.config import WARCGZProcessorConfig
from warcbench.filters import warc_named_field_filter
from warcbench.utils import system_open_archive
def combo_filter(record):
is_warc_info = lambda r: warc_named_field_filter('type', 'warcinfo')(r)
targets_example_page = lambda r: warc_named_field_filter(
'target-uri',
'http://example.com/',
exact_match=True
)(r)
return is_warc_info(record) or (
targets_example_page(record) and
http_verb_filter('get')(record) and
http_status_filter(200)(record)
)
with system_open_archive('example.com.wacz') as warcgz_file:
parser = WARCGZParser(
warcgz_file,
processors=WARCGZProcessorConfig(
record_filters=[
combo_filter,
record_content_length_filter('2056', 'le'),
]
)
)
```
### Configuration
WARCbench supports a number of configuration options:
- You can parse a WARC file by reading the WARC record headers' `Content-Length` fields (faster), or by scanning and splitting on the delimiter expected between WARC records (slower; may rarely detect false positives; more robust against mangled or broken WARCs).
- You can parse a gzipped WARC file by reading and parsing the file member by member (much slower; simpler), or by gunzipping the entire file while making note of member/record boundaries, and then further processing the bytes of the decompressed records (much faster; may use more disk space).
- You can choose whether or not to attempt to split WARC records into headers and content blocks.
- You can choose whether to cache record properties (such as the bytes of headers or content blocks) during parsing, or to consume those bytes lazily on access, or both. These features are independent and can be used together.
See `config.py` for details.
[⇧ Back to top](#contents)
---
## Development setup
We use [uv](https://docs.astral.sh/uv/) for package dependency management, [Ruff](https://docs.astral.sh/ruff/) for code linting/formatting, and [pytest](https://docs.pytest.org/en/stable/) for testing.
To set up a local development environment, follow these steps:
- [Install uv](https://docs.astral.sh/uv/getting-started/installation/) if it is not already installed
- Clone this repository
- From the project root, `uv sync` to set up a virtual environment and install dependencies
### Linting/formatting
Run the linting process like so:
```sh
uv run ruff check
```
Run the formatting process like so:
```sh
# Check formatting changes before applying
uv run ruff format --check
# Apply formatting changes
uv run ruff format
```
### Tests
Run tests like so:
```sh
uv run pytest
```
### Coverage
Run tests with coverage reporting:
```sh
# Terminal coverage report
uv run pytest --cov=src/warcbench
# HTML coverage report (opens in browser)
uv run pytest --cov=src/warcbench --cov-report=html
```
The HTML report will be generated in the `htmlcov/` directory. Open `htmlcov/index.html` in your browser to view the detailed coverage report.
### Type checking
Run type checking with mypy:
```sh
uv run mypy
```
The mypy configuration is defined in `mypy.ini`.
[⇧ Back to top](#contents)
Raw data
{
"_id": null,
"home_page": null,
"name": "warcbench",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "analysis, archive, harvard, library, parsing, warc, web-archive, web-crawling",
"author": null,
"author_email": "Harvard Library Innovation Lab <lil@law.harvard.edu>",
"download_url": "https://files.pythonhosted.org/packages/39/f0/cf95a5026ea4db832b40b7d6cb90004be87e81c2d279576929230bb03166/warcbench-0.1.0.tar.gz",
"platform": null,
"description": "# WARCbench \ud83d\udee0\ufe0f\n\nA tool for exploring, analyzing, transforming, recombining, and extracting data from WARC (Web ARChive) files.\n\n<a href=\"https://tools.perma.cc\"><img src=\"https://github.com/harvard-lil/tools.perma.cc/blob/main/perma-tools.png?raw=1\" alt=\"Perma Tools\" width=\"150\"></a>\n\n[](https://codecov.io/gh/harvard-lil/warcbench)\n\n---\n\n## Contents\n\n- [Quickstart](#quickstart)\n- [About](#about)\n- [Command line usage](#command-line-usage)\n- [Python usage](#python-usage)\n- [Configuration](#configuration)\n- [Development setup](#development-setup)\n\n---\n\n## Quickstart\n\nTo install WARCbench, use Pip:\n\n```sh\n# From PyPI (recommended):\npip install warcbench\n\n# Or directly from GitHub using HTTPS...\npip install git+https://github.com/harvard-lil/warcbench.git\n\n# ...or SSH:\npip install git+ssh://git@github.com/harvard-lil/warcbench.git\n```\n\nOnce WARCbench is installed, you may run it on the command line...\n\n```sh\nwb summarize example.com.warc\n```\n\n...or import it in your Python project:\n\n```python\nfrom warcbench import WARCParser\n\nwith open('example.com.warc', 'rb') as warc_file:\n parser = WARCParser(warc_file)\n parser.parse()\n```\n\n[\u21e7 Back to top](#contents)\n\n---\n\n## About\n\nWARCbench has been designed as a resilient, efficient, and highly configurable tool for working with WARC files in all their variety. Among our motivations for the project:\n\n- Enable users to explore a WARC without prior knowledge of the format\n- Support inspection of malformed or misbehaving WARCs\n- Everything is configurable: plenty of hooks and custom callbacks\n- Flexibility to optimize for memory, speed, or convenience as needed\n- As little magic as possible: e.g., don't decode bytes into strings or deserialize headers until you need to\n\nMany other useful open-source WARC packages can be found online. Among the inspirations for WARCbench are:\n\n- [Warchaeology](https://github.com/nlnwa/warchaeology)\n- [WARCAT](https://github.com/chfoo/warcat)\n- [WARCIO](https://github.com/webrecorder/warcio)\n- [Warctools](https://github.com/internetarchive/warctools)\n- [warc](https://github.com/internetarchive/warc)\n\nWARCbench is a project of the [Harvard Library Innovation Lab](https://lil.law.harvard.edu).\n\n[\u21e7 Back to top](#contents)\n\n---\n\n## Command line usage\n\nAfter installing WARCbench, you may use `wb` to interact with WARC files on the command line:\n\n```console\nuser@host~$ wb inspect example.com.warc\n\nRecord bytes 0-280\n\nWARC/1.1\nWARC-Filename: archive.warc\nWARC-Date: 2024-11-04T19:10:55.900Z\nWARC-Type: warcinfo\n...\n```\n\nAll commands support `.warc`, `.warc.gz`, and `.wacz` file formats.\n\nTo view a complete summary of WARCbench commands and options, invoke the `--help` flag:\n\n```console\nuser@host~$ wb --help\n\nUsage: wb [OPTIONS] COMMAND [ARGS]...\n\n WARCbench command framework\n\nOptions:\n -o, --out [raw|json] Format subcommand output as a human-readable\n report (raw) or as JSON.\n -v, --verbose Logging verbosity; repeatable.\n -d, --decompression [python|system]\n Use native Python or system tools for\n extracting archives. [default: python]\n --gunzip / --no-gunzip Gunzip the input archive before parsing, if\n it is gzipped. [default: no-gunzip]\n -V, --version Show the version and exit.\n -h, --help Show this message and exit.\n\nCommands:\n compare-headers Compare the record headers of two archives.\n compare-parsers Compare all available parsing strategies.\n extract Extract files of MIMETYPE to disk.\n filter-records Filter records; optionally extract to a new archive.\n inspect Get detailed record metadata.\n match-record-pairs Match requests/responses into pairs.\n summarize Summarize the contents of an archive.\n...\n```\n\nEach subcommand has its own, more-detailed `--help` text. For example, `filter-records`:\n\n```console\n\nuser@host~$ wb filter-records --help\n\nUsage: wb filter-records [OPTIONS] FILEPATH\n\n Applies the specified filters (if any) to the archive's records. If no\n filters are specified, all WARC records are considered to match.\n\n By default, outputs the number of matching records. Use the `--output-*`\n options to include more detailed information about matching records, or\n `--no-output-count` to suppress the count.\n\n Can also extract the matching records to a new WARC file (`--extract-to-\n warc`, `--extract-to-gzipped-warc`). To ensure the new WARC includes a\n `WARC-Type: warcinfo` record (if present in the original), even if it would\n otherwise be filtered out by any applied filters, run with `--force-include-\n warcinfo`.\n\n If extracting records to a new WARC file, by default, no other output is\n produced. To produce a summary report as well, run with `--extract-summary-\n to`.\n\n To apply your own, custom filters, use `--custom-filter-path` to specify the\n path to a python file where the custom filter functions are listed, in\n desired order of application, in `__all__`. See `tests/assets/custom-\n filters.py` for an example. See the \"Filters\" section of the README for more\n information on constructing filters.\n\n This command also supports custom record handlers, which can be used to do\n arbitrary work on records that pass through the supplied filters. For\n example, you could use record handlers to construct a custom report, or\n export records one-at-a-time to an upstream service. Use `--custom-record-\n handler-path` to specify the path to a python file where the custom handler\n functions are listed, in desired order of application, in `__all__`. See\n `tests/assets/custom-handlers.py` for an example. See the \"Handlers\" section\n of the README for more information on constructing handlers.\n\n ---\n\n Example:\n\n $ wb filter-records --filter-by-warc-named-field Type response tests/assets/example.com.warc\n Found 6 records.\n\nOptions:\n --filter-by-http-header TEXT...\n Find records with WARC-Type: {request,\n response} and look for the supplied HTTP\n header name and value.\n --filter-by-http-response-content-type TEXT\n Find records with WARC-Type: response, and\n then filters by Content-Type.\n --filter-by-http-status-code INTEGER\n Find records with WARC-Type: response, and\n then filters by HTTP status code.\n --filter-by-http-verb TEXT Find records with WARC-Type: request, and\n then filter by HTTP verb.\n...\n```\n\nSee `tests/assets` for sample outputs.\n\n[\u21e7 Back to top](#contents)\n\n---\n\n## Python usage\n\n### Parsing a WARC file\n\nThe `WARCParser` class is typically the best way to start interacting with a WARC file in Python:\n\n```python\nfrom warcbench import WARCParser\n\n# Instantiate a parser, passing in a file handle along with any other config\nwith open('example.com.warc', 'rb') as warc_file:\n parser = WARCParser(warc_file)\n\n # Iterate lazily over each record in the WARC...\n for record in parser.iterator():\n print(record.bytes)\n\n # ...or parse the entire file and produce a list of all records\n parser.parse(cache_records=True)\n print(len(parser.records))\n print(parser.records[3].header.bytes)\n```\n\n### Parsing a Gzipped WARC file\n\nYou can parse and interact with a gzipped WARC file without gunzipping it using the `WARCGZParser` class.\n\nThis is not only for convenience, but for utility: by convention, WARCs are frequently gzipped [one record at a time](http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#record-at-time-compression), such that a complete `warc.gz` file is in fact a series of concatenated, individually valid gzip files, or \"members\". This makes it possible to extract individual WARC records, if the byte offsets of their members are known in advance, without needing to gunzip the entire file, which in certain applications can be a significant performance improvement.\n\n```python\nfrom warcbench import WARCGZParser\n\n# Instantiate a parser, passing in a file handle along with any other config\nwith open('example.com.warc.gz', 'rb') as warcgz_file:\n parser = WARCGZParser(warcgz_file)\n\n # Iterate lazily over each record in the WARC...\n for record in parser.iterator(yield_type=\"records\"):\n print(record.start, record.length)\n\n # ... or over each gzipped member...\n for member in parser.iterator(yield_type=\"members\"):\n print(member.start, member.length, member.record.bytes)\n\n # ...or parse the entire file and produce a list of all members and records\n parser.parse(cache_members=True)\n print(len(parser.members))\n print(len(parser.records))\n print(parser.records[3].header.bytes)\n```\n\n### Utility functions\n\nFor other use cases, such as extracting and working with WARCs in a WACZ file, you may wish to use WARCbench's utility functions:\n\n```python\nfrom warcbench import WARCParser\nfrom warcbench.utils import python_open_archive, system_open_archive\n\n# Slower: uses Python zip/gzip to decompress\nwith python_open_archive('example.com.wacz') as warcgz_file:\n parser = WARCGZParser(warcgz_file)\n\nwith python_open_archive('example.com.wacz', gunzip=True) as warc_file:\n parser = WARCParser(warc_file)\n\n# Faster: uses system zip/gzip to decompress where possible\nwith system_open_archive('example.com.wacz') as warcgz_file:\n parser = WARCGZParser(warcgz_file)\n\nwith system_open_archive('example.com.wacz', gunzip=True) as warc_file:\n parser = WARCParser(warc_file)\n```\n\n### Filters, handlers, and callbacks\n\nWARCbench includes several additional mechanisms for wrangling WARC records: filters, handlers, and callbacks.\n\n#### Filters\n\n**Record Filters** are functions that include or exclude a WARC record based on a given condition. You can pass in any function that accepts a `warcbench.models.Record` as its sole argument and returns a Boolean value. (A number of built-in filters are included in the `warcbench.filters` module.) Example:\n\n```python\nfrom warcbench import WARCGZParser\nfrom warcbench.config import WARCGZProcessorConfig\nfrom warcbench.filters import warc_named_field_filter\nfrom warcbench.utils import system_open_archive\n\nwith system_open_archive('example.com.wacz') as warcgz_file:\n parser = WARCGZParser(\n warcgz_file,\n processors=WARCGZProcessorConfig(\n record_filters=[\n warc_named_field_filter('type', 'request'),\n ]\n )\n )\n```\n\n**Member Filters** (only supported when using the WARCGZParser) behave just like record filters, except they work with `warcbench.models.GzippedMember` objects instead of `Record`s. Example:\n\n```python\nfrom warcbench import WARCGZParser\nfrom warcbench.config import WARCGZProcessorConfig\nfrom warcbench.utils import system_open_archive\n\nwith system_open_archive('example.com.wacz') as warcgz_file:\n parser = WARCGZParser(\n warcgz_file,\n processors=WARCGZProcessorConfig(\n member_filters=[\n # only yield malformed members\n lambda member: bool(member.uncompressed_non_warc_data),\n ]\n )\n )\n```\n\n#### Handlers\n\n**Record handlers** are functions that process a record once it is parsed. For example, you could use a record handler to print each record's content in bytes for debugging purposes, or write each record to disk as a separate file. As with filters, you may pass in an arbitrary handler function that accepts a `warcbench.models.Record` as its sole argument; a handler's return value is ignored. Example:\n\n```python\nfrom warcbench import WARCParser\nfrom warcbench.config import WARCProcessorConfig\nfrom warcbench.record_handlers import get_record_offsets\nfrom warcbench.utils import system_open_archive\n\nwith system_open_archive('example.com.warc') as warc_file:\n parser = WARCParser(\n warc_file,\n processors=WARCProcessorConfig(\n record_handlers=[\n get_record_offsets(),\n ]\n )\n )\n```\n\nTo support inspection of WARC files that contain invalid records, WARCbench also includes a way to specify handlers for unparsable lines. **Unparsable line handlers** behave just like record handlers, except that they accept `warcbench.models.UnparsableLine` objects instead of `Record`s. You could use these handlers to print information about unparsable lines, or even repair them. Example:\n\n```python\nfrom warcbench import WARCParser\nfrom warcbench.config import WARCProcessorConfig\nfrom warcbench.record_handlers import get_record_offsets\nfrom warcbench.utils import system_open_archive\n\nwith system_open_archive('example.com.wacz') as warc_file:\n parser = WARCParser(\n warc_file,\n processors=WARCProcessorConfig(\n unparsable_line_handlers=[\n lambda line: print(line),\n ]\n )\n )\n```\n\n#### Callbacks\n\n**Callbacks** are functions that run after the WARCbench parser finishes parsing a WARC file. A callback can be any function that accepts a `warcbench.WARCParser` or `warcbench.WARCGZParser` object as its sole argument. You could use a callback to print the number of records parsed, write the records out to disk, pass the full set of records over to another function, and so on.\n\n#### Combining filters, handlers, and callbacks\n\nFilters, handlers, and callbacks are additive, but you can combine them together to produce output of arbitrary complexity. Example:\n\n```python\nfrom warcbench import WARCGZParser\nfrom warcbench.config import WARCGZProcessorConfig\nfrom warcbench.filters import warc_named_field_filter\nfrom warcbench.utils import system_open_archive\n\ndef combo_filter(record):\n is_warc_info = lambda r: warc_named_field_filter('type', 'warcinfo')(r)\n\n targets_example_page = lambda r: warc_named_field_filter(\n 'target-uri',\n 'http://example.com/',\n exact_match=True\n )(r)\n\n return is_warc_info(record) or (\n targets_example_page(record) and\n http_verb_filter('get')(record) and\n http_status_filter(200)(record)\n )\n\nwith system_open_archive('example.com.wacz') as warcgz_file:\n parser = WARCGZParser(\n warcgz_file,\n processors=WARCGZProcessorConfig(\n record_filters=[\n combo_filter,\n record_content_length_filter('2056', 'le'),\n ]\n )\n )\n```\n\n### Configuration\n\nWARCbench supports a number of configuration options:\n\n- You can parse a WARC file by reading the WARC record headers' `Content-Length` fields (faster), or by scanning and splitting on the delimiter expected between WARC records (slower; may rarely detect false positives; more robust against mangled or broken WARCs).\n\n- You can parse a gzipped WARC file by reading and parsing the file member by member (much slower; simpler), or by gunzipping the entire file while making note of member/record boundaries, and then further processing the bytes of the decompressed records (much faster; may use more disk space).\n\n- You can choose whether or not to attempt to split WARC records into headers and content blocks.\n\n- You can choose whether to cache record properties (such as the bytes of headers or content blocks) during parsing, or to consume those bytes lazily on access, or both. These features are independent and can be used together.\n\nSee `config.py` for details.\n\n[\u21e7 Back to top](#contents)\n\n---\n\n## Development setup\n\nWe use [uv](https://docs.astral.sh/uv/) for package dependency management, [Ruff](https://docs.astral.sh/ruff/) for code linting/formatting, and [pytest](https://docs.pytest.org/en/stable/) for testing.\n\nTo set up a local development environment, follow these steps:\n\n- [Install uv](https://docs.astral.sh/uv/getting-started/installation/) if it is not already installed\n- Clone this repository\n- From the project root, `uv sync` to set up a virtual environment and install dependencies\n\n### Linting/formatting\n\nRun the linting process like so:\n\n```sh\nuv run ruff check\n```\n\nRun the formatting process like so:\n\n```sh\n# Check formatting changes before applying\nuv run ruff format --check\n\n# Apply formatting changes\nuv run ruff format\n```\n\n### Tests\n\nRun tests like so:\n\n```sh\nuv run pytest\n```\n\n### Coverage\n\nRun tests with coverage reporting:\n\n```sh\n# Terminal coverage report\nuv run pytest --cov=src/warcbench\n\n# HTML coverage report (opens in browser)\nuv run pytest --cov=src/warcbench --cov-report=html\n```\n\nThe HTML report will be generated in the `htmlcov/` directory. Open `htmlcov/index.html` in your browser to view the detailed coverage report.\n\n### Type checking\n\nRun type checking with mypy:\n\n```sh\nuv run mypy\n```\n\nThe mypy configuration is defined in `mypy.ini`.\n\n[\u21e7 Back to top](#contents)\n",
"bugtrack_url": null,
"license": null,
"summary": "A tool for exploring, analyzing, transforming, recombining, and extracting data from WARC (Web ARChive) files",
"version": "0.1.0",
"project_urls": {
"Bug Tracker": "https://github.com/harvard-lil/warcbench/issues",
"Repository": "https://github.com/harvard-lil/warcbench"
},
"split_keywords": [
"analysis",
" archive",
" harvard",
" library",
" parsing",
" warc",
" web-archive",
" web-crawling"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "b7e0ec867f15a35120e62391d003ff05d1eee0171985fe928309ad8f0a7b86ba",
"md5": "cbf07147f72d62a2440bf61699ef900c",
"sha256": "682859bc13e146c9d00ed0793f6c09a200bf5324c43ea54aa2c5ae4b583af3cf"
},
"downloads": -1,
"filename": "warcbench-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cbf07147f72d62a2440bf61699ef900c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 66421,
"upload_time": "2025-07-31T20:35:43",
"upload_time_iso_8601": "2025-07-31T20:35:43.682385Z",
"url": "https://files.pythonhosted.org/packages/b7/e0/ec867f15a35120e62391d003ff05d1eee0171985fe928309ad8f0a7b86ba/warcbench-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "39f0cf95a5026ea4db832b40b7d6cb90004be87e81c2d279576929230bb03166",
"md5": "796b43d64ad45993fc861785b570844e",
"sha256": "9f85c09eeb373cf27414fe6c9dd1ce1b6324cdcfd9fc9bb02125a06df1b1aa01"
},
"downloads": -1,
"filename": "warcbench-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "796b43d64ad45993fc861785b570844e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 1774999,
"upload_time": "2025-07-31T20:35:45",
"upload_time_iso_8601": "2025-07-31T20:35:45.227731Z",
"url": "https://files.pythonhosted.org/packages/39/f0/cf95a5026ea4db832b40b7d6cb90004be87e81c2d279576929230bb03166/warcbench-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-31 20:35:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "harvard-lil",
"github_project": "warcbench",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "warcbench"
}