warcio

Name	warcio JSON
Version	1.7.5 JSON
	download
home_page	https://github.com/webrecorder/warcio
Summary	Streaming WARC (and ARC) IO library
upload_time	2024-12-10 21:31:48
maintainer	None
docs_url	None
author	Ilya Kreymer
requires_python	None
license	Apache 2.0
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage

            WARCIO: WARC (and ARC) Streaming Library
========================================
.. image:: https://travis-ci.org/webrecorder/warcio.svg?branch=master
      :target: https://travis-ci.org/webrecorder/warcio
.. image:: https://codecov.io/gh/webrecorder/warcio/branch/master/graph/badge.svg
      :target: https://codecov.io/gh/webrecorder/warcio


Background
----------

This library provides a fast, standalone way to read and write `WARC
Format <https://en.wikipedia.org/wiki/Web_ARChive>`__ commonly used in
web archives. Python 3.7+ (minimally only needing
`six <https://pythonhosted.org/six/>`__ as an external dependency)

warcio supports reading and writing of WARC files compliant with both the `WARC 1.0 <http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf>`__
and `WARC 1.1 <http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1-1_latestdraft.pdf>`__ ISO standards.

Install with: ``pip install warcio`` (or ``pip install warcio[all]`` to get optional features)

This library is a spin-off of the WARC reading and writing component of
the `pywb <https://github.com/webrecorder/pywb>`__ high-fidelity replay
library, a key component of
`Webrecorder <https://github.com/webrecorder/webrecorder>`__

The library is designed for fast, low-level access to web archival
content, oriented around a stream of WARC records rather than files.

Reading WARC Records
--------------------

A key feature of the library is to be able to iterate over a stream of
WARC records using the ``ArchiveIterator``.

It includes the following features:

- Reading a WARC 1.0, WARC 1.1 or ARC stream
- On the fly ARC to WARC record conversion
- Decompressing and de-chunking HTTP payload content stored in WARC/ARC files.

For example, the following prints the the url for each WARC ``response``
record:

.. code:: python

    from warcio.archiveiterator import ArchiveIterator

    with open('path/to/file', 'rb') as stream:
        for record in ArchiveIterator(stream):
            if record.rec_type == 'response':
                print(record.rec_headers.get_header('WARC-Target-URI'))

The stream object could be a file on disk or a remote network stream.
The ``ArchiveIterator`` reads the WARC content in a single pass. The
``record`` is represented by an ``ArcWarcRecord`` object which contains
the format (ARC or WARC), record type, the record headers, http headers
(if any), and raw stream for reading the payload.

.. code:: python

    class ArcWarcRecord(object):
        def __init__(self, *args):
            (self.format, self.rec_type, self.rec_headers, self.raw_stream,
             self.http_headers, self.content_type, self.length) = args

Reading WARC Content
~~~~~~~~~~~~~~~~~~~~

The ``raw_stream`` can be used to read the rest of the payload directly.
A special ``ArcWarcRecord.content_stream()`` function provides a stream that
automatically decompresses and de-chunks the HTTP payload, if it is
compressed and/or transfer-encoding chunked.

ARC Files
~~~~~~~~~

The library provides support for reading (but not writing ARC) files.
The ARC format is legacy but is important to support in a consistent
matter. The ``ArchiveIterator`` can equally iterate over ARC and WARC
files to emit ``ArcWarcRecord`` objects. The special ``arc2warc`` option
converts ARC records to WARCs on the fly, allowing for them to be
accessed using the same API.

(Special ``WARCIterator`` and ``ARCIterator`` subclasses of ``ArchiveIterator``
are also available to read only WARC or only ARC files).

WARC and ARC Streaming
~~~~~~~~~~~~~~~~~~~~~~
For example, here is a snippet for reading an ARC and a WARC using the
same API.

The example streams a WARC and ARC file over HTTP using
`requests <http://docs.python-requests.org/en/master/>`__, printing the
``warcinfo`` record (or ARC header) and any response records (or all ARC
records) that contain HTML:

.. code:: python

    import requests
    from warcio.archiveiterator import ArchiveIterator

    def print_records(url):
        resp = requests.get(url, stream=True)

        for record in ArchiveIterator(resp.raw, arc2warc=True):
            if record.rec_type == 'warcinfo':
                print(record.raw_stream.read())

            elif record.rec_type == 'response':
                if record.http_headers.get_header('Content-Type') == 'text/html':
                    print(record.rec_headers.get_header('WARC-Target-URI'))
                    print(record.content_stream().read())
                    print('')

    # WARC
    print_records('https://archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.warc.gz')


    # ARC with arc2warc
    print_records('https://archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.arc.gz')


Writing WARC Records
--------------------

Starting with 1.6, warcio introduces a way to capture HTTP/S traffic directly
to a WARC file, by monkey-patching Python's ``http.client`` library.

This approach works well with the popular ``requests`` library often used to fetch
HTTP/S content. Note that ``requests`` must be imported after the ``capture_http`` module.

Quick Start to Writing a WARC
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fetching the url ``https://example.com/`` while capturing the response and request
into a gzip compressed WARC file named ``example.warc.gz`` can be done with the following four lines:

.. code:: python

    from warcio.capture_http import capture_http
    import requests  # requests must be imported after capture_http

    with capture_http('example.warc.gz'):
        requests.get('https://example.com/')


The WARC ``example.warc.gz`` will contain two records (the response is written first, then the request).

To write to a default in-memory buffer (``BufferWARCWriter``), don't specify a filename, using ``with capture_http() as writer:``.

Additional requests in the ``capture_http`` context and will be appended to the WARC as expected.

The ``WARC-IP-Address`` header will also be added for each record if the IP address is available.

The following example (similar to a `unit test from the test suite <test/test_capture_http.py>`__) demonstrates the resulting records created with ``capture_http``:

.. code:: python

    with capture_http() as writer:
        requests.get('http://example.com/')
        requests.get('https://google.com/')

    expected = [('http://example.com/', 'response', True),
                ('http://example.com/', 'request', True),
                ('https://google.com/', 'response', True),
                ('https://google.com/', 'request', True),
                ('https://www.google.com/', 'response', True),
                ('https://www.google.com/', 'request', True)
               ]

     actual = [
                (record.rec_headers['WARC-Target-URI'],
                 record.rec_type,
                 'WARC-IP-Address' in record.rec_headers)

                for record in ArchiveIterator(writer.get_stream())
              ]

     assert actual == expected
        

Customizing WARC Writing
~~~~~~~~~~~~~~~~~~~~~~~~

The library provides a simple and extensible interface for writing
standards-compliant WARC files.

The library comes with a basic ``WARCWriter`` class for writing to a
single WARC file and ``BufferWARCWriter`` for writing to an in-memory
buffer. The ``BaseWARCWriter`` can be extended to support more complex
operations.

(There is no support for writing legacy ARC files)

For more flexibility, such as to use a custom ``WARCWriter`` class,
the above example can be written as:

.. code:: python

    from warcio.capture_http import capture_http
    from warcio import WARCWriter
    import requests  # requests *must* be imported after capture_http

    with open('example.warc.gz', 'wb') as fh:
        warc_writer = WARCWriter(fh)
        with capture_http(warc_writer):
            requests.get('https://example.com/')
            
WARC/1.1 Support
~~~~~~~~~~~~~~~~

By default, warcio creates WARC 1.0 records for maximum compatibility with existing tools.
To create WARC/1.1 records, simply specify the warc version as follows:

.. code:: python
   
    with capture_http('example.warc.gz', warc_version='1.1'):
        ...


.. code:: python

    WARCWriter(fh, warc_version='1.1)
    ...
    
When using WARC 1.1, the main difference is that the ``WARC-Date`` timestamp header
will be written with microsecond precision, while WARC 1.0 only supports second precision.

WARC 1.0:

.. code::
 
    WARC/1.0
    ...
    WARC-Date: 2018-12-26T10:11:12Z

WARC 1.1:

.. code::

    WARC/1.1
    ...
    WARC-Date: 2018-12-26T10:11:12.456789Z
    
    

Filtering HTTP Capture
~~~~~~~~~~~~~~~~~~~~~~

When capturing via HTTP, it is possible to provide a custom filter function, 
which can be used to determine if a particular request and response records
should be written to the WARC file or skipped.

The filter function is called with the request and response record
before they are written, and can be used to substitute a different record (for example, a revisit
instead of a response), or to skip writing altogether by returning nothing, as shown below:

.. code:: python

    def filter_records(request, response, request_recorder):
        # return None, None to indicate records should be skipped
        if response.http_headers.get_statuscode() != '200':
            return None, None
            
        # the response record can be replaced with a revisit record
        elif check_for_dedup():
            response = create_revisit_record(...)
            
        return request, response

    with capture_http('example.warc.gz', filter_records):
         requests.get('https://example.com/')
         
Please refer to
`test/test\_capture_http.py <test/test_capture_http.py>`__ for additional examples
of capturing ``requests`` traffic to WARC.

Manual/Advanced WARC Writing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Before 1.6, this was the primary method for fetching a url and then
writing to a WARC. This process is a bit more verbose,
but provides for full control of WARC creation and avoid monkey-patching.

The following example loads ``http://example.com/``, creates a WARC
response record, and writes it, gzip compressed, to ``example.warc.gz``
The block and payload digests are computed automatically.

.. code:: python

    from warcio.warcwriter import WARCWriter
    from warcio.statusandheaders import StatusAndHeaders

    import requests

    with open('example.warc.gz', 'wb') as output:
        writer = WARCWriter(output, gzip=True)

        resp = requests.get('http://example.com/',
                            headers={'Accept-Encoding': 'identity'},
                            stream=True)

        # get raw headers from urllib3
        headers_list = resp.raw.headers.items()

        http_headers = StatusAndHeaders('200 OK', headers_list, protocol='HTTP/1.0')

        record = writer.create_warc_record('http://example.com/', 'response',
                                            payload=resp.raw,
                                            http_headers=http_headers)

        writer.write_record(record)


The library also includes additional semantics for:
 - Creating ``warcinfo`` and ``revisit`` records
 - Writing ``response`` and ``request`` records together
 - Writing custom WARC records
 - Reading a full WARC record from a stream

Please refer to `warcwriter.py <warcio/warcwriter.py>`__ and
`test/test\_writer.py <test/test_writer.py>`__ for additional examples.

WARCIO CLI: Indexing and Recompression
--------------------------------------

The library currently ships with a few simple command line tools.

Index
~~~~~

The ``warcio index`` cmd will print a simple index of the records in the
warc file as newline delimited JSON lines (NDJSON).

WARC header fields to include in the index can be specified via the
``-f`` flag, and are included in the JSON block (in order, for
convenience).

::

    warcio index ./test/data/example-iana.org-chunked.warc -f warc-type,warc-target-uri,content-length
    {"warc-type": "warcinfo", "content-length": "137"}
    {"warc-type": "response", "warc-target-uri": "http://www.iana.org/", "content-length": "7566"}
    {"warc-type": "request", "warc-target-uri": "http://www.iana.org/", "content-length": "76"}


HTTP header fields can be included by prefixing them with the prefix
``http:``. The special field ``offset`` refers to the record offset within
the warc file.

::

    warcio index ./test/data/example-iana.org-chunked.warc -f offset,content-type,http:content-type,warc-target-uri
    {"offset": "0", "content-type": "application/warc-fields"}
    {"offset": "405", "content-type": "application/http;msgtype=response", "http:content-type": "text/html; charset=UTF-8", "warc-target-uri": "http://www.iana.org/"}
    {"offset": "8379", "content-type": "application/http;msgtype=request", "warc-target-uri": "http://www.iana.org/"}

(Note: this library does not produce CDX or CDXJ format indexes often
associated with web archives. To create these indexes, please see the
`cdxj-indexer <https://github.com/webrecorder/cdxj-indexer>`__ tool which extends warcio indexing to provide this functionality)

Check
~~~~~

The ``warcio check`` command will check the payload and block digests
of WARC records, if possible. An exit value of 1 indicates a failure.
``warcio check -v`` will print verbose output for each record in the
WARC file.

Recompress
~~~~~~~~~~

The ``recompress`` command allows for re-compressing or normalizing WARC
(or ARC) files to a record-compressed, gzipped WARC file.

Each WARC record is compressed individually and concatenated. This is
the 'canonical' WARC storage format used by
`Webrecorder <https://github.com/webrecorder/webrecorder>`__ and other
web archiving institutions, and usually stored with a ``.warc.gz``
extension.

It can be used to: - Compress an uncompressed WARC - Convert any ARC
file to a compressed WARC - Fix an improperly compressed WARC file (eg.
a WARC compressed entirely instead of by record)

::

    warcio recompress ./input.arc.gz ./output.warc.gz


Extract
~~~~~~~

The  ``extract`` command provides a way to extract either the WARC and HTTP headers and/or payload of a WARC record
to stdout. Given a WARC filename and an offset, ``extract`` will print the (decompressed) record at that offset
in the file to stdout

Specifying --payload or --headers will output only the payload or only the WARC + HTTP headers (if any), respectively.

::

    warcio extract [--payload | --headers] filename offset


License
~~~~~~~

``warcio`` is licensed under the Apache 2.0 License and is part of the
Webrecorder project.

See `NOTICE <NOTICE>`__ and `LICENSE <LICENSE>`__ for details.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/webrecorder/warcio",
    "name": "warcio",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Ilya Kreymer",
    "author_email": "ikreymer@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/f2/2b/d825506924cb4508c90cd950dbda2a4dbfa9f5609e2ae76b53deaba656db/warcio-1.7.5.tar.gz",
    "platform": null,
    "description": "WARCIO: WARC (and ARC) Streaming Library\n========================================\n.. image:: https://travis-ci.org/webrecorder/warcio.svg?branch=master\n      :target: https://travis-ci.org/webrecorder/warcio\n.. image:: https://codecov.io/gh/webrecorder/warcio/branch/master/graph/badge.svg\n      :target: https://codecov.io/gh/webrecorder/warcio\n\n\nBackground\n----------\n\nThis library provides a fast, standalone way to read and write `WARC\nFormat <https://en.wikipedia.org/wiki/Web_ARChive>`__ commonly used in\nweb archives. Python 3.7+ (minimally only needing\n`six <https://pythonhosted.org/six/>`__ as an external dependency)\n\nwarcio supports reading and writing of WARC files compliant with both the `WARC 1.0 <http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf>`__\nand `WARC 1.1 <http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1-1_latestdraft.pdf>`__ ISO standards.\n\nInstall with: ``pip install warcio`` (or ``pip install warcio[all]`` to get optional features)\n\nThis library is a spin-off of the WARC reading and writing component of\nthe `pywb <https://github.com/webrecorder/pywb>`__ high-fidelity replay\nlibrary, a key component of\n`Webrecorder <https://github.com/webrecorder/webrecorder>`__\n\nThe library is designed for fast, low-level access to web archival\ncontent, oriented around a stream of WARC records rather than files.\n\nReading WARC Records\n--------------------\n\nA key feature of the library is to be able to iterate over a stream of\nWARC records using the ``ArchiveIterator``.\n\nIt includes the following features:\n\n- Reading a WARC 1.0, WARC 1.1 or ARC stream\n- On the fly ARC to WARC record conversion\n- Decompressing and de-chunking HTTP payload content stored in WARC/ARC files.\n\nFor example, the following prints the the url for each WARC ``response``\nrecord:\n\n.. code:: python\n\n    from warcio.archiveiterator import ArchiveIterator\n\n    with open('path/to/file', 'rb') as stream:\n        for record in ArchiveIterator(stream):\n            if record.rec_type == 'response':\n                print(record.rec_headers.get_header('WARC-Target-URI'))\n\nThe stream object could be a file on disk or a remote network stream.\nThe ``ArchiveIterator`` reads the WARC content in a single pass. The\n``record`` is represented by an ``ArcWarcRecord`` object which contains\nthe format (ARC or WARC), record type, the record headers, http headers\n(if any), and raw stream for reading the payload.\n\n.. code:: python\n\n    class ArcWarcRecord(object):\n        def __init__(self, *args):\n            (self.format, self.rec_type, self.rec_headers, self.raw_stream,\n             self.http_headers, self.content_type, self.length) = args\n\nReading WARC Content\n~~~~~~~~~~~~~~~~~~~~\n\nThe ``raw_stream`` can be used to read the rest of the payload directly.\nA special ``ArcWarcRecord.content_stream()`` function provides a stream that\nautomatically decompresses and de-chunks the HTTP payload, if it is\ncompressed and/or transfer-encoding chunked.\n\nARC Files\n~~~~~~~~~\n\nThe library provides support for reading (but not writing ARC) files.\nThe ARC format is legacy but is important to support in a consistent\nmatter. The ``ArchiveIterator`` can equally iterate over ARC and WARC\nfiles to emit ``ArcWarcRecord`` objects. The special ``arc2warc`` option\nconverts ARC records to WARCs on the fly, allowing for them to be\naccessed using the same API.\n\n(Special ``WARCIterator`` and ``ARCIterator`` subclasses of ``ArchiveIterator``\nare also available to read only WARC or only ARC files).\n\nWARC and ARC Streaming\n~~~~~~~~~~~~~~~~~~~~~~\nFor example, here is a snippet for reading an ARC and a WARC using the\nsame API.\n\nThe example streams a WARC and ARC file over HTTP using\n`requests <http://docs.python-requests.org/en/master/>`__, printing the\n``warcinfo`` record (or ARC header) and any response records (or all ARC\nrecords) that contain HTML:\n\n.. code:: python\n\n    import requests\n    from warcio.archiveiterator import ArchiveIterator\n\n    def print_records(url):\n        resp = requests.get(url, stream=True)\n\n        for record in ArchiveIterator(resp.raw, arc2warc=True):\n            if record.rec_type == 'warcinfo':\n                print(record.raw_stream.read())\n\n            elif record.rec_type == 'response':\n                if record.http_headers.get_header('Content-Type') == 'text/html':\n                    print(record.rec_headers.get_header('WARC-Target-URI'))\n                    print(record.content_stream().read())\n                    print('')\n\n    # WARC\n    print_records('https://archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.warc.gz')\n\n\n    # ARC with arc2warc\n    print_records('https://archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.arc.gz')\n\n\nWriting WARC Records\n--------------------\n\nStarting with 1.6, warcio introduces a way to capture HTTP/S traffic directly\nto a WARC file, by monkey-patching Python's ``http.client`` library.\n\nThis approach works well with the popular ``requests`` library often used to fetch\nHTTP/S content. Note that ``requests`` must be imported after the ``capture_http`` module.\n\nQuick Start to Writing a WARC\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nFetching the url ``https://example.com/`` while capturing the response and request\ninto a gzip compressed WARC file named ``example.warc.gz`` can be done with the following four lines:\n\n.. code:: python\n\n    from warcio.capture_http import capture_http\n    import requests  # requests must be imported after capture_http\n\n    with capture_http('example.warc.gz'):\n        requests.get('https://example.com/')\n\n\nThe WARC ``example.warc.gz`` will contain two records (the response is written first, then the request).\n\nTo write to a default in-memory buffer (``BufferWARCWriter``), don't specify a filename, using ``with capture_http() as writer:``.\n\nAdditional requests in the ``capture_http`` context and will be appended to the WARC as expected.\n\nThe ``WARC-IP-Address`` header will also be added for each record if the IP address is available.\n\nThe following example (similar to a `unit test from the test suite <test/test_capture_http.py>`__) demonstrates the resulting records created with ``capture_http``:\n\n.. code:: python\n\n    with capture_http() as writer:\n        requests.get('http://example.com/')\n        requests.get('https://google.com/')\n\n    expected = [('http://example.com/', 'response', True),\n                ('http://example.com/', 'request', True),\n                ('https://google.com/', 'response', True),\n                ('https://google.com/', 'request', True),\n                ('https://www.google.com/', 'response', True),\n                ('https://www.google.com/', 'request', True)\n               ]\n\n     actual = [\n                (record.rec_headers['WARC-Target-URI'],\n                 record.rec_type,\n                 'WARC-IP-Address' in record.rec_headers)\n\n                for record in ArchiveIterator(writer.get_stream())\n              ]\n\n     assert actual == expected\n        \n\nCustomizing WARC Writing\n~~~~~~~~~~~~~~~~~~~~~~~~\n\nThe library provides a simple and extensible interface for writing\nstandards-compliant WARC files.\n\nThe library comes with a basic ``WARCWriter`` class for writing to a\nsingle WARC file and ``BufferWARCWriter`` for writing to an in-memory\nbuffer. The ``BaseWARCWriter`` can be extended to support more complex\noperations.\n\n(There is no support for writing legacy ARC files)\n\nFor more flexibility, such as to use a custom ``WARCWriter`` class,\nthe above example can be written as:\n\n.. code:: python\n\n    from warcio.capture_http import capture_http\n    from warcio import WARCWriter\n    import requests  # requests *must* be imported after capture_http\n\n    with open('example.warc.gz', 'wb') as fh:\n        warc_writer = WARCWriter(fh)\n        with capture_http(warc_writer):\n            requests.get('https://example.com/')\n            \nWARC/1.1 Support\n~~~~~~~~~~~~~~~~\n\nBy default, warcio creates WARC 1.0 records for maximum compatibility with existing tools.\nTo create WARC/1.1 records, simply specify the warc version as follows:\n\n.. code:: python\n   \n    with capture_http('example.warc.gz', warc_version='1.1'):\n        ...\n\n\n.. code:: python\n\n    WARCWriter(fh, warc_version='1.1)\n    ...\n    \nWhen using WARC 1.1, the main difference is that the ``WARC-Date`` timestamp header\nwill be written with microsecond precision, while WARC 1.0 only supports second precision.\n\nWARC 1.0:\n\n.. code::\n \n    WARC/1.0\n    ...\n    WARC-Date: 2018-12-26T10:11:12Z\n\nWARC 1.1:\n\n.. code::\n\n    WARC/1.1\n    ...\n    WARC-Date: 2018-12-26T10:11:12.456789Z\n    \n    \n\nFiltering HTTP Capture\n~~~~~~~~~~~~~~~~~~~~~~\n\nWhen capturing via HTTP, it is possible to provide a custom filter function, \nwhich can be used to determine if a particular request and response records\nshould be written to the WARC file or skipped.\n\nThe filter function is called with the request and response record\nbefore they are written, and can be used to substitute a different record (for example, a revisit\ninstead of a response), or to skip writing altogether by returning nothing, as shown below:\n\n.. code:: python\n\n    def filter_records(request, response, request_recorder):\n        # return None, None to indicate records should be skipped\n        if response.http_headers.get_statuscode() != '200':\n            return None, None\n            \n        # the response record can be replaced with a revisit record\n        elif check_for_dedup():\n            response = create_revisit_record(...)\n            \n        return request, response\n\n    with capture_http('example.warc.gz', filter_records):\n         requests.get('https://example.com/')\n         \nPlease refer to\n`test/test\\_capture_http.py <test/test_capture_http.py>`__ for additional examples\nof capturing ``requests`` traffic to WARC.\n\nManual/Advanced WARC Writing\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nBefore 1.6, this was the primary method for fetching a url and then\nwriting to a WARC. This process is a bit more verbose,\nbut provides for full control of WARC creation and avoid monkey-patching.\n\nThe following example loads ``http://example.com/``, creates a WARC\nresponse record, and writes it, gzip compressed, to ``example.warc.gz``\nThe block and payload digests are computed automatically.\n\n.. code:: python\n\n    from warcio.warcwriter import WARCWriter\n    from warcio.statusandheaders import StatusAndHeaders\n\n    import requests\n\n    with open('example.warc.gz', 'wb') as output:\n        writer = WARCWriter(output, gzip=True)\n\n        resp = requests.get('http://example.com/',\n                            headers={'Accept-Encoding': 'identity'},\n                            stream=True)\n\n        # get raw headers from urllib3\n        headers_list = resp.raw.headers.items()\n\n        http_headers = StatusAndHeaders('200 OK', headers_list, protocol='HTTP/1.0')\n\n        record = writer.create_warc_record('http://example.com/', 'response',\n                                            payload=resp.raw,\n                                            http_headers=http_headers)\n\n        writer.write_record(record)\n\n\nThe library also includes additional semantics for:\n - Creating ``warcinfo`` and ``revisit`` records\n - Writing ``response`` and ``request`` records together\n - Writing custom WARC records\n - Reading a full WARC record from a stream\n\nPlease refer to `warcwriter.py <warcio/warcwriter.py>`__ and\n`test/test\\_writer.py <test/test_writer.py>`__ for additional examples.\n\nWARCIO CLI: Indexing and Recompression\n--------------------------------------\n\nThe library currently ships with a few simple command line tools.\n\nIndex\n~~~~~\n\nThe ``warcio index`` cmd will print a simple index of the records in the\nwarc file as newline delimited JSON lines (NDJSON).\n\nWARC header fields to include in the index can be specified via the\n``-f`` flag, and are included in the JSON block (in order, for\nconvenience).\n\n::\n\n    warcio index ./test/data/example-iana.org-chunked.warc -f warc-type,warc-target-uri,content-length\n    {\"warc-type\": \"warcinfo\", \"content-length\": \"137\"}\n    {\"warc-type\": \"response\", \"warc-target-uri\": \"http://www.iana.org/\", \"content-length\": \"7566\"}\n    {\"warc-type\": \"request\", \"warc-target-uri\": \"http://www.iana.org/\", \"content-length\": \"76\"}\n\n\nHTTP header fields can be included by prefixing them with the prefix\n``http:``. The special field ``offset`` refers to the record offset within\nthe warc file.\n\n::\n\n    warcio index ./test/data/example-iana.org-chunked.warc -f offset,content-type,http:content-type,warc-target-uri\n    {\"offset\": \"0\", \"content-type\": \"application/warc-fields\"}\n    {\"offset\": \"405\", \"content-type\": \"application/http;msgtype=response\", \"http:content-type\": \"text/html; charset=UTF-8\", \"warc-target-uri\": \"http://www.iana.org/\"}\n    {\"offset\": \"8379\", \"content-type\": \"application/http;msgtype=request\", \"warc-target-uri\": \"http://www.iana.org/\"}\n\n(Note: this library does not produce CDX or CDXJ format indexes often\nassociated with web archives. To create these indexes, please see the\n`cdxj-indexer <https://github.com/webrecorder/cdxj-indexer>`__ tool which extends warcio indexing to provide this functionality)\n\nCheck\n~~~~~\n\nThe ``warcio check`` command will check the payload and block digests\nof WARC records, if possible. An exit value of 1 indicates a failure.\n``warcio check -v`` will print verbose output for each record in the\nWARC file.\n\nRecompress\n~~~~~~~~~~\n\nThe ``recompress`` command allows for re-compressing or normalizing WARC\n(or ARC) files to a record-compressed, gzipped WARC file.\n\nEach WARC record is compressed individually and concatenated. This is\nthe 'canonical' WARC storage format used by\n`Webrecorder <https://github.com/webrecorder/webrecorder>`__ and other\nweb archiving institutions, and usually stored with a ``.warc.gz``\nextension.\n\nIt can be used to: - Compress an uncompressed WARC - Convert any ARC\nfile to a compressed WARC - Fix an improperly compressed WARC file (eg.\na WARC compressed entirely instead of by record)\n\n::\n\n    warcio recompress ./input.arc.gz ./output.warc.gz\n\n\nExtract\n~~~~~~~\n\nThe  ``extract`` command provides a way to extract either the WARC and HTTP headers and/or payload of a WARC record\nto stdout. Given a WARC filename and an offset, ``extract`` will print the (decompressed) record at that offset\nin the file to stdout\n\nSpecifying --payload or --headers will output only the payload or only the WARC + HTTP headers (if any), respectively.\n\n::\n\n    warcio extract [--payload | --headers] filename offset\n\n\nLicense\n~~~~~~~\n\n``warcio`` is licensed under the Apache 2.0 License and is part of the\nWebrecorder project.\n\nSee `NOTICE <NOTICE>`__ and `LICENSE <LICENSE>`__ for details.\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Streaming WARC (and ARC) IO library",
    "version": "1.7.5",
    "project_urls": {
        "Homepage": "https://github.com/webrecorder/warcio"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5cf03f19085980f8a4485f4265cc9dba1099b2fa35ef7552390a8446e149c293",
                "md5": "703ffc6ed3e6e62fa20b9bfbfa48430b",
                "sha256": "ca96130bde7747e49da714097d144c6ff939458d4f93e1beb1e42455db4326d4"
            },
            "downloads": -1,
            "filename": "warcio-1.7.5-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "703ffc6ed3e6e62fa20b9bfbfa48430b",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 40568,
            "upload_time": "2024-12-10T21:31:46",
            "upload_time_iso_8601": "2024-12-10T21:31:46.291881Z",
            "url": "https://files.pythonhosted.org/packages/5c/f0/3f19085980f8a4485f4265cc9dba1099b2fa35ef7552390a8446e149c293/warcio-1.7.5-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f22bd825506924cb4508c90cd950dbda2a4dbfa9f5609e2ae76b53deaba656db",
                "md5": "34752e14cc47c6f7a174803b62e51c45",
                "sha256": "7247b57e68074cfd9433cb6dc226f8567d6777052abec2d3c78346cffa4d19b9"
            },
            "downloads": -1,
            "filename": "warcio-1.7.5.tar.gz",
            "has_sig": false,
            "md5_digest": "34752e14cc47c6f7a174803b62e51c45",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 61691,
            "upload_time": "2024-12-10T21:31:48",
            "upload_time_iso_8601": "2024-12-10T21:31:48.939725Z",
            "url": "https://files.pythonhosted.org/packages/f2/2b/d825506924cb4508c90cd950dbda2a4dbfa9f5609e2ae76b53deaba656db/warcio-1.7.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-10 21:31:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "webrecorder",
    "github_project": "warcio",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "appveyor": true,
    "lcname": "warcio"
}

Ilya Kreymer