cdxj-indexer


Namecdxj-indexer JSON
Version 1.4.6 PyPI version JSON
download
home_pagehttps://github.com/webrecorder/cdxj-indexer
SummaryCDXJ Indexer for WARC and ARC files
upload_time2024-12-10 21:39:01
maintainerNone
docs_urlNone
authorIlya Kreymer
requires_pythonNone
licenseApache 2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            CDXJ Indexer
~~~~~~~~~~~~

A command-line tool for generating CDXJ (and  CDX) indexes from WARC and ARC files.
The indexer is a new tool redesigned for fast and flexible indexing. (Based on the indexing functionality from `pywb <https://github.com/ikreymer/pywb>`_)

Install with ``pip install cdxj-indexer`` or install locally with ``python setup.py install``


The indexer supports classic CDX index format as well as the more flexible CDXJ. With CDXJ, the indexer supports custom fields and ``request`` record access for WARC files. See the examples below and the command line ``-h`` option for latest features. (This is a work in progress).


Usage examples
~~~~~~~~~~~~~~~~~~~~

Generate CDXJ index:

.. code:: console

    > cdxj-indexer /path/to/archive-file.warc.gz
    com,example)/ 20170730223850 {"url": "http://example.com/", "mime": "text/html", "status": "200", "digest": "G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK", "length": "1219", "offset": "771", "filename": "example-20170730223917.warc.gz"}


CDX Index (11 field):

.. code:: console

    > cdxj-indexer -11 /path/to/archive-file.warc.gz
    CDX N b a m s k r M S V g
    com,example)/ 20170730223850 http://example.com/ text/html 200 G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK - - 1219 771 example-20170730223917.warc.gz


More advanced use cases: add additional http headers as fields. ``http:`` prefix specifies current record headers, while ``req.http:`` specifies corresponding request record headers. The following adds the Date, Referer headers, and the request method to the index:

.. code:: console

    > cdxj-indexer -f req.http:method,http:date,req.http:referer /path/to/archive-file.warc.gz
    com,example)/ 20170801032435 {"url": "http://example.com/", "mime": "text/html", "status": "200", "digest": "A6DESOVDZ3WLYF57CS5E4RIC4ARPWRK7", "length": "1207", "offset": "834", "filename": "temp-20170801032445.warc.gz", "req.http:method": "GET", "http:date": "Tue, 01 Aug 2017 03:24:35 GMT", "referrer": "https://webrecorder.io/temp-NU34HBNO/temp/recording-session/record/http://example.com/"}
    org,iana)/domains/example 20170801032437 {"url": "http://www.iana.org/domains/example", "mime": "text/html", "status": "302", "digest": "RP3Y66FDBYBZKSFYQ4VJ4RMDA5BPDJX2", "length": "675", "offset": "2652", "filename": "temp-20170801032445.warc.gz", "req.http:method": "GET", "http:date": "Tue, 01 Aug 2017 02:35:05 GMT", "referrer": "http://example.com/"}


The CDXJ Indexer extends the ``Indexer`` functionality in `warcio <https://github.com/webrecorder/warcio>`_ and should be flexible to extend.





            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/webrecorder/cdxj-indexer",
    "name": "cdxj-indexer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Ilya Kreymer",
    "author_email": "ikreymer@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/ef/b1/01fd5e8316cdfca178e951ba554309b9fe20910328039ca03830d7ca11ac/cdxj_indexer-1.4.6.tar.gz",
    "platform": null,
    "description": "CDXJ Indexer\n~~~~~~~~~~~~\n\nA command-line tool for generating CDXJ (and  CDX) indexes from WARC and ARC files.\nThe indexer is a new tool redesigned for fast and flexible indexing. (Based on the indexing functionality from `pywb <https://github.com/ikreymer/pywb>`_)\n\nInstall with ``pip install cdxj-indexer`` or install locally with ``python setup.py install``\n\n\nThe indexer supports classic CDX index format as well as the more flexible CDXJ. With CDXJ, the indexer supports custom fields and ``request`` record access for WARC files. See the examples below and the command line ``-h`` option for latest features. (This is a work in progress).\n\n\nUsage examples\n~~~~~~~~~~~~~~~~~~~~\n\nGenerate CDXJ index:\n\n.. code:: console\n\n    > cdxj-indexer /path/to/archive-file.warc.gz\n    com,example)/ 20170730223850 {\"url\": \"http://example.com/\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK\", \"length\": \"1219\", \"offset\": \"771\", \"filename\": \"example-20170730223917.warc.gz\"}\n\n\nCDX Index (11 field):\n\n.. code:: console\n\n    > cdxj-indexer -11 /path/to/archive-file.warc.gz\n    CDX N b a m s k r M S V g\n    com,example)/ 20170730223850 http://example.com/ text/html 200 G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK - - 1219 771 example-20170730223917.warc.gz\n\n\nMore advanced use cases: add additional http headers as fields. ``http:`` prefix specifies current record headers, while ``req.http:`` specifies corresponding request record headers. The following adds the Date, Referer headers, and the request method to the index:\n\n.. code:: console\n\n    > cdxj-indexer -f req.http:method,http:date,req.http:referer /path/to/archive-file.warc.gz\n    com,example)/ 20170801032435 {\"url\": \"http://example.com/\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"A6DESOVDZ3WLYF57CS5E4RIC4ARPWRK7\", \"length\": \"1207\", \"offset\": \"834\", \"filename\": \"temp-20170801032445.warc.gz\", \"req.http:method\": \"GET\", \"http:date\": \"Tue, 01 Aug 2017 03:24:35 GMT\", \"referrer\": \"https://webrecorder.io/temp-NU34HBNO/temp/recording-session/record/http://example.com/\"}\n    org,iana)/domains/example 20170801032437 {\"url\": \"http://www.iana.org/domains/example\", \"mime\": \"text/html\", \"status\": \"302\", \"digest\": \"RP3Y66FDBYBZKSFYQ4VJ4RMDA5BPDJX2\", \"length\": \"675\", \"offset\": \"2652\", \"filename\": \"temp-20170801032445.warc.gz\", \"req.http:method\": \"GET\", \"http:date\": \"Tue, 01 Aug 2017 02:35:05 GMT\", \"referrer\": \"http://example.com/\"}\n\n\nThe CDXJ Indexer extends the ``Indexer`` functionality in `warcio <https://github.com/webrecorder/warcio>`_ and should be flexible to extend.\n\n\n\n\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "CDXJ Indexer for WARC and ARC files",
    "version": "1.4.6",
    "project_urls": {
        "Homepage": "https://github.com/webrecorder/cdxj-indexer"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ea57b2489ba52732906a1ec59c0a852ce35f806a5afe2d4e5c83c3f407e44891",
                "md5": "00cab0a6ccc737708f48ec9dc8862c5a",
                "sha256": "91ff88e0ca8f39f9e772ccfb6e3d245344b8e80db04cca5e88f184f8cbbd6604"
            },
            "downloads": -1,
            "filename": "cdxj_indexer-1.4.6-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "00cab0a6ccc737708f48ec9dc8862c5a",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 14993,
            "upload_time": "2024-12-10T21:39:00",
            "upload_time_iso_8601": "2024-12-10T21:39:00.091309Z",
            "url": "https://files.pythonhosted.org/packages/ea/57/b2489ba52732906a1ec59c0a852ce35f806a5afe2d4e5c83c3f407e44891/cdxj_indexer-1.4.6-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "efb101fd5e8316cdfca178e951ba554309b9fe20910328039ca03830d7ca11ac",
                "md5": "2611c92414704fd14b7e26cda381c71d",
                "sha256": "7606d0c3eeba530323f6fafa62647c74c86ddefdca1edffa2d9d303388112238"
            },
            "downloads": -1,
            "filename": "cdxj_indexer-1.4.6.tar.gz",
            "has_sig": false,
            "md5_digest": "2611c92414704fd14b7e26cda381c71d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 19024,
            "upload_time": "2024-12-10T21:39:01",
            "upload_time_iso_8601": "2024-12-10T21:39:01.640836Z",
            "url": "https://files.pythonhosted.org/packages/ef/b1/01fd5e8316cdfca178e951ba554309b9fe20910328039ca03830d7ca11ac/cdxj_indexer-1.4.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-10 21:39:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "webrecorder",
    "github_project": "cdxj-indexer",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "cdxj-indexer"
}
        
Elapsed time: 0.33850s