cdxj-indexer


Namecdxj-indexer JSON
Version 1.4.5 PyPI version JSON
download
home_pagehttps://github.com/webrecorder/cdxj-indexer
SummaryCDXJ Indexer for WARC and ARC files
upload_time2022-06-25 22:19:45
maintainer
docs_urlNone
authorIlya Kreymer
requires_python
licenseApache 2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            CDXJ Indexer
~~~~~~~~~~~~

A command-line tool for generating CDXJ (and  CDX) indexes from WARC and ARC files.
The indexer is a new tool redesigned for fast and flexible indexing. (Based on the indexing functionality from `pywb <https://github.com/ikreymer/pywb>`_)

Install with ``pip install cdxj-indexer`` or install locally with ``python setup.py install``


The indexer supports classic CDX index format as well as the more flexible CDXJ. With CDXJ, the indexer supports custom fields and ``request`` record access for WARC files. See the examples below and the command line ``-h`` option for latest features. (This is a work in progress).


Usage examples
~~~~~~~~~~~~~~~~~~~~

Generate CDXJ index:

.. code:: console

    > cdxj-indexer /path/to/archive-file.warc.gz
    com,example)/ 20170730223850 {"url": "http://example.com/", "mime": "text/html", "status": "200", "digest": "G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK", "length": "1219", "offset": "771", "filename": "example-20170730223917.warc.gz"}


CDX Index (11 field):

.. code:: console

    > cdxj-indexer -11 /path/to/archive-file.warc.gz
    CDX N b a m s k r M S V g
    com,example)/ 20170730223850 http://example.com/ text/html 200 G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK - - 1219 771 example-20170730223917.warc.gz


More advanced use cases: add additonal http headers as fields. ``http:`` prefix specifies current record headers, while ``req.http:`` specifies corresponding request record headers. The following adds the Date, Referer headers, and the request method to the index:

.. code:: console

    > cdxj-indexer -f req.http:method,http:date,req.http:referer /path/to/archive-file.warc.gz
    com,example)/ 20170801032435 {"url": "http://example.com/", "mime": "text/html", "status": "200", "digest": "A6DESOVDZ3WLYF57CS5E4RIC4ARPWRK7", "length": "1207", "offset": "834", "filename": "temp-20170801032445.warc.gz", "req.http:method": "GET", "http:date": "Tue, 01 Aug 2017 03:24:35 GMT", "referrer": "https://webrecorder.io/temp-NU34HBNO/temp/recording-session/record/http://example.com/"}
    org,iana)/domains/example 20170801032437 {"url": "http://www.iana.org/domains/example", "mime": "text/html", "status": "302", "digest": "RP3Y66FDBYBZKSFYQ4VJ4RMDA5BPDJX2", "length": "675", "offset": "2652", "filename": "temp-20170801032445.warc.gz", "req.http:method": "GET", "http:date": "Tue, 01 Aug 2017 02:35:05 GMT", "referrer": "http://example.com/"}


The CDXJ Indexer extends the ``Indexer`` functionality in `warcio <https://github.com/webrecorder/warcio>`_ and should be flexible to extend.







            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/webrecorder/cdxj-indexer",
    "name": "cdxj-indexer",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Ilya Kreymer",
    "author_email": "ikreymer@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/5d/2e/c245b73d2897afc0f1eb369e30b56dae1cf8ec4762086e74b200448f1fa0/cdxj_indexer-1.4.5.tar.gz",
    "platform": null,
    "description": "CDXJ Indexer\n~~~~~~~~~~~~\n\nA command-line tool for generating CDXJ (and  CDX) indexes from WARC and ARC files.\nThe indexer is a new tool redesigned for fast and flexible indexing. (Based on the indexing functionality from `pywb <https://github.com/ikreymer/pywb>`_)\n\nInstall with ``pip install cdxj-indexer`` or install locally with ``python setup.py install``\n\n\nThe indexer supports classic CDX index format as well as the more flexible CDXJ. With CDXJ, the indexer supports custom fields and ``request`` record access for WARC files. See the examples below and the command line ``-h`` option for latest features. (This is a work in progress).\n\n\nUsage examples\n~~~~~~~~~~~~~~~~~~~~\n\nGenerate CDXJ index:\n\n.. code:: console\n\n    > cdxj-indexer /path/to/archive-file.warc.gz\n    com,example)/ 20170730223850 {\"url\": \"http://example.com/\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK\", \"length\": \"1219\", \"offset\": \"771\", \"filename\": \"example-20170730223917.warc.gz\"}\n\n\nCDX Index (11 field):\n\n.. code:: console\n\n    > cdxj-indexer -11 /path/to/archive-file.warc.gz\n    CDX N b a m s k r M S V g\n    com,example)/ 20170730223850 http://example.com/ text/html 200 G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK - - 1219 771 example-20170730223917.warc.gz\n\n\nMore advanced use cases: add additonal http headers as fields. ``http:`` prefix specifies current record headers, while ``req.http:`` specifies corresponding request record headers. The following adds the Date, Referer headers, and the request method to the index:\n\n.. code:: console\n\n    > cdxj-indexer -f req.http:method,http:date,req.http:referer /path/to/archive-file.warc.gz\n    com,example)/ 20170801032435 {\"url\": \"http://example.com/\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"A6DESOVDZ3WLYF57CS5E4RIC4ARPWRK7\", \"length\": \"1207\", \"offset\": \"834\", \"filename\": \"temp-20170801032445.warc.gz\", \"req.http:method\": \"GET\", \"http:date\": \"Tue, 01 Aug 2017 03:24:35 GMT\", \"referrer\": \"https://webrecorder.io/temp-NU34HBNO/temp/recording-session/record/http://example.com/\"}\n    org,iana)/domains/example 20170801032437 {\"url\": \"http://www.iana.org/domains/example\", \"mime\": \"text/html\", \"status\": \"302\", \"digest\": \"RP3Y66FDBYBZKSFYQ4VJ4RMDA5BPDJX2\", \"length\": \"675\", \"offset\": \"2652\", \"filename\": \"temp-20170801032445.warc.gz\", \"req.http:method\": \"GET\", \"http:date\": \"Tue, 01 Aug 2017 02:35:05 GMT\", \"referrer\": \"http://example.com/\"}\n\n\nThe CDXJ Indexer extends the ``Indexer`` functionality in `warcio <https://github.com/webrecorder/warcio>`_ and should be flexible to extend.\n\n\n\n\n\n\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "CDXJ Indexer for WARC and ARC files",
    "version": "1.4.5",
    "project_urls": {
        "Homepage": "https://github.com/webrecorder/cdxj-indexer"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c1e7f627442ccdd7b441728672125282f556f230e0f26a0859e987d5101a35e5",
                "md5": "81f4048e41aab2ff2c91c7dbbdc5a2f6",
                "sha256": "7a459511c4635734c44323bdd3589a92ba6e5d8097d24757be0b7add0bbd6153"
            },
            "downloads": -1,
            "filename": "cdxj_indexer-1.4.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "81f4048e41aab2ff2c91c7dbbdc5a2f6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 15089,
            "upload_time": "2022-06-25T22:19:43",
            "upload_time_iso_8601": "2022-06-25T22:19:43.323498Z",
            "url": "https://files.pythonhosted.org/packages/c1/e7/f627442ccdd7b441728672125282f556f230e0f26a0859e987d5101a35e5/cdxj_indexer-1.4.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5d2ec245b73d2897afc0f1eb369e30b56dae1cf8ec4762086e74b200448f1fa0",
                "md5": "2d6cd1ef19587c70d1935f2472bba6f7",
                "sha256": "95ebd479ef103c0bfdccee9bff21bd260c611b486c868e498d9a352857f5e27a"
            },
            "downloads": -1,
            "filename": "cdxj_indexer-1.4.5.tar.gz",
            "has_sig": false,
            "md5_digest": "2d6cd1ef19587c70d1935f2472bba6f7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 19127,
            "upload_time": "2022-06-25T22:19:45",
            "upload_time_iso_8601": "2022-06-25T22:19:45.533267Z",
            "url": "https://files.pythonhosted.org/packages/5d/2e/c245b73d2897afc0f1eb369e30b56dae1cf8ec4762086e74b200448f1fa0/cdxj_indexer-1.4.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-06-25 22:19:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "webrecorder",
    "github_project": "cdxj-indexer",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "cdxj-indexer"
}
        
Elapsed time: 0.10697s