CDXJ Indexer
~~~~~~~~~~~~
A command-line tool for generating CDXJ (and CDX) indexes from WARC and ARC files.
The indexer is a new tool redesigned for fast and flexible indexing. (Based on the indexing functionality from `pywb <https://github.com/ikreymer/pywb>`_)
Install with ``pip install cdxj-indexer`` or install locally with ``python setup.py install``
The indexer supports classic CDX index format as well as the more flexible CDXJ. With CDXJ, the indexer supports custom fields and ``request`` record access for WARC files. See the examples below and the command line ``-h`` option for latest features. (This is a work in progress).
Usage examples
~~~~~~~~~~~~~~~~~~~~
Generate CDXJ index:
.. code:: console
> cdxj-indexer /path/to/archive-file.warc.gz
com,example)/ 20170730223850 {"url": "http://example.com/", "mime": "text/html", "status": "200", "digest": "G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK", "length": "1219", "offset": "771", "filename": "example-20170730223917.warc.gz"}
CDX Index (11 field):
.. code:: console
> cdxj-indexer -11 /path/to/archive-file.warc.gz
CDX N b a m s k r M S V g
com,example)/ 20170730223850 http://example.com/ text/html 200 G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK - - 1219 771 example-20170730223917.warc.gz
More advanced use cases: add additional http headers as fields. ``http:`` prefix specifies current record headers, while ``req.http:`` specifies corresponding request record headers. The following adds the Date, Referer headers, and the request method to the index:
.. code:: console
> cdxj-indexer -f req.http:method,http:date,req.http:referer /path/to/archive-file.warc.gz
com,example)/ 20170801032435 {"url": "http://example.com/", "mime": "text/html", "status": "200", "digest": "A6DESOVDZ3WLYF57CS5E4RIC4ARPWRK7", "length": "1207", "offset": "834", "filename": "temp-20170801032445.warc.gz", "req.http:method": "GET", "http:date": "Tue, 01 Aug 2017 03:24:35 GMT", "referrer": "https://webrecorder.io/temp-NU34HBNO/temp/recording-session/record/http://example.com/"}
org,iana)/domains/example 20170801032437 {"url": "http://www.iana.org/domains/example", "mime": "text/html", "status": "302", "digest": "RP3Y66FDBYBZKSFYQ4VJ4RMDA5BPDJX2", "length": "675", "offset": "2652", "filename": "temp-20170801032445.warc.gz", "req.http:method": "GET", "http:date": "Tue, 01 Aug 2017 02:35:05 GMT", "referrer": "http://example.com/"}
The CDXJ Indexer extends the ``Indexer`` functionality in `warcio <https://github.com/webrecorder/warcio>`_ and should be flexible to extend.
Raw data
{
"_id": null,
"home_page": "https://github.com/webrecorder/cdxj-indexer",
"name": "cdxj-indexer",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Ilya Kreymer",
"author_email": "ikreymer@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/ef/b1/01fd5e8316cdfca178e951ba554309b9fe20910328039ca03830d7ca11ac/cdxj_indexer-1.4.6.tar.gz",
"platform": null,
"description": "CDXJ Indexer\n~~~~~~~~~~~~\n\nA command-line tool for generating CDXJ (and CDX) indexes from WARC and ARC files.\nThe indexer is a new tool redesigned for fast and flexible indexing. (Based on the indexing functionality from `pywb <https://github.com/ikreymer/pywb>`_)\n\nInstall with ``pip install cdxj-indexer`` or install locally with ``python setup.py install``\n\n\nThe indexer supports classic CDX index format as well as the more flexible CDXJ. With CDXJ, the indexer supports custom fields and ``request`` record access for WARC files. See the examples below and the command line ``-h`` option for latest features. (This is a work in progress).\n\n\nUsage examples\n~~~~~~~~~~~~~~~~~~~~\n\nGenerate CDXJ index:\n\n.. code:: console\n\n > cdxj-indexer /path/to/archive-file.warc.gz\n com,example)/ 20170730223850 {\"url\": \"http://example.com/\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK\", \"length\": \"1219\", \"offset\": \"771\", \"filename\": \"example-20170730223917.warc.gz\"}\n\n\nCDX Index (11 field):\n\n.. code:: console\n\n > cdxj-indexer -11 /path/to/archive-file.warc.gz\n CDX N b a m s k r M S V g\n com,example)/ 20170730223850 http://example.com/ text/html 200 G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK - - 1219 771 example-20170730223917.warc.gz\n\n\nMore advanced use cases: add additional http headers as fields. ``http:`` prefix specifies current record headers, while ``req.http:`` specifies corresponding request record headers. The following adds the Date, Referer headers, and the request method to the index:\n\n.. code:: console\n\n > cdxj-indexer -f req.http:method,http:date,req.http:referer /path/to/archive-file.warc.gz\n com,example)/ 20170801032435 {\"url\": \"http://example.com/\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"A6DESOVDZ3WLYF57CS5E4RIC4ARPWRK7\", \"length\": \"1207\", \"offset\": \"834\", \"filename\": \"temp-20170801032445.warc.gz\", \"req.http:method\": \"GET\", \"http:date\": \"Tue, 01 Aug 2017 03:24:35 GMT\", \"referrer\": \"https://webrecorder.io/temp-NU34HBNO/temp/recording-session/record/http://example.com/\"}\n org,iana)/domains/example 20170801032437 {\"url\": \"http://www.iana.org/domains/example\", \"mime\": \"text/html\", \"status\": \"302\", \"digest\": \"RP3Y66FDBYBZKSFYQ4VJ4RMDA5BPDJX2\", \"length\": \"675\", \"offset\": \"2652\", \"filename\": \"temp-20170801032445.warc.gz\", \"req.http:method\": \"GET\", \"http:date\": \"Tue, 01 Aug 2017 02:35:05 GMT\", \"referrer\": \"http://example.com/\"}\n\n\nThe CDXJ Indexer extends the ``Indexer`` functionality in `warcio <https://github.com/webrecorder/warcio>`_ and should be flexible to extend.\n\n\n\n\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "CDXJ Indexer for WARC and ARC files",
"version": "1.4.6",
"project_urls": {
"Homepage": "https://github.com/webrecorder/cdxj-indexer"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ea57b2489ba52732906a1ec59c0a852ce35f806a5afe2d4e5c83c3f407e44891",
"md5": "00cab0a6ccc737708f48ec9dc8862c5a",
"sha256": "91ff88e0ca8f39f9e772ccfb6e3d245344b8e80db04cca5e88f184f8cbbd6604"
},
"downloads": -1,
"filename": "cdxj_indexer-1.4.6-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "00cab0a6ccc737708f48ec9dc8862c5a",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 14993,
"upload_time": "2024-12-10T21:39:00",
"upload_time_iso_8601": "2024-12-10T21:39:00.091309Z",
"url": "https://files.pythonhosted.org/packages/ea/57/b2489ba52732906a1ec59c0a852ce35f806a5afe2d4e5c83c3f407e44891/cdxj_indexer-1.4.6-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "efb101fd5e8316cdfca178e951ba554309b9fe20910328039ca03830d7ca11ac",
"md5": "2611c92414704fd14b7e26cda381c71d",
"sha256": "7606d0c3eeba530323f6fafa62647c74c86ddefdca1edffa2d9d303388112238"
},
"downloads": -1,
"filename": "cdxj_indexer-1.4.6.tar.gz",
"has_sig": false,
"md5_digest": "2611c92414704fd14b7e26cda381c71d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 19024,
"upload_time": "2024-12-10T21:39:01",
"upload_time_iso_8601": "2024-12-10T21:39:01.640836Z",
"url": "https://files.pythonhosted.org/packages/ef/b1/01fd5e8316cdfca178e951ba554309b9fe20910328039ca03830d7ca11ac/cdxj_indexer-1.4.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-10 21:39:01",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "webrecorder",
"github_project": "cdxj-indexer",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"lcname": "cdxj-indexer"
}