CDXJ Indexer
~~~~~~~~~~~~
A command-line tool for generating CDXJ (and CDX) indexes from WARC and ARC files.
The indexer is a new tool redesigned for fast and flexible indexing. (Based on the indexing functionality from `pywb <https://github.com/ikreymer/pywb>`_)
Install with ``pip install cdxj-indexer`` or install locally with ``python setup.py install``
The indexer supports classic CDX index format as well as the more flexible CDXJ. With CDXJ, the indexer supports custom fields and ``request`` record access for WARC files. See the examples below and the command line ``-h`` option for latest features. (This is a work in progress).
Usage examples
~~~~~~~~~~~~~~~~~~~~
Generate CDXJ index:
.. code:: console
> cdxj-indexer /path/to/archive-file.warc.gz
com,example)/ 20170730223850 {"url": "http://example.com/", "mime": "text/html", "status": "200", "digest": "G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK", "length": "1219", "offset": "771", "filename": "example-20170730223917.warc.gz"}
CDX Index (11 field):
.. code:: console
> cdxj-indexer -11 /path/to/archive-file.warc.gz
CDX N b a m s k r M S V g
com,example)/ 20170730223850 http://example.com/ text/html 200 G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK - - 1219 771 example-20170730223917.warc.gz
More advanced use cases: add additonal http headers as fields. ``http:`` prefix specifies current record headers, while ``req.http:`` specifies corresponding request record headers. The following adds the Date, Referer headers, and the request method to the index:
.. code:: console
> cdxj-indexer -f req.http:method,http:date,req.http:referer /path/to/archive-file.warc.gz
com,example)/ 20170801032435 {"url": "http://example.com/", "mime": "text/html", "status": "200", "digest": "A6DESOVDZ3WLYF57CS5E4RIC4ARPWRK7", "length": "1207", "offset": "834", "filename": "temp-20170801032445.warc.gz", "req.http:method": "GET", "http:date": "Tue, 01 Aug 2017 03:24:35 GMT", "referrer": "https://webrecorder.io/temp-NU34HBNO/temp/recording-session/record/http://example.com/"}
org,iana)/domains/example 20170801032437 {"url": "http://www.iana.org/domains/example", "mime": "text/html", "status": "302", "digest": "RP3Y66FDBYBZKSFYQ4VJ4RMDA5BPDJX2", "length": "675", "offset": "2652", "filename": "temp-20170801032445.warc.gz", "req.http:method": "GET", "http:date": "Tue, 01 Aug 2017 02:35:05 GMT", "referrer": "http://example.com/"}
The CDXJ Indexer extends the ``Indexer`` functionality in `warcio <https://github.com/webrecorder/warcio>`_ and should be flexible to extend.
Raw data
{
"_id": null,
"home_page": "https://github.com/webrecorder/cdxj-indexer",
"name": "cdxj-indexer",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Ilya Kreymer",
"author_email": "ikreymer@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/5d/2e/c245b73d2897afc0f1eb369e30b56dae1cf8ec4762086e74b200448f1fa0/cdxj_indexer-1.4.5.tar.gz",
"platform": null,
"description": "CDXJ Indexer\n~~~~~~~~~~~~\n\nA command-line tool for generating CDXJ (and CDX) indexes from WARC and ARC files.\nThe indexer is a new tool redesigned for fast and flexible indexing. (Based on the indexing functionality from `pywb <https://github.com/ikreymer/pywb>`_)\n\nInstall with ``pip install cdxj-indexer`` or install locally with ``python setup.py install``\n\n\nThe indexer supports classic CDX index format as well as the more flexible CDXJ. With CDXJ, the indexer supports custom fields and ``request`` record access for WARC files. See the examples below and the command line ``-h`` option for latest features. (This is a work in progress).\n\n\nUsage examples\n~~~~~~~~~~~~~~~~~~~~\n\nGenerate CDXJ index:\n\n.. code:: console\n\n > cdxj-indexer /path/to/archive-file.warc.gz\n com,example)/ 20170730223850 {\"url\": \"http://example.com/\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK\", \"length\": \"1219\", \"offset\": \"771\", \"filename\": \"example-20170730223917.warc.gz\"}\n\n\nCDX Index (11 field):\n\n.. code:: console\n\n > cdxj-indexer -11 /path/to/archive-file.warc.gz\n CDX N b a m s k r M S V g\n com,example)/ 20170730223850 http://example.com/ text/html 200 G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK - - 1219 771 example-20170730223917.warc.gz\n\n\nMore advanced use cases: add additonal http headers as fields. ``http:`` prefix specifies current record headers, while ``req.http:`` specifies corresponding request record headers. The following adds the Date, Referer headers, and the request method to the index:\n\n.. code:: console\n\n > cdxj-indexer -f req.http:method,http:date,req.http:referer /path/to/archive-file.warc.gz\n com,example)/ 20170801032435 {\"url\": \"http://example.com/\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"A6DESOVDZ3WLYF57CS5E4RIC4ARPWRK7\", \"length\": \"1207\", \"offset\": \"834\", \"filename\": \"temp-20170801032445.warc.gz\", \"req.http:method\": \"GET\", \"http:date\": \"Tue, 01 Aug 2017 03:24:35 GMT\", \"referrer\": \"https://webrecorder.io/temp-NU34HBNO/temp/recording-session/record/http://example.com/\"}\n org,iana)/domains/example 20170801032437 {\"url\": \"http://www.iana.org/domains/example\", \"mime\": \"text/html\", \"status\": \"302\", \"digest\": \"RP3Y66FDBYBZKSFYQ4VJ4RMDA5BPDJX2\", \"length\": \"675\", \"offset\": \"2652\", \"filename\": \"temp-20170801032445.warc.gz\", \"req.http:method\": \"GET\", \"http:date\": \"Tue, 01 Aug 2017 02:35:05 GMT\", \"referrer\": \"http://example.com/\"}\n\n\nThe CDXJ Indexer extends the ``Indexer`` functionality in `warcio <https://github.com/webrecorder/warcio>`_ and should be flexible to extend.\n\n\n\n\n\n\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "CDXJ Indexer for WARC and ARC files",
"version": "1.4.5",
"project_urls": {
"Homepage": "https://github.com/webrecorder/cdxj-indexer"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c1e7f627442ccdd7b441728672125282f556f230e0f26a0859e987d5101a35e5",
"md5": "81f4048e41aab2ff2c91c7dbbdc5a2f6",
"sha256": "7a459511c4635734c44323bdd3589a92ba6e5d8097d24757be0b7add0bbd6153"
},
"downloads": -1,
"filename": "cdxj_indexer-1.4.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "81f4048e41aab2ff2c91c7dbbdc5a2f6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 15089,
"upload_time": "2022-06-25T22:19:43",
"upload_time_iso_8601": "2022-06-25T22:19:43.323498Z",
"url": "https://files.pythonhosted.org/packages/c1/e7/f627442ccdd7b441728672125282f556f230e0f26a0859e987d5101a35e5/cdxj_indexer-1.4.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "5d2ec245b73d2897afc0f1eb369e30b56dae1cf8ec4762086e74b200448f1fa0",
"md5": "2d6cd1ef19587c70d1935f2472bba6f7",
"sha256": "95ebd479ef103c0bfdccee9bff21bd260c611b486c868e498d9a352857f5e27a"
},
"downloads": -1,
"filename": "cdxj_indexer-1.4.5.tar.gz",
"has_sig": false,
"md5_digest": "2d6cd1ef19587c70d1935f2472bba6f7",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 19127,
"upload_time": "2022-06-25T22:19:45",
"upload_time_iso_8601": "2022-06-25T22:19:45.533267Z",
"url": "https://files.pythonhosted.org/packages/5d/2e/c245b73d2897afc0f1eb369e30b56dae1cf8ec4762086e74b200448f1fa0/cdxj_indexer-1.4.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-06-25 22:19:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "webrecorder",
"github_project": "cdxj-indexer",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"lcname": "cdxj-indexer"
}