# cdxj_util
cdxj_util is a Python library for efficiently processing CDXJ (Compressed DeduplicateD Web Archive Index JSON) files. This library provides functionality for loading, searching, and analyzing large CDXJ files.
## Features
- Asynchronous and synchronous loading of CDXJ files
- URL-based searching (exact and partial matching)
- Filtering by timestamp range
- Bulk searching of multiple URLs
- Generation of CDXJ file statistics (total records, unique URLs, subdomain distribution, MIME type distribution, etc.)
## Installation
```bash
pip install cdxj_util
```
## Usage
### Loading a CDXJ file
```python
from cdxj_util.core import CDXJCore
core = CDXJCore("path/to/your.cdxj")
records = core.load_all_records()
```
### Searching URLs
```python
from cdxj_util.search import CDXJSearch
search = CDXJSearch(records)
results = search.search_by_url("http://example.com/", exact_match=True)
```
### Generating statistics
```python
from cdxj_util.stats import CDXJStats
stats = CDXJStats(records)
total_records = stats.total_records()
unique_urls = stats.unique_urls()
mime_distribution = stats.mime_type_distribution()
```
## Asynchronous Support
cdxj_util also supports asynchronous processing, which is particularly useful for handling large CDXJ files:
```python
import asyncio
from cdxj_util.async_core import AsyncCDXJCore
async def process_cdxj():
async_core = AsyncCDXJCore("path/to/your.cdxj")
records = await async_core.load_all_records()
# Further processing...
asyncio.run(process_cdxj())
```
## Examples
For more detailed usage examples, please refer to the demo scripts in the `examples/` directory.
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/r74tech/cdxj-util",
"name": "cdxj-util",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "cdxj openwayback wayback",
"author": "r74tech",
"author_email": "r74tech@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/db/1c/48864bb2649fda27b74eeaa5f1abb732baeae4cebb27477c3a4645b2d596/cdxj_util-1.0.0.tar.gz",
"platform": null,
"description": "# cdxj_util\n\ncdxj_util is a Python library for efficiently processing CDXJ (Compressed DeduplicateD Web Archive Index JSON) files. This library provides functionality for loading, searching, and analyzing large CDXJ files.\n\n## Features\n\n- Asynchronous and synchronous loading of CDXJ files\n- URL-based searching (exact and partial matching)\n- Filtering by timestamp range\n- Bulk searching of multiple URLs\n- Generation of CDXJ file statistics (total records, unique URLs, subdomain distribution, MIME type distribution, etc.)\n\n## Installation\n\n```bash\npip install cdxj_util\n```\n\n## Usage\n\n### Loading a CDXJ file\n\n```python\nfrom cdxj_util.core import CDXJCore\n\ncore = CDXJCore(\"path/to/your.cdxj\")\nrecords = core.load_all_records()\n```\n\n### Searching URLs\n\n```python\nfrom cdxj_util.search import CDXJSearch\n\nsearch = CDXJSearch(records)\nresults = search.search_by_url(\"http://example.com/\", exact_match=True)\n```\n\n### Generating statistics\n\n```python\nfrom cdxj_util.stats import CDXJStats\n\nstats = CDXJStats(records)\ntotal_records = stats.total_records()\nunique_urls = stats.unique_urls()\nmime_distribution = stats.mime_type_distribution()\n```\n\n## Asynchronous Support\n\ncdxj_util also supports asynchronous processing, which is particularly useful for handling large CDXJ files:\n\n```python\nimport asyncio\nfrom cdxj_util.async_core import AsyncCDXJCore\n\nasync def process_cdxj():\n async_core = AsyncCDXJCore(\"path/to/your.cdxj\")\n records = await async_core.load_all_records()\n # Further processing...\n\nasyncio.run(process_cdxj())\n```\n\n## Examples\n\nFor more detailed usage examples, please refer to the demo scripts in the `examples/` directory.\n\n## License\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n",
"bugtrack_url": null,
"license": null,
"summary": "A utility library for working with CDXJ files",
"version": "1.0.0",
"project_urls": {
"Homepage": "https://github.com/r74tech/cdxj-util"
},
"split_keywords": [
"cdxj",
"openwayback",
"wayback"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "25e88e2c69bd52e0c98884a365354d40926f6995bae37bb4a447823d5ad22934",
"md5": "24816bbc1f2bd0f368843c2f254f8db3",
"sha256": "644a045ed8c47f013a3689a1e76cf465d44923934b76d27c8f81afd6cc113246"
},
"downloads": -1,
"filename": "cdxj_util-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "24816bbc1f2bd0f368843c2f254f8db3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 8491,
"upload_time": "2024-08-06T07:26:27",
"upload_time_iso_8601": "2024-08-06T07:26:27.241232Z",
"url": "https://files.pythonhosted.org/packages/25/e8/8e2c69bd52e0c98884a365354d40926f6995bae37bb4a447823d5ad22934/cdxj_util-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "db1c48864bb2649fda27b74eeaa5f1abb732baeae4cebb27477c3a4645b2d596",
"md5": "a58682afcea93f02149ce927f0900889",
"sha256": "018329539984e110129f4e26070a75d523228bc1ba207831dc1a99a4656f196a"
},
"downloads": -1,
"filename": "cdxj_util-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "a58682afcea93f02149ce927f0900889",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 7147,
"upload_time": "2024-08-06T07:26:28",
"upload_time_iso_8601": "2024-08-06T07:26:28.851446Z",
"url": "https://files.pythonhosted.org/packages/db/1c/48864bb2649fda27b74eeaa5f1abb732baeae4cebb27477c3a4645b2d596/cdxj_util-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-06 07:26:28",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "r74tech",
"github_project": "cdxj-util",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "cdxj-util"
}