cdxj-util


Namecdxj-util JSON
Version 1.0.0 PyPI version JSON
download
home_pagehttps://github.com/r74tech/cdxj-util
SummaryA utility library for working with CDXJ files
upload_time2024-08-06 07:26:28
maintainerNone
docs_urlNone
authorr74tech
requires_python>=3.7
licenseNone
keywords cdxj openwayback wayback
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # cdxj_util

cdxj_util is a Python library for efficiently processing CDXJ (Compressed DeduplicateD Web Archive Index JSON) files. This library provides functionality for loading, searching, and analyzing large CDXJ files.

## Features

- Asynchronous and synchronous loading of CDXJ files
- URL-based searching (exact and partial matching)
- Filtering by timestamp range
- Bulk searching of multiple URLs
- Generation of CDXJ file statistics (total records, unique URLs, subdomain distribution, MIME type distribution, etc.)

## Installation

```bash
pip install cdxj_util
```

## Usage

### Loading a CDXJ file

```python
from cdxj_util.core import CDXJCore

core = CDXJCore("path/to/your.cdxj")
records = core.load_all_records()
```

### Searching URLs

```python
from cdxj_util.search import CDXJSearch

search = CDXJSearch(records)
results = search.search_by_url("http://example.com/", exact_match=True)
```

### Generating statistics

```python
from cdxj_util.stats import CDXJStats

stats = CDXJStats(records)
total_records = stats.total_records()
unique_urls = stats.unique_urls()
mime_distribution = stats.mime_type_distribution()
```

## Asynchronous Support

cdxj_util also supports asynchronous processing, which is particularly useful for handling large CDXJ files:

```python
import asyncio
from cdxj_util.async_core import AsyncCDXJCore

async def process_cdxj():
    async_core = AsyncCDXJCore("path/to/your.cdxj")
    records = await async_core.load_all_records()
    # Further processing...

asyncio.run(process_cdxj())
```

## Examples

For more detailed usage examples, please refer to the demo scripts in the `examples/` directory.

## License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/r74tech/cdxj-util",
    "name": "cdxj-util",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "cdxj openwayback wayback",
    "author": "r74tech",
    "author_email": "r74tech@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/db/1c/48864bb2649fda27b74eeaa5f1abb732baeae4cebb27477c3a4645b2d596/cdxj_util-1.0.0.tar.gz",
    "platform": null,
    "description": "# cdxj_util\n\ncdxj_util is a Python library for efficiently processing CDXJ (Compressed DeduplicateD Web Archive Index JSON) files. This library provides functionality for loading, searching, and analyzing large CDXJ files.\n\n## Features\n\n- Asynchronous and synchronous loading of CDXJ files\n- URL-based searching (exact and partial matching)\n- Filtering by timestamp range\n- Bulk searching of multiple URLs\n- Generation of CDXJ file statistics (total records, unique URLs, subdomain distribution, MIME type distribution, etc.)\n\n## Installation\n\n```bash\npip install cdxj_util\n```\n\n## Usage\n\n### Loading a CDXJ file\n\n```python\nfrom cdxj_util.core import CDXJCore\n\ncore = CDXJCore(\"path/to/your.cdxj\")\nrecords = core.load_all_records()\n```\n\n### Searching URLs\n\n```python\nfrom cdxj_util.search import CDXJSearch\n\nsearch = CDXJSearch(records)\nresults = search.search_by_url(\"http://example.com/\", exact_match=True)\n```\n\n### Generating statistics\n\n```python\nfrom cdxj_util.stats import CDXJStats\n\nstats = CDXJStats(records)\ntotal_records = stats.total_records()\nunique_urls = stats.unique_urls()\nmime_distribution = stats.mime_type_distribution()\n```\n\n## Asynchronous Support\n\ncdxj_util also supports asynchronous processing, which is particularly useful for handling large CDXJ files:\n\n```python\nimport asyncio\nfrom cdxj_util.async_core import AsyncCDXJCore\n\nasync def process_cdxj():\n    async_core = AsyncCDXJCore(\"path/to/your.cdxj\")\n    records = await async_core.load_all_records()\n    # Further processing...\n\nasyncio.run(process_cdxj())\n```\n\n## Examples\n\nFor more detailed usage examples, please refer to the demo scripts in the `examples/` directory.\n\n## License\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A utility library for working with CDXJ files",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "https://github.com/r74tech/cdxj-util"
    },
    "split_keywords": [
        "cdxj",
        "openwayback",
        "wayback"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "25e88e2c69bd52e0c98884a365354d40926f6995bae37bb4a447823d5ad22934",
                "md5": "24816bbc1f2bd0f368843c2f254f8db3",
                "sha256": "644a045ed8c47f013a3689a1e76cf465d44923934b76d27c8f81afd6cc113246"
            },
            "downloads": -1,
            "filename": "cdxj_util-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "24816bbc1f2bd0f368843c2f254f8db3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 8491,
            "upload_time": "2024-08-06T07:26:27",
            "upload_time_iso_8601": "2024-08-06T07:26:27.241232Z",
            "url": "https://files.pythonhosted.org/packages/25/e8/8e2c69bd52e0c98884a365354d40926f6995bae37bb4a447823d5ad22934/cdxj_util-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "db1c48864bb2649fda27b74eeaa5f1abb732baeae4cebb27477c3a4645b2d596",
                "md5": "a58682afcea93f02149ce927f0900889",
                "sha256": "018329539984e110129f4e26070a75d523228bc1ba207831dc1a99a4656f196a"
            },
            "downloads": -1,
            "filename": "cdxj_util-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a58682afcea93f02149ce927f0900889",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 7147,
            "upload_time": "2024-08-06T07:26:28",
            "upload_time_iso_8601": "2024-08-06T07:26:28.851446Z",
            "url": "https://files.pythonhosted.org/packages/db/1c/48864bb2649fda27b74eeaa5f1abb732baeae4cebb27477c3a4645b2d596/cdxj_util-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-06 07:26:28",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "r74tech",
    "github_project": "cdxj-util",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "cdxj-util"
}
        
Elapsed time: 0.45004s