ciff-toolkit


Nameciff-toolkit JSON
Version 0.1.1 PyPI version JSON
download
home_pagehttps://opencode.it4i.eu/openwebsearcheu-public/ciff-toolkit/
SummaryToolkit for working with Common Index File Format (CIFF) files.
upload_time2023-06-22 12:27:13
maintainer
docs_urlNone
authorGijs Hendriksen
requires_python>=3.10,<4.0
licenseMIT
keywords ciff
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # CIFF Toolkit

This repository contains a Python toolkit for working with [Common Index File Format (CIFF)](https://github.com/osirrc/ciff/) files.

Specifically, it provides a `CiffReader` and `CiffWriter` for easily reading and writing CIFF files. It also provides a handful of CLI tools, such as merging a CIFF file or dumping its contents.

## Installation

To use the CIFF toolkit, install it from PyPI:

```bash
$ pip install ciff-toolkit
```

## Usage

### Reading

To read a CIFF file, you can use the `CiffReader` class. It returns the posting lists and documents as lazy generators, so operations that need to process large CIFF files do not need to load the entire index into memory.

The `CiffReader` can be used as a context manager, automatically opening files if a path is supplied as a `str` or `pathlib.Path`. 

```python
from ciff_toolkit.read import CiffReader

with CiffReader('./path/to/index.ciff') as reader:
    header = reader.read_header()

    for pl in reader.read_postings_lists():
        print(pl)

    for doc in reader.read_documents():
        print(doc)
```

Alternatively, the `CiffReader` also accepts iterables of bytes instead of file paths. This could be useful if, for instance, the index is in a remote location:

```python
import requests
from ciff_toolkit.read import CiffReader

url = 'https://example.com/remote-index.ciff'
with CiffReader(requests.get(url, stream=True).iter_content(1024)) as reader:
    header = reader.read_header()
    ...
```

### Writing

The `CiffWriter` offers a similar context manager API:

```python
from ciff_toolkit.ciff_pb2 import Header, PostingsList, DocRecord
from ciff_toolkit.write import CiffWriter

header: Header = ...
postings_lists: list[PostingsList] = ...
doc_records: list[DocRecord] = ...

with CiffWriter('./path/to/index.ciff') as writer:
    writer.write_header(header)
    writer.write_postings_lists(postings_lists)
    writer.write_documents(doc_records)
```

### Command Line Interface

A couple of CLI commands are supported:

- `ciff_dump INPUT`

  Dumps the contents of a CIFF file, in order to inspect its contents.
- `ciff_merge INPUT... OUTPUT`

  Merges two or more CIFF files into a single CIFF file. Ensures documents and terms are ordered correctly, and will read and write in a streaming manner (i.e. not read all data into memory at once).

  Note: `ciff_merge` requires that the `DocRecord` messages occur before the `PostingsList` messages in the CIFF file, as it needs to remap the internal document identifiers before merging the posting lists. See `ciff_swap` below for more information on how to achieve that. 
- `ciff_swap --input-order [hpd|hdp] INPUT OUTPUT`

  Swaps the `PostingsList` and `DocRecord` messages in a CIFF file (e.g. in order to prepare for merging). The `--input-order` argument specifies the current format of the CIFF file: `hpd` for header - posting lists - documents, and `hdp` for header - documents - posting lists.
- `ciff_zero_index INPUT OUTPUT`

  Takes a CIFF file with 1-indexed documents, and turns it into 0-indexed documents.

## Development

This project uses [Poetry](https://python-poetry.org/) to manage dependencies, configure the project and publish it to PyPI.

To get started, use Poetry to install all dependencies:

```bash
$ poetry install
```

Then, either activate the virtual environment to execute all Python code in the virtual environment, or prepend every command with poetry run.

```bash
$ poetry shell
(venv) $ ciff_dump index.ciff
```

or:

```bash
$ poetry run ciff_dump index.ciff
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://opencode.it4i.eu/openwebsearcheu-public/ciff-toolkit/",
    "name": "ciff-toolkit",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10,<4.0",
    "maintainer_email": "",
    "keywords": "ciff",
    "author": "Gijs Hendriksen",
    "author_email": "g.hendriksen@cs.ru.nl",
    "download_url": "https://files.pythonhosted.org/packages/ae/e5/fa32c9b820229dab4082ffb3a5e94607d86ef2af2c41c3ad1915f26c81b4/ciff-toolkit-0.1.1.tar.gz",
    "platform": null,
    "description": "# CIFF Toolkit\n\nThis repository contains a Python toolkit for working with [Common Index File Format (CIFF)](https://github.com/osirrc/ciff/) files.\n\nSpecifically, it provides a `CiffReader` and `CiffWriter` for easily reading and writing CIFF files. It also provides a handful of CLI tools, such as merging a CIFF file or dumping its contents.\n\n## Installation\n\nTo use the CIFF toolkit, install it from PyPI:\n\n```bash\n$ pip install ciff-toolkit\n```\n\n## Usage\n\n### Reading\n\nTo read a CIFF file, you can use the `CiffReader` class. It returns the posting lists and documents as lazy generators, so operations that need to process large CIFF files do not need to load the entire index into memory.\n\nThe `CiffReader` can be used as a context manager, automatically opening files if a path is supplied as a `str` or `pathlib.Path`. \n\n```python\nfrom ciff_toolkit.read import CiffReader\n\nwith CiffReader('./path/to/index.ciff') as reader:\n    header = reader.read_header()\n\n    for pl in reader.read_postings_lists():\n        print(pl)\n\n    for doc in reader.read_documents():\n        print(doc)\n```\n\nAlternatively, the `CiffReader` also accepts iterables of bytes instead of file paths. This could be useful if, for instance, the index is in a remote location:\n\n```python\nimport requests\nfrom ciff_toolkit.read import CiffReader\n\nurl = 'https://example.com/remote-index.ciff'\nwith CiffReader(requests.get(url, stream=True).iter_content(1024)) as reader:\n    header = reader.read_header()\n    ...\n```\n\n### Writing\n\nThe `CiffWriter` offers a similar context manager API:\n\n```python\nfrom ciff_toolkit.ciff_pb2 import Header, PostingsList, DocRecord\nfrom ciff_toolkit.write import CiffWriter\n\nheader: Header = ...\npostings_lists: list[PostingsList] = ...\ndoc_records: list[DocRecord] = ...\n\nwith CiffWriter('./path/to/index.ciff') as writer:\n    writer.write_header(header)\n    writer.write_postings_lists(postings_lists)\n    writer.write_documents(doc_records)\n```\n\n### Command Line Interface\n\nA couple of CLI commands are supported:\n\n- `ciff_dump INPUT`\n\n  Dumps the contents of a CIFF file, in order to inspect its contents.\n- `ciff_merge INPUT... OUTPUT`\n\n  Merges two or more CIFF files into a single CIFF file. Ensures documents and terms are ordered correctly, and will read and write in a streaming manner (i.e. not read all data into memory at once).\n\n  Note: `ciff_merge` requires that the `DocRecord` messages occur before the `PostingsList` messages in the CIFF file, as it needs to remap the internal document identifiers before merging the posting lists. See `ciff_swap` below for more information on how to achieve that. \n- `ciff_swap --input-order [hpd|hdp] INPUT OUTPUT`\n\n  Swaps the `PostingsList` and `DocRecord` messages in a CIFF file (e.g. in order to prepare for merging). The `--input-order` argument specifies the current format of the CIFF file: `hpd` for header - posting lists - documents, and `hdp` for header - documents - posting lists.\n- `ciff_zero_index INPUT OUTPUT`\n\n  Takes a CIFF file with 1-indexed documents, and turns it into 0-indexed documents.\n\n## Development\n\nThis project uses [Poetry](https://python-poetry.org/) to manage dependencies, configure the project and publish it to PyPI.\n\nTo get started, use Poetry to install all dependencies:\n\n```bash\n$ poetry install\n```\n\nThen, either activate the virtual environment to execute all Python code in the virtual environment, or prepend every command with poetry run.\n\n```bash\n$ poetry shell\n(venv) $ ciff_dump index.ciff\n```\n\nor:\n\n```bash\n$ poetry run ciff_dump index.ciff\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Toolkit for working with Common Index File Format (CIFF) files.",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://opencode.it4i.eu/openwebsearcheu-public/ciff-toolkit/",
        "Repository": "https://opencode.it4i.eu/openwebsearcheu-public/ciff-toolkit/"
    },
    "split_keywords": [
        "ciff"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bb6c2564ac35844265106a121bf0cdc65b224ad81a309f45f10e1c0fd4b47cdd",
                "md5": "275a8a5cdf172c1d6a7fa9cbc7bc870e",
                "sha256": "701d48028783ae9618a45d1a1a6dfb8d4cbaa3cb268c8bd3e09db5d932fde3c7"
            },
            "downloads": -1,
            "filename": "ciff_toolkit-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "275a8a5cdf172c1d6a7fa9cbc7bc870e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10,<4.0",
            "size": 12464,
            "upload_time": "2023-06-22T12:27:15",
            "upload_time_iso_8601": "2023-06-22T12:27:15.187899Z",
            "url": "https://files.pythonhosted.org/packages/bb/6c/2564ac35844265106a121bf0cdc65b224ad81a309f45f10e1c0fd4b47cdd/ciff_toolkit-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "aee5fa32c9b820229dab4082ffb3a5e94607d86ef2af2c41c3ad1915f26c81b4",
                "md5": "f957ffd10d0f7ccd44691dad35842336",
                "sha256": "361444935f3524d03fb1ca80dc234539dfdf897db6a057cdf60ac75b2a1a3f91"
            },
            "downloads": -1,
            "filename": "ciff-toolkit-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "f957ffd10d0f7ccd44691dad35842336",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10,<4.0",
            "size": 11159,
            "upload_time": "2023-06-22T12:27:13",
            "upload_time_iso_8601": "2023-06-22T12:27:13.546613Z",
            "url": "https://files.pythonhosted.org/packages/ae/e5/fa32c9b820229dab4082ffb3a5e94607d86ef2af2c41c3ad1915f26c81b4/ciff-toolkit-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-22 12:27:13",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "ciff-toolkit"
}
        
Elapsed time: 0.27901s