tika-client


Nametika-client JSON
Version 0.8.1 PyPI version JSON
download
home_pageNone
SummaryA modern REST client for Apache Tika server
upload_time2024-12-17 19:34:26
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseNone
keywords api client html office pdf tika
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Tika Rest Client

[![PyPI - Version](https://img.shields.io/pypi/v/tika-client.svg)](https://pypi.org/project/tika-client)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/tika-client.svg)](https://pypi.org/project/tika-client)
[![codecov](https://codecov.io/github/stumpylog/tika-client/branch/main/graph/badge.svg?token=PTESS6YUK5)](https://codecov.io/github/stumpylog/tika-client)

---

## Table of Contents

- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
- [Why](#why)
- [License](#license)

## Features

- Simplified: No need to worry about XML or JSON responses, downloading a Tika jar file or Python 2
- Support for Tika 2+ only (including Tika v3, which didn't change the API)
- Based on the modern [httpx](https://github.com/encode/httpx) library
- Full support for type hinting
- Nearly full test coverage run against an actual Tika server for multiple Python and PyPy versions
- Uses HTTP multipart/form-data to stream files to the server (instead of reading into memory)
- Optional compression for parsing from a file content already in a buffer (as opposed to a file)

## Installation

```console
pip3 install tika-client
```

## Usage

```python3
from pathlib import Path
from tika_client import TikaClient

test_file = Path("sample.docx")


with TikaClient("http://localhost:9998") as client:

    # Extract a document's metadata
    metadata = client.metadata.from_file(test_file)

    # Get the content of a document as HTML
    data = client.tika.as_html.from_file(test_file)

    # Or as plain text
    text = client.tika.as_text.from_file(test_file)

    # Content and metadata combined
    data = client.rmeta.as_text.from_file(test_file)

    # The mime type can also be given
    # This allows Content-Type to be set most accurately
    text = client.tika.as_text.from_file(test_file,
                                         "application/vnd.openxmlformats-officedocument.wordprocessingml.document")

```

The Tika REST API documentation can be found [here](https://cwiki.apache.org/confluence/display/TIKA/TikaServer).
At the moment, only the metadata, tika and recursive metadata endpoints are implemented.

Unfortunately, the set of possible return values of the Tika API are not very well documented. The library makes
a best effort to extract relevant fields into type properties where it understands more about the mime type
of the document (as returned by Tika). This includes information like created/modified information as time zone
aware `datetime` objects. The full JSON response is always available to the user under the `.data`
attribute.

When a particular key is not present in the response, all properties will return `None` instead.

## Why

Only one other library for interfacing with Tika exists that I know of. I find it too complicated, trying to handle
a lot of differing uses.

The biggest issue I have with the library is its downloading and running of a jar file if needed. To me, an
API client should only interface to the API and not try to provide functionality to start
the API as well. The user is responsible for providing the server with the Tika version they desire.

The library also provides a lot of knobs to turn, but I argue most developers will not want to configure XML as
the response type, they just want the data, already parsed to the maximum extend possible.

This library attempts to provide a simpler interface, minimal lines of code and typing of the parsed response.

## License

`tika-client` is distributed under the terms of the [Mozilla Public License 2.0](https://spdx.org/licenses/MPL-2.0.html) license.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "tika-client",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "api, client, html, office, pdf, tika",
    "author": null,
    "author_email": "Trenton H <rda0128ou@mozmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/eb/e7/30706d16a66280a8d2a83f737b9a172c585ea824b8dc4a65ae63ed5fc3e2/tika_client-0.8.1.tar.gz",
    "platform": null,
    "description": "# Tika Rest Client\n\n[![PyPI - Version](https://img.shields.io/pypi/v/tika-client.svg)](https://pypi.org/project/tika-client)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/tika-client.svg)](https://pypi.org/project/tika-client)\n[![codecov](https://codecov.io/github/stumpylog/tika-client/branch/main/graph/badge.svg?token=PTESS6YUK5)](https://codecov.io/github/stumpylog/tika-client)\n\n---\n\n## Table of Contents\n\n- [Features](#features)\n- [Installation](#installation)\n- [Usage](#usage)\n- [Why](#why)\n- [License](#license)\n\n## Features\n\n- Simplified: No need to worry about XML or JSON responses, downloading a Tika jar file or Python 2\n- Support for Tika 2+ only (including Tika v3, which didn't change the API)\n- Based on the modern [httpx](https://github.com/encode/httpx) library\n- Full support for type hinting\n- Nearly full test coverage run against an actual Tika server for multiple Python and PyPy versions\n- Uses HTTP multipart/form-data to stream files to the server (instead of reading into memory)\n- Optional compression for parsing from a file content already in a buffer (as opposed to a file)\n\n## Installation\n\n```console\npip3 install tika-client\n```\n\n## Usage\n\n```python3\nfrom pathlib import Path\nfrom tika_client import TikaClient\n\ntest_file = Path(\"sample.docx\")\n\n\nwith TikaClient(\"http://localhost:9998\") as client:\n\n    # Extract a document's metadata\n    metadata = client.metadata.from_file(test_file)\n\n    # Get the content of a document as HTML\n    data = client.tika.as_html.from_file(test_file)\n\n    # Or as plain text\n    text = client.tika.as_text.from_file(test_file)\n\n    # Content and metadata combined\n    data = client.rmeta.as_text.from_file(test_file)\n\n    # The mime type can also be given\n    # This allows Content-Type to be set most accurately\n    text = client.tika.as_text.from_file(test_file,\n                                         \"application/vnd.openxmlformats-officedocument.wordprocessingml.document\")\n\n```\n\nThe Tika REST API documentation can be found [here](https://cwiki.apache.org/confluence/display/TIKA/TikaServer).\nAt the moment, only the metadata, tika and recursive metadata endpoints are implemented.\n\nUnfortunately, the set of possible return values of the Tika API are not very well documented. The library makes\na best effort to extract relevant fields into type properties where it understands more about the mime type\nof the document (as returned by Tika). This includes information like created/modified information as time zone\naware `datetime` objects. The full JSON response is always available to the user under the `.data`\nattribute.\n\nWhen a particular key is not present in the response, all properties will return `None` instead.\n\n## Why\n\nOnly one other library for interfacing with Tika exists that I know of. I find it too complicated, trying to handle\na lot of differing uses.\n\nThe biggest issue I have with the library is its downloading and running of a jar file if needed. To me, an\nAPI client should only interface to the API and not try to provide functionality to start\nthe API as well. The user is responsible for providing the server with the Tika version they desire.\n\nThe library also provides a lot of knobs to turn, but I argue most developers will not want to configure XML as\nthe response type, they just want the data, already parsed to the maximum extend possible.\n\nThis library attempts to provide a simpler interface, minimal lines of code and typing of the parsed response.\n\n## License\n\n`tika-client` is distributed under the terms of the [Mozilla Public License 2.0](https://spdx.org/licenses/MPL-2.0.html) license.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A modern REST client for Apache Tika server",
    "version": "0.8.1",
    "project_urls": {
        "Changelog": "https://github.com/stumpylog/tika-rest-client/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/stumpylog/tika-rest-client#readme",
        "Issues": "https://github.com/stumpylog/tika-rest-client/issues",
        "Source": "https://github.com/stumpylog/tika-rest-client"
    },
    "split_keywords": [
        "api",
        " client",
        " html",
        " office",
        " pdf",
        " tika"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a47938bd9397e67eeafdda296216669d601bfca7cbda135e761bf3ee2efedb99",
                "md5": "7b43aebc06c0084973677cf0bc2c467a",
                "sha256": "7562afde0629134e9d8ec48fc55d83c44d0487f9382db30623f3bbc5aa60394a"
            },
            "downloads": -1,
            "filename": "tika_client-0.8.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7b43aebc06c0084973677cf0bc2c467a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 16932,
            "upload_time": "2024-12-17T19:34:24",
            "upload_time_iso_8601": "2024-12-17T19:34:24.149456Z",
            "url": "https://files.pythonhosted.org/packages/a4/79/38bd9397e67eeafdda296216669d601bfca7cbda135e761bf3ee2efedb99/tika_client-0.8.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ebe730706d16a66280a8d2a83f737b9a172c585ea824b8dc4a65ae63ed5fc3e2",
                "md5": "d99cc51ba56af96ff52a6f587da07500",
                "sha256": "c2cdd437dd01f37ca354a6f899ff35df9e333a3f3e7570638a47d71a026ff173"
            },
            "downloads": -1,
            "filename": "tika_client-0.8.1.tar.gz",
            "has_sig": false,
            "md5_digest": "d99cc51ba56af96ff52a6f587da07500",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 2172396,
            "upload_time": "2024-12-17T19:34:26",
            "upload_time_iso_8601": "2024-12-17T19:34:26.845489Z",
            "url": "https://files.pythonhosted.org/packages/eb/e7/30706d16a66280a8d2a83f737b9a172c585ea824b8dc4a65ae63ed5fc3e2/tika_client-0.8.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-17 19:34:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "stumpylog",
    "github_project": "tika-rest-client",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "tika-client"
}
        
Elapsed time: 0.39094s