# Tikara
<img src="https://raw.githubusercontent.com/baughmann/tikara/refs/heads/master/tikara_logo.svg" width="100" alt="Tikara Logo" />
![Coverage](https://img.shields.io/badge/dynamic/xml?url=https://raw.githubusercontent.com/baughmann/tikara/refs/heads/master/coverage.xml&query=/coverage/@line-rate%20*%20100&suffix=%25&color=brightgreen&label=coverage) ![Tests](https://img.shields.io/badge/dynamic/xml?url=https://raw.githubusercontent.com/baughmann/tikara/refs/heads/master/junit.xml&query=/testsuites/testsuite/@tests&label=tests&color=green) ![PyPI](https://img.shields.io/pypi/v/tikara) ![GitHub License](https://img.shields.io/github/license/baughmann/tikara) ![PyPI - Downloads](https://img.shields.io/pypi/dm/tikara) ![GitHub issues](https://img.shields.io/github/issues/baughmann/tikara) ![GitHub pull requests](https://img.shields.io/github/issues-pr/baughmann/tikara) ![GitHub stars](https://img.shields.io/github/stars/baughmann/tikara?style=social)
## π Overview
Tikara is a modern, type-hinted Python wrapper for Apache Tika, supporting over 1600 file formats for content extraction, metadata analysis, and language detection. It provides direct JNI integration through JPype for optimal performance.
```python
from tikara import Tika
tika = Tika()
content, metadata = tika.parse("document.pdf")
```
## β‘οΈ Key Features
- Modern Python 3.12+ with complete type hints
- Direct JVM integration via JPype (no HTTP server required)
- Streaming support for large files
- Recursive document unpacking
- Language detection
- MIME type detection
- Custom parser and detector support
- Comprehensive metadata extraction
- Ships with embedded Tika JAR: works in air-gapped networks. No need to manage libraries.
## π¦ Supported Formats
π **1682 supported media types and counting!**
- [See the full list β](https://github.com/baughmann/tikara/tree/master/SUPPORTED_MIME_TYPES.md)
- [Tika parsers list β](https://tika.apache.org/1.21/formats.html#Supported_Document_Formats)
## π οΈ Installation
```bash
pip install tikara
```
### System Dependencies
#### Required Dependencies
- Python 3.12+
- Java Development Kit 11+ (OpenJDK recommended)
#### Optional Dependencies
##### Image and PDF OCR Enhancements _(recommended)_
- **Tesseract OCR** (strongly recommended if you process images) ([Reference β](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454096#TikaOCR-InstallingTesseractonUbuntu))
```bash
# Ubuntu
apt-get install tesseract-ocr
```
Additional language packs for Tesseract (optional):
```bash
# Ubuntu
apt-get install tesseract-ocr-deu tesseract-ocr-fra tesseract-ocr-ita tesseract-ocr-spa
```
- **ImageMagick** for advanced image processing ([Reference β](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454096#TikaOCR-InstallImageMagick))
```bash
# Ubuntu
apt-get install imagemagick
```
##### Multimedia Enhancements _(recommended)_
- **FFMPEG** for enhanced multimedia file support ([Reference β](https://cwiki.apache.org/confluence/display/TIKA/FFMPEGParser))
```bash
# Ubuntu
apt-get install ffmpeg
```
##### Enhanced PDF Support _(recommended)_
- [**PDFBox** β](https://pdfbox.apache.org/2.0/dependencies.html#optional-components) for enhanced PDF support ([Reference β](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066))
```bash
# Ubuntu
apt-get install pdfbox
```
Enhanced PDF support with [PDFBox](https://pdfbox.apache.org/2.0/dependencies.html#optional-components) [Reference β](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066)
##### Metadata Enhancements _(recommended)_
- **EXIFTool** for metadata extraction from images [Reference β](https://cwiki.apache.org/confluence/display/TIKA/EXIFToolParser)
```bash
# Ubuntu
apt-get install libimage-exiftool-perl
```
##### Geospatial Enhancements
- **GDAL** for geospatial file support ([Reference β](https://tika.apache.org/1.18/api/org/apache/tika/parser/gdal/GDALParser))
```bash
# Ubuntu
apt-get install gdal-bin
```
##### Additional Font Support _(recommended)_
- **MSCore Fonts** for enhanced Office file handling ([Reference β](https://github.com/apache/tika-docker/blob/main/full/Dockerfile))
```bash
# Ubuntu
apt-get install xfonts-utils fonts-freefont-ttf fonts-liberation ttf-mscorefonts-installer
```
For more OS dependency information including MSCore fonts setup and additional configuration, see the [official Apache Tika Dockerfile](https://github.com/apache/tika-docker/blob/main/full/Dockerfile).
## π Usage
[Example Jupyter Notebooks](https://github.com/baughmann/tikara/tree/master/examples) π
### Basic Content Extraction
```python
from tikara import Tika
from pathlib import Path
tika = Tika()
# Basic string output
content, metadata = tika.parse("document.pdf")
# Stream large files
stream, metadata = tika.parse(
"large.pdf",
output_stream=True,
output_format="txt"
)
# Save to file
output_path, metadata = tika.parse(
"input.docx",
output_file=Path("output.txt"),
output_format="txt"
)
```
### Language Detection
```python
from tikara import Tika
tika = Tika()
result = tika.detect_language("El rΓ‘pido zorro marrΓ³n salta sobre el perro perezoso")
print(f"Language: {result.language}, Confidence: {result.confidence}")
```
### MIME Type Detection
```python
from tikara import Tika
tika = Tika()
mime_type = tika.detect_mime_type("unknown_file")
print(f"Detected type: {mime_type}")
```
### Recursive Document Unpacking
```python
from tikara import Tika
from pathlib import Path
tika = Tika()
results = tika.unpack(
"container.docx",
output_dir=Path("extracted"),
max_depth=3
)
for item in results:
print(f"Extracted {item.metadata['Content-Type']} to {item.file_path}")
```
## π§ Development
### Environment Setup
1. Ensure that you have the [system dependencies](#system-dependencies) installed
2. Install uv:
```bash
pip install uv
```
3. Install python dependencies and create the Virtual Environment: `uv sync`
### Common Tasks
```bash
make ruff # Format and lint code
make test # Run test suite
make docs # Generate documentation
make stubs # Generate Java stubs
make prepush # Run all checks (ruff, test, coverage, safety)
```
## π€ When to Use Tikara
### Ideal Use Cases
- Python applications needing document processing
- Microservices and containerized environments
- Data processing pipelines ([Ray](https://ray.io), [Dask](https://dask.org), [Prefect](https://prefect.io))
- Applications requiring direct Tika integration without HTTP overhead
### Advanced Usage
For detailed documentation on:
- Custom parser implementation
- Custom detector creation
- MIME type handling
See the [Example Jupyter Notebooks](https://github.com/baughmann/tikara/tree/master/examples) π
## π― Inspiration
Tikara builds on the shoulders of giants:
- [Apache Tika](https://tika.apache.org/) - The powerful content detection and extraction toolkit
- [tika-python](https://github.com/chrismattmann/tika-python) - The original Python Tika wrapper using HTTP that inspired this project
- [JPype](https://jpype.readthedocs.io/) - The bridge between Python and Java
### Considerations
- Process isolation: Tika crashes will affect the host application
- Memory management: Large documents require careful handling
- JVM startup: Initial overhead for first operation
- Custom implementations: Parser/detector development requires Java interface knowledge
## π Performance Considerations
### Memory Management
- Use streaming for large files
- Monitor JVM heap usage
- Consider process isolation for critical applications
### Optimization Tips
- Reuse Tika instances
- Use appropriate output formats
- Implement custom parsers for specific needs
- Configure JVM parameters for your use case
## π Security Considerations
- Input validation
- Resource limits
- Secure file handling
- Access control for extracted content
- Careful handling of custom parsers
## π€ Contributing
Contributions welcome! The project uses Make for development tasks:
```bash
make prepush # Run all checks (format, lint, test, coverage, safety)
```
For developing custom parsers/detectors, Java stubs can be generated:
```bash
make stubs # Generate Java stubs for Apache Tika interfaces
```
Note: Generated stubs are git-ignored but provide IDE support and type hints when implementing custom parsers/detectors.
## Common Problems
- Verify Java installation and `JAVA_HOME` environment variable
- Ensure Tesseract and required language packs are installed
- Check file permissions and paths
- Monitor memory usage when processing large files
- Use streaming output for large documents
## π Reference
See [API Documentation](https://baughmann.github.io/tikara/autoapi/tikara/index.html#tikara.Tika) for complete details.
## π License
Apache License 2.0 - See [LICENSE](https://raw.githubusercontent.com/baughmann/tikara/refs/heads/master/LICENSE.txt) for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "tikara",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": null,
"keywords": "apache-tika, content-detection, content-extraction, content-indexing, content-intelligence, content-management, content-parsing, content-processing, content-type, data-extraction, data-parsing, data-processing, document-ai, document-analysis, document-automation, document-classification, document-converter, document-extraction, document-indexing, document-intelligence, document-management, document-metadata, document-ocr, document-parsing, document-processing, document-reader, document-text, document-understanding, docx, excel, file-analysis, file-conversion, file-format, file-identification, file-parsing, file-processing, file-reader, file-type, format-detection, format-identification, image-extraction, information-extraction, language-detection, metadata, mime-type, ocr, office-documents, pdf, pdf-parsing, powerpoint, structured-data, text-analytics, text-extraction, text-mining, text-parsing, text-processing, text-recognition, tika, unstructured-data, word-documents",
"author": null,
"author_email": "Nick Baughman <baughmann1@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/43/24/2c29da16b979babcd2926184c1e5302dee1552acce0ebb9d4337d0f551c3/tikara-0.1.5.tar.gz",
"platform": null,
"description": "# Tikara\n\n<img src=\"https://raw.githubusercontent.com/baughmann/tikara/refs/heads/master/tikara_logo.svg\" width=\"100\" alt=\"Tikara Logo\" />\n\n![Coverage](https://img.shields.io/badge/dynamic/xml?url=https://raw.githubusercontent.com/baughmann/tikara/refs/heads/master/coverage.xml&query=/coverage/@line-rate%20*%20100&suffix=%25&color=brightgreen&label=coverage) ![Tests](https://img.shields.io/badge/dynamic/xml?url=https://raw.githubusercontent.com/baughmann/tikara/refs/heads/master/junit.xml&query=/testsuites/testsuite/@tests&label=tests&color=green) ![PyPI](https://img.shields.io/pypi/v/tikara) ![GitHub License](https://img.shields.io/github/license/baughmann/tikara) ![PyPI - Downloads](https://img.shields.io/pypi/dm/tikara) ![GitHub issues](https://img.shields.io/github/issues/baughmann/tikara) ![GitHub pull requests](https://img.shields.io/github/issues-pr/baughmann/tikara) ![GitHub stars](https://img.shields.io/github/stars/baughmann/tikara?style=social)\n\n## \ud83d\ude80 Overview\n\nTikara is a modern, type-hinted Python wrapper for Apache Tika, supporting over 1600 file formats for content extraction, metadata analysis, and language detection. It provides direct JNI integration through JPype for optimal performance.\n\n```python\nfrom tikara import Tika\n\ntika = Tika()\ncontent, metadata = tika.parse(\"document.pdf\")\n```\n\n## \u26a1\ufe0f Key Features\n\n- Modern Python 3.12+ with complete type hints\n- Direct JVM integration via JPype (no HTTP server required)\n- Streaming support for large files\n- Recursive document unpacking\n- Language detection\n- MIME type detection\n- Custom parser and detector support\n- Comprehensive metadata extraction\n- Ships with embedded Tika JAR: works in air-gapped networks. No need to manage libraries.\n\n## \ud83d\udce6 Supported Formats\n\n\ud83c\udf08 **1682 supported media types and counting!**\n\n- [See the full list \u2192](https://github.com/baughmann/tikara/tree/master/SUPPORTED_MIME_TYPES.md)\n- [Tika parsers list \u21d7](https://tika.apache.org/1.21/formats.html#Supported_Document_Formats)\n\n## \ud83d\udee0\ufe0f Installation\n\n```bash\npip install tikara\n```\n\n### System Dependencies\n\n#### Required Dependencies\n\n- Python 3.12+\n- Java Development Kit 11+ (OpenJDK recommended)\n\n#### Optional Dependencies\n\n##### Image and PDF OCR Enhancements _(recommended)_\n\n- **Tesseract OCR** (strongly recommended if you process images) ([Reference \u21d7](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454096#TikaOCR-InstallingTesseractonUbuntu))\n\n ```bash\n # Ubuntu\n apt-get install tesseract-ocr\n ```\n\n Additional language packs for Tesseract (optional):\n\n ```bash\n # Ubuntu\n apt-get install tesseract-ocr-deu tesseract-ocr-fra tesseract-ocr-ita tesseract-ocr-spa\n ```\n\n- **ImageMagick** for advanced image processing ([Reference \u21d7](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454096#TikaOCR-InstallImageMagick))\n\n ```bash\n # Ubuntu\n apt-get install imagemagick\n ```\n\n##### Multimedia Enhancements _(recommended)_\n\n- **FFMPEG** for enhanced multimedia file support ([Reference \u21d7](https://cwiki.apache.org/confluence/display/TIKA/FFMPEGParser))\n\n ```bash\n # Ubuntu\n apt-get install ffmpeg\n ```\n\n##### Enhanced PDF Support _(recommended)_\n\n- [**PDFBox** \u21d7](https://pdfbox.apache.org/2.0/dependencies.html#optional-components) for enhanced PDF support ([Reference \u21d7](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066))\n\n ```bash\n # Ubuntu\n apt-get install pdfbox\n ```\n\nEnhanced PDF support with [PDFBox](https://pdfbox.apache.org/2.0/dependencies.html#optional-components) [Reference \u21d7](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066)\n\n##### Metadata Enhancements _(recommended)_\n\n- **EXIFTool** for metadata extraction from images [Reference \u21d7](https://cwiki.apache.org/confluence/display/TIKA/EXIFToolParser)\n\n ```bash\n # Ubuntu\n apt-get install libimage-exiftool-perl\n ```\n\n##### Geospatial Enhancements\n\n- **GDAL** for geospatial file support ([Reference \u21d7](https://tika.apache.org/1.18/api/org/apache/tika/parser/gdal/GDALParser))\n\n ```bash\n # Ubuntu\n apt-get install gdal-bin\n ```\n\n##### Additional Font Support _(recommended)_\n\n- **MSCore Fonts** for enhanced Office file handling ([Reference \u21d7](https://github.com/apache/tika-docker/blob/main/full/Dockerfile))\n\n ```bash\n # Ubuntu\n apt-get install xfonts-utils fonts-freefont-ttf fonts-liberation ttf-mscorefonts-installer\n ```\n\nFor more OS dependency information including MSCore fonts setup and additional configuration, see the [official Apache Tika Dockerfile](https://github.com/apache/tika-docker/blob/main/full/Dockerfile).\n\n## \ud83d\udcd6 Usage\n\n[Example Jupyter Notebooks](https://github.com/baughmann/tikara/tree/master/examples) \ud83d\udcd4\n\n### Basic Content Extraction\n\n```python\nfrom tikara import Tika\nfrom pathlib import Path\n\ntika = Tika()\n\n# Basic string output\ncontent, metadata = tika.parse(\"document.pdf\")\n\n# Stream large files\nstream, metadata = tika.parse(\n \"large.pdf\",\n output_stream=True,\n output_format=\"txt\"\n)\n\n# Save to file\noutput_path, metadata = tika.parse(\n \"input.docx\",\n output_file=Path(\"output.txt\"),\n output_format=\"txt\"\n)\n```\n\n### Language Detection\n\n```python\nfrom tikara import Tika\n\ntika = Tika()\nresult = tika.detect_language(\"El r\u00e1pido zorro marr\u00f3n salta sobre el perro perezoso\")\nprint(f\"Language: {result.language}, Confidence: {result.confidence}\")\n```\n\n### MIME Type Detection\n\n```python\nfrom tikara import Tika\n\ntika = Tika()\nmime_type = tika.detect_mime_type(\"unknown_file\")\nprint(f\"Detected type: {mime_type}\")\n```\n\n### Recursive Document Unpacking\n\n```python\nfrom tikara import Tika\nfrom pathlib import Path\n\ntika = Tika()\nresults = tika.unpack(\n \"container.docx\",\n output_dir=Path(\"extracted\"),\n max_depth=3\n)\n\nfor item in results:\n print(f\"Extracted {item.metadata['Content-Type']} to {item.file_path}\")\n```\n\n## \ud83d\udd27 Development\n\n### Environment Setup\n\n1. Ensure that you have the [system dependencies](#system-dependencies) installed\n2. Install uv:\n\n ```bash\n pip install uv\n ```\n\n3. Install python dependencies and create the Virtual Environment: `uv sync`\n\n### Common Tasks\n\n```bash\nmake ruff # Format and lint code\nmake test # Run test suite\nmake docs # Generate documentation\nmake stubs # Generate Java stubs\nmake prepush # Run all checks (ruff, test, coverage, safety)\n```\n\n## \ud83e\udd14 When to Use Tikara\n\n### Ideal Use Cases\n\n- Python applications needing document processing\n- Microservices and containerized environments\n- Data processing pipelines ([Ray](https://ray.io), [Dask](https://dask.org), [Prefect](https://prefect.io))\n- Applications requiring direct Tika integration without HTTP overhead\n\n### Advanced Usage\n\nFor detailed documentation on:\n\n- Custom parser implementation\n- Custom detector creation\n- MIME type handling\n\nSee the [Example Jupyter Notebooks](https://github.com/baughmann/tikara/tree/master/examples) \ud83d\udcd4\n\n## \ud83c\udfaf Inspiration\n\nTikara builds on the shoulders of giants:\n\n- [Apache Tika](https://tika.apache.org/) - The powerful content detection and extraction toolkit\n- [tika-python](https://github.com/chrismattmann/tika-python) - The original Python Tika wrapper using HTTP that inspired this project\n- [JPype](https://jpype.readthedocs.io/) - The bridge between Python and Java\n\n### Considerations\n\n- Process isolation: Tika crashes will affect the host application\n- Memory management: Large documents require careful handling\n- JVM startup: Initial overhead for first operation\n- Custom implementations: Parser/detector development requires Java interface knowledge\n\n## \ud83d\udcca Performance Considerations\n\n### Memory Management\n\n- Use streaming for large files\n- Monitor JVM heap usage\n- Consider process isolation for critical applications\n\n### Optimization Tips\n\n- Reuse Tika instances\n- Use appropriate output formats\n- Implement custom parsers for specific needs\n- Configure JVM parameters for your use case\n\n## \ud83d\udd10 Security Considerations\n\n- Input validation\n- Resource limits\n- Secure file handling\n- Access control for extracted content\n- Careful handling of custom parsers\n\n## \ud83e\udd1d Contributing\n\nContributions welcome! The project uses Make for development tasks:\n\n```bash\nmake prepush # Run all checks (format, lint, test, coverage, safety)\n```\n\nFor developing custom parsers/detectors, Java stubs can be generated:\n\n```bash\nmake stubs # Generate Java stubs for Apache Tika interfaces\n```\n\nNote: Generated stubs are git-ignored but provide IDE support and type hints when implementing custom parsers/detectors.\n\n## Common Problems\n\n- Verify Java installation and `JAVA_HOME` environment variable\n- Ensure Tesseract and required language packs are installed\n- Check file permissions and paths\n- Monitor memory usage when processing large files\n- Use streaming output for large documents\n\n## \ud83d\udcda Reference\n\nSee [API Documentation](https://baughmann.github.io/tikara/autoapi/tikara/index.html#tikara.Tika) for complete details.\n\n## \ud83d\udcc4 License\n\nApache License 2.0 - See [LICENSE](https://raw.githubusercontent.com/baughmann/tikara/refs/heads/master/LICENSE.txt) for details.\n",
"bugtrack_url": null,
"license": null,
"summary": "The metadata and text content extractor for almost every file type.",
"version": "0.1.5",
"project_urls": {
"Homepage": "https://github.com/baughmann/tikara",
"Issues": "https://github.com/baughmann/tikara/issues"
},
"split_keywords": [
"apache-tika",
" content-detection",
" content-extraction",
" content-indexing",
" content-intelligence",
" content-management",
" content-parsing",
" content-processing",
" content-type",
" data-extraction",
" data-parsing",
" data-processing",
" document-ai",
" document-analysis",
" document-automation",
" document-classification",
" document-converter",
" document-extraction",
" document-indexing",
" document-intelligence",
" document-management",
" document-metadata",
" document-ocr",
" document-parsing",
" document-processing",
" document-reader",
" document-text",
" document-understanding",
" docx",
" excel",
" file-analysis",
" file-conversion",
" file-format",
" file-identification",
" file-parsing",
" file-processing",
" file-reader",
" file-type",
" format-detection",
" format-identification",
" image-extraction",
" information-extraction",
" language-detection",
" metadata",
" mime-type",
" ocr",
" office-documents",
" pdf",
" pdf-parsing",
" powerpoint",
" structured-data",
" text-analytics",
" text-extraction",
" text-mining",
" text-parsing",
" text-processing",
" text-recognition",
" tika",
" unstructured-data",
" word-documents"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "ea03ec57b451acb9b78626d7676c45a81ff3fe37ab560f3480c0b7bd8f9bd28d",
"md5": "9f997e282acdcda4873ee3b4ab2eb58f",
"sha256": "4634d22f92bb6859f58a9b84538745d0eef910f658ca7047d5ef5bd8e4803b2e"
},
"downloads": -1,
"filename": "tikara-0.1.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9f997e282acdcda4873ee3b4ab2eb58f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 49711521,
"upload_time": "2025-01-26T23:33:32",
"upload_time_iso_8601": "2025-01-26T23:33:32.052008Z",
"url": "https://files.pythonhosted.org/packages/ea/03/ec57b451acb9b78626d7676c45a81ff3fe37ab560f3480c0b7bd8f9bd28d/tikara-0.1.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "43242c29da16b979babcd2926184c1e5302dee1552acce0ebb9d4337d0f551c3",
"md5": "f7d6c9eb99a2236f43d17e82c4d2bbf1",
"sha256": "e5c5a25d3538af9335d7b318926f385e1564d204996b1edcd9fe559c2afa0579"
},
"downloads": -1,
"filename": "tikara-0.1.5.tar.gz",
"has_sig": false,
"md5_digest": "f7d6c9eb99a2236f43d17e82c4d2bbf1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 49686188,
"upload_time": "2025-01-26T23:33:40",
"upload_time_iso_8601": "2025-01-26T23:33:40.329816Z",
"url": "https://files.pythonhosted.org/packages/43/24/2c29da16b979babcd2926184c1e5302dee1552acce0ebb9d4337d0f551c3/tikara-0.1.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-26 23:33:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "baughmann",
"github_project": "tikara",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "tikara"
}