parsethisio

Name	parsethisio JSON
Version	0.2.3 JSON
	download
home_page	None
Summary	A Python library to extract text from various sources for LLM preprocessing.
upload_time	2025-07-24 03:29:15
maintainer	None
docs_url	None
author	None
requires_python	<4.0,>=3.10
license	GNU Affero General Public License v3.0
keywords	text extraction llm pdf web preprocessing
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # ParseThisIO

![Coverage](./coverage.svg)
![PyPI](https://img.shields.io/pypi/v/ParseThis)
![Build Status](https://img.shields.io/github/workflow/status/jdde/ParseThis/CI)
![License](https://img.shields.io/github/license/jdde/ParseThis)


**ParseThisIO** is a powerful and flexible tool with zero additional OS dependencies that makes raw data effortlessly readable and structured for your AI and data processing workflows. Whether you're extracting information from PDFs, transforming files into Markdown, or preparing data for LLMs and RAG pipelines, **ParseThisIO** gets the job done—quickly, effectively, and with a touch of magic.
Just install as a pip package and enjoy, no configuring around with third-party tools before you can use this package. Just parseThis.io.

For some parsers, there are API keys required. They're not required when you just don't use them—they will error on usage when no API key was found.

ParseThis aggregates multiple open-source projects to avoid re-implementing a file type mapping for content conversion.

## Table of Contents
- [Features](#features)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Usage](#usage)
- [ParserMatrix - Dependency overview](#ParserMatrix)
- [Testing](#Testing)
- [License](#License)

---

## Features
- Auto-detects file types (pdf, docx, csv, pptx, xlsx, xls, json, xml, zip, mp3, mp4 and more).
- Converts any file into readable Markdown or plain text.
- Extracts structured data for use in LLM and RAG pipelines.
- Simple API for seamless integration into your workflows.
- Just forward user input to ParseThis and get Text || markdown.

The mapping of parser to file type can be found in the [ParserMatrix](#parsermatrix---when-is-which-dependency-used).

```python
import parsethisio

#get list of supported file extensions via 
parsethisio.get_supported_extensions()
```


---

## Prerequisites
Use Python 3.12 - maximum version supported by PyO3 - dependency of scrapegraph-ai, use a virtual environment with version 3.12
```sh
python3.12 -m venv myenv
source myenv/bin/activate
```

---

## Installation

To install **ParseThisIO**, use pip:

```bash
pip install parsethisio
```

---

## Usage
Use the parse() function to auto-detect the current type of content - when the autodetection is not working you can provide more information to help detect the type.
The auto-parse function accepts any input - file_path, url strings, file byte content.
```python
import parsethisio

#extract image description for llm
with open('tests/fixtures/test_data_diagram.png', 'rb') as f:
    image_description = parsethisio.parse(f.read(), result_format=ResultFormat.TXT)

#get transcript of audio
with open('tests/fixtures/test_data_ttsmaker-test-generated-file.mp3', 'rb') as f:
    audio_transcript = parsethisio.parse(f.read(), result_format=ResultFormat.TXT)
```

The generic parse() function detects automatically which parsers will be used based on the file content.

```python
import parsethisio

from parsethisio import ResultFormat


#automatic parse based on file_path
parsed_pdf_text = parsethisio.parse('tests/fixtures/text_data_meeting_notes.pdf', result_format=ResultFormat.TXT)

#automatic parse based on file content
with open('tests/fixtures/text_data_meeting_notes.pdf', 'rb') as f:
    parsed_pdf_text = parsethisio.parse(f.read(), result_format=ResultFormat.TXT)  # works with any bytes content

#automatic parse based on string
parsed_github_repository = parsethisio.parse('https://github.com/jdde/ParseThis', result_format=ResultFormat.TXT)

#automatic parse based on YouTube URL
transcribed_youtube_text = parsethisio.parse('https://www.youtube.com/watch?v=ca7QkcAGe', result_format=ResultFormat.TXT)
```

Use the parser detection when you want to just find the parser and configure it differently before it parses the content.
```python
import parsethisio

with open('tests/fixtures/text_data_meeting_notes.pdf', 'rb') as f:
    file_content = f.read()
    parser = parsethisio.get_parser(file_content)
    text = parser.parse(file_content)
```

Or just directly use a parser.
```python
from parsethisio import PDFParser

with open('tests/fixtures/text_data_meeting_notes.pdf', 'rb') as f:
    text = PDFParser.parse(file_content)
```

For more examples how to use it - see our [testing section](tests/test_automatic_parser_selection.py).

---

## ParserMatrix
Overview of dependencies used for specific parsing processes.

| File Type | Parser         | Dependency          | External Access Required |
|-----------|----------------|---------------------|---------------------|
| PDF       | PDFParser      | PyPDF2, Markitdown | ❌ |
| Image     | ImageParser    | OpenAI GPT         | ✅ env.OPENAI_API_KEY|
| Audio     | AudioParser    | OpenAI Whisper     | ✅ env.OPENAI_API_KEY |
| URL       | TextParser     | scrapegraphai      | ✅ env.OPENAI_API_KEY |
| YouTube   | TextParser  | youtube-transcript-api | ❌ |
| Github    | TextParser     | gitingest          | ❌ |
| DOCX      | OfficeParser   | Markitdown         | ❌ |
| PPTX      | OfficeParser   | Markitdown         | ❌ |
| XLSX/XLS  | OfficeParser   | Markitdown         | ❌ |
| CSV       | DataParser     | Markitdown         | ❌ |
| JSON      | DataParser     | Markitdown         | ❌ |
| XML       | DataParser     | Markitdown         | ❌ |
| ZIP       | ArchiveParser  | Markitdown         | ❌ |


If you're working with the source code, you can install all dependencies using:

```bash
pip install .
```
For more information, see the [how we install it in our github action](.github/workflows/coverage.yml).


## Testing
To execute tests use this:

```bash
coverage run -m pytest
#or for a single test:
pytest -k test_text_parser_github_url
```


## License
This project is licensed under the GNU Affero General Public License v3.0 - see the [LICENSE](LICENSE) file for details.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "parsethisio",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": "text extraction, LLM, PDF, web, preprocessing",
    "author": null,
    "author_email": "J\u00f6rn Depenbrock <joern@jdde.de>",
    "download_url": "https://files.pythonhosted.org/packages/1e/eb/069b7bcb3033f8028353153b851a5397e400fd55c647264ce37b77df404d/parsethisio-0.2.3.tar.gz",
    "platform": null,
    "description": "# ParseThisIO\n\n![Coverage](./coverage.svg)\n![PyPI](https://img.shields.io/pypi/v/ParseThis)\n![Build Status](https://img.shields.io/github/workflow/status/jdde/ParseThis/CI)\n![License](https://img.shields.io/github/license/jdde/ParseThis)\n\n\n**ParseThisIO** is a powerful and flexible tool with zero additional OS dependencies that makes raw data effortlessly readable and structured for your AI and data processing workflows. Whether you're extracting information from PDFs, transforming files into Markdown, or preparing data for LLMs and RAG pipelines, **ParseThisIO** gets the job done\u2014quickly, effectively, and with a touch of magic.\nJust install as a pip package and enjoy, no configuring around with third-party tools before you can use this package. Just parseThis.io.\n\nFor some parsers, there are API keys required. They're not required when you just don't use them\u2014they will error on usage when no API key was found.\n\nParseThis aggregates multiple open-source projects to avoid re-implementing a file type mapping for content conversion.\n\n## Table of Contents\n- [Features](#features)\n- [Prerequisites](#prerequisites)\n- [Installation](#installation)\n- [Usage](#usage)\n- [ParserMatrix - Dependency overview](#ParserMatrix)\n- [Testing](#Testing)\n- [License](#License)\n\n---\n\n## Features\n- Auto-detects file types (pdf, docx, csv, pptx, xlsx, xls, json, xml, zip, mp3, mp4 and more).\n- Converts any file into readable Markdown or plain text.\n- Extracts structured data for use in LLM and RAG pipelines.\n- Simple API for seamless integration into your workflows.\n- Just forward user input to ParseThis and get Text || markdown.\n\nThe mapping of parser to file type can be found in the [ParserMatrix](#parsermatrix---when-is-which-dependency-used).\n\n```python\nimport parsethisio\n\n#get list of supported file extensions via \nparsethisio.get_supported_extensions()\n```\n\n\n---\n\n## Prerequisites\nUse Python 3.12 - maximum version supported by PyO3 - dependency of scrapegraph-ai, use a virtual environment with version 3.12\n```sh\npython3.12 -m venv myenv\nsource myenv/bin/activate\n```\n\n---\n\n## Installation\n\nTo install **ParseThisIO**, use pip:\n\n```bash\npip install parsethisio\n```\n\n---\n\n## Usage\nUse the parse() function to auto-detect the current type of content - when the autodetection is not working you can provide more information to help detect the type.\nThe auto-parse function accepts any input - file_path, url strings, file byte content.\n```python\nimport parsethisio\n\n#extract image description for llm\nwith open('tests/fixtures/test_data_diagram.png', 'rb') as f:\n    image_description = parsethisio.parse(f.read(), result_format=ResultFormat.TXT)\n\n#get transcript of audio\nwith open('tests/fixtures/test_data_ttsmaker-test-generated-file.mp3', 'rb') as f:\n    audio_transcript = parsethisio.parse(f.read(), result_format=ResultFormat.TXT)\n```\n\nThe generic parse() function detects automatically which parsers will be used based on the file content.\n\n```python\nimport parsethisio\n\nfrom parsethisio import ResultFormat\n\n\n#automatic parse based on file_path\nparsed_pdf_text = parsethisio.parse('tests/fixtures/text_data_meeting_notes.pdf', result_format=ResultFormat.TXT)\n\n#automatic parse based on file content\nwith open('tests/fixtures/text_data_meeting_notes.pdf', 'rb') as f:\n    parsed_pdf_text = parsethisio.parse(f.read(), result_format=ResultFormat.TXT)  # works with any bytes content\n\n#automatic parse based on string\nparsed_github_repository = parsethisio.parse('https://github.com/jdde/ParseThis', result_format=ResultFormat.TXT)\n\n#automatic parse based on YouTube URL\ntranscribed_youtube_text = parsethisio.parse('https://www.youtube.com/watch?v=ca7QkcAGe', result_format=ResultFormat.TXT)\n```\n\nUse the parser detection when you want to just find the parser and configure it differently before it parses the content.\n```python\nimport parsethisio\n\nwith open('tests/fixtures/text_data_meeting_notes.pdf', 'rb') as f:\n    file_content = f.read()\n    parser = parsethisio.get_parser(file_content)\n    text = parser.parse(file_content)\n```\n\nOr just directly use a parser.\n```python\nfrom parsethisio import PDFParser\n\nwith open('tests/fixtures/text_data_meeting_notes.pdf', 'rb') as f:\n    text = PDFParser.parse(file_content)\n```\n\nFor more examples how to use it - see our [testing section](tests/test_automatic_parser_selection.py).\n\n---\n\n## ParserMatrix\nOverview of dependencies used for specific parsing processes.\n\n| File Type | Parser         | Dependency          | External Access Required |\n|-----------|----------------|---------------------|---------------------|\n| PDF       | PDFParser      | PyPDF2, Markitdown | \u274c |\n| Image     | ImageParser    | OpenAI GPT         | \u2705 env.OPENAI_API_KEY|\n| Audio     | AudioParser    | OpenAI Whisper     | \u2705 env.OPENAI_API_KEY |\n| URL       | TextParser     | scrapegraphai      | \u2705 env.OPENAI_API_KEY |\n| YouTube   | TextParser  | youtube-transcript-api | \u274c |\n| Github    | TextParser     | gitingest          | \u274c |\n| DOCX      | OfficeParser   | Markitdown         | \u274c |\n| PPTX      | OfficeParser   | Markitdown         | \u274c |\n| XLSX/XLS  | OfficeParser   | Markitdown         | \u274c |\n| CSV       | DataParser     | Markitdown         | \u274c |\n| JSON      | DataParser     | Markitdown         | \u274c |\n| XML       | DataParser     | Markitdown         | \u274c |\n| ZIP       | ArchiveParser  | Markitdown         | \u274c |\n\n\nIf you're working with the source code, you can install all dependencies using:\n\n```bash\npip install .\n```\nFor more information, see the [how we install it in our github action](.github/workflows/coverage.yml).\n\n\n## Testing\nTo execute tests use this:\n\n```bash\ncoverage run -m pytest\n#or for a single test:\npytest -k test_text_parser_github_url\n```\n\n\n## License\nThis project is licensed under the GNU Affero General Public License v3.0 - see the [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": "GNU Affero General Public License v3.0",
    "summary": "A Python library to extract text from various sources for LLM preprocessing.",
    "version": "0.2.3",
    "project_urls": null,
    "split_keywords": [
        "text extraction",
        " llm",
        " pdf",
        " web",
        " preprocessing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ca103770135a2c8b4c0ca516cc398efcbed3a7364f3b1ec7e31dfeb49c6d1a56",
                "md5": "3ac651a065bc8a1befccf7f468a34fb0",
                "sha256": "f08a9d48a9fb4ee1d2c419ccb60321bf69b647cc72bfae585b2ea2809fb7b55b"
            },
            "downloads": -1,
            "filename": "parsethisio-0.2.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3ac651a065bc8a1befccf7f468a34fb0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 26802,
            "upload_time": "2025-07-24T03:29:14",
            "upload_time_iso_8601": "2025-07-24T03:29:14.464462Z",
            "url": "https://files.pythonhosted.org/packages/ca/10/3770135a2c8b4c0ca516cc398efcbed3a7364f3b1ec7e31dfeb49c6d1a56/parsethisio-0.2.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1eeb069b7bcb3033f8028353153b851a5397e400fd55c647264ce37b77df404d",
                "md5": "c5e581da2b3b265cdaf2189788ce00e2",
                "sha256": "3862b8c04f4a4e0ee36b4c4280288540f0741a15025c29b9f8ffb5fa41716fb9"
            },
            "downloads": -1,
            "filename": "parsethisio-0.2.3.tar.gz",
            "has_sig": false,
            "md5_digest": "c5e581da2b3b265cdaf2189788ce00e2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 30238,
            "upload_time": "2025-07-24T03:29:15",
            "upload_time_iso_8601": "2025-07-24T03:29:15.535563Z",
            "url": "https://files.pythonhosted.org/packages/1e/eb/069b7bcb3033f8028353153b851a5397e400fd55c647264ce37b77df404d/parsethisio-0.2.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-24 03:29:15",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "parsethisio"
}

None