texta-parsers

Name	texta-parsers JSON
Version	3.0.0 JSON
	download
home_page	None
Summary	texta-parsers
upload_time	2024-09-23 10:22:08
maintainer	None
docs_url	None
author	TEXTA
requires_python	None
license	GPLv3
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # texta-parsers

A Python package for file parsing.

The main class in the package is **DocParser**. The package also supports sophisticated parsing of emails which is implemented in class **EmailParser**. If you only need to parse emails then you can specify it with parameter `parse_only_extensions`. It is possible to use **EmailParser** independently as well but then attachments will not be parsed. 


## Requirements

***NB!*** Starting from version 3.0.0, only Elasticsearch 8 clusters are supported.


Most of the file types are parsed with **[tika](http://tika.apache.org/)**. Other tools that are required:

| Tool | File Type |
|---|---|
| pst-utils | .pst  |
| digidoc-tool | .ddoc .bdoc .asics .asice |
| rar-nonfree  | .rar |
| lxml | XML HTML |

Installation of required packages on Ubuntu/Debian:

`sudo apt-get install pst-utils rar python3-lxml cmake build-essential -y`

`sudo sh install-digidoc.sh`

Requires our custom version of Apache TIKA with relevant Tesseract language packs installed:

`sudo docker run -p 9998:9998 docker.texta.ee/texta/texta-parsers-python/tikaserver:latest`

## Installation

Base install (without MLP & Face Analyzer):

`pip install texta-parsers`

Install with MLP:

`pip install texta-parsers[mlp]`


Install with whole bundle:

`pip install texta-parsers[mlp]`


## Testing

`python -m  pytest -rx -v tests`


## Description

#### DocParser

A file parser. Input can either be in bytes or a path to the file as a string. See [user guide](https://git.texta.ee/texta/email-parser/-/wikis/DocParser/User-Guide/Getting-started) more information. DocParser also includes EmailParser.

#### EmailParser

For parsing email messages and mailboxes. Supported file formats are Outlook Data File (**.pst**), mbox (**.mbox**) and EML (**.eml**). Can be used separately from DocParser but then attachments are not parsed.
User guide can be found [here](https://git.texta.ee/texta/email-parser/-/wikis/EmailParser/User-Guide/Getting-started) and documentation [here](https://git.texta.ee/texta/email-parser/-/wikis/EmailParser/Documentation/1.2.1).

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "texta-parsers",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "TEXTA",
    "author_email": "info@texta.ee",
    "download_url": "https://files.pythonhosted.org/packages/bd/8e/2e50f416119fbec1196503d727b80663736b55e3b8aff2032db3cc243bb9/texta-parsers-3.0.0.tar.gz",
    "platform": null,
    "description": "# texta-parsers\n\nA Python package for file parsing.\n\nThe main class in the package is **DocParser**. The package also supports sophisticated parsing of emails which is implemented in class **EmailParser**. If you only need to parse emails then you can specify it with parameter `parse_only_extensions`. It is possible to use **EmailParser** independently as well but then attachments will not be parsed. \n\n\n## Requirements\n\n***NB!*** Starting from version 3.0.0, only Elasticsearch 8 clusters are supported.\n\n\nMost of the file types are parsed with **[tika](http://tika.apache.org/)**. Other tools that are required:\n\n| Tool | File Type |\n|---|---|\n| pst-utils | .pst  |\n| digidoc-tool | .ddoc .bdoc .asics .asice |\n| rar-nonfree  | .rar |\n| lxml | XML HTML |\n\nInstallation of required packages on Ubuntu/Debian:\n\n`sudo apt-get install pst-utils rar python3-lxml cmake build-essential -y`\n\n`sudo sh install-digidoc.sh`\n\nRequires our custom version of Apache TIKA with relevant Tesseract language packs installed:\n\n`sudo docker run -p 9998:9998 docker.texta.ee/texta/texta-parsers-python/tikaserver:latest`\n\n## Installation\n\nBase install (without MLP & Face Analyzer):\n\n`pip install texta-parsers`\n\nInstall with MLP:\n\n`pip install texta-parsers[mlp]`\n\n\nInstall with whole bundle:\n\n`pip install texta-parsers[mlp]`\n\n\n## Testing\n\n`python -m  pytest -rx -v tests`\n\n\n## Description\n\n#### DocParser\n\nA file parser. Input can either be in bytes or a path to the file as a string. See [user guide](https://git.texta.ee/texta/email-parser/-/wikis/DocParser/User-Guide/Getting-started) more information. DocParser also includes EmailParser.\n\n#### EmailParser\n\nFor parsing email messages and mailboxes. Supported file formats are Outlook Data File (**.pst**), mbox (**.mbox**) and EML (**.eml**). Can be used separately from DocParser but then attachments are not parsed.\nUser guide can be found [here](https://git.texta.ee/texta/email-parser/-/wikis/EmailParser/User-Guide/Getting-started) and documentation [here](https://git.texta.ee/texta/email-parser/-/wikis/EmailParser/Documentation/1.2.1).\n",
    "bugtrack_url": null,
    "license": "GPLv3",
    "summary": "texta-parsers",
    "version": "3.0.0",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bd8e2e50f416119fbec1196503d727b80663736b55e3b8aff2032db3cc243bb9",
                "md5": "2cc2e79306780a13ac429cdd37675619",
                "sha256": "01a5f0705f04117743a68ae3ccf8f21c86a7459676c6ae2339f9dc5568e5bbea"
            },
            "downloads": -1,
            "filename": "texta-parsers-3.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "2cc2e79306780a13ac429cdd37675619",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 34360,
            "upload_time": "2024-09-23T10:22:08",
            "upload_time_iso_8601": "2024-09-23T10:22:08.490100Z",
            "url": "https://files.pythonhosted.org/packages/bd/8e/2e50f416119fbec1196503d727b80663736b55e3b8aff2032db3cc243bb9/texta-parsers-3.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-23 10:22:08",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "texta-parsers"
}

TEXTA