Name | texta-parsers JSON |
Version |
2.8.2
JSON |
| download |
home_page | |
Summary | texta-parsers |
upload_time | 2023-11-08 12:23:53 |
maintainer | |
docs_url | None |
author | TEXTA |
requires_python | |
license | GPLv3 |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# texta-parsers
A Python package for file parsing.
The main class in the package is **DocParser**. The package also supports sophisticated parsing of emails which is implemented in class **EmailParser**. If you only need to parse emails then you can specify it with parameter `parse_only_extensions`. It is possible to use **EmailParser** independently as well but then attachments will not be parsed.
## Requirements
Most of the file types are parsed with **[tika](http://tika.apache.org/)**. Other tools that are required:
| Tool | File Type |
|---|---|
| pst-utils | .pst |
| digidoc-tool | .ddoc .bdoc .asics .asice |
| rar-nonfree | .rar |
| lxml | XML HTML |
Installation of required packages on Ubuntu/Debian:
`sudo apt-get install pst-utils rar python3-lxml cmake build-essential -y`
`sudo sh install-digidoc.sh`
Requires our custom version of Apache TIKA with relevant Tesseract language packs installed:
`sudo docker run -p 9998:9998 docker.texta.ee/texta/texta-parsers-python/tikaserver:latest`
## Installation
Base install (without MLP & Face Analyzer):
`pip install texta-parsers`
Install with MLP:
`pip install texta-parsers[mlp]`
Install with whole bundle:
`pip install texta-parsers[mlp]`
## Testing
`python -m pytest -rx -v tests`
## Description
#### DocParser
A file parser. Input can either be in bytes or a path to the file as a string. See [user guide](https://git.texta.ee/texta/email-parser/-/wikis/DocParser/User-Guide/Getting-started) more information. DocParser also includes EmailParser.
#### EmailParser
For parsing email messages and mailboxes. Supported file formats are Outlook Data File (**.pst**), mbox (**.mbox**) and EML (**.eml**). Can be used separately from DocParser but then attachments are not parsed.
User guide can be found [here](https://git.texta.ee/texta/email-parser/-/wikis/EmailParser/User-Guide/Getting-started) and documentation [here](https://git.texta.ee/texta/email-parser/-/wikis/EmailParser/Documentation/1.2.1).
Raw data
{
"_id": null,
"home_page": "",
"name": "texta-parsers",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "TEXTA",
"author_email": "info@texta.ee",
"download_url": "https://files.pythonhosted.org/packages/ba/e0/0bc3359017acbac066d3e8003f7d84e8e66f78a21eaaaec449e4f3a431dd/texta-parsers-2.8.2.tar.gz",
"platform": null,
"description": "# texta-parsers\n\nA Python package for file parsing.\n\nThe main class in the package is **DocParser**. The package also supports sophisticated parsing of emails which is implemented in class **EmailParser**. If you only need to parse emails then you can specify it with parameter `parse_only_extensions`. It is possible to use **EmailParser** independently as well but then attachments will not be parsed. \n\n\n## Requirements\n\nMost of the file types are parsed with **[tika](http://tika.apache.org/)**. Other tools that are required:\n\n| Tool | File Type |\n|---|---|\n| pst-utils | .pst |\n| digidoc-tool | .ddoc .bdoc .asics .asice |\n| rar-nonfree | .rar |\n| lxml | XML HTML |\n\nInstallation of required packages on Ubuntu/Debian:\n\n`sudo apt-get install pst-utils rar python3-lxml cmake build-essential -y`\n\n`sudo sh install-digidoc.sh`\n\nRequires our custom version of Apache TIKA with relevant Tesseract language packs installed:\n\n`sudo docker run -p 9998:9998 docker.texta.ee/texta/texta-parsers-python/tikaserver:latest`\n\n## Installation\n\nBase install (without MLP & Face Analyzer):\n\n`pip install texta-parsers`\n\nInstall with MLP:\n\n`pip install texta-parsers[mlp]`\n\n\nInstall with whole bundle:\n\n`pip install texta-parsers[mlp]`\n\n\n## Testing\n\n`python -m pytest -rx -v tests`\n\n\n## Description\n\n#### DocParser\n\nA file parser. Input can either be in bytes or a path to the file as a string. See [user guide](https://git.texta.ee/texta/email-parser/-/wikis/DocParser/User-Guide/Getting-started) more information. DocParser also includes EmailParser.\n\n#### EmailParser\n\nFor parsing email messages and mailboxes. Supported file formats are Outlook Data File (**.pst**), mbox (**.mbox**) and EML (**.eml**). Can be used separately from DocParser but then attachments are not parsed.\nUser guide can be found [here](https://git.texta.ee/texta/email-parser/-/wikis/EmailParser/User-Guide/Getting-started) and documentation [here](https://git.texta.ee/texta/email-parser/-/wikis/EmailParser/Documentation/1.2.1).\n",
"bugtrack_url": null,
"license": "GPLv3",
"summary": "texta-parsers",
"version": "2.8.2",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "bae00bc3359017acbac066d3e8003f7d84e8e66f78a21eaaaec449e4f3a431dd",
"md5": "7b2f50ef709bd41f83690f02a7c08a71",
"sha256": "067b40b64457b2c21c6285d73dd9571c10ababbcacc45558937313f23be385c2"
},
"downloads": -1,
"filename": "texta-parsers-2.8.2.tar.gz",
"has_sig": false,
"md5_digest": "7b2f50ef709bd41f83690f02a7c08a71",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 34069,
"upload_time": "2023-11-08T12:23:53",
"upload_time_iso_8601": "2023-11-08T12:23:53.378136Z",
"url": "https://files.pythonhosted.org/packages/ba/e0/0bc3359017acbac066d3e8003f7d84e8e66f78a21eaaaec449e4f3a431dd/texta-parsers-2.8.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-08 12:23:53",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "texta-parsers"
}