texta-parsers


Nametexta-parsers JSON
Version 2.8.2 PyPI version JSON
download
home_page
Summarytexta-parsers
upload_time2023-11-08 12:23:53
maintainer
docs_urlNone
authorTEXTA
requires_python
licenseGPLv3
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # texta-parsers

A Python package for file parsing.

The main class in the package is **DocParser**. The package also supports sophisticated parsing of emails which is implemented in class **EmailParser**. If you only need to parse emails then you can specify it with parameter `parse_only_extensions`. It is possible to use **EmailParser** independently as well but then attachments will not be parsed. 


## Requirements

Most of the file types are parsed with **[tika](http://tika.apache.org/)**. Other tools that are required:

| Tool | File Type |
|---|---|
| pst-utils | .pst  |
| digidoc-tool | .ddoc .bdoc .asics .asice |
| rar-nonfree  | .rar |
| lxml | XML HTML |

Installation of required packages on Ubuntu/Debian:

`sudo apt-get install pst-utils rar python3-lxml cmake build-essential -y`

`sudo sh install-digidoc.sh`

Requires our custom version of Apache TIKA with relevant Tesseract language packs installed:

`sudo docker run -p 9998:9998 docker.texta.ee/texta/texta-parsers-python/tikaserver:latest`

## Installation

Base install (without MLP & Face Analyzer):

`pip install texta-parsers`

Install with MLP:

`pip install texta-parsers[mlp]`


Install with whole bundle:

`pip install texta-parsers[mlp]`


## Testing

`python -m  pytest -rx -v tests`


## Description

#### DocParser

A file parser. Input can either be in bytes or a path to the file as a string. See [user guide](https://git.texta.ee/texta/email-parser/-/wikis/DocParser/User-Guide/Getting-started) more information. DocParser also includes EmailParser.

#### EmailParser

For parsing email messages and mailboxes. Supported file formats are Outlook Data File (**.pst**), mbox (**.mbox**) and EML (**.eml**). Can be used separately from DocParser but then attachments are not parsed.
User guide can be found [here](https://git.texta.ee/texta/email-parser/-/wikis/EmailParser/User-Guide/Getting-started) and documentation [here](https://git.texta.ee/texta/email-parser/-/wikis/EmailParser/Documentation/1.2.1).

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "texta-parsers",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "TEXTA",
    "author_email": "info@texta.ee",
    "download_url": "https://files.pythonhosted.org/packages/ba/e0/0bc3359017acbac066d3e8003f7d84e8e66f78a21eaaaec449e4f3a431dd/texta-parsers-2.8.2.tar.gz",
    "platform": null,
    "description": "# texta-parsers\n\nA Python package for file parsing.\n\nThe main class in the package is **DocParser**. The package also supports sophisticated parsing of emails which is implemented in class **EmailParser**. If you only need to parse emails then you can specify it with parameter `parse_only_extensions`. It is possible to use **EmailParser** independently as well but then attachments will not be parsed. \n\n\n## Requirements\n\nMost of the file types are parsed with **[tika](http://tika.apache.org/)**. Other tools that are required:\n\n| Tool | File Type |\n|---|---|\n| pst-utils | .pst  |\n| digidoc-tool | .ddoc .bdoc .asics .asice |\n| rar-nonfree  | .rar |\n| lxml | XML HTML |\n\nInstallation of required packages on Ubuntu/Debian:\n\n`sudo apt-get install pst-utils rar python3-lxml cmake build-essential -y`\n\n`sudo sh install-digidoc.sh`\n\nRequires our custom version of Apache TIKA with relevant Tesseract language packs installed:\n\n`sudo docker run -p 9998:9998 docker.texta.ee/texta/texta-parsers-python/tikaserver:latest`\n\n## Installation\n\nBase install (without MLP & Face Analyzer):\n\n`pip install texta-parsers`\n\nInstall with MLP:\n\n`pip install texta-parsers[mlp]`\n\n\nInstall with whole bundle:\n\n`pip install texta-parsers[mlp]`\n\n\n## Testing\n\n`python -m  pytest -rx -v tests`\n\n\n## Description\n\n#### DocParser\n\nA file parser. Input can either be in bytes or a path to the file as a string. See [user guide](https://git.texta.ee/texta/email-parser/-/wikis/DocParser/User-Guide/Getting-started) more information. DocParser also includes EmailParser.\n\n#### EmailParser\n\nFor parsing email messages and mailboxes. Supported file formats are Outlook Data File (**.pst**), mbox (**.mbox**) and EML (**.eml**). Can be used separately from DocParser but then attachments are not parsed.\nUser guide can be found [here](https://git.texta.ee/texta/email-parser/-/wikis/EmailParser/User-Guide/Getting-started) and documentation [here](https://git.texta.ee/texta/email-parser/-/wikis/EmailParser/Documentation/1.2.1).\n",
    "bugtrack_url": null,
    "license": "GPLv3",
    "summary": "texta-parsers",
    "version": "2.8.2",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bae00bc3359017acbac066d3e8003f7d84e8e66f78a21eaaaec449e4f3a431dd",
                "md5": "7b2f50ef709bd41f83690f02a7c08a71",
                "sha256": "067b40b64457b2c21c6285d73dd9571c10ababbcacc45558937313f23be385c2"
            },
            "downloads": -1,
            "filename": "texta-parsers-2.8.2.tar.gz",
            "has_sig": false,
            "md5_digest": "7b2f50ef709bd41f83690f02a7c08a71",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 34069,
            "upload_time": "2023-11-08T12:23:53",
            "upload_time_iso_8601": "2023-11-08T12:23:53.378136Z",
            "url": "https://files.pythonhosted.org/packages/ba/e0/0bc3359017acbac066d3e8003f7d84e8e66f78a21eaaaec449e4f3a431dd/texta-parsers-2.8.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-08 12:23:53",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "texta-parsers"
}
        
Elapsed time: 0.14320s