# NativeExtractor module for Python
This is official Python binding for the [NativeExtractor](https://github.com/SpongeData-cz/nativeextractor) project.
<p align="center"><img src="https://raw.githubusercontent.com/SpongeData-cz/nativeextractor/main/logo.svg" width="400" /></p>
<p align="center"><img src="logo_python.png" width="400" /></p>
# Installation
## Requirements
* Python >=2.7 (>3 usage is highly recommended)
* `pip`
* `build-essential` (gcc, make)
* `libglib2.0`, `libglib2.0-dev`, `libpythonX-dev`
We recommend to use virtual environments.
```bash
virtualenv myproject
source myproject/bin/activate
```
or
```bash
python -m venv myproject
source myproject/bin/activate
```
## Instant PyPi solution
```pip install pynativeextractor```
## Manual
* Clone the repo
`git clone --recurse-submodules https://github.com/SpongeData-cz/pynativeextractor.git`
* Install via `pip` or `pip3`
```bash
pip install -e ./pynativeextractor/
```
# Typical usage
```python
import os
from pynativeextractor.extractor import BufferStream, Extractor, DEFAULT_MINERS_PATH
# Construct new Extractor instance
ex = Extractor()
# Add fictional miner from web_entities.so with name match_url matching all URLs
ex.add_miner_so(os.path.join(DEFAULT_MINERS_PATH, 'web_entities.so'), 'match_url')
text = '{}'.format("https://spongedata.cz")
# Make from hw stream (you can also do the stream from files - use FileStream - mmap is used internally)
with BufferStream(text) as bf:
# Initialize occurrences list as empty list
occurrences = []
# Set the stream to the extractor
with ex.set_stream(bf):
# Mine all occurrences of URLs
while not ex.eof():
# Summarize occurrences
occurrences += ex.next()
print(occurrences) # Prints [{'label': 'URL', 'value': 'https://spongedata.cz', 'pos': 0, 'len': 13, 'prob': 1.0}]
```
Raw data
{
"_id": null,
"home_page": "https://github.com/SpongeData-cz/pynativeextractor",
"name": "pynativeextractor",
"maintainer": "",
"docs_url": null,
"requires_python": ">=2.7",
"maintainer_email": "",
"keywords": "",
"author": "SpongeData s.r.o.",
"author_email": "info@spongedata.cz",
"download_url": "https://files.pythonhosted.org/packages/ae/61/7be4bb317ee6434504f3b34d823f7339e525e5fc6a74bdee0514ceb0182f/pynativeextractor-10.0.12.tar.gz",
"platform": null,
"description": "# NativeExtractor module for Python\nThis is official Python binding for the [NativeExtractor](https://github.com/SpongeData-cz/nativeextractor) project.\n\n<p align=\"center\"><img src=\"https://raw.githubusercontent.com/SpongeData-cz/nativeextractor/main/logo.svg\" width=\"400\" /></p>\n<p align=\"center\"><img src=\"logo_python.png\" width=\"400\" /></p>\n\n# Installation\n## Requirements\n* Python >=2.7 (>3 usage is highly recommended)\n* `pip`\n* `build-essential` (gcc, make)\n* `libglib2.0`, `libglib2.0-dev`, `libpythonX-dev`\n\nWe recommend to use virtual environments.\n```bash\nvirtualenv myproject\nsource myproject/bin/activate\n```\nor\n```bash\npython -m venv myproject\nsource myproject/bin/activate\n```\n\n## Instant PyPi solution\n```pip install pynativeextractor```\n\n## Manual\n* Clone the repo\n`git clone --recurse-submodules https://github.com/SpongeData-cz/pynativeextractor.git`\n\n* Install via `pip` or `pip3`\n ```bash\n pip install -e ./pynativeextractor/\n ```\n\n# Typical usage\n\n```python\nimport os\nfrom pynativeextractor.extractor import BufferStream, Extractor, DEFAULT_MINERS_PATH\n\n# Construct new Extractor instance\nex = Extractor()\n# Add fictional miner from web_entities.so with name match_url matching all URLs\nex.add_miner_so(os.path.join(DEFAULT_MINERS_PATH, 'web_entities.so'), 'match_url')\ntext = '{}'.format(\"https://spongedata.cz\")\n\n# Make from hw stream (you can also do the stream from files - use FileStream - mmap is used internally)\nwith BufferStream(text) as bf:\n # Initialize occurrences list as empty list\n occurrences = []\n # Set the stream to the extractor\n with ex.set_stream(bf):\n # Mine all occurrences of URLs\n while not ex.eof():\n # Summarize occurrences\n occurrences += ex.next()\n\nprint(occurrences) # Prints [{'label': 'URL', 'value': 'https://spongedata.cz', 'pos': 0, 'len': 13, 'prob': 1.0}]\n```",
"bugtrack_url": null,
"license": "",
"summary": "Python binding for nativeextractor",
"version": "10.0.12",
"project_urls": {
"Bug Tracker": "https://github.com/SpongeData-cz/pynativeextractor/issues",
"Homepage": "https://github.com/SpongeData-cz/pynativeextractor"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ae617be4bb317ee6434504f3b34d823f7339e525e5fc6a74bdee0514ceb0182f",
"md5": "a40a10cb26e4df22fe3a6c91d63dfcc9",
"sha256": "eb6d9bc85bd74d46bf2c0393d1f2ddbf94b41f3c94548b5f40df8bedb65789fd"
},
"downloads": -1,
"filename": "pynativeextractor-10.0.12.tar.gz",
"has_sig": false,
"md5_digest": "a40a10cb26e4df22fe3a6c91d63dfcc9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=2.7",
"size": 41443,
"upload_time": "2022-07-13T11:32:51",
"upload_time_iso_8601": "2022-07-13T11:32:51.734370Z",
"url": "https://files.pythonhosted.org/packages/ae/61/7be4bb317ee6434504f3b34d823f7339e525e5fc6a74bdee0514ceb0182f/pynativeextractor-10.0.12.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-07-13 11:32:51",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "SpongeData-cz",
"github_project": "pynativeextractor",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pynativeextractor"
}