<p align="center">
<img src="https://raw.githubusercontent.com/neuml/txtmarker/master/logo.png"/>
</p>
<p align="center">
<b>Highlight text in documents</b>
</p>
<p align="center">
<a href="https://github.com/neuml/txtmarker/releases">
<img src="https://img.shields.io/github/release/neuml/txtmarker.svg?style=flat&color=success" alt="Version"/>
</a>
<a href="https://github.com/neuml/txtmarker/releases">
<img src="https://img.shields.io/github/release-date/neuml/txtmarker.svg?style=flat&color=blue" alt="GitHub Release Date"/>
</a>
<a href="https://github.com/neuml/txtmarker/issues">
<img src="https://img.shields.io/github/issues/neuml/txtmarker.svg?style=flat&color=success" alt="GitHub issues"/>
</a>
<a href="https://github.com/neuml/txtmarker">
<img src="https://img.shields.io/github/last-commit/neuml/txtmarker.svg?style=flat&color=blue" alt="GitHub last commit"/>
</a>
<a href="https://github.com/neuml/txtmarker/actions?query=workflow%3Abuild">
<img src="https://github.com/neuml/txtmarker/workflows/build/badge.svg" alt="Build Status"/>
</a>
<a href="https://coveralls.io/github/neuml/txtmarker?branch=master">
<img src="https://img.shields.io/coverallsCoverage/github/neuml/txtmarker" alt="Coverage Status">
</a>
</p>
-------------------------------------------------------------------------------------------------------------------------------------------------------
![demo](https://raw.githubusercontent.com/neuml/txtmarker/master/demo.png)
txtmarker highlights text in documents. txtmarker takes a list of (name, text) pairs, scans an input document and creates a modified version with highlights embedded.
Current file formats supported:
- pdf
## Installation
The easiest way to install is via pip and PyPI
```
pip install txtmarker
```
Python 3.9+ is supported. Using a Python [virtual environment](https://docs.python.org/3/library/venv.html) is recommended.
txtmarker can also be installed directly from GitHub to access the latest, unreleased features.
```
pip install git+https://github.com/neuml/txtmarker
```
Python 3.9+ is supported
## Examples
The examples directory has a series of examples and notebooks giving an overview of txtmarker. See the list of notebooks below.
### Notebooks
| Notebook | Description | |
|:----------|:-------------|------:|
| [Introducing txtmarker](https://github.com/neuml/txtmarker/blob/master/examples/01_Introducing_txtmarker.ipynb) | Overview of the functionality provided by txtmarker | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtmarker/blob/master/examples/01_Introducing_txtmarker.ipynb) |
| [Highlighting with Transformers](https://github.com/neuml/txtmarker/blob/master/examples/02_Highlighting_with_Transformers.ipynb) | AI-driven highlighting with Transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtmarker/blob/master/examples/02_Highlighting_with_Transformers.ipynb) |
## Configuration
The following section gives an overview of highlighters and available methods/configuration. See the notebooks above for detailed examples.
### Create a new highlighter
Creates a new highlighter instance.
```python
from txtmarker.factory import Factory
highlighter = Factory.create("pdf")
```
#### extension
```yaml
extension: string
```
Type of highlighter to create (i.e. pdf)
#### Optional constructor arguments:
#### formatter
```yaml
formatter: callable
```
Formats queries and input text using this method. Helps with cleanup of files with lots of symbols and other content.
#### chunks
```yaml
chunks: int
```
Splits queries into multiple chunks. This is designed for very long text matches.
### Page text
Extracts page text from `infile` and returns as a generator. This enables analysis on the text exactly as it will appear to the highlighter.
```python
highlighter.pages("input.pdf")
```
#### infile
```yaml
infile: string
```
Full path to input file
### Highlight text
Highlights using provided annotations. Annotated file is stored as `outfile`.
```python
highlighter.highlight("input.pdf", "output.pdf", [("name", "text to highlight")])
```
#### infile
```yaml
infile: string
```
Full path to input file
#### outfile
```yaml
outfile: string
```
Full path to output file, i.e. the highlighted file
#### highlights
```yaml
highlights: list of (string, string|regex)
```
List of highlight elements. Each pair has a name (can be None) and text value. The text can either be a string or a regular expression. When using string matching, make sure to escape regular expressions (i.e. call `re.escape`).
Raw data
{
"_id": null,
"home_page": "https://github.com/neuml/txtmarker",
"name": "txtmarker",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "pdf highlight text search",
"author": "NeuML",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/20/e5/b2d638be7575b10620dc06816fe68707c9c4aad6da462f23ffb443453cd1/txtmarker-1.1.0.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <img src=\"https://raw.githubusercontent.com/neuml/txtmarker/master/logo.png\"/>\n</p>\n\n<p align=\"center\">\n <b>Highlight text in documents</b>\n</p>\n\n<p align=\"center\">\n <a href=\"https://github.com/neuml/txtmarker/releases\">\n <img src=\"https://img.shields.io/github/release/neuml/txtmarker.svg?style=flat&color=success\" alt=\"Version\"/>\n </a>\n <a href=\"https://github.com/neuml/txtmarker/releases\">\n <img src=\"https://img.shields.io/github/release-date/neuml/txtmarker.svg?style=flat&color=blue\" alt=\"GitHub Release Date\"/>\n </a>\n <a href=\"https://github.com/neuml/txtmarker/issues\">\n <img src=\"https://img.shields.io/github/issues/neuml/txtmarker.svg?style=flat&color=success\" alt=\"GitHub issues\"/>\n </a>\n <a href=\"https://github.com/neuml/txtmarker\">\n <img src=\"https://img.shields.io/github/last-commit/neuml/txtmarker.svg?style=flat&color=blue\" alt=\"GitHub last commit\"/>\n </a>\n <a href=\"https://github.com/neuml/txtmarker/actions?query=workflow%3Abuild\">\n <img src=\"https://github.com/neuml/txtmarker/workflows/build/badge.svg\" alt=\"Build Status\"/>\n </a>\n <a href=\"https://coveralls.io/github/neuml/txtmarker?branch=master\">\n <img src=\"https://img.shields.io/coverallsCoverage/github/neuml/txtmarker\" alt=\"Coverage Status\">\n </a>\n</p>\n\n-------------------------------------------------------------------------------------------------------------------------------------------------------\n\n![demo](https://raw.githubusercontent.com/neuml/txtmarker/master/demo.png)\n\ntxtmarker highlights text in documents. txtmarker takes a list of (name, text) pairs, scans an input document and creates a modified version with highlights embedded.\n\nCurrent file formats supported:\n\n- pdf\n\n## Installation\nThe easiest way to install is via pip and PyPI\n\n```\npip install txtmarker\n```\n\nPython 3.9+ is supported. Using a Python [virtual environment](https://docs.python.org/3/library/venv.html) is recommended.\n\ntxtmarker can also be installed directly from GitHub to access the latest, unreleased features.\n\n```\npip install git+https://github.com/neuml/txtmarker\n```\n\nPython 3.9+ is supported\n\n## Examples\n\nThe examples directory has a series of examples and notebooks giving an overview of txtmarker. See the list of notebooks below.\n\n### Notebooks\n\n| Notebook | Description | |\n|:----------|:-------------|------:|\n| [Introducing txtmarker](https://github.com/neuml/txtmarker/blob/master/examples/01_Introducing_txtmarker.ipynb) | Overview of the functionality provided by txtmarker | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtmarker/blob/master/examples/01_Introducing_txtmarker.ipynb) |\n| [Highlighting with Transformers](https://github.com/neuml/txtmarker/blob/master/examples/02_Highlighting_with_Transformers.ipynb) | AI-driven highlighting with Transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtmarker/blob/master/examples/02_Highlighting_with_Transformers.ipynb) |\n\n\n## Configuration\n\nThe following section gives an overview of highlighters and available methods/configuration. See the notebooks above for detailed examples.\n\n### Create a new highlighter\n\nCreates a new highlighter instance.\n\n```python\nfrom txtmarker.factory import Factory\nhighlighter = Factory.create(\"pdf\")\n```\n\n#### extension\n```yaml\nextension: string\n```\n\nType of highlighter to create (i.e. pdf)\n\n#### Optional constructor arguments:\n\n#### formatter\n```yaml\nformatter: callable\n```\n\nFormats queries and input text using this method. Helps with cleanup of files with lots of symbols and other content.\n\n#### chunks\n```yaml\nchunks: int\n```\n\nSplits queries into multiple chunks. This is designed for very long text matches.\n\n### Page text\n\nExtracts page text from `infile` and returns as a generator. This enables analysis on the text exactly as it will appear to the highlighter.\n\n```python\nhighlighter.pages(\"input.pdf\")\n```\n\n#### infile\n```yaml\ninfile: string\n```\n\nFull path to input file\n\n### Highlight text\n\nHighlights using provided annotations. Annotated file is stored as `outfile`.\n\n```python\nhighlighter.highlight(\"input.pdf\", \"output.pdf\", [(\"name\", \"text to highlight\")])\n```\n\n#### infile\n```yaml\ninfile: string\n```\n\nFull path to input file\n\n#### outfile\n```yaml\noutfile: string\n```\n\nFull path to output file, i.e. the highlighted file\n\n#### highlights\n```yaml\nhighlights: list of (string, string|regex)\n```\n\nList of highlight elements. Each pair has a name (can be None) and text value. The text can either be a string or a regular expression. When using string matching, make sure to escape regular expressions (i.e. call `re.escape`).\n",
"bugtrack_url": null,
"license": "Apache 2.0: http://www.apache.org/licenses/LICENSE-2.0",
"summary": "Finds and highlights text in documents",
"version": "1.1.0",
"project_urls": {
"Documentation": "https://github.com/neuml/txtmarker",
"Homepage": "https://github.com/neuml/txtmarker",
"Issue Tracker": "https://github.com/neuml/txtmarker/issues",
"Source Code": "https://github.com/neuml/txtmarker"
},
"split_keywords": [
"pdf",
"highlight",
"text",
"search"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "bab1dfa1daf40cce4a85d2a1363c3e1afd27718f273b20ebe8a08756a0ac6966",
"md5": "adf513582c6898cd98d3e1aa6a5c46e0",
"sha256": "372a01c6808ead16974522260cbe232fb546cde1601a1ef930f960d0be5cc63f"
},
"downloads": -1,
"filename": "txtmarker-1.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "adf513582c6898cd98d3e1aa6a5c46e0",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 11773,
"upload_time": "2024-12-13T11:11:27",
"upload_time_iso_8601": "2024-12-13T11:11:27.744838Z",
"url": "https://files.pythonhosted.org/packages/ba/b1/dfa1daf40cce4a85d2a1363c3e1afd27718f273b20ebe8a08756a0ac6966/txtmarker-1.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "20e5b2d638be7575b10620dc06816fe68707c9c4aad6da462f23ffb443453cd1",
"md5": "bbba92eb52fe40a35f12e9e147e46248",
"sha256": "eeba11e6835a0a2ad6073dba5816f338f4136f6c9773e27a818e8c3d7591b05a"
},
"downloads": -1,
"filename": "txtmarker-1.1.0.tar.gz",
"has_sig": false,
"md5_digest": "bbba92eb52fe40a35f12e9e147e46248",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 12670,
"upload_time": "2024-12-13T11:11:29",
"upload_time_iso_8601": "2024-12-13T11:11:29.801808Z",
"url": "https://files.pythonhosted.org/packages/20/e5/b2d638be7575b10620dc06816fe68707c9c4aad6da462f23ffb443453cd1/txtmarker-1.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-13 11:11:29",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "neuml",
"github_project": "txtmarker",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "txtmarker"
}