txtmarker


Nametxtmarker JSON
Version 1.1.0 PyPI version JSON
download
home_pagehttps://github.com/neuml/txtmarker
SummaryFinds and highlights text in documents
upload_time2024-12-13 11:11:29
maintainerNone
docs_urlNone
authorNeuML
requires_python>=3.9
licenseApache 2.0: http://www.apache.org/licenses/LICENSE-2.0
keywords pdf highlight text search
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
    <img src="https://raw.githubusercontent.com/neuml/txtmarker/master/logo.png"/>
</p>

<p align="center">
    <b>Highlight text in documents</b>
</p>

<p align="center">
    <a href="https://github.com/neuml/txtmarker/releases">
        <img src="https://img.shields.io/github/release/neuml/txtmarker.svg?style=flat&color=success" alt="Version"/>
    </a>
    <a href="https://github.com/neuml/txtmarker/releases">
        <img src="https://img.shields.io/github/release-date/neuml/txtmarker.svg?style=flat&color=blue" alt="GitHub Release Date"/>
    </a>
    <a href="https://github.com/neuml/txtmarker/issues">
        <img src="https://img.shields.io/github/issues/neuml/txtmarker.svg?style=flat&color=success" alt="GitHub issues"/>
    </a>
    <a href="https://github.com/neuml/txtmarker">
        <img src="https://img.shields.io/github/last-commit/neuml/txtmarker.svg?style=flat&color=blue" alt="GitHub last commit"/>
    </a>
    <a href="https://github.com/neuml/txtmarker/actions?query=workflow%3Abuild">
        <img src="https://github.com/neuml/txtmarker/workflows/build/badge.svg" alt="Build Status"/>
    </a>
    <a href="https://coveralls.io/github/neuml/txtmarker?branch=master">
        <img src="https://img.shields.io/coverallsCoverage/github/neuml/txtmarker" alt="Coverage Status">
    </a>
</p>

-------------------------------------------------------------------------------------------------------------------------------------------------------

![demo](https://raw.githubusercontent.com/neuml/txtmarker/master/demo.png)

txtmarker highlights text in documents. txtmarker takes a list of (name, text) pairs, scans an input document and creates a modified version with highlights embedded.

Current file formats supported:

- pdf

## Installation
The easiest way to install is via pip and PyPI

```
pip install txtmarker
```

Python 3.9+ is supported. Using a Python [virtual environment](https://docs.python.org/3/library/venv.html) is recommended.

txtmarker can also be installed directly from GitHub to access the latest, unreleased features.

```
pip install git+https://github.com/neuml/txtmarker
```

Python 3.9+ is supported

## Examples

The examples directory has a series of examples and notebooks giving an overview of txtmarker. See the list of notebooks below.

### Notebooks

| Notebook     |      Description      |   |
|:----------|:-------------|------:|
| [Introducing txtmarker](https://github.com/neuml/txtmarker/blob/master/examples/01_Introducing_txtmarker.ipynb) | Overview of the functionality provided by txtmarker | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtmarker/blob/master/examples/01_Introducing_txtmarker.ipynb) |
| [Highlighting with Transformers](https://github.com/neuml/txtmarker/blob/master/examples/02_Highlighting_with_Transformers.ipynb) | AI-driven highlighting with Transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtmarker/blob/master/examples/02_Highlighting_with_Transformers.ipynb) |


## Configuration

The following section gives an overview of highlighters and available methods/configuration. See the notebooks above for detailed examples.

### Create a new highlighter

Creates a new highlighter instance.

```python
from txtmarker.factory import Factory
highlighter = Factory.create("pdf")
```

#### extension
```yaml
extension: string
```

Type of highlighter to create (i.e. pdf)

#### Optional constructor arguments:

#### formatter
```yaml
formatter: callable
```

Formats queries and input text using this method. Helps with cleanup of files with lots of symbols and other content.

#### chunks
```yaml
chunks: int
```

Splits queries into multiple chunks. This is designed for very long text matches.

### Page text

Extracts page text from `infile` and returns as a generator. This enables analysis on the text exactly as it will appear to the highlighter.

```python
highlighter.pages("input.pdf")
```

#### infile
```yaml
infile: string
```

Full path to input file

### Highlight text

Highlights using provided annotations. Annotated file is stored as `outfile`.

```python
highlighter.highlight("input.pdf", "output.pdf", [("name", "text to highlight")])
```

#### infile
```yaml
infile: string
```

Full path to input file

#### outfile
```yaml
outfile: string
```

Full path to output file, i.e. the highlighted file

#### highlights
```yaml
highlights: list of (string, string|regex)
```

List of highlight elements. Each pair has a name (can be None) and text value. The text can either be a string or a regular expression. When using string matching, make sure to escape regular expressions (i.e. call `re.escape`).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/neuml/txtmarker",
    "name": "txtmarker",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "pdf highlight text search",
    "author": "NeuML",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/20/e5/b2d638be7575b10620dc06816fe68707c9c4aad6da462f23ffb443453cd1/txtmarker-1.1.0.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n    <img src=\"https://raw.githubusercontent.com/neuml/txtmarker/master/logo.png\"/>\n</p>\n\n<p align=\"center\">\n    <b>Highlight text in documents</b>\n</p>\n\n<p align=\"center\">\n    <a href=\"https://github.com/neuml/txtmarker/releases\">\n        <img src=\"https://img.shields.io/github/release/neuml/txtmarker.svg?style=flat&color=success\" alt=\"Version\"/>\n    </a>\n    <a href=\"https://github.com/neuml/txtmarker/releases\">\n        <img src=\"https://img.shields.io/github/release-date/neuml/txtmarker.svg?style=flat&color=blue\" alt=\"GitHub Release Date\"/>\n    </a>\n    <a href=\"https://github.com/neuml/txtmarker/issues\">\n        <img src=\"https://img.shields.io/github/issues/neuml/txtmarker.svg?style=flat&color=success\" alt=\"GitHub issues\"/>\n    </a>\n    <a href=\"https://github.com/neuml/txtmarker\">\n        <img src=\"https://img.shields.io/github/last-commit/neuml/txtmarker.svg?style=flat&color=blue\" alt=\"GitHub last commit\"/>\n    </a>\n    <a href=\"https://github.com/neuml/txtmarker/actions?query=workflow%3Abuild\">\n        <img src=\"https://github.com/neuml/txtmarker/workflows/build/badge.svg\" alt=\"Build Status\"/>\n    </a>\n    <a href=\"https://coveralls.io/github/neuml/txtmarker?branch=master\">\n        <img src=\"https://img.shields.io/coverallsCoverage/github/neuml/txtmarker\" alt=\"Coverage Status\">\n    </a>\n</p>\n\n-------------------------------------------------------------------------------------------------------------------------------------------------------\n\n![demo](https://raw.githubusercontent.com/neuml/txtmarker/master/demo.png)\n\ntxtmarker highlights text in documents. txtmarker takes a list of (name, text) pairs, scans an input document and creates a modified version with highlights embedded.\n\nCurrent file formats supported:\n\n- pdf\n\n## Installation\nThe easiest way to install is via pip and PyPI\n\n```\npip install txtmarker\n```\n\nPython 3.9+ is supported. Using a Python [virtual environment](https://docs.python.org/3/library/venv.html) is recommended.\n\ntxtmarker can also be installed directly from GitHub to access the latest, unreleased features.\n\n```\npip install git+https://github.com/neuml/txtmarker\n```\n\nPython 3.9+ is supported\n\n## Examples\n\nThe examples directory has a series of examples and notebooks giving an overview of txtmarker. See the list of notebooks below.\n\n### Notebooks\n\n| Notebook     |      Description      |   |\n|:----------|:-------------|------:|\n| [Introducing txtmarker](https://github.com/neuml/txtmarker/blob/master/examples/01_Introducing_txtmarker.ipynb) | Overview of the functionality provided by txtmarker | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtmarker/blob/master/examples/01_Introducing_txtmarker.ipynb) |\n| [Highlighting with Transformers](https://github.com/neuml/txtmarker/blob/master/examples/02_Highlighting_with_Transformers.ipynb) | AI-driven highlighting with Transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtmarker/blob/master/examples/02_Highlighting_with_Transformers.ipynb) |\n\n\n## Configuration\n\nThe following section gives an overview of highlighters and available methods/configuration. See the notebooks above for detailed examples.\n\n### Create a new highlighter\n\nCreates a new highlighter instance.\n\n```python\nfrom txtmarker.factory import Factory\nhighlighter = Factory.create(\"pdf\")\n```\n\n#### extension\n```yaml\nextension: string\n```\n\nType of highlighter to create (i.e. pdf)\n\n#### Optional constructor arguments:\n\n#### formatter\n```yaml\nformatter: callable\n```\n\nFormats queries and input text using this method. Helps with cleanup of files with lots of symbols and other content.\n\n#### chunks\n```yaml\nchunks: int\n```\n\nSplits queries into multiple chunks. This is designed for very long text matches.\n\n### Page text\n\nExtracts page text from `infile` and returns as a generator. This enables analysis on the text exactly as it will appear to the highlighter.\n\n```python\nhighlighter.pages(\"input.pdf\")\n```\n\n#### infile\n```yaml\ninfile: string\n```\n\nFull path to input file\n\n### Highlight text\n\nHighlights using provided annotations. Annotated file is stored as `outfile`.\n\n```python\nhighlighter.highlight(\"input.pdf\", \"output.pdf\", [(\"name\", \"text to highlight\")])\n```\n\n#### infile\n```yaml\ninfile: string\n```\n\nFull path to input file\n\n#### outfile\n```yaml\noutfile: string\n```\n\nFull path to output file, i.e. the highlighted file\n\n#### highlights\n```yaml\nhighlights: list of (string, string|regex)\n```\n\nList of highlight elements. Each pair has a name (can be None) and text value. The text can either be a string or a regular expression. When using string matching, make sure to escape regular expressions (i.e. call `re.escape`).\n",
    "bugtrack_url": null,
    "license": "Apache 2.0: http://www.apache.org/licenses/LICENSE-2.0",
    "summary": "Finds and highlights text in documents",
    "version": "1.1.0",
    "project_urls": {
        "Documentation": "https://github.com/neuml/txtmarker",
        "Homepage": "https://github.com/neuml/txtmarker",
        "Issue Tracker": "https://github.com/neuml/txtmarker/issues",
        "Source Code": "https://github.com/neuml/txtmarker"
    },
    "split_keywords": [
        "pdf",
        "highlight",
        "text",
        "search"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bab1dfa1daf40cce4a85d2a1363c3e1afd27718f273b20ebe8a08756a0ac6966",
                "md5": "adf513582c6898cd98d3e1aa6a5c46e0",
                "sha256": "372a01c6808ead16974522260cbe232fb546cde1601a1ef930f960d0be5cc63f"
            },
            "downloads": -1,
            "filename": "txtmarker-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "adf513582c6898cd98d3e1aa6a5c46e0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 11773,
            "upload_time": "2024-12-13T11:11:27",
            "upload_time_iso_8601": "2024-12-13T11:11:27.744838Z",
            "url": "https://files.pythonhosted.org/packages/ba/b1/dfa1daf40cce4a85d2a1363c3e1afd27718f273b20ebe8a08756a0ac6966/txtmarker-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "20e5b2d638be7575b10620dc06816fe68707c9c4aad6da462f23ffb443453cd1",
                "md5": "bbba92eb52fe40a35f12e9e147e46248",
                "sha256": "eeba11e6835a0a2ad6073dba5816f338f4136f6c9773e27a818e8c3d7591b05a"
            },
            "downloads": -1,
            "filename": "txtmarker-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "bbba92eb52fe40a35f12e9e147e46248",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 12670,
            "upload_time": "2024-12-13T11:11:29",
            "upload_time_iso_8601": "2024-12-13T11:11:29.801808Z",
            "url": "https://files.pythonhosted.org/packages/20/e5/b2d638be7575b10620dc06816fe68707c9c4aad6da462f23ffb443453cd1/txtmarker-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-13 11:11:29",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "neuml",
    "github_project": "txtmarker",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "txtmarker"
}
        
Elapsed time: 0.39270s