nougat-ocr

Name	nougat-ocr JSON
Version	0.1.17 JSON
	download
home_page	https://github.com/facebookresearch/nougat
Summary	Nougat: Neural Optical Understanding for Academic Documents
upload_time	2023-10-04 09:29:52
maintainer
docs_url	None
author	Lukas Blecher
requires_python	>=3.7
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <div align="center">
<h1>Nougat: Neural Optical Understanding for Academic Documents</h1>

[![Paper](https://img.shields.io/badge/Paper-arxiv.2308.13418-white)](https://arxiv.org/abs/2308.13418)
[![GitHub](https://img.shields.io/github/license/facebookresearch/nougat)](https://github.com/facebookresearch/nougat)
[![PyPI](https://img.shields.io/pypi/v/nougat-ocr?logo=pypi)](https://pypi.org/project/nougat-ocr)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Hugging Face Spaces](https://img.shields.io/badge/🤗%20Hugging%20Face-Community%20Space-blue)](https://huggingface.co/spaces/ysharma/nougat)

</div>

This is the official repository for Nougat, the academic document PDF parser that understands LaTeX math and tables.

Project page: https://facebookresearch.github.io/nougat/

## Install

From pip:
```
pip install nougat-ocr
```

From repository:
```
pip install git+https://github.com/facebookresearch/nougat
```

> Note, on Windows: If you want to utilize a GPU, make sure you first install the correct PyTorch version. Follow instructions [here](https://pytorch.org/get-started/locally/)

There are extra dependencies if you want to call the model from an API or generate a dataset.
Install via

`pip install "nougat-ocr[api]"` or `pip install "nougat-ocr[dataset]"`

### Get prediction for a PDF
#### CLI

To get predictions for a PDF run

```
$ nougat path/to/file.pdf -o output_directory
```

A path to a directory or to a file where each line is a path to a PDF can also be passed as a positional argument

```
$ nougat path/to/directory -o output_directory
```

```
usage: nougat [-h] [--batchsize BATCHSIZE] [--checkpoint CHECKPOINT] [--model MODEL] [--out OUT]
              [--recompute] [--markdown] [--no-skipping] pdf [pdf ...]

positional arguments:
  pdf                   PDF(s) to process.

options:
  -h, --help            show this help message and exit
  --batchsize BATCHSIZE, -b BATCHSIZE
                        Batch size to use.
  --checkpoint CHECKPOINT, -c CHECKPOINT
                        Path to checkpoint directory.
  --model MODEL_TAG, -m MODEL_TAG
                        Model tag to use.
  --out OUT, -o OUT     Output directory.
  --recompute           Recompute already computed PDF, discarding previous predictions.
  --full-precision      Use float32 instead of bfloat16. Can speed up CPU conversion for some setups.
  --no-markdown         Do not add postprocessing step for markdown compatibility.
  --markdown            Add postprocessing step for markdown compatibility (default).
  --no-skipping         Don't apply failure detection heuristic.
  --pages PAGES, -p PAGES
                        Provide page numbers like '1-4,7' for pages 1 through 4 and page 7. Only works for single PDFs.
```

The default model tag is `0.1.0-small`. If you want to use the base model, use `0.1.0-base`.
```
$ nougat path/to/file.pdf -o output_directory -m 0.1.0-base
```

In the output directory every PDF will be saved as a `.mmd` file, the lightweight markup language, mostly compatible with [Mathpix Markdown](https://github.com/Mathpix/mathpix-markdown-it) (we make use of the LaTeX tables).

> Note: On some devices the failure detection heuristic is not working properly. If you experience a lot of `[MISSING_PAGE]` responses, try to run with the `--no-skipping` flag. Related: [#11](https://github.com/facebookresearch/nougat/issues/11), [#67](https://github.com/facebookresearch/nougat/issues/67)

#### API

With the extra dependencies you use `app.py` to start an API. Call

```sh
$ nougat_api
```

To get a prediction of a PDF file by making a POST request to http://127.0.0.1:8503/predict/. It also accepts parameters `start` and `stop` to limit the computation to select page numbers (boundaries are included).

The response is a string with the markdown text of the document.

```sh
curl -X 'POST' \
  'http://127.0.0.1:8503/predict/' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'file=@<PDFFILE.pdf>;type=application/pdf'
```
To use the limit the conversion to pages 1 to 5, use the start/stop parameters in the request URL: http://127.0.0.1:8503/predict/?start=1&stop=5

## Dataset
### Generate dataset

To generate a dataset you need 

1. A directory containing the PDFs
2. A directory containing the `.html` files (processed `.tex` files by [LaTeXML](https://math.nist.gov/~BMiller/LaTeXML/)) with the same folder structure
3. A binary file of [pdffigures2](https://github.com/allenai/pdffigures2) and a corresponding environment variable `export PDFFIGURES_PATH="/path/to/binary.jar"`

Next run

```
python -m nougat.dataset.split_htmls_to_pages --html path/html/root --pdfs path/pdf/root --out path/paired/output --figure path/pdffigures/outputs
```

Additional arguments include

| Argument              | Description                                |
| --------------------- | ------------------------------------------ |
| `--recompute`         | recompute all splits                       |
| `--markdown MARKDOWN` | Markdown output dir                        |
| `--workers WORKERS`   | How many processes to use                  |
| `--dpi DPI`           | What resolution the pages will be saved at |
| `--timeout TIMEOUT`   | max time per paper in seconds              |
| `--tesseract`         | Tesseract OCR prediction for each page     |

Finally create a `jsonl` file that contains all the image paths, markdown text and meta information.

```
python -m nougat.dataset.create_index --dir path/paired/output --out index.jsonl
```

For each `jsonl` file you also need to generate a seek map for faster data loading:

```
python -m nougat.dataset.gen_seek file.jsonl
```

The resulting directory structure can look as follows:

```
root/
├── images
├── train.jsonl
├── train.seek.map
├── test.jsonl
├── test.seek.map
├── validation.jsonl
└── validation.seek.map
```

Note that the `.mmd` and `.json` files in the `path/paired/output` (here `images`) are no longer required.
This can be useful for pushing to a S3 bucket by halving the amount of files.

## Training

To train or fine tune a Nougat model, run 

```
python train.py --config config/train_nougat.yaml
```

## Evaluation

Run 

```
python test.py --checkpoint path/to/checkpoint --dataset path/to/test.jsonl --save_path path/to/results.json
```

To get the results for the different text modalities, run

```
python -m nougat.metrics path/to/results.json
```

## FAQ

- Why am I only getting `[MISSING_PAGE]`?

  Nougat was trained on scientific papers found on arXiv and PMC. Is the document you're processing similar to that?
  What language is the document in? Nougat works best with English papers, other Latin-based languages might work. **Chinese, Russian, Japanese etc. will not work**.
  If these requirements are fulfilled it might be because of false positives in the failure detection, when computing on CPU or older GPUs ([#11](https://github.com/facebookresearch/nougat/issues/11)). Try passing the `--no-skipping` flag for now.

- Where can I download the model checkpoint from.

  They are uploaded here on GitHub in the release section. You can also download them during the first execution of the program. Choose the preferred preferred model by passing `--model 0.1.0-{base,small}`

## Citation

```
@misc{blecher2023nougat,
      title={Nougat: Neural Optical Understanding for Academic Documents}, 
      author={Lukas Blecher and Guillem Cucurull and Thomas Scialom and Robert Stojnic},
      year={2023},
      eprint={2308.13418},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
```

## Acknowledgments

This repository builds on top of the [Donut](https://github.com/clovaai/donut/) repository.

## License

Nougat codebase is licensed under MIT.

Nougat model weights are licensed under CC-BY-NC.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/facebookresearch/nougat",
    "name": "nougat-ocr",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "",
    "author": "Lukas Blecher",
    "author_email": "lblecher@meta.com",
    "download_url": "",
    "platform": null,
    "description": "<div align=\"center\">\r\n<h1>Nougat: Neural Optical Understanding for Academic Documents</h1>\r\n\r\n[![Paper](https://img.shields.io/badge/Paper-arxiv.2308.13418-white)](https://arxiv.org/abs/2308.13418)\r\n[![GitHub](https://img.shields.io/github/license/facebookresearch/nougat)](https://github.com/facebookresearch/nougat)\r\n[![PyPI](https://img.shields.io/pypi/v/nougat-ocr?logo=pypi)](https://pypi.org/project/nougat-ocr)\r\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/)\r\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\r\n[![Hugging Face Spaces](https://img.shields.io/badge/\ud83e\udd17%20Hugging%20Face-Community%20Space-blue)](https://huggingface.co/spaces/ysharma/nougat)\r\n\r\n</div>\r\n\r\nThis is the official repository for Nougat, the academic document PDF parser that understands LaTeX math and tables.\r\n\r\nProject page: https://facebookresearch.github.io/nougat/\r\n\r\n## Install\r\n\r\nFrom pip:\r\n```\r\npip install nougat-ocr\r\n```\r\n\r\nFrom repository:\r\n```\r\npip install git+https://github.com/facebookresearch/nougat\r\n```\r\n\r\n> Note, on Windows: If you want to utilize a GPU, make sure you first install the correct PyTorch version. Follow instructions [here](https://pytorch.org/get-started/locally/)\r\n\r\nThere are extra dependencies if you want to call the model from an API or generate a dataset.\r\nInstall via\r\n\r\n`pip install \"nougat-ocr[api]\"` or `pip install \"nougat-ocr[dataset]\"`\r\n\r\n### Get prediction for a PDF\r\n#### CLI\r\n\r\nTo get predictions for a PDF run\r\n\r\n```\r\n$ nougat path/to/file.pdf -o output_directory\r\n```\r\n\r\nA path to a directory or to a file where each line is a path to a PDF can also be passed as a positional argument\r\n\r\n```\r\n$ nougat path/to/directory -o output_directory\r\n```\r\n\r\n```\r\nusage: nougat [-h] [--batchsize BATCHSIZE] [--checkpoint CHECKPOINT] [--model MODEL] [--out OUT]\r\n              [--recompute] [--markdown] [--no-skipping] pdf [pdf ...]\r\n\r\npositional arguments:\r\n  pdf                   PDF(s) to process.\r\n\r\noptions:\r\n  -h, --help            show this help message and exit\r\n  --batchsize BATCHSIZE, -b BATCHSIZE\r\n                        Batch size to use.\r\n  --checkpoint CHECKPOINT, -c CHECKPOINT\r\n                        Path to checkpoint directory.\r\n  --model MODEL_TAG, -m MODEL_TAG\r\n                        Model tag to use.\r\n  --out OUT, -o OUT     Output directory.\r\n  --recompute           Recompute already computed PDF, discarding previous predictions.\r\n  --full-precision      Use float32 instead of bfloat16. Can speed up CPU conversion for some setups.\r\n  --no-markdown         Do not add postprocessing step for markdown compatibility.\r\n  --markdown            Add postprocessing step for markdown compatibility (default).\r\n  --no-skipping         Don't apply failure detection heuristic.\r\n  --pages PAGES, -p PAGES\r\n                        Provide page numbers like '1-4,7' for pages 1 through 4 and page 7. Only works for single PDFs.\r\n```\r\n\r\nThe default model tag is `0.1.0-small`. If you want to use the base model, use `0.1.0-base`.\r\n```\r\n$ nougat path/to/file.pdf -o output_directory -m 0.1.0-base\r\n```\r\n\r\nIn the output directory every PDF will be saved as a `.mmd` file, the lightweight markup language, mostly compatible with [Mathpix Markdown](https://github.com/Mathpix/mathpix-markdown-it) (we make use of the LaTeX tables).\r\n\r\n> Note: On some devices the failure detection heuristic is not working properly. If you experience a lot of `[MISSING_PAGE]` responses, try to run with the `--no-skipping` flag. Related: [#11](https://github.com/facebookresearch/nougat/issues/11), [#67](https://github.com/facebookresearch/nougat/issues/67)\r\n\r\n#### API\r\n\r\nWith the extra dependencies you use `app.py` to start an API. Call\r\n\r\n```sh\r\n$ nougat_api\r\n```\r\n\r\nTo get a prediction of a PDF file by making a POST request to http://127.0.0.1:8503/predict/. It also accepts parameters `start` and `stop` to limit the computation to select page numbers (boundaries are included).\r\n\r\nThe response is a string with the markdown text of the document.\r\n\r\n```sh\r\ncurl -X 'POST' \\\r\n  'http://127.0.0.1:8503/predict/' \\\r\n  -H 'accept: application/json' \\\r\n  -H 'Content-Type: multipart/form-data' \\\r\n  -F 'file=@<PDFFILE.pdf>;type=application/pdf'\r\n```\r\nTo use the limit the conversion to pages 1 to 5, use the start/stop parameters in the request URL: http://127.0.0.1:8503/predict/?start=1&stop=5\r\n\r\n## Dataset\r\n### Generate dataset\r\n\r\nTo generate a dataset you need \r\n\r\n1. A directory containing the PDFs\r\n2. A directory containing the `.html` files (processed `.tex` files by [LaTeXML](https://math.nist.gov/~BMiller/LaTeXML/)) with the same folder structure\r\n3. A binary file of [pdffigures2](https://github.com/allenai/pdffigures2) and a corresponding environment variable `export PDFFIGURES_PATH=\"/path/to/binary.jar\"`\r\n\r\nNext run\r\n\r\n```\r\npython -m nougat.dataset.split_htmls_to_pages --html path/html/root --pdfs path/pdf/root --out path/paired/output --figure path/pdffigures/outputs\r\n```\r\n\r\nAdditional arguments include\r\n\r\n| Argument              | Description                                |\r\n| --------------------- | ------------------------------------------ |\r\n| `--recompute`         | recompute all splits                       |\r\n| `--markdown MARKDOWN` | Markdown output dir                        |\r\n| `--workers WORKERS`   | How many processes to use                  |\r\n| `--dpi DPI`           | What resolution the pages will be saved at |\r\n| `--timeout TIMEOUT`   | max time per paper in seconds              |\r\n| `--tesseract`         | Tesseract OCR prediction for each page     |\r\n\r\nFinally create a `jsonl` file that contains all the image paths, markdown text and meta information.\r\n\r\n```\r\npython -m nougat.dataset.create_index --dir path/paired/output --out index.jsonl\r\n```\r\n\r\nFor each `jsonl` file you also need to generate a seek map for faster data loading:\r\n\r\n```\r\npython -m nougat.dataset.gen_seek file.jsonl\r\n```\r\n\r\nThe resulting directory structure can look as follows:\r\n\r\n```\r\nroot/\r\n\u251c\u2500\u2500 images\r\n\u251c\u2500\u2500 train.jsonl\r\n\u251c\u2500\u2500 train.seek.map\r\n\u251c\u2500\u2500 test.jsonl\r\n\u251c\u2500\u2500 test.seek.map\r\n\u251c\u2500\u2500 validation.jsonl\r\n\u2514\u2500\u2500 validation.seek.map\r\n```\r\n\r\nNote that the `.mmd` and `.json` files in the `path/paired/output` (here `images`) are no longer required.\r\nThis can be useful for pushing to a S3 bucket by halving the amount of files.\r\n\r\n## Training\r\n\r\nTo train or fine tune a Nougat model, run \r\n\r\n```\r\npython train.py --config config/train_nougat.yaml\r\n```\r\n\r\n## Evaluation\r\n\r\nRun \r\n\r\n```\r\npython test.py --checkpoint path/to/checkpoint --dataset path/to/test.jsonl --save_path path/to/results.json\r\n```\r\n\r\nTo get the results for the different text modalities, run\r\n\r\n```\r\npython -m nougat.metrics path/to/results.json\r\n```\r\n\r\n## FAQ\r\n\r\n- Why am I only getting `[MISSING_PAGE]`?\r\n\r\n  Nougat was trained on scientific papers found on arXiv and PMC. Is the document you're processing similar to that?\r\n  What language is the document in? Nougat works best with English papers, other Latin-based languages might work. **Chinese, Russian, Japanese etc. will not work**.\r\n  If these requirements are fulfilled it might be because of false positives in the failure detection, when computing on CPU or older GPUs ([#11](https://github.com/facebookresearch/nougat/issues/11)). Try passing the `--no-skipping` flag for now.\r\n\r\n- Where can I download the model checkpoint from.\r\n\r\n  They are uploaded here on GitHub in the release section. You can also download them during the first execution of the program. Choose the preferred preferred model by passing `--model 0.1.0-{base,small}`\r\n\r\n## Citation\r\n\r\n```\r\n@misc{blecher2023nougat,\r\n      title={Nougat: Neural Optical Understanding for Academic Documents}, \r\n      author={Lukas Blecher and Guillem Cucurull and Thomas Scialom and Robert Stojnic},\r\n      year={2023},\r\n      eprint={2308.13418},\r\n      archivePrefix={arXiv},\r\n      primaryClass={cs.LG}\r\n}\r\n```\r\n\r\n## Acknowledgments\r\n\r\nThis repository builds on top of the [Donut](https://github.com/clovaai/donut/) repository.\r\n\r\n## License\r\n\r\nNougat codebase is licensed under MIT.\r\n\r\nNougat model weights are licensed under CC-BY-NC.\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Nougat: Neural Optical Understanding for Academic Documents",
    "version": "0.1.17",
    "project_urls": {
        "Homepage": "https://github.com/facebookresearch/nougat"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8105accfaf31a1a6e50f379f0504d374d563707bc62190501f6f0cbad815be3f",
                "md5": "b1bd9a7b1b40a768db422a946cbd6ea1",
                "sha256": "f776732c716250972c7de11a47b36e94fa48e271d67045a427f19f12eeeef118"
            },
            "downloads": -1,
            "filename": "nougat_ocr-0.1.17-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b1bd9a7b1b40a768db422a946cbd6ea1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 82497,
            "upload_time": "2023-10-04T09:29:52",
            "upload_time_iso_8601": "2023-10-04T09:29:52.357780Z",
            "url": "https://files.pythonhosted.org/packages/81/05/accfaf31a1a6e50f379f0504d374d563707bc62190501f6f0cbad815be3f/nougat_ocr-0.1.17-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-04 09:29:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "facebookresearch",
    "github_project": "nougat",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "nougat-ocr"
}

Lukas Blecher