| Name | pdf-language-detector JSON |
| Version |
0.0.11
JSON |
| download |
| home_page | |
| Summary | A python script to iterate over a list of PDF in a directory and try to guess their language with Tesseract OCR. |
| upload_time | 2023-06-20 10:22:03 |
| maintainer | |
| docs_url | None |
| author | ICIJ |
| requires_python | >=3.8.1,<4.0.0 |
| license | |
| keywords |
|
| VCS |
|
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
# PLD (PDF Language Detector) [](https://github.com/ICIJ/pld/actions)
PLD is a Python program that analyzes PDF files, extracts images, processes them using Optical Character Recognition (OCR), and detects the dominant language of the text. It provides language detection information in JSON format and calculates the average confidence coefficient for each language.
## Requirements
- [Python 3.8](https://www.python.org/downloads/) or above
- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)
- [pdftoppm](https://poppler.freedesktop.org/)
## Installation
Install Tesseract OCR and pdftoppm using your package manager. For example, on Ubuntu:
```bash
sudo apt install tesseract-ocr tesseract-ocr-all poppler-utils
```
### From PyPi
Install with pip:
```bash
python3 -m pip install --user pdf-language-detector
```
Then run directly from your terminal:
```bash
pld --help
````
### From the sources
Clone the PLD repository:
```bash
git clone git@github.com:github.com/icij/pld.git
```
Install the required Python packages with poetry:
```bash
poetry install
````
Then run inside a virtual env managed by poetry:
```bash
poetry run pld --help
````
### From Docker
Install with Docker:
```bash
docker pull icij/pld
```
Then run inside a container:
```bash
docker run -it icij/pld pld --help
```
## Usage
### Detect
This command process PDF files and detect the dominant language.
```
pld detect --help
--language A list of ISO3 language codes to detect.
--input-dir: Path to the input directory containing PDF files. Default is the current directory.
--output-dir (optional): Path to the output directory. Default is 'out' directory in the current directory.
--max-pages (optional): Maximum number of pages to process per PDF file. Default is 5.
--resume (optional): Skip PDF files already analyzed.
--skip-images (optional): Skip the extraction of PDF files a images.
--skip-ocr (optional): Skip the OCR of images from PDF files.
--parallel (optional): Number of threads to run in parallel.
--relative-to (optional): Path to the directory relative to which build the output dir path.
```
### Report
This command print a report from the previously detected language (using the same output dir).
```
pld report --help
--output-dir: Path to the output directory. Default is 'out' directory in the current directory.
```
## Test
You can run the test suite (propulsed by pytest) with this command:
```bash
make test
```
## Examples
Process PDF files in the current directory, detect English and Spanish languages, and save the results in the 'results' directory:
```bash
pld --language eng --language spa --input-dir documents --output-dir results
```
Process PDF files in the 'documents' directory, detect French and Greek languages, and limit the processing to 3 pages per file:
```bash
pld --language fra --language ell --input-dir documents --max-pages 3
```
## License
This project is licensed under the MIT License.
Raw data
{
"_id": null,
"home_page": "",
"name": "pdf-language-detector",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8.1,<4.0.0",
"maintainer_email": "",
"keywords": "",
"author": "ICIJ",
"author_email": "engineering@icij.org",
"download_url": "https://files.pythonhosted.org/packages/ed/38/9b1cf4d9ae963d4a5bec0957da7b626be1de6fc8cb4dd7bc1aa9894681ff/pdf_language_detector-0.0.11.tar.gz",
"platform": null,
"description": "# PLD (PDF Language Detector) [](https://github.com/ICIJ/pld/actions)\n\n\nPLD is a Python program that analyzes PDF files, extracts images, processes them using Optical Character Recognition (OCR), and detects the dominant language of the text. It provides language detection information in JSON format and calculates the average confidence coefficient for each language.\n\n## Requirements\n\n- [Python 3.8](https://www.python.org/downloads/) or above\n- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)\n- [pdftoppm](https://poppler.freedesktop.org/)\n\n## Installation\n\nInstall Tesseract OCR and pdftoppm using your package manager. For example, on Ubuntu:\n\n```bash\nsudo apt install tesseract-ocr tesseract-ocr-all poppler-utils\n```\n\n### From PyPi\n\nInstall with pip:\n\n```bash\npython3 -m pip install --user pdf-language-detector\n```\n\nThen run directly from your terminal:\n\n```bash\npld --help\n````\n\n### From the sources\n\nClone the PLD repository:\n\n```bash\ngit clone git@github.com:github.com/icij/pld.git\n```\n\nInstall the required Python packages with poetry:\n\n```bash\npoetry install\n````\n\nThen run inside a virtual env managed by poetry:\n\n```bash\npoetry run pld --help\n````\n\n### From Docker\n\nInstall with Docker:\n\n```bash\ndocker pull icij/pld\n```\n\nThen run inside a container:\n\n```bash\ndocker run -it icij/pld pld --help\n```\n\n## Usage\n\n### Detect\n\nThis command process PDF files and detect the dominant language.\n\n```\npld detect --help\n\n --language A list of ISO3 language codes to detect.\n --input-dir: Path to the input directory containing PDF files. Default is the current directory.\n --output-dir (optional): Path to the output directory. Default is 'out' directory in the current directory.\n --max-pages (optional): Maximum number of pages to process per PDF file. Default is 5.\n --resume (optional): Skip PDF files already analyzed.\n --skip-images (optional): Skip the extraction of PDF files a images.\n --skip-ocr (optional): Skip the OCR of images from PDF files.\n --parallel (optional): Number of threads to run in parallel.\n --relative-to (optional): Path to the directory relative to which build the output dir path.\n```\n\n### Report\n\nThis command print a report from the previously detected language (using the same output dir).\n\n```\npld report --help\n\n --output-dir: Path to the output directory. Default is 'out' directory in the current directory.\n```\n\n## Test\n\nYou can run the test suite (propulsed by pytest) with this command:\n\n```bash\nmake test\n```\n\n## Examples\n\nProcess PDF files in the current directory, detect English and Spanish languages, and save the results in the 'results' directory:\n\n```bash\npld --language eng --language spa --input-dir documents --output-dir results\n```\n\nProcess PDF files in the 'documents' directory, detect French and Greek languages, and limit the processing to 3 pages per file:\n\n```bash\npld --language fra --language ell --input-dir documents --max-pages 3\n```\n\n## License\n\nThis project is licensed under the MIT License.\n",
"bugtrack_url": null,
"license": "",
"summary": "A python script to iterate over a list of PDF in a directory and try to guess their language with Tesseract OCR.",
"version": "0.0.11",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9b97b2e42125a4ca9d5a8eaf4766b4bd97ff83e68711c591d97c62991afcb113",
"md5": "e675bd001687ca977126d8589affa49c",
"sha256": "4c624a4fd8664a8e856a39bca1864b55b790111e6a17c023ca6b033bfeacd5fd"
},
"downloads": -1,
"filename": "pdf_language_detector-0.0.11-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e675bd001687ca977126d8589affa49c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8.1,<4.0.0",
"size": 9540,
"upload_time": "2023-06-20T10:22:01",
"upload_time_iso_8601": "2023-06-20T10:22:01.305404Z",
"url": "https://files.pythonhosted.org/packages/9b/97/b2e42125a4ca9d5a8eaf4766b4bd97ff83e68711c591d97c62991afcb113/pdf_language_detector-0.0.11-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ed389b1cf4d9ae963d4a5bec0957da7b626be1de6fc8cb4dd7bc1aa9894681ff",
"md5": "c4c96ab956eeab90333861d10e339432",
"sha256": "9a483cf13d0d32d246c82671a8e1a49ea05cb01a6d635d59206f562bf2c59e76"
},
"downloads": -1,
"filename": "pdf_language_detector-0.0.11.tar.gz",
"has_sig": false,
"md5_digest": "c4c96ab956eeab90333861d10e339432",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8.1,<4.0.0",
"size": 9493,
"upload_time": "2023-06-20T10:22:03",
"upload_time_iso_8601": "2023-06-20T10:22:03.963865Z",
"url": "https://files.pythonhosted.org/packages/ed/38/9b1cf4d9ae963d4a5bec0957da7b626be1de6fc8cb4dd7bc1aa9894681ff/pdf_language_detector-0.0.11.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-20 10:22:03",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "pdf-language-detector"
}