bangla-pdf-ocr


Namebangla-pdf-ocr JSON
Version 0.1.1 PyPI version JSON
download
home_pagehttps://github.com/asiff00/bangla-pdf-ocr
SummaryA package to extract Bengali text from PDFs using OCR
upload_time2024-10-12 18:18:54
maintainerNone
docs_urlNone
authorAbdullah Al Asif
requires_python>=3.6
licenseNone
keywords
VCS
bugtrack_url
requirements tqdm Pillow pdf2image pytesseract colorama
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Bangla PDF OCR

Bangla PDF OCR is a powerful tool that extracts Bengali text from PDF files. It's designed for simplicity and works on Windows, macOS, and Linux without any extra downloads or configurations.

## Key Features

- Extracts Bengali text from PDFs quickly and accurately
- Works on Windows, macOS, and Linux
- Easy to use from both command line and Python scripts
- Installs all necessary components automatically
- Supports other languages besides Bengali

## Quick Start

1. Install the package:
   ```bash
   pip install bangla-pdf-ocr
   ```

2. Run the setup command to install dependencies:
   ```bash
   bangla-pdf-ocr-setup
   ```

3. Start using it right away!

   From command line:
   ```bash
   bangla-pdf-ocr your_file.pdf
   ```

   In your Python script:
   ```python
   from bangla_pdf_ocr import process_pdf
   text = process_pdf("your_file.pdf")
   print(text)
   ```

That's it! No additional downloads or configurations needed.

## Features

- Extract Bengali text from PDF files
- Support for other languages through Tesseract OCR
- Easy-to-use command-line interface
- Automatic installation of dependencies (OS-specific)
- Multi-threaded processing for improved performance

## Prerequisites

- Python 3.6 or higher
- pip (Python package installer)

## Installation

1. Install the package from PyPI:
   ```bash
   pip install bangla-pdf-ocr
   ```

2. Set up system dependencies:
   ```bash
   bangla-pdf-ocr-setup
   ```
   This command installs necessary dependencies based on your operating system:
   - Linux: Installs `tesseract-ocr`, `poppler-utils`, and `tesseract-ocr-ben`
   - macOS: Installs `tesseract`, `poppler`, and `tesseract-lang` via Homebrew
   - Windows: Downloads and installs Tesseract OCR and Poppler, adding them to the system PATH

   Note: On Windows, you may need to run the command prompt as administrator.

3. Verify the installation:
   ```bash
   bangla-pdf-ocr-verify
   ```
   This command checks if all required dependencies are properly installed and accessible.

4. Try a sample PDF extraction:
   ```bash
   bangla-pdf-ocr
   ```
   This command processes a sample Bengali PDF file included with the package, demonstrating the text extraction capabilities.
   
## Usage

### Command-line Interface

Basic usage:
```bash
bangla-pdf-ocr [input_pdf] [-o output_file] [-l language]
```

### Options:
- `input_pdf`: Path to the input PDF file (optional, uses a sample PDF if not provided)
- `-o, --output`: Specify the output file path (default: input filename with `.txt` extension)
- `-l, --language`: Specify the OCR language (default: 'ben' for Bengali)

### Examples:

1. Process the default sample PDF:
   ```bash
   bangla-pdf-ocr
   ```

2. Process a specific PDF:
   ```bash
   bangla-pdf-ocr path/to/my_document.pdf
   ```

3. Specify an output file:
   ```bash
   bangla-pdf-ocr path/to/my_document.pdf -o path/to/extracted_text.txt
   ```


### Using as a Python Module

You can also use Bangla PDF OCR as a module in your Python scripts. Here's an example:

```python
from bangla_pdf_ocr import process_pdf

# Process a PDF file
input_pdf = "path/to/your/document.pdf"
output_file = "path/to/output/extracted_text.txt"
language = "ben"  # Use "ben" for Bengali or other language codes as needed

extracted_text = process_pdf(input_pdf, output_file, language)

# The extracted text is now in the 'extracted_text' variable
# and has also been saved to the output file

print(f"Text extracted and saved to: {output_file}")
```

This allows you to integrate Bangla PDF OCR functionality directly into your Python projects, giving you more control over the OCR process and enabling you to use the extracted text in your applications.

## Troubleshooting

If you encounter any issues:

1. Run the verification command:
   ```bash
   bangla-pdf-ocr-verify
   ```

2. For Windows users:
   - Run `setup/verify` command prompts as administrator if you encounter permission issues.
   - Restart your command prompt or IDE after installation to ensure PATH changes take effect.

3. Check the console output and logs for any error messages.

4. If automatic installation fails, refer to the manual installation instructions provided by the setup command.

5. Ensure you have the latest version of the package:
   ```bash
   pip install --upgrade bangla-pdf-ocr
   ```

6. If problems persist, please open an issue on our GitHub repository with detailed information about the error and your system configuration.


## Reporting Issues

If you encounter any problems or have suggestions for Bangla PDF OCR:

1. Check [existing issues](https://github.com/asiff00/bangla-pdf-ocr/issues) to see if your issue has already been reported.
2. If not, [create a new issue](https://github.com/asiff00/bangla-pdf-ocr/issues/new) on our GitHub repository.
3. Provide detailed information about the problem, including steps to reproduce it.

We appreciate your feedback to help improve Bangla PDF OCR!

Happy OCR processing!

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/asiff00/bangla-pdf-ocr",
    "name": "bangla-pdf-ocr",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": null,
    "author": "Abdullah Al Asif",
    "author_email": "asif.dev.bd@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/b0/24/3d6a0e0fefbbf0222b76e510c1397a4b1d34fbe5b1f5e9e5edfdaabe9035/bangla_pdf_ocr-0.1.1.tar.gz",
    "platform": null,
    "description": "# Bangla PDF OCR\r\n\r\nBangla PDF OCR is a powerful tool that extracts Bengali text from PDF files. It's designed for simplicity and works on Windows, macOS, and Linux without any extra downloads or configurations.\r\n\r\n## Key Features\r\n\r\n- Extracts Bengali text from PDFs quickly and accurately\r\n- Works on Windows, macOS, and Linux\r\n- Easy to use from both command line and Python scripts\r\n- Installs all necessary components automatically\r\n- Supports other languages besides Bengali\r\n\r\n## Quick Start\r\n\r\n1. Install the package:\r\n   ```bash\r\n   pip install bangla-pdf-ocr\r\n   ```\r\n\r\n2. Run the setup command to install dependencies:\r\n   ```bash\r\n   bangla-pdf-ocr-setup\r\n   ```\r\n\r\n3. Start using it right away!\r\n\r\n   From command line:\r\n   ```bash\r\n   bangla-pdf-ocr your_file.pdf\r\n   ```\r\n\r\n   In your Python script:\r\n   ```python\r\n   from bangla_pdf_ocr import process_pdf\r\n   text = process_pdf(\"your_file.pdf\")\r\n   print(text)\r\n   ```\r\n\r\nThat's it! No additional downloads or configurations needed.\r\n\r\n## Features\r\n\r\n- Extract Bengali text from PDF files\r\n- Support for other languages through Tesseract OCR\r\n- Easy-to-use command-line interface\r\n- Automatic installation of dependencies (OS-specific)\r\n- Multi-threaded processing for improved performance\r\n\r\n## Prerequisites\r\n\r\n- Python 3.6 or higher\r\n- pip (Python package installer)\r\n\r\n## Installation\r\n\r\n1. Install the package from PyPI:\r\n   ```bash\r\n   pip install bangla-pdf-ocr\r\n   ```\r\n\r\n2. Set up system dependencies:\r\n   ```bash\r\n   bangla-pdf-ocr-setup\r\n   ```\r\n   This command installs necessary dependencies based on your operating system:\r\n   - Linux: Installs `tesseract-ocr`, `poppler-utils`, and `tesseract-ocr-ben`\r\n   - macOS: Installs `tesseract`, `poppler`, and `tesseract-lang` via Homebrew\r\n   - Windows: Downloads and installs Tesseract OCR and Poppler, adding them to the system PATH\r\n\r\n   Note: On Windows, you may need to run the command prompt as administrator.\r\n\r\n3. Verify the installation:\r\n   ```bash\r\n   bangla-pdf-ocr-verify\r\n   ```\r\n   This command checks if all required dependencies are properly installed and accessible.\r\n\r\n4. Try a sample PDF extraction:\r\n   ```bash\r\n   bangla-pdf-ocr\r\n   ```\r\n   This command processes a sample Bengali PDF file included with the package, demonstrating the text extraction capabilities.\r\n   \r\n## Usage\r\n\r\n### Command-line Interface\r\n\r\nBasic usage:\r\n```bash\r\nbangla-pdf-ocr [input_pdf] [-o output_file] [-l language]\r\n```\r\n\r\n### Options:\r\n- `input_pdf`: Path to the input PDF file (optional, uses a sample PDF if not provided)\r\n- `-o, --output`: Specify the output file path (default: input filename with `.txt` extension)\r\n- `-l, --language`: Specify the OCR language (default: 'ben' for Bengali)\r\n\r\n### Examples:\r\n\r\n1. Process the default sample PDF:\r\n   ```bash\r\n   bangla-pdf-ocr\r\n   ```\r\n\r\n2. Process a specific PDF:\r\n   ```bash\r\n   bangla-pdf-ocr path/to/my_document.pdf\r\n   ```\r\n\r\n3. Specify an output file:\r\n   ```bash\r\n   bangla-pdf-ocr path/to/my_document.pdf -o path/to/extracted_text.txt\r\n   ```\r\n\r\n\r\n### Using as a Python Module\r\n\r\nYou can also use Bangla PDF OCR as a module in your Python scripts. Here's an example:\r\n\r\n```python\r\nfrom bangla_pdf_ocr import process_pdf\r\n\r\n# Process a PDF file\r\ninput_pdf = \"path/to/your/document.pdf\"\r\noutput_file = \"path/to/output/extracted_text.txt\"\r\nlanguage = \"ben\"  # Use \"ben\" for Bengali or other language codes as needed\r\n\r\nextracted_text = process_pdf(input_pdf, output_file, language)\r\n\r\n# The extracted text is now in the 'extracted_text' variable\r\n# and has also been saved to the output file\r\n\r\nprint(f\"Text extracted and saved to: {output_file}\")\r\n```\r\n\r\nThis allows you to integrate Bangla PDF OCR functionality directly into your Python projects, giving you more control over the OCR process and enabling you to use the extracted text in your applications.\r\n\r\n## Troubleshooting\r\n\r\nIf you encounter any issues:\r\n\r\n1. Run the verification command:\r\n   ```bash\r\n   bangla-pdf-ocr-verify\r\n   ```\r\n\r\n2. For Windows users:\r\n   - Run `setup/verify` command prompts as administrator if you encounter permission issues.\r\n   - Restart your command prompt or IDE after installation to ensure PATH changes take effect.\r\n\r\n3. Check the console output and logs for any error messages.\r\n\r\n4. If automatic installation fails, refer to the manual installation instructions provided by the setup command.\r\n\r\n5. Ensure you have the latest version of the package:\r\n   ```bash\r\n   pip install --upgrade bangla-pdf-ocr\r\n   ```\r\n\r\n6. If problems persist, please open an issue on our GitHub repository with detailed information about the error and your system configuration.\r\n\r\n\r\n## Reporting Issues\r\n\r\nIf you encounter any problems or have suggestions for Bangla PDF OCR:\r\n\r\n1. Check [existing issues](https://github.com/asiff00/bangla-pdf-ocr/issues) to see if your issue has already been reported.\r\n2. If not, [create a new issue](https://github.com/asiff00/bangla-pdf-ocr/issues/new) on our GitHub repository.\r\n3. Provide detailed information about the problem, including steps to reproduce it.\r\n\r\nWe appreciate your feedback to help improve Bangla PDF OCR!\r\n\r\nHappy OCR processing!\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A package to extract Bengali text from PDFs using OCR",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/asiff00/bangla-pdf-ocr"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f692e0e2788966a0e0c15b647fa6816eb99f54b487232a1278dc3a76d40101d5",
                "md5": "d1c7e2e40cd8dd36f088a5fc7ebbb4d0",
                "sha256": "95dd886ca747898c27cf72b8bad620e931c457ee2cf6d5c59688a32104595cde"
            },
            "downloads": -1,
            "filename": "bangla_pdf_ocr-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d1c7e2e40cd8dd36f088a5fc7ebbb4d0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 78262,
            "upload_time": "2024-10-12T18:18:52",
            "upload_time_iso_8601": "2024-10-12T18:18:52.145016Z",
            "url": "https://files.pythonhosted.org/packages/f6/92/e0e2788966a0e0c15b647fa6816eb99f54b487232a1278dc3a76d40101d5/bangla_pdf_ocr-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b0243d6a0e0fefbbf0222b76e510c1397a4b1d34fbe5b1f5e9e5edfdaabe9035",
                "md5": "6aab3ee2934bbb3ad741326324eb87d6",
                "sha256": "f3e1fb6691d2217ccc31009cc2613c9f8b29445cd1e384ce5becc18e42107f65"
            },
            "downloads": -1,
            "filename": "bangla_pdf_ocr-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "6aab3ee2934bbb3ad741326324eb87d6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 81889,
            "upload_time": "2024-10-12T18:18:54",
            "upload_time_iso_8601": "2024-10-12T18:18:54.762421Z",
            "url": "https://files.pythonhosted.org/packages/b0/24/3d6a0e0fefbbf0222b76e510c1397a4b1d34fbe5b1f5e9e5edfdaabe9035/bangla_pdf_ocr-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-12 18:18:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "asiff00",
    "github_project": "bangla-pdf-ocr",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "Pillow",
            "specs": []
        },
        {
            "name": "pdf2image",
            "specs": []
        },
        {
            "name": "pytesseract",
            "specs": []
        },
        {
            "name": "colorama",
            "specs": []
        }
    ],
    "lcname": "bangla-pdf-ocr"
}
        
Elapsed time: 0.63028s