NT-TextFileLoader


NameNT-TextFileLoader JSON
Version 2.0.1 PyPI version JSON
download
home_page
SummaryPython library to extract text from various file formats. The supported formats are: JPEG, PNG, PDF, DOCX, DOC, and TEXT.
upload_time2023-12-13 16:13:32
maintainer
docs_urlNone
authorVishnu.D
requires_python
licenseMIT
keywords pip install nt-textloader pip install textfileloader pip install nt loader pip install textloader pip install nt-textfileloader pip install textfileloader
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # NT-TextLoader

[![N|Solid](https://narmtech.com/img/companylogo.png)](https://nodesource.com/products/nsolid)


### Description

  *A Python module for extracting text content from various file types including PDFs, DOCX, DOC, text files, and images using Optical Character Recognition (OCR).*


### Installation Instructions

Before using this package, ensure you have installed the following system-level dependencies:

### 1.On Linux
- Tesseract OCR and MS Office:

  ```bash
  !apt install tesseract-ocr
  !apt install libtesseract-dev
  !apt-get --no-install-recommends install libreoffice -y
  !apt-get install -y libreoffice-java-common

### 2.On Windows

Simple steps for tesseract installation in windows.

  - 1.Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.

  - 2.Install this exe in C:\Program Files (x86)\Tesseract-OCR

  - 3.Open virtual machine command prompt in windows or anaconda prompt.

  - 4.Run pip install pytesseract

To test if tesseract is installed type in python prompt:
```python 
import pytesseract
print(pytesseract)
 ```

## Installation

Install the package using pip:

```bash
pip install NT-TextFileLoader

```

## Usage

```python
from NT_TextFileLoader.text_loader import TextFileLoader

# Load text from a file
file_path = 'path/to/your/file'
extracted_text = TextFileLoader.load_text(file_path,min_text_length=50) 
# If the ouput length is lesser than 50(min_text_length) then OCR will be used to extract text.
# Increate the min_text_length value to use OCR.
print(extracted_text)
```

## Supported File Types

- **PDF**: Extracts text from PDF files.
- **DOCX**: Extracts text from DOCX files.
- **DOC**: Extracts text from legacy DOC files.
- **Text files**: Loads text content from TXT files.
- **Images (JPG, PNG, JPEG, WEBP)**: Uses OCR to extract text from images.

## Requirements

- PyPDF2
- python-docx
- Pillow
- pytesseract (For image-based text extraction)
- langchain 
- unstructured
- docx2txt
- PyMuPDF

## Contributions

Contributions, issues, and feature requests are welcome!

## License

This project is licensed under the MIT License.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "NT-TextFileLoader",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "pip install NT-TextLoader,pip install TextFileLoader,pip install NT Loader,pip install textloader,pip install nt-textfileloader,pip install textfileloader",
    "author": "Vishnu.D",
    "author_email": "vishnu.d@narmtech.com",
    "download_url": "https://files.pythonhosted.org/packages/ee/60/3a28d710dc60fa079e8822857ec991a9e238065696e57c8b72b4a744d342/NT_TextFileLoader-2.0.1.tar.gz",
    "platform": null,
    "description": "# NT-TextLoader\r\n\r\n[![N|Solid](https://narmtech.com/img/companylogo.png)](https://nodesource.com/products/nsolid)\r\n\r\n\r\n### Description\r\n\r\n  *A Python module for extracting text content from various file types including PDFs, DOCX, DOC, text files, and images using Optical Character Recognition (OCR).*\r\n\r\n\r\n### Installation Instructions\r\n\r\nBefore using this package, ensure you have installed the following system-level dependencies:\r\n\r\n### 1.On Linux\r\n- Tesseract OCR and MS Office:\r\n\r\n  ```bash\r\n  !apt install tesseract-ocr\r\n  !apt install libtesseract-dev\r\n  !apt-get --no-install-recommends install libreoffice -y\r\n  !apt-get install -y libreoffice-java-common\r\n\r\n### 2.On Windows\r\n\r\nSimple steps for tesseract installation in windows.\r\n\r\n  - 1.Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.\r\n\r\n  - 2.Install this exe in C:\\Program Files (x86)\\Tesseract-OCR\r\n\r\n  - 3.Open virtual machine command prompt in windows or anaconda prompt.\r\n\r\n  - 4.Run pip install pytesseract\r\n\r\nTo test if tesseract is installed type in python prompt:\r\n```python \r\nimport pytesseract\r\nprint(pytesseract)\r\n ```\r\n\r\n## Installation\r\n\r\nInstall the package using pip:\r\n\r\n```bash\r\npip install NT-TextFileLoader\r\n\r\n```\r\n\r\n## Usage\r\n\r\n```python\r\nfrom NT_TextFileLoader.text_loader import TextFileLoader\r\n\r\n# Load text from a file\r\nfile_path = 'path/to/your/file'\r\nextracted_text = TextFileLoader.load_text(file_path,min_text_length=50) \r\n# If the ouput length is lesser than 50(min_text_length) then OCR will be used to extract text.\r\n# Increate the min_text_length value to use OCR.\r\nprint(extracted_text)\r\n```\r\n\r\n## Supported File Types\r\n\r\n- **PDF**: Extracts text from PDF files.\r\n- **DOCX**: Extracts text from DOCX files.\r\n- **DOC**: Extracts text from legacy DOC files.\r\n- **Text files**: Loads text content from TXT files.\r\n- **Images (JPG, PNG, JPEG, WEBP)**: Uses OCR to extract text from images.\r\n\r\n## Requirements\r\n\r\n- PyPDF2\r\n- python-docx\r\n- Pillow\r\n- pytesseract (For image-based text extraction)\r\n- langchain \r\n- unstructured\r\n- docx2txt\r\n- PyMuPDF\r\n\r\n## Contributions\r\n\r\nContributions, issues, and feature requests are welcome!\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License.\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Python library to extract text from various file formats. The supported formats are: JPEG, PNG, PDF, DOCX, DOC, and TEXT.",
    "version": "2.0.1",
    "project_urls": null,
    "split_keywords": [
        "pip install nt-textloader",
        "pip install textfileloader",
        "pip install nt loader",
        "pip install textloader",
        "pip install nt-textfileloader",
        "pip install textfileloader"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ee603a28d710dc60fa079e8822857ec991a9e238065696e57c8b72b4a744d342",
                "md5": "89496d6f2005652e87adcf4ce67107e4",
                "sha256": "797e967222b4dad517090df6038e45e78843b584b15c7f90ceab7ef50f4ffe4c"
            },
            "downloads": -1,
            "filename": "NT_TextFileLoader-2.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "89496d6f2005652e87adcf4ce67107e4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 5454,
            "upload_time": "2023-12-13T16:13:32",
            "upload_time_iso_8601": "2023-12-13T16:13:32.375927Z",
            "url": "https://files.pythonhosted.org/packages/ee/60/3a28d710dc60fa079e8822857ec991a9e238065696e57c8b72b4a744d342/NT_TextFileLoader-2.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-13 16:13:32",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "nt-textfileloader"
}
        
Elapsed time: 2.66602s