# NT-TextLoader
[![N|Solid](https://narmtech.com/img/companylogo.png)](https://nodesource.com/products/nsolid)
### Description
*A Python module for extracting text content from various file types including PDFs, DOCX, DOC, text files, and images using Optical Character Recognition (OCR).*
### Installation Instructions
Before using this package, ensure you have installed the following system-level dependencies:
### 1.On Linux
- Tesseract OCR and MS Office:
```bash
!apt install tesseract-ocr
!apt install libtesseract-dev
!apt-get --no-install-recommends install libreoffice -y
!apt-get install -y libreoffice-java-common
### 2.On Windows
Simple steps for tesseract installation in windows.
- 1.Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.
- 2.Install this exe in C:\Program Files (x86)\Tesseract-OCR
- 3.Open virtual machine command prompt in windows or anaconda prompt.
- 4.Run pip install pytesseract
To test if tesseract is installed type in python prompt:
```python
import pytesseract
print(pytesseract)
```
## Installation
Install the package using pip:
```bash
pip install NT-TextFileLoader
```
## Usage
```python
from NT_TextFileLoader.text_loader import TextFileLoader
# Load text from a file
file_path = 'path/to/your/file'
extracted_text = TextFileLoader.load_text(file_path,min_text_length=50)
# If the ouput length is lesser than 50(min_text_length) then OCR will be used to extract text.
# Increate the min_text_length value to use OCR.
print(extracted_text)
```
## Supported File Types
- **PDF**: Extracts text from PDF files.
- **DOCX**: Extracts text from DOCX files.
- **DOC**: Extracts text from legacy DOC files.
- **Text files**: Loads text content from TXT files.
- **Images (JPG, PNG, JPEG, WEBP)**: Uses OCR to extract text from images.
## Requirements
- PyPDF2
- python-docx
- Pillow
- pytesseract (For image-based text extraction)
- langchain
- unstructured
- docx2txt
- PyMuPDF
## Contributions
Contributions, issues, and feature requests are welcome!
## License
This project is licensed under the MIT License.
Raw data
{
"_id": null,
"home_page": "",
"name": "NT-TextFileLoader",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "pip install NT-TextLoader,pip install TextFileLoader,pip install NT Loader,pip install textloader,pip install nt-textfileloader,pip install textfileloader",
"author": "Vishnu.D",
"author_email": "vishnu.d@narmtech.com",
"download_url": "https://files.pythonhosted.org/packages/ee/60/3a28d710dc60fa079e8822857ec991a9e238065696e57c8b72b4a744d342/NT_TextFileLoader-2.0.1.tar.gz",
"platform": null,
"description": "# NT-TextLoader\r\n\r\n[![N|Solid](https://narmtech.com/img/companylogo.png)](https://nodesource.com/products/nsolid)\r\n\r\n\r\n### Description\r\n\r\n *A Python module for extracting text content from various file types including PDFs, DOCX, DOC, text files, and images using Optical Character Recognition (OCR).*\r\n\r\n\r\n### Installation Instructions\r\n\r\nBefore using this package, ensure you have installed the following system-level dependencies:\r\n\r\n### 1.On Linux\r\n- Tesseract OCR and MS Office:\r\n\r\n ```bash\r\n !apt install tesseract-ocr\r\n !apt install libtesseract-dev\r\n !apt-get --no-install-recommends install libreoffice -y\r\n !apt-get install -y libreoffice-java-common\r\n\r\n### 2.On Windows\r\n\r\nSimple steps for tesseract installation in windows.\r\n\r\n - 1.Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.\r\n\r\n - 2.Install this exe in C:\\Program Files (x86)\\Tesseract-OCR\r\n\r\n - 3.Open virtual machine command prompt in windows or anaconda prompt.\r\n\r\n - 4.Run pip install pytesseract\r\n\r\nTo test if tesseract is installed type in python prompt:\r\n```python \r\nimport pytesseract\r\nprint(pytesseract)\r\n ```\r\n\r\n## Installation\r\n\r\nInstall the package using pip:\r\n\r\n```bash\r\npip install NT-TextFileLoader\r\n\r\n```\r\n\r\n## Usage\r\n\r\n```python\r\nfrom NT_TextFileLoader.text_loader import TextFileLoader\r\n\r\n# Load text from a file\r\nfile_path = 'path/to/your/file'\r\nextracted_text = TextFileLoader.load_text(file_path,min_text_length=50) \r\n# If the ouput length is lesser than 50(min_text_length) then OCR will be used to extract text.\r\n# Increate the min_text_length value to use OCR.\r\nprint(extracted_text)\r\n```\r\n\r\n## Supported File Types\r\n\r\n- **PDF**: Extracts text from PDF files.\r\n- **DOCX**: Extracts text from DOCX files.\r\n- **DOC**: Extracts text from legacy DOC files.\r\n- **Text files**: Loads text content from TXT files.\r\n- **Images (JPG, PNG, JPEG, WEBP)**: Uses OCR to extract text from images.\r\n\r\n## Requirements\r\n\r\n- PyPDF2\r\n- python-docx\r\n- Pillow\r\n- pytesseract (For image-based text extraction)\r\n- langchain \r\n- unstructured\r\n- docx2txt\r\n- PyMuPDF\r\n\r\n## Contributions\r\n\r\nContributions, issues, and feature requests are welcome!\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License.\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Python library to extract text from various file formats. The supported formats are: JPEG, PNG, PDF, DOCX, DOC, and TEXT.",
"version": "2.0.1",
"project_urls": null,
"split_keywords": [
"pip install nt-textloader",
"pip install textfileloader",
"pip install nt loader",
"pip install textloader",
"pip install nt-textfileloader",
"pip install textfileloader"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ee603a28d710dc60fa079e8822857ec991a9e238065696e57c8b72b4a744d342",
"md5": "89496d6f2005652e87adcf4ce67107e4",
"sha256": "797e967222b4dad517090df6038e45e78843b584b15c7f90ceab7ef50f4ffe4c"
},
"downloads": -1,
"filename": "NT_TextFileLoader-2.0.1.tar.gz",
"has_sig": false,
"md5_digest": "89496d6f2005652e87adcf4ce67107e4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 5454,
"upload_time": "2023-12-13T16:13:32",
"upload_time_iso_8601": "2023-12-13T16:13:32.375927Z",
"url": "https://files.pythonhosted.org/packages/ee/60/3a28d710dc60fa079e8822857ec991a9e238065696e57c8b72b4a744d342/NT_TextFileLoader-2.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-12-13 16:13:32",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "nt-textfileloader"
}