Name | DataXtractor JSON |
Version |
1.0.6
JSON |
| download |
home_page | https://github.com/Rahulkatoch99/DataXtractor |
Summary | DataXtractor is a versatile Python library designed to simplify the extraction of valuable data from a variety of sources, including images and PDF documents. Whether you need to extract text, tables, or structured content, DataXtractor provides powerful and intuitive tools to streamline the process. |
upload_time | 2023-10-12 14:01:10 |
maintainer | |
docs_url | None |
author | Rahul Katoch |
requires_python | >=3.6 |
license | MIT |
keywords |
data
extraction
pdf
image
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# DataXtractor Library
DataXtractor is a versatile library designed to extract text from PDF documents, with the ability to handle images and multi-column layouts. This README file provides an overview of the library's capabilities and how to use it effectively.
## Features
DataXtractor library offers the following key features:
### 1. Image to Text Extraction
DataXtractor is equipped to handle PDFs containing images. It utilizes Optical Character Recognition (OCR) to convert images embedded in PDF files into machine-readable text. This allows you to access and manipulate the textual content within images present in your PDF documents.
### 2. Multi-Column Text Extraction
In case your PDF contains text arranged in multiple columns, DataXtractor allows you to extract this text intelligently. The library can separate and extract content from each column independently, making it possible to obtain text in a structured and organized manner.
### 3. Language Support
DataXtractor supports multiple languages for OCR operations. You can specify the language code string using the `lang` parameter. By default, the library uses English (`eng`) if the language is not specified. You can also specify multiple languages for a more comprehensive text extraction process. For example:
```
supported_language_codes = [
"ara", "aze", "aze_cyrl", "bel", "ben", "bod", "bos", "bul", "cat", "ceb", "ces", "chi_sim", "chi_sim_vert", "chi_tra",
"chi_tra_vert", "chr", "cym", "dan", "deu", "deu-frak", "ell", "eng", "enm", "epo", "est", "eus", "fas", "fil", "fin",
"fra", "frk", "frm", "fry", "gle", "glg", "grc", "guj", "hat", "heb", "hin", "hrv", "hun", "hye", "iku", "ind", "isl",
"ita", "ita-old", "jav", "jpn", "jpn_vert", "kan", "kat", "kat-old", "kaz", "khm", "kir", "kor", "kor_vert", "lao",
"lat", "lav", "lit", "ltz", "mal", "mar", "mkd", "mlt", "mon", "mri", "msa", "mya", "nep", "nld", "nor", "oci", "ori",
"osd", "pan", "pol", "por", "pus", "ron", "rus", "san", "sin", "slk", "slv", "snd", "spa", "spa_old", "sqi", "srp",
"srp_latn", "sun", "swa", "swe", "syr", "tam", "tat", "tel", "tgk", "tha", "tir", "ton", "tur", "uig", "ukr", "urd",
"uzb", "uzb_cyrl", "vie", "yid", "yor"
]
```
This is especially useful when working with PDFs that contain text in various languages.
### 4. PDF Text Extraction
DataXtractor is not limited to image-based PDFs. It can also extract text directly from PDF documents that contain text content. This feature allows you to process PDF files, whether they contain text alone or a combination of text and images.
## Getting Started
To get started with the DataXtractor library, follow these steps:
1. **Installation**: Install the DataXtractor library by using the provided package manager (if available), or manually include it in your project.
2. **Library Initialization**: Initialize the DataXtractor library in your code, specifying the language(s) to use for OCR, as well as any other required parameters.
3. **PDF Processing**: Load your PDF document and apply the appropriate extraction functions based on your needs. For image-based PDFs, use OCR to convert images to text. For text-based PDFs, extract text directly.
4. **Output Handling**: Receive the extracted text and use it as needed for further processing or analysis within your application.
## You can convert a PDF into an image and then perform OCR on that image using two different languages. Additionally, you can crop the image into two parts for separate OCR processing.
## Example
```python
from pdf2image_dataextraction import data_extraction2parts
path = "sample.pdf"
left_partition = "40"
right_partition = "60"
lang_part_first = "en"
lang_part_second = "en"
data = data_extraction2parts.DATA_EXTRACTION_2_PARTS(
path, left_partition, right_partition, lang_part_first, lang_part_second
)
print(data)
```
## You can convert a PDF into an image and then perform OCR on that image and extract Data
```python
from pdf2image_dataextraction import data_extract
path = "sample.pdf"
data = data_extraction2parts.DATA_EXTRACTION(
path
)
print(data)
```
## You can also extract data from PDF
```python
from pdf_dataextraction import data_extraction_pdf
path = "sample.pdf"
data = data_extraction_pdf.extract_text_from_pdf(path)
print(data)
```
## You can also extract data from images
```python
from image_dataextraction import data_imageextraction
path = "sample.jpeg"
data = data_imageextraction.Image_extraction(path)
print(data)
```
## Contribute
If you find any issues or want to contribute to the DataXtractor library, please check the project's repository for information on how to get involved.
## License
This library is released under the [MIT License](LICENSE) to encourage collaboration and use in various applications.
---
DataXtractor is a powerful library for extracting text from PDF documents, whether they contain images, multi-column layouts, or plain text. It supports multiple languages and can be a valuable tool for text extraction and data analysis in a wide range of applications.
Raw data
{
"_id": null,
"home_page": "https://github.com/Rahulkatoch99/DataXtractor",
"name": "DataXtractor",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "data extraction pdf image",
"author": "Rahul Katoch",
"author_email": "rahulkatoch99@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/d2/53/44dfd52559841686ba6b473efb185cba43d7c185844d71d71557de9aaad6/DataXtractor-1.0.6.tar.gz",
"platform": null,
"description": "# DataXtractor Library\n\nDataXtractor is a versatile library designed to extract text from PDF documents, with the ability to handle images and multi-column layouts. This README file provides an overview of the library's capabilities and how to use it effectively.\n\n## Features\n\nDataXtractor library offers the following key features:\n\n### 1. Image to Text Extraction\n\nDataXtractor is equipped to handle PDFs containing images. It utilizes Optical Character Recognition (OCR) to convert images embedded in PDF files into machine-readable text. This allows you to access and manipulate the textual content within images present in your PDF documents.\n\n### 2. Multi-Column Text Extraction\n\nIn case your PDF contains text arranged in multiple columns, DataXtractor allows you to extract this text intelligently. The library can separate and extract content from each column independently, making it possible to obtain text in a structured and organized manner.\n\n### 3. Language Support\n\nDataXtractor supports multiple languages for OCR operations. You can specify the language code string using the `lang` parameter. By default, the library uses English (`eng`) if the language is not specified. You can also specify multiple languages for a more comprehensive text extraction process. For example:\n\n```\nsupported_language_codes = [\n \"ara\", \"aze\", \"aze_cyrl\", \"bel\", \"ben\", \"bod\", \"bos\", \"bul\", \"cat\", \"ceb\", \"ces\", \"chi_sim\", \"chi_sim_vert\", \"chi_tra\",\n \"chi_tra_vert\", \"chr\", \"cym\", \"dan\", \"deu\", \"deu-frak\", \"ell\", \"eng\", \"enm\", \"epo\", \"est\", \"eus\", \"fas\", \"fil\", \"fin\",\n \"fra\", \"frk\", \"frm\", \"fry\", \"gle\", \"glg\", \"grc\", \"guj\", \"hat\", \"heb\", \"hin\", \"hrv\", \"hun\", \"hye\", \"iku\", \"ind\", \"isl\",\n \"ita\", \"ita-old\", \"jav\", \"jpn\", \"jpn_vert\", \"kan\", \"kat\", \"kat-old\", \"kaz\", \"khm\", \"kir\", \"kor\", \"kor_vert\", \"lao\",\n \"lat\", \"lav\", \"lit\", \"ltz\", \"mal\", \"mar\", \"mkd\", \"mlt\", \"mon\", \"mri\", \"msa\", \"mya\", \"nep\", \"nld\", \"nor\", \"oci\", \"ori\",\n \"osd\", \"pan\", \"pol\", \"por\", \"pus\", \"ron\", \"rus\", \"san\", \"sin\", \"slk\", \"slv\", \"snd\", \"spa\", \"spa_old\", \"sqi\", \"srp\",\n \"srp_latn\", \"sun\", \"swa\", \"swe\", \"syr\", \"tam\", \"tat\", \"tel\", \"tgk\", \"tha\", \"tir\", \"ton\", \"tur\", \"uig\", \"ukr\", \"urd\",\n \"uzb\", \"uzb_cyrl\", \"vie\", \"yid\", \"yor\"\n]\n\n```\n\nThis is especially useful when working with PDFs that contain text in various languages.\n\n### 4. PDF Text Extraction\n\nDataXtractor is not limited to image-based PDFs. It can also extract text directly from PDF documents that contain text content. This feature allows you to process PDF files, whether they contain text alone or a combination of text and images.\n\n## Getting Started\n\nTo get started with the DataXtractor library, follow these steps:\n\n1. **Installation**: Install the DataXtractor library by using the provided package manager (if available), or manually include it in your project.\n\n2. **Library Initialization**: Initialize the DataXtractor library in your code, specifying the language(s) to use for OCR, as well as any other required parameters.\n\n3. **PDF Processing**: Load your PDF document and apply the appropriate extraction functions based on your needs. For image-based PDFs, use OCR to convert images to text. For text-based PDFs, extract text directly.\n\n4. **Output Handling**: Receive the extracted text and use it as needed for further processing or analysis within your application.\n\n## You can convert a PDF into an image and then perform OCR on that image using two different languages. Additionally, you can crop the image into two parts for separate OCR processing.\n\n## Example\n\n```python\nfrom pdf2image_dataextraction import data_extraction2parts\n\n\npath = \"sample.pdf\"\nleft_partition = \"40\"\nright_partition = \"60\"\nlang_part_first = \"en\"\nlang_part_second = \"en\"\ndata = data_extraction2parts.DATA_EXTRACTION_2_PARTS(\n path, left_partition, right_partition, lang_part_first, lang_part_second\n)\nprint(data)\n\n```\n\n## You can convert a PDF into an image and then perform OCR on that image and extract Data\n\n```python\n\nfrom pdf2image_dataextraction import data_extract\n\n\npath = \"sample.pdf\"\n\ndata = data_extraction2parts.DATA_EXTRACTION(\n path\n)\nprint(data)\n\n```\n\n## You can also extract data from PDF\n\n```python\nfrom pdf_dataextraction import data_extraction_pdf\n\n\npath = \"sample.pdf\"\n\ndata = data_extraction_pdf.extract_text_from_pdf(path)\nprint(data)\n\n\n```\n\n## You can also extract data from images\n\n```python\nfrom image_dataextraction import data_imageextraction\n\n\npath = \"sample.jpeg\"\n\ndata = data_imageextraction.Image_extraction(path)\nprint(data)\n\n```\n\n## Contribute\n\nIf you find any issues or want to contribute to the DataXtractor library, please check the project's repository for information on how to get involved.\n\n## License\n\nThis library is released under the [MIT License](LICENSE) to encourage collaboration and use in various applications.\n\n---\n\nDataXtractor is a powerful library for extracting text from PDF documents, whether they contain images, multi-column layouts, or plain text. It supports multiple languages and can be a valuable tool for text extraction and data analysis in a wide range of applications.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "DataXtractor is a versatile Python library designed to simplify the extraction of valuable data from a variety of sources, including images and PDF documents. Whether you need to extract text, tables, or structured content, DataXtractor provides powerful and intuitive tools to streamline the process.",
"version": "1.0.6",
"project_urls": {
"Homepage": "https://github.com/Rahulkatoch99/DataXtractor"
},
"split_keywords": [
"data",
"extraction",
"pdf",
"image"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6675a29e9460daea6f28161c11276acc2012ed272003c7086913504d5a1aaf6a",
"md5": "2ea0b54c663cbc35877bd95537af4f37",
"sha256": "26ff0f2d3858dd301987518b409b2bc64739d6110e75dc9dda5707bbed57054a"
},
"downloads": -1,
"filename": "DataXtractor-1.0.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2ea0b54c663cbc35877bd95537af4f37",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 6278,
"upload_time": "2023-10-12T14:01:07",
"upload_time_iso_8601": "2023-10-12T14:01:07.609565Z",
"url": "https://files.pythonhosted.org/packages/66/75/a29e9460daea6f28161c11276acc2012ed272003c7086913504d5a1aaf6a/DataXtractor-1.0.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "d25344dfd52559841686ba6b473efb185cba43d7c185844d71d71557de9aaad6",
"md5": "cc45a9b1f2fa706278ef542d412af0f4",
"sha256": "0fe2dd617dccbe7ca376178662fa1e5cfe7e3dc8a1f29268667b9084c0c10ad6"
},
"downloads": -1,
"filename": "DataXtractor-1.0.6.tar.gz",
"has_sig": false,
"md5_digest": "cc45a9b1f2fa706278ef542d412af0f4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 5213,
"upload_time": "2023-10-12T14:01:10",
"upload_time_iso_8601": "2023-10-12T14:01:10.291669Z",
"url": "https://files.pythonhosted.org/packages/d2/53/44dfd52559841686ba6b473efb185cba43d7c185844d71d71557de9aaad6/DataXtractor-1.0.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-12 14:01:10",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Rahulkatoch99",
"github_project": "DataXtractor",
"github_not_found": true,
"lcname": "dataxtractor"
}