# Extractable: Table Extraction from PDFs using Machine Learning
Extractable is an open-source library designed to bring the power of state-of-the-art machine learning to everyone. Our goal is to make it easy for anyone to extract tables from PDFs, regardless of their technical expertise. Extractable is built on top of Microsoft's Open Source Table Transformer (TATR) library, which we have expanded to include a variety of new features and improvements.
## Features
Extractable is designed to be easy to use and highly flexible. Some of its key features include:
- **Table Extraction from PDFs**: Extractable uses machine learning models to extract tables from PDFs, enabling users to easily extract data from large datasets.
- **Open-Source and Collaborative**: Extractable is an open-source library designed for easy collaboration and contributions from the community.
- **PDF Test Table Generator**: We have developed a unique dataset to simulate real-world scenarios and benchmark machine learning models, identify the challenges and improve on specific areas.
- **Comparative Analyses**: We have conducted extensive comparative analyses of various machine learning models to determine their effectiveness in extracting tables from PDFs.
- **Robust Data Pipelines**: We have designed and implemented robust data pipelines for processing and analyzing large volumes of PDF data, with a focus on code-readability and sustainability.
## Installation
To install Extractable, simply use pip:
```pip install Extractable```
Extractable is designed to be used with Python 3.10.
## Usage
To use Extractable, simply import the library and use its functions. We provide comprehensive documentation to get started with the library.
```python
import extractable
input_file = "path_to/your_input.pdf"
output_file = "path_to/your_preferred_output"
# Extract tables from a PDF file
tables = extractable.Extractor.extract(input_file, output_file)
# That's how simple it is!
```
## Architecture
To visualize the architecture of internal dependencies of this codebase, the extension 'Codesee' has been used.
With every pull-request a Codesee bot will be called to analyze changes in the architecture and visualize it. Next to that, Codesee also provides useful insights into the code on-demand. This looks as follows:
![Code and dependency Architecture of the codebase](Extractable_Architecture_3_10_2023.png)
## Contributing
Extractable is an open-source project and we welcome contributions from the community. If you would like to contribute, please take a look at our contribution guidelines and feel free to reach out to us on our GitHub repository.
## Maintainers
As maintainer one would want to test and publish new versions of Extractable. Testing is done using pytest and publishing on Pypi which will make it available on
```pip install```, below are the guides to accomplish different tasks with Extractable:
### For Testing
Before publishing one may want to run tests on the code to check if your newly written code hasn't changed the functionality of the codebase.
This is called regression testing and this can be done by running the Unit- and e2e tests. You can do this with the following command:
```pdm run pytest -k "tests/ and Test_"```
After all tests have passed, one may want to test how the library functionality works in real-life. To test this we do not upload the library to the official Pypi.org filesystem, but we first upload it to test.pypi.org:
1. In the command line type ```pdm build``` which will automatically build the ```extractable-[version].tar.gz``` and ```extractable-[version]-py3-none-any.whl``` files, which are needed for Pypi
2. If you want to push it to the testing environment you can enter the following in the cmd: ```twine upload --repository-url https://test.pypi.org/legacy/ dist/extractable-[version]* --verbose``` and then enter your login info
3. Then open up a new project in your IDE and to be safe you can first enter ```pip uninstall Extractable``` before installing the testing library with: ```python -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple Extractable==[version]```
That's it, you now have built and published your new version of the library to the testing environment.
### For Publishing
1. Go to ```src/version.py``` and change the version to the new version (old version +1). And go to the ```pyproject.toml``` file and inside the ```[project]``` section change the ```version``` to the new version.
2. In the command line type ```pdm build``` which will automatically build the ```extractable-[version].tar.gz``` and ```extractable-[version]-py3-none-any.whl``` files, which are needed for Pypi
3. If you want to push it to the real pypi environment you can enter the following in the cmd: ```twine upload dist/extractable-[version]* --verbose``` and then enter your login info
4. Then pen a new project in your IDE and to be safe you can first enter ```pip uninstall Extractable``` before installing the testing library with: ```python -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple Extractable==[version]```
That's it, you now have built and published your new version of the library to the testing environment.
## License
This software is free to use, and I encourage anyone who finds it useful to use it in any way they see fit. While I have not applied any license to the software, I do ask that users respect Microsofts' authorship of the TATR software and give appropriate attribution when sharing or distributing it. Please note that I make no warranties or guarantees about the software's functionality, and I am not liable for any damages resulting from its use
## Acknowledgments
We would like to thank Microsoft for developing the TATR library and making it open-source. We have built upon their work to create Extractable, and we are grateful for their contribution to the open-source community.
Raw data
{
"_id": null,
"home_page": "",
"name": "Extractable",
"maintainer": "",
"docs_url": null,
"requires_python": "<3.12,>=3.9",
"maintainer_email": "",
"keywords": "python table-extraction pdf TATR Table Transformer Computer Vision",
"author": "",
"author_email": "Suleymen C. Kandrouch <suleyleeuw@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/42/ed/45167fc1e8b7bfa6a75e3b57920d296be9002ca3c34f1b5dc9b121d7a6eb/extractable-0.0.119.tar.gz",
"platform": null,
"description": "# Extractable: Table Extraction from PDFs using Machine Learning\n\nExtractable is an open-source library designed to bring the power of state-of-the-art machine learning to everyone. Our goal is to make it easy for anyone to extract tables from PDFs, regardless of their technical expertise. Extractable is built on top of Microsoft's Open Source Table Transformer (TATR) library, which we have expanded to include a variety of new features and improvements.\n\n## Features\n\nExtractable is designed to be easy to use and highly flexible. Some of its key features include:\n\n- **Table Extraction from PDFs**: Extractable uses machine learning models to extract tables from PDFs, enabling users to easily extract data from large datasets.\n\n- **Open-Source and Collaborative**: Extractable is an open-source library designed for easy collaboration and contributions from the community.\n\n- **PDF Test Table Generator**: We have developed a unique dataset to simulate real-world scenarios and benchmark machine learning models, identify the challenges and improve on specific areas. \n\n- **Comparative Analyses**: We have conducted extensive comparative analyses of various machine learning models to determine their effectiveness in extracting tables from PDFs.\n\n- **Robust Data Pipelines**: We have designed and implemented robust data pipelines for processing and analyzing large volumes of PDF data, with a focus on code-readability and sustainability.\n\n## Installation\n\nTo install Extractable, simply use pip:\n```pip install Extractable```\n\nExtractable is designed to be used with Python 3.10.\n\n## Usage\n\nTo use Extractable, simply import the library and use its functions. We provide comprehensive documentation to get started with the library.\n\n```python\nimport extractable\n\ninput_file = \"path_to/your_input.pdf\"\noutput_file = \"path_to/your_preferred_output\"\n\n# Extract tables from a PDF file\ntables = extractable.Extractor.extract(input_file, output_file)\n\n# That's how simple it is!\n```\n\n## Architecture\nTo visualize the architecture of internal dependencies of this codebase, the extension 'Codesee' has been used. \nWith every pull-request a Codesee bot will be called to analyze changes in the architecture and visualize it. Next to that, Codesee also provides useful insights into the code on-demand. This looks as follows:\n\n![Code and dependency Architecture of the codebase](Extractable_Architecture_3_10_2023.png)\n\n## Contributing\nExtractable is an open-source project and we welcome contributions from the community. If you would like to contribute, please take a look at our contribution guidelines and feel free to reach out to us on our GitHub repository.\n\n## Maintainers\nAs maintainer one would want to test and publish new versions of Extractable. Testing is done using pytest and publishing on Pypi which will make it available on\n```pip install```, below are the guides to accomplish different tasks with Extractable:\n\n### For Testing\nBefore publishing one may want to run tests on the code to check if your newly written code hasn't changed the functionality of the codebase. \nThis is called regression testing and this can be done by running the Unit- and e2e tests. You can do this with the following command:\n```pdm run pytest -k \"tests/ and Test_\"```\n\nAfter all tests have passed, one may want to test how the library functionality works in real-life. To test this we do not upload the library to the official Pypi.org filesystem, but we first upload it to test.pypi.org:\n\n1. In the command line type ```pdm build``` which will automatically build the ```extractable-[version].tar.gz``` and ```extractable-[version]-py3-none-any.whl``` files, which are needed for Pypi\n2. If you want to push it to the testing environment you can enter the following in the cmd: ```twine upload --repository-url https://test.pypi.org/legacy/ dist/extractable-[version]* --verbose``` and then enter your login info\n3. Then open up a new project in your IDE and to be safe you can first enter ```pip uninstall Extractable``` before installing the testing library with: ```python -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple Extractable==[version]```\nThat's it, you now have built and published your new version of the library to the testing environment.\n\n\n### For Publishing\n1. Go to ```src/version.py``` and change the version to the new version (old version +1). And go to the ```pyproject.toml``` file and inside the ```[project]``` section change the ```version``` to the new version.\n2. In the command line type ```pdm build``` which will automatically build the ```extractable-[version].tar.gz``` and ```extractable-[version]-py3-none-any.whl``` files, which are needed for Pypi\n3. If you want to push it to the real pypi environment you can enter the following in the cmd: ```twine upload dist/extractable-[version]* --verbose``` and then enter your login info\n4. Then pen a new project in your IDE and to be safe you can first enter ```pip uninstall Extractable``` before installing the testing library with: ```python -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple Extractable==[version]```\nThat's it, you now have built and published your new version of the library to the testing environment.\n\n\n## License\nThis software is free to use, and I encourage anyone who finds it useful to use it in any way they see fit. While I have not applied any license to the software, I do ask that users respect Microsofts' authorship of the TATR software and give appropriate attribution when sharing or distributing it. Please note that I make no warranties or guarantees about the software's functionality, and I am not liable for any damages resulting from its use\n\n## Acknowledgments\nWe would like to thank Microsoft for developing the TATR library and making it open-source. We have built upon their work to create Extractable, and we are grateful for their contribution to the open-source community.\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Extract tables from PDFs",
"version": "0.0.119",
"project_urls": {
"Downloads": "https://github.com/SuleyNL/Extractable/archive/refs/tags/v1.0.0.tar.gz",
"Homepage": "https://github.com/SuleyNL/Extractable"
},
"split_keywords": [
"python",
"table-extraction",
"pdf",
"tatr",
"table",
"transformer",
"computer",
"vision"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d59fcf69b1a60046f12f8b39fa41c3265811a9f0227c56d2ce2bc30c90a96264",
"md5": "f71746f4f96b42d9bb89c1e4d21ee2f7",
"sha256": "33db599d66cc6fd8dae69f17db317ce272e3101c0d2fde9a748a5898e58732ac"
},
"downloads": -1,
"filename": "extractable-0.0.119-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f71746f4f96b42d9bb89c1e4d21ee2f7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.12,>=3.9",
"size": 539350,
"upload_time": "2023-10-21T10:32:34",
"upload_time_iso_8601": "2023-10-21T10:32:34.923687Z",
"url": "https://files.pythonhosted.org/packages/d5/9f/cf69b1a60046f12f8b39fa41c3265811a9f0227c56d2ce2bc30c90a96264/extractable-0.0.119-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "42ed45167fc1e8b7bfa6a75e3b57920d296be9002ca3c34f1b5dc9b121d7a6eb",
"md5": "6becaf29d94284cac90857babdbd6b6b",
"sha256": "390ce9d7674eaa3725675ede807bd5e3cf70328ebcf72dfec9a17705dedaa49a"
},
"downloads": -1,
"filename": "extractable-0.0.119.tar.gz",
"has_sig": false,
"md5_digest": "6becaf29d94284cac90857babdbd6b6b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.12,>=3.9",
"size": 10511188,
"upload_time": "2023-10-21T10:32:36",
"upload_time_iso_8601": "2023-10-21T10:32:36.929141Z",
"url": "https://files.pythonhosted.org/packages/42/ed/45167fc1e8b7bfa6a75e3b57920d296be9002ca3c34f1b5dc9b121d7a6eb/extractable-0.0.119.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-21 10:32:36",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "SuleyNL",
"github_project": "Extractable",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "extractable"
}