patent-parsing-tools
====================
USPTO patents dataset generator.
[![Documentation Status](https://readthedocs.org/projects/patent-parsing-tools/badge/?version=latest)](https://patent-parsing-tools.readthedocs.io/en/latest/?badge=latest)
[![patent-parsing-tools CI](https://github.com/pprzetacznik/patent-parsing-tools/workflows/patent-parsing-tools%20CI/badge.svg)](https://github.com/pprzetacznik/patent-parsing-tools/actions?query=workflow%3A"patent-parsing-tools+CI")
[![PyPI version](https://badge.fury.io/py/patent-parsing-tools.svg)](https://pypi.org/project/patent-parsing-tools/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/patent-parsing-tools)](https://pypi.org/project/patent-parsing-tools/)
## Documentation
[Read the docs](https://patent-parsing-tools.readthedocs.io/en/latest/)
## System requirements
```Bash
sudo yum install python-devel libxslt-devel libxml2-devel
```
## Installation:
```
pip install patent-parsing-tools
```
## Examples:
Downloading dataset:
```Bash
python -m patent_parsing_tools.downloader \
--directory dataset \
--year-from 2010 \
--year-to 2010
```
Collecting and serializing data:
```Bash
python -m patent_parsing_tools.supervisor \
--working-directory patents/working_directory \
--train-destination patents/train_destination \
--test-destination patents/test_destination \
--year-from 2014 \
--year-to 2015
```
Generating dictionary with train set:
```Bash
python -m patent_parsing_tools.bow.dictionary_maker \
--train-directory patents/train_destination \
--max-patents 1000000000 \
--dictionary dictionary.txt \
--dict-max-size 4096
```
Generate bag of words with train set and test set:
```Bash
python -m patent_parsing_tools.bow.bag_of_words \
--serialized-patents patents/train_destination \
--destination-directory patents/final_dataset_train \
--dictionary dictionary.txt \
--batch-size 1048576
python -m patent_parsing_tools.bow.bag_of_words \
--serialized-patents patents/test_destination \
--destination-directory patents/final_dataset_test \
--dictionary dictionary.txt \
--batch-size 1048576
```
## Testing
```Bash
pytest
```
## Contributing and develpment
```Bash
$ mkvirtualenv ppt
$ workon ppt
(ppt) $ pip install -r requirements.txt
```
## Publish new release
```Bash
$ git tag v1.0
$ git push origin v1.0
```
## Building documentation
```Bash
(ppt) $ sphinx-build -M html docs docs_build
```
## References
Usage:
* Elton, *Using natural language processing techniques to extract information on the properties and functionalities of energetic materials from large text corpora*, 2019, online: [https://arxiv.org/abs/1903.00415](https://arxiv.org/abs/1903.00415).
* Lee, *Natural Language Processing Techniques for Advancing Materials Discovery: A Short Review*, 2023, online: [https://doi.org/10.1007/s40684-023-00523-6](https://doi.org/10.1007/s40684-023-00523-6).
## License
The MIT License (MIT). Copyright (c) 2014 MichaĆ Dul, Piotr Przetacznik, Krzysztof Strojny. Check [LICENSE](LICENSE) files for more information.
Raw data
{
"_id": null,
"home_page": "https://github.com/pprzetacznik/patent-parsing-tools",
"name": "patent-parsing-tools",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "deeplearning dbn rbm rsm backpropagation precission recall",
"author": "Michal Dul, Piotr Przetacznik, Krzysztof Strojny",
"author_email": "piotr.przetacznik@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/f0/18/0b8a5cbd4e2fb669e2c34b0c10839bf089ded88e02ff94e4f80c1b749896/patent-parsing-tools-0.9.5.tar.gz",
"platform": null,
"description": "patent-parsing-tools\n====================\nUSPTO patents dataset generator.\n\n[![Documentation Status](https://readthedocs.org/projects/patent-parsing-tools/badge/?version=latest)](https://patent-parsing-tools.readthedocs.io/en/latest/?badge=latest)\n[![patent-parsing-tools CI](https://github.com/pprzetacznik/patent-parsing-tools/workflows/patent-parsing-tools%20CI/badge.svg)](https://github.com/pprzetacznik/patent-parsing-tools/actions?query=workflow%3A\"patent-parsing-tools+CI\")\n[![PyPI version](https://badge.fury.io/py/patent-parsing-tools.svg)](https://pypi.org/project/patent-parsing-tools/)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/patent-parsing-tools)](https://pypi.org/project/patent-parsing-tools/)\n\n## Documentation\n\n[Read the docs](https://patent-parsing-tools.readthedocs.io/en/latest/)\n\n## System requirements\n\n```Bash\nsudo yum install python-devel libxslt-devel libxml2-devel\n```\n\n## Installation:\n\n```\npip install patent-parsing-tools\n```\n\n## Examples:\n\nDownloading dataset:\n```Bash\npython -m patent_parsing_tools.downloader \\\n --directory dataset \\\n --year-from 2010 \\\n --year-to 2010\n```\n\nCollecting and serializing data:\n```Bash\npython -m patent_parsing_tools.supervisor \\\n --working-directory patents/working_directory \\\n --train-destination patents/train_destination \\\n --test-destination patents/test_destination \\\n --year-from 2014 \\\n --year-to 2015\n```\n\nGenerating dictionary with train set:\n```Bash\npython -m patent_parsing_tools.bow.dictionary_maker \\\n --train-directory patents/train_destination \\\n --max-patents 1000000000 \\\n --dictionary dictionary.txt \\\n --dict-max-size 4096\n```\n\nGenerate bag of words with train set and test set:\n```Bash\npython -m patent_parsing_tools.bow.bag_of_words \\\n --serialized-patents patents/train_destination \\\n --destination-directory patents/final_dataset_train \\\n --dictionary dictionary.txt \\\n --batch-size 1048576\npython -m patent_parsing_tools.bow.bag_of_words \\\n --serialized-patents patents/test_destination \\\n --destination-directory patents/final_dataset_test \\\n --dictionary dictionary.txt \\\n --batch-size 1048576\n```\n\n## Testing\n\n```Bash\npytest\n```\n\n## Contributing and develpment\n\n```Bash\n$ mkvirtualenv ppt\n$ workon ppt\n(ppt) $ pip install -r requirements.txt\n```\n\n## Publish new release\n\n```Bash\n$ git tag v1.0\n$ git push origin v1.0\n```\n\n## Building documentation\n\n```Bash\n(ppt) $ sphinx-build -M html docs docs_build\n```\n\n## References\n\nUsage:\n* Elton, *Using natural language processing techniques to extract information on the properties and functionalities of energetic materials from large text corpora*, 2019, online: [https://arxiv.org/abs/1903.00415](https://arxiv.org/abs/1903.00415).\n* Lee, *Natural Language Processing Techniques for Advancing Materials Discovery: A Short Review*, 2023, online: [https://doi.org/10.1007/s40684-023-00523-6](https://doi.org/10.1007/s40684-023-00523-6).\n\n## License\n\nThe MIT License (MIT). Copyright (c) 2014 Micha\u0142 Dul, Piotr Przetacznik, Krzysztof Strojny. Check [LICENSE](LICENSE) files for more information.\n\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "patent-parsing-tools is a library providing tools for generating training and test set from Google's USPTO data helpful with for testing machine learning algorithms",
"version": "0.9.5",
"project_urls": {
"Homepage": "https://github.com/pprzetacznik/patent-parsing-tools"
},
"split_keywords": [
"deeplearning",
"dbn",
"rbm",
"rsm",
"backpropagation",
"precission",
"recall"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "65f78b2f6a6f49f85107f3660af525809cb6714a8b61fb9673b09dfc8c0f8b40",
"md5": "07c93439d0b248945ae9db1e24e62b20",
"sha256": "7bb52a2deaaaec6faa49ac3d78f59f959189d4e4215a15776e5cadbb40dd3802"
},
"downloads": -1,
"filename": "patent_parsing_tools-0.9.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "07c93439d0b248945ae9db1e24e62b20",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 1568951,
"upload_time": "2024-12-29T21:11:15",
"upload_time_iso_8601": "2024-12-29T21:11:15.367646Z",
"url": "https://files.pythonhosted.org/packages/65/f7/8b2f6a6f49f85107f3660af525809cb6714a8b61fb9673b09dfc8c0f8b40/patent_parsing_tools-0.9.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f0180b8a5cbd4e2fb669e2c34b0c10839bf089ded88e02ff94e4f80c1b749896",
"md5": "4fe1f2bf42c6a2f3fb84b245ef36f67f",
"sha256": "8a4c2da98468fde1c87ca20d01cc1988b077e9a5493588b2e192f22e9c7883ef"
},
"downloads": -1,
"filename": "patent-parsing-tools-0.9.5.tar.gz",
"has_sig": false,
"md5_digest": "4fe1f2bf42c6a2f3fb84b245ef36f67f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 1524932,
"upload_time": "2024-12-29T21:11:17",
"upload_time_iso_8601": "2024-12-29T21:11:17.948824Z",
"url": "https://files.pythonhosted.org/packages/f0/18/0b8a5cbd4e2fb669e2c34b0c10839bf089ded88e02ff94e4f80c1b749896/patent-parsing-tools-0.9.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-29 21:11:17",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "pprzetacznik",
"github_project": "patent-parsing-tools",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "lxml",
"specs": []
},
{
"name": "nltk",
"specs": []
},
{
"name": "stemming",
"specs": []
},
{
"name": "Sphinx",
"specs": []
},
{
"name": "sphinx_rtd_theme",
"specs": []
},
{
"name": "requests",
"specs": []
},
{
"name": "numpy",
"specs": [
[
"==",
"1.22.0"
]
]
},
{
"name": "Theano",
"specs": [
[
"==",
"0.9.0"
]
]
},
{
"name": "mypy",
"specs": []
},
{
"name": "pytest",
"specs": []
},
{
"name": "pytest-cov",
"specs": []
}
],
"lcname": "patent-parsing-tools"
}