# PatentGPT
This repository hosts a prototype for patent analysis with a particular focus on extracting specific technical measurements and their associated values
[](https://pypi.org/project/patentgpt-extract/) [](https://pypi.org/project/patentgpt-extract/) [](https://github.com/arminnorouzi/patentGPT/blob/main/LICENSE)
## Quick Start
To run this package in Google Colab:
[](https://colab.research.google.com/github/arminnorouzi/patentGPT/blob/main/quick_start.ipynb)
## Installation
Step 1. Install `patentgpt-extract`, simply use `pip`:
`from patentgpt.main import main`
Step 2. Authentication: Before using the package, you need to authenticate with OpenAI. To do this:
```
import os
from getpass import getpass
token = getpass("Enter your OpenAI token: ()")
os.environ["OPENAI_API_KEY"] = str(token)
```
Step 3. Importing and Running:
```
from patentgpt.main import main
main()
```
Then answer these questions::
- Enter a date in the format 'YYYY-MM-DD': exampe is 2023-01-12
- Enter the number of patents you want to analyze: example is 5 (this randomly select 5 parsed patents)
- Do you want to log the results? (yes/no)
- Select a model for analysis: 1. gpt-3.5-turbo 2. gpt-4
## Quick Start using repository
1. Clone this repository.
2. Install the required Python packages by running `!pip install -r requirements.txt`.
3. Run the Jupyter notebook `quick_start.ipynb`.
4. Authenticate and add your OpenAI token. Then answer these questions::
- Enter a date in the format 'YYYY-MM-DD': exampe is 2023-01-12
- Enter the number of patents you want to analyze: example is 5 (this randomly select 5 parsed patents)
- Do you want to log the results? (yes/no)
- Select a model for analysis: 1. gpt-3.5-turbo 2. gpt-4
5. JSON results will be saved in the output folder.
## System Design
Below is an image that illustrates the main design and workflow of the patent analysis system

The project involves the following steps:
1. Downloading a ZIP archive that contains granted patent full-text data (without images) from https://bulkdata.uspto.gov/.
2. Reading the contained patents from XML files and extracting individual XML file, parsing it to text file and saving it in data.
3. Implementing an approach based on a large language model (LLM) to extract measurements from the patents using vector store Chroma. The measurements are returned in a structured format (such as JSON).
## Requirements
- Python 3.10+
- See requirements.txt for Python packages and versions.
## License
This project is licensed under the terms of the MIT license. See [LICENSE](LICENSE) file.
## Collaboration
We welcome contributions to patentgpt-extract! If you're interested in improving the package, adding features, or even fixing bugs, here's how you can get started:
1. Fork the Repository: Start by forking the repository. This creates your own personal copy of the entire project.
2. Clone Your Fork: Once you've forked the repo, clone your fork to your local machine to start making changes.
```
git clone https://github.com/arminnorouzi/patentGPT.git
```
3. Create a New Branch: Before making changes, create a new branch. This helps in segregating your changes and makes it easier to merge later.
```
git checkout -b new-feature-branch
```
Replace `new-feature-branch`` with a descriptive name for your changes.
4. Make Your Changes: Now, you can start making changes, adding new features, fixing bugs, or improving documentation.
5. Commit and Push: Once you're done, commit your changes and push them to your fork on GitHub.
```
git add .
git commit -m "feat or fix: Description of changes made"
git push origin new-feature-branch
```
6. Open a Pull Request: Go to your fork on GitHub and click the "New pull request" button. Ensure you're comparing the correct branches and then submit your pull request with a description of the changes you made.
## Additional Resources
For further questions or if you encounter any problems, please do not hesitate to open an issue.
- [My Website](https://arminnorouzi.github.io/) - More details about me and my projects.
- [LinkedIn](https://www.linkedin.com/in/arminnorouzi/) - See the post related to machine learning and software development on LinkedIn.
- [Medium](https://arminnorouzi.medium.com/) - See my posts related to ML/AI, algorithms, and system design.
Raw data
{
"_id": null,
"home_page": "",
"name": "patentgpt-extract",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Armin Norouzi",
"author_email": "arminnorouzi2016@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/1b/28/d5d0efef98bcd9f7f665ffae3daaa741a153cc67fff639ce702f38e35a63/patentgpt_extract-0.1.7.tar.gz",
"platform": null,
"description": "# PatentGPT\r\n\r\nThis repository hosts a prototype for patent analysis with a particular focus on extracting specific technical measurements and their associated values\r\n\r\n[](https://pypi.org/project/patentgpt-extract/) [](https://pypi.org/project/patentgpt-extract/) [](https://github.com/arminnorouzi/patentGPT/blob/main/LICENSE)\r\n\r\n## Quick Start\r\n\r\nTo run this package in Google Colab:\r\n\r\n[](https://colab.research.google.com/github/arminnorouzi/patentGPT/blob/main/quick_start.ipynb)\r\n\r\n## Installation\r\n\r\nStep 1. Install `patentgpt-extract`, simply use `pip`:\r\n\r\n`from patentgpt.main import main`\r\n\r\nStep 2. Authentication: Before using the package, you need to authenticate with OpenAI. To do this:\r\n\r\n```\r\nimport os\r\nfrom getpass import getpass\r\n\r\ntoken = getpass(\"Enter your OpenAI token: ()\")\r\nos.environ[\"OPENAI_API_KEY\"] = str(token)\r\n```\r\n\r\nStep 3. Importing and Running:\r\n\r\n```\r\nfrom patentgpt.main import main\r\n\r\nmain()\r\n```\r\n\r\nThen answer these questions::\r\n\r\n- Enter a date in the format 'YYYY-MM-DD': exampe is 2023-01-12\r\n- Enter the number of patents you want to analyze: example is 5 (this randomly select 5 parsed patents)\r\n- Do you want to log the results? (yes/no)\r\n- Select a model for analysis: 1. gpt-3.5-turbo 2. gpt-4\r\n\r\n## Quick Start using repository\r\n\r\n1. Clone this repository.\r\n2. Install the required Python packages by running `!pip install -r requirements.txt`.\r\n3. Run the Jupyter notebook `quick_start.ipynb`.\r\n4. Authenticate and add your OpenAI token. Then answer these questions::\r\n - Enter a date in the format 'YYYY-MM-DD': exampe is 2023-01-12\r\n - Enter the number of patents you want to analyze: example is 5 (this randomly select 5 parsed patents)\r\n - Do you want to log the results? (yes/no)\r\n - Select a model for analysis: 1. gpt-3.5-turbo 2. gpt-4\r\n5. JSON results will be saved in the output folder.\r\n\r\n## System Design\r\n\r\nBelow is an image that illustrates the main design and workflow of the patent analysis system\r\n\r\n\r\n\r\nThe project involves the following steps:\r\n\r\n1. Downloading a ZIP archive that contains granted patent full-text data (without images) from https://bulkdata.uspto.gov/.\r\n2. Reading the contained patents from XML files and extracting individual XML file, parsing it to text file and saving it in data.\r\n3. Implementing an approach based on a large language model (LLM) to extract measurements from the patents using vector store Chroma. The measurements are returned in a structured format (such as JSON).\r\n\r\n## Requirements\r\n\r\n- Python 3.10+\r\n- See requirements.txt for Python packages and versions.\r\n\r\n## License\r\n\r\nThis project is licensed under the terms of the MIT license. See [LICENSE](LICENSE) file.\r\n\r\n## Collaboration\r\n\r\nWe welcome contributions to patentgpt-extract! If you're interested in improving the package, adding features, or even fixing bugs, here's how you can get started:\r\n\r\n1. Fork the Repository: Start by forking the repository. This creates your own personal copy of the entire project.\r\n\r\n2. Clone Your Fork: Once you've forked the repo, clone your fork to your local machine to start making changes.\r\n\r\n```\r\ngit clone https://github.com/arminnorouzi/patentGPT.git\r\n```\r\n\r\n3. Create a New Branch: Before making changes, create a new branch. This helps in segregating your changes and makes it easier to merge later.\r\n\r\n```\r\ngit checkout -b new-feature-branch\r\n```\r\n\r\nReplace `new-feature-branch`` with a descriptive name for your changes.\r\n\r\n4. Make Your Changes: Now, you can start making changes, adding new features, fixing bugs, or improving documentation.\r\n\r\n5. Commit and Push: Once you're done, commit your changes and push them to your fork on GitHub.\r\n\r\n```\r\ngit add .\r\ngit commit -m \"feat or fix: Description of changes made\"\r\ngit push origin new-feature-branch\r\n```\r\n\r\n6. Open a Pull Request: Go to your fork on GitHub and click the \"New pull request\" button. Ensure you're comparing the correct branches and then submit your pull request with a description of the changes you made.\r\n\r\n## Additional Resources\r\n\r\nFor further questions or if you encounter any problems, please do not hesitate to open an issue.\r\n\r\n- [My Website](https://arminnorouzi.github.io/) - More details about me and my projects.\r\n- [LinkedIn](https://www.linkedin.com/in/arminnorouzi/) - See the post related to machine learning and software development on LinkedIn.\r\n- [Medium](https://arminnorouzi.medium.com/) - See my posts related to ML/AI, algorithms, and system design.\r\n",
"bugtrack_url": null,
"license": "",
"summary": "A package for downloading and analysisng us patents.",
"version": "0.1.7",
"project_urls": {
"Bug Tracker": "https://github.com/arminnorouzi/patentGPT/issues",
"Source Code": "https://github.com/arminnorouzi/patentGPT"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1b28d5d0efef98bcd9f7f665ffae3daaa741a153cc67fff639ce702f38e35a63",
"md5": "60d39b6f8d0eb9147262de7a0a7d9f85",
"sha256": "6e493330f659c58b92f4dca80104a9a490650e4d7c3103b909068f064bca15db"
},
"downloads": -1,
"filename": "patentgpt_extract-0.1.7.tar.gz",
"has_sig": false,
"md5_digest": "60d39b6f8d0eb9147262de7a0a7d9f85",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 18119,
"upload_time": "2023-08-12T16:17:07",
"upload_time_iso_8601": "2023-08-12T16:17:07.346663Z",
"url": "https://files.pythonhosted.org/packages/1b/28/d5d0efef98bcd9f7f665ffae3daaa741a153cc67fff639ce702f38e35a63/patentgpt_extract-0.1.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-12 16:17:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "arminnorouzi",
"github_project": "patentGPT",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "aiohttp",
"specs": [
[
"==",
"3.8.5"
]
]
},
{
"name": "aiosignal",
"specs": [
[
"==",
"1.3.1"
]
]
},
{
"name": "anyio",
"specs": [
[
"==",
"3.7.1"
]
]
},
{
"name": "asttokens",
"specs": [
[
"==",
"2.2.1"
]
]
},
{
"name": "async-timeout",
"specs": [
[
"==",
"4.0.2"
]
]
},
{
"name": "attrs",
"specs": [
[
"==",
"23.1.0"
]
]
},
{
"name": "backcall",
"specs": [
[
"==",
"0.2.0"
]
]
},
{
"name": "backoff",
"specs": [
[
"==",
"2.2.1"
]
]
},
{
"name": "beautifulsoup4",
"specs": [
[
"==",
"4.12.2"
]
]
},
{
"name": "certifi",
"specs": [
[
"==",
"2023.7.22"
]
]
},
{
"name": "cffi",
"specs": [
[
"==",
"1.15.1"
]
]
},
{
"name": "chardet",
"specs": [
[
"==",
"5.1.0"
]
]
},
{
"name": "charset-normalizer",
"specs": [
[
"==",
"3.2.0"
]
]
},
{
"name": "chroma-hnswlib",
"specs": [
[
"==",
"0.7.1"
]
]
},
{
"name": "chromadb",
"specs": [
[
"==",
"0.4.3"
]
]
},
{
"name": "click",
"specs": [
[
"==",
"8.1.6"
]
]
},
{
"name": "colorama",
"specs": [
[
"==",
"0.4.6"
]
]
},
{
"name": "coloredlogs",
"specs": [
[
"==",
"15.0.1"
]
]
},
{
"name": "comm",
"specs": [
[
"==",
"0.1.3"
]
]
},
{
"name": "cryptography",
"specs": [
[
"==",
"41.0.2"
]
]
},
{
"name": "dataclasses-json",
"specs": [
[
"==",
"0.5.13"
]
]
},
{
"name": "debugpy",
"specs": [
[
"==",
"1.6.7"
]
]
},
{
"name": "decorator",
"specs": [
[
"==",
"5.1.1"
]
]
},
{
"name": "et-xmlfile",
"specs": [
[
"==",
"1.1.0"
]
]
},
{
"name": "executing",
"specs": [
[
"==",
"1.2.0"
]
]
},
{
"name": "faiss-cpu",
"specs": [
[
"==",
"1.7.4"
]
]
},
{
"name": "fastapi",
"specs": [
[
"==",
"0.99.1"
]
]
},
{
"name": "filetype",
"specs": [
[
"==",
"1.2.0"
]
]
},
{
"name": "flatbuffers",
"specs": [
[
"==",
"23.5.26"
]
]
},
{
"name": "frozenlist",
"specs": [
[
"==",
"1.4.0"
]
]
},
{
"name": "greenlet",
"specs": [
[
"==",
"2.0.2"
]
]
},
{
"name": "h11",
"specs": [
[
"==",
"0.14.0"
]
]
},
{
"name": "httptools",
"specs": [
[
"==",
"0.6.0"
]
]
},
{
"name": "humanfriendly",
"specs": [
[
"==",
"10.0"
]
]
},
{
"name": "idna",
"specs": [
[
"==",
"3.4"
]
]
},
{
"name": "importlib-resources",
"specs": [
[
"==",
"6.0.0"
]
]
},
{
"name": "ipykernel",
"specs": [
[
"==",
"6.25.0"
]
]
},
{
"name": "ipython",
"specs": [
[
"==",
"8.14.0"
]
]
},
{
"name": "jedi",
"specs": [
[
"==",
"0.18.2"
]
]
},
{
"name": "joblib",
"specs": [
[
"==",
"1.3.1"
]
]
},
{
"name": "jupyter_client",
"specs": [
[
"==",
"8.3.0"
]
]
},
{
"name": "jupyter_core",
"specs": [
[
"==",
"5.3.1"
]
]
},
{
"name": "kor",
"specs": [
[
"==",
"0.13.0"
]
]
},
{
"name": "langchain",
"specs": [
[
"==",
"0.0.246"
]
]
},
{
"name": "langsmith",
"specs": [
[
"==",
"0.0.15"
]
]
},
{
"name": "lxml",
"specs": [
[
"==",
"4.9.3"
]
]
},
{
"name": "Markdown",
"specs": [
[
"==",
"3.4.4"
]
]
},
{
"name": "marshmallow",
"specs": [
[
"==",
"3.20.1"
]
]
},
{
"name": "matplotlib-inline",
"specs": [
[
"==",
"0.1.6"
]
]
},
{
"name": "monotonic",
"specs": [
[
"==",
"1.6"
]
]
},
{
"name": "mpmath",
"specs": [
[
"==",
"1.3.0"
]
]
},
{
"name": "msg-parser",
"specs": [
[
"==",
"1.2.0"
]
]
},
{
"name": "multidict",
"specs": [
[
"==",
"6.0.4"
]
]
},
{
"name": "mypy-extensions",
"specs": [
[
"==",
"1.0.0"
]
]
},
{
"name": "nest-asyncio",
"specs": [
[
"==",
"1.5.6"
]
]
},
{
"name": "nltk",
"specs": [
[
"==",
"3.8.1"
]
]
},
{
"name": "numexpr",
"specs": [
[
"==",
"2.8.4"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"1.25.1"
]
]
},
{
"name": "olefile",
"specs": [
[
"==",
"0.46"
]
]
},
{
"name": "onnxruntime",
"specs": [
[
"==",
"1.15.1"
]
]
},
{
"name": "openai",
"specs": [
[
"==",
"0.27.8"
]
]
},
{
"name": "openapi-schema-pydantic",
"specs": [
[
"==",
"1.2.4"
]
]
},
{
"name": "openpyxl",
"specs": [
[
"==",
"3.1.2"
]
]
},
{
"name": "overrides",
"specs": [
[
"==",
"7.3.1"
]
]
},
{
"name": "packaging",
"specs": [
[
"==",
"23.1"
]
]
},
{
"name": "pandas",
"specs": [
[
"==",
"1.5.3"
]
]
},
{
"name": "parso",
"specs": [
[
"==",
"0.8.3"
]
]
},
{
"name": "pdf2image",
"specs": [
[
"==",
"1.16.3"
]
]
},
{
"name": "pdfminer.six",
"specs": [
[
"==",
"20221105"
]
]
},
{
"name": "pickleshare",
"specs": [
[
"==",
"0.7.5"
]
]
},
{
"name": "Pillow",
"specs": [
[
"==",
"10.0.0"
]
]
},
{
"name": "platformdirs",
"specs": [
[
"==",
"3.9.1"
]
]
},
{
"name": "posthog",
"specs": [
[
"==",
"3.0.1"
]
]
},
{
"name": "prompt-toolkit",
"specs": [
[
"==",
"3.0.39"
]
]
},
{
"name": "protobuf",
"specs": [
[
"==",
"4.23.4"
]
]
},
{
"name": "psutil",
"specs": [
[
"==",
"5.9.5"
]
]
},
{
"name": "pulsar-client",
"specs": [
[
"==",
"3.2.0"
]
]
},
{
"name": "pure-eval",
"specs": [
[
"==",
"0.2.2"
]
]
},
{
"name": "pycodestyle",
"specs": [
[
"==",
"2.10.0"
]
]
},
{
"name": "pycparser",
"specs": [
[
"==",
"2.21"
]
]
},
{
"name": "pydantic",
"specs": [
[
"==",
"1.10.12"
]
]
},
{
"name": "Pygments",
"specs": [
[
"==",
"2.15.1"
]
]
},
{
"name": "pypandoc",
"specs": [
[
"==",
"1.11"
]
]
},
{
"name": "PyPika",
"specs": [
[
"==",
"0.48.9"
]
]
},
{
"name": "pyreadline3",
"specs": [
[
"==",
"3.4.1"
]
]
},
{
"name": "python-dateutil",
"specs": [
[
"==",
"2.8.2"
]
]
},
{
"name": "python-docx",
"specs": [
[
"==",
"0.8.11"
]
]
},
{
"name": "python-dotenv",
"specs": [
[
"==",
"1.0.0"
]
]
},
{
"name": "python-magic",
"specs": [
[
"==",
"0.4.27"
]
]
},
{
"name": "python-pptx",
"specs": [
[
"==",
"0.6.21"
]
]
},
{
"name": "pytz",
"specs": [
[
"==",
"2023.3"
]
]
},
{
"name": "pywin32",
"specs": [
[
"==",
"306"
]
]
},
{
"name": "PyYAML",
"specs": [
[
"==",
"6.0.1"
]
]
},
{
"name": "pyzmq",
"specs": [
[
"==",
"25.1.0"
]
]
},
{
"name": "regex",
"specs": [
[
"==",
"2023.6.3"
]
]
},
{
"name": "requests",
"specs": [
[
"==",
"2.31.0"
]
]
},
{
"name": "six",
"specs": [
[
"==",
"1.16.0"
]
]
},
{
"name": "sniffio",
"specs": [
[
"==",
"1.3.0"
]
]
},
{
"name": "soupsieve",
"specs": [
[
"==",
"2.4.1"
]
]
},
{
"name": "SQLAlchemy",
"specs": [
[
"==",
"2.0.19"
]
]
},
{
"name": "stack-data",
"specs": [
[
"==",
"0.6.2"
]
]
},
{
"name": "starlette",
"specs": [
[
"==",
"0.27.0"
]
]
},
{
"name": "sympy",
"specs": [
[
"==",
"1.12"
]
]
},
{
"name": "tabulate",
"specs": [
[
"==",
"0.9.0"
]
]
},
{
"name": "tenacity",
"specs": [
[
"==",
"8.2.2"
]
]
},
{
"name": "tiktoken",
"specs": [
[
"==",
"0.4.0"
]
]
},
{
"name": "tokenizers",
"specs": [
[
"==",
"0.13.3"
]
]
},
{
"name": "tornado",
"specs": [
[
"==",
"6.3.2"
]
]
},
{
"name": "tqdm",
"specs": [
[
"==",
"4.65.0"
]
]
},
{
"name": "traitlets",
"specs": [
[
"==",
"5.9.0"
]
]
},
{
"name": "typing-inspect",
"specs": [
[
"==",
"0.9.0"
]
]
},
{
"name": "typing_extensions",
"specs": [
[
"==",
"4.7.1"
]
]
},
{
"name": "tzdata",
"specs": [
[
"==",
"2023.3"
]
]
},
{
"name": "unstructured",
"specs": [
[
"==",
"0.8.6"
]
]
},
{
"name": "urllib3",
"specs": [
[
"==",
"2.0.4"
]
]
},
{
"name": "uvicorn",
"specs": [
[
"==",
"0.23.1"
]
]
},
{
"name": "watchfiles",
"specs": [
[
"==",
"0.19.0"
]
]
},
{
"name": "wcwidth",
"specs": [
[
"==",
"0.2.6"
]
]
},
{
"name": "websockets",
"specs": [
[
"==",
"11.0.3"
]
]
},
{
"name": "xlrd",
"specs": [
[
"==",
"2.0.1"
]
]
},
{
"name": "XlsxWriter",
"specs": [
[
"==",
"3.1.2"
]
]
},
{
"name": "yarl",
"specs": [
[
"==",
"1.9.2"
]
]
}
],
"lcname": "patentgpt-extract"
}