docembedder

Name	docembedder JSON
Version	0.1.0 JSON
	download
home_page
Summary	Package for creating document embeddings of patents and analysis tools.
upload_time	2023-12-13 09:14:54
maintainer
docs_url	None
author
requires_python	>=3.8
license	MIT License Copyright (c) 2022 Utrecht University Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	patents embeddings machine learning
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Patent breakthrough

![tests](https://github.com/UtrechtUniversity/patent-breakthrough/actions/workflows/python-package.yml/badge.svg)


The code in this repository is used to identify breakthrough innovations in historical patents from the [USPTO](https://www.uspto.gov/).
The `docembedding` Python package contains a variety of methods for creating document embeddings. We have optimized and tested these methods for their ability to predict similarity between patents. This was done by maximizing the cosine similarity between patents that are classified into the same technology class, and minimizing cosine similarity between patents that fall into different technology classes. These methods with optimized parameters are then used to create document embeddings. From these embeddings, novelty scores are created using cosine similarities between the focal patent and patents in the previous n years and subsequent n years.

## Getting Started

Clone this repository to your working station to obtain example notebooks and python scripts:

```
git clone https://github.com/UtrechtUniversity/patent-breakthrough.git
```

### Prerequisites

To install and run this project you need to have the following prerequisites installed.

```
- Python [>=3.8, <3.11]
- jupyterlab (or any other program to run jupyter notebooks)
```
To install jupyterlab:
```
pip install jupyterlab
```

### Installation

To run the project, ensure to install the project's dependencies

```sh
pip install git+https://github.com/UtrechtUniversity/patent-breakthrough.git
```

### Built with

These packages are automatically installed in the step above:

- [scikit-learn](https://scikit-learn.org/)
- [gensim](https://pypi.org/project/gensim/)
- [sbert](https://www.sbert.net/)
- [bpemb](https://bpemb.h-its.org/)


## Usage

### 1. Preparation

First you need to make sure that you have the data prepared. There should be a directory with *.xz files, which should have the year, so 1923.xz, 1924.xz, 1925.xz, etc. If this is not the case and you have only the raw .txt files, then you have to compress your data:

```python
from docembedder.preprocessor.parser import compress_raw
compress_raw(some_file_name, "year.csv", some_output_dir)
```

Here, "year.csv" should be a file that that contains the patent ids and the year in which they were issued.



### 2. Hyper parameter optimization

There are procedures to optimize the preprocessor and ML models with respect to predicting CPC classifications. This is not a necessary step to compute the novelties and impacts, and has already been done for patents 1838-1951. For more information on how to optimize the models, see the [documentation](docs/hyperparameter.md).

### 3. Preprocessing

To improve the quality of the patents, and process/remove the start sections and such, it is necessary to preprocess these raw files. This is done using the `Preprocessor` and `OldPreprocessor` classes, for example:

```python
from docembedder.preprocessor import Preprocessor, OldPreprocessor

prep = Preprocessor()
old_prep = OldPreprocessor()
documents = prep.preprocess_file("1928.xz")
```

Normally however, we do not need to do preprocessing as a seperate step. We can compute the embeddings directly, which is explained in the next section.


### 4. Embedding models

There are 5 different embedding models implemented to compute the embeddings:

```python
from docembedder.models import CountVecEmbedder, D2VEmbedder, BPembEmbedder
from docembedder.models import TfidfEmbedder, BERTEmbedder
model = BERTEmbedder()
model.fit(documents)
embeddings = model.transform(documents)
```

These models can have different parameters for training, see the section on hyper parameter models. The result can be either sparse or dense matrices. The functions and methods in this package work with either in the same way.

### 5. Computing embeddings

The prepared data can be analysed to compute the embeddings for each of the patents using the `run_models` function. This function has the capability to run in parallel, in case you have more than one core on your CPU for examples.

Before we can run, we have to tell docembedder the parameters of the run, which is done through the `SimulationSpecification` class:

```python
from docembedder.utils import SimulationSpecification
sim_spec = SimulationSpecification(
    year_start=1838,  # Starting year for computing the embeddings.
    year_end=1951, # Last year for computing the embeddings.
    window_size=21,  # Size of the window to compute the embeddings for.
    window_shift=1,  # How many years between subsequent windows.
    debug_max_patents=100  # For a trial run we sample the patents instead, remove for final run.
)
```

An example to create a file with the embeddings is:

```python
from docembedder.utils import run_models
run_models({"bert": BERTEmbedder()}, model, sim_spec, output_fp, cpc_fp)
```

The output file is then a HDF5 file, which stores the embeddings for all patents in all windows.

### 6. Computing novelty and impact

To compute the novelty and impact we're using the `Analysis` class:
```python
from docembedder.analysis import DocAnalysis
with DataModel(output_fp, read_only=False) as data
    analysis = DocAnalysis(data)
    results = analysis.compute_impact_novelty("1920-1940", "bert")
```

The result is a dictionary that contains the novelties and impacts for each of the patents in that window (in this case 1920-1940).

## About the Project

**Date**: February 2023

**Researcher(s)**:

- Benjamin Cornejo Costas (b.j.cornejocostas@uu.nl)

**Research Software Engineer(s)**:

- Raoul Schram
- Shiva Nadi
- Maarten Schermer
- Casper Kaandorp
- Jelle Treep (h.j.treep@uu.nl)

### License

The code in this project is released under [MIT license](LICENSE).

### Attribution and academic use

Manuscript in preparation

## Contributing

Contributions are what make the open source community an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.

To contribute:

1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## Contact

Benjamin Cornejo Costas - b.j.cornejocostas@uu.nl

Project Link: [https://github.com/UtrechtUniversity/patent-breakthrough](https://github.com/UtrechtUniversity/patent-breakthrough)

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "docembedder",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "patents,embeddings,machine learning",
    "author": "",
    "author_email": "Raoul Schram <r.d.schram@uu.nl>, Shiva Nadi <s.nadinajafabadi@uu.nl>, Maarten Schermer <m.d.schermer@uu.nl>, Jelle Treep <h.j.treep@uu.nl>",
    "download_url": "https://files.pythonhosted.org/packages/45/44/decc5832051c1eca041098754a0bf88ff18ba3ac1e9575bf9ed41fb9d39a/docembedder-0.1.0.tar.gz",
    "platform": null,
    "description": "# Patent breakthrough\n\n![tests](https://github.com/UtrechtUniversity/patent-breakthrough/actions/workflows/python-package.yml/badge.svg)\n\n\nThe code in this repository is used to identify breakthrough innovations in historical patents from the [USPTO](https://www.uspto.gov/).\nThe `docembedding` Python package contains a variety of methods for creating document embeddings. We have optimized and tested these methods for their ability to predict similarity between patents. This was done by maximizing the cosine similarity between patents that are classified into the same technology class, and minimizing cosine similarity between patents that fall into different technology classes. These methods with optimized parameters are then used to create document embeddings. From these embeddings, novelty scores are created using cosine similarities between the focal patent and patents in the previous n years and subsequent n years.\n\n## Getting Started\n\nClone this repository to your working station to obtain example notebooks and python scripts:\n\n```\ngit clone https://github.com/UtrechtUniversity/patent-breakthrough.git\n```\n\n### Prerequisites\n\nTo install and run this project you need to have the following prerequisites installed.\n\n```\n- Python [>=3.8, <3.11]\n- jupyterlab (or any other program to run jupyter notebooks)\n```\nTo install jupyterlab:\n```\npip install jupyterlab\n```\n\n### Installation\n\nTo run the project, ensure to install the project's dependencies\n\n```sh\npip install git+https://github.com/UtrechtUniversity/patent-breakthrough.git\n```\n\n### Built with\n\nThese packages are automatically installed in the step above:\n\n- [scikit-learn](https://scikit-learn.org/)\n- [gensim](https://pypi.org/project/gensim/)\n- [sbert](https://www.sbert.net/)\n- [bpemb](https://bpemb.h-its.org/)\n\n\n## Usage\n\n### 1. Preparation\n\nFirst you need to make sure that you have the data prepared. There should be a directory with *.xz files, which should have the year, so 1923.xz, 1924.xz, 1925.xz, etc. If this is not the case and you have only the raw .txt files, then you have to compress your data:\n\n```python\nfrom docembedder.preprocessor.parser import compress_raw\ncompress_raw(some_file_name, \"year.csv\", some_output_dir)\n```\n\nHere, \"year.csv\" should be a file that that contains the patent ids and the year in which they were issued.\n\n\n\n### 2. Hyper parameter optimization\n\nThere are procedures to optimize the preprocessor and ML models with respect to predicting CPC classifications. This is not a necessary step to compute the novelties and impacts, and has already been done for patents 1838-1951. For more information on how to optimize the models, see the [documentation](docs/hyperparameter.md).\n\n### 3. Preprocessing\n\nTo improve the quality of the patents, and process/remove the start sections and such, it is necessary to preprocess these raw files. This is done using the `Preprocessor` and `OldPreprocessor` classes, for example:\n\n```python\nfrom docembedder.preprocessor import Preprocessor, OldPreprocessor\n\nprep = Preprocessor()\nold_prep = OldPreprocessor()\ndocuments = prep.preprocess_file(\"1928.xz\")\n```\n\nNormally however, we do not need to do preprocessing as a seperate step. We can compute the embeddings directly, which is explained in the next section.\n\n\n### 4. Embedding models\n\nThere are 5 different embedding models implemented to compute the embeddings:\n\n```python\nfrom docembedder.models import CountVecEmbedder, D2VEmbedder, BPembEmbedder\nfrom docembedder.models import TfidfEmbedder, BERTEmbedder\nmodel = BERTEmbedder()\nmodel.fit(documents)\nembeddings = model.transform(documents)\n```\n\nThese models can have different parameters for training, see the section on hyper parameter models. The result can be either sparse or dense matrices. The functions and methods in this package work with either in the same way.\n\n### 5. Computing embeddings\n\nThe prepared data can be analysed to compute the embeddings for each of the patents using the `run_models` function. This function has the capability to run in parallel, in case you have more than one core on your CPU for examples.\n\nBefore we can run, we have to tell docembedder the parameters of the run, which is done through the `SimulationSpecification` class:\n\n```python\nfrom docembedder.utils import SimulationSpecification\nsim_spec = SimulationSpecification(\n    year_start=1838,  # Starting year for computing the embeddings.\n    year_end=1951, # Last year for computing the embeddings.\n    window_size=21,  # Size of the window to compute the embeddings for.\n    window_shift=1,  # How many years between subsequent windows.\n    debug_max_patents=100  # For a trial run we sample the patents instead, remove for final run.\n)\n```\n\nAn example to create a file with the embeddings is:\n\n```python\nfrom docembedder.utils import run_models\nrun_models({\"bert\": BERTEmbedder()}, model, sim_spec, output_fp, cpc_fp)\n```\n\nThe output file is then a HDF5 file, which stores the embeddings for all patents in all windows.\n\n### 6. Computing novelty and impact\n\nTo compute the novelty and impact we're using the `Analysis` class:\n```python\nfrom docembedder.analysis import DocAnalysis\nwith DataModel(output_fp, read_only=False) as data\n    analysis = DocAnalysis(data)\n    results = analysis.compute_impact_novelty(\"1920-1940\", \"bert\")\n```\n\nThe result is a dictionary that contains the novelties and impacts for each of the patents in that window (in this case 1920-1940).\n\n## About the Project\n\n**Date**: February 2023\n\n**Researcher(s)**:\n\n- Benjamin Cornejo Costas (b.j.cornejocostas@uu.nl)\n\n**Research Software Engineer(s)**:\n\n- Raoul Schram\n- Shiva Nadi\n- Maarten Schermer\n- Casper Kaandorp\n- Jelle Treep (h.j.treep@uu.nl)\n\n### License\n\nThe code in this project is released under [MIT license](LICENSE).\n\n### Attribution and academic use\n\nManuscript in preparation\n\n## Contributing\n\nContributions are what make the open source community an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.\n\nTo contribute:\n\n1. Fork the Project\n2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the Branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\n## Contact\n\nBenjamin Cornejo Costas - b.j.cornejocostas@uu.nl\n\nProject Link: [https://github.com/UtrechtUniversity/patent-breakthrough](https://github.com/UtrechtUniversity/patent-breakthrough)\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2022 Utrecht University  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Package for creating document embeddings of patents and analysis tools.",
    "version": "0.1.0",
    "project_urls": {
        "GitHub": "https://github.com/UtrechtUniversity/patent-breakthrough",
        "documentation": "https://github.com/UtrechtUniversity/patent-breakthrough/tree/main/docs"
    },
    "split_keywords": [
        "patents",
        "embeddings",
        "machine learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "08ac7078ad100ff77ed158cc8ce734a64a660364dfebdc173b8e4257da7cd765",
                "md5": "4e326b58c6a00990fa4c288904c315dc",
                "sha256": "02d7caec0f11d64818585f34c67efc97f3c1b00414ae1badbed7d0fcf6003e5e"
            },
            "downloads": -1,
            "filename": "docembedder-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4e326b58c6a00990fa4c288904c315dc",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 49038,
            "upload_time": "2023-12-13T09:13:35",
            "upload_time_iso_8601": "2023-12-13T09:13:35.813138Z",
            "url": "https://files.pythonhosted.org/packages/08/ac/7078ad100ff77ed158cc8ce734a64a660364dfebdc173b8e4257da7cd765/docembedder-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4544decc5832051c1eca041098754a0bf88ff18ba3ac1e9575bf9ed41fb9d39a",
                "md5": "f3047d1cda3517b7f58b0dc0dbbcd433",
                "sha256": "088a27fb6c272db894bbd1132759d8ab9fd8d7db60b3243d006d446d9f3990d3"
            },
            "downloads": -1,
            "filename": "docembedder-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "f3047d1cda3517b7f58b0dc0dbbcd433",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 17175874,
            "upload_time": "2023-12-13T09:14:54",
            "upload_time_iso_8601": "2023-12-13T09:14:54.756474Z",
            "url": "https://files.pythonhosted.org/packages/45/44/decc5832051c1eca041098754a0bf88ff18ba3ac1e9575bf9ed41fb9d39a/docembedder-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-13 09:14:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "UtrechtUniversity",
    "github_project": "patent-breakthrough",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "docembedder"
}