inspire-classifier


Nameinspire-classifier JSON
Version 2.0.4 PyPI version JSON
download
home_pageNone
SummaryINSPIRE package aimed to automatically classify the new papers that are added to INSPIRE, such as if they are core or not.
upload_time2025-10-24 13:39:10
maintainerNone
docs_urlNone
authorCERN
requires_python<=3.13,>=3.11
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Inspire Classifier

## About
INSPIRE package aimed to automatically classify the new papers that are added to INSPIRE, such as if they are core or not.

The current implementation uses the ULMfit approach. Universal Language Model Fine-tuning, is a method for training text classifiers by first pre-training a language model on a large corpus to learn general language features (in this case a pre-loaded model, which was trained using the WikiText-103 dataset is used). The pre-trained model is then fine-tuned on the title and abstract of the INSPIRE dataset before training the classifier on top.


## Package Usage
```
from inspire_classifier import Classifier

classifier = Classifier(model_path="PATH/TO/MODEL.h5")

title = "Search for new physics in high-energy particle collisions"
abstract = "We present results from a search for beyond..."

result = classifier.predict_coreness(title, abstract)
print(result) --> {'prediction': 'core', 'scores': {'rejected': 0.1, 'non_core': 0.3, 'core': 0.6}}
```



## Installation for local usage and Training:
* Install and activate `python 3.11` environment (for example using pyenv)
* Install poetry: `pip install poetry==1.8.3`
* Run poetry install: `poetry install`


## Train new classifier model
### 1. Gather training data
Set the environment variables for inspire-prod es database and run the [`create_dataset.py`](scripts/create_dataset.py) file, passing the range of years. This will create a `inspire_classifier_dataset.pkl`, containing the label (core, non-core, rejected) as well as the title and abstract of the fetched records. This data will be used in the next step to train the model. Make sure the generated file is called  `inspire_classifier_dataset.pkl`!

```
export ES_USERNAME=XXXX
export ES_PASSWORD=XXXX

poetry run python scripts/create_dataset.py --year-from $YEAR_FROM --month-from $MONTH_FROM --year-to $YEAR_TO --month-to $MONTH_TO

($MONTH_FROM and $MONTH_TO are optional parameters)
```


### 2. Run training and validate model
The [`train_classifier.py`](scripts/train_classifier.py) script will run the commands to train and validate a new model. Configurations changes like the amount of training epochs as well as the train-test split can be adjusted here. In short, the script first splits the pkl file from the first step into a training and a test dataset inside the `classifier/data` folder. The training set is then used to train the model, while the test set is used to evaluate the model after the training is finished. The model will be saved into `classifier/models/language_model/finetuned_language_model_encoder.h5`

```
poetry run python scripts/train_classifier.py
```


### 3. Upload the model to CERN S3
In order to use the new model in production upload it to CERN S3 and follow [this writeup](https://confluence.cern.ch/display/RCSSIS/Update+Airflow+Base+Image+%28with+classifier+model%29+for+INSPIRE)


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "inspire-classifier",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<=3.13,>=3.11",
    "maintainer_email": null,
    "keywords": null,
    "author": "CERN",
    "author_email": "admin@inspirehep.net",
    "download_url": "https://files.pythonhosted.org/packages/10/7b/c9f1ca68b767cb0c9204a7772e425735246fdda1272af80838476b44f949/inspire_classifier-2.0.4.tar.gz",
    "platform": null,
    "description": "# Inspire Classifier\n\n## About\nINSPIRE package aimed to automatically classify the new papers that are added to INSPIRE, such as if they are core or not.\n\nThe current implementation uses the ULMfit approach. Universal Language Model Fine-tuning, is a method for training text classifiers by first pre-training a language model on a large corpus to learn general language features (in this case a pre-loaded model, which was trained using the WikiText-103 dataset is used). The pre-trained model is then fine-tuned on the title and abstract of the INSPIRE dataset before training the classifier on top.\n\n\n## Package Usage\n```\nfrom inspire_classifier import Classifier\n\nclassifier = Classifier(model_path=\"PATH/TO/MODEL.h5\")\n\ntitle = \"Search for new physics in high-energy particle collisions\"\nabstract = \"We present results from a search for beyond...\"\n\nresult = classifier.predict_coreness(title, abstract)\nprint(result) --> {'prediction': 'core', 'scores': {'rejected': 0.1, 'non_core': 0.3, 'core': 0.6}}\n```\n\n\n\n## Installation for local usage and Training:\n* Install and activate `python 3.11` environment (for example using pyenv)\n* Install poetry: `pip install poetry==1.8.3`\n* Run poetry install: `poetry install`\n\n\n## Train new classifier model\n### 1. Gather training data\nSet the environment variables for inspire-prod es database and run the [`create_dataset.py`](scripts/create_dataset.py) file, passing the range of years. This will create a `inspire_classifier_dataset.pkl`, containing the label (core, non-core, rejected) as well as the title and abstract of the fetched records. This data will be used in the next step to train the model. Make sure the generated file is called  `inspire_classifier_dataset.pkl`!\n\n```\nexport ES_USERNAME=XXXX\nexport ES_PASSWORD=XXXX\n\npoetry run python scripts/create_dataset.py --year-from $YEAR_FROM --month-from $MONTH_FROM --year-to $YEAR_TO --month-to $MONTH_TO\n\n($MONTH_FROM and $MONTH_TO are optional parameters)\n```\n\n\n### 2. Run training and validate model\nThe [`train_classifier.py`](scripts/train_classifier.py) script will run the commands to train and validate a new model. Configurations changes like the amount of training epochs as well as the train-test split can be adjusted here. In short, the script first splits the pkl file from the first step into a training and a test dataset inside the `classifier/data` folder. The training set is then used to train the model, while the test set is used to evaluate the model after the training is finished. The model will be saved into `classifier/models/language_model/finetuned_language_model_encoder.h5`\n\n```\npoetry run python scripts/train_classifier.py\n```\n\n\n### 3. Upload the model to CERN S3\nIn order to use the new model in production upload it to CERN S3 and follow [this writeup](https://confluence.cern.ch/display/RCSSIS/Update+Airflow+Base+Image+%28with+classifier+model%29+for+INSPIRE)\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "INSPIRE package aimed to automatically classify the new papers that are added to INSPIRE, such as if they are core or not.",
    "version": "2.0.4",
    "project_urls": {
        "Homepage": "https://inspirehep.net",
        "Repository": "https://github.com/inspirehep/inspire-classifier"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1aa4d8186ba0742692318bd8c4d20ba9fb1b53ad02e43bb75a99c34e2a7204b6",
                "md5": "d0d96e08e0a136fe578c2f7351662943",
                "sha256": "692f84b4c6bd7dace1f56c08573a835a180215e8fd979a05bb0181c119f0e569"
            },
            "downloads": -1,
            "filename": "inspire_classifier-2.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d0d96e08e0a136fe578c2f7351662943",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<=3.13,>=3.11",
            "size": 13121,
            "upload_time": "2025-10-24T13:39:09",
            "upload_time_iso_8601": "2025-10-24T13:39:09.640365Z",
            "url": "https://files.pythonhosted.org/packages/1a/a4/d8186ba0742692318bd8c4d20ba9fb1b53ad02e43bb75a99c34e2a7204b6/inspire_classifier-2.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "107bc9f1ca68b767cb0c9204a7772e425735246fdda1272af80838476b44f949",
                "md5": "6d05537a93c0c3b21dd19f78eb283d54",
                "sha256": "645c193b52dec793b9fd5889056255497acb4de291cf35400d4f0e8aa8dc0310"
            },
            "downloads": -1,
            "filename": "inspire_classifier-2.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "6d05537a93c0c3b21dd19f78eb283d54",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<=3.13,>=3.11",
            "size": 9256,
            "upload_time": "2025-10-24T13:39:10",
            "upload_time_iso_8601": "2025-10-24T13:39:10.360381Z",
            "url": "https://files.pythonhosted.org/packages/10/7b/c9f1ca68b767cb0c9204a7772e425735246fdda1272af80838476b44f949/inspire_classifier-2.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-24 13:39:10",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "inspirehep",
    "github_project": "inspire-classifier",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "inspire-classifier"
}
        
Elapsed time: 1.36932s