sequifier


Namesequifier JSON
Version 0.4.0.0 PyPI version JSON
download
home_pagehttps://github.com/0xideas/sequifier
SummaryTrain a transformer model with the command line
upload_time2024-10-02 10:29:33
maintainerNone
docs_urlNone
authorLeon Luithlen
requires_python<4.0,>=3.10
licenseBSD 3-Clause
keywords transformer sequence classification machine learning sequence sequence modelling nlp language language modelling torch pytorch
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <img src="./design/sequifier.png">


### one-to-one and many-to-one autoregression made easy

Sequifier enables sequence classification or regression for time based sequences using transformer models, via CLI.
The specific configuration of preprocessing, which takes a single or multi-variable columnar data file and creates
training, validation and test sequences, training, which trains a transformer model, and inference, which calculates
model outputs for data (usually the test data from preprocessing), is done via configuration yaml files.

\
\
\
## Overview
The sequifier package enables:
  - the extraction of sequences for training
  - the configuration and training of a transformer classification or regression model
  - using multiple input and output sequences
  - inference on data with a trained model


## Other materials
If you want to first get a more specific understanding of the transformer architecture, have a look at
the [Wikipedia article.](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model))

If you want to see a benchmark on a small synthetic dataset with 10k cases, agains a random forest,
an xgboost model and a logistic regression, check out [this notebook.](./documentation/demos/benchmark-small-data.ipynb)


## Complete example how to build and apply a transformer sequence classifier with sequifier

1. create a conda environment with python >=3.9 activate and run
```console
pip install sequifier
```
2. run
```console
git clone https://github.com/0xideas/sequifier-config YOUR_PROJECT_NAME
```
3. cd into the `YOUR_PROJECT_NAME` folder, create a `data` folder and add your data and adapt the config file `preprocess.yaml` in the configs folder to take the path to the data
4. run
```console
sequifier preprocess
```
5. the preprocessing step outputs a "data driven config" at `configs/ddconfigs/[FILE NAME]`. It contains the number of classes found in the data, a map of classes to indices and the oaths to train, validation and test splits of data. Adapt the `dd_config` parameter in `train.yaml` and `infer.yaml` in to the path `configs/ddconfigs/[FILE NAME]`
6. Adapt the config file `train.yaml` to specify the transformer hyperparameters you want and run
```console
sequifier train
```
7. adapt `data_path` in `infer.yaml` to one of the files output in the preprocessing step
8. run
```console
sequifier infer
```
9. find your predictions at `[PROJECT PATH]/outputs/predictions/sequifier-default-best-predictions.csv`


## More detailed explanations of the three steps
#### Preprocessing of data into sequences for training

The preprocessing step is designed for scenarios where for timeseries or timeseries-like data,
the prediction of the next data point of one or more variables from prior values of these
variables and (optionally) other variables is of interest.

This step presupposes input data with three columns: "sequenceId" and "itemPosition", and a column
with the variable that is the prediction target.
"sequenceId" separates different sequences and the itemPosition column
provides values that enable sequential sorting. Often this will simply be a timestamp.
You can find an example of the preprocessing input data at [documentation/example_inputs/preprocessing_input.csv](./documentation/example_inputs/preprocessing_input.csv)

The data can then be processed and split into training, validation and testing datasets of all
valid subsequences in the original data with the command:

```console
sequifier preprocess --config_path=[CONFIG PATH]
```

The config path specifies the path to the preprocessing config and the project
path the path to the (preferably empty) folder the output files of the different
steps are written to.

The default config can be found on this path:

[configs/preprocess.yaml](./configs/preprocess.yaml)



#### Configuring and training the sequence classification model

The training step is executed with the command:

```console
sequifier train --config_path=[CONFIG PATH]
```

If the data on which the model is trained DOES NOT come from the preprocessing step, the flag

```console
--on-unprocessed
```

should be added.

If the training data does not come from the preprocessing step, both train and validation
data have to take the form of a csv file with the columns "sequenceId", "subsequenceId", "col_name", [SEQ LENGTH], [SEQ LENGTH - 1],...,"1", "0".
You can find an example of the preprocessing input data at [documentation/example_inputs/training_input.csv](./documentation/example_inputs/training_input.csv)

The training step is configured using the config. The two default configs can be found here:

[configs/train.yaml](./configs/train.yaml)

depending on whether the preprocessing step was executed.


#### Inferring on test data using the trained model

Inference is done using the command:

```console
sequifier infer --config_path=[CONFIG PATH]
```

and configured using a config file. The default version can be found here:

[configs/infer.yaml](./configs/infer.yaml)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/0xideas/sequifier",
    "name": "sequifier",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": "transformer, sequence classification, machine learning, sequence, sequence modelling, nlp, language, language modelling, torch, pytorch",
    "author": "Leon Luithlen",
    "author_email": "leontimnaluithlen@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/ef/3d/0cbc748d434650496cc967c8fb24279a43908f75448a998e00f2f4ada5f0/sequifier-0.4.0.0.tar.gz",
    "platform": null,
    "description": "<img src=\"./design/sequifier.png\">\n\n\n### one-to-one and many-to-one autoregression made easy\n\nSequifier enables sequence classification or regression for time based sequences using transformer models, via CLI.\nThe specific configuration of preprocessing, which takes a single or multi-variable columnar data file and creates\ntraining, validation and test sequences, training, which trains a transformer model, and inference, which calculates\nmodel outputs for data (usually the test data from preprocessing), is done via configuration yaml files.\n\n\\\n\\\n\\\n## Overview\nThe sequifier package enables:\n  - the extraction of sequences for training\n  - the configuration and training of a transformer classification or regression model\n  - using multiple input and output sequences\n  - inference on data with a trained model\n\n\n## Other materials\nIf you want to first get a more specific understanding of the transformer architecture, have a look at\nthe [Wikipedia article.](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model))\n\nIf you want to see a benchmark on a small synthetic dataset with 10k cases, agains a random forest,\nan xgboost model and a logistic regression, check out [this notebook.](./documentation/demos/benchmark-small-data.ipynb)\n\n\n## Complete example how to build and apply a transformer sequence classifier with sequifier\n\n1. create a conda environment with python >=3.9 activate and run\n```console\npip install sequifier\n```\n2. run\n```console\ngit clone https://github.com/0xideas/sequifier-config YOUR_PROJECT_NAME\n```\n3. cd into the `YOUR_PROJECT_NAME` folder, create a `data` folder and add your data and adapt the config file `preprocess.yaml` in the configs folder to take the path to the data\n4. run\n```console\nsequifier preprocess\n```\n5. the preprocessing step outputs a \"data driven config\" at `configs/ddconfigs/[FILE NAME]`. It contains the number of classes found in the data, a map of classes to indices and the oaths to train, validation and test splits of data. Adapt the `dd_config` parameter in `train.yaml` and `infer.yaml` in to the path `configs/ddconfigs/[FILE NAME]`\n6. Adapt the config file `train.yaml` to specify the transformer hyperparameters you want and run\n```console\nsequifier train\n```\n7. adapt `data_path` in `infer.yaml` to one of the files output in the preprocessing step\n8. run\n```console\nsequifier infer\n```\n9. find your predictions at `[PROJECT PATH]/outputs/predictions/sequifier-default-best-predictions.csv`\n\n\n## More detailed explanations of the three steps\n#### Preprocessing of data into sequences for training\n\nThe preprocessing step is designed for scenarios where for timeseries or timeseries-like data,\nthe prediction of the next data point of one or more variables from prior values of these\nvariables and (optionally) other variables is of interest.\n\nThis step presupposes input data with three columns: \"sequenceId\" and \"itemPosition\", and a column\nwith the variable that is the prediction target.\n\"sequenceId\" separates different sequences and the itemPosition column\nprovides values that enable sequential sorting. Often this will simply be a timestamp.\nYou can find an example of the preprocessing input data at [documentation/example_inputs/preprocessing_input.csv](./documentation/example_inputs/preprocessing_input.csv)\n\nThe data can then be processed and split into training, validation and testing datasets of all\nvalid subsequences in the original data with the command:\n\n```console\nsequifier preprocess --config_path=[CONFIG PATH]\n```\n\nThe config path specifies the path to the preprocessing config and the project\npath the path to the (preferably empty) folder the output files of the different\nsteps are written to.\n\nThe default config can be found on this path:\n\n[configs/preprocess.yaml](./configs/preprocess.yaml)\n\n\n\n#### Configuring and training the sequence classification model\n\nThe training step is executed with the command:\n\n```console\nsequifier train --config_path=[CONFIG PATH]\n```\n\nIf the data on which the model is trained DOES NOT come from the preprocessing step, the flag\n\n```console\n--on-unprocessed\n```\n\nshould be added.\n\nIf the training data does not come from the preprocessing step, both train and validation\ndata have to take the form of a csv file with the columns \"sequenceId\", \"subsequenceId\", \"col_name\", [SEQ LENGTH], [SEQ LENGTH - 1],...,\"1\", \"0\".\nYou can find an example of the preprocessing input data at [documentation/example_inputs/training_input.csv](./documentation/example_inputs/training_input.csv)\n\nThe training step is configured using the config. The two default configs can be found here:\n\n[configs/train.yaml](./configs/train.yaml)\n\ndepending on whether the preprocessing step was executed.\n\n\n#### Inferring on test data using the trained model\n\nInference is done using the command:\n\n```console\nsequifier infer --config_path=[CONFIG PATH]\n```\n\nand configured using a config file. The default version can be found here:\n\n[configs/infer.yaml](./configs/infer.yaml)\n",
    "bugtrack_url": null,
    "license": "BSD 3-Clause",
    "summary": "Train a transformer model with the command line",
    "version": "0.4.0.0",
    "project_urls": {
        "Homepage": "https://github.com/0xideas/sequifier",
        "Repository": "https://github.com/0xideas/sequifier"
    },
    "split_keywords": [
        "transformer",
        " sequence classification",
        " machine learning",
        " sequence",
        " sequence modelling",
        " nlp",
        " language",
        " language modelling",
        " torch",
        " pytorch"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "523a42f4683b1578b251ab936fa0ea3a6ae400cd6e254ae7b6a021fa71e291fb",
                "md5": "ba6b8686b811bc38393d002bc7ede0cf",
                "sha256": "431fc750175223e43d197bfcde5c1f76a9b4fa5c1878dbb113171610a25864c0"
            },
            "downloads": -1,
            "filename": "sequifier-0.4.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ba6b8686b811bc38393d002bc7ede0cf",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 32462,
            "upload_time": "2024-10-02T10:29:31",
            "upload_time_iso_8601": "2024-10-02T10:29:31.474672Z",
            "url": "https://files.pythonhosted.org/packages/52/3a/42f4683b1578b251ab936fa0ea3a6ae400cd6e254ae7b6a021fa71e291fb/sequifier-0.4.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ef3d0cbc748d434650496cc967c8fb24279a43908f75448a998e00f2f4ada5f0",
                "md5": "347eabb1d7a19cd86cad772b419d427b",
                "sha256": "389c3a66076617038f859f265569388f1129cd36644e9d52653c656fef536e99"
            },
            "downloads": -1,
            "filename": "sequifier-0.4.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "347eabb1d7a19cd86cad772b419d427b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 28144,
            "upload_time": "2024-10-02T10:29:33",
            "upload_time_iso_8601": "2024-10-02T10:29:33.340071Z",
            "url": "https://files.pythonhosted.org/packages/ef/3d/0cbc748d434650496cc967c8fb24279a43908f75448a998e00f2f4ada5f0/sequifier-0.4.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-02 10:29:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "0xideas",
    "github_project": "sequifier",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "sequifier"
}
        
Elapsed time: 0.34107s