ft-drift

Name	ft-drift JSON
Version	0.0.13 JSON
	download
home_page	https://github.com/hamelsmu/ft-drift
Summary	Check for data drift with OAI data
upload_time	2024-04-10 20:28:36
maintainer	None
docs_url	None
author	Hamel Husain
requires_python	>=3.7
license	Apache Software License 2.0
keywords	nbdev jupyter notebook python
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # ft-drift


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

`ft-drift` helps you check for data drift by comparing two OpenAI
[multi-turn chat jsonl
files](https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset).

## Install

``` sh
pip install ft_drift
```

## Background

Checking for dataset drift can help you debug if:

1.  Your model is trained on data that doesn’t reflect production
    (different prompts, functions, etc).
2.  Your training data contains unexpected or accidental artifacts.

In either situation, you can compare data from relevant sources
(i.e. production vs fine-tuning) to find unwanted changes. This is one
of the most common source of errors when fine-tuning models!

The demo below shows a cli tool used to detect data drift between two
files, `file_a.jsonl` and `file_b.jsonl`. Afterwards, a table of
important tokens that account for the drift are shown, such as:

- `END-UI-FORMAT`
- `UI-FORMAT`
- “\`\`\`json”
- etc.

**Currently, `ft_drift` only detects drift in prompt templates, schemas
and other token-based drift (as opposed to semantic drift)**.

## Usage

After installing `ft_drift`, the cli command `detect_drift` will be
available to you.

![](drift_cli.gif)

## How Does it Work?

This works by doing the following steps:

1.  Fit a binary classifier (random forest) to discriminate between two
    datasets.
2.  If the classifier can predict a material difference (ex: AUC \>=
    0.60) then we know there is drift (something is systematically
    different b/w the two datasets).
3.  We show the most important features from the classifier which are
    tokens (segments of text) to help you debug what is different.

If this tool doesn’t detect drift, it doesn’t mean drift doesn’t exist.
It just means we didn’t find it. For more background on this approach,
see this slide from [my talk on MLOps
tools](https://www.youtube.com/watch?v=GHk5HMW4XMA):

![](drift_tfx.png)

## TODO

Other things that could be added:

- [ ] Semantic drift by incorporating embeddings.
- [ ] More features: length of messages, \# of turns etc.
- [ ] Wiring up the function definition diff to the CLI (I don’t need
  this yet for my use case).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hamelsmu/ft-drift",
    "name": "ft-drift",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "nbdev jupyter notebook python",
    "author": "Hamel Husain",
    "author_email": "hamel.husain@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/91/5e/8993172c38d56eb702898717c4414ded66b9b94dd73e907c25038edcadf8/ft-drift-0.0.13.tar.gz",
    "platform": null,
    "description": "# ft-drift\n\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\n`ft-drift` helps you check for data drift by comparing two OpenAI\n[multi-turn chat jsonl\nfiles](https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset).\n\n## Install\n\n``` sh\npip install ft_drift\n```\n\n## Background\n\nChecking for dataset drift can help you debug if:\n\n1.  Your model is trained on data that doesn\u2019t reflect production\n    (different prompts, functions, etc).\n2.  Your training data contains unexpected or accidental artifacts.\n\nIn either situation, you can compare data from relevant sources\n(i.e.\u00a0production vs fine-tuning) to find unwanted changes. This is one\nof the most common source of errors when fine-tuning models!\n\nThe demo below shows a cli tool used to detect data drift between two\nfiles, `file_a.jsonl` and `file_b.jsonl`. Afterwards, a table of\nimportant tokens that account for the drift are shown, such as:\n\n- `END-UI-FORMAT`\n- `UI-FORMAT`\n- \u201c\\`\\`\\`json\u201d\n- etc.\n\n**Currently, `ft_drift` only detects drift in prompt templates, schemas\nand other token-based drift (as opposed to semantic drift)**.\n\n## Usage\n\nAfter installing `ft_drift`, the cli command `detect_drift` will be\navailable to you.\n\n![](drift_cli.gif)\n\n## How Does it Work?\n\nThis works by doing the following steps:\n\n1.  Fit a binary classifier (random forest) to discriminate between two\n    datasets.\n2.  If the classifier can predict a material difference (ex: AUC \\>=\n    0.60) then we know there is drift (something is systematically\n    different b/w the two datasets).\n3.  We show the most important features from the classifier which are\n    tokens (segments of text) to help you debug what is different.\n\nIf this tool doesn\u2019t detect drift, it doesn\u2019t mean drift doesn\u2019t exist.\nIt just means we didn\u2019t find it. For more background on this approach,\nsee this slide from [my talk on MLOps\ntools](https://www.youtube.com/watch?v=GHk5HMW4XMA):\n\n![](drift_tfx.png)\n\n## TODO\n\nOther things that could be added:\n\n- [ ] Semantic drift by incorporating embeddings.\n- [ ] More features: length of messages, \\# of turns etc.\n- [ ] Wiring up the function definition diff to the CLI (I don\u2019t need\n  this yet for my use case).\n",
    "bugtrack_url": null,
    "license": "Apache Software License 2.0",
    "summary": "Check for data drift with OAI data",
    "version": "0.0.13",
    "project_urls": {
        "Homepage": "https://github.com/hamelsmu/ft-drift"
    },
    "split_keywords": [
        "nbdev",
        "jupyter",
        "notebook",
        "python"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cced7f2ce627c1ff40f3c55060650f2efe33dada52d91aab24e90dba5bed68c1",
                "md5": "dd116bdd4e9717349724af2f64b5f91d",
                "sha256": "dc204c7ddb4eb367fdeed69d3599be8d89412a08694a9efd773c46efca69e630"
            },
            "downloads": -1,
            "filename": "ft_drift-0.0.13-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "dd116bdd4e9717349724af2f64b5f91d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 12583,
            "upload_time": "2024-04-10T20:28:35",
            "upload_time_iso_8601": "2024-04-10T20:28:35.267914Z",
            "url": "https://files.pythonhosted.org/packages/cc/ed/7f2ce627c1ff40f3c55060650f2efe33dada52d91aab24e90dba5bed68c1/ft_drift-0.0.13-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "915e8993172c38d56eb702898717c4414ded66b9b94dd73e907c25038edcadf8",
                "md5": "1c129cd5b79c9e799f8f28b6ccb65678",
                "sha256": "e19df9da79362e83cc800b67a1648685f4932f93ad95953fd7560f6562ff917e"
            },
            "downloads": -1,
            "filename": "ft-drift-0.0.13.tar.gz",
            "has_sig": false,
            "md5_digest": "1c129cd5b79c9e799f8f28b6ccb65678",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 13478,
            "upload_time": "2024-04-10T20:28:36",
            "upload_time_iso_8601": "2024-04-10T20:28:36.403810Z",
            "url": "https://files.pythonhosted.org/packages/91/5e/8993172c38d56eb702898717c4414ded66b9b94dd73e907c25038edcadf8/ft-drift-0.0.13.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-10 20:28:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hamelsmu",
    "github_project": "ft-drift",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "ft-drift"
}

Hamel Husain