FancySchmancyTestsplit

Name	FancySchmancyTestsplit JSON
Version	0.1.10 JSON
	download
home_page	None
Summary	a more in-depth testsplit splitting intercategorical
upload_time	2024-05-07 10:27:22
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	MIT
keywords	test split testsplit train test split
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # fancy schmancy testsplit
#### it's like a testsplit, but fancy and also schmancy
----
for reference:
 package | fancy | schmancy | testsplit
 :- | :- | :- | :-
 sklearn.model_selection | &#128078; | &#128078; | &#128077;
 fancy schmancy testsplit | &#128077; | &#128077; | &#128077;

a testsplit per label category, to ensure that every category is present
        
----
### Examples

Assume the following DataFrame:
```Python
df = DataFrame(data= {"Column A":[10, 14, 12, 13, 9, 5, 13, 16, 18, 4, 12],
"Column B": ["Cat1", "Cat1", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2"]})
print(df)
```
|| Column A | Column B
:- | -: | -:
0 | 10 | Cat1
1 | 14 | Cat1
2 | 12 | Cat2
3 | 13 | Cat2
4 | 9 | Cat2
5 | 5 | Cat2
6 | 13 | Cat2
7 | 16 | Cat2
8 | 18 | Cat2
9 | 4 | Cat2
10 | 12 | Cat2

If we assume further that Column B contains the label categories, we'd
run the risk of eliminating Cat1 by doing a train test split at 50%.

So, to preserve every existing category, the split will instead be made
on every single subset of categories.

As an example for Cat1:
```Python
subset = df[df["Column B"] == "Cat1"]
X = subset.drop("Column B", axis= 1)
y = subset["Column B"]
if isinstance(y, Series): y = DataFrame(y)
X_tr, X_te, y_tr, y_te = \
    train_test_split(X, y, test_size = 0.5, random_state = 42)
print(y_tr)
```
|| Column B
:- | -:
0 | Cat1

This is done for every unique entry of the given label column, so that a random pick of train and test data is done for every category separately.

If this was done for "Cat1" and "Cat2", it would look like this:

|| Column B
:- | -:
0 | Cat1
4 | Cat2
6 | Cat2
5 | Cat2
8 | Cat2

To shorten the process, the method fancy_schmancy_testsplit can be used in this way:

```Python
from FancySchmancyTestsplit.fst import fancy_schmancy_testsplit
from pandas import DataFrame
df = DataFrame(data= {"Column A":[10, 14, 12, 13, 9, 5, 13, 16, 18, 4, 12],
"Column B": ["Cat1", "Cat1", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2"]})
X_train, X_test, y_train, y_test = \
    fancy_schmancy_testsplit(data= df,
                            label_column= "Column B",
                            test_split= 0.5,
                            seed= 42
                            )
print(y_train)
```
|| Column B
:- | -:
0 | Cat1
4 | Cat2
6 | Cat2
5 | Cat2
8 | Cat2

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "FancySchmancyTestsplit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Kevin Pohl <pohl.kevin@gmail.com>",
    "keywords": "test split, testsplit, train test split",
    "author": null,
    "author_email": "Kevin Pohl <pohl.kevin@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/c4/37/0320d5afe376f1673a717a400b90195819429c386d251770bb7263f260b1/fancyschmancytestsplit-0.1.10.tar.gz",
    "platform": null,
    "description": "# fancy schmancy testsplit\r\n#### it's like a testsplit, but fancy and also schmancy\r\n----\r\nfor reference:\r\n package | fancy | schmancy | testsplit\r\n :- | :- | :- | :-\r\n sklearn.model_selection | &#128078; | &#128078; | &#128077;\r\n fancy schmancy testsplit | &#128077; | &#128077; | &#128077;\r\n\r\na testsplit per label category, to ensure that every category is present\r\n        \r\n----\r\n### Examples\r\n\r\nAssume the following DataFrame:\r\n```Python\r\ndf = DataFrame(data= {\"Column A\":[10, 14, 12, 13, 9, 5, 13, 16, 18, 4, 12],\r\n\"Column B\": [\"Cat1\", \"Cat1\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\"]})\r\nprint(df)\r\n```\r\n|| Column A | Column B\r\n:- | -: | -:\r\n0 | 10 | Cat1\r\n1 | 14 | Cat1\r\n2 | 12 | Cat2\r\n3 | 13 | Cat2\r\n4 | 9 | Cat2\r\n5 | 5 | Cat2\r\n6 | 13 | Cat2\r\n7 | 16 | Cat2\r\n8 | 18 | Cat2\r\n9 | 4 | Cat2\r\n10 | 12 | Cat2\r\n\r\nIf we assume further that Column B contains the label categories, we'd\r\nrun the risk of eliminating Cat1 by doing a train test split at 50%.\r\n\r\nSo, to preserve every existing category, the split will instead be made\r\non every single subset of categories.\r\n\r\nAs an example for Cat1:\r\n```Python\r\nsubset = df[df[\"Column B\"] == \"Cat1\"]\r\nX = subset.drop(\"Column B\", axis= 1)\r\ny = subset[\"Column B\"]\r\nif isinstance(y, Series): y = DataFrame(y)\r\nX_tr, X_te, y_tr, y_te = \\\r\n    train_test_split(X, y, test_size = 0.5, random_state = 42)\r\nprint(y_tr)\r\n```\r\n|| Column B\r\n:- | -:\r\n0 | Cat1\r\n\r\nThis is done for every unique entry of the given label column, so that a random pick of train and test data is done for every category separately.\r\n\r\nIf this was done for \"Cat1\" and \"Cat2\", it would look like this:\r\n\r\n|| Column B\r\n:- | -:\r\n0 | Cat1\r\n4 | Cat2\r\n6 | Cat2\r\n5 | Cat2\r\n8 | Cat2\r\n\r\nTo shorten the process, the method fancy_schmancy_testsplit can be used in this way:\r\n\r\n```Python\r\nfrom FancySchmancyTestsplit.fst import fancy_schmancy_testsplit\r\nfrom pandas import DataFrame\r\ndf = DataFrame(data= {\"Column A\":[10, 14, 12, 13, 9, 5, 13, 16, 18, 4, 12],\r\n\"Column B\": [\"Cat1\", \"Cat1\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\"]})\r\nX_train, X_test, y_train, y_test = \\\r\n    fancy_schmancy_testsplit(data= df,\r\n                            label_column= \"Column B\",\r\n                            test_split= 0.5,\r\n                            seed= 42\r\n                            )\r\nprint(y_train)\r\n```\r\n|| Column B\r\n:- | -:\r\n0 | Cat1\r\n4 | Cat2\r\n6 | Cat2\r\n5 | Cat2\r\n8 | Cat2\r\n\r\n\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "a more in-depth testsplit splitting intercategorical",
    "version": "0.1.10",
    "project_urls": null,
    "split_keywords": [
        "test split",
        " testsplit",
        " train test split"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b4739ddba650e277f3e57d2f45a5bea367704491b9d7c5b1cd8f89dad4335f25",
                "md5": "51996aca7a9e5a288406a9656dbdf1b7",
                "sha256": "87b8d2e0070f540844483d8d8411f0b5cc5d36ce0119f80287c40a89ef9b0cae"
            },
            "downloads": -1,
            "filename": "FancySchmancyTestsplit-0.1.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "51996aca7a9e5a288406a9656dbdf1b7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 4015,
            "upload_time": "2024-05-07T10:27:17",
            "upload_time_iso_8601": "2024-05-07T10:27:17.551338Z",
            "url": "https://files.pythonhosted.org/packages/b4/73/9ddba650e277f3e57d2f45a5bea367704491b9d7c5b1cd8f89dad4335f25/FancySchmancyTestsplit-0.1.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c4370320d5afe376f1673a717a400b90195819429c386d251770bb7263f260b1",
                "md5": "ce1fdee4f08fac567cae9d13f4c86e5b",
                "sha256": "247e44bfa5d610cf7a567825dc6858068fee1a756a17d4c5cbc2904a61ab4f37"
            },
            "downloads": -1,
            "filename": "fancyschmancytestsplit-0.1.10.tar.gz",
            "has_sig": false,
            "md5_digest": "ce1fdee4f08fac567cae9d13f4c86e5b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 3984,
            "upload_time": "2024-05-07T10:27:22",
            "upload_time_iso_8601": "2024-05-07T10:27:22.490730Z",
            "url": "https://files.pythonhosted.org/packages/c4/37/0320d5afe376f1673a717a400b90195819429c386d251770bb7263f260b1/fancyschmancytestsplit-0.1.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-07 10:27:22",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "fancyschmancytestsplit"
}

None