# fancy schmancy testsplit
#### it's like a testsplit, but fancy and also schmancy
----
for reference:
package | fancy | schmancy | testsplit
:- | :- | :- | :-
sklearn.model_selection | 👎 | 👎 | 👍
fancy schmancy testsplit | 👍 | 👍 | 👍
a testsplit per label category, to ensure that every category is present
----
### Examples
Assume the following DataFrame:
```Python
df = DataFrame(data= {"Column A":[10, 14, 12, 13, 9, 5, 13, 16, 18, 4, 12],
"Column B": ["Cat1", "Cat1", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2"]})
print(df)
```
|| Column A | Column B
:- | -: | -:
0 | 10 | Cat1
1 | 14 | Cat1
2 | 12 | Cat2
3 | 13 | Cat2
4 | 9 | Cat2
5 | 5 | Cat2
6 | 13 | Cat2
7 | 16 | Cat2
8 | 18 | Cat2
9 | 4 | Cat2
10 | 12 | Cat2
If we assume further that Column B contains the label categories, we'd
run the risk of eliminating Cat1 by doing a train test split at 50%.
So, to preserve every existing category, the split will instead be made
on every single subset of categories.
As an example for Cat1:
```Python
subset = df[df["Column B"] == "Cat1"]
X = subset.drop("Column B", axis= 1)
y = subset["Column B"]
if isinstance(y, Series): y = DataFrame(y)
X_tr, X_te, y_tr, y_te = \
train_test_split(X, y, test_size = 0.5, random_state = 42)
print(y_tr)
```
|| Column B
:- | -:
0 | Cat1
This is done for every unique entry of the given label column, so that a random pick of train and test data is done for every category separately.
If this was done for "Cat1" and "Cat2", it would look like this:
|| Column B
:- | -:
0 | Cat1
4 | Cat2
6 | Cat2
5 | Cat2
8 | Cat2
To shorten the process, the method fancy_schmancy_testsplit can be used in this way:
```Python
from FancySchmancyTestsplit.fst import fancy_schmancy_testsplit
from pandas import DataFrame
df = DataFrame(data= {"Column A":[10, 14, 12, 13, 9, 5, 13, 16, 18, 4, 12],
"Column B": ["Cat1", "Cat1", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2", "Cat2"]})
X_train, X_test, y_train, y_test = \
fancy_schmancy_testsplit(data= df,
label_column= "Column B",
test_split= 0.5,
seed= 42
)
print(y_train)
```
|| Column B
:- | -:
0 | Cat1
4 | Cat2
6 | Cat2
5 | Cat2
8 | Cat2
Raw data
{
"_id": null,
"home_page": null,
"name": "FancySchmancyTestsplit",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Kevin Pohl <pohl.kevin@gmail.com>",
"keywords": "test split, testsplit, train test split",
"author": null,
"author_email": "Kevin Pohl <pohl.kevin@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/c4/37/0320d5afe376f1673a717a400b90195819429c386d251770bb7263f260b1/fancyschmancytestsplit-0.1.10.tar.gz",
"platform": null,
"description": "# fancy schmancy testsplit\r\n#### it's like a testsplit, but fancy and also schmancy\r\n----\r\nfor reference:\r\n package | fancy | schmancy | testsplit\r\n :- | :- | :- | :-\r\n sklearn.model_selection | 👎 | 👎 | 👍\r\n fancy schmancy testsplit | 👍 | 👍 | 👍\r\n\r\na testsplit per label category, to ensure that every category is present\r\n \r\n----\r\n### Examples\r\n\r\nAssume the following DataFrame:\r\n```Python\r\ndf = DataFrame(data= {\"Column A\":[10, 14, 12, 13, 9, 5, 13, 16, 18, 4, 12],\r\n\"Column B\": [\"Cat1\", \"Cat1\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\"]})\r\nprint(df)\r\n```\r\n|| Column A | Column B\r\n:- | -: | -:\r\n0 | 10 | Cat1\r\n1 | 14 | Cat1\r\n2 | 12 | Cat2\r\n3 | 13 | Cat2\r\n4 | 9 | Cat2\r\n5 | 5 | Cat2\r\n6 | 13 | Cat2\r\n7 | 16 | Cat2\r\n8 | 18 | Cat2\r\n9 | 4 | Cat2\r\n10 | 12 | Cat2\r\n\r\nIf we assume further that Column B contains the label categories, we'd\r\nrun the risk of eliminating Cat1 by doing a train test split at 50%.\r\n\r\nSo, to preserve every existing category, the split will instead be made\r\non every single subset of categories.\r\n\r\nAs an example for Cat1:\r\n```Python\r\nsubset = df[df[\"Column B\"] == \"Cat1\"]\r\nX = subset.drop(\"Column B\", axis= 1)\r\ny = subset[\"Column B\"]\r\nif isinstance(y, Series): y = DataFrame(y)\r\nX_tr, X_te, y_tr, y_te = \\\r\n train_test_split(X, y, test_size = 0.5, random_state = 42)\r\nprint(y_tr)\r\n```\r\n|| Column B\r\n:- | -:\r\n0 | Cat1\r\n\r\nThis is done for every unique entry of the given label column, so that a random pick of train and test data is done for every category separately.\r\n\r\nIf this was done for \"Cat1\" and \"Cat2\", it would look like this:\r\n\r\n|| Column B\r\n:- | -:\r\n0 | Cat1\r\n4 | Cat2\r\n6 | Cat2\r\n5 | Cat2\r\n8 | Cat2\r\n\r\nTo shorten the process, the method fancy_schmancy_testsplit can be used in this way:\r\n\r\n```Python\r\nfrom FancySchmancyTestsplit.fst import fancy_schmancy_testsplit\r\nfrom pandas import DataFrame\r\ndf = DataFrame(data= {\"Column A\":[10, 14, 12, 13, 9, 5, 13, 16, 18, 4, 12],\r\n\"Column B\": [\"Cat1\", \"Cat1\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\", \"Cat2\"]})\r\nX_train, X_test, y_train, y_test = \\\r\n fancy_schmancy_testsplit(data= df,\r\n label_column= \"Column B\",\r\n test_split= 0.5,\r\n seed= 42\r\n )\r\nprint(y_train)\r\n```\r\n|| Column B\r\n:- | -:\r\n0 | Cat1\r\n4 | Cat2\r\n6 | Cat2\r\n5 | Cat2\r\n8 | Cat2\r\n\r\n\r\n\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "a more in-depth testsplit splitting intercategorical",
"version": "0.1.10",
"project_urls": null,
"split_keywords": [
"test split",
" testsplit",
" train test split"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b4739ddba650e277f3e57d2f45a5bea367704491b9d7c5b1cd8f89dad4335f25",
"md5": "51996aca7a9e5a288406a9656dbdf1b7",
"sha256": "87b8d2e0070f540844483d8d8411f0b5cc5d36ce0119f80287c40a89ef9b0cae"
},
"downloads": -1,
"filename": "FancySchmancyTestsplit-0.1.10-py3-none-any.whl",
"has_sig": false,
"md5_digest": "51996aca7a9e5a288406a9656dbdf1b7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 4015,
"upload_time": "2024-05-07T10:27:17",
"upload_time_iso_8601": "2024-05-07T10:27:17.551338Z",
"url": "https://files.pythonhosted.org/packages/b4/73/9ddba650e277f3e57d2f45a5bea367704491b9d7c5b1cd8f89dad4335f25/FancySchmancyTestsplit-0.1.10-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c4370320d5afe376f1673a717a400b90195819429c386d251770bb7263f260b1",
"md5": "ce1fdee4f08fac567cae9d13f4c86e5b",
"sha256": "247e44bfa5d610cf7a567825dc6858068fee1a756a17d4c5cbc2904a61ab4f37"
},
"downloads": -1,
"filename": "fancyschmancytestsplit-0.1.10.tar.gz",
"has_sig": false,
"md5_digest": "ce1fdee4f08fac567cae9d13f4c86e5b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 3984,
"upload_time": "2024-05-07T10:27:22",
"upload_time_iso_8601": "2024-05-07T10:27:22.490730Z",
"url": "https://files.pythonhosted.org/packages/c4/37/0320d5afe376f1673a717a400b90195819429c386d251770bb7263f260b1/fancyschmancytestsplit-0.1.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-07 10:27:22",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "fancyschmancytestsplit"
}