Name | kennard-stone JSON |
Version |
3.0.1
JSON |
| download |
home_page | None |
Summary | A method for selecting samples by spreading the training data evenly. |
upload_time | 2025-08-28 13:24:02 |
maintainer | yu9824 |
docs_url | None |
author | yu9824 |
requires_python | >=3.8 |
license | MIT License
Copyright (c) 2021 yu9824
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
|
keywords |
kennard_stone
scikit-learn
train_test_split
kfold
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Kennard Stone
[](https://pypi.org/project/kennard-stone/)
[](https://pypi.org/project/kennard-stone/)
[](https://pypi.org/project/kennard-stone/)
[](https://pepy.tech/project/kennard-stone)
[](https://github.com/yu9824/kennard_stone/actions/workflows/pytest-on-each-version.yaml)
[](https://github.com/yu9824/kennard_stone/actions/workflows/docs.yml)
[](https://github.com/yu9824/kennard_stone/actions/workflows/release-pypi.yml)
[](https://anaconda.org/conda-forge/kennard-stone)
[](https://anaconda.org/conda-forge/kennard-stone)
[](https://github.com/astral-sh/ruff)
[](https://github.com/python/mypy)
## What is this?
This is an algorithm for evenly partitioning data in a `scikit-learn`-like interface.
(See [References](#References) for details of the algorithm.)

## How to install
### PyPI
```bash
pip install kennard-stone
```
The project site is [here](https://pypi.org/project/kennard-stone/).
### Anaconda
```bash
conda install -c conda-forge kennard-stone
```
The project site is [here](https://anaconda.org/conda-forge/kennard-stone).
You need `numpy>=1.20` and `scikit-learn` to run.
## How to use
You can use them like [scikit-learn](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection).
See [examples](https://github.com/yu9824/kennard_stone/tree/main/examples) for details.
In the following, `X` denotes an arbitrary explanatory variable and `y` an arbitrary objective variable.
And, `estimator` indicates an arbitrary prediction model that conforms to scikit-learn.
### train_test_split
#### kennard_stone
```python
from kennard_stone import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
```
#### scikit-learn
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=334
)
```
### KFold
#### kennard_stone
```python
from kennard_stone import KFold
# Always shuffled and uniquely determined for a data set.
kf = KFold(n_splits=5)
for i_train, i_test in kf.split(X, y):
X_train = X[i_train]
y_train = y[i_train]
X_test = X[i_test]
y_test = y[i_test]
```
#### scikit-learn
```python
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=334)
for i_train, i_test in kf.split(X, y):
X_train = X[i_train]
y_train = y[i_train]
X_test = X[i_test]
y_test = y[i_test]
```
### Other usages
If you ever specify `cv` in scikit-learn, you can assign `KFold` objects to it and apply it to various functions.
An example is [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html).
#### kennard_stone
```python
from kennard_stone import KFold
from sklearn.model_selection import cross_validate
kf = KFold(n_splits=5)
print(cross_validate(estimator, X, y, cv=kf))
```
#### scikit-learn
```python
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate
kf = KFold(n_splits=5, shuffle=True, random_state=334)
print(cross_validate(estimator, X, y, cv=kf))
```
OR
```python
from sklearn.model_selection import cross_validate
print(cross_validate(estimator, X, y, cv=5))
```
## Notes
There is no notion of `random_state` or `shuffle` because the partitioning is determined uniquely for the dataset.
If these arguments are included, they do not cause an error. They simply have no effect on the result. Please be careful.
If you want to run the notebook in examples directory,
you will need to additionally install `pandas`, `matplotlib`, `seaborn`, `tqdm`, and `jupyter` other than the packages in requirements.txt.
## Distance metrics
See the documentation of
- `scipy.spatial.distance.pdist`
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html
- `sklearn.metrics.pairwise_distances`
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html
Valid values for metric are:
- From scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1',
'l2', 'manhattan']. These metrics support sparse matrix inputs.
['nan_euclidean'] but it does not yet support sparse matrices.
- From scipy.spatial.distance: ['braycurtis', 'canberra',
'chebyshev', 'correlation', 'dice', 'hamming', 'jaccard',
'mahalanobis', 'minkowski', 'rogerstanimoto',
'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath',
'sqeuclidean', 'yule'] See the documentation for
scipy.spatial.distance for details on these metrics.
These metrics do not support sparse matrix inputs.
, by default "euclidean"
## Parallelization (since v2.1.0)
This algorithm is very computationally intensive and takes a lot of time.
To solve this problem, I have implemented parallelization and optimized the algorithm since v2.1.0.
`n_jobs` can be specified for parallelization as in the scikit-learn-like api.
```python
# parallelization KFold
kf = KFold(n_splits=5, n_jobs=-1)
# parallelization train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, n_jobs=-1
)
```
The parallelization is used when calculating the distance matrix,
so it doesn't conflict with something like `cross_validate` in parallel when using `KFold`.
```python
# OK: does not conflict each other
cross_validate(estimator, X, y, cv=KFold(5, n_jobs=-1), n_jobs=-1)
```
## Using GPU
If you have a GPU and have installed pytorch,
you can use it to calculate Minkowski distances (Manhattan, Euclidean, and Chebyshev distances).
```python
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, device="cuda"
)
```
## LICENSE
MIT Licence
Copyright (c) 2021 yu9824
## References
### Papers
- R. W. Kennard & L. A. Stone (1969) Computer Aided Design of Experiments, Technometrics, 11:1, 137-148, DOI: [10.1080/00401706.1969.10490666](https://doi.org/10.1080/00401706.1969.10490666)
### Sites
- [https://datachemeng.com/trainingtestdivision/](https://datachemeng.com/trainingtestdivision/) (Japanese site)
## Histories
### v2.0.0 (deprecated)
- Define Extended Kennard-Stone algorithm (multi-class) i.e. Improve KFold algorithm.
- Delete `alternate` argument in `KFold`.
- Delete requirements of `pandas`.
### v2.0.1
- Fix bug with Python3.7.
### v2.1.0 (deprecated)
- Optimize algorithm
- Deal with Large number of data.
- parallel calculation when calculating distance (Add `n_jobs` argument)
- replacing recursive functions with for-loops
- Add other than "euclidean" calculation methods (Add `metric` argument)
### v2.1.1 (deprecated)
- Fix bug when `metric="nan_euclidean"`.
### v2.1.2 (deprecated)
- Fix details.
- Update docstrings and typings.
### v2.1.3 (deprecated)
- Fix details.
- Update some typings. (You have access to a list of strings that can be used in the metric.)
### v2.1.4
- Fix bug when metric=="seuclidean" and "mahalanobis"
- Add some tests to check all metrics.
- Add requirements numpy>=1.20
### v2.1.5
- Delete "klusinski" metric to support scipy>=1.11
### v2.1.6
- Improve typing in `kennard_stone.train_test_split`
- Add some docstrings.
### v2.2.0
- Supports GPU calculations. (when metric is 'euclidean', 'manhattan', 'chebyshev' and 'minkowski')
- Supports Python 3.12
### v2.2.1
- Fix setup.cfg
- Update 'typing'
Raw data
{
"_id": null,
"home_page": null,
"name": "kennard-stone",
"maintainer": "yu9824",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "kennard_stone, scikit-learn, train_test_split, KFold",
"author": "yu9824",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/20/dc/6ec59ff435fc09babd75f43ae12974ba8edbaae125e9c35289b5c0fbe2ec/kennard_stone-3.0.1.tar.gz",
"platform": null,
"description": "# Kennard Stone\n\n[](https://pypi.org/project/kennard-stone/)\n[](https://pypi.org/project/kennard-stone/)\n[](https://pypi.org/project/kennard-stone/)\n[](https://pepy.tech/project/kennard-stone)\n\n[](https://github.com/yu9824/kennard_stone/actions/workflows/pytest-on-each-version.yaml)\n[](https://github.com/yu9824/kennard_stone/actions/workflows/docs.yml)\n[](https://github.com/yu9824/kennard_stone/actions/workflows/release-pypi.yml)\n\n[](https://anaconda.org/conda-forge/kennard-stone)\n[](https://anaconda.org/conda-forge/kennard-stone)\n\n[](https://github.com/astral-sh/ruff)\n[](https://github.com/python/mypy)\n\n## What is this?\n\nThis is an algorithm for evenly partitioning data in a `scikit-learn`-like interface.\n(See [References](#References) for details of the algorithm.)\n\n\n\n## How to install\n\n### PyPI\n\n```bash\npip install kennard-stone\n```\n\nThe project site is [here](https://pypi.org/project/kennard-stone/).\n\n### Anaconda\n\n```bash\nconda install -c conda-forge kennard-stone\n```\n\nThe project site is [here](https://anaconda.org/conda-forge/kennard-stone).\n\nYou need `numpy>=1.20` and `scikit-learn` to run.\n\n## How to use\n\nYou can use them like [scikit-learn](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection).\n\nSee [examples](https://github.com/yu9824/kennard_stone/tree/main/examples) for details.\n\nIn the following, `X` denotes an arbitrary explanatory variable and `y` an arbitrary objective variable.\nAnd, `estimator` indicates an arbitrary prediction model that conforms to scikit-learn.\n\n### train_test_split\n\n#### kennard_stone\n\n```python\nfrom kennard_stone import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n```\n\n#### scikit-learn\n\n```python\nfrom sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(\n X, y, test_size=0.2, random_state=334\n)\n```\n\n### KFold\n\n#### kennard_stone\n\n```python\nfrom kennard_stone import KFold\n\n# Always shuffled and uniquely determined for a data set.\nkf = KFold(n_splits=5)\nfor i_train, i_test in kf.split(X, y):\n X_train = X[i_train]\n y_train = y[i_train]\n X_test = X[i_test]\n y_test = y[i_test]\n```\n\n#### scikit-learn\n\n```python\nfrom sklearn.model_selection import KFold\n\nkf = KFold(n_splits=5, shuffle=True, random_state=334)\nfor i_train, i_test in kf.split(X, y):\n X_train = X[i_train]\n y_train = y[i_train]\n X_test = X[i_test]\n y_test = y[i_test]\n```\n\n### Other usages\n\nIf you ever specify `cv` in scikit-learn, you can assign `KFold` objects to it and apply it to various functions.\n\nAn example is [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html).\n\n#### kennard_stone\n\n```python\nfrom kennard_stone import KFold\nfrom sklearn.model_selection import cross_validate\n\nkf = KFold(n_splits=5)\nprint(cross_validate(estimator, X, y, cv=kf))\n```\n\n#### scikit-learn\n\n```python\nfrom sklearn.model_selection import KFold\nfrom sklearn.model_selection import cross_validate\n\nkf = KFold(n_splits=5, shuffle=True, random_state=334)\nprint(cross_validate(estimator, X, y, cv=kf))\n```\n\nOR\n\n```python\nfrom sklearn.model_selection import cross_validate\n\nprint(cross_validate(estimator, X, y, cv=5))\n```\n\n\n## Notes\n\nThere is no notion of `random_state` or `shuffle` because the partitioning is determined uniquely for the dataset.\nIf these arguments are included, they do not cause an error. They simply have no effect on the result. Please be careful.\n\nIf you want to run the notebook in examples directory,\nyou will need to additionally install `pandas`, `matplotlib`, `seaborn`, `tqdm`, and `jupyter` other than the packages in requirements.txt.\n\n## Distance metrics\n\nSee the documentation of\n\n- `scipy.spatial.distance.pdist`\n https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html\n- `sklearn.metrics.pairwise_distances`\n https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html\n\nValid values for metric are:\n\n- From scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1',\n 'l2', 'manhattan']. These metrics support sparse matrix inputs.\n ['nan_euclidean'] but it does not yet support sparse matrices.\n- From scipy.spatial.distance: ['braycurtis', 'canberra',\n 'chebyshev', 'correlation', 'dice', 'hamming', 'jaccard',\n 'mahalanobis', 'minkowski', 'rogerstanimoto',\n 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath',\n 'sqeuclidean', 'yule'] See the documentation for\n scipy.spatial.distance for details on these metrics.\n These metrics do not support sparse matrix inputs.\n\n, by default \"euclidean\"\n\n## Parallelization (since v2.1.0)\n\nThis algorithm is very computationally intensive and takes a lot of time.\nTo solve this problem, I have implemented parallelization and optimized the algorithm since v2.1.0.\n`n_jobs` can be specified for parallelization as in the scikit-learn-like api.\n\n```python\n# parallelization KFold\nkf = KFold(n_splits=5, n_jobs=-1)\n\n# parallelization train_test_split\nX_train, X_test, y_train, y_test = train_test_split(\n X, y, test_size=0.2, n_jobs=-1\n)\n```\n\nThe parallelization is used when calculating the distance matrix,\nso it doesn't conflict with something like `cross_validate` in parallel when using `KFold`.\n\n```python\n# OK: does not conflict each other\ncross_validate(estimator, X, y, cv=KFold(5, n_jobs=-1), n_jobs=-1)\n```\n\n\n## Using GPU\n\nIf you have a GPU and have installed pytorch,\nyou can use it to calculate Minkowski distances (Manhattan, Euclidean, and Chebyshev distances).\n\n\n```python\nX_train, X_test, y_train, y_test = train_test_split(\n X, y, test_size=0.2, device=\"cuda\"\n)\n\n```\n\n\n## LICENSE\n\nMIT Licence\n\nCopyright (c) 2021 yu9824\n\n\n## References\n\n### Papers\n\n- R. W. Kennard & L. A. Stone (1969) Computer Aided Design of Experiments, Technometrics, 11:1, 137-148, DOI: [10.1080/00401706.1969.10490666](https://doi.org/10.1080/00401706.1969.10490666)\n\n### Sites\n\n- [https://datachemeng.com/trainingtestdivision/](https://datachemeng.com/trainingtestdivision/) (Japanese site)\n\n\n## Histories\n\n### v2.0.0 (deprecated)\n\n- Define Extended Kennard-Stone algorithm (multi-class) i.e. Improve KFold algorithm.\n- Delete `alternate` argument in `KFold`.\n- Delete requirements of `pandas`.\n\n### v2.0.1\n\n- Fix bug with Python3.7.\n\n### v2.1.0 (deprecated)\n\n- Optimize algorithm\n- Deal with Large number of data.\n - parallel calculation when calculating distance (Add `n_jobs` argument)\n - replacing recursive functions with for-loops\n- Add other than \"euclidean\" calculation methods (Add `metric` argument)\n\n### v2.1.1 (deprecated)\n\n- Fix bug when `metric=\"nan_euclidean\"`.\n\n### v2.1.2 (deprecated)\n\n- Fix details.\n - Update docstrings and typings.\n\n### v2.1.3 (deprecated)\n\n- Fix details.\n - Update some typings. (You have access to a list of strings that can be used in the metric.)\n\n### v2.1.4\n\n- Fix bug when metric==\"seuclidean\" and \"mahalanobis\"\n - Add some tests to check all metrics.\n- Add requirements numpy>=1.20\n\n### v2.1.5\n\n- Delete \"klusinski\" metric to support scipy>=1.11\n\n### v2.1.6\n\n- Improve typing in `kennard_stone.train_test_split`\n- Add some docstrings.\n\n### v2.2.0\n\n- Supports GPU calculations. (when metric is 'euclidean', 'manhattan', 'chebyshev' and 'minkowski')\n- Supports Python 3.12\n\n\n### v2.2.1\n\n- Fix setup.cfg\n- Update 'typing'\n",
"bugtrack_url": null,
"license": "MIT License\n \n Copyright (c) 2021 yu9824\n \n Permission is hereby granted, free of charge, to any person obtaining a copy\n of this software and associated documentation files (the \"Software\"), to deal\n in the Software without restriction, including without limitation the rights\n to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n copies of the Software, and to permit persons to whom the Software is\n furnished to do so, subject to the following conditions:\n \n The above copyright notice and this permission notice shall be included in all\n copies or substantial portions of the Software.\n \n THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n SOFTWARE.\n ",
"summary": "A method for selecting samples by spreading the training data evenly.",
"version": "3.0.1",
"project_urls": {
"Changelog": "https://github.com/yu9824/kennard_stone/blob/main/CHANGELOG.md",
"Homepage": "https://github.com/yu9824/kennard_stone",
"PyPI": "https://pypi.org/project/kennard-stone/",
"Source": "https://github.com/yu9824/kennard_stone",
"Tracker": "https://github.com/yu9824/kennard_stone/issues"
},
"split_keywords": [
"kennard_stone",
" scikit-learn",
" train_test_split",
" kfold"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "577adc3576523a11ac4c691f530f7676765e9ff9dfb29fcc7a7d014ebdee4c23",
"md5": "81e01f15eeaea302d32d40ee094dbc43",
"sha256": "0823ea78cda671d94679320eb4654a904786c79bf9586cbb76dac9d12d4a3fdb"
},
"downloads": -1,
"filename": "kennard_stone-3.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "81e01f15eeaea302d32d40ee094dbc43",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 18576,
"upload_time": "2025-08-28T13:24:01",
"upload_time_iso_8601": "2025-08-28T13:24:01.181025Z",
"url": "https://files.pythonhosted.org/packages/57/7a/dc3576523a11ac4c691f530f7676765e9ff9dfb29fcc7a7d014ebdee4c23/kennard_stone-3.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "20dc6ec59ff435fc09babd75f43ae12974ba8edbaae125e9c35289b5c0fbe2ec",
"md5": "d42864f4ced6fecebe2a5fb280f045dd",
"sha256": "9050695546ec0b04dcd667fa2ce9db6f08cabd9124e6b89719e05f211ab30666"
},
"downloads": -1,
"filename": "kennard_stone-3.0.1.tar.gz",
"has_sig": false,
"md5_digest": "d42864f4ced6fecebe2a5fb280f045dd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 19708,
"upload_time": "2025-08-28T13:24:02",
"upload_time_iso_8601": "2025-08-28T13:24:02.693287Z",
"url": "https://files.pythonhosted.org/packages/20/dc/6ec59ff435fc09babd75f43ae12974ba8edbaae125e9c35289b5c0fbe2ec/kennard_stone-3.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-28 13:24:02",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "yu9824",
"github_project": "kennard_stone",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "kennard-stone"
}