umap-stratified-split


Nameumap-stratified-split JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryStratified dataset splitting via UMAP-based pseudo-labels
upload_time2025-07-19 13:17:31
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT License Copyright (c) 2025 Bilal Q. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords umap stratification pytorch dataset split
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # UMAP-Stratified Dataset Split
![Python](https://img.shields.io/badge/Python-3.8+-blue)
> Stratified dataset splitting via UMAP‑based pseudo‑labels  
> A drop‑in alternative to `torch.utils.data.random_split` that ensures each split preserves the manifold structure of your data.

## 📖 Overview

When you randomly split a dataset, you risk concentrating your validation set in a narrow region of feature space—especially in small or highly clustered datasets.  
`umap-stratified-split` embeds your entire dataset with UMAP (with fixed seed), clusters the embedding into “pseudo‑labels,” then performs stratified splitting to guarantee that each subset visits **all** manifold regions evenly.

This approach is grounded in the assumption that your dataset lies on a meaningful low-dimensional manifold and benefits from a uniform, structure-preserving sampling across this space.

---

## 🧠 Theoretical Motivation

This method builds on the foundation laid by the UMAP algorithm:

> **Uniform Manifold Approximation and Projection (UMAP)** is a general-purpose nonlinear dimension reduction technique. Its core idea is to construct a graph-based fuzzy topological representation of data, then optimize a low-dimensional embedding that preserves local neighborhood structure [1].

UMAP relies on the following assumptions:

1. **The data is uniformly distributed on a Riemannian manifold**
2. **The Riemannian metric is locally constant** (or approximately so)
3. **The manifold is locally connected**

These assumptions allow UMAP to build a reliable low-dimensional embedding that captures meaningful cluster and density information [2].

By clustering the UMAP embedding and stratifying across those clusters, this package provides a principled way to sample validation and training sets that are *representative of the entire data manifold*.

---

## ✅ Advantages of this Approach

- **Manifold-aware validation**: Ensures your validation set covers the same regions as your training set
- **Less validation bias**: Avoids selecting validation samples from narrow regions of the space
- **General-purpose**: Works for any data type (images, time series, embeddings, tabular, etc.)
- **Drop-in replacement**: Mimics the `random_split` API from PyTorch
- **Fully unsupervised**: Uses the geometry of the data, no true labels required
- **Reproducible**: UMAP seed and clustering make the split deterministic

---

## ⚠️ When to Use / Limitations

This method works best when:

- Your data lies on a structured manifold (e.g. clustered, continuous trajectories)
- The standard random split leads to class imbalance or structural bias
- You don’t have true labels, but want stratified-like splits based on learned structure

Avoid using this method if:

- Your data is completely uniform or already well-distributed
- You are splitting into a *test set* (risk of data leakage via unsupervised embedding)
- Your feature extractor or UMAP embedding fails to capture meaningful structure

---

## 📚 References

1. McInnes, L., Healy, J., & Melville, J. (2018). *UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction*. arXiv preprint [arXiv:1802.03426](https://arxiv.org/abs/1802.03426)
2. Official UMAP implementation: https://github.com/lmcinnes/umap

---

## 🔧 Installation

### ✅ Recommended (fast & modern):

```bash
# Install via uv (recommended)
uv pip install umap-stratified-split
```

### 🛠️ From GitHub (latest main):

```bash
uv pip install git+https://github.com/bilalqur/umap-stratified-split.git#egg=umap-stratified-split
```

### 🧪 Local development mode:

```bash
git clone https://github.com/bilalqur/umap-stratified-split.git
cd umap-stratified-split
uv pip install -e .
```

---

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "umap-stratified-split",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "umap, stratification, pytorch, dataset, split",
    "author": null,
    "author_email": "\"Bilal Qureshi, M.Sc.\" <bilal.qureshi@outlook.de>",
    "download_url": "https://files.pythonhosted.org/packages/69/19/a23847fcf89c1550c5bde89dd23669eb9f3982fd410185b3170dc7b01c66/umap_stratified_split-0.1.0.tar.gz",
    "platform": null,
    "description": "# UMAP-Stratified Dataset Split\r\n![Python](https://img.shields.io/badge/Python-3.8+-blue)\r\n> Stratified dataset splitting via UMAP\u2011based pseudo\u2011labels  \r\n> A drop\u2011in alternative to `torch.utils.data.random_split` that ensures each split preserves the manifold structure of your data.\r\n\r\n## \ud83d\udcd6 Overview\r\n\r\nWhen you randomly split a dataset, you risk concentrating your validation set in a narrow region of feature space\u2014especially in small or highly clustered datasets.  \r\n`umap-stratified-split` embeds your entire dataset with UMAP (with fixed seed), clusters the embedding into \u201cpseudo\u2011labels,\u201d then performs stratified splitting to guarantee that each subset visits **all** manifold regions evenly.\r\n\r\nThis approach is grounded in the assumption that your dataset lies on a meaningful low-dimensional manifold and benefits from a uniform, structure-preserving sampling across this space.\r\n\r\n---\r\n\r\n## \ud83e\udde0 Theoretical Motivation\r\n\r\nThis method builds on the foundation laid by the UMAP algorithm:\r\n\r\n> **Uniform Manifold Approximation and Projection (UMAP)** is a general-purpose nonlinear dimension reduction technique. Its core idea is to construct a graph-based fuzzy topological representation of data, then optimize a low-dimensional embedding that preserves local neighborhood structure [1].\r\n\r\nUMAP relies on the following assumptions:\r\n\r\n1. **The data is uniformly distributed on a Riemannian manifold**\r\n2. **The Riemannian metric is locally constant** (or approximately so)\r\n3. **The manifold is locally connected**\r\n\r\nThese assumptions allow UMAP to build a reliable low-dimensional embedding that captures meaningful cluster and density information [2].\r\n\r\nBy clustering the UMAP embedding and stratifying across those clusters, this package provides a principled way to sample validation and training sets that are *representative of the entire data manifold*.\r\n\r\n---\r\n\r\n## \u2705 Advantages of this Approach\r\n\r\n- **Manifold-aware validation**: Ensures your validation set covers the same regions as your training set\r\n- **Less validation bias**: Avoids selecting validation samples from narrow regions of the space\r\n- **General-purpose**: Works for any data type (images, time series, embeddings, tabular, etc.)\r\n- **Drop-in replacement**: Mimics the `random_split` API from PyTorch\r\n- **Fully unsupervised**: Uses the geometry of the data, no true labels required\r\n- **Reproducible**: UMAP seed and clustering make the split deterministic\r\n\r\n---\r\n\r\n## \u26a0\ufe0f When to Use / Limitations\r\n\r\nThis method works best when:\r\n\r\n- Your data lies on a structured manifold (e.g. clustered, continuous trajectories)\r\n- The standard random split leads to class imbalance or structural bias\r\n- You don\u2019t have true labels, but want stratified-like splits based on learned structure\r\n\r\nAvoid using this method if:\r\n\r\n- Your data is completely uniform or already well-distributed\r\n- You are splitting into a *test set* (risk of data leakage via unsupervised embedding)\r\n- Your feature extractor or UMAP embedding fails to capture meaningful structure\r\n\r\n---\r\n\r\n## \ud83d\udcda References\r\n\r\n1. McInnes, L., Healy, J., & Melville, J. (2018). *UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction*. arXiv preprint [arXiv:1802.03426](https://arxiv.org/abs/1802.03426)\r\n2. Official UMAP implementation: https://github.com/lmcinnes/umap\r\n\r\n---\r\n\r\n## \ud83d\udd27 Installation\r\n\r\n### \u2705 Recommended (fast & modern):\r\n\r\n```bash\r\n# Install via uv (recommended)\r\nuv pip install umap-stratified-split\r\n```\r\n\r\n### \ud83d\udee0\ufe0f From GitHub (latest main):\r\n\r\n```bash\r\nuv pip install git+https://github.com/bilalqur/umap-stratified-split.git#egg=umap-stratified-split\r\n```\r\n\r\n### \ud83e\uddea Local development mode:\r\n\r\n```bash\r\ngit clone https://github.com/bilalqur/umap-stratified-split.git\r\ncd umap-stratified-split\r\nuv pip install -e .\r\n```\r\n\r\n---\r\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2025 Bilal Q.  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Stratified dataset splitting via UMAP-based pseudo-labels",
    "version": "0.1.0",
    "project_urls": {
        "Documentation": "https://github.com/bilalqur/umap-stratified-split/docs",
        "Homepage": "https://github.com/bilalqur/umap-stratified-split",
        "Issue Tracker": "https://github.com/bilalqur/umap-stratified-split/issues"
    },
    "split_keywords": [
        "umap",
        " stratification",
        " pytorch",
        " dataset",
        " split"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "88d4da4aef8ec38b8ccb60ed94f861a5cbbb95dca1d036ca7865eb7f3e85aaea",
                "md5": "734e1c08896a60477e4593a2a148cf99",
                "sha256": "783ad267fa0cefb3158dc88ebb9fbdb9d32b1df589e6c1b4e3c4b979b48ba615"
            },
            "downloads": -1,
            "filename": "umap_stratified_split-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "734e1c08896a60477e4593a2a148cf99",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 6373,
            "upload_time": "2025-07-19T13:17:29",
            "upload_time_iso_8601": "2025-07-19T13:17:29.771834Z",
            "url": "https://files.pythonhosted.org/packages/88/d4/da4aef8ec38b8ccb60ed94f861a5cbbb95dca1d036ca7865eb7f3e85aaea/umap_stratified_split-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6919a23847fcf89c1550c5bde89dd23669eb9f3982fd410185b3170dc7b01c66",
                "md5": "5ccc2ab4ed6e2e09879ae098b102dc0a",
                "sha256": "5670dd1d5ee2fefc52d066f26e6d71e4eb2b1ff8639c63a7835a13db3408a6e9"
            },
            "downloads": -1,
            "filename": "umap_stratified_split-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "5ccc2ab4ed6e2e09879ae098b102dc0a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 5564,
            "upload_time": "2025-07-19T13:17:31",
            "upload_time_iso_8601": "2025-07-19T13:17:31.129424Z",
            "url": "https://files.pythonhosted.org/packages/69/19/a23847fcf89c1550c5bde89dd23669eb9f3982fd410185b3170dc7b01c66/umap_stratified_split-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-19 13:17:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "bilalqur",
    "github_project": "umap-stratified-split",
    "github_not_found": true,
    "lcname": "umap-stratified-split"
}
        
Elapsed time: 1.29407s