dprune


Namedprune JSON
Version 0.0.1 PyPI version JSON
download
home_pageNone
SummaryA lightweight, extensible Python library for data pruning with Hugging Face datasets and transformers
upload_time2025-07-12 21:11:00
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT License Copyright (c) 2024 Abdul Hameed Azeemi Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords machine-learning data-pruning hugging-face transformers datasets nlp deep-learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ๐ŸŒฟ dPrune: A Framework for Data Pruning

[![CI](https://github.com/ahazeemi/dPrune/actions/workflows/ci.yml/badge.svg)](https://github.com/ahazeemi/dPrune/actions/workflows/ci.yml)
[![PyPI version](https://badge.fury.io/py/dprune.svg)](https://badge.fury.io/py/dprune)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

`dPrune` is a lightweight, extensible Python library designed to make data selection and pruning simple and accessible for NLP and speech tasks, with first-class support for Hugging Face `datasets` and `transformers`.

Data pruning is the process of selecting a smaller, more informative, and a higher quality subset of a large training dataset. This can lead to faster training, lower computational costs, and even better model performance by removing noisy or redundant examples. `dPrune` provides a modular framework to experiment with various pruning strategies.

---

## โญ Key Features

- **Hugging Face Integration**: Works seamlessly with huggingface `datasets` and `transformers`.
- **Modular Design**: Separates the scoring logic from the pruning criteria.
- **Extensible**: Easily create your own custom scoring functions and pruning methods.
- **Supervised & Unsupervised Scoring Methods**: Includes a variety of common pruning techniques.
  - **Supervised**: Score data based on model outputs (e.g., cross-entropy loss, forgetting scores).
  - **Unsupervised**: Score data based on intrinsic properties (e.g., clustering embeddings, perplexity scores).
- **Multiple Pruning Strategies**: Supports top/bottom-k pruning, stratified sampling, and random pruning.

## ๐Ÿ“ฆ Installation

You can install `dPrune` via pip:

```bash
pip install dprune
```

Alternatively, you can use [`uv`](https://github.com/astral-sh/uv):

```bash
uv pip install dprune
```

To install the library with all testing dependencies, run:

```bash
pip install "dprune[test]"
```

## ๐Ÿš€ Quick Start

Here's a simple example of how to prune a dataset using unsupervised KMeans clustering. This approach keeps the most representative examples (closest to cluster centroids) without requiring labels or fine-tuning.

```python
from datasets import Dataset
from dprune import PruningPipeline, KMeansCentroidDistanceScorer, BottomKPruner

data = {'text': ['A great movie!', 'Waste of time.', 'Amazing.', 'So predictable.']}
raw_dataset = Dataset.from_dict(data)

scorer = KMeansCentroidDistanceScorer(
    model=model,
    tokenizer=tokenizer,
    text_column='text',
    num_clusters=2
)
pruner = BottomKPruner(k=0.5)

pipeline = PruningPipeline(scorer=scorer, pruner=pruner)
pruned_dataset = pipeline.run(raw_dataset)

print(f"Original dataset size: {len(raw_dataset)}")
print(f"Pruned dataset size: {len(pruned_dataset)}")
```

## ๐Ÿ’ก Core Concepts

`dPrune` is built around three core components:

#### `Scorer`

A `Scorer` takes a `Dataset` and adds a new `score` column to it. The score is a numerical value that represents some property of the example (e.g., how hard it is for the model to classify).

#### `Pruner`

A `Pruner` takes a scored `Dataset` and selects a subset of it based on the `score` column.

#### `PruningPipeline`

The `PruningPipeline` is a convenience wrapper that chains a `Scorer` and a `Pruner` together into a single, easy-to-use workflow.

## ๐Ÿ› ๏ธ Available Components

### Scorers

- **`KMeansCentroidDistanceScorer`**: (Unsupervised) Embeds the data, performs k-means clustering, and scores each example by its distance to its cluster centroid.
- **`PerplexityScorer`**: (Unsupervised) Calculates perplexity score for each example using the KenLM n-gram language model.
- **`CrossEntropyScorer`**: (Supervised) Scores examples based on the cross-entropy loss from a given model.
- **`ForgettingScorer`**: (Supervised) Works with a `ForgettingCallback` to score examples based on how many times they are "forgotten" during training.
- ...many more coming soon!


### Pruners

- **`TopKPruner`**: Selects the `k` examples with the highest scores.
- **`BottomKPruner`**: Selects the `k` examples with the lowest scores.
- **`StratifiedPruner`**: Divides the data into strata based on score quantiles and samples proportionally from each.
- **`RandomPruner`**: Randomly selects `k` examples, ignoring scores. Useful for establishing a baseline.

### Callbacks

- **`ForgettingCallback`**: A `TrainerCallback` that records learning events during training to be used with the `ForgettingScorer`.

## ๐ŸŽจ Extending dPrune

Creating your own custom components is straightforward.

### Custom Scorer

Simply inherit from the `Scorer` base class and implement the `score` method.

```python
from dprune import Scorer
from datasets import Dataset
import random

class RandomScorer(Scorer):
    def score(self, dataset: Dataset, **kwargs) -> Dataset:
        scores = [random.random() for _ in range(len(dataset))]
        return dataset.add_column("score", scores)
```

### Custom Pruner

Inherit from the `Pruner` base class and implement the `prune` method.

```python
from dprune import Pruner
from datasets import Dataset

class ThresholdPruner(Pruner):
    def __init__(self, threshold: float):
        self.threshold = threshold

    def prune(self, scored_dataset: Dataset, **kwargs) -> Dataset:
        indices_to_keep = [i for i, score in enumerate(scored_dataset['score']) if score > self.threshold]
        return scored_dataset.select(indices_to_keep)
```

## ๐Ÿ““ Example Notebooks

### 1. Supervised Pruning with Forgetting Score
`examples/supervised_pruning_with_forgetting_score.ipynb`

Shows how to use forgetting scores to prune dataset.

### 2. Unsupervised Pruning with K-Means
`examples/unsupervised_pruning_with_kmeans.ipynb` 

Demonstrates clustering-based pruning using K-means to remove outlier examples.

### 3. Unsupervised Pruning with Perplexity
`examples/unsupervised_pruning_with_perplexity.ipynb`

Shows how to use perplexity scoring for data pruning in text summarization.

## ๐ŸŽ“ Advanced Usage: Forgetting Score

Some pruning strategies require observing the model's behavior _during_ training. `dPrune` supports this via Hugging Face `TrainerCallback`. Here is how you would use the `ForgettingScorer`:

```python
from dprune import ForgettingCallback, ForgettingScorer

# 1. Initialize the callback and trainer
forgetting_callback = ForgettingCallback()
trainer = Trainer(
    model=model,
    train_dataset=raw_dataset,
    callbacks=[forgetting_callback],
)

# 2. Assign the trainer to the callback
forgetting_callback.trainer = trainer

# 3. Train the model. The callback will record events automatically.
trainer.train()

# 4. Create the scorer from the populated callback
scorer = ForgettingScorer(forgetting_callback)

# 5. Use the scorer in a pipeline as usual
pipeline = PruningPipeline(scorer=scorer, pruner=TopKPruner(k=0.8)) # Keep 80%
pruned_dataset = pipeline.run(raw_dataset)

print(f"Pruned with forgetting scores, final size: {len(pruned_dataset)}")
```

## ๐Ÿงช Running Tests

To run the full test suite, clone the repository and run `pytest` from the root directory:

```bash
git clone https://github.com/ahazeemi/dPrune.git
cd dPrune
# Install in editable mode with test dependencies
pip install -e ".[test]"
# Or, with uv
uv pip install -e ".[test]"

pytest
```

## ๐Ÿค Contributing

Contributions are welcome! If you have a feature request, bug report, or want to add a new scorer or pruner, please open an issue or submit a pull request on GitHub.

## ๐Ÿ“„ License

This project is licensed under the MIT License. See the `LICENSE` file for details.


## ๐Ÿ“ Citation

If you use `dPrune` in your research, please cite it as follows:

```bibtex
@software{dprune2025,
  author = {Azeemi, Abdul Hameed and Qazi, Ihsan Ayyub and Raza, Agha Ali},
  title = {dPrune: A Framework for Data Pruning},
  year = {2025},
  url = {https://github.com/ahazeemi/dPrune}
}
```

Alternatively, you can cite it in text as:

> Abdul Hameed Azeemi, Ihsan Ayyub Qazi, and Agha Ali Raza. (2025). dPrune: A Framework for Data Pruning. https://github.com/ahazeemi/dPrune

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "dprune",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "machine-learning, data-pruning, hugging-face, transformers, datasets, nlp, deep-learning",
    "author": null,
    "author_email": "Abdul Hameed Azeemi <abdulhameed.azeemi99@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/fb/02/eb6e5b19282c927acbdefa871723ad161e03c2d60ef216f00e70f5f985a2/dprune-0.0.1.tar.gz",
    "platform": null,
    "description": "# \ud83c\udf3f dPrune: A Framework for Data Pruning\n\n[![CI](https://github.com/ahazeemi/dPrune/actions/workflows/ci.yml/badge.svg)](https://github.com/ahazeemi/dPrune/actions/workflows/ci.yml)\n[![PyPI version](https://badge.fury.io/py/dprune.svg)](https://badge.fury.io/py/dprune)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n`dPrune` is a lightweight, extensible Python library designed to make data selection and pruning simple and accessible for NLP and speech tasks, with first-class support for Hugging Face `datasets` and `transformers`.\n\nData pruning is the process of selecting a smaller, more informative, and a higher quality subset of a large training dataset. This can lead to faster training, lower computational costs, and even better model performance by removing noisy or redundant examples. `dPrune` provides a modular framework to experiment with various pruning strategies.\n\n---\n\n## \u2b50 Key Features\n\n- **Hugging Face Integration**: Works seamlessly with huggingface `datasets` and `transformers`.\n- **Modular Design**: Separates the scoring logic from the pruning criteria.\n- **Extensible**: Easily create your own custom scoring functions and pruning methods.\n- **Supervised & Unsupervised Scoring Methods**: Includes a variety of common pruning techniques.\n  - **Supervised**: Score data based on model outputs (e.g., cross-entropy loss, forgetting scores).\n  - **Unsupervised**: Score data based on intrinsic properties (e.g., clustering embeddings, perplexity scores).\n- **Multiple Pruning Strategies**: Supports top/bottom-k pruning, stratified sampling, and random pruning.\n\n## \ud83d\udce6 Installation\n\nYou can install `dPrune` via pip:\n\n```bash\npip install dprune\n```\n\nAlternatively, you can use [`uv`](https://github.com/astral-sh/uv):\n\n```bash\nuv pip install dprune\n```\n\nTo install the library with all testing dependencies, run:\n\n```bash\npip install \"dprune[test]\"\n```\n\n## \ud83d\ude80 Quick Start\n\nHere's a simple example of how to prune a dataset using unsupervised KMeans clustering. This approach keeps the most representative examples (closest to cluster centroids) without requiring labels or fine-tuning.\n\n```python\nfrom datasets import Dataset\nfrom dprune import PruningPipeline, KMeansCentroidDistanceScorer, BottomKPruner\n\ndata = {'text': ['A great movie!', 'Waste of time.', 'Amazing.', 'So predictable.']}\nraw_dataset = Dataset.from_dict(data)\n\nscorer = KMeansCentroidDistanceScorer(\n    model=model,\n    tokenizer=tokenizer,\n    text_column='text',\n    num_clusters=2\n)\npruner = BottomKPruner(k=0.5)\n\npipeline = PruningPipeline(scorer=scorer, pruner=pruner)\npruned_dataset = pipeline.run(raw_dataset)\n\nprint(f\"Original dataset size: {len(raw_dataset)}\")\nprint(f\"Pruned dataset size: {len(pruned_dataset)}\")\n```\n\n## \ud83d\udca1 Core Concepts\n\n`dPrune` is built around three core components:\n\n#### `Scorer`\n\nA `Scorer` takes a `Dataset` and adds a new `score` column to it. The score is a numerical value that represents some property of the example (e.g., how hard it is for the model to classify).\n\n#### `Pruner`\n\nA `Pruner` takes a scored `Dataset` and selects a subset of it based on the `score` column.\n\n#### `PruningPipeline`\n\nThe `PruningPipeline` is a convenience wrapper that chains a `Scorer` and a `Pruner` together into a single, easy-to-use workflow.\n\n## \ud83d\udee0\ufe0f Available Components\n\n### Scorers\n\n- **`KMeansCentroidDistanceScorer`**: (Unsupervised) Embeds the data, performs k-means clustering, and scores each example by its distance to its cluster centroid.\n- **`PerplexityScorer`**: (Unsupervised) Calculates perplexity score for each example using the KenLM n-gram language model.\n- **`CrossEntropyScorer`**: (Supervised) Scores examples based on the cross-entropy loss from a given model.\n- **`ForgettingScorer`**: (Supervised) Works with a `ForgettingCallback` to score examples based on how many times they are \"forgotten\" during training.\n- ...many more coming soon!\n\n\n### Pruners\n\n- **`TopKPruner`**: Selects the `k` examples with the highest scores.\n- **`BottomKPruner`**: Selects the `k` examples with the lowest scores.\n- **`StratifiedPruner`**: Divides the data into strata based on score quantiles and samples proportionally from each.\n- **`RandomPruner`**: Randomly selects `k` examples, ignoring scores. Useful for establishing a baseline.\n\n### Callbacks\n\n- **`ForgettingCallback`**: A `TrainerCallback` that records learning events during training to be used with the `ForgettingScorer`.\n\n## \ud83c\udfa8 Extending dPrune\n\nCreating your own custom components is straightforward.\n\n### Custom Scorer\n\nSimply inherit from the `Scorer` base class and implement the `score` method.\n\n```python\nfrom dprune import Scorer\nfrom datasets import Dataset\nimport random\n\nclass RandomScorer(Scorer):\n    def score(self, dataset: Dataset, **kwargs) -> Dataset:\n        scores = [random.random() for _ in range(len(dataset))]\n        return dataset.add_column(\"score\", scores)\n```\n\n### Custom Pruner\n\nInherit from the `Pruner` base class and implement the `prune` method.\n\n```python\nfrom dprune import Pruner\nfrom datasets import Dataset\n\nclass ThresholdPruner(Pruner):\n    def __init__(self, threshold: float):\n        self.threshold = threshold\n\n    def prune(self, scored_dataset: Dataset, **kwargs) -> Dataset:\n        indices_to_keep = [i for i, score in enumerate(scored_dataset['score']) if score > self.threshold]\n        return scored_dataset.select(indices_to_keep)\n```\n\n## \ud83d\udcd3 Example Notebooks\n\n### 1. Supervised Pruning with Forgetting Score\n`examples/supervised_pruning_with_forgetting_score.ipynb`\n\nShows how to use forgetting scores to prune dataset.\n\n### 2. Unsupervised Pruning with K-Means\n`examples/unsupervised_pruning_with_kmeans.ipynb` \n\nDemonstrates clustering-based pruning using K-means to remove outlier examples.\n\n### 3. Unsupervised Pruning with Perplexity\n`examples/unsupervised_pruning_with_perplexity.ipynb`\n\nShows how to use perplexity scoring for data pruning in text summarization.\n\n## \ud83c\udf93 Advanced Usage: Forgetting Score\n\nSome pruning strategies require observing the model's behavior _during_ training. `dPrune` supports this via Hugging Face `TrainerCallback`. Here is how you would use the `ForgettingScorer`:\n\n```python\nfrom dprune import ForgettingCallback, ForgettingScorer\n\n# 1. Initialize the callback and trainer\nforgetting_callback = ForgettingCallback()\ntrainer = Trainer(\n    model=model,\n    train_dataset=raw_dataset,\n    callbacks=[forgetting_callback],\n)\n\n# 2. Assign the trainer to the callback\nforgetting_callback.trainer = trainer\n\n# 3. Train the model. The callback will record events automatically.\ntrainer.train()\n\n# 4. Create the scorer from the populated callback\nscorer = ForgettingScorer(forgetting_callback)\n\n# 5. Use the scorer in a pipeline as usual\npipeline = PruningPipeline(scorer=scorer, pruner=TopKPruner(k=0.8)) # Keep 80%\npruned_dataset = pipeline.run(raw_dataset)\n\nprint(f\"Pruned with forgetting scores, final size: {len(pruned_dataset)}\")\n```\n\n## \ud83e\uddea Running Tests\n\nTo run the full test suite, clone the repository and run `pytest` from the root directory:\n\n```bash\ngit clone https://github.com/ahazeemi/dPrune.git\ncd dPrune\n# Install in editable mode with test dependencies\npip install -e \".[test]\"\n# Or, with uv\nuv pip install -e \".[test]\"\n\npytest\n```\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! If you have a feature request, bug report, or want to add a new scorer or pruner, please open an issue or submit a pull request on GitHub.\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License. See the `LICENSE` file for details.\n\n\n## \ud83d\udcdd Citation\n\nIf you use `dPrune` in your research, please cite it as follows:\n\n```bibtex\n@software{dprune2025,\n  author = {Azeemi, Abdul Hameed and Qazi, Ihsan Ayyub and Raza, Agha Ali},\n  title = {dPrune: A Framework for Data Pruning},\n  year = {2025},\n  url = {https://github.com/ahazeemi/dPrune}\n}\n```\n\nAlternatively, you can cite it in text as:\n\n> Abdul Hameed Azeemi, Ihsan Ayyub Qazi, and Agha Ali Raza. (2025). dPrune: A Framework for Data Pruning. https://github.com/ahazeemi/dPrune\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2024 Abdul Hameed Azeemi\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE. ",
    "summary": "A lightweight, extensible Python library for data pruning with Hugging Face datasets and transformers",
    "version": "0.0.1",
    "project_urls": {
        "Documentation": "https://github.com/ahazeemi/dPrune#readme",
        "Homepage": "https://github.com/ahazeemi/dPrune",
        "Issues": "https://github.com/ahazeemi/dPrune/issues",
        "Repository": "https://github.com/ahazeemi/dPrune"
    },
    "split_keywords": [
        "machine-learning",
        " data-pruning",
        " hugging-face",
        " transformers",
        " datasets",
        " nlp",
        " deep-learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3b1b8fe68a63d8a9d93bafcc7738c3c750796ea38c6bab5bd0bd950613747eb2",
                "md5": "e9e80c5cb3d1bc6310c71e4fddf458a2",
                "sha256": "383973c6755b0cd13223bd199b09c8c2d865e6d51a53ac883a23afc0dd857902"
            },
            "downloads": -1,
            "filename": "dprune-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e9e80c5cb3d1bc6310c71e4fddf458a2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 16160,
            "upload_time": "2025-07-12T21:10:58",
            "upload_time_iso_8601": "2025-07-12T21:10:58.907345Z",
            "url": "https://files.pythonhosted.org/packages/3b/1b/8fe68a63d8a9d93bafcc7738c3c750796ea38c6bab5bd0bd950613747eb2/dprune-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fb02eb6e5b19282c927acbdefa871723ad161e03c2d60ef216f00e70f5f985a2",
                "md5": "4b2f7c53d81678082f7823c53d52cac0",
                "sha256": "b1e73ee579cb2074e8867038c540638ea17e28197aa4025d542d366d1d8ae770"
            },
            "downloads": -1,
            "filename": "dprune-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "4b2f7c53d81678082f7823c53d52cac0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 21645,
            "upload_time": "2025-07-12T21:11:00",
            "upload_time_iso_8601": "2025-07-12T21:11:00.403414Z",
            "url": "https://files.pythonhosted.org/packages/fb/02/eb6e5b19282c927acbdefa871723ad161e03c2d60ef216f00e70f5f985a2/dprune-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-12 21:11:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ahazeemi",
    "github_project": "dPrune#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "dprune"
}
        
Elapsed time: 0.49896s