fewlab


Namefewlab JSON
Version 0.3.1 PyPI version JSON
download
home_pageNone
SummaryPick the fewest items to label for unbiased OLS on shares
upload_time2025-11-01 22:29:56
maintainerNone
docs_urlNone
authorGaurav Sood
requires_python>=3.11
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## fewlab: fewest items to label for most efficient unbiased OLS on shares

[![Python application](https://github.com/finite-sample/fewlab/actions/workflows/ci.yml/badge.svg)](https://github.com/finite-sample/fewlab/actions/workflows/ci.yml)
[![Documentation](https://img.shields.io/badge/docs-github.io-blue)](https://finite-sample.github.io/fewlab/)
[![PyPI version](https://img.shields.io/pypi/v/fewlab.svg)](https://pypi.org/project/fewlab/)
[![Downloads](https://pepy.tech/badge/fewlab)](https://pepy.tech/project/fewlab)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)

**Problem**: You have usage data (users Γ— items) and want to understand how user traits relate to item preferences. But you can't afford to label every item. This tool tells you which items to label first to get the most accurate analysis.

## When You Need This

You have:
- A usage matrix: rows are users, columns are items (websites, products, apps)
- User features you want to analyze (demographics, behavior patterns)
- Limited budget to label items (safe/unsafe, brand affiliation, category)

You want to run a regression to understand relationships between user features and item traits, but labeling is expensive. Random sampling wastes budget on items that don't affect your analysis.

## How It Works

The tool identifies items that most influence your regression coefficients. It prioritizes items that:
1. Are used by many people
2. Show different usage patterns across your user segments
3. Would most change your conclusions if mislabeled

Think of it as "statistical leverage"β€”some items matter more for understanding user-trait relationships.

## Quick Start

```python
from fewlab import items_to_label
import pandas as pd

# Your data: user features and item usage
user_features = pd.DataFrame(...)  # User characteristics
item_usage = pd.DataFrame(...)     # Usage counts per user-item

# Get top 100 items to label
priority_items = items_to_label(
    counts=item_usage,
    X=user_features,
    K=100
)

# Send priority_items to your labeling team
print(f"Label these items first: {priority_items}")
```

## Advanced Usage

```python
from fewlab import pi_aopt_for_budget, balanced_fixed_size, row_se_min_labels

# Get inclusion probabilities for expected budget
probabilities = pi_aopt_for_budget(
    counts=item_usage,
    X=user_features,
    K=100
)

# Balanced sampling with probability constraints
selected_items = balanced_fixed_size(
    pi=probabilities,
    g=influence_projections,
    K=100,
    seed=42
)

# Minimize row-wise standard errors
optimal_items = row_se_min_labels(
    counts=item_usage,
    eps2=error_budget_per_row
)
```

## What You Get

**Multiple approaches** for optimal item selection:

- **`items_to_label()`**: Deterministic top-K items for maximum precision
- **`pi_aopt_for_budget()`**: Inclusion probabilities for randomized sampling
- **`balanced_fixed_size()`**: Balanced sampling with probability constraints
- **`row_se_min_labels()`**: Minimize row-wise standard errors
- **`topk()`**: Efficient O(n) top-k selection algorithm

All methods consider:
- Item usage patterns across user segments
- Statistical leverage for regression coefficients
- Optimal allocation of labeling budget

## Practical Considerations

**Choosing K**: Start with 10-20% of items. You can always label more if needed.

**Validation**: Compare regression stability with different K values. When coefficients stop changing significantly, you have enough labels.

**Limitations**:
- Works best when usage patterns correlate with user features
- Assumes item labels are binary (has trait / doesn't have trait)
- Most effective for sparse usage matrices

## Advanced: Ensuring Unbiased Estimates

The basic approach gives you optimal items to label but technically requires some randomization for completely unbiased statistical estimates. If you need formal statistical guarantees, add a small random sample on top of the priority list. See the [statistical details](link) for more.

## Installation

```bash
pip install fewlab
```

**Requirements**: Python 3.11+, numpy β‰₯1.23, pandas β‰₯1.5

**Development**:
```bash
pip install -e ".[dev]"  # Includes testing, linting, pre-commit hooks
pip install -e ".[docs]" # Includes documentation building
```

## What's New in v0.3.0

- 🐍 **Modern Python**: Requires Python 3.11+ (breaking change)
- πŸ“‹ **Smart Config**: Docs automatically sync with pyproject.toml metadata
- πŸš€ **Performance**: O(n) top-k selection algorithm (vs O(n log n))
- πŸ”§ **Code Quality**: Type hints, constants, eliminated dead code
- πŸ“š **Modern Docs**: Furo theme with dark/light mode support
- πŸ§ͺ **Developer Experience**: Pre-commit hooks, comprehensive testing
- πŸ“¦ **Expanded API**: 5 functions for different sampling strategies

## Development

For contributors, see [CONTRIBUTING.md](CONTRIBUTING.md) for setup instructions including required pre-commit hooks.

## License

MIT

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "fewlab",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": null,
    "author": "Gaurav Sood",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/3c/a6/1d68b33d05ede2717aa9d01a871c75c3aadc5d2c8d506556b59cb768dbac/fewlab-0.3.1.tar.gz",
    "platform": null,
    "description": "## fewlab: fewest items to label for most efficient unbiased OLS on shares\n\n[![Python application](https://github.com/finite-sample/fewlab/actions/workflows/ci.yml/badge.svg)](https://github.com/finite-sample/fewlab/actions/workflows/ci.yml)\n[![Documentation](https://img.shields.io/badge/docs-github.io-blue)](https://finite-sample.github.io/fewlab/)\n[![PyPI version](https://img.shields.io/pypi/v/fewlab.svg)](https://pypi.org/project/fewlab/)\n[![Downloads](https://pepy.tech/badge/fewlab)](https://pepy.tech/project/fewlab)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)\n\n**Problem**: You have usage data (users \u00d7 items) and want to understand how user traits relate to item preferences. But you can't afford to label every item. This tool tells you which items to label first to get the most accurate analysis.\n\n## When You Need This\n\nYou have:\n- A usage matrix: rows are users, columns are items (websites, products, apps)\n- User features you want to analyze (demographics, behavior patterns)\n- Limited budget to label items (safe/unsafe, brand affiliation, category)\n\nYou want to run a regression to understand relationships between user features and item traits, but labeling is expensive. Random sampling wastes budget on items that don't affect your analysis.\n\n## How It Works\n\nThe tool identifies items that most influence your regression coefficients. It prioritizes items that:\n1. Are used by many people\n2. Show different usage patterns across your user segments\n3. Would most change your conclusions if mislabeled\n\nThink of it as \"statistical leverage\"\u2014some items matter more for understanding user-trait relationships.\n\n## Quick Start\n\n```python\nfrom fewlab import items_to_label\nimport pandas as pd\n\n# Your data: user features and item usage\nuser_features = pd.DataFrame(...)  # User characteristics\nitem_usage = pd.DataFrame(...)     # Usage counts per user-item\n\n# Get top 100 items to label\npriority_items = items_to_label(\n    counts=item_usage,\n    X=user_features,\n    K=100\n)\n\n# Send priority_items to your labeling team\nprint(f\"Label these items first: {priority_items}\")\n```\n\n## Advanced Usage\n\n```python\nfrom fewlab import pi_aopt_for_budget, balanced_fixed_size, row_se_min_labels\n\n# Get inclusion probabilities for expected budget\nprobabilities = pi_aopt_for_budget(\n    counts=item_usage,\n    X=user_features,\n    K=100\n)\n\n# Balanced sampling with probability constraints\nselected_items = balanced_fixed_size(\n    pi=probabilities,\n    g=influence_projections,\n    K=100,\n    seed=42\n)\n\n# Minimize row-wise standard errors\noptimal_items = row_se_min_labels(\n    counts=item_usage,\n    eps2=error_budget_per_row\n)\n```\n\n## What You Get\n\n**Multiple approaches** for optimal item selection:\n\n- **`items_to_label()`**: Deterministic top-K items for maximum precision\n- **`pi_aopt_for_budget()`**: Inclusion probabilities for randomized sampling\n- **`balanced_fixed_size()`**: Balanced sampling with probability constraints\n- **`row_se_min_labels()`**: Minimize row-wise standard errors\n- **`topk()`**: Efficient O(n) top-k selection algorithm\n\nAll methods consider:\n- Item usage patterns across user segments\n- Statistical leverage for regression coefficients\n- Optimal allocation of labeling budget\n\n## Practical Considerations\n\n**Choosing K**: Start with 10-20% of items. You can always label more if needed.\n\n**Validation**: Compare regression stability with different K values. When coefficients stop changing significantly, you have enough labels.\n\n**Limitations**:\n- Works best when usage patterns correlate with user features\n- Assumes item labels are binary (has trait / doesn't have trait)\n- Most effective for sparse usage matrices\n\n## Advanced: Ensuring Unbiased Estimates\n\nThe basic approach gives you optimal items to label but technically requires some randomization for completely unbiased statistical estimates. If you need formal statistical guarantees, add a small random sample on top of the priority list. See the [statistical details](link) for more.\n\n## Installation\n\n```bash\npip install fewlab\n```\n\n**Requirements**: Python 3.11+, numpy \u22651.23, pandas \u22651.5\n\n**Development**:\n```bash\npip install -e \".[dev]\"  # Includes testing, linting, pre-commit hooks\npip install -e \".[docs]\" # Includes documentation building\n```\n\n## What's New in v0.3.0\n\n- \ud83d\udc0d **Modern Python**: Requires Python 3.11+ (breaking change)\n- \ud83d\udccb **Smart Config**: Docs automatically sync with pyproject.toml metadata\n- \ud83d\ude80 **Performance**: O(n) top-k selection algorithm (vs O(n log n))\n- \ud83d\udd27 **Code Quality**: Type hints, constants, eliminated dead code\n- \ud83d\udcda **Modern Docs**: Furo theme with dark/light mode support\n- \ud83e\uddea **Developer Experience**: Pre-commit hooks, comprehensive testing\n- \ud83d\udce6 **Expanded API**: 5 functions for different sampling strategies\n\n## Development\n\nFor contributors, see [CONTRIBUTING.md](CONTRIBUTING.md) for setup instructions including required pre-commit hooks.\n\n## License\n\nMIT\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Pick the fewest items to label for unbiased OLS on shares",
    "version": "0.3.1",
    "project_urls": {
        "Homepage": "https://github.com/finite-sample/fewlab"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f3c2b5fe9672846db184babee58ca3260bdf50f1e9f729f10e872d1d5298b0aa",
                "md5": "4d4e832b3c8e4ec4e1b1e2ca8c050568",
                "sha256": "8b16136b8be71aead675e1dbc2b92d7cde00af417ec01c47e92ab8810a5cf282"
            },
            "downloads": -1,
            "filename": "fewlab-0.3.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4d4e832b3c8e4ec4e1b1e2ca8c050568",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 10718,
            "upload_time": "2025-11-01T22:29:55",
            "upload_time_iso_8601": "2025-11-01T22:29:55.159841Z",
            "url": "https://files.pythonhosted.org/packages/f3/c2/b5fe9672846db184babee58ca3260bdf50f1e9f729f10e872d1d5298b0aa/fewlab-0.3.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3ca61d68b33d05ede2717aa9d01a871c75c3aadc5d2c8d506556b59cb768dbac",
                "md5": "f99b630b5f1c5d2f455ae25b55a498ef",
                "sha256": "089d670c0bbc5cf69678a2f190114a4c095ed3aa145ab319a41231a50b21396a"
            },
            "downloads": -1,
            "filename": "fewlab-0.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "f99b630b5f1c5d2f455ae25b55a498ef",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 12391,
            "upload_time": "2025-11-01T22:29:56",
            "upload_time_iso_8601": "2025-11-01T22:29:56.665685Z",
            "url": "https://files.pythonhosted.org/packages/3c/a6/1d68b33d05ede2717aa9d01a871c75c3aadc5d2c8d506556b59cb768dbac/fewlab-0.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-11-01 22:29:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "finite-sample",
    "github_project": "fewlab",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "fewlab"
}
        
Elapsed time: 2.02375s