hypersets


Namehypersets JSON
Version 0.0.2 PyPI version JSON
download
home_pageNone
SummaryFast, efficient alternative to Hugging Face load_dataset using DuckDB for querying, sampling and transforming remote datasets
upload_time2025-07-20 22:53:45
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords datasets duckdb huggingface machine-learning parquet
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Hypersets

**Efficient SQL interface for HuggingFace datasets using DuckDB.**

Hypersets is a library to work with massive datasets without downloading them entirely. Query terabytes of data using simple SQL while only downloading what you need.

_Hypersets is currently in pre-alpha stage. Use at your own risk._

## ✨ Features

- 🚀 **Fast metadata retrieval** - Get dataset info without downloading
- 💾 **Memory-only operation** - No disk caching unless requested  
- 🎯 **Efficient querying** - SQL interface with DuckDB optimization
- 📊 **Download tracking** - See exactly how much data you're saving
- 🧠 **Smart caching** - Avoid repeated API calls
- 🔄 **Multiple formats** - Output as pandas DataFrame or HuggingFace Dataset
- ⚡ **Rate limit handling** - Built-in exponential backoff for 429 errors
- 🛡️ **Proper error handling** - Clear exceptions for common issues

## 🚦 Validation Status

What has been tested and confirmed so far:
- **Dataset info retrieval**: Fast YAML frontmatter parsing
- **Efficient querying**: DuckDB SQL with HTTP optimization and 429 retry logic
- **Smart caching**: 1000x+ speedup on repeated calls  
- **Download tracking**: 99.9% data savings demonstrated on real datasets (0.04GB on a 59GB dataset for simple operations)
- **Multiple formats**: pandas DataFrame and HuggingFace Dataset support
- **Error handling**: Proper exceptions and retry logic for production use
- **Memory efficiency**: Handles TB-scale datasets in MBs or GBs of RAM and bandwidth

## 📦 Installation

```bash
pip install hypersets
```

## 🎯 Quick Start

```python
import hypersets as hs

# Get dataset info without downloading
info = hs.info("omarkamali/wikipedia-monthly") 
print(f"Dataset size: {info.estimated_total_size_gb:.1f} GB")
print(f"Configs: {len(info.config_names)}")
print(f"Available configs: {info.config_names[:5]}")

# Query with SQL - only downloads what's needed
result = hs.query(
    "SELECT title, LENGTH(text) as text_length FROM dataset LIMIT 10",
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en"
)

# Convert to pandas for analysis
df = result.to_pandas()
print(f"Retrieved {len(df)} articles")
```

## 🚀 Core API

### Dataset Information
```python
# Get comprehensive dataset metadata
info = hs.info("omarkamali/wikipedia-monthly")
print(f"Total files: {info.total_parquet_files}")
print(f"Size estimate: {info.estimated_total_size_gb:.1f} GB")

# List available configurations
configs = hs.list_configs("omarkamali/wikipedia-monthly")
print(f"Available configs: {configs[:10]}")  # First 10

# Clear cached metadata
hs.clear_cache()
```

### SQL Querying
```python
# Basic querying
result = hs.query(
    "SELECT title, url FROM dataset WHERE LENGTH(text) > 10000 LIMIT 100",
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en"
)

# Aggregation queries
count = hs.count(
    dataset="omarkamali/wikipedia-monthly", 
    config="latest.en"
)
print(f"Total articles: {count:,}")

# Advanced analytics
stats = hs.query(
    """
    SELECT 
        COUNT(*) as total_articles,
        AVG(LENGTH(text)) as avg_length,
        MAX(LENGTH(text)) as max_length
    FROM dataset
    """,
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en"
)
```

### Sampling & Exploration
```python
# Random sampling with DuckDB optimization
sample = hs.sample(
    n=1000,
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en",
    columns=["title", "url", "LENGTH(text) as text_length"]
)

# Quick data preview
preview = hs.head(
    n=5,
    dataset="omarkamali/wikipedia-monthly", 
    config="latest.en",
    columns=["title", "url"]
)

# Schema inspection
schema = hs.schema(
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en"
)
print(f"Columns: {[col['name'] for col in schema.columns]}")
```

### Output Formats
```python
result = hs.query("SELECT * FROM dataset LIMIT 100", ...)

# As pandas DataFrame
df = result.to_pandas()
print(df.head())

# As HuggingFace Dataset
hf_dataset = result.to_hf_dataset()
print(hf_dataset.features)

# Query result metadata
print(f"Shape: {result.shape}")
print(f"Columns: {result.columns}")
```

### Download Tracking
```python
# Enable download tracking to see data savings
result = hs.query(
    "SELECT title FROM dataset LIMIT 1000",
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en",
    track_downloads=True
)

# Check savings
if result.download_stats:
    stats = result.download_stats
    print(f"Total dataset: {stats.total_dataset_size_gb:.1f} GB")
    print(f"Downloaded: {stats.estimated_downloaded_gb:.2f} GB")
    print(f"Savings: {stats.savings_percentage:.1f}%")
```

## 📁 Examples

Explore our comprehensive examples to see Hypersets in action:

### 🏃 Quick Demo
```bash
python examples/demo.py
```
**Complete feature demonstration** - Shows all Hypersets capabilities with real datasets.

### 📚 Basic Usage
```bash
python examples/basic_usage.py  
```
**Learn the fundamentals** - Dataset info, querying, sampling, caching, and output formats.

### 🔬 Advanced Queries
```bash  
python examples/advanced_queries.py
```
**Sophisticated analytics** - Text analysis, pattern matching, quality metrics, and performance optimization.

## 🏗️ Architecture

Hypersets consists of four core components:

1. **Dataset Info Retriever** - Discovers parquet files, configs, and schema from YAML frontmatter
2. **DuckDB Mount System** - Mounts remote parquet files as virtual tables with HTTP optimization
3. **Query Interface** - Clean API with SQL support, download tracking, and multiple output formats
4. **Smart Caching** - TTL-based caching of dataset metadata to avoid repeated API calls

All components include proper 429 rate limit handling with exponential backoff.

## 🔧 Advanced Configuration

### Memory Management
```python
# Configure DuckDB memory limit (default: 4GB)
result = hs.query(
    "SELECT * FROM dataset LIMIT 1000",
    dataset="large/dataset",
    memory_limit="8GB"  # Increase for large datasets
)

# For extremely large datasets
result = hs.query(
    "SELECT * FROM dataset LIMIT 10000", 
    dataset="massive/dataset",
    memory_limit="16GB",  # More memory
    threads=8             # More threads
)

# Memory-efficient column selection
result = hs.query(
    "SELECT id, title FROM dataset LIMIT 100000",  # Only select needed columns
    dataset="large/dataset",
    memory_limit="2GB"  # Can use less memory
)
```

**Memory Limit Guidelines:**
- **Default (4GB)**: Good for most datasets up to ~50GB
- **8GB**: For large datasets (50-200GB) or complex queries  
- **16GB+**: For massive datasets (200GB+) or heavy aggregations
- **Column selection**: Always select only needed columns for better memory efficiency

### Custom Caching
```python
# Cache with custom TTL (Time To Live)
info = hs.info("dataset", cache_ttl=3600)  # 1 hour

# Disable caching for fresh data
info = hs.info("dataset", use_cache=False)
```

### Authentication
```python
# Use HuggingFace token for private datasets
result = hs.query(
    "SELECT * FROM dataset LIMIT 10",
    dataset="private/dataset",
    token="hf_your_token_here"
)
```

### Performance Tuning
```python
# Optimize for your use case
result = hs.query(
    "SELECT * FROM dataset USING SAMPLE 10000",
    dataset="large/dataset",
    memory_limit="6GB",    # Adequate memory
    threads=4,            # Balanced parallelism  
    track_downloads=True  # Monitor efficiency
)

# For aggregation-heavy workloads
stats = hs.query(
    """
    SELECT 
        category,
        COUNT(*) as count,
        AVG(LENGTH(text)) as avg_length
    FROM dataset 
    GROUP BY category
    """,
    dataset="large/dataset",
    memory_limit="12GB",  # More memory for grouping
    threads=8            # More threads for aggregation
)
```

## Contributing

1. Fork the repository
2. Create a feature branch: `git checkout -b feature-name`
3. Make changes and add tests
4. Run tests: `pytest tests/`
5. Submit a pull request

## License

MIT License - see [LICENSE](LICENSE) file for details.

## Acknowledgments

- **DuckDB** for incredible SQL analytics on remote data
- **Parquet** for the de facto standard for columnar data storage
- **HuggingFace** for democratizing access to datasets
- **The open source community** for inspiration and feedback

## Contributors

[Omar Kamali](https://omarkama.li)

## 📝 Citation

If you use Hypersets in your research, please cite:

```bibtex
@misc{hypersets,
    title={Hypersets: Efficient dataset transfer, querying and transformation},
    author={Omar Kamali},
    year={2025},
    url={https://github.com/omarkamali/hypersets}
    note={Project developed under Omneity Labs}
}
```


---

**🚀 Ready to query terabytes of data efficiently?** Start with `examples/demo.py` to see Hypersets in action! 
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "hypersets",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "datasets, duckdb, huggingface, machine-learning, parquet",
    "author": null,
    "author_email": "Omar Kamali <hypersets@omarkama.li>",
    "download_url": "https://files.pythonhosted.org/packages/35/34/fced14d37b98dd7ae30175d3c9e7332645433c8d91fd7939c9751f6597fa/hypersets-0.0.2.tar.gz",
    "platform": null,
    "description": "# Hypersets\n\n**Efficient SQL interface for HuggingFace datasets using DuckDB.**\n\nHypersets is a library to work with massive datasets without downloading them entirely. Query terabytes of data using simple SQL while only downloading what you need.\n\n_Hypersets is currently in pre-alpha stage. Use at your own risk._\n\n## \u2728 Features\n\n- \ud83d\ude80 **Fast metadata retrieval** - Get dataset info without downloading\n- \ud83d\udcbe **Memory-only operation** - No disk caching unless requested  \n- \ud83c\udfaf **Efficient querying** - SQL interface with DuckDB optimization\n- \ud83d\udcca **Download tracking** - See exactly how much data you're saving\n- \ud83e\udde0 **Smart caching** - Avoid repeated API calls\n- \ud83d\udd04 **Multiple formats** - Output as pandas DataFrame or HuggingFace Dataset\n- \u26a1 **Rate limit handling** - Built-in exponential backoff for 429 errors\n- \ud83d\udee1\ufe0f **Proper error handling** - Clear exceptions for common issues\n\n## \ud83d\udea6 Validation Status\n\nWhat has been tested and confirmed so far:\n- **Dataset info retrieval**: Fast YAML frontmatter parsing\n- **Efficient querying**: DuckDB SQL with HTTP optimization and 429 retry logic\n- **Smart caching**: 1000x+ speedup on repeated calls  \n- **Download tracking**: 99.9% data savings demonstrated on real datasets (0.04GB on a 59GB dataset for simple operations)\n- **Multiple formats**: pandas DataFrame and HuggingFace Dataset support\n- **Error handling**: Proper exceptions and retry logic for production use\n- **Memory efficiency**: Handles TB-scale datasets in MBs or GBs of RAM and bandwidth\n\n## \ud83d\udce6 Installation\n\n```bash\npip install hypersets\n```\n\n## \ud83c\udfaf Quick Start\n\n```python\nimport hypersets as hs\n\n# Get dataset info without downloading\ninfo = hs.info(\"omarkamali/wikipedia-monthly\") \nprint(f\"Dataset size: {info.estimated_total_size_gb:.1f} GB\")\nprint(f\"Configs: {len(info.config_names)}\")\nprint(f\"Available configs: {info.config_names[:5]}\")\n\n# Query with SQL - only downloads what's needed\nresult = hs.query(\n    \"SELECT title, LENGTH(text) as text_length FROM dataset LIMIT 10\",\n    dataset=\"omarkamali/wikipedia-monthly\",\n    config=\"latest.en\"\n)\n\n# Convert to pandas for analysis\ndf = result.to_pandas()\nprint(f\"Retrieved {len(df)} articles\")\n```\n\n## \ud83d\ude80 Core API\n\n### Dataset Information\n```python\n# Get comprehensive dataset metadata\ninfo = hs.info(\"omarkamali/wikipedia-monthly\")\nprint(f\"Total files: {info.total_parquet_files}\")\nprint(f\"Size estimate: {info.estimated_total_size_gb:.1f} GB\")\n\n# List available configurations\nconfigs = hs.list_configs(\"omarkamali/wikipedia-monthly\")\nprint(f\"Available configs: {configs[:10]}\")  # First 10\n\n# Clear cached metadata\nhs.clear_cache()\n```\n\n### SQL Querying\n```python\n# Basic querying\nresult = hs.query(\n    \"SELECT title, url FROM dataset WHERE LENGTH(text) > 10000 LIMIT 100\",\n    dataset=\"omarkamali/wikipedia-monthly\",\n    config=\"latest.en\"\n)\n\n# Aggregation queries\ncount = hs.count(\n    dataset=\"omarkamali/wikipedia-monthly\", \n    config=\"latest.en\"\n)\nprint(f\"Total articles: {count:,}\")\n\n# Advanced analytics\nstats = hs.query(\n    \"\"\"\n    SELECT \n        COUNT(*) as total_articles,\n        AVG(LENGTH(text)) as avg_length,\n        MAX(LENGTH(text)) as max_length\n    FROM dataset\n    \"\"\",\n    dataset=\"omarkamali/wikipedia-monthly\",\n    config=\"latest.en\"\n)\n```\n\n### Sampling & Exploration\n```python\n# Random sampling with DuckDB optimization\nsample = hs.sample(\n    n=1000,\n    dataset=\"omarkamali/wikipedia-monthly\",\n    config=\"latest.en\",\n    columns=[\"title\", \"url\", \"LENGTH(text) as text_length\"]\n)\n\n# Quick data preview\npreview = hs.head(\n    n=5,\n    dataset=\"omarkamali/wikipedia-monthly\", \n    config=\"latest.en\",\n    columns=[\"title\", \"url\"]\n)\n\n# Schema inspection\nschema = hs.schema(\n    dataset=\"omarkamali/wikipedia-monthly\",\n    config=\"latest.en\"\n)\nprint(f\"Columns: {[col['name'] for col in schema.columns]}\")\n```\n\n### Output Formats\n```python\nresult = hs.query(\"SELECT * FROM dataset LIMIT 100\", ...)\n\n# As pandas DataFrame\ndf = result.to_pandas()\nprint(df.head())\n\n# As HuggingFace Dataset\nhf_dataset = result.to_hf_dataset()\nprint(hf_dataset.features)\n\n# Query result metadata\nprint(f\"Shape: {result.shape}\")\nprint(f\"Columns: {result.columns}\")\n```\n\n### Download Tracking\n```python\n# Enable download tracking to see data savings\nresult = hs.query(\n    \"SELECT title FROM dataset LIMIT 1000\",\n    dataset=\"omarkamali/wikipedia-monthly\",\n    config=\"latest.en\",\n    track_downloads=True\n)\n\n# Check savings\nif result.download_stats:\n    stats = result.download_stats\n    print(f\"Total dataset: {stats.total_dataset_size_gb:.1f} GB\")\n    print(f\"Downloaded: {stats.estimated_downloaded_gb:.2f} GB\")\n    print(f\"Savings: {stats.savings_percentage:.1f}%\")\n```\n\n## \ud83d\udcc1 Examples\n\nExplore our comprehensive examples to see Hypersets in action:\n\n### \ud83c\udfc3 Quick Demo\n```bash\npython examples/demo.py\n```\n**Complete feature demonstration** - Shows all Hypersets capabilities with real datasets.\n\n### \ud83d\udcda Basic Usage\n```bash\npython examples/basic_usage.py  \n```\n**Learn the fundamentals** - Dataset info, querying, sampling, caching, and output formats.\n\n### \ud83d\udd2c Advanced Queries\n```bash  \npython examples/advanced_queries.py\n```\n**Sophisticated analytics** - Text analysis, pattern matching, quality metrics, and performance optimization.\n\n## \ud83c\udfd7\ufe0f Architecture\n\nHypersets consists of four core components:\n\n1. **Dataset Info Retriever** - Discovers parquet files, configs, and schema from YAML frontmatter\n2. **DuckDB Mount System** - Mounts remote parquet files as virtual tables with HTTP optimization\n3. **Query Interface** - Clean API with SQL support, download tracking, and multiple output formats\n4. **Smart Caching** - TTL-based caching of dataset metadata to avoid repeated API calls\n\nAll components include proper 429 rate limit handling with exponential backoff.\n\n## \ud83d\udd27 Advanced Configuration\n\n### Memory Management\n```python\n# Configure DuckDB memory limit (default: 4GB)\nresult = hs.query(\n    \"SELECT * FROM dataset LIMIT 1000\",\n    dataset=\"large/dataset\",\n    memory_limit=\"8GB\"  # Increase for large datasets\n)\n\n# For extremely large datasets\nresult = hs.query(\n    \"SELECT * FROM dataset LIMIT 10000\", \n    dataset=\"massive/dataset\",\n    memory_limit=\"16GB\",  # More memory\n    threads=8             # More threads\n)\n\n# Memory-efficient column selection\nresult = hs.query(\n    \"SELECT id, title FROM dataset LIMIT 100000\",  # Only select needed columns\n    dataset=\"large/dataset\",\n    memory_limit=\"2GB\"  # Can use less memory\n)\n```\n\n**Memory Limit Guidelines:**\n- **Default (4GB)**: Good for most datasets up to ~50GB\n- **8GB**: For large datasets (50-200GB) or complex queries  \n- **16GB+**: For massive datasets (200GB+) or heavy aggregations\n- **Column selection**: Always select only needed columns for better memory efficiency\n\n### Custom Caching\n```python\n# Cache with custom TTL (Time To Live)\ninfo = hs.info(\"dataset\", cache_ttl=3600)  # 1 hour\n\n# Disable caching for fresh data\ninfo = hs.info(\"dataset\", use_cache=False)\n```\n\n### Authentication\n```python\n# Use HuggingFace token for private datasets\nresult = hs.query(\n    \"SELECT * FROM dataset LIMIT 10\",\n    dataset=\"private/dataset\",\n    token=\"hf_your_token_here\"\n)\n```\n\n### Performance Tuning\n```python\n# Optimize for your use case\nresult = hs.query(\n    \"SELECT * FROM dataset USING SAMPLE 10000\",\n    dataset=\"large/dataset\",\n    memory_limit=\"6GB\",    # Adequate memory\n    threads=4,            # Balanced parallelism  \n    track_downloads=True  # Monitor efficiency\n)\n\n# For aggregation-heavy workloads\nstats = hs.query(\n    \"\"\"\n    SELECT \n        category,\n        COUNT(*) as count,\n        AVG(LENGTH(text)) as avg_length\n    FROM dataset \n    GROUP BY category\n    \"\"\",\n    dataset=\"large/dataset\",\n    memory_limit=\"12GB\",  # More memory for grouping\n    threads=8            # More threads for aggregation\n)\n```\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch: `git checkout -b feature-name`\n3. Make changes and add tests\n4. Run tests: `pytest tests/`\n5. Submit a pull request\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\n- **DuckDB** for incredible SQL analytics on remote data\n- **Parquet** for the de facto standard for columnar data storage\n- **HuggingFace** for democratizing access to datasets\n- **The open source community** for inspiration and feedback\n\n## Contributors\n\n[Omar Kamali](https://omarkama.li)\n\n## \ud83d\udcdd Citation\n\nIf you use Hypersets in your research, please cite:\n\n```bibtex\n@misc{hypersets,\n    title={Hypersets: Efficient dataset transfer, querying and transformation},\n    author={Omar Kamali},\n    year={2025},\n    url={https://github.com/omarkamali/hypersets}\n    note={Project developed under Omneity Labs}\n}\n```\n\n\n---\n\n**\ud83d\ude80 Ready to query terabytes of data efficiently?** Start with `examples/demo.py` to see Hypersets in action! ",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Fast, efficient alternative to Hugging Face load_dataset using DuckDB for querying, sampling and transforming remote datasets",
    "version": "0.0.2",
    "project_urls": {
        "Documentation": "https://github.com/omarkamali/hypersets#readme",
        "Homepage": "https://github.com/omarkamali/hypersets",
        "Issues": "https://github.com/omarkamali/hypersets/issues",
        "Repository": "https://github.com/omarkamali/hypersets"
    },
    "split_keywords": [
        "datasets",
        " duckdb",
        " huggingface",
        " machine-learning",
        " parquet"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4d187450baae94d3e24bcad56a3f629d5fe3d24fbd3017a6172395de0cc95072",
                "md5": "c9ee6f4f38a0e53ad561ce9140c9b93b",
                "sha256": "fdc05a6c7805a40561886acbb6e6c471b4c99b24447c57ba636dc1f79e28f97a"
            },
            "downloads": -1,
            "filename": "hypersets-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c9ee6f4f38a0e53ad561ce9140c9b93b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 31824,
            "upload_time": "2025-07-20T22:53:44",
            "upload_time_iso_8601": "2025-07-20T22:53:44.138355Z",
            "url": "https://files.pythonhosted.org/packages/4d/18/7450baae94d3e24bcad56a3f629d5fe3d24fbd3017a6172395de0cc95072/hypersets-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3534fced14d37b98dd7ae30175d3c9e7332645433c8d91fd7939c9751f6597fa",
                "md5": "ef88ea85e394a4c8c8583b20de59cffa",
                "sha256": "5de7da766477af0af48dd47b6a522c5dba899aef6b60444b6b145712ddac7b15"
            },
            "downloads": -1,
            "filename": "hypersets-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "ef88ea85e394a4c8c8583b20de59cffa",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 37857,
            "upload_time": "2025-07-20T22:53:45",
            "upload_time_iso_8601": "2025-07-20T22:53:45.256007Z",
            "url": "https://files.pythonhosted.org/packages/35/34/fced14d37b98dd7ae30175d3c9e7332645433c8d91fd7939c9751f6597fa/hypersets-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-20 22:53:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "omarkamali",
    "github_project": "hypersets#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "hypersets"
}
        
Elapsed time: 0.80279s