Name | hypersets JSON |
Version |
0.0.2
JSON |
| download |
home_page | None |
Summary | Fast, efficient alternative to Hugging Face load_dataset using DuckDB for querying, sampling and transforming remote datasets |
upload_time | 2025-07-20 22:53:45 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.8 |
license | MIT |
keywords |
datasets
duckdb
huggingface
machine-learning
parquet
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Hypersets
**Efficient SQL interface for HuggingFace datasets using DuckDB.**
Hypersets is a library to work with massive datasets without downloading them entirely. Query terabytes of data using simple SQL while only downloading what you need.
_Hypersets is currently in pre-alpha stage. Use at your own risk._
## ✨ Features
- 🚀 **Fast metadata retrieval** - Get dataset info without downloading
- 💾 **Memory-only operation** - No disk caching unless requested
- 🎯 **Efficient querying** - SQL interface with DuckDB optimization
- 📊 **Download tracking** - See exactly how much data you're saving
- 🧠 **Smart caching** - Avoid repeated API calls
- 🔄 **Multiple formats** - Output as pandas DataFrame or HuggingFace Dataset
- ⚡ **Rate limit handling** - Built-in exponential backoff for 429 errors
- 🛡️ **Proper error handling** - Clear exceptions for common issues
## 🚦 Validation Status
What has been tested and confirmed so far:
- **Dataset info retrieval**: Fast YAML frontmatter parsing
- **Efficient querying**: DuckDB SQL with HTTP optimization and 429 retry logic
- **Smart caching**: 1000x+ speedup on repeated calls
- **Download tracking**: 99.9% data savings demonstrated on real datasets (0.04GB on a 59GB dataset for simple operations)
- **Multiple formats**: pandas DataFrame and HuggingFace Dataset support
- **Error handling**: Proper exceptions and retry logic for production use
- **Memory efficiency**: Handles TB-scale datasets in MBs or GBs of RAM and bandwidth
## 📦 Installation
```bash
pip install hypersets
```
## 🎯 Quick Start
```python
import hypersets as hs
# Get dataset info without downloading
info = hs.info("omarkamali/wikipedia-monthly")
print(f"Dataset size: {info.estimated_total_size_gb:.1f} GB")
print(f"Configs: {len(info.config_names)}")
print(f"Available configs: {info.config_names[:5]}")
# Query with SQL - only downloads what's needed
result = hs.query(
"SELECT title, LENGTH(text) as text_length FROM dataset LIMIT 10",
dataset="omarkamali/wikipedia-monthly",
config="latest.en"
)
# Convert to pandas for analysis
df = result.to_pandas()
print(f"Retrieved {len(df)} articles")
```
## 🚀 Core API
### Dataset Information
```python
# Get comprehensive dataset metadata
info = hs.info("omarkamali/wikipedia-monthly")
print(f"Total files: {info.total_parquet_files}")
print(f"Size estimate: {info.estimated_total_size_gb:.1f} GB")
# List available configurations
configs = hs.list_configs("omarkamali/wikipedia-monthly")
print(f"Available configs: {configs[:10]}") # First 10
# Clear cached metadata
hs.clear_cache()
```
### SQL Querying
```python
# Basic querying
result = hs.query(
"SELECT title, url FROM dataset WHERE LENGTH(text) > 10000 LIMIT 100",
dataset="omarkamali/wikipedia-monthly",
config="latest.en"
)
# Aggregation queries
count = hs.count(
dataset="omarkamali/wikipedia-monthly",
config="latest.en"
)
print(f"Total articles: {count:,}")
# Advanced analytics
stats = hs.query(
"""
SELECT
COUNT(*) as total_articles,
AVG(LENGTH(text)) as avg_length,
MAX(LENGTH(text)) as max_length
FROM dataset
""",
dataset="omarkamali/wikipedia-monthly",
config="latest.en"
)
```
### Sampling & Exploration
```python
# Random sampling with DuckDB optimization
sample = hs.sample(
n=1000,
dataset="omarkamali/wikipedia-monthly",
config="latest.en",
columns=["title", "url", "LENGTH(text) as text_length"]
)
# Quick data preview
preview = hs.head(
n=5,
dataset="omarkamali/wikipedia-monthly",
config="latest.en",
columns=["title", "url"]
)
# Schema inspection
schema = hs.schema(
dataset="omarkamali/wikipedia-monthly",
config="latest.en"
)
print(f"Columns: {[col['name'] for col in schema.columns]}")
```
### Output Formats
```python
result = hs.query("SELECT * FROM dataset LIMIT 100", ...)
# As pandas DataFrame
df = result.to_pandas()
print(df.head())
# As HuggingFace Dataset
hf_dataset = result.to_hf_dataset()
print(hf_dataset.features)
# Query result metadata
print(f"Shape: {result.shape}")
print(f"Columns: {result.columns}")
```
### Download Tracking
```python
# Enable download tracking to see data savings
result = hs.query(
"SELECT title FROM dataset LIMIT 1000",
dataset="omarkamali/wikipedia-monthly",
config="latest.en",
track_downloads=True
)
# Check savings
if result.download_stats:
stats = result.download_stats
print(f"Total dataset: {stats.total_dataset_size_gb:.1f} GB")
print(f"Downloaded: {stats.estimated_downloaded_gb:.2f} GB")
print(f"Savings: {stats.savings_percentage:.1f}%")
```
## 📁 Examples
Explore our comprehensive examples to see Hypersets in action:
### 🏃 Quick Demo
```bash
python examples/demo.py
```
**Complete feature demonstration** - Shows all Hypersets capabilities with real datasets.
### 📚 Basic Usage
```bash
python examples/basic_usage.py
```
**Learn the fundamentals** - Dataset info, querying, sampling, caching, and output formats.
### 🔬 Advanced Queries
```bash
python examples/advanced_queries.py
```
**Sophisticated analytics** - Text analysis, pattern matching, quality metrics, and performance optimization.
## 🏗️ Architecture
Hypersets consists of four core components:
1. **Dataset Info Retriever** - Discovers parquet files, configs, and schema from YAML frontmatter
2. **DuckDB Mount System** - Mounts remote parquet files as virtual tables with HTTP optimization
3. **Query Interface** - Clean API with SQL support, download tracking, and multiple output formats
4. **Smart Caching** - TTL-based caching of dataset metadata to avoid repeated API calls
All components include proper 429 rate limit handling with exponential backoff.
## 🔧 Advanced Configuration
### Memory Management
```python
# Configure DuckDB memory limit (default: 4GB)
result = hs.query(
"SELECT * FROM dataset LIMIT 1000",
dataset="large/dataset",
memory_limit="8GB" # Increase for large datasets
)
# For extremely large datasets
result = hs.query(
"SELECT * FROM dataset LIMIT 10000",
dataset="massive/dataset",
memory_limit="16GB", # More memory
threads=8 # More threads
)
# Memory-efficient column selection
result = hs.query(
"SELECT id, title FROM dataset LIMIT 100000", # Only select needed columns
dataset="large/dataset",
memory_limit="2GB" # Can use less memory
)
```
**Memory Limit Guidelines:**
- **Default (4GB)**: Good for most datasets up to ~50GB
- **8GB**: For large datasets (50-200GB) or complex queries
- **16GB+**: For massive datasets (200GB+) or heavy aggregations
- **Column selection**: Always select only needed columns for better memory efficiency
### Custom Caching
```python
# Cache with custom TTL (Time To Live)
info = hs.info("dataset", cache_ttl=3600) # 1 hour
# Disable caching for fresh data
info = hs.info("dataset", use_cache=False)
```
### Authentication
```python
# Use HuggingFace token for private datasets
result = hs.query(
"SELECT * FROM dataset LIMIT 10",
dataset="private/dataset",
token="hf_your_token_here"
)
```
### Performance Tuning
```python
# Optimize for your use case
result = hs.query(
"SELECT * FROM dataset USING SAMPLE 10000",
dataset="large/dataset",
memory_limit="6GB", # Adequate memory
threads=4, # Balanced parallelism
track_downloads=True # Monitor efficiency
)
# For aggregation-heavy workloads
stats = hs.query(
"""
SELECT
category,
COUNT(*) as count,
AVG(LENGTH(text)) as avg_length
FROM dataset
GROUP BY category
""",
dataset="large/dataset",
memory_limit="12GB", # More memory for grouping
threads=8 # More threads for aggregation
)
```
## Contributing
1. Fork the repository
2. Create a feature branch: `git checkout -b feature-name`
3. Make changes and add tests
4. Run tests: `pytest tests/`
5. Submit a pull request
## License
MIT License - see [LICENSE](LICENSE) file for details.
## Acknowledgments
- **DuckDB** for incredible SQL analytics on remote data
- **Parquet** for the de facto standard for columnar data storage
- **HuggingFace** for democratizing access to datasets
- **The open source community** for inspiration and feedback
## Contributors
[Omar Kamali](https://omarkama.li)
## 📝 Citation
If you use Hypersets in your research, please cite:
```bibtex
@misc{hypersets,
title={Hypersets: Efficient dataset transfer, querying and transformation},
author={Omar Kamali},
year={2025},
url={https://github.com/omarkamali/hypersets}
note={Project developed under Omneity Labs}
}
```
---
**🚀 Ready to query terabytes of data efficiently?** Start with `examples/demo.py` to see Hypersets in action!
Raw data
{
"_id": null,
"home_page": null,
"name": "hypersets",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "datasets, duckdb, huggingface, machine-learning, parquet",
"author": null,
"author_email": "Omar Kamali <hypersets@omarkama.li>",
"download_url": "https://files.pythonhosted.org/packages/35/34/fced14d37b98dd7ae30175d3c9e7332645433c8d91fd7939c9751f6597fa/hypersets-0.0.2.tar.gz",
"platform": null,
"description": "# Hypersets\n\n**Efficient SQL interface for HuggingFace datasets using DuckDB.**\n\nHypersets is a library to work with massive datasets without downloading them entirely. Query terabytes of data using simple SQL while only downloading what you need.\n\n_Hypersets is currently in pre-alpha stage. Use at your own risk._\n\n## \u2728 Features\n\n- \ud83d\ude80 **Fast metadata retrieval** - Get dataset info without downloading\n- \ud83d\udcbe **Memory-only operation** - No disk caching unless requested \n- \ud83c\udfaf **Efficient querying** - SQL interface with DuckDB optimization\n- \ud83d\udcca **Download tracking** - See exactly how much data you're saving\n- \ud83e\udde0 **Smart caching** - Avoid repeated API calls\n- \ud83d\udd04 **Multiple formats** - Output as pandas DataFrame or HuggingFace Dataset\n- \u26a1 **Rate limit handling** - Built-in exponential backoff for 429 errors\n- \ud83d\udee1\ufe0f **Proper error handling** - Clear exceptions for common issues\n\n## \ud83d\udea6 Validation Status\n\nWhat has been tested and confirmed so far:\n- **Dataset info retrieval**: Fast YAML frontmatter parsing\n- **Efficient querying**: DuckDB SQL with HTTP optimization and 429 retry logic\n- **Smart caching**: 1000x+ speedup on repeated calls \n- **Download tracking**: 99.9% data savings demonstrated on real datasets (0.04GB on a 59GB dataset for simple operations)\n- **Multiple formats**: pandas DataFrame and HuggingFace Dataset support\n- **Error handling**: Proper exceptions and retry logic for production use\n- **Memory efficiency**: Handles TB-scale datasets in MBs or GBs of RAM and bandwidth\n\n## \ud83d\udce6 Installation\n\n```bash\npip install hypersets\n```\n\n## \ud83c\udfaf Quick Start\n\n```python\nimport hypersets as hs\n\n# Get dataset info without downloading\ninfo = hs.info(\"omarkamali/wikipedia-monthly\") \nprint(f\"Dataset size: {info.estimated_total_size_gb:.1f} GB\")\nprint(f\"Configs: {len(info.config_names)}\")\nprint(f\"Available configs: {info.config_names[:5]}\")\n\n# Query with SQL - only downloads what's needed\nresult = hs.query(\n \"SELECT title, LENGTH(text) as text_length FROM dataset LIMIT 10\",\n dataset=\"omarkamali/wikipedia-monthly\",\n config=\"latest.en\"\n)\n\n# Convert to pandas for analysis\ndf = result.to_pandas()\nprint(f\"Retrieved {len(df)} articles\")\n```\n\n## \ud83d\ude80 Core API\n\n### Dataset Information\n```python\n# Get comprehensive dataset metadata\ninfo = hs.info(\"omarkamali/wikipedia-monthly\")\nprint(f\"Total files: {info.total_parquet_files}\")\nprint(f\"Size estimate: {info.estimated_total_size_gb:.1f} GB\")\n\n# List available configurations\nconfigs = hs.list_configs(\"omarkamali/wikipedia-monthly\")\nprint(f\"Available configs: {configs[:10]}\") # First 10\n\n# Clear cached metadata\nhs.clear_cache()\n```\n\n### SQL Querying\n```python\n# Basic querying\nresult = hs.query(\n \"SELECT title, url FROM dataset WHERE LENGTH(text) > 10000 LIMIT 100\",\n dataset=\"omarkamali/wikipedia-monthly\",\n config=\"latest.en\"\n)\n\n# Aggregation queries\ncount = hs.count(\n dataset=\"omarkamali/wikipedia-monthly\", \n config=\"latest.en\"\n)\nprint(f\"Total articles: {count:,}\")\n\n# Advanced analytics\nstats = hs.query(\n \"\"\"\n SELECT \n COUNT(*) as total_articles,\n AVG(LENGTH(text)) as avg_length,\n MAX(LENGTH(text)) as max_length\n FROM dataset\n \"\"\",\n dataset=\"omarkamali/wikipedia-monthly\",\n config=\"latest.en\"\n)\n```\n\n### Sampling & Exploration\n```python\n# Random sampling with DuckDB optimization\nsample = hs.sample(\n n=1000,\n dataset=\"omarkamali/wikipedia-monthly\",\n config=\"latest.en\",\n columns=[\"title\", \"url\", \"LENGTH(text) as text_length\"]\n)\n\n# Quick data preview\npreview = hs.head(\n n=5,\n dataset=\"omarkamali/wikipedia-monthly\", \n config=\"latest.en\",\n columns=[\"title\", \"url\"]\n)\n\n# Schema inspection\nschema = hs.schema(\n dataset=\"omarkamali/wikipedia-monthly\",\n config=\"latest.en\"\n)\nprint(f\"Columns: {[col['name'] for col in schema.columns]}\")\n```\n\n### Output Formats\n```python\nresult = hs.query(\"SELECT * FROM dataset LIMIT 100\", ...)\n\n# As pandas DataFrame\ndf = result.to_pandas()\nprint(df.head())\n\n# As HuggingFace Dataset\nhf_dataset = result.to_hf_dataset()\nprint(hf_dataset.features)\n\n# Query result metadata\nprint(f\"Shape: {result.shape}\")\nprint(f\"Columns: {result.columns}\")\n```\n\n### Download Tracking\n```python\n# Enable download tracking to see data savings\nresult = hs.query(\n \"SELECT title FROM dataset LIMIT 1000\",\n dataset=\"omarkamali/wikipedia-monthly\",\n config=\"latest.en\",\n track_downloads=True\n)\n\n# Check savings\nif result.download_stats:\n stats = result.download_stats\n print(f\"Total dataset: {stats.total_dataset_size_gb:.1f} GB\")\n print(f\"Downloaded: {stats.estimated_downloaded_gb:.2f} GB\")\n print(f\"Savings: {stats.savings_percentage:.1f}%\")\n```\n\n## \ud83d\udcc1 Examples\n\nExplore our comprehensive examples to see Hypersets in action:\n\n### \ud83c\udfc3 Quick Demo\n```bash\npython examples/demo.py\n```\n**Complete feature demonstration** - Shows all Hypersets capabilities with real datasets.\n\n### \ud83d\udcda Basic Usage\n```bash\npython examples/basic_usage.py \n```\n**Learn the fundamentals** - Dataset info, querying, sampling, caching, and output formats.\n\n### \ud83d\udd2c Advanced Queries\n```bash \npython examples/advanced_queries.py\n```\n**Sophisticated analytics** - Text analysis, pattern matching, quality metrics, and performance optimization.\n\n## \ud83c\udfd7\ufe0f Architecture\n\nHypersets consists of four core components:\n\n1. **Dataset Info Retriever** - Discovers parquet files, configs, and schema from YAML frontmatter\n2. **DuckDB Mount System** - Mounts remote parquet files as virtual tables with HTTP optimization\n3. **Query Interface** - Clean API with SQL support, download tracking, and multiple output formats\n4. **Smart Caching** - TTL-based caching of dataset metadata to avoid repeated API calls\n\nAll components include proper 429 rate limit handling with exponential backoff.\n\n## \ud83d\udd27 Advanced Configuration\n\n### Memory Management\n```python\n# Configure DuckDB memory limit (default: 4GB)\nresult = hs.query(\n \"SELECT * FROM dataset LIMIT 1000\",\n dataset=\"large/dataset\",\n memory_limit=\"8GB\" # Increase for large datasets\n)\n\n# For extremely large datasets\nresult = hs.query(\n \"SELECT * FROM dataset LIMIT 10000\", \n dataset=\"massive/dataset\",\n memory_limit=\"16GB\", # More memory\n threads=8 # More threads\n)\n\n# Memory-efficient column selection\nresult = hs.query(\n \"SELECT id, title FROM dataset LIMIT 100000\", # Only select needed columns\n dataset=\"large/dataset\",\n memory_limit=\"2GB\" # Can use less memory\n)\n```\n\n**Memory Limit Guidelines:**\n- **Default (4GB)**: Good for most datasets up to ~50GB\n- **8GB**: For large datasets (50-200GB) or complex queries \n- **16GB+**: For massive datasets (200GB+) or heavy aggregations\n- **Column selection**: Always select only needed columns for better memory efficiency\n\n### Custom Caching\n```python\n# Cache with custom TTL (Time To Live)\ninfo = hs.info(\"dataset\", cache_ttl=3600) # 1 hour\n\n# Disable caching for fresh data\ninfo = hs.info(\"dataset\", use_cache=False)\n```\n\n### Authentication\n```python\n# Use HuggingFace token for private datasets\nresult = hs.query(\n \"SELECT * FROM dataset LIMIT 10\",\n dataset=\"private/dataset\",\n token=\"hf_your_token_here\"\n)\n```\n\n### Performance Tuning\n```python\n# Optimize for your use case\nresult = hs.query(\n \"SELECT * FROM dataset USING SAMPLE 10000\",\n dataset=\"large/dataset\",\n memory_limit=\"6GB\", # Adequate memory\n threads=4, # Balanced parallelism \n track_downloads=True # Monitor efficiency\n)\n\n# For aggregation-heavy workloads\nstats = hs.query(\n \"\"\"\n SELECT \n category,\n COUNT(*) as count,\n AVG(LENGTH(text)) as avg_length\n FROM dataset \n GROUP BY category\n \"\"\",\n dataset=\"large/dataset\",\n memory_limit=\"12GB\", # More memory for grouping\n threads=8 # More threads for aggregation\n)\n```\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch: `git checkout -b feature-name`\n3. Make changes and add tests\n4. Run tests: `pytest tests/`\n5. Submit a pull request\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\n- **DuckDB** for incredible SQL analytics on remote data\n- **Parquet** for the de facto standard for columnar data storage\n- **HuggingFace** for democratizing access to datasets\n- **The open source community** for inspiration and feedback\n\n## Contributors\n\n[Omar Kamali](https://omarkama.li)\n\n## \ud83d\udcdd Citation\n\nIf you use Hypersets in your research, please cite:\n\n```bibtex\n@misc{hypersets,\n title={Hypersets: Efficient dataset transfer, querying and transformation},\n author={Omar Kamali},\n year={2025},\n url={https://github.com/omarkamali/hypersets}\n note={Project developed under Omneity Labs}\n}\n```\n\n\n---\n\n**\ud83d\ude80 Ready to query terabytes of data efficiently?** Start with `examples/demo.py` to see Hypersets in action! ",
"bugtrack_url": null,
"license": "MIT",
"summary": "Fast, efficient alternative to Hugging Face load_dataset using DuckDB for querying, sampling and transforming remote datasets",
"version": "0.0.2",
"project_urls": {
"Documentation": "https://github.com/omarkamali/hypersets#readme",
"Homepage": "https://github.com/omarkamali/hypersets",
"Issues": "https://github.com/omarkamali/hypersets/issues",
"Repository": "https://github.com/omarkamali/hypersets"
},
"split_keywords": [
"datasets",
" duckdb",
" huggingface",
" machine-learning",
" parquet"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "4d187450baae94d3e24bcad56a3f629d5fe3d24fbd3017a6172395de0cc95072",
"md5": "c9ee6f4f38a0e53ad561ce9140c9b93b",
"sha256": "fdc05a6c7805a40561886acbb6e6c471b4c99b24447c57ba636dc1f79e28f97a"
},
"downloads": -1,
"filename": "hypersets-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c9ee6f4f38a0e53ad561ce9140c9b93b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 31824,
"upload_time": "2025-07-20T22:53:44",
"upload_time_iso_8601": "2025-07-20T22:53:44.138355Z",
"url": "https://files.pythonhosted.org/packages/4d/18/7450baae94d3e24bcad56a3f629d5fe3d24fbd3017a6172395de0cc95072/hypersets-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "3534fced14d37b98dd7ae30175d3c9e7332645433c8d91fd7939c9751f6597fa",
"md5": "ef88ea85e394a4c8c8583b20de59cffa",
"sha256": "5de7da766477af0af48dd47b6a522c5dba899aef6b60444b6b145712ddac7b15"
},
"downloads": -1,
"filename": "hypersets-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "ef88ea85e394a4c8c8583b20de59cffa",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 37857,
"upload_time": "2025-07-20T22:53:45",
"upload_time_iso_8601": "2025-07-20T22:53:45.256007Z",
"url": "https://files.pythonhosted.org/packages/35/34/fced14d37b98dd7ae30175d3c9e7332645433c8d91fd7939c9751f6597fa/hypersets-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-20 22:53:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "omarkamali",
"github_project": "hypersets#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "hypersets"
}