# Wikisets
[](https://badge.fury.io/py/wikisets)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
Flexible Wikipedia dataset builder with sampling and pretraining support. Built on top of [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly), providing fresh, clean Wikipedia dumps updated monthly.
## Features
- 🌍 **Multi-language support** - Access Wikipedia in any language
- 📊 **Flexible sampling** - Use exact sizes, percentages, or prebuilt samples (1k/5k/10k)
- ⚡ **Memory efficient** - Reservoir sampling for large datasets
- 🔄 **Reproducible** - Deterministic sampling with seeds
- 📦 **HuggingFace compatible** - Subclasses `datasets.Dataset`
- ✂️ **Pretraining ready** - Built-in text chunking with tokenizer support
- 📝 **Auto-generated cards** - Comprehensive dataset documentation
## Installation
```bash
pip install wikisets
```
Or with uv:
```bash
# Preferred: Add to your project
uv add wikisets
# Or just install
uv pip install wikisets
```
## Quick Start
```python
from wikisets import Wikiset, WikisetConfig
# Create a multi-language dataset
config = WikisetConfig(
languages=[
{"lang": "en", "size": 10000}, # 10k sample
{"lang": "fr", "size": "50%"}, # 50% of French Wikipedia
{"lang": "ar", "size": 0.1}, # 10% of Arabic Wikipedia
],
seed=42
)
dataset = Wikiset.create(config)
# Access like any HuggingFace dataset
print(len(dataset))
print(dataset[0])
# View dataset card
print(dataset.get_card())
```
## Configuration Options
### WikisetConfig Parameters
- **languages** (required): List of `{lang: str, size: int|float|str}` dictionaries
- `lang`: Language code (e.g., "en", "fr", "ar", "simple")
- `size`: Can be:
- Integer (e.g., `1000`, `5000`, `10000`) - Uses prebuilt samples when available
- Percentage string (e.g., `"50%"`) - Samples that percentage
- Float 0-1 (e.g., `0.5`) - Samples that fraction
- **date** (optional, default: `"latest"`): Wikipedia dump date in yyyymmdd format
- **use_train_split** (optional, default: `False`): Force sampling from full "train" split, ignoring prebuilt samples
- **shuffle** (optional, default: `False`): Proportionally interleave languages
- **seed** (optional, default: `42`): Random seed for reproducibility
- **num_proc** (optional): Number of parallel processes
## Usage Examples
### Basic Usage
```python
from wikisets import Wikiset, WikisetConfig
config = WikisetConfig(
languages=[{"lang": "en", "size": 5000}]
)
dataset = Wikiset.create(config)
# Wikiset is just an HF Dataset
dataset.push_to_hub("my-wiki-dataset")
```
### Pretraining with Chunking
```python
# Create base dataset
config = WikisetConfig(
languages=[
{"lang": "en", "size": 10000},
{"lang": "ar", "size": 5000},
]
)
dataset = Wikiset.create(config)
# Convert to pretraining format with 2048 token chunks
pretrain_dataset = dataset.to_pretrain(
split_token_len=2048,
tokenizer="gpt2",
nearest_delimiter="newline",
num_proc=4
)
# Do whatever you want with it
pretrain_dataset.map(lambda x: x["text"].upper())
# It's still just a HuggingFace Dataset
pretrain_dataset.push_to_hub("my-wiki-pretraining-dataset")
```
## Documentation
- **[Quick Start Guide](docs/quickstart.md)** - Get started in 5 minutes
- **[API Reference](docs/api.md)** - Complete API documentation
- **[Examples](docs/examples.md)** - Common usage patterns
- **[Technical Specification](SPEC.md)** - Design and implementation details
## Builds on wikipedia-monthly
Wikisets is built on top of [omarkamali/wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly), which provides:
- Fresh Wikipedia dumps updated monthly
- Clean, preprocessed text
- 300+ languages
- Prebuilt 1k/5k/10k samples for large languages
Wikisets adds:
- Simple configuration-based building
- Intelligent sampling strategies
- Multi-language mixing
- Pretraining utilities
- Comprehensive dataset cards
## Citation
```bibtex
@software{wikisets2025,
author = {Omar Kamali},
title = {Wikisets: Flexible Wikipedia Dataset Builder},
year = {2025},
url = {https://github.com/omarkamali/wikisets}
}
```
## License
MIT License - see [LICENSE](LICENSE) for details.
## Links
- **GitHub**: https://github.com/omarkamali/wikisets
- **PyPI**: https://pypi.org/project/wikisets/
- **Documentation**: https://github.com/omarkamali/wikisets/tree/main/docs
- **Wikipedia Monthly**: https://huggingface.co/datasets/omarkamali/wikipedia-monthly
Raw data
{
"_id": null,
"home_page": null,
"name": "wikisets",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "dataset, machine-learning, nlp, pretraining, wikipedia",
"author": null,
"author_email": "Omar Kamali <wikisets@omarkama.li>",
"download_url": "https://files.pythonhosted.org/packages/5f/53/4e8fdb68fe2e3c0a9df57fd803862af0ec6fd4fc71b5c81d968711bf80b6/wikisets-0.1.0.tar.gz",
"platform": null,
"description": "# Wikisets\n\n[](https://badge.fury.io/py/wikisets)\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n\nFlexible Wikipedia dataset builder with sampling and pretraining support. Built on top of [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly), providing fresh, clean Wikipedia dumps updated monthly.\n\n## Features\n\n- \ud83c\udf0d **Multi-language support** - Access Wikipedia in any language\n- \ud83d\udcca **Flexible sampling** - Use exact sizes, percentages, or prebuilt samples (1k/5k/10k)\n- \u26a1 **Memory efficient** - Reservoir sampling for large datasets\n- \ud83d\udd04 **Reproducible** - Deterministic sampling with seeds\n- \ud83d\udce6 **HuggingFace compatible** - Subclasses `datasets.Dataset`\n- \u2702\ufe0f **Pretraining ready** - Built-in text chunking with tokenizer support\n- \ud83d\udcdd **Auto-generated cards** - Comprehensive dataset documentation\n\n## Installation\n\n```bash\npip install wikisets\n```\n\nOr with uv:\n```bash\n# Preferred: Add to your project\nuv add wikisets\n\n# Or just install\nuv pip install wikisets\n```\n\n## Quick Start\n\n```python\nfrom wikisets import Wikiset, WikisetConfig\n\n# Create a multi-language dataset\nconfig = WikisetConfig(\n languages=[\n {\"lang\": \"en\", \"size\": 10000}, # 10k sample\n {\"lang\": \"fr\", \"size\": \"50%\"}, # 50% of French Wikipedia\n {\"lang\": \"ar\", \"size\": 0.1}, # 10% of Arabic Wikipedia\n ],\n seed=42\n)\n\ndataset = Wikiset.create(config)\n\n# Access like any HuggingFace dataset\nprint(len(dataset))\nprint(dataset[0])\n\n# View dataset card\nprint(dataset.get_card())\n```\n\n## Configuration Options\n\n### WikisetConfig Parameters\n\n- **languages** (required): List of `{lang: str, size: int|float|str}` dictionaries\n - `lang`: Language code (e.g., \"en\", \"fr\", \"ar\", \"simple\")\n - `size`: Can be:\n - Integer (e.g., `1000`, `5000`, `10000`) - Uses prebuilt samples when available\n - Percentage string (e.g., `\"50%\"`) - Samples that percentage\n - Float 0-1 (e.g., `0.5`) - Samples that fraction\n- **date** (optional, default: `\"latest\"`): Wikipedia dump date in yyyymmdd format\n- **use_train_split** (optional, default: `False`): Force sampling from full \"train\" split, ignoring prebuilt samples\n- **shuffle** (optional, default: `False`): Proportionally interleave languages\n- **seed** (optional, default: `42`): Random seed for reproducibility\n- **num_proc** (optional): Number of parallel processes\n\n## Usage Examples\n\n### Basic Usage\n\n```python\nfrom wikisets import Wikiset, WikisetConfig\n\nconfig = WikisetConfig(\n languages=[{\"lang\": \"en\", \"size\": 5000}]\n)\ndataset = Wikiset.create(config)\n\n# Wikiset is just an HF Dataset\ndataset.push_to_hub(\"my-wiki-dataset\")\n```\n\n### Pretraining with Chunking\n\n```python\n# Create base dataset\nconfig = WikisetConfig(\n languages=[\n {\"lang\": \"en\", \"size\": 10000},\n {\"lang\": \"ar\", \"size\": 5000},\n ]\n)\ndataset = Wikiset.create(config)\n\n# Convert to pretraining format with 2048 token chunks\npretrain_dataset = dataset.to_pretrain(\n split_token_len=2048,\n tokenizer=\"gpt2\",\n nearest_delimiter=\"newline\",\n num_proc=4\n)\n\n# Do whatever you want with it\npretrain_dataset.map(lambda x: x[\"text\"].upper())\n\n# It's still just a HuggingFace Dataset\npretrain_dataset.push_to_hub(\"my-wiki-pretraining-dataset\")\n```\n\n## Documentation\n\n- **[Quick Start Guide](docs/quickstart.md)** - Get started in 5 minutes\n- **[API Reference](docs/api.md)** - Complete API documentation\n- **[Examples](docs/examples.md)** - Common usage patterns\n- **[Technical Specification](SPEC.md)** - Design and implementation details\n\n## Builds on wikipedia-monthly\n\nWikisets is built on top of [omarkamali/wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly), which provides:\n\n- Fresh Wikipedia dumps updated monthly\n- Clean, preprocessed text\n- 300+ languages\n- Prebuilt 1k/5k/10k samples for large languages\n\nWikisets adds:\n- Simple configuration-based building\n- Intelligent sampling strategies\n- Multi-language mixing\n- Pretraining utilities\n- Comprehensive dataset cards\n\n## Citation\n\n```bibtex\n@software{wikisets2025,\n author = {Omar Kamali},\n title = {Wikisets: Flexible Wikipedia Dataset Builder},\n year = {2025},\n url = {https://github.com/omarkamali/wikisets}\n}\n```\n\n## License\n\nMIT License - see [LICENSE](LICENSE) for details.\n\n## Links\n\n- **GitHub**: https://github.com/omarkamali/wikisets\n- **PyPI**: https://pypi.org/project/wikisets/\n- **Documentation**: https://github.com/omarkamali/wikisets/tree/main/docs\n- **Wikipedia Monthly**: https://huggingface.co/datasets/omarkamali/wikipedia-monthly\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Flexible Wikipedia dataset builder with sampling and pretraining support",
"version": "0.1.0",
"project_urls": {
"Documentation": "https://github.com/omarkamali/wikisets/tree/main/docs",
"Homepage": "https://github.com/omarkamali/wikisets",
"Issues": "https://github.com/omarkamali/wikisets/issues",
"Repository": "https://github.com/omarkamali/wikisets"
},
"split_keywords": [
"dataset",
" machine-learning",
" nlp",
" pretraining",
" wikipedia"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "58312c1b919d848661a872ecb1ea88fc0668fcbed7004f921292cf53b1b78621",
"md5": "523d554379a1808ba54392872fffd229",
"sha256": "fd9bcb0a1cb820db7a81b8e1226396b6ac50a26f05ba0e415f24b1c4896601b3"
},
"downloads": -1,
"filename": "wikisets-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "523d554379a1808ba54392872fffd229",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 14730,
"upload_time": "2025-10-27T22:41:32",
"upload_time_iso_8601": "2025-10-27T22:41:32.042063Z",
"url": "https://files.pythonhosted.org/packages/58/31/2c1b919d848661a872ecb1ea88fc0668fcbed7004f921292cf53b1b78621/wikisets-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "5f534e8fdb68fe2e3c0a9df57fd803862af0ec6fd4fc71b5c81d968711bf80b6",
"md5": "c0fe2204e06898caf1066e69af0a1338",
"sha256": "956bc15907e81cb625d34b55fbbb78217effa33744e2013e64eab634e08eae31"
},
"downloads": -1,
"filename": "wikisets-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "c0fe2204e06898caf1066e69af0a1338",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 16552,
"upload_time": "2025-10-27T22:41:33",
"upload_time_iso_8601": "2025-10-27T22:41:33.065297Z",
"url": "https://files.pythonhosted.org/packages/5f/53/4e8fdb68fe2e3c0a9df57fd803862af0ec6fd4fc71b5c81d968711bf80b6/wikisets-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-27 22:41:33",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "omarkamali",
"github_project": "wikisets",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "wikisets"
}