wikisets


Namewikisets JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryFlexible Wikipedia dataset builder with sampling and pretraining support
upload_time2025-10-27 22:41:33
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords dataset machine-learning nlp pretraining wikipedia
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Wikisets

[![PyPI version](https://badge.fury.io/py/wikisets.svg)](https://badge.fury.io/py/wikisets)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Flexible Wikipedia dataset builder with sampling and pretraining support. Built on top of [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly), providing fresh, clean Wikipedia dumps updated monthly.

## Features

- 🌍 **Multi-language support** - Access Wikipedia in any language
- 📊 **Flexible sampling** - Use exact sizes, percentages, or prebuilt samples (1k/5k/10k)
- ⚡ **Memory efficient** - Reservoir sampling for large datasets
- 🔄 **Reproducible** - Deterministic sampling with seeds
- 📦 **HuggingFace compatible** - Subclasses `datasets.Dataset`
- ✂️ **Pretraining ready** - Built-in text chunking with tokenizer support
- 📝 **Auto-generated cards** - Comprehensive dataset documentation

## Installation

```bash
pip install wikisets
```

Or with uv:
```bash
# Preferred: Add to your project
uv add wikisets

# Or just install
uv pip install wikisets
```

## Quick Start

```python
from wikisets import Wikiset, WikisetConfig

# Create a multi-language dataset
config = WikisetConfig(
    languages=[
        {"lang": "en", "size": 10000},      # 10k sample
        {"lang": "fr", "size": "50%"},      # 50% of French Wikipedia
        {"lang": "ar", "size": 0.1},        # 10% of Arabic Wikipedia
    ],
    seed=42
)

dataset = Wikiset.create(config)

# Access like any HuggingFace dataset
print(len(dataset))
print(dataset[0])

# View dataset card
print(dataset.get_card())
```

## Configuration Options

### WikisetConfig Parameters

- **languages** (required): List of `{lang: str, size: int|float|str}` dictionaries
  - `lang`: Language code (e.g., "en", "fr", "ar", "simple")
  - `size`: Can be:
    - Integer (e.g., `1000`, `5000`, `10000`) - Uses prebuilt samples when available
    - Percentage string (e.g., `"50%"`) - Samples that percentage
    - Float 0-1 (e.g., `0.5`) - Samples that fraction
- **date** (optional, default: `"latest"`): Wikipedia dump date in yyyymmdd format
- **use_train_split** (optional, default: `False`): Force sampling from full "train" split, ignoring prebuilt samples
- **shuffle** (optional, default: `False`): Proportionally interleave languages
- **seed** (optional, default: `42`): Random seed for reproducibility
- **num_proc** (optional): Number of parallel processes

## Usage Examples

### Basic Usage

```python
from wikisets import Wikiset, WikisetConfig

config = WikisetConfig(
    languages=[{"lang": "en", "size": 5000}]
)
dataset = Wikiset.create(config)

# Wikiset is just an HF Dataset
dataset.push_to_hub("my-wiki-dataset")
```

### Pretraining with Chunking

```python
# Create base dataset
config = WikisetConfig(
    languages=[
        {"lang": "en", "size": 10000},
        {"lang": "ar", "size": 5000},
    ]
)
dataset = Wikiset.create(config)

# Convert to pretraining format with 2048 token chunks
pretrain_dataset = dataset.to_pretrain(
    split_token_len=2048,
    tokenizer="gpt2",
    nearest_delimiter="newline",
    num_proc=4
)

# Do whatever you want with it
pretrain_dataset.map(lambda x: x["text"].upper())

# It's still just a HuggingFace Dataset
pretrain_dataset.push_to_hub("my-wiki-pretraining-dataset")
```

## Documentation

- **[Quick Start Guide](docs/quickstart.md)** - Get started in 5 minutes
- **[API Reference](docs/api.md)** - Complete API documentation
- **[Examples](docs/examples.md)** - Common usage patterns
- **[Technical Specification](SPEC.md)** - Design and implementation details

## Builds on wikipedia-monthly

Wikisets is built on top of [omarkamali/wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly), which provides:

- Fresh Wikipedia dumps updated monthly
- Clean, preprocessed text
- 300+ languages
- Prebuilt 1k/5k/10k samples for large languages

Wikisets adds:
- Simple configuration-based building
- Intelligent sampling strategies
- Multi-language mixing
- Pretraining utilities
- Comprehensive dataset cards

## Citation

```bibtex
@software{wikisets2025,
  author = {Omar Kamali},
  title = {Wikisets: Flexible Wikipedia Dataset Builder},
  year = {2025},
  url = {https://github.com/omarkamali/wikisets}
}
```

## License

MIT License - see [LICENSE](LICENSE) for details.

## Links

- **GitHub**: https://github.com/omarkamali/wikisets
- **PyPI**: https://pypi.org/project/wikisets/
- **Documentation**: https://github.com/omarkamali/wikisets/tree/main/docs
- **Wikipedia Monthly**: https://huggingface.co/datasets/omarkamali/wikipedia-monthly

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "wikisets",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "dataset, machine-learning, nlp, pretraining, wikipedia",
    "author": null,
    "author_email": "Omar Kamali <wikisets@omarkama.li>",
    "download_url": "https://files.pythonhosted.org/packages/5f/53/4e8fdb68fe2e3c0a9df57fd803862af0ec6fd4fc71b5c81d968711bf80b6/wikisets-0.1.0.tar.gz",
    "platform": null,
    "description": "# Wikisets\n\n[![PyPI version](https://badge.fury.io/py/wikisets.svg)](https://badge.fury.io/py/wikisets)\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nFlexible Wikipedia dataset builder with sampling and pretraining support. Built on top of [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly), providing fresh, clean Wikipedia dumps updated monthly.\n\n## Features\n\n- \ud83c\udf0d **Multi-language support** - Access Wikipedia in any language\n- \ud83d\udcca **Flexible sampling** - Use exact sizes, percentages, or prebuilt samples (1k/5k/10k)\n- \u26a1 **Memory efficient** - Reservoir sampling for large datasets\n- \ud83d\udd04 **Reproducible** - Deterministic sampling with seeds\n- \ud83d\udce6 **HuggingFace compatible** - Subclasses `datasets.Dataset`\n- \u2702\ufe0f **Pretraining ready** - Built-in text chunking with tokenizer support\n- \ud83d\udcdd **Auto-generated cards** - Comprehensive dataset documentation\n\n## Installation\n\n```bash\npip install wikisets\n```\n\nOr with uv:\n```bash\n# Preferred: Add to your project\nuv add wikisets\n\n# Or just install\nuv pip install wikisets\n```\n\n## Quick Start\n\n```python\nfrom wikisets import Wikiset, WikisetConfig\n\n# Create a multi-language dataset\nconfig = WikisetConfig(\n    languages=[\n        {\"lang\": \"en\", \"size\": 10000},      # 10k sample\n        {\"lang\": \"fr\", \"size\": \"50%\"},      # 50% of French Wikipedia\n        {\"lang\": \"ar\", \"size\": 0.1},        # 10% of Arabic Wikipedia\n    ],\n    seed=42\n)\n\ndataset = Wikiset.create(config)\n\n# Access like any HuggingFace dataset\nprint(len(dataset))\nprint(dataset[0])\n\n# View dataset card\nprint(dataset.get_card())\n```\n\n## Configuration Options\n\n### WikisetConfig Parameters\n\n- **languages** (required): List of `{lang: str, size: int|float|str}` dictionaries\n  - `lang`: Language code (e.g., \"en\", \"fr\", \"ar\", \"simple\")\n  - `size`: Can be:\n    - Integer (e.g., `1000`, `5000`, `10000`) - Uses prebuilt samples when available\n    - Percentage string (e.g., `\"50%\"`) - Samples that percentage\n    - Float 0-1 (e.g., `0.5`) - Samples that fraction\n- **date** (optional, default: `\"latest\"`): Wikipedia dump date in yyyymmdd format\n- **use_train_split** (optional, default: `False`): Force sampling from full \"train\" split, ignoring prebuilt samples\n- **shuffle** (optional, default: `False`): Proportionally interleave languages\n- **seed** (optional, default: `42`): Random seed for reproducibility\n- **num_proc** (optional): Number of parallel processes\n\n## Usage Examples\n\n### Basic Usage\n\n```python\nfrom wikisets import Wikiset, WikisetConfig\n\nconfig = WikisetConfig(\n    languages=[{\"lang\": \"en\", \"size\": 5000}]\n)\ndataset = Wikiset.create(config)\n\n# Wikiset is just an HF Dataset\ndataset.push_to_hub(\"my-wiki-dataset\")\n```\n\n### Pretraining with Chunking\n\n```python\n# Create base dataset\nconfig = WikisetConfig(\n    languages=[\n        {\"lang\": \"en\", \"size\": 10000},\n        {\"lang\": \"ar\", \"size\": 5000},\n    ]\n)\ndataset = Wikiset.create(config)\n\n# Convert to pretraining format with 2048 token chunks\npretrain_dataset = dataset.to_pretrain(\n    split_token_len=2048,\n    tokenizer=\"gpt2\",\n    nearest_delimiter=\"newline\",\n    num_proc=4\n)\n\n# Do whatever you want with it\npretrain_dataset.map(lambda x: x[\"text\"].upper())\n\n# It's still just a HuggingFace Dataset\npretrain_dataset.push_to_hub(\"my-wiki-pretraining-dataset\")\n```\n\n## Documentation\n\n- **[Quick Start Guide](docs/quickstart.md)** - Get started in 5 minutes\n- **[API Reference](docs/api.md)** - Complete API documentation\n- **[Examples](docs/examples.md)** - Common usage patterns\n- **[Technical Specification](SPEC.md)** - Design and implementation details\n\n## Builds on wikipedia-monthly\n\nWikisets is built on top of [omarkamali/wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly), which provides:\n\n- Fresh Wikipedia dumps updated monthly\n- Clean, preprocessed text\n- 300+ languages\n- Prebuilt 1k/5k/10k samples for large languages\n\nWikisets adds:\n- Simple configuration-based building\n- Intelligent sampling strategies\n- Multi-language mixing\n- Pretraining utilities\n- Comprehensive dataset cards\n\n## Citation\n\n```bibtex\n@software{wikisets2025,\n  author = {Omar Kamali},\n  title = {Wikisets: Flexible Wikipedia Dataset Builder},\n  year = {2025},\n  url = {https://github.com/omarkamali/wikisets}\n}\n```\n\n## License\n\nMIT License - see [LICENSE](LICENSE) for details.\n\n## Links\n\n- **GitHub**: https://github.com/omarkamali/wikisets\n- **PyPI**: https://pypi.org/project/wikisets/\n- **Documentation**: https://github.com/omarkamali/wikisets/tree/main/docs\n- **Wikipedia Monthly**: https://huggingface.co/datasets/omarkamali/wikipedia-monthly\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Flexible Wikipedia dataset builder with sampling and pretraining support",
    "version": "0.1.0",
    "project_urls": {
        "Documentation": "https://github.com/omarkamali/wikisets/tree/main/docs",
        "Homepage": "https://github.com/omarkamali/wikisets",
        "Issues": "https://github.com/omarkamali/wikisets/issues",
        "Repository": "https://github.com/omarkamali/wikisets"
    },
    "split_keywords": [
        "dataset",
        " machine-learning",
        " nlp",
        " pretraining",
        " wikipedia"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "58312c1b919d848661a872ecb1ea88fc0668fcbed7004f921292cf53b1b78621",
                "md5": "523d554379a1808ba54392872fffd229",
                "sha256": "fd9bcb0a1cb820db7a81b8e1226396b6ac50a26f05ba0e415f24b1c4896601b3"
            },
            "downloads": -1,
            "filename": "wikisets-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "523d554379a1808ba54392872fffd229",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 14730,
            "upload_time": "2025-10-27T22:41:32",
            "upload_time_iso_8601": "2025-10-27T22:41:32.042063Z",
            "url": "https://files.pythonhosted.org/packages/58/31/2c1b919d848661a872ecb1ea88fc0668fcbed7004f921292cf53b1b78621/wikisets-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5f534e8fdb68fe2e3c0a9df57fd803862af0ec6fd4fc71b5c81d968711bf80b6",
                "md5": "c0fe2204e06898caf1066e69af0a1338",
                "sha256": "956bc15907e81cb625d34b55fbbb78217effa33744e2013e64eab634e08eae31"
            },
            "downloads": -1,
            "filename": "wikisets-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "c0fe2204e06898caf1066e69af0a1338",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 16552,
            "upload_time": "2025-10-27T22:41:33",
            "upload_time_iso_8601": "2025-10-27T22:41:33.065297Z",
            "url": "https://files.pythonhosted.org/packages/5f/53/4e8fdb68fe2e3c0a9df57fd803862af0ec6fd4fc71b5c81d968711bf80b6/wikisets-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-27 22:41:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "omarkamali",
    "github_project": "wikisets",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "wikisets"
}
        
Elapsed time: 0.57435s