Name | ts-preprocessing JSON |
Version |
0.1.3
JSON |
| download |
home_page | None |
Summary | A preprocessing library for foundation time series models |
upload_time | 2025-09-07 20:21:12 |
maintainer | None |
docs_url | None |
author | Your Name |
requires_python | <4.0,>=3.11 |
license | MIT |
keywords |
time-series
preprocessing
ml
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Time Series Preprocessing Library
A comprehensive library for building streaming data pipelines for time series datasets, providing tools for downloading, transforming, and combining time series data in real-time.
## Features
- **Streaming Dataset Transforms**: Build composable data pipelines that process time series data as it streams
- **Dataset Downloaders**: Download datasets from Hugging Face Hub with caching support
- **Synthetic Data Generation**: Generate synthetic time series data for testing and development
- **Flexible Data Pipeline Builder**: Chain multiple transforms together using a fluent builder pattern
- **PyTorch Integration**: Full compatibility with PyTorch's IterableDataset interface
- **Type Safety**: Comprehensive type annotations for better development experience
## Installation
Using poetry:
```bash
poetry install
```
Using pip:
```bash
pip install .
```
## Core Components
### Dataset Transforms
The library provides a rich set of streaming dataset transforms that can be chained together:
#### Basic Transforms
- **`TransformingDataset`**: Apply arbitrary functions to each item in a dataset
- **`BatchingIterableDataset`**: Group items into batches with configurable batch sizes
- **`UnbatchingIterableDataset`**: Flatten batched datasets back to individual items
- **`SlidingWindowIterableDataset`**: Create sliding windows over time series data
#### Combining Transforms
- **`ConcatDataset`**: Concatenate multiple datasets sequentially
- **`CombiningDataset`**: Combine multiple datasets using custom operations (e.g., element-wise addition)
- **`ProbabilisticMixingDataset`**: Mix datasets with configurable probabilities and seeding
#### Pipeline Builder
- **`Builder`**: Fluent interface for building complex data pipelines
### Dataset Downloaders
- **`HuggingFaceDownloader`**: Download datasets from Hugging Face Hub with automatic caching
- **`GiftEvalWrapperDataset`**: Wrapper for Salesforce's GiftEval pretraining datasets
### Synthetic Data
- **`LinearTrendDataset`**: Generate synthetic time series with linear trends and configurable noise
### Serialization
- **`serialize_tensor_stream`**: Save tensor streams to disk in sharded format
- **`SerializedTensorDataset`**: Load serialized tensor streams with lazy loading support
## Usage Examples
### Basic Transform Pipeline
```python
from preprocessing.transform.dataset_builder import Builder
from preprocessing.transform.batching_dataset import BatchingIterableDataset
from preprocessing.transform.transforming_dataset import TransformingDataset
# Create a simple pipeline
pipeline = (
Builder(your_dataset)
.batch(batch_size=32)
.map(lambda x: x * 2) # Double all values
.build()
)
# Iterate over the pipeline
for batch in pipeline:
print(batch.shape) # (32, sequence_length, features)
```
### Combining Multiple Datasets
```python
from preprocessing.transform.combining_dataset import CombiningDataset
# Combine two datasets element-wise
def add_operation(x, y):
return x + y
combined = CombiningDataset([dataset1, dataset2], op=add_operation)
for result in combined:
print(result) # x + y for each pair
```
### Concatenating Datasets
```python
from preprocessing.transform.concat_dataset import ConcatDataset
# Concatenate datasets sequentially
concatenated = ConcatDataset([dataset1, dataset2, dataset3])
for item in concatenated:
# Items from dataset1, then dataset2, then dataset3
print(item)
```
### Probabilistic Mixing
```python
from preprocessing.transform.probabilistic_mixing_dataset import ProbabilisticMixingDataset
# Mix datasets with custom probabilities
mixed = ProbabilisticMixingDataset(
datasets={"train": train_data, "val": val_data},
probabilities={"train": 0.8, "val": 0.2},
seed=42 # For reproducibility
)
for item in mixed:
# 80% chance from train, 20% chance from val
print(item)
```
### Sliding Windows
```python
from preprocessing.transform.sliding_window_dataset import SlidingWindowIterableDataset
# Create sliding windows over time series
windowed = SlidingWindowIterableDataset(
dataset=your_dataset,
window_size=100,
step=50
)
for window in windowed:
print(window.shape) # (100, features)
```
### Downloading Datasets
```python
from preprocessing.downloader.huggingface import HuggingFaceDownloader
from preprocessing.config import DatasetConfig
# Configure dataset download
config = DatasetConfig(
name="air-passengers",
repo_id="duol/airpassengers",
files=["AP.csv"],
cache_dir="data/cache"
)
# Download dataset
downloader = HuggingFaceDownloader(config)
data = downloader.download()
```
### Synthetic Data Generation
```python
from preprocessing.synthetic.linear_trend import LinearTrendDataset
# Generate synthetic time series
synthetic_data = LinearTrendDataset(
sequence_length=1000,
num_sequences=100,
trend_slope=0.01,
noise_std=0.1,
seed=42
)
for sequence in synthetic_data:
print(sequence.shape) # (1000, 1)
```
### Serialization
```python
from preprocessing.serialization.serialize import serialize_tensor_stream
from preprocessing.serialization.deserialize import SerializedTensorDataset
# Save dataset to disk
serialize_tensor_stream(
dataset=your_dataset,
output_dir="data/serialized",
max_tensors_per_file=1000
)
# Load dataset from disk
loaded_dataset = SerializedTensorDataset(
filepaths=["data/serialized/shard_00000.pt"],
lazy=True # Load on-demand
)
```
## Project Structure
```
preprocessing/
├── common/ # Common types and utilities
│ └── tensor_dataset.py
├── config/ # Configuration management
│ ├── __init__.py
│ └── examples/
├── downloader/ # Dataset downloaders
│ ├── huggingface.py
│ └── gift_eval.py
├── serialization/ # Data serialization
│ ├── serialize.py
│ └── deserialize.py
├── synthetic/ # Synthetic data generation
│ └── linear_trend.py
└── transform/ # Dataset transforms
├── batching_dataset.py
├── combining_dataset.py
├── concat_dataset.py
├── dataset_builder.py
├── probabilistic_mixing_dataset.py
├── sliding_window_dataset.py
├── transforming_dataset.py
└── unbatching_dataset.py
```
## Development
### Setup Development Environment
```bash
poetry install --with dev
```
### Run Tests
```bash
poetry run pytest
```
### Run Tests with Coverage
```bash
poetry run pytest --cov=preprocessing
```
## Key Design Principles
1. **Streaming First**: All transforms work with streaming data, enabling processing of large datasets that don't fit in memory
2. **Composability**: Transforms can be easily chained together using the Builder pattern
3. **Type Safety**: Comprehensive type annotations for better development experience
4. **PyTorch Integration**: Full compatibility with PyTorch's data loading ecosystem
5. **Reproducibility**: Built-in support for random seeds and deterministic operations
6. **Flexibility**: Support for custom operations and transformations
## License
MIT License - see LICENSE file for details
Raw data
{
"_id": null,
"home_page": null,
"name": "ts-preprocessing",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.11",
"maintainer_email": null,
"keywords": "time-series, preprocessing, ml",
"author": "Your Name",
"author_email": "your.email@example.com",
"download_url": "https://files.pythonhosted.org/packages/73/26/2f8639190a618aaaebdf4404ccddf332915bf30c0835287cfcf072e90905/ts_preprocessing-0.1.3.tar.gz",
"platform": null,
"description": "# Time Series Preprocessing Library\n\nA comprehensive library for building streaming data pipelines for time series datasets, providing tools for downloading, transforming, and combining time series data in real-time.\n\n## Features\n\n- **Streaming Dataset Transforms**: Build composable data pipelines that process time series data as it streams\n- **Dataset Downloaders**: Download datasets from Hugging Face Hub with caching support\n- **Synthetic Data Generation**: Generate synthetic time series data for testing and development\n- **Flexible Data Pipeline Builder**: Chain multiple transforms together using a fluent builder pattern\n- **PyTorch Integration**: Full compatibility with PyTorch's IterableDataset interface\n- **Type Safety**: Comprehensive type annotations for better development experience\n\n## Installation\n\nUsing poetry:\n```bash\npoetry install\n```\n\nUsing pip:\n```bash\npip install .\n```\n\n## Core Components\n\n### Dataset Transforms\n\nThe library provides a rich set of streaming dataset transforms that can be chained together:\n\n#### Basic Transforms\n- **`TransformingDataset`**: Apply arbitrary functions to each item in a dataset\n- **`BatchingIterableDataset`**: Group items into batches with configurable batch sizes\n- **`UnbatchingIterableDataset`**: Flatten batched datasets back to individual items\n- **`SlidingWindowIterableDataset`**: Create sliding windows over time series data\n\n#### Combining Transforms\n- **`ConcatDataset`**: Concatenate multiple datasets sequentially\n- **`CombiningDataset`**: Combine multiple datasets using custom operations (e.g., element-wise addition)\n- **`ProbabilisticMixingDataset`**: Mix datasets with configurable probabilities and seeding\n\n#### Pipeline Builder\n- **`Builder`**: Fluent interface for building complex data pipelines\n\n### Dataset Downloaders\n\n- **`HuggingFaceDownloader`**: Download datasets from Hugging Face Hub with automatic caching\n- **`GiftEvalWrapperDataset`**: Wrapper for Salesforce's GiftEval pretraining datasets\n\n### Synthetic Data\n\n- **`LinearTrendDataset`**: Generate synthetic time series with linear trends and configurable noise\n\n### Serialization\n\n- **`serialize_tensor_stream`**: Save tensor streams to disk in sharded format\n- **`SerializedTensorDataset`**: Load serialized tensor streams with lazy loading support\n\n## Usage Examples\n\n### Basic Transform Pipeline\n\n```python\nfrom preprocessing.transform.dataset_builder import Builder\nfrom preprocessing.transform.batching_dataset import BatchingIterableDataset\nfrom preprocessing.transform.transforming_dataset import TransformingDataset\n\n# Create a simple pipeline\npipeline = (\n Builder(your_dataset)\n .batch(batch_size=32)\n .map(lambda x: x * 2) # Double all values\n .build()\n)\n\n# Iterate over the pipeline\nfor batch in pipeline:\n print(batch.shape) # (32, sequence_length, features)\n```\n\n### Combining Multiple Datasets\n\n```python\nfrom preprocessing.transform.combining_dataset import CombiningDataset\n\n# Combine two datasets element-wise\ndef add_operation(x, y):\n return x + y\n\ncombined = CombiningDataset([dataset1, dataset2], op=add_operation)\n\nfor result in combined:\n print(result) # x + y for each pair\n```\n\n### Concatenating Datasets\n\n```python\nfrom preprocessing.transform.concat_dataset import ConcatDataset\n\n# Concatenate datasets sequentially\nconcatenated = ConcatDataset([dataset1, dataset2, dataset3])\n\nfor item in concatenated:\n # Items from dataset1, then dataset2, then dataset3\n print(item)\n```\n\n### Probabilistic Mixing\n\n```python\nfrom preprocessing.transform.probabilistic_mixing_dataset import ProbabilisticMixingDataset\n\n# Mix datasets with custom probabilities\nmixed = ProbabilisticMixingDataset(\n datasets={\"train\": train_data, \"val\": val_data},\n probabilities={\"train\": 0.8, \"val\": 0.2},\n seed=42 # For reproducibility\n)\n\nfor item in mixed:\n # 80% chance from train, 20% chance from val\n print(item)\n```\n\n### Sliding Windows\n\n```python\nfrom preprocessing.transform.sliding_window_dataset import SlidingWindowIterableDataset\n\n# Create sliding windows over time series\nwindowed = SlidingWindowIterableDataset(\n dataset=your_dataset,\n window_size=100,\n step=50\n)\n\nfor window in windowed:\n print(window.shape) # (100, features)\n```\n\n### Downloading Datasets\n\n```python\nfrom preprocessing.downloader.huggingface import HuggingFaceDownloader\nfrom preprocessing.config import DatasetConfig\n\n# Configure dataset download\nconfig = DatasetConfig(\n name=\"air-passengers\",\n repo_id=\"duol/airpassengers\",\n files=[\"AP.csv\"],\n cache_dir=\"data/cache\"\n)\n\n# Download dataset\ndownloader = HuggingFaceDownloader(config)\ndata = downloader.download()\n```\n\n### Synthetic Data Generation\n\n```python\nfrom preprocessing.synthetic.linear_trend import LinearTrendDataset\n\n# Generate synthetic time series\nsynthetic_data = LinearTrendDataset(\n sequence_length=1000,\n num_sequences=100,\n trend_slope=0.01,\n noise_std=0.1,\n seed=42\n)\n\nfor sequence in synthetic_data:\n print(sequence.shape) # (1000, 1)\n```\n\n### Serialization\n\n```python\nfrom preprocessing.serialization.serialize import serialize_tensor_stream\nfrom preprocessing.serialization.deserialize import SerializedTensorDataset\n\n# Save dataset to disk\nserialize_tensor_stream(\n dataset=your_dataset,\n output_dir=\"data/serialized\",\n max_tensors_per_file=1000\n)\n\n# Load dataset from disk\nloaded_dataset = SerializedTensorDataset(\n filepaths=[\"data/serialized/shard_00000.pt\"],\n lazy=True # Load on-demand\n)\n```\n\n## Project Structure\n\n```\npreprocessing/\n\u251c\u2500\u2500 common/ # Common types and utilities\n\u2502 \u2514\u2500\u2500 tensor_dataset.py\n\u251c\u2500\u2500 config/ # Configuration management\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u2514\u2500\u2500 examples/\n\u251c\u2500\u2500 downloader/ # Dataset downloaders\n\u2502 \u251c\u2500\u2500 huggingface.py\n\u2502 \u2514\u2500\u2500 gift_eval.py\n\u251c\u2500\u2500 serialization/ # Data serialization\n\u2502 \u251c\u2500\u2500 serialize.py\n\u2502 \u2514\u2500\u2500 deserialize.py\n\u251c\u2500\u2500 synthetic/ # Synthetic data generation\n\u2502 \u2514\u2500\u2500 linear_trend.py\n\u2514\u2500\u2500 transform/ # Dataset transforms\n \u251c\u2500\u2500 batching_dataset.py\n \u251c\u2500\u2500 combining_dataset.py\n \u251c\u2500\u2500 concat_dataset.py\n \u251c\u2500\u2500 dataset_builder.py\n \u251c\u2500\u2500 probabilistic_mixing_dataset.py\n \u251c\u2500\u2500 sliding_window_dataset.py\n \u251c\u2500\u2500 transforming_dataset.py\n \u2514\u2500\u2500 unbatching_dataset.py\n```\n\n## Development\n\n### Setup Development Environment\n```bash\npoetry install --with dev\n```\n\n### Run Tests\n```bash\npoetry run pytest\n```\n\n### Run Tests with Coverage\n```bash\npoetry run pytest --cov=preprocessing\n```\n\n## Key Design Principles\n\n1. **Streaming First**: All transforms work with streaming data, enabling processing of large datasets that don't fit in memory\n2. **Composability**: Transforms can be easily chained together using the Builder pattern\n3. **Type Safety**: Comprehensive type annotations for better development experience\n4. **PyTorch Integration**: Full compatibility with PyTorch's data loading ecosystem\n5. **Reproducibility**: Built-in support for random seeds and deterministic operations\n6. **Flexibility**: Support for custom operations and transformations\n\n## License\n\nMIT License - see LICENSE file for details \n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A preprocessing library for foundation time series models",
"version": "0.1.3",
"project_urls": {
"Homepage": "https://github.com/TimeBinFM/data",
"Repository": "https://github.com/TimeBinFM/data"
},
"split_keywords": [
"time-series",
" preprocessing",
" ml"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "28e77a712dba76e142ad4d3cfa71c894b3064b659c920b874bfe5ab9ecbe67c5",
"md5": "95c6e757db86003c6fcf45488ad92a4c",
"sha256": "e5e3fa39cb7ea32e566cc113e643f60c9e7e333aeb6620ba62d9f21ca21fd607"
},
"downloads": -1,
"filename": "ts_preprocessing-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "95c6e757db86003c6fcf45488ad92a4c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.11",
"size": 15440,
"upload_time": "2025-09-07T20:21:11",
"upload_time_iso_8601": "2025-09-07T20:21:11.522990Z",
"url": "https://files.pythonhosted.org/packages/28/e7/7a712dba76e142ad4d3cfa71c894b3064b659c920b874bfe5ab9ecbe67c5/ts_preprocessing-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "73262f8639190a618aaaebdf4404ccddf332915bf30c0835287cfcf072e90905",
"md5": "debb02bf863c8f33fb243ac2e4310f7b",
"sha256": "e0808b5bb621ae36722dca57aa471cdd0374cb6a5892a021e061e0cd3567cbb9"
},
"downloads": -1,
"filename": "ts_preprocessing-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "debb02bf863c8f33fb243ac2e4310f7b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.11",
"size": 10836,
"upload_time": "2025-09-07T20:21:12",
"upload_time_iso_8601": "2025-09-07T20:21:12.940111Z",
"url": "https://files.pythonhosted.org/packages/73/26/2f8639190a618aaaebdf4404ccddf332915bf30c0835287cfcf072e90905/ts_preprocessing-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-07 20:21:12",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "TimeBinFM",
"github_project": "data",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "ts-preprocessing"
}