shardate


Nameshardate JSON
Version 2025.9.9.2 PyPI version JSON
download
home_pageNone
SummaryA lightweight Python library for efficiently reading year-month-day partitioned Parquet datasets.
upload_time2025-09-09 04:09:27
maintainerNone
docs_urlNone
authorNone
requires_python>=3.12
licenseMIT
keywords data parquet partitioning pyspark
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # shardate

A lightweight Python library for efficiently reading year-month-day partitioned Parquet datasets with PySpark.

## Installation

```bash
pip install shardate
```

## Overview

`shardate` provides a clean, dataclass-based API for working with Parquet datasets that are partitioned in a `year/month/day` directory structure (e.g., `y=2025/m=01/d=15/`). It's built on PySpark and designed for efficient date-based filtering and data retrieval.

## Features

- **Date-based reading**: Read data for specific dates, date ranges, or collections of dates
- **End-of-month support**: Dedicated functionality for reading end-of-month data
- **Flexible partitioning**: Customizable partition format (defaults to `y=%Y/m=%m/d=%d`)
- **PySpark integration**: Seamlessly works with existing PySpark workflows
- **Type hints**: Full type annotation support for better development experience
- **Well-tested**: Comprehensive test suite ensuring reliability

## Quick Start

### Basic Usage

```python
from datetime import date
from shardate import Shardate

# Create a Shardate instance for your data path
reader = Shardate("/path/to/your/partitioned/data")

# Read data for a specific date
df = reader.read_by_date(date(2025, 1, 15))

# Read data between two dates (inclusive)
df = reader.read_between(date(2025, 1, 1), date(2025, 1, 31))

# Read data for specific dates
target_dates = [date(2025, 1, 1), date(2025, 1, 15), date(2025, 1, 31)]
df = reader.read_by_dates(target_dates)

# Read only end-of-month data within a date range
df = reader.read_eoms_between(date(2025, 1, 1), date(2025, 3, 31))
```

### Custom Partition Format

```python
# If your data uses a different partition format
reader = Shardate("/path/to/data", partition_format="year=%Y/month=%m/day=%d")
df = reader.read_by_date(date(2025, 1, 15))
```

### Working with PySpark

```python
from pyspark.sql import SparkSession
from shardate import Shardate

# Ensure you have an active SparkSession
spark = SparkSession.builder.appName("SharDate Example").getOrCreate()

# Use Shardate as normal
reader = Shardate("/path/to/data")
df = reader.read_by_date(date(2025, 1, 15))

# The returned DataFrame is a standard PySpark DataFrame
df.show()
df.filter(df.column_name == "some_value").count()
```

## API Reference

### Shardate Class

```python
@dataclass
class Shardate:
    path: str
    partition_format: str = "y=%Y/m=%m/d=%d"
```

#### Methods

- `read_by_date(target_date: date) -> DataFrame`: Read data for a specific date
- `read_between(start_date: date, end_date: date) -> DataFrame`: Read data between two dates (inclusive)  
- `read_by_dates(target_dates: Iterable[date]) -> DataFrame`: Read data for specific dates
- `read_eoms_between(start_date: date, end_date: date) -> DataFrame`: Read end-of-month data within a date range

## Data Structure Requirements

Your Parquet data should be partitioned in directories following this structure:

```
data/
├── y=2025/
│   ├── m=01/
│   │   ├── d=01/
│   │   │   └── part-*.parquet
│   │   ├── d=02/
│   │   │   └── part-*.parquet
│   │   └── d=31/
│   │       └── part-*.parquet
│   └── m=02/
│       └── ...
└── y=2024/
    └── ...
```

## Development

### Setup

```bash
# Clone the repository
git clone https://github.com/yoichiojima-2/shardate.git
cd shardate

# Install development dependencies (using uv)
uv sync --dev
```

### Testing

```bash
# Run all tests
make test

# Or use uv directly
uv run pytest -vvv
```

### Code Quality

```bash
# Run linting and formatting
make lint

# Or use uv directly
uv run ruff check --fix .
uv run ruff format .
```

### Building

```bash
# Build the package
make build

# Clean build artifacts
make clean
```

## Requirements

- **Python**: 3.12+
- **PySpark**: 4.0+
- **python-dateutil**: 2.9.0.post0+

## Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes and add tests
4. Run the test suite (`make test`)
5. Run code quality checks (`make lint`)
6. Commit your changes (`git commit -m 'Add amazing feature'`)
7. Push to the branch (`git push origin feature/amazing-feature`)
8. Open a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Links

- **Homepage**: https://github.com/yoichiojima-2/shardate
- **Issues**: https://github.com/yoichiojima-2/shardate/issues
- **PyPI**: https://pypi.org/project/shardate/
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "shardate",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": null,
    "keywords": "data, parquet, partitioning, pyspark",
    "author": null,
    "author_email": "Yoichi Ojima <yoichiojima@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/af/b0/482f271356490dd1b2ae6092a64e4d9356c69518fa290ca3e0ff15038f87/shardate-2025.9.9.2.tar.gz",
    "platform": null,
    "description": "# shardate\n\nA lightweight Python library for efficiently reading year-month-day partitioned Parquet datasets with PySpark.\n\n## Installation\n\n```bash\npip install shardate\n```\n\n## Overview\n\n`shardate` provides a clean, dataclass-based API for working with Parquet datasets that are partitioned in a `year/month/day` directory structure (e.g., `y=2025/m=01/d=15/`). It's built on PySpark and designed for efficient date-based filtering and data retrieval.\n\n## Features\n\n- **Date-based reading**: Read data for specific dates, date ranges, or collections of dates\n- **End-of-month support**: Dedicated functionality for reading end-of-month data\n- **Flexible partitioning**: Customizable partition format (defaults to `y=%Y/m=%m/d=%d`)\n- **PySpark integration**: Seamlessly works with existing PySpark workflows\n- **Type hints**: Full type annotation support for better development experience\n- **Well-tested**: Comprehensive test suite ensuring reliability\n\n## Quick Start\n\n### Basic Usage\n\n```python\nfrom datetime import date\nfrom shardate import Shardate\n\n# Create a Shardate instance for your data path\nreader = Shardate(\"/path/to/your/partitioned/data\")\n\n# Read data for a specific date\ndf = reader.read_by_date(date(2025, 1, 15))\n\n# Read data between two dates (inclusive)\ndf = reader.read_between(date(2025, 1, 1), date(2025, 1, 31))\n\n# Read data for specific dates\ntarget_dates = [date(2025, 1, 1), date(2025, 1, 15), date(2025, 1, 31)]\ndf = reader.read_by_dates(target_dates)\n\n# Read only end-of-month data within a date range\ndf = reader.read_eoms_between(date(2025, 1, 1), date(2025, 3, 31))\n```\n\n### Custom Partition Format\n\n```python\n# If your data uses a different partition format\nreader = Shardate(\"/path/to/data\", partition_format=\"year=%Y/month=%m/day=%d\")\ndf = reader.read_by_date(date(2025, 1, 15))\n```\n\n### Working with PySpark\n\n```python\nfrom pyspark.sql import SparkSession\nfrom shardate import Shardate\n\n# Ensure you have an active SparkSession\nspark = SparkSession.builder.appName(\"SharDate Example\").getOrCreate()\n\n# Use Shardate as normal\nreader = Shardate(\"/path/to/data\")\ndf = reader.read_by_date(date(2025, 1, 15))\n\n# The returned DataFrame is a standard PySpark DataFrame\ndf.show()\ndf.filter(df.column_name == \"some_value\").count()\n```\n\n## API Reference\n\n### Shardate Class\n\n```python\n@dataclass\nclass Shardate:\n    path: str\n    partition_format: str = \"y=%Y/m=%m/d=%d\"\n```\n\n#### Methods\n\n- `read_by_date(target_date: date) -> DataFrame`: Read data for a specific date\n- `read_between(start_date: date, end_date: date) -> DataFrame`: Read data between two dates (inclusive)  \n- `read_by_dates(target_dates: Iterable[date]) -> DataFrame`: Read data for specific dates\n- `read_eoms_between(start_date: date, end_date: date) -> DataFrame`: Read end-of-month data within a date range\n\n## Data Structure Requirements\n\nYour Parquet data should be partitioned in directories following this structure:\n\n```\ndata/\n\u251c\u2500\u2500 y=2025/\n\u2502   \u251c\u2500\u2500 m=01/\n\u2502   \u2502   \u251c\u2500\u2500 d=01/\n\u2502   \u2502   \u2502   \u2514\u2500\u2500 part-*.parquet\n\u2502   \u2502   \u251c\u2500\u2500 d=02/\n\u2502   \u2502   \u2502   \u2514\u2500\u2500 part-*.parquet\n\u2502   \u2502   \u2514\u2500\u2500 d=31/\n\u2502   \u2502       \u2514\u2500\u2500 part-*.parquet\n\u2502   \u2514\u2500\u2500 m=02/\n\u2502       \u2514\u2500\u2500 ...\n\u2514\u2500\u2500 y=2024/\n    \u2514\u2500\u2500 ...\n```\n\n## Development\n\n### Setup\n\n```bash\n# Clone the repository\ngit clone https://github.com/yoichiojima-2/shardate.git\ncd shardate\n\n# Install development dependencies (using uv)\nuv sync --dev\n```\n\n### Testing\n\n```bash\n# Run all tests\nmake test\n\n# Or use uv directly\nuv run pytest -vvv\n```\n\n### Code Quality\n\n```bash\n# Run linting and formatting\nmake lint\n\n# Or use uv directly\nuv run ruff check --fix .\nuv run ruff format .\n```\n\n### Building\n\n```bash\n# Build the package\nmake build\n\n# Clean build artifacts\nmake clean\n```\n\n## Requirements\n\n- **Python**: 3.12+\n- **PySpark**: 4.0+\n- **python-dateutil**: 2.9.0.post0+\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes and add tests\n4. Run the test suite (`make test`)\n5. Run code quality checks (`make lint`)\n6. Commit your changes (`git commit -m 'Add amazing feature'`)\n7. Push to the branch (`git push origin feature/amazing-feature`)\n8. Open a Pull Request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Links\n\n- **Homepage**: https://github.com/yoichiojima-2/shardate\n- **Issues**: https://github.com/yoichiojima-2/shardate/issues\n- **PyPI**: https://pypi.org/project/shardate/",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A lightweight Python library for efficiently reading year-month-day partitioned Parquet datasets.",
    "version": "2025.9.9.2",
    "project_urls": {
        "Homepage": "https://github.com/yoichiojima-2/shardate",
        "Issues": "https://github.com/yoichiojima-2/shardate/issues",
        "Repository": "https://github.com/yoichiojima-2/shardate"
    },
    "split_keywords": [
        "data",
        " parquet",
        " partitioning",
        " pyspark"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1788249bc9f92c84eeacee399243805b814160a038a1ba09e5a2a38d9302c142",
                "md5": "fc788c225061e607816b858380508633",
                "sha256": "3afca1861addeecbe3d78dc43f2dcadd99d8ab01293098bf6102725709c03d38"
            },
            "downloads": -1,
            "filename": "shardate-2025.9.9.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fc788c225061e607816b858380508633",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 5134,
            "upload_time": "2025-09-09T04:09:26",
            "upload_time_iso_8601": "2025-09-09T04:09:26.480389Z",
            "url": "https://files.pythonhosted.org/packages/17/88/249bc9f92c84eeacee399243805b814160a038a1ba09e5a2a38d9302c142/shardate-2025.9.9.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "afb0482f271356490dd1b2ae6092a64e4d9356c69518fa290ca3e0ff15038f87",
                "md5": "ae27eb79c0e6629235ed338f48016e5c",
                "sha256": "a9c7866c325581768cd4d1004230bb5085d01f3ea2ffdc6b1883238230605cf3"
            },
            "downloads": -1,
            "filename": "shardate-2025.9.9.2.tar.gz",
            "has_sig": false,
            "md5_digest": "ae27eb79c0e6629235ed338f48016e5c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 8899,
            "upload_time": "2025-09-09T04:09:27",
            "upload_time_iso_8601": "2025-09-09T04:09:27.943604Z",
            "url": "https://files.pythonhosted.org/packages/af/b0/482f271356490dd1b2ae6092a64e4d9356c69518fa290ca3e0ff15038f87/shardate-2025.9.9.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-09 04:09:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yoichiojima-2",
    "github_project": "shardate",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "shardate"
}
        
Elapsed time: 0.81204s