statcast-pitches


Namestatcast-pitches JSON
Version 1.0.0 PyPI version JSON
download
home_pageNone
SummaryA package for loading MLB Statcast pitch data quickly using HF Dataset
upload_time2025-03-19 21:33:11
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords baseball data mlb statcast
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # statcast-pitches

[![Latest Update](https://github.com/Jensen-holm/statcast-era-pitches/actions/workflows/update_statcast_data.yml/badge.svg)](https://github.com/Jensen-holm/statcast-era-pitches/actions/workflows/update_statcast_data.yml)

[pybaseball](https://github.com/jldbc/pybaseball) is a great tool for downloading baseball data. Even though the library is optimized and scrapes this data in parallel, it can be time consuming. 
 
The point of this repository is to utilize GitHub Actions to scrape new baseball data weekly during the MLB season, and update a parquet file hosted as a huggingface dataset. Reading this data as a huggingface dataset is much faster than scraping the new data each time you re run your code, or just want updated statcast pitch data in general.

The `update.py` script updates each week during the MLB season, updating the [statcast-era-pitches HuggingFace Dataset](https://huggingface.co/datasets/Jensen-holm/statcast-era-pitches) so that you don't have to re scrape this data yourself. 

You can explore the entire dataset in your browser [at this link](https://huggingface.co/datasets/Jensen-holm/statcast-era-pitches/viewer/default/train)

# Installation

```bash
pip install statcast-pitches
```

# Usage

### With statcast_pitches package

**Example 1 w/ polars (suggested)**
```python
import statcast_pitches
import polars as pl

# load all pitches from 2015-present
pitches_lf = statcast_pitches.load()

# filter to get 2024 bat speed data
bat_speed_24_df = (pitches_lf
                    .filter(pl.col("game_date").dt.year() == 2024)
                    .select("bat_speed", "swing_length")
                    .collect())

print(bat_speed_24_df.head(3))
```

output: 
| | bat_speed  | swing_length |
|-|------------|--------------|
| 0 | 73.61710 | 6.92448 |
| 1 | 58.63812 | 7.56904 |
| 2 | 71.71226 | 6.46088 |

**Notes**
- Because `statcast_pitches.load()` uses a LazyFrame, we can load it much faster and even perform operations on it before 'collecting' it into memory. If it were loaded as a DataFrame, this code would execute in ~30-60 seconds, instead it runs between 2-8 seconds. 

**Example 2 Duckdb**
```python
import statcast_pitches

# get bat tracking data from 2024
params = ("2024",)
query_2024_bat_speed = f"""
    SELECT bat_speed, swing_length
    FROM pitches
    WHERE 
        YEAR(game_date) =?
        AND bat_speed IS NOT NULL;
    """

bat_speed_24_df = statcast_pitches.load(
    query=query_2024_bat_speed,
    params=params,
).collect()

print(bat_speed_24_df.head(3))
```

output: 
| | bat_speed  | swing_length |
|-|------------|--------------|
| 0 | 73.61710 | 6.92448 |
| 1 | 58.63812 | 7.56904 |
| 2 | 71.71226 | 6.46088 |

**Notes**:
- If no query is specified, all data from 2015-present will be loaded into a DataFrame.
- The table in your query MUST be called 'pitches', or it will fail.
- Since `load()` returns a LazyFrame, notice that I had to call `pl.DataFrame.collect()` before calling `head()`
- This is slower than the other polars approach, however sometimes using SQL is fun

### With HuggingFace API (not recommended)

***Pandas***

```python
import pandas as pd

df = pd.read_parquet("hf://datasets/Jensen-holm/statcast-era-pitches/data/statcast_era_pitches.parquet")
```

***Polars***

```python
import polars as pl

df = pl.read_parquet('hf://datasets/Jensen-holm/statcast-era-pitches/data/statcast_era_pitches.parquet')
```

***Duckdb***

```sql
SELECT *
FROM 'hf://datasets/Jensen-holm/statcast-era-pitches/data/statcast_era_pitches.parquet';
```

***HuggingFace Dataset***

```python
from datasets import load_dataset

ds = load_dataset("Jensen-holm/statcast-era-pitches")
```

***Tidyverse***
```r
library(tidyverse)

statcast_pitches <- read_parquet(
    "https://huggingface.co/datasets/Jensen-holm/statcast-era-pitches/resolve/main/data/statcast_era_pitches.parquet"
)
```

see the [dataset](https://huggingface.co/datasets/Jensen-holm/statcast-era-pitches) on HugingFace itself for more details. 

## Eager Benchmarking

![dataset_load_times](dataset_load_times.png)

| Eager Load Time (s) | API |
|---------------|-----|
| 1421.103 | pybaseball |
| 26.899 | polars |
| 33.093 | pandas |
| 68.692 | duckdb |

# ⚠️ Data-Quality Warning ⚠️

MLB states that real time `pitch_type` classification is automated and subject to change as data gets reviewed. This is currently not taken into account as the huggingface dataset gets updated. `pitch_type` is the only column that is affected by this.

# Contributing

Feel free to submit issues and PR's if you have a contribution you would like to make.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "statcast-pitches",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "baseball, data, mlb, statcast",
    "author": null,
    "author_email": "Jensen Holm <jensenh87@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/c5/58/48aa01df910682e74dad9ed56668ab20e8e1ab04c019312422de7ef041b1/statcast_pitches-1.0.0.tar.gz",
    "platform": null,
    "description": "# statcast-pitches\n\n[![Latest Update](https://github.com/Jensen-holm/statcast-era-pitches/actions/workflows/update_statcast_data.yml/badge.svg)](https://github.com/Jensen-holm/statcast-era-pitches/actions/workflows/update_statcast_data.yml)\n\n[pybaseball](https://github.com/jldbc/pybaseball) is a great tool for downloading baseball data. Even though the library is optimized and scrapes this data in parallel, it can be time consuming. \n \nThe point of this repository is to utilize GitHub Actions to scrape new baseball data weekly during the MLB season, and update a parquet file hosted as a huggingface dataset. Reading this data as a huggingface dataset is much faster than scraping the new data each time you re run your code, or just want updated statcast pitch data in general.\n\nThe `update.py` script updates each week during the MLB season, updating the [statcast-era-pitches HuggingFace Dataset](https://huggingface.co/datasets/Jensen-holm/statcast-era-pitches) so that you don't have to re scrape this data yourself. \n\nYou can explore the entire dataset in your browser [at this link](https://huggingface.co/datasets/Jensen-holm/statcast-era-pitches/viewer/default/train)\n\n# Installation\n\n```bash\npip install statcast-pitches\n```\n\n# Usage\n\n### With statcast_pitches package\n\n**Example 1 w/ polars (suggested)**\n```python\nimport statcast_pitches\nimport polars as pl\n\n# load all pitches from 2015-present\npitches_lf = statcast_pitches.load()\n\n# filter to get 2024 bat speed data\nbat_speed_24_df = (pitches_lf\n                    .filter(pl.col(\"game_date\").dt.year() == 2024)\n                    .select(\"bat_speed\", \"swing_length\")\n                    .collect())\n\nprint(bat_speed_24_df.head(3))\n```\n\noutput: \n| | bat_speed  | swing_length |\n|-|------------|--------------|\n| 0 | 73.61710 | 6.92448 |\n| 1 | 58.63812 | 7.56904 |\n| 2 | 71.71226 | 6.46088 |\n\n**Notes**\n- Because `statcast_pitches.load()` uses a LazyFrame, we can load it much faster and even perform operations on it before 'collecting' it into memory. If it were loaded as a DataFrame, this code would execute in ~30-60 seconds, instead it runs between 2-8 seconds. \n\n**Example 2 Duckdb**\n```python\nimport statcast_pitches\n\n# get bat tracking data from 2024\nparams = (\"2024\",)\nquery_2024_bat_speed = f\"\"\"\n    SELECT bat_speed, swing_length\n    FROM pitches\n    WHERE \n        YEAR(game_date) =?\n        AND bat_speed IS NOT NULL;\n    \"\"\"\n\nbat_speed_24_df = statcast_pitches.load(\n    query=query_2024_bat_speed,\n    params=params,\n).collect()\n\nprint(bat_speed_24_df.head(3))\n```\n\noutput: \n| | bat_speed  | swing_length |\n|-|------------|--------------|\n| 0 | 73.61710 | 6.92448 |\n| 1 | 58.63812 | 7.56904 |\n| 2 | 71.71226 | 6.46088 |\n\n**Notes**:\n- If no query is specified, all data from 2015-present will be loaded into a DataFrame.\n- The table in your query MUST be called 'pitches', or it will fail.\n- Since `load()` returns a LazyFrame, notice that I had to call `pl.DataFrame.collect()` before calling `head()`\n- This is slower than the other polars approach, however sometimes using SQL is fun\n\n### With HuggingFace API (not recommended)\n\n***Pandas***\n\n```python\nimport pandas as pd\n\ndf = pd.read_parquet(\"hf://datasets/Jensen-holm/statcast-era-pitches/data/statcast_era_pitches.parquet\")\n```\n\n***Polars***\n\n```python\nimport polars as pl\n\ndf = pl.read_parquet('hf://datasets/Jensen-holm/statcast-era-pitches/data/statcast_era_pitches.parquet')\n```\n\n***Duckdb***\n\n```sql\nSELECT *\nFROM 'hf://datasets/Jensen-holm/statcast-era-pitches/data/statcast_era_pitches.parquet';\n```\n\n***HuggingFace Dataset***\n\n```python\nfrom datasets import load_dataset\n\nds = load_dataset(\"Jensen-holm/statcast-era-pitches\")\n```\n\n***Tidyverse***\n```r\nlibrary(tidyverse)\n\nstatcast_pitches <- read_parquet(\n    \"https://huggingface.co/datasets/Jensen-holm/statcast-era-pitches/resolve/main/data/statcast_era_pitches.parquet\"\n)\n```\n\nsee the [dataset](https://huggingface.co/datasets/Jensen-holm/statcast-era-pitches) on HugingFace itself for more details. \n\n## Eager Benchmarking\n\n![dataset_load_times](dataset_load_times.png)\n\n| Eager Load Time (s) | API |\n|---------------|-----|\n| 1421.103 | pybaseball |\n| 26.899 | polars |\n| 33.093 | pandas |\n| 68.692 | duckdb |\n\n# \u26a0\ufe0f Data-Quality Warning \u26a0\ufe0f\n\nMLB states that real time `pitch_type` classification is automated and subject to change as data gets reviewed. This is currently not taken into account as the huggingface dataset gets updated. `pitch_type` is the only column that is affected by this.\n\n# Contributing\n\nFeel free to submit issues and PR's if you have a contribution you would like to make.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A package for loading MLB Statcast pitch data quickly using HF Dataset",
    "version": "1.0.0",
    "project_urls": {
        "HuggingFaceDataset": "https://huggingface.com/Jensen-holm/statcast-era-pitches",
        "Repository": "https://github.com/Jensen-holm/statcast-era-pitches"
    },
    "split_keywords": [
        "baseball",
        " data",
        " mlb",
        " statcast"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fd0a0436cd99728daf928a8fe1c0d17b9b316026c4eae3e19373dac3f1240287",
                "md5": "3e77aec9eb9cd5f9b96370c1f2fbf2e6",
                "sha256": "fa20e00a920805b7557b81003947c238c3471cac8fa2cc290895cff2c3767285"
            },
            "downloads": -1,
            "filename": "statcast_pitches-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3e77aec9eb9cd5f9b96370c1f2fbf2e6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 4440,
            "upload_time": "2025-03-19T21:33:10",
            "upload_time_iso_8601": "2025-03-19T21:33:10.492099Z",
            "url": "https://files.pythonhosted.org/packages/fd/0a/0436cd99728daf928a8fe1c0d17b9b316026c4eae3e19373dac3f1240287/statcast_pitches-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c55848aa01df910682e74dad9ed56668ab20e8e1ab04c019312422de7ef041b1",
                "md5": "3f16c8501a35814d514889667b3d64d3",
                "sha256": "8f43cd267db9cfb7ea725f0f5815d4f6f30f06ded21bf5f56a88782d580217be"
            },
            "downloads": -1,
            "filename": "statcast_pitches-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "3f16c8501a35814d514889667b3d64d3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 45550,
            "upload_time": "2025-03-19T21:33:11",
            "upload_time_iso_8601": "2025-03-19T21:33:11.437820Z",
            "url": "https://files.pythonhosted.org/packages/c5/58/48aa01df910682e74dad9ed56668ab20e8e1ab04c019312422de7ef041b1/statcast_pitches-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-03-19 21:33:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Jensen-holm",
    "github_project": "statcast-era-pitches",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "statcast-pitches"
}
        
Elapsed time: 1.80439s