pybaseball


Namepybaseball JSON
Version 2.2.7 PyPI version JSON
download
home_pagehttps://github.com/jldbc/pybaseball
SummaryRetrieve baseball data in Python
upload_time2023-09-08 12:42:45
maintainerMoshe Schorr
docs_urlNone
authorJames LeDoux
requires_python
licenseMIT
keywords baseball sabermetrics data statistics statcast web scraping
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pybaseball

Baseball data scraping and analysis tools in python

## Overview

`pybaseball` is a Python package for baseball data analysis. This package scrapes Baseball Reference, Baseball Savant, and FanGraphs so you don't have to. The package retrieves statcast data, pitching stats, batting stats, division standings/team records, awards data, and more. Data is available at the individual pitch level, as well as aggregated at the season level and over custom time periods. See the [docs](https://github.com/jldbc/pybaseball/tree/master/docs) for a comprehensive list of data acquisition functions.

## Installation

Pybaseball can be installed via pip:

```bash
pip install pybaseball
```

or from the repo (which may at times be more up to date):

```bash
git clone https://github.com/jldbc/pybaseball
cd pybaseball
pip install -e .
```

We will try to publish periodic updates through the 'releases' and PyPI CI, but it may lag at times.

## Community

Discussion about pybaseball use and development is hosted on our group Discord, sign up link [here](https://discord.gg/TnJVyUDDn8). Issues with the codebase should still be raised and addressed on GitHub.

##  Documentation

Full documentation on available functions and their arguments along with examples is located [docs](https://github.com/jldbc/pybaseball/tree/master/docs) folder. This section contains a brief overview of the main functionalities of this library.


### Statcast: Pull advanced metrics from Major League Baseball's Statcast system

Statcast data include pitch-level information, pulled from baseballsavant.com.

```python
>>> from pybaseball import statcast
>>> statcast(start_dt="2019-06-24", end_dt="2019-06-25").columns
Index(['pitch_type', 'game_date', 'release_speed', 'release_pos_x',
       'release_pos_z', 'player_name', 'batter', 'pitcher', 'events',
       'description', 'spin_dir', 'spin_rate_deprecated',
       'break_angle_deprecated', 'break_length_deprecated', 'zone', 'des',
       'game_type', 'stand', 'p_throws', 'home_team', 'away_team', 'type',
       'hit_location', 'bb_type', 'balls', 'strikes', 'game_year', 'pfx_x',
       'pfx_z', 'plate_x', 'plate_z', 'on_3b', 'on_2b', 'on_1b',
       'outs_when_up', 'inning', 'inning_topbot', 'hc_x', 'hc_y',
       'tfs_deprecated', 'tfs_zulu_deprecated', 'fielder_2', 'umpire', 'sv_id',
       'vx0', 'vy0', 'vz0', 'ax', 'ay', 'az', 'sz_top', 'sz_bot',
       'hit_distance_sc', 'launch_speed', 'launch_angle', 'effective_speed',
       'release_spin_rate', 'release_extension', 'game_pk', 'pitcher.1',
       'fielder_2.1', 'fielder_3', 'fielder_4', 'fielder_5', 'fielder_6',
       'fielder_7', 'fielder_8', 'fielder_9', 'release_pos_y',
       'estimated_ba_using_speedangle', 'estimated_woba_using_speedangle',
       'woba_value', 'woba_denom', 'babip_value', 'iso_value',
       'launch_speed_angle', 'at_bat_number', 'pitch_number', 'pitch_name',
       'home_score', 'away_score', 'bat_score', 'fld_score', 'post_away_score',
       'post_home_score', 'post_bat_score', 'post_fld_score',
       'if_fielding_alignment', 'of_fielding_alignment', 'spin_axis',
       'delta_home_win_exp', 'delta_run_exp'],
      dtype='object')
```

For documentation on the definitions of these columns, see the [Statcast Search CSV Documentation](https://baseballsavant.mlb.com/csv-docs).

If `start_dt` and `end_dt` are supplied, it will return all statcast data between those two dates. If not, it will return yesterday's data. The optional argument `verbose` will control whether the library updates you on its progress while it pulls the data.

#### Player-Specific Queries

For a player-specific statcast query, pull pitching or batting data using the `statcast_pitcher` and `statcast_batter` functions. These take the same `start_dt` and `end_dt` arguments as the statcast function, as well as a `player_id` argument. This ID comes from MLB Advanced Media, and can be obtained using the function `playerid_lookup`. The returned columns match the set above, but filtered to rows for that specific pitcher or batter. A complete example: 

```python
# Find Clayton Kershaw's player id
from pybaseball import  playerid_lookup
from pybaseball import  statcast_pitcher
playerid_lookup('kershaw', 'clayton')
  name_last name_first  key_mlbam key_retro  key_bbref  key_fangraphs  mlb_played_first  mlb_played_last
0   kershaw    clayton     477132  kersc001  kershcl01           2036            2008.0           2022.0

# His MLBAM ID is 477132, so we feed that as the player_id argument to the following function 
kershaw_stats = statcast_pitcher('2017-06-01', '2017-07-01', 477132)
kershaw_stats.groupby("pitch_type").release_speed.agg("mean")
pitch_type
CH    86.725000
CU    73.133333
FF    92.844622
SI    94.515385
SL    87.962381
Name: release_speed, dtype: float64
```

#### A note on Statcast data

Statcast data is subject to change (even for prior seasons):

<div>
   <blockquote class="twitter-tweet">
      <p lang="en" dir="ltr">
         Each season has 700,000+ pitches, and is subject to update. You should code accordingly.
      </p>&mdash; Tangotiger (@tangotiger)
      <a href="https://twitter.com/tangotiger/status/1362064972025634821?ref_src=twsrc%5Etfw">February 17, 2021</a>
   </blockquote>
</div>

### Aggregate Statistics

For league-wide season-level pitching data, use the function `pitching_stats(start_season, end_season)`. This will return one row per player per season, and provide all metrics made available by FanGraphs.

For a fixed range, `pitching_stats_range(start_dt, end_dt)` pulls data for a specific time-interval from Baseball Reference. Note that all dates should be in `YYYY-MM-DD` format.

```python
from pybaseball import pitching_stats
data = pitching_stats(2014,2016)
data.columns
Index(['IDfg', 'Season', 'Name', 'Team', 'Age', 'W', 'L', 'WAR', 'ERA', 'G',
       ...
       'LA', 'Barrels', 'Barrel%', 'maxEV', 'HardHit', 'HardHit%', 'Events',
       'CStr%', 'CSW%', 'xERA'],
      dtype='object', length=334)
```

Batting stats are obtained similarly. The function call for getting a season-level stats is `batting_stats(start_season, end_season)`, and for a particular time range it is `batting_stats_range(start_dt, end_dt)`. The Baseball Reference equivalent for season-level data is `batting_stats_bref(season)`. 

(For season level queries, if you prefer Baseball Reference to FanGraphs, there is a third option, `pitching_stats_bref(season)`. This works the same as `pitching_stats`, but retrieves its data from Baseball Reference instead. This is *not recommended*, however, because the Baseball Reference query currently can only retrieve one season's worth of data per request.)

### Game-by-Game Results and Schedule 
The `schedule_and_record` function returns a team's game-by-game results for a given season. The function's only two arguments are `season` and `team`, where team is the team's abbreviation (i.e. NYY for New York Yankees).

```python
# Example: Say we want to know the 1927 Yankees record on May 16 
from pybaseball import schedule_and_record
data = schedule_and_record(1927, 'NYY')
data.loc[data.Date.str.contains("May 16"), :]
              Date   Tm Home_Away  Opp W/L    R   RA  Inn   W-L  Rank      GB      Win      Loss   Save  Time D/N  Attendance   cLI  Streak Orig. Scheduled
28  Monday, May 16  NYY         @  DET   W  6.0  2.0  9.0  19-8   1.0  up 3.0  Ruether  Holloway  Moore  2:28   D      4000.0  5.15       5            None
```


### Standings: up to date or historical division standings, W/L records

The `standings(season)` function gives division standings for a given season. If the current season is chosen, it will give the most current set of standings. Otherwise, it will give the end-of-season standings for each division for the chosen season. This function returns a list of dataframes. Each dataframe is the standings for one of MLB's six divisions. 

```python
>>> from pybaseball import standings
>>> data = standings(2016)[4]
>>> print(data)
                    Tm    W   L  W-L%    GB
1         Chicago Cubs  103  58  .640    --
2  St. Louis Cardinals   86  76  .531  17.5
3   Pittsburgh Pirates   78  83  .484  25.0
4    Milwaukee Brewers   73  89  .451  30.5
5      Cincinnati Reds   68  94  .420  35.5
```

### Caching

To facilitate faster data retrieval for repeated calls, a local data cache may be used to save a local copy of the
requested data. By default the cache is disabled so as to respect a user's potential desire to not have their hard drive
space used without their permission. However, enabling the cache is simple.

Cache can be turned on by including the pybaseball.cache module and enabling the cache option like so:

```python
from pybaseball import cache

cache.enable()
```

## FAQ

### Stale Cache

If you call a statcast method for a future date, the cache will log empty datasets for those dates. If you're not getting the results you expect for a given date, first try clearing your cache:

```
from pybaseball import cache
cache.purge()
```

### Multiprocessing

If you're getting a error with `concurrent.futures.process.BrokenProcessPool`, wrap your call in a main function, e.g.

```
if __name__ == '__main__':
    stats = statcast()
```

This may be necessary on systems that use spawn-based processes (often Windows and OSX). 

For other problems, please submit an issue.

## Contributing

See [contributing.md](https://github.com/jldbc/pybaseball/tree/master/contributing.md) for a guide to contributing to this library.


------

## Credit

This package was developed by James LeDoux and is maintained by [Moshe Schorr](https://github.com/schorrm).

This package was inspired by Bill Petti's excellent R package [baseballr](https://github.com/billpetti/baseballr), which at the time of this package's development had no Python equivalent. Our hope is to fill that void with this package.

The Lahman data comes from [Sean Lahman's baseball database](http://www.seanlahman.com/baseball-archive/statistics/).

All other data comes from FanGraphs, Baseball Reference, the Chadwick Bureau, Retrosheet, and Baseball Savant.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/jldbc/pybaseball",
    "name": "pybaseball",
    "maintainer": "Moshe Schorr",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "baseball sabermetrics data statistics statcast web scraping",
    "author": "James LeDoux",
    "author_email": "ledoux.james.r@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/62/00/4009bc63970a8277ecd2b3fece4070bf31f1acf52150879df8384e1d7c5a/pybaseball-2.2.7.tar.gz",
    "platform": null,
    "description": "# pybaseball\n\nBaseball data scraping and analysis tools in python\n\n## Overview\n\n`pybaseball` is a Python package for baseball data analysis. This package scrapes Baseball Reference, Baseball Savant, and FanGraphs so you don't have to. The package retrieves statcast data, pitching stats, batting stats, division standings/team records, awards data, and more. Data is available at the individual pitch level, as well as aggregated at the season level and over custom time periods. See the [docs](https://github.com/jldbc/pybaseball/tree/master/docs) for a comprehensive list of data acquisition functions.\n\n## Installation\n\nPybaseball can be installed via pip:\n\n```bash\npip install pybaseball\n```\n\nor from the repo (which may at times be more up to date):\n\n```bash\ngit clone https://github.com/jldbc/pybaseball\ncd pybaseball\npip install -e .\n```\n\nWe will try to publish periodic updates through the 'releases' and PyPI CI, but it may lag at times.\n\n## Community\n\nDiscussion about pybaseball use and development is hosted on our group Discord, sign up link [here](https://discord.gg/TnJVyUDDn8). Issues with the codebase should still be raised and addressed on GitHub.\n\n##  Documentation\n\nFull documentation on available functions and their arguments along with examples is located [docs](https://github.com/jldbc/pybaseball/tree/master/docs) folder. This section contains a brief overview of the main functionalities of this library.\n\n\n### Statcast: Pull advanced metrics from Major League Baseball's Statcast system\n\nStatcast data include pitch-level information, pulled from baseballsavant.com.\n\n```python\n>>> from pybaseball import statcast\n>>> statcast(start_dt=\"2019-06-24\", end_dt=\"2019-06-25\").columns\nIndex(['pitch_type', 'game_date', 'release_speed', 'release_pos_x',\n       'release_pos_z', 'player_name', 'batter', 'pitcher', 'events',\n       'description', 'spin_dir', 'spin_rate_deprecated',\n       'break_angle_deprecated', 'break_length_deprecated', 'zone', 'des',\n       'game_type', 'stand', 'p_throws', 'home_team', 'away_team', 'type',\n       'hit_location', 'bb_type', 'balls', 'strikes', 'game_year', 'pfx_x',\n       'pfx_z', 'plate_x', 'plate_z', 'on_3b', 'on_2b', 'on_1b',\n       'outs_when_up', 'inning', 'inning_topbot', 'hc_x', 'hc_y',\n       'tfs_deprecated', 'tfs_zulu_deprecated', 'fielder_2', 'umpire', 'sv_id',\n       'vx0', 'vy0', 'vz0', 'ax', 'ay', 'az', 'sz_top', 'sz_bot',\n       'hit_distance_sc', 'launch_speed', 'launch_angle', 'effective_speed',\n       'release_spin_rate', 'release_extension', 'game_pk', 'pitcher.1',\n       'fielder_2.1', 'fielder_3', 'fielder_4', 'fielder_5', 'fielder_6',\n       'fielder_7', 'fielder_8', 'fielder_9', 'release_pos_y',\n       'estimated_ba_using_speedangle', 'estimated_woba_using_speedangle',\n       'woba_value', 'woba_denom', 'babip_value', 'iso_value',\n       'launch_speed_angle', 'at_bat_number', 'pitch_number', 'pitch_name',\n       'home_score', 'away_score', 'bat_score', 'fld_score', 'post_away_score',\n       'post_home_score', 'post_bat_score', 'post_fld_score',\n       'if_fielding_alignment', 'of_fielding_alignment', 'spin_axis',\n       'delta_home_win_exp', 'delta_run_exp'],\n      dtype='object')\n```\n\nFor documentation on the definitions of these columns, see the [Statcast Search CSV Documentation](https://baseballsavant.mlb.com/csv-docs).\n\nIf `start_dt` and `end_dt` are supplied, it will return all statcast data between those two dates. If not, it will return yesterday's data. The optional argument `verbose` will control whether the library updates you on its progress while it pulls the data.\n\n#### Player-Specific Queries\n\nFor a player-specific statcast query, pull pitching or batting data using the `statcast_pitcher` and `statcast_batter` functions. These take the same `start_dt` and `end_dt` arguments as the statcast function, as well as a `player_id` argument. This ID comes from MLB Advanced Media, and can be obtained using the function `playerid_lookup`. The returned columns match the set above, but filtered to rows for that specific pitcher or batter. A complete example: \n\n```python\n# Find Clayton Kershaw's player id\nfrom pybaseball import  playerid_lookup\nfrom pybaseball import  statcast_pitcher\nplayerid_lookup('kershaw', 'clayton')\n  name_last name_first  key_mlbam key_retro  key_bbref  key_fangraphs  mlb_played_first  mlb_played_last\n0   kershaw    clayton     477132  kersc001  kershcl01           2036            2008.0           2022.0\n\n# His MLBAM ID is 477132, so we feed that as the player_id argument to the following function \nkershaw_stats = statcast_pitcher('2017-06-01', '2017-07-01', 477132)\nkershaw_stats.groupby(\"pitch_type\").release_speed.agg(\"mean\")\npitch_type\nCH    86.725000\nCU    73.133333\nFF    92.844622\nSI    94.515385\nSL    87.962381\nName: release_speed, dtype: float64\n```\n\n#### A note on Statcast data\n\nStatcast data is subject to change (even for prior seasons):\n\n<div>\n   <blockquote class=\"twitter-tweet\">\n      <p lang=\"en\" dir=\"ltr\">\n         Each season has 700,000+ pitches, and is subject to update. You should code accordingly.\n      </p>&mdash; Tangotiger (@tangotiger)\n      <a href=\"https://twitter.com/tangotiger/status/1362064972025634821?ref_src=twsrc%5Etfw\">February 17, 2021</a>\n   </blockquote>\n</div>\n\n### Aggregate Statistics\n\nFor league-wide season-level pitching data, use the function `pitching_stats(start_season, end_season)`. This will return one row per player per season, and provide all metrics made available by FanGraphs.\n\nFor a fixed range, `pitching_stats_range(start_dt, end_dt)` pulls data for a specific time-interval from Baseball Reference. Note that all dates should be in `YYYY-MM-DD` format.\n\n```python\nfrom pybaseball import pitching_stats\ndata = pitching_stats(2014,2016)\ndata.columns\nIndex(['IDfg', 'Season', 'Name', 'Team', 'Age', 'W', 'L', 'WAR', 'ERA', 'G',\n       ...\n       'LA', 'Barrels', 'Barrel%', 'maxEV', 'HardHit', 'HardHit%', 'Events',\n       'CStr%', 'CSW%', 'xERA'],\n      dtype='object', length=334)\n```\n\nBatting stats are obtained similarly. The function call for getting a season-level stats is `batting_stats(start_season, end_season)`, and for a particular time range it is `batting_stats_range(start_dt, end_dt)`. The Baseball Reference equivalent for season-level data is `batting_stats_bref(season)`. \n\n(For season level queries, if you prefer Baseball Reference to FanGraphs, there is a third option, `pitching_stats_bref(season)`. This works the same as `pitching_stats`, but retrieves its data from Baseball Reference instead. This is *not recommended*, however, because the Baseball Reference query currently can only retrieve one season's worth of data per request.)\n\n### Game-by-Game Results and Schedule \nThe `schedule_and_record` function returns a team's game-by-game results for a given season. The function's only two arguments are `season` and `team`, where team is the team's abbreviation (i.e. NYY for New York Yankees).\n\n```python\n# Example: Say we want to know the 1927 Yankees record on May 16 \nfrom pybaseball import schedule_and_record\ndata = schedule_and_record(1927, 'NYY')\ndata.loc[data.Date.str.contains(\"May 16\"), :]\n              Date   Tm Home_Away  Opp W/L    R   RA  Inn   W-L  Rank      GB      Win      Loss   Save  Time D/N  Attendance   cLI  Streak Orig. Scheduled\n28  Monday, May 16  NYY         @  DET   W  6.0  2.0  9.0  19-8   1.0  up 3.0  Ruether  Holloway  Moore  2:28   D      4000.0  5.15       5            None\n```\n\n\n### Standings: up to date or historical division standings, W/L records\n\nThe `standings(season)` function gives division standings for a given season. If the current season is chosen, it will give the most current set of standings. Otherwise, it will give the end-of-season standings for each division for the chosen season. This function returns a list of dataframes. Each dataframe is the standings for one of MLB's six divisions. \n\n```python\n>>> from pybaseball import standings\n>>> data = standings(2016)[4]\n>>> print(data)\n                    Tm    W   L  W-L%    GB\n1         Chicago Cubs  103  58  .640    --\n2  St. Louis Cardinals   86  76  .531  17.5\n3   Pittsburgh Pirates   78  83  .484  25.0\n4    Milwaukee Brewers   73  89  .451  30.5\n5      Cincinnati Reds   68  94  .420  35.5\n```\n\n### Caching\n\nTo facilitate faster data retrieval for repeated calls, a local data cache may be used to save a local copy of the\nrequested data. By default the cache is disabled so as to respect a user's potential desire to not have their hard drive\nspace used without their permission. However, enabling the cache is simple.\n\nCache can be turned on by including the pybaseball.cache module and enabling the cache option like so:\n\n```python\nfrom pybaseball import cache\n\ncache.enable()\n```\n\n## FAQ\n\n### Stale Cache\n\nIf you call a statcast method for a future date, the cache will log empty datasets for those dates. If you're not getting the results you expect for a given date, first try clearing your cache:\n\n```\nfrom pybaseball import cache\ncache.purge()\n```\n\n### Multiprocessing\n\nIf you're getting a error with `concurrent.futures.process.BrokenProcessPool`, wrap your call in a main function, e.g.\n\n```\nif __name__ == '__main__':\n    stats = statcast()\n```\n\nThis may be necessary on systems that use spawn-based processes (often Windows and OSX). \n\nFor other problems, please submit an issue.\n\n## Contributing\n\nSee [contributing.md](https://github.com/jldbc/pybaseball/tree/master/contributing.md) for a guide to contributing to this library.\n\n\n------\n\n## Credit\n\nThis package was developed by James LeDoux and is maintained by [Moshe Schorr](https://github.com/schorrm).\n\nThis package was inspired by Bill Petti's excellent R package [baseballr](https://github.com/billpetti/baseballr), which at the time of this package's development had no Python equivalent. Our hope is to fill that void with this package.\n\nThe Lahman data comes from [Sean Lahman's baseball database](http://www.seanlahman.com/baseball-archive/statistics/).\n\nAll other data comes from FanGraphs, Baseball Reference, the Chadwick Bureau, Retrosheet, and Baseball Savant.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Retrieve baseball data in Python",
    "version": "2.2.7",
    "project_urls": {
        "Homepage": "https://github.com/jldbc/pybaseball"
    },
    "split_keywords": [
        "baseball",
        "sabermetrics",
        "data",
        "statistics",
        "statcast",
        "web",
        "scraping"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bb665ef47f5830570a30afbdfbd741cdf3e5a1a31c4c588514ab69bc074e8704",
                "md5": "e3068c183fe4cbbec20f536210485ff7",
                "sha256": "0f76895147ed7c8bfa5374e248c5b6553eaeed9022f32c30271c732932e44a82"
            },
            "downloads": -1,
            "filename": "pybaseball-2.2.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e3068c183fe4cbbec20f536210485ff7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 426137,
            "upload_time": "2023-09-08T12:42:42",
            "upload_time_iso_8601": "2023-09-08T12:42:42.710495Z",
            "url": "https://files.pythonhosted.org/packages/bb/66/5ef47f5830570a30afbdfbd741cdf3e5a1a31c4c588514ab69bc074e8704/pybaseball-2.2.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "62004009bc63970a8277ecd2b3fece4070bf31f1acf52150879df8384e1d7c5a",
                "md5": "9b11d3522aff00772f95c4d1c2ca8692",
                "sha256": "ff6d1b923cf5d5aa8ad023ccf2b4b2ef467b7fdda2244a4f559f2122a777d2fc"
            },
            "downloads": -1,
            "filename": "pybaseball-2.2.7.tar.gz",
            "has_sig": false,
            "md5_digest": "9b11d3522aff00772f95c4d1c2ca8692",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 402337,
            "upload_time": "2023-09-08T12:42:45",
            "upload_time_iso_8601": "2023-09-08T12:42:45.099684Z",
            "url": "https://files.pythonhosted.org/packages/62/00/4009bc63970a8277ecd2b3fece4070bf31f1acf52150879df8384e1d7c5a/pybaseball-2.2.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-08 12:42:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jldbc",
    "github_project": "pybaseball",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pybaseball"
}
        
Elapsed time: 0.13281s