hockey-scraper


Namehockey-scraper JSON
Version 1.40.3 PyPI version JSON
download
home_pagehttps://github.com/HarryShomer/Hockey-Scraper
SummaryPython Package for scraping NHL Play-by-Play and Shift data.
upload_time2024-04-21 00:30:34
maintainerNone
docs_urlNone
authorHarry Shomer
requires_pythonNone
licenseGNU General Public License v3 (GPLv3)
keywords nhl
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            .. image:: https://badge.fury.io/py/hockey-scraper.svg
   :target: https://badge.fury.io/py/hockey-scraper
.. image:: https://readthedocs.org/projects/hockey-scraper/badge/?version=latest
   :target: https://readthedocs.org/projects/hockey-scraper/?badge=latest
   :alt: Documentation Status


**Please upgrade to version 1.40 or higher as earlier versions won't work.**

Hockey-Scraper
==============

.. inclusion-marker-for-sphinx


Purpose
-------

Scrape NHL data off the NHL API and website. This includes the Play by Play and Shift data for each game and the schedule information. 
It currently supports all preseason, regular season, and playoff games from the 2007-2008 season onwards. 

Prerequisites
-------------

You are going to need to have python installed for this. This should work for both python 2.7 and 3. I recommend having
from at least version 3.6.0 but earlier versions should be fine.

Installation
------------

To install all you need to do is open up your terminal and run:

::

    pip install hockey_scraper


NHL Usage
---------

The full documentation can be found `here <http://hockey-scraper.readthedocs.io/en/latest/>`_.

Standard Scrape Functions
~~~~~~~~~~~~~~~~~~~~~~~~~

Scrape data on a season by season level:

::

    import hockey_scraper

    # Scrapes the 2015 & 2016 season with shifts and stores the data in a Csv file
    hockey_scraper.scrape_seasons([2015, 2016], True)

    # Scrapes the 2008 season without shifts and returns a dictionary containing the pbp Pandas DataFrame
    scraped_data = hockey_scraper.scrape_seasons([2008], False, data_format='Pandas')

Scrape a list of games:

::

    import hockey_scraper

    # Scrapes the first game of 2014, 2015, and 2016 seasons with shifts and stores the data in a Csv file
    hockey_scraper.scrape_games([2014020001, 2015020001, 2016020001], True)

    # Scrapes the first game of 2007, 2008, and 2009 seasons with shifts and returns a Dictionary with the Pandas DataFrames
    scraped_data = hockey_scraper.scrape_games([2007020001, 2008020001, 2009020001], True, data_format='Pandas')

Scrape all games in a given date range:

::

    import hockey_scraper

    # Scrapes all games between 2016-10-10 and 2016-10-20 without shifts and stores the data in a Csv file
    hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False)

    # Scrapes all games between 2015-1-1 and 2015-1-15 without shifts and returns a Dictionary with the pbp Pandas DataFrame
    scraped_data = hockey_scraper.scrape_date_range('2015-1-1', '2015-1-15', False, data_format='Pandas')


The dictionary returned by setting the default argument "data_format" equal to "Pandas" is structured like:

::

    {
      # Both of these are always included
      'pbp': pbp_df,

      # This is only included when the argument 'if_scrape_shifts' is set equal to True
      'shifts': shifts_df
    }


Schedule
~~~~~~~~

The schedule for any past or future games can be scraped as follows:

::

    import hockey_scraper

    # As oppossed to the other calls the default format is 'Pandas' which returns a DataFrame
    sched_df = hockey_scraper.scrape_schedule("2019-10-01", "2020-07-01")

The columns returned are: `['game_id', 'date', 'venue', 'home_team', 'away_team', 'start_time', 'home_score', 'away_score', 'status']`


Persistent Data
~~~~~~~~~~~~~~~

All the raw game data files retrieved can also be saved to your disk. This allows for faster rescraping (we don't need to re-retrieve them) 
and the ability to parse the data yourself.

This is achieved by setting the keyword argument `docs_dir=True`. This will store the data in a directory called `~/hockey_scraper_data`. 
You can provide your own directory where you want everything to be stored (it must exist beforehand). By default `docs_dir=False`.

For example, let's say we are scraping the JSON PBP data for game `2019020001 <http://statsapi.web.nhl.com/api/v1/game/2019020001/feed/live>`_. 
If `docs_dir` isn't `False` it will first check if the data is already in the directory. If so, it will load in the data from that file and not make a GET 
request to the NHL API. However if it doesn't exist, it will make a GET request and then save the output to the directory. 
This will ensure that next time you are requesting that data it can load it from a file.

Here are some examples.

The default saving location is `~/hockey_scraper_data`.


::

    # Create or try to refer to a directory in the home directory
    # Will create a directory called 'hockey_scraper_data' in the home directory (if it doesn't exist)
    hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=True)


User defined directory

::

    USER_PATH = "/...."
    hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH)


You can override the existing files by specifying `rescrape=True`. It will retrieve all the files from source and save the newer versions to `docs_dir`.

::

    hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH, rescrape=True)



Live Scraping
~~~~~~~~~~~~~

Here is a simple example of a way to setup live scraping. I strongly suggest checking out
`this section <https://hockey-scraper.readthedocs.io/en/latest/live_scrape.html>`_ of the docs if you plan on using this.
::

   import hockey_scraper as hs


   def to_csv(game):
       """
       Store each game DataFrame in a file

       :param game: LiveGame object

       :return: None
       """

       # If the game:
       # 1. Started - We recorded at least one event
       # 2. Not in Intermission
       # 3. Not Over
       if game.is_ongoing():
           # Print the description of the last event
           print(game.game_id, "->", game.pbp_df.iloc[-1]['Description'])

           # Store in CSV files
           game.pbp_df.to_csv(f"../hockey_scraper_data/{game.game_id}_pbp.csv", sep=',')
           game.shifts_df.to_csv(f"../hockey_scraper_data/{game.game_id}_shifts.csv", sep=',')

   if __name__ == "__main__":
       # B4 we start set the directory to store the files
       # You don't have to do this but I recommend it
       hs.live_scrape.set_docs_dir("../hockey_scraper_data")

       # Scrape the info for all the games on 2018-11-15
       games = hs.ScrapeLiveGames("2018-11-15", if_scrape_shifts=True, pause=20)

       # While all the games aren't finished
       while not games.finished():
           # Update for all the games currently being played
           games.update_live_games(sleep_next=True)

           # Go through every LiveGame object and apply some function
           # You can of course do whatever you want here.
           for game in games.live_games:
               to_csv(game)



Contact
-------

Please contact me for any issues or suggestions. For any bugs or anything related to the code please open an issue.
Otherwise you can email me at Harryshomer@gmail.com.


Copyright
---------
::

    Copyright (C) 2019-2022 Harry Shomer
    This file is part of hockey_scraper

    hockey_scraper is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/HarryShomer/Hockey-Scraper",
    "name": "hockey-scraper",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "NHL",
    "author": "Harry Shomer",
    "author_email": "Harryshomer@gmail.com",
    "download_url": null,
    "platform": null,
    "description": ".. image:: https://badge.fury.io/py/hockey-scraper.svg\n   :target: https://badge.fury.io/py/hockey-scraper\n.. image:: https://readthedocs.org/projects/hockey-scraper/badge/?version=latest\n   :target: https://readthedocs.org/projects/hockey-scraper/?badge=latest\n   :alt: Documentation Status\n\n\n**Please upgrade to version 1.40 or higher as earlier versions won't work.**\n\nHockey-Scraper\n==============\n\n.. inclusion-marker-for-sphinx\n\n\nPurpose\n-------\n\nScrape NHL data off the NHL API and website. This includes the Play by Play and Shift data for each game and the schedule information. \nIt currently supports all preseason, regular season, and playoff games from the 2007-2008 season onwards. \n\nPrerequisites\n-------------\n\nYou are going to need to have python installed for this. This should work for both python 2.7 and 3. I recommend having\nfrom at least version 3.6.0 but earlier versions should be fine.\n\nInstallation\n------------\n\nTo install all you need to do is open up your terminal and run:\n\n::\n\n    pip install hockey_scraper\n\n\nNHL Usage\n---------\n\nThe full documentation can be found `here <http://hockey-scraper.readthedocs.io/en/latest/>`_.\n\nStandard Scrape Functions\n~~~~~~~~~~~~~~~~~~~~~~~~~\n\nScrape data on a season by season level:\n\n::\n\n    import hockey_scraper\n\n    # Scrapes the 2015 & 2016 season with shifts and stores the data in a Csv file\n    hockey_scraper.scrape_seasons([2015, 2016], True)\n\n    # Scrapes the 2008 season without shifts and returns a dictionary containing the pbp Pandas DataFrame\n    scraped_data = hockey_scraper.scrape_seasons([2008], False, data_format='Pandas')\n\nScrape a list of games:\n\n::\n\n    import hockey_scraper\n\n    # Scrapes the first game of 2014, 2015, and 2016 seasons with shifts and stores the data in a Csv file\n    hockey_scraper.scrape_games([2014020001, 2015020001, 2016020001], True)\n\n    # Scrapes the first game of 2007, 2008, and 2009 seasons with shifts and returns a Dictionary with the Pandas DataFrames\n    scraped_data = hockey_scraper.scrape_games([2007020001, 2008020001, 2009020001], True, data_format='Pandas')\n\nScrape all games in a given date range:\n\n::\n\n    import hockey_scraper\n\n    # Scrapes all games between 2016-10-10 and 2016-10-20 without shifts and stores the data in a Csv file\n    hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False)\n\n    # Scrapes all games between 2015-1-1 and 2015-1-15 without shifts and returns a Dictionary with the pbp Pandas DataFrame\n    scraped_data = hockey_scraper.scrape_date_range('2015-1-1', '2015-1-15', False, data_format='Pandas')\n\n\nThe dictionary returned by setting the default argument \"data_format\" equal to \"Pandas\" is structured like:\n\n::\n\n    {\n      # Both of these are always included\n      'pbp': pbp_df,\n\n      # This is only included when the argument 'if_scrape_shifts' is set equal to True\n      'shifts': shifts_df\n    }\n\n\nSchedule\n~~~~~~~~\n\nThe schedule for any past or future games can be scraped as follows:\n\n::\n\n    import hockey_scraper\n\n    # As oppossed to the other calls the default format is 'Pandas' which returns a DataFrame\n    sched_df = hockey_scraper.scrape_schedule(\"2019-10-01\", \"2020-07-01\")\n\nThe columns returned are: `['game_id', 'date', 'venue', 'home_team', 'away_team', 'start_time', 'home_score', 'away_score', 'status']`\n\n\nPersistent Data\n~~~~~~~~~~~~~~~\n\nAll the raw game data files retrieved can also be saved to your disk. This allows for faster rescraping (we don't need to re-retrieve them) \nand the ability to parse the data yourself.\n\nThis is achieved by setting the keyword argument `docs_dir=True`. This will store the data in a directory called `~/hockey_scraper_data`. \nYou can provide your own directory where you want everything to be stored (it must exist beforehand). By default `docs_dir=False`.\n\nFor example, let's say we are scraping the JSON PBP data for game `2019020001 <http://statsapi.web.nhl.com/api/v1/game/2019020001/feed/live>`_. \nIf `docs_dir` isn't `False` it will first check if the data is already in the directory. If so, it will load in the data from that file and not make a GET \nrequest to the NHL API. However if it doesn't exist, it will make a GET request and then save the output to the directory. \nThis will ensure that next time you are requesting that data it can load it from a file.\n\nHere are some examples.\n\nThe default saving location is `~/hockey_scraper_data`.\n\n\n::\n\n    # Create or try to refer to a directory in the home directory\n    # Will create a directory called 'hockey_scraper_data' in the home directory (if it doesn't exist)\n    hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=True)\n\n\nUser defined directory\n\n::\n\n    USER_PATH = \"/....\"\n    hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH)\n\n\nYou can override the existing files by specifying `rescrape=True`. It will retrieve all the files from source and save the newer versions to `docs_dir`.\n\n::\n\n    hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH, rescrape=True)\n\n\n\nLive Scraping\n~~~~~~~~~~~~~\n\nHere is a simple example of a way to setup live scraping. I strongly suggest checking out\n`this section <https://hockey-scraper.readthedocs.io/en/latest/live_scrape.html>`_ of the docs if you plan on using this.\n::\n\n   import hockey_scraper as hs\n\n\n   def to_csv(game):\n       \"\"\"\n       Store each game DataFrame in a file\n\n       :param game: LiveGame object\n\n       :return: None\n       \"\"\"\n\n       # If the game:\n       # 1. Started - We recorded at least one event\n       # 2. Not in Intermission\n       # 3. Not Over\n       if game.is_ongoing():\n           # Print the description of the last event\n           print(game.game_id, \"->\", game.pbp_df.iloc[-1]['Description'])\n\n           # Store in CSV files\n           game.pbp_df.to_csv(f\"../hockey_scraper_data/{game.game_id}_pbp.csv\", sep=',')\n           game.shifts_df.to_csv(f\"../hockey_scraper_data/{game.game_id}_shifts.csv\", sep=',')\n\n   if __name__ == \"__main__\":\n       # B4 we start set the directory to store the files\n       # You don't have to do this but I recommend it\n       hs.live_scrape.set_docs_dir(\"../hockey_scraper_data\")\n\n       # Scrape the info for all the games on 2018-11-15\n       games = hs.ScrapeLiveGames(\"2018-11-15\", if_scrape_shifts=True, pause=20)\n\n       # While all the games aren't finished\n       while not games.finished():\n           # Update for all the games currently being played\n           games.update_live_games(sleep_next=True)\n\n           # Go through every LiveGame object and apply some function\n           # You can of course do whatever you want here.\n           for game in games.live_games:\n               to_csv(game)\n\n\n\nContact\n-------\n\nPlease contact me for any issues or suggestions. For any bugs or anything related to the code please open an issue.\nOtherwise you can email me at Harryshomer@gmail.com.\n\n\nCopyright\n---------\n::\n\n    Copyright (C) 2019-2022 Harry Shomer\n    This file is part of hockey_scraper\n\n    hockey_scraper is free software: you can redistribute it and/or modify\n    it under the terms of the GNU General Public License as published by\n    the Free Software Foundation, either version 3 of the License, or\n    (at your option) any later version.\n\n    This program is distributed in the hope that it will be useful,\n    but WITHOUT ANY WARRANTY; without even the implied warranty of\n    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n    GNU General Public License for more details.\n\n    You should have received a copy of the GNU General Public License\n    along with this program.  If not, see <https://www.gnu.org/licenses/>.\n",
    "bugtrack_url": null,
    "license": "GNU General Public License v3 (GPLv3)",
    "summary": "Python Package for scraping NHL Play-by-Play and Shift data.",
    "version": "1.40.3",
    "project_urls": {
        "Homepage": "https://github.com/HarryShomer/Hockey-Scraper"
    },
    "split_keywords": [
        "nhl"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b90911e9edce95d7241d96d121e7f9b5a22796a95110dfd17475d7a1b3583513",
                "md5": "818e1eafd043a821e60972d15feafd73",
                "sha256": "9583b48bccfa02cc43a377221b38bc4d7d007e07c7238c7e5d35e48b86c9fa18"
            },
            "downloads": -1,
            "filename": "hockey_scraper-1.40.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "818e1eafd043a821e60972d15feafd73",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 86424,
            "upload_time": "2024-04-21T00:30:34",
            "upload_time_iso_8601": "2024-04-21T00:30:34.173141Z",
            "url": "https://files.pythonhosted.org/packages/b9/09/11e9edce95d7241d96d121e7f9b5a22796a95110dfd17475d7a1b3583513/hockey_scraper-1.40.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-21 00:30:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "HarryShomer",
    "github_project": "Hockey-Scraper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "hockey-scraper"
}
        
Elapsed time: 0.23185s