# scrape-schema-recipe
[![PyPI](https://img.shields.io/pypi/v/scrape-schema-recipe)](https://pypi.org/project/scrape-schema-recipe/)
![Build Status](https://github.com/micahcochran/scrape-schema-recipe/actions/workflows/python-package.yml/badge.svg)
[![Downloads](https://pepy.tech/badge/scrape-schema-recipe)](https://pepy.tech/project/scrape-schema-recipe)
Scrapes recipes from HTML https://schema.org/Recipe (Microdata/JSON-LD) into Python dictionaries.
## Install
```
pip install scrape-schema-recipe
```
## Requirements
Python version 3.6+
This library relies heavily upon [extruct](https://github.com/scrapinghub/extruct).
Other requirements:
* isodate (>=0.5.1)
* requests
## Online Example
```python
>>> import scrape_schema_recipe
>>> url = 'https://www.foodnetwork.com/recipes/alton-brown/honey-mustard-dressing-recipe-1939031'
>>> recipe_list = scrape_schema_recipe.scrape_url(url, python_objects=True)
>>> len(recipe_list)
1
>>> recipe = recipe_list[0]
# Name of the recipe
>>> recipe['name']
'Honey Mustard Dressing'
# List of the Ingredients
>>> recipe['recipeIngredient']
['5 tablespoons medium body honey (sourwood is nice)',
'3 tablespoons smooth Dijon mustard',
'2 tablespoons rice wine vinegar']
# List of the Instructions
>>> recipe['recipeInstructions']
['Combine all ingredients in a bowl and whisk until smooth. Serve as a dressing or a dip.']
# Author
>>> recipe['author']
[{'@type': 'Person',
'name': 'Alton Brown',
'url': 'https://www.foodnetwork.com/profiles/talent/alton-brown'}]
```
'@type': 'Person' is a [https://schema.org/Person](https://schema.org/Person) object
```python
# Preparation Time
>>> recipe['prepTime']
datetime.timedelta(0, 300)
# The library pendulum can give you something a little easier to read.
>>> import pendulum
# for pendulum version 1.0
>>> pendulum.Interval.instanceof(recipe['prepTime'])
<Interval [5 minutes]>
# for version 2.0 of pendulum
>>> pendulum.Duration(seconds=recipe['prepTime'].total_seconds())
<Duration [5 minutes]>
```
If `python_objects` is set to `False`, this would return the string ISO8611 representation of the duration, `'PT5M'`
[pendulum's library website](https://pendulum.eustace.io/).
```python
# Publication date
>>> recipe['datePublished']
datetime.datetime(2016, 11, 13, 21, 5, 50, 518000, tzinfo=<FixedOffset '-05:00'>)
>>> str(recipe['datePublished'])
'2016-11-13 21:05:50.518000-05:00'
# Identifying this is http://schema.org/Recipe data (in LD-JSON format)
>>> recipe['@context'], recipe['@type']
('http://schema.org', 'Recipe')
# Content's URL
>>> recipe['url']
'https://www.foodnetwork.com/recipes/alton-brown/honey-mustard-dressing-recipe-1939031'
# all the keys in this dictionary
>>> recipe.keys()
dict_keys(['recipeYield', 'totalTime', 'dateModified', 'url', '@context', 'name', 'publisher', 'prepTime', 'datePublished', 'recipeIngredient', '@type', 'recipeInstructions', 'author', 'mainEntityOfPage', 'aggregateRating', 'recipeCategory', 'image', 'headline', 'review'])
```
## Example from a File (alternative representations)
Also works with locally saved [HTML example file](/test_data/google-recipe-example.html).
```python
>>> filelocation = 'test_data/google-recipe-example.html'
>>> recipe_list = scrape_schema_recipe.scrape(filelocation, python_objects=True)
>>> recipe = recipe_list[0]
>>> recipe['name']
'Party Coffee Cake'
>>> repcipe['datePublished']
datetime.date(2018, 3, 10)
# Recipe Instructions using the HowToStep
>>> recipe['recipeInstructions']
[{'@type': 'HowToStep',
'text': 'Preheat the oven to 350 degrees F. Grease and flour a 9x9 inch pan.'},
{'@type': 'HowToStep',
'text': 'In a large bowl, combine flour, sugar, baking powder, and salt.'},
{'@type': 'HowToStep', 'text': 'Mix in the butter, eggs, and milk.'},
{'@type': 'HowToStep', 'text': 'Spread into the prepared pan.'},
{'@type': 'HowToStep', 'text': 'Bake for 30 to 35 minutes, or until firm.'},
{'@type': 'HowToStep', 'text': 'Allow to cool.'}]
```
## What Happens When Things Go Wrong
If there aren't any http://schema.org/Recipe formatted recipes on the site.
```python
>>> url = 'https://www.google.com'
>>> recipe_list = scrape_schema_recipe.scrape(url, python_objects=True)
>>> len(recipe_list)
0
```
Some websites will cause an `HTTPError`.
You may get around a 403 - Forbidden Errror by putting in an alternative user-agent
via the variable `user_agent_str`.
## Functions
* `load()` - load HTML schema.org/Recipe structured data from a file or file-like object
* `loads()` - loads HTML schema.org/Recipe structured data from a string
* `scrape_url()` - scrape a URL for HTML schema.org/Recipe structured data
* `scrape()` - load HTML schema.org/Recipe structured data from a file, file-like object, string, or URL
```
Parameters
----------
location : string or file-like object
A url, filename, or text_string of HTML, or a file-like object.
python_objects : bool, list, or tuple (optional)
when True it translates certain data types into python objects
dates into datetime.date, datetimes into datetime.datetimes,
durations as dateime.timedelta.
when set to a list or tuple only converts types specified to
python objects:
* when set to either [dateime.date] or [datetime.datetimes] either will
convert dates.
* when set to [datetime.timedelta] durations will be converted
when False no conversion is performed
(defaults to False)
nonstandard_attrs : bool, optional
when True it adds nonstandard (for schema.org/Recipe) attributes to the
resulting dictionaries, that are outside the specification such as:
'_format' is either 'json-ld' or 'microdata' (how schema.org/Recipe was encoded into HTML)
'_source_url' is the source url, when 'url' has already been defined as another value
(defaults to False)
migrate_old_schema : bool, optional
when True it migrates the schema from older version to current version
(defaults to True)
user_agent_str : string, optional ***only for scrape_url() and scrape()***
overide the user_agent_string with this value.
(defaults to None)
Returns
-------
list
a list of dictionaries in the style of schema.org/Recipe JSON-LD
no results - an empty list will be returned
```
These are also available with `help()` in the python console.
## Example function
The `example_output()` function gives quick access to data for prototyping and debugging.
It accepts the same parameters as load(), but the first parameter, `name`, is different.
```python
>>> from scrape_schema_recipe import example_names, example_output
>>> example_names
('irish-coffee', 'google', 'taco-salad', 'tart', 'tea-cake', 'truffles')
>>> recipes = example_output('truffles')
>>> recipes[0]['name']
'Rum & Tonka Bean Dark Chocolate Truffles'
```
## Files
License: Apache 2.0 see [LICENSE](LICENSE)
Test data attribution and licensing: [ATTRIBUTION.md](ATTRIBUTION.md)
## Development
The unit testing must be run from a copy of the repository folder. Unit testing can be run by:
```
schema-recipe-scraper$ python3 test_scrape.py
```
mypy is used for static type checking
from the project directory:
```
schema-recipe-scraper$ mypy schema_recipe_scraper/scrape.py
```
If you run mypy from another directory the `--ignore-missing-imports` flag will need to be added,
thus `$ mypy --ignore-missing-imports scrape.py`
`--ignore-missing-imports` flag is used because most libraries don't have static typing information contained
in their own code or typeshed.
## Reference Documentation
Here are some references for how schema.org/Recipe *should* be structured:
* [https://schema.org/Recipe](https://schema.org/Recipe) - official specification
* [Recipe Google Search Guide](https://developers.google.com/search/docs/data-types/recipe) - material teaching developers how to use the schema (with emphasis on how structured data impacts search results)
## Other Similar Python Libraries
* [recipe_scrapers](https://github.com/hhursev/recipe-scrapers) - library scrapes
recipes by using extruct to scrape the schema.org/Recipe format or HTML tags with BeautifulSoup.
The library has drivers that support many different websites that further parse the information.
This is a solid alternative to schema-recipe-scraper that is focused on a different kind of simplicity.
Raw data
{
"_id": null,
"home_page": "https://github.com/micahcochran/scrape-schema-recipe",
"name": "scrape-schema-recipe",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "recipe,cooking,food,schema.org,schema.org/Recipe",
"author": "Micah Cochran",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/05/4d/e246cf3f4901ef53d61a3f140562436f172688172f7086ad23c37b51a85c/scrape-schema-recipe-0.2.2.tar.gz",
"platform": null,
"description": "# scrape-schema-recipe\n[![PyPI](https://img.shields.io/pypi/v/scrape-schema-recipe)](https://pypi.org/project/scrape-schema-recipe/)\n![Build Status](https://github.com/micahcochran/scrape-schema-recipe/actions/workflows/python-package.yml/badge.svg)\n[![Downloads](https://pepy.tech/badge/scrape-schema-recipe)](https://pepy.tech/project/scrape-schema-recipe)\n\nScrapes recipes from HTML https://schema.org/Recipe (Microdata/JSON-LD) into Python dictionaries.\n\n\n## Install\n\n```\npip install scrape-schema-recipe\n```\n\n## Requirements\n\nPython version 3.6+\n\nThis library relies heavily upon [extruct](https://github.com/scrapinghub/extruct).\n\nOther requirements: \n* isodate (>=0.5.1)\n* requests\n\n## Online Example\n\n```python\n>>> import scrape_schema_recipe\n\n>>> url = 'https://www.foodnetwork.com/recipes/alton-brown/honey-mustard-dressing-recipe-1939031'\n>>> recipe_list = scrape_schema_recipe.scrape_url(url, python_objects=True)\n>>> len(recipe_list)\n1\n>>> recipe = recipe_list[0]\n\n# Name of the recipe\n>>> recipe['name']\n'Honey Mustard Dressing'\n\n# List of the Ingredients\n>>> recipe['recipeIngredient']\n['5 tablespoons medium body honey (sourwood is nice)',\n '3 tablespoons smooth Dijon mustard',\n '2 tablespoons rice wine vinegar']\n\n# List of the Instructions\n>>> recipe['recipeInstructions']\n['Combine all ingredients in a bowl and whisk until smooth. Serve as a dressing or a dip.']\n\n# Author\n>>> recipe['author']\n[{'@type': 'Person',\n 'name': 'Alton Brown',\n 'url': 'https://www.foodnetwork.com/profiles/talent/alton-brown'}]\n```\n'@type': 'Person' is a [https://schema.org/Person](https://schema.org/Person) object\n\n\n```python\n# Preparation Time\n>>> recipe['prepTime']\ndatetime.timedelta(0, 300)\n\n# The library pendulum can give you something a little easier to read.\n>>> import pendulum\n\n# for pendulum version 1.0\n>>> pendulum.Interval.instanceof(recipe['prepTime'])\n<Interval [5 minutes]>\n\n# for version 2.0 of pendulum\n>>> pendulum.Duration(seconds=recipe['prepTime'].total_seconds())\n<Duration [5 minutes]>\n```\n\nIf `python_objects` is set to `False`, this would return the string ISO8611 representation of the duration, `'PT5M'`\n\n[pendulum's library website](https://pendulum.eustace.io/).\n\n\n```python\n# Publication date\n>>> recipe['datePublished']\ndatetime.datetime(2016, 11, 13, 21, 5, 50, 518000, tzinfo=<FixedOffset '-05:00'>)\n\n>>> str(recipe['datePublished'])\n'2016-11-13 21:05:50.518000-05:00'\n\n# Identifying this is http://schema.org/Recipe data (in LD-JSON format)\n >>> recipe['@context'], recipe['@type']\n('http://schema.org', 'Recipe')\n\n# Content's URL\n>>> recipe['url']\n'https://www.foodnetwork.com/recipes/alton-brown/honey-mustard-dressing-recipe-1939031'\n\n# all the keys in this dictionary\n>>> recipe.keys()\ndict_keys(['recipeYield', 'totalTime', 'dateModified', 'url', '@context', 'name', 'publisher', 'prepTime', 'datePublished', 'recipeIngredient', '@type', 'recipeInstructions', 'author', 'mainEntityOfPage', 'aggregateRating', 'recipeCategory', 'image', 'headline', 'review'])\n```\n\n## Example from a File (alternative representations)\n\nAlso works with locally saved [HTML example file](/test_data/google-recipe-example.html).\n```python\n>>> filelocation = 'test_data/google-recipe-example.html'\n>>> recipe_list = scrape_schema_recipe.scrape(filelocation, python_objects=True)\n>>> recipe = recipe_list[0]\n\n>>> recipe['name']\n'Party Coffee Cake'\n\n>>> repcipe['datePublished']\ndatetime.date(2018, 3, 10)\n\n# Recipe Instructions using the HowToStep\n>>> recipe['recipeInstructions']\n[{'@type': 'HowToStep',\n 'text': 'Preheat the oven to 350 degrees F. Grease and flour a 9x9 inch pan.'},\n {'@type': 'HowToStep',\n 'text': 'In a large bowl, combine flour, sugar, baking powder, and salt.'},\n {'@type': 'HowToStep', 'text': 'Mix in the butter, eggs, and milk.'},\n {'@type': 'HowToStep', 'text': 'Spread into the prepared pan.'},\n {'@type': 'HowToStep', 'text': 'Bake for 30 to 35 minutes, or until firm.'},\n {'@type': 'HowToStep', 'text': 'Allow to cool.'}]\n\n```\n\n\n## What Happens When Things Go Wrong\n\nIf there aren't any http://schema.org/Recipe formatted recipes on the site.\n```python\n>>> url = 'https://www.google.com'\n>>> recipe_list = scrape_schema_recipe.scrape(url, python_objects=True)\n\n>>> len(recipe_list)\n0\n```\n\nSome websites will cause an `HTTPError`.\n\nYou may get around a 403 - Forbidden Errror by putting in an alternative user-agent\nvia the variable `user_agent_str`.\n\n\n## Functions\n\n* `load()` - load HTML schema.org/Recipe structured data from a file or file-like object\n* `loads()` - loads HTML schema.org/Recipe structured data from a string\n* `scrape_url()` - scrape a URL for HTML schema.org/Recipe structured data \n* `scrape()` - load HTML schema.org/Recipe structured data from a file, file-like object, string, or URL\n\n```\n Parameters\n ----------\n location : string or file-like object\n A url, filename, or text_string of HTML, or a file-like object.\n\n python_objects : bool, list, or tuple (optional)\n when True it translates certain data types into python objects\n dates into datetime.date, datetimes into datetime.datetimes,\n durations as dateime.timedelta.\n when set to a list or tuple only converts types specified to\n python objects:\n * when set to either [dateime.date] or [datetime.datetimes] either will\n convert dates.\n * when set to [datetime.timedelta] durations will be converted\n when False no conversion is performed\n (defaults to False)\n\n nonstandard_attrs : bool, optional\n when True it adds nonstandard (for schema.org/Recipe) attributes to the\n resulting dictionaries, that are outside the specification such as:\n '_format' is either 'json-ld' or 'microdata' (how schema.org/Recipe was encoded into HTML)\n '_source_url' is the source url, when 'url' has already been defined as another value\n (defaults to False)\n\n migrate_old_schema : bool, optional\n when True it migrates the schema from older version to current version\n (defaults to True)\n\n user_agent_str : string, optional ***only for scrape_url() and scrape()***\n overide the user_agent_string with this value.\n (defaults to None)\n\n Returns\n -------\n list\n a list of dictionaries in the style of schema.org/Recipe JSON-LD\n no results - an empty list will be returned\n```\n\nThese are also available with `help()` in the python console.\n\n## Example function\nThe `example_output()` function gives quick access to data for prototyping and debugging.\nIt accepts the same parameters as load(), but the first parameter, `name`, is different.\n\n```python\n>>> from scrape_schema_recipe import example_names, example_output\n\n>>> example_names\n('irish-coffee', 'google', 'taco-salad', 'tart', 'tea-cake', 'truffles')\n\n>>> recipes = example_output('truffles')\n>>> recipes[0]['name']\n'Rum & Tonka Bean Dark Chocolate Truffles'\n```\n\n\n## Files\n\nLicense: Apache 2.0 see [LICENSE](LICENSE)\n\nTest data attribution and licensing: [ATTRIBUTION.md](ATTRIBUTION.md)\n\n## Development\n\nThe unit testing must be run from a copy of the repository folder. Unit testing can be run by:\n```\nschema-recipe-scraper$ python3 test_scrape.py\n```\n\nmypy is used for static type checking\n\nfrom the project directory:\n```\n schema-recipe-scraper$ mypy schema_recipe_scraper/scrape.py\n```\n\nIf you run mypy from another directory the `--ignore-missing-imports` flag will need to be added,\nthus `$ mypy --ignore-missing-imports scrape.py`\n\n`--ignore-missing-imports` flag is used because most libraries don't have static typing information contained\nin their own code or typeshed.\n\n## Reference Documentation\nHere are some references for how schema.org/Recipe *should* be structured:\n\n* [https://schema.org/Recipe](https://schema.org/Recipe) - official specification\n* [Recipe Google Search Guide](https://developers.google.com/search/docs/data-types/recipe) - material teaching developers how to use the schema (with emphasis on how structured data impacts search results)\n\n\n## Other Similar Python Libraries\n\n* [recipe_scrapers](https://github.com/hhursev/recipe-scrapers) - library scrapes\nrecipes by using extruct to scrape the schema.org/Recipe format or HTML tags with BeautifulSoup.\nThe library has drivers that support many different websites that further parse the information.\nThis is a solid alternative to schema-recipe-scraper that is focused on a different kind of simplicity.\n",
"bugtrack_url": null,
"license": "Apache-2",
"summary": "Extracts cooking recipe from HTML structured data in the https://schema.org/Recipe format.",
"version": "0.2.2",
"project_urls": {
"Homepage": "https://github.com/micahcochran/scrape-schema-recipe"
},
"split_keywords": [
"recipe",
"cooking",
"food",
"schema.org",
"schema.org/recipe"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "600afcf80c90f16cd64f9b675dcb88e9b4e8df07630dfa49b037e19420175a36",
"md5": "6eca6d14a2dbc4a42bf64b10e1deea7e",
"sha256": "ef71e87fa234b48dffdfcf77ed1b9bc118b1ba6cdfe7ec285da9911e4342bf3f"
},
"downloads": -1,
"filename": "scrape_schema_recipe-0.2.2-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "6eca6d14a2dbc4a42bf64b10e1deea7e",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 567085,
"upload_time": "2023-09-26T19:09:21",
"upload_time_iso_8601": "2023-09-26T19:09:21.681675Z",
"url": "https://files.pythonhosted.org/packages/60/0a/fcf80c90f16cd64f9b675dcb88e9b4e8df07630dfa49b037e19420175a36/scrape_schema_recipe-0.2.2-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "054de246cf3f4901ef53d61a3f140562436f172688172f7086ad23c37b51a85c",
"md5": "f40dc2e64ab8349c127951999a761753",
"sha256": "c34f45561f4d6c167b33c0efb9b7c5a5d68652e2dbb4b0a307036d8f534c7ca6"
},
"downloads": -1,
"filename": "scrape-schema-recipe-0.2.2.tar.gz",
"has_sig": false,
"md5_digest": "f40dc2e64ab8349c127951999a761753",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 561488,
"upload_time": "2023-09-26T19:09:23",
"upload_time_iso_8601": "2023-09-26T19:09:23.324814Z",
"url": "https://files.pythonhosted.org/packages/05/4d/e246cf3f4901ef53d61a3f140562436f172688172f7086ad23c37b51a85c/scrape-schema-recipe-0.2.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-26 19:09:23",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "micahcochran",
"github_project": "scrape-schema-recipe",
"travis_ci": true,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "scrape-schema-recipe"
}