=======
extruct
=======
.. image:: https://github.com/scrapinghub/extruct/workflows/build/badge.svg?branch=master
:target: https://github.com/scrapinghub/extruct/actions
:alt: Build Status
.. image:: https://img.shields.io/codecov/c/github/scrapinghub/extruct/master.svg?maxAge=2592000
:target: https://codecov.io/gh/scrapinghub/extruct
:alt: Coverage report
.. image:: https://img.shields.io/pypi/v/extruct.svg
:target: https://pypi.python.org/pypi/extruct
:alt: PyPI Version
*extruct* is a library for extracting embedded metadata from HTML markup.
Currently, *extruct* supports:
- `W3C's HTML Microdata`_
- `embedded JSON-LD`_
- `Microformat`_ via `mf2py`_
- `Facebook's Open Graph`_
- (experimental) `RDFa`_ via `rdflib`_
- `Dublin Core Metadata (DC-HTML-2003)`_
.. _W3C's HTML Microdata: http://www.w3.org/TR/microdata/
.. _embedded JSON-LD: http://www.w3.org/TR/json-ld/#embedding-json-ld-in-html-documents
.. _RDFa: https://www.w3.org/TR/html-rdfa/
.. _rdflib: https://pypi.python.org/pypi/rdflib/
.. _Microformat: http://microformats.org/wiki/Main_Page
.. _mf2py: https://github.com/microformats/mf2py
.. _Facebook's Open Graph: http://ogp.me/
.. _Dublin Core Metadata (DC-HTML-2003): https://www.dublincore.org/specifications/dublin-core/dcq-html/2003-11-30/
The microdata algorithm is a revisit of `this Scrapinghub blog post`_ showing how to use EXSLT extensions.
.. _this Scrapinghub blog post: http://blog.scrapinghub.com/2014/06/18/extracting-schema-org-microdata-using-scrapy-selectors-and-xpath/
Installation
------------
::
pip install extruct
Usage
-----
All-in-one extraction
+++++++++++++++++++++
The simplest example how to use extruct is to call
``extruct.extract(htmlstring, base_url=base_url)``
with some HTML string and an optional base URL.
Let's try this on a webpage that uses all the syntaxes supported (RDFa with `ogp`_).
First fetch the HTML using python-requests and then feed the response body to ``extruct``::
>>> import extruct
>>> import requests
>>> import pprint
>>> from w3lib.html import get_base_url
>>>
>>> pp = pprint.PrettyPrinter(indent=2)
>>> r = requests.get('https://www.optimizesmart.com/how-to-use-open-graph-protocol/')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url=base_url)
>>>
>>> pp.pprint(data)
{ 'dublincore': [ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/description',
'content': 'What is Open Graph Protocol '
'and why you need it? Learn to '
'implement Open Graph Protocol '
'for Facebook on your website. '
'Open Graph Protocol Meta Tags.',
'name': 'description'}],
'namespaces': {},
'terms': []}],
'json-ld': [ { '@context': 'https://schema.org',
'@id': '#organization',
'@type': 'Organization',
'logo': 'https://www.optimizesmart.com/wp-content/uploads/2016/03/optimize-smart-Twitter-logo.jpg',
'name': 'Optimize Smart',
'sameAs': [ 'https://www.facebook.com/optimizesmart/',
'https://uk.linkedin.com/in/analyticsnerd',
'https://www.youtube.com/user/optimizesmart',
'https://twitter.com/analyticsnerd'],
'url': 'https://www.optimizesmart.com/'}],
'microdata': [ { 'properties': {'headline': ''},
'type': 'http://schema.org/WPHeader'}],
'microformat': [ { 'children': [ { 'properties': { 'category': [ 'specialized-tracking'],
'name': [ 'Open Graph '
'Protocol for '
'Facebook '
'explained with '
'examples\n'
'\n'
'Specialized '
'Tracking\n'
'\n'
'\n'
(...)
'Follow '
'@analyticsnerd\n'
'!function(d,s,id){var '
"js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, "
"'script', "
"'twitter-wjs');"]},
'type': ['h-entry']}],
'properties': { 'name': [ 'Open Graph Protocol for '
'Facebook explained with '
'examples\n'
(...)
'Follow @analyticsnerd\n'
'!function(d,s,id){var '
"js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, "
"'script', 'twitter-wjs');"]},
'type': ['h-feed']}],
'opengraph': [ { 'namespace': {'og': 'http://ogp.me/ns#'},
'properties': [ ('og:locale', 'en_US'),
('og:type', 'article'),
( 'og:title',
'Open Graph Protocol for Facebook '
'explained with examples'),
( 'og:description',
'What is Open Graph Protocol and why you '
'need it? Learn to implement Open Graph '
'Protocol for Facebook on your website. '
'Open Graph Protocol Meta Tags.'),
( 'og:url',
'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'),
('og:site_name', 'Optimize Smart'),
( 'og:updated_time',
'2018-03-09T16:26:35+00:00'),
( 'og:image',
'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'),
( 'og:image:secure_url',
'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg')]}],
'rdfa': [ { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/#header',
'http://www.w3.org/1999/xhtml/vocab#role': [ { '@id': 'http://www.w3.org/1999/xhtml/vocab#banner'}]},
{ '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/',
'article:modified_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],
'article:published_time': [ { '@value': '2010-07-02T18:57:23+00:00'}],
'article:publisher': [ { '@value': 'https://www.facebook.com/optimizesmart/'}],
'article:section': [{'@value': 'Specialized Tracking'}],
'http://ogp.me/ns#description': [ { '@value': 'What is Open '
'Graph Protocol '
'and why you need '
'it? Learn to '
'implement Open '
'Graph Protocol '
'for Facebook on '
'your website. '
'Open Graph '
'Protocol Meta '
'Tags.'}],
'http://ogp.me/ns#image': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],
'http://ogp.me/ns#image:secure_url': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],
'http://ogp.me/ns#locale': [{'@value': 'en_US'}],
'http://ogp.me/ns#site_name': [{'@value': 'Optimize Smart'}],
'http://ogp.me/ns#title': [ { '@value': 'Open Graph Protocol for '
'Facebook explained with '
'examples'}],
'http://ogp.me/ns#type': [{'@value': 'article'}],
'http://ogp.me/ns#updated_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],
'http://ogp.me/ns#url': [ { '@value': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'}],
'https://api.w.org/': [ { '@id': 'https://www.optimizesmart.com/wp-json/'}]}]}
Select syntaxes
+++++++++++++++
It is possible to select which syntaxes to extract by passing a list with the desired ones to extract. Valid values: 'microdata', 'json-ld', 'opengraph', 'microformat', 'rdfa' and 'dublincore'. If no list is passed all syntaxes will be extracted and returned::
>>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'])
>>>
>>> pp.pprint(data)
{ 'microdata': [],
'opengraph': [ { 'namespace': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
'fb': 'http://www.facebook.com/2008/fbml',
'og': 'http://ogp.me/ns#'},
'properties': [ ('fb:app_id', '308540029359'),
('og:site_name', 'Songkick'),
('og:type', 'songkick-concerts:artist'),
('og:title', 'Elysian Fields'),
( 'og:description',
'Find out when Elysian Fields is next '
'playing live near you. List of all '
'Elysian Fields tour dates and concerts.'),
( 'og:url',
'https://www.songkick.com/artists/236156-elysian-fields'),
( 'og:image',
'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg')]}],
'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
'al:ios:app_store_id': [{'@value': '438690886'}],
'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
'http://ogp.me/ns#description': [ { '@value': 'Find out when '
'Elysian Fields is '
'next playing live '
'near you. List of '
'all Elysian '
'Fields tour dates '
'and concerts.'}],
'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}
Alternatively, if you already parsed the HTML before calling extruct, you can use the tree instead of the HTML string: ::
>>> # using the request from the previous example
>>> base_url = get_base_url(r.text, r.url)
>>> from extruct.utils import parse_html
>>> tree = parse_html(r.text)
>>> data = extruct.extract(tree, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'])
Microformat format doesn't support the HTML tree, so you need to use a HTML string.
Uniform
+++++++
Another option is to uniform the output of microformat, opengraph, microdata, dublincore and json-ld syntaxes to the following structure: ::
{'@context': 'http://example.com',
'@type': 'example_type',
/* All other the properties in keys here */
}
To do so set ``uniform=True`` when calling ``extract``, it's false by default for backward compatibility. Here the same example as before but with uniform set to True: ::
>>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'], uniform=True)
>>>
>>> pp.pprint(data)
{ 'microdata': [],
'opengraph': [ { '@context': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
'fb': 'http://www.facebook.com/2008/fbml',
'og': 'http://ogp.me/ns#'},
'@type': 'songkick-concerts:artist',
'fb:app_id': '308540029359',
'og:description': 'Find out when Elysian Fields is next '
'playing live near you. List of all '
'Elysian Fields tour dates and concerts.',
'og:image': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg',
'og:site_name': 'Songkick',
'og:title': 'Elysian Fields',
'og:url': 'https://www.songkick.com/artists/236156-elysian-fields'}],
'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
'al:ios:app_store_id': [{'@value': '438690886'}],
'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
'http://ogp.me/ns#description': [ { '@value': 'Find out when '
'Elysian Fields is '
'next playing live '
'near you. List of '
'all Elysian '
'Fields tour dates '
'and concerts.'}],
'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}
NB rdfa structure is not uniformed yet.
Returning HTML node
+++++++++++++++++++
It is also possible to get references to HTML node for every extracted metadata item.
The feature is supported only by microdata syntax.
To use that, just set the ``return_html_node`` option of ``extract`` method to ``True``.
As the result, an additional key "nodeHtml" will be included in the result for every
item. Each node is of ``lxml.etree.Element`` type: ::
>>> r = requests.get('http://www.rugpadcorner.com/shop/no-muv/')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata'], return_html_node=True)
>>>
>>> pp.pprint(data)
{ 'microdata': [ { 'htmlNode': <Element div at 0x7f10f8e6d3b8>,
'properties': { 'description': 'KEEP RUGS FLAT ON CARPET!\n'
'Not your thin sticky pad, '
'No-Muv is truly the best!',
'image': ['', ''],
'name': ['No-Muv', 'No-Muv'],
'offers': [ { 'htmlNode': <Element div at 0x7f10f8e6d138>,
'properties': { 'availability': 'http://schema.org/InStock',
'price': 'Price: '
'$45'},
'type': 'http://schema.org/Offer'},
{ 'htmlNode': <Element div at 0x7f10f8e60f48>,
'properties': { 'availability': 'http://schema.org/InStock',
'price': '(Select '
'Size/Shape '
'for '
'Pricing)'},
'type': 'http://schema.org/Offer'}],
'ratingValue': ['5.00', '5.00']},
'type': 'http://schema.org/Product'}]}
Single extractors
-----------------
You can also use each extractor individually. See below.
Microdata extraction
++++++++++++++++++++
::
>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.w3cmicrodata import MicrodataExtractor
>>>
>>> # example from http://www.w3.org/TR/microdata/#associating-names-with-items
>>> html = """<!DOCTYPE HTML>
... <html>
... <head>
... <title>Photo gallery</title>
... </head>
... <body>
... <h1>My photos</h1>
... <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
... <img itemprop="work" src="images/house.jpeg" alt="A white house, boarded up, sits in a forest.">
... <figcaption itemprop="title">The house I found.</figcaption>
... </figure>
... <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
... <img itemprop="work" src="images/mailbox.jpeg" alt="Outside the house is a mailbox. It has a leaflet inside.">
... <figcaption itemprop="title">The mailbox.</figcaption>
... </figure>
... <footer>
... <p id="licenses">All images licensed under the <a itemprop="license"
... href="http://www.opensource.org/licenses/mit-license.php">MIT
... license</a>.</p>
... </footer>
... </body>
... </html>"""
>>>
>>> mde = MicrodataExtractor()
>>> data = mde.extract(html)
>>> pp.pprint(data)
[{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
'title': 'The house I found.',
'work': 'http://www.example.com/images/house.jpeg'},
'type': 'http://n.whatwg.org/work'},
{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
'title': 'The mailbox.',
'work': 'http://www.example.com/images/mailbox.jpeg'},
'type': 'http://n.whatwg.org/work'}]
JSON-LD extraction
++++++++++++++++++
::
>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.jsonld import JsonLdExtractor
>>>
>>> html = """<!DOCTYPE HTML>
... <html>
... <head>
... <title>Some Person Page</title>
... </head>
... <body>
... <h1>This guys</h1>
... <script type="application/ld+json">
... {
... "@context": "http://schema.org",
... "@type": "Person",
... "name": "John Doe",
... "jobTitle": "Graduate research assistant",
... "affiliation": "University of Dreams",
... "additionalName": "Johnny",
... "url": "http://www.example.com",
... "address": {
... "@type": "PostalAddress",
... "streetAddress": "1234 Peach Drive",
... "addressLocality": "Wonderland",
... "addressRegion": "Georgia"
... }
... }
... </script>
... </body>
... </html>"""
>>>
>>> jslde = JsonLdExtractor()
>>>
>>> data = jslde.extract(html)
>>> pp.pprint(data)
[{'@context': 'http://schema.org',
'@type': 'Person',
'additionalName': 'Johnny',
'address': {'@type': 'PostalAddress',
'addressLocality': 'Wonderland',
'addressRegion': 'Georgia',
'streetAddress': '1234 Peach Drive'},
'affiliation': 'University of Dreams',
'jobTitle': 'Graduate research assistant',
'name': 'John Doe',
'url': 'http://www.example.com'}]
RDFa extraction (experimental)
++++++++++++++++++++++++++++++
::
>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>> from extruct.rdfa import RDFaExtractor # you can ignore the warning about html5lib not being available
INFO:rdflib:RDFLib Version: 4.2.1
/home/paul/.virtualenvs/extruct.wheel.test/lib/python3.5/site-packages/rdflib/plugins/parsers/structureddata.py:30: UserWarning: html5lib not found! RDFa and Microdata parsers will not be available.
'parsers will not be available.')
>>>
>>> html = """<html>
... <head>
... ...
... </head>
... <body prefix="dc: http://purl.org/dc/terms/ schema: http://schema.org/">
... <div resource="/alice/posts/trouble_with_bob" typeof="schema:BlogPosting">
... <h2 property="dc:title">The trouble with Bob</h2>
... ...
... <h3 property="dc:creator schema:creator" resource="#me">Alice</h3>
... <div property="schema:articleBody">
... <p>The trouble with Bob is that he takes much better photos than I do:</p>
... </div>
... ...
... </div>
... </body>
... </html>
... """
>>>
>>> rdfae = RDFaExtractor()
>>> pp.pprint(rdfae.extract(html, base_url='http://www.example.com/index.html'))
[{'@id': 'http://www.example.com/alice/posts/trouble_with_bob',
'@type': ['http://schema.org/BlogPosting'],
'http://purl.org/dc/terms/creator': [{'@id': 'http://www.example.com/index.html#me'}],
'http://purl.org/dc/terms/title': [{'@value': 'The trouble with Bob'}],
'http://schema.org/articleBody': [{'@value': '\n'
' The trouble with Bob '
'is that he takes much better '
'photos than I do:\n'
' '}],
'http://schema.org/creator': [{'@id': 'http://www.example.com/index.html#me'}]}]
You'll get a list of expanded JSON-LD nodes.
Open Graph extraction
++++++++++++++++++++++++++++++
::
>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.opengraph import OpenGraphExtractor
>>>
>>> html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
... <html xmlns="https://www.w3.org/1999/xhtml" xmlns:og="https://ogp.me/ns#" xmlns:fb="https://www.facebook.com/2008/fbml">
... <head>
... <title>Himanshu's Open Graph Protocol</title>
... <meta http-equiv="Content-Type" content="text/html;charset=WINDOWS-1252" />
... <meta http-equiv="Content-Language" content="en-us" />
... <link rel="stylesheet" type="text/css" href="event-education.css" />
... <meta name="verify-v1" content="so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=" >
... <meta property="og:title" content="Himanshu's Open Graph Protocol"/>
... <meta property="og:type" content="article"/>
... <meta property="og:url" content="https://www.eventeducation.com/test.php"/>
... <meta property="og:image" content="https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg"/>
... <meta property="fb:admins" content="himanshu160"/>
... <meta property="og:site_name" content="Event Education"/>
... <meta property="og:description" content="Event Education provides free courses on event planning and management to event professionals worldwide."/>
... </head>
... <body>
... <div id="fb-root"></div>
... <script>(function(d, s, id) {
... var js, fjs = d.getElementsByTagName(s)[0];
... if (d.getElementById(id)) return;
... js = d.createElement(s); js.id = id;
... js.src = "//connect.facebook.net/en_US/all.js#xfbml=1&appId=501839739845103";
... fjs.parentNode.insertBefore(js, fjs);
... }(document, 'script', 'facebook-jssdk'));</script>
... </body>
... </html>"""
>>>
>>> opengraphe = OpenGraphExtractor()
>>> pp.pprint(opengraphe.extract(html))
[{"namespace": {
"og": "http://ogp.me/ns#"
},
"properties": [
[
"og:title",
"Himanshu's Open Graph Protocol"
],
[
"og:type",
"article"
],
[
"og:url",
"https://www.eventeducation.com/test.php"
],
[
"og:image",
"https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg"
],
[
"og:site_name",
"Event Education"
],
[
"og:description",
"Event Education provides free courses on event planning and management to event professionals worldwide."
]
]
}]
Microformat extraction
++++++++++++++++++++++++++++++
::
>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.microformat import MicroformatExtractor
>>>
>>> html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
... <html xmlns="https://www.w3.org/1999/xhtml" xmlns:og="https://ogp.me/ns#" xmlns:fb="https://www.facebook.com/2008/fbml">
... <head>
... <title>Himanshu's Open Graph Protocol</title>
... <meta http-equiv="Content-Type" content="text/html;charset=WINDOWS-1252" />
... <meta http-equiv="Content-Language" content="en-us" />
... <link rel="stylesheet" type="text/css" href="event-education.css" />
... <meta name="verify-v1" content="so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=" >
... <meta property="og:title" content="Himanshu's Open Graph Protocol"/>
... <article class="h-entry">
... <h1 class="p-name">Microformats are amazing</h1>
... <p>Published by <a class="p-author h-card" href="http://example.com">W. Developer</a>
... on <time class="dt-published" datetime="2013-06-13 12:00:00">13<sup>th</sup> June 2013</time></p>
... <p class="p-summary">In which I extoll the virtues of using microformats.</p>
... <div class="e-content">
... <p>Blah blah blah</p>
... </div>
... </article>
... </head>
... <body></body>
... </html>"""
>>>
>>> microformate = MicroformatExtractor()
>>> data = microformate.extract(html)
>>> pp.pprint(data)
[{"type": [
"h-entry"
],
"properties": {
"name": [
"Microformats are amazing"
],
"author": [
{
"type": [
"h-card"
],
"properties": {
"name": [
"W. Developer"
],
"url": [
"http://example.com"
]
},
"value": "W. Developer"
}
],
"published": [
"2013-06-13 12:00:00"
],
"summary": [
"In which I extoll the virtues of using microformats."
],
"content": [
{
"html": "\n<p>Blah blah blah</p>\n",
"value": "\nBlah blah blah\n"
}
]
}
}]
DublinCore extraction
++++++++++++++++++++++++++++++
::
>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>> from extruct.dublincore import DublinCoreExtractor
>>> html = '''<head profile="http://dublincore.org/documents/dcq-html/">
... <title>Expressing Dublin Core in HTML/XHTML meta and link elements</title>
... <link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" />
... <link rel="schema.DCTERMS" href="http://purl.org/dc/terms/" />
...
...
... <meta name="DC.title" lang="en" content="Expressing Dublin Core
... in HTML/XHTML meta and link elements" />
... <meta name="DC.creator" content="Andy Powell, UKOLN, University of Bath" />
... <meta name="DCTERMS.issued" scheme="DCTERMS.W3CDTF" content="2003-11-01" />
... <meta name="DC.identifier" scheme="DCTERMS.URI"
... content="http://dublincore.org/documents/dcq-html/" />
... <link rel="DCTERMS.replaces" hreflang="en"
... href="http://dublincore.org/documents/2000/08/15/dcq-html/" />
... <meta name="DCTERMS.abstract" content="This document describes how
... qualified Dublin Core metadata can be encoded
... in HTML/XHTML <meta> elements" />
... <meta name="DC.format" scheme="DCTERMS.IMT" content="text/html" />
... <meta name="DC.type" scheme="DCTERMS.DCMIType" content="Text" />
... <meta name="DC.Date.modified" content="2001-07-18" />
... <meta name="DCTERMS.modified" content="2001-07-18" />'''
>>> dublinlde = DublinCoreExtractor()
>>> data = dublinlde.extract(html)
>>> pp.pprint(data)
[ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/title',
'content': 'Expressing Dublin Core\n'
'in HTML/XHTML meta and link elements',
'lang': 'en',
'name': 'DC.title'},
{ 'URI': 'http://purl.org/dc/elements/1.1/creator',
'content': 'Andy Powell, UKOLN, University of Bath',
'name': 'DC.creator'},
{ 'URI': 'http://purl.org/dc/elements/1.1/identifier',
'content': 'http://dublincore.org/documents/dcq-html/',
'name': 'DC.identifier',
'scheme': 'DCTERMS.URI'},
{ 'URI': 'http://purl.org/dc/elements/1.1/format',
'content': 'text/html',
'name': 'DC.format',
'scheme': 'DCTERMS.IMT'},
{ 'URI': 'http://purl.org/dc/elements/1.1/type',
'content': 'Text',
'name': 'DC.type',
'scheme': 'DCTERMS.DCMIType'}],
'namespaces': { 'DC': 'http://purl.org/dc/elements/1.1/',
'DCTERMS': 'http://purl.org/dc/terms/'},
'terms': [ { 'URI': 'http://purl.org/dc/terms/issued',
'content': '2003-11-01',
'name': 'DCTERMS.issued',
'scheme': 'DCTERMS.W3CDTF'},
{ 'URI': 'http://purl.org/dc/terms/abstract',
'content': 'This document describes how\n'
'qualified Dublin Core metadata can be encoded\n'
'in HTML/XHTML <meta> elements',
'name': 'DCTERMS.abstract'},
{ 'URI': 'http://purl.org/dc/terms/modified',
'content': '2001-07-18',
'name': 'DC.Date.modified'},
{ 'URI': 'http://purl.org/dc/terms/modified',
'content': '2001-07-18',
'name': 'DCTERMS.modified'},
{ 'URI': 'http://purl.org/dc/terms/replaces',
'href': 'http://dublincore.org/documents/2000/08/15/dcq-html/',
'hreflang': 'en',
'rel': 'DCTERMS.replaces'}]}]
Command Line Tool
-----------------
*extruct* provides a command line tool that allows you to fetch a page and
extract the metadata from it directly from the command line.
Dependencies
++++++++++++
The command line tool depends on ``requests``, which is not installed by default
when you install **extruct**. In order to use the command line tool, you can
install **extruct** with the `cli` extra requirements::
pip install 'extruct[cli]'
Usage
+++++
::
extruct "http://example.com"
Downloads "http://example.com" and outputs the Microdata, JSON-LD and RDFa, Open Graph
and Microformat metadata to `stdout`.
Supported Parameters
++++++++++++++++++++
By default, the command line tool will try to extract all the supported
metadata formats from the page (currently Microdata, JSON-LD, RDFa, Open Graph
and Microformat). If you want to restrict the output to just one or a subset of
those, you can pass their individual names collected in a list through 'syntaxes' argument.
For example, this command extracts only Microdata and JSON-LD metadata from
"http://example.com"::
extruct "http://example.com" --syntaxes microdata json-ld
NB syntaxes names passed must correspond to these: microdata, json-ld, rdfa, opengraph, microformat
Development version
-------------------
::
mkvirtualenv extruct
pip install -r requirements-dev.txt
Tests
-----
Run tests in current environment::
py.test tests
Use tox_ to run tests with different Python versions::
tox
.. _tox: https://testrun.org/tox/latest/
.. _ogp: https://ogp.me/
Raw data
{
"_id": null,
"home_page": "https://github.com/scrapinghub/extruct",
"name": "extruct",
"maintainer": "Scrapinghub",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "info@scrapinghub.com",
"keywords": "extruct",
"author": "Scrapinghub",
"author_email": "info@scrapinghub.com",
"download_url": "https://files.pythonhosted.org/packages/ce/dd/5ba7345dafaa2b5ccbfd441c7bcfc74bae1176aac9acb0d746b6bb327979/extruct-0.18.0.tar.gz",
"platform": null,
"description": "=======\nextruct\n=======\n\n.. image:: https://github.com/scrapinghub/extruct/workflows/build/badge.svg?branch=master\n :target: https://github.com/scrapinghub/extruct/actions\n :alt: Build Status\n\n.. image:: https://img.shields.io/codecov/c/github/scrapinghub/extruct/master.svg?maxAge=2592000\n :target: https://codecov.io/gh/scrapinghub/extruct\n :alt: Coverage report\n\n.. image:: https://img.shields.io/pypi/v/extruct.svg\n :target: https://pypi.python.org/pypi/extruct\n :alt: PyPI Version\n\n\n*extruct* is a library for extracting embedded metadata from HTML markup.\n\nCurrently, *extruct* supports:\n\n- `W3C's HTML Microdata`_\n- `embedded JSON-LD`_\n- `Microformat`_ via `mf2py`_\n- `Facebook's Open Graph`_\n- (experimental) `RDFa`_ via `rdflib`_\n- `Dublin Core Metadata (DC-HTML-2003)`_\n\n.. _W3C's HTML Microdata: http://www.w3.org/TR/microdata/\n.. _embedded JSON-LD: http://www.w3.org/TR/json-ld/#embedding-json-ld-in-html-documents\n.. _RDFa: https://www.w3.org/TR/html-rdfa/\n.. _rdflib: https://pypi.python.org/pypi/rdflib/\n.. _Microformat: http://microformats.org/wiki/Main_Page\n.. _mf2py: https://github.com/microformats/mf2py\n.. _Facebook's Open Graph: http://ogp.me/\n.. _Dublin Core Metadata (DC-HTML-2003): https://www.dublincore.org/specifications/dublin-core/dcq-html/2003-11-30/\n\nThe microdata algorithm is a revisit of `this Scrapinghub blog post`_ showing how to use EXSLT extensions.\n\n.. _this Scrapinghub blog post: http://blog.scrapinghub.com/2014/06/18/extracting-schema-org-microdata-using-scrapy-selectors-and-xpath/\n\n\nInstallation\n------------\n\n::\n\n pip install extruct\n\n\nUsage\n-----\n\nAll-in-one extraction\n+++++++++++++++++++++\n\nThe simplest example how to use extruct is to call\n``extruct.extract(htmlstring, base_url=base_url)``\nwith some HTML string and an optional base URL.\n\nLet's try this on a webpage that uses all the syntaxes supported (RDFa with `ogp`_).\n\nFirst fetch the HTML using python-requests and then feed the response body to ``extruct``::\n\n >>> import extruct\n >>> import requests\n >>> import pprint\n >>> from w3lib.html import get_base_url\n >>>\n >>> pp = pprint.PrettyPrinter(indent=2)\n >>> r = requests.get('https://www.optimizesmart.com/how-to-use-open-graph-protocol/')\n >>> base_url = get_base_url(r.text, r.url)\n >>> data = extruct.extract(r.text, base_url=base_url)\n >>>\n >>> pp.pprint(data)\n { 'dublincore': [ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/description',\n 'content': 'What is Open Graph Protocol '\n 'and why you need it? Learn to '\n 'implement Open Graph Protocol '\n 'for Facebook on your website. '\n 'Open Graph Protocol Meta Tags.',\n 'name': 'description'}],\n 'namespaces': {},\n 'terms': []}],\n\n 'json-ld': [ { '@context': 'https://schema.org',\n '@id': '#organization',\n '@type': 'Organization',\n 'logo': 'https://www.optimizesmart.com/wp-content/uploads/2016/03/optimize-smart-Twitter-logo.jpg',\n 'name': 'Optimize Smart',\n 'sameAs': [ 'https://www.facebook.com/optimizesmart/',\n 'https://uk.linkedin.com/in/analyticsnerd',\n 'https://www.youtube.com/user/optimizesmart',\n 'https://twitter.com/analyticsnerd'],\n 'url': 'https://www.optimizesmart.com/'}],\n 'microdata': [ { 'properties': {'headline': ''},\n 'type': 'http://schema.org/WPHeader'}],\n 'microformat': [ { 'children': [ { 'properties': { 'category': [ 'specialized-tracking'],\n 'name': [ 'Open Graph '\n 'Protocol for '\n 'Facebook '\n 'explained with '\n 'examples\\n'\n '\\n'\n 'Specialized '\n 'Tracking\\n'\n '\\n'\n '\\n'\n (...)\n 'Follow '\n '@analyticsnerd\\n'\n '!function(d,s,id){var '\n \"js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, \"\n \"'script', \"\n \"'twitter-wjs');\"]},\n 'type': ['h-entry']}],\n 'properties': { 'name': [ 'Open Graph Protocol for '\n 'Facebook explained with '\n 'examples\\n'\n (...)\n 'Follow @analyticsnerd\\n'\n '!function(d,s,id){var '\n \"js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, \"\n \"'script', 'twitter-wjs');\"]},\n 'type': ['h-feed']}],\n 'opengraph': [ { 'namespace': {'og': 'http://ogp.me/ns#'},\n 'properties': [ ('og:locale', 'en_US'),\n ('og:type', 'article'),\n ( 'og:title',\n 'Open Graph Protocol for Facebook '\n 'explained with examples'),\n ( 'og:description',\n 'What is Open Graph Protocol and why you '\n 'need it? Learn to implement Open Graph '\n 'Protocol for Facebook on your website. '\n 'Open Graph Protocol Meta Tags.'),\n ( 'og:url',\n 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'),\n ('og:site_name', 'Optimize Smart'),\n ( 'og:updated_time',\n '2018-03-09T16:26:35+00:00'),\n ( 'og:image',\n 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'),\n ( 'og:image:secure_url',\n 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg')]}],\n 'rdfa': [ { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/#header',\n 'http://www.w3.org/1999/xhtml/vocab#role': [ { '@id': 'http://www.w3.org/1999/xhtml/vocab#banner'}]},\n { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/',\n 'article:modified_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],\n 'article:published_time': [ { '@value': '2010-07-02T18:57:23+00:00'}],\n 'article:publisher': [ { '@value': 'https://www.facebook.com/optimizesmart/'}],\n 'article:section': [{'@value': 'Specialized Tracking'}],\n 'http://ogp.me/ns#description': [ { '@value': 'What is Open '\n 'Graph Protocol '\n 'and why you need '\n 'it? Learn to '\n 'implement Open '\n 'Graph Protocol '\n 'for Facebook on '\n 'your website. '\n 'Open Graph '\n 'Protocol Meta '\n 'Tags.'}],\n 'http://ogp.me/ns#image': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],\n 'http://ogp.me/ns#image:secure_url': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],\n 'http://ogp.me/ns#locale': [{'@value': 'en_US'}],\n 'http://ogp.me/ns#site_name': [{'@value': 'Optimize Smart'}],\n 'http://ogp.me/ns#title': [ { '@value': 'Open Graph Protocol for '\n 'Facebook explained with '\n 'examples'}],\n 'http://ogp.me/ns#type': [{'@value': 'article'}],\n 'http://ogp.me/ns#updated_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],\n 'http://ogp.me/ns#url': [ { '@value': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'}],\n 'https://api.w.org/': [ { '@id': 'https://www.optimizesmart.com/wp-json/'}]}]}\n\nSelect syntaxes\n+++++++++++++++\nIt is possible to select which syntaxes to extract by passing a list with the desired ones to extract. Valid values: 'microdata', 'json-ld', 'opengraph', 'microformat', 'rdfa' and 'dublincore'. If no list is passed all syntaxes will be extracted and returned::\n\n >>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')\n >>> base_url = get_base_url(r.text, r.url)\n >>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'])\n >>>\n >>> pp.pprint(data)\n { 'microdata': [],\n 'opengraph': [ { 'namespace': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',\n 'fb': 'http://www.facebook.com/2008/fbml',\n 'og': 'http://ogp.me/ns#'},\n 'properties': [ ('fb:app_id', '308540029359'),\n ('og:site_name', 'Songkick'),\n ('og:type', 'songkick-concerts:artist'),\n ('og:title', 'Elysian Fields'),\n ( 'og:description',\n 'Find out when Elysian Fields is next '\n 'playing live near you. List of all '\n 'Elysian Fields tour dates and concerts.'),\n ( 'og:url',\n 'https://www.songkick.com/artists/236156-elysian-fields'),\n ( 'og:image',\n 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg')]}],\n 'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',\n 'al:ios:app_name': [{'@value': 'Songkick Concerts'}],\n 'al:ios:app_store_id': [{'@value': '438690886'}],\n 'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],\n 'http://ogp.me/ns#description': [ { '@value': 'Find out when '\n 'Elysian Fields is '\n 'next playing live '\n 'near you. List of '\n 'all Elysian '\n 'Fields tour dates '\n 'and concerts.'}],\n 'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],\n 'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],\n 'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],\n 'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],\n 'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],\n 'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}\n\nAlternatively, if you already parsed the HTML before calling extruct, you can use the tree instead of the HTML string: ::\n\n >>> # using the request from the previous example\n >>> base_url = get_base_url(r.text, r.url)\n >>> from extruct.utils import parse_html\n >>> tree = parse_html(r.text)\n >>> data = extruct.extract(tree, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'])\n\nMicroformat format doesn't support the HTML tree, so you need to use a HTML string.\n\nUniform\n+++++++\nAnother option is to uniform the output of microformat, opengraph, microdata, dublincore and json-ld syntaxes to the following structure: ::\n\n {'@context': 'http://example.com',\n '@type': 'example_type',\n /* All other the properties in keys here */\n }\n\nTo do so set ``uniform=True`` when calling ``extract``, it's false by default for backward compatibility. Here the same example as before but with uniform set to True: ::\n\n >>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')\n >>> base_url = get_base_url(r.text, r.url)\n >>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'], uniform=True)\n >>>\n >>> pp.pprint(data)\n { 'microdata': [],\n 'opengraph': [ { '@context': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',\n 'fb': 'http://www.facebook.com/2008/fbml',\n 'og': 'http://ogp.me/ns#'},\n '@type': 'songkick-concerts:artist',\n 'fb:app_id': '308540029359',\n 'og:description': 'Find out when Elysian Fields is next '\n 'playing live near you. List of all '\n 'Elysian Fields tour dates and concerts.',\n 'og:image': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg',\n 'og:site_name': 'Songkick',\n 'og:title': 'Elysian Fields',\n 'og:url': 'https://www.songkick.com/artists/236156-elysian-fields'}],\n 'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',\n 'al:ios:app_name': [{'@value': 'Songkick Concerts'}],\n 'al:ios:app_store_id': [{'@value': '438690886'}],\n 'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],\n 'http://ogp.me/ns#description': [ { '@value': 'Find out when '\n 'Elysian Fields is '\n 'next playing live '\n 'near you. List of '\n 'all Elysian '\n 'Fields tour dates '\n 'and concerts.'}],\n 'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],\n 'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],\n 'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],\n 'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],\n 'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],\n 'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}\n\nNB rdfa structure is not uniformed yet.\n\nReturning HTML node\n+++++++++++++++++++\n\nIt is also possible to get references to HTML node for every extracted metadata item.\nThe feature is supported only by microdata syntax.\n\nTo use that, just set the ``return_html_node`` option of ``extract`` method to ``True``.\nAs the result, an additional key \"nodeHtml\" will be included in the result for every\nitem. Each node is of ``lxml.etree.Element`` type: ::\n\n >>> r = requests.get('http://www.rugpadcorner.com/shop/no-muv/')\n >>> base_url = get_base_url(r.text, r.url)\n >>> data = extruct.extract(r.text, base_url, syntaxes=['microdata'], return_html_node=True)\n >>>\n >>> pp.pprint(data)\n { 'microdata': [ { 'htmlNode': <Element div at 0x7f10f8e6d3b8>,\n 'properties': { 'description': 'KEEP RUGS FLAT ON CARPET!\\n'\n 'Not your thin sticky pad, '\n 'No-Muv is truly the best!',\n 'image': ['', ''],\n 'name': ['No-Muv', 'No-Muv'],\n 'offers': [ { 'htmlNode': <Element div at 0x7f10f8e6d138>,\n 'properties': { 'availability': 'http://schema.org/InStock',\n 'price': 'Price: '\n '$45'},\n 'type': 'http://schema.org/Offer'},\n { 'htmlNode': <Element div at 0x7f10f8e60f48>,\n 'properties': { 'availability': 'http://schema.org/InStock',\n 'price': '(Select '\n 'Size/Shape '\n 'for '\n 'Pricing)'},\n 'type': 'http://schema.org/Offer'}],\n 'ratingValue': ['5.00', '5.00']},\n 'type': 'http://schema.org/Product'}]}\n\nSingle extractors\n-----------------\n\nYou can also use each extractor individually. See below.\n\nMicrodata extraction\n++++++++++++++++++++\n::\n\n >>> import pprint\n >>> pp = pprint.PrettyPrinter(indent=2)\n >>>\n >>> from extruct.w3cmicrodata import MicrodataExtractor\n >>>\n >>> # example from http://www.w3.org/TR/microdata/#associating-names-with-items\n >>> html = \"\"\"<!DOCTYPE HTML>\n ... <html>\n ... <head>\n ... <title>Photo gallery</title>\n ... </head>\n ... <body>\n ... <h1>My photos</h1>\n ... <figure itemscope itemtype=\"http://n.whatwg.org/work\" itemref=\"licenses\">\n ... <img itemprop=\"work\" src=\"images/house.jpeg\" alt=\"A white house, boarded up, sits in a forest.\">\n ... <figcaption itemprop=\"title\">The house I found.</figcaption>\n ... </figure>\n ... <figure itemscope itemtype=\"http://n.whatwg.org/work\" itemref=\"licenses\">\n ... <img itemprop=\"work\" src=\"images/mailbox.jpeg\" alt=\"Outside the house is a mailbox. It has a leaflet inside.\">\n ... <figcaption itemprop=\"title\">The mailbox.</figcaption>\n ... </figure>\n ... <footer>\n ... <p id=\"licenses\">All images licensed under the <a itemprop=\"license\"\n ... href=\"http://www.opensource.org/licenses/mit-license.php\">MIT\n ... license</a>.</p>\n ... </footer>\n ... </body>\n ... </html>\"\"\"\n >>>\n >>> mde = MicrodataExtractor()\n >>> data = mde.extract(html)\n >>> pp.pprint(data)\n [{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',\n 'title': 'The house I found.',\n 'work': 'http://www.example.com/images/house.jpeg'},\n 'type': 'http://n.whatwg.org/work'},\n {'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',\n 'title': 'The mailbox.',\n 'work': 'http://www.example.com/images/mailbox.jpeg'},\n 'type': 'http://n.whatwg.org/work'}]\n\nJSON-LD extraction\n++++++++++++++++++\n::\n\n >>> import pprint\n >>> pp = pprint.PrettyPrinter(indent=2)\n >>>\n >>> from extruct.jsonld import JsonLdExtractor\n >>>\n >>> html = \"\"\"<!DOCTYPE HTML>\n ... <html>\n ... <head>\n ... <title>Some Person Page</title>\n ... </head>\n ... <body>\n ... <h1>This guys</h1>\n ... <script type=\"application/ld+json\">\n ... {\n ... \"@context\": \"http://schema.org\",\n ... \"@type\": \"Person\",\n ... \"name\": \"John Doe\",\n ... \"jobTitle\": \"Graduate research assistant\",\n ... \"affiliation\": \"University of Dreams\",\n ... \"additionalName\": \"Johnny\",\n ... \"url\": \"http://www.example.com\",\n ... \"address\": {\n ... \"@type\": \"PostalAddress\",\n ... \"streetAddress\": \"1234 Peach Drive\",\n ... \"addressLocality\": \"Wonderland\",\n ... \"addressRegion\": \"Georgia\"\n ... }\n ... }\n ... </script>\n ... </body>\n ... </html>\"\"\"\n >>>\n >>> jslde = JsonLdExtractor()\n >>>\n >>> data = jslde.extract(html)\n >>> pp.pprint(data)\n [{'@context': 'http://schema.org',\n '@type': 'Person',\n 'additionalName': 'Johnny',\n 'address': {'@type': 'PostalAddress',\n 'addressLocality': 'Wonderland',\n 'addressRegion': 'Georgia',\n 'streetAddress': '1234 Peach Drive'},\n 'affiliation': 'University of Dreams',\n 'jobTitle': 'Graduate research assistant',\n 'name': 'John Doe',\n 'url': 'http://www.example.com'}]\n\n\nRDFa extraction (experimental)\n++++++++++++++++++++++++++++++\n\n::\n\n >>> import pprint\n >>> pp = pprint.PrettyPrinter(indent=2)\n >>> from extruct.rdfa import RDFaExtractor # you can ignore the warning about html5lib not being available\n INFO:rdflib:RDFLib Version: 4.2.1\n /home/paul/.virtualenvs/extruct.wheel.test/lib/python3.5/site-packages/rdflib/plugins/parsers/structureddata.py:30: UserWarning: html5lib not found! RDFa and Microdata parsers will not be available.\n 'parsers will not be available.')\n >>>\n >>> html = \"\"\"<html>\n ... <head>\n ... ...\n ... </head>\n ... <body prefix=\"dc: http://purl.org/dc/terms/ schema: http://schema.org/\">\n ... <div resource=\"/alice/posts/trouble_with_bob\" typeof=\"schema:BlogPosting\">\n ... <h2 property=\"dc:title\">The trouble with Bob</h2>\n ... ...\n ... <h3 property=\"dc:creator schema:creator\" resource=\"#me\">Alice</h3>\n ... <div property=\"schema:articleBody\">\n ... <p>The trouble with Bob is that he takes much better photos than I do:</p>\n ... </div>\n ... ...\n ... </div>\n ... </body>\n ... </html>\n ... \"\"\"\n >>>\n >>> rdfae = RDFaExtractor()\n >>> pp.pprint(rdfae.extract(html, base_url='http://www.example.com/index.html'))\n [{'@id': 'http://www.example.com/alice/posts/trouble_with_bob',\n '@type': ['http://schema.org/BlogPosting'],\n 'http://purl.org/dc/terms/creator': [{'@id': 'http://www.example.com/index.html#me'}],\n 'http://purl.org/dc/terms/title': [{'@value': 'The trouble with Bob'}],\n 'http://schema.org/articleBody': [{'@value': '\\n'\n ' The trouble with Bob '\n 'is that he takes much better '\n 'photos than I do:\\n'\n ' '}],\n 'http://schema.org/creator': [{'@id': 'http://www.example.com/index.html#me'}]}]\n\nYou'll get a list of expanded JSON-LD nodes.\n\n\nOpen Graph extraction\n++++++++++++++++++++++++++++++\n\n::\n\n >>> import pprint\n >>> pp = pprint.PrettyPrinter(indent=2)\n >>>\n >>> from extruct.opengraph import OpenGraphExtractor\n >>>\n >>> html = \"\"\"<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n ... <html xmlns=\"https://www.w3.org/1999/xhtml\" xmlns:og=\"https://ogp.me/ns#\" xmlns:fb=\"https://www.facebook.com/2008/fbml\">\n ... <head>\n ... <title>Himanshu's Open Graph Protocol</title>\n ... <meta http-equiv=\"Content-Type\" content=\"text/html;charset=WINDOWS-1252\" />\n ... <meta http-equiv=\"Content-Language\" content=\"en-us\" />\n ... <link rel=\"stylesheet\" type=\"text/css\" href=\"event-education.css\" />\n ... <meta name=\"verify-v1\" content=\"so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=\" >\n ... <meta property=\"og:title\" content=\"Himanshu's Open Graph Protocol\"/>\n ... <meta property=\"og:type\" content=\"article\"/>\n ... <meta property=\"og:url\" content=\"https://www.eventeducation.com/test.php\"/>\n ... <meta property=\"og:image\" content=\"https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg\"/>\n ... <meta property=\"fb:admins\" content=\"himanshu160\"/>\n ... <meta property=\"og:site_name\" content=\"Event Education\"/>\n ... <meta property=\"og:description\" content=\"Event Education provides free courses on event planning and management to event professionals worldwide.\"/>\n ... </head>\n ... <body>\n ... <div id=\"fb-root\"></div>\n ... <script>(function(d, s, id) {\n ... var js, fjs = d.getElementsByTagName(s)[0];\n ... if (d.getElementById(id)) return;\n ... js = d.createElement(s); js.id = id;\n ... js.src = \"//connect.facebook.net/en_US/all.js#xfbml=1&appId=501839739845103\";\n ... fjs.parentNode.insertBefore(js, fjs);\n ... }(document, 'script', 'facebook-jssdk'));</script>\n ... </body>\n ... </html>\"\"\"\n >>>\n >>> opengraphe = OpenGraphExtractor()\n >>> pp.pprint(opengraphe.extract(html))\n [{\"namespace\": {\n \"og\": \"http://ogp.me/ns#\"\n },\n \"properties\": [\n [\n \"og:title\",\n \"Himanshu's Open Graph Protocol\"\n ],\n [\n \"og:type\",\n \"article\"\n ],\n [\n \"og:url\",\n \"https://www.eventeducation.com/test.php\"\n ],\n [\n \"og:image\",\n \"https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg\"\n ],\n [\n \"og:site_name\",\n \"Event Education\"\n ],\n [\n \"og:description\",\n \"Event Education provides free courses on event planning and management to event professionals worldwide.\"\n ]\n ]\n }]\n\n\nMicroformat extraction\n++++++++++++++++++++++++++++++\n\n::\n\n >>> import pprint\n >>> pp = pprint.PrettyPrinter(indent=2)\n >>>\n >>> from extruct.microformat import MicroformatExtractor\n >>>\n >>> html = \"\"\"<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n ... <html xmlns=\"https://www.w3.org/1999/xhtml\" xmlns:og=\"https://ogp.me/ns#\" xmlns:fb=\"https://www.facebook.com/2008/fbml\">\n ... <head>\n ... <title>Himanshu's Open Graph Protocol</title>\n ... <meta http-equiv=\"Content-Type\" content=\"text/html;charset=WINDOWS-1252\" />\n ... <meta http-equiv=\"Content-Language\" content=\"en-us\" />\n ... <link rel=\"stylesheet\" type=\"text/css\" href=\"event-education.css\" />\n ... <meta name=\"verify-v1\" content=\"so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=\" >\n ... <meta property=\"og:title\" content=\"Himanshu's Open Graph Protocol\"/>\n ... <article class=\"h-entry\">\n ... <h1 class=\"p-name\">Microformats are amazing</h1>\n ... <p>Published by <a class=\"p-author h-card\" href=\"http://example.com\">W. Developer</a>\n ... on <time class=\"dt-published\" datetime=\"2013-06-13 12:00:00\">13<sup>th</sup> June 2013</time></p>\n ... <p class=\"p-summary\">In which I extoll the virtues of using microformats.</p>\n ... <div class=\"e-content\">\n ... <p>Blah blah blah</p>\n ... </div>\n ... </article>\n ... </head>\n ... <body></body>\n ... </html>\"\"\"\n >>>\n >>> microformate = MicroformatExtractor()\n >>> data = microformate.extract(html)\n >>> pp.pprint(data)\n [{\"type\": [\n \"h-entry\"\n ],\n \"properties\": {\n \"name\": [\n \"Microformats are amazing\"\n ],\n \"author\": [\n {\n \"type\": [\n \"h-card\"\n ],\n \"properties\": {\n \"name\": [\n \"W. Developer\"\n ],\n \"url\": [\n \"http://example.com\"\n ]\n },\n \"value\": \"W. Developer\"\n }\n ],\n \"published\": [\n \"2013-06-13 12:00:00\"\n ],\n \"summary\": [\n \"In which I extoll the virtues of using microformats.\"\n ],\n \"content\": [\n {\n \"html\": \"\\n<p>Blah blah blah</p>\\n\",\n \"value\": \"\\nBlah blah blah\\n\"\n }\n ]\n }\n }]\n\nDublinCore extraction\n++++++++++++++++++++++++++++++\n::\n\n >>> import pprint\n >>> pp = pprint.PrettyPrinter(indent=2)\n >>> from extruct.dublincore import DublinCoreExtractor\n >>> html = '''<head profile=\"http://dublincore.org/documents/dcq-html/\">\n ... <title>Expressing Dublin Core in HTML/XHTML meta and link elements</title>\n ... <link rel=\"schema.DC\" href=\"http://purl.org/dc/elements/1.1/\" />\n ... <link rel=\"schema.DCTERMS\" href=\"http://purl.org/dc/terms/\" />\n ...\n ...\n ... <meta name=\"DC.title\" lang=\"en\" content=\"Expressing Dublin Core\n ... in HTML/XHTML meta and link elements\" />\n ... <meta name=\"DC.creator\" content=\"Andy Powell, UKOLN, University of Bath\" />\n ... <meta name=\"DCTERMS.issued\" scheme=\"DCTERMS.W3CDTF\" content=\"2003-11-01\" />\n ... <meta name=\"DC.identifier\" scheme=\"DCTERMS.URI\"\n ... content=\"http://dublincore.org/documents/dcq-html/\" />\n ... <link rel=\"DCTERMS.replaces\" hreflang=\"en\"\n ... href=\"http://dublincore.org/documents/2000/08/15/dcq-html/\" />\n ... <meta name=\"DCTERMS.abstract\" content=\"This document describes how\n ... qualified Dublin Core metadata can be encoded\n ... in HTML/XHTML <meta> elements\" />\n ... <meta name=\"DC.format\" scheme=\"DCTERMS.IMT\" content=\"text/html\" />\n ... <meta name=\"DC.type\" scheme=\"DCTERMS.DCMIType\" content=\"Text\" />\n ... <meta name=\"DC.Date.modified\" content=\"2001-07-18\" />\n ... <meta name=\"DCTERMS.modified\" content=\"2001-07-18\" />'''\n >>> dublinlde = DublinCoreExtractor()\n >>> data = dublinlde.extract(html)\n >>> pp.pprint(data)\n [ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/title',\n 'content': 'Expressing Dublin Core\\n'\n 'in HTML/XHTML meta and link elements',\n 'lang': 'en',\n 'name': 'DC.title'},\n { 'URI': 'http://purl.org/dc/elements/1.1/creator',\n 'content': 'Andy Powell, UKOLN, University of Bath',\n 'name': 'DC.creator'},\n { 'URI': 'http://purl.org/dc/elements/1.1/identifier',\n 'content': 'http://dublincore.org/documents/dcq-html/',\n 'name': 'DC.identifier',\n 'scheme': 'DCTERMS.URI'},\n { 'URI': 'http://purl.org/dc/elements/1.1/format',\n 'content': 'text/html',\n 'name': 'DC.format',\n 'scheme': 'DCTERMS.IMT'},\n { 'URI': 'http://purl.org/dc/elements/1.1/type',\n 'content': 'Text',\n 'name': 'DC.type',\n 'scheme': 'DCTERMS.DCMIType'}],\n 'namespaces': { 'DC': 'http://purl.org/dc/elements/1.1/',\n 'DCTERMS': 'http://purl.org/dc/terms/'},\n 'terms': [ { 'URI': 'http://purl.org/dc/terms/issued',\n 'content': '2003-11-01',\n 'name': 'DCTERMS.issued',\n 'scheme': 'DCTERMS.W3CDTF'},\n { 'URI': 'http://purl.org/dc/terms/abstract',\n 'content': 'This document describes how\\n'\n 'qualified Dublin Core metadata can be encoded\\n'\n 'in HTML/XHTML <meta> elements',\n 'name': 'DCTERMS.abstract'},\n { 'URI': 'http://purl.org/dc/terms/modified',\n 'content': '2001-07-18',\n 'name': 'DC.Date.modified'},\n { 'URI': 'http://purl.org/dc/terms/modified',\n 'content': '2001-07-18',\n 'name': 'DCTERMS.modified'},\n { 'URI': 'http://purl.org/dc/terms/replaces',\n 'href': 'http://dublincore.org/documents/2000/08/15/dcq-html/',\n 'hreflang': 'en',\n 'rel': 'DCTERMS.replaces'}]}]\n\n\n\nCommand Line Tool\n-----------------\n\n*extruct* provides a command line tool that allows you to fetch a page and\nextract the metadata from it directly from the command line.\n\nDependencies\n++++++++++++\n\nThe command line tool depends on ``requests``, which is not installed by default\nwhen you install **extruct**. In order to use the command line tool, you can\ninstall **extruct** with the `cli` extra requirements::\n\n pip install 'extruct[cli]'\n\n\nUsage\n+++++\n\n::\n\n extruct \"http://example.com\"\n\nDownloads \"http://example.com\" and outputs the Microdata, JSON-LD and RDFa, Open Graph\nand Microformat metadata to `stdout`.\n\nSupported Parameters\n++++++++++++++++++++\n\nBy default, the command line tool will try to extract all the supported\nmetadata formats from the page (currently Microdata, JSON-LD, RDFa, Open Graph\nand Microformat). If you want to restrict the output to just one or a subset of\nthose, you can pass their individual names collected in a list through 'syntaxes' argument.\n\nFor example, this command extracts only Microdata and JSON-LD metadata from\n\"http://example.com\"::\n\n extruct \"http://example.com\" --syntaxes microdata json-ld\n\nNB syntaxes names passed must correspond to these: microdata, json-ld, rdfa, opengraph, microformat\n\nDevelopment version\n-------------------\n\n::\n\n mkvirtualenv extruct\n pip install -r requirements-dev.txt\n\n\nTests\n-----\n\nRun tests in current environment::\n\n py.test tests\n\n\nUse tox_ to run tests with different Python versions::\n\n tox\n\n\n.. _tox: https://testrun.org/tox/latest/\n.. _ogp: https://ogp.me/\n",
"bugtrack_url": null,
"license": null,
"summary": "Extract embedded metadata from HTML markup",
"version": "0.18.0",
"project_urls": {
"Homepage": "https://github.com/scrapinghub/extruct"
},
"split_keywords": [
"extruct"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "330060d3efc09d52a9e8d57b59323d7edb962daba2f1c46b89449a252f3652e7",
"md5": "d305584f8cc653f2501e29fd4061cecd",
"sha256": "1e739985da705c3348c7614dc169e7780caf20908338fa5f4c6e48576df6f000"
},
"downloads": -1,
"filename": "extruct-0.18.0-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "d305584f8cc653f2501e29fd4061cecd",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.8",
"size": 26359,
"upload_time": "2024-11-08T14:59:22",
"upload_time_iso_8601": "2024-11-08T14:59:22.806818Z",
"url": "https://files.pythonhosted.org/packages/33/00/60d3efc09d52a9e8d57b59323d7edb962daba2f1c46b89449a252f3652e7/extruct-0.18.0-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "cedd5ba7345dafaa2b5ccbfd441c7bcfc74bae1176aac9acb0d746b6bb327979",
"md5": "c96160f0b51c788fd84fafcf0d6d9bd2",
"sha256": "b5b48d459003b27c05ee91527b14a5a31735231aaf85d2b1f331d4db879318dd"
},
"downloads": -1,
"filename": "extruct-0.18.0.tar.gz",
"has_sig": false,
"md5_digest": "c96160f0b51c788fd84fafcf0d6d9bd2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 44170,
"upload_time": "2024-11-08T14:59:24",
"upload_time_iso_8601": "2024-11-08T14:59:24.260236Z",
"url": "https://files.pythonhosted.org/packages/ce/dd/5ba7345dafaa2b5ccbfd441c7bcfc74bae1176aac9acb0d746b6bb327979/extruct-0.18.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-08 14:59:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "scrapinghub",
"github_project": "extruct",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"requirements": [],
"tox": true,
"lcname": "extruct"
}