********
xextract
********
Extract structured data from HTML and XML documents like a boss.
**xextract** is simple enough for writing a one-line parser, yet powerful enough to be used in a big project.
**Features**
- Parsing of HTML and XML documents
- Supports **xpath** and **css** selectors
- Simple declarative style of parsers
- Built-in self-validation to let you know when the structure of the website has changed
- Speed - under the hood the library uses `lxml library <http://lxml.de/>`_ with compiled xpath selectors
**Table of Contents**
.. contents::
:local:
:depth: 2
:backlinks: none
====================
A little taste of it
====================
Let's parse `The Shawshank Redemption <http://www.imdb.com/title/tt0111161/>`_'s IMDB page:
.. code-block:: python
# fetch the website
>>> import requests
>>> response = requests.get('http://www.imdb.com/title/tt0111161/')
# parse like a boss
>>> from xextract import String, Group
# extract title with css selector
>>> String(css='h1[itemprop="name"]', count=1).parse(response.text)
'The Shawshank Redemption'
# extract release year with xpath selector
>>> String(xpath='//*[@id="titleYear"]/a', count=1, callback=int).parse(response.text)
1994
# extract structured data
>>> Group(css='.cast_list tr:not(:first-child)', children=[
... String(name='name', css='[itemprop="actor"]', attr='_all_text', count=1),
... String(name='character', css='.character', attr='_all_text', count=1)
... ]).parse(response.text)
[
{'name': 'Tim Robbins', 'character': 'Andy Dufresne'},
{'name': 'Morgan Freeman', 'character': "Ellis Boyd 'Red' Redding"},
...
]
============
Installation
============
To install **xextract**, simply run:
.. code-block:: bash
$ pip install xextract
Requirements: lxml, cssselect
Supported Python versions are 3.5 - 3.11.
Windows users can download lxml binary `here <http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml>`_.
=======
Parsers
=======
------
String
------
**Parameters**: `name`_ (optional), `css / xpath`_ (optional, default ``"self::*"``), `count`_ (optional, default ``"*"``), `attr`_ (optional, default ``"_text"``), `callback`_ (optional), `namespaces`_ (optional)
Extract string data from the matched element(s).
Extracted value is always unicode.
By default, ``String`` extracts the text content of only the matched element, but not its descendants.
To extract and concatenate the text out of every descendant element, use ``attr`` parameter with the special value ``"_all_text"``:
Use ``attr`` parameter to extract the data from an HTML/XML attribute.
Use ``callback`` parameter to post-process extracted values.
Example:
.. code-block:: python
>>> from xextract import String
>>> String(css='span', count=1).parse('<span>Hello <b>world</b>!</span>')
'Hello !'
>>> String(css='span', count=1, attr='class').parse('<span class="text-success"></span>')
'text-success'
# use special `attr` value `_all_text` to extract and concantenate text out of all descendants
>>> String(css='span', count=1, attr='_all_text').parse('<span>Hello <b>world</b>!</span>')
'Hello world!'
# use special `attr` value `_name` to extract tag name of the matched element
>>> String(css='span', count=1, attr='_name').parse('<span>hello</span>')
'span'
>>> String(css='span', callback=int).parse('<span>1</span><span>2</span>')
[1, 2]
---
Url
---
**Parameters**: `name`_ (optional), `css / xpath`_ (optional, default ``"self::*"``), `count`_ (optional, default ``"*"``), `attr`_ (optional, default ``"href"``), `callback`_ (optional), `namespaces`_ (optional)
Behaves like ``String`` parser, but with two exceptions:
* default value for ``attr`` parameter is ``"href"``
* if you pass ``url`` parameter to ``parse()`` method, the absolute url will be constructed and returned
If ``callback`` is specified, it is called *after* the absolute urls are constructed.
Example:
.. code-block:: python
>>> from xextract import Url, Prefix
>>> content = '<div id="main"> <a href="/test">Link</a> </div>'
>>> Url(css='a', count=1).parse(content)
'/test'
>>> Url(css='a', count=1).parse(content, url='http://github.com/Mimino666')
'http://github.com/test' # absolute url address. Told ya!
>>> Prefix(css='#main', children=[
... Url(css='a', count=1)
... ]).parse(content, url='http://github.com/Mimino666') # you can pass url also to ancestor's parse(). It will propagate down.
'http://github.com/test'
--------
DateTime
--------
**Parameters**: `name`_ (optional), `css / xpath`_ (optional, default ``"self::*"``), ``format`` (**required**), `count`_ (optional, default ``"*"``), `attr`_ (optional, default ``"_text"``), `callback`_ (optional) `namespaces`_ (optional)
Returns the ``datetime.datetime`` object constructed out of the extracted data: ``datetime.strptime(extracted_data, format)``.
``format`` syntax is described in the `Python documentation <https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior>`_.
If ``callback`` is specified, it is called *after* the datetime objects are constructed.
Example:
.. code-block:: python
>>> from xextract import DateTime
>>> DateTime(css='span', count=1, format='%d.%m.%Y %H:%M').parse('<span>24.12.2015 5:30</span>')
datetime.datetime(2015, 12, 24, 50, 30)
----
Date
----
**Parameters**: `name`_ (optional), `css / xpath`_ (optional, default ``"self::*"``), ``format`` (**required**), `count`_ (optional, default ``"*"``), `attr`_ (optional, default ``"_text"``), `callback`_ (optional) `namespaces`_ (optional)
Returns the ``datetime.date`` object constructed out of the extracted data: ``datetime.strptime(extracted_data, format).date()``.
``format`` syntax is described in the `Python documentation <https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior>`_.
If ``callback`` is specified, it is called *after* the datetime objects are constructed.
Example:
.. code-block:: python
>>> from xextract import Date
>>> Date(css='span', count=1, format='%d.%m.%Y').parse('<span>24.12.2015</span>')
datetime.date(2015, 12, 24)
-------
Element
-------
**Parameters**: `name`_ (optional), `css / xpath`_ (optional, default ``"self::*"``), `count`_ (optional, default ``"*"``), `callback`_ (optional), `namespaces`_ (optional)
Returns lxml instance (``lxml.etree._Element``) of the matched element(s).
If you use xpath expression and match the text content of the element (e.g. ``text()`` or ``@attr``), unicode is returned.
If ``callback`` is specified, it is called with ``lxml.etree._Element`` instance.
Example:
.. code-block:: python
>>> from xextract import Element
>>> Element(css='span', count=1).parse('<span>Hello</span>')
<Element span at 0x2ac2990>
>>> Element(css='span', count=1, callback=lambda el: el.text).parse('<span>Hello</span>')
'Hello'
# same as above
>>> Element(xpath='//span/text()', count=1).parse('<span>Hello</span>')
'Hello'
-----
Group
-----
**Parameters**: `name`_ (optional), `css / xpath`_ (optional, default ``"self::*"``), `children`_ (**required**), `count`_ (optional, default ``"*"``), `callback`_ (optional), `namespaces`_ (optional)
For each element matched by css/xpath selector returns the dictionary containing the data extracted by the parsers listed in ``children`` parameter.
All parsers listed in ``children`` parameter **must** have ``name`` specified - this is then used as the key in dictionary.
Typical use case for this parser is when you want to parse structured data, e.g. list of user profiles, where each profile contains fields like name, address, etc. Use ``Group`` parser to group the fields of each user profile together.
If ``callback`` is specified, it is called with the dictionary of parsed children values.
Example:
.. code-block:: python
>>> from xextract import Group
>>> content = '<ul><li id="id1">michal</li> <li id="id2">peter</li></ul>'
>>> Group(css='li', count=2, children=[
... String(name='id', xpath='self::*', count=1, attr='id'),
... String(name='name', xpath='self::*', count=1)
... ]).parse(content)
[{'name': 'michal', 'id': 'id1'},
{'name': 'peter', 'id': 'id2'}]
------
Prefix
------
**Parameters**: `css / xpath`_ (optional, default ``"self::*"``), `children`_ (**required**), `namespaces`_ (optional)
This parser doesn't actually parse any data on its own. Instead you can use it, when many of your parsers share the same css/xpath selector prefix.
``Prefix`` parser always returns a single dictionary containing the data extracted by the parsers listed in ``children`` parameter.
All parsers listed in ``children`` parameter **must** have ``name`` specified - this is then used as the key in dictionary.
Example:
.. code-block:: python
# instead of...
>>> String(css='#main .name').parse(...)
>>> String(css='#main .date').parse(...)
# ...you can use
>>> from xextract import Prefix
>>> Prefix(css='#main', children=[
... String(name="name", css='.name'),
... String(name="date", css='.date')
... ]).parse(...)
=================
Parser parameters
=================
----
name
----
**Parsers**: `String`_, `Url`_, `DateTime`_, `Date`_, `Element`_, `Group`_
**Default value**: ``None``
If specified, then the extracted data will be returned in a dictionary, with the ``name`` as the key and the data as the value.
All parsers listed in ``children`` parameter of ``Group`` or ``Prefix`` parser **must** have ``name`` specified.
If multiple children parsers have the same ``name``, the behavior is undefined.
Example:
.. code-block:: python
# when `name` is not specified, raw value is returned
>>> String(css='span', count=1).parse('<span>Hello!</span>')
'Hello!'
# when `name` is specified, dictionary is returned with `name` as the key
>>> String(name='message', css='span', count=1).parse('<span>Hello!</span>')
{'message': 'Hello!'}
-----------
css / xpath
-----------
**Parsers**: `String`_, `Url`_, `DateTime`_, `Date`_, `Element`_, `Group`_, `Prefix`_
**Default value (xpath)**: ``"self::*"``
Use either ``css`` or ``xpath`` parameter (but not both) to select the elements from which to extract the data.
Under the hood css selectors are translated into equivalent xpath selectors.
For the children of ``Prefix`` or ``Group`` parsers, the elements are selected relative to the elements matched by the parent parser.
Example:
.. code-block:: python
Prefix(xpath='//*[@id="profile"]', children=[
# equivalent to: //*[@id="profile"]/descendant-or-self::*[@class="name"]
String(name='name', css='.name', count=1),
# equivalent to: //*[@id="profile"]/*[@class="title"]
String(name='title', xpath='*[@class="title"]', count=1),
# equivalent to: //*[@class="subtitle"]
String(name='subtitle', xpath='//*[@class="subtitle"]', count=1)
])
-----
count
-----
**Parsers**: `String`_, `Url`_, `DateTime`_, `Date`_, `Element`_, `Group`_
**Default value**: ``"*"``
``count`` specifies the expected number of elements to be matched with css/xpath selector. It serves two purposes:
1. Number of matched elements is checked against the ``count`` parameter. If the number of elements doesn't match the expected countity, ``xextract.parsers.ParsingError`` exception is raised. This way you will be notified, when the website has changed its structure.
2. It tells the parser whether to return a single extracted value or a list of values. See the table below.
Syntax for ``count`` mimics the regular expressions.
You can either pass the value as a string, single integer or tuple of two integers.
Depending on the value of ``count``, the parser returns either a single extracted value or a list of values.
+-------------------+-----------------------------------------------+-----------------------------+
| Value of ``count``| Meaning | Extracted data |
+===================+===============================================+=============================+
| ``"*"`` (default) | Zero or more elements. | List of values |
+-------------------+-----------------------------------------------+-----------------------------+
| ``"+"`` | One or more elements. | List of values |
+-------------------+-----------------------------------------------+-----------------------------+
| ``"?"`` | Zero or one element. | Single value or ``None`` |
+-------------------+-----------------------------------------------+-----------------------------+
| ``num`` | Exactly ``num`` elements. | ``num`` == 0: ``None`` |
| | | |
| | You can pass either string or integer. | ``num`` == 1: Single value |
| | | |
| | | ``num`` > 1: List of values |
+-------------------+-----------------------------------------------+-----------------------------+
| ``(num1, num2)`` | Number of elements has to be between | List of values |
| | ``num1`` and ``num2``, inclusive. | |
| | | |
| | You can pass either a string or 2-tuple. | |
+-------------------+-----------------------------------------------+-----------------------------+
Example:
.. code-block:: python
>>> String(css='.full-name', count=1).parse(content) # return single value
'John Rambo'
>>> String(css='.full-name', count='1').parse(content) # same as above
'John Rambo'
>>> String(css='.full-name', count=(1,2)).parse(content) # return list of values
['John Rambo']
>>> String(css='.full-name', count='1,2').parse(content) # same as above
['John Rambo']
>>> String(css='.middle-name', count='?').parse(content) # return single value or None
None
>>> String(css='.job-titles', count='+').parse(content) # return list of values
['President', 'US Senator', 'State Senator', 'Senior Lecturer in Law']
>>> String(css='.friends', count='*').parse(content) # return possibly empty list of values
[]
>>> String(css='.friends', count='+').parse(content) # raise exception, when no elements are matched
xextract.parsers.ParsingError: Parser String matched 0 elements ("+" expected).
----
attr
----
**Parsers**: `String`_, `Url`_, `DateTime`_, `Date`_
**Default value**: ``"href"`` for ``Url`` parser. ``"_text"`` otherwise.
Use ``attr`` parameter to specify what data to extract from the matched element.
+-------------------+-----------------------------------------------------+
| Value of ``attr`` | Meaning |
+===================+=====================================================+
| ``"_text"`` | Extract the text content of the matched element. |
+-------------------+-----------------------------------------------------+
| ``"_all_text"`` | Extract and concatenate the text content of |
| | the matched element and all its descendants. |
+-------------------+-----------------------------------------------------+
| ``"_name"`` | Extract tag name of the matched element. |
+-------------------+-----------------------------------------------------+
| ``att_name`` | Extract the value out of ``att_name`` attribute of |
| | the matched element. |
| | |
| | If such attribute doesn't exist, empty string is |
| | returned. |
+-------------------+-----------------------------------------------------+
Example:
.. code-block:: python
>>> from xextract import String, Url
>>> content = '<span class="name">Barack <strong>Obama</strong> III.</span> <a href="/test">Link</a>'
>>> String(css='.name', count=1).parse(content) # default attr is "_text"
'Barack III.'
>>> String(css='.name', count=1, attr='_text').parse(content) # same as above
'Barack III.'
>>> String(css='.name', count=1, attr='_all_text').parse(content) # all text
'Barack Obama III.'
>>> String(css='.name', count=1, attr='_name').parse(content) # tag name
'span'
>>> Url(css='a', count='1').parse(content) # Url extracts href by default
'/test'
>>> String(css='a', count='1', attr='id').parse(content) # non-existent attributes return empty string
''
--------
callback
--------
**Parsers**: `String`_, `Url`_, `DateTime`_, `Date`_, `Element`_, `Group`_
Provides an easy way to post-process extracted values.
It should be a function that takes a single argument, the extracted value, and returns the postprocessed value.
Example:
.. code-block:: python
>>> String(css='span', callback=int).parse('<span>1</span><span>2</span>')
[1, 2]
>>> Element(css='span', count=1, callback=lambda el: el.text).parse('<span>Hello</span>')
'Hello'
--------
children
--------
**Parsers**: `Group`_, `Prefix`_
Specifies the children parsers for the ``Group`` and ``Prefix`` parsers.
All parsers listed in ``children`` parameter **must** have ``name`` specified
Css/xpath selectors in the children parsers are relative to the selectors specified in the parent parser.
Example:
.. code-block:: python
Prefix(xpath='//*[@id="profile"]', children=[
# equivalent to: //*[@id="profile"]/descendant-or-self::*[@class="name"]
String(name='name', css='.name', count=1),
# equivalent to: //*[@id="profile"]/*[@class="title"]
String(name='title', xpath='*[@class="title"]', count=1),
# equivalent to: //*[@class="subtitle"]
String(name='subtitle', xpath='//*[@class="subtitle"]', count=1)
])
----------
namespaces
----------
**Parsers**: `String`_, `Url`_, `DateTime`_, `Date`_, `Element`_, `Group`_, `Prefix`_
When parsing XML documents containing namespace prefixes, pass the dictionary mapping namespace prefixes to namespace URIs.
Use then full name for elements in xpath selector in the form ``"prefix:element"``
As for the moment, you **cannot use default namespace** for parsing (see `lxml docs <http://lxml.de/FAQ.html#how-can-i-specify-a-default-namespace-for-xpath-expressions>`_ for more information). Just use an arbitrary prefix.
Example:
.. code-block:: python
>>> content = '''<?xml version='1.0' encoding='UTF-8'?>
... <movie xmlns="http://imdb.com/ns/">
... <title>The Shawshank Redemption</title>
... <year>1994</year>
... </movie>'''
>>> nsmap = {'imdb': 'http://imdb.com/ns/'} # use arbitrary prefix for default namespace
>>> Prefix(xpath='//imdb:movie', namespaces=nsmap, children=[ # pass namespaces to the outermost parser
... String(name='title', xpath='imdb:title', count=1),
... String(name='year', xpath='imdb:year', count=1)
... ]).parse(content)
{'title': 'The Shawshank Redemption', 'year': '1994'}
====================
HTML vs. XML parsing
====================
To extract data from HTML or XML document, simply call ``parse()`` method of the parser:
.. code-block:: python
>>> from xextract import *
>>> parser = Prefix(..., children=[...])
>>> extracted_data = parser.parse(content)
``content`` can be either string or unicode, containing the content of the document.
Under the hood **xextact** uses either ``lxml.etree.XMLParser`` or ``lxml.etree.HTMLParser`` to parse the document.
To select the parser, **xextract** looks for ``"<?xml"`` string in the first 128 bytes of the document. If it is found, then ``XMLParser`` is used.
To force either of the parsers, you can call ``parse_html()`` or ``parse_xml()`` method:
.. code-block:: python
>>> parser.parse_html(content) # force lxml.etree.HTMLParser
>>> parser.parse_xml(content) # force lxml.etree.XMLParser
Raw data
{
"_id": null,
"home_page": "https://github.com/Mimino666/python-xextract",
"name": "xextract",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "HTML parse parsing extraction extract crawl",
"author": "Michal \"Mimino\" Danilak",
"author_email": "michal.danilak@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/c8/49/04fb500d61dd6406c274af919cabbff867c34c27b932576371ac0b4fee3e/xextract-0.1.9.tar.gz",
"platform": null,
"description": "********\nxextract\n********\n\nExtract structured data from HTML and XML documents like a boss.\n\n**xextract** is simple enough for writing a one-line parser, yet powerful enough to be used in a big project.\n\n\n**Features**\n\n- Parsing of HTML and XML documents\n- Supports **xpath** and **css** selectors\n- Simple declarative style of parsers\n- Built-in self-validation to let you know when the structure of the website has changed\n- Speed - under the hood the library uses `lxml library <http://lxml.de/>`_ with compiled xpath selectors\n\n\n**Table of Contents**\n\n.. contents::\n :local:\n :depth: 2\n :backlinks: none\n\n\n====================\nA little taste of it\n====================\n\nLet's parse `The Shawshank Redemption <http://www.imdb.com/title/tt0111161/>`_'s IMDB page:\n\n.. code-block:: python\n\n # fetch the website\n >>> import requests\n >>> response = requests.get('http://www.imdb.com/title/tt0111161/')\n\n # parse like a boss\n >>> from xextract import String, Group\n\n # extract title with css selector\n >>> String(css='h1[itemprop=\"name\"]', count=1).parse(response.text)\n 'The Shawshank Redemption'\n\n # extract release year with xpath selector\n >>> String(xpath='//*[@id=\"titleYear\"]/a', count=1, callback=int).parse(response.text)\n 1994\n\n # extract structured data\n >>> Group(css='.cast_list tr:not(:first-child)', children=[\n ... String(name='name', css='[itemprop=\"actor\"]', attr='_all_text', count=1),\n ... String(name='character', css='.character', attr='_all_text', count=1)\n ... ]).parse(response.text)\n [\n {'name': 'Tim Robbins', 'character': 'Andy Dufresne'},\n {'name': 'Morgan Freeman', 'character': \"Ellis Boyd 'Red' Redding\"},\n ...\n ]\n\n\n============\nInstallation\n============\n\nTo install **xextract**, simply run:\n\n.. code-block:: bash\n\n $ pip install xextract\n\nRequirements: lxml, cssselect\n\nSupported Python versions are 3.5 - 3.11.\n\nWindows users can download lxml binary `here <http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml>`_.\n\n\n=======\nParsers\n=======\n\n------\nString\n------\n\n**Parameters**: `name`_ (optional), `css / xpath`_ (optional, default ``\"self::*\"``), `count`_ (optional, default ``\"*\"``), `attr`_ (optional, default ``\"_text\"``), `callback`_ (optional), `namespaces`_ (optional)\n\nExtract string data from the matched element(s).\nExtracted value is always unicode.\n\nBy default, ``String`` extracts the text content of only the matched element, but not its descendants.\nTo extract and concatenate the text out of every descendant element, use ``attr`` parameter with the special value ``\"_all_text\"``:\n\nUse ``attr`` parameter to extract the data from an HTML/XML attribute.\n\nUse ``callback`` parameter to post-process extracted values.\n\nExample:\n\n.. code-block:: python\n\n >>> from xextract import String\n >>> String(css='span', count=1).parse('<span>Hello <b>world</b>!</span>')\n 'Hello !'\n\n >>> String(css='span', count=1, attr='class').parse('<span class=\"text-success\"></span>')\n 'text-success'\n\n # use special `attr` value `_all_text` to extract and concantenate text out of all descendants\n >>> String(css='span', count=1, attr='_all_text').parse('<span>Hello <b>world</b>!</span>')\n 'Hello world!'\n\n # use special `attr` value `_name` to extract tag name of the matched element\n >>> String(css='span', count=1, attr='_name').parse('<span>hello</span>')\n 'span'\n\n >>> String(css='span', callback=int).parse('<span>1</span><span>2</span>')\n [1, 2]\n\n---\nUrl\n---\n\n**Parameters**: `name`_ (optional), `css / xpath`_ (optional, default ``\"self::*\"``), `count`_ (optional, default ``\"*\"``), `attr`_ (optional, default ``\"href\"``), `callback`_ (optional), `namespaces`_ (optional)\n\nBehaves like ``String`` parser, but with two exceptions:\n\n* default value for ``attr`` parameter is ``\"href\"``\n* if you pass ``url`` parameter to ``parse()`` method, the absolute url will be constructed and returned\n\nIf ``callback`` is specified, it is called *after* the absolute urls are constructed.\n\nExample:\n\n.. code-block:: python\n\n >>> from xextract import Url, Prefix\n >>> content = '<div id=\"main\"> <a href=\"/test\">Link</a> </div>'\n\n >>> Url(css='a', count=1).parse(content)\n '/test'\n\n >>> Url(css='a', count=1).parse(content, url='http://github.com/Mimino666')\n 'http://github.com/test' # absolute url address. Told ya!\n\n >>> Prefix(css='#main', children=[\n ... Url(css='a', count=1)\n ... ]).parse(content, url='http://github.com/Mimino666') # you can pass url also to ancestor's parse(). It will propagate down.\n 'http://github.com/test'\n\n\n--------\nDateTime\n--------\n\n**Parameters**: `name`_ (optional), `css / xpath`_ (optional, default ``\"self::*\"``), ``format`` (**required**), `count`_ (optional, default ``\"*\"``), `attr`_ (optional, default ``\"_text\"``), `callback`_ (optional) `namespaces`_ (optional)\n\nReturns the ``datetime.datetime`` object constructed out of the extracted data: ``datetime.strptime(extracted_data, format)``.\n\n``format`` syntax is described in the `Python documentation <https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior>`_.\n\nIf ``callback`` is specified, it is called *after* the datetime objects are constructed.\n\nExample:\n\n.. code-block:: python\n\n >>> from xextract import DateTime\n >>> DateTime(css='span', count=1, format='%d.%m.%Y %H:%M').parse('<span>24.12.2015 5:30</span>')\n datetime.datetime(2015, 12, 24, 50, 30)\n\n\n----\nDate\n----\n\n**Parameters**: `name`_ (optional), `css / xpath`_ (optional, default ``\"self::*\"``), ``format`` (**required**), `count`_ (optional, default ``\"*\"``), `attr`_ (optional, default ``\"_text\"``), `callback`_ (optional) `namespaces`_ (optional)\n\nReturns the ``datetime.date`` object constructed out of the extracted data: ``datetime.strptime(extracted_data, format).date()``.\n\n``format`` syntax is described in the `Python documentation <https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior>`_.\n\nIf ``callback`` is specified, it is called *after* the datetime objects are constructed.\n\nExample:\n\n.. code-block:: python\n\n >>> from xextract import Date\n >>> Date(css='span', count=1, format='%d.%m.%Y').parse('<span>24.12.2015</span>')\n datetime.date(2015, 12, 24)\n\n\n-------\nElement\n-------\n\n**Parameters**: `name`_ (optional), `css / xpath`_ (optional, default ``\"self::*\"``), `count`_ (optional, default ``\"*\"``), `callback`_ (optional), `namespaces`_ (optional)\n\nReturns lxml instance (``lxml.etree._Element``) of the matched element(s).\nIf you use xpath expression and match the text content of the element (e.g. ``text()`` or ``@attr``), unicode is returned.\n\nIf ``callback`` is specified, it is called with ``lxml.etree._Element`` instance.\n\nExample:\n\n.. code-block:: python\n\n >>> from xextract import Element\n >>> Element(css='span', count=1).parse('<span>Hello</span>')\n <Element span at 0x2ac2990>\n\n >>> Element(css='span', count=1, callback=lambda el: el.text).parse('<span>Hello</span>')\n 'Hello'\n\n # same as above\n >>> Element(xpath='//span/text()', count=1).parse('<span>Hello</span>')\n 'Hello'\n\n\n-----\nGroup\n-----\n\n**Parameters**: `name`_ (optional), `css / xpath`_ (optional, default ``\"self::*\"``), `children`_ (**required**), `count`_ (optional, default ``\"*\"``), `callback`_ (optional), `namespaces`_ (optional)\n\nFor each element matched by css/xpath selector returns the dictionary containing the data extracted by the parsers listed in ``children`` parameter.\nAll parsers listed in ``children`` parameter **must** have ``name`` specified - this is then used as the key in dictionary.\n\nTypical use case for this parser is when you want to parse structured data, e.g. list of user profiles, where each profile contains fields like name, address, etc. Use ``Group`` parser to group the fields of each user profile together.\n\nIf ``callback`` is specified, it is called with the dictionary of parsed children values.\n\nExample:\n\n.. code-block:: python\n\n >>> from xextract import Group\n >>> content = '<ul><li id=\"id1\">michal</li> <li id=\"id2\">peter</li></ul>'\n\n >>> Group(css='li', count=2, children=[\n ... String(name='id', xpath='self::*', count=1, attr='id'),\n ... String(name='name', xpath='self::*', count=1)\n ... ]).parse(content)\n [{'name': 'michal', 'id': 'id1'},\n {'name': 'peter', 'id': 'id2'}]\n\n\n------\nPrefix\n------\n\n**Parameters**: `css / xpath`_ (optional, default ``\"self::*\"``), `children`_ (**required**), `namespaces`_ (optional)\n\nThis parser doesn't actually parse any data on its own. Instead you can use it, when many of your parsers share the same css/xpath selector prefix.\n\n``Prefix`` parser always returns a single dictionary containing the data extracted by the parsers listed in ``children`` parameter.\nAll parsers listed in ``children`` parameter **must** have ``name`` specified - this is then used as the key in dictionary.\n\nExample:\n\n.. code-block:: python\n\n # instead of...\n >>> String(css='#main .name').parse(...)\n >>> String(css='#main .date').parse(...)\n\n # ...you can use\n >>> from xextract import Prefix\n >>> Prefix(css='#main', children=[\n ... String(name=\"name\", css='.name'),\n ... String(name=\"date\", css='.date')\n ... ]).parse(...)\n\n\n=================\nParser parameters\n=================\n\n----\nname\n----\n\n**Parsers**: `String`_, `Url`_, `DateTime`_, `Date`_, `Element`_, `Group`_\n\n**Default value**: ``None``\n\nIf specified, then the extracted data will be returned in a dictionary, with the ``name`` as the key and the data as the value.\n\nAll parsers listed in ``children`` parameter of ``Group`` or ``Prefix`` parser **must** have ``name`` specified.\nIf multiple children parsers have the same ``name``, the behavior is undefined.\n\nExample:\n\n.. code-block:: python\n\n # when `name` is not specified, raw value is returned\n >>> String(css='span', count=1).parse('<span>Hello!</span>')\n 'Hello!'\n\n # when `name` is specified, dictionary is returned with `name` as the key\n >>> String(name='message', css='span', count=1).parse('<span>Hello!</span>')\n {'message': 'Hello!'}\n\n\n-----------\ncss / xpath\n-----------\n\n**Parsers**: `String`_, `Url`_, `DateTime`_, `Date`_, `Element`_, `Group`_, `Prefix`_\n\n**Default value (xpath)**: ``\"self::*\"``\n\nUse either ``css`` or ``xpath`` parameter (but not both) to select the elements from which to extract the data.\n\nUnder the hood css selectors are translated into equivalent xpath selectors.\n\nFor the children of ``Prefix`` or ``Group`` parsers, the elements are selected relative to the elements matched by the parent parser.\n\nExample:\n\n.. code-block:: python\n\n Prefix(xpath='//*[@id=\"profile\"]', children=[\n # equivalent to: //*[@id=\"profile\"]/descendant-or-self::*[@class=\"name\"]\n String(name='name', css='.name', count=1),\n\n # equivalent to: //*[@id=\"profile\"]/*[@class=\"title\"]\n String(name='title', xpath='*[@class=\"title\"]', count=1),\n\n # equivalent to: //*[@class=\"subtitle\"]\n String(name='subtitle', xpath='//*[@class=\"subtitle\"]', count=1)\n ])\n\n\n-----\ncount\n-----\n\n**Parsers**: `String`_, `Url`_, `DateTime`_, `Date`_, `Element`_, `Group`_\n\n**Default value**: ``\"*\"``\n\n``count`` specifies the expected number of elements to be matched with css/xpath selector. It serves two purposes:\n\n1. Number of matched elements is checked against the ``count`` parameter. If the number of elements doesn't match the expected countity, ``xextract.parsers.ParsingError`` exception is raised. This way you will be notified, when the website has changed its structure.\n2. It tells the parser whether to return a single extracted value or a list of values. See the table below.\n\nSyntax for ``count`` mimics the regular expressions.\nYou can either pass the value as a string, single integer or tuple of two integers.\n\nDepending on the value of ``count``, the parser returns either a single extracted value or a list of values.\n\n+-------------------+-----------------------------------------------+-----------------------------+\n| Value of ``count``| Meaning | Extracted data |\n+===================+===============================================+=============================+\n| ``\"*\"`` (default) | Zero or more elements. | List of values |\n+-------------------+-----------------------------------------------+-----------------------------+\n| ``\"+\"`` | One or more elements. | List of values |\n+-------------------+-----------------------------------------------+-----------------------------+\n| ``\"?\"`` | Zero or one element. | Single value or ``None`` |\n+-------------------+-----------------------------------------------+-----------------------------+\n| ``num`` | Exactly ``num`` elements. | ``num`` == 0: ``None`` |\n| | | |\n| | You can pass either string or integer. | ``num`` == 1: Single value |\n| | | |\n| | | ``num`` > 1: List of values |\n+-------------------+-----------------------------------------------+-----------------------------+\n| ``(num1, num2)`` | Number of elements has to be between | List of values |\n| | ``num1`` and ``num2``, inclusive. | |\n| | | |\n| | You can pass either a string or 2-tuple. | |\n+-------------------+-----------------------------------------------+-----------------------------+\n\nExample:\n\n.. code-block:: python\n\n >>> String(css='.full-name', count=1).parse(content) # return single value\n 'John Rambo'\n\n >>> String(css='.full-name', count='1').parse(content) # same as above\n 'John Rambo'\n\n >>> String(css='.full-name', count=(1,2)).parse(content) # return list of values\n ['John Rambo']\n\n >>> String(css='.full-name', count='1,2').parse(content) # same as above\n ['John Rambo']\n\n >>> String(css='.middle-name', count='?').parse(content) # return single value or None\n None\n\n >>> String(css='.job-titles', count='+').parse(content) # return list of values\n ['President', 'US Senator', 'State Senator', 'Senior Lecturer in Law']\n\n >>> String(css='.friends', count='*').parse(content) # return possibly empty list of values\n []\n\n >>> String(css='.friends', count='+').parse(content) # raise exception, when no elements are matched\n xextract.parsers.ParsingError: Parser String matched 0 elements (\"+\" expected).\n\n\n----\nattr\n----\n\n**Parsers**: `String`_, `Url`_, `DateTime`_, `Date`_\n\n**Default value**: ``\"href\"`` for ``Url`` parser. ``\"_text\"`` otherwise.\n\nUse ``attr`` parameter to specify what data to extract from the matched element.\n\n+-------------------+-----------------------------------------------------+\n| Value of ``attr`` | Meaning |\n+===================+=====================================================+\n| ``\"_text\"`` | Extract the text content of the matched element. |\n+-------------------+-----------------------------------------------------+\n| ``\"_all_text\"`` | Extract and concatenate the text content of |\n| | the matched element and all its descendants. |\n+-------------------+-----------------------------------------------------+\n| ``\"_name\"`` | Extract tag name of the matched element. |\n+-------------------+-----------------------------------------------------+\n| ``att_name`` | Extract the value out of ``att_name`` attribute of |\n| | the matched element. |\n| | |\n| | If such attribute doesn't exist, empty string is |\n| | returned. |\n+-------------------+-----------------------------------------------------+\n\nExample:\n\n.. code-block:: python\n\n >>> from xextract import String, Url\n >>> content = '<span class=\"name\">Barack <strong>Obama</strong> III.</span> <a href=\"/test\">Link</a>'\n\n >>> String(css='.name', count=1).parse(content) # default attr is \"_text\"\n 'Barack III.'\n\n >>> String(css='.name', count=1, attr='_text').parse(content) # same as above\n 'Barack III.'\n\n >>> String(css='.name', count=1, attr='_all_text').parse(content) # all text\n 'Barack Obama III.'\n\n >>> String(css='.name', count=1, attr='_name').parse(content) # tag name\n 'span'\n\n >>> Url(css='a', count='1').parse(content) # Url extracts href by default\n '/test'\n\n >>> String(css='a', count='1', attr='id').parse(content) # non-existent attributes return empty string\n ''\n\n\n--------\ncallback\n--------\n\n**Parsers**: `String`_, `Url`_, `DateTime`_, `Date`_, `Element`_, `Group`_\n\nProvides an easy way to post-process extracted values.\nIt should be a function that takes a single argument, the extracted value, and returns the postprocessed value.\n\nExample:\n\n.. code-block:: python\n\n >>> String(css='span', callback=int).parse('<span>1</span><span>2</span>')\n [1, 2]\n\n >>> Element(css='span', count=1, callback=lambda el: el.text).parse('<span>Hello</span>')\n 'Hello'\n\n--------\nchildren\n--------\n\n**Parsers**: `Group`_, `Prefix`_\n\nSpecifies the children parsers for the ``Group`` and ``Prefix`` parsers.\nAll parsers listed in ``children`` parameter **must** have ``name`` specified\n\nCss/xpath selectors in the children parsers are relative to the selectors specified in the parent parser.\n\nExample:\n\n.. code-block:: python\n\n Prefix(xpath='//*[@id=\"profile\"]', children=[\n # equivalent to: //*[@id=\"profile\"]/descendant-or-self::*[@class=\"name\"]\n String(name='name', css='.name', count=1),\n\n # equivalent to: //*[@id=\"profile\"]/*[@class=\"title\"]\n String(name='title', xpath='*[@class=\"title\"]', count=1),\n\n # equivalent to: //*[@class=\"subtitle\"]\n String(name='subtitle', xpath='//*[@class=\"subtitle\"]', count=1)\n ])\n\n----------\nnamespaces\n----------\n\n**Parsers**: `String`_, `Url`_, `DateTime`_, `Date`_, `Element`_, `Group`_, `Prefix`_\n\nWhen parsing XML documents containing namespace prefixes, pass the dictionary mapping namespace prefixes to namespace URIs.\nUse then full name for elements in xpath selector in the form ``\"prefix:element\"``\n\nAs for the moment, you **cannot use default namespace** for parsing (see `lxml docs <http://lxml.de/FAQ.html#how-can-i-specify-a-default-namespace-for-xpath-expressions>`_ for more information). Just use an arbitrary prefix.\n\nExample:\n\n.. code-block:: python\n\n >>> content = '''<?xml version='1.0' encoding='UTF-8'?>\n ... <movie xmlns=\"http://imdb.com/ns/\">\n ... <title>The Shawshank Redemption</title>\n ... <year>1994</year>\n ... </movie>'''\n >>> nsmap = {'imdb': 'http://imdb.com/ns/'} # use arbitrary prefix for default namespace\n\n >>> Prefix(xpath='//imdb:movie', namespaces=nsmap, children=[ # pass namespaces to the outermost parser\n ... String(name='title', xpath='imdb:title', count=1),\n ... String(name='year', xpath='imdb:year', count=1)\n ... ]).parse(content)\n {'title': 'The Shawshank Redemption', 'year': '1994'}\n\n\n====================\nHTML vs. XML parsing\n====================\n\nTo extract data from HTML or XML document, simply call ``parse()`` method of the parser:\n\n.. code-block:: python\n\n >>> from xextract import *\n >>> parser = Prefix(..., children=[...])\n >>> extracted_data = parser.parse(content)\n\n\n``content`` can be either string or unicode, containing the content of the document.\n\nUnder the hood **xextact** uses either ``lxml.etree.XMLParser`` or ``lxml.etree.HTMLParser`` to parse the document.\nTo select the parser, **xextract** looks for ``\"<?xml\"`` string in the first 128 bytes of the document. If it is found, then ``XMLParser`` is used.\n\nTo force either of the parsers, you can call ``parse_html()`` or ``parse_xml()`` method:\n\n.. code-block:: python\n\n >>> parser.parse_html(content) # force lxml.etree.HTMLParser\n >>> parser.parse_xml(content) # force lxml.etree.XMLParser\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Extract structured data from HTML and XML documents like a boss.",
"version": "0.1.9",
"project_urls": {
"Homepage": "https://github.com/Mimino666/python-xextract"
},
"split_keywords": [
"html",
"parse",
"parsing",
"extraction",
"extract",
"crawl"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c84904fb500d61dd6406c274af919cabbff867c34c27b932576371ac0b4fee3e",
"md5": "2006046a1e9e2e9e1d754cdb8e0a4be6",
"sha256": "54b039a39525f6716622c5991b2798133524d863e1185ffe74f7b0aa3719e852"
},
"downloads": -1,
"filename": "xextract-0.1.9.tar.gz",
"has_sig": false,
"md5_digest": "2006046a1e9e2e9e1d754cdb8e0a4be6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 16561,
"upload_time": "2024-12-06T12:02:27",
"upload_time_iso_8601": "2024-12-06T12:02:27.947676Z",
"url": "https://files.pythonhosted.org/packages/c8/49/04fb500d61dd6406c274af919cabbff867c34c27b932576371ac0b4fee3e/xextract-0.1.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-06 12:02:27",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Mimino666",
"github_project": "python-xextract",
"travis_ci": true,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "lxml",
"specs": []
},
{
"name": "cssselect",
"specs": []
}
],
"lcname": "xextract"
}