autopager


Nameautopager JSON
Version 0.3.1 PyPI version JSON
download
home_pagehttps://github.com/TeamHG-Memex/autopager
SummaryDetect and classify pagination links on web pages
upload_time2020-09-09 10:21:51
maintainer
docs_urlNone
authorMikhail Korobov
requires_python
licenseMIT license
keywords
VCS
bugtrack_url
requirements lxml sklearn-crfsuite parsel tldextract backports.csv scikit-learn eli5 docopt
Travis-CI
coveralls test coverage
            =========
Autopager
=========

.. image:: https://img.shields.io/pypi/v/autopager.svg
   :target: https://pypi.python.org/pypi/autopager
   :alt: PyPI Version

.. image:: https://img.shields.io/travis/TeamHG-Memex/autopager/master.svg
   :target: http://travis-ci.org/TeamHG-Memex/autopager
   :alt: Build Status

.. image:: http://codecov.io/github/TeamHG-Memex/autopager/coverage.svg?branch=master
   :target: http://codecov.io/github/TeamHG-Memex/autopager?branch=master
   :alt: Code Coverage


Autopager is a Python package which detects and classifies pagination links.

License is MIT.

Installation
============

Install autopager with pip::

   pip install autopager

Autopager depends on a few other packages like lxml_ and python-crfsuite_;
it will try install them automatically, but you may need to consult
with installation docs for these packages if installation fails.

.. _lxml: http://lxml.de/
.. _python-crfsuite: http://python-crfsuite.readthedocs.org/en/latest/

Autopager works in Python 3.6+.

Usage
=====

``autopager.urls`` function returns a list of pagination URLs::

   >>> import autopager
   >>> import requests
   >>> autopager.urls(requests.get('http://my-url.org'))
   ['http://my-url.org/page/1', 'http://my-url.org/page/3', 'http://my-url.org/page/4']

``autopager.select`` function returns all pagination ``<a>`` elements
as ``parsel.SelectorList`` (the same object as scrapy
response.css / response.xpath methods return).

``autopager.extract`` function returns a list of (link_type, link) tuples
where link_type is one of "PAGE", "PREV", "NEXT" and link
is a ``parsel.Selector`` instance.

These functions accept HTML page contents (as an unicode string),
requests Response or scrapy Response as a first argument.

By default, a prebuilt extraction model is used. If you want to use
your own model use ``autopager.AutoPager`` class; it has the same
methods but allows to provide model path or model itself::

   >>> import autopager
   >>> pager = autopager.AutoPager('my_model.crf')
   >>> pager.urls(html)

You also have to use AutoPager class if you've cloned repository from git;
prebuilt model is only available in pypi releases.

Detection Quality
=================

Web pages can be very different; autopager tries to work for all websites,
but some errors are inevitable. As a very rough estimate, expect it to work
properly for **9/10** paginators on websites sampled from 1M international
most popular websites (according to `Alexa Top`_).

.. _Alexa Top: https://support.alexa.com/hc/en-us/articles/200449834-Does-Alexa-have-a-list-of-its-top-ranked-websites-

Contributing
============

* Source code: https://github.com/TeamHG-Memex/autopager
* Issue tracker: https://github.com/TeamHG-Memex/autopager/issues

How It Works
============

Autopager uses machine learning to detect paginators. It classifies
``<a>`` HTML elements into 4 classes:

* PREV - previous page link
* PAGE - a link to a specific page
* NEXT - next page link
* OTHER - not a pagination link

To do that it uses features like link text, css class names,
URL parts and right/left contexts. CRF_ model is used for learning.

Web page is represented as a sequence of ``<a>`` elements. Only ``<a>``
elements with non-empty href attributes are in this sequence.

See also: https://github.com/TeamHG-Memex/autopager/blob/master/notebooks/Training.ipynb

.. _CRF: https://en.wikipedia.org/wiki/Conditional_random_field

Training Data
=============

Data is stored at autopager/data. Raw HTML source code
is in autopager/data/html folder. Annotations are in autopager/data/data.csv
file; elements are stored as CSS selectors.

Training data is annotated with 5 non-empty classes:

* PREV - previous page link
* PAGE - a link to a specific page
* NEXT - next page link
* LAST - 'got to last page' link which is not just a number
* FIRST - 'got to first page' link which is not just '1' number

Because LAST and FIRST are relatively rare they are converted to PAGE
by pagination model. By using these classes during annotation it can be
possible to make model predict them as well in future, with more training
examples.

To add a new page to training data save it to an html file
and add a row to the data.csv file. It is helpful
to use http://selectorgadget.com/ extension to get CSS selectors.

Don't worry if your CSS selectors don't return ``<a>`` elements directly
(it is easy to occasionally select a parent or a child of an ``<a>`` element
when using SelectorGadget). If a selection itself is not ``<a>`` element
then parent ``<a>`` elements and children ``<a>`` elements are tried, this is
usually what is wanted because ``<a>`` tags are not nested on valid websites.

When using SelectorGadget pay special attention not to select anything other
than pagination elements. Always check element count displayed by
SelectorGadget and compare it to a number of elements you wanted to select.

Some websites change their DOM after rendering. This rarely affect paginator
elements, but sometimes it can happen. To prevent it instead of downloading
HTML file using "Save As.." browser menu option it is better to use
"Copy Outer HTML" in developer tools or render HTML using a headless browser
(e.g. Splash_). If you do so make sure to put UTF-8 encoding to data.csv,
regardless of page encoding defined in HTTP headers or ``<meta>`` tags.

.. _Splash: https://github.com/scrapinghub/splash

----

.. image:: https://hyperiongray.s3.amazonaws.com/define-hg.svg
	:target: https://www.hyperiongray.com/?pk_campaign=github&pk_kwd=autopager
	:alt: define hyperiongray


Changes
=======

0.3.1 (2020-09-09)
------------------

* Fixing the distribution;
* backports.csv is no longer required in setup.py

0.3 (2020-09-09)
----------------

* Minimum Python requirement is now 3.6. Older versions may still work,
  but they're no longer tested on CI.
* Memory usage is limited, to avoid spikes on pathological pages.

0.2 (2016-04-26)
----------------

* more training examples;
* fixed Scrapy < 1.1 support;
* fixed a bug in text-before and text-after features.

0.1 (2016-03-15)
----------------

Initial release

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/TeamHG-Memex/autopager",
    "name": "autopager",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Mikhail Korobov",
    "author_email": "kmike84@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/38/40/7c7ccb492a103bd942f9ab4055de90468943f4b97f94b609a15184918ed8/autopager-0.3.1.tar.gz",
    "platform": "",
    "description": "=========\nAutopager\n=========\n\n.. image:: https://img.shields.io/pypi/v/autopager.svg\n   :target: https://pypi.python.org/pypi/autopager\n   :alt: PyPI Version\n\n.. image:: https://img.shields.io/travis/TeamHG-Memex/autopager/master.svg\n   :target: http://travis-ci.org/TeamHG-Memex/autopager\n   :alt: Build Status\n\n.. image:: http://codecov.io/github/TeamHG-Memex/autopager/coverage.svg?branch=master\n   :target: http://codecov.io/github/TeamHG-Memex/autopager?branch=master\n   :alt: Code Coverage\n\n\nAutopager is a Python package which detects and classifies pagination links.\n\nLicense is MIT.\n\nInstallation\n============\n\nInstall autopager with pip::\n\n   pip install autopager\n\nAutopager depends on a few other packages like lxml_ and python-crfsuite_;\nit will try install them automatically, but you may need to consult\nwith installation docs for these packages if installation fails.\n\n.. _lxml: http://lxml.de/\n.. _python-crfsuite: http://python-crfsuite.readthedocs.org/en/latest/\n\nAutopager works in Python 3.6+.\n\nUsage\n=====\n\n``autopager.urls`` function returns a list of pagination URLs::\n\n   >>> import autopager\n   >>> import requests\n   >>> autopager.urls(requests.get('http://my-url.org'))\n   ['http://my-url.org/page/1', 'http://my-url.org/page/3', 'http://my-url.org/page/4']\n\n``autopager.select`` function returns all pagination ``<a>`` elements\nas ``parsel.SelectorList`` (the same object as scrapy\nresponse.css / response.xpath methods return).\n\n``autopager.extract`` function returns a list of (link_type, link) tuples\nwhere link_type is one of \"PAGE\", \"PREV\", \"NEXT\" and link\nis a ``parsel.Selector`` instance.\n\nThese functions accept HTML page contents (as an unicode string),\nrequests Response or scrapy Response as a first argument.\n\nBy default, a prebuilt extraction model is used. If you want to use\nyour own model use ``autopager.AutoPager`` class; it has the same\nmethods but allows to provide model path or model itself::\n\n   >>> import autopager\n   >>> pager = autopager.AutoPager('my_model.crf')\n   >>> pager.urls(html)\n\nYou also have to use AutoPager class if you've cloned repository from git;\nprebuilt model is only available in pypi releases.\n\nDetection Quality\n=================\n\nWeb pages can be very different; autopager tries to work for all websites,\nbut some errors are inevitable. As a very rough estimate, expect it to work\nproperly for **9/10** paginators on websites sampled from 1M international\nmost popular websites (according to `Alexa Top`_).\n\n.. _Alexa Top: https://support.alexa.com/hc/en-us/articles/200449834-Does-Alexa-have-a-list-of-its-top-ranked-websites-\n\nContributing\n============\n\n* Source code: https://github.com/TeamHG-Memex/autopager\n* Issue tracker: https://github.com/TeamHG-Memex/autopager/issues\n\nHow It Works\n============\n\nAutopager uses machine learning to detect paginators. It classifies\n``<a>`` HTML elements into 4 classes:\n\n* PREV - previous page link\n* PAGE - a link to a specific page\n* NEXT - next page link\n* OTHER - not a pagination link\n\nTo do that it uses features like link text, css class names,\nURL parts and right/left contexts. CRF_ model is used for learning.\n\nWeb page is represented as a sequence of ``<a>`` elements. Only ``<a>``\nelements with non-empty href attributes are in this sequence.\n\nSee also: https://github.com/TeamHG-Memex/autopager/blob/master/notebooks/Training.ipynb\n\n.. _CRF: https://en.wikipedia.org/wiki/Conditional_random_field\n\nTraining Data\n=============\n\nData is stored at autopager/data. Raw HTML source code\nis in autopager/data/html folder. Annotations are in autopager/data/data.csv\nfile; elements are stored as CSS selectors.\n\nTraining data is annotated with 5 non-empty classes:\n\n* PREV - previous page link\n* PAGE - a link to a specific page\n* NEXT - next page link\n* LAST - 'got to last page' link which is not just a number\n* FIRST - 'got to first page' link which is not just '1' number\n\nBecause LAST and FIRST are relatively rare they are converted to PAGE\nby pagination model. By using these classes during annotation it can be\npossible to make model predict them as well in future, with more training\nexamples.\n\nTo add a new page to training data save it to an html file\nand add a row to the data.csv file. It is helpful\nto use http://selectorgadget.com/ extension to get CSS selectors.\n\nDon't worry if your CSS selectors don't return ``<a>`` elements directly\n(it is easy to occasionally select a parent or a child of an ``<a>`` element\nwhen using SelectorGadget). If a selection itself is not ``<a>`` element\nthen parent ``<a>`` elements and children ``<a>`` elements are tried, this is\nusually what is wanted because ``<a>`` tags are not nested on valid websites.\n\nWhen using SelectorGadget pay special attention not to select anything other\nthan pagination elements. Always check element count displayed by\nSelectorGadget and compare it to a number of elements you wanted to select.\n\nSome websites change their DOM after rendering. This rarely affect paginator\nelements, but sometimes it can happen. To prevent it instead of downloading\nHTML file using \"Save As..\" browser menu option it is better to use\n\"Copy Outer HTML\" in developer tools or render HTML using a headless browser\n(e.g. Splash_). If you do so make sure to put UTF-8 encoding to data.csv,\nregardless of page encoding defined in HTTP headers or ``<meta>`` tags.\n\n.. _Splash: https://github.com/scrapinghub/splash\n\n----\n\n.. image:: https://hyperiongray.s3.amazonaws.com/define-hg.svg\n\t:target: https://www.hyperiongray.com/?pk_campaign=github&pk_kwd=autopager\n\t:alt: define hyperiongray\n\n\nChanges\n=======\n\n0.3.1 (2020-09-09)\n------------------\n\n* Fixing the distribution;\n* backports.csv is no longer required in setup.py\n\n0.3 (2020-09-09)\n----------------\n\n* Minimum Python requirement is now 3.6. Older versions may still work,\n  but they're no longer tested on CI.\n* Memory usage is limited, to avoid spikes on pathological pages.\n\n0.2 (2016-04-26)\n----------------\n\n* more training examples;\n* fixed Scrapy < 1.1 support;\n* fixed a bug in text-before and text-after features.\n\n0.1 (2016-03-15)\n----------------\n\nInitial release\n",
    "bugtrack_url": null,
    "license": "MIT license",
    "summary": "Detect and classify pagination links on web pages",
    "version": "0.3.1",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "302bec83bbb5a88fddd81aaff110647160e7bd215bb92badcf4229fbbeb7b29a",
                "md5": "34fb2b76b56714f8be95149966440e22",
                "sha256": "7f31c677d24dcf13e0f07b22831653d9988608a7821c22c8e8b49e876d527d51"
            },
            "downloads": -1,
            "filename": "autopager-0.3.1-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "34fb2b76b56714f8be95149966440e22",
            "packagetype": "bdist_wheel",
            "python_version": "3.8",
            "requires_python": null,
            "size": 404017,
            "upload_time": "2020-09-09T10:21:55",
            "upload_time_iso_8601": "2020-09-09T10:21:55.518786Z",
            "url": "https://files.pythonhosted.org/packages/30/2b/ec83bbb5a88fddd81aaff110647160e7bd215bb92badcf4229fbbeb7b29a/autopager-0.3.1-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "38407c7ccb492a103bd942f9ab4055de90468943f4b97f94b609a15184918ed8",
                "md5": "5bb5b4242e3ecf619e8c08f1683745c0",
                "sha256": "3de41ba5cc88828b48695f0e7176ebf6ab09d04139d1a4339af58d8b2728fdaa"
            },
            "downloads": -1,
            "filename": "autopager-0.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "5bb5b4242e3ecf619e8c08f1683745c0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 401496,
            "upload_time": "2020-09-09T10:21:51",
            "upload_time_iso_8601": "2020-09-09T10:21:51.877945Z",
            "url": "https://files.pythonhosted.org/packages/38/40/7c7ccb492a103bd942f9ab4055de90468943f4b97f94b609a15184918ed8/autopager-0.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2020-09-09 10:21:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "TeamHG-Memex",
    "github_project": "autopager",
    "travis_ci": true,
    "coveralls": true,
    "github_actions": false,
    "requirements": [
        {
            "name": "lxml",
            "specs": []
        },
        {
            "name": "sklearn-crfsuite",
            "specs": [
                [
                    ">=",
                    "0.3.3"
                ]
            ]
        },
        {
            "name": "parsel",
            "specs": [
                [
                    ">=",
                    "1.0.1"
                ]
            ]
        },
        {
            "name": "tldextract",
            "specs": []
        },
        {
            "name": "backports.csv",
            "specs": []
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "0.18"
                ]
            ]
        },
        {
            "name": "eli5",
            "specs": [
                [
                    ">=",
                    "0.10.1"
                ]
            ]
        },
        {
            "name": "docopt",
            "specs": []
        }
    ],
    "tox": true,
    "lcname": "autopager"
}
        
Elapsed time: 0.13155s