courlan


Namecourlan JSON
Version 0.9.5 PyPI version JSON
download
home_pagehttps://github.com/adbar/courlan
SummaryClean, filter and sample URLs to optimize data collection – includes spam, content type and language filters.
upload_time2023-11-28 11:34:33
maintainer
docs_urlNone
authorAdrien Barbaresi
requires_python>=3.6
licenseGPLv3+
keywords cleaner crawler preprocessing url-parsing url-manipulation urls validation webcrawling
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            coURLan: Clean, filter, normalize, and sample URLs
==================================================


.. image:: https://img.shields.io/pypi/v/courlan.svg
    :target: https://pypi.python.org/pypi/courlan
    :alt: Python package

.. image:: https://img.shields.io/pypi/pyversions/courlan.svg
    :target: https://pypi.python.org/pypi/courlan
    :alt: Python versions

.. image:: https://img.shields.io/codecov/c/github/adbar/courlan.svg
    :target: https://codecov.io/gh/adbar/courlan
    :alt: Code Coverage

.. image:: https://img.shields.io/badge/code%20style-black-000000.svg
   :target: https://github.com/psf/black
   :alt: Code style: black


Why coURLan?
------------

    “It is important for the crawler to visit "important" pages first, so that the fraction of the Web that is visited (and kept up to date) is more meaningful.” (Cho et al. 1998)

    “Given that the bandwidth for conducting crawls is neither infinite nor free, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained.” (Edwards et al. 2001)


This library provides an additional “brain” for web crawling, scraping and document management. It facilitates web navigation through a set of filters, enhancing the quality of resulting document collections:

- Save bandwidth and processing time by steering clear of pages deemed low-value
- Identify specific pages based on language or text content
- Pinpoint pages relevant for efficient link gathering

Additional utilities needed include URL storage, filtering, and deduplication.


Features
--------

Separate the wheat from the chaff and optimize document discovery and retrieval:

- URL handling
   - Validation
   - Normalization
   - Sampling
- Heuristics for link filtering
   - Spam, trackers, and content-types
   - Language/Locale-aware processing
   - Web crawling (frontier, scheduling)
- Data store specifically designed for URLs
- Usable with Python or on the command-line


**Let the coURLan fish up juicy bits for you!**

.. image:: courlan_harns-march.jpg
    :alt: Courlan 
    :align: center
    :width: 65%
    :target: https://commons.wikimedia.org/wiki/File:Limpkin,_harns_marsh_(33723700146).jpg

Here is a `courlan <https://en.wiktionary.org/wiki/courlan>`_ (source: `Limpkin at Harn's Marsh by Russ <https://commons.wikimedia.org/wiki/File:Limpkin,_harns_marsh_(33723700146).jpg>`_, CC BY 2.0).



Installation
------------

This package is compatible with with all common versions of Python, it is tested on Linux, macOS and Windows systems.

Courlan is available on the package repository `PyPI <https://pypi.org/>`_ and can notably be installed with the Python package manager ``pip``:

.. code-block:: bash

    $ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed
    $ pip install --upgrade courlan # to make sure you have the latest version
    $ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)


Python
------

Most filters revolve around the ``strict`` and ``language`` arguments.


check_url()
~~~~~~~~~~~

All useful operations chained in ``check_url(url)``:

.. code-block:: python

    >>> from courlan import check_url

    # return url and domain name
    >>> check_url('https://github.com/adbar/courlan')
    ('https://github.com/adbar/courlan', 'github.com')

    # filter out bogus domains
    >>> check_url('http://666.0.0.1/')
    >>>

    # tracker removal
    >>> check_url('http://test.net/foo.html?utm_source=twitter#gclid=123')
    ('http://test.net/foo.html', 'test.net')

    # use strict for further trimming
    >>> my_url = 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org'
    >>> check_url(my_url, strict=True)
    ('https://httpbin.org/redirect-to', 'httpbin.org')

    # check for redirects (HEAD request)
    >>> url, domain_name = check_url(my_url, with_redirects=True)


Language-aware heuristics, notably internationalization in URLs, are available in ``lang_filter(url, language)``:

.. code-block:: python

    # optional language argument
    >>> url = 'https://www.un.org/en/about-us'

    # success: returns clean URL and domain name
    >>> check_url(url, language='en')
    ('https://www.un.org/en/about-us', 'un.org')

    # failure: doesn't return anything
    >>> check_url(url, language='de')
    >>>

    # optional argument: strict
    >>> url = 'https://en.wikipedia.org/'
    >>> check_url(url, language='de', strict=False)
    ('https://en.wikipedia.org', 'wikipedia.org')
    >>> check_url(url, language='de', strict=True)
    >>>


Define stricter restrictions on the expected content type with ``strict=True``. Also blocks certain platforms and pages types crawlers should stay away from if they don't target them explicitly and other black holes where machines get lost.

.. code-block:: python

    # strict filtering: blocked as it is a major platform
    >>> check_url('https://www.twitch.com/', strict=True)
    >>>



Sampling by domain name
~~~~~~~~~~~~~~~~~~~~~~~


.. code-block:: python

    >>> from courlan import sample_urls
    >>> my_urls = ['https://example.org/' + str(x) for x in range(100)]
    >>> my_sample = sample_urls(my_urls, 10)
    # optional: exclude_min=None, exclude_max=None, strict=False, verbose=False


Web crawling and URL handling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Determine if a link leads to another host:

.. code-block:: python

    >>> from courlan import is_external
    >>> is_external('https://github.com/', 'https://www.microsoft.com/')
    True
    # default
    >>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True)
    False
    # taking suffixes into account
    >>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False)
    True


Other useful functions dedicated to URL handling:

- ``extract_domain(url, fast=True)``: find domain and subdomain or just domain with ``fast=False``
- ``get_base_url(url)``: strip the URL of some of its parts
- ``get_host_and_path(url)``: decompose URLs in two parts: protocol + host/domain and path
- ``get_hostinfo(url)``: extract domain and host info (protocol + host/domain)
- ``fix_relative_urls(baseurl, url)``: prepend necessary information to relative links


.. code-block:: python

    >>> from courlan import *
    >>> url = 'https://www.un.org/en/about-us'

    >>> get_base_url(url)
    'https://www.un.org'

    >>> get_host_and_path(url)
    ('https://www.un.org', '/en/about-us')

    >>> get_hostinfo(url)
    ('un.org', 'https://www.un.org')

    >>> fix_relative_urls('https://www.un.org', 'en/about-us')
    'https://www.un.org/en/about-us'


Other filters dedicated to crawl frontier management:

- ``is_not_crawlable(url)``: check for deep web or pages generally not usable in a crawling context
- ``is_navigation_page(url)``: check for navigation and overview pages


.. code-block:: python

    >>> from courlan import is_navigation_page, is_not_crawlable
    >>> is_navigation_page('https://www.randomblog.net/category/myposts')
    True
    >>> is_not_crawlable('https://www.randomblog.net/login')
    True


Python helpers
~~~~~~~~~~~~~~

Helper function, scrub and normalize:

.. code-block:: python

    >>> from courlan import clean_url
    >>> clean_url('HTTPS://WWW.DWDS.DE:80/')
    'https://www.dwds.de'


Basic scrubbing only:

.. code-block:: python

    >>> from courlan import scrub_url


Basic canonicalization/normalization only, i.e. modifying and standardizing URLs in a consistent manner:

.. code-block:: python

    >>> from urllib.parse import urlparse
    >>> from courlan import normalize_url
    >>> my_url = normalize_url(urlparse(my_url))
    # passing URL strings directly also works
    >>> my_url = normalize_url(my_url)
    # remove unnecessary components and re-order query elements
    >>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True)
    'http://test.net/foo.html?page=2&post=abc'


Basic URL validation only:

.. code-block:: python

    >>> from courlan import validate_url
    >>> validate_url('http://1234')
    (False, None)
    >>> validate_url('http://www.example.org/')
    (True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))


Troubleshooting
~~~~~~~~~~~~~~~

Courlan uses an internal cache to speed up URL parsing. It can be reset as follows:

.. code-block:: python

    >>> from courlan.meta import clear_caches
    >>> clear_caches()



UrlStore class
~~~~~~~~~~~~~~

The ``UrlStore`` class allow for storing and retrieving domain-classified URLs, where a URL like ``https://example.org/path/testpage`` is stored as the path ``/path/testpage`` within the domain ``https://example.org``. It features the following methods:

- URL management
   - ``add_urls(urls=[], appendleft=None, visited=False)``: Add a list of URLs to the (possibly) existing one. Optional: append certain URLs to the left, specify if the URLs have already been visited.
   - ``add_from_html(htmlstring, url, external=False, lang=None, with_nav=True)``: Extract and filter links in a HTML string.
   - ``discard(domains)``: Declare domains void and prune the store.
   - ``dump_urls()``: Return a list of all known URLs.
   - ``print_urls()``: Print all URLs in store (URL + TAB + visited or not).
   - ``print_unvisited_urls()``: Print all unvisited URLs in store.
   - ``get_all_counts()``: Return all download counts for the hosts in store.
   - ``get_known_domains()``: Return all known domains as a list.
   - ``get_unvisited_domains()``: Find all domains for which there are unvisited URLs.
   - ``total_url_number()``: Find number of all URLs in store.
   - ``is_known(url)``: Check if the given URL has already been stored.
   - ``has_been_visited(url)``: Check if the given URL has already been visited.
   - ``filter_unknown_urls(urls)``: Take a list of URLs and return the currently unknown ones.
   - ``filter_unvisited_urls(urls)``: Take a list of URLs and return the currently unvisited ones.
   - ``find_known_urls(domain)``: Get all already known URLs for the given domain (ex. "https://example.org").
   - ``find_unvisited_urls(domain)``: Get all unvisited URLs for the given domain.
   - ``get_unvisited_domains()``: Return all domains which have not been all visited.
   - ``reset()``: Re-initialize the URL store.
- Crawling and downloads
   - ``get_url(domain)``: Retrieve a single URL and consider it to be visited (with corresponding timestamp).
   - ``get_rules(domain)``: Return the stored crawling rules for the given website.
   - ``store_rules(website, rules=None)``: Store crawling rules for a given website.
   - ``get_crawl_delay()``: Return the delay as extracted from robots.txt, or a given default.
   - ``get_download_urls(timelimit=10)``: Get a list of immediately downloadable URLs according to the given time limit per domain.
   - ``establish_download_schedule(max_urls=100, time_limit=10)``: Get up to the specified number of URLs along with a suitable backoff schedule (in seconds).
   - ``download_threshold_reached(threshold)``: Find out if the download limit (in seconds) has been reached for one of the websites in store.
   - ``unvisited_websites_number()``: Return the number of websites for which there are still URLs to visit.
   - ``is_exhausted_domain(domain)``: Tell if all known URLs for the website have been visited.

Optional settings:
- ``compressed=True``: activate compression of URLs and rules
- ``language=XX``: focus on a particular target language (two-letter code)
- ``strict=True``: stricter URL filtering
- ``verbose=True``: dump URLs if interrupted (requires use of ``signal``)


Command-line
------------

The main fonctions are also available through a command-line utility.

.. code-block:: bash

    $ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
    $ courlan --help
    usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v]
                   [-p PARALLEL] [--strict] [-l LANGUAGE] [-r] [--sample SAMPLE]
                   [--exclude-max EXCLUDE_MAX] [--exclude-min EXCLUDE_MIN]


optional arguments:
  -h, --help            show this help message and exit

I/O:
  Manage input and output

  -i INPUTFILE, --inputfile INPUTFILE
                        name of input file (required)
  -o OUTPUTFILE, --outputfile OUTPUTFILE
                        name of output file (required)
  -d DISCARDEDFILE, --discardedfile DISCARDEDFILE
                        name of file to store discarded URLs (optional)
  -v, --verbose         increase output verbosity
  -p PARALLEL, --parallel PARALLEL
                        number of parallel processes (not used for sampling)

Filtering:
  Configure URL filters

  --strict              perform more restrictive tests
  -l LANGUAGE, --language LANGUAGE
                        use language filter (ISO 639-1 code)
  -r, --redirects       check redirects

Sampling:
  Use sampling by host, configure sample size

  --sample SAMPLE       size of sample per domain
  --exclude-max EXCLUDE_MAX
                        exclude domains with more than n URLs
  --exclude-min EXCLUDE_MIN
                        exclude domains with less than n URLs


License
-------

*coURLan* is distributed under the `GNU General Public License v3.0 <https://github.com/adbar/courlan/blob/master/LICENSE>`_. If you wish to redistribute this library but feel bounded by the license conditions please try interacting `at arms length <https://www.gnu.org/licenses/gpl-faq.html#GPLInProprietarySystem>`_, `multi-licensing <https://en.wikipedia.org/wiki/Multi-licensing>`_ with `compatible licenses <https://en.wikipedia.org/wiki/GNU_General_Public_License#Compatibility_and_multi-licensing>`_, or `contacting me <https://github.com/adbar/courlan#author>`_.

See also `GPL and free software licensing: What's in it for business? <https://web.archive.org/web/20230127221311/https://www.techrepublic.com/article/gpl-and-free-software-licensing-whats-in-it-for-business/>`_



Settings
--------

``courlan`` is optimized for English and German but its generic approach is also usable in other contexts.

Details of strict URL filtering can be reviewed and changed in the file ``settings.py``. To override the default settings, `clone the repository <https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/cloning-a-repository-from-github>`_ and `re-install the package locally <https://packaging.python.org/tutorials/installing-packages/#installing-from-a-local-src-tree>`_.



Contributing
------------

`Contributions <https://github.com/adbar/courlan/blob/master/CONTRIBUTING.md>`_ are welcome!

Feel free to file issues on the `dedicated page <https://github.com/adbar/courlan/issues>`_.


Author
------

This effort is part of methods to derive information from web documents in order to build `text databases for research <https://www.dwds.de/d/k-web>`_ (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. Web corpus construction involves numerous design decisions, and this software package can help facilitate text data collection and enhance corpus quality.

- Barbaresi, A. "`Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction <https://aclanthology.org/2021.acl-demo.15/>`_." *Proceedings of ACL/IJCNLP 2021: System Demonstrations*, 2021, pp. 122-131.
- Barbaresi, A. "`Generic Web Content Extraction with Open-Source Software <https://konvens.org/proceedings/2019/papers/kaleidoskop/camera_ready_barbaresi.pdf>`_." *Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019)*, 2019, pp. 267-268.

Contact: see `homepage <https://adrien.barbaresi.eu/>`_ or `GitHub <https://github.com/adbar>`_.

Software ecosystem: see `this graphic <https://github.com/adbar/trafilatura/blob/master/docs/software-ecosystem.png>`_.



Similar work
------------

These Python libraries perform similar handling and normalization tasks but do not entail language or content filters. They also do not primarily focus on crawl optimization:

- `furl <https://github.com/gruns/furl>`_
- `ural <https://github.com/medialab/ural>`_
- `yarl <https://github.com/aio-libs/yarl>`_


References
----------

- Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling through URL ordering. *Computer networks and ISDN systems*, 30(1-7), 161–172.
- Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). "An adaptive model for optimizing performance of an incremental web crawler". In *Proceedings of the 10th international conference on World Wide Web - WWW '01*, pp. 106–113.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/adbar/courlan",
    "name": "courlan",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "cleaner,crawler,preprocessing,url-parsing,url-manipulation,urls,validation,webcrawling",
    "author": "Adrien Barbaresi",
    "author_email": "barbaresi@bbaw.de",
    "download_url": "https://files.pythonhosted.org/packages/dd/66/09e441e9130ea67201009b3e6020ebccf2c90df4a25a05edd79f08706d33/courlan-0.9.5.tar.gz",
    "platform": null,
    "description": "coURLan: Clean, filter, normalize, and sample URLs\n==================================================\n\n\n.. image:: https://img.shields.io/pypi/v/courlan.svg\n    :target: https://pypi.python.org/pypi/courlan\n    :alt: Python package\n\n.. image:: https://img.shields.io/pypi/pyversions/courlan.svg\n    :target: https://pypi.python.org/pypi/courlan\n    :alt: Python versions\n\n.. image:: https://img.shields.io/codecov/c/github/adbar/courlan.svg\n    :target: https://codecov.io/gh/adbar/courlan\n    :alt: Code Coverage\n\n.. image:: https://img.shields.io/badge/code%20style-black-000000.svg\n   :target: https://github.com/psf/black\n   :alt: Code style: black\n\n\nWhy coURLan?\n------------\n\n    \u201cIt is important for the crawler to visit \"important\" pages first, so that the fraction of the Web that is visited (and kept up to date) is more meaningful.\u201d (Cho et al. 1998)\n\n    \u201cGiven that the bandwidth for conducting crawls is neither infinite nor free, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained.\u201d (Edwards et al. 2001)\n\n\nThis library provides an additional \u201cbrain\u201d for web crawling, scraping and document management. It facilitates web navigation through a set of filters, enhancing the quality of resulting document collections:\n\n- Save bandwidth and processing time by steering clear of pages deemed low-value\n- Identify specific pages based on language or text content\n- Pinpoint pages relevant for efficient link gathering\n\nAdditional utilities needed include URL storage, filtering, and deduplication.\n\n\nFeatures\n--------\n\nSeparate the wheat from the chaff and optimize document discovery and retrieval:\n\n- URL handling\n   - Validation\n   - Normalization\n   - Sampling\n- Heuristics for link filtering\n   - Spam, trackers, and content-types\n   - Language/Locale-aware processing\n   - Web crawling (frontier, scheduling)\n- Data store specifically designed for URLs\n- Usable with Python or on the command-line\n\n\n**Let the coURLan fish up juicy bits for you!**\n\n.. image:: courlan_harns-march.jpg\n    :alt: Courlan \n    :align: center\n    :width: 65%\n    :target: https://commons.wikimedia.org/wiki/File:Limpkin,_harns_marsh_(33723700146).jpg\n\nHere is a `courlan <https://en.wiktionary.org/wiki/courlan>`_ (source: `Limpkin at Harn's Marsh by Russ <https://commons.wikimedia.org/wiki/File:Limpkin,_harns_marsh_(33723700146).jpg>`_, CC BY 2.0).\n\n\n\nInstallation\n------------\n\nThis package is compatible with with all common versions of Python, it is tested on Linux, macOS and Windows systems.\n\nCourlan is available on the package repository `PyPI <https://pypi.org/>`_ and can notably be installed with the Python package manager ``pip``:\n\n.. code-block:: bash\n\n    $ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed\n    $ pip install --upgrade courlan # to make sure you have the latest version\n    $ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)\n\n\nPython\n------\n\nMost filters revolve around the ``strict`` and ``language`` arguments.\n\n\ncheck_url()\n~~~~~~~~~~~\n\nAll useful operations chained in ``check_url(url)``:\n\n.. code-block:: python\n\n    >>> from courlan import check_url\n\n    # return url and domain name\n    >>> check_url('https://github.com/adbar/courlan')\n    ('https://github.com/adbar/courlan', 'github.com')\n\n    # filter out bogus domains\n    >>> check_url('http://666.0.0.1/')\n    >>>\n\n    # tracker removal\n    >>> check_url('http://test.net/foo.html?utm_source=twitter#gclid=123')\n    ('http://test.net/foo.html', 'test.net')\n\n    # use strict for further trimming\n    >>> my_url = 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org'\n    >>> check_url(my_url, strict=True)\n    ('https://httpbin.org/redirect-to', 'httpbin.org')\n\n    # check for redirects (HEAD request)\n    >>> url, domain_name = check_url(my_url, with_redirects=True)\n\n\nLanguage-aware heuristics, notably internationalization in URLs, are available in ``lang_filter(url, language)``:\n\n.. code-block:: python\n\n    # optional language argument\n    >>> url = 'https://www.un.org/en/about-us'\n\n    # success: returns clean URL and domain name\n    >>> check_url(url, language='en')\n    ('https://www.un.org/en/about-us', 'un.org')\n\n    # failure: doesn't return anything\n    >>> check_url(url, language='de')\n    >>>\n\n    # optional argument: strict\n    >>> url = 'https://en.wikipedia.org/'\n    >>> check_url(url, language='de', strict=False)\n    ('https://en.wikipedia.org', 'wikipedia.org')\n    >>> check_url(url, language='de', strict=True)\n    >>>\n\n\nDefine stricter restrictions on the expected content type with ``strict=True``. Also blocks certain platforms and pages types crawlers should stay away from if they don't target them explicitly and other black holes where machines get lost.\n\n.. code-block:: python\n\n    # strict filtering: blocked as it is a major platform\n    >>> check_url('https://www.twitch.com/', strict=True)\n    >>>\n\n\n\nSampling by domain name\n~~~~~~~~~~~~~~~~~~~~~~~\n\n\n.. code-block:: python\n\n    >>> from courlan import sample_urls\n    >>> my_urls = ['https://example.org/' + str(x) for x in range(100)]\n    >>> my_sample = sample_urls(my_urls, 10)\n    # optional: exclude_min=None, exclude_max=None, strict=False, verbose=False\n\n\nWeb crawling and URL handling\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\nDetermine if a link leads to another host:\n\n.. code-block:: python\n\n    >>> from courlan import is_external\n    >>> is_external('https://github.com/', 'https://www.microsoft.com/')\n    True\n    # default\n    >>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True)\n    False\n    # taking suffixes into account\n    >>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False)\n    True\n\n\nOther useful functions dedicated to URL handling:\n\n- ``extract_domain(url, fast=True)``: find domain and subdomain or just domain with ``fast=False``\n- ``get_base_url(url)``: strip the URL of some of its parts\n- ``get_host_and_path(url)``: decompose URLs in two parts: protocol + host/domain and path\n- ``get_hostinfo(url)``: extract domain and host info (protocol + host/domain)\n- ``fix_relative_urls(baseurl, url)``: prepend necessary information to relative links\n\n\n.. code-block:: python\n\n    >>> from courlan import *\n    >>> url = 'https://www.un.org/en/about-us'\n\n    >>> get_base_url(url)\n    'https://www.un.org'\n\n    >>> get_host_and_path(url)\n    ('https://www.un.org', '/en/about-us')\n\n    >>> get_hostinfo(url)\n    ('un.org', 'https://www.un.org')\n\n    >>> fix_relative_urls('https://www.un.org', 'en/about-us')\n    'https://www.un.org/en/about-us'\n\n\nOther filters dedicated to crawl frontier management:\n\n- ``is_not_crawlable(url)``: check for deep web or pages generally not usable in a crawling context\n- ``is_navigation_page(url)``: check for navigation and overview pages\n\n\n.. code-block:: python\n\n    >>> from courlan import is_navigation_page, is_not_crawlable\n    >>> is_navigation_page('https://www.randomblog.net/category/myposts')\n    True\n    >>> is_not_crawlable('https://www.randomblog.net/login')\n    True\n\n\nPython helpers\n~~~~~~~~~~~~~~\n\nHelper function, scrub and normalize:\n\n.. code-block:: python\n\n    >>> from courlan import clean_url\n    >>> clean_url('HTTPS://WWW.DWDS.DE:80/')\n    'https://www.dwds.de'\n\n\nBasic scrubbing only:\n\n.. code-block:: python\n\n    >>> from courlan import scrub_url\n\n\nBasic canonicalization/normalization only, i.e. modifying and standardizing URLs in a consistent manner:\n\n.. code-block:: python\n\n    >>> from urllib.parse import urlparse\n    >>> from courlan import normalize_url\n    >>> my_url = normalize_url(urlparse(my_url))\n    # passing URL strings directly also works\n    >>> my_url = normalize_url(my_url)\n    # remove unnecessary components and re-order query elements\n    >>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True)\n    'http://test.net/foo.html?page=2&post=abc'\n\n\nBasic URL validation only:\n\n.. code-block:: python\n\n    >>> from courlan import validate_url\n    >>> validate_url('http://1234')\n    (False, None)\n    >>> validate_url('http://www.example.org/')\n    (True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))\n\n\nTroubleshooting\n~~~~~~~~~~~~~~~\n\nCourlan uses an internal cache to speed up URL parsing. It can be reset as follows:\n\n.. code-block:: python\n\n    >>> from courlan.meta import clear_caches\n    >>> clear_caches()\n\n\n\nUrlStore class\n~~~~~~~~~~~~~~\n\nThe ``UrlStore`` class allow for storing and retrieving domain-classified URLs, where a URL like ``https://example.org/path/testpage`` is stored as the path ``/path/testpage`` within the domain ``https://example.org``. It features the following methods:\n\n- URL management\n   - ``add_urls(urls=[], appendleft=None, visited=False)``: Add a list of URLs to the (possibly) existing one. Optional: append certain URLs to the left, specify if the URLs have already been visited.\n   - ``add_from_html(htmlstring, url, external=False, lang=None, with_nav=True)``: Extract and filter links in a HTML string.\n   - ``discard(domains)``: Declare domains void and prune the store.\n   - ``dump_urls()``: Return a list of all known URLs.\n   - ``print_urls()``: Print all URLs in store (URL + TAB + visited or not).\n   - ``print_unvisited_urls()``: Print all unvisited URLs in store.\n   - ``get_all_counts()``: Return all download counts for the hosts in store.\n   - ``get_known_domains()``: Return all known domains as a list.\n   - ``get_unvisited_domains()``: Find all domains for which there are unvisited URLs.\n   - ``total_url_number()``: Find number of all URLs in store.\n   - ``is_known(url)``: Check if the given URL has already been stored.\n   - ``has_been_visited(url)``: Check if the given URL has already been visited.\n   - ``filter_unknown_urls(urls)``: Take a list of URLs and return the currently unknown ones.\n   - ``filter_unvisited_urls(urls)``: Take a list of URLs and return the currently unvisited ones.\n   - ``find_known_urls(domain)``: Get all already known URLs for the given domain (ex. \"https://example.org\").\n   - ``find_unvisited_urls(domain)``: Get all unvisited URLs for the given domain.\n   - ``get_unvisited_domains()``: Return all domains which have not been all visited.\n   - ``reset()``: Re-initialize the URL store.\n- Crawling and downloads\n   - ``get_url(domain)``: Retrieve a single URL and consider it to be visited (with corresponding timestamp).\n   - ``get_rules(domain)``: Return the stored crawling rules for the given website.\n   - ``store_rules(website, rules=None)``: Store crawling rules for a given website.\n   - ``get_crawl_delay()``: Return the delay as extracted from robots.txt, or a given default.\n   - ``get_download_urls(timelimit=10)``: Get a list of immediately downloadable URLs according to the given time limit per domain.\n   - ``establish_download_schedule(max_urls=100, time_limit=10)``: Get up to the specified number of URLs along with a suitable backoff schedule (in seconds).\n   - ``download_threshold_reached(threshold)``: Find out if the download limit (in seconds) has been reached for one of the websites in store.\n   - ``unvisited_websites_number()``: Return the number of websites for which there are still URLs to visit.\n   - ``is_exhausted_domain(domain)``: Tell if all known URLs for the website have been visited.\n\nOptional settings:\n- ``compressed=True``: activate compression of URLs and rules\n- ``language=XX``: focus on a particular target language (two-letter code)\n- ``strict=True``: stricter URL filtering\n- ``verbose=True``: dump URLs if interrupted (requires use of ``signal``)\n\n\nCommand-line\n------------\n\nThe main fonctions are also available through a command-line utility.\n\n.. code-block:: bash\n\n    $ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt\n    $ courlan --help\n    usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v]\n                   [-p PARALLEL] [--strict] [-l LANGUAGE] [-r] [--sample SAMPLE]\n                   [--exclude-max EXCLUDE_MAX] [--exclude-min EXCLUDE_MIN]\n\n\noptional arguments:\n  -h, --help            show this help message and exit\n\nI/O:\n  Manage input and output\n\n  -i INPUTFILE, --inputfile INPUTFILE\n                        name of input file (required)\n  -o OUTPUTFILE, --outputfile OUTPUTFILE\n                        name of output file (required)\n  -d DISCARDEDFILE, --discardedfile DISCARDEDFILE\n                        name of file to store discarded URLs (optional)\n  -v, --verbose         increase output verbosity\n  -p PARALLEL, --parallel PARALLEL\n                        number of parallel processes (not used for sampling)\n\nFiltering:\n  Configure URL filters\n\n  --strict              perform more restrictive tests\n  -l LANGUAGE, --language LANGUAGE\n                        use language filter (ISO 639-1 code)\n  -r, --redirects       check redirects\n\nSampling:\n  Use sampling by host, configure sample size\n\n  --sample SAMPLE       size of sample per domain\n  --exclude-max EXCLUDE_MAX\n                        exclude domains with more than n URLs\n  --exclude-min EXCLUDE_MIN\n                        exclude domains with less than n URLs\n\n\nLicense\n-------\n\n*coURLan* is distributed under the `GNU General Public License v3.0 <https://github.com/adbar/courlan/blob/master/LICENSE>`_. If you wish to redistribute this library but feel bounded by the license conditions please try interacting `at arms length <https://www.gnu.org/licenses/gpl-faq.html#GPLInProprietarySystem>`_, `multi-licensing <https://en.wikipedia.org/wiki/Multi-licensing>`_ with `compatible licenses <https://en.wikipedia.org/wiki/GNU_General_Public_License#Compatibility_and_multi-licensing>`_, or `contacting me <https://github.com/adbar/courlan#author>`_.\n\nSee also `GPL and free software licensing: What's in it for business? <https://web.archive.org/web/20230127221311/https://www.techrepublic.com/article/gpl-and-free-software-licensing-whats-in-it-for-business/>`_\n\n\n\nSettings\n--------\n\n``courlan`` is optimized for English and German but its generic approach is also usable in other contexts.\n\nDetails of strict URL filtering can be reviewed and changed in the file ``settings.py``. To override the default settings, `clone the repository <https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/cloning-a-repository-from-github>`_ and `re-install the package locally <https://packaging.python.org/tutorials/installing-packages/#installing-from-a-local-src-tree>`_.\n\n\n\nContributing\n------------\n\n`Contributions <https://github.com/adbar/courlan/blob/master/CONTRIBUTING.md>`_ are welcome!\n\nFeel free to file issues on the `dedicated page <https://github.com/adbar/courlan/issues>`_.\n\n\nAuthor\n------\n\nThis effort is part of methods to derive information from web documents in order to build `text databases for research <https://www.dwds.de/d/k-web>`_ (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. Web corpus construction involves numerous design decisions, and this software package can help facilitate text data collection and enhance corpus quality.\n\n- Barbaresi, A. \"`Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction <https://aclanthology.org/2021.acl-demo.15/>`_.\" *Proceedings of ACL/IJCNLP 2021: System Demonstrations*, 2021, pp. 122-131.\n- Barbaresi, A. \"`Generic Web Content Extraction with Open-Source Software <https://konvens.org/proceedings/2019/papers/kaleidoskop/camera_ready_barbaresi.pdf>`_.\" *Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019)*, 2019, pp. 267-268.\n\nContact: see `homepage <https://adrien.barbaresi.eu/>`_ or `GitHub <https://github.com/adbar>`_.\n\nSoftware ecosystem: see `this graphic <https://github.com/adbar/trafilatura/blob/master/docs/software-ecosystem.png>`_.\n\n\n\nSimilar work\n------------\n\nThese Python libraries perform similar handling and normalization tasks but do not entail language or content filters. They also do not primarily focus on crawl optimization:\n\n- `furl <https://github.com/gruns/furl>`_\n- `ural <https://github.com/medialab/ural>`_\n- `yarl <https://github.com/aio-libs/yarl>`_\n\n\nReferences\n----------\n\n- Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling through URL ordering. *Computer networks and ISDN systems*, 30(1-7), 161\u2013172.\n- Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). \"An adaptive model for optimizing performance of an incremental web crawler\". In *Proceedings of the 10th international conference on World Wide Web - WWW '01*, pp. 106\u2013113.\n",
    "bugtrack_url": null,
    "license": "GPLv3+",
    "summary": "Clean, filter and sample URLs to optimize data collection \u2013 includes spam, content type and language filters.",
    "version": "0.9.5",
    "project_urls": {
        "Blog": "https://adrien.barbaresi.eu/blog/",
        "Homepage": "https://github.com/adbar/courlan",
        "Tracker": "https://github.com/adbar/courlan/issues"
    },
    "split_keywords": [
        "cleaner",
        "crawler",
        "preprocessing",
        "url-parsing",
        "url-manipulation",
        "urls",
        "validation",
        "webcrawling"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d9d9b621e0297c3518faefb3d3d18f5bdf76fa65982695de1aa73007a1090676",
                "md5": "cc3f69474e6aeba8b82e2ba004a6b945",
                "sha256": "3c10fb06a26422b5c5e6f5f6d2c16e5d4308026f9dcea783ca6a88dae5922ee5"
            },
            "downloads": -1,
            "filename": "courlan-0.9.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cc3f69474e6aeba8b82e2ba004a6b945",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 44472,
            "upload_time": "2023-11-28T11:34:28",
            "upload_time_iso_8601": "2023-11-28T11:34:28.754908Z",
            "url": "https://files.pythonhosted.org/packages/d9/d9/b621e0297c3518faefb3d3d18f5bdf76fa65982695de1aa73007a1090676/courlan-0.9.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dd6609e441e9130ea67201009b3e6020ebccf2c90df4a25a05edd79f08706d33",
                "md5": "264c6b75677833b75b3d613833df65b7",
                "sha256": "38dc35b2e3bf1f5d516d00d51ac12ebde543e3417c6be6f6a2273c0fc5b5b353"
            },
            "downloads": -1,
            "filename": "courlan-0.9.5.tar.gz",
            "has_sig": false,
            "md5_digest": "264c6b75677833b75b3d613833df65b7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 214811,
            "upload_time": "2023-11-28T11:34:33",
            "upload_time_iso_8601": "2023-11-28T11:34:33.062166Z",
            "url": "https://files.pythonhosted.org/packages/dd/66/09e441e9130ea67201009b3e6020ebccf2c90df4a25a05edd79f08706d33/courlan-0.9.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-28 11:34:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "adbar",
    "github_project": "courlan",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "courlan"
}
        
Elapsed time: 0.14422s