courlan


Namecourlan JSON
Version 1.3.2 PyPI version JSON
download
home_pageNone
SummaryClean, filter and sample URLs to optimize data collection – includes spam, content type and language filters.
upload_time2024-10-29 16:40:20
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseApache 2.0
keywords cleaner crawler uri url-parsing url-manipulation urls validation webcrawling
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            # coURLan: Clean, filter, normalize, and sample URLs


[![Python package](https://img.shields.io/pypi/v/courlan.svg)](https://pypi.python.org/pypi/courlan)
[![Python versions](https://img.shields.io/pypi/pyversions/courlan.svg)](https://pypi.python.org/pypi/courlan)
[![Code Coverage](https://img.shields.io/codecov/c/github/adbar/courlan.svg)](https://codecov.io/gh/adbar/courlan)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)


## Why coURLan?

> "It is important for the crawler to visit 'important' pages first,
> so that the fraction of the Web that is visited (and kept up to date)
> is more meaningful." (Cho et al. 1998)
>
> "Given that the bandwidth for conducting crawls is neither infinite
> nor free, it is becoming essential to crawl the Web in not only a
> scalable, but efficient way, if some reasonable measure of quality or
> freshness is to be maintained." (Edwards et al. 2001)

This library provides an additional "brain" for web crawling, scraping
and document management. It facilitates web navigation through a set of
filters, enhancing the quality of resulting document collections:

- Save bandwidth and processing time by steering clear of pages deemed
  low-value
- Identify specific pages based on language or text content
- Pinpoint pages relevant for efficient link gathering

Additional utilities needed include URL storage, filtering, and
deduplication.

## Features

Separate the wheat from the chaff and optimize document discovery and
retrieval:


- URL handling
   - Validation
   - Normalization
   - Sampling
- Heuristics for link filtering
   - Spam, trackers, and content-types
   - Locales and internationalization
   - Web crawling (frontier, scheduling)
- Data store specifically designed for URLs
- Usable with Python or on the command-line


**Let the coURLan fish up juicy bits for you!**

<img src="https://raw.githubusercontent.com/adbar/courlan/master/courlan_harns-march.jpg" width="65%" alt="Courlan bird"/>

Here is a [courlan](https://en.wiktionary.org/wiki/courlan) (source:
[Limpkin at Harn's Marsh by
Russ](https://commons.wikimedia.org/wiki/File:Limpkin,_harns_marsh_(33723700146).jpg),
CC BY 2.0).


## Installation

This package is compatible with with all common versions of Python, it
is tested on Linux, macOS and Windows systems.

Courlan is available on the package repository [PyPI](https://pypi.org/)
and can notably be installed with the Python package manager `pip`:

``` bash
$ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed
$ pip install --upgrade courlan # to make sure you have the latest version
$ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)
```

The last version to support Python 3.6 and 3.7 is `courlan==1.2.0`.


## Python

Most filters revolve around the `strict` and `language` arguments.

### check_url()

All useful operations chained in `check_url(url)`:

``` python
>>> from courlan import check_url

# return url and domain name
>>> check_url('https://github.com/adbar/courlan')
('https://github.com/adbar/courlan', 'github.com')

# filter out bogus domains
>>> check_url('http://666.0.0.1/')
>>>

# tracker removal
>>> check_url('http://test.net/foo.html?utm_source=twitter#gclid=123')
('http://test.net/foo.html', 'test.net')

# use strict for further trimming
>>> my_url = 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org'
>>> check_url(my_url, strict=True)
('https://httpbin.org/redirect-to', 'httpbin.org')

# check for redirects (HEAD request)
>>> url, domain_name = check_url(my_url, with_redirects=True)

# include navigation pages instead of discarding them
>>> check_url('http://www.example.org/page/10/', with_nav=True)

# remove trailing slash
>>> check_url('https://github.com/adbar/courlan/', trailing_slash=False)
```

Language-aware heuristics, notably internationalization in URLs, are
available in `lang_filter(url, language)`:

``` python
# optional language argument
>>> url = 'https://www.un.org/en/about-us'

# success: returns clean URL and domain name
>>> check_url(url, language='en')
('https://www.un.org/en/about-us', 'un.org')

# failure: doesn't return anything
>>> check_url(url, language='de')
>>>

# optional argument: strict
>>> url = 'https://en.wikipedia.org/'
>>> check_url(url, language='de', strict=False)
('https://en.wikipedia.org', 'wikipedia.org')
>>> check_url(url, language='de', strict=True)
>>>
```

Define stricter restrictions on the expected content type with
`strict=True`. This also blocks certain platforms and page types
where machines get lost.

``` python
# strict filtering: blocked as it is a major platform
>>> check_url('https://www.twitch.com/', strict=True)
>>>
```

### Sampling by domain name

``` python
>>> from courlan import sample_urls
>>> my_urls = ['https://example.org/' + str(x) for x in range(100)]
>>> my_sample = sample_urls(my_urls, 10)
# optional: exclude_min=None, exclude_max=None, strict=False, verbose=False
```

### Web crawling and URL handling

Link extraction and preprocessing:

``` python
>>> from courlan import extract_links
>>> doc = '<html><body><a href="test/link.html">Link</a></body></html>'
>>> url = "https://example.org"
>>> extract_links(doc, url)
{'https://example.org/test/link.html'}
# other options: external_bool, no_filter, language, strict, redirects, ...
```

The `filter_links()` function provides additional filters for crawling purposes:
use of robots.txt rules and link priorization. See `courlan.core` for details.

Determine if a link leads to another host:

``` python
>>> from courlan import is_external
>>> is_external('https://github.com/', 'https://www.microsoft.com/')
True
# default
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True)
False
# taking suffixes into account
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False)
True
```

Other useful functions dedicated to URL handling:

-   `extract_domain(url, fast=True)`: find domain and subdomain or just
    domain with `fast=False`
-   `get_base_url(url)`: strip the URL of some of its parts
-   `get_host_and_path(url)`: decompose URLs in two parts: protocol +
    host/domain and path
-   `get_hostinfo(url)`: extract domain and host info (protocol +
    host/domain)
-   `fix_relative_urls(baseurl, url)`: prepend necessary information to
    relative links

``` python
>>> from courlan import *
>>> url = 'https://www.un.org/en/about-us'

>>> get_base_url(url)
'https://www.un.org'

>>> get_host_and_path(url)
('https://www.un.org', '/en/about-us')

>>> get_hostinfo(url)
('un.org', 'https://www.un.org')

>>> fix_relative_urls('https://www.un.org', 'en/about-us')
'https://www.un.org/en/about-us'
```

Other filters dedicated to crawl frontier management:

-   `is_not_crawlable(url)`: check for deep web or pages generally not
    usable in a crawling context
-   `is_navigation_page(url)`: check for navigation and overview pages

``` python
>>> from courlan import is_navigation_page, is_not_crawlable
>>> is_navigation_page('https://www.randomblog.net/category/myposts')
True
>>> is_not_crawlable('https://www.randomblog.net/login')
True
```

See also [URL management page](https://trafilatura.readthedocs.io/en/latest/url-management.html)
of the Trafilatura documentation.


### Python helpers

Helper function, scrub and normalize:

``` python
>>> from courlan import clean_url
>>> clean_url('HTTPS://WWW.DWDS.DE:80/')
'https://www.dwds.de'
```

Basic scrubbing only:

``` python
>>> from courlan import scrub_url
```

Basic canonicalization/normalization only, i.e. modifying and
standardizing URLs in a consistent manner:

``` python
>>> from urllib.parse import urlparse
>>> from courlan import normalize_url
>>> my_url = normalize_url(urlparse(my_url))
# passing URL strings directly also works
>>> my_url = normalize_url(my_url)
# remove unnecessary components and re-order query elements
>>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True)
'http://test.net/foo.html?page=2&post=abc'
```

Basic URL validation only:

``` python
>>> from courlan import validate_url
>>> validate_url('http://1234')
(False, None)
>>> validate_url('http://www.example.org/')
(True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))
```

### Troubleshooting

Courlan uses an internal cache to speed up URL parsing. It can be reset
as follows:

``` python
>>> from courlan.meta import clear_caches
>>> clear_caches()
```

## UrlStore class

The `UrlStore` class allow for storing and retrieving domain-classified
URLs, where a URL like `https://example.org/path/testpage` is stored as
the path `/path/testpage` within the domain `https://example.org`. It
features the following methods:

- URL management
   - `add_urls(urls=[], appendleft=None, visited=False)`: Add a
     list of URLs to the (possibly) existing one. Optional:
     append certain URLs to the left, specify if the URLs have
     already been visited.
   - `add_from_html(htmlstring, url, external=False, lang=None, with_nav=True)`:
     Extract and filter links in a HTML string.
   - `discard(domains)`: Declare domains void and prune the store.
   - `dump_urls()`: Return a list of all known URLs.
   - `print_urls()`: Print all URLs in store (URL + TAB + visited or not).
   - `print_unvisited_urls()`: Print all unvisited URLs in store.
   - `get_all_counts()`: Return all download counts for the hosts in store.
   - `get_known_domains()`: Return all known domains as a list.
   - `get_unvisited_domains()`: Find all domains for which there are unvisited URLs.
   - `total_url_number()`: Find number of all URLs in store.
   - `is_known(url)`: Check if the given URL has already been stored.
   - `has_been_visited(url)`: Check if the given URL has already been visited.
   - `filter_unknown_urls(urls)`: Take a list of URLs and return the currently unknown ones.
   - `filter_unvisited_urls(urls)`: Take a list of URLs and return the currently unvisited ones.
   - `find_known_urls(domain)`: Get all already known URLs for the
     given domain (ex. `https://example.org`).
   - `find_unvisited_urls(domain)`: Get all unvisited URLs for the given domain.
   - `get_unvisited_domains()`: Return all domains which have not been all visited.
   - `reset()`: Re-initialize the URL store.

- Crawling and downloads
   - `get_url(domain)`: Retrieve a single URL and consider it to
     be visited (with corresponding timestamp).
   - `get_rules(domain)`: Return the stored crawling rules for the given website.
   - `store_rules(website, rules=None)`: Store crawling rules for a given website.
   - `get_crawl_delay()`: Return the delay as extracted from robots.txt, or a given default.
   - `get_download_urls(max_urls=100, time_limit=10)`: Get a list of immediately
     downloadable URLs according to the given time limit per domain.
   - `establish_download_schedule(max_urls=100, time_limit=10)`:
     Get up to the specified number of URLs along with a suitable
     backoff schedule (in seconds).
   - `download_threshold_reached(threshold)`: Find out if the
     download limit (in seconds) has been reached for one of the
     websites in store.
   - `unvisited_websites_number()`: Return the number of websites
     for which there are still URLs to visit.
   - `is_exhausted_domain(domain)`: Tell if all known URLs for
     the website have been visited.

- Persistance
   - `write(filename)`: Save the store to disk.
   - `load_store(filename)`: Read a UrlStore from disk (separate function, not class method).

- Optional settings:
   - `compressed=True`: activate compression of URLs and rules
   - `language=XX`: focus on a particular target language (two-letter code)
   - `strict=True`: stricter URL filtering
   - `verbose=True`: dump URLs if interrupted (requires use of `signal`)


## Command-line

The main fonctions are also available through a command-line utility:

``` bash
$ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
$ courlan --help
usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v]
               [-p PARALLEL] [--strict] [-l LANGUAGE] [-r] [--sample SAMPLE]
               [--exclude-max EXCLUDE_MAX] [--exclude-min EXCLUDE_MIN]

Command-line interface for Courlan

options:
  -h, --help            show this help message and exit

I/O:
  Manage input and output

  -i INPUTFILE, --inputfile INPUTFILE
                        name of input file (required)
  -o OUTPUTFILE, --outputfile OUTPUTFILE
                        name of output file (required)
  -d DISCARDEDFILE, --discardedfile DISCARDEDFILE
                        name of file to store discarded URLs (optional)
  -v, --verbose         increase output verbosity
  -p PARALLEL, --parallel PARALLEL
                        number of parallel processes (not used for sampling)

Filtering:
  Configure URL filters

  --strict              perform more restrictive tests
  -l LANGUAGE, --language LANGUAGE
                        use language filter (ISO 639-1 code)
  -r, --redirects       check redirects

Sampling:
  Use sampling by host, configure sample size

  --sample SAMPLE       size of sample per domain
  --exclude-max EXCLUDE_MAX
                        exclude domains with more than n URLs
  --exclude-min EXCLUDE_MIN
                        exclude domains with less than n URLs
```


## License

*coURLan* is distributed under the [Apache 2.0
license](https://www.apache.org/licenses/LICENSE-2.0.html).

Versions prior to v1 were under GPLv3+ license.


## Settings

`courlan` is optimized for English and German but its generic approach
is also usable in other contexts.

Details of strict URL filtering can be reviewed and changed in the file
`settings.py`. To override the default settings, clone the repository and
[re-install the package
locally](https://packaging.python.org/tutorials/installing-packages/#installing-from-a-local-src-tree).


## Contributing

[Contributions](https://github.com/adbar/courlan/blob/master/CONTRIBUTING.md)
are welcome!

Feel free to file issues on the [dedicated
page](https://github.com/adbar/courlan/issues).


## Author

Developed with practical applications of academic research in mind, this software
is part of a broader effort to derive information from web documents.
Extracting and pre-processing web texts to the exacting standards of
scientific research presents a substantial challenge.
This software package simplifies text data collection and enhances corpus quality,
it is currently used to build [text databases for research](https://www.dwds.de/d/k-web).

- Barbaresi, A. "[Trafilatura: A Web Scraping Library and
  Command-Line Tool for Text Discovery and
  Extraction](https://aclanthology.org/2021.acl-demo.15/)."
  *Proceedings of ACL/IJCNLP 2021: System Demonstrations*, 2021, pp. 122-131.

Contact: see [homepage](https://adrien.barbaresi.eu/).

Software ecosystem: see [this
graphic](https://github.com/adbar/trafilatura/blob/master/docs/software-ecosystem.png).


## Similar work

These Python libraries perform similar handling and normalization tasks
but do not entail language or content filters. They also do not
primarily focus on crawl optimization:

-   [furl](https://github.com/gruns/furl)
-   [ural](https://github.com/medialab/ural)
-   [yarl](https://github.com/aio-libs/yarl)


## References

-   Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling
    through URL ordering. *Computer networks and ISDN systems*, 30(1-7),
    161–172.
-   Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). "An
    adaptive model for optimizing performance of an incremental web
    crawler". In *Proceedings of the 10th international conference on
    World Wide Web - WWW'01*, pp. 106–113.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "courlan",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "cleaner, crawler, uri, url-parsing, url-manipulation, urls, validation, webcrawling",
    "author": null,
    "author_email": "Adrien Barbaresi <barbaresi@bbaw.de>",
    "download_url": "https://files.pythonhosted.org/packages/6f/54/6d6ceeff4bed42e7a10d6064d35ee43a810e7b3e8beb4abeae8cff4713ae/courlan-1.3.2.tar.gz",
    "platform": null,
    "description": "# coURLan: Clean, filter, normalize, and sample URLs\n\n\n[![Python package](https://img.shields.io/pypi/v/courlan.svg)](https://pypi.python.org/pypi/courlan)\n[![Python versions](https://img.shields.io/pypi/pyversions/courlan.svg)](https://pypi.python.org/pypi/courlan)\n[![Code Coverage](https://img.shields.io/codecov/c/github/adbar/courlan.svg)](https://codecov.io/gh/adbar/courlan)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n\n## Why coURLan?\n\n> \"It is important for the crawler to visit 'important' pages first,\n> so that the fraction of the Web that is visited (and kept up to date)\n> is more meaningful.\" (Cho et al. 1998)\n>\n> \"Given that the bandwidth for conducting crawls is neither infinite\n> nor free, it is becoming essential to crawl the Web in not only a\n> scalable, but efficient way, if some reasonable measure of quality or\n> freshness is to be maintained.\" (Edwards et al. 2001)\n\nThis library provides an additional \"brain\" for web crawling, scraping\nand document management. It facilitates web navigation through a set of\nfilters, enhancing the quality of resulting document collections:\n\n- Save bandwidth and processing time by steering clear of pages deemed\n  low-value\n- Identify specific pages based on language or text content\n- Pinpoint pages relevant for efficient link gathering\n\nAdditional utilities needed include URL storage, filtering, and\ndeduplication.\n\n## Features\n\nSeparate the wheat from the chaff and optimize document discovery and\nretrieval:\n\n\n- URL handling\n   - Validation\n   - Normalization\n   - Sampling\n- Heuristics for link filtering\n   - Spam, trackers, and content-types\n   - Locales and internationalization\n   - Web crawling (frontier, scheduling)\n- Data store specifically designed for URLs\n- Usable with Python or on the command-line\n\n\n**Let the coURLan fish up juicy bits for you!**\n\n<img src=\"https://raw.githubusercontent.com/adbar/courlan/master/courlan_harns-march.jpg\" width=\"65%\" alt=\"Courlan bird\"/>\n\nHere is a [courlan](https://en.wiktionary.org/wiki/courlan) (source:\n[Limpkin at Harn's Marsh by\nRuss](https://commons.wikimedia.org/wiki/File:Limpkin,_harns_marsh_(33723700146).jpg),\nCC BY 2.0).\n\n\n## Installation\n\nThis package is compatible with with all common versions of Python, it\nis tested on Linux, macOS and Windows systems.\n\nCourlan is available on the package repository [PyPI](https://pypi.org/)\nand can notably be installed with the Python package manager `pip`:\n\n``` bash\n$ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed\n$ pip install --upgrade courlan # to make sure you have the latest version\n$ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)\n```\n\nThe last version to support Python 3.6 and 3.7 is `courlan==1.2.0`.\n\n\n## Python\n\nMost filters revolve around the `strict` and `language` arguments.\n\n### check_url()\n\nAll useful operations chained in `check_url(url)`:\n\n``` python\n>>> from courlan import check_url\n\n# return url and domain name\n>>> check_url('https://github.com/adbar/courlan')\n('https://github.com/adbar/courlan', 'github.com')\n\n# filter out bogus domains\n>>> check_url('http://666.0.0.1/')\n>>>\n\n# tracker removal\n>>> check_url('http://test.net/foo.html?utm_source=twitter#gclid=123')\n('http://test.net/foo.html', 'test.net')\n\n# use strict for further trimming\n>>> my_url = 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org'\n>>> check_url(my_url, strict=True)\n('https://httpbin.org/redirect-to', 'httpbin.org')\n\n# check for redirects (HEAD request)\n>>> url, domain_name = check_url(my_url, with_redirects=True)\n\n# include navigation pages instead of discarding them\n>>> check_url('http://www.example.org/page/10/', with_nav=True)\n\n# remove trailing slash\n>>> check_url('https://github.com/adbar/courlan/', trailing_slash=False)\n```\n\nLanguage-aware heuristics, notably internationalization in URLs, are\navailable in `lang_filter(url, language)`:\n\n``` python\n# optional language argument\n>>> url = 'https://www.un.org/en/about-us'\n\n# success: returns clean URL and domain name\n>>> check_url(url, language='en')\n('https://www.un.org/en/about-us', 'un.org')\n\n# failure: doesn't return anything\n>>> check_url(url, language='de')\n>>>\n\n# optional argument: strict\n>>> url = 'https://en.wikipedia.org/'\n>>> check_url(url, language='de', strict=False)\n('https://en.wikipedia.org', 'wikipedia.org')\n>>> check_url(url, language='de', strict=True)\n>>>\n```\n\nDefine stricter restrictions on the expected content type with\n`strict=True`. This also blocks certain platforms and page types\nwhere machines get lost.\n\n``` python\n# strict filtering: blocked as it is a major platform\n>>> check_url('https://www.twitch.com/', strict=True)\n>>>\n```\n\n### Sampling by domain name\n\n``` python\n>>> from courlan import sample_urls\n>>> my_urls = ['https://example.org/' + str(x) for x in range(100)]\n>>> my_sample = sample_urls(my_urls, 10)\n# optional: exclude_min=None, exclude_max=None, strict=False, verbose=False\n```\n\n### Web crawling and URL handling\n\nLink extraction and preprocessing:\n\n``` python\n>>> from courlan import extract_links\n>>> doc = '<html><body><a href=\"test/link.html\">Link</a></body></html>'\n>>> url = \"https://example.org\"\n>>> extract_links(doc, url)\n{'https://example.org/test/link.html'}\n# other options: external_bool, no_filter, language, strict, redirects, ...\n```\n\nThe `filter_links()` function provides additional filters for crawling purposes:\nuse of robots.txt rules and link priorization. See `courlan.core` for details.\n\nDetermine if a link leads to another host:\n\n``` python\n>>> from courlan import is_external\n>>> is_external('https://github.com/', 'https://www.microsoft.com/')\nTrue\n# default\n>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True)\nFalse\n# taking suffixes into account\n>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False)\nTrue\n```\n\nOther useful functions dedicated to URL handling:\n\n-   `extract_domain(url, fast=True)`: find domain and subdomain or just\n    domain with `fast=False`\n-   `get_base_url(url)`: strip the URL of some of its parts\n-   `get_host_and_path(url)`: decompose URLs in two parts: protocol +\n    host/domain and path\n-   `get_hostinfo(url)`: extract domain and host info (protocol +\n    host/domain)\n-   `fix_relative_urls(baseurl, url)`: prepend necessary information to\n    relative links\n\n``` python\n>>> from courlan import *\n>>> url = 'https://www.un.org/en/about-us'\n\n>>> get_base_url(url)\n'https://www.un.org'\n\n>>> get_host_and_path(url)\n('https://www.un.org', '/en/about-us')\n\n>>> get_hostinfo(url)\n('un.org', 'https://www.un.org')\n\n>>> fix_relative_urls('https://www.un.org', 'en/about-us')\n'https://www.un.org/en/about-us'\n```\n\nOther filters dedicated to crawl frontier management:\n\n-   `is_not_crawlable(url)`: check for deep web or pages generally not\n    usable in a crawling context\n-   `is_navigation_page(url)`: check for navigation and overview pages\n\n``` python\n>>> from courlan import is_navigation_page, is_not_crawlable\n>>> is_navigation_page('https://www.randomblog.net/category/myposts')\nTrue\n>>> is_not_crawlable('https://www.randomblog.net/login')\nTrue\n```\n\nSee also [URL management page](https://trafilatura.readthedocs.io/en/latest/url-management.html)\nof the Trafilatura documentation.\n\n\n### Python helpers\n\nHelper function, scrub and normalize:\n\n``` python\n>>> from courlan import clean_url\n>>> clean_url('HTTPS://WWW.DWDS.DE:80/')\n'https://www.dwds.de'\n```\n\nBasic scrubbing only:\n\n``` python\n>>> from courlan import scrub_url\n```\n\nBasic canonicalization/normalization only, i.e. modifying and\nstandardizing URLs in a consistent manner:\n\n``` python\n>>> from urllib.parse import urlparse\n>>> from courlan import normalize_url\n>>> my_url = normalize_url(urlparse(my_url))\n# passing URL strings directly also works\n>>> my_url = normalize_url(my_url)\n# remove unnecessary components and re-order query elements\n>>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True)\n'http://test.net/foo.html?page=2&post=abc'\n```\n\nBasic URL validation only:\n\n``` python\n>>> from courlan import validate_url\n>>> validate_url('http://1234')\n(False, None)\n>>> validate_url('http://www.example.org/')\n(True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))\n```\n\n### Troubleshooting\n\nCourlan uses an internal cache to speed up URL parsing. It can be reset\nas follows:\n\n``` python\n>>> from courlan.meta import clear_caches\n>>> clear_caches()\n```\n\n## UrlStore class\n\nThe `UrlStore` class allow for storing and retrieving domain-classified\nURLs, where a URL like `https://example.org/path/testpage` is stored as\nthe path `/path/testpage` within the domain `https://example.org`. It\nfeatures the following methods:\n\n- URL management\n   - `add_urls(urls=[], appendleft=None, visited=False)`: Add a\n     list of URLs to the (possibly) existing one. Optional:\n     append certain URLs to the left, specify if the URLs have\n     already been visited.\n   - `add_from_html(htmlstring, url, external=False, lang=None, with_nav=True)`:\n     Extract and filter links in a HTML string.\n   - `discard(domains)`: Declare domains void and prune the store.\n   - `dump_urls()`: Return a list of all known URLs.\n   - `print_urls()`: Print all URLs in store (URL + TAB + visited or not).\n   - `print_unvisited_urls()`: Print all unvisited URLs in store.\n   - `get_all_counts()`: Return all download counts for the hosts in store.\n   - `get_known_domains()`: Return all known domains as a list.\n   - `get_unvisited_domains()`: Find all domains for which there are unvisited URLs.\n   - `total_url_number()`: Find number of all URLs in store.\n   - `is_known(url)`: Check if the given URL has already been stored.\n   - `has_been_visited(url)`: Check if the given URL has already been visited.\n   - `filter_unknown_urls(urls)`: Take a list of URLs and return the currently unknown ones.\n   - `filter_unvisited_urls(urls)`: Take a list of URLs and return the currently unvisited ones.\n   - `find_known_urls(domain)`: Get all already known URLs for the\n     given domain (ex. `https://example.org`).\n   - `find_unvisited_urls(domain)`: Get all unvisited URLs for the given domain.\n   - `get_unvisited_domains()`: Return all domains which have not been all visited.\n   - `reset()`: Re-initialize the URL store.\n\n- Crawling and downloads\n   - `get_url(domain)`: Retrieve a single URL and consider it to\n     be visited (with corresponding timestamp).\n   - `get_rules(domain)`: Return the stored crawling rules for the given website.\n   - `store_rules(website, rules=None)`: Store crawling rules for a given website.\n   - `get_crawl_delay()`: Return the delay as extracted from robots.txt, or a given default.\n   - `get_download_urls(max_urls=100, time_limit=10)`: Get a list of immediately\n     downloadable URLs according to the given time limit per domain.\n   - `establish_download_schedule(max_urls=100, time_limit=10)`:\n     Get up to the specified number of URLs along with a suitable\n     backoff schedule (in seconds).\n   - `download_threshold_reached(threshold)`: Find out if the\n     download limit (in seconds) has been reached for one of the\n     websites in store.\n   - `unvisited_websites_number()`: Return the number of websites\n     for which there are still URLs to visit.\n   - `is_exhausted_domain(domain)`: Tell if all known URLs for\n     the website have been visited.\n\n- Persistance\n   - `write(filename)`: Save the store to disk.\n   - `load_store(filename)`: Read a UrlStore from disk (separate function, not class method).\n\n- Optional settings:\n   - `compressed=True`: activate compression of URLs and rules\n   - `language=XX`: focus on a particular target language (two-letter code)\n   - `strict=True`: stricter URL filtering\n   - `verbose=True`: dump URLs if interrupted (requires use of `signal`)\n\n\n## Command-line\n\nThe main fonctions are also available through a command-line utility:\n\n``` bash\n$ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt\n$ courlan --help\nusage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v]\n               [-p PARALLEL] [--strict] [-l LANGUAGE] [-r] [--sample SAMPLE]\n               [--exclude-max EXCLUDE_MAX] [--exclude-min EXCLUDE_MIN]\n\nCommand-line interface for Courlan\n\noptions:\n  -h, --help            show this help message and exit\n\nI/O:\n  Manage input and output\n\n  -i INPUTFILE, --inputfile INPUTFILE\n                        name of input file (required)\n  -o OUTPUTFILE, --outputfile OUTPUTFILE\n                        name of output file (required)\n  -d DISCARDEDFILE, --discardedfile DISCARDEDFILE\n                        name of file to store discarded URLs (optional)\n  -v, --verbose         increase output verbosity\n  -p PARALLEL, --parallel PARALLEL\n                        number of parallel processes (not used for sampling)\n\nFiltering:\n  Configure URL filters\n\n  --strict              perform more restrictive tests\n  -l LANGUAGE, --language LANGUAGE\n                        use language filter (ISO 639-1 code)\n  -r, --redirects       check redirects\n\nSampling:\n  Use sampling by host, configure sample size\n\n  --sample SAMPLE       size of sample per domain\n  --exclude-max EXCLUDE_MAX\n                        exclude domains with more than n URLs\n  --exclude-min EXCLUDE_MIN\n                        exclude domains with less than n URLs\n```\n\n\n## License\n\n*coURLan* is distributed under the [Apache 2.0\nlicense](https://www.apache.org/licenses/LICENSE-2.0.html).\n\nVersions prior to v1 were under GPLv3+ license.\n\n\n## Settings\n\n`courlan` is optimized for English and German but its generic approach\nis also usable in other contexts.\n\nDetails of strict URL filtering can be reviewed and changed in the file\n`settings.py`. To override the default settings, clone the repository and\n[re-install the package\nlocally](https://packaging.python.org/tutorials/installing-packages/#installing-from-a-local-src-tree).\n\n\n## Contributing\n\n[Contributions](https://github.com/adbar/courlan/blob/master/CONTRIBUTING.md)\nare welcome!\n\nFeel free to file issues on the [dedicated\npage](https://github.com/adbar/courlan/issues).\n\n\n## Author\n\nDeveloped with practical applications of academic research in mind, this software\nis part of a broader effort to derive information from web documents.\nExtracting and pre-processing web texts to the exacting standards of\nscientific research presents a substantial challenge.\nThis software package simplifies text data collection and enhances corpus quality,\nit is currently used to build [text databases for research](https://www.dwds.de/d/k-web).\n\n- Barbaresi, A. \"[Trafilatura: A Web Scraping Library and\n  Command-Line Tool for Text Discovery and\n  Extraction](https://aclanthology.org/2021.acl-demo.15/).\"\n  *Proceedings of ACL/IJCNLP 2021: System Demonstrations*, 2021, pp. 122-131.\n\nContact: see [homepage](https://adrien.barbaresi.eu/).\n\nSoftware ecosystem: see [this\ngraphic](https://github.com/adbar/trafilatura/blob/master/docs/software-ecosystem.png).\n\n\n## Similar work\n\nThese Python libraries perform similar handling and normalization tasks\nbut do not entail language or content filters. They also do not\nprimarily focus on crawl optimization:\n\n-   [furl](https://github.com/gruns/furl)\n-   [ural](https://github.com/medialab/ural)\n-   [yarl](https://github.com/aio-libs/yarl)\n\n\n## References\n\n-   Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling\n    through URL ordering. *Computer networks and ISDN systems*, 30(1-7),\n    161\u2013172.\n-   Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). \"An\n    adaptive model for optimizing performance of an incremental web\n    crawler\". In *Proceedings of the 10th international conference on\n    World Wide Web - WWW'01*, pp. 106\u2013113.\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Clean, filter and sample URLs to optimize data collection \u2013 includes spam, content type and language filters.",
    "version": "1.3.2",
    "project_urls": {
        "Blog": "https://adrien.barbaresi.eu/blog/",
        "Homepage": "https://github.com/adbar/courlan",
        "Tracker": "https://github.com/adbar/courlan/issues"
    },
    "split_keywords": [
        "cleaner",
        " crawler",
        " uri",
        " url-parsing",
        " url-manipulation",
        " urls",
        " validation",
        " webcrawling"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8eca6a667ccbe649856dcd3458bab80b016681b274399d6211187c6ab969fc50",
                "md5": "b34adf6ba2547baceb119284730536ad",
                "sha256": "d0dab52cf5b5b1000ee2839fbc2837e93b2514d3cb5bb61ae158a55b7a04c6be"
            },
            "downloads": -1,
            "filename": "courlan-1.3.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b34adf6ba2547baceb119284730536ad",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 33848,
            "upload_time": "2024-10-29T16:40:18",
            "upload_time_iso_8601": "2024-10-29T16:40:18.325614Z",
            "url": "https://files.pythonhosted.org/packages/8e/ca/6a667ccbe649856dcd3458bab80b016681b274399d6211187c6ab969fc50/courlan-1.3.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6f546d6ceeff4bed42e7a10d6064d35ee43a810e7b3e8beb4abeae8cff4713ae",
                "md5": "41a346e13ee6d3251bdf5c9eb500ffcb",
                "sha256": "0b66f4db3a9c39a6e22dd247c72cfaa57d68ea660e94bb2c84ec7db8712af190"
            },
            "downloads": -1,
            "filename": "courlan-1.3.2.tar.gz",
            "has_sig": false,
            "md5_digest": "41a346e13ee6d3251bdf5c9eb500ffcb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 206382,
            "upload_time": "2024-10-29T16:40:20",
            "upload_time_iso_8601": "2024-10-29T16:40:20.994328Z",
            "url": "https://files.pythonhosted.org/packages/6f/54/6d6ceeff4bed42e7a10d6064d35ee43a810e7b3e8beb4abeae8cff4713ae/courlan-1.3.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-29 16:40:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "adbar",
    "github_project": "courlan",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "courlan"
}
        
Elapsed time: 0.47919s