# coURLan: Clean, filter, normalize, and sample URLs
[![Python package](https://img.shields.io/pypi/v/courlan.svg)](https://pypi.python.org/pypi/courlan)
[![Python versions](https://img.shields.io/pypi/pyversions/courlan.svg)](https://pypi.python.org/pypi/courlan)
[![Code Coverage](https://img.shields.io/codecov/c/github/adbar/courlan.svg)](https://codecov.io/gh/adbar/courlan)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
## Why coURLan?
> "It is important for the crawler to visit 'important' pages first,
> so that the fraction of the Web that is visited (and kept up to date)
> is more meaningful." (Cho et al. 1998)
>
> "Given that the bandwidth for conducting crawls is neither infinite
> nor free, it is becoming essential to crawl the Web in not only a
> scalable, but efficient way, if some reasonable measure of quality or
> freshness is to be maintained." (Edwards et al. 2001)
This library provides an additional "brain" for web crawling, scraping
and document management. It facilitates web navigation through a set of
filters, enhancing the quality of resulting document collections:
- Save bandwidth and processing time by steering clear of pages deemed
low-value
- Identify specific pages based on language or text content
- Pinpoint pages relevant for efficient link gathering
Additional utilities needed include URL storage, filtering, and
deduplication.
## Features
Separate the wheat from the chaff and optimize document discovery and
retrieval:
- URL handling
- Validation
- Normalization
- Sampling
- Heuristics for link filtering
- Spam, trackers, and content-types
- Locales and internationalization
- Web crawling (frontier, scheduling)
- Data store specifically designed for URLs
- Usable with Python or on the command-line
**Let the coURLan fish up juicy bits for you!**
<img src="https://raw.githubusercontent.com/adbar/courlan/master/courlan_harns-march.jpg" width="65%" alt="Courlan bird"/>
Here is a [courlan](https://en.wiktionary.org/wiki/courlan) (source:
[Limpkin at Harn's Marsh by
Russ](https://commons.wikimedia.org/wiki/File:Limpkin,_harns_marsh_(33723700146).jpg),
CC BY 2.0).
## Installation
This package is compatible with with all common versions of Python, it
is tested on Linux, macOS and Windows systems.
Courlan is available on the package repository [PyPI](https://pypi.org/)
and can notably be installed with the Python package manager `pip`:
``` bash
$ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed
$ pip install --upgrade courlan # to make sure you have the latest version
$ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)
```
The last version to support Python 3.6 and 3.7 is `courlan==1.2.0`.
## Python
Most filters revolve around the `strict` and `language` arguments.
### check_url()
All useful operations chained in `check_url(url)`:
``` python
>>> from courlan import check_url
# return url and domain name
>>> check_url('https://github.com/adbar/courlan')
('https://github.com/adbar/courlan', 'github.com')
# filter out bogus domains
>>> check_url('http://666.0.0.1/')
>>>
# tracker removal
>>> check_url('http://test.net/foo.html?utm_source=twitter#gclid=123')
('http://test.net/foo.html', 'test.net')
# use strict for further trimming
>>> my_url = 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org'
>>> check_url(my_url, strict=True)
('https://httpbin.org/redirect-to', 'httpbin.org')
# check for redirects (HEAD request)
>>> url, domain_name = check_url(my_url, with_redirects=True)
# include navigation pages instead of discarding them
>>> check_url('http://www.example.org/page/10/', with_nav=True)
# remove trailing slash
>>> check_url('https://github.com/adbar/courlan/', trailing_slash=False)
```
Language-aware heuristics, notably internationalization in URLs, are
available in `lang_filter(url, language)`:
``` python
# optional language argument
>>> url = 'https://www.un.org/en/about-us'
# success: returns clean URL and domain name
>>> check_url(url, language='en')
('https://www.un.org/en/about-us', 'un.org')
# failure: doesn't return anything
>>> check_url(url, language='de')
>>>
# optional argument: strict
>>> url = 'https://en.wikipedia.org/'
>>> check_url(url, language='de', strict=False)
('https://en.wikipedia.org', 'wikipedia.org')
>>> check_url(url, language='de', strict=True)
>>>
```
Define stricter restrictions on the expected content type with
`strict=True`. This also blocks certain platforms and page types
where machines get lost.
``` python
# strict filtering: blocked as it is a major platform
>>> check_url('https://www.twitch.com/', strict=True)
>>>
```
### Sampling by domain name
``` python
>>> from courlan import sample_urls
>>> my_urls = ['https://example.org/' + str(x) for x in range(100)]
>>> my_sample = sample_urls(my_urls, 10)
# optional: exclude_min=None, exclude_max=None, strict=False, verbose=False
```
### Web crawling and URL handling
Link extraction and preprocessing:
``` python
>>> from courlan import extract_links
>>> doc = '<html><body><a href="test/link.html">Link</a></body></html>'
>>> url = "https://example.org"
>>> extract_links(doc, url)
{'https://example.org/test/link.html'}
# other options: external_bool, no_filter, language, strict, redirects, ...
```
The `filter_links()` function provides additional filters for crawling purposes:
use of robots.txt rules and link priorization. See `courlan.core` for details.
Determine if a link leads to another host:
``` python
>>> from courlan import is_external
>>> is_external('https://github.com/', 'https://www.microsoft.com/')
True
# default
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True)
False
# taking suffixes into account
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False)
True
```
Other useful functions dedicated to URL handling:
- `extract_domain(url, fast=True)`: find domain and subdomain or just
domain with `fast=False`
- `get_base_url(url)`: strip the URL of some of its parts
- `get_host_and_path(url)`: decompose URLs in two parts: protocol +
host/domain and path
- `get_hostinfo(url)`: extract domain and host info (protocol +
host/domain)
- `fix_relative_urls(baseurl, url)`: prepend necessary information to
relative links
``` python
>>> from courlan import *
>>> url = 'https://www.un.org/en/about-us'
>>> get_base_url(url)
'https://www.un.org'
>>> get_host_and_path(url)
('https://www.un.org', '/en/about-us')
>>> get_hostinfo(url)
('un.org', 'https://www.un.org')
>>> fix_relative_urls('https://www.un.org', 'en/about-us')
'https://www.un.org/en/about-us'
```
Other filters dedicated to crawl frontier management:
- `is_not_crawlable(url)`: check for deep web or pages generally not
usable in a crawling context
- `is_navigation_page(url)`: check for navigation and overview pages
``` python
>>> from courlan import is_navigation_page, is_not_crawlable
>>> is_navigation_page('https://www.randomblog.net/category/myposts')
True
>>> is_not_crawlable('https://www.randomblog.net/login')
True
```
See also [URL management page](https://trafilatura.readthedocs.io/en/latest/url-management.html)
of the Trafilatura documentation.
### Python helpers
Helper function, scrub and normalize:
``` python
>>> from courlan import clean_url
>>> clean_url('HTTPS://WWW.DWDS.DE:80/')
'https://www.dwds.de'
```
Basic scrubbing only:
``` python
>>> from courlan import scrub_url
```
Basic canonicalization/normalization only, i.e. modifying and
standardizing URLs in a consistent manner:
``` python
>>> from urllib.parse import urlparse
>>> from courlan import normalize_url
>>> my_url = normalize_url(urlparse(my_url))
# passing URL strings directly also works
>>> my_url = normalize_url(my_url)
# remove unnecessary components and re-order query elements
>>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True)
'http://test.net/foo.html?page=2&post=abc'
```
Basic URL validation only:
``` python
>>> from courlan import validate_url
>>> validate_url('http://1234')
(False, None)
>>> validate_url('http://www.example.org/')
(True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))
```
### Troubleshooting
Courlan uses an internal cache to speed up URL parsing. It can be reset
as follows:
``` python
>>> from courlan.meta import clear_caches
>>> clear_caches()
```
## UrlStore class
The `UrlStore` class allow for storing and retrieving domain-classified
URLs, where a URL like `https://example.org/path/testpage` is stored as
the path `/path/testpage` within the domain `https://example.org`. It
features the following methods:
- URL management
- `add_urls(urls=[], appendleft=None, visited=False)`: Add a
list of URLs to the (possibly) existing one. Optional:
append certain URLs to the left, specify if the URLs have
already been visited.
- `add_from_html(htmlstring, url, external=False, lang=None, with_nav=True)`:
Extract and filter links in a HTML string.
- `discard(domains)`: Declare domains void and prune the store.
- `dump_urls()`: Return a list of all known URLs.
- `print_urls()`: Print all URLs in store (URL + TAB + visited or not).
- `print_unvisited_urls()`: Print all unvisited URLs in store.
- `get_all_counts()`: Return all download counts for the hosts in store.
- `get_known_domains()`: Return all known domains as a list.
- `get_unvisited_domains()`: Find all domains for which there are unvisited URLs.
- `total_url_number()`: Find number of all URLs in store.
- `is_known(url)`: Check if the given URL has already been stored.
- `has_been_visited(url)`: Check if the given URL has already been visited.
- `filter_unknown_urls(urls)`: Take a list of URLs and return the currently unknown ones.
- `filter_unvisited_urls(urls)`: Take a list of URLs and return the currently unvisited ones.
- `find_known_urls(domain)`: Get all already known URLs for the
given domain (ex. `https://example.org`).
- `find_unvisited_urls(domain)`: Get all unvisited URLs for the given domain.
- `get_unvisited_domains()`: Return all domains which have not been all visited.
- `reset()`: Re-initialize the URL store.
- Crawling and downloads
- `get_url(domain)`: Retrieve a single URL and consider it to
be visited (with corresponding timestamp).
- `get_rules(domain)`: Return the stored crawling rules for the given website.
- `store_rules(website, rules=None)`: Store crawling rules for a given website.
- `get_crawl_delay()`: Return the delay as extracted from robots.txt, or a given default.
- `get_download_urls(max_urls=100, time_limit=10)`: Get a list of immediately
downloadable URLs according to the given time limit per domain.
- `establish_download_schedule(max_urls=100, time_limit=10)`:
Get up to the specified number of URLs along with a suitable
backoff schedule (in seconds).
- `download_threshold_reached(threshold)`: Find out if the
download limit (in seconds) has been reached for one of the
websites in store.
- `unvisited_websites_number()`: Return the number of websites
for which there are still URLs to visit.
- `is_exhausted_domain(domain)`: Tell if all known URLs for
the website have been visited.
- Persistance
- `write(filename)`: Save the store to disk.
- `load_store(filename)`: Read a UrlStore from disk (separate function, not class method).
- Optional settings:
- `compressed=True`: activate compression of URLs and rules
- `language=XX`: focus on a particular target language (two-letter code)
- `strict=True`: stricter URL filtering
- `verbose=True`: dump URLs if interrupted (requires use of `signal`)
## Command-line
The main fonctions are also available through a command-line utility:
``` bash
$ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
$ courlan --help
usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v]
[-p PARALLEL] [--strict] [-l LANGUAGE] [-r] [--sample SAMPLE]
[--exclude-max EXCLUDE_MAX] [--exclude-min EXCLUDE_MIN]
Command-line interface for Courlan
options:
-h, --help show this help message and exit
I/O:
Manage input and output
-i INPUTFILE, --inputfile INPUTFILE
name of input file (required)
-o OUTPUTFILE, --outputfile OUTPUTFILE
name of output file (required)
-d DISCARDEDFILE, --discardedfile DISCARDEDFILE
name of file to store discarded URLs (optional)
-v, --verbose increase output verbosity
-p PARALLEL, --parallel PARALLEL
number of parallel processes (not used for sampling)
Filtering:
Configure URL filters
--strict perform more restrictive tests
-l LANGUAGE, --language LANGUAGE
use language filter (ISO 639-1 code)
-r, --redirects check redirects
Sampling:
Use sampling by host, configure sample size
--sample SAMPLE size of sample per domain
--exclude-max EXCLUDE_MAX
exclude domains with more than n URLs
--exclude-min EXCLUDE_MIN
exclude domains with less than n URLs
```
## License
*coURLan* is distributed under the [Apache 2.0
license](https://www.apache.org/licenses/LICENSE-2.0.html).
Versions prior to v1 were under GPLv3+ license.
## Settings
`courlan` is optimized for English and German but its generic approach
is also usable in other contexts.
Details of strict URL filtering can be reviewed and changed in the file
`settings.py`. To override the default settings, clone the repository and
[re-install the package
locally](https://packaging.python.org/tutorials/installing-packages/#installing-from-a-local-src-tree).
## Contributing
[Contributions](https://github.com/adbar/courlan/blob/master/CONTRIBUTING.md)
are welcome!
Feel free to file issues on the [dedicated
page](https://github.com/adbar/courlan/issues).
## Author
Developed with practical applications of academic research in mind, this software
is part of a broader effort to derive information from web documents.
Extracting and pre-processing web texts to the exacting standards of
scientific research presents a substantial challenge.
This software package simplifies text data collection and enhances corpus quality,
it is currently used to build [text databases for research](https://www.dwds.de/d/k-web).
- Barbaresi, A. "[Trafilatura: A Web Scraping Library and
Command-Line Tool for Text Discovery and
Extraction](https://aclanthology.org/2021.acl-demo.15/)."
*Proceedings of ACL/IJCNLP 2021: System Demonstrations*, 2021, pp. 122-131.
Contact: see [homepage](https://adrien.barbaresi.eu/).
Software ecosystem: see [this
graphic](https://github.com/adbar/trafilatura/blob/master/docs/software-ecosystem.png).
## Similar work
These Python libraries perform similar handling and normalization tasks
but do not entail language or content filters. They also do not
primarily focus on crawl optimization:
- [furl](https://github.com/gruns/furl)
- [ural](https://github.com/medialab/ural)
- [yarl](https://github.com/aio-libs/yarl)
## References
- Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling
through URL ordering. *Computer networks and ISDN systems*, 30(1-7),
161–172.
- Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). "An
adaptive model for optimizing performance of an incremental web
crawler". In *Proceedings of the 10th international conference on
World Wide Web - WWW'01*, pp. 106–113.
Raw data
{
"_id": null,
"home_page": null,
"name": "courlan",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "cleaner, crawler, uri, url-parsing, url-manipulation, urls, validation, webcrawling",
"author": null,
"author_email": "Adrien Barbaresi <barbaresi@bbaw.de>",
"download_url": "https://files.pythonhosted.org/packages/6f/54/6d6ceeff4bed42e7a10d6064d35ee43a810e7b3e8beb4abeae8cff4713ae/courlan-1.3.2.tar.gz",
"platform": null,
"description": "# coURLan: Clean, filter, normalize, and sample URLs\n\n\n[![Python package](https://img.shields.io/pypi/v/courlan.svg)](https://pypi.python.org/pypi/courlan)\n[![Python versions](https://img.shields.io/pypi/pyversions/courlan.svg)](https://pypi.python.org/pypi/courlan)\n[![Code Coverage](https://img.shields.io/codecov/c/github/adbar/courlan.svg)](https://codecov.io/gh/adbar/courlan)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n\n## Why coURLan?\n\n> \"It is important for the crawler to visit 'important' pages first,\n> so that the fraction of the Web that is visited (and kept up to date)\n> is more meaningful.\" (Cho et al. 1998)\n>\n> \"Given that the bandwidth for conducting crawls is neither infinite\n> nor free, it is becoming essential to crawl the Web in not only a\n> scalable, but efficient way, if some reasonable measure of quality or\n> freshness is to be maintained.\" (Edwards et al. 2001)\n\nThis library provides an additional \"brain\" for web crawling, scraping\nand document management. It facilitates web navigation through a set of\nfilters, enhancing the quality of resulting document collections:\n\n- Save bandwidth and processing time by steering clear of pages deemed\n low-value\n- Identify specific pages based on language or text content\n- Pinpoint pages relevant for efficient link gathering\n\nAdditional utilities needed include URL storage, filtering, and\ndeduplication.\n\n## Features\n\nSeparate the wheat from the chaff and optimize document discovery and\nretrieval:\n\n\n- URL handling\n - Validation\n - Normalization\n - Sampling\n- Heuristics for link filtering\n - Spam, trackers, and content-types\n - Locales and internationalization\n - Web crawling (frontier, scheduling)\n- Data store specifically designed for URLs\n- Usable with Python or on the command-line\n\n\n**Let the coURLan fish up juicy bits for you!**\n\n<img src=\"https://raw.githubusercontent.com/adbar/courlan/master/courlan_harns-march.jpg\" width=\"65%\" alt=\"Courlan bird\"/>\n\nHere is a [courlan](https://en.wiktionary.org/wiki/courlan) (source:\n[Limpkin at Harn's Marsh by\nRuss](https://commons.wikimedia.org/wiki/File:Limpkin,_harns_marsh_(33723700146).jpg),\nCC BY 2.0).\n\n\n## Installation\n\nThis package is compatible with with all common versions of Python, it\nis tested on Linux, macOS and Windows systems.\n\nCourlan is available on the package repository [PyPI](https://pypi.org/)\nand can notably be installed with the Python package manager `pip`:\n\n``` bash\n$ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed\n$ pip install --upgrade courlan # to make sure you have the latest version\n$ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)\n```\n\nThe last version to support Python 3.6 and 3.7 is `courlan==1.2.0`.\n\n\n## Python\n\nMost filters revolve around the `strict` and `language` arguments.\n\n### check_url()\n\nAll useful operations chained in `check_url(url)`:\n\n``` python\n>>> from courlan import check_url\n\n# return url and domain name\n>>> check_url('https://github.com/adbar/courlan')\n('https://github.com/adbar/courlan', 'github.com')\n\n# filter out bogus domains\n>>> check_url('http://666.0.0.1/')\n>>>\n\n# tracker removal\n>>> check_url('http://test.net/foo.html?utm_source=twitter#gclid=123')\n('http://test.net/foo.html', 'test.net')\n\n# use strict for further trimming\n>>> my_url = 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org'\n>>> check_url(my_url, strict=True)\n('https://httpbin.org/redirect-to', 'httpbin.org')\n\n# check for redirects (HEAD request)\n>>> url, domain_name = check_url(my_url, with_redirects=True)\n\n# include navigation pages instead of discarding them\n>>> check_url('http://www.example.org/page/10/', with_nav=True)\n\n# remove trailing slash\n>>> check_url('https://github.com/adbar/courlan/', trailing_slash=False)\n```\n\nLanguage-aware heuristics, notably internationalization in URLs, are\navailable in `lang_filter(url, language)`:\n\n``` python\n# optional language argument\n>>> url = 'https://www.un.org/en/about-us'\n\n# success: returns clean URL and domain name\n>>> check_url(url, language='en')\n('https://www.un.org/en/about-us', 'un.org')\n\n# failure: doesn't return anything\n>>> check_url(url, language='de')\n>>>\n\n# optional argument: strict\n>>> url = 'https://en.wikipedia.org/'\n>>> check_url(url, language='de', strict=False)\n('https://en.wikipedia.org', 'wikipedia.org')\n>>> check_url(url, language='de', strict=True)\n>>>\n```\n\nDefine stricter restrictions on the expected content type with\n`strict=True`. This also blocks certain platforms and page types\nwhere machines get lost.\n\n``` python\n# strict filtering: blocked as it is a major platform\n>>> check_url('https://www.twitch.com/', strict=True)\n>>>\n```\n\n### Sampling by domain name\n\n``` python\n>>> from courlan import sample_urls\n>>> my_urls = ['https://example.org/' + str(x) for x in range(100)]\n>>> my_sample = sample_urls(my_urls, 10)\n# optional: exclude_min=None, exclude_max=None, strict=False, verbose=False\n```\n\n### Web crawling and URL handling\n\nLink extraction and preprocessing:\n\n``` python\n>>> from courlan import extract_links\n>>> doc = '<html><body><a href=\"test/link.html\">Link</a></body></html>'\n>>> url = \"https://example.org\"\n>>> extract_links(doc, url)\n{'https://example.org/test/link.html'}\n# other options: external_bool, no_filter, language, strict, redirects, ...\n```\n\nThe `filter_links()` function provides additional filters for crawling purposes:\nuse of robots.txt rules and link priorization. See `courlan.core` for details.\n\nDetermine if a link leads to another host:\n\n``` python\n>>> from courlan import is_external\n>>> is_external('https://github.com/', 'https://www.microsoft.com/')\nTrue\n# default\n>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True)\nFalse\n# taking suffixes into account\n>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False)\nTrue\n```\n\nOther useful functions dedicated to URL handling:\n\n- `extract_domain(url, fast=True)`: find domain and subdomain or just\n domain with `fast=False`\n- `get_base_url(url)`: strip the URL of some of its parts\n- `get_host_and_path(url)`: decompose URLs in two parts: protocol +\n host/domain and path\n- `get_hostinfo(url)`: extract domain and host info (protocol +\n host/domain)\n- `fix_relative_urls(baseurl, url)`: prepend necessary information to\n relative links\n\n``` python\n>>> from courlan import *\n>>> url = 'https://www.un.org/en/about-us'\n\n>>> get_base_url(url)\n'https://www.un.org'\n\n>>> get_host_and_path(url)\n('https://www.un.org', '/en/about-us')\n\n>>> get_hostinfo(url)\n('un.org', 'https://www.un.org')\n\n>>> fix_relative_urls('https://www.un.org', 'en/about-us')\n'https://www.un.org/en/about-us'\n```\n\nOther filters dedicated to crawl frontier management:\n\n- `is_not_crawlable(url)`: check for deep web or pages generally not\n usable in a crawling context\n- `is_navigation_page(url)`: check for navigation and overview pages\n\n``` python\n>>> from courlan import is_navigation_page, is_not_crawlable\n>>> is_navigation_page('https://www.randomblog.net/category/myposts')\nTrue\n>>> is_not_crawlable('https://www.randomblog.net/login')\nTrue\n```\n\nSee also [URL management page](https://trafilatura.readthedocs.io/en/latest/url-management.html)\nof the Trafilatura documentation.\n\n\n### Python helpers\n\nHelper function, scrub and normalize:\n\n``` python\n>>> from courlan import clean_url\n>>> clean_url('HTTPS://WWW.DWDS.DE:80/')\n'https://www.dwds.de'\n```\n\nBasic scrubbing only:\n\n``` python\n>>> from courlan import scrub_url\n```\n\nBasic canonicalization/normalization only, i.e. modifying and\nstandardizing URLs in a consistent manner:\n\n``` python\n>>> from urllib.parse import urlparse\n>>> from courlan import normalize_url\n>>> my_url = normalize_url(urlparse(my_url))\n# passing URL strings directly also works\n>>> my_url = normalize_url(my_url)\n# remove unnecessary components and re-order query elements\n>>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True)\n'http://test.net/foo.html?page=2&post=abc'\n```\n\nBasic URL validation only:\n\n``` python\n>>> from courlan import validate_url\n>>> validate_url('http://1234')\n(False, None)\n>>> validate_url('http://www.example.org/')\n(True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))\n```\n\n### Troubleshooting\n\nCourlan uses an internal cache to speed up URL parsing. It can be reset\nas follows:\n\n``` python\n>>> from courlan.meta import clear_caches\n>>> clear_caches()\n```\n\n## UrlStore class\n\nThe `UrlStore` class allow for storing and retrieving domain-classified\nURLs, where a URL like `https://example.org/path/testpage` is stored as\nthe path `/path/testpage` within the domain `https://example.org`. It\nfeatures the following methods:\n\n- URL management\n - `add_urls(urls=[], appendleft=None, visited=False)`: Add a\n list of URLs to the (possibly) existing one. Optional:\n append certain URLs to the left, specify if the URLs have\n already been visited.\n - `add_from_html(htmlstring, url, external=False, lang=None, with_nav=True)`:\n Extract and filter links in a HTML string.\n - `discard(domains)`: Declare domains void and prune the store.\n - `dump_urls()`: Return a list of all known URLs.\n - `print_urls()`: Print all URLs in store (URL + TAB + visited or not).\n - `print_unvisited_urls()`: Print all unvisited URLs in store.\n - `get_all_counts()`: Return all download counts for the hosts in store.\n - `get_known_domains()`: Return all known domains as a list.\n - `get_unvisited_domains()`: Find all domains for which there are unvisited URLs.\n - `total_url_number()`: Find number of all URLs in store.\n - `is_known(url)`: Check if the given URL has already been stored.\n - `has_been_visited(url)`: Check if the given URL has already been visited.\n - `filter_unknown_urls(urls)`: Take a list of URLs and return the currently unknown ones.\n - `filter_unvisited_urls(urls)`: Take a list of URLs and return the currently unvisited ones.\n - `find_known_urls(domain)`: Get all already known URLs for the\n given domain (ex. `https://example.org`).\n - `find_unvisited_urls(domain)`: Get all unvisited URLs for the given domain.\n - `get_unvisited_domains()`: Return all domains which have not been all visited.\n - `reset()`: Re-initialize the URL store.\n\n- Crawling and downloads\n - `get_url(domain)`: Retrieve a single URL and consider it to\n be visited (with corresponding timestamp).\n - `get_rules(domain)`: Return the stored crawling rules for the given website.\n - `store_rules(website, rules=None)`: Store crawling rules for a given website.\n - `get_crawl_delay()`: Return the delay as extracted from robots.txt, or a given default.\n - `get_download_urls(max_urls=100, time_limit=10)`: Get a list of immediately\n downloadable URLs according to the given time limit per domain.\n - `establish_download_schedule(max_urls=100, time_limit=10)`:\n Get up to the specified number of URLs along with a suitable\n backoff schedule (in seconds).\n - `download_threshold_reached(threshold)`: Find out if the\n download limit (in seconds) has been reached for one of the\n websites in store.\n - `unvisited_websites_number()`: Return the number of websites\n for which there are still URLs to visit.\n - `is_exhausted_domain(domain)`: Tell if all known URLs for\n the website have been visited.\n\n- Persistance\n - `write(filename)`: Save the store to disk.\n - `load_store(filename)`: Read a UrlStore from disk (separate function, not class method).\n\n- Optional settings:\n - `compressed=True`: activate compression of URLs and rules\n - `language=XX`: focus on a particular target language (two-letter code)\n - `strict=True`: stricter URL filtering\n - `verbose=True`: dump URLs if interrupted (requires use of `signal`)\n\n\n## Command-line\n\nThe main fonctions are also available through a command-line utility:\n\n``` bash\n$ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt\n$ courlan --help\nusage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v]\n [-p PARALLEL] [--strict] [-l LANGUAGE] [-r] [--sample SAMPLE]\n [--exclude-max EXCLUDE_MAX] [--exclude-min EXCLUDE_MIN]\n\nCommand-line interface for Courlan\n\noptions:\n -h, --help show this help message and exit\n\nI/O:\n Manage input and output\n\n -i INPUTFILE, --inputfile INPUTFILE\n name of input file (required)\n -o OUTPUTFILE, --outputfile OUTPUTFILE\n name of output file (required)\n -d DISCARDEDFILE, --discardedfile DISCARDEDFILE\n name of file to store discarded URLs (optional)\n -v, --verbose increase output verbosity\n -p PARALLEL, --parallel PARALLEL\n number of parallel processes (not used for sampling)\n\nFiltering:\n Configure URL filters\n\n --strict perform more restrictive tests\n -l LANGUAGE, --language LANGUAGE\n use language filter (ISO 639-1 code)\n -r, --redirects check redirects\n\nSampling:\n Use sampling by host, configure sample size\n\n --sample SAMPLE size of sample per domain\n --exclude-max EXCLUDE_MAX\n exclude domains with more than n URLs\n --exclude-min EXCLUDE_MIN\n exclude domains with less than n URLs\n```\n\n\n## License\n\n*coURLan* is distributed under the [Apache 2.0\nlicense](https://www.apache.org/licenses/LICENSE-2.0.html).\n\nVersions prior to v1 were under GPLv3+ license.\n\n\n## Settings\n\n`courlan` is optimized for English and German but its generic approach\nis also usable in other contexts.\n\nDetails of strict URL filtering can be reviewed and changed in the file\n`settings.py`. To override the default settings, clone the repository and\n[re-install the package\nlocally](https://packaging.python.org/tutorials/installing-packages/#installing-from-a-local-src-tree).\n\n\n## Contributing\n\n[Contributions](https://github.com/adbar/courlan/blob/master/CONTRIBUTING.md)\nare welcome!\n\nFeel free to file issues on the [dedicated\npage](https://github.com/adbar/courlan/issues).\n\n\n## Author\n\nDeveloped with practical applications of academic research in mind, this software\nis part of a broader effort to derive information from web documents.\nExtracting and pre-processing web texts to the exacting standards of\nscientific research presents a substantial challenge.\nThis software package simplifies text data collection and enhances corpus quality,\nit is currently used to build [text databases for research](https://www.dwds.de/d/k-web).\n\n- Barbaresi, A. \"[Trafilatura: A Web Scraping Library and\n Command-Line Tool for Text Discovery and\n Extraction](https://aclanthology.org/2021.acl-demo.15/).\"\n *Proceedings of ACL/IJCNLP 2021: System Demonstrations*, 2021, pp. 122-131.\n\nContact: see [homepage](https://adrien.barbaresi.eu/).\n\nSoftware ecosystem: see [this\ngraphic](https://github.com/adbar/trafilatura/blob/master/docs/software-ecosystem.png).\n\n\n## Similar work\n\nThese Python libraries perform similar handling and normalization tasks\nbut do not entail language or content filters. They also do not\nprimarily focus on crawl optimization:\n\n- [furl](https://github.com/gruns/furl)\n- [ural](https://github.com/medialab/ural)\n- [yarl](https://github.com/aio-libs/yarl)\n\n\n## References\n\n- Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling\n through URL ordering. *Computer networks and ISDN systems*, 30(1-7),\n 161\u2013172.\n- Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). \"An\n adaptive model for optimizing performance of an incremental web\n crawler\". In *Proceedings of the 10th international conference on\n World Wide Web - WWW'01*, pp. 106\u2013113.\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Clean, filter and sample URLs to optimize data collection \u2013 includes spam, content type and language filters.",
"version": "1.3.2",
"project_urls": {
"Blog": "https://adrien.barbaresi.eu/blog/",
"Homepage": "https://github.com/adbar/courlan",
"Tracker": "https://github.com/adbar/courlan/issues"
},
"split_keywords": [
"cleaner",
" crawler",
" uri",
" url-parsing",
" url-manipulation",
" urls",
" validation",
" webcrawling"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8eca6a667ccbe649856dcd3458bab80b016681b274399d6211187c6ab969fc50",
"md5": "b34adf6ba2547baceb119284730536ad",
"sha256": "d0dab52cf5b5b1000ee2839fbc2837e93b2514d3cb5bb61ae158a55b7a04c6be"
},
"downloads": -1,
"filename": "courlan-1.3.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b34adf6ba2547baceb119284730536ad",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 33848,
"upload_time": "2024-10-29T16:40:18",
"upload_time_iso_8601": "2024-10-29T16:40:18.325614Z",
"url": "https://files.pythonhosted.org/packages/8e/ca/6a667ccbe649856dcd3458bab80b016681b274399d6211187c6ab969fc50/courlan-1.3.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "6f546d6ceeff4bed42e7a10d6064d35ee43a810e7b3e8beb4abeae8cff4713ae",
"md5": "41a346e13ee6d3251bdf5c9eb500ffcb",
"sha256": "0b66f4db3a9c39a6e22dd247c72cfaa57d68ea660e94bb2c84ec7db8712af190"
},
"downloads": -1,
"filename": "courlan-1.3.2.tar.gz",
"has_sig": false,
"md5_digest": "41a346e13ee6d3251bdf5c9eb500ffcb",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 206382,
"upload_time": "2024-10-29T16:40:20",
"upload_time_iso_8601": "2024-10-29T16:40:20.994328Z",
"url": "https://files.pythonhosted.org/packages/6f/54/6d6ceeff4bed42e7a10d6064d35ee43a810e7b3e8beb4abeae8cff4713ae/courlan-1.3.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-29 16:40:20",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "adbar",
"github_project": "courlan",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"lcname": "courlan"
}