tldextract


Nametldextract JSON
Version 5.1.2 PyPI version JSON
download
home_page
SummaryAccurately separates a URL's subdomain, domain, and public suffix, using the Public Suffix List (PSL). By default, this includes the public ICANN TLDs and their exceptions. You can optionally support the Public Suffix List's private domains as well.
upload_time2024-03-19 04:08:10
maintainer
docs_urlNone
author
requires_python>=3.8
licenseBSD-3-Clause
keywords tld domain subdomain url parse extract urlparse urlsplit public suffix list publicsuffix publicsuffixlist
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # tldextract [![PyPI version](https://badge.fury.io/py/tldextract.svg)](https://badge.fury.io/py/tldextract) [![Build Status](https://github.com/john-kurkowski/tldextract/actions/workflows/ci.yml/badge.svg)](https://github.com/john-kurkowski/tldextract/actions/workflows/ci.yml)

`tldextract` accurately separates a URL's subdomain, domain, and public suffix,
using [the Public Suffix List (PSL)](https://publicsuffix.org).

Say you want just the "google" part of https://www.google.com. *Everybody gets
this wrong.* Splitting on the "." and taking the 2nd-to-last element only works
for simple domains, e.g. .com. Consider
[http://forums.bbc.co.uk](http://forums.bbc.co.uk): the naive splitting method
will give you "co" as the domain, instead of "bbc". Rather than juggle TLDs,
gTLDs, or ccTLDs  yourself, `tldextract` extracts the currently living public
suffixes according to [the Public Suffix List](https://publicsuffix.org).

> A "public suffix" is one under which Internet users can directly register
> names.

A public suffix is also sometimes called an effective TLD (eTLD).

## Usage

```python
>>> import tldextract

>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com', is_private=False)

>>> tldextract.extract('http://forums.bbc.co.uk/') # United Kingdom
ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk', is_private=False)

>>> tldextract.extract('http://www.worldbank.org.kg/') # Kyrgyzstan
ExtractResult(subdomain='www', domain='worldbank', suffix='org.kg', is_private=False)
```

Note subdomain and suffix are _optional_. Not all URL-like inputs have a
subdomain or a valid suffix.

```python
>>> tldextract.extract('google.com')
ExtractResult(subdomain='', domain='google', suffix='com', is_private=False)

>>> tldextract.extract('google.notavalidsuffix')
ExtractResult(subdomain='google', domain='notavalidsuffix', suffix='', is_private=False)

>>> tldextract.extract('http://127.0.0.1:8080/deployed/')
ExtractResult(subdomain='', domain='127.0.0.1', suffix='', is_private=False)
```

To rejoin the original hostname, if it was indeed a valid, registered hostname:

```python
>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> ext.registered_domain
'bbc.co.uk'
>>> ext.fqdn
'forums.bbc.co.uk'
```

By default, this package supports the public ICANN TLDs and their exceptions.
You can optionally support the Public Suffix List's private domains as well.

This package started by implementing the chosen answer from [this StackOverflow question on
getting the "domain name" from a URL](http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url/569219#569219).
However, the proposed regex solution doesn't address many country codes like
com.au, or the exceptions to country codes like the registered domain
parliament.uk. The Public Suffix List does, and so does this package.

## Install

Latest release on PyPI:

```zsh
pip install tldextract
```

Or the latest dev version:

```zsh
pip install -e 'git://github.com/john-kurkowski/tldextract.git#egg=tldextract'
```

Command-line usage, splits the URL components by space:

```zsh
tldextract http://forums.bbc.co.uk
# forums bbc co.uk
```

## Note about caching

Beware when first calling `tldextract`, it updates its TLD list with a live HTTP
request. This updated TLD set is usually cached indefinitely in `$HOME/.cache/python-tldextract`.
To control the cache's location, set TLDEXTRACT_CACHE environment variable or set the
cache_dir path in TLDExtract initialization.

(Arguably runtime bootstrapping like that shouldn't be the default behavior,
like for production systems. But I want you to have the latest TLDs, especially
when I haven't kept this code up to date.)


```python
# extract callable that falls back to the included TLD snapshot, no live HTTP fetching
no_fetch_extract = tldextract.TLDExtract(suffix_list_urls=())
no_fetch_extract('http://www.google.com')

# extract callable that reads/writes the updated TLD set to a different path
custom_cache_extract = tldextract.TLDExtract(cache_dir='/path/to/your/cache/')
custom_cache_extract('http://www.google.com')

# extract callable that doesn't use caching
no_cache_extract = tldextract.TLDExtract(cache_dir=None)
no_cache_extract('http://www.google.com')
```

If you want to stay fresh with the TLD definitions--though they don't change
often--delete the cache file occasionally, or run

```zsh
tldextract --update
```

or:

```zsh
env TLDEXTRACT_CACHE="~/tldextract.cache" tldextract --update
```

It is also recommended to delete the file after upgrading this lib.

## Advanced usage

### Public vs. private domains

The PSL [maintains a concept of "private"
domains](https://publicsuffix.org/list/).

> PRIVATE domains are amendments submitted by the domain holder, as an
> expression of how they operate their domain security policy. … While some
> applications, such as browsers when considering cookie-setting, treat all
> entries the same, other applications may wish to treat ICANN domains and
> PRIVATE domains differently.

By default, `tldextract` treats public and private domains the same.

```python
>>> extract = tldextract.TLDExtract()
>>> extract('waiterrant.blogspot.com')
ExtractResult(subdomain='waiterrant', domain='blogspot', suffix='com', is_private=False)
```

The following overrides this.
```python
>>> extract = tldextract.TLDExtract()
>>> extract('waiterrant.blogspot.com', include_psl_private_domains=True)
ExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com', is_private=True)
```

or to change the default for all extract calls,
```python
>>> extract = tldextract.TLDExtract( include_psl_private_domains=True)
>>> extract('waiterrant.blogspot.com')
ExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com', is_private=True)
```

The thinking behind the default is, it's the more common case when people
mentally parse a domain name. It doesn't assume familiarity with the PSL nor
that the PSL makes a public/private distinction. Note this default may run
counter to the default parsing behavior of other, PSL-based libraries.

### Specifying your own URL or file for Public Suffix List data

You can specify your own input data in place of the default Mozilla Public Suffix List:

```python
extract = tldextract.TLDExtract(
    suffix_list_urls=["http://foo.bar.baz"],
    # Recommended: Specify your own cache file, to minimize ambiguities about where
    # tldextract is getting its data, or cached data, from.
    cache_dir='/path/to/your/cache/',
    fallback_to_snapshot=False)
```

The above snippet will fetch from the URL *you* specified, upon first need to download the
suffix list (i.e. if the cached version doesn't exist).

If you want to use input data from your local filesystem, just use the `file://` protocol:

```python
extract = tldextract.TLDExtract(
    suffix_list_urls=["file://" + "/absolute/path/to/your/local/suffix/list/file"],
    cache_dir='/path/to/your/cache/',
    fallback_to_snapshot=False)
```

Use an absolute path when specifying the `suffix_list_urls` keyword argument.
`os.path` is your friend.

The command line update command can be used with a URL or local file you specify:

```zsh
tldextract --update --suffix_list_url "http://foo.bar.baz"
```

This could be useful in production when you don't want the delay associated with updating the suffix
list on first use, or if you are behind a complex firewall that prevents a simple update from working.

## FAQ

### Can you add suffix \_\_\_\_? Can you make an exception for domain \_\_\_\_?

This project doesn't contain an actual list of public suffixes. That comes from
[the Public Suffix List (PSL)](https://publicsuffix.org/). Submit amendments there.

(In the meantime, you can tell tldextract about your exception by either
forking the PSL and using your fork in the `suffix_list_urls` param, or adding
your suffix piecemeal with the `extra_suffixes` param.)

### I see my suffix in [the Public Suffix List (PSL)](https://publicsuffix.org/), but this library doesn't extract it.

Check if your suffix is in the private section of the list. See [this
documentation](#public-vs-private-domains).

### If I pass an invalid URL, I still get a result, no error. What gives?

To keep `tldextract` light in LoC & overhead, and because there are plenty of
URL validators out there, this library is very lenient with input. If valid
URLs are important to you, validate them before calling `tldextract`.

To avoid parsing a string twice, you can pass `tldextract` the output of
[`urllib.parse`](https://docs.python.org/3/library/urllib.parse.html) methods.
For example:

```py
extractor = TLDExtract()
split_url = urllib.parse.urlsplit("https://foo.bar.com:8080")
split_suffix = extractor.extract_urllib(split_url)
url_to_crawl = f"{split_url.scheme}://{split_suffix.registered_domain}:{split_url.port}"
```

`tldextract`'s lenient string parsing stance lowers the learning curve of using
the library, at the cost of desensitizing users to the nuances of URLs. This
could be overhauled. For example, users could opt into validation, either
receiving exceptions or error metadata on results.

## Contribute

### Setting up

1. `git clone` this repository.
2. Change into the new directory.
3. `pip install --upgrade --editable '.[testing]'`

### Running the test suite

Run all tests against all supported Python versions:

```zsh
tox --parallel
```

Run all tests against a specific Python environment configuration:

```zsh
tox -l
tox -e py311
```

### Code Style

Automatically format all code:

```zsh
black .
```

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "tldextract",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "tld,domain,subdomain,url,parse,extract,urlparse,urlsplit,public,suffix,list,publicsuffix,publicsuffixlist",
    "author": "",
    "author_email": "John Kurkowski <john.kurkowski@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/db/ed/c92a5d6edaafec52f388c2d2946b4664294299cebf52bb1ef9cbc44ae739/tldextract-5.1.2.tar.gz",
    "platform": null,
    "description": "# tldextract [![PyPI version](https://badge.fury.io/py/tldextract.svg)](https://badge.fury.io/py/tldextract) [![Build Status](https://github.com/john-kurkowski/tldextract/actions/workflows/ci.yml/badge.svg)](https://github.com/john-kurkowski/tldextract/actions/workflows/ci.yml)\n\n`tldextract` accurately separates a URL's subdomain, domain, and public suffix,\nusing [the Public Suffix List (PSL)](https://publicsuffix.org).\n\nSay you want just the \"google\" part of https://www.google.com. *Everybody gets\nthis wrong.* Splitting on the \".\" and taking the 2nd-to-last element only works\nfor simple domains, e.g. .com. Consider\n[http://forums.bbc.co.uk](http://forums.bbc.co.uk): the naive splitting method\nwill give you \"co\" as the domain, instead of \"bbc\". Rather than juggle TLDs,\ngTLDs, or ccTLDs  yourself, `tldextract` extracts the currently living public\nsuffixes according to [the Public Suffix List](https://publicsuffix.org).\n\n> A \"public suffix\" is one under which Internet users can directly register\n> names.\n\nA public suffix is also sometimes called an effective TLD (eTLD).\n\n## Usage\n\n```python\n>>> import tldextract\n\n>>> tldextract.extract('http://forums.news.cnn.com/')\nExtractResult(subdomain='forums.news', domain='cnn', suffix='com', is_private=False)\n\n>>> tldextract.extract('http://forums.bbc.co.uk/') # United Kingdom\nExtractResult(subdomain='forums', domain='bbc', suffix='co.uk', is_private=False)\n\n>>> tldextract.extract('http://www.worldbank.org.kg/') # Kyrgyzstan\nExtractResult(subdomain='www', domain='worldbank', suffix='org.kg', is_private=False)\n```\n\nNote subdomain and suffix are _optional_. Not all URL-like inputs have a\nsubdomain or a valid suffix.\n\n```python\n>>> tldextract.extract('google.com')\nExtractResult(subdomain='', domain='google', suffix='com', is_private=False)\n\n>>> tldextract.extract('google.notavalidsuffix')\nExtractResult(subdomain='google', domain='notavalidsuffix', suffix='', is_private=False)\n\n>>> tldextract.extract('http://127.0.0.1:8080/deployed/')\nExtractResult(subdomain='', domain='127.0.0.1', suffix='', is_private=False)\n```\n\nTo rejoin the original hostname, if it was indeed a valid, registered hostname:\n\n```python\n>>> ext = tldextract.extract('http://forums.bbc.co.uk')\n>>> ext.registered_domain\n'bbc.co.uk'\n>>> ext.fqdn\n'forums.bbc.co.uk'\n```\n\nBy default, this package supports the public ICANN TLDs and their exceptions.\nYou can optionally support the Public Suffix List's private domains as well.\n\nThis package started by implementing the chosen answer from [this StackOverflow question on\ngetting the \"domain name\" from a URL](http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url/569219#569219).\nHowever, the proposed regex solution doesn't address many country codes like\ncom.au, or the exceptions to country codes like the registered domain\nparliament.uk. The Public Suffix List does, and so does this package.\n\n## Install\n\nLatest release on PyPI:\n\n```zsh\npip install tldextract\n```\n\nOr the latest dev version:\n\n```zsh\npip install -e 'git://github.com/john-kurkowski/tldextract.git#egg=tldextract'\n```\n\nCommand-line usage, splits the URL components by space:\n\n```zsh\ntldextract http://forums.bbc.co.uk\n# forums bbc co.uk\n```\n\n## Note about caching\n\nBeware when first calling `tldextract`, it updates its TLD list with a live HTTP\nrequest. This updated TLD set is usually cached indefinitely in `$HOME/.cache/python-tldextract`.\nTo control the cache's location, set TLDEXTRACT_CACHE environment variable or set the\ncache_dir path in TLDExtract initialization.\n\n(Arguably runtime bootstrapping like that shouldn't be the default behavior,\nlike for production systems. But I want you to have the latest TLDs, especially\nwhen I haven't kept this code up to date.)\n\n\n```python\n# extract callable that falls back to the included TLD snapshot, no live HTTP fetching\nno_fetch_extract = tldextract.TLDExtract(suffix_list_urls=())\nno_fetch_extract('http://www.google.com')\n\n# extract callable that reads/writes the updated TLD set to a different path\ncustom_cache_extract = tldextract.TLDExtract(cache_dir='/path/to/your/cache/')\ncustom_cache_extract('http://www.google.com')\n\n# extract callable that doesn't use caching\nno_cache_extract = tldextract.TLDExtract(cache_dir=None)\nno_cache_extract('http://www.google.com')\n```\n\nIf you want to stay fresh with the TLD definitions--though they don't change\noften--delete the cache file occasionally, or run\n\n```zsh\ntldextract --update\n```\n\nor:\n\n```zsh\nenv TLDEXTRACT_CACHE=\"~/tldextract.cache\" tldextract --update\n```\n\nIt is also recommended to delete the file after upgrading this lib.\n\n## Advanced usage\n\n### Public vs. private domains\n\nThe PSL [maintains a concept of \"private\"\ndomains](https://publicsuffix.org/list/).\n\n> PRIVATE domains are amendments submitted by the domain holder, as an\n> expression of how they operate their domain security policy. \u2026 While some\n> applications, such as browsers when considering cookie-setting, treat all\n> entries the same, other applications may wish to treat ICANN domains and\n> PRIVATE domains differently.\n\nBy default, `tldextract` treats public and private domains the same.\n\n```python\n>>> extract = tldextract.TLDExtract()\n>>> extract('waiterrant.blogspot.com')\nExtractResult(subdomain='waiterrant', domain='blogspot', suffix='com', is_private=False)\n```\n\nThe following overrides this.\n```python\n>>> extract = tldextract.TLDExtract()\n>>> extract('waiterrant.blogspot.com', include_psl_private_domains=True)\nExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com', is_private=True)\n```\n\nor to change the default for all extract calls,\n```python\n>>> extract = tldextract.TLDExtract( include_psl_private_domains=True)\n>>> extract('waiterrant.blogspot.com')\nExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com', is_private=True)\n```\n\nThe thinking behind the default is, it's the more common case when people\nmentally parse a domain name. It doesn't assume familiarity with the PSL nor\nthat the PSL makes a public/private distinction. Note this default may run\ncounter to the default parsing behavior of other, PSL-based libraries.\n\n### Specifying your own URL or file for Public Suffix List data\n\nYou can specify your own input data in place of the default Mozilla Public Suffix List:\n\n```python\nextract = tldextract.TLDExtract(\n    suffix_list_urls=[\"http://foo.bar.baz\"],\n    # Recommended: Specify your own cache file, to minimize ambiguities about where\n    # tldextract is getting its data, or cached data, from.\n    cache_dir='/path/to/your/cache/',\n    fallback_to_snapshot=False)\n```\n\nThe above snippet will fetch from the URL *you* specified, upon first need to download the\nsuffix list (i.e. if the cached version doesn't exist).\n\nIf you want to use input data from your local filesystem, just use the `file://` protocol:\n\n```python\nextract = tldextract.TLDExtract(\n    suffix_list_urls=[\"file://\" + \"/absolute/path/to/your/local/suffix/list/file\"],\n    cache_dir='/path/to/your/cache/',\n    fallback_to_snapshot=False)\n```\n\nUse an absolute path when specifying the `suffix_list_urls` keyword argument.\n`os.path` is your friend.\n\nThe command line update command can be used with a URL or local file you specify:\n\n```zsh\ntldextract --update --suffix_list_url \"http://foo.bar.baz\"\n```\n\nThis could be useful in production when you don't want the delay associated with updating the suffix\nlist on first use, or if you are behind a complex firewall that prevents a simple update from working.\n\n## FAQ\n\n### Can you add suffix \\_\\_\\_\\_? Can you make an exception for domain \\_\\_\\_\\_?\n\nThis project doesn't contain an actual list of public suffixes. That comes from\n[the Public Suffix List (PSL)](https://publicsuffix.org/). Submit amendments there.\n\n(In the meantime, you can tell tldextract about your exception by either\nforking the PSL and using your fork in the `suffix_list_urls` param, or adding\nyour suffix piecemeal with the `extra_suffixes` param.)\n\n### I see my suffix in [the Public Suffix List (PSL)](https://publicsuffix.org/), but this library doesn't extract it.\n\nCheck if your suffix is in the private section of the list. See [this\ndocumentation](#public-vs-private-domains).\n\n### If I pass an invalid URL, I still get a result, no error. What gives?\n\nTo keep `tldextract` light in LoC & overhead, and because there are plenty of\nURL validators out there, this library is very lenient with input. If valid\nURLs are important to you, validate them before calling `tldextract`.\n\nTo avoid parsing a string twice, you can pass `tldextract` the output of\n[`urllib.parse`](https://docs.python.org/3/library/urllib.parse.html) methods.\nFor example:\n\n```py\nextractor = TLDExtract()\nsplit_url = urllib.parse.urlsplit(\"https://foo.bar.com:8080\")\nsplit_suffix = extractor.extract_urllib(split_url)\nurl_to_crawl = f\"{split_url.scheme}://{split_suffix.registered_domain}:{split_url.port}\"\n```\n\n`tldextract`'s lenient string parsing stance lowers the learning curve of using\nthe library, at the cost of desensitizing users to the nuances of URLs. This\ncould be overhauled. For example, users could opt into validation, either\nreceiving exceptions or error metadata on results.\n\n## Contribute\n\n### Setting up\n\n1. `git clone` this repository.\n2. Change into the new directory.\n3. `pip install --upgrade --editable '.[testing]'`\n\n### Running the test suite\n\nRun all tests against all supported Python versions:\n\n```zsh\ntox --parallel\n```\n\nRun all tests against a specific Python environment configuration:\n\n```zsh\ntox -l\ntox -e py311\n```\n\n### Code Style\n\nAutomatically format all code:\n\n```zsh\nblack .\n```\n",
    "bugtrack_url": null,
    "license": "BSD-3-Clause",
    "summary": "Accurately separates a URL's subdomain, domain, and public suffix, using the Public Suffix List (PSL). By default, this includes the public ICANN TLDs and their exceptions. You can optionally support the Public Suffix List's private domains as well.",
    "version": "5.1.2",
    "project_urls": {
        "Homepage": "https://github.com/john-kurkowski/tldextract"
    },
    "split_keywords": [
        "tld",
        "domain",
        "subdomain",
        "url",
        "parse",
        "extract",
        "urlparse",
        "urlsplit",
        "public",
        "suffix",
        "list",
        "publicsuffix",
        "publicsuffixlist"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fc6d8eaafb735b39c4ab3bb8fe4324ef8f0f0af27a7df9bb4cd503927bd5475d",
                "md5": "249d44935a378689bb4a260f7d359774",
                "sha256": "4dfc4c277b6b97fa053899fcdb892d2dc27295851ab5fac4e07797b6a21b2e46"
            },
            "downloads": -1,
            "filename": "tldextract-5.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "249d44935a378689bb4a260f7d359774",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 97560,
            "upload_time": "2024-03-19T04:08:08",
            "upload_time_iso_8601": "2024-03-19T04:08:08.492286Z",
            "url": "https://files.pythonhosted.org/packages/fc/6d/8eaafb735b39c4ab3bb8fe4324ef8f0f0af27a7df9bb4cd503927bd5475d/tldextract-5.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dbedc92a5d6edaafec52f388c2d2946b4664294299cebf52bb1ef9cbc44ae739",
                "md5": "2edf4652cfd3a5ac96f350f17a76d492",
                "sha256": "c9e17f756f05afb5abac04fe8f766e7e70f9fe387adb1859f0f52408ee060200"
            },
            "downloads": -1,
            "filename": "tldextract-5.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "2edf4652cfd3a5ac96f350f17a76d492",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 116825,
            "upload_time": "2024-03-19T04:08:10",
            "upload_time_iso_8601": "2024-03-19T04:08:10.962488Z",
            "url": "https://files.pythonhosted.org/packages/db/ed/c92a5d6edaafec52f388c2d2946b4664294299cebf52bb1ef9cbc44ae739/tldextract-5.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-19 04:08:10",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "john-kurkowski",
    "github_project": "tldextract",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "tldextract"
}
        
Elapsed time: 0.22186s