# cleanurl
Remove clutter from URLs and return a canonicalized version
# Install
```
pip install cleanurl
```
or if you're using poetry:
```
poetry add cleanurl
```
# Usage
By default *cleanurl* retuns a cleaned URL without respecting semantics.
For example:
```
>>> import cleanurl
>>> r = cleanurl.cleanurl('https://www.xojoc.pw/blog/focus.html?utm_content=buffercf3b2&utm_medium=social&utm_source=snapchat.com&utm_campaign=buffe')
>>> r.url
'https://xojoc.pw/blog/focus'
>>> r.parsed_url
ParseResult(scheme='https', netloc='xojoc.pw', path='/blog/focus', params='', query='', fragment='')
```
The default parameters are useful if you want to get a *canonical* URL without caring if the resulting URL is still valid.
If you want to get a clean URL which is still valid call it like this:
```
>>> r = cleanurl.cleanurl('https://www.xojoc.pw/blog/////focus.html', respect_semantics=True)
>>> r.url
'https://www.xojoc.pw/blog/focus.html'
```
```celeanurl.cleanurl``` parameters:
- ```generic``` -> if True don't use site specific rules
- ```respect_semantics``` -> if True make sure the returned URL is still valid, altough it may still contain some superfluous elements
- ```host_remap``` -> whether to remap hosts. Example:
```
>>> import cleanurl
>>> cleanurl.cleanurl('https://threadreaderapp.com/thread/1453753924960219145', host_remap=True).url
'https://twitter.com/i/status/1453753924960219145'
>>> cleanurl.cleanurl('https://threadreaderapp.com/thread/1453753924960219145', host_remap=False).url
'https://threadreaderapp.com/thread/1453753924960219145'
```
For more examples see the [unit tests](https://github.com/xojoc/cleanurl/blob/main/src/test_cleanurl.py).
# Why?
While there are some libraries that handle general cases, this library has website specific rules that more aggresivly normalize urls.
# Users
Initially used for [discu.eu](https://discu.eu).
[Discussions around the web](https://discu.eu/q/https://github.com/xojoc/cleanurl)
# Who?
*cleanurl* was written by [Alexandru Cojocaru](https://xojoc.pw).
# License
*cleanurl* is [Free Software](https://www.gnu.org/philosophy/free-sw.html) and is released as [AGPLv3](https://github.com/xojoc/cleanurl/blob/main/LICENSE)
Raw data
{
"_id": null,
"home_page": "https://github.com/xojoc/cleanurl",
"name": "cleanurl",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9,<4.0",
"maintainer_email": "",
"keywords": "url,canonical",
"author": "Alexandru Cojocaru",
"author_email": "hi@xojoc.pw",
"download_url": "https://files.pythonhosted.org/packages/92/fb/bf71e2b1060f36fb26f1b62f26f8a9d27c13a95b9a86310118f963071619/cleanurl-0.1.15.tar.gz",
"platform": null,
"description": "# cleanurl\nRemove clutter from URLs and return a canonicalized version\n\n# Install\n```\npip install cleanurl\n```\nor if you're using poetry:\n```\npoetry add cleanurl\n```\n\n# Usage\nBy default *cleanurl* retuns a cleaned URL without respecting semantics.\nFor example:\n\n```\n>>> import cleanurl\n>>> r = cleanurl.cleanurl('https://www.xojoc.pw/blog/focus.html?utm_content=buffercf3b2&utm_medium=social&utm_source=snapchat.com&utm_campaign=buffe')\n>>> r.url\n'https://xojoc.pw/blog/focus'\n>>> r.parsed_url\nParseResult(scheme='https', netloc='xojoc.pw', path='/blog/focus', params='', query='', fragment='')\n```\n\nThe default parameters are useful if you want to get a *canonical* URL without caring if the resulting URL is still valid.\n\nIf you want to get a clean URL which is still valid call it like this:\n\n```\n>>> r = cleanurl.cleanurl('https://www.xojoc.pw/blog/////focus.html', respect_semantics=True)\n>>> r.url\n'https://www.xojoc.pw/blog/focus.html'\n```\n\n```celeanurl.cleanurl``` parameters:\n\n - ```generic``` -> if True don't use site specific rules\n - ```respect_semantics``` -> if True make sure the returned URL is still valid, altough it may still contain some superfluous elements\n - ```host_remap``` -> whether to remap hosts. Example:\n```\n>>> import cleanurl\n>>> cleanurl.cleanurl('https://threadreaderapp.com/thread/1453753924960219145', host_remap=True).url\n'https://twitter.com/i/status/1453753924960219145'\n>>> cleanurl.cleanurl('https://threadreaderapp.com/thread/1453753924960219145', host_remap=False).url\n'https://threadreaderapp.com/thread/1453753924960219145'\n```\n\nFor more examples see the [unit tests](https://github.com/xojoc/cleanurl/blob/main/src/test_cleanurl.py).\n\n\n# Why?\nWhile there are some libraries that handle general cases, this library has website specific rules that more aggresivly normalize urls.\n\n# Users\nInitially used for [discu.eu](https://discu.eu).\n\n[Discussions around the web](https://discu.eu/q/https://github.com/xojoc/cleanurl)\n\n# Who?\n*cleanurl* was written by [Alexandru Cojocaru](https://xojoc.pw).\n\n# License\n*cleanurl* is [Free Software](https://www.gnu.org/philosophy/free-sw.html) and is released as [AGPLv3](https://github.com/xojoc/cleanurl/blob/main/LICENSE)",
"bugtrack_url": null,
"license": "AGPL-3.0-or-later",
"summary": "Remove clutter from URLs and return a canonicalized version",
"version": "0.1.15",
"split_keywords": [
"url",
"canonical"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e6d932b98ad854a35cde655f462d0d0fc55ae052188eb54c7c835dfb8dd0b35e",
"md5": "bbb78e4c47d93892e1252e7af9e817d2",
"sha256": "24edd6f8d4d01b8781c709b122e0f0d55defa081535ef416f7f04aaedf9bde7a"
},
"downloads": -1,
"filename": "cleanurl-0.1.15-py3-none-any.whl",
"has_sig": false,
"md5_digest": "bbb78e4c47d93892e1252e7af9e817d2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9,<4.0",
"size": 18637,
"upload_time": "2023-03-21T21:32:01",
"upload_time_iso_8601": "2023-03-21T21:32:01.198359Z",
"url": "https://files.pythonhosted.org/packages/e6/d9/32b98ad854a35cde655f462d0d0fc55ae052188eb54c7c835dfb8dd0b35e/cleanurl-0.1.15-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "92fbbf71e2b1060f36fb26f1b62f26f8a9d27c13a95b9a86310118f963071619",
"md5": "dedb6c91e75b7d7c9e4279b620e385fe",
"sha256": "e05e9fe59491a5df51dd4a08015d82259cdd1c2fe2f6b573205d8ec09877bbaa"
},
"downloads": -1,
"filename": "cleanurl-0.1.15.tar.gz",
"has_sig": false,
"md5_digest": "dedb6c91e75b7d7c9e4279b620e385fe",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9,<4.0",
"size": 18287,
"upload_time": "2023-03-21T21:32:03",
"upload_time_iso_8601": "2023-03-21T21:32:03.225420Z",
"url": "https://files.pythonhosted.org/packages/92/fb/bf71e2b1060f36fb26f1b62f26f8a9d27c13a95b9a86310118f963071619/cleanurl-0.1.15.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-03-21 21:32:03",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "xojoc",
"github_project": "cleanurl",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "cleanurl"
}