# tk-normalizer
[](https://pypi.org/project/tk-normalizer/)
[](https://pypi.org/project/tk-normalizer/)
[](https://opensource.org/licenses/MIT)
URL normalization library for creating consistent URL representations.
## Purpose
The URL normalization process creates a mechanism to provide equivalence between URLs with varying string, protocol, scheme, and query parameter ordering. This library helps create normalized representations of URLs for consistent storage, comparison, and analysis.
## Installation
```bash
pip install tk-normalizer
```
## Quick Start
```python
from tk_normalizer import TkNormalizer
# Simple usage - str() returns just the normalized URL
normalized = TkNormalizer("http://www.Example.com/path?b=2&a=1&utm_source=test")
print(str(normalized)) # Output: example.com/path?a=1&b=2
# Get full details with dict()
print(dict(normalized)) # Returns all fields including query_string, path, and hashes
```
## Features
### URL Normalization
The following URLs all normalize to the same normalized form:
```
https://example.com/
http://www.example.com/
http://www.example.com
http://www.example.com/#my_search_engine_is_great
https://www.example.com/?utm_campaign=SomeGoogleCampaign
https://www.example.com/?utm_source=because&utm_campaign=SomeGoogleCampaign
```
All normalize to: `example.com`
### Normalization Process
URLs are normalized through the following steps:
- ✅ Protocol and www subdomains removed
- ✅ Lowercased
- ✅ Trailing slashes removed
- ✅ Query parameters reordered alphabetically by key
- ✅ Duplicate query parameter key/value pairs removed
- ✅ Common tracking parameters removed (utm_*, gclid, fbclid, etc.)
- ✅ Non-HTTP(S) protocols rejected
- ✅ Localhost URLs rejected
### Tracking Parameters Removed
The following tracking parameters are automatically removed during normalization:
- `utm_*` (all utm parameters)
- `gclid`, `fbclid`, `dclid` (click identifiers)
- `_ga`, `_gid`, `_fbp`, `_hjid` (analytics cookies)
- `msclkid` (Microsoft Ads)
- `aff_id`, `affid` (affiliate tracking)
- `referrer`, `adgroupid`, `srsltid`
## Advanced Usage
### Getting Full Normalization Details
```python
from tk_normalizer import TkNormalizer
normalizer = TkNormalizer("http://blog.example.com/page?b=2&a=1")
# Use str() for just the normalized URL
print(str(normalizer)) # blog.example.com/page?a=1&b=2
# Use dict() for complete normalization data
result = dict(normalizer)
print(result)
# {
# 'normalized_url': 'blog.example.com/page?a=1&b=2',
# 'parent_normalized_url': 'blog.example.com',
# 'root_normalized_url': 'example.com',
# 'query_string': 'a=1&b=2',
# 'path': '/page',
# 'normalized_url_hash': '...',
# 'parent_normalized_url_hash': '...',
# 'root_normalized_url_hash': '...'
# }
```
### Error Handling
```python
from tk_normalizer import TkNormalizer, InvalidUrlException
try:
normalizer = TkNormalizer("not a valid url")
except InvalidUrlException as e:
print(f"Invalid URL: {e}")
```
### Accessing Individual Components
```python
from tk_normalizer import TkNormalizer
normalizer = TkNormalizer("https://blog.example.com/path?a=1")
# Dict-like access to individual fields
print(normalizer["normalized_url"]) # blog.example.com/path?a=1
print(normalizer["parent_normalized_url"]) # blog.example.com
print(normalizer["root_normalized_url"]) # example.com
print(normalizer["query_string"]) # a=1
print(normalizer["path"]) # /path
# Iterate over available fields
for key in normalizer:
print(f"{key}: {normalizer[key]}")
# Get all field names
print(normalizer.keys())
```
## Hashing
For efficient storage and comparison, SHA-256 hashes are computed for:
- The normalized URL
- The parent normal URL (domain without path)
- The root normal URL (root domain without subdomains)
This provides fixed-length representations suitable for database indexing.
## Important Caveats
While this normalization process works well for most use cases, there are some limitations:
1. **www subdomain removal**: Technically, `www.example.com` and `example.com` could serve different content, though this is rare in practice.
2. **Case sensitivity**: URLs are lowercased, but some servers are case-sensitive for paths.
3. **Tracking parameters**: New tracking parameters emerge over time and may not be in the removal list.
4. **Fragment removal**: URL fragments (#anchors) are removed, which may affect single-page applications.
## Development
### Setting Up Development Environment
```bash
# Clone the repository
git clone https://github.com/terakeet/tk-normalizer.git
cd tk-normalizer
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=tk_normalizer
# Run linting
ruff check src tests
```
### Running Tests
```bash
# Run all tests
pytest
# Run with verbose output
pytest -v
# Run specific test file
pytest tests/test_normalizer.py
# Run with coverage report
pytest --cov=tk_normalizer --cov-report=html
```
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Support
For issues and questions, please use the [GitHub issue tracker](https://github.com/terakeet/tk-normalizer/issues).
## Credits
Based on the URL normalization functionality from [tk-core](https://github.com/terakeet/tk-core), extracted and packaged for standalone use.
Raw data
{
"_id": null,
"home_page": null,
"name": "tk-normalizer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": "Terakeet <engineering@terakeet.com>",
"keywords": "url, normalization, canonicalization, web, utilities",
"author": null,
"author_email": "Terakeet <engineering@terakeet.com>",
"download_url": "https://files.pythonhosted.org/packages/5b/a1/db0219adfb95059f1eacc205da1d23d03b5fdbb7b4245a3f3f82692ce5f9/tk_normalizer-1.0.1.tar.gz",
"platform": null,
"description": "# tk-normalizer\n\n[](https://pypi.org/project/tk-normalizer/)\n[](https://pypi.org/project/tk-normalizer/)\n[](https://opensource.org/licenses/MIT)\n\nURL normalization library for creating consistent URL representations.\n\n## Purpose\n\nThe URL normalization process creates a mechanism to provide equivalence between URLs with varying string, protocol, scheme, and query parameter ordering. This library helps create normalized representations of URLs for consistent storage, comparison, and analysis.\n\n## Installation\n\n```bash\npip install tk-normalizer\n```\n\n## Quick Start\n\n```python\nfrom tk_normalizer import TkNormalizer\n\n# Simple usage - str() returns just the normalized URL\nnormalized = TkNormalizer(\"http://www.Example.com/path?b=2&a=1&utm_source=test\")\nprint(str(normalized)) # Output: example.com/path?a=1&b=2\n\n# Get full details with dict()\nprint(dict(normalized)) # Returns all fields including query_string, path, and hashes\n```\n\n## Features\n\n### URL Normalization\n\nThe following URLs all normalize to the same normalized form:\n\n```\nhttps://example.com/\nhttp://www.example.com/\nhttp://www.example.com\nhttp://www.example.com/#my_search_engine_is_great\nhttps://www.example.com/?utm_campaign=SomeGoogleCampaign\nhttps://www.example.com/?utm_source=because&utm_campaign=SomeGoogleCampaign\n```\n\nAll normalize to: `example.com`\n\n### Normalization Process\n\nURLs are normalized through the following steps:\n\n- \u2705 Protocol and www subdomains removed\n- \u2705 Lowercased\n- \u2705 Trailing slashes removed\n- \u2705 Query parameters reordered alphabetically by key\n- \u2705 Duplicate query parameter key/value pairs removed\n- \u2705 Common tracking parameters removed (utm_*, gclid, fbclid, etc.)\n- \u2705 Non-HTTP(S) protocols rejected\n- \u2705 Localhost URLs rejected\n\n### Tracking Parameters Removed\n\nThe following tracking parameters are automatically removed during normalization:\n\n- `utm_*` (all utm parameters)\n- `gclid`, `fbclid`, `dclid` (click identifiers)\n- `_ga`, `_gid`, `_fbp`, `_hjid` (analytics cookies)\n- `msclkid` (Microsoft Ads)\n- `aff_id`, `affid` (affiliate tracking)\n- `referrer`, `adgroupid`, `srsltid`\n\n## Advanced Usage\n\n### Getting Full Normalization Details\n\n```python\nfrom tk_normalizer import TkNormalizer\n\nnormalizer = TkNormalizer(\"http://blog.example.com/page?b=2&a=1\")\n\n# Use str() for just the normalized URL\nprint(str(normalizer)) # blog.example.com/page?a=1&b=2\n\n# Use dict() for complete normalization data\nresult = dict(normalizer)\nprint(result)\n# {\n# 'normalized_url': 'blog.example.com/page?a=1&b=2',\n# 'parent_normalized_url': 'blog.example.com',\n# 'root_normalized_url': 'example.com',\n# 'query_string': 'a=1&b=2',\n# 'path': '/page',\n# 'normalized_url_hash': '...',\n# 'parent_normalized_url_hash': '...',\n# 'root_normalized_url_hash': '...'\n# }\n```\n\n### Error Handling\n\n```python\nfrom tk_normalizer import TkNormalizer, InvalidUrlException\n\ntry:\n normalizer = TkNormalizer(\"not a valid url\")\nexcept InvalidUrlException as e:\n print(f\"Invalid URL: {e}\")\n```\n\n### Accessing Individual Components\n\n```python\nfrom tk_normalizer import TkNormalizer\n\nnormalizer = TkNormalizer(\"https://blog.example.com/path?a=1\")\n\n# Dict-like access to individual fields\nprint(normalizer[\"normalized_url\"]) # blog.example.com/path?a=1\nprint(normalizer[\"parent_normalized_url\"]) # blog.example.com\nprint(normalizer[\"root_normalized_url\"]) # example.com\nprint(normalizer[\"query_string\"]) # a=1\nprint(normalizer[\"path\"]) # /path\n\n# Iterate over available fields\nfor key in normalizer:\n print(f\"{key}: {normalizer[key]}\")\n\n# Get all field names\nprint(normalizer.keys())\n```\n\n## Hashing\n\nFor efficient storage and comparison, SHA-256 hashes are computed for:\n- The normalized URL\n- The parent normal URL (domain without path)\n- The root normal URL (root domain without subdomains)\n\nThis provides fixed-length representations suitable for database indexing.\n\n## Important Caveats\n\nWhile this normalization process works well for most use cases, there are some limitations:\n\n1. **www subdomain removal**: Technically, `www.example.com` and `example.com` could serve different content, though this is rare in practice.\n\n2. **Case sensitivity**: URLs are lowercased, but some servers are case-sensitive for paths.\n\n3. **Tracking parameters**: New tracking parameters emerge over time and may not be in the removal list.\n\n4. **Fragment removal**: URL fragments (#anchors) are removed, which may affect single-page applications.\n\n## Development\n\n### Setting Up Development Environment\n\n```bash\n# Clone the repository\ngit clone https://github.com/terakeet/tk-normalizer.git\ncd tk-normalizer\n\n# Install development dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Run tests with coverage\npytest --cov=tk_normalizer\n\n# Run linting\nruff check src tests\n```\n\n### Running Tests\n\n```bash\n# Run all tests\npytest\n\n# Run with verbose output\npytest -v\n\n# Run specific test file\npytest tests/test_normalizer.py\n\n# Run with coverage report\npytest --cov=tk_normalizer --cov-report=html\n```\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## Support\n\nFor issues and questions, please use the [GitHub issue tracker](https://github.com/terakeet/tk-normalizer/issues).\n\n## Credits\n\nBased on the URL normalization functionality from [tk-core](https://github.com/terakeet/tk-core), extracted and packaged for standalone use.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "URL normalization library for consistent URL representation",
"version": "1.0.1",
"project_urls": {
"Documentation": "https://github.com/terakeet/tk-normalizer/blob/main/docs/ARCHITECTURE.md",
"Homepage": "https://github.com/terakeet/tk-normalizer",
"Issues": "https://github.com/terakeet/tk-normalizer/issues",
"Repository": "https://github.com/terakeet/tk-normalizer.git"
},
"split_keywords": [
"url",
" normalization",
" canonicalization",
" web",
" utilities"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "d734fa392fe683c639ad36f2328bcb56fbf3c1d33260df2b71244fc0f1ab4d35",
"md5": "18a6544962d9ea989dfdd8beacfc542d",
"sha256": "67546c58c925cb21347829a16b1ab952a1412389c5338c9c6abfd75aac3c3c9b"
},
"downloads": -1,
"filename": "tk_normalizer-1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "18a6544962d9ea989dfdd8beacfc542d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 7845,
"upload_time": "2025-08-20T16:50:56",
"upload_time_iso_8601": "2025-08-20T16:50:56.034000Z",
"url": "https://files.pythonhosted.org/packages/d7/34/fa392fe683c639ad36f2328bcb56fbf3c1d33260df2b71244fc0f1ab4d35/tk_normalizer-1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "5ba1db0219adfb95059f1eacc205da1d23d03b5fdbb7b4245a3f3f82692ce5f9",
"md5": "440e90eff92f1bc861e8d638168632cb",
"sha256": "46c9889677591d197a1ace64d6866a3dfce0d66b47a8de88cefa4c2cbb6567ba"
},
"downloads": -1,
"filename": "tk_normalizer-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "440e90eff92f1bc861e8d638168632cb",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 15171,
"upload_time": "2025-08-20T16:50:57",
"upload_time_iso_8601": "2025-08-20T16:50:57.440959Z",
"url": "https://files.pythonhosted.org/packages/5b/a1/db0219adfb95059f1eacc205da1d23d03b5fdbb7b4245a3f3f82692ce5f9/tk_normalizer-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-20 16:50:57",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "terakeet",
"github_project": "tk-normalizer",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "tk-normalizer"
}