tk-normalizer


Nametk-normalizer JSON
Version 1.0.1 PyPI version JSON
download
home_pageNone
SummaryURL normalization library for consistent URL representation
upload_time2025-08-20 16:50:57
maintainerNone
docs_urlNone
authorNone
requires_python>=3.11
licenseMIT
keywords url normalization canonicalization web utilities
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # tk-normalizer

[![Python](https://img.shields.io/pypi/pyversions/tk-normalizer.svg)](https://pypi.org/project/tk-normalizer/)
[![PyPI](https://img.shields.io/pypi/v/tk-normalizer.svg)](https://pypi.org/project/tk-normalizer/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

URL normalization library for creating consistent URL representations.

## Purpose

The URL normalization process creates a mechanism to provide equivalence between URLs with varying string, protocol, scheme, and query parameter ordering. This library helps create normalized representations of URLs for consistent storage, comparison, and analysis.

## Installation

```bash
pip install tk-normalizer
```

## Quick Start

```python
from tk_normalizer import TkNormalizer

# Simple usage - str() returns just the normalized URL
normalized = TkNormalizer("http://www.Example.com/path?b=2&a=1&utm_source=test")
print(str(normalized))  # Output: example.com/path?a=1&b=2

# Get full details with dict()
print(dict(normalized))  # Returns all fields including query_string, path, and hashes
```

## Features

### URL Normalization

The following URLs all normalize to the same normalized form:

```
https://example.com/
http://www.example.com/
http://www.example.com
http://www.example.com/#my_search_engine_is_great
https://www.example.com/?utm_campaign=SomeGoogleCampaign
https://www.example.com/?utm_source=because&utm_campaign=SomeGoogleCampaign
```

All normalize to: `example.com`

### Normalization Process

URLs are normalized through the following steps:

- ✅ Protocol and www subdomains removed
- ✅ Lowercased
- ✅ Trailing slashes removed
- ✅ Query parameters reordered alphabetically by key
- ✅ Duplicate query parameter key/value pairs removed
- ✅ Common tracking parameters removed (utm_*, gclid, fbclid, etc.)
- ✅ Non-HTTP(S) protocols rejected
- ✅ Localhost URLs rejected

### Tracking Parameters Removed

The following tracking parameters are automatically removed during normalization:

- `utm_*` (all utm parameters)
- `gclid`, `fbclid`, `dclid` (click identifiers)
- `_ga`, `_gid`, `_fbp`, `_hjid` (analytics cookies)
- `msclkid` (Microsoft Ads)
- `aff_id`, `affid` (affiliate tracking)
- `referrer`, `adgroupid`, `srsltid`

## Advanced Usage

### Getting Full Normalization Details

```python
from tk_normalizer import TkNormalizer

normalizer = TkNormalizer("http://blog.example.com/page?b=2&a=1")

# Use str() for just the normalized URL
print(str(normalizer))  # blog.example.com/page?a=1&b=2

# Use dict() for complete normalization data
result = dict(normalizer)
print(result)
# {
#   'normalized_url': 'blog.example.com/page?a=1&b=2',
#   'parent_normalized_url': 'blog.example.com',
#   'root_normalized_url': 'example.com',
#   'query_string': 'a=1&b=2',
#   'path': '/page',
#   'normalized_url_hash': '...',
#   'parent_normalized_url_hash': '...',
#   'root_normalized_url_hash': '...'
# }
```

### Error Handling

```python
from tk_normalizer import TkNormalizer, InvalidUrlException

try:
    normalizer = TkNormalizer("not a valid url")
except InvalidUrlException as e:
    print(f"Invalid URL: {e}")
```

### Accessing Individual Components

```python
from tk_normalizer import TkNormalizer

normalizer = TkNormalizer("https://blog.example.com/path?a=1")

# Dict-like access to individual fields
print(normalizer["normalized_url"])       # blog.example.com/path?a=1
print(normalizer["parent_normalized_url"]) # blog.example.com
print(normalizer["root_normalized_url"])   # example.com
print(normalizer["query_string"])          # a=1
print(normalizer["path"])                  # /path

# Iterate over available fields
for key in normalizer:
    print(f"{key}: {normalizer[key]}")

# Get all field names
print(normalizer.keys())
```

## Hashing

For efficient storage and comparison, SHA-256 hashes are computed for:
- The normalized URL
- The parent normal URL (domain without path)
- The root normal URL (root domain without subdomains)

This provides fixed-length representations suitable for database indexing.

## Important Caveats

While this normalization process works well for most use cases, there are some limitations:

1. **www subdomain removal**: Technically, `www.example.com` and `example.com` could serve different content, though this is rare in practice.

2. **Case sensitivity**: URLs are lowercased, but some servers are case-sensitive for paths.

3. **Tracking parameters**: New tracking parameters emerge over time and may not be in the removal list.

4. **Fragment removal**: URL fragments (#anchors) are removed, which may affect single-page applications.

## Development

### Setting Up Development Environment

```bash
# Clone the repository
git clone https://github.com/terakeet/tk-normalizer.git
cd tk-normalizer

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=tk_normalizer

# Run linting
ruff check src tests
```

### Running Tests

```bash
# Run all tests
pytest

# Run with verbose output
pytest -v

# Run specific test file
pytest tests/test_normalizer.py

# Run with coverage report
pytest --cov=tk_normalizer --cov-report=html
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Support

For issues and questions, please use the [GitHub issue tracker](https://github.com/terakeet/tk-normalizer/issues).

## Credits

Based on the URL normalization functionality from [tk-core](https://github.com/terakeet/tk-core), extracted and packaged for standalone use.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "tk-normalizer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": "Terakeet <engineering@terakeet.com>",
    "keywords": "url, normalization, canonicalization, web, utilities",
    "author": null,
    "author_email": "Terakeet <engineering@terakeet.com>",
    "download_url": "https://files.pythonhosted.org/packages/5b/a1/db0219adfb95059f1eacc205da1d23d03b5fdbb7b4245a3f3f82692ce5f9/tk_normalizer-1.0.1.tar.gz",
    "platform": null,
    "description": "# tk-normalizer\n\n[![Python](https://img.shields.io/pypi/pyversions/tk-normalizer.svg)](https://pypi.org/project/tk-normalizer/)\n[![PyPI](https://img.shields.io/pypi/v/tk-normalizer.svg)](https://pypi.org/project/tk-normalizer/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nURL normalization library for creating consistent URL representations.\n\n## Purpose\n\nThe URL normalization process creates a mechanism to provide equivalence between URLs with varying string, protocol, scheme, and query parameter ordering. This library helps create normalized representations of URLs for consistent storage, comparison, and analysis.\n\n## Installation\n\n```bash\npip install tk-normalizer\n```\n\n## Quick Start\n\n```python\nfrom tk_normalizer import TkNormalizer\n\n# Simple usage - str() returns just the normalized URL\nnormalized = TkNormalizer(\"http://www.Example.com/path?b=2&a=1&utm_source=test\")\nprint(str(normalized))  # Output: example.com/path?a=1&b=2\n\n# Get full details with dict()\nprint(dict(normalized))  # Returns all fields including query_string, path, and hashes\n```\n\n## Features\n\n### URL Normalization\n\nThe following URLs all normalize to the same normalized form:\n\n```\nhttps://example.com/\nhttp://www.example.com/\nhttp://www.example.com\nhttp://www.example.com/#my_search_engine_is_great\nhttps://www.example.com/?utm_campaign=SomeGoogleCampaign\nhttps://www.example.com/?utm_source=because&utm_campaign=SomeGoogleCampaign\n```\n\nAll normalize to: `example.com`\n\n### Normalization Process\n\nURLs are normalized through the following steps:\n\n- \u2705 Protocol and www subdomains removed\n- \u2705 Lowercased\n- \u2705 Trailing slashes removed\n- \u2705 Query parameters reordered alphabetically by key\n- \u2705 Duplicate query parameter key/value pairs removed\n- \u2705 Common tracking parameters removed (utm_*, gclid, fbclid, etc.)\n- \u2705 Non-HTTP(S) protocols rejected\n- \u2705 Localhost URLs rejected\n\n### Tracking Parameters Removed\n\nThe following tracking parameters are automatically removed during normalization:\n\n- `utm_*` (all utm parameters)\n- `gclid`, `fbclid`, `dclid` (click identifiers)\n- `_ga`, `_gid`, `_fbp`, `_hjid` (analytics cookies)\n- `msclkid` (Microsoft Ads)\n- `aff_id`, `affid` (affiliate tracking)\n- `referrer`, `adgroupid`, `srsltid`\n\n## Advanced Usage\n\n### Getting Full Normalization Details\n\n```python\nfrom tk_normalizer import TkNormalizer\n\nnormalizer = TkNormalizer(\"http://blog.example.com/page?b=2&a=1\")\n\n# Use str() for just the normalized URL\nprint(str(normalizer))  # blog.example.com/page?a=1&b=2\n\n# Use dict() for complete normalization data\nresult = dict(normalizer)\nprint(result)\n# {\n#   'normalized_url': 'blog.example.com/page?a=1&b=2',\n#   'parent_normalized_url': 'blog.example.com',\n#   'root_normalized_url': 'example.com',\n#   'query_string': 'a=1&b=2',\n#   'path': '/page',\n#   'normalized_url_hash': '...',\n#   'parent_normalized_url_hash': '...',\n#   'root_normalized_url_hash': '...'\n# }\n```\n\n### Error Handling\n\n```python\nfrom tk_normalizer import TkNormalizer, InvalidUrlException\n\ntry:\n    normalizer = TkNormalizer(\"not a valid url\")\nexcept InvalidUrlException as e:\n    print(f\"Invalid URL: {e}\")\n```\n\n### Accessing Individual Components\n\n```python\nfrom tk_normalizer import TkNormalizer\n\nnormalizer = TkNormalizer(\"https://blog.example.com/path?a=1\")\n\n# Dict-like access to individual fields\nprint(normalizer[\"normalized_url\"])       # blog.example.com/path?a=1\nprint(normalizer[\"parent_normalized_url\"]) # blog.example.com\nprint(normalizer[\"root_normalized_url\"])   # example.com\nprint(normalizer[\"query_string\"])          # a=1\nprint(normalizer[\"path\"])                  # /path\n\n# Iterate over available fields\nfor key in normalizer:\n    print(f\"{key}: {normalizer[key]}\")\n\n# Get all field names\nprint(normalizer.keys())\n```\n\n## Hashing\n\nFor efficient storage and comparison, SHA-256 hashes are computed for:\n- The normalized URL\n- The parent normal URL (domain without path)\n- The root normal URL (root domain without subdomains)\n\nThis provides fixed-length representations suitable for database indexing.\n\n## Important Caveats\n\nWhile this normalization process works well for most use cases, there are some limitations:\n\n1. **www subdomain removal**: Technically, `www.example.com` and `example.com` could serve different content, though this is rare in practice.\n\n2. **Case sensitivity**: URLs are lowercased, but some servers are case-sensitive for paths.\n\n3. **Tracking parameters**: New tracking parameters emerge over time and may not be in the removal list.\n\n4. **Fragment removal**: URL fragments (#anchors) are removed, which may affect single-page applications.\n\n## Development\n\n### Setting Up Development Environment\n\n```bash\n# Clone the repository\ngit clone https://github.com/terakeet/tk-normalizer.git\ncd tk-normalizer\n\n# Install development dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Run tests with coverage\npytest --cov=tk_normalizer\n\n# Run linting\nruff check src tests\n```\n\n### Running Tests\n\n```bash\n# Run all tests\npytest\n\n# Run with verbose output\npytest -v\n\n# Run specific test file\npytest tests/test_normalizer.py\n\n# Run with coverage report\npytest --cov=tk_normalizer --cov-report=html\n```\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## Support\n\nFor issues and questions, please use the [GitHub issue tracker](https://github.com/terakeet/tk-normalizer/issues).\n\n## Credits\n\nBased on the URL normalization functionality from [tk-core](https://github.com/terakeet/tk-core), extracted and packaged for standalone use.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "URL normalization library for consistent URL representation",
    "version": "1.0.1",
    "project_urls": {
        "Documentation": "https://github.com/terakeet/tk-normalizer/blob/main/docs/ARCHITECTURE.md",
        "Homepage": "https://github.com/terakeet/tk-normalizer",
        "Issues": "https://github.com/terakeet/tk-normalizer/issues",
        "Repository": "https://github.com/terakeet/tk-normalizer.git"
    },
    "split_keywords": [
        "url",
        " normalization",
        " canonicalization",
        " web",
        " utilities"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d734fa392fe683c639ad36f2328bcb56fbf3c1d33260df2b71244fc0f1ab4d35",
                "md5": "18a6544962d9ea989dfdd8beacfc542d",
                "sha256": "67546c58c925cb21347829a16b1ab952a1412389c5338c9c6abfd75aac3c3c9b"
            },
            "downloads": -1,
            "filename": "tk_normalizer-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "18a6544962d9ea989dfdd8beacfc542d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 7845,
            "upload_time": "2025-08-20T16:50:56",
            "upload_time_iso_8601": "2025-08-20T16:50:56.034000Z",
            "url": "https://files.pythonhosted.org/packages/d7/34/fa392fe683c639ad36f2328bcb56fbf3c1d33260df2b71244fc0f1ab4d35/tk_normalizer-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5ba1db0219adfb95059f1eacc205da1d23d03b5fdbb7b4245a3f3f82692ce5f9",
                "md5": "440e90eff92f1bc861e8d638168632cb",
                "sha256": "46c9889677591d197a1ace64d6866a3dfce0d66b47a8de88cefa4c2cbb6567ba"
            },
            "downloads": -1,
            "filename": "tk_normalizer-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "440e90eff92f1bc861e8d638168632cb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 15171,
            "upload_time": "2025-08-20T16:50:57",
            "upload_time_iso_8601": "2025-08-20T16:50:57.440959Z",
            "url": "https://files.pythonhosted.org/packages/5b/a1/db0219adfb95059f1eacc205da1d23d03b5fdbb7b4245a3f3f82692ce5f9/tk_normalizer-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-20 16:50:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "terakeet",
    "github_project": "tk-normalizer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "tk-normalizer"
}
        
Elapsed time: 0.68170s