filter-url

Name	filter-url JSON
Version	1.2.0 JSON
	download
home_page	None
Summary	A simple, fast, and configurable URL sensitive data filter
upload_time	2025-07-15 16:07:47
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	MIT
keywords	url filter filtering url filtering logging
VCS
bugtrack_url
requirements	iniconfig packaging pluggy pygments pytest
Travis-CI	No Travis.
coveralls test coverage

            filter-url
==========

[![PyPI version](https://img.shields.io/pypi/v/filter-url.svg)](https://pypi.org/project/filter-url/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/filter-url.svg)](https://pypi.org/project/filter-url/)
[![PyPI - License](https://img.shields.io/pypi/l/filter-url.svg)](https://pypi.org/project/filter-url/)
[![Coverage Status](https://coveralls.io/repos/github/alexsemenyaka/filter_url/badge.svg?branch=main)](https://coveralls.io/github/alexsemenyaka/filter_url?branch=main)
[![CI/CD Status](https://github.com/alexsemenyaka/filter_url/actions/workflows/ci.yml/badge.svg)](https://github.com/alexsemenyaka/filter_url/actions/workflows/ci.yml)

A simple, fast, and configurable Python utility to censor sensitive data (passwords, API keys, tokens) from URLs, making them safe for logging, monitoring, and debugging.

Key Features
------------

* **Comprehensive Censoring**: Censors passwords in userinfo (`user:[...]@host`), query parameter values, and parts of the URL path.
* **Flexible Rules**: Filter query parameters by exact key names or by powerful regular expressions.
* **Advanced Path Filtering**: Use regex with named capture groups to censor specific dynamic parts of a URL path while leaving the rest intact.
* **Order Preserving**: Guarantees that the order of query parameters in the output is identical to the input.
* **Logging Integration**: Provides a ready-to-use `logging.Filter` subclass for seamless integration into your application's logging setup.
* **Lightweight**: Zero external dependencies.

Installation
------------

    pip install filter-url

Quick Start
-----------

The quickest way to use the library is the standalone `filter_url()` function, which uses a default set of rules to catch common sensitive keys.

    from filter_url import filter_url

    dirty_url = "https://user:my-secret-password@example.com/data?token=abc-123-xyz"

    # Use the function with default filters
    clean_url = filter_url(dirty_url)

    print(clean_url)
    # >> https://user:[...]@example.com/data?token=[...]

Usage & Examples
----------------

### Basic Filtering (Standalone Function)

The `filter_url()` function is great for one-off tasks. You can pass your own filtering rules directly to it. If a rule is not provided, a sensible default is used.

    from filter_url import filter_url

    # Define custom rules
    custom_path_re = r'/user/(?P<user_id>\d+)/profile'

    dirty_url = "https://example.com/user/123456/profile?credit_card_number=5555"

    # Censor using a custom path regex
    clean_url = filter_url(
        url=dirty_url,
        bad_path_re=custom_path_re
    )

    print(clean_url)
    # >> https://example.com/user/[...]/profile?credit_card_number=5555

### Advanced: Using the `FilterURL` Class for Performance

When you need to filter a large number of URLs with the same configuration, it's much more efficient to instantiate the `FilterURL` class once. This pre-compiles the regular expressions and avoids redundant work in a loop.

    from filter_url import FilterURL

    # Create the filter instance ONCE with your custom rules.
    # The regexes are compiled here.
    my_filter = FilterURL(
        bad_keys={'api_key'},
        bad_keys_re=[r'session']
    )

    urls_to_process = [
        "https://service.com/api?api_key=key-1",
        "https://service.com/api?user_session=sess-2",
        "https://service.com/api?id=3"
    ]

    # Reuse the same instance in a loop for high performance
    clean_urls = [my_filter.remove_sensitive(url) for url in urls_to_process]

    # clean_urls will be:
    # [
    #   'https://service.com/api?api_key=[...]',
    #   'https://service.com/api?user_session=[...]',
    #   'https://service.com/api?id=3'
    # ]

The class has an internal cache for filtered URLs, you can tune it or turn it off completely with the parameter cache\_size (see API description below)

### Integration with Python's `logging` Module

This is the most powerful feature for real-world applications. The `URLFilter` automatically censors URLs in your logs. The filter works in two ways:
1. **(Preferred)** It looks for a `url` key in the `extra` dictionary of your logging call.
2. **(Fallback)** If `fallback=True` (the default), it searches for URLs in the positional arguments of the log message.


```python
    import logging
    import sys
    from filter_url import URLFilter

    # 1. Configure a logger

    logger = logging.getLogger('my_app')
    logger.setLevel(logging.INFO)
    if logger.hasHandlers():
        logger.handlers.clear()

    # 2. Simply add our filter. Let's use custom rules for this example

    custom_filter = URLFilter(
        bad_keys={'access_token'},
        fallback=True # Default, but shown for clarity
    )
    logger.addFilter(custom_filter)

    # 3. Use a standard Formatter. No special formatter is needed

    handler = logging.StreamHandler(sys.stdout)
    formatter = logging.Formatter('%(levelname)s: %(message)s')
    handler.setFormatter(formatter)
    logger.addHandler(handler)

    # --- Usage Examples ---

    # Case 1: (Preferred) Pass the URL via 'extra'

    logger.info(
        "User login attempt failed",
        extra={'url': "<https://auth.service.com/login?access_token=12345"}>
    )

    # Case 2: (Fallback) The URL is an argument in the message string

    logger.info(
        "API call to %s resulted in a 404 error.",
        "<https://api.service.com/data/v1/user?password=abc>"
    )

    # Case 3: No URL in the message. Nothing extra is added

    logger.info("Application started successfully.")
```

Be aware of a minor trade-off between using a filter for the logging module and the FilterURL class.
Provided each URL is only output once, then a filter for logging is the perfect solution: it will make your code much more straightforward and cleaner.
When processing URLs and outputting them multiple times during different stages, prepare them in advance using the FilterURL class to save CPU cycles.
The filtered URTs are stored in the internal cache inside FilterURL to mitigate this difference. However, it can still be notable under load.

**Expected Output:**

    INFO: User login attempt failed | (URL data: https://auth.service.com/login?access_token=[...])
    INFO: API call to https://api.service.com/data/v1/user?password=[...] was made. | (URL data: https://api.service.com/data/v1/user?password=[...])
    INFO: Application started successfully.

Corner Cases & Considerations
-----------------------------

* **Log String vs. Valid URL**: The primary goal of this library is to produce a human-readable, safe string for logging. The output string containing `[...]` in the userinfo (password) section is not a valid URL according to RFC standards and may fail if you try to parse it again with `urllib.parse`.
* **Performance**: For filtering a large number of URLs, always instantiate the `FilterURL` class once and reuse the instance. The standalone `filter_url()` function re-compiles regexes on every call and is less performant for batch jobs.
* **Logging Filter Precedence**: When using `URLFilter`, providing a URL in the `extra` dictionary is always the preferred method. The `fallback` search will only trigger if a `url` key is not found in `extra`. Also, using fallback option needs extra CPU cycles, which may be unwanted.

API Reference
-------------

* `filter_url(url, censored, bad_keys, bad_keys_re, bad_path_re)`: A standalone function for one-off URL censoring.
     * **url:str              - (required)** an URL to 'censor'
     * **censored:str         - (optional)** a placeholder to use insted aof redacted parts, '[...]' by default
     * **bad_keys:list:       - (optional)** a list of keys in the HTTP method GET that may contain a sensitive data. Default:

    [ "password", "token", "key", "secret", "auth", "apikey", "credentials", ]

     * **bad_keys_re:list:     - (optional)** a list of regexs matching keys in the HTTP method GET that may contain a sensitive data. Default:

    [ r"session", r"csrf", r".*_secret", r".*_token", r".*_key", ]

     * **bad_path_re:str:      - (optional)** a regex to match a path port of the URL, each defined group in it will be redacted. Default: None. Examples:

    custom_path_re_named = r"/api/v1/(?P<api_key>[^/]+)/resource"
    custom_path_re_simple = r"(?<=/user/)\d+(?=/delete)"

* `FilterURL(bad_keys, bad_keys_re, bad_path_re, cache_size)`: A class that holds a compiled filter configuration for efficient, repeated use.
                                                               Meaning of **bad_keys:list, bad_keys_re:list, bad_path_re:str** and their defaults are the same
                                                               as for filter\_url() (see above)
     * **cache_size:int       - (optional)** Size of the cache to keep filtered URLs, 0 or None means no caching. Default: 512
  * `.remove_sensitive(url, censored)`: The method that performs the censoring.
     * **censored:str         - (optional)** a placeholder to use insted aof redacted parts, '[...]' by default
* `URLFilter(bad_keys, bad_keys_re, bad_path_re, fmt, url_filter_instance, fallback, cache_size, name)`: A `logging.Filter` subclass for easy integration with Python's logging module.
     * **bad_keys:list, bad_keys_re:list, bad_path_re:str** are the same as for filter\_url() (see above)
     * **fmt:str                       - (optional)** Format to add an filtered URL into the log message, default: ' | (URL={filtered\_url})' ({filtered\_url} will be
                                                       replaced with your filtered URL)
     * **url_filter_instance:FilterURL - (optional)** Pre-configured instance of FilterURL-like class to use for filtering. Default: None (will be created by the filter)
     * **fallback:bool                 - (optional)** Do we look for URL in the text when URL is not specified explicitly with extra={'url':...}? Default: True
     * **cache_size:int                - (optional)** Size of the cache to keep filtered URLs, 0 or None means no caching. Default: 512
     * **name:str                      - (optional)** The name of the filter (inherited from the logging.Filter)

License
-------

This project is licensed under the MIT License.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "filter-url",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "url, filter, filtering, URL filtering, logging",
    "author": null,
    "author_email": "Alex Semenyaka <alex.semenyaka@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/ca/38/b0243052d7f287f219bd47339f523ff97e3b38b8c7902b6576aa0866e67f/filter_url-1.2.0.tar.gz",
    "platform": null,
    "description": "filter-url\n==========\n\n[![PyPI version](https://img.shields.io/pypi/v/filter-url.svg)](https://pypi.org/project/filter-url/)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/filter-url.svg)](https://pypi.org/project/filter-url/)\n[![PyPI - License](https://img.shields.io/pypi/l/filter-url.svg)](https://pypi.org/project/filter-url/)\n[![Coverage Status](https://coveralls.io/repos/github/alexsemenyaka/filter_url/badge.svg?branch=main)](https://coveralls.io/github/alexsemenyaka/filter_url?branch=main)\n[![CI/CD Status](https://github.com/alexsemenyaka/filter_url/actions/workflows/ci.yml/badge.svg)](https://github.com/alexsemenyaka/filter_url/actions/workflows/ci.yml)\n\nA simple, fast, and configurable Python utility to censor sensitive data (passwords, API keys, tokens) from URLs, making them safe for logging, monitoring, and debugging.\n\nKey Features\n------------\n\n* **Comprehensive Censoring**: Censors passwords in userinfo (`user:[...]@host`), query parameter values, and parts of the URL path.\n* **Flexible Rules**: Filter query parameters by exact key names or by powerful regular expressions.\n* **Advanced Path Filtering**: Use regex with named capture groups to censor specific dynamic parts of a URL path while leaving the rest intact.\n* **Order Preserving**: Guarantees that the order of query parameters in the output is identical to the input.\n* **Logging Integration**: Provides a ready-to-use `logging.Filter` subclass for seamless integration into your application's logging setup.\n* **Lightweight**: Zero external dependencies.\n\nInstallation\n------------\n\n    pip install filter-url\n\nQuick Start\n-----------\n\nThe quickest way to use the library is the standalone `filter_url()` function, which uses a default set of rules to catch common sensitive keys.\n\n    from filter_url import filter_url\n\n    dirty_url = \"https://user:my-secret-password@example.com/data?token=abc-123-xyz\"\n\n    # Use the function with default filters\n    clean_url = filter_url(dirty_url)\n\n    print(clean_url)\n    # >> https://user:[...]@example.com/data?token=[...]\n\nUsage & Examples\n----------------\n\n### Basic Filtering (Standalone Function)\n\nThe `filter_url()` function is great for one-off tasks. You can pass your own filtering rules directly to it. If a rule is not provided, a sensible default is used.\n\n    from filter_url import filter_url\n\n    # Define custom rules\n    custom_path_re = r'/user/(?P<user_id>\\d+)/profile'\n\n    dirty_url = \"https://example.com/user/123456/profile?credit_card_number=5555\"\n\n    # Censor using a custom path regex\n    clean_url = filter_url(\n        url=dirty_url,\n        bad_path_re=custom_path_re\n    )\n\n    print(clean_url)\n    # >> https://example.com/user/[...]/profile?credit_card_number=5555\n\n### Advanced: Using the `FilterURL` Class for Performance\n\nWhen you need to filter a large number of URLs with the same configuration, it's much more efficient to instantiate the `FilterURL` class once. This pre-compiles the regular expressions and avoids redundant work in a loop.\n\n    from filter_url import FilterURL\n\n    # Create the filter instance ONCE with your custom rules.\n    # The regexes are compiled here.\n    my_filter = FilterURL(\n        bad_keys={'api_key'},\n        bad_keys_re=[r'session']\n    )\n\n    urls_to_process = [\n        \"https://service.com/api?api_key=key-1\",\n        \"https://service.com/api?user_session=sess-2\",\n        \"https://service.com/api?id=3\"\n    ]\n\n    # Reuse the same instance in a loop for high performance\n    clean_urls = [my_filter.remove_sensitive(url) for url in urls_to_process]\n\n    # clean_urls will be:\n    # [\n    #   'https://service.com/api?api_key=[...]',\n    #   'https://service.com/api?user_session=[...]',\n    #   'https://service.com/api?id=3'\n    # ]\n\nThe class has an internal cache for filtered URLs, you can tune it or turn it off completely with the parameter cache\\_size (see API description below)\n\n### Integration with Python's `logging` Module\n\nThis is the most powerful feature for real-world applications. The `URLFilter` automatically censors URLs in your logs. The filter works in two ways:\n1. **(Preferred)** It looks for a `url` key in the `extra` dictionary of your logging call.\n2. **(Fallback)** If `fallback=True` (the default), it searches for URLs in the positional arguments of the log message.\n\n\n```python\n    import logging\n    import sys\n    from filter_url import URLFilter\n\n    # 1. Configure a logger\n\n    logger = logging.getLogger('my_app')\n    logger.setLevel(logging.INFO)\n    if logger.hasHandlers():\n        logger.handlers.clear()\n\n    # 2. Simply add our filter. Let's use custom rules for this example\n\n    custom_filter = URLFilter(\n        bad_keys={'access_token'},\n        fallback=True # Default, but shown for clarity\n    )\n    logger.addFilter(custom_filter)\n\n    # 3. Use a standard Formatter. No special formatter is needed\n\n    handler = logging.StreamHandler(sys.stdout)\n    formatter = logging.Formatter('%(levelname)s: %(message)s')\n    handler.setFormatter(formatter)\n    logger.addHandler(handler)\n\n    # --- Usage Examples ---\n\n    # Case 1: (Preferred) Pass the URL via 'extra'\n\n    logger.info(\n        \"User login attempt failed\",\n        extra={'url': \"<https://auth.service.com/login?access_token=12345\"}>\n    )\n\n    # Case 2: (Fallback) The URL is an argument in the message string\n\n    logger.info(\n        \"API call to %s resulted in a 404 error.\",\n        \"<https://api.service.com/data/v1/user?password=abc>\"\n    )\n\n    # Case 3: No URL in the message. Nothing extra is added\n\n    logger.info(\"Application started successfully.\")\n```\n\nBe aware of a minor trade-off between using a filter for the logging module and the FilterURL class.\nProvided each URL is only output once, then a filter for logging is the perfect solution: it will make your code much more straightforward and cleaner.\nWhen processing URLs and outputting them multiple times during different stages, prepare them in advance using the FilterURL class to save CPU cycles.\nThe filtered URTs are stored in the internal cache inside FilterURL to mitigate this difference. However, it can still be notable under load.\n\n**Expected Output:**\n\n    INFO: User login attempt failed | (URL data: https://auth.service.com/login?access_token=[...])\n    INFO: API call to https://api.service.com/data/v1/user?password=[...] was made. | (URL data: https://api.service.com/data/v1/user?password=[...])\n    INFO: Application started successfully.\n\nCorner Cases & Considerations\n-----------------------------\n\n* **Log String vs. Valid URL**: The primary goal of this library is to produce a human-readable, safe string for logging. The output string containing `[...]` in the userinfo (password) section is not a valid URL according to RFC standards and may fail if you try to parse it again with `urllib.parse`.\n* **Performance**: For filtering a large number of URLs, always instantiate the `FilterURL` class once and reuse the instance. The standalone `filter_url()` function re-compiles regexes on every call and is less performant for batch jobs.\n* **Logging Filter Precedence**: When using `URLFilter`, providing a URL in the `extra` dictionary is always the preferred method. The `fallback` search will only trigger if a `url` key is not found in `extra`. Also, using fallback option needs extra CPU cycles, which may be unwanted.\n\nAPI Reference\n-------------\n\n* `filter_url(url, censored, bad_keys, bad_keys_re, bad_path_re)`: A standalone function for one-off URL censoring.\n     * **url:str              - (required)** an URL to 'censor'\n     * **censored:str         - (optional)** a placeholder to use insted aof redacted parts, '[...]' by default\n     * **bad_keys:list:       - (optional)** a list of keys in the HTTP method GET that may contain a sensitive data. Default:\n\n    [ \"password\", \"token\", \"key\", \"secret\", \"auth\", \"apikey\", \"credentials\", ]\n\n     * **bad_keys_re:list:     - (optional)** a list of regexs matching keys in the HTTP method GET that may contain a sensitive data. Default:\n\n    [ r\"session\", r\"csrf\", r\".*_secret\", r\".*_token\", r\".*_key\", ]\n\n     * **bad_path_re:str:      - (optional)** a regex to match a path port of the URL, each defined group in it will be redacted. Default: None. Examples:\n\n    custom_path_re_named = r\"/api/v1/(?P<api_key>[^/]+)/resource\"\n    custom_path_re_simple = r\"(?<=/user/)\\d+(?=/delete)\"\n\n* `FilterURL(bad_keys, bad_keys_re, bad_path_re, cache_size)`: A class that holds a compiled filter configuration for efficient, repeated use.\n                                                               Meaning of **bad_keys:list, bad_keys_re:list, bad_path_re:str** and their defaults are the same\n                                                               as for filter\\_url() (see above)\n     * **cache_size:int       - (optional)** Size of the cache to keep filtered URLs, 0 or None means no caching. Default: 512\n  * `.remove_sensitive(url, censored)`: The method that performs the censoring.\n     * **censored:str         - (optional)** a placeholder to use insted aof redacted parts, '[...]' by default\n* `URLFilter(bad_keys, bad_keys_re, bad_path_re, fmt, url_filter_instance, fallback, cache_size, name)`: A `logging.Filter` subclass for easy integration with Python's logging module.\n     * **bad_keys:list, bad_keys_re:list, bad_path_re:str** are the same as for filter\\_url() (see above)\n     * **fmt:str                       - (optional)** Format to add an filtered URL into the log message, default: ' | (URL={filtered\\_url})' ({filtered\\_url} will be\n                                                       replaced with your filtered URL)\n     * **url_filter_instance:FilterURL - (optional)** Pre-configured instance of FilterURL-like class to use for filtering. Default: None (will be created by the filter)\n     * **fallback:bool                 - (optional)** Do we look for URL in the text when URL is not specified explicitly with extra={'url':...}? Default: True\n     * **cache_size:int                - (optional)** Size of the cache to keep filtered URLs, 0 or None means no caching. Default: 512\n     * **name:str                      - (optional)** The name of the filter (inherited from the logging.Filter)\n\nLicense\n-------\n\nThis project is licensed under the MIT License.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A simple, fast, and configurable URL sensitive data filter",
    "version": "1.2.0",
    "project_urls": {
        "Homepage": "https://github.com/alexsemenyaka/filter_url",
        "Issues": "https://github.com/alexsemenyaka/filter_url/issues",
        "Repository": "https://github.com/alexsemenyaka/filter_url"
    },
    "split_keywords": [
        "url",
        " filter",
        " filtering",
        " url filtering",
        " logging"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0b9ae38227deafaa6934017f94980286727548e4f4f44b0497cdb4be39649e2e",
                "md5": "4c14f13da5c34b02335cac8483b805af",
                "sha256": "37e47a9170d7bb7d2eb1f11bd9afc9d1d33fba9413b4ba8c503604fae853bf95"
            },
            "downloads": -1,
            "filename": "filter_url-1.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4c14f13da5c34b02335cac8483b805af",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 9566,
            "upload_time": "2025-07-15T16:07:47",
            "upload_time_iso_8601": "2025-07-15T16:07:47.149530Z",
            "url": "https://files.pythonhosted.org/packages/0b/9a/e38227deafaa6934017f94980286727548e4f4f44b0497cdb4be39649e2e/filter_url-1.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ca38b0243052d7f287f219bd47339f523ff97e3b38b8c7902b6576aa0866e67f",
                "md5": "e39d07221feda451fe039cb42455bb32",
                "sha256": "d0138995c96917aa75048227d714e0eee849dfa4fdff28a918af3f228403a66e"
            },
            "downloads": -1,
            "filename": "filter_url-1.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "e39d07221feda451fe039cb42455bb32",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 11233,
            "upload_time": "2025-07-15T16:07:47",
            "upload_time_iso_8601": "2025-07-15T16:07:47.954222Z",
            "url": "https://files.pythonhosted.org/packages/ca/38/b0243052d7f287f219bd47339f523ff97e3b38b8c7902b6576aa0866e67f/filter_url-1.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-15 16:07:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "alexsemenyaka",
    "github_project": "filter_url",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "requirements": [
        {
            "name": "iniconfig",
            "specs": [
                [
                    "==",
                    "2.1.0"
                ]
            ]
        },
        {
            "name": "packaging",
            "specs": [
                [
                    "==",
                    "25.0"
                ]
            ]
        },
        {
            "name": "pluggy",
            "specs": [
                [
                    "==",
                    "1.6.0"
                ]
            ]
        },
        {
            "name": "pygments",
            "specs": [
                [
                    "==",
                    "2.19.2"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    "==",
                    "8.4.1"
                ]
            ]
        }
    ],
    "lcname": "filter-url"
}

None