urlfinderlib

Name	urlfinderlib JSON
Version	0.18.6 JSON
	download
home_page	https://github.com/ace-ecosystem/urlfinderlib
Summary	Library to find URLs and check their validity.
upload_time	2022-12-01 14:33:13
maintainer
docs_url	None
author	Matthew Wilson
requires_python
license	Apache 2.0
keywords	urlfinderlib
VCS
bugtrack_url
requirements	icalendar idna lxml pytest pytest-cov python-magic tld validators
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # urlfinderlib

This is a Python (3.6+) library for finding URLs in documents and checking their validity.

## Supported Documents

Extracts URLs from the following types of documents:

* Binary files (finds URLs within strings)
* CSV files
* HTML files
* iCalendar/vCalendar files
* PDF files
* Text files (ASCII or UTF-8)
* XML files

Every extracted URL is validated such that it contains a domain with a valid TLD (or a valid IP address) and does not contain any invalid characters.

## URL Permutations

This was originally written to accommodate finding both valid and obfuscated or slightly malformed URLs used by malicious actors and using them as indicators of compromise (IOCs). As such, the extracted URLs will also include the following permutations:

* URL with any Unicode characters in its domain
* URL with any Unicode characters converted to its IDNA equivalent

For both domain variations, the following permutations are also returned:

* URL with its path %-encoded
* URL with its path %-decoded
* URL with encoded HTML entities in its path
* URL with decoded HTML entities in its path
* URL with its path %-decoded and HTML entities decoded

## Child URLs

This library also attempts to extract or decode child URLs found in the paths of URLs. The following formats are supported:

* Barracuda protected URLs
* Base64-encoded URLs found within the URL's path
* Google redirect URLs
* Mandrill/Mailchimp redirect URLs
* Outlook Safe Links URLs
* Proofpoint protected URLs
* URLs found in the URL's path query parameters

## Basic usage

    from urlfinderlib import find_urls
    
    with open('/path/to/file', 'rb') as f:
        print(find_urls(f.read())

### base_url Parameter

If you are trying to find URLs inside of an HTML file, the paths in the URLs are often relative to their location on the server hosting the HTML. You can use the *base_url* parameter in this case to extract these "relative" URLs.

    from urlfinderlib import find_urls
    
    with open('/path/to/file', 'rb') as f:
        print(find_urls(f.read(), base_url='http://example.com')

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ace-ecosystem/urlfinderlib",
    "name": "urlfinderlib",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "urlfinderlib",
    "author": "Matthew Wilson",
    "author_email": "dev@bytecafe.io",
    "download_url": "https://files.pythonhosted.org/packages/f7/43/bb555dc65a18849062bc69f494b90bb47da0d4553c41f747d70b693c08b9/urlfinderlib-0.18.6.tar.gz",
    "platform": null,
    "description": "# urlfinderlib\n\nThis is a Python (3.6+) library for finding URLs in documents and checking their validity.\n\n## Supported Documents\n\nExtracts URLs from the following types of documents:\n\n* Binary files (finds URLs within strings)\n* CSV files\n* HTML files\n* iCalendar/vCalendar files\n* PDF files\n* Text files (ASCII or UTF-8)\n* XML files\n\nEvery extracted URL is validated such that it contains a domain with a valid TLD (or a valid IP address) and does not contain any invalid characters.\n\n## URL Permutations\n\nThis was originally written to accommodate finding both valid and obfuscated or slightly malformed URLs used by malicious actors and using them as indicators of compromise (IOCs). As such, the extracted URLs will also include the following permutations:\n\n* URL with any Unicode characters in its domain\n* URL with any Unicode characters converted to its IDNA equivalent\n\nFor both domain variations, the following permutations are also returned:\n\n* URL with its path %-encoded\n* URL with its path %-decoded\n* URL with encoded HTML entities in its path\n* URL with decoded HTML entities in its path\n* URL with its path %-decoded and HTML entities decoded\n\n## Child URLs\n\nThis library also attempts to extract or decode child URLs found in the paths of URLs. The following formats are supported:\n\n* Barracuda protected URLs\n* Base64-encoded URLs found within the URL's path\n* Google redirect URLs\n* Mandrill/Mailchimp redirect URLs\n* Outlook Safe Links URLs\n* Proofpoint protected URLs\n* URLs found in the URL's path query parameters\n\n## Basic usage\n\n    from urlfinderlib import find_urls\n    \n    with open('/path/to/file', 'rb') as f:\n        print(find_urls(f.read())\n\n### base_url Parameter\n\nIf you are trying to find URLs inside of an HTML file, the paths in the URLs are often relative to their location on the server hosting the HTML. You can use the *base_url* parameter in this case to extract these \"relative\" URLs.\n\n    from urlfinderlib import find_urls\n    \n    with open('/path/to/file', 'rb') as f:\n        print(find_urls(f.read(), base_url='http://example.com')\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Library to find URLs and check their validity.",
    "version": "0.18.6",
    "split_keywords": [
        "urlfinderlib"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "9aff751ab6361523cc4d6b02cab0a611",
                "sha256": "234fc41df1ecd1da0d2f1f2e55f20cecc33981b4a6cbea5ccf45d4224e40f13a"
            },
            "downloads": -1,
            "filename": "urlfinderlib-0.18.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9aff751ab6361523cc4d6b02cab0a611",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 22471,
            "upload_time": "2022-12-01T14:33:12",
            "upload_time_iso_8601": "2022-12-01T14:33:12.639000Z",
            "url": "https://files.pythonhosted.org/packages/dc/f1/c4b845e1f02a9382bd330f9ed0124b6bab1213e6c25cb5e94acab5e0d4bf/urlfinderlib-0.18.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "3c5a6a19c2becb6b69b1699875c8367f",
                "sha256": "5e100a04459da0834f08901a6c99eee48aa94da1ea740ae82d0cebd8425c58ce"
            },
            "downloads": -1,
            "filename": "urlfinderlib-0.18.6.tar.gz",
            "has_sig": false,
            "md5_digest": "3c5a6a19c2becb6b69b1699875c8367f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 20063,
            "upload_time": "2022-12-01T14:33:13",
            "upload_time_iso_8601": "2022-12-01T14:33:13.724953Z",
            "url": "https://files.pythonhosted.org/packages/f7/43/bb555dc65a18849062bc69f494b90bb47da0d4553c41f747d70b693c08b9/urlfinderlib-0.18.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-01 14:33:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "ace-ecosystem",
    "github_project": "urlfinderlib",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "icalendar",
            "specs": [
                [
                    ">=",
                    "4.0.7"
                ]
            ]
        },
        {
            "name": "idna",
            "specs": [
                [
                    ">=",
                    "2.10"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    ">=",
                    "4.5.2"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": []
        },
        {
            "name": "pytest-cov",
            "specs": []
        },
        {
            "name": "python-magic",
            "specs": [
                [
                    ">=",
                    "0.4.18"
                ]
            ]
        },
        {
            "name": "tld",
            "specs": [
                [
                    ">=",
                    "0.12.2"
                ]
            ]
        },
        {
            "name": "validators",
            "specs": [
                [
                    ">=",
                    "0.16.0"
                ]
            ]
        }
    ],
    "lcname": "urlfinderlib"
}

Matthew Wilson