bleach-extras

Name	bleach-extras JSON
Version	0.2.1 JSON
	download
home_page	http://github.com/jvanasco/bleach_extras
Summary	some extensions for bleach
upload_time	2023-06-09 21:19:34
maintainer
docs_url	None
author	Jonathan Vanasco
requires_python
license	MIT License
keywords	bleach html-sanitizing
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ![Python package](https://github.com/jvanasco/bleach_extras/workflows/Python%20package/badge.svg)

`bleach_extras` is a package of *unofficial* "extras" and utilities paired for
use with the `bleach` library.

The first utility is `TagTreeFilter` which is utilized by `clean_strip_content`
and `cleaner_factory__strip_content`.

# Compatability

`bleach_extras` currently requires `bleach>=3.2.1` and `bleach<5`.
Earlier versions of `bleach` have security concerns; latter versions are not
compatible due to API changes (future support is planned).


# `TagTreeFilter`, `clean_strip_content`, `cleaner_factory__strip_content`

`clean_strip_content` is paired to `bleach.clean`; the only intended difference
is to support the concept of stripping the content tree of tags -- not just the
tag node itself.  `cleaner_factory__strip_content` is a factory function used to
create configured `bleach.Cleaner` instances.

`bleach` has a `strip` flag that toggles the behavior of "unsafe" tags:

`strip = False` will render the tags as escaped HTML encodings, such as this
replacement:

	- foo.<div>1<script>alert("ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!");</script>2</div>.bar
	+ foo.<div>1&lt;script&gt;alert("ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!");&lt;/script&gt;2</div>.bar
	
`strip = True` will strip the tags, but leave the HTML within as plaintext:

	- foo.<div>1<script>alert("ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!");</script>2</div>.bar
	+ foo.<div>1alert("ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!");2</div>.bar

Many users of `bleach` want to remove both the tag and contents of unsafe tags
for a variety of reasons, such as:

* escaping the tags make the text safe, but unreadable
* leaving the tags' content without the tags negatively affects readability and
  comprehension
* leaving the tags' content allows a malicious user to still have some sort of
  fallback payload which is displayed

`clean_strip_content` is a function that mimics `bleach.clean` with a key difference:

* tags destined for content stripping are fed into a `Cleaner` instance as allowed
* the tags are stripped during the filter process via `TagTreeFilter`

An expected transformation is such:

	- foo.<div>1<script>alert("ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!");</script>2</div>.bar
	+ foo.12.bar

Look at that! all the evil payload is gone, including the bitcoin wallet address
that f---- spammers tried to slip through.

## Why do this filtering with `bleach` and not something else ?

Parsing/Tokenzing HTML is not very efficient. Performing this outside of `bleach`
would require performing these operations on the HTML fragments at least twice.

`bleach`'s design implementation encodes/strips 'unsafe' tags during the
parsing/tokening process - before the plugin filtering process starts. In order
to filter the tags out correctly, they must be allowed during the generation of
the DOM tree, then removed during the filter step. This trips a lot of people up;
offering this in a public library with tests that can grow is ideal.


Example:

	dangerous = """foo.<div>1<script>alert("ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!");</script>2</div>.bar"""

	print(bleach.clean(dangerous, tags=['div', ], strip=False))
	# foo.<div>1&lt;script&gt;alert("ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!");&lt;/script&gt;2</div>.bar

	print(bleach.clean(dangerous, tags=['div', ], strip=True))
	# foo.<div>1alert("ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!");2</div>.bar

	print(bleach_extras.clean_strip_content(dangerous, tags=['div'], ))
	# foo.<div>12</div>.bar

	cleaner = bleach_extras.cleaner_factory__strip_content(tags=['div'],)
	print(cleaner.clean(dangerous))
	# foo.<div>12</div>.bar

	print(bleach_extras.clean_strip_content(dangerous, tags=['div', ], strip=True, ))
	# foo.<div>12</div>.bar

## custom replacement of stripped nodes

maybe you need to replace the evil content with a warning. this "extra" has you
covered!

	dangerous2 = """foo.<div>1<script>alert("ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!");<iframe>iiffrraammee</iframe></script>2</div>.bar"""

	class IFrameFilter2(bleach_extras.TagTreeFilter):
		tags_strip_content = ('script', 'style', 'iframe')
		tag_replace_string = "&lt;unsafe garbage/&gt;"

	print bleach_extras.clean_strip_content(dangerous2, tags=['div', ], filters=[IFrameFilter2, ])
	# foo.<div>1&amp;lt;unsafe garbage/&amp;gt;2</div>.bar

Raw data

            {
    "_id": null,
    "home_page": "http://github.com/jvanasco/bleach_extras",
    "name": "bleach-extras",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "bleach html-sanitizing",
    "author": "Jonathan Vanasco",
    "author_email": "jonathan@findmeon.com",
    "download_url": "https://files.pythonhosted.org/packages/9d/32/66ca245a0bcc3a28542137348642149603aad4d5faafa21da6c832283cfb/bleach_extras-0.2.1.tar.gz",
    "platform": null,
    "description": "![Python package](https://github.com/jvanasco/bleach_extras/workflows/Python%20package/badge.svg)\n\n`bleach_extras` is a package of *unofficial* \"extras\" and utilities paired for\nuse with the `bleach` library.\n\nThe first utility is `TagTreeFilter` which is utilized by `clean_strip_content`\nand `cleaner_factory__strip_content`.\n\n# Compatability\n\n`bleach_extras` currently requires `bleach>=3.2.1` and `bleach<5`.\nEarlier versions of `bleach` have security concerns; latter versions are not\ncompatible due to API changes (future support is planned).\n\n\n# `TagTreeFilter`, `clean_strip_content`, `cleaner_factory__strip_content`\n\n`clean_strip_content` is paired to `bleach.clean`; the only intended difference\nis to support the concept of stripping the content tree of tags -- not just the\ntag\u00a0node itself.  `cleaner_factory__strip_content` is a factory function used to\ncreate configured `bleach.Cleaner` instances.\n\n`bleach` has a `strip` flag that toggles the behavior of \"unsafe\" tags:\n\n`strip = False` will render the tags as escaped HTML encodings, such as this\nreplacement:\n\n\t- foo.<div>1<script>alert(\"ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!\");</script>2</div>.bar\n\t+ foo.<div>1&lt;script&gt;alert(\"ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!\");&lt;/script&gt;2</div>.bar\n\t\n`strip = True` will strip the tags, but leave the HTML within as plaintext:\n\n\t- foo.<div>1<script>alert(\"ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!\");</script>2</div>.bar\n\t+ foo.<div>1alert(\"ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!\");2</div>.bar\n\nMany users of `bleach` want to remove both the tag and contents of unsafe tags\nfor a variety of reasons, such as:\n\n* escaping the tags make the text safe, but unreadable\n* leaving the tags' content without the tags negatively affects readability and\n  comprehension\n* leaving the tags' content allows a malicious user to still have some sort of\n  fallback payload which is displayed\n\n`clean_strip_content` is a function that mimics `bleach.clean` with a key difference:\n\n* tags destined for content stripping are fed into a `Cleaner` instance as allowed\n* the tags are stripped during the filter process via `TagTreeFilter`\n\nAn expected transformation is such:\n\n\t- foo.<div>1<script>alert(\"ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!\");</script>2</div>.bar\n\t+ foo.12.bar\n\nLook at that! all the evil payload is gone, including the bitcoin wallet address\nthat f---- spammers tried to slip through.\n\n## Why do this filtering with `bleach` and not something else ?\n\nParsing/Tokenzing HTML is not very efficient. Performing this outside of `bleach`\nwould require performing these operations on the HTML fragments at least twice.\n\n`bleach`'s design implementation encodes/strips 'unsafe' tags during the\nparsing/tokening process - before the plugin filtering process starts. In order\nto filter the tags out correctly, they must be allowed during the generation of\nthe DOM tree, then removed during the filter step. This trips a lot of people up;\noffering this in a public library with tests that can grow is ideal.\n\n\nExample:\n\n\tdangerous = \"\"\"foo.<div>1<script>alert(\"ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!\");</script>2</div>.bar\"\"\"\n\n\tprint(bleach.clean(dangerous, tags=['div', ], strip=False))\n\t# foo.<div>1&lt;script&gt;alert(\"ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!\");&lt;/script&gt;2</div>.bar\n\n\tprint(bleach.clean(dangerous, tags=['div', ], strip=True))\n\t# foo.<div>1alert(\"ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!\");2</div>.bar\n\n\tprint(bleach_extras.clean_strip_content(dangerous, tags=['div'], ))\n\t# foo.<div>12</div>.bar\n\n\tcleaner = bleach_extras.cleaner_factory__strip_content(tags=['div'],)\n\tprint(cleaner.clean(dangerous))\n\t# foo.<div>12</div>.bar\n\n\tprint(bleach_extras.clean_strip_content(dangerous, tags=['div', ], strip=True, ))\n\t# foo.<div>12</div>.bar\n\n## custom replacement of stripped nodes\n\nmaybe you need to replace the evil content with a warning. this \"extra\" has you\ncovered!\n\n\tdangerous2 = \"\"\"foo.<div>1<script>alert(\"ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!\");<iframe>iiffrraammee</iframe></script>2</div>.bar\"\"\"\n\n\tclass IFrameFilter2(bleach_extras.TagTreeFilter):\n\t\ttags_strip_content = ('script', 'style', 'iframe')\n\t\ttag_replace_string = \"&lt;unsafe garbage/&gt;\"\n\n\tprint bleach_extras.clean_strip_content(dangerous2, tags=['div', ], filters=[IFrameFilter2, ])\n\t# foo.<div>1&amp;lt;unsafe garbage/&amp;gt;2</div>.bar",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "some extensions for bleach",
    "version": "0.2.1",
    "project_urls": {
        "Homepage": "http://github.com/jvanasco/bleach_extras"
    },
    "split_keywords": [
        "bleach",
        "html-sanitizing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9d3266ca245a0bcc3a28542137348642149603aad4d5faafa21da6c832283cfb",
                "md5": "27d4224af1574096b46449091fbf4dfd",
                "sha256": "d3d10961c4376d93db188b0c054ae0a970a82eb799db4d5328b7f90607d29ab9"
            },
            "downloads": -1,
            "filename": "bleach_extras-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "27d4224af1574096b46449091fbf4dfd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 9464,
            "upload_time": "2023-06-09T21:19:34",
            "upload_time_iso_8601": "2023-06-09T21:19:34.336371Z",
            "url": "https://files.pythonhosted.org/packages/9d/32/66ca245a0bcc3a28542137348642149603aad4d5faafa21da6c832283cfb/bleach_extras-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-09 21:19:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jvanasco",
    "github_project": "bleach_extras",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "bleach-extras"
}

Jonathan Vanasco