# strip-tags
[![PyPI](https://img.shields.io/pypi/v/strip-tags.svg)](https://pypi.org/project/strip-tags/)
[![Changelog](https://img.shields.io/github/v/release/simonw/strip-tags?include_prereleases&label=changelog)](https://github.com/simonw/strip-tags/releases)
[![Tests](https://github.com/simonw/strip-tags/workflows/Test/badge.svg)](https://github.com/simonw/strip-tags/actions?query=workflow%3ATest)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/strip-tags/blob/master/LICENSE)
Strip tags from HTML, optionally from areas identified by CSS selectors
See [llm, ttok and strip-tags—CLI tools for working with ChatGPT and other LLMs](https://simonwillison.net/2023/May/18/cli-tools-for-llms/) for more on this project.
## Installation
Install this tool using `pip`:
```bash
pip install strip-tags
```
## Usage
Pipe content into this tool to strip tags from it:
```bash
cat input.html | strip-tags > output.txt
````
Or pass a filename:
```bash
strip-tags -i input.html > output.txt
```
To run against just specific areas identified by CSS selectors:
```bash
strip-tags '.content' -i input.html > output.txt
```
This can be called with multiple selectors:
```bash
cat input.html | strip-tags '.content' '.sidebar' > output.txt
```
To return just the first element on the page that matches one of the selectors, use `--first`:
```bash
cat input.html | strip-tags .content --first > output.txt
```
To remove content contained by specific selectors - e.g. the `<nav>` section of a page, use `-r` or `--remove`:
```bash
cat input.html | strip-tags -r nav > output.txt
```
To minify whitespace - reducing multiple space and tab characters to a single space, and multiple newlines and spaces to a maximum of two newlines - add `-m` or `--minify`:
```bash
cat input.html | strip-tags -m > output.txt
```
You can also run this command using `python -m` like this:
```bash
python -m strip_tags --help
```
### Keeping the markup for specified tags
When passing content to a language model, it can sometimes be useful to leave in a subset of HTML tags - `<h1>This is the heading</h1>` for example - to provide extra hints to the model.
The `-t/--keep-tag` option can be passed multiple times to specify tags that should be kept.
This example looks at the `<header>` section of https://datasette.io/ and keeps the tags around the list items and `<h1>` elements:
```
curl -s https://datasette.io/ | strip-tags header -t h1 -t li
```
```html
<li>Uses</li>
<li>Documentation Docs</li>
<li>Tutorials</li>
<li>Examples</li>
<li>Plugins</li>
<li>Tools</li>
<li>News</li>
<h1>
Datasette
</h1>
Find stories in data
```
All attributes will be removed from the tags, except for the `id=` and `class=` attribute since those may provide further useful hints to the language model.
The `href` attribute on links, the `alt` attribute on images and the `name` and `value` attributes on `meta` tags are kept as well.
You can also specify a bundle of tags. For example, `strip-tags -t hs` will keep the tag markup for all levels of headings.
The following bundles can be used:
<!-- [[[cog
import cog
from strip_tags.lib import BUNDLES
lines = []
for name, tags in BUNDLES.items():
lines.append("- `-t {}`: {}".format(name, ", ".join("`<{}>`".format(tag) for tag in tags)))
cog.out("\n".join(lines))
]]] -->
- `-t hs`: `<h1>`, `<h2>`, `<h3>`, `<h4>`, `<h5>`, `<h6>`
- `-t metadata`: `<title>`, `<meta>`
- `-t structure`: `<header>`, `<nav>`, `<main>`, `<article>`, `<section>`, `<aside>`, `<footer>`
- `-t tables`: `<table>`, `<tr>`, `<td>`, `<th>`, `<thead>`, `<tbody>`, `<tfoot>`, `<caption>`, `<colgroup>`, `<col>`
- `-t lists`: `<ul>`, `<ol>`, `<li>`, `<dl>`, `<dd>`, `<dt>`
<!-- [[[end]]] -->
## As a Python library
You can use `strip-tags` from Python code too. The function signature looks like this:
<!-- [[[cog
import ast
module = ast.parse(open("strip_tags/lib.py").read())
strip_tags = [
fn for fn in module.body
if getattr(fn, 'name', None) == 'strip_tags'
][0]
code = ast.unparse(strip_tags)
defline = code.split("\n")[0]
code = (
',\n '.join(defline.split(', ')).replace(") ->", "\n) ->").replace("strip_tags(", "strip_tags(\n ")
)
cog.out("```python\n{}\n```".format(code))
]]] -->
```python
def strip_tags(
input: str,
selectors: Optional[Iterable[str]]=None,
*,
removes: Optional[Iterable[str]]=None,
minify: bool=False,
first: bool=False,
keep_tags: Optional[Iterable[str]]=None,
all_attrs: bool=False
) -> str:
```
<!-- [[[end]]] -->
Here's an example:
```python
from strip_tags import strip_tags
html = """
<div>
<h1>This has tags</h1>
<p>And whitespace too</p>
</div>
Ignore this bit.
"""
stripped = strip_tags(html, ["div"], minify=True, keep_tags=["h1"])
print(stripped)
```
Output:
```
<h1>This has tags</h1>
And whitespace too
```
## strip-tags --help
<!-- [[[cog
import cog
from strip_tags import cli
from click.testing import CliRunner
runner = CliRunner()
result = runner.invoke(cli.cli, ["--help"])
help = result.output.replace("Usage: cli", "Usage: strip-tags")
cog.out(
"```\n{}\n```".format(help)
)
]]] -->
```
Usage: strip-tags [OPTIONS] [SELECTORS]...
Strip tags from HTML, optionally from areas identified by CSS selectors
Example usage:
cat input.html | strip-tags > output.txt
To run against just specific areas identified by CSS selectors:
cat input.html | strip-tags .entry .footer > output.txt
Options:
--version Show the version and exit.
-r, --remove TEXT Remove content in these selectors
-i, --input FILENAME Input file
-m, --minify Minify whitespace
-t, --keep-tag TEXT Keep these <tags>
--all-attrs Include all attributes on kept tags
--first First element matching the selectors
--help Show this message and exit.
```
<!-- [[[end]]] -->
## Development
To contribute to this tool, first checkout the code. Then create a new virtual environment:
```bash
cd strip-tags
python -m venv venv
source venv/bin/activate
```
Now install the dependencies and test dependencies:
```bash
pip install -e '.[test]'
```
To run the tests:
```bash
pytest
```
Raw data
{
"_id": null,
"home_page": "https://github.com/simonw/strip-tags",
"name": "strip-tags",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "",
"author": "Simon Willison",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/59/22/1b50f0c98d35c7e958b080aa7947a90bd74b3dc7e72b759034727edc10e3/strip-tags-0.5.1.tar.gz",
"platform": null,
"description": "# strip-tags\n\n[![PyPI](https://img.shields.io/pypi/v/strip-tags.svg)](https://pypi.org/project/strip-tags/)\n[![Changelog](https://img.shields.io/github/v/release/simonw/strip-tags?include_prereleases&label=changelog)](https://github.com/simonw/strip-tags/releases)\n[![Tests](https://github.com/simonw/strip-tags/workflows/Test/badge.svg)](https://github.com/simonw/strip-tags/actions?query=workflow%3ATest)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/strip-tags/blob/master/LICENSE)\n\nStrip tags from HTML, optionally from areas identified by CSS selectors\n\nSee [llm, ttok and strip-tags\u2014CLI tools for working with ChatGPT and other LLMs](https://simonwillison.net/2023/May/18/cli-tools-for-llms/) for more on this project.\n\n## Installation\n\nInstall this tool using `pip`:\n```bash\npip install strip-tags\n```\n## Usage\n\nPipe content into this tool to strip tags from it:\n```bash\ncat input.html | strip-tags > output.txt\n````\nOr pass a filename:\n```bash\nstrip-tags -i input.html > output.txt\n```\nTo run against just specific areas identified by CSS selectors:\n```bash\nstrip-tags '.content' -i input.html > output.txt\n```\nThis can be called with multiple selectors:\n```bash\ncat input.html | strip-tags '.content' '.sidebar' > output.txt\n```\nTo return just the first element on the page that matches one of the selectors, use `--first`:\n```bash\ncat input.html | strip-tags .content --first > output.txt\n```\nTo remove content contained by specific selectors - e.g. the `<nav>` section of a page, use `-r` or `--remove`:\n```bash\ncat input.html | strip-tags -r nav > output.txt\n```\nTo minify whitespace - reducing multiple space and tab characters to a single space, and multiple newlines and spaces to a maximum of two newlines - add `-m` or `--minify`:\n```bash\ncat input.html | strip-tags -m > output.txt\n```\nYou can also run this command using `python -m` like this:\n```bash\npython -m strip_tags --help\n```\n### Keeping the markup for specified tags\n\nWhen passing content to a language model, it can sometimes be useful to leave in a subset of HTML tags - `<h1>This is the heading</h1>` for example - to provide extra hints to the model.\n\nThe `-t/--keep-tag` option can be passed multiple times to specify tags that should be kept.\n\nThis example looks at the `<header>` section of https://datasette.io/ and keeps the tags around the list items and `<h1>` elements:\n\n```\ncurl -s https://datasette.io/ | strip-tags header -t h1 -t li\n```\n```html\n<li>Uses</li>\n<li>Documentation Docs</li>\n<li>Tutorials</li>\n<li>Examples</li>\n<li>Plugins</li>\n<li>Tools</li>\n<li>News</li>\n<h1>\n Datasette\n</h1>\nFind stories in data\n```\nAll attributes will be removed from the tags, except for the `id=` and `class=` attribute since those may provide further useful hints to the language model.\n\nThe `href` attribute on links, the `alt` attribute on images and the `name` and `value` attributes on `meta` tags are kept as well.\n\nYou can also specify a bundle of tags. For example, `strip-tags -t hs` will keep the tag markup for all levels of headings.\n\nThe following bundles can be used:\n\n<!-- [[[cog\nimport cog\nfrom strip_tags.lib import BUNDLES\nlines = []\nfor name, tags in BUNDLES.items():\n lines.append(\"- `-t {}`: {}\".format(name, \", \".join(\"`<{}>`\".format(tag) for tag in tags)))\ncog.out(\"\\n\".join(lines))\n]]] -->\n- `-t hs`: `<h1>`, `<h2>`, `<h3>`, `<h4>`, `<h5>`, `<h6>`\n- `-t metadata`: `<title>`, `<meta>`\n- `-t structure`: `<header>`, `<nav>`, `<main>`, `<article>`, `<section>`, `<aside>`, `<footer>`\n- `-t tables`: `<table>`, `<tr>`, `<td>`, `<th>`, `<thead>`, `<tbody>`, `<tfoot>`, `<caption>`, `<colgroup>`, `<col>`\n- `-t lists`: `<ul>`, `<ol>`, `<li>`, `<dl>`, `<dd>`, `<dt>`\n<!-- [[[end]]] -->\n\n## As a Python library\n\nYou can use `strip-tags` from Python code too. The function signature looks like this:\n\n<!-- [[[cog\nimport ast\nmodule = ast.parse(open(\"strip_tags/lib.py\").read())\nstrip_tags = [\n fn for fn in module.body\n if getattr(fn, 'name', None) == 'strip_tags'\n][0]\ncode = ast.unparse(strip_tags)\ndefline = code.split(\"\\n\")[0]\ncode = (\n ',\\n '.join(defline.split(', ')).replace(\") ->\", \"\\n) ->\").replace(\"strip_tags(\", \"strip_tags(\\n \")\n)\ncog.out(\"```python\\n{}\\n```\".format(code))\n]]] -->\n```python\ndef strip_tags(\n input: str,\n selectors: Optional[Iterable[str]]=None,\n *,\n removes: Optional[Iterable[str]]=None,\n minify: bool=False,\n first: bool=False,\n keep_tags: Optional[Iterable[str]]=None,\n all_attrs: bool=False\n) -> str:\n```\n<!-- [[[end]]] -->\n\nHere's an example:\n```python\nfrom strip_tags import strip_tags\n\nhtml = \"\"\"\n<div>\n<h1>This has tags</h1>\n\n<p>And whitespace too</p>\n</div>\nIgnore this bit.\n\"\"\"\nstripped = strip_tags(html, [\"div\"], minify=True, keep_tags=[\"h1\"])\nprint(stripped)\n```\nOutput:\n```\n<h1>This has tags</h1>\n\nAnd whitespace too\n```\n\n## strip-tags --help\n\n<!-- [[[cog\nimport cog\nfrom strip_tags import cli\nfrom click.testing import CliRunner\nrunner = CliRunner()\nresult = runner.invoke(cli.cli, [\"--help\"])\nhelp = result.output.replace(\"Usage: cli\", \"Usage: strip-tags\")\ncog.out(\n \"```\\n{}\\n```\".format(help)\n)\n]]] -->\n```\nUsage: strip-tags [OPTIONS] [SELECTORS]...\n\n Strip tags from HTML, optionally from areas identified by CSS selectors\n\n Example usage:\n\n cat input.html | strip-tags > output.txt\n\n To run against just specific areas identified by CSS selectors:\n\n cat input.html | strip-tags .entry .footer > output.txt\n\nOptions:\n --version Show the version and exit.\n -r, --remove TEXT Remove content in these selectors\n -i, --input FILENAME Input file\n -m, --minify Minify whitespace\n -t, --keep-tag TEXT Keep these <tags>\n --all-attrs Include all attributes on kept tags\n --first First element matching the selectors\n --help Show this message and exit.\n\n```\n<!-- [[[end]]] -->\n\n## Development\n\nTo contribute to this tool, first checkout the code. Then create a new virtual environment:\n```bash\ncd strip-tags\npython -m venv venv\nsource venv/bin/activate\n```\nNow install the dependencies and test dependencies:\n```bash\npip install -e '.[test]'\n```\nTo run the tests:\n```bash\npytest\n```\n",
"bugtrack_url": null,
"license": "Apache License, Version 2.0",
"summary": "Strip tags from HTML, optionally from areas identified by CSS selectors",
"version": "0.5.1",
"project_urls": {
"CI": "https://github.com/simonw/strip-tags/actions",
"Changelog": "https://github.com/simonw/strip-tags/releases",
"Homepage": "https://github.com/simonw/strip-tags",
"Issues": "https://github.com/simonw/strip-tags/issues"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f5b60bf8b369ca8b07f8b74b867fdbdd2e693452ab715d726a8f7f134aee44d3",
"md5": "d393fbd17d696e9bada559c88bb12e88",
"sha256": "2ced3d245bab6cd2ea34948baabbc244e1ee734c89e65705eff0e8ac6fdef46e"
},
"downloads": -1,
"filename": "strip_tags-0.5.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d393fbd17d696e9bada559c88bb12e88",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 11195,
"upload_time": "2023-07-09T21:53:09",
"upload_time_iso_8601": "2023-07-09T21:53:09.849379Z",
"url": "https://files.pythonhosted.org/packages/f5/b6/0bf8b369ca8b07f8b74b867fdbdd2e693452ab715d726a8f7f134aee44d3/strip_tags-0.5.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "59221b50f0c98d35c7e958b080aa7947a90bd74b3dc7e72b759034727edc10e3",
"md5": "082e61f5591611c4500ae4baf7b3e984",
"sha256": "841a158bc8f57e3a891d45132e78c1eb8fdd9b978b8a40e68028446118dedad3"
},
"downloads": -1,
"filename": "strip-tags-0.5.1.tar.gz",
"has_sig": false,
"md5_digest": "082e61f5591611c4500ae4baf7b3e984",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 11159,
"upload_time": "2023-07-09T21:53:11",
"upload_time_iso_8601": "2023-07-09T21:53:11.365078Z",
"url": "https://files.pythonhosted.org/packages/59/22/1b50f0c98d35c7e958b080aa7947a90bd74b3dc7e72b759034727edc10e3/strip-tags-0.5.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-09 21:53:11",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "simonw",
"github_project": "strip-tags",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "strip-tags"
}