strip-tags


Namestrip-tags JSON
Version 0.5.1 PyPI version JSON
download
home_pagehttps://github.com/simonw/strip-tags
SummaryStrip tags from HTML, optionally from areas identified by CSS selectors
upload_time2023-07-09 21:53:11
maintainer
docs_urlNone
authorSimon Willison
requires_python>=3.7
licenseApache License, Version 2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # strip-tags

[![PyPI](https://img.shields.io/pypi/v/strip-tags.svg)](https://pypi.org/project/strip-tags/)
[![Changelog](https://img.shields.io/github/v/release/simonw/strip-tags?include_prereleases&label=changelog)](https://github.com/simonw/strip-tags/releases)
[![Tests](https://github.com/simonw/strip-tags/workflows/Test/badge.svg)](https://github.com/simonw/strip-tags/actions?query=workflow%3ATest)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/strip-tags/blob/master/LICENSE)

Strip tags from HTML, optionally from areas identified by CSS selectors

See [llm, ttok and strip-tags—CLI tools for working with ChatGPT and other LLMs](https://simonwillison.net/2023/May/18/cli-tools-for-llms/) for more on this project.

## Installation

Install this tool using `pip`:
```bash
pip install strip-tags
```
## Usage

Pipe content into this tool to strip tags from it:
```bash
cat input.html | strip-tags > output.txt
````
Or pass a filename:
```bash
strip-tags -i input.html > output.txt
```
To run against just specific areas identified by CSS selectors:
```bash
strip-tags '.content' -i input.html > output.txt
```
This can be called with multiple selectors:
```bash
cat input.html | strip-tags '.content' '.sidebar' > output.txt
```
To return just the first element on the page that matches one of the selectors, use `--first`:
```bash
cat input.html | strip-tags .content --first > output.txt
```
To remove content contained by specific selectors - e.g. the `<nav>` section of a page, use `-r` or `--remove`:
```bash
cat input.html | strip-tags -r nav > output.txt
```
To minify whitespace - reducing multiple space and tab characters to a single space, and multiple newlines and spaces to a maximum of two newlines - add `-m` or `--minify`:
```bash
cat input.html | strip-tags -m > output.txt
```
You can also run this command using `python -m` like this:
```bash
python -m strip_tags --help
```
### Keeping the markup for specified tags

When passing content to a language model, it can sometimes be useful to leave in a subset of HTML tags - `<h1>This is the heading</h1>` for example - to provide extra hints to the model.

The `-t/--keep-tag` option can be passed multiple times to specify tags that should be kept.

This example looks at the `<header>` section of https://datasette.io/ and keeps the tags around the list items and `<h1>` elements:

```
curl -s https://datasette.io/ | strip-tags header -t h1 -t li
```
```html
<li>Uses</li>
<li>Documentation Docs</li>
<li>Tutorials</li>
<li>Examples</li>
<li>Plugins</li>
<li>Tools</li>
<li>News</li>
<h1>
    Datasette
</h1>
Find stories in data
```
All attributes will be removed from the tags, except for the `id=` and `class=` attribute since those may provide further useful hints to the language model.

The `href` attribute on links, the `alt` attribute on images and the `name` and `value` attributes on `meta` tags are kept as well.

You can also specify a bundle of tags. For example, `strip-tags -t hs` will keep the tag markup for all levels of headings.

The following bundles can be used:

<!-- [[[cog
import cog
from strip_tags.lib import BUNDLES
lines = []
for name, tags in BUNDLES.items():
    lines.append("- `-t {}`: {}".format(name, ", ".join("`<{}>`".format(tag) for tag in tags)))
cog.out("\n".join(lines))
]]] -->
- `-t hs`: `<h1>`, `<h2>`, `<h3>`, `<h4>`, `<h5>`, `<h6>`
- `-t metadata`: `<title>`, `<meta>`
- `-t structure`: `<header>`, `<nav>`, `<main>`, `<article>`, `<section>`, `<aside>`, `<footer>`
- `-t tables`: `<table>`, `<tr>`, `<td>`, `<th>`, `<thead>`, `<tbody>`, `<tfoot>`, `<caption>`, `<colgroup>`, `<col>`
- `-t lists`: `<ul>`, `<ol>`, `<li>`, `<dl>`, `<dd>`, `<dt>`
<!-- [[[end]]] -->

## As a Python library

You can use `strip-tags` from Python code too. The function signature looks like this:

<!-- [[[cog
import ast
module = ast.parse(open("strip_tags/lib.py").read())
strip_tags = [
    fn for fn in module.body
    if getattr(fn, 'name', None) == 'strip_tags'
][0]
code = ast.unparse(strip_tags)
defline = code.split("\n")[0]
code = (
    ',\n    '.join(defline.split(', ')).replace(") ->", "\n) ->").replace("strip_tags(", "strip_tags(\n    ")
)
cog.out("```python\n{}\n```".format(code))
]]] -->
```python
def strip_tags(
    input: str,
    selectors: Optional[Iterable[str]]=None,
    *,
    removes: Optional[Iterable[str]]=None,
    minify: bool=False,
    first: bool=False,
    keep_tags: Optional[Iterable[str]]=None,
    all_attrs: bool=False
) -> str:
```
<!-- [[[end]]] -->

Here's an example:
```python
from strip_tags import strip_tags

html = """
<div>
<h1>This has tags</h1>

<p>And whitespace too</p>
</div>
Ignore this bit.
"""
stripped = strip_tags(html, ["div"], minify=True, keep_tags=["h1"])
print(stripped)
```
Output:
```
<h1>This has tags</h1>

And whitespace too
```

## strip-tags --help

<!-- [[[cog
import cog
from strip_tags import cli
from click.testing import CliRunner
runner = CliRunner()
result = runner.invoke(cli.cli, ["--help"])
help = result.output.replace("Usage: cli", "Usage: strip-tags")
cog.out(
    "```\n{}\n```".format(help)
)
]]] -->
```
Usage: strip-tags [OPTIONS] [SELECTORS]...

  Strip tags from HTML, optionally from areas identified by CSS selectors

  Example usage:

      cat input.html | strip-tags > output.txt

  To run against just specific areas identified by CSS selectors:

      cat input.html | strip-tags .entry .footer > output.txt

Options:
  --version             Show the version and exit.
  -r, --remove TEXT     Remove content in these selectors
  -i, --input FILENAME  Input file
  -m, --minify          Minify whitespace
  -t, --keep-tag TEXT   Keep these <tags>
  --all-attrs           Include all attributes on kept tags
  --first               First element matching the selectors
  --help                Show this message and exit.

```
<!-- [[[end]]] -->

## Development

To contribute to this tool, first checkout the code. Then create a new virtual environment:
```bash
cd strip-tags
python -m venv venv
source venv/bin/activate
```
Now install the dependencies and test dependencies:
```bash
pip install -e '.[test]'
```
To run the tests:
```bash
pytest
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/simonw/strip-tags",
    "name": "strip-tags",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "",
    "author": "Simon Willison",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/59/22/1b50f0c98d35c7e958b080aa7947a90bd74b3dc7e72b759034727edc10e3/strip-tags-0.5.1.tar.gz",
    "platform": null,
    "description": "# strip-tags\n\n[![PyPI](https://img.shields.io/pypi/v/strip-tags.svg)](https://pypi.org/project/strip-tags/)\n[![Changelog](https://img.shields.io/github/v/release/simonw/strip-tags?include_prereleases&label=changelog)](https://github.com/simonw/strip-tags/releases)\n[![Tests](https://github.com/simonw/strip-tags/workflows/Test/badge.svg)](https://github.com/simonw/strip-tags/actions?query=workflow%3ATest)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/strip-tags/blob/master/LICENSE)\n\nStrip tags from HTML, optionally from areas identified by CSS selectors\n\nSee [llm, ttok and strip-tags\u2014CLI tools for working with ChatGPT and other LLMs](https://simonwillison.net/2023/May/18/cli-tools-for-llms/) for more on this project.\n\n## Installation\n\nInstall this tool using `pip`:\n```bash\npip install strip-tags\n```\n## Usage\n\nPipe content into this tool to strip tags from it:\n```bash\ncat input.html | strip-tags > output.txt\n````\nOr pass a filename:\n```bash\nstrip-tags -i input.html > output.txt\n```\nTo run against just specific areas identified by CSS selectors:\n```bash\nstrip-tags '.content' -i input.html > output.txt\n```\nThis can be called with multiple selectors:\n```bash\ncat input.html | strip-tags '.content' '.sidebar' > output.txt\n```\nTo return just the first element on the page that matches one of the selectors, use `--first`:\n```bash\ncat input.html | strip-tags .content --first > output.txt\n```\nTo remove content contained by specific selectors - e.g. the `<nav>` section of a page, use `-r` or `--remove`:\n```bash\ncat input.html | strip-tags -r nav > output.txt\n```\nTo minify whitespace - reducing multiple space and tab characters to a single space, and multiple newlines and spaces to a maximum of two newlines - add `-m` or `--minify`:\n```bash\ncat input.html | strip-tags -m > output.txt\n```\nYou can also run this command using `python -m` like this:\n```bash\npython -m strip_tags --help\n```\n### Keeping the markup for specified tags\n\nWhen passing content to a language model, it can sometimes be useful to leave in a subset of HTML tags - `<h1>This is the heading</h1>` for example - to provide extra hints to the model.\n\nThe `-t/--keep-tag` option can be passed multiple times to specify tags that should be kept.\n\nThis example looks at the `<header>` section of https://datasette.io/ and keeps the tags around the list items and `<h1>` elements:\n\n```\ncurl -s https://datasette.io/ | strip-tags header -t h1 -t li\n```\n```html\n<li>Uses</li>\n<li>Documentation Docs</li>\n<li>Tutorials</li>\n<li>Examples</li>\n<li>Plugins</li>\n<li>Tools</li>\n<li>News</li>\n<h1>\n    Datasette\n</h1>\nFind stories in data\n```\nAll attributes will be removed from the tags, except for the `id=` and `class=` attribute since those may provide further useful hints to the language model.\n\nThe `href` attribute on links, the `alt` attribute on images and the `name` and `value` attributes on `meta` tags are kept as well.\n\nYou can also specify a bundle of tags. For example, `strip-tags -t hs` will keep the tag markup for all levels of headings.\n\nThe following bundles can be used:\n\n<!-- [[[cog\nimport cog\nfrom strip_tags.lib import BUNDLES\nlines = []\nfor name, tags in BUNDLES.items():\n    lines.append(\"- `-t {}`: {}\".format(name, \", \".join(\"`<{}>`\".format(tag) for tag in tags)))\ncog.out(\"\\n\".join(lines))\n]]] -->\n- `-t hs`: `<h1>`, `<h2>`, `<h3>`, `<h4>`, `<h5>`, `<h6>`\n- `-t metadata`: `<title>`, `<meta>`\n- `-t structure`: `<header>`, `<nav>`, `<main>`, `<article>`, `<section>`, `<aside>`, `<footer>`\n- `-t tables`: `<table>`, `<tr>`, `<td>`, `<th>`, `<thead>`, `<tbody>`, `<tfoot>`, `<caption>`, `<colgroup>`, `<col>`\n- `-t lists`: `<ul>`, `<ol>`, `<li>`, `<dl>`, `<dd>`, `<dt>`\n<!-- [[[end]]] -->\n\n## As a Python library\n\nYou can use `strip-tags` from Python code too. The function signature looks like this:\n\n<!-- [[[cog\nimport ast\nmodule = ast.parse(open(\"strip_tags/lib.py\").read())\nstrip_tags = [\n    fn for fn in module.body\n    if getattr(fn, 'name', None) == 'strip_tags'\n][0]\ncode = ast.unparse(strip_tags)\ndefline = code.split(\"\\n\")[0]\ncode = (\n    ',\\n    '.join(defline.split(', ')).replace(\") ->\", \"\\n) ->\").replace(\"strip_tags(\", \"strip_tags(\\n    \")\n)\ncog.out(\"```python\\n{}\\n```\".format(code))\n]]] -->\n```python\ndef strip_tags(\n    input: str,\n    selectors: Optional[Iterable[str]]=None,\n    *,\n    removes: Optional[Iterable[str]]=None,\n    minify: bool=False,\n    first: bool=False,\n    keep_tags: Optional[Iterable[str]]=None,\n    all_attrs: bool=False\n) -> str:\n```\n<!-- [[[end]]] -->\n\nHere's an example:\n```python\nfrom strip_tags import strip_tags\n\nhtml = \"\"\"\n<div>\n<h1>This has tags</h1>\n\n<p>And whitespace too</p>\n</div>\nIgnore this bit.\n\"\"\"\nstripped = strip_tags(html, [\"div\"], minify=True, keep_tags=[\"h1\"])\nprint(stripped)\n```\nOutput:\n```\n<h1>This has tags</h1>\n\nAnd whitespace too\n```\n\n## strip-tags --help\n\n<!-- [[[cog\nimport cog\nfrom strip_tags import cli\nfrom click.testing import CliRunner\nrunner = CliRunner()\nresult = runner.invoke(cli.cli, [\"--help\"])\nhelp = result.output.replace(\"Usage: cli\", \"Usage: strip-tags\")\ncog.out(\n    \"```\\n{}\\n```\".format(help)\n)\n]]] -->\n```\nUsage: strip-tags [OPTIONS] [SELECTORS]...\n\n  Strip tags from HTML, optionally from areas identified by CSS selectors\n\n  Example usage:\n\n      cat input.html | strip-tags > output.txt\n\n  To run against just specific areas identified by CSS selectors:\n\n      cat input.html | strip-tags .entry .footer > output.txt\n\nOptions:\n  --version             Show the version and exit.\n  -r, --remove TEXT     Remove content in these selectors\n  -i, --input FILENAME  Input file\n  -m, --minify          Minify whitespace\n  -t, --keep-tag TEXT   Keep these <tags>\n  --all-attrs           Include all attributes on kept tags\n  --first               First element matching the selectors\n  --help                Show this message and exit.\n\n```\n<!-- [[[end]]] -->\n\n## Development\n\nTo contribute to this tool, first checkout the code. Then create a new virtual environment:\n```bash\ncd strip-tags\npython -m venv venv\nsource venv/bin/activate\n```\nNow install the dependencies and test dependencies:\n```bash\npip install -e '.[test]'\n```\nTo run the tests:\n```bash\npytest\n```\n",
    "bugtrack_url": null,
    "license": "Apache License, Version 2.0",
    "summary": "Strip tags from HTML, optionally from areas identified by CSS selectors",
    "version": "0.5.1",
    "project_urls": {
        "CI": "https://github.com/simonw/strip-tags/actions",
        "Changelog": "https://github.com/simonw/strip-tags/releases",
        "Homepage": "https://github.com/simonw/strip-tags",
        "Issues": "https://github.com/simonw/strip-tags/issues"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f5b60bf8b369ca8b07f8b74b867fdbdd2e693452ab715d726a8f7f134aee44d3",
                "md5": "d393fbd17d696e9bada559c88bb12e88",
                "sha256": "2ced3d245bab6cd2ea34948baabbc244e1ee734c89e65705eff0e8ac6fdef46e"
            },
            "downloads": -1,
            "filename": "strip_tags-0.5.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d393fbd17d696e9bada559c88bb12e88",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 11195,
            "upload_time": "2023-07-09T21:53:09",
            "upload_time_iso_8601": "2023-07-09T21:53:09.849379Z",
            "url": "https://files.pythonhosted.org/packages/f5/b6/0bf8b369ca8b07f8b74b867fdbdd2e693452ab715d726a8f7f134aee44d3/strip_tags-0.5.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "59221b50f0c98d35c7e958b080aa7947a90bd74b3dc7e72b759034727edc10e3",
                "md5": "082e61f5591611c4500ae4baf7b3e984",
                "sha256": "841a158bc8f57e3a891d45132e78c1eb8fdd9b978b8a40e68028446118dedad3"
            },
            "downloads": -1,
            "filename": "strip-tags-0.5.1.tar.gz",
            "has_sig": false,
            "md5_digest": "082e61f5591611c4500ae4baf7b3e984",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 11159,
            "upload_time": "2023-07-09T21:53:11",
            "upload_time_iso_8601": "2023-07-09T21:53:11.365078Z",
            "url": "https://files.pythonhosted.org/packages/59/22/1b50f0c98d35c7e958b080aa7947a90bd74b3dc7e72b759034727edc10e3/strip-tags-0.5.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-09 21:53:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "simonw",
    "github_project": "strip-tags",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "strip-tags"
}
        
Elapsed time: 0.10479s