mkdocs-anti-ai-scraper-plugin


Namemkdocs-anti-ai-scraper-plugin JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryMake it slightly harder for bots to steal your content
upload_time2025-08-28 18:50:01
maintainerNone
docs_urlNone
authorsix-two
requires_python>=3.9
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # MkDocs Anti AI Scraper Plugin

This plugin tries to prevent AI scrapers from easily ingesting your website's contents.
It is probably implemented pretty badly and by design it can be bypassed by anyone that invests a bit of time, but it is probably better than nothing.

## Installation

Install the plugin with `pip`:
```bash
pip install mkdocs-anti-ai-scraper-plugin
```

Then add the plugin to your `mkdocs.yml`:
```yaml
plugins:
- search
- anti_ai_scraper
```

Or with all config options:
```yaml
plugins:
- search
- anti_ai_scraper:
    robots_txt: True
    sitemap_xml: True
    encode_html: True
    debug: False
```

## Implemented Techniques

Technique | Scraper Protection | Impact on human visitors | Enabled by default
--- | --- | ---
Add robots.txt | weak | none | yes
Remove sitemap.xml | very weak | none | yes
Encode HTML | only against simple HTML parser based scrapers | slows down page loading, may break page events | true

### Add robots.txt

This technique is enabled by default, and can be disabled by setting the option `robots_txt: False` in `mkdocs.yml`.
If enabled, it adds a `robots.txt` with the following contents to the output directory:
```
User-agent: *
Disallow: /
```
This hints to crawlers that they should not crawl your site.

This technique does not hinder normal users from using the site at all.
However, the `robots.txt` is not enforcing anything.
It just tells well-behaved bots how you would like them to behave.
Many AI bots may just ignore it ([Source](https://www.tomshardware.com/tech-industry/artificial-intelligence/several-ai-companies-said-to-be-ignoring-robots-dot-txt-exclusion-scraping-content-without-permission-report)).

### Remove sitemap.xml

This technique is enabled by default, and can be disabled by setting the option `robots_txt: False` in `mkdocs.yml`.
If enabled, it removes the `sitemap.xml` and `sitemap.xml.gz` files.
This prevents leaking the paths to pages not referenced by your navigation.

### Encode HTML

This technique is enabled by default, and can be disabled by setting the option `robots_txt: False` in `mkdocs.yml`.
If enabled, it encodes (zip + ASCII85) each page's contents and will decode it in the user's browser with JavaScript.
This obscures the page contents to simple scrapers that just download and parse your HTML.
It will not work against any bots that use remote controlled browsers (using selenium or other tech).

The decoding takes some time and will result in browser events (like `onload`) being fired before the page is decoded.
This may break some functionality, that listens to these events and expects them to happen.

## Planned Techniques

- remove sitemap.xml(.gz): just obscures a bit, the nav will still point to most pages.
- Encoding the page contents and decode with JS: Will prevent basic HTML parsers from getting the contents, but anything using a browser (selenium, pupeteer, etc) will still work.
- Encrypt page contents and adding client side "CAPTCHA" to generate the key: Should help against primitive browser based bots.
    It would probably make sense to just let the user solve the CAPTCHA once and cache the key as a cookie or in `localStorage`.
- Bot detection JS: Will be a cat and mouse game, but should help against badly written crawlers

Suggestions welcome: If you know bot detection mechanisms, that can be used with static websites, feel free to open an issue :D

## Problems and Considerations

- Similar to the [encryption plugin](https://github.com/unverbuggt/mkdocs-encryptcontent-plugin), the encryption of the search index is hard.
    So best disable search to prevent anyone from accessing its index.
- Obviously, to protect your contents from scraping, you should not have their source code hosted in public repos ;D
- By blocking bots, you also prevent search engines like Google from properly endexing your site.

## Notable changes

### Version 0.1.0

- Added `encode_html` option
- Added `sitemap_xml` option

### Version 0.0.1

- Added `robots_txt` option

## Development Commands

This repo is managed using [poetry](https://github.com/python-poetry/poetry?tab=readme-ov-file).
You can install `poetry` with `pip install poetry` or `pipx install poetry`.

Clone repo:
```bash
git clone git@github.com:six-two/mkdocs-anti-ai-scraper-plugin.git
```

Install/update extension locally:
```bash
poetry install
```

Build test site:
```bash
poetry run mkdocs build
```

Serve test site:
```bash
poetry run mkdocs serve
```

### Release

Set PyPI API token (only needed once):
```bash
poetry config pypi-token.pypi YOUR_PYPI_TOKEN_HERE
```

Build extension:
```bash
poetry build
```

Upload extension:
```bash
poetry publish
```


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "mkdocs-anti-ai-scraper-plugin",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "six-two",
    "author_email": "git@six-two.dev",
    "download_url": "https://files.pythonhosted.org/packages/d7/6c/762190c005e081223db2ae03320844e4fb42ace280bd1f9fdeea608e8cd8/mkdocs_anti_ai_scraper_plugin-0.1.0.tar.gz",
    "platform": null,
    "description": "# MkDocs Anti AI Scraper Plugin\n\nThis plugin tries to prevent AI scrapers from easily ingesting your website's contents.\nIt is probably implemented pretty badly and by design it can be bypassed by anyone that invests a bit of time, but it is probably better than nothing.\n\n## Installation\n\nInstall the plugin with `pip`:\n```bash\npip install mkdocs-anti-ai-scraper-plugin\n```\n\nThen add the plugin to your `mkdocs.yml`:\n```yaml\nplugins:\n- search\n- anti_ai_scraper\n```\n\nOr with all config options:\n```yaml\nplugins:\n- search\n- anti_ai_scraper:\n    robots_txt: True\n    sitemap_xml: True\n    encode_html: True\n    debug: False\n```\n\n## Implemented Techniques\n\nTechnique | Scraper Protection | Impact on human visitors | Enabled by default\n--- | --- | ---\nAdd robots.txt | weak | none | yes\nRemove sitemap.xml | very weak | none | yes\nEncode HTML | only against simple HTML parser based scrapers | slows down page loading, may break page events | true\n\n### Add robots.txt\n\nThis technique is enabled by default, and can be disabled by setting the option `robots_txt: False` in `mkdocs.yml`.\nIf enabled, it adds a `robots.txt` with the following contents to the output directory:\n```\nUser-agent: *\nDisallow: /\n```\nThis hints to crawlers that they should not crawl your site.\n\nThis technique does not hinder normal users from using the site at all.\nHowever, the `robots.txt` is not enforcing anything.\nIt just tells well-behaved bots how you would like them to behave.\nMany AI bots may just ignore it ([Source](https://www.tomshardware.com/tech-industry/artificial-intelligence/several-ai-companies-said-to-be-ignoring-robots-dot-txt-exclusion-scraping-content-without-permission-report)).\n\n### Remove sitemap.xml\n\nThis technique is enabled by default, and can be disabled by setting the option `robots_txt: False` in `mkdocs.yml`.\nIf enabled, it removes the `sitemap.xml` and `sitemap.xml.gz` files.\nThis prevents leaking the paths to pages not referenced by your navigation.\n\n### Encode HTML\n\nThis technique is enabled by default, and can be disabled by setting the option `robots_txt: False` in `mkdocs.yml`.\nIf enabled, it encodes (zip + ASCII85) each page's contents and will decode it in the user's browser with JavaScript.\nThis obscures the page contents to simple scrapers that just download and parse your HTML.\nIt will not work against any bots that use remote controlled browsers (using selenium or other tech).\n\nThe decoding takes some time and will result in browser events (like `onload`) being fired before the page is decoded.\nThis may break some functionality, that listens to these events and expects them to happen.\n\n## Planned Techniques\n\n- remove sitemap.xml(.gz): just obscures a bit, the nav will still point to most pages.\n- Encoding the page contents and decode with JS: Will prevent basic HTML parsers from getting the contents, but anything using a browser (selenium, pupeteer, etc) will still work.\n- Encrypt page contents and adding client side \"CAPTCHA\" to generate the key: Should help against primitive browser based bots.\n    It would probably make sense to just let the user solve the CAPTCHA once and cache the key as a cookie or in `localStorage`.\n- Bot detection JS: Will be a cat and mouse game, but should help against badly written crawlers\n\nSuggestions welcome: If you know bot detection mechanisms, that can be used with static websites, feel free to open an issue :D\n\n## Problems and Considerations\n\n- Similar to the [encryption plugin](https://github.com/unverbuggt/mkdocs-encryptcontent-plugin), the encryption of the search index is hard.\n    So best disable search to prevent anyone from accessing its index.\n- Obviously, to protect your contents from scraping, you should not have their source code hosted in public repos ;D\n- By blocking bots, you also prevent search engines like Google from properly endexing your site.\n\n## Notable changes\n\n### Version 0.1.0\n\n- Added `encode_html` option\n- Added `sitemap_xml` option\n\n### Version 0.0.1\n\n- Added `robots_txt` option\n\n## Development Commands\n\nThis repo is managed using [poetry](https://github.com/python-poetry/poetry?tab=readme-ov-file).\nYou can install `poetry` with `pip install poetry` or `pipx install poetry`.\n\nClone repo:\n```bash\ngit clone git@github.com:six-two/mkdocs-anti-ai-scraper-plugin.git\n```\n\nInstall/update extension locally:\n```bash\npoetry install\n```\n\nBuild test site:\n```bash\npoetry run mkdocs build\n```\n\nServe test site:\n```bash\npoetry run mkdocs serve\n```\n\n### Release\n\nSet PyPI API token (only needed once):\n```bash\npoetry config pypi-token.pypi YOUR_PYPI_TOKEN_HERE\n```\n\nBuild extension:\n```bash\npoetry build\n```\n\nUpload extension:\n```bash\npoetry publish\n```\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Make it slightly harder for bots to steal your content",
    "version": "0.1.0",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6fa8c5075deff5fe2d3f0a3231f6d2381dbc83f6f8cba9d727aa182cf03cf170",
                "md5": "b7d750eee1a1bdc63820a74262ebc35b",
                "sha256": "2ddcefc73d47db96038cd7f88b59dde7016ebb8356c4c180bf540cf380361ee5"
            },
            "downloads": -1,
            "filename": "mkdocs_anti_ai_scraper_plugin-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b7d750eee1a1bdc63820a74262ebc35b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 4857,
            "upload_time": "2025-08-28T18:50:00",
            "upload_time_iso_8601": "2025-08-28T18:50:00.374482Z",
            "url": "https://files.pythonhosted.org/packages/6f/a8/c5075deff5fe2d3f0a3231f6d2381dbc83f6f8cba9d727aa182cf03cf170/mkdocs_anti_ai_scraper_plugin-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d76c762190c005e081223db2ae03320844e4fb42ace280bd1f9fdeea608e8cd8",
                "md5": "407ac5d4e226f53ba0a45124fad31094",
                "sha256": "e31f41d4593d2557d2a727547d8a3726d214793f729dc7c81e9b711bf52aca79"
            },
            "downloads": -1,
            "filename": "mkdocs_anti_ai_scraper_plugin-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "407ac5d4e226f53ba0a45124fad31094",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 3957,
            "upload_time": "2025-08-28T18:50:01",
            "upload_time_iso_8601": "2025-08-28T18:50:01.644254Z",
            "url": "https://files.pythonhosted.org/packages/d7/6c/762190c005e081223db2ae03320844e4fb42ace280bd1f9fdeea608e8cd8/mkdocs_anti_ai_scraper_plugin-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-28 18:50:01",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "mkdocs-anti-ai-scraper-plugin"
}
        
Elapsed time: 8.63829s