web2llm


Nameweb2llm JSON
Version 0.5.3 PyPI version JSON
download
home_pageNone
SummaryA tool to scrape web content into clean Markdown for LLMs.
upload_time2025-09-03 18:47:31
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT License
keywords scraper llm markdown web scraping pdf github rag
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Web2LLM

[![CI/CD Pipeline](https://github.com/herruzo99/web2llm/actions/workflows/ci.yml/badge.svg)](https://github.com/herruzo99/web2llm/actions/workflows/ci.yml)

A command-line tool to scrape web pages, GitHub repos, local folders, and PDFs into clean, aggregated Markdown suitable for Large Language Models.

## Description

This tool provides a unified interface to process various sources—from live websites and code repositories to local directories and PDF files—and convert them into a structured Markdown format. The clean, token-efficient output is ideal for use as context in prompts for Large Language Models, for Retrieval-Augmented Generation (RAG) pipelines, or for documentation archiving.

## Installation

For standard scraping of static websites, local files, and GitHub repositories, install the base package:
```bash
pip install web2llm
```
To enable JavaScript rendering for Single-Page Applications (SPAs) and other dynamic websites, you must install the `[js]` extra, which includes Playwright:
```bash
pip install "web2llm[js]"
```
After installing the `js` extra, you must also download the necessary browser binaries for Playwright to function:
```bash
playwright install
```
## Usage

### Command-Line Interface

The tool is run from the command line with the following structure:

```bash
web2llm <SOURCE> -o <OUTPUT_NAME> [OPTIONS]
```
-   `<SOURCE>`: The URL or local path to scrape.
-   `-o, --output`: The base name for the output folder and the `.md` and `.json` files created inside it.

All scraped content is saved to a new directory at `output/<OUTPUT_NAME>/`.

#### General Options:
- `--debug`: Enable debug mode for verbose, step-by-step output to stderr.

#### Web Scraper Options (For URLs):
- `--render-js`: Render JavaScript using a headless browser. Slower but necessary for SPAs. Requires installation with the `[js]` extra.
- `--check-content-type`: Force a network request to check the page's `Content-Type` header. Use for URLs that serve PDFs without a `.pdf` extension.

#### Filesystem Options (For GitHub & Local Folders):
When scraping a local folder or a GitHub repository, `web2llm` will automatically find and respect the rules in the project's `.gitignore` file. This ensures that the scrape accurately reflects the intended source code of the project.

-   `--exclude <PATTERN>`: A `.gitignore`-style pattern for files/directories to exclude. Can be used multiple times.
-   `--include <PATTERN>`: A pattern to re-include a file that would otherwise be ignored by default or by an `--exclude` rule. Can be used multiple times.
-   `--include-all`: Disables all default, project-level, and `.gitignore` ignore patterns, providing a complete scrape of all text-based files. Explicit `--exclude` flags are still respected.

### Configuration

`web2llm` uses a hierarchical configuration system that gives you precise control over the scraping process:

1.  **Default Config**: The tool comes with a built-in `default_config.yaml` containing a robust set of ignore patterns for common development files and selectors for web scraping.
2.  **Project-Specific Config**: You can create a `.web2llm.yaml` file in the root of your project to override or extend the default settings. This is the recommended way to manage project-specific rules.
3.  **CLI Arguments**: Command-line flags provide the final layer of control, overriding any settings from the configuration files for a single run.

## Examples

**1. Scrape a specific directory within a GitHub repo:**
```bash
web2llm 'https://github.com/tiangolo/fastapi' -o fastapi-src --include 'fastapi/'
```

**2. Scrape a local project, excluding test and documentation folders:**
```bash
web2llm '~/dev/my-project' -o my-project-code --exclude 'tests/' --exclude 'docs/'
```

**3. Scrape a local project but re-include the `LICENSE` file, which is ignored by default:**
```bash
web2llm '.' -o my-project-with-license --include '!LICENSE'
```

**4. Scrape everything in a project, including files normally ignored by `.gitignore`:**
```bash
web2llm . -o my-project-full --include-all --exclude '.git/'
```

**5. Scrape just the "Installation" section from a webpage:**
```bash
web2llm 'https://fastapi.tiangolo.com/#installation' -o fastapi-install
```

**6. Scrape a PDF from an arXiv URL:**
```bash
web2llm 'https://arxiv.org/pdf/1706.03762.pdf' -o attention-is-all-you-need
```

## Contributing

Contributions are welcome. Please refer to the project's issue tracker and `CONTRIBUTING.md` file for information on how to participate.

## License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "web2llm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "scraper, llm, markdown, web scraping, pdf, github, rag",
    "author": null,
    "author_email": "Juan Herruzo <juan@herruzo.dev>",
    "download_url": "https://files.pythonhosted.org/packages/56/a1/79297ccdd15b36115f78850358124e9e6be98baa34854f7ae109dd65bdcf/web2llm-0.5.3.tar.gz",
    "platform": null,
    "description": "# Web2LLM\n\n[![CI/CD Pipeline](https://github.com/herruzo99/web2llm/actions/workflows/ci.yml/badge.svg)](https://github.com/herruzo99/web2llm/actions/workflows/ci.yml)\n\nA command-line tool to scrape web pages, GitHub repos, local folders, and PDFs into clean, aggregated Markdown suitable for Large Language Models.\n\n## Description\n\nThis tool provides a unified interface to process various sources\u2014from live websites and code repositories to local directories and PDF files\u2014and convert them into a structured Markdown format. The clean, token-efficient output is ideal for use as context in prompts for Large Language Models, for Retrieval-Augmented Generation (RAG) pipelines, or for documentation archiving.\n\n## Installation\n\nFor standard scraping of static websites, local files, and GitHub repositories, install the base package:\n```bash\npip install web2llm\n```\nTo enable JavaScript rendering for Single-Page Applications (SPAs) and other dynamic websites, you must install the `[js]` extra, which includes Playwright:\n```bash\npip install \"web2llm[js]\"\n```\nAfter installing the `js` extra, you must also download the necessary browser binaries for Playwright to function:\n```bash\nplaywright install\n```\n## Usage\n\n### Command-Line Interface\n\nThe tool is run from the command line with the following structure:\n\n```bash\nweb2llm <SOURCE> -o <OUTPUT_NAME> [OPTIONS]\n```\n-   `<SOURCE>`: The URL or local path to scrape.\n-   `-o, --output`: The base name for the output folder and the `.md` and `.json` files created inside it.\n\nAll scraped content is saved to a new directory at `output/<OUTPUT_NAME>/`.\n\n#### General Options:\n- `--debug`: Enable debug mode for verbose, step-by-step output to stderr.\n\n#### Web Scraper Options (For URLs):\n- `--render-js`: Render JavaScript using a headless browser. Slower but necessary for SPAs. Requires installation with the `[js]` extra.\n- `--check-content-type`: Force a network request to check the page's `Content-Type` header. Use for URLs that serve PDFs without a `.pdf` extension.\n\n#### Filesystem Options (For GitHub & Local Folders):\nWhen scraping a local folder or a GitHub repository, `web2llm` will automatically find and respect the rules in the project's `.gitignore` file. This ensures that the scrape accurately reflects the intended source code of the project.\n\n-   `--exclude <PATTERN>`: A `.gitignore`-style pattern for files/directories to exclude. Can be used multiple times.\n-   `--include <PATTERN>`: A pattern to re-include a file that would otherwise be ignored by default or by an `--exclude` rule. Can be used multiple times.\n-   `--include-all`: Disables all default, project-level, and `.gitignore` ignore patterns, providing a complete scrape of all text-based files. Explicit `--exclude` flags are still respected.\n\n### Configuration\n\n`web2llm` uses a hierarchical configuration system that gives you precise control over the scraping process:\n\n1.  **Default Config**: The tool comes with a built-in `default_config.yaml` containing a robust set of ignore patterns for common development files and selectors for web scraping.\n2.  **Project-Specific Config**: You can create a `.web2llm.yaml` file in the root of your project to override or extend the default settings. This is the recommended way to manage project-specific rules.\n3.  **CLI Arguments**: Command-line flags provide the final layer of control, overriding any settings from the configuration files for a single run.\n\n## Examples\n\n**1. Scrape a specific directory within a GitHub repo:**\n```bash\nweb2llm 'https://github.com/tiangolo/fastapi' -o fastapi-src --include 'fastapi/'\n```\n\n**2. Scrape a local project, excluding test and documentation folders:**\n```bash\nweb2llm '~/dev/my-project' -o my-project-code --exclude 'tests/' --exclude 'docs/'\n```\n\n**3. Scrape a local project but re-include the `LICENSE` file, which is ignored by default:**\n```bash\nweb2llm '.' -o my-project-with-license --include '!LICENSE'\n```\n\n**4. Scrape everything in a project, including files normally ignored by `.gitignore`:**\n```bash\nweb2llm . -o my-project-full --include-all --exclude '.git/'\n```\n\n**5. Scrape just the \"Installation\" section from a webpage:**\n```bash\nweb2llm 'https://fastapi.tiangolo.com/#installation' -o fastapi-install\n```\n\n**6. Scrape a PDF from an arXiv URL:**\n```bash\nweb2llm 'https://arxiv.org/pdf/1706.03762.pdf' -o attention-is-all-you-need\n```\n\n## Contributing\n\nContributions are welcome. Please refer to the project's issue tracker and `CONTRIBUTING.md` file for information on how to participate.\n\n## License\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "A tool to scrape web content into clean Markdown for LLMs.",
    "version": "0.5.3",
    "project_urls": {
        "Bug Tracker": "https://github.com/herruzo99/web2llm/issues",
        "Homepage": "https://github.com/herruzo99/web2llm"
    },
    "split_keywords": [
        "scraper",
        " llm",
        " markdown",
        " web scraping",
        " pdf",
        " github",
        " rag"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7fb79280588a794b78b13db7aaea5185b117ca5eabe789bf36d187cdd4ab5e7a",
                "md5": "289e59a11b41481407730b2f0a294d8d",
                "sha256": "09d88546efc3a7ed4334c2ac0e9e55a400dce761edc06bed68974b5f0b908a5f"
            },
            "downloads": -1,
            "filename": "web2llm-0.5.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "289e59a11b41481407730b2f0a294d8d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 23533,
            "upload_time": "2025-09-03T18:47:28",
            "upload_time_iso_8601": "2025-09-03T18:47:28.628009Z",
            "url": "https://files.pythonhosted.org/packages/7f/b7/9280588a794b78b13db7aaea5185b117ca5eabe789bf36d187cdd4ab5e7a/web2llm-0.5.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "56a179297ccdd15b36115f78850358124e9e6be98baa34854f7ae109dd65bdcf",
                "md5": "d393ca7c6c26502f03495659159d4753",
                "sha256": "9131942f873c293e927cb5f97ea9d7e1baa863fd7aeaf1642e815af8744bd00e"
            },
            "downloads": -1,
            "filename": "web2llm-0.5.3.tar.gz",
            "has_sig": false,
            "md5_digest": "d393ca7c6c26502f03495659159d4753",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 25236,
            "upload_time": "2025-09-03T18:47:31",
            "upload_time_iso_8601": "2025-09-03T18:47:31.593136Z",
            "url": "https://files.pythonhosted.org/packages/56/a1/79297ccdd15b36115f78850358124e9e6be98baa34854f7ae109dd65bdcf/web2llm-0.5.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-03 18:47:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "herruzo99",
    "github_project": "web2llm",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "web2llm"
}
        
Elapsed time: 2.00549s