web2llm


Nameweb2llm JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
SummaryA tool to scrape web content into clean Markdown for LLMs.
upload_time2025-07-28 17:43:14
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT License
keywords scraper llm markdown web scraping pdf github rag
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Web2LLM

[![CI/CD Pipeline](https://github.com/herruzo99/web2llm/actions/workflows/ci.yml/badge.svg)](https://github.com/herruzo99/web2llm/actions/workflows/ci.yml)

A command-line tool to scrape web pages, GitHub repos, local folders, and PDFs into clean, aggregated Markdown suitable for Large Language Models.

## Description

This tool provides a unified interface to process various sources—from live websites and code repositories to local directories and PDF files—and convert them into a structured Markdown format. The clean, token-efficient output is ideal for use as context in prompts for Large Language Models, for Retrieval-Augmented Generation (RAG) pipelines, or for documentation archiving.

## Key Features

-   **Multi-Source Scraping**: Handles public web pages, GitHub repositories, local project folders, and both local and remote PDF files.
-   **Content-Aware Extraction**: For web pages, it intelligently identifies and extracts the main content, ignoring common clutter like navigation bars, sidebars, and footers.
-   **Targeted Section Scraping**: Use a URL with a hash fragment (e.g., `page.html#usage`) to scrape just that specific section of a webpage.
-   **Code-Aware Filesystem Processing**: For GitHub repos and local folders, it generates a file tree and concatenates all text-based source files into a single document, complete with syntax highlighting hints.
-   **Intelligent & Extensible Filtering**: Automatically ignores common non-source files (`.git`, `node_modules`, lockfiles, images) using a comprehensive set of default `.gitignore`-style patterns.
-   **Advanced Configuration**: Customize scraping behavior by placing a `.web2llm.yaml` file in your project root to override default settings or by using command-line flags for on-the-fly adjustments.
-   **Specialized PDF Handling**: Extracts text from PDFs and includes special logic for arXiv papers to pull structured metadata (title, abstract) from the landing page.

## Installation

```bash
pip install web2llm
```

## Usage

### Command-Line Interface

The tool is run from the command line with the following structure:

```bash
web2llm <SOURCE> -o <OUTPUT_NAME> [OPTIONS]
```

-   `<SOURCE>`: The URL or local path to scrape.
-   `-o, --output`: The base name for the output folder and the `.md` and `.json` files created inside it.

All scraped content is saved to a new directory at `output/<OUTPUT_NAME>/`.

#### Filesystem Options (For GitHub & Local Folders):

-   `--exclude <PATTERN>`: A `.gitignore`-style pattern for files/directories to exclude. This flag can be used multiple times. (e.g., `--exclude 'docs/' --exclude '*.log'`).
-   `--include <PATTERN>`: A pattern to re-include a file that would otherwise be ignored by default or by an `--exclude` rule. This is typically a negation pattern. Can be used multiple times. (e.g., `--include '!LICENSE'`).
-   `--include-all`: Disables all default and project-level ignore patterns, processing every text file encountered. Explicit `--exclude` flags are still respected.

### Configuration

`web2llm` uses a hierarchical configuration system that gives you precise control over the scraping process:

1.  **Default Config**: The tool comes with a built-in `default_config.yaml` containing a robust set of ignore patterns for common development files and selectors for web scraping.
2.  **Project-Specific Config**: You can create a `.web2llm.yaml` file in the root of your project to override or extend the default settings. This is the recommended way to manage project-specific rules (e.g., ignoring a `dist` folder or a custom log file).
3.  **CLI Arguments**: Flags like `--exclude` and `--include-all` provide the final layer of control, overriding any settings from the configuration files for a single run.

## Examples

**1. Scrape a specific directory within a GitHub repo:**
```bash
web2llm 'https://github.com/tiangolo/fastapi' -o fastapi-src --include 'fastapi/'
```

**2. Scrape a local project, excluding test and documentation folders:**
```bash
web2llm '~/dev/my-project' -o my-project-code --exclude 'tests/' --exclude 'docs/'
```

**3. Scrape a local project but re-include the `LICENSE` file, which is ignored by default:**
```bash
web2llm '.' -o my-project-with-license --include '!LICENSE'
```

**4. Scrape everything in a project except the `.git` directory:**
```bash
web2llm . -o my-project-full --include-all --exclude '.git/'
```

**5. Scrape just the "Installation" section from a webpage:**
```bash
web2llm 'https://fastapi.tiangolo.com/#installation' -o fastapi-install
```

**6. Scrape a PDF from an arXiv URL:**
```bash
web2llm 'https://arxiv.org/pdf/1706.03762.pdf' -o attention-is-all-you-need
```

## Contributing

Contributions are welcome. Please refer to the project's issue tracker and contribution guidelines for information on how to participate.

## License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "web2llm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "scraper, llm, markdown, web scraping, pdf, github, rag",
    "author": null,
    "author_email": "Juan Herruzo <juan@herruzo.dev>",
    "download_url": "https://files.pythonhosted.org/packages/04/46/52f40d3e7ae0427549cc0a85b7f6c9b6b64ce06d3619148510523a0ae8f3/web2llm-0.2.0.tar.gz",
    "platform": null,
    "description": "# Web2LLM\n\n[![CI/CD Pipeline](https://github.com/herruzo99/web2llm/actions/workflows/ci.yml/badge.svg)](https://github.com/herruzo99/web2llm/actions/workflows/ci.yml)\n\nA command-line tool to scrape web pages, GitHub repos, local folders, and PDFs into clean, aggregated Markdown suitable for Large Language Models.\n\n## Description\n\nThis tool provides a unified interface to process various sources\u2014from live websites and code repositories to local directories and PDF files\u2014and convert them into a structured Markdown format. The clean, token-efficient output is ideal for use as context in prompts for Large Language Models, for Retrieval-Augmented Generation (RAG) pipelines, or for documentation archiving.\n\n## Key Features\n\n-   **Multi-Source Scraping**: Handles public web pages, GitHub repositories, local project folders, and both local and remote PDF files.\n-   **Content-Aware Extraction**: For web pages, it intelligently identifies and extracts the main content, ignoring common clutter like navigation bars, sidebars, and footers.\n-   **Targeted Section Scraping**: Use a URL with a hash fragment (e.g., `page.html#usage`) to scrape just that specific section of a webpage.\n-   **Code-Aware Filesystem Processing**: For GitHub repos and local folders, it generates a file tree and concatenates all text-based source files into a single document, complete with syntax highlighting hints.\n-   **Intelligent & Extensible Filtering**: Automatically ignores common non-source files (`.git`, `node_modules`, lockfiles, images) using a comprehensive set of default `.gitignore`-style patterns.\n-   **Advanced Configuration**: Customize scraping behavior by placing a `.web2llm.yaml` file in your project root to override default settings or by using command-line flags for on-the-fly adjustments.\n-   **Specialized PDF Handling**: Extracts text from PDFs and includes special logic for arXiv papers to pull structured metadata (title, abstract) from the landing page.\n\n## Installation\n\n```bash\npip install web2llm\n```\n\n## Usage\n\n### Command-Line Interface\n\nThe tool is run from the command line with the following structure:\n\n```bash\nweb2llm <SOURCE> -o <OUTPUT_NAME> [OPTIONS]\n```\n\n-   `<SOURCE>`: The URL or local path to scrape.\n-   `-o, --output`: The base name for the output folder and the `.md` and `.json` files created inside it.\n\nAll scraped content is saved to a new directory at `output/<OUTPUT_NAME>/`.\n\n#### Filesystem Options (For GitHub & Local Folders):\n\n-   `--exclude <PATTERN>`: A `.gitignore`-style pattern for files/directories to exclude. This flag can be used multiple times. (e.g., `--exclude 'docs/' --exclude '*.log'`).\n-   `--include <PATTERN>`: A pattern to re-include a file that would otherwise be ignored by default or by an `--exclude` rule. This is typically a negation pattern. Can be used multiple times. (e.g., `--include '!LICENSE'`).\n-   `--include-all`: Disables all default and project-level ignore patterns, processing every text file encountered. Explicit `--exclude` flags are still respected.\n\n### Configuration\n\n`web2llm` uses a hierarchical configuration system that gives you precise control over the scraping process:\n\n1.  **Default Config**: The tool comes with a built-in `default_config.yaml` containing a robust set of ignore patterns for common development files and selectors for web scraping.\n2.  **Project-Specific Config**: You can create a `.web2llm.yaml` file in the root of your project to override or extend the default settings. This is the recommended way to manage project-specific rules (e.g., ignoring a `dist` folder or a custom log file).\n3.  **CLI Arguments**: Flags like `--exclude` and `--include-all` provide the final layer of control, overriding any settings from the configuration files for a single run.\n\n## Examples\n\n**1. Scrape a specific directory within a GitHub repo:**\n```bash\nweb2llm 'https://github.com/tiangolo/fastapi' -o fastapi-src --include 'fastapi/'\n```\n\n**2. Scrape a local project, excluding test and documentation folders:**\n```bash\nweb2llm '~/dev/my-project' -o my-project-code --exclude 'tests/' --exclude 'docs/'\n```\n\n**3. Scrape a local project but re-include the `LICENSE` file, which is ignored by default:**\n```bash\nweb2llm '.' -o my-project-with-license --include '!LICENSE'\n```\n\n**4. Scrape everything in a project except the `.git` directory:**\n```bash\nweb2llm . -o my-project-full --include-all --exclude '.git/'\n```\n\n**5. Scrape just the \"Installation\" section from a webpage:**\n```bash\nweb2llm 'https://fastapi.tiangolo.com/#installation' -o fastapi-install\n```\n\n**6. Scrape a PDF from an arXiv URL:**\n```bash\nweb2llm 'https://arxiv.org/pdf/1706.03762.pdf' -o attention-is-all-you-need\n```\n\n## Contributing\n\nContributions are welcome. Please refer to the project's issue tracker and contribution guidelines for information on how to participate.\n\n## License\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "A tool to scrape web content into clean Markdown for LLMs.",
    "version": "0.2.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/herruzo99/web2llm/issues",
        "Homepage": "https://github.com/herruzo99/web2llm"
    },
    "split_keywords": [
        "scraper",
        " llm",
        " markdown",
        " web scraping",
        " pdf",
        " github",
        " rag"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f9f22206bcef615bfb562567ad5da68129b0dc1d4bdc6d173bfcd332f56364d7",
                "md5": "65b303bf815b3b12ba9e79c1747d444a",
                "sha256": "fd57ee5ab3ea35be04c9814416c20fbcb291a2c3dff92a2d0e1087dd30022403"
            },
            "downloads": -1,
            "filename": "web2llm-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "65b303bf815b3b12ba9e79c1747d444a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 19318,
            "upload_time": "2025-07-28T17:43:13",
            "upload_time_iso_8601": "2025-07-28T17:43:13.789910Z",
            "url": "https://files.pythonhosted.org/packages/f9/f2/2206bcef615bfb562567ad5da68129b0dc1d4bdc6d173bfcd332f56364d7/web2llm-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "044652f40d3e7ae0427549cc0a85b7f6c9b6b64ce06d3619148510523a0ae8f3",
                "md5": "0c4055336c50665770a6e02448393cce",
                "sha256": "b99af405922efb17f8050a523e4e6b80efe65e045b5170c86f98ba133beb1fd5"
            },
            "downloads": -1,
            "filename": "web2llm-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "0c4055336c50665770a6e02448393cce",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 21133,
            "upload_time": "2025-07-28T17:43:14",
            "upload_time_iso_8601": "2025-07-28T17:43:14.919792Z",
            "url": "https://files.pythonhosted.org/packages/04/46/52f40d3e7ae0427549cc0a85b7f6c9b6b64ce06d3619148510523a0ae8f3/web2llm-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-28 17:43:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "herruzo99",
    "github_project": "web2llm",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "web2llm"
}
        
Elapsed time: 0.81232s