grabit-md

Name	grabit-md JSON
Version	1.0.0 JSON
	download
home_page	None
Summary	Grabit is a library that allows you to download web pages, extract their readable content, convert it to Markdown, and save it locally.
upload_time	2025-10-24 12:54:41
maintainer	None
docs_url	None
author	Vlad Iliescu
requires_python	>=3.11
license	None
keywords	python markdown cli web scraping downloader scraper
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Grabit

Grabit is a command-line tool that allows you to download web pages, extract their readable content, convert it to Markdown, and save it locally.

It's ideal for archiving articles, blog posts, or any web content you may want to save forever and ever. It works well for feeding web content into LLMs too.

I'm using it to save bookmarks in [Obsidian](https://obsidian.md/), so you'll see a lot of focus in this area (the YAML front matter, the domain subdirectory, etc.). But it's flexible enough to be used in other contexts as well.


| It gets you from this                                    | to this                                     |
|-------------------------------------------|-------------------------------------------|
| ![Raw html](https://vladiliescu.net/grabit-web-downloader/img/before.png "Before") | ![Markdown](https://vladiliescu.net/grabit-web-downloader/img/after.png "After") |



## Features

- **Download and convert web pages to Markdown**: Fetches the content from a URL and converts it into clean Markdown format
- **Supports multiple output formats**: Save content as Markdown, readable or raw HTML, or just send it to stdout so you can pipe it into another app
- **Customizable output**: Include YAML front matter, page titles, source links, and control the output directory structure. This is especially useful for integrating with knowledge management systems such as [Obsidian](https://obsidian.md/)
- **Uses Readability.js**: Extracts the main content from web pages for cleaner outputs (requires Node.js to be installed)
- **Supports Reddit posts**: Grabit now handles Reddit (both text & link) posts (including comments)

## Installation

1. Ensure [uv](https://docs.astral.sh/uv/) is installed
2. Ensure [Node.js](https://nodejs.org/) is installed (optional, required for Readability.js, see below for options)
3. That's it, now you can run `grabit` directly with `uvx`:

```sh
uvx grabit [OPTIONS] URL
```


## Usage

```sh
uvx grabit [OPTIONS] URL
```

### Options

- `--yaml-frontmatter / --no-yaml-frontmatter`: Include YAML front matter with metadata, useful for saving & viewing content in [Obsidian](https://obsidian.md) (default: `enabled`).
- `--include-title / --no-include-title`: Include the page title as an H1 heading. A bit redundant when rendering the YAML frontmatter, but I like it anyway (default: `enabled`).
- `--include-source / --no-include-source`: Include the page source URL at the top of the document. Also a bit redundant when rendering the YAML frontmatter, but this one I don't like so much (default: `disabled`).
- `--user-agent TEXT`: Set a custom User-Agent to be used for retrieving web pages (default: `Grabit/<version>`).
- `--fallback-title TEXT`: Fallback title if no title is found. Use `{date}` for the current date (default: `Untitled {date}`).
- `--use-readability-js / --no-use-readability-js`: Use Readability.js for processing pages. Disabling it will result in **some** processing courtesy of [ReadabiliPy](https://github.com/alan-turing-institute/ReadabiliPy), but it doesn't look so great to be honest (requires Node.js, default: `enabled`).
- `--create-domain-subdir / --no-create-domain-subdir`: Save the resulting files in a subdirectory named after the domain. Useful when saving a **lot** of bookmarks in the same Obsidian vault (default: `enabled`).
- `--overwrite / --no-overwrite`: Overwrite existing files (default: `disabled`).
- `-f, --format [md|stdout.md|html|raw.html]`: Output format(s) to save the content in. Most useful are `md`, which saves the content to a Markdown file, and `stdout.md` which simply outputs the raw content so you can pipe it to something else, like the clipboard or Simon Willison's [llm cli](https://github.com/simonw/llm). Can be specified multiple times (default: `md`).


### Examples

- **Save a web page as Markdown with the default options:**
```sh
uvx grabit https://example.com/article
```

- **Save as both Markdown and readable HTML:**
```sh
uvx grabit -f md -f html https://example.com/article
```

- **Set a custom User-Agent:**
```sh
uvx grabit --user-agent "MyCustomAgent/1.0" https://example.com/article
```

- **Output markdown content to stdout:**
```sh
uvx grabit -f stdout.md https://example.com/article
```

- **Output markdown content to clipboard (MacOS):**
```sh
uvx grabit -f stdout.md https://example.com/article | pbcopy
```

- **Disable YAML front matter and include source URL:**
```sh
uvx grabit --no-yaml-frontmatter --include-source https://example.com/article
```

- **Save files in the working directory, without creating a domain subdirectory:**
```sh
uvx grabit --no-create-domain-subdir https://example.com/article
```

## Requirements

- [uv](https://docs.astral.sh/uv/) (for running the script)
- [Node.js](https://nodejs.org) (if using Readability.js)

### License

**Grabit**, a tool for archiving web content, copyright (C) 2025  **Vlad Iliescu**

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. See the [LICENSE](./LICENSE) for details.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "grabit-md",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "python, markdown, cli, web scraping, downloader, scraper",
    "author": "Vlad Iliescu",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/8c/1b/3d14753f50d3400dfd6dade29d87b323ea6fbd3b0b88e9d7d5526c1d7a37/grabit_md-1.0.0.tar.gz",
    "platform": null,
    "description": "# Grabit\n\nGrabit is a command-line tool that allows you to download web pages, extract their readable content, convert it to Markdown, and save it locally.\n\nIt's ideal for archiving articles, blog posts, or any web content you may want to save forever and ever. It works well for feeding web content into LLMs too.\n\nI'm using it to save bookmarks in [Obsidian](https://obsidian.md/), so you'll see a lot of focus in this area (the YAML front matter, the domain subdirectory, etc.). But it's flexible enough to be used in other contexts as well.\n\n\n| It gets you from this                                    | to this                                     |\n|-------------------------------------------|-------------------------------------------|\n| ![Raw html](https://vladiliescu.net/grabit-web-downloader/img/before.png \"Before\") | ![Markdown](https://vladiliescu.net/grabit-web-downloader/img/after.png \"After\") |\n\n\n\n## Features\n\n- **Download and convert web pages to Markdown**: Fetches the content from a URL and converts it into clean Markdown format\n- **Supports multiple output formats**: Save content as Markdown, readable or raw HTML, or just send it to stdout so you can pipe it into another app\n- **Customizable output**: Include YAML front matter, page titles, source links, and control the output directory structure. This is especially useful for integrating with knowledge management systems such as [Obsidian](https://obsidian.md/)\n- **Uses Readability.js**: Extracts the main content from web pages for cleaner outputs (requires Node.js to be installed)\n- **Supports Reddit posts**: Grabit now handles Reddit (both text & link) posts (including comments)\n\n## Installation\n\n1. Ensure [uv](https://docs.astral.sh/uv/) is installed\n2. Ensure [Node.js](https://nodejs.org/) is installed (optional, required for Readability.js, see below for options)\n3. That's it, now you can run `grabit` directly with `uvx`:\n\n```sh\nuvx grabit [OPTIONS] URL\n```\n\n\n## Usage\n\n```sh\nuvx grabit [OPTIONS] URL\n```\n\n### Options\n\n- `--yaml-frontmatter / --no-yaml-frontmatter`: Include YAML front matter with metadata, useful for saving & viewing content in [Obsidian](https://obsidian.md) (default: `enabled`).\n- `--include-title / --no-include-title`: Include the page title as an H1 heading. A bit redundant when rendering the YAML frontmatter, but I like it anyway (default: `enabled`).\n- `--include-source / --no-include-source`: Include the page source URL at the top of the document. Also a bit redundant when rendering the YAML frontmatter, but this one I don't like so much (default: `disabled`).\n- `--user-agent TEXT`: Set a custom User-Agent to be used for retrieving web pages (default: `Grabit/<version>`).\n- `--fallback-title TEXT`: Fallback title if no title is found. Use `{date}` for the current date (default: `Untitled {date}`).\n- `--use-readability-js / --no-use-readability-js`: Use Readability.js for processing pages. Disabling it will result in **some** processing courtesy of [ReadabiliPy](https://github.com/alan-turing-institute/ReadabiliPy), but it doesn't look so great to be honest (requires Node.js, default: `enabled`).\n- `--create-domain-subdir / --no-create-domain-subdir`: Save the resulting files in a subdirectory named after the domain. Useful when saving a **lot** of bookmarks in the same Obsidian vault (default: `enabled`).\n- `--overwrite / --no-overwrite`: Overwrite existing files (default: `disabled`).\n- `-f, --format [md|stdout.md|html|raw.html]`: Output format(s) to save the content in. Most useful are `md`, which saves the content to a Markdown file, and `stdout.md` which simply outputs the raw content so you can pipe it to something else, like the clipboard or Simon Willison's [llm cli](https://github.com/simonw/llm). Can be specified multiple times (default: `md`).\n\n\n### Examples\n\n- **Save a web page as Markdown with the default options:**\n```sh\nuvx grabit https://example.com/article\n```\n\n- **Save as both Markdown and readable HTML:**\n```sh\nuvx grabit -f md -f html https://example.com/article\n```\n\n- **Set a custom User-Agent:**\n```sh\nuvx grabit --user-agent \"MyCustomAgent/1.0\" https://example.com/article\n```\n\n- **Output markdown content to stdout:**\n```sh\nuvx grabit -f stdout.md https://example.com/article\n```\n\n- **Output markdown content to clipboard (MacOS):**\n```sh\nuvx grabit -f stdout.md https://example.com/article | pbcopy\n```\n\n- **Disable YAML front matter and include source URL:**\n```sh\nuvx grabit --no-yaml-frontmatter --include-source https://example.com/article\n```\n\n- **Save files in the working directory, without creating a domain subdirectory:**\n```sh\nuvx grabit --no-create-domain-subdir https://example.com/article\n```\n\n## Requirements\n\n- [uv](https://docs.astral.sh/uv/) (for running the script)\n- [Node.js](https://nodejs.org) (if using Readability.js)\n\n### License\n\n**Grabit**, a tool for archiving web content, copyright (C) 2025  **Vlad Iliescu**\n\nThis program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. See the [LICENSE](./LICENSE) for details.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Grabit is a library that allows you to download web pages, extract their readable content, convert it to Markdown, and save it locally.",
    "version": "1.0.0",
    "project_urls": null,
    "split_keywords": [
        "python",
        " markdown",
        " cli",
        " web scraping",
        " downloader",
        " scraper"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7d5d1eb3ed244443cadd60728d356acf0b7a0534e474cc77bed80165ffc50bfa",
                "md5": "ccf9cac030033f2b0c90b9de6f8edd45",
                "sha256": "51b237b41953c0e475398e449fe8ee30e37f0eca5450c6eedda332cbab5177b4"
            },
            "downloads": -1,
            "filename": "grabit_md-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ccf9cac030033f2b0c90b9de6f8edd45",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 15840,
            "upload_time": "2025-10-24T12:54:39",
            "upload_time_iso_8601": "2025-10-24T12:54:39.743778Z",
            "url": "https://files.pythonhosted.org/packages/7d/5d/1eb3ed244443cadd60728d356acf0b7a0534e474cc77bed80165ffc50bfa/grabit_md-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8c1b3d14753f50d3400dfd6dade29d87b323ea6fbd3b0b88e9d7d5526c1d7a37",
                "md5": "5184d48aaacc6e513dafc56d6f8368a4",
                "sha256": "42d7a373010b03d69fdb25bf31cdcd840d80bd9f677a30e21078ba709d67bf40"
            },
            "downloads": -1,
            "filename": "grabit_md-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "5184d48aaacc6e513dafc56d6f8368a4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 16454,
            "upload_time": "2025-10-24T12:54:41",
            "upload_time_iso_8601": "2025-10-24T12:54:41.170267Z",
            "url": "https://files.pythonhosted.org/packages/8c/1b/3d14753f50d3400dfd6dade29d87b323ea6fbd3b0b88e9d7d5526c1d7a37/grabit_md-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-24 12:54:41",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "grabit-md"
}

Vlad Iliescu