webeater

Name	webeater JSON
Version	0.1.1 JSON
	download
home_page	None
Summary	A web content extraction tool designed to fetch and process web pages efficiently
upload_time	2025-09-03 16:19:12
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	None
keywords	web scraping extraction selenium beautifulsoup content
VCS
bugtrack_url
requirements	annotated-types attrs beautifulsoup4 bs4 certifi cffi coloredlogs h11 humanfriendly idna outcome pycparser pydantic pydantic_core pyreadline3 PySocks selenium sniffio sortedcontainers soupsieve trio trio-websocket typing-inspection typing_extensions urllib3 websocket-client wsproto
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <img src="img/logo.png" alt="Logo" style="max-height: 100px;">

# WebEater (weat)

WebEater is a web content extraction tool designed to fetch and process web pages.\
It is made for developers and researchers who need to extract structured data from web pages efficiently.\
The tool goes straight to the point, focusing on extracting text and structured data from web pages,
while providing some additional configurations and hits for better effectiveness.

Its main purpose is to serve as a go-to-component that works out of the box for most general use cases.

As it's currently at an early stage, it may not cover all edge cases or complex scenarios.\
We welcome contributions and feedback to help improve its capabilities.

## Main Features
- Fetches web pages and extracts text content into Markdown format.
- Return clean, plain text or a JSON object optionally containing lists of images and links found on the page.
- Handles JavaScript-heavy pages using Selenium and BeautifulSoup
- Can be used both as a library and a command-line tool (CLI).

## Quick Start (CLI)
To use WebEater from the command line, first install it using `pip`:

```
pip install webeater
```

Then, you can run it with a URL using the `weat` CLI tool

```
weat https://example.com
```

This will fetch the content of the page and print the extracted text to the console.

### CLI Options
You can customize the behavior of WebEater using various command-line options:

- url (positional): URL to fetch content from. If omitted, WebEater starts an interactive prompt.
- -c, --config FILE (default: weat.json): Config file to use.
- --hints FILE [FILE ...]: Additional hint files to load (space-separated paths).
- --debug: Enable debug logging.
- --silent: Silent mode — suppress debug/info messages; only print results or errors to allow calling from scripts or subprocesses.
- --json: Return content as JSON instead of plain text.
- --content-only: Return only the main extracted content (skip extracting links and images).

Examples:

```
# Basic usage
webeater https://example.com

# JSON output and content-only
webeater --json --content-only https://example.com

# Using a custom config and multiple hint files
webeater -c weat.json --hints hints/news.json hints/sports.json https://example.com
```

Interactive mode (when no URL is provided):

- Enter a URL when prompted to fetch content.
- Prefix shortcuts per request:
    - j!<url> → return JSON
    - c!<url> → content only
    - jc!<url> or cj!<url> → JSON + content only
- q → quit the interactive session

Notes:

- URLs must start with http:// or https://.
- In silent mode, only the result or an error line (prefixed with "Error:") is printed.


## Quick Start (Python)
To use WebEater, first install it using `pip`:

```
pip install webeater
```

You can then import the `Webeater` class and
create an instance of it.\
The engine will automatically load the necessary configurations
and provide methods to perform web content extraction actions.

Note that it must be loaded within an async context.

Below is a minimal example:

```
import asyncio
from webeater import Webeater

async def main():
    weat = await Webeater.create()
    content = await weat.get(url="https://www.tiagoribeiro.pt")
    print(content)

asyncio.run(main())
```

## Help and Contributions

For questions or discussions about changes and new features, please start a new [Discussion in the Webeater GitHub repository](https://github.com/tiagrib/webeater/discussions).

If you find bugs or want to contribute, please open an [Issue](https://github.com/tiagrib/webeater/issues).

## Develop with Source

To develop with WebEater from source code, you can clone the repository at:
```
https://github.com/tiagrib/webeater.git
```

then navigate to the project directory and install the required dependencies:

```
pip install -r requirements.txt
```
The current code was tested using python version 3.12.3, though other versions may work.


## Configuration and Advanced documentation
Web Eater uses a configuration file to manage its settings.
The configuration file is typically located at `config/weat.yaml`.

You can customize the settings in this file to suit your needs,
such as specifying the default user agent, timeout settings, and other parameters.

For more detailed documentation on configuration options and advanced usage,
please refer to the [Hints Documentation](hints/README.md).

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "webeater",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "Tiago Ribeiro <webeater@tiagoribeiro.pt>",
    "keywords": "web, scraping, extraction, selenium, beautifulsoup, content",
    "author": null,
    "author_email": "Tiago Ribeiro <webeater@tiagoribeiro.pt>",
    "download_url": "https://files.pythonhosted.org/packages/87/74/c58dc9207fe86fc7fd35123fa2f54587046dfd662152f7f536abef066dcf/webeater-0.1.1.tar.gz",
    "platform": null,
    "description": "<img src=\"img/logo.png\" alt=\"Logo\" style=\"max-height: 100px;\">\r\n\r\n# WebEater (weat)\r\n\r\nWebEater is a web content extraction tool designed to fetch and process web pages.\\\r\nIt is made for developers and researchers who need to extract structured data from web pages efficiently.\\\r\nThe tool goes straight to the point, focusing on extracting text and structured data from web pages,\r\nwhile providing some additional configurations and hits for better effectiveness.\r\n\r\nIts main purpose is to serve as a go-to-component that works out of the box for most general use cases.\r\n\r\nAs it's currently at an early stage, it may not cover all edge cases or complex scenarios.\\\r\nWe welcome contributions and feedback to help improve its capabilities.\r\n\r\n## Main Features\r\n- Fetches web pages and extracts text content into Markdown format.\r\n- Return clean, plain text or a JSON object optionally containing lists of images and links found on the page.\r\n- Handles JavaScript-heavy pages using Selenium and BeautifulSoup\r\n- Can be used both as a library and a command-line tool (CLI).\r\n\r\n## Quick Start (CLI)\r\nTo use WebEater from the command line, first install it using `pip`:\r\n\r\n```\r\npip install webeater\r\n```\r\n\r\nThen, you can run it with a URL using the `weat` CLI tool\r\n\r\n```\r\nweat https://example.com\r\n```\r\n\r\nThis will fetch the content of the page and print the extracted text to the console.\r\n\r\n### CLI Options\r\nYou can customize the behavior of WebEater using various command-line options:\r\n\r\n- url (positional): URL to fetch content from. If omitted, WebEater starts an interactive prompt.\r\n- -c, --config FILE (default: weat.json): Config file to use.\r\n- --hints FILE [FILE ...]: Additional hint files to load (space-separated paths).\r\n- --debug: Enable debug logging.\r\n- --silent: Silent mode \u2014 suppress debug/info messages; only print results or errors to allow calling from scripts or subprocesses.\r\n- --json: Return content as JSON instead of plain text.\r\n- --content-only: Return only the main extracted content (skip extracting links and images).\r\n\r\nExamples:\r\n\r\n```\r\n# Basic usage\r\nwebeater https://example.com\r\n\r\n# JSON output and content-only\r\nwebeater --json --content-only https://example.com\r\n\r\n# Using a custom config and multiple hint files\r\nwebeater -c weat.json --hints hints/news.json hints/sports.json https://example.com\r\n```\r\n\r\nInteractive mode (when no URL is provided):\r\n\r\n- Enter a URL when prompted to fetch content.\r\n- Prefix shortcuts per request:\r\n    - j!<url> \u2192 return JSON\r\n    - c!<url> \u2192 content only\r\n    - jc!<url> or cj!<url> \u2192 JSON + content only\r\n- q \u2192 quit the interactive session\r\n\r\nNotes:\r\n\r\n- URLs must start with http:// or https://.\r\n- In silent mode, only the result or an error line (prefixed with \"Error:\") is printed.\r\n\r\n\r\n## Quick Start (Python)\r\nTo use WebEater, first install it using `pip`:\r\n\r\n```\r\npip install webeater\r\n```\r\n\r\nYou can then import the `Webeater` class and\r\ncreate an instance of it.\\\r\nThe engine will automatically load the necessary configurations\r\nand provide methods to perform web content extraction actions.\r\n\r\nNote that it must be loaded within an async context.\r\n\r\nBelow is a minimal example:\r\n\r\n```\r\nimport asyncio\r\nfrom webeater import Webeater\r\n\r\nasync def main():\r\n    weat = await Webeater.create()\r\n    content = await weat.get(url=\"https://www.tiagoribeiro.pt\")\r\n    print(content)\r\n\r\nasyncio.run(main())\r\n```\r\n\r\n## Help and Contributions\r\n\r\nFor questions or discussions about changes and new features, please start a new [Discussion in the Webeater GitHub repository](https://github.com/tiagrib/webeater/discussions).\r\n\r\nIf you find bugs or want to contribute, please open an [Issue](https://github.com/tiagrib/webeater/issues).\r\n\r\n## Develop with Source\r\n\r\nTo develop with WebEater from source code, you can clone the repository at:\r\n```\r\nhttps://github.com/tiagrib/webeater.git\r\n```\r\n\r\nthen navigate to the project directory and install the required dependencies:\r\n\r\n```\r\npip install -r requirements.txt\r\n```\r\nThe current code was tested using python version 3.12.3, though other versions may work.\r\n\r\n\r\n## Configuration and Advanced documentation\r\nWeb Eater uses a configuration file to manage its settings.\r\nThe configuration file is typically located at `config/weat.yaml`.\r\n\r\nYou can customize the settings in this file to suit your needs,\r\nsuch as specifying the default user agent, timeout settings, and other parameters.\r\n\r\nFor more detailed documentation on configuration options and advanced usage,\r\nplease refer to the [Hints Documentation](hints/README.md).\r\n\r\n    \r\n    \r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A web content extraction tool designed to fetch and process web pages efficiently",
    "version": "0.1.1",
    "project_urls": {
        "Documentation": "https://github.com/tiagrib/webeater#readme",
        "Homepage": "https://github.com/tiagrib/webeater",
        "Issues": "https://github.com/tiagrib/webeater/issues",
        "Repository": "https://github.com/tiagrib/webeater.git"
    },
    "split_keywords": [
        "web",
        " scraping",
        " extraction",
        " selenium",
        " beautifulsoup",
        " content"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4dd09a28d58db900afdee8d2bc7ba1a429799822ca79dceed2de3731757dcb12",
                "md5": "a72c47e54719f883dde28d8c22ddc940",
                "sha256": "0302db1fcbbf085c8bdd812459e4a7847b0833cd9cacd3b6bb81b7110b6fb6d7"
            },
            "downloads": -1,
            "filename": "webeater-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a72c47e54719f883dde28d8c22ddc940",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 20204,
            "upload_time": "2025-09-03T16:19:10",
            "upload_time_iso_8601": "2025-09-03T16:19:10.767572Z",
            "url": "https://files.pythonhosted.org/packages/4d/d0/9a28d58db900afdee8d2bc7ba1a429799822ca79dceed2de3731757dcb12/webeater-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8774c58dc9207fe86fc7fd35123fa2f54587046dfd662152f7f536abef066dcf",
                "md5": "a92719a08b6a5be14a27e9cf10145d49",
                "sha256": "f58be615e3d36d6ac67355ab2b77e38f0ae972ac46410133807191483ad736b6"
            },
            "downloads": -1,
            "filename": "webeater-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "a92719a08b6a5be14a27e9cf10145d49",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 24294,
            "upload_time": "2025-09-03T16:19:12",
            "upload_time_iso_8601": "2025-09-03T16:19:12.135480Z",
            "url": "https://files.pythonhosted.org/packages/87/74/c58dc9207fe86fc7fd35123fa2f54587046dfd662152f7f536abef066dcf/webeater-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-03 16:19:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "tiagrib",
    "github_project": "webeater#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "annotated-types",
            "specs": [
                [
                    "==",
                    "0.7.0"
                ]
            ]
        },
        {
            "name": "attrs",
            "specs": [
                [
                    "==",
                    "25.3.0"
                ]
            ]
        },
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    "==",
                    "4.13.4"
                ]
            ]
        },
        {
            "name": "bs4",
            "specs": [
                [
                    "==",
                    "0.0.2"
                ]
            ]
        },
        {
            "name": "certifi",
            "specs": [
                [
                    "==",
                    "2025.8.3"
                ]
            ]
        },
        {
            "name": "cffi",
            "specs": [
                [
                    "==",
                    "1.17.1"
                ]
            ]
        },
        {
            "name": "coloredlogs",
            "specs": [
                [
                    "==",
                    "15.0.1"
                ]
            ]
        },
        {
            "name": "h11",
            "specs": [
                [
                    "==",
                    "0.16.0"
                ]
            ]
        },
        {
            "name": "humanfriendly",
            "specs": [
                [
                    "==",
                    "10.0"
                ]
            ]
        },
        {
            "name": "idna",
            "specs": [
                [
                    "==",
                    "3.10"
                ]
            ]
        },
        {
            "name": "outcome",
            "specs": [
                [
                    "==",
                    "1.3.0.post0"
                ]
            ]
        },
        {
            "name": "pycparser",
            "specs": [
                [
                    "==",
                    "2.22"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    "==",
                    "2.11.7"
                ]
            ]
        },
        {
            "name": "pydantic_core",
            "specs": [
                [
                    "==",
                    "2.33.2"
                ]
            ]
        },
        {
            "name": "pyreadline3",
            "specs": [
                [
                    "==",
                    "3.5.4"
                ]
            ]
        },
        {
            "name": "PySocks",
            "specs": [
                [
                    "==",
                    "1.7.1"
                ]
            ]
        },
        {
            "name": "selenium",
            "specs": [
                [
                    "==",
                    "4.34.2"
                ]
            ]
        },
        {
            "name": "sniffio",
            "specs": [
                [
                    "==",
                    "1.3.1"
                ]
            ]
        },
        {
            "name": "sortedcontainers",
            "specs": [
                [
                    "==",
                    "2.4.0"
                ]
            ]
        },
        {
            "name": "soupsieve",
            "specs": [
                [
                    "==",
                    "2.7"
                ]
            ]
        },
        {
            "name": "trio",
            "specs": [
                [
                    "==",
                    "0.30.0"
                ]
            ]
        },
        {
            "name": "trio-websocket",
            "specs": [
                [
                    "==",
                    "0.12.2"
                ]
            ]
        },
        {
            "name": "typing-inspection",
            "specs": [
                [
                    "==",
                    "0.4.1"
                ]
            ]
        },
        {
            "name": "typing_extensions",
            "specs": [
                [
                    "==",
                    "4.14.1"
                ]
            ]
        },
        {
            "name": "urllib3",
            "specs": [
                [
                    "==",
                    "2.5.0"
                ]
            ]
        },
        {
            "name": "websocket-client",
            "specs": [
                [
                    "==",
                    "1.8.0"
                ]
            ]
        },
        {
            "name": "wsproto",
            "specs": [
                [
                    "==",
                    "1.2.0"
                ]
            ]
        }
    ],
    "tox": true,
    "lcname": "webeater"
}

None