scrapeghost

Name	scrapeghost JSON
Version	0.6.0 JSON
	download
home_page
Summary	An experimental library for scraping websites using GPT.
upload_time	2023-11-25 01:10:53
maintainer
docs_url	None
author	James Turk
requires_python	>=3.11,<4.0
license	Hippocratic License HL3-EXTR-FFD-LAW-MIL-SV
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # scrapeghost

![scrapeghost logo](docs/assets/scrapeghost.png)

`scrapeghost` is an experimental library for scraping websites using OpenAI's GPT.

Source: [https://github.com/jamesturk/scrapeghost](https://github.com/jamesturk/scrapeghost)

Documentation: [https://jamesturk.github.io/scrapeghost/](https://jamesturk.github.io/scrapeghost/)

Issues: [https://github.com/jamesturk/scrapeghost/issues](https://github.com/jamesturk/scrapeghost/issues)

[![PyPI badge](https://badge.fury.io/py/scrapeghost.svg)](https://badge.fury.io/py/scrapeghost)
[![Test badge](https://github.com/jamesturk/scrapeghost/workflows/Test%20&%20Lint/badge.svg)](https://github.com/jamesturk/scrapeghost/actions?query=workflow%3A%22Test+%26+Lint%22)

**Use at your own risk. This library makes considerably expensive calls ($0.36 for a GPT-4 call on a moderately sized page.) Cost estimates are based on the [OpenAI pricing page](https://beta.openai.com/pricing) and not guaranteed to be accurate.**

![](screenshot.png)

## Features

The purpose of this library is to provide a convenient interface for exploring web scraping with GPT.

While the bulk of the work is done by the GPT model, `scrapeghost` provides a number of features to make it easier to use.

**Python-based schema definition** - Define the shape of the data you want to extract as any Python object, with as much or little detail as you want.

**Preprocessing**

* **HTML cleaning** - Remove unnecessary HTML to reduce the size and cost of API requests.
* **CSS and XPath selectors** - Pre-filter HTML by writing a single CSS or XPath selector.
* **Auto-splitting** - Optionally split the HTML into multiple calls to the model, allowing for larger pages to be scraped.

**Postprocessing**

* **JSON validation** - Ensure that the response is valid JSON.  (With the option to kick it back to GPT for fixes if it's not.)
* **Schema validation** - Go a step further, use a [`pydantic`](https://pydantic-docs.helpmanual.io/) schema to validate the response.
* **Hallucination check** - Does the data in the response truly exist on the page?

**Cost Controls**

* Scrapers keep running totals of how many tokens have been sent and received, so costs can be tracked.
* Support for automatic fallbacks (e.g. use cost-saving GPT-3.5-Turbo by default, fall back to GPT-4 if needed.)
* Allows setting a budget and stops the scraper if the budget is exceeded.

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "scrapeghost",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.11,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "James Turk",
    "author_email": "dev@jamesturk.net",
    "download_url": "https://files.pythonhosted.org/packages/06/60/fd7bc7b7f3bac6dff42d019de01372120bd91a14055e7ae3db1a3d825f55/scrapeghost-0.6.0.tar.gz",
    "platform": null,
    "description": "# scrapeghost\n\n![scrapeghost logo](docs/assets/scrapeghost.png)\n\n`scrapeghost` is an experimental library for scraping websites using OpenAI's GPT.\n\nSource: [https://github.com/jamesturk/scrapeghost](https://github.com/jamesturk/scrapeghost)\n\nDocumentation: [https://jamesturk.github.io/scrapeghost/](https://jamesturk.github.io/scrapeghost/)\n\nIssues: [https://github.com/jamesturk/scrapeghost/issues](https://github.com/jamesturk/scrapeghost/issues)\n\n[![PyPI badge](https://badge.fury.io/py/scrapeghost.svg)](https://badge.fury.io/py/scrapeghost)\n[![Test badge](https://github.com/jamesturk/scrapeghost/workflows/Test%20&%20Lint/badge.svg)](https://github.com/jamesturk/scrapeghost/actions?query=workflow%3A%22Test+%26+Lint%22)\n\n**Use at your own risk. This library makes considerably expensive calls ($0.36 for a GPT-4 call on a moderately sized page.) Cost estimates are based on the [OpenAI pricing page](https://beta.openai.com/pricing) and not guaranteed to be accurate.**\n\n![](screenshot.png)\n\n## Features\n\nThe purpose of this library is to provide a convenient interface for exploring web scraping with GPT.\n\nWhile the bulk of the work is done by the GPT model, `scrapeghost` provides a number of features to make it easier to use.\n\n**Python-based schema definition** - Define the shape of the data you want to extract as any Python object, with as much or little detail as you want.\n\n**Preprocessing**\n\n* **HTML cleaning** - Remove unnecessary HTML to reduce the size and cost of API requests.\n* **CSS and XPath selectors** - Pre-filter HTML by writing a single CSS or XPath selector.\n* **Auto-splitting** - Optionally split the HTML into multiple calls to the model, allowing for larger pages to be scraped.\n\n**Postprocessing**\n\n* **JSON validation** - Ensure that the response is valid JSON.  (With the option to kick it back to GPT for fixes if it's not.)\n* **Schema validation** - Go a step further, use a [`pydantic`](https://pydantic-docs.helpmanual.io/) schema to validate the response.\n* **Hallucination check** - Does the data in the response truly exist on the page?\n\n**Cost Controls**\n\n* Scrapers keep running totals of how many tokens have been sent and received, so costs can be tracked.\n* Support for automatic fallbacks (e.g. use cost-saving GPT-3.5-Turbo by default, fall back to GPT-4 if needed.)\n* Allows setting a budget and stops the scraper if the budget is exceeded.",
    "bugtrack_url": null,
    "license": "Hippocratic License HL3-EXTR-FFD-LAW-MIL-SV",
    "summary": "An experimental library for scraping websites using GPT.",
    "version": "0.6.0",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d9d1d779b4e7ca8195d0514814bc2432e282aaca72be122801a1467000dc17f2",
                "md5": "9c625a5b17be0cc5bd1f44eef960def6",
                "sha256": "3afa8d6e48cfcc37704c500930763fd5022062e6447f5748ce4771992fac87e1"
            },
            "downloads": -1,
            "filename": "scrapeghost-0.6.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9c625a5b17be0cc5bd1f44eef960def6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11,<4.0",
            "size": 19669,
            "upload_time": "2023-11-25T01:10:51",
            "upload_time_iso_8601": "2023-11-25T01:10:51.657564Z",
            "url": "https://files.pythonhosted.org/packages/d9/d1/d779b4e7ca8195d0514814bc2432e282aaca72be122801a1467000dc17f2/scrapeghost-0.6.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0660fd7bc7b7f3bac6dff42d019de01372120bd91a14055e7ae3db1a3d825f55",
                "md5": "97c7222b804123182dcd49093de18354",
                "sha256": "78d49016c59d907f659b6bd3f470555b374fc7a9b085a8feeac31a4f0df59404"
            },
            "downloads": -1,
            "filename": "scrapeghost-0.6.0.tar.gz",
            "has_sig": false,
            "md5_digest": "97c7222b804123182dcd49093de18354",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11,<4.0",
            "size": 17322,
            "upload_time": "2023-11-25T01:10:53",
            "upload_time_iso_8601": "2023-11-25T01:10:53.350113Z",
            "url": "https://files.pythonhosted.org/packages/06/60/fd7bc7b7f3bac6dff42d019de01372120bd91a14055e7ae3db1a3d825f55/scrapeghost-0.6.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-25 01:10:53",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "scrapeghost"
}

James Turk