scrapedict

Name	scrapedict JSON
Version	0.3.0 JSON
	download
home_page	https://github.com/medecau/scrapedict
Summary	Scrape HTML to dictionaries
upload_time	2023-11-16 22:48:48
maintainer
docs_url	None
author	Pedro Rodrigues
requires_python	>=3.8,<4.0
license
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            *Write scraping rules, get dictionaries.*

`scrapedict` is a Python module designed to simplify the process of writing web scraping code. The goal is to make scrapers easy to adapt and maintain, with straightforward and readable code.


# Features

- The rules dictionary is straightforward and easy to read
- Once you define the rules for one item you can extract multiple items
- You get ✨dictionaries✨ of the data you want


# Installation

```$ pip install scrapedict```


# Usage

```python
import scrapedict as sd
from urllib.request import urlopen

# Fetch the content from the Urban Dictionary page for "larping"
url = "https://www.urbandictionary.com/define.php?term=larping"
content = urlopen(url).read().decode()

# Define the fields to be extracted
fields = {
    "word": sd.text(".word"),
    "meaning": sd.text(".meaning"),
    "example": sd.text(".example"),
}

# Extract the data using scrapedict
item = sd.extract(fields, content)

# The result is a dictionary with the word, its meaning, and an example usage.
# Here, we perform a couple of assertions to demonstrate the expected structure and content.
assert isinstance(item, dict)
assert item["word"] == "Larping"
```


# The orange site example

```python
import scrapedict as sd
from urllib.request import urlopen

# Fetch the content from the Hacker News homepage
url = "https://news.ycombinator.com/"
content = urlopen(url).read().decode()

# Define the fields to extract: title and URL for each news item
fields = {
    "title": sd.text(".titleline a"),
    "url": sd.attr(".titleline a", "href"),
}

# Use scrapedict to extract all news items as a list of dictionaries
items = sd.extract_all(".athing", fields, content)

# The result is a list of dictionaries, each containing the title and URL of a news item.
# Here, we assert that 30 items are extracted, which is the typical number of news items on the Hacker News homepage.
assert len(items) == 30
```


# Development

Dependencies are managed with [Poetry](https://python-poetry.org/).

Testing is done with [Tox](https://tox.readthedocs.io/en/latest/).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/medecau/scrapedict",
    "name": "scrapedict",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Pedro Rodrigues",
    "author_email": "medecau@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/f8/71/15cfeff94c649c1f43444724fc9f60473a544dc7331f1cfb5be8f2f81030/scrapedict-0.3.0.tar.gz",
    "platform": null,
    "description": "*Write scraping rules, get dictionaries.*\n\n`scrapedict` is a Python module designed to simplify the process of writing web scraping code. The goal is to make scrapers easy to adapt and maintain, with straightforward and readable code.\n\n\n# Features\n\n- The rules dictionary is straightforward and easy to read\n- Once you define the rules for one item you can extract multiple items\n- You get \u2728dictionaries\u2728 of the data you want\n\n\n# Installation\n\n```$ pip install scrapedict```\n\n\n# Usage\n\n```python\nimport scrapedict as sd\nfrom urllib.request import urlopen\n\n# Fetch the content from the Urban Dictionary page for \"larping\"\nurl = \"https://www.urbandictionary.com/define.php?term=larping\"\ncontent = urlopen(url).read().decode()\n\n# Define the fields to be extracted\nfields = {\n    \"word\": sd.text(\".word\"),\n    \"meaning\": sd.text(\".meaning\"),\n    \"example\": sd.text(\".example\"),\n}\n\n# Extract the data using scrapedict\nitem = sd.extract(fields, content)\n\n# The result is a dictionary with the word, its meaning, and an example usage.\n# Here, we perform a couple of assertions to demonstrate the expected structure and content.\nassert isinstance(item, dict)\nassert item[\"word\"] == \"Larping\"\n```\n\n\n# The orange site example\n\n```python\nimport scrapedict as sd\nfrom urllib.request import urlopen\n\n# Fetch the content from the Hacker News homepage\nurl = \"https://news.ycombinator.com/\"\ncontent = urlopen(url).read().decode()\n\n# Define the fields to extract: title and URL for each news item\nfields = {\n    \"title\": sd.text(\".titleline a\"),\n    \"url\": sd.attr(\".titleline a\", \"href\"),\n}\n\n# Use scrapedict to extract all news items as a list of dictionaries\nitems = sd.extract_all(\".athing\", fields, content)\n\n# The result is a list of dictionaries, each containing the title and URL of a news item.\n# Here, we assert that 30 items are extracted, which is the typical number of news items on the Hacker News homepage.\nassert len(items) == 30\n```\n\n\n# Development\n\nDependencies are managed with [Poetry](https://python-poetry.org/).\n\nTesting is done with [Tox](https://tox.readthedocs.io/en/latest/).\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Scrape HTML to dictionaries",
    "version": "0.3.0",
    "project_urls": {
        "Homepage": "https://github.com/medecau/scrapedict"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "332a5633b943285e19b2a5ff69713b3ab2c526b40f44ebe492592254a3ab21f9",
                "md5": "29157c33c3deff0ffbeccc7197ca8879",
                "sha256": "d49e3aa43ed8a7a513f09f984bbd898c21087e9a05f87d76becb7c14796f1702"
            },
            "downloads": -1,
            "filename": "scrapedict-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "29157c33c3deff0ffbeccc7197ca8879",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<4.0",
            "size": 2678,
            "upload_time": "2023-11-16T22:48:45",
            "upload_time_iso_8601": "2023-11-16T22:48:45.691936Z",
            "url": "https://files.pythonhosted.org/packages/33/2a/5633b943285e19b2a5ff69713b3ab2c526b40f44ebe492592254a3ab21f9/scrapedict-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f87115cfeff94c649c1f43444724fc9f60473a544dc7331f1cfb5be8f2f81030",
                "md5": "930d64862fed741b622bd73d495e88cb",
                "sha256": "0ab26e1f294ece1627a5651cbc37f92d8a342c0297879f6705fabc8c620a6bbc"
            },
            "downloads": -1,
            "filename": "scrapedict-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "930d64862fed741b622bd73d495e88cb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<4.0",
            "size": 2420,
            "upload_time": "2023-11-16T22:48:48",
            "upload_time_iso_8601": "2023-11-16T22:48:48.502716Z",
            "url": "https://files.pythonhosted.org/packages/f8/71/15cfeff94c649c1f43444724fc9f60473a544dc7331f1cfb5be8f2f81030/scrapedict-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-16 22:48:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "medecau",
    "github_project": "scrapedict",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "tox": true,
    "lcname": "scrapedict"
}

Pedro Rodrigues