*Write scraping rules, get dictionaries.*
`scrapedict` is a Python module designed to simplify the process of writing web scraping code. The goal is to make scrapers easy to adapt and maintain, with straightforward and readable code.
# Features
- The rules dictionary is straightforward and easy to read
- Once you define the rules for one item you can extract multiple items
- You get ✨dictionaries✨ of the data you want
# Installation
```$ pip install scrapedict```
# Usage
```python
import scrapedict as sd
from urllib.request import urlopen
# Fetch the content from the Urban Dictionary page for "larping"
url = "https://www.urbandictionary.com/define.php?term=larping"
content = urlopen(url).read().decode()
# Define the fields to be extracted
fields = {
"word": sd.text(".word"),
"meaning": sd.text(".meaning"),
"example": sd.text(".example"),
}
# Extract the data using scrapedict
item = sd.extract(fields, content)
# The result is a dictionary with the word, its meaning, and an example usage.
# Here, we perform a couple of assertions to demonstrate the expected structure and content.
assert isinstance(item, dict)
assert item["word"] == "Larping"
```
# The orange site example
```python
import scrapedict as sd
from urllib.request import urlopen
# Fetch the content from the Hacker News homepage
url = "https://news.ycombinator.com/"
content = urlopen(url).read().decode()
# Define the fields to extract: title and URL for each news item
fields = {
"title": sd.text(".titleline a"),
"url": sd.attr(".titleline a", "href"),
}
# Use scrapedict to extract all news items as a list of dictionaries
items = sd.extract_all(".athing", fields, content)
# The result is a list of dictionaries, each containing the title and URL of a news item.
# Here, we assert that 30 items are extracted, which is the typical number of news items on the Hacker News homepage.
assert len(items) == 30
```
# Development
Dependencies are managed with [Poetry](https://python-poetry.org/).
Testing is done with [Tox](https://tox.readthedocs.io/en/latest/).
Raw data
{
"_id": null,
"home_page": "https://github.com/medecau/scrapedict",
"name": "scrapedict",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8,<4.0",
"maintainer_email": "",
"keywords": "",
"author": "Pedro Rodrigues",
"author_email": "medecau@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/f8/71/15cfeff94c649c1f43444724fc9f60473a544dc7331f1cfb5be8f2f81030/scrapedict-0.3.0.tar.gz",
"platform": null,
"description": "*Write scraping rules, get dictionaries.*\n\n`scrapedict` is a Python module designed to simplify the process of writing web scraping code. The goal is to make scrapers easy to adapt and maintain, with straightforward and readable code.\n\n\n# Features\n\n- The rules dictionary is straightforward and easy to read\n- Once you define the rules for one item you can extract multiple items\n- You get \u2728dictionaries\u2728 of the data you want\n\n\n# Installation\n\n```$ pip install scrapedict```\n\n\n# Usage\n\n```python\nimport scrapedict as sd\nfrom urllib.request import urlopen\n\n# Fetch the content from the Urban Dictionary page for \"larping\"\nurl = \"https://www.urbandictionary.com/define.php?term=larping\"\ncontent = urlopen(url).read().decode()\n\n# Define the fields to be extracted\nfields = {\n \"word\": sd.text(\".word\"),\n \"meaning\": sd.text(\".meaning\"),\n \"example\": sd.text(\".example\"),\n}\n\n# Extract the data using scrapedict\nitem = sd.extract(fields, content)\n\n# The result is a dictionary with the word, its meaning, and an example usage.\n# Here, we perform a couple of assertions to demonstrate the expected structure and content.\nassert isinstance(item, dict)\nassert item[\"word\"] == \"Larping\"\n```\n\n\n# The orange site example\n\n```python\nimport scrapedict as sd\nfrom urllib.request import urlopen\n\n# Fetch the content from the Hacker News homepage\nurl = \"https://news.ycombinator.com/\"\ncontent = urlopen(url).read().decode()\n\n# Define the fields to extract: title and URL for each news item\nfields = {\n \"title\": sd.text(\".titleline a\"),\n \"url\": sd.attr(\".titleline a\", \"href\"),\n}\n\n# Use scrapedict to extract all news items as a list of dictionaries\nitems = sd.extract_all(\".athing\", fields, content)\n\n# The result is a list of dictionaries, each containing the title and URL of a news item.\n# Here, we assert that 30 items are extracted, which is the typical number of news items on the Hacker News homepage.\nassert len(items) == 30\n```\n\n\n# Development\n\nDependencies are managed with [Poetry](https://python-poetry.org/).\n\nTesting is done with [Tox](https://tox.readthedocs.io/en/latest/).\n",
"bugtrack_url": null,
"license": "",
"summary": "Scrape HTML to dictionaries",
"version": "0.3.0",
"project_urls": {
"Homepage": "https://github.com/medecau/scrapedict"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "332a5633b943285e19b2a5ff69713b3ab2c526b40f44ebe492592254a3ab21f9",
"md5": "29157c33c3deff0ffbeccc7197ca8879",
"sha256": "d49e3aa43ed8a7a513f09f984bbd898c21087e9a05f87d76becb7c14796f1702"
},
"downloads": -1,
"filename": "scrapedict-0.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "29157c33c3deff0ffbeccc7197ca8879",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8,<4.0",
"size": 2678,
"upload_time": "2023-11-16T22:48:45",
"upload_time_iso_8601": "2023-11-16T22:48:45.691936Z",
"url": "https://files.pythonhosted.org/packages/33/2a/5633b943285e19b2a5ff69713b3ab2c526b40f44ebe492592254a3ab21f9/scrapedict-0.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f87115cfeff94c649c1f43444724fc9f60473a544dc7331f1cfb5be8f2f81030",
"md5": "930d64862fed741b622bd73d495e88cb",
"sha256": "0ab26e1f294ece1627a5651cbc37f92d8a342c0297879f6705fabc8c620a6bbc"
},
"downloads": -1,
"filename": "scrapedict-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "930d64862fed741b622bd73d495e88cb",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8,<4.0",
"size": 2420,
"upload_time": "2023-11-16T22:48:48",
"upload_time_iso_8601": "2023-11-16T22:48:48.502716Z",
"url": "https://files.pythonhosted.org/packages/f8/71/15cfeff94c649c1f43444724fc9f60473a544dc7331f1cfb5be8f2f81030/scrapedict-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-16 22:48:48",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "medecau",
"github_project": "scrapedict",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"tox": true,
"lcname": "scrapedict"
}