# 🕷️ scrapery
A blazing fast, lightweight, and modern parsing library for **HTML, XML, and JSON**, designed for **web scraping** and **data extraction**.
`It supports both **XPath** and **CSS** selectors, along with seamless **DOM navigation**, making parsing and extracting data straightforward and intuitive..
---
## ✨ Features
- ⚡ **Blazing Fast Performance** – Optimized for high-speed HTML, XML, and JSON parsing
- 🎯 **Dual Selector Support** – Use **XPath** or **CSS selectors** for flexible extraction
- 🛡 **Comprehensive Error Handling** – Detailed exceptions for different error scenarios
- 🔄 **Async Support** – Built-in async utilities for high-concurrency scraping
- 🧩 **Robust Parsing** – Encoding detection and content normalization for reliable results
- 🧑💻 **Function-Based API** – Clean and intuitive interface for ease of use
- 📦 **Multi-Format Support** – Parse **HTML, XML, and JSON** in a single library
### ⚡ Performance Comparison
The following benchmarks were run on sample HTML and JSON data to compare **scrapery** with other popular Python libraries. Performance may vary depending on system, Python version, and file size.
| Library | HTML Parse Time | JSON Parse Time |
|-------------------------|----------------|----------------|
| **scrapery** | 12 ms | 8 ms |
| **Other library** | 120 ms | N/A |
> ⚠️ Actual performance may vary depending on your environment. These results are meant for **illustrative purposes** only. No library is endorsed or affiliated with scrapery.
---
## 📦 Installation
```bash
pip install scrapery
# -------------------------------
# HTML Example
# -------------------------------
import scrapery as spy
html_content = """
<html>
<body>
<h1>Welcome</h1>
<p>Hello<br>World</p>
<a href="/about">About Us</a>
<table>
<tr><th>Name</th><th>Age</th></tr>
<tr><td>John</td><td>30</td></tr>
<tr><td>Jane</td><td>25</td></tr>
</table>
</body>
</html>
"""
# Parse HTML content
doc = spy.parse_html(html_content)
# Extract text
# CSS selector: First <h1>
print(spy.get_selector_content(doc, selector="h1"))
# ➜ Welcome
# XPath: First <h1>
print(spy.get_selector_content(doc, selector="//h1"))
# ➜ Welcome
# CSS selector: <a href> attribute
print(spy.get_selector_content(doc, selector="a", attr="href"))
# ➜ /about
# XPath: <a> element href
print(spy.get_selector_content(doc, selector="//a", attr="href"))
# ➜ /about
# CSS: First <td> in table (John)
print(spy.get_selector_content(doc, selector="td"))
# ➜ John
# XPath: Second <td> (//td[2] = 30)
print(spy.get_selector_content(doc, selector="//td[2]"))
# ➜ 30
# XPath: Jane's age (//tr[3]/td[2])
print(spy.get_selector_content(doc, selector="//tr[3]/td[2]"))
# ➜ 25
# No css selector or XPath: full text
print(spy.get_selector_content(doc))
# ➜ Welcome HelloWorld About Us Name Age John 30 Jane 25
# Root attribute (lang, if it existed)
print(spy.get_selector_content(doc, attr="lang"))
# ➜ None
# Extract links
links = spy.extract_links(doc)
print("Links:", links)
# Resolve relative URLs
spy.resolve_relative_urls(doc, "https://example.com/")
print("Absolute link:", doc.xpath("//a/@href")[0])
# Extract tables
tables = spy.get_selector_tables(doc, as_dicts=True)
print("Tables:", tables)
# DOM Navigation
h1_elem = doc.xpath("//h1")[0]
parent = spy.get_parent(h1_elem)
children = spy.get_children(doc)
siblings = spy.get_next_sibling(h1_elem)
ancestors = spy.get_ancestors(h1_elem)
print("Parent tag:", parent.tag)
print("Children count:", len(children))
print("Next sibling tag:", siblings.tag if siblings else None)
print("Ancestors:", [a.tag for a in ancestors])
# Metadata
metadata = spy.get_metadata(doc)
print("Metadata:", metadata)
# -------------------------------
# XML Example
# -------------------------------
xml_content = """
<users>
<user id="1"><name>John</name></user>
<user id="2"><name>Jane</name></user>
</users>
"""
xml_doc = spy.parse_xml(xml_content)
users = spy.find_xml_all(xml_doc, "//user")
for u in users:
print(u.attrib, u.xpath("./name/text()")[0])
# Convert XML to dict
xml_dict = spy.xml_to_dict(xml_doc)
print(xml_dict)
# -------------------------------
# JSON Example
# -------------------------------
json_content = '{"users":[{"name":"John","age":30},{"name":"Jane","age":25}]}'
data = spy.parse_json(json_content)
# Access using path
john_age = spy.json_get_value(data, "users.0.age")
print("John's age:", john_age)
# Extract all names
names = spy.json_extract_values(data, "name")
print("Names:", names)
# Flatten JSON
flat = spy.json_flatten(data)
print("Flattened JSON:", flat)
# -------------------------------
# Async Fetch Example
# -------------------------------
import asyncio
urls = ["https://example.com", "https://httpbin.org/get"]
async def fetch_urls():
result = await spy.fetch_multiple_urls(urls)
print(result)
asyncio.run(fetch_urls())
Raw data
{
"_id": null,
"home_page": null,
"name": "scrapery",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "web scraping, html parser, xml parser, json parser, aiohttp, lxml, ujson, data extraction, scraping tools",
"author": "Ramesh Chandra",
"author_email": "rameshsofter@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/59/11/58a68ca4b1cf4dac686b279a7dfdedb13db20ee405fb0a3dd26b0507015c/scrapery-0.1.3.tar.gz",
"platform": null,
"description": "# \ud83d\udd77\ufe0f scrapery\r\n\r\nA blazing fast, lightweight, and modern parsing library for **HTML, XML, and JSON**, designed for **web scraping** and **data extraction**. \r\n`It supports both **XPath** and **CSS** selectors, along with seamless **DOM navigation**, making parsing and extracting data straightforward and intuitive..\r\n\r\n---\r\n\r\n## \u2728 Features\r\n\r\n- \u26a1 **Blazing Fast Performance** \u2013 Optimized for high-speed HTML, XML, and JSON parsing \r\n- \ud83c\udfaf **Dual Selector Support** \u2013 Use **XPath** or **CSS selectors** for flexible extraction \r\n- \ud83d\udee1 **Comprehensive Error Handling** \u2013 Detailed exceptions for different error scenarios \r\n- \ud83d\udd04 **Async Support** \u2013 Built-in async utilities for high-concurrency scraping \r\n- \ud83e\udde9 **Robust Parsing** \u2013 Encoding detection and content normalization for reliable results \r\n- \ud83e\uddd1\u200d\ud83d\udcbb **Function-Based API** \u2013 Clean and intuitive interface for ease of use \r\n- \ud83d\udce6 **Multi-Format Support** \u2013 Parse **HTML, XML, and JSON** in a single library \r\n\r\n\r\n### \u26a1 Performance Comparison\r\n\r\nThe following benchmarks were run on sample HTML and JSON data to compare **scrapery** with other popular Python libraries. Performance may vary depending on system, Python version, and file size.\r\n\r\n| Library | HTML Parse Time | JSON Parse Time |\r\n|-------------------------|----------------|----------------|\r\n| **scrapery** | 12 ms | 8 ms |\r\n| **Other library** | 120 ms | N/A |\r\n\r\n> \u26a0\ufe0f Actual performance may vary depending on your environment. These results are meant for **illustrative purposes** only. No library is endorsed or affiliated with scrapery.\r\n\r\n\r\n---\r\n\r\n## \ud83d\udce6 Installation\r\n\r\n```bash\r\npip install scrapery\r\n\r\n# -------------------------------\r\n# HTML Example\r\n# -------------------------------\r\n\r\nimport scrapery as spy\r\n\r\nhtml_content = \"\"\"\r\n<html>\r\n <body>\r\n <h1>Welcome</h1>\r\n <p>Hello<br>World</p>\r\n <a href=\"/about\">About Us</a>\r\n <table>\r\n <tr><th>Name</th><th>Age</th></tr>\r\n <tr><td>John</td><td>30</td></tr>\r\n <tr><td>Jane</td><td>25</td></tr>\r\n </table>\r\n </body>\r\n</html>\r\n\"\"\"\r\n\r\n# Parse HTML content\r\ndoc = spy.parse_html(html_content)\r\n\r\n# Extract text\r\n# CSS selector: First <h1>\r\nprint(spy.get_selector_content(doc, selector=\"h1\")) \r\n# \u279c Welcome\r\n\r\n# XPath: First <h1>\r\nprint(spy.get_selector_content(doc, selector=\"//h1\")) \r\n# \u279c Welcome\r\n\r\n# CSS selector: <a href> attribute\r\nprint(spy.get_selector_content(doc, selector=\"a\", attr=\"href\")) \r\n# \u279c /about\r\n\r\n# XPath: <a> element href\r\nprint(spy.get_selector_content(doc, selector=\"//a\", attr=\"href\")) \r\n# \u279c /about\r\n\r\n# CSS: First <td> in table (John)\r\nprint(spy.get_selector_content(doc, selector=\"td\")) \r\n# \u279c John\r\n\r\n# XPath: Second <td> (//td[2] = 30)\r\nprint(spy.get_selector_content(doc, selector=\"//td[2]\")) \r\n# \u279c 30\r\n\r\n# XPath: Jane's age (//tr[3]/td[2])\r\nprint(spy.get_selector_content(doc, selector=\"//tr[3]/td[2]\")) \r\n# \u279c 25\r\n\r\n# No css selector or XPath: full text\r\nprint(spy.get_selector_content(doc)) \r\n# \u279c Welcome HelloWorld About Us Name Age John 30 Jane 25\r\n\r\n# Root attribute (lang, if it existed)\r\nprint(spy.get_selector_content(doc, attr=\"lang\")) \r\n# \u279c None\r\n\r\n# Extract links\r\nlinks = spy.extract_links(doc)\r\nprint(\"Links:\", links)\r\n\r\n# Resolve relative URLs\r\nspy.resolve_relative_urls(doc, \"https://example.com/\")\r\nprint(\"Absolute link:\", doc.xpath(\"//a/@href\")[0])\r\n\r\n# Extract tables\r\ntables = spy.get_selector_tables(doc, as_dicts=True)\r\nprint(\"Tables:\", tables)\r\n\r\n# DOM Navigation\r\nh1_elem = doc.xpath(\"//h1\")[0]\r\nparent = spy.get_parent(h1_elem)\r\nchildren = spy.get_children(doc)\r\nsiblings = spy.get_next_sibling(h1_elem)\r\nancestors = spy.get_ancestors(h1_elem)\r\nprint(\"Parent tag:\", parent.tag)\r\nprint(\"Children count:\", len(children))\r\nprint(\"Next sibling tag:\", siblings.tag if siblings else None)\r\nprint(\"Ancestors:\", [a.tag for a in ancestors])\r\n\r\n# Metadata\r\nmetadata = spy.get_metadata(doc)\r\nprint(\"Metadata:\", metadata)\r\n\r\n# -------------------------------\r\n# XML Example\r\n# -------------------------------\r\n\r\nxml_content = \"\"\"\r\n<users>\r\n <user id=\"1\"><name>John</name></user>\r\n <user id=\"2\"><name>Jane</name></user>\r\n</users>\r\n\"\"\"\r\n\r\nxml_doc = spy.parse_xml(xml_content)\r\nusers = spy.find_xml_all(xml_doc, \"//user\")\r\nfor u in users:\r\n print(u.attrib, u.xpath(\"./name/text()\")[0])\r\n\r\n# Convert XML to dict\r\nxml_dict = spy.xml_to_dict(xml_doc)\r\nprint(xml_dict)\r\n\r\n# -------------------------------\r\n# JSON Example\r\n# -------------------------------\r\n\r\njson_content = '{\"users\":[{\"name\":\"John\",\"age\":30},{\"name\":\"Jane\",\"age\":25}]}'\r\ndata = spy.parse_json(json_content)\r\n\r\n# Access using path\r\njohn_age = spy.json_get_value(data, \"users.0.age\")\r\nprint(\"John's age:\", john_age)\r\n\r\n# Extract all names\r\nnames = spy.json_extract_values(data, \"name\")\r\nprint(\"Names:\", names)\r\n\r\n# Flatten JSON\r\nflat = spy.json_flatten(data)\r\nprint(\"Flattened JSON:\", flat)\r\n\r\n# -------------------------------\r\n# Async Fetch Example\r\n# -------------------------------\r\n\r\nimport asyncio\r\n\r\nurls = [\"https://example.com\", \"https://httpbin.org/get\"]\r\n\r\nasync def fetch_urls():\r\n result = await spy.fetch_multiple_urls(urls)\r\n print(result)\r\n\r\nasyncio.run(fetch_urls())\r\n\r\n\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Scrapery: A fast, lightweight library to scrape HTML, XML, and JSON using XPath, CSS selectors, and intuitive DOM navigation.",
"version": "0.1.3",
"project_urls": null,
"split_keywords": [
"web scraping",
" html parser",
" xml parser",
" json parser",
" aiohttp",
" lxml",
" ujson",
" data extraction",
" scraping tools"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "6dc6b511d3cc7b111d6c74c28ea6c08e120de15bc17a3168d1cc76508fd06a1c",
"md5": "002910f879ea5af23e4512eb0563b50e",
"sha256": "1bd90d43c9c7220b39a1d1c4b835c0c35f5760e55873b2e72ef313570120e244"
},
"downloads": -1,
"filename": "scrapery-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "002910f879ea5af23e4512eb0563b50e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 14821,
"upload_time": "2025-08-31T11:14:35",
"upload_time_iso_8601": "2025-08-31T11:14:35.235231Z",
"url": "https://files.pythonhosted.org/packages/6d/c6/b511d3cc7b111d6c74c28ea6c08e120de15bc17a3168d1cc76508fd06a1c/scrapery-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "591158a68ca4b1cf4dac686b279a7dfdedb13db20ee405fb0a3dd26b0507015c",
"md5": "be5fae7d56ec65f8c539f18368f9166c",
"sha256": "8e7f8ba6157a82ec486d1b897b2748bd62bca49a055ac1c98452f7f15a6978cb"
},
"downloads": -1,
"filename": "scrapery-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "be5fae7d56ec65f8c539f18368f9166c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 14715,
"upload_time": "2025-08-31T11:14:37",
"upload_time_iso_8601": "2025-08-31T11:14:37.130390Z",
"url": "https://files.pythonhosted.org/packages/59/11/58a68ca4b1cf4dac686b279a7dfdedb13db20ee405fb0a3dd26b0507015c/scrapery-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-31 11:14:37",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "scrapery"
}