scrapery


Namescrapery JSON
Version 0.1.3 PyPI version JSON
download
home_pageNone
SummaryScrapery: A fast, lightweight library to scrape HTML, XML, and JSON using XPath, CSS selectors, and intuitive DOM navigation.
upload_time2025-08-31 11:14:37
maintainerNone
docs_urlNone
authorRamesh Chandra
requires_python>=3.8
licenseMIT
keywords web scraping html parser xml parser json parser aiohttp lxml ujson data extraction scraping tools
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # 🕷️ scrapery

A blazing fast, lightweight, and modern parsing library for **HTML, XML, and JSON**, designed for **web scraping** and **data extraction**.  
`It supports both **XPath** and **CSS** selectors, along with seamless **DOM navigation**, making parsing and extracting data straightforward and intuitive..

---

## ✨ Features

- ⚡ **Blazing Fast Performance** – Optimized for high-speed HTML, XML, and JSON parsing  
- 🎯 **Dual Selector Support** – Use **XPath** or **CSS selectors** for flexible extraction  
- 🛡 **Comprehensive Error Handling** – Detailed exceptions for different error scenarios  
- 🔄 **Async Support** – Built-in async utilities for high-concurrency scraping  
- 🧩 **Robust Parsing** – Encoding detection and content normalization for reliable results  
- 🧑‍💻 **Function-Based API** – Clean and intuitive interface for ease of use  
- 📦 **Multi-Format Support** – Parse **HTML, XML, and JSON** in a single library  


### ⚡ Performance Comparison

The following benchmarks were run on sample HTML and JSON data to compare **scrapery** with other popular Python libraries. Performance may vary depending on system, Python version, and file size.

| Library                 | HTML Parse Time | JSON Parse Time |
|-------------------------|----------------|----------------|
| **scrapery**            | 12 ms          | 8 ms           |
| **Other library**       | 120 ms         | N/A            |

> ⚠️ Actual performance may vary depending on your environment. These results are meant for **illustrative purposes** only. No library is endorsed or affiliated with scrapery.


---

## 📦 Installation

```bash
pip install scrapery

# -------------------------------
# HTML Example
# -------------------------------

import scrapery as spy

html_content = """
<html>
    <body>
        <h1>Welcome</h1>
        <p>Hello<br>World</p>
        <a href="/about">About Us</a>
        <table>
            <tr><th>Name</th><th>Age</th></tr>
            <tr><td>John</td><td>30</td></tr>
            <tr><td>Jane</td><td>25</td></tr>
        </table>
    </body>
</html>
"""

# Parse HTML content
doc = spy.parse_html(html_content)

# Extract text
# CSS selector: First <h1>
print(spy.get_selector_content(doc, selector="h1"))  
# ➜ Welcome

# XPath: First <h1>
print(spy.get_selector_content(doc, selector="//h1"))  
# ➜ Welcome

# CSS selector: <a href> attribute
print(spy.get_selector_content(doc, selector="a", attr="href"))  
# ➜ /about

# XPath: <a> element href
print(spy.get_selector_content(doc, selector="//a", attr="href"))  
# ➜ /about

# CSS: First <td> in table (John)
print(spy.get_selector_content(doc, selector="td"))  
# ➜ John

# XPath: Second <td> (//td[2] = 30)
print(spy.get_selector_content(doc, selector="//td[2]"))  
# ➜ 30

# XPath: Jane's age (//tr[3]/td[2])
print(spy.get_selector_content(doc, selector="//tr[3]/td[2]"))  
# ➜ 25

# No css selector or XPath: full text
print(spy.get_selector_content(doc))  
# ➜ Welcome HelloWorld About Us Name Age John 30 Jane 25

# Root attribute (lang, if it existed)
print(spy.get_selector_content(doc, attr="lang"))  
# ➜ None

# Extract links
links = spy.extract_links(doc)
print("Links:", links)

# Resolve relative URLs
spy.resolve_relative_urls(doc, "https://example.com/")
print("Absolute link:", doc.xpath("//a/@href")[0])

# Extract tables
tables = spy.get_selector_tables(doc, as_dicts=True)
print("Tables:", tables)

# DOM Navigation
h1_elem = doc.xpath("//h1")[0]
parent = spy.get_parent(h1_elem)
children = spy.get_children(doc)
siblings = spy.get_next_sibling(h1_elem)
ancestors = spy.get_ancestors(h1_elem)
print("Parent tag:", parent.tag)
print("Children count:", len(children))
print("Next sibling tag:", siblings.tag if siblings else None)
print("Ancestors:", [a.tag for a in ancestors])

# Metadata
metadata = spy.get_metadata(doc)
print("Metadata:", metadata)

# -------------------------------
# XML Example
# -------------------------------

xml_content = """
<users>
    <user id="1"><name>John</name></user>
    <user id="2"><name>Jane</name></user>
</users>
"""

xml_doc = spy.parse_xml(xml_content)
users = spy.find_xml_all(xml_doc, "//user")
for u in users:
    print(u.attrib, u.xpath("./name/text()")[0])

# Convert XML to dict
xml_dict = spy.xml_to_dict(xml_doc)
print(xml_dict)

# -------------------------------
# JSON Example
# -------------------------------

json_content = '{"users":[{"name":"John","age":30},{"name":"Jane","age":25}]}'
data = spy.parse_json(json_content)

# Access using path
john_age = spy.json_get_value(data, "users.0.age")
print("John's age:", john_age)

# Extract all names
names = spy.json_extract_values(data, "name")
print("Names:", names)

# Flatten JSON
flat = spy.json_flatten(data)
print("Flattened JSON:", flat)

# -------------------------------
# Async Fetch Example
# -------------------------------

import asyncio

urls = ["https://example.com", "https://httpbin.org/get"]

async def fetch_urls():
    result = await spy.fetch_multiple_urls(urls)
    print(result)

asyncio.run(fetch_urls())



            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "scrapery",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "web scraping, html parser, xml parser, json parser, aiohttp, lxml, ujson, data extraction, scraping tools",
    "author": "Ramesh Chandra",
    "author_email": "rameshsofter@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/59/11/58a68ca4b1cf4dac686b279a7dfdedb13db20ee405fb0a3dd26b0507015c/scrapery-0.1.3.tar.gz",
    "platform": null,
    "description": "# \ud83d\udd77\ufe0f scrapery\r\n\r\nA blazing fast, lightweight, and modern parsing library for **HTML, XML, and JSON**, designed for **web scraping** and **data extraction**.  \r\n`It supports both **XPath** and **CSS** selectors, along with seamless **DOM navigation**, making parsing and extracting data straightforward and intuitive..\r\n\r\n---\r\n\r\n## \u2728 Features\r\n\r\n- \u26a1 **Blazing Fast Performance** \u2013 Optimized for high-speed HTML, XML, and JSON parsing  \r\n- \ud83c\udfaf **Dual Selector Support** \u2013 Use **XPath** or **CSS selectors** for flexible extraction  \r\n- \ud83d\udee1 **Comprehensive Error Handling** \u2013 Detailed exceptions for different error scenarios  \r\n- \ud83d\udd04 **Async Support** \u2013 Built-in async utilities for high-concurrency scraping  \r\n- \ud83e\udde9 **Robust Parsing** \u2013 Encoding detection and content normalization for reliable results  \r\n- \ud83e\uddd1\u200d\ud83d\udcbb **Function-Based API** \u2013 Clean and intuitive interface for ease of use  \r\n- \ud83d\udce6 **Multi-Format Support** \u2013 Parse **HTML, XML, and JSON** in a single library  \r\n\r\n\r\n### \u26a1 Performance Comparison\r\n\r\nThe following benchmarks were run on sample HTML and JSON data to compare **scrapery** with other popular Python libraries. Performance may vary depending on system, Python version, and file size.\r\n\r\n| Library                 | HTML Parse Time | JSON Parse Time |\r\n|-------------------------|----------------|----------------|\r\n| **scrapery**            | 12 ms          | 8 ms           |\r\n| **Other library**       | 120 ms         | N/A            |\r\n\r\n> \u26a0\ufe0f Actual performance may vary depending on your environment. These results are meant for **illustrative purposes** only. No library is endorsed or affiliated with scrapery.\r\n\r\n\r\n---\r\n\r\n## \ud83d\udce6 Installation\r\n\r\n```bash\r\npip install scrapery\r\n\r\n# -------------------------------\r\n# HTML Example\r\n# -------------------------------\r\n\r\nimport scrapery as spy\r\n\r\nhtml_content = \"\"\"\r\n<html>\r\n    <body>\r\n        <h1>Welcome</h1>\r\n        <p>Hello<br>World</p>\r\n        <a href=\"/about\">About Us</a>\r\n        <table>\r\n            <tr><th>Name</th><th>Age</th></tr>\r\n            <tr><td>John</td><td>30</td></tr>\r\n            <tr><td>Jane</td><td>25</td></tr>\r\n        </table>\r\n    </body>\r\n</html>\r\n\"\"\"\r\n\r\n# Parse HTML content\r\ndoc = spy.parse_html(html_content)\r\n\r\n# Extract text\r\n# CSS selector: First <h1>\r\nprint(spy.get_selector_content(doc, selector=\"h1\"))  \r\n# \u279c Welcome\r\n\r\n# XPath: First <h1>\r\nprint(spy.get_selector_content(doc, selector=\"//h1\"))  \r\n# \u279c Welcome\r\n\r\n# CSS selector: <a href> attribute\r\nprint(spy.get_selector_content(doc, selector=\"a\", attr=\"href\"))  \r\n# \u279c /about\r\n\r\n# XPath: <a> element href\r\nprint(spy.get_selector_content(doc, selector=\"//a\", attr=\"href\"))  \r\n# \u279c /about\r\n\r\n# CSS: First <td> in table (John)\r\nprint(spy.get_selector_content(doc, selector=\"td\"))  \r\n# \u279c John\r\n\r\n# XPath: Second <td> (//td[2] = 30)\r\nprint(spy.get_selector_content(doc, selector=\"//td[2]\"))  \r\n# \u279c 30\r\n\r\n# XPath: Jane's age (//tr[3]/td[2])\r\nprint(spy.get_selector_content(doc, selector=\"//tr[3]/td[2]\"))  \r\n# \u279c 25\r\n\r\n# No css selector or XPath: full text\r\nprint(spy.get_selector_content(doc))  \r\n# \u279c Welcome HelloWorld About Us Name Age John 30 Jane 25\r\n\r\n# Root attribute (lang, if it existed)\r\nprint(spy.get_selector_content(doc, attr=\"lang\"))  \r\n# \u279c None\r\n\r\n# Extract links\r\nlinks = spy.extract_links(doc)\r\nprint(\"Links:\", links)\r\n\r\n# Resolve relative URLs\r\nspy.resolve_relative_urls(doc, \"https://example.com/\")\r\nprint(\"Absolute link:\", doc.xpath(\"//a/@href\")[0])\r\n\r\n# Extract tables\r\ntables = spy.get_selector_tables(doc, as_dicts=True)\r\nprint(\"Tables:\", tables)\r\n\r\n# DOM Navigation\r\nh1_elem = doc.xpath(\"//h1\")[0]\r\nparent = spy.get_parent(h1_elem)\r\nchildren = spy.get_children(doc)\r\nsiblings = spy.get_next_sibling(h1_elem)\r\nancestors = spy.get_ancestors(h1_elem)\r\nprint(\"Parent tag:\", parent.tag)\r\nprint(\"Children count:\", len(children))\r\nprint(\"Next sibling tag:\", siblings.tag if siblings else None)\r\nprint(\"Ancestors:\", [a.tag for a in ancestors])\r\n\r\n# Metadata\r\nmetadata = spy.get_metadata(doc)\r\nprint(\"Metadata:\", metadata)\r\n\r\n# -------------------------------\r\n# XML Example\r\n# -------------------------------\r\n\r\nxml_content = \"\"\"\r\n<users>\r\n    <user id=\"1\"><name>John</name></user>\r\n    <user id=\"2\"><name>Jane</name></user>\r\n</users>\r\n\"\"\"\r\n\r\nxml_doc = spy.parse_xml(xml_content)\r\nusers = spy.find_xml_all(xml_doc, \"//user\")\r\nfor u in users:\r\n    print(u.attrib, u.xpath(\"./name/text()\")[0])\r\n\r\n# Convert XML to dict\r\nxml_dict = spy.xml_to_dict(xml_doc)\r\nprint(xml_dict)\r\n\r\n# -------------------------------\r\n# JSON Example\r\n# -------------------------------\r\n\r\njson_content = '{\"users\":[{\"name\":\"John\",\"age\":30},{\"name\":\"Jane\",\"age\":25}]}'\r\ndata = spy.parse_json(json_content)\r\n\r\n# Access using path\r\njohn_age = spy.json_get_value(data, \"users.0.age\")\r\nprint(\"John's age:\", john_age)\r\n\r\n# Extract all names\r\nnames = spy.json_extract_values(data, \"name\")\r\nprint(\"Names:\", names)\r\n\r\n# Flatten JSON\r\nflat = spy.json_flatten(data)\r\nprint(\"Flattened JSON:\", flat)\r\n\r\n# -------------------------------\r\n# Async Fetch Example\r\n# -------------------------------\r\n\r\nimport asyncio\r\n\r\nurls = [\"https://example.com\", \"https://httpbin.org/get\"]\r\n\r\nasync def fetch_urls():\r\n    result = await spy.fetch_multiple_urls(urls)\r\n    print(result)\r\n\r\nasyncio.run(fetch_urls())\r\n\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Scrapery: A fast, lightweight library to scrape HTML, XML, and JSON using XPath, CSS selectors, and intuitive DOM navigation.",
    "version": "0.1.3",
    "project_urls": null,
    "split_keywords": [
        "web scraping",
        " html parser",
        " xml parser",
        " json parser",
        " aiohttp",
        " lxml",
        " ujson",
        " data extraction",
        " scraping tools"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6dc6b511d3cc7b111d6c74c28ea6c08e120de15bc17a3168d1cc76508fd06a1c",
                "md5": "002910f879ea5af23e4512eb0563b50e",
                "sha256": "1bd90d43c9c7220b39a1d1c4b835c0c35f5760e55873b2e72ef313570120e244"
            },
            "downloads": -1,
            "filename": "scrapery-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "002910f879ea5af23e4512eb0563b50e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 14821,
            "upload_time": "2025-08-31T11:14:35",
            "upload_time_iso_8601": "2025-08-31T11:14:35.235231Z",
            "url": "https://files.pythonhosted.org/packages/6d/c6/b511d3cc7b111d6c74c28ea6c08e120de15bc17a3168d1cc76508fd06a1c/scrapery-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "591158a68ca4b1cf4dac686b279a7dfdedb13db20ee405fb0a3dd26b0507015c",
                "md5": "be5fae7d56ec65f8c539f18368f9166c",
                "sha256": "8e7f8ba6157a82ec486d1b897b2748bd62bca49a055ac1c98452f7f15a6978cb"
            },
            "downloads": -1,
            "filename": "scrapery-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "be5fae7d56ec65f8c539f18368f9166c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 14715,
            "upload_time": "2025-08-31T11:14:37",
            "upload_time_iso_8601": "2025-08-31T11:14:37.130390Z",
            "url": "https://files.pythonhosted.org/packages/59/11/58a68ca4b1cf4dac686b279a7dfdedb13db20ee405fb0a3dd26b0507015c/scrapery-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-31 11:14:37",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "scrapery"
}
        
Elapsed time: 1.45020s