# 🕷️ scrapery
[](https://pypi.org/project/scrapery/)
[](https://pypi.org/project/scrapery/)
[](https://pypi.org/project/scrapery/)

[](https://scrapery.readthedocs.io/en/latest/?badge=latest)
A blazing fast, lightweight, and modern parsing library for **HTML, XML, and JSON**, designed for **web scraping** and **data extraction**.
It supports both **XPath** and **CSS** selectors, along with seamless **DOM navigation**, making parsing and extracting data straightforward and intuitive.
📘 **Full Documentation**: [https://scrapery.readthedocs.io](https://scrapery.readthedocs.io)
---
## ✨ Features
- ⚡ **Blazing Fast Performance** – Optimized for high-speed HTML, XML, and JSON parsing
- 🎯 **Dual Selector Support** – Use **XPath** or **CSS selectors** for flexible extraction
- 🛡 **Comprehensive Error Handling** – Detailed exceptions for different error scenarios
- 🧩 **Robust Parsing** – Encoding detection and content normalization for reliable results
- 🧑💻 **Function-Based API** – Clean and intuitive interface for ease of use
- 📦 **Multi-Format Support** – Parse **HTML, XML, and JSON** in a single library
- ⚙️ **Versatile File Management** – Create directories, list files, and handle paths effortlessly
- 📝 **Smart String Normalization** – Clean text by fixing encodings, removing HTML tags, and standardizing whitespace
- 🔍 **Flexible CSV, Excel & Database Handling** – Read, filter, save, and append data
- 🔄 **Efficient JSON Streaming & Reading** – Stream large JSON files or load fully with encoding detection
- 💾 **Robust File Reading & Writing** – Auto-detect encoding, support large files with mmap, and save JSON or plain text cleanly
- 🌐 **URL & Domain Utilities** – Extract base domains accurately using industry-standard parsing
- 🛡 **Input Validation & Error Handling** – Custom validations to ensure reliable data processing
### ⚡ Performance Comparison
The following benchmarks were run on sample HTML and JSON data to compare **scrapery** with other popular Python libraries.
| Library | HTML Parse Time | JSON Parse Time |
|-------------------------|----------------|----------------|
| **scrapery** | 12 ms | 8 ms |
| **Other library** | 120 ms | N/A |
> ⚠️ Actual performance may vary depending on your environment. These results are meant for **illustrative purposes** only. No library is endorsed or affiliated with scrapery.
---
## 📦 Installation
```bash
pip install scrapery
# -------------------------------
# HTML Example
# -------------------------------
import scrapery import *
html_content = """
<html>
<body>
<h1>Welcome</h1>
<p>Hello<br>World</p>
<a href="/about">About Us</a>
<img src="/images/logo.png">
<table>
<tr><th>Name</th><th>Age</th></tr>
<tr><td>John</td><td>30</td></tr>
<tr><td>Jane</td><td>25</td></tr>
</table>
</body>
</html>
"""
# Parse HTML content
html_doc = parse_html(html_content)
# Pretty print XML
print(prettify(html_doc))
# Get all table rows
rows = select_all(html_doc, "table tr")
print("All table rows:")
for row in rows:
print(selector_content(row))
# Output
All table rows:
NameAge
John30
Jane25
# Get first paragraph
paragraph = select_one(html_doc, "p")
print("First paragraph text:", selector_content(paragraph))
# ➜ First paragraph text: HelloWorld
# CSS selector: First <h1>
print(selector_content(html_doc, selector="h1"))
# ➜ Welcome
# XPath: First <h1>
print(selector_content(html_doc, selector="//h1"))
# ➜ Welcome
# CSS selector: <a href> attribute
print(selector_content(html_doc, selector="a", attr="href"))
# ➜ /about
# XPath: <a> element href
print(selector_content(html_doc, selector="//a", attr="href"))
# ➜ /about
# CSS: First <td> in table (John)
print(selector_content(html_doc, selector="td"))
# ➜ John
# XPath: Second <td> (//td[2] = 30)
print(selector_content(html_doc, selector="//td[2]"))
# ➜ 30
# XPath: Jane's age (//tr[3]/td[2])
print(selector_content(html_doc, selector="//tr[3]/td[2]"))
# ➜ 25
# No css selector or XPath: full text
print(selector_content(html_doc))
# ➜ Welcome HelloWorld About Us Name Age John 30 Jane 25
# Root attribute (lang, if it existed)
print(selector_content(html_doc, attr="lang"))
# ➜ None
#-------------------------
# Embedded Data
#-------------------------
html_content = """
<html>
<head>
<script>
window.__INITIAL_STATE__ = {
"user": {"id": 1, "name": "Alice"},
"isLoggedIn": true
};
</script>
</head>
<body></body>
</html>
"""
json_data = embedded_json(page_source=html_content, start_keyword="window.__INITIAL_STATE__ =")
print(json_data)
# Output
{
"user": {"id": 1, "name": "Alice"},
"isLoggedIn": True
}
html_with_ldjson = """
<html>
<head>
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "Person",
"name": "Alice"
}
</script>
</head>
</html>
"""
ld_json = embedded_json(page_source=html_with_ldjson, selector = "[type*='application/ld+json']")
print(ld_json)
# Output
[{
"@context": "http://schema.org",
"@type": "Person",
"name": "Alice"
}]
#-------------------------
# DOM navigation
#-------------------------
# Example 1: parent, children, siblings
p_elem = select_one(html_doc,"p")
print("Parent tag of <p>:", parent(p_elem).tag)
print("Children of <p>:", [c.tag for c in children(p_elem)])
print("Siblings of <p>:", [s.tag for s in siblings(p_elem)])
# Example 2: next_sibling, prev_sibling
print("Next sibling of <p>:", next_sibling(p_elem).tag)
h1_elem = select_one(html_doc,"h1")
print("Previous sibling of <p>:", next_sibling(h1_elem))
# Example 3: ancestors and descendants
ancs = ancestors(p_elem)
print("Ancestor tags of <p>:", [a.tag for a in ancs])
desc = descendants(select_one(html_doc,"table"))
print("Descendant tags of <table>:", [d.tag for d in desc])
# Example 4: class utilities
div_html = '<div class="card primary"></div>'
div_elem = parse_html(div_html)
print("Has class 'card'? ->", has_class(div_elem, "card"))
print("Classes:", get_classes(div_elem))
# -------------------------------
# Resolve relative URLs
# -------------------------------
base = "https://example.com"
# Get <a> links
print(absolute_url(html_doc, "a", base_url=base))
# → 'https://example.com/about'
# Get <img> sources
print(absolute_url(html_doc, "img", base_url=base, attr="src"))
# → 'https://example.com/images/logo.png'
# -------------------------------
# XML Example
# -------------------------------
# Parsing XML from a string
xml_content = """<root>
<child>Test</child>
</root>
"""
xml_doc = parse_xml(xml_content)
print(xml_doc)
# Pretty print XML
print(prettify(xml_doc))
# Select all child elements using CSS selector
all_elements = select_all(xml_doc, "child")
print(all_elements)
# Select one child element using XPath selector
child = select_one(xml_doc, "//child")
print(child)
# Extract content from an element
content = selector_content(xml_doc, "child")
print(content)
# Get the parent element of a child
parent_element = parent(child)
print(parent_element)
# Get all children of the root element
children = children(xml_doc)
print(children)
# Find the first child element with a specific tag
child = xml_find(xml_doc, "child")
print(child)
# Find all child elements with a specific tag
children = xml_find_all(xml_doc, "child")
print(children)
# Execute XPath expression
result = xml_xpath(xml_doc, "//child")
print(result)
# Apply XSLT transformation
xslt = """<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="/">
<html>
<body>
<xsl:value-of select="/root/child"/>
</body>
</html>
</xsl:template>
</xsl:stylesheet>"""
transformed = xml_transform(xml_doc, xslt)
print(prettify(transformed))
# Validate XML against XSD schema
is_valid = xml_validate_xsd(xml_doc, Path("schema.xsd"))
print(is_valid)
# Create a new element and add it to the root
new_element = xml_create_element("newTag", text="This is new", id="123")
xml_add_child(xml_doc, new_element)
print(prettify(xml_doc))
# Set an attribute on an element
xml_set_attr(new_element, "id", "456")
print(prettify(new_element))
# -------------------------------
# JSON Example
# -------------------------------
json_str = '{"user": {"profile": {"name": "Alice"}}}'
data = parse_json(json_str)
# Get first key match
print(json_content(json_str, keys=["name"], position="first"))
# ➜ {'name': 'Alice'}
# Follow nested path
print(json_content(json_str, keys=["user", "profile", "name"], position="last"))
# ➜ Alice
# -------------------------------
# Utility Example
# -------------------------------
1. Create a Directory
from scrapery import create_directory
# Creates a directory if it doesn't already exist.
# Example 1: Creating a new directory
create_directory("new_folder")
# Example 2: Creating nested directories
create_directory("parent_folder/sub_folder")
# ================================================================
2. Standardize a String
from scrapery import standardized_string
# This function standardizes the input string by removing escape sequences like \n, \t, and \r, removing HTML tags, collapsing multiple spaces, and trimming leading/trailing spaces.
# Example 1: Standardize a string with newlines, tabs, and HTML tags
input_string_1 = "<html><body> Hello \nWorld! \tThis is a test. </body></html>"
print("Standardized String 1:", standardized_string(input_string_1))
# Example 2: Input string with multiple spaces and line breaks
input_string_2 = " This is a \n\n string with spaces and \t tabs. "
print("Standardized String 2:", standardized_string(input_string_2))
# Example 3: Pass an empty string
input_string_3 = ""
print("Standardized String 3:", standardized_string(input_string_3))
# Example 4: Pass None (invalid input)
input_string_4 = None
print("Standardized String 4:", standardized_string(input_string_4))
================================================================
3. Replace a String
from scrapery import replace_content
text = "posting posting posting"
# Example 1: Replace all occurrences
result = replace_content(text, "posting", "UPDATED")
print(result)
# Output: "UPDATED UPDATED UPDATED"
# Example 2: Replace only the 2nd occurrence (position)
result = replace_content(text, "posting", "UPDATED", position=2)
print(result)
# Output: "posting UPDATED posting"
# Example 3: Case-insensitive replacement
text = "Posting POSTING posting"
result = replace_content(text, "posting", "edited", ignore_case=True, position=2)
print(result)
# Output: "Posting edited posting"
# Example 4: Limit number of replacements (count)
text = "apple apple apple"
result = replace_content(text, "apple", "orange", count=2)
print(result)
# Output: "orange orange apple"
# Example 5: Replace in a file
# example.txt contains: "error error error"
replace_content("example.txt", "error", "warning", ignore_case=True)
# The file now contains: "warning warning warning"
================================================================
4. Read CSV
from scrapery import read_csv
csv_file_path = 'data.csv'
get_value_by_col_name = 'URL'
filter_col_name = 'Category'
include_filter_col_values = ['Tech']
result = read_csv(csv_file_path, get_value_by_col_name, filter_col_name, include_filter_col_values)
print(result)
Sample CSV
Category,URL
Tech,https://tech1.com
Tech,https://tech2.com
Science,https://science1.com
Result
['https://tech1.com', 'https://tech2.com']
================================================================
5. Save to CSV
from scrapery import save_to_csv
list_data = [[1, 'Alice', 23], [2, 'Bob', 30], [3, 'Charlie', 25]]
headers = ['ID', 'Name', 'Age']
output_file_path = 'output_data.csv'
# Default separator (comma)
save_to_csv(data_list, headers, output_file_path)
# Tab separator
save_to_csv(data_list, headers, output_file_path, sep="\t")
# Semicolon separator
save_to_csv(data_list, headers, output_file_path, sep=";")
Output (default, sep=","):
ID,Name,Age
1,Alice,23
2,Bob,30
3,Charlie,25
Output (sep="\t"):
ID Name Age
1 Alice 23
2 Bob 30
3 Charlie 25
================================================================
6. Save to Excel file
from scrapery import save_to_xls
save_to_xls(data_list, headers, output_file_path)
================================================================
7. Save to sqlite Database
from scrapery import save_to_db
#Creates a SQLite database file named data.sqlite in the current folder and adds a table called data.
save_to_db(data_list, headers)
#Creates a SQLite database file named mydb.sqlite in the given folder (report) and adds a table called User.
save_to_db(data_list, headers, auto_data_type=False, output_file_path="report/mydb.sqlite", table_name="User")
================================================================
8. List files in a directory
from scrapery import list_files
files = list_files(directory=output_dir, extension="csv")
print("CSV files in output directory:", files)
================================================================
9. Read back file content
from scrapery import read_file_content
# Example 1: Read small JSON file fully
file_path_small_json = 'small_data.json'
content = read_file_content(file_path_small_json, stream_json=False)
print("Small JSON file content (fully loaded):")
print(content) # content will be a dict or list depending on JSON structure
# Example 2: Read large JSON file by streaming (returns a generator)
file_path_large_json = 'large_data.json'
json_stream: Generator[dict, None, None] = read_file_content(file_path_large_json, stream_json=True)
print("\nLarge JSON file content streamed:")
for item in json_stream:
print(item) # process each streamed JSON object one by one
# Example 3: Read a large text file using mmap
file_path_large_txt = 'large_text.txt'
text_content = read_file_content(file_path_large_txt)
print("\nLarge text file content (using mmap):")
print(text_content[:500]) # print first 500 characters
# Example 4: Read a small text file with encoding detection
file_path_small_txt = 'small_text.txt'
text_content = read_file_content(file_path_small_txt)
print("\nSmall text file content (with encoding detection):")
print(text_content)
================================================================
10. Save to file
from scrapery import save_file_content
# Example 1: Save plain text content to a file
text_content = "Hello, this is a sample text file.\nWelcome to file handling in Python!"
save_file_content("output/text_file.txt", text_content)
# Output: Content successfully written to output/text_file.txt
# Example 2: Save JSON content to a file
json_content = {
"name": "Alice",
"age": 30,
"skills": ["Python", "Data Science", "Machine Learning"]
}
save_file_content("output/data.json", json_content)
# Output: JSON content successfully written to output/data.json
# Example 3: Save number (non-string content) to a file
number_content = 12345
save_file_content("output/number.txt", number_content)
# Output: Content successfully written to output/number.txt
# Example 4: Append text content to an existing file
append_text = "\nThis line is appended."
save_file_content("output/text_file.txt", append_text, mode="a")
# Output: Content successfully written to output/text_file.txt
Raw data
{
"_id": null,
"home_page": null,
"name": "scrapery",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "web scraping, html parser, xml parser, json parser, lxml, ujson, data extraction, scraping tools",
"author": "Ramesh Chandra",
"author_email": "rameshsofter@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/e7/88/12b8466bb484ce2c0cfa72af752f41f666e33228fc3fabb6a2676a15e462/scrapery-0.1.19.tar.gz",
"platform": null,
"description": "# \ud83d\udd77\ufe0f scrapery\r\n\r\n\r\n[](https://pypi.org/project/scrapery/)\r\n[](https://pypi.org/project/scrapery/)\r\n[](https://pypi.org/project/scrapery/)\r\n\r\n[](https://scrapery.readthedocs.io/en/latest/?badge=latest)\r\n\r\nA blazing fast, lightweight, and modern parsing library for **HTML, XML, and JSON**, designed for **web scraping** and **data extraction**. \r\nIt supports both **XPath** and **CSS** selectors, along with seamless **DOM navigation**, making parsing and extracting data straightforward and intuitive.\r\n\r\n\ud83d\udcd8 **Full Documentation**: [https://scrapery.readthedocs.io](https://scrapery.readthedocs.io)\r\n\r\n---\r\n\r\n## \u2728 Features\r\n\r\n- \u26a1 **Blazing Fast Performance** \u2013 Optimized for high-speed HTML, XML, and JSON parsing \r\n- \ud83c\udfaf **Dual Selector Support** \u2013 Use **XPath** or **CSS selectors** for flexible extraction \r\n- \ud83d\udee1 **Comprehensive Error Handling** \u2013 Detailed exceptions for different error scenarios\r\n- \ud83e\udde9 **Robust Parsing** \u2013 Encoding detection and content normalization for reliable results \r\n- \ud83e\uddd1\u200d\ud83d\udcbb **Function-Based API** \u2013 Clean and intuitive interface for ease of use \r\n- \ud83d\udce6 **Multi-Format Support** \u2013 Parse **HTML, XML, and JSON** in a single library\r\n- \u2699\ufe0f **Versatile File Management** \u2013 Create directories, list files, and handle paths effortlessly\r\n- \ud83d\udcdd **Smart String Normalization** \u2013 Clean text by fixing encodings, removing HTML tags, and standardizing whitespace\r\n- \ud83d\udd0d **Flexible CSV, Excel & Database Handling** \u2013 Read, filter, save, and append data\r\n- \ud83d\udd04 **Efficient JSON Streaming & Reading** \u2013 Stream large JSON files or load fully with encoding detection\r\n- \ud83d\udcbe **Robust File Reading & Writing** \u2013 Auto-detect encoding, support large files with mmap, and save JSON or plain text cleanly\r\n- \ud83c\udf10 **URL & Domain Utilities** \u2013 Extract base domains accurately using industry-standard parsing\r\n- \ud83d\udee1 **Input Validation & Error Handling** \u2013 Custom validations to ensure reliable data processing\r\n\r\n\r\n\r\n### \u26a1 Performance Comparison\r\n\r\nThe following benchmarks were run on sample HTML and JSON data to compare **scrapery** with other popular Python libraries.\r\n\r\n\r\n| Library | HTML Parse Time | JSON Parse Time |\r\n|-------------------------|----------------|----------------|\r\n| **scrapery** | 12 ms | 8 ms |\r\n| **Other library** | 120 ms | N/A |\r\n\r\n> \u26a0\ufe0f Actual performance may vary depending on your environment. These results are meant for **illustrative purposes** only. No library is endorsed or affiliated with scrapery.\r\n\r\n\r\n---\r\n\r\n## \ud83d\udce6 Installation\r\n\r\n```bash\r\npip install scrapery\r\n\r\n# -------------------------------\r\n# HTML Example\r\n# -------------------------------\r\nimport scrapery import *\r\n\r\nhtml_content = \"\"\"\r\n<html>\r\n <body>\r\n <h1>Welcome</h1>\r\n <p>Hello<br>World</p>\r\n <a href=\"/about\">About Us</a>\r\n <img src=\"/images/logo.png\">\r\n <table>\r\n <tr><th>Name</th><th>Age</th></tr>\r\n <tr><td>John</td><td>30</td></tr>\r\n <tr><td>Jane</td><td>25</td></tr>\r\n </table>\r\n </body>\r\n</html>\r\n\"\"\"\r\n\r\n# Parse HTML content\r\nhtml_doc = parse_html(html_content)\r\n\r\n# Pretty print XML\r\nprint(prettify(html_doc))\r\n\r\n# Get all table rows\r\nrows = select_all(html_doc, \"table tr\")\r\nprint(\"All table rows:\")\r\nfor row in rows:\r\n print(selector_content(row))\r\n\r\n# Output\r\n All table rows:\r\n NameAge\r\n John30\r\n Jane25\r\n\r\n# Get first paragraph\r\nparagraph = select_one(html_doc, \"p\")\r\nprint(\"First paragraph text:\", selector_content(paragraph))\r\n# \u279c First paragraph text: HelloWorld\r\n\r\n# CSS selector: First <h1>\r\nprint(selector_content(html_doc, selector=\"h1\")) \r\n# \u279c Welcome\r\n\r\n# XPath: First <h1>\r\nprint(selector_content(html_doc, selector=\"//h1\")) \r\n# \u279c Welcome\r\n\r\n# CSS selector: <a href> attribute\r\nprint(selector_content(html_doc, selector=\"a\", attr=\"href\")) \r\n# \u279c /about\r\n\r\n# XPath: <a> element href\r\nprint(selector_content(html_doc, selector=\"//a\", attr=\"href\")) \r\n# \u279c /about\r\n\r\n# CSS: First <td> in table (John)\r\nprint(selector_content(html_doc, selector=\"td\")) \r\n# \u279c John\r\n\r\n# XPath: Second <td> (//td[2] = 30)\r\nprint(selector_content(html_doc, selector=\"//td[2]\")) \r\n# \u279c 30\r\n\r\n# XPath: Jane's age (//tr[3]/td[2])\r\nprint(selector_content(html_doc, selector=\"//tr[3]/td[2]\")) \r\n# \u279c 25\r\n\r\n# No css selector or XPath: full text\r\nprint(selector_content(html_doc)) \r\n# \u279c Welcome HelloWorld About Us Name Age John 30 Jane 25\r\n\r\n# Root attribute (lang, if it existed)\r\nprint(selector_content(html_doc, attr=\"lang\")) \r\n# \u279c None\r\n\r\n#-------------------------\r\n# Embedded Data\r\n#-------------------------\r\n\r\nhtml_content = \"\"\"\r\n<html>\r\n<head>\r\n <script>\r\n window.__INITIAL_STATE__ = {\r\n \"user\": {\"id\": 1, \"name\": \"Alice\"},\r\n \"isLoggedIn\": true\r\n };\r\n </script>\r\n</head>\r\n<body></body>\r\n</html>\r\n\"\"\"\r\n\r\njson_data = embedded_json(page_source=html_content, start_keyword=\"window.__INITIAL_STATE__ =\")\r\nprint(json_data)\r\n\r\n# Output\r\n\r\n{\r\n \"user\": {\"id\": 1, \"name\": \"Alice\"},\r\n \"isLoggedIn\": True\r\n}\r\n\r\n\r\nhtml_with_ldjson = \"\"\"\r\n<html>\r\n <head>\r\n <script type=\"application/ld+json\">\r\n {\r\n \"@context\": \"http://schema.org\",\r\n \"@type\": \"Person\",\r\n \"name\": \"Alice\"\r\n }\r\n </script>\r\n </head>\r\n</html>\r\n\"\"\"\r\n\r\nld_json = embedded_json(page_source=html_with_ldjson, selector = \"[type*='application/ld+json']\")\r\nprint(ld_json)\r\n\r\n# Output\r\n\r\n[{\r\n \"@context\": \"http://schema.org\",\r\n \"@type\": \"Person\",\r\n \"name\": \"Alice\"\r\n}]\r\n\r\n#-------------------------\r\n# DOM navigation\r\n#-------------------------\r\n# Example 1: parent, children, siblings\r\np_elem = select_one(html_doc,\"p\")\r\nprint(\"Parent tag of <p>:\", parent(p_elem).tag)\r\nprint(\"Children of <p>:\", [c.tag for c in children(p_elem)])\r\nprint(\"Siblings of <p>:\", [s.tag for s in siblings(p_elem)])\r\n\r\n# Example 2: next_sibling, prev_sibling\r\nprint(\"Next sibling of <p>:\", next_sibling(p_elem).tag)\r\nh1_elem = select_one(html_doc,\"h1\")\r\nprint(\"Previous sibling of <p>:\", next_sibling(h1_elem))\r\n\r\n# Example 3: ancestors and descendants\r\nancs = ancestors(p_elem)\r\nprint(\"Ancestor tags of <p>:\", [a.tag for a in ancs])\r\ndesc = descendants(select_one(html_doc,\"table\"))\r\nprint(\"Descendant tags of <table>:\", [d.tag for d in desc])\r\n\r\n# Example 4: class utilities\r\ndiv_html = '<div class=\"card primary\"></div>'\r\ndiv_elem = parse_html(div_html)\r\nprint(\"Has class 'card'? ->\", has_class(div_elem, \"card\"))\r\nprint(\"Classes:\", get_classes(div_elem))\r\n\r\n# -------------------------------\r\n# Resolve relative URLs\r\n# -------------------------------\r\nbase = \"https://example.com\"\r\n\r\n# Get <a> links\r\nprint(absolute_url(html_doc, \"a\", base_url=base))\r\n# \u2192 'https://example.com/about'\r\n\r\n# Get <img> sources\r\nprint(absolute_url(html_doc, \"img\", base_url=base, attr=\"src\"))\r\n# \u2192 'https://example.com/images/logo.png'\r\n\r\n# -------------------------------\r\n# XML Example\r\n# -------------------------------\r\n\r\n# Parsing XML from a string\r\nxml_content = \"\"\"<root>\r\n <child>Test</child>\r\n </root>\r\n \"\"\"\r\n\r\nxml_doc = parse_xml(xml_content)\r\nprint(xml_doc)\r\n\r\n# Pretty print XML\r\nprint(prettify(xml_doc))\r\n\r\n# Select all child elements using CSS selector\r\nall_elements = select_all(xml_doc, \"child\")\r\nprint(all_elements)\r\n\r\n# Select one child element using XPath selector\r\nchild = select_one(xml_doc, \"//child\")\r\nprint(child)\r\n\r\n# Extract content from an element\r\ncontent = selector_content(xml_doc, \"child\")\r\nprint(content)\r\n\r\n# Get the parent element of a child\r\nparent_element = parent(child)\r\nprint(parent_element)\r\n\r\n# Get all children of the root element\r\nchildren = children(xml_doc)\r\nprint(children)\r\n\r\n# Find the first child element with a specific tag\r\nchild = xml_find(xml_doc, \"child\")\r\nprint(child)\r\n\r\n# Find all child elements with a specific tag\r\nchildren = xml_find_all(xml_doc, \"child\")\r\nprint(children)\r\n\r\n# Execute XPath expression\r\nresult = xml_xpath(xml_doc, \"//child\")\r\nprint(result)\r\n\r\n# Apply XSLT transformation\r\nxslt = \"\"\"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\" version=\"1.0\">\r\n <xsl:template match=\"/\">\r\n <html>\r\n <body>\r\n <xsl:value-of select=\"/root/child\"/>\r\n </body>\r\n </html>\r\n </xsl:template>\r\n</xsl:stylesheet>\"\"\"\r\n\r\ntransformed = xml_transform(xml_doc, xslt)\r\nprint(prettify(transformed))\r\n\r\n# Validate XML against XSD schema\r\nis_valid = xml_validate_xsd(xml_doc, Path(\"schema.xsd\"))\r\nprint(is_valid)\r\n\r\n# Create a new element and add it to the root\r\nnew_element = xml_create_element(\"newTag\", text=\"This is new\", id=\"123\")\r\nxml_add_child(xml_doc, new_element)\r\nprint(prettify(xml_doc))\r\n\r\n# Set an attribute on an element\r\nxml_set_attr(new_element, \"id\", \"456\")\r\nprint(prettify(new_element))\r\n\r\n# -------------------------------\r\n# JSON Example\r\n# -------------------------------\r\n\r\njson_str = '{\"user\": {\"profile\": {\"name\": \"Alice\"}}}'\r\ndata = parse_json(json_str)\r\n\r\n# Get first key match\r\nprint(json_content(json_str, keys=[\"name\"], position=\"first\"))\r\n# \u279c {'name': 'Alice'}\r\n\r\n# Follow nested path\r\nprint(json_content(json_str, keys=[\"user\", \"profile\", \"name\"], position=\"last\"))\r\n# \u279c Alice\r\n\r\n# -------------------------------\r\n# Utility Example\r\n# -------------------------------\r\n\r\n1. Create a Directory\r\n\r\nfrom scrapery import create_directory\r\n# Creates a directory if it doesn't already exist.\r\n\r\n\r\n# Example 1: Creating a new directory\r\ncreate_directory(\"new_folder\")\r\n\r\n# Example 2: Creating nested directories\r\ncreate_directory(\"parent_folder/sub_folder\")\r\n\r\n# ================================================================\r\n2. Standardize a String\r\n\r\nfrom scrapery import standardized_string\r\n# This function standardizes the input string by removing escape sequences like \\n, \\t, and \\r, removing HTML tags, collapsing multiple spaces, and trimming leading/trailing spaces.\r\n\r\n# Example 1: Standardize a string with newlines, tabs, and HTML tags\r\ninput_string_1 = \"<html><body> Hello \\nWorld! \\tThis is a test. </body></html>\"\r\nprint(\"Standardized String 1:\", standardized_string(input_string_1))\r\n\r\n# Example 2: Input string with multiple spaces and line breaks\r\ninput_string_2 = \" This is a \\n\\n string with spaces and \\t tabs. \"\r\nprint(\"Standardized String 2:\", standardized_string(input_string_2))\r\n\r\n# Example 3: Pass an empty string\r\ninput_string_3 = \"\"\r\nprint(\"Standardized String 3:\", standardized_string(input_string_3))\r\n\r\n# Example 4: Pass None (invalid input)\r\ninput_string_4 = None\r\nprint(\"Standardized String 4:\", standardized_string(input_string_4))\r\n\r\n================================================================\r\n3. Replace a String\r\n\r\nfrom scrapery import replace_content\r\n\r\ntext = \"posting posting posting\"\r\n\r\n# Example 1: Replace all occurrences\r\nresult = replace_content(text, \"posting\", \"UPDATED\")\r\nprint(result)\r\n# Output: \"UPDATED UPDATED UPDATED\"\r\n\r\n# Example 2: Replace only the 2nd occurrence (position)\r\nresult = replace_content(text, \"posting\", \"UPDATED\", position=2)\r\nprint(result)\r\n# Output: \"posting UPDATED posting\"\r\n\r\n# Example 3: Case-insensitive replacement\r\ntext = \"Posting POSTING posting\"\r\nresult = replace_content(text, \"posting\", \"edited\", ignore_case=True, position=2)\r\nprint(result)\r\n# Output: \"Posting edited posting\"\r\n\r\n# Example 4: Limit number of replacements (count)\r\ntext = \"apple apple apple\"\r\nresult = replace_content(text, \"apple\", \"orange\", count=2)\r\nprint(result)\r\n# Output: \"orange orange apple\"\r\n\r\n# Example 5: Replace in a file\r\n\r\n# example.txt contains: \"error error error\"\r\nreplace_content(\"example.txt\", \"error\", \"warning\", ignore_case=True)\r\n# The file now contains: \"warning warning warning\"\r\n\r\n================================================================\r\n4. Read CSV\r\n\r\nfrom scrapery import read_csv\r\n\r\ncsv_file_path = 'data.csv'\r\nget_value_by_col_name = 'URL'\r\nfilter_col_name = 'Category'\r\ninclude_filter_col_values = ['Tech']\r\n\r\nresult = read_csv(csv_file_path, get_value_by_col_name, filter_col_name, include_filter_col_values)\r\nprint(result)\r\n\r\nSample CSV\r\n\r\nCategory,URL\r\nTech,https://tech1.com\r\nTech,https://tech2.com\r\nScience,https://science1.com\r\n\r\nResult\r\n\r\n['https://tech1.com', 'https://tech2.com']\r\n\r\n================================================================\r\n5. Save to CSV\r\n\r\nfrom scrapery import save_to_csv\r\n\r\nlist_data = [[1, 'Alice', 23], [2, 'Bob', 30], [3, 'Charlie', 25]]\r\nheaders = ['ID', 'Name', 'Age']\r\noutput_file_path = 'output_data.csv'\r\n\r\n# Default separator (comma)\r\nsave_to_csv(data_list, headers, output_file_path)\r\n\r\n# Tab separator\r\nsave_to_csv(data_list, headers, output_file_path, sep=\"\\t\")\r\n\r\n# Semicolon separator\r\nsave_to_csv(data_list, headers, output_file_path, sep=\";\")\r\n\r\nOutput (default, sep=\",\"):\r\nID,Name,Age\r\n1,Alice,23\r\n2,Bob,30\r\n3,Charlie,25\r\n\r\nOutput (sep=\"\\t\"):\r\nID Name Age\r\n1 Alice 23\r\n2 Bob 30\r\n3 Charlie 25\r\n\r\n================================================================\r\n6. Save to Excel file \r\n\r\nfrom scrapery import save_to_xls\r\n\r\nsave_to_xls(data_list, headers, output_file_path)\r\n\r\n================================================================\r\n7. Save to sqlite Database\r\n\r\nfrom scrapery import save_to_db\r\n\r\n#Creates a SQLite database file named data.sqlite in the current folder and adds a table called data.\r\nsave_to_db(data_list, headers)\r\n\r\n#Creates a SQLite database file named mydb.sqlite in the given folder (report) and adds a table called User.\r\nsave_to_db(data_list, headers, auto_data_type=False, output_file_path=\"report/mydb.sqlite\", table_name=\"User\")\r\n\r\n================================================================\r\n8. List files in a directory\r\n\r\nfrom scrapery import list_files\r\n\r\nfiles = list_files(directory=output_dir, extension=\"csv\")\r\nprint(\"CSV files in output directory:\", files)\r\n\r\n================================================================\r\n9. Read back file content\r\n\r\nfrom scrapery import read_file_content\r\n\r\n# Example 1: Read small JSON file fully\r\nfile_path_small_json = 'small_data.json'\r\ncontent = read_file_content(file_path_small_json, stream_json=False)\r\nprint(\"Small JSON file content (fully loaded):\")\r\nprint(content) # content will be a dict or list depending on JSON structure\r\n\r\n# Example 2: Read large JSON file by streaming (returns a generator)\r\nfile_path_large_json = 'large_data.json'\r\njson_stream: Generator[dict, None, None] = read_file_content(file_path_large_json, stream_json=True)\r\nprint(\"\\nLarge JSON file content streamed:\")\r\nfor item in json_stream:\r\n print(item) # process each streamed JSON object one by one\r\n\r\n# Example 3: Read a large text file using mmap\r\nfile_path_large_txt = 'large_text.txt'\r\ntext_content = read_file_content(file_path_large_txt)\r\nprint(\"\\nLarge text file content (using mmap):\")\r\nprint(text_content[:500]) # print first 500 characters\r\n\r\n# Example 4: Read a small text file with encoding detection\r\nfile_path_small_txt = 'small_text.txt'\r\ntext_content = read_file_content(file_path_small_txt)\r\nprint(\"\\nSmall text file content (with encoding detection):\")\r\nprint(text_content)\r\n\r\n================================================================\r\n10. Save to file\r\n\r\nfrom scrapery import save_file_content\r\n\r\n# Example 1: Save plain text content to a file\r\ntext_content = \"Hello, this is a sample text file.\\nWelcome to file handling in Python!\"\r\nsave_file_content(\"output/text_file.txt\", text_content)\r\n\r\n# Output: Content successfully written to output/text_file.txt\r\n\r\n# Example 2: Save JSON content to a file\r\njson_content = {\r\n \"name\": \"Alice\",\r\n \"age\": 30,\r\n \"skills\": [\"Python\", \"Data Science\", \"Machine Learning\"]\r\n}\r\nsave_file_content(\"output/data.json\", json_content)\r\n\r\n# Output: JSON content successfully written to output/data.json\r\n\r\n# Example 3: Save number (non-string content) to a file\r\nnumber_content = 12345\r\nsave_file_content(\"output/number.txt\", number_content)\r\n\r\n# Output: Content successfully written to output/number.txt\r\n\r\n# Example 4: Append text content to an existing file\r\nappend_text = \"\\nThis line is appended.\"\r\nsave_file_content(\"output/text_file.txt\", append_text, mode=\"a\")\r\n\r\n# Output: Content successfully written to output/text_file.txt\r\n\r\n\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Scrapery: A fast, lightweight library to scrape HTML, XML, and JSON using XPath, CSS selectors, and intuitive DOM navigation.",
"version": "0.1.19",
"project_urls": {
"Documentation": "https://scrapery.readthedocs.io/en/latest/"
},
"split_keywords": [
"web scraping",
" html parser",
" xml parser",
" json parser",
" lxml",
" ujson",
" data extraction",
" scraping tools"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "46c981e09492238ab46def5bea6cdca433ba1c29bb5faa373c68aea86ca88c73",
"md5": "9bcbb0de583920d80dbe77b29b1da971",
"sha256": "3118c2c5fab5f5e1f80d504b47eecbed9a69d3bde1506dd76eaf4fb0a8976ee3"
},
"downloads": -1,
"filename": "scrapery-0.1.19-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9bcbb0de583920d80dbe77b29b1da971",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 26926,
"upload_time": "2025-10-27T08:25:39",
"upload_time_iso_8601": "2025-10-27T08:25:39.754967Z",
"url": "https://files.pythonhosted.org/packages/46/c9/81e09492238ab46def5bea6cdca433ba1c29bb5faa373c68aea86ca88c73/scrapery-0.1.19-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e78812b8466bb484ce2c0cfa72af752f41f666e33228fc3fabb6a2676a15e462",
"md5": "084a6c8c3af21e82e8018397d44c56e7",
"sha256": "7f245eb090a297cb134f2650e313e94f03b7f8a8b8415feadee93f5757215ae0"
},
"downloads": -1,
"filename": "scrapery-0.1.19.tar.gz",
"has_sig": false,
"md5_digest": "084a6c8c3af21e82e8018397d44c56e7",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 29252,
"upload_time": "2025-10-27T08:25:43",
"upload_time_iso_8601": "2025-10-27T08:25:43.458792Z",
"url": "https://files.pythonhosted.org/packages/e7/88/12b8466bb484ce2c0cfa72af752f41f666e33228fc3fabb6a2676a15e462/scrapery-0.1.19.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-27 08:25:43",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "scrapery"
}