scrapery


Namescrapery JSON
Version 0.1.19 PyPI version JSON
download
home_pageNone
SummaryScrapery: A fast, lightweight library to scrape HTML, XML, and JSON using XPath, CSS selectors, and intuitive DOM navigation.
upload_time2025-10-27 08:25:43
maintainerNone
docs_urlNone
authorRamesh Chandra
requires_python>=3.8
licenseMIT
keywords web scraping html parser xml parser json parser lxml ujson data extraction scraping tools
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # 🕷️ scrapery


[![PyPI Version](https://img.shields.io/pypi/v/scrapery)](https://pypi.org/project/scrapery/)
[![Python Versions](https://img.shields.io/pypi/pyversions/scrapery)](https://pypi.org/project/scrapery/)
[![Downloads](https://img.shields.io/pypi/dm/scrapery)](https://pypi.org/project/scrapery/)
![License](https://img.shields.io/badge/License-Free-brightgreen)
[![Documentation Status](https://readthedocs.org/projects/scrapery/badge/?version=latest)](https://scrapery.readthedocs.io/en/latest/?badge=latest)

A blazing fast, lightweight, and modern parsing library for **HTML, XML, and JSON**, designed for **web scraping** and **data extraction**.  
It supports both **XPath** and **CSS** selectors, along with seamless **DOM navigation**, making parsing and extracting data straightforward and intuitive.

📘 **Full Documentation**: [https://scrapery.readthedocs.io](https://scrapery.readthedocs.io)

---

## ✨ Features

- ⚡ **Blazing Fast Performance** – Optimized for high-speed HTML, XML, and JSON parsing  
- 🎯 **Dual Selector Support** – Use **XPath** or **CSS selectors** for flexible extraction  
- 🛡 **Comprehensive Error Handling** – Detailed exceptions for different error scenarios
- 🧩 **Robust Parsing** – Encoding detection and content normalization for reliable results  
- 🧑‍💻 **Function-Based API** – Clean and intuitive interface for ease of use  
- 📦 **Multi-Format Support** – Parse **HTML, XML, and JSON** in a single library
- ⚙️ **Versatile File Management** – Create directories, list files, and handle paths effortlessly
- 📝 **Smart String Normalization** – Clean text by fixing encodings, removing HTML tags, and standardizing whitespace
- 🔍 **Flexible CSV, Excel & Database Handling** – Read, filter, save, and append data
- 🔄 **Efficient JSON Streaming & Reading** – Stream large JSON files or load fully with encoding detection
- 💾 **Robust File Reading & Writing** – Auto-detect encoding, support large files with mmap, and save JSON or plain text cleanly
- 🌐 **URL & Domain Utilities** – Extract base domains accurately using industry-standard parsing
- 🛡 **Input Validation & Error Handling** – Custom validations to ensure reliable data processing



### ⚡ Performance Comparison

The following benchmarks were run on sample HTML and JSON data to compare **scrapery** with other popular Python libraries.


| Library                 | HTML Parse Time | JSON Parse Time |
|-------------------------|----------------|----------------|
| **scrapery**            | 12 ms          | 8 ms           |
| **Other library**       | 120 ms         | N/A            |

> ⚠️ Actual performance may vary depending on your environment. These results are meant for **illustrative purposes** only. No library is endorsed or affiliated with scrapery.


---

## 📦 Installation

```bash
pip install scrapery

# -------------------------------
# HTML Example
# -------------------------------
import scrapery import *

html_content = """
<html>
    <body>
        <h1>Welcome</h1>
        <p>Hello<br>World</p>
        <a href="/about">About Us</a>
        <img src="/images/logo.png">
        <table>
            <tr><th>Name</th><th>Age</th></tr>
            <tr><td>John</td><td>30</td></tr>
            <tr><td>Jane</td><td>25</td></tr>
        </table>
    </body>
</html>
"""

# Parse HTML content
html_doc = parse_html(html_content)

# Pretty print XML
print(prettify(html_doc))

# Get all table rows
rows = select_all(html_doc, "table tr")
print("All table rows:")
for row in rows:
    print(selector_content(row))

# Output
    All table rows:
    NameAge
    John30
    Jane25

# Get first paragraph
paragraph = select_one(html_doc, "p")
print("First paragraph text:", selector_content(paragraph))
# ➜ First paragraph text: HelloWorld

# CSS selector: First <h1>
print(selector_content(html_doc, selector="h1"))  
# ➜ Welcome

# XPath: First <h1>
print(selector_content(html_doc, selector="//h1"))  
# ➜ Welcome

# CSS selector: <a href> attribute
print(selector_content(html_doc, selector="a", attr="href"))  
# ➜ /about

# XPath: <a> element href
print(selector_content(html_doc, selector="//a", attr="href"))  
# ➜ /about

# CSS: First <td> in table (John)
print(selector_content(html_doc, selector="td"))  
# ➜ John

# XPath: Second <td> (//td[2] = 30)
print(selector_content(html_doc, selector="//td[2]"))  
# ➜ 30

# XPath: Jane's age (//tr[3]/td[2])
print(selector_content(html_doc, selector="//tr[3]/td[2]"))  
# ➜ 25

# No css selector or XPath: full text
print(selector_content(html_doc))  
# ➜ Welcome HelloWorld About Us Name Age John 30 Jane 25

# Root attribute (lang, if it existed)
print(selector_content(html_doc, attr="lang"))  
# ➜ None

#-------------------------
# Embedded Data
#-------------------------

html_content = """
<html>
<head>
  <script>
    window.__INITIAL_STATE__ = {
      "user": {"id": 1, "name": "Alice"},
      "isLoggedIn": true
    };
  </script>
</head>
<body></body>
</html>
"""

json_data = embedded_json(page_source=html_content, start_keyword="window.__INITIAL_STATE__ =")
print(json_data)

# Output

{
  "user": {"id": 1, "name": "Alice"},
  "isLoggedIn": True
}


html_with_ldjson = """
<html>
  <head>
    <script type="application/ld+json">
      {
        "@context": "http://schema.org",
        "@type": "Person",
        "name": "Alice"
      }
    </script>
  </head>
</html>
"""

ld_json = embedded_json(page_source=html_with_ldjson, selector = "[type*='application/ld+json']")
print(ld_json)

# Output

[{
  "@context": "http://schema.org",
  "@type": "Person",
  "name": "Alice"
}]

#-------------------------
# DOM navigation
#-------------------------
# Example 1: parent, children, siblings
p_elem = select_one(html_doc,"p")
print("Parent tag of <p>:", parent(p_elem).tag)
print("Children of <p>:", [c.tag for c in children(p_elem)])
print("Siblings of <p>:", [s.tag for s in siblings(p_elem)])

# Example 2: next_sibling, prev_sibling
print("Next sibling of <p>:", next_sibling(p_elem).tag)
h1_elem = select_one(html_doc,"h1")
print("Previous sibling of <p>:", next_sibling(h1_elem))

# Example 3: ancestors and descendants
ancs = ancestors(p_elem)
print("Ancestor tags of <p>:", [a.tag for a in ancs])
desc = descendants(select_one(html_doc,"table"))
print("Descendant tags of <table>:", [d.tag for d in desc])

# Example 4: class utilities
div_html = '<div class="card primary"></div>'
div_elem = parse_html(div_html)
print("Has class 'card'? ->", has_class(div_elem, "card"))
print("Classes:", get_classes(div_elem))

# -------------------------------
# Resolve relative URLs
# -------------------------------
base = "https://example.com"

# Get <a> links
print(absolute_url(html_doc, "a", base_url=base))
# → 'https://example.com/about'

# Get <img> sources
print(absolute_url(html_doc, "img", base_url=base, attr="src"))
# → 'https://example.com/images/logo.png'

# -------------------------------
# XML Example
# -------------------------------

# Parsing XML from a string
xml_content = """<root>
                    <child>Test</child>
                </root>
            """

xml_doc = parse_xml(xml_content)
print(xml_doc)

# Pretty print XML
print(prettify(xml_doc))

# Select all child elements using CSS selector
all_elements = select_all(xml_doc, "child")
print(all_elements)

# Select one child element using XPath selector
child = select_one(xml_doc, "//child")
print(child)

# Extract content from an element
content = selector_content(xml_doc, "child")
print(content)

# Get the parent element of a child
parent_element = parent(child)
print(parent_element)

# Get all children of the root element
children = children(xml_doc)
print(children)

# Find the first child element with a specific tag
child = xml_find(xml_doc, "child")
print(child)

# Find all child elements with a specific tag
children = xml_find_all(xml_doc, "child")
print(children)

# Execute XPath expression
result = xml_xpath(xml_doc, "//child")
print(result)

# Apply XSLT transformation
xslt = """<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:template match="/">
        <html>
            <body>
                <xsl:value-of select="/root/child"/>
            </body>
        </html>
    </xsl:template>
</xsl:stylesheet>"""

transformed = xml_transform(xml_doc, xslt)
print(prettify(transformed))

# Validate XML against XSD schema
is_valid = xml_validate_xsd(xml_doc, Path("schema.xsd"))
print(is_valid)

# Create a new element and add it to the root
new_element = xml_create_element("newTag", text="This is new", id="123")
xml_add_child(xml_doc, new_element)
print(prettify(xml_doc))

# Set an attribute on an element
xml_set_attr(new_element, "id", "456")
print(prettify(new_element))

# -------------------------------
# JSON Example
# -------------------------------

json_str = '{"user": {"profile": {"name": "Alice"}}}'
data = parse_json(json_str)

# Get first key match
print(json_content(json_str, keys=["name"], position="first"))
# ➜ {'name': 'Alice'}

# Follow nested path
print(json_content(json_str, keys=["user", "profile", "name"], position="last"))
# ➜ Alice

# -------------------------------
# Utility Example
# -------------------------------

1. Create a Directory

from scrapery import create_directory
# Creates a directory if it doesn't already exist.


# Example 1: Creating a new directory
create_directory("new_folder")

# Example 2: Creating nested directories
create_directory("parent_folder/sub_folder")

# ================================================================
2. Standardize a String

from scrapery import standardized_string
# This function standardizes the input string by removing escape sequences like \n, \t, and \r, removing HTML tags, collapsing multiple spaces, and trimming leading/trailing spaces.

# Example 1: Standardize a string with newlines, tabs, and HTML tags
input_string_1 = "<html><body>  Hello \nWorld!  \tThis is a test.  </body></html>"
print("Standardized String 1:", standardized_string(input_string_1))

# Example 2: Input string with multiple spaces and line breaks
input_string_2 = "  This   is   a  \n\n   string   with  spaces and \t tabs.  "
print("Standardized String 2:", standardized_string(input_string_2))

# Example 3: Pass an empty string
input_string_3 = ""
print("Standardized String 3:", standardized_string(input_string_3))

# Example 4: Pass None (invalid input)
input_string_4 = None
print("Standardized String 4:", standardized_string(input_string_4))

================================================================
3. Replace a String

from scrapery import replace_content

text = "posting posting posting"

# Example 1: Replace all occurrences
result = replace_content(text, "posting", "UPDATED")
print(result)
# Output: "UPDATED UPDATED UPDATED"

# Example 2: Replace only the 2nd occurrence (position)
result = replace_content(text, "posting", "UPDATED", position=2)
print(result)
# Output: "posting UPDATED posting"

# Example 3: Case-insensitive replacement
text = "Posting POSTING posting"
result = replace_content(text, "posting", "edited", ignore_case=True, position=2)
print(result)
# Output: "Posting edited posting"

# Example 4: Limit number of replacements (count)
text = "apple apple apple"
result = replace_content(text, "apple", "orange", count=2)
print(result)
# Output: "orange orange apple"

# Example 5: Replace in a file

# example.txt contains: "error error error"
replace_content("example.txt", "error", "warning", ignore_case=True)
# The file now contains: "warning warning warning"

================================================================
4. Read CSV

from scrapery import read_csv

csv_file_path = 'data.csv'
get_value_by_col_name = 'URL'
filter_col_name = 'Category'
include_filter_col_values = ['Tech']

result = read_csv(csv_file_path, get_value_by_col_name, filter_col_name, include_filter_col_values)
print(result)

Sample CSV

Category,URL
Tech,https://tech1.com
Tech,https://tech2.com
Science,https://science1.com

Result

['https://tech1.com', 'https://tech2.com']

================================================================
5. Save to CSV

from scrapery import save_to_csv

list_data = [[1, 'Alice', 23], [2, 'Bob', 30], [3, 'Charlie', 25]]
headers = ['ID', 'Name', 'Age']
output_file_path = 'output_data.csv'

# Default separator (comma)
save_to_csv(data_list, headers, output_file_path)

# Tab separator
save_to_csv(data_list, headers, output_file_path, sep="\t")

# Semicolon separator
save_to_csv(data_list, headers, output_file_path, sep=";")

Output (default, sep=","):
ID,Name,Age
1,Alice,23
2,Bob,30
3,Charlie,25

Output (sep="\t"):
ID  Name    Age
1   Alice   23
2   Bob 30
3   Charlie 25

================================================================
6. Save to Excel file 

from scrapery import save_to_xls

save_to_xls(data_list, headers, output_file_path)

================================================================
7. Save to sqlite Database

from scrapery import save_to_db

#Creates a SQLite database file named data.sqlite in the current folder and adds a table called data.
save_to_db(data_list, headers)

#Creates a SQLite database file named mydb.sqlite in the given folder (report) and adds a table called User.
save_to_db(data_list, headers, auto_data_type=False, output_file_path="report/mydb.sqlite", table_name="User")

================================================================
8. List files in a directory

from scrapery import list_files

files = list_files(directory=output_dir, extension="csv")
print("CSV files in output directory:", files)

================================================================
9. Read back file content

from scrapery import read_file_content

# Example 1: Read small JSON file fully
file_path_small_json = 'small_data.json'
content = read_file_content(file_path_small_json, stream_json=False)
print("Small JSON file content (fully loaded):")
print(content)  # content will be a dict or list depending on JSON structure

# Example 2: Read large JSON file by streaming (returns a generator)
file_path_large_json = 'large_data.json'
json_stream: Generator[dict, None, None] = read_file_content(file_path_large_json, stream_json=True)
print("\nLarge JSON file content streamed:")
for item in json_stream:
    print(item)  # process each streamed JSON object one by one

# Example 3: Read a large text file using mmap
file_path_large_txt = 'large_text.txt'
text_content = read_file_content(file_path_large_txt)
print("\nLarge text file content (using mmap):")
print(text_content[:500])  # print first 500 characters

# Example 4: Read a small text file with encoding detection
file_path_small_txt = 'small_text.txt'
text_content = read_file_content(file_path_small_txt)
print("\nSmall text file content (with encoding detection):")
print(text_content)

================================================================
10. Save to file

from scrapery import save_file_content

# Example 1: Save plain text content to a file
text_content = "Hello, this is a sample text file.\nWelcome to file handling in Python!"
save_file_content("output/text_file.txt", text_content)

# Output: Content successfully written to output/text_file.txt

# Example 2: Save JSON content to a file
json_content = {
    "name": "Alice",
    "age": 30,
    "skills": ["Python", "Data Science", "Machine Learning"]
}
save_file_content("output/data.json", json_content)

# Output: JSON content successfully written to output/data.json

# Example 3: Save number (non-string content) to a file
number_content = 12345
save_file_content("output/number.txt", number_content)

# Output: Content successfully written to output/number.txt

# Example 4: Append text content to an existing file
append_text = "\nThis line is appended."
save_file_content("output/text_file.txt", append_text, mode="a")

# Output: Content successfully written to output/text_file.txt



            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "scrapery",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "web scraping, html parser, xml parser, json parser, lxml, ujson, data extraction, scraping tools",
    "author": "Ramesh Chandra",
    "author_email": "rameshsofter@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/e7/88/12b8466bb484ce2c0cfa72af752f41f666e33228fc3fabb6a2676a15e462/scrapery-0.1.19.tar.gz",
    "platform": null,
    "description": "# \ud83d\udd77\ufe0f scrapery\r\n\r\n\r\n[![PyPI Version](https://img.shields.io/pypi/v/scrapery)](https://pypi.org/project/scrapery/)\r\n[![Python Versions](https://img.shields.io/pypi/pyversions/scrapery)](https://pypi.org/project/scrapery/)\r\n[![Downloads](https://img.shields.io/pypi/dm/scrapery)](https://pypi.org/project/scrapery/)\r\n![License](https://img.shields.io/badge/License-Free-brightgreen)\r\n[![Documentation Status](https://readthedocs.org/projects/scrapery/badge/?version=latest)](https://scrapery.readthedocs.io/en/latest/?badge=latest)\r\n\r\nA blazing fast, lightweight, and modern parsing library for **HTML, XML, and JSON**, designed for **web scraping** and **data extraction**.  \r\nIt supports both **XPath** and **CSS** selectors, along with seamless **DOM navigation**, making parsing and extracting data straightforward and intuitive.\r\n\r\n\ud83d\udcd8 **Full Documentation**: [https://scrapery.readthedocs.io](https://scrapery.readthedocs.io)\r\n\r\n---\r\n\r\n## \u2728 Features\r\n\r\n- \u26a1 **Blazing Fast Performance** \u2013 Optimized for high-speed HTML, XML, and JSON parsing  \r\n- \ud83c\udfaf **Dual Selector Support** \u2013 Use **XPath** or **CSS selectors** for flexible extraction  \r\n- \ud83d\udee1 **Comprehensive Error Handling** \u2013 Detailed exceptions for different error scenarios\r\n- \ud83e\udde9 **Robust Parsing** \u2013 Encoding detection and content normalization for reliable results  \r\n- \ud83e\uddd1\u200d\ud83d\udcbb **Function-Based API** \u2013 Clean and intuitive interface for ease of use  \r\n- \ud83d\udce6 **Multi-Format Support** \u2013 Parse **HTML, XML, and JSON** in a single library\r\n- \u2699\ufe0f **Versatile File Management** \u2013 Create directories, list files, and handle paths effortlessly\r\n- \ud83d\udcdd **Smart String Normalization** \u2013 Clean text by fixing encodings, removing HTML tags, and standardizing whitespace\r\n- \ud83d\udd0d **Flexible CSV, Excel & Database Handling** \u2013 Read, filter, save, and append data\r\n- \ud83d\udd04 **Efficient JSON Streaming & Reading** \u2013 Stream large JSON files or load fully with encoding detection\r\n- \ud83d\udcbe **Robust File Reading & Writing** \u2013 Auto-detect encoding, support large files with mmap, and save JSON or plain text cleanly\r\n- \ud83c\udf10 **URL & Domain Utilities** \u2013 Extract base domains accurately using industry-standard parsing\r\n- \ud83d\udee1 **Input Validation & Error Handling** \u2013 Custom validations to ensure reliable data processing\r\n\r\n\r\n\r\n### \u26a1 Performance Comparison\r\n\r\nThe following benchmarks were run on sample HTML and JSON data to compare **scrapery** with other popular Python libraries.\r\n\r\n\r\n| Library                 | HTML Parse Time | JSON Parse Time |\r\n|-------------------------|----------------|----------------|\r\n| **scrapery**            | 12 ms          | 8 ms           |\r\n| **Other library**       | 120 ms         | N/A            |\r\n\r\n> \u26a0\ufe0f Actual performance may vary depending on your environment. These results are meant for **illustrative purposes** only. No library is endorsed or affiliated with scrapery.\r\n\r\n\r\n---\r\n\r\n## \ud83d\udce6 Installation\r\n\r\n```bash\r\npip install scrapery\r\n\r\n# -------------------------------\r\n# HTML Example\r\n# -------------------------------\r\nimport scrapery import *\r\n\r\nhtml_content = \"\"\"\r\n<html>\r\n    <body>\r\n        <h1>Welcome</h1>\r\n        <p>Hello<br>World</p>\r\n        <a href=\"/about\">About Us</a>\r\n        <img src=\"/images/logo.png\">\r\n        <table>\r\n            <tr><th>Name</th><th>Age</th></tr>\r\n            <tr><td>John</td><td>30</td></tr>\r\n            <tr><td>Jane</td><td>25</td></tr>\r\n        </table>\r\n    </body>\r\n</html>\r\n\"\"\"\r\n\r\n# Parse HTML content\r\nhtml_doc = parse_html(html_content)\r\n\r\n# Pretty print XML\r\nprint(prettify(html_doc))\r\n\r\n# Get all table rows\r\nrows = select_all(html_doc, \"table tr\")\r\nprint(\"All table rows:\")\r\nfor row in rows:\r\n    print(selector_content(row))\r\n\r\n# Output\r\n    All table rows:\r\n    NameAge\r\n    John30\r\n    Jane25\r\n\r\n# Get first paragraph\r\nparagraph = select_one(html_doc, \"p\")\r\nprint(\"First paragraph text:\", selector_content(paragraph))\r\n# \u279c First paragraph text: HelloWorld\r\n\r\n# CSS selector: First <h1>\r\nprint(selector_content(html_doc, selector=\"h1\"))  \r\n# \u279c Welcome\r\n\r\n# XPath: First <h1>\r\nprint(selector_content(html_doc, selector=\"//h1\"))  \r\n# \u279c Welcome\r\n\r\n# CSS selector: <a href> attribute\r\nprint(selector_content(html_doc, selector=\"a\", attr=\"href\"))  \r\n# \u279c /about\r\n\r\n# XPath: <a> element href\r\nprint(selector_content(html_doc, selector=\"//a\", attr=\"href\"))  \r\n# \u279c /about\r\n\r\n# CSS: First <td> in table (John)\r\nprint(selector_content(html_doc, selector=\"td\"))  \r\n# \u279c John\r\n\r\n# XPath: Second <td> (//td[2] = 30)\r\nprint(selector_content(html_doc, selector=\"//td[2]\"))  \r\n# \u279c 30\r\n\r\n# XPath: Jane's age (//tr[3]/td[2])\r\nprint(selector_content(html_doc, selector=\"//tr[3]/td[2]\"))  \r\n# \u279c 25\r\n\r\n# No css selector or XPath: full text\r\nprint(selector_content(html_doc))  \r\n# \u279c Welcome HelloWorld About Us Name Age John 30 Jane 25\r\n\r\n# Root attribute (lang, if it existed)\r\nprint(selector_content(html_doc, attr=\"lang\"))  \r\n# \u279c None\r\n\r\n#-------------------------\r\n# Embedded Data\r\n#-------------------------\r\n\r\nhtml_content = \"\"\"\r\n<html>\r\n<head>\r\n  <script>\r\n    window.__INITIAL_STATE__ = {\r\n      \"user\": {\"id\": 1, \"name\": \"Alice\"},\r\n      \"isLoggedIn\": true\r\n    };\r\n  </script>\r\n</head>\r\n<body></body>\r\n</html>\r\n\"\"\"\r\n\r\njson_data = embedded_json(page_source=html_content, start_keyword=\"window.__INITIAL_STATE__ =\")\r\nprint(json_data)\r\n\r\n# Output\r\n\r\n{\r\n  \"user\": {\"id\": 1, \"name\": \"Alice\"},\r\n  \"isLoggedIn\": True\r\n}\r\n\r\n\r\nhtml_with_ldjson = \"\"\"\r\n<html>\r\n  <head>\r\n    <script type=\"application/ld+json\">\r\n      {\r\n        \"@context\": \"http://schema.org\",\r\n        \"@type\": \"Person\",\r\n        \"name\": \"Alice\"\r\n      }\r\n    </script>\r\n  </head>\r\n</html>\r\n\"\"\"\r\n\r\nld_json = embedded_json(page_source=html_with_ldjson, selector = \"[type*='application/ld+json']\")\r\nprint(ld_json)\r\n\r\n# Output\r\n\r\n[{\r\n  \"@context\": \"http://schema.org\",\r\n  \"@type\": \"Person\",\r\n  \"name\": \"Alice\"\r\n}]\r\n\r\n#-------------------------\r\n# DOM navigation\r\n#-------------------------\r\n# Example 1: parent, children, siblings\r\np_elem = select_one(html_doc,\"p\")\r\nprint(\"Parent tag of <p>:\", parent(p_elem).tag)\r\nprint(\"Children of <p>:\", [c.tag for c in children(p_elem)])\r\nprint(\"Siblings of <p>:\", [s.tag for s in siblings(p_elem)])\r\n\r\n# Example 2: next_sibling, prev_sibling\r\nprint(\"Next sibling of <p>:\", next_sibling(p_elem).tag)\r\nh1_elem = select_one(html_doc,\"h1\")\r\nprint(\"Previous sibling of <p>:\", next_sibling(h1_elem))\r\n\r\n# Example 3: ancestors and descendants\r\nancs = ancestors(p_elem)\r\nprint(\"Ancestor tags of <p>:\", [a.tag for a in ancs])\r\ndesc = descendants(select_one(html_doc,\"table\"))\r\nprint(\"Descendant tags of <table>:\", [d.tag for d in desc])\r\n\r\n# Example 4: class utilities\r\ndiv_html = '<div class=\"card primary\"></div>'\r\ndiv_elem = parse_html(div_html)\r\nprint(\"Has class 'card'? ->\", has_class(div_elem, \"card\"))\r\nprint(\"Classes:\", get_classes(div_elem))\r\n\r\n# -------------------------------\r\n# Resolve relative URLs\r\n# -------------------------------\r\nbase = \"https://example.com\"\r\n\r\n# Get <a> links\r\nprint(absolute_url(html_doc, \"a\", base_url=base))\r\n# \u2192 'https://example.com/about'\r\n\r\n# Get <img> sources\r\nprint(absolute_url(html_doc, \"img\", base_url=base, attr=\"src\"))\r\n# \u2192 'https://example.com/images/logo.png'\r\n\r\n# -------------------------------\r\n# XML Example\r\n# -------------------------------\r\n\r\n# Parsing XML from a string\r\nxml_content = \"\"\"<root>\r\n                    <child>Test</child>\r\n                </root>\r\n            \"\"\"\r\n\r\nxml_doc = parse_xml(xml_content)\r\nprint(xml_doc)\r\n\r\n# Pretty print XML\r\nprint(prettify(xml_doc))\r\n\r\n# Select all child elements using CSS selector\r\nall_elements = select_all(xml_doc, \"child\")\r\nprint(all_elements)\r\n\r\n# Select one child element using XPath selector\r\nchild = select_one(xml_doc, \"//child\")\r\nprint(child)\r\n\r\n# Extract content from an element\r\ncontent = selector_content(xml_doc, \"child\")\r\nprint(content)\r\n\r\n# Get the parent element of a child\r\nparent_element = parent(child)\r\nprint(parent_element)\r\n\r\n# Get all children of the root element\r\nchildren = children(xml_doc)\r\nprint(children)\r\n\r\n# Find the first child element with a specific tag\r\nchild = xml_find(xml_doc, \"child\")\r\nprint(child)\r\n\r\n# Find all child elements with a specific tag\r\nchildren = xml_find_all(xml_doc, \"child\")\r\nprint(children)\r\n\r\n# Execute XPath expression\r\nresult = xml_xpath(xml_doc, \"//child\")\r\nprint(result)\r\n\r\n# Apply XSLT transformation\r\nxslt = \"\"\"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\" version=\"1.0\">\r\n    <xsl:template match=\"/\">\r\n        <html>\r\n            <body>\r\n                <xsl:value-of select=\"/root/child\"/>\r\n            </body>\r\n        </html>\r\n    </xsl:template>\r\n</xsl:stylesheet>\"\"\"\r\n\r\ntransformed = xml_transform(xml_doc, xslt)\r\nprint(prettify(transformed))\r\n\r\n# Validate XML against XSD schema\r\nis_valid = xml_validate_xsd(xml_doc, Path(\"schema.xsd\"))\r\nprint(is_valid)\r\n\r\n# Create a new element and add it to the root\r\nnew_element = xml_create_element(\"newTag\", text=\"This is new\", id=\"123\")\r\nxml_add_child(xml_doc, new_element)\r\nprint(prettify(xml_doc))\r\n\r\n# Set an attribute on an element\r\nxml_set_attr(new_element, \"id\", \"456\")\r\nprint(prettify(new_element))\r\n\r\n# -------------------------------\r\n# JSON Example\r\n# -------------------------------\r\n\r\njson_str = '{\"user\": {\"profile\": {\"name\": \"Alice\"}}}'\r\ndata = parse_json(json_str)\r\n\r\n# Get first key match\r\nprint(json_content(json_str, keys=[\"name\"], position=\"first\"))\r\n# \u279c {'name': 'Alice'}\r\n\r\n# Follow nested path\r\nprint(json_content(json_str, keys=[\"user\", \"profile\", \"name\"], position=\"last\"))\r\n# \u279c Alice\r\n\r\n# -------------------------------\r\n# Utility Example\r\n# -------------------------------\r\n\r\n1. Create a Directory\r\n\r\nfrom scrapery import create_directory\r\n# Creates a directory if it doesn't already exist.\r\n\r\n\r\n# Example 1: Creating a new directory\r\ncreate_directory(\"new_folder\")\r\n\r\n# Example 2: Creating nested directories\r\ncreate_directory(\"parent_folder/sub_folder\")\r\n\r\n# ================================================================\r\n2. Standardize a String\r\n\r\nfrom scrapery import standardized_string\r\n# This function standardizes the input string by removing escape sequences like \\n, \\t, and \\r, removing HTML tags, collapsing multiple spaces, and trimming leading/trailing spaces.\r\n\r\n# Example 1: Standardize a string with newlines, tabs, and HTML tags\r\ninput_string_1 = \"<html><body>  Hello \\nWorld!  \\tThis is a test.  </body></html>\"\r\nprint(\"Standardized String 1:\", standardized_string(input_string_1))\r\n\r\n# Example 2: Input string with multiple spaces and line breaks\r\ninput_string_2 = \"  This   is   a  \\n\\n   string   with  spaces and \\t tabs.  \"\r\nprint(\"Standardized String 2:\", standardized_string(input_string_2))\r\n\r\n# Example 3: Pass an empty string\r\ninput_string_3 = \"\"\r\nprint(\"Standardized String 3:\", standardized_string(input_string_3))\r\n\r\n# Example 4: Pass None (invalid input)\r\ninput_string_4 = None\r\nprint(\"Standardized String 4:\", standardized_string(input_string_4))\r\n\r\n================================================================\r\n3. Replace a String\r\n\r\nfrom scrapery import replace_content\r\n\r\ntext = \"posting posting posting\"\r\n\r\n# Example 1: Replace all occurrences\r\nresult = replace_content(text, \"posting\", \"UPDATED\")\r\nprint(result)\r\n# Output: \"UPDATED UPDATED UPDATED\"\r\n\r\n# Example 2: Replace only the 2nd occurrence (position)\r\nresult = replace_content(text, \"posting\", \"UPDATED\", position=2)\r\nprint(result)\r\n# Output: \"posting UPDATED posting\"\r\n\r\n# Example 3: Case-insensitive replacement\r\ntext = \"Posting POSTING posting\"\r\nresult = replace_content(text, \"posting\", \"edited\", ignore_case=True, position=2)\r\nprint(result)\r\n# Output: \"Posting edited posting\"\r\n\r\n# Example 4: Limit number of replacements (count)\r\ntext = \"apple apple apple\"\r\nresult = replace_content(text, \"apple\", \"orange\", count=2)\r\nprint(result)\r\n# Output: \"orange orange apple\"\r\n\r\n# Example 5: Replace in a file\r\n\r\n# example.txt contains: \"error error error\"\r\nreplace_content(\"example.txt\", \"error\", \"warning\", ignore_case=True)\r\n# The file now contains: \"warning warning warning\"\r\n\r\n================================================================\r\n4. Read CSV\r\n\r\nfrom scrapery import read_csv\r\n\r\ncsv_file_path = 'data.csv'\r\nget_value_by_col_name = 'URL'\r\nfilter_col_name = 'Category'\r\ninclude_filter_col_values = ['Tech']\r\n\r\nresult = read_csv(csv_file_path, get_value_by_col_name, filter_col_name, include_filter_col_values)\r\nprint(result)\r\n\r\nSample CSV\r\n\r\nCategory,URL\r\nTech,https://tech1.com\r\nTech,https://tech2.com\r\nScience,https://science1.com\r\n\r\nResult\r\n\r\n['https://tech1.com', 'https://tech2.com']\r\n\r\n================================================================\r\n5. Save to CSV\r\n\r\nfrom scrapery import save_to_csv\r\n\r\nlist_data = [[1, 'Alice', 23], [2, 'Bob', 30], [3, 'Charlie', 25]]\r\nheaders = ['ID', 'Name', 'Age']\r\noutput_file_path = 'output_data.csv'\r\n\r\n# Default separator (comma)\r\nsave_to_csv(data_list, headers, output_file_path)\r\n\r\n# Tab separator\r\nsave_to_csv(data_list, headers, output_file_path, sep=\"\\t\")\r\n\r\n# Semicolon separator\r\nsave_to_csv(data_list, headers, output_file_path, sep=\";\")\r\n\r\nOutput (default, sep=\",\"):\r\nID,Name,Age\r\n1,Alice,23\r\n2,Bob,30\r\n3,Charlie,25\r\n\r\nOutput (sep=\"\\t\"):\r\nID  Name    Age\r\n1   Alice   23\r\n2   Bob 30\r\n3   Charlie 25\r\n\r\n================================================================\r\n6. Save to Excel file \r\n\r\nfrom scrapery import save_to_xls\r\n\r\nsave_to_xls(data_list, headers, output_file_path)\r\n\r\n================================================================\r\n7. Save to sqlite Database\r\n\r\nfrom scrapery import save_to_db\r\n\r\n#Creates a SQLite database file named data.sqlite in the current folder and adds a table called data.\r\nsave_to_db(data_list, headers)\r\n\r\n#Creates a SQLite database file named mydb.sqlite in the given folder (report) and adds a table called User.\r\nsave_to_db(data_list, headers, auto_data_type=False, output_file_path=\"report/mydb.sqlite\", table_name=\"User\")\r\n\r\n================================================================\r\n8. List files in a directory\r\n\r\nfrom scrapery import list_files\r\n\r\nfiles = list_files(directory=output_dir, extension=\"csv\")\r\nprint(\"CSV files in output directory:\", files)\r\n\r\n================================================================\r\n9. Read back file content\r\n\r\nfrom scrapery import read_file_content\r\n\r\n# Example 1: Read small JSON file fully\r\nfile_path_small_json = 'small_data.json'\r\ncontent = read_file_content(file_path_small_json, stream_json=False)\r\nprint(\"Small JSON file content (fully loaded):\")\r\nprint(content)  # content will be a dict or list depending on JSON structure\r\n\r\n# Example 2: Read large JSON file by streaming (returns a generator)\r\nfile_path_large_json = 'large_data.json'\r\njson_stream: Generator[dict, None, None] = read_file_content(file_path_large_json, stream_json=True)\r\nprint(\"\\nLarge JSON file content streamed:\")\r\nfor item in json_stream:\r\n    print(item)  # process each streamed JSON object one by one\r\n\r\n# Example 3: Read a large text file using mmap\r\nfile_path_large_txt = 'large_text.txt'\r\ntext_content = read_file_content(file_path_large_txt)\r\nprint(\"\\nLarge text file content (using mmap):\")\r\nprint(text_content[:500])  # print first 500 characters\r\n\r\n# Example 4: Read a small text file with encoding detection\r\nfile_path_small_txt = 'small_text.txt'\r\ntext_content = read_file_content(file_path_small_txt)\r\nprint(\"\\nSmall text file content (with encoding detection):\")\r\nprint(text_content)\r\n\r\n================================================================\r\n10. Save to file\r\n\r\nfrom scrapery import save_file_content\r\n\r\n# Example 1: Save plain text content to a file\r\ntext_content = \"Hello, this is a sample text file.\\nWelcome to file handling in Python!\"\r\nsave_file_content(\"output/text_file.txt\", text_content)\r\n\r\n# Output: Content successfully written to output/text_file.txt\r\n\r\n# Example 2: Save JSON content to a file\r\njson_content = {\r\n    \"name\": \"Alice\",\r\n    \"age\": 30,\r\n    \"skills\": [\"Python\", \"Data Science\", \"Machine Learning\"]\r\n}\r\nsave_file_content(\"output/data.json\", json_content)\r\n\r\n# Output: JSON content successfully written to output/data.json\r\n\r\n# Example 3: Save number (non-string content) to a file\r\nnumber_content = 12345\r\nsave_file_content(\"output/number.txt\", number_content)\r\n\r\n# Output: Content successfully written to output/number.txt\r\n\r\n# Example 4: Append text content to an existing file\r\nappend_text = \"\\nThis line is appended.\"\r\nsave_file_content(\"output/text_file.txt\", append_text, mode=\"a\")\r\n\r\n# Output: Content successfully written to output/text_file.txt\r\n\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Scrapery: A fast, lightweight library to scrape HTML, XML, and JSON using XPath, CSS selectors, and intuitive DOM navigation.",
    "version": "0.1.19",
    "project_urls": {
        "Documentation": "https://scrapery.readthedocs.io/en/latest/"
    },
    "split_keywords": [
        "web scraping",
        " html parser",
        " xml parser",
        " json parser",
        " lxml",
        " ujson",
        " data extraction",
        " scraping tools"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "46c981e09492238ab46def5bea6cdca433ba1c29bb5faa373c68aea86ca88c73",
                "md5": "9bcbb0de583920d80dbe77b29b1da971",
                "sha256": "3118c2c5fab5f5e1f80d504b47eecbed9a69d3bde1506dd76eaf4fb0a8976ee3"
            },
            "downloads": -1,
            "filename": "scrapery-0.1.19-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9bcbb0de583920d80dbe77b29b1da971",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 26926,
            "upload_time": "2025-10-27T08:25:39",
            "upload_time_iso_8601": "2025-10-27T08:25:39.754967Z",
            "url": "https://files.pythonhosted.org/packages/46/c9/81e09492238ab46def5bea6cdca433ba1c29bb5faa373c68aea86ca88c73/scrapery-0.1.19-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e78812b8466bb484ce2c0cfa72af752f41f666e33228fc3fabb6a2676a15e462",
                "md5": "084a6c8c3af21e82e8018397d44c56e7",
                "sha256": "7f245eb090a297cb134f2650e313e94f03b7f8a8b8415feadee93f5757215ae0"
            },
            "downloads": -1,
            "filename": "scrapery-0.1.19.tar.gz",
            "has_sig": false,
            "md5_digest": "084a6c8c3af21e82e8018397d44c56e7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 29252,
            "upload_time": "2025-10-27T08:25:43",
            "upload_time_iso_8601": "2025-10-27T08:25:43.458792Z",
            "url": "https://files.pythonhosted.org/packages/e7/88/12b8466bb484ce2c0cfa72af752f41f666e33228fc3fabb6a2676a15e462/scrapery-0.1.19.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-27 08:25:43",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "scrapery"
}
        
Elapsed time: 2.16395s