ChainableSoup


NameChainableSoup JSON
Version 0.1.3 PyPI version JSON
download
home_pagehttps://github.com/thefcraft/ChainableSoup
SummaryA fluent, pipeline-based interface for querying HTML/XML with BeautifulSoup.
upload_time2025-07-16 19:47:02
maintainerNone
docs_urlNone
authorThefCraft
requires_python>=3.8
licenseNone
keywords beautifulsoup bs4 scraping parser html xml fluent chainable pipeline
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ChainableSoup

[![Build Status](https://img.shields.io/badge/build-passing-brightgreen)](https://github.com/thefcraft/ChainableSoup)
[![PyPI version](https://badge.fury.io/py/ChainableSoup.svg)](https://badge.fury.io/py/ChainableSoup)

**ChainableSoup** provides a fluent, pipeline-based interface for querying HTML and XML documents with BeautifulSoup, turning complex nested searches into clean, readable, and chainable method calls.

## The Problem

Working with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is great, but navigating deeply nested structures can lead to verbose and hard-to-read code:

```python
# Standard BeautifulSoup
try:
    doc = soup.find('div', class_='document')
    wrapper = doc.find('div', class_='documentwrapper')
    body_wrapper = wrapper.find('div', class_='bodywrapper')
    body = body_wrapper.find('div', class_='body')
    section = body.find('section', recursive=False)
    p_tag = section.find_all('p', recursive=False)[0]
    print(p_tag.text)
except AttributeError:
    print("One of the tags was not found.")

```

This pattern is repetitive, and the error handling can obscure the main logic.

## The Solution: A Fluent Pipeline

ChainableSoup elegantly solves this by introducing a `Pipeline` that lets you chain `find` operations. The same query becomes:

```python
from ChainableSoup import Pipeline

# With ChainableSoup
pipeline = Pipeline().find_tag('div', class_='document')                      .find_tag('div', class_='documentwrapper')                      .find_tag('div', class_='bodywrapper')                      .find_tag('div', class_='body')                      .find_tag('section', recursive=False)                      .find_all_tags('p', recursive=False)[0]

# Execute the pipeline and get the result
first_p = pipeline.raise_for_error.run(soup)
print(first_p.text)
```

or

```python 
from ChainableSoup import Pipeline, NestedArg, SpecalArg

# With ChainableSoup
pipeline = Pipeline().find_nested_tag(
    name = NestedArg() >> 'div' >> 'div' >> 'div' >> 'div' >> 'section',
    class_ = NestedArg() >> 'document' >> 'documentwrapper' >> 'bodywrapper' >> 'body',
    recursive = NestedArg() >> True >> True >> True >> True >> False >> SpecalArg.EXPANDLAST
).find_all_tags('p', recursive=False)[0]

# Execute the pipeline and get the result
first_p = pipeline.raise_for_error.run(soup)
print(first_p.text)
```

## Features

-   **Fluent Chaining:** Link `find_tag` and `find_all_tags` calls in a natural, readable sequence.
-   **Powerful Nested Searches:** Use `find_nested_tag` with `NestedArg` to perform complex deep searches with a single method call.
-   **Sequence Operations:** After a `find_all_tags` call, you can `filter`, `map`, and perform assertions on the sequence of results.
-   **Robust Error Handling:** Choose your style: either get a descriptive `Error` object back or have an exception raised automatically on failure.
-   **Intelligent Argument Resolution:** Automatically handle varying arguments for each level of a nested search.

## Installation

```bash
pip install ChainableSoup
```

## Quickstart

### 1. Basic Find

Create a `Pipeline` and chain `find_tag` calls to navigate to a specific element.

```python
from bs4 import BeautifulSoup
from ChainableSoup import Pipeline

html = '''
<body>
  <div id="content">
    <h1>Title</h1>
    <p>First paragraph.</p>
    <p>Second paragraph.</p>
  </div>
</body>
'''
soup = BeautifulSoup(html, 'html.parser')

# Build the pipeline
pipeline = Pipeline().find_tag('body').find_tag('div', id='content').find_tag('p')

# Execute it and raise an exception if any tag is not found
first_p = pipeline.raise_for_error.run(soup)
print(first_p.text)
# Output: First paragraph.

# Alternatively, execute without raising an error
result = pipeline.run(soup)
if not result:
    print(f"Pipeline failed: {result.msg}")
else:
    print(result.text)
```

### 2. Finding All Tags and Filtering

Use `find_all_tags` to get a sequence of results. This returns a `PipelineSequence` object, which you can use to filter, map, or select items.

```python
# Continues from the previous example...

# Find all <p> tags inside the div
p_sequence = Pipeline().find_tag('div', id='content').find_all_tags('p')

# Select the second paragraph (index 1)
second_p_pipeline = p_sequence[1]
print(second_p_pipeline.raise_for_error.run(soup).text)
# Output: Second paragraph.

# Or use .first / .last properties
first_p_pipeline = p_sequence.first
print(first_p_pipeline.raise_for_error.run(soup).text)
# Output: First paragraph.

# Filter the sequence
contains_second = lambda tag: "Second" in tag.text
filtered_sequence = p_sequence.filter(contains_second)

# This will now find the first (and only) tag that matches the filter
result = filtered_sequence.first.raise_for_error.run(soup)
print(result.text)
# Output: Second paragraph.
```

## Advanced Usage: `find_nested_tag`

The `find_nested_tag` method is the most powerful feature of ChainableSoup. It allows you to define an entire path of `find` operations in a single, declarative call using `NestedArg`.

### `NestedArg`

An `NestedArg` is a fluent builder for creating a list of arguments, one for each level of the search. You can chain values using the `>>` operator or the `.add()` method.

### Example

Let's revisit the complex example from the introduction.

```python
from ChainableSoup import Pipeline, NestedArg, SpecalArg

# ... setup soup ...

pipeline = Pipeline().find_nested_tag(
    # For each level of the search, specify the tag 'name'
    name = NestedArg() >> 'body' >> 'div' >> 'div' >> 'div' >> 'div',

    # Specify attributes for each level. The lists are matched by index.
    attrs={
        'class': NestedArg() >> None >> 'document' >> 'documentwrapper' >> 'bodywrapper' >> 'body'
    },
    
    # Specify the `recursive` flag. Here, we use a Special Argument.
    # It will be True, then False, and EXPANDLAST will repeat `False` for the rest.
    recursive = NestedArg() >> True >> False >> SpecalArg.EXPANDLAST

).find_all_tags(
    name='section',
    recursive=False
).first.find_all_tags(
    name='p',
    recursive=False
)

# Create two branches of the pipeline to get the first and second <p> tags
first_p_pipeline = pipeline[0]
second_p_pipeline = pipeline[1]

# Execute both
print(first_p_pipeline.raise_for_error.run(soup).text)
print(second_p_pipeline.raise_for_error.run(soup).text)
```

### `SpecalArg` Enum

When argument lists have different lengths, `SpecalArg` controls how the shorter lists are padded to match the longest one.

-   `SpecalArg.EXPANDLAST`: Repeats the last provided value.
-   `SpecalArg.FILLNONE`: Fills with `None` (the default).
-   `SpecalArg.FILLTRUE`: Fills with `True`.
-   `SpecalArg.FILLFALSE`: Fills with `False`.

## API Overview

-   **`Pipeline`**: The main object for building a query that results in a **single `Tag`**.
    -   `.find_tag(...)`: Appends a `find` operation.
    -   `.find_nested_tag(...)`: Appends a series of `find` operations.
    -   `.find_all_tags(...)`: Transitions the query into a `PipelineSequence`.
    -   `.run(soup)`: Executes the pipeline and returns a `Tag` or `Error` object.
    -   `.run_and_raise_for_error(soup)`: Executes and raises an `Error` on failure.

-   **`PipelineSequence`**: An object for building a query that results in a **sequence of `Tag`s**.
    -   `.filter(fn)`: Filters the sequence.
    -   `.map(fn)`: Applies a function to each tag in the sequence.
    -   `.assert_all(fn)`: Asserts a condition for all tags.
    -   `.first`, `.last`, `[index]`: Selects a single element, returning control to a `Pipeline`.

-   **`NestedArg`**: A helper class to build argument lists for `find_nested_tag`.

## Contributing

Contributions are welcome! If you have a feature request, find a bug, or want to improve the documentation, please open an issue or submit a pull request on our [GitHub repository](https://github.com/your-username/chainablesoup).

## License

This project is licensed under the MIT License.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/thefcraft/ChainableSoup",
    "name": "ChainableSoup",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "beautifulsoup, bs4, scraping, parser, html, xml, fluent, chainable, pipeline",
    "author": "ThefCraft",
    "author_email": "sisodiyalaksh@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/45/cc/5db602c5bf0a298167d855a0adbbbf35122115f6fd24f04048d0ae895b08/ChainableSoup-0.1.3.tar.gz",
    "platform": null,
    "description": "# ChainableSoup\n\n[![Build Status](https://img.shields.io/badge/build-passing-brightgreen)](https://github.com/thefcraft/ChainableSoup)\n[![PyPI version](https://badge.fury.io/py/ChainableSoup.svg)](https://badge.fury.io/py/ChainableSoup)\n\n**ChainableSoup** provides a fluent, pipeline-based interface for querying HTML and XML documents with BeautifulSoup, turning complex nested searches into clean, readable, and chainable method calls.\n\n## The Problem\n\nWorking with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is great, but navigating deeply nested structures can lead to verbose and hard-to-read code:\n\n```python\n# Standard BeautifulSoup\ntry:\n    doc = soup.find('div', class_='document')\n    wrapper = doc.find('div', class_='documentwrapper')\n    body_wrapper = wrapper.find('div', class_='bodywrapper')\n    body = body_wrapper.find('div', class_='body')\n    section = body.find('section', recursive=False)\n    p_tag = section.find_all('p', recursive=False)[0]\n    print(p_tag.text)\nexcept AttributeError:\n    print(\"One of the tags was not found.\")\n\n```\n\nThis pattern is repetitive, and the error handling can obscure the main logic.\n\n## The Solution: A Fluent Pipeline\n\nChainableSoup elegantly solves this by introducing a `Pipeline` that lets you chain `find` operations. The same query becomes:\n\n```python\nfrom ChainableSoup import Pipeline\n\n# With ChainableSoup\npipeline = Pipeline().find_tag('div', class_='document')                      .find_tag('div', class_='documentwrapper')                      .find_tag('div', class_='bodywrapper')                      .find_tag('div', class_='body')                      .find_tag('section', recursive=False)                      .find_all_tags('p', recursive=False)[0]\n\n# Execute the pipeline and get the result\nfirst_p = pipeline.raise_for_error.run(soup)\nprint(first_p.text)\n```\n\nor\n\n```python \nfrom ChainableSoup import Pipeline, NestedArg, SpecalArg\n\n# With ChainableSoup\npipeline = Pipeline().find_nested_tag(\n    name = NestedArg() >> 'div' >> 'div' >> 'div' >> 'div' >> 'section',\n    class_ = NestedArg() >> 'document' >> 'documentwrapper' >> 'bodywrapper' >> 'body',\n    recursive = NestedArg() >> True >> True >> True >> True >> False >> SpecalArg.EXPANDLAST\n).find_all_tags('p', recursive=False)[0]\n\n# Execute the pipeline and get the result\nfirst_p = pipeline.raise_for_error.run(soup)\nprint(first_p.text)\n```\n\n## Features\n\n-   **Fluent Chaining:** Link `find_tag` and `find_all_tags` calls in a natural, readable sequence.\n-   **Powerful Nested Searches:** Use `find_nested_tag` with `NestedArg` to perform complex deep searches with a single method call.\n-   **Sequence Operations:** After a `find_all_tags` call, you can `filter`, `map`, and perform assertions on the sequence of results.\n-   **Robust Error Handling:** Choose your style: either get a descriptive `Error` object back or have an exception raised automatically on failure.\n-   **Intelligent Argument Resolution:** Automatically handle varying arguments for each level of a nested search.\n\n## Installation\n\n```bash\npip install ChainableSoup\n```\n\n## Quickstart\n\n### 1. Basic Find\n\nCreate a `Pipeline` and chain `find_tag` calls to navigate to a specific element.\n\n```python\nfrom bs4 import BeautifulSoup\nfrom ChainableSoup import Pipeline\n\nhtml = '''\n<body>\n  <div id=\"content\">\n    <h1>Title</h1>\n    <p>First paragraph.</p>\n    <p>Second paragraph.</p>\n  </div>\n</body>\n'''\nsoup = BeautifulSoup(html, 'html.parser')\n\n# Build the pipeline\npipeline = Pipeline().find_tag('body').find_tag('div', id='content').find_tag('p')\n\n# Execute it and raise an exception if any tag is not found\nfirst_p = pipeline.raise_for_error.run(soup)\nprint(first_p.text)\n# Output: First paragraph.\n\n# Alternatively, execute without raising an error\nresult = pipeline.run(soup)\nif not result:\n    print(f\"Pipeline failed: {result.msg}\")\nelse:\n    print(result.text)\n```\n\n### 2. Finding All Tags and Filtering\n\nUse `find_all_tags` to get a sequence of results. This returns a `PipelineSequence` object, which you can use to filter, map, or select items.\n\n```python\n# Continues from the previous example...\n\n# Find all <p> tags inside the div\np_sequence = Pipeline().find_tag('div', id='content').find_all_tags('p')\n\n# Select the second paragraph (index 1)\nsecond_p_pipeline = p_sequence[1]\nprint(second_p_pipeline.raise_for_error.run(soup).text)\n# Output: Second paragraph.\n\n# Or use .first / .last properties\nfirst_p_pipeline = p_sequence.first\nprint(first_p_pipeline.raise_for_error.run(soup).text)\n# Output: First paragraph.\n\n# Filter the sequence\ncontains_second = lambda tag: \"Second\" in tag.text\nfiltered_sequence = p_sequence.filter(contains_second)\n\n# This will now find the first (and only) tag that matches the filter\nresult = filtered_sequence.first.raise_for_error.run(soup)\nprint(result.text)\n# Output: Second paragraph.\n```\n\n## Advanced Usage: `find_nested_tag`\n\nThe `find_nested_tag` method is the most powerful feature of ChainableSoup. It allows you to define an entire path of `find` operations in a single, declarative call using `NestedArg`.\n\n### `NestedArg`\n\nAn `NestedArg` is a fluent builder for creating a list of arguments, one for each level of the search. You can chain values using the `>>` operator or the `.add()` method.\n\n### Example\n\nLet's revisit the complex example from the introduction.\n\n```python\nfrom ChainableSoup import Pipeline, NestedArg, SpecalArg\n\n# ... setup soup ...\n\npipeline = Pipeline().find_nested_tag(\n    # For each level of the search, specify the tag 'name'\n    name = NestedArg() >> 'body' >> 'div' >> 'div' >> 'div' >> 'div',\n\n    # Specify attributes for each level. The lists are matched by index.\n    attrs={\n        'class': NestedArg() >> None >> 'document' >> 'documentwrapper' >> 'bodywrapper' >> 'body'\n    },\n    \n    # Specify the `recursive` flag. Here, we use a Special Argument.\n    # It will be True, then False, and EXPANDLAST will repeat `False` for the rest.\n    recursive = NestedArg() >> True >> False >> SpecalArg.EXPANDLAST\n\n).find_all_tags(\n    name='section',\n    recursive=False\n).first.find_all_tags(\n    name='p',\n    recursive=False\n)\n\n# Create two branches of the pipeline to get the first and second <p> tags\nfirst_p_pipeline = pipeline[0]\nsecond_p_pipeline = pipeline[1]\n\n# Execute both\nprint(first_p_pipeline.raise_for_error.run(soup).text)\nprint(second_p_pipeline.raise_for_error.run(soup).text)\n```\n\n### `SpecalArg` Enum\n\nWhen argument lists have different lengths, `SpecalArg` controls how the shorter lists are padded to match the longest one.\n\n-   `SpecalArg.EXPANDLAST`: Repeats the last provided value.\n-   `SpecalArg.FILLNONE`: Fills with `None` (the default).\n-   `SpecalArg.FILLTRUE`: Fills with `True`.\n-   `SpecalArg.FILLFALSE`: Fills with `False`.\n\n## API Overview\n\n-   **`Pipeline`**: The main object for building a query that results in a **single `Tag`**.\n    -   `.find_tag(...)`: Appends a `find` operation.\n    -   `.find_nested_tag(...)`: Appends a series of `find` operations.\n    -   `.find_all_tags(...)`: Transitions the query into a `PipelineSequence`.\n    -   `.run(soup)`: Executes the pipeline and returns a `Tag` or `Error` object.\n    -   `.run_and_raise_for_error(soup)`: Executes and raises an `Error` on failure.\n\n-   **`PipelineSequence`**: An object for building a query that results in a **sequence of `Tag`s**.\n    -   `.filter(fn)`: Filters the sequence.\n    -   `.map(fn)`: Applies a function to each tag in the sequence.\n    -   `.assert_all(fn)`: Asserts a condition for all tags.\n    -   `.first`, `.last`, `[index]`: Selects a single element, returning control to a `Pipeline`.\n\n-   **`NestedArg`**: A helper class to build argument lists for `find_nested_tag`.\n\n## Contributing\n\nContributions are welcome! If you have a feature request, find a bug, or want to improve the documentation, please open an issue or submit a pull request on our [GitHub repository](https://github.com/your-username/chainablesoup).\n\n## License\n\nThis project is licensed under the MIT License.\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A fluent, pipeline-based interface for querying HTML/XML with BeautifulSoup.",
    "version": "0.1.3",
    "project_urls": {
        "Homepage": "https://github.com/thefcraft/ChainableSoup"
    },
    "split_keywords": [
        "beautifulsoup",
        " bs4",
        " scraping",
        " parser",
        " html",
        " xml",
        " fluent",
        " chainable",
        " pipeline"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "45cc5db602c5bf0a298167d855a0adbbbf35122115f6fd24f04048d0ae895b08",
                "md5": "d86daa6824812d84a4d06306b4467cc0",
                "sha256": "68ad8fdafd3250375442959daadfed7221bae35ce3318bfc61cbdedc4e5416e8"
            },
            "downloads": -1,
            "filename": "ChainableSoup-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "d86daa6824812d84a4d06306b4467cc0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 13176,
            "upload_time": "2025-07-16T19:47:02",
            "upload_time_iso_8601": "2025-07-16T19:47:02.050305Z",
            "url": "https://files.pythonhosted.org/packages/45/cc/5db602c5bf0a298167d855a0adbbbf35122115f6fd24f04048d0ae895b08/ChainableSoup-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-16 19:47:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "thefcraft",
    "github_project": "ChainableSoup",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "chainablesoup"
}
        
Elapsed time: 0.54077s