soupsavvy


Namesoupsavvy JSON
Version 0.1.4 PyPI version JSON
download
home_pageNone
SummaryPython package for advanced web scraping
upload_time2024-04-15 15:47:42
maintainerNone
docs_urlNone
authorsewcio543
requires_python>=3.9
licenseMIT License Copyright (c) 2024 Wojciech Seweryn Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords web-scraping html soup bs4 markup
VCS
bugtrack_url
requirements beautifulsoup4 lxml
Travis-CI No Travis.
coveralls test coverage
            # soupsavvy

-----------------

## Python package that makes web-scraping more manageable

| | |
| --- | --- |
| Package | ![Deployment PyPI](https://github.com/sewcio543/soupsavvy/actions/workflows/production_release.yml/badge.svg) ![Deployment test PyPI](https://github.com/sewcio543/soupsavvy/actions/workflows/development_release.yml/badge.svg) [![GitHub](https://img.shields.io/badge/GitHub-sewcio543-181717.svg?style=flat&logo=github)](https://github.com/sewcio543) [![PyPI](https://img.shields.io/pypi/v/soupsavvy?color=orange)](https://pypi.org/project/soupsavvy/) [![Python Versions](https://img.shields.io/pypi/pyversions/soupsavvy)](https://www.python.org/)|
| Testing | ![Tests](https://github.com/sewcio543/soupsavvy/actions/workflows/tests.yml/badge.svg) [![Codecov](https://codecov.io/gh/sewcio543/soupsavvy/graph/badge.svg?token=RZ51VS3QLB)](https://codecov.io/gh/sewcio543/soupsavvy)|
| Code Quality | ![Build](https://github.com/sewcio543/soupsavvy/actions/workflows/build_package.yml/badge.svg) ![Linting](https://github.com/sewcio543/soupsavvy/actions/workflows/linting.yml/badge.svg) [![pre-commit.ci status](https://results.pre-commit.ci/badge/github/sewcio543/soupsavvy/main.svg)](https://results.pre-commit.ci/latest/github/sewcio543/soupsavvy/main)|
| Documentation | [![readthedocs](https://img.shields.io/readthedocs/pip?logo=readthedocs)](https://github.com/sewcio543/soupsavvy/actions/workflows/documentation.yml/badge.svg) [![Docs link](https://img.shields.io/badge/docs-check_out-blue)](https://sewcio543.github.io/soupsavvy/)|

## Table of Contents

- [About](#about)
- [Key features](#key-features)
- [Instalation](#installation)
- [Usage](#usage)
- [Contributing](#contributing)
- [License](#license)
- [Acknowledgements](#acknowledgements)
- [In the future](#in-the-future)

## About

**soupsavvy** is a library designed to make web scraping tasks more efficient and manageable. Automating web scraping can be a thankless and time-consuming job. It builds around [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) library enabling developers to create more complex workflows and advanced searches with ease.

## Key Features

- **Automated Web Scraping**: soupsavvy simplifies the process of web scraping by providing intuitive interfaces and tools for automating tasks.

- **Complex Workflows**: With soupsavvy, developers can create complex scraping workflows effortlessly, allowing for more intricate data extraction.

- **Productionalize Scraping Code**: By providing structured workflows and error handling mechanisms, soupsavvy facilitates the productionalization of scraping code, making it easier to integrate into larger projects and pipelines.

## Getting Started

### Installation

soupsavvy is published on PyPi and can be installed via pip:

```bash
pip install soupsavvy
```

### Usage

For simple html code snippet parsed into bs4 Tag we want to extract specific tag(s):

```html
    <div class="menu" role="search">
        <a class="option" href="twitter.com/page">Twitter Page</a>
        <a class="option" href="github.com/fake">Fake Ghb Page</a>
        <a class="blank" href="github.com/blank">Blank Github Page</a>
        <a class="option" href="github.com/correct">Correct Github Page</a>
    </div>
    <div class="menu" role="placeholder">
        <a class="option" href="github.com/oos">Out of Scope Github Page</a>
    </div>
```

Replace convoluted BeautifulSoup approach:

```python
div = markup.find(
    "div",
    class_="menu",
    role="search",
)

if not isinstance(div, Tag):
    raise ValueError("No element found")

a = div.find(
    "a",
    class_="option",
    href=re.compile("github.com"),
    string=re.compile("Github"),
)

if not isinstance(a, Tag):
    raise ValueError("No element found")
```

with savvier version:

```python
import re

from soupsavvy import AttributeTag, ElementTag, PatternElementTag, StepsElementTag

# define your complex tag once
tag = StepsElementTag(
    ElementTag(
        "div",
        attributes=[
            AttributeTag(name="class", value="menu"),
            AttributeTag(name="role", value="search"),
        ],
    ),
    PatternElementTag(
        tag=ElementTag(
            "a",
            attributes=[
                AttributeTag(name="class", value="option"),
                AttributeTag(name="href", value="github.com", re=True),
            ],
        ),
        pattern="Github",
        re=True,
    ),
)
# reuse it in any place to search for tag in any markup, if not found, strict mode raises exception
a = tag.find(markup, strict=True)
```

This streamlined soupsavvy approach and encapsulating complex tag(s) into single objects transforms web scraping tasks from a potential 'soup sandwich'🥪 into a 'duck soup' 🦆 scenario.

With soupsavvy's robust features, developers can avoid common problems encountered in web scraping, such as exception handling or integration with type checkers.

Full documentation can be found at **[soupsavvy Docs](https://sewcio543.github.io/soupsavvy/)**

## Contributing

If you'd like to contribute to soupsavvy, feel free to check out the [GitHub repository](https://github.com/sewcio543/soupsavvy) and submit pull requests into one of development branches. Any feedback, bug reports, or feature requests are welcome!

## License

[![MIT License](https://img.shields.io/badge/license-MIT-green?style=plastic)](https://choosealicense.com/licenses/mit/)  
soupsavvy is licensed under the [MIT License](https://opensource.org/licenses/MIT), allowing for both personal and commercial use. See the `LICENSE` file for more information.

## Acknowledgements

Soupsavvy is built upon the foundation of excellent BeautifulSoup. We extend our gratitude to the developers and contributors of this projects for their invaluable contributions to the Python community and making our life a lot easier!

-----------------

**Let's soap this soup!**  
**Happy scraping!** ✨

## In the future

- Scraping workflows from soup to nuts
- New Tag components
- Enhanced CI pipeline
- Documentation  

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "soupsavvy",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "web-scraping, html, soup, bs4, markup",
    "author": "sewcio543",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/c7/75/3c5323da4c62b83e1f71e9a25f525ee05bc3f801047f167ef39882f4ecaa/soupsavvy-0.1.4.tar.gz",
    "platform": null,
    "description": "# soupsavvy\n\n-----------------\n\n## Python package that makes web-scraping more manageable\n\n| | |\n| --- | --- |\n| Package | ![Deployment PyPI](https://github.com/sewcio543/soupsavvy/actions/workflows/production_release.yml/badge.svg) ![Deployment test PyPI](https://github.com/sewcio543/soupsavvy/actions/workflows/development_release.yml/badge.svg) [![GitHub](https://img.shields.io/badge/GitHub-sewcio543-181717.svg?style=flat&logo=github)](https://github.com/sewcio543) [![PyPI](https://img.shields.io/pypi/v/soupsavvy?color=orange)](https://pypi.org/project/soupsavvy/) [![Python Versions](https://img.shields.io/pypi/pyversions/soupsavvy)](https://www.python.org/)|\n| Testing | ![Tests](https://github.com/sewcio543/soupsavvy/actions/workflows/tests.yml/badge.svg) [![Codecov](https://codecov.io/gh/sewcio543/soupsavvy/graph/badge.svg?token=RZ51VS3QLB)](https://codecov.io/gh/sewcio543/soupsavvy)|\n| Code Quality | ![Build](https://github.com/sewcio543/soupsavvy/actions/workflows/build_package.yml/badge.svg) ![Linting](https://github.com/sewcio543/soupsavvy/actions/workflows/linting.yml/badge.svg) [![pre-commit.ci status](https://results.pre-commit.ci/badge/github/sewcio543/soupsavvy/main.svg)](https://results.pre-commit.ci/latest/github/sewcio543/soupsavvy/main)|\n| Documentation | [![readthedocs](https://img.shields.io/readthedocs/pip?logo=readthedocs)](https://github.com/sewcio543/soupsavvy/actions/workflows/documentation.yml/badge.svg) [![Docs link](https://img.shields.io/badge/docs-check_out-blue)](https://sewcio543.github.io/soupsavvy/)|\n\n## Table of Contents\n\n- [About](#about)\n- [Key features](#key-features)\n- [Instalation](#installation)\n- [Usage](#usage)\n- [Contributing](#contributing)\n- [License](#license)\n- [Acknowledgements](#acknowledgements)\n- [In the future](#in-the-future)\n\n## About\n\n**soupsavvy** is a library designed to make web scraping tasks more efficient and manageable. Automating web scraping can be a thankless and time-consuming job. It builds around [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) library enabling developers to create more complex workflows and advanced searches with ease.\n\n## Key Features\n\n- **Automated Web Scraping**: soupsavvy simplifies the process of web scraping by providing intuitive interfaces and tools for automating tasks.\n\n- **Complex Workflows**: With soupsavvy, developers can create complex scraping workflows effortlessly, allowing for more intricate data extraction.\n\n- **Productionalize Scraping Code**: By providing structured workflows and error handling mechanisms, soupsavvy facilitates the productionalization of scraping code, making it easier to integrate into larger projects and pipelines.\n\n## Getting Started\n\n### Installation\n\nsoupsavvy is published on PyPi and can be installed via pip:\n\n```bash\npip install soupsavvy\n```\n\n### Usage\n\nFor simple html code snippet parsed into bs4 Tag we want to extract specific tag(s):\n\n```html\n    <div class=\"menu\" role=\"search\">\n        <a class=\"option\" href=\"twitter.com/page\">Twitter Page</a>\n        <a class=\"option\" href=\"github.com/fake\">Fake Ghb Page</a>\n        <a class=\"blank\" href=\"github.com/blank\">Blank Github Page</a>\n        <a class=\"option\" href=\"github.com/correct\">Correct Github Page</a>\n    </div>\n    <div class=\"menu\" role=\"placeholder\">\n        <a class=\"option\" href=\"github.com/oos\">Out of Scope Github Page</a>\n    </div>\n```\n\nReplace convoluted BeautifulSoup approach:\n\n```python\ndiv = markup.find(\n    \"div\",\n    class_=\"menu\",\n    role=\"search\",\n)\n\nif not isinstance(div, Tag):\n    raise ValueError(\"No element found\")\n\na = div.find(\n    \"a\",\n    class_=\"option\",\n    href=re.compile(\"github.com\"),\n    string=re.compile(\"Github\"),\n)\n\nif not isinstance(a, Tag):\n    raise ValueError(\"No element found\")\n```\n\nwith savvier version:\n\n```python\nimport re\n\nfrom soupsavvy import AttributeTag, ElementTag, PatternElementTag, StepsElementTag\n\n# define your complex tag once\ntag = StepsElementTag(\n    ElementTag(\n        \"div\",\n        attributes=[\n            AttributeTag(name=\"class\", value=\"menu\"),\n            AttributeTag(name=\"role\", value=\"search\"),\n        ],\n    ),\n    PatternElementTag(\n        tag=ElementTag(\n            \"a\",\n            attributes=[\n                AttributeTag(name=\"class\", value=\"option\"),\n                AttributeTag(name=\"href\", value=\"github.com\", re=True),\n            ],\n        ),\n        pattern=\"Github\",\n        re=True,\n    ),\n)\n# reuse it in any place to search for tag in any markup, if not found, strict mode raises exception\na = tag.find(markup, strict=True)\n```\n\nThis streamlined soupsavvy approach and encapsulating complex tag(s) into single objects transforms web scraping tasks from a potential 'soup sandwich'\ud83e\udd6a into a 'duck soup' \ud83e\udd86 scenario.\n\nWith soupsavvy's robust features, developers can avoid common problems encountered in web scraping, such as exception handling or integration with type checkers.\n\nFull documentation can be found at **[soupsavvy Docs](https://sewcio543.github.io/soupsavvy/)**\n\n## Contributing\n\nIf you'd like to contribute to soupsavvy, feel free to check out the [GitHub repository](https://github.com/sewcio543/soupsavvy) and submit pull requests into one of development branches. Any feedback, bug reports, or feature requests are welcome!\n\n## License\n\n[![MIT License](https://img.shields.io/badge/license-MIT-green?style=plastic)](https://choosealicense.com/licenses/mit/)  \nsoupsavvy is licensed under the [MIT License](https://opensource.org/licenses/MIT), allowing for both personal and commercial use. See the `LICENSE` file for more information.\n\n## Acknowledgements\n\nSoupsavvy is built upon the foundation of excellent BeautifulSoup. We extend our gratitude to the developers and contributors of this projects for their invaluable contributions to the Python community and making our life a lot easier!\n\n-----------------\n\n**Let's soap this soup!**  \n**Happy scraping!** \u2728\n\n## In the future\n\n- Scraping workflows from soup to nuts\n- New Tag components\n- Enhanced CI pipeline\n- Documentation  \n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 Wojciech Seweryn  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Python package for advanced web scraping",
    "version": "0.1.4",
    "project_urls": {
        "source": "https://github.com/sewcio543/soupsavvy"
    },
    "split_keywords": [
        "web-scraping",
        " html",
        " soup",
        " bs4",
        " markup"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ecf85246ffc6ff85635d26fff144c3fbe9061b2d356053063c56dd605ea20b16",
                "md5": "877b9e19585d33e6fccbe24996ebbc28",
                "sha256": "bbd83067fb718a05b7d70ef1aaa3192a9fccd83e5abc93e3256426b5f3f879f5"
            },
            "downloads": -1,
            "filename": "soupsavvy-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "877b9e19585d33e6fccbe24996ebbc28",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 22456,
            "upload_time": "2024-04-15T15:47:40",
            "upload_time_iso_8601": "2024-04-15T15:47:40.467529Z",
            "url": "https://files.pythonhosted.org/packages/ec/f8/5246ffc6ff85635d26fff144c3fbe9061b2d356053063c56dd605ea20b16/soupsavvy-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c7753c5323da4c62b83e1f71e9a25f525ee05bc3f801047f167ef39882f4ecaa",
                "md5": "a32cd259586d7ccb7386f8498d1cb575",
                "sha256": "4dbd4f29f962a748e453ba00bd58c1accc4b94c3b5125a7f845ed38078b040f0"
            },
            "downloads": -1,
            "filename": "soupsavvy-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "a32cd259586d7ccb7386f8498d1cb575",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 19783,
            "upload_time": "2024-04-15T15:47:42",
            "upload_time_iso_8601": "2024-04-15T15:47:42.009495Z",
            "url": "https://files.pythonhosted.org/packages/c7/75/3c5323da4c62b83e1f71e9a25f525ee05bc3f801047f167ef39882f4ecaa/soupsavvy-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-15 15:47:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "sewcio543",
    "github_project": "soupsavvy",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "requirements": [
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    "==",
                    "4.12.2"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    "==",
                    "4.9.2"
                ]
            ]
        }
    ],
    "tox": true,
    "lcname": "soupsavvy"
}
        
Elapsed time: 0.23018s