shearer


Nameshearer JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummaryNone
upload_time2024-06-10 14:24:14
maintainerNone
docs_urlNone
authorEdward
requires_python<4.0,>=3.11
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <h1 align="center">Shearer</h1>
<p align="center"><i>`shearer` is a large-languge model driven package to help you scrape webpage automatically.</i></p>

<div align="center">
  <a href="https://github.com/edwardmfho/shearer/stargazers"><img src="https://img.shields.io/github/stars/elangosundar/shearer" alt="Stars Badge"/></a>
<a href="https://github.com/edwardmfho/shearer/network/members"><img src="https://img.shields.io/github/forks/edwardmfho/shearer" alt="Forks Badge"/></a>
<a href="https://github.com/edwardmfho/shearer/pulls"><img src="https://img.shields.io/github/issues-pr/edwardmfho/shearer" alt="Pull Requests Badge"/></a>
<a href="https://github.com/edwardmfho/shearer/shearer/issues"><img src="https://img.shields.io/github/issues/edwardmfho/shearer" alt="Issues Badge"/></a>
<a href="https://github.com/edwardmfho/shearer/graphs/contributors"><img alt="GitHub contributors" src="https://img.shields.io/github/contributors/edwardmfho/shearer?color=2b9348"></a>
<a href="https://github.com/edwardmfho/shearer/blob/master/LICENSE"><img src="https://img.shields.io/github/license/edwardmfho/shearer?color=2b9348" alt="License Badge"/></a>
</div>

<br>

This repo helps you to extract the eye tag from a sourcecode for a webpage, whether it is a XML or HTML file, we can help you to get the relevant data as requested using large language model.

If you like this Repo, Please click the :star:

## Installation

You can install `shearer` using `pip install` or `poetry`

### Install via pip
```bash
pip install shearer
```
### Install via poetry
```bash
poetry add shearer
poetry install
```
## Usage
`shearer` currently supports only `XML` file but will aim to support `HTML` in the future.

### Getting Started
```bash
poetry install
poetry shell
```
### XML
To scrape a XML file into structured format. 

```python
from shearer.scraper import XMLScraper
from shearer.models import ScrappingOptions

from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI()

options = ScrappingOptions(model="gpt-4o", temperature=0.0, content_type="xml", required_data="author name, article title, article_url")
```

Once you have configure the options, we recommend you to get page_schema first
before scraping. THis will save you a lot of time in long run.

```
scraper = XMLScraper.from_input_options(client=client, options=options)
page_schema = scraper.get_site_schema(url="https://{substack_name}.substack.com/feed")

# Check the site schema
print(page_schema)
```

In the currently update, we introduce the concept of `key`. `key` refers
to the tag that capture all the required field you want. For example:

```
<rss>
  <item>
    <title>A great article</title>
    <author>John Doe</author>
  </item>
  <item>
    <title>A even better article</title>
    <author>Jane Doe</author>
  </item>  
</rss>
```

In this scenario, the `item` tag will be the `key` for the `page_schema`. 
It is important as it will help you organize your data in a structured 
format.

The page schema in Substack is slightly different from what we expected,
there is a required field outside of your `key` field.

One way to do it is manually modify the `page_schema`, remove the field that
is outside of your `key` field. So the modified `page_schema` will be like
the below:

```
updated_page_schema = {
    'rss': 
        {'channel': {
            'item': {
                'title': 'target_field', 
                'link': 'target_field',
                'dc:creator': 'target_field'
                }
            }
        }
    }
```

Then pass the new `page_schema` into the `scape` method.

```
output = scraper.scrape(
  url="https://{substack_name}.substack.com/feed",
  page_schema=updated_page_schema,
  key="item"
)

```
And done! You have extracted your first structured data. 


## Contributing
If you want to contribute to the project, do the following:

1. Create your feature branch (git checkout -b feature/AmazingFeature)
2. Commit your changes (git commit -m 'Add some AmazingFeature')
3. Push to the branch (git push origin feature/AmazingFeature)
4. Open a Pull Request

## License
This project is licensed under [MIT](https://opensource.org/licenses/MIT) license.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "shearer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.11",
    "maintainer_email": null,
    "keywords": null,
    "author": "Edward",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/9d/59/b6bf3183378549b20df4eab7978e3e612fdec4d9b05843ca24ac0fb908c2/shearer-0.1.1.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\">Shearer</h1>\n<p align=\"center\"><i>`shearer` is a large-languge model driven package to help you scrape webpage automatically.</i></p>\n\n<div align=\"center\">\n  <a href=\"https://github.com/edwardmfho/shearer/stargazers\"><img src=\"https://img.shields.io/github/stars/elangosundar/shearer\" alt=\"Stars Badge\"/></a>\n<a href=\"https://github.com/edwardmfho/shearer/network/members\"><img src=\"https://img.shields.io/github/forks/edwardmfho/shearer\" alt=\"Forks Badge\"/></a>\n<a href=\"https://github.com/edwardmfho/shearer/pulls\"><img src=\"https://img.shields.io/github/issues-pr/edwardmfho/shearer\" alt=\"Pull Requests Badge\"/></a>\n<a href=\"https://github.com/edwardmfho/shearer/shearer/issues\"><img src=\"https://img.shields.io/github/issues/edwardmfho/shearer\" alt=\"Issues Badge\"/></a>\n<a href=\"https://github.com/edwardmfho/shearer/graphs/contributors\"><img alt=\"GitHub contributors\" src=\"https://img.shields.io/github/contributors/edwardmfho/shearer?color=2b9348\"></a>\n<a href=\"https://github.com/edwardmfho/shearer/blob/master/LICENSE\"><img src=\"https://img.shields.io/github/license/edwardmfho/shearer?color=2b9348\" alt=\"License Badge\"/></a>\n</div>\n\n<br>\n\nThis repo helps you to extract the eye tag from a sourcecode for a webpage, whether it is a XML or HTML file, we can help you to get the relevant data as requested using large language model.\n\nIf you like this Repo, Please click the :star:\n\n## Installation\n\nYou can install `shearer` using `pip install` or `poetry`\n\n### Install via pip\n```bash\npip install shearer\n```\n### Install via poetry\n```bash\npoetry add shearer\npoetry install\n```\n## Usage\n`shearer` currently supports only `XML` file but will aim to support `HTML` in the future.\n\n### Getting Started\n```bash\npoetry install\npoetry shell\n```\n### XML\nTo scrape a XML file into structured format. \n\n```python\nfrom shearer.scraper import XMLScraper\nfrom shearer.models import ScrappingOptions\n\nfrom dotenv import load_dotenv\nfrom openai import OpenAI\n\nload_dotenv()\nclient = OpenAI()\n\noptions = ScrappingOptions(model=\"gpt-4o\", temperature=0.0, content_type=\"xml\", required_data=\"author name, article title, article_url\")\n```\n\nOnce you have configure the options, we recommend you to get page_schema first\nbefore scraping. THis will save you a lot of time in long run.\n\n```\nscraper = XMLScraper.from_input_options(client=client, options=options)\npage_schema = scraper.get_site_schema(url=\"https://{substack_name}.substack.com/feed\")\n\n# Check the site schema\nprint(page_schema)\n```\n\nIn the currently update, we introduce the concept of `key`. `key` refers\nto the tag that capture all the required field you want. For example:\n\n```\n<rss>\n  <item>\n    <title>A great article</title>\n    <author>John Doe</author>\n  </item>\n  <item>\n    <title>A even better article</title>\n    <author>Jane Doe</author>\n  </item>  \n</rss>\n```\n\nIn this scenario, the `item` tag will be the `key` for the `page_schema`. \nIt is important as it will help you organize your data in a structured \nformat.\n\nThe page schema in Substack is slightly different from what we expected,\nthere is a required field outside of your `key` field.\n\nOne way to do it is manually modify the `page_schema`, remove the field that\nis outside of your `key` field. So the modified `page_schema` will be like\nthe below:\n\n```\nupdated_page_schema = {\n    'rss': \n        {'channel': {\n            'item': {\n                'title': 'target_field', \n                'link': 'target_field',\n                'dc:creator': 'target_field'\n                }\n            }\n        }\n    }\n```\n\nThen pass the new `page_schema` into the `scape` method.\n\n```\noutput = scraper.scrape(\n  url=\"https://{substack_name}.substack.com/feed\",\n  page_schema=updated_page_schema,\n  key=\"item\"\n)\n\n```\nAnd done! You have extracted your first structured data. \n\n\n## Contributing\nIf you want to contribute to the project, do the following:\n\n1. Create your feature branch (git checkout -b feature/AmazingFeature)\n2. Commit your changes (git commit -m 'Add some AmazingFeature')\n3. Push to the branch (git push origin feature/AmazingFeature)\n4. Open a Pull Request\n\n## License\nThis project is licensed under [MIT](https://opensource.org/licenses/MIT) license.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": null,
    "version": "0.1.1",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "37af2871526bbb0b5b2e127542f546d29f9c66851f322eb606548405866f4a3e",
                "md5": "d7409c42b9e78e265c29408540773e25",
                "sha256": "ae409d7d5bb0e72654ecca19af9e9a01e815aff6a757cd2c8b71b053bf6e047f"
            },
            "downloads": -1,
            "filename": "shearer-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d7409c42b9e78e265c29408540773e25",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.11",
            "size": 8148,
            "upload_time": "2024-06-10T14:24:13",
            "upload_time_iso_8601": "2024-06-10T14:24:13.050897Z",
            "url": "https://files.pythonhosted.org/packages/37/af/2871526bbb0b5b2e127542f546d29f9c66851f322eb606548405866f4a3e/shearer-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9d59b6bf3183378549b20df4eab7978e3e612fdec4d9b05843ca24ac0fb908c2",
                "md5": "87ae46a8afc0657c76afedfc765236f8",
                "sha256": "cfc533e14d8cf7dcc3c20da6c1f197865bd2d181bc62aa5f523996bc694775d1"
            },
            "downloads": -1,
            "filename": "shearer-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "87ae46a8afc0657c76afedfc765236f8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.11",
            "size": 7270,
            "upload_time": "2024-06-10T14:24:14",
            "upload_time_iso_8601": "2024-06-10T14:24:14.253105Z",
            "url": "https://files.pythonhosted.org/packages/9d/59/b6bf3183378549b20df4eab7978e3e612fdec4d9b05843ca24ac0fb908c2/shearer-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-10 14:24:14",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "shearer"
}
        
Elapsed time: 1.09440s