Name | shearer JSON |
Version |
0.1.1
JSON |
| download |
home_page | None |
Summary | None |
upload_time | 2024-06-10 14:24:14 |
maintainer | None |
docs_url | None |
author | Edward |
requires_python | <4.0,>=3.11 |
license | None |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
<h1 align="center">Shearer</h1>
<p align="center"><i>`shearer` is a large-languge model driven package to help you scrape webpage automatically.</i></p>
<div align="center">
<a href="https://github.com/edwardmfho/shearer/stargazers"><img src="https://img.shields.io/github/stars/elangosundar/shearer" alt="Stars Badge"/></a>
<a href="https://github.com/edwardmfho/shearer/network/members"><img src="https://img.shields.io/github/forks/edwardmfho/shearer" alt="Forks Badge"/></a>
<a href="https://github.com/edwardmfho/shearer/pulls"><img src="https://img.shields.io/github/issues-pr/edwardmfho/shearer" alt="Pull Requests Badge"/></a>
<a href="https://github.com/edwardmfho/shearer/shearer/issues"><img src="https://img.shields.io/github/issues/edwardmfho/shearer" alt="Issues Badge"/></a>
<a href="https://github.com/edwardmfho/shearer/graphs/contributors"><img alt="GitHub contributors" src="https://img.shields.io/github/contributors/edwardmfho/shearer?color=2b9348"></a>
<a href="https://github.com/edwardmfho/shearer/blob/master/LICENSE"><img src="https://img.shields.io/github/license/edwardmfho/shearer?color=2b9348" alt="License Badge"/></a>
</div>
<br>
This repo helps you to extract the eye tag from a sourcecode for a webpage, whether it is a XML or HTML file, we can help you to get the relevant data as requested using large language model.
If you like this Repo, Please click the :star:
## Installation
You can install `shearer` using `pip install` or `poetry`
### Install via pip
```bash
pip install shearer
```
### Install via poetry
```bash
poetry add shearer
poetry install
```
## Usage
`shearer` currently supports only `XML` file but will aim to support `HTML` in the future.
### Getting Started
```bash
poetry install
poetry shell
```
### XML
To scrape a XML file into structured format.
```python
from shearer.scraper import XMLScraper
from shearer.models import ScrappingOptions
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI()
options = ScrappingOptions(model="gpt-4o", temperature=0.0, content_type="xml", required_data="author name, article title, article_url")
```
Once you have configure the options, we recommend you to get page_schema first
before scraping. THis will save you a lot of time in long run.
```
scraper = XMLScraper.from_input_options(client=client, options=options)
page_schema = scraper.get_site_schema(url="https://{substack_name}.substack.com/feed")
# Check the site schema
print(page_schema)
```
In the currently update, we introduce the concept of `key`. `key` refers
to the tag that capture all the required field you want. For example:
```
<rss>
<item>
<title>A great article</title>
<author>John Doe</author>
</item>
<item>
<title>A even better article</title>
<author>Jane Doe</author>
</item>
</rss>
```
In this scenario, the `item` tag will be the `key` for the `page_schema`.
It is important as it will help you organize your data in a structured
format.
The page schema in Substack is slightly different from what we expected,
there is a required field outside of your `key` field.
One way to do it is manually modify the `page_schema`, remove the field that
is outside of your `key` field. So the modified `page_schema` will be like
the below:
```
updated_page_schema = {
'rss':
{'channel': {
'item': {
'title': 'target_field',
'link': 'target_field',
'dc:creator': 'target_field'
}
}
}
}
```
Then pass the new `page_schema` into the `scape` method.
```
output = scraper.scrape(
url="https://{substack_name}.substack.com/feed",
page_schema=updated_page_schema,
key="item"
)
```
And done! You have extracted your first structured data.
## Contributing
If you want to contribute to the project, do the following:
1. Create your feature branch (git checkout -b feature/AmazingFeature)
2. Commit your changes (git commit -m 'Add some AmazingFeature')
3. Push to the branch (git push origin feature/AmazingFeature)
4. Open a Pull Request
## License
This project is licensed under [MIT](https://opensource.org/licenses/MIT) license.
Raw data
{
"_id": null,
"home_page": null,
"name": "shearer",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.11",
"maintainer_email": null,
"keywords": null,
"author": "Edward",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/9d/59/b6bf3183378549b20df4eab7978e3e612fdec4d9b05843ca24ac0fb908c2/shearer-0.1.1.tar.gz",
"platform": null,
"description": "<h1 align=\"center\">Shearer</h1>\n<p align=\"center\"><i>`shearer` is a large-languge model driven package to help you scrape webpage automatically.</i></p>\n\n<div align=\"center\">\n <a href=\"https://github.com/edwardmfho/shearer/stargazers\"><img src=\"https://img.shields.io/github/stars/elangosundar/shearer\" alt=\"Stars Badge\"/></a>\n<a href=\"https://github.com/edwardmfho/shearer/network/members\"><img src=\"https://img.shields.io/github/forks/edwardmfho/shearer\" alt=\"Forks Badge\"/></a>\n<a href=\"https://github.com/edwardmfho/shearer/pulls\"><img src=\"https://img.shields.io/github/issues-pr/edwardmfho/shearer\" alt=\"Pull Requests Badge\"/></a>\n<a href=\"https://github.com/edwardmfho/shearer/shearer/issues\"><img src=\"https://img.shields.io/github/issues/edwardmfho/shearer\" alt=\"Issues Badge\"/></a>\n<a href=\"https://github.com/edwardmfho/shearer/graphs/contributors\"><img alt=\"GitHub contributors\" src=\"https://img.shields.io/github/contributors/edwardmfho/shearer?color=2b9348\"></a>\n<a href=\"https://github.com/edwardmfho/shearer/blob/master/LICENSE\"><img src=\"https://img.shields.io/github/license/edwardmfho/shearer?color=2b9348\" alt=\"License Badge\"/></a>\n</div>\n\n<br>\n\nThis repo helps you to extract the eye tag from a sourcecode for a webpage, whether it is a XML or HTML file, we can help you to get the relevant data as requested using large language model.\n\nIf you like this Repo, Please click the :star:\n\n## Installation\n\nYou can install `shearer` using `pip install` or `poetry`\n\n### Install via pip\n```bash\npip install shearer\n```\n### Install via poetry\n```bash\npoetry add shearer\npoetry install\n```\n## Usage\n`shearer` currently supports only `XML` file but will aim to support `HTML` in the future.\n\n### Getting Started\n```bash\npoetry install\npoetry shell\n```\n### XML\nTo scrape a XML file into structured format. \n\n```python\nfrom shearer.scraper import XMLScraper\nfrom shearer.models import ScrappingOptions\n\nfrom dotenv import load_dotenv\nfrom openai import OpenAI\n\nload_dotenv()\nclient = OpenAI()\n\noptions = ScrappingOptions(model=\"gpt-4o\", temperature=0.0, content_type=\"xml\", required_data=\"author name, article title, article_url\")\n```\n\nOnce you have configure the options, we recommend you to get page_schema first\nbefore scraping. THis will save you a lot of time in long run.\n\n```\nscraper = XMLScraper.from_input_options(client=client, options=options)\npage_schema = scraper.get_site_schema(url=\"https://{substack_name}.substack.com/feed\")\n\n# Check the site schema\nprint(page_schema)\n```\n\nIn the currently update, we introduce the concept of `key`. `key` refers\nto the tag that capture all the required field you want. For example:\n\n```\n<rss>\n <item>\n <title>A great article</title>\n <author>John Doe</author>\n </item>\n <item>\n <title>A even better article</title>\n <author>Jane Doe</author>\n </item> \n</rss>\n```\n\nIn this scenario, the `item` tag will be the `key` for the `page_schema`. \nIt is important as it will help you organize your data in a structured \nformat.\n\nThe page schema in Substack is slightly different from what we expected,\nthere is a required field outside of your `key` field.\n\nOne way to do it is manually modify the `page_schema`, remove the field that\nis outside of your `key` field. So the modified `page_schema` will be like\nthe below:\n\n```\nupdated_page_schema = {\n 'rss': \n {'channel': {\n 'item': {\n 'title': 'target_field', \n 'link': 'target_field',\n 'dc:creator': 'target_field'\n }\n }\n }\n }\n```\n\nThen pass the new `page_schema` into the `scape` method.\n\n```\noutput = scraper.scrape(\n url=\"https://{substack_name}.substack.com/feed\",\n page_schema=updated_page_schema,\n key=\"item\"\n)\n\n```\nAnd done! You have extracted your first structured data. \n\n\n## Contributing\nIf you want to contribute to the project, do the following:\n\n1. Create your feature branch (git checkout -b feature/AmazingFeature)\n2. Commit your changes (git commit -m 'Add some AmazingFeature')\n3. Push to the branch (git push origin feature/AmazingFeature)\n4. Open a Pull Request\n\n## License\nThis project is licensed under [MIT](https://opensource.org/licenses/MIT) license.\n",
"bugtrack_url": null,
"license": null,
"summary": null,
"version": "0.1.1",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "37af2871526bbb0b5b2e127542f546d29f9c66851f322eb606548405866f4a3e",
"md5": "d7409c42b9e78e265c29408540773e25",
"sha256": "ae409d7d5bb0e72654ecca19af9e9a01e815aff6a757cd2c8b71b053bf6e047f"
},
"downloads": -1,
"filename": "shearer-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d7409c42b9e78e265c29408540773e25",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.11",
"size": 8148,
"upload_time": "2024-06-10T14:24:13",
"upload_time_iso_8601": "2024-06-10T14:24:13.050897Z",
"url": "https://files.pythonhosted.org/packages/37/af/2871526bbb0b5b2e127542f546d29f9c66851f322eb606548405866f4a3e/shearer-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "9d59b6bf3183378549b20df4eab7978e3e612fdec4d9b05843ca24ac0fb908c2",
"md5": "87ae46a8afc0657c76afedfc765236f8",
"sha256": "cfc533e14d8cf7dcc3c20da6c1f197865bd2d181bc62aa5f523996bc694775d1"
},
"downloads": -1,
"filename": "shearer-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "87ae46a8afc0657c76afedfc765236f8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.11",
"size": 7270,
"upload_time": "2024-06-10T14:24:14",
"upload_time_iso_8601": "2024-06-10T14:24:14.253105Z",
"url": "https://files.pythonhosted.org/packages/9d/59/b6bf3183378549b20df4eab7978e3e612fdec4d9b05843ca24ac0fb908c2/shearer-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-10 14:24:14",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "shearer"
}