<!--
![aio-scrapy](./doc/images/aio-scrapy.png)
-->
### aio-scrapy
An asyncio + aiolibs crawler imitate scrapy framework
English | [中文](./doc/README_ZH.md)
### Overview
- aio-scrapy framework is base on opensource project Scrapy & scrapy_redis.
- aio-scrapy implements compatibility with scrapyd.
- aio-scrapy implements redis queue and rabbitmq queue.
- aio-scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
- Distributed crawling/scraping.
### Requirements
- Python 3.9+
- Works on Linux, Windows, macOS, BSD
### Install
The quick way:
```shell
# Install the latest aio-scrapy
pip install git+https://github.com/ConlinH/aio-scrapy
# default
pip install aio-scrapy
# Install all dependencies
pip install aio-scrapy[all]
# When you need to use mysql/httpx/rabbitmq/mongo
pip install aio-scrapy[aiomysql,httpx,aio-pika,mongo]
```
### Usage
#### create project spider:
```shell
aioscrapy startproject project_quotes
```
```
cd project_quotes
aioscrapy genspider quotes
```
quotes.py
```python
from aioscrapy.spiders import Spider
class QuotesMemorySpider(Spider):
name = 'QuotesMemorySpider'
start_urls = ['https://quotes.toscrape.com']
async def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
if __name__ == '__main__':
QuotesMemorySpider.start()
```
run the spider:
```shell
aioscrapy crawl quotes
```
#### create single script spider:
```shell
aioscrapy genspider single_quotes -t single
```
single_quotes.py:
```python
from aioscrapy.spiders import Spider
class QuotesMemorySpider(Spider):
name = 'QuotesMemorySpider'
custom_settings = {
"USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
'CLOSE_SPIDER_ON_IDLE': True,
# 'DOWNLOAD_DELAY': 3,
# 'RANDOMIZE_DOWNLOAD_DELAY': True,
# 'CONCURRENT_REQUESTS': 1,
# 'LOG_LEVEL': 'INFO'
}
start_urls = ['https://quotes.toscrape.com']
@staticmethod
async def process_request(request, spider):
""" request middleware """
pass
@staticmethod
async def process_response(request, response, spider):
""" response middleware """
return response
@staticmethod
async def process_exception(request, exception, spider):
""" exception middleware """
pass
async def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
async def process_item(self, item):
print(item)
if __name__ == '__main__':
QuotesMemorySpider.start()
```
run the spider:
```shell
aioscrapy runspider quotes.py
```
### more commands:
```shell
aioscrapy -h
```
#### [more example](./example)
### Documentation
[doc](./doc/documentation.md)
### Ready
please submit your sugguestion to owner by issue
## Thanks
[aiohttp](https://github.com/aio-libs/aiohttp/)
[scrapy](https://github.com/scrapy/scrapy)
Raw data
{
"_id": null,
"home_page": "https://github.com/conlin-huang/aio-scrapy.git",
"name": "aio-scrapy",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "aio-scrapy, scrapy, aioscrapy, scrapy redis, asyncio, spider",
"author": "conlin",
"author_email": "995018884@qq.com",
"download_url": "https://files.pythonhosted.org/packages/6d/5e/b7d2ae255e1dce6b1a523ac6c317279a6bc1b852c9df2989074c48cf01a7/aio-scrapy-2.1.3.tar.gz",
"platform": null,
"description": "<!--\r\n![aio-scrapy](./doc/images/aio-scrapy.png)\r\n-->\r\n### aio-scrapy\r\n\r\nAn asyncio + aiolibs crawler imitate scrapy framework\r\n\r\nEnglish | [\u4e2d\u6587](./doc/README_ZH.md)\r\n\r\n### Overview\r\n- aio-scrapy framework is base on opensource project Scrapy & scrapy_redis.\r\n- aio-scrapy implements compatibility with scrapyd.\r\n- aio-scrapy implements redis queue and rabbitmq queue.\r\n- aio-scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.\r\n- Distributed crawling/scraping.\r\n### Requirements\r\n\r\n- Python 3.9+\r\n- Works on Linux, Windows, macOS, BSD\r\n\r\n### Install\r\n\r\nThe quick way:\r\n\r\n```shell\r\n# Install the latest aio-scrapy\r\npip install git+https://github.com/ConlinH/aio-scrapy\r\n\r\n# default\r\npip install aio-scrapy\r\n\r\n# Install all dependencies \r\npip install aio-scrapy[all]\r\n\r\n# When you need to use mysql/httpx/rabbitmq/mongo\r\npip install aio-scrapy[aiomysql,httpx,aio-pika,mongo]\r\n```\r\n\r\n### Usage\r\n\r\n#### create project spider:\r\n\r\n```shell\r\naioscrapy startproject project_quotes\r\n```\r\n\r\n```\r\ncd project_quotes\r\naioscrapy genspider quotes \r\n```\r\n\r\nquotes.py\r\n\r\n```python\r\nfrom aioscrapy.spiders import Spider\r\n\r\n\r\nclass QuotesMemorySpider(Spider):\r\n name = 'QuotesMemorySpider'\r\n\r\n start_urls = ['https://quotes.toscrape.com']\r\n\r\n async def parse(self, response):\r\n for quote in response.css('div.quote'):\r\n yield {\r\n 'author': quote.xpath('span/small/text()').get(),\r\n 'text': quote.css('span.text::text').get(),\r\n }\r\n\r\n next_page = response.css('li.next a::attr(\"href\")').get()\r\n if next_page is not None:\r\n yield response.follow(next_page, self.parse)\r\n\r\n\r\nif __name__ == '__main__':\r\n QuotesMemorySpider.start()\r\n\r\n```\r\n\r\nrun the spider:\r\n\r\n```shell\r\naioscrapy crawl quotes\r\n```\r\n\r\n#### create single script spider:\r\n\r\n```shell\r\naioscrapy genspider single_quotes -t single\r\n```\r\n\r\nsingle_quotes.py:\r\n\r\n```python\r\nfrom aioscrapy.spiders import Spider\r\n\r\n\r\nclass QuotesMemorySpider(Spider):\r\n name = 'QuotesMemorySpider'\r\n custom_settings = {\r\n \"USER_AGENT\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36\",\r\n 'CLOSE_SPIDER_ON_IDLE': True,\r\n # 'DOWNLOAD_DELAY': 3,\r\n # 'RANDOMIZE_DOWNLOAD_DELAY': True,\r\n # 'CONCURRENT_REQUESTS': 1,\r\n # 'LOG_LEVEL': 'INFO'\r\n }\r\n\r\n start_urls = ['https://quotes.toscrape.com']\r\n\r\n @staticmethod\r\n async def process_request(request, spider):\r\n \"\"\" request middleware \"\"\"\r\n pass\r\n\r\n @staticmethod\r\n async def process_response(request, response, spider):\r\n \"\"\" response middleware \"\"\"\r\n return response\r\n\r\n @staticmethod\r\n async def process_exception(request, exception, spider):\r\n \"\"\" exception middleware \"\"\"\r\n pass\r\n\r\n async def parse(self, response):\r\n for quote in response.css('div.quote'):\r\n yield {\r\n 'author': quote.xpath('span/small/text()').get(),\r\n 'text': quote.css('span.text::text').get(),\r\n }\r\n\r\n next_page = response.css('li.next a::attr(\"href\")').get()\r\n if next_page is not None:\r\n yield response.follow(next_page, self.parse)\r\n\r\n async def process_item(self, item):\r\n print(item)\r\n\r\n\r\nif __name__ == '__main__':\r\n QuotesMemorySpider.start()\r\n\r\n```\r\n\r\nrun the spider:\r\n\r\n```shell\r\naioscrapy runspider quotes.py\r\n```\r\n\r\n\r\n### more commands:\r\n\r\n```shell\r\naioscrapy -h\r\n```\r\n\r\n#### [more example](./example)\r\n\r\n### Documentation\r\n[doc](./doc/documentation.md)\r\n\r\n### Ready\r\n\r\nplease submit your sugguestion to owner by issue\r\n\r\n## Thanks\r\n\r\n[aiohttp](https://github.com/aio-libs/aiohttp/)\r\n\r\n[scrapy](https://github.com/scrapy/scrapy)\r\n\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A high-level Web Crawling and Web Scraping framework based on Asyncio",
"version": "2.1.3",
"project_urls": {
"Homepage": "https://github.com/conlin-huang/aio-scrapy.git"
},
"split_keywords": [
"aio-scrapy",
" scrapy",
" aioscrapy",
" scrapy redis",
" asyncio",
" spider"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f5a97a685c3a996ae516ea7afbd9acc6f39e998bf732d828e7d0bb846820b5de",
"md5": "6de7c56bbe36a92f1868f4d8130c4e3e",
"sha256": "973b993051ca32ca9ab162c2d804bc9abaebbd8521eb2b7ca0fbbdd098d26025"
},
"downloads": -1,
"filename": "aio_scrapy-2.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6de7c56bbe36a92f1868f4d8130c4e3e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 141083,
"upload_time": "2024-07-22T01:57:35",
"upload_time_iso_8601": "2024-07-22T01:57:35.724964Z",
"url": "https://files.pythonhosted.org/packages/f5/a9/7a685c3a996ae516ea7afbd9acc6f39e998bf732d828e7d0bb846820b5de/aio_scrapy-2.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "6d5eb7d2ae255e1dce6b1a523ac6c317279a6bc1b852c9df2989074c48cf01a7",
"md5": "6d90322c8bb88698b6b12917f0a5fa75",
"sha256": "dda1819ca3aa75b2f80b9d054944e3c690563af9575e5003ece044fded94a621"
},
"downloads": -1,
"filename": "aio-scrapy-2.1.3.tar.gz",
"has_sig": false,
"md5_digest": "6d90322c8bb88698b6b12917f0a5fa75",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 97948,
"upload_time": "2024-07-22T01:57:37",
"upload_time_iso_8601": "2024-07-22T01:57:37.877227Z",
"url": "https://files.pythonhosted.org/packages/6d/5e/b7d2ae255e1dce6b1a523ac6c317279a6bc1b852c9df2989074c48cf01a7/aio-scrapy-2.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-22 01:57:37",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "conlin-huang",
"github_project": "aio-scrapy",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "aio-scrapy"
}