aio-scrapy

Name	aio-scrapy JSON
Version	2.1.3 JSON
	download
home_page	https://github.com/conlin-huang/aio-scrapy.git
Summary	A high-level Web Crawling and Web Scraping framework based on Asyncio
upload_time	2024-07-22 01:57:37
maintainer	None
docs_url	None
author	conlin
requires_python	>=3.9
license	MIT
keywords	aio-scrapy scrapy aioscrapy scrapy redis asyncio spider
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <!--
![aio-scrapy](./doc/images/aio-scrapy.png)
-->
### aio-scrapy

An asyncio + aiolibs crawler  imitate scrapy framework

English | [中文](./doc/README_ZH.md)

### Overview
- aio-scrapy framework is base on opensource project Scrapy & scrapy_redis.
- aio-scrapy implements compatibility with scrapyd.
- aio-scrapy implements redis queue and rabbitmq queue.
- aio-scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
- Distributed crawling/scraping.
### Requirements

- Python 3.9+
- Works on Linux, Windows, macOS, BSD

### Install

The quick way:

```shell
# Install the latest aio-scrapy
pip install git+https://github.com/ConlinH/aio-scrapy

# default
pip install aio-scrapy

# Install all dependencies 
pip install aio-scrapy[all]

# When you need to use mysql/httpx/rabbitmq/mongo
pip install aio-scrapy[aiomysql,httpx,aio-pika,mongo]
```

### Usage

#### create project spider:

```shell
aioscrapy startproject project_quotes
```

```
cd project_quotes
aioscrapy genspider quotes 
```

quotes.py

```python
from aioscrapy.spiders import Spider


class QuotesMemorySpider(Spider):
    name = 'QuotesMemorySpider'

    start_urls = ['https://quotes.toscrape.com']

    async def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)


if __name__ == '__main__':
    QuotesMemorySpider.start()

```

run the spider:

```shell
aioscrapy crawl quotes
```

#### create single script spider:

```shell
aioscrapy genspider single_quotes -t single
```

single_quotes.py:

```python
from aioscrapy.spiders import Spider


class QuotesMemorySpider(Spider):
    name = 'QuotesMemorySpider'
    custom_settings = {
        "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
        'CLOSE_SPIDER_ON_IDLE': True,
        # 'DOWNLOAD_DELAY': 3,
        # 'RANDOMIZE_DOWNLOAD_DELAY': True,
        # 'CONCURRENT_REQUESTS': 1,
        # 'LOG_LEVEL': 'INFO'
    }

    start_urls = ['https://quotes.toscrape.com']

    @staticmethod
    async def process_request(request, spider):
        """ request middleware """
        pass

    @staticmethod
    async def process_response(request, response, spider):
        """ response middleware """
        return response

    @staticmethod
    async def process_exception(request, exception, spider):
        """ exception middleware """
        pass

    async def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

    async def process_item(self, item):
        print(item)


if __name__ == '__main__':
    QuotesMemorySpider.start()

```

run the spider:

```shell
aioscrapy runspider quotes.py
```


### more commands:

```shell
aioscrapy -h
```

#### [more example](./example)

### Documentation
[doc](./doc/documentation.md)

### Ready

please submit your sugguestion to owner by issue

## Thanks

[aiohttp](https://github.com/aio-libs/aiohttp/)

[scrapy](https://github.com/scrapy/scrapy)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/conlin-huang/aio-scrapy.git",
    "name": "aio-scrapy",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "aio-scrapy, scrapy, aioscrapy, scrapy redis, asyncio, spider",
    "author": "conlin",
    "author_email": "995018884@qq.com",
    "download_url": "https://files.pythonhosted.org/packages/6d/5e/b7d2ae255e1dce6b1a523ac6c317279a6bc1b852c9df2989074c48cf01a7/aio-scrapy-2.1.3.tar.gz",
    "platform": null,
    "description": "<!--\r\n![aio-scrapy](./doc/images/aio-scrapy.png)\r\n-->\r\n### aio-scrapy\r\n\r\nAn asyncio + aiolibs crawler  imitate scrapy framework\r\n\r\nEnglish | [\u4e2d\u6587](./doc/README_ZH.md)\r\n\r\n### Overview\r\n- aio-scrapy framework is base on opensource project Scrapy & scrapy_redis.\r\n- aio-scrapy implements compatibility with scrapyd.\r\n- aio-scrapy implements redis queue and rabbitmq queue.\r\n- aio-scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.\r\n- Distributed crawling/scraping.\r\n### Requirements\r\n\r\n- Python 3.9+\r\n- Works on Linux, Windows, macOS, BSD\r\n\r\n### Install\r\n\r\nThe quick way:\r\n\r\n```shell\r\n# Install the latest aio-scrapy\r\npip install git+https://github.com/ConlinH/aio-scrapy\r\n\r\n# default\r\npip install aio-scrapy\r\n\r\n# Install all dependencies \r\npip install aio-scrapy[all]\r\n\r\n# When you need to use mysql/httpx/rabbitmq/mongo\r\npip install aio-scrapy[aiomysql,httpx,aio-pika,mongo]\r\n```\r\n\r\n### Usage\r\n\r\n#### create project spider:\r\n\r\n```shell\r\naioscrapy startproject project_quotes\r\n```\r\n\r\n```\r\ncd project_quotes\r\naioscrapy genspider quotes \r\n```\r\n\r\nquotes.py\r\n\r\n```python\r\nfrom aioscrapy.spiders import Spider\r\n\r\n\r\nclass QuotesMemorySpider(Spider):\r\n    name = 'QuotesMemorySpider'\r\n\r\n    start_urls = ['https://quotes.toscrape.com']\r\n\r\n    async def parse(self, response):\r\n        for quote in response.css('div.quote'):\r\n            yield {\r\n                'author': quote.xpath('span/small/text()').get(),\r\n                'text': quote.css('span.text::text').get(),\r\n            }\r\n\r\n        next_page = response.css('li.next a::attr(\"href\")').get()\r\n        if next_page is not None:\r\n            yield response.follow(next_page, self.parse)\r\n\r\n\r\nif __name__ == '__main__':\r\n    QuotesMemorySpider.start()\r\n\r\n```\r\n\r\nrun the spider:\r\n\r\n```shell\r\naioscrapy crawl quotes\r\n```\r\n\r\n#### create single script spider:\r\n\r\n```shell\r\naioscrapy genspider single_quotes -t single\r\n```\r\n\r\nsingle_quotes.py:\r\n\r\n```python\r\nfrom aioscrapy.spiders import Spider\r\n\r\n\r\nclass QuotesMemorySpider(Spider):\r\n    name = 'QuotesMemorySpider'\r\n    custom_settings = {\r\n        \"USER_AGENT\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36\",\r\n        'CLOSE_SPIDER_ON_IDLE': True,\r\n        # 'DOWNLOAD_DELAY': 3,\r\n        # 'RANDOMIZE_DOWNLOAD_DELAY': True,\r\n        # 'CONCURRENT_REQUESTS': 1,\r\n        # 'LOG_LEVEL': 'INFO'\r\n    }\r\n\r\n    start_urls = ['https://quotes.toscrape.com']\r\n\r\n    @staticmethod\r\n    async def process_request(request, spider):\r\n        \"\"\" request middleware \"\"\"\r\n        pass\r\n\r\n    @staticmethod\r\n    async def process_response(request, response, spider):\r\n        \"\"\" response middleware \"\"\"\r\n        return response\r\n\r\n    @staticmethod\r\n    async def process_exception(request, exception, spider):\r\n        \"\"\" exception middleware \"\"\"\r\n        pass\r\n\r\n    async def parse(self, response):\r\n        for quote in response.css('div.quote'):\r\n            yield {\r\n                'author': quote.xpath('span/small/text()').get(),\r\n                'text': quote.css('span.text::text').get(),\r\n            }\r\n\r\n        next_page = response.css('li.next a::attr(\"href\")').get()\r\n        if next_page is not None:\r\n            yield response.follow(next_page, self.parse)\r\n\r\n    async def process_item(self, item):\r\n        print(item)\r\n\r\n\r\nif __name__ == '__main__':\r\n    QuotesMemorySpider.start()\r\n\r\n```\r\n\r\nrun the spider:\r\n\r\n```shell\r\naioscrapy runspider quotes.py\r\n```\r\n\r\n\r\n### more commands:\r\n\r\n```shell\r\naioscrapy -h\r\n```\r\n\r\n#### [more example](./example)\r\n\r\n### Documentation\r\n[doc](./doc/documentation.md)\r\n\r\n### Ready\r\n\r\nplease submit your sugguestion to owner by issue\r\n\r\n## Thanks\r\n\r\n[aiohttp](https://github.com/aio-libs/aiohttp/)\r\n\r\n[scrapy](https://github.com/scrapy/scrapy)\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A high-level Web Crawling and Web Scraping framework based on Asyncio",
    "version": "2.1.3",
    "project_urls": {
        "Homepage": "https://github.com/conlin-huang/aio-scrapy.git"
    },
    "split_keywords": [
        "aio-scrapy",
        " scrapy",
        " aioscrapy",
        " scrapy redis",
        " asyncio",
        " spider"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f5a97a685c3a996ae516ea7afbd9acc6f39e998bf732d828e7d0bb846820b5de",
                "md5": "6de7c56bbe36a92f1868f4d8130c4e3e",
                "sha256": "973b993051ca32ca9ab162c2d804bc9abaebbd8521eb2b7ca0fbbdd098d26025"
            },
            "downloads": -1,
            "filename": "aio_scrapy-2.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6de7c56bbe36a92f1868f4d8130c4e3e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 141083,
            "upload_time": "2024-07-22T01:57:35",
            "upload_time_iso_8601": "2024-07-22T01:57:35.724964Z",
            "url": "https://files.pythonhosted.org/packages/f5/a9/7a685c3a996ae516ea7afbd9acc6f39e998bf732d828e7d0bb846820b5de/aio_scrapy-2.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6d5eb7d2ae255e1dce6b1a523ac6c317279a6bc1b852c9df2989074c48cf01a7",
                "md5": "6d90322c8bb88698b6b12917f0a5fa75",
                "sha256": "dda1819ca3aa75b2f80b9d054944e3c690563af9575e5003ece044fded94a621"
            },
            "downloads": -1,
            "filename": "aio-scrapy-2.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "6d90322c8bb88698b6b12917f0a5fa75",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 97948,
            "upload_time": "2024-07-22T01:57:37",
            "upload_time_iso_8601": "2024-07-22T01:57:37.877227Z",
            "url": "https://files.pythonhosted.org/packages/6d/5e/b7d2ae255e1dce6b1a523ac6c317279a6bc1b852c9df2989074c48cf01a7/aio-scrapy-2.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-22 01:57:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "conlin-huang",
    "github_project": "aio-scrapy",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "aio-scrapy"
}

conlin