aio-scrapy


Nameaio-scrapy JSON
Version 2.1.0 PyPI version JSON
download
home_pagehttps://github.com/conlin-huang/aio-scrapy.git
SummaryA high-level Web Crawling and Web Scraping framework based on Asyncio
upload_time2024-04-17 06:40:53
maintainerNone
docs_urlNone
authorconlin
requires_python>=3.9
licenseMIT
keywords aio-scrapy scrapy aioscrapy scrapy redis asyncio spider
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <!--
![aio-scrapy](./doc/images/aio-scrapy.png)
-->
### aio-scrapy

An asyncio + aiolibs crawler  imitate scrapy framework

English | [中文](./doc/README_ZH.md)

### Overview
- aio-scrapy framework is base on opensource project Scrapy & scrapy_redis.
- aio-scrapy implements compatibility with scrapyd.
- aio-scrapy implements redis queue and rabbitmq queue.
- aio-scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
- Distributed crawling/scraping.
### Requirements

- Python 3.9+
- Works on Linux, Windows, macOS, BSD

### Install

The quick way:

```shell
# Install the latest aio-scrapy
pip install git+https://github.com/conlin-huang/aio-scrapy

# default
pip install aio-scrapy

# Install all dependencies 
pip install aio-scrapy[all]

# When you need to use mysql/httpx/rabbitmq/mongo
pip install aio-scrapy[aiomysql,httpx,aio-pika,mongo]
```

### Usage

#### create project spider:

```shell
aioscrapy startproject project_quotes
```

```
cd project_quotes
aioscrapy genspider quotes 
```

quotes.py

```python
from aioscrapy.spiders import Spider


class QuotesMemorySpider(Spider):
    name = 'QuotesMemorySpider'

    start_urls = ['https://quotes.toscrape.com']

    async def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)


if __name__ == '__main__':
    QuotesMemorySpider.start()

```

run the spider:

```shell
aioscrapy crawl quotes
```

#### create single script spider:

```shell
aioscrapy genspider single_quotes -t single
```

single_quotes.py:

```python
from aioscrapy.spiders import Spider


class QuotesMemorySpider(Spider):
    name = 'QuotesMemorySpider'
    custom_settings = {
        "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
        'CLOSE_SPIDER_ON_IDLE': True,
        # 'DOWNLOAD_DELAY': 3,
        # 'RANDOMIZE_DOWNLOAD_DELAY': True,
        # 'CONCURRENT_REQUESTS': 1,
        # 'LOG_LEVEL': 'INFO'
    }

    start_urls = ['https://quotes.toscrape.com']

    @staticmethod
    async def process_request(request, spider):
        """ request middleware """
        pass

    @staticmethod
    async def process_response(request, response, spider):
        """ response middleware """
        return response

    @staticmethod
    async def process_exception(request, exception, spider):
        """ exception middleware """
        pass

    async def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

    async def process_item(self, item):
        print(item)


if __name__ == '__main__':
    QuotesMemorySpider.start()

```

run the spider:

```shell
aioscrapy runspider quotes.py
```


### more commands:

```shell
aioscrapy -h
```

#### [more example](./example)

### Documentation
[doc](./doc/documentation.md)

### Ready

please submit your sugguestion to owner by issue

## Thanks

[aiohttp](https://github.com/aio-libs/aiohttp/)

[scrapy](https://github.com/scrapy/scrapy)


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/conlin-huang/aio-scrapy.git",
    "name": "aio-scrapy",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "aio-scrapy, scrapy, aioscrapy, scrapy redis, asyncio, spider",
    "author": "conlin",
    "author_email": "995018884@qq.com",
    "download_url": "https://files.pythonhosted.org/packages/01/6b/a17c577c86067ba5b59e3715d5dc01eaca1dd4793d661307ca189a76ea2e/aio-scrapy-2.1.0.tar.gz",
    "platform": null,
    "description": "<!--\r\n![aio-scrapy](./doc/images/aio-scrapy.png)\r\n-->\r\n### aio-scrapy\r\n\r\nAn asyncio + aiolibs crawler  imitate scrapy framework\r\n\r\nEnglish | [\u4e2d\u6587](./doc/README_ZH.md)\r\n\r\n### Overview\r\n- aio-scrapy framework is base on opensource project Scrapy & scrapy_redis.\r\n- aio-scrapy implements compatibility with scrapyd.\r\n- aio-scrapy implements redis queue and rabbitmq queue.\r\n- aio-scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.\r\n- Distributed crawling/scraping.\r\n### Requirements\r\n\r\n- Python 3.9+\r\n- Works on Linux, Windows, macOS, BSD\r\n\r\n### Install\r\n\r\nThe quick way:\r\n\r\n```shell\r\n# Install the latest aio-scrapy\r\npip install git+https://github.com/conlin-huang/aio-scrapy\r\n\r\n# default\r\npip install aio-scrapy\r\n\r\n# Install all dependencies \r\npip install aio-scrapy[all]\r\n\r\n# When you need to use mysql/httpx/rabbitmq/mongo\r\npip install aio-scrapy[aiomysql,httpx,aio-pika,mongo]\r\n```\r\n\r\n### Usage\r\n\r\n#### create project spider:\r\n\r\n```shell\r\naioscrapy startproject project_quotes\r\n```\r\n\r\n```\r\ncd project_quotes\r\naioscrapy genspider quotes \r\n```\r\n\r\nquotes.py\r\n\r\n```python\r\nfrom aioscrapy.spiders import Spider\r\n\r\n\r\nclass QuotesMemorySpider(Spider):\r\n    name = 'QuotesMemorySpider'\r\n\r\n    start_urls = ['https://quotes.toscrape.com']\r\n\r\n    async def parse(self, response):\r\n        for quote in response.css('div.quote'):\r\n            yield {\r\n                'author': quote.xpath('span/small/text()').get(),\r\n                'text': quote.css('span.text::text').get(),\r\n            }\r\n\r\n        next_page = response.css('li.next a::attr(\"href\")').get()\r\n        if next_page is not None:\r\n            yield response.follow(next_page, self.parse)\r\n\r\n\r\nif __name__ == '__main__':\r\n    QuotesMemorySpider.start()\r\n\r\n```\r\n\r\nrun the spider:\r\n\r\n```shell\r\naioscrapy crawl quotes\r\n```\r\n\r\n#### create single script spider:\r\n\r\n```shell\r\naioscrapy genspider single_quotes -t single\r\n```\r\n\r\nsingle_quotes.py:\r\n\r\n```python\r\nfrom aioscrapy.spiders import Spider\r\n\r\n\r\nclass QuotesMemorySpider(Spider):\r\n    name = 'QuotesMemorySpider'\r\n    custom_settings = {\r\n        \"USER_AGENT\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36\",\r\n        'CLOSE_SPIDER_ON_IDLE': True,\r\n        # 'DOWNLOAD_DELAY': 3,\r\n        # 'RANDOMIZE_DOWNLOAD_DELAY': True,\r\n        # 'CONCURRENT_REQUESTS': 1,\r\n        # 'LOG_LEVEL': 'INFO'\r\n    }\r\n\r\n    start_urls = ['https://quotes.toscrape.com']\r\n\r\n    @staticmethod\r\n    async def process_request(request, spider):\r\n        \"\"\" request middleware \"\"\"\r\n        pass\r\n\r\n    @staticmethod\r\n    async def process_response(request, response, spider):\r\n        \"\"\" response middleware \"\"\"\r\n        return response\r\n\r\n    @staticmethod\r\n    async def process_exception(request, exception, spider):\r\n        \"\"\" exception middleware \"\"\"\r\n        pass\r\n\r\n    async def parse(self, response):\r\n        for quote in response.css('div.quote'):\r\n            yield {\r\n                'author': quote.xpath('span/small/text()').get(),\r\n                'text': quote.css('span.text::text').get(),\r\n            }\r\n\r\n        next_page = response.css('li.next a::attr(\"href\")').get()\r\n        if next_page is not None:\r\n            yield response.follow(next_page, self.parse)\r\n\r\n    async def process_item(self, item):\r\n        print(item)\r\n\r\n\r\nif __name__ == '__main__':\r\n    QuotesMemorySpider.start()\r\n\r\n```\r\n\r\nrun the spider:\r\n\r\n```shell\r\naioscrapy runspider quotes.py\r\n```\r\n\r\n\r\n### more commands:\r\n\r\n```shell\r\naioscrapy -h\r\n```\r\n\r\n#### [more example](./example)\r\n\r\n### Documentation\r\n[doc](./doc/documentation.md)\r\n\r\n### Ready\r\n\r\nplease submit your sugguestion to owner by issue\r\n\r\n## Thanks\r\n\r\n[aiohttp](https://github.com/aio-libs/aiohttp/)\r\n\r\n[scrapy](https://github.com/scrapy/scrapy)\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A high-level Web Crawling and Web Scraping framework based on Asyncio",
    "version": "2.1.0",
    "project_urls": {
        "Homepage": "https://github.com/conlin-huang/aio-scrapy.git"
    },
    "split_keywords": [
        "aio-scrapy",
        " scrapy",
        " aioscrapy",
        " scrapy redis",
        " asyncio",
        " spider"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0a445e78dd0ed7ec133faa4a5ff5ba1750b29e1942943ba5b3b35a7f2f5a29db",
                "md5": "0afe0502c6c4f053641db42501017315",
                "sha256": "39b1e3dfe63607db3d2b1bba9ef4807b5f28eca05ed9cefcdb662a14540ebb6f"
            },
            "downloads": -1,
            "filename": "aio_scrapy-2.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0afe0502c6c4f053641db42501017315",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 139546,
            "upload_time": "2024-04-17T06:40:49",
            "upload_time_iso_8601": "2024-04-17T06:40:49.594464Z",
            "url": "https://files.pythonhosted.org/packages/0a/44/5e78dd0ed7ec133faa4a5ff5ba1750b29e1942943ba5b3b35a7f2f5a29db/aio_scrapy-2.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "016ba17c577c86067ba5b59e3715d5dc01eaca1dd4793d661307ca189a76ea2e",
                "md5": "0f3ae50d03f43fd33564ace65d4fb768",
                "sha256": "06330ad81a21f40590a3d1c1fb5bddaab169396e4e2d5bfdc7b3113f6ddcd897"
            },
            "downloads": -1,
            "filename": "aio-scrapy-2.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "0f3ae50d03f43fd33564ace65d4fb768",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 96867,
            "upload_time": "2024-04-17T06:40:53",
            "upload_time_iso_8601": "2024-04-17T06:40:53.064448Z",
            "url": "https://files.pythonhosted.org/packages/01/6b/a17c577c86067ba5b59e3715d5dc01eaca1dd4793d661307ca189a76ea2e/aio-scrapy-2.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-17 06:40:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "conlin-huang",
    "github_project": "aio-scrapy",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "aio-scrapy"
}
        
Elapsed time: 0.23007s