# 1.分布式高速python爬虫框架 boost_spider
## 安装:
pip install boost_spider
## boost_spider框架的更详细用法要看funboost文档
boost_spider是基于funboost,增加了对爬虫更方便的请求类和快捷入库
[查看分布式函数调度框架完整文档 https://funboost.readthedocs.io/zh-cn/latest/index.html](https://funboost.readthedocs.io/zh-cn/latest/index.html)
## 简介:
boost_spider 是powerd by funboost,加了一个方便爬虫的请求类(用户可以不使用这个请求类,可以用任意包自己发送http请求)
本质上,funboost是函数调度框架,scrapy和国产仿scrapy api用法的爬虫框架是一个url请求调度框架,
函数里面用户可以写任何逻辑,所以boost_spider适应范围和用户自由度暴击写死了替发送一个http请求的仿scrapy框架.
函数调度框架暴击url请求调度框架,这是降维打击.
boost_spider特点:
```
boost_spider支持同步爬虫也支持asyncio异步爬虫
boost_spider 是一款自由奔放写法的爬虫框架,无任何束缚,和用户手写平铺直叙的爬虫函数一样
是横冲直撞的思维写的,不需要callback回调解析方法,不需要继承BaseSpider类,没有BaseSpider类,大开大合自由奔放,代码阅读所见即所得
绝对没有class MySpider(BaseSpider) 的写法
绝对没有 yield Request(url=url, callback=self.my_parse) 的写法.
绝对没有 yield item 的写法
boost_spider在函数里面写的东西所见即所得,不需要在好几个文件中来回切换检查代码.
函数去掉@boost装饰器仍然可以正常使用爬虫,加上和去掉都很容易,这就是自由.
有的人喜欢纯手写无框架的使用线程池运行函数来爬虫,很容易替换成boost_spider
仿scrapy api的爬虫框架,无论是去掉和加上框架,代码组织形式需要翻天覆地的大改特改,这样就是束缚框架.
boost_spider所写的爬虫代码可以直接去掉@boost装饰器,可以正常运行,所见即所得.
只需要加上boost装饰器就可以自动加速并发,给函数和消息加上20控制功能,控制手段比传统爬虫框架多太多,
boost_spider 支持多线程 gvent eventlet asyncio 并且能叠加多进程消费,运行速度远远的暴击国产爬虫框架.
国产框架大部分是只能支持多线程同步语法爬虫,不能支持asyncio编程写法,而boost_spider能够同时兼容用户使用requests和aiohttp任意写法
```
scrapy和国内写的各种仿scrapy api用法的框架特点
```
funboost函数调度框架,用户完全自由,
仿scrapy框架,只是个url调度框架,仿scrapy api 框架里面写死了怎么帮用户请求一个url,
有时候为了支持用户复杂的请求逻辑,例如换代理ip逻辑,框架还不得不暴露出用户自定义请求的所谓middware,用户要掌握在这些爬虫框架中自定义发送请求,框架又变难了.
因为爬虫框架难的是替自动并发 替用户自动重试 自动断点续爬,发送一个请求并不难,用户导入requests发一个http请求,只需要一行代码,
用户对requests封装一个请求http函数也很简单,反而替用户自作主张怎么发送请求,用户奇葩方式发请求反而满足不了,所以爬虫框架不需要内置替用户自动发送请求.
```
```
需要在 spiders文件夹写继承BaseSpider,
items文件夹定义item,
pipleines文件夹写怎么保存爬虫数据,
settings.py写DOWNLOADER_MIDDLEWARES调用什么pipleline,ITEM_PIPELINES调用什么middlware优先级,各种配置
middlewares.py写怎么换代理 请求头,
以及命令行中写怎么启动爬虫运行.
在各个代码文件中来回切换检查写代码,写法烦人程度非常的吓人.
国内的爬虫框架没有创新能力,都是模仿scrapy的 api用法,所以scrapy的写法烦人的缺点基本上都继承下来了.
和scrapy写法一样烦人的爬虫框架,这样的框架就没必要重复开发了.
```
boost_spider的qps作用远远的暴击所有爬虫框架的固定线程并发数量
```
国内的仿scrapy框架的,都只能做到固定并发数量,一般是固定开多少个线程.
比如我要求每秒精确完成爬10次接口或网页保存到数据库,你咋做到?
一般人就以为是开10个线程,这是错误的,我没讲过对方接口刚好是精确1秒的响应时间.
如果网站接口或网页耗时0.1秒,你开10线程那就每秒爬了100个网页了.
如果网站网页耗时20秒(特别是加上代理ip后经常可能响应时间大),你开10线程,每秒只能爬0.5次.
用线程数来决定每秒爬多少次就是非常的滑稽,只有请求耗时一直精确等于1秒,那么开多少个线程才等于每秒爬多少次,
否则每秒爬多少次和线程数量没有对应关系.
boost_spider不仅能设置并发数量,也可以设置qps,
boost_spider的qps参数无视任何网站的耗时是多少,不需要提前评估好接口的平均耗时,就能达到控频,
无视对方的响应耗时从0.01 0.07 0.3 0.7 3 7 13 19 37 秒 这些不规律的响应时间数字,
随意波动变化,都能一直保持恒定的爬虫次数.
保持恒定qps,这一点国产框架不行,国产框架需要提前评估好接口耗时,然后精确计算好开多少个线程来达到qps,
如果对方接口耗时变了,就要重新改代码的线程数量.
```
# 2.代码例子:
```python
from boost_spider import boost, BrokerEnum, RequestClient, MongoSink, json, re, MysqlSink
from db_conn_kwargs import MONGO_CONNECT_URL, MYSQL_CONN_KWARGS # 保密 密码
"""
非常经典的列表页-详情页 两层级爬虫调度,只要掌握了两层级爬虫,三层级多层级爬虫就很容易模仿
列表页负责翻页和提取详情页url,发送详情页任务到详情页消息队列中
"""
@boost('car_home_list', broker_kind=BrokerEnum.REDIS_ACK_ABLE, max_retry_times=5, qps=2,
do_task_filtering=False) # boost 的控制手段很多.
def crawl_list_page(news_type, page, do_page_turning=False):
""" 函数这里面的代码是用户想写什么就写什么,函数里面的代码和框架没有任何绑定关系
例如用户可以用 urllib3请求 用正则表达式解析,没有强迫你用requests请求和parsel包解析。
"""
url = f'https://www.autohome.com.cn/{news_type}/{page}/#liststart'
sel = RequestClient(proxy_name_list=['noproxy'], request_retry_times=3,
using_platfrom='汽车之家爬虫新闻列表页').get(url).selector
for li in sel.css('ul.article > li'):
if len(li.extract()) > 100: # 有的是这样的去掉。 <li id="ad_tw_04" style="display: none;">
url_detail = 'https:' + li.xpath('./a/@href').extract_first()
title = li.xpath('./a/h3/text()').extract_first()
crawl_detail_page.push(url_detail, title=title, news_type=news_type) # 发布详情页任务
if do_page_turning:
last_page = int(sel.css('#channelPage > a:nth-child(12)::text').extract_first())
for p in range(2, last_page + 1):
crawl_list_page.push(news_type, p) # 列表页翻页。
@boost('car_home_detail', broker_kind=BrokerEnum.REDIS_ACK_ABLE, qps=5,
do_task_filtering=True, is_using_distributed_frequency_control=True)
def crawl_detail_page(url: str, title: str, news_type: str):
sel = RequestClient(using_platfrom='汽车之家爬虫新闻详情页').get(url).selector
author = sel.css('#articlewrap > div.article-info > div > a::text').extract_first() or sel.css(
'#articlewrap > div.article-info > div::text').extract_first() or ''
author = author.replace("\n", "").strip()
news_id = re.search('/(\d+).html', url).group(1)
item = {'news_type': news_type, 'title': title, 'author': author, 'news_id': news_id, 'url': url}
# 也提供了 MysqlSink类,都是自动连接池操作数据库
# MongoSink(db='test', col='car_home_news', uniqu_key='news_id', mongo_connect_url=MONGO_CONNECT_URL, ).save(item)
MysqlSink(db='test', table='car_home_news', **MYSQL_CONN_KWARGS).save(item) # 用户需要自己先创建mysql表
if __name__ == '__main__':
# crawl_list_page('news',1) # 直接函数测试
crawl_list_page.clear() # 清空种子队列
crawl_detail_page.clear()
crawl_list_page.push('news', 1, do_page_turning=True) # 发布新闻频道首页种子到列表页队列
crawl_list_page.push('advice', page=1,do_page_turning=True) # 导购
crawl_list_page.push(news_type='drive', page=1,do_page_turning=True) # 驾驶评测
crawl_list_page.consume() # 启动列表页消费
crawl_detail_page.consume() # 启动详情页新闻内容消费
# 这样速度更猛,叠加多进程
# crawl_detail_page.multi_process_consume(4)
```
## 代码说明:
```
1.
RequestClient 类的方法入参和返回与requests包一模一样,方便用户切换
response在requests.Response基础上增加了适合爬虫解析的属性和方法。
RequestClient支持继承,用户自定义增加爬虫使用代理的方法,在 PROXYNAME__REQUEST_METHED_MAP 声明增加的方法就可以.
2.
爬虫函数的入参随意,加上@ boost装饰器就可以自动并发
3.
爬虫种子保存,支持30种消息队列
4.
qps是规定爬虫每秒爬几个网页,qps的控制比指定固定的并发数量,控制强太多太多了
```
## boost_spider 支持用户使用asyncio编程生态
国产爬虫框架大部分只能支持同步编程语法生态,无法兼容用户原有的asyncio编程方式.
boost_spider是同步编程和asyncio编程双支持.(boost_spider 还能支持gevent eventlet)
```python
import asyncio
import httpx
from funboost import boost, BrokerEnum, ConcurrentModeEnum
import threading
thread_local = threading.local()
def get_client() -> httpx.AsyncClient:
if not getattr(thread_local, 'httpx_async_client', None):
thread_local.httpx_async_client = httpx.AsyncClient()
return thread_local.httpx_async_client
@boost('test_httpx_q2', broker_kind=BrokerEnum.REDIS, concurrent_mode=ConcurrentModeEnum.ASYNC, concurrent_num=500)
async def f(url):
# client= httpx.AsyncClient() # 这样慢
r = await get_client().get(url) # 这样好,不要每次单独创建 AsyncClient()
print(r.status_code, len(r.text))
if __name__ == '__main__':
# asyncio.run(f())
f.consume()
for i in range(10000):
f.push('https://www.baidu.com/')
```
Raw data
{
"_id": null,
"home_page": "https://github.com/ydf0509/boost_spider",
"name": "boost-spider",
"maintainer": "ydf",
"docs_url": null,
"requires_python": null,
"maintainer_email": "ydf0509@sohu.com",
"keywords": "scrapy, funboost, distributed-framework, function-scheduling, rabbitmq, rocketmq, kafka, nsq, redis, sqlachemy, consume-confirm, timing, task-scheduling, apscheduler, pulsar, mqtt, kombu, \u7684, celery, \u6846\u67b6, \u5206\u5e03\u5f0f\u8c03\u5ea6",
"author": "bfzs",
"author_email": "ydf0509@sohu.com",
"download_url": "https://files.pythonhosted.org/packages/79/f2/f00f5042e5cbc4e615d6c9d94c99823a70e5d470ac52515dd469bfb71717/boost_spider-1.2.tar.gz",
"platform": "all",
"description": "# 1.\u5206\u5e03\u5f0f\u9ad8\u901fpython\u722c\u866b\u6846\u67b6 boost_spider\r\n\r\n## \u5b89\u88c5\uff1a\r\n\r\npip install boost_spider\r\n\r\n## boost_spider\u6846\u67b6\u7684\u66f4\u8be6\u7ec6\u7528\u6cd5\u8981\u770bfunboost\u6587\u6863\r\n\r\nboost_spider\u662f\u57fa\u4e8efunboost,\u589e\u52a0\u4e86\u5bf9\u722c\u866b\u66f4\u65b9\u4fbf\u7684\u8bf7\u6c42\u7c7b\u548c\u5feb\u6377\u5165\u5e93\r\n\r\n[\u67e5\u770b\u5206\u5e03\u5f0f\u51fd\u6570\u8c03\u5ea6\u6846\u67b6\u5b8c\u6574\u6587\u6863 https://funboost.readthedocs.io/zh-cn/latest/index.html](https://funboost.readthedocs.io/zh-cn/latest/index.html)\r\n\r\n\r\n## \u7b80\u4ecb\uff1a\r\n\r\nboost_spider \u662fpowerd by funboost,\u52a0\u4e86\u4e00\u4e2a\u65b9\u4fbf\u722c\u866b\u7684\u8bf7\u6c42\u7c7b(\u7528\u6237\u53ef\u4ee5\u4e0d\u4f7f\u7528\u8fd9\u4e2a\u8bf7\u6c42\u7c7b,\u53ef\u4ee5\u7528\u4efb\u610f\u5305\u81ea\u5df1\u53d1\u9001http\u8bf7\u6c42)\r\n\r\n\u672c\u8d28\u4e0a,funboost\u662f\u51fd\u6570\u8c03\u5ea6\u6846\u67b6,scrapy\u548c\u56fd\u4ea7\u4effscrapy api\u7528\u6cd5\u7684\u722c\u866b\u6846\u67b6\u662f\u4e00\u4e2aurl\u8bf7\u6c42\u8c03\u5ea6\u6846\u67b6,\r\n\r\n\u51fd\u6570\u91cc\u9762\u7528\u6237\u53ef\u4ee5\u5199\u4efb\u4f55\u903b\u8f91,\u6240\u4ee5boost_spider\u9002\u5e94\u8303\u56f4\u548c\u7528\u6237\u81ea\u7531\u5ea6\u66b4\u51fb\u5199\u6b7b\u4e86\u66ff\u53d1\u9001\u4e00\u4e2ahttp\u8bf7\u6c42\u7684\u4effscrapy\u6846\u67b6.\r\n\r\n\u51fd\u6570\u8c03\u5ea6\u6846\u67b6\u66b4\u51fburl\u8bf7\u6c42\u8c03\u5ea6\u6846\u67b6,\u8fd9\u662f\u964d\u7ef4\u6253\u51fb.\r\n\r\n\r\nboost_spider\u7279\u70b9:\r\n\r\n ```\r\n boost_spider\u652f\u6301\u540c\u6b65\u722c\u866b\u4e5f\u652f\u6301asyncio\u5f02\u6b65\u722c\u866b\r\n boost_spider \u662f\u4e00\u6b3e\u81ea\u7531\u5954\u653e\u5199\u6cd5\u7684\u722c\u866b\u6846\u67b6\uff0c\u65e0\u4efb\u4f55\u675f\u7f1a\uff0c\u548c\u7528\u6237\u624b\u5199\u5e73\u94fa\u76f4\u53d9\u7684\u722c\u866b\u51fd\u6570\u4e00\u6837\r\n \u662f\u6a2a\u51b2\u76f4\u649e\u7684\u601d\u7ef4\u5199\u7684,\u4e0d\u9700\u8981callback\u56de\u8c03\u89e3\u6790\u65b9\u6cd5,\u4e0d\u9700\u8981\u7ee7\u627fBaseSpider\u7c7b,\u6ca1\u6709BaseSpider\u7c7b,\u5927\u5f00\u5927\u5408\u81ea\u7531\u5954\u653e,\u4ee3\u7801\u9605\u8bfb\u6240\u89c1\u5373\u6240\u5f97\r\n \r\n \u7edd\u5bf9\u6ca1\u6709class MySpider(BaseSpider) \u7684\u5199\u6cd5\r\n \r\n \u7edd\u5bf9\u6ca1\u6709 yield Request(url=url, callback=self.my_parse) \u7684\u5199\u6cd5.\r\n \r\n \u7edd\u5bf9\u6ca1\u6709 yield item \u7684\u5199\u6cd5\r\n \r\n boost_spider\u5728\u51fd\u6570\u91cc\u9762\u5199\u7684\u4e1c\u897f\u6240\u89c1\u5373\u6240\u5f97,\u4e0d\u9700\u8981\u5728\u597d\u51e0\u4e2a\u6587\u4ef6\u4e2d\u6765\u56de\u5207\u6362\u68c0\u67e5\u4ee3\u7801.\r\n \r\n \u51fd\u6570\u53bb\u6389@boost\u88c5\u9970\u5668\u4ecd\u7136\u53ef\u4ee5\u6b63\u5e38\u4f7f\u7528\u722c\u866b,\u52a0\u4e0a\u548c\u53bb\u6389\u90fd\u5f88\u5bb9\u6613,\u8fd9\u5c31\u662f\u81ea\u7531.\r\n \u6709\u7684\u4eba\u559c\u6b22\u7eaf\u624b\u5199\u65e0\u6846\u67b6\u7684\u4f7f\u7528\u7ebf\u7a0b\u6c60\u8fd0\u884c\u51fd\u6570\u6765\u722c\u866b,\u5f88\u5bb9\u6613\u66ff\u6362\u6210boost_spider\r\n \r\n \u4effscrapy api\u7684\u722c\u866b\u6846\u67b6,\u65e0\u8bba\u662f\u53bb\u6389\u548c\u52a0\u4e0a\u6846\u67b6,\u4ee3\u7801\u7ec4\u7ec7\u5f62\u5f0f\u9700\u8981\u7ffb\u5929\u8986\u5730\u7684\u5927\u6539\u7279\u6539,\u8fd9\u6837\u5c31\u662f\u675f\u7f1a\u6846\u67b6.\r\n \r\n boost_spider\u6240\u5199\u7684\u722c\u866b\u4ee3\u7801\u53ef\u4ee5\u76f4\u63a5\u53bb\u6389@boost\u88c5\u9970\u5668,\u53ef\u4ee5\u6b63\u5e38\u8fd0\u884c,\u6240\u89c1\u5373\u6240\u5f97.\r\n \r\n \u53ea\u9700\u8981\u52a0\u4e0aboost\u88c5\u9970\u5668\u5c31\u53ef\u4ee5\u81ea\u52a8\u52a0\u901f\u5e76\u53d1\uff0c\u7ed9\u51fd\u6570\u548c\u6d88\u606f\u52a0\u4e0a20\u63a7\u5236\u529f\u80fd,\u63a7\u5236\u624b\u6bb5\u6bd4\u4f20\u7edf\u722c\u866b\u6846\u67b6\u591a\u592a\u591a,\r\n boost_spider \u652f\u6301\u591a\u7ebf\u7a0b gvent eventlet asyncio \u5e76\u4e14\u80fd\u53e0\u52a0\u591a\u8fdb\u7a0b\u6d88\u8d39,\u8fd0\u884c\u901f\u5ea6\u8fdc\u8fdc\u7684\u66b4\u51fb\u56fd\u4ea7\u722c\u866b\u6846\u67b6.\r\n \u56fd\u4ea7\u6846\u67b6\u5927\u90e8\u5206\u662f\u53ea\u80fd\u652f\u6301\u591a\u7ebf\u7a0b\u540c\u6b65\u8bed\u6cd5\u722c\u866b,\u4e0d\u80fd\u652f\u6301asyncio\u7f16\u7a0b\u5199\u6cd5,\u800cboost_spider\u80fd\u591f\u540c\u65f6\u517c\u5bb9\u7528\u6237\u4f7f\u7528requests\u548caiohttp\u4efb\u610f\u5199\u6cd5\r\n \r\n ```\r\n\r\nscrapy\u548c\u56fd\u5185\u5199\u7684\u5404\u79cd\u4effscrapy api\u7528\u6cd5\u7684\u6846\u67b6\u7279\u70b9\r\n```\r\nfunboost\u51fd\u6570\u8c03\u5ea6\u6846\u67b6,\u7528\u6237\u5b8c\u5168\u81ea\u7531,\r\n\r\n\u4effscrapy\u6846\u67b6,\u53ea\u662f\u4e2aurl\u8c03\u5ea6\u6846\u67b6,\u4effscrapy api \u6846\u67b6\u91cc\u9762\u5199\u6b7b\u4e86\u600e\u4e48\u5e2e\u7528\u6237\u8bf7\u6c42\u4e00\u4e2aurl,\r\n\u6709\u65f6\u5019\u4e3a\u4e86\u652f\u6301\u7528\u6237\u590d\u6742\u7684\u8bf7\u6c42\u903b\u8f91,\u4f8b\u5982\u6362\u4ee3\u7406ip\u903b\u8f91,\u6846\u67b6\u8fd8\u4e0d\u5f97\u4e0d\u66b4\u9732\u51fa\u7528\u6237\u81ea\u5b9a\u4e49\u8bf7\u6c42\u7684\u6240\u8c13middware,\u7528\u6237\u8981\u638c\u63e1\u5728\u8fd9\u4e9b\u722c\u866b\u6846\u67b6\u4e2d\u81ea\u5b9a\u4e49\u53d1\u9001\u8bf7\u6c42,\u6846\u67b6\u53c8\u53d8\u96be\u4e86.\r\n\u56e0\u4e3a\u722c\u866b\u6846\u67b6\u96be\u7684\u662f\u66ff\u81ea\u52a8\u5e76\u53d1 \u66ff\u7528\u6237\u81ea\u52a8\u91cd\u8bd5 \u81ea\u52a8\u65ad\u70b9\u7eed\u722c,\u53d1\u9001\u4e00\u4e2a\u8bf7\u6c42\u5e76\u4e0d\u96be,\u7528\u6237\u5bfc\u5165requests\u53d1\u4e00\u4e2ahttp\u8bf7\u6c42,\u53ea\u9700\u8981\u4e00\u884c\u4ee3\u7801,\r\n\u7528\u6237\u5bf9requests\u5c01\u88c5\u4e00\u4e2a\u8bf7\u6c42http\u51fd\u6570\u4e5f\u5f88\u7b80\u5355,\u53cd\u800c\u66ff\u7528\u6237\u81ea\u4f5c\u4e3b\u5f20\u600e\u4e48\u53d1\u9001\u8bf7\u6c42,\u7528\u6237\u5947\u8469\u65b9\u5f0f\u53d1\u8bf7\u6c42\u53cd\u800c\u6ee1\u8db3\u4e0d\u4e86,\u6240\u4ee5\u722c\u866b\u6846\u67b6\u4e0d\u9700\u8981\u5185\u7f6e\u66ff\u7528\u6237\u81ea\u52a8\u53d1\u9001\u8bf7\u6c42.\r\n```\r\n\r\n```\r\n\u9700\u8981\u5728 spiders\u6587\u4ef6\u5939\u5199\u7ee7\u627fBaseSpider, \r\nitems\u6587\u4ef6\u5939\u5b9a\u4e49item, \r\npipleines\u6587\u4ef6\u5939\u5199\u600e\u4e48\u4fdd\u5b58\u722c\u866b\u6570\u636e,\r\nsettings.py\u5199DOWNLOADER_MIDDLEWARES\u8c03\u7528\u4ec0\u4e48pipleline,ITEM_PIPELINES\u8c03\u7528\u4ec0\u4e48middlware\u4f18\u5148\u7ea7,\u5404\u79cd\u914d\u7f6e\r\nmiddlewares.py\u5199\u600e\u4e48\u6362\u4ee3\u7406 \u8bf7\u6c42\u5934,\r\n\u4ee5\u53ca\u547d\u4ee4\u884c\u4e2d\u5199\u600e\u4e48\u542f\u52a8\u722c\u866b\u8fd0\u884c. \r\n\u5728\u5404\u4e2a\u4ee3\u7801\u6587\u4ef6\u4e2d\u6765\u56de\u5207\u6362\u68c0\u67e5\u5199\u4ee3\u7801,\u5199\u6cd5\u70e6\u4eba\u7a0b\u5ea6\u975e\u5e38\u7684\u5413\u4eba.\r\n\r\n\u56fd\u5185\u7684\u722c\u866b\u6846\u67b6\u6ca1\u6709\u521b\u65b0\u80fd\u529b,\u90fd\u662f\u6a21\u4effscrapy\u7684 api\u7528\u6cd5,\u6240\u4ee5scrapy\u7684\u5199\u6cd5\u70e6\u4eba\u7684\u7f3a\u70b9\u57fa\u672c\u4e0a\u90fd\u7ee7\u627f\u4e0b\u6765\u4e86.\r\n\u548cscrapy\u5199\u6cd5\u4e00\u6837\u70e6\u4eba\u7684\u722c\u866b\u6846\u67b6,\u8fd9\u6837\u7684\u6846\u67b6\u5c31\u6ca1\u5fc5\u8981\u91cd\u590d\u5f00\u53d1\u4e86.\r\n```\r\n\r\nboost_spider\u7684qps\u4f5c\u7528\u8fdc\u8fdc\u7684\u66b4\u51fb\u6240\u6709\u722c\u866b\u6846\u67b6\u7684\u56fa\u5b9a\u7ebf\u7a0b\u5e76\u53d1\u6570\u91cf\r\n\r\n```\r\n\u56fd\u5185\u7684\u4effscrapy\u6846\u67b6\u7684,\u90fd\u53ea\u80fd\u505a\u5230\u56fa\u5b9a\u5e76\u53d1\u6570\u91cf,\u4e00\u822c\u662f\u56fa\u5b9a\u5f00\u591a\u5c11\u4e2a\u7ebf\u7a0b.\r\n\r\n\u6bd4\u5982\u6211\u8981\u6c42\u6bcf\u79d2\u7cbe\u786e\u5b8c\u6210\u722c10\u6b21\u63a5\u53e3\u6216\u7f51\u9875\u4fdd\u5b58\u5230\u6570\u636e\u5e93,\u4f60\u548b\u505a\u5230?\r\n\u4e00\u822c\u4eba\u5c31\u4ee5\u4e3a\u662f\u5f0010\u4e2a\u7ebf\u7a0b,\u8fd9\u662f\u9519\u8bef\u7684,\u6211\u6ca1\u8bb2\u8fc7\u5bf9\u65b9\u63a5\u53e3\u521a\u597d\u662f\u7cbe\u786e1\u79d2\u7684\u54cd\u5e94\u65f6\u95f4.\r\n\r\n\u5982\u679c\u7f51\u7ad9\u63a5\u53e3\u6216\u7f51\u9875\u8017\u65f60.1\u79d2,\u4f60\u5f0010\u7ebf\u7a0b\u90a3\u5c31\u6bcf\u79d2\u722c\u4e86100\u4e2a\u7f51\u9875\u4e86.\r\n\u5982\u679c\u7f51\u7ad9\u7f51\u9875\u8017\u65f620\u79d2(\u7279\u522b\u662f\u52a0\u4e0a\u4ee3\u7406ip\u540e\u7ecf\u5e38\u53ef\u80fd\u54cd\u5e94\u65f6\u95f4\u5927),\u4f60\u5f0010\u7ebf\u7a0b,\u6bcf\u79d2\u53ea\u80fd\u722c0.5\u6b21.\r\n\u7528\u7ebf\u7a0b\u6570\u6765\u51b3\u5b9a\u6bcf\u79d2\u722c\u591a\u5c11\u6b21\u5c31\u662f\u975e\u5e38\u7684\u6ed1\u7a3d,\u53ea\u6709\u8bf7\u6c42\u8017\u65f6\u4e00\u76f4\u7cbe\u786e\u7b49\u4e8e1\u79d2,\u90a3\u4e48\u5f00\u591a\u5c11\u4e2a\u7ebf\u7a0b\u624d\u7b49\u4e8e\u6bcf\u79d2\u722c\u591a\u5c11\u6b21,\r\n\u5426\u5219\u6bcf\u79d2\u722c\u591a\u5c11\u6b21\u548c\u7ebf\u7a0b\u6570\u91cf\u6ca1\u6709\u5bf9\u5e94\u5173\u7cfb.\r\n\r\nboost_spider\u4e0d\u4ec5\u80fd\u8bbe\u7f6e\u5e76\u53d1\u6570\u91cf,\u4e5f\u53ef\u4ee5\u8bbe\u7f6eqps,\r\nboost_spider\u7684qps\u53c2\u6570\u65e0\u89c6\u4efb\u4f55\u7f51\u7ad9\u7684\u8017\u65f6\u662f\u591a\u5c11,\u4e0d\u9700\u8981\u63d0\u524d\u8bc4\u4f30\u597d\u63a5\u53e3\u7684\u5e73\u5747\u8017\u65f6,\u5c31\u80fd\u8fbe\u5230\u63a7\u9891,\r\n\u65e0\u89c6\u5bf9\u65b9\u7684\u54cd\u5e94\u8017\u65f6\u4ece0.01 0.07 0.3 0.7 3 7 13 19 37 \u79d2 \u8fd9\u4e9b\u4e0d\u89c4\u5f8b\u7684\u54cd\u5e94\u65f6\u95f4\u6570\u5b57,\r\n\u968f\u610f\u6ce2\u52a8\u53d8\u5316,\u90fd\u80fd\u4e00\u76f4\u4fdd\u6301\u6052\u5b9a\u7684\u722c\u866b\u6b21\u6570.\r\n\r\n\u4fdd\u6301\u6052\u5b9aqps,\u8fd9\u4e00\u70b9\u56fd\u4ea7\u6846\u67b6\u4e0d\u884c,\u56fd\u4ea7\u6846\u67b6\u9700\u8981\u63d0\u524d\u8bc4\u4f30\u597d\u63a5\u53e3\u8017\u65f6,\u7136\u540e\u7cbe\u786e\u8ba1\u7b97\u597d\u5f00\u591a\u5c11\u4e2a\u7ebf\u7a0b\u6765\u8fbe\u5230qps,\r\n\u5982\u679c\u5bf9\u65b9\u63a5\u53e3\u8017\u65f6\u53d8\u4e86,\u5c31\u8981\u91cd\u65b0\u6539\u4ee3\u7801\u7684\u7ebf\u7a0b\u6570\u91cf.\r\n```\r\n\r\n# 2.\u4ee3\u7801\u4f8b\u5b50\uff1a\r\n\r\n```python\r\n\r\nfrom boost_spider import boost, BrokerEnum, RequestClient, MongoSink, json, re, MysqlSink\r\nfrom db_conn_kwargs import MONGO_CONNECT_URL, MYSQL_CONN_KWARGS # \u4fdd\u5bc6 \u5bc6\u7801\r\n\r\n\"\"\"\r\n\u975e\u5e38\u7ecf\u5178\u7684\u5217\u8868\u9875-\u8be6\u60c5\u9875 \u4e24\u5c42\u7ea7\u722c\u866b\u8c03\u5ea6,\u53ea\u8981\u638c\u63e1\u4e86\u4e24\u5c42\u7ea7\u722c\u866b,\u4e09\u5c42\u7ea7\u591a\u5c42\u7ea7\u722c\u866b\u5c31\u5f88\u5bb9\u6613\u6a21\u4eff\r\n\r\n\u5217\u8868\u9875\u8d1f\u8d23\u7ffb\u9875\u548c\u63d0\u53d6\u8be6\u60c5\u9875url,\u53d1\u9001\u8be6\u60c5\u9875\u4efb\u52a1\u5230\u8be6\u60c5\u9875\u6d88\u606f\u961f\u5217\u4e2d\r\n\"\"\"\r\n\r\n\r\n@boost('car_home_list', broker_kind=BrokerEnum.REDIS_ACK_ABLE, max_retry_times=5, qps=2,\r\n do_task_filtering=False) # boost \u7684\u63a7\u5236\u624b\u6bb5\u5f88\u591a.\r\ndef crawl_list_page(news_type, page, do_page_turning=False):\r\n \"\"\" \u51fd\u6570\u8fd9\u91cc\u9762\u7684\u4ee3\u7801\u662f\u7528\u6237\u60f3\u5199\u4ec0\u4e48\u5c31\u5199\u4ec0\u4e48\uff0c\u51fd\u6570\u91cc\u9762\u7684\u4ee3\u7801\u548c\u6846\u67b6\u6ca1\u6709\u4efb\u4f55\u7ed1\u5b9a\u5173\u7cfb\r\n \u4f8b\u5982\u7528\u6237\u53ef\u4ee5\u7528 urllib3\u8bf7\u6c42 \u7528\u6b63\u5219\u8868\u8fbe\u5f0f\u89e3\u6790\uff0c\u6ca1\u6709\u5f3a\u8feb\u4f60\u7528requests\u8bf7\u6c42\u548cparsel\u5305\u89e3\u6790\u3002\r\n \"\"\"\r\n url = f'https://www.autohome.com.cn/{news_type}/{page}/#liststart'\r\n sel = RequestClient(proxy_name_list=['noproxy'], request_retry_times=3,\r\n using_platfrom='\u6c7d\u8f66\u4e4b\u5bb6\u722c\u866b\u65b0\u95fb\u5217\u8868\u9875').get(url).selector\r\n for li in sel.css('ul.article > li'):\r\n if len(li.extract()) > 100: # \u6709\u7684\u662f\u8fd9\u6837\u7684\u53bb\u6389\u3002 <li id=\"ad_tw_04\" style=\"display: none;\">\r\n url_detail = 'https:' + li.xpath('./a/@href').extract_first()\r\n title = li.xpath('./a/h3/text()').extract_first()\r\n crawl_detail_page.push(url_detail, title=title, news_type=news_type) # \u53d1\u5e03\u8be6\u60c5\u9875\u4efb\u52a1\r\n if do_page_turning:\r\n last_page = int(sel.css('#channelPage > a:nth-child(12)::text').extract_first())\r\n for p in range(2, last_page + 1):\r\n crawl_list_page.push(news_type, p) # \u5217\u8868\u9875\u7ffb\u9875\u3002\r\n\r\n\r\n@boost('car_home_detail', broker_kind=BrokerEnum.REDIS_ACK_ABLE, qps=5,\r\n do_task_filtering=True, is_using_distributed_frequency_control=True)\r\ndef crawl_detail_page(url: str, title: str, news_type: str):\r\n sel = RequestClient(using_platfrom='\u6c7d\u8f66\u4e4b\u5bb6\u722c\u866b\u65b0\u95fb\u8be6\u60c5\u9875').get(url).selector\r\n author = sel.css('#articlewrap > div.article-info > div > a::text').extract_first() or sel.css(\r\n '#articlewrap > div.article-info > div::text').extract_first() or ''\r\n author = author.replace(\"\\n\", \"\").strip()\r\n news_id = re.search('/(\\d+).html', url).group(1)\r\n item = {'news_type': news_type, 'title': title, 'author': author, 'news_id': news_id, 'url': url}\r\n # \u4e5f\u63d0\u4f9b\u4e86 MysqlSink\u7c7b,\u90fd\u662f\u81ea\u52a8\u8fde\u63a5\u6c60\u64cd\u4f5c\u6570\u636e\u5e93\r\n # MongoSink(db='test', col='car_home_news', uniqu_key='news_id', mongo_connect_url=MONGO_CONNECT_URL, ).save(item)\r\n MysqlSink(db='test', table='car_home_news', **MYSQL_CONN_KWARGS).save(item) # \u7528\u6237\u9700\u8981\u81ea\u5df1\u5148\u521b\u5efamysql\u8868\r\n\r\n\r\nif __name__ == '__main__':\r\n # crawl_list_page('news',1) # \u76f4\u63a5\u51fd\u6570\u6d4b\u8bd5\r\n\r\n crawl_list_page.clear() # \u6e05\u7a7a\u79cd\u5b50\u961f\u5217\r\n crawl_detail_page.clear()\r\n\r\n crawl_list_page.push('news', 1, do_page_turning=True) # \u53d1\u5e03\u65b0\u95fb\u9891\u9053\u9996\u9875\u79cd\u5b50\u5230\u5217\u8868\u9875\u961f\u5217\r\n crawl_list_page.push('advice', page=1,do_page_turning=True) # \u5bfc\u8d2d\r\n crawl_list_page.push(news_type='drive', page=1,do_page_turning=True) # \u9a7e\u9a76\u8bc4\u6d4b\r\n\r\n crawl_list_page.consume() # \u542f\u52a8\u5217\u8868\u9875\u6d88\u8d39\r\n crawl_detail_page.consume() # \u542f\u52a8\u8be6\u60c5\u9875\u65b0\u95fb\u5185\u5bb9\u6d88\u8d39\r\n\r\n # \u8fd9\u6837\u901f\u5ea6\u66f4\u731b\uff0c\u53e0\u52a0\u591a\u8fdb\u7a0b\r\n # crawl_detail_page.multi_process_consume(4)\r\n\r\n\r\n```\r\n\r\n## \u4ee3\u7801\u8bf4\u660e\uff1a\r\n\r\n```\r\n1.\r\nRequestClient \u7c7b\u7684\u65b9\u6cd5\u5165\u53c2\u548c\u8fd4\u56de\u4e0erequests\u5305\u4e00\u6a21\u4e00\u6837\uff0c\u65b9\u4fbf\u7528\u6237\u5207\u6362\r\nresponse\u5728requests.Response\u57fa\u7840\u4e0a\u589e\u52a0\u4e86\u9002\u5408\u722c\u866b\u89e3\u6790\u7684\u5c5e\u6027\u548c\u65b9\u6cd5\u3002\r\n\r\nRequestClient\u652f\u6301\u7ee7\u627f,\u7528\u6237\u81ea\u5b9a\u4e49\u589e\u52a0\u722c\u866b\u4f7f\u7528\u4ee3\u7406\u7684\u65b9\u6cd5,\u5728 PROXYNAME__REQUEST_METHED_MAP \u58f0\u660e\u589e\u52a0\u7684\u65b9\u6cd5\u5c31\u53ef\u4ee5.\r\n\r\n2. \r\n\u722c\u866b\u51fd\u6570\u7684\u5165\u53c2\u968f\u610f\uff0c\u52a0\u4e0a@ boost\u88c5\u9970\u5668\u5c31\u53ef\u4ee5\u81ea\u52a8\u5e76\u53d1\r\n\r\n3.\r\n\u722c\u866b\u79cd\u5b50\u4fdd\u5b58\uff0c\u652f\u630130\u79cd\u6d88\u606f\u961f\u5217\r\n\r\n4.\r\nqps\u662f\u89c4\u5b9a\u722c\u866b\u6bcf\u79d2\u722c\u51e0\u4e2a\u7f51\u9875\uff0cqps\u7684\u63a7\u5236\u6bd4\u6307\u5b9a\u56fa\u5b9a\u7684\u5e76\u53d1\u6570\u91cf\uff0c\u63a7\u5236\u5f3a\u592a\u591a\u592a\u591a\u4e86\r\n\r\n```\r\n\r\n## boost_spider \u652f\u6301\u7528\u6237\u4f7f\u7528asyncio\u7f16\u7a0b\u751f\u6001\r\n\r\n\u56fd\u4ea7\u722c\u866b\u6846\u67b6\u5927\u90e8\u5206\u53ea\u80fd\u652f\u6301\u540c\u6b65\u7f16\u7a0b\u8bed\u6cd5\u751f\u6001,\u65e0\u6cd5\u517c\u5bb9\u7528\u6237\u539f\u6709\u7684asyncio\u7f16\u7a0b\u65b9\u5f0f.\r\n\r\nboost_spider\u662f\u540c\u6b65\u7f16\u7a0b\u548casyncio\u7f16\u7a0b\u53cc\u652f\u6301.(boost_spider \u8fd8\u80fd\u652f\u6301gevent eventlet)\r\n\r\n```python\r\nimport asyncio\r\nimport httpx\r\nfrom funboost import boost, BrokerEnum, ConcurrentModeEnum\r\nimport threading\r\n\r\nthread_local = threading.local()\r\n\r\n\r\ndef get_client() -> httpx.AsyncClient:\r\n if not getattr(thread_local, 'httpx_async_client', None):\r\n thread_local.httpx_async_client = httpx.AsyncClient()\r\n return thread_local.httpx_async_client\r\n\r\n\r\n@boost('test_httpx_q2', broker_kind=BrokerEnum.REDIS, concurrent_mode=ConcurrentModeEnum.ASYNC, concurrent_num=500)\r\nasync def f(url):\r\n # client= httpx.AsyncClient() # \u8fd9\u6837\u6162\r\n r = await get_client().get(url) # \u8fd9\u6837\u597d,\u4e0d\u8981\u6bcf\u6b21\u5355\u72ec\u521b\u5efa AsyncClient()\r\n print(r.status_code, len(r.text))\r\n\r\n\r\nif __name__ == '__main__':\r\n # asyncio.run(f())\r\n f.consume()\r\n for i in range(10000):\r\n f.push('https://www.baidu.com/')\r\n\r\n```\r\n",
"bugtrack_url": null,
"license": "BSD License",
"summary": "\u6a2a\u51b2\u76f4\u95ef \u81ea\u7531\u5954\u653e \u65e0\u56de\u8c03 \u65e0\u7ee7\u627f\u5199\u6cd5\u7684\u9ad8\u901f\u722c\u866b\u6846\u67b6",
"version": "1.2",
"project_urls": {
"Homepage": "https://github.com/ydf0509/boost_spider"
},
"split_keywords": [
"scrapy",
" funboost",
" distributed-framework",
" function-scheduling",
" rabbitmq",
" rocketmq",
" kafka",
" nsq",
" redis",
" sqlachemy",
" consume-confirm",
" timing",
" task-scheduling",
" apscheduler",
" pulsar",
" mqtt",
" kombu",
" \u7684",
" celery",
" \u6846\u67b6",
" \u5206\u5e03\u5f0f\u8c03\u5ea6"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6699faec8f334575c91353802baab966e61c809f48b0862de93c85bbcb314f3c",
"md5": "5accfb50254048e9b0d095380e5064e7",
"sha256": "6b8660ef3640acff73787c9f72d7f5af1c197419dd3c8af066dba92e09ac8718"
},
"downloads": -1,
"filename": "boost_spider-1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5accfb50254048e9b0d095380e5064e7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 21253,
"upload_time": "2024-06-26T06:29:38",
"upload_time_iso_8601": "2024-06-26T06:29:38.078555Z",
"url": "https://files.pythonhosted.org/packages/66/99/faec8f334575c91353802baab966e61c809f48b0862de93c85bbcb314f3c/boost_spider-1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "79f2f00f5042e5cbc4e615d6c9d94c99823a70e5d470ac52515dd469bfb71717",
"md5": "221b4548fab1c0ca360b2779f60ab850",
"sha256": "16b0d83be1257b875bc34e742f6a529d7846f51410adceff5a6d2bd67901605f"
},
"downloads": -1,
"filename": "boost_spider-1.2.tar.gz",
"has_sig": false,
"md5_digest": "221b4548fab1c0ca360b2779f60ab850",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 23518,
"upload_time": "2024-06-26T06:29:39",
"upload_time_iso_8601": "2024-06-26T06:29:39.916776Z",
"url": "https://files.pythonhosted.org/packages/79/f2/f00f5042e5cbc4e615d6c9d94c99823a70e5d470ac52515dd469bfb71717/boost_spider-1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-26 06:29:39",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ydf0509",
"github_project": "boost_spider",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "boost-spider"
}