# cobweb
cobweb是一个基于python的分布式爬虫调度框架,目前支持分布式爬虫,单机爬虫,支持自定义数据库,支持自定义数据存储,支持自定义数据处理等操作。
cobweb主要由3个模块和一个配置文件组成:Launcher启动器、Crawler采集器、Pipeline存储和setting配置文件。
1. Launcher启动器:用于启动爬虫任务,控制爬虫任务的执行流程,以及数据存储和数据处理。
框架提供两种启动器模式:LauncherAir、LauncherPro,分别对应单机爬虫模式和分布式调度模式。
2. Crawler采集器:用于控制采集流程、数据下载和数据处理。
框架提供了基础的采集器,用于控制采集流程、数据下载和数据处理,用户也可在创建任务时自定义请求、下载和解析方法,具体看使用方法介绍。
3. Pipeline存储:用于存储采集到的数据,支持自定义数据存储和数据处理。框架提供了Console和Loghub两种存储方式,用户也可继承Pipeline抽象类自定义存储方式。
4. setting配置文件:用于配置采集器、存储器、队列长度、采集线程数等参数,框架提供了默认配置,用户也可自定义配置。
## 安装
```
pip3 install --upgrade cobweb-launcher
```
## 使用方法介绍
### 1. 任务创建
- LauncherAir任务创建
```python
from cobweb import LauncherAir
# 创建启动器
app = LauncherAir(task="test", project="test")
# 设置采集种子
app.SEEDS = [{
"url": "https://www.baidu.com"
}]
...
# 启动任务
app.start()
```
- LauncherPro任务创建
LauncherPro依赖redis实现分布式调度,使用LauncherPro启动器需要完成环境变量的配置或自定义setting文件中的redis配置,如何配置查看`2. 自定义配置文件参数`
```python
from cobweb import LauncherPro
# 创建启动器
app = LauncherPro(
task="test",
project="test"
)
...
# 启动任务
app.start()
```
### 2. 自定义配置文件参数
- 通过自定义setting文件,配置文件导入字符串方式
> 默认配置文件:import cobweb.setting
> 不推荐!!!目前有bug,随缘使用...
例如:同级目录下自定义创建了setting.py文件。
```python
from cobweb import LauncherAir
app = LauncherAir(
task="test",
project="test",
setting="import setting"
)
...
app.start()
```
- 自定义修改setting中对象值
```python
from cobweb import LauncherPro
# 创建启动器
app = LauncherPro(
task="test",
project="test",
REDIS_CONFIG = {
"host": ...,
"password":...,
"port": ...,
"db": ...
}
)
...
# 启动任务
app.start()
```
### 3. 自定义请求
`@app.request`使用装饰器封装自定义请求方法,作用于发生请求前的操作,返回Request对象或继承于BaseItem对象,用于控制请求参数。
```python
from typing import Union
from cobweb import LauncherAir
from cobweb.base import Seed, Request, BaseItem
app = LauncherAir(
task="test",
project="test"
)
...
@app.request
def request(seed: Seed) -> Union[Request, BaseItem]:
# 可自定义headers,代理,构造请求参数等操作
proxies = {"http": ..., "https": ...}
yield Request(seed.url, seed, ..., proxies=proxies, timeout=15)
# yield xxxItem(seed, ...) # 跳过请求和解析直接进入数据存储流程
...
app.start()
```
> 默认请求方法
> def request(seed: Seed) -> Union[Request, BaseItem]:
> yield Request(seed.url, seed, timeout=5)
### 4. 自定义下载
`@app.download`使用装饰器封装自定义下载方法,作用于发生请求时的操作,返回Response对象或继承于BaseItem对象,用于控制请求参数。
```python
from typing import Union
from cobweb import LauncherAir
from cobweb.base import Request, Response, BaseItem
app = LauncherAir(
task="test",
project="test"
)
...
@app.download
def download(item: Request) -> Union[BaseItem, Response]:
...
response = ...
...
yield Response(item.seed, response, ...) # 返回Response对象,进行解析
# yield xxxItem(seed, ...) # 跳过请求和解析直接进入数据存储流程
...
app.start()
```
> 默认下载方法
> def download(item: Request) -> Union[Seed, BaseItem, Response, str]:
> response = item.download()
> yield Response(item.seed, response, **item.to_dict)
### 5. 自定义解析
自定义解析需要由一个存储数据类和解析方法组成。存储数据类继承于BaseItem的对象,规定存储表名及字段,
解析方法返回继承于BaseItem的对象,yield返回进行控制数据存储流程。
```python
from typing import Union
from cobweb import LauncherAir
from cobweb.base import Seed, Response, BaseItem
class TestItem(BaseItem):
__TABLE__ = "test_data" # 表名
__FIELDS__ = "field1, field2, field3" # 字段名
app = LauncherAir(
task="test",
project="test"
)
...
@app.parse
def parse(item: Response) -> Union[Seed, BaseItem]:
...
yield TestItem(item.seed, field1=..., field2=..., field3=...)
# yield Seed(...) # 构造新种子推送至消费队列
...
app.start()
```
> 默认解析方法
> def parse(item: Request) -> Union[Seed, BaseItem]:
> upload_item = item.to_dict
> upload_item["text"] = item.response.text
> yield ConsoleItem(item.seed, data=json.dumps(upload_item, ensure_ascii=False))
## need deal
- 队列优化完善,使用queue的机制wait()同步各模块执行?
- 日志功能完善,单机模式调度和保存数据写入文件,结构化输出各任务日志
- 去重过滤(布隆过滤器等)
- 单机防丢失
- excel、mysql、redis数据完善
> 未更新流程图!!!
![img.png](https://image-luyuan.oss-cn-hangzhou.aliyuncs.com/image/D2388CDC-B9E5-4CE4-9F2C-7D173763B6A8.png)
Raw data
{
"_id": null,
"home_page": "https://github.com/Juannie-PP/cobweb",
"name": "cobweb-launcher",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "cobweb-launcher, cobweb",
"author": "Juannie-PP",
"author_email": "2604868278@qq.com",
"download_url": "https://files.pythonhosted.org/packages/1f/74/beec92116013b41777359ccfd6e5c96abb0da67674941cce90d09f6e236c/cobweb-launcher-1.3.15.tar.gz",
"platform": null,
"description": "# cobweb \ncobweb\u662f\u4e00\u4e2a\u57fa\u4e8epython\u7684\u5206\u5e03\u5f0f\u722c\u866b\u8c03\u5ea6\u6846\u67b6\uff0c\u76ee\u524d\u652f\u6301\u5206\u5e03\u5f0f\u722c\u866b\uff0c\u5355\u673a\u722c\u866b\uff0c\u652f\u6301\u81ea\u5b9a\u4e49\u6570\u636e\u5e93\uff0c\u652f\u6301\u81ea\u5b9a\u4e49\u6570\u636e\u5b58\u50a8\uff0c\u652f\u6301\u81ea\u5b9a\u4e49\u6570\u636e\u5904\u7406\u7b49\u64cd\u4f5c\u3002 \n\ncobweb\u4e3b\u8981\u75313\u4e2a\u6a21\u5757\u548c\u4e00\u4e2a\u914d\u7f6e\u6587\u4ef6\u7ec4\u6210\uff1aLauncher\u542f\u52a8\u5668\u3001Crawler\u91c7\u96c6\u5668\u3001Pipeline\u5b58\u50a8\u548csetting\u914d\u7f6e\u6587\u4ef6\u3002\n1. Launcher\u542f\u52a8\u5668\uff1a\u7528\u4e8e\u542f\u52a8\u722c\u866b\u4efb\u52a1\uff0c\u63a7\u5236\u722c\u866b\u4efb\u52a1\u7684\u6267\u884c\u6d41\u7a0b\uff0c\u4ee5\u53ca\u6570\u636e\u5b58\u50a8\u548c\u6570\u636e\u5904\u7406\u3002\n\u6846\u67b6\u63d0\u4f9b\u4e24\u79cd\u542f\u52a8\u5668\u6a21\u5f0f\uff1aLauncherAir\u3001LauncherPro\uff0c\u5206\u522b\u5bf9\u5e94\u5355\u673a\u722c\u866b\u6a21\u5f0f\u548c\u5206\u5e03\u5f0f\u8c03\u5ea6\u6a21\u5f0f\u3002\n2. Crawler\u91c7\u96c6\u5668\uff1a\u7528\u4e8e\u63a7\u5236\u91c7\u96c6\u6d41\u7a0b\u3001\u6570\u636e\u4e0b\u8f7d\u548c\u6570\u636e\u5904\u7406\u3002\n\u6846\u67b6\u63d0\u4f9b\u4e86\u57fa\u7840\u7684\u91c7\u96c6\u5668\uff0c\u7528\u4e8e\u63a7\u5236\u91c7\u96c6\u6d41\u7a0b\u3001\u6570\u636e\u4e0b\u8f7d\u548c\u6570\u636e\u5904\u7406\uff0c\u7528\u6237\u4e5f\u53ef\u5728\u521b\u5efa\u4efb\u52a1\u65f6\u81ea\u5b9a\u4e49\u8bf7\u6c42\u3001\u4e0b\u8f7d\u548c\u89e3\u6790\u65b9\u6cd5\uff0c\u5177\u4f53\u770b\u4f7f\u7528\u65b9\u6cd5\u4ecb\u7ecd\u3002\n3. Pipeline\u5b58\u50a8\uff1a\u7528\u4e8e\u5b58\u50a8\u91c7\u96c6\u5230\u7684\u6570\u636e\uff0c\u652f\u6301\u81ea\u5b9a\u4e49\u6570\u636e\u5b58\u50a8\u548c\u6570\u636e\u5904\u7406\u3002\u6846\u67b6\u63d0\u4f9b\u4e86Console\u548cLoghub\u4e24\u79cd\u5b58\u50a8\u65b9\u5f0f\uff0c\u7528\u6237\u4e5f\u53ef\u7ee7\u627fPipeline\u62bd\u8c61\u7c7b\u81ea\u5b9a\u4e49\u5b58\u50a8\u65b9\u5f0f\u3002\n4. setting\u914d\u7f6e\u6587\u4ef6\uff1a\u7528\u4e8e\u914d\u7f6e\u91c7\u96c6\u5668\u3001\u5b58\u50a8\u5668\u3001\u961f\u5217\u957f\u5ea6\u3001\u91c7\u96c6\u7ebf\u7a0b\u6570\u7b49\u53c2\u6570\uff0c\u6846\u67b6\u63d0\u4f9b\u4e86\u9ed8\u8ba4\u914d\u7f6e\uff0c\u7528\u6237\u4e5f\u53ef\u81ea\u5b9a\u4e49\u914d\u7f6e\u3002\n## \u5b89\u88c5\n```\npip3 install --upgrade cobweb-launcher\n```\n## \u4f7f\u7528\u65b9\u6cd5\u4ecb\u7ecd\n### 1. \u4efb\u52a1\u521b\u5efa\n- LauncherAir\u4efb\u52a1\u521b\u5efa\n```python\nfrom cobweb import LauncherAir\n\n# \u521b\u5efa\u542f\u52a8\u5668\napp = LauncherAir(task=\"test\", project=\"test\")\n\n# \u8bbe\u7f6e\u91c7\u96c6\u79cd\u5b50\napp.SEEDS = [{\n \"url\": \"https://www.baidu.com\"\n}]\n...\n# \u542f\u52a8\u4efb\u52a1\napp.start()\n```\n- LauncherPro\u4efb\u52a1\u521b\u5efa \nLauncherPro\u4f9d\u8d56redis\u5b9e\u73b0\u5206\u5e03\u5f0f\u8c03\u5ea6\uff0c\u4f7f\u7528LauncherPro\u542f\u52a8\u5668\u9700\u8981\u5b8c\u6210\u73af\u5883\u53d8\u91cf\u7684\u914d\u7f6e\u6216\u81ea\u5b9a\u4e49setting\u6587\u4ef6\u4e2d\u7684redis\u914d\u7f6e\uff0c\u5982\u4f55\u914d\u7f6e\u67e5\u770b`2. \u81ea\u5b9a\u4e49\u914d\u7f6e\u6587\u4ef6\u53c2\u6570`\n```python\nfrom cobweb import LauncherPro\n\n# \u521b\u5efa\u542f\u52a8\u5668\napp = LauncherPro(\n task=\"test\",\n project=\"test\"\n)\n...\n# \u542f\u52a8\u4efb\u52a1\napp.start()\n```\n### 2. \u81ea\u5b9a\u4e49\u914d\u7f6e\u6587\u4ef6\u53c2\u6570\n- \u901a\u8fc7\u81ea\u5b9a\u4e49setting\u6587\u4ef6\uff0c\u914d\u7f6e\u6587\u4ef6\u5bfc\u5165\u5b57\u7b26\u4e32\u65b9\u5f0f \n> \u9ed8\u8ba4\u914d\u7f6e\u6587\u4ef6\uff1aimport cobweb.setting \n> \u4e0d\u63a8\u8350\uff01\uff01\uff01\u76ee\u524d\u6709bug\uff0c\u968f\u7f18\u4f7f\u7528...\n\u4f8b\u5982\uff1a\u540c\u7ea7\u76ee\u5f55\u4e0b\u81ea\u5b9a\u4e49\u521b\u5efa\u4e86setting.py\u6587\u4ef6\u3002\n```python\nfrom cobweb import LauncherAir\n\napp = LauncherAir(\n task=\"test\", \n project=\"test\",\n setting=\"import setting\"\n)\n\n...\n\napp.start()\n```\n- \u81ea\u5b9a\u4e49\u4fee\u6539setting\u4e2d\u5bf9\u8c61\u503c\n```python\nfrom cobweb import LauncherPro\n\n# \u521b\u5efa\u542f\u52a8\u5668\napp = LauncherPro(\n task=\"test\",\n project=\"test\",\n REDIS_CONFIG = {\n \"host\": ...,\n \"password\":...,\n \"port\": ...,\n \"db\": ...\n }\n)\n...\n# \u542f\u52a8\u4efb\u52a1\napp.start()\n```\n### 3. \u81ea\u5b9a\u4e49\u8bf7\u6c42\n`@app.request`\u4f7f\u7528\u88c5\u9970\u5668\u5c01\u88c5\u81ea\u5b9a\u4e49\u8bf7\u6c42\u65b9\u6cd5\uff0c\u4f5c\u7528\u4e8e\u53d1\u751f\u8bf7\u6c42\u524d\u7684\u64cd\u4f5c\uff0c\u8fd4\u56deRequest\u5bf9\u8c61\u6216\u7ee7\u627f\u4e8eBaseItem\u5bf9\u8c61\uff0c\u7528\u4e8e\u63a7\u5236\u8bf7\u6c42\u53c2\u6570\u3002\n```python\nfrom typing import Union\nfrom cobweb import LauncherAir\nfrom cobweb.base import Seed, Request, BaseItem\n\napp = LauncherAir(\n task=\"test\", \n project=\"test\"\n)\n\n...\n\n@app.request\ndef request(seed: Seed) -> Union[Request, BaseItem]:\n # \u53ef\u81ea\u5b9a\u4e49headers\uff0c\u4ee3\u7406\uff0c\u6784\u9020\u8bf7\u6c42\u53c2\u6570\u7b49\u64cd\u4f5c\n proxies = {\"http\": ..., \"https\": ...}\n yield Request(seed.url, seed, ..., proxies=proxies, timeout=15)\n # yield xxxItem(seed, ...) # \u8df3\u8fc7\u8bf7\u6c42\u548c\u89e3\u6790\u76f4\u63a5\u8fdb\u5165\u6570\u636e\u5b58\u50a8\u6d41\u7a0b\n \n...\n\napp.start()\n```\n> \u9ed8\u8ba4\u8bf7\u6c42\u65b9\u6cd5 \n> def request(seed: Seed) -> Union[Request, BaseItem]: \n> yield Request(seed.url, seed, timeout=5)\n### 4. \u81ea\u5b9a\u4e49\u4e0b\u8f7d\n`@app.download`\u4f7f\u7528\u88c5\u9970\u5668\u5c01\u88c5\u81ea\u5b9a\u4e49\u4e0b\u8f7d\u65b9\u6cd5\uff0c\u4f5c\u7528\u4e8e\u53d1\u751f\u8bf7\u6c42\u65f6\u7684\u64cd\u4f5c\uff0c\u8fd4\u56deResponse\u5bf9\u8c61\u6216\u7ee7\u627f\u4e8eBaseItem\u5bf9\u8c61\uff0c\u7528\u4e8e\u63a7\u5236\u8bf7\u6c42\u53c2\u6570\u3002\n```python\nfrom typing import Union\nfrom cobweb import LauncherAir\nfrom cobweb.base import Request, Response, BaseItem\n\napp = LauncherAir(\n task=\"test\", \n project=\"test\"\n)\n\n...\n\n@app.download\ndef download(item: Request) -> Union[BaseItem, Response]:\n ...\n response = ...\n ...\n yield Response(item.seed, response, ...) # \u8fd4\u56deResponse\u5bf9\u8c61\uff0c\u8fdb\u884c\u89e3\u6790\n # yield xxxItem(seed, ...) # \u8df3\u8fc7\u8bf7\u6c42\u548c\u89e3\u6790\u76f4\u63a5\u8fdb\u5165\u6570\u636e\u5b58\u50a8\u6d41\u7a0b\n \n...\n\napp.start()\n```\n> \u9ed8\u8ba4\u4e0b\u8f7d\u65b9\u6cd5 \n> def download(item: Request) -> Union[Seed, BaseItem, Response, str]: \n> response = item.download() \n> yield Response(item.seed, response, **item.to_dict)\n### 5. \u81ea\u5b9a\u4e49\u89e3\u6790\n\u81ea\u5b9a\u4e49\u89e3\u6790\u9700\u8981\u7531\u4e00\u4e2a\u5b58\u50a8\u6570\u636e\u7c7b\u548c\u89e3\u6790\u65b9\u6cd5\u7ec4\u6210\u3002\u5b58\u50a8\u6570\u636e\u7c7b\u7ee7\u627f\u4e8eBaseItem\u7684\u5bf9\u8c61\uff0c\u89c4\u5b9a\u5b58\u50a8\u8868\u540d\u53ca\u5b57\u6bb5\uff0c\n\u89e3\u6790\u65b9\u6cd5\u8fd4\u56de\u7ee7\u627f\u4e8eBaseItem\u7684\u5bf9\u8c61\uff0cyield\u8fd4\u56de\u8fdb\u884c\u63a7\u5236\u6570\u636e\u5b58\u50a8\u6d41\u7a0b\u3002\n```python\nfrom typing import Union\nfrom cobweb import LauncherAir\nfrom cobweb.base import Seed, Response, BaseItem\n\nclass TestItem(BaseItem):\n __TABLE__ = \"test_data\" # \u8868\u540d\n __FIELDS__ = \"field1, field2, field3\" # \u5b57\u6bb5\u540d\n\napp = LauncherAir(\n task=\"test\", \n project=\"test\"\n)\n\n...\n\n@app.parse\ndef parse(item: Response) -> Union[Seed, BaseItem]:\n ...\n yield TestItem(item.seed, field1=..., field2=..., field3=...)\n # yield Seed(...) # \u6784\u9020\u65b0\u79cd\u5b50\u63a8\u9001\u81f3\u6d88\u8d39\u961f\u5217\n \n...\n\napp.start()\n```\n> \u9ed8\u8ba4\u89e3\u6790\u65b9\u6cd5 \n> def parse(item: Request) -> Union[Seed, BaseItem]: \n> upload_item = item.to_dict \n> upload_item[\"text\"] = item.response.text \n> yield ConsoleItem(item.seed, data=json.dumps(upload_item, ensure_ascii=False))\n## need deal\n- \u961f\u5217\u4f18\u5316\u5b8c\u5584\uff0c\u4f7f\u7528queue\u7684\u673a\u5236wait()\u540c\u6b65\u5404\u6a21\u5757\u6267\u884c\uff1f\n- \u65e5\u5fd7\u529f\u80fd\u5b8c\u5584\uff0c\u5355\u673a\u6a21\u5f0f\u8c03\u5ea6\u548c\u4fdd\u5b58\u6570\u636e\u5199\u5165\u6587\u4ef6\uff0c\u7ed3\u6784\u5316\u8f93\u51fa\u5404\u4efb\u52a1\u65e5\u5fd7\n- \u53bb\u91cd\u8fc7\u6ee4\uff08\u5e03\u9686\u8fc7\u6ee4\u5668\u7b49\uff09\n- \u5355\u673a\u9632\u4e22\u5931\n- excel\u3001mysql\u3001redis\u6570\u636e\u5b8c\u5584\n\n> \u672a\u66f4\u65b0\u6d41\u7a0b\u56fe\uff01\uff01\uff01\n![img.png](https://image-luyuan.oss-cn-hangzhou.aliyuncs.com/image/D2388CDC-B9E5-4CE4-9F2C-7D173763B6A8.png)\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "spider_hole",
"version": "1.3.15",
"project_urls": {
"Homepage": "https://github.com/Juannie-PP/cobweb"
},
"split_keywords": [
"cobweb-launcher",
" cobweb"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f1539952e125cf675ddfe1e5306eb081cbae94baaef45ef76f60f182f606ff53",
"md5": "f420a0c5a125e9b56c6be360b080327d",
"sha256": "0922d2143b16dc028689f0c7b13e4f0a7369b09ebc90871173a81b763381d0d3"
},
"downloads": -1,
"filename": "cobweb_launcher-1.3.15-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f420a0c5a125e9b56c6be360b080327d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 33043,
"upload_time": "2024-11-21T09:49:18",
"upload_time_iso_8601": "2024-11-21T09:49:18.053021Z",
"url": "https://files.pythonhosted.org/packages/f1/53/9952e125cf675ddfe1e5306eb081cbae94baaef45ef76f60f182f606ff53/cobweb_launcher-1.3.15-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "1f74beec92116013b41777359ccfd6e5c96abb0da67674941cce90d09f6e236c",
"md5": "dbabf8141b5d52d6e7241172e62ecd5d",
"sha256": "46541acbf3099277ccd8ca3d52154f450b756ac13ae3fb37b7c965da584dd2a6"
},
"downloads": -1,
"filename": "cobweb-launcher-1.3.15.tar.gz",
"has_sig": false,
"md5_digest": "dbabf8141b5d52d6e7241172e62ecd5d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 25060,
"upload_time": "2024-11-21T09:49:19",
"upload_time_iso_8601": "2024-11-21T09:49:19.210460Z",
"url": "https://files.pythonhosted.org/packages/1f/74/beec92116013b41777359ccfd6e5c96abb0da67674941cce90d09f6e236c/cobweb-launcher-1.3.15.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-21 09:49:19",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Juannie-PP",
"github_project": "cobweb",
"github_not_found": true,
"lcname": "cobweb-launcher"
}