cobweb-launcher


Namecobweb-launcher JSON
Version 1.2.11 PyPI version JSON
download
home_pagehttps://github.com/Juannie-PP/cobweb
Summaryspider_hole
upload_time2024-09-25 11:51:43
maintainerNone
docs_urlNone
authorJuannie-PP
requires_python>=3.7
licenseMIT
keywords cobweb-launcher cobweb
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # cobweb 
cobweb是一个基于python的分布式爬虫调度框架,目前支持分布式爬虫,单机爬虫,支持自定义数据库,支持自定义数据存储,支持自定义数据处理等操作。  

cobweb主要由3个模块和一个配置文件组成:Launcher启动器、Crawler采集器、Pipeline存储和setting配置文件。
1. Launcher启动器:用于启动爬虫任务,控制爬虫任务的执行流程,以及数据存储和数据处理。
框架提供两种启动器模式:LauncherAir、LauncherPro,分别对应单机爬虫模式和分布式调度模式。
2. Crawler采集器:用于控制采集流程、数据下载和数据处理。
框架提供了基础的采集器,用于控制采集流程、数据下载和数据处理,用户也可在创建任务时自定义请求、下载和解析方法,具体看使用方法介绍。
3. Pipeline存储:用于存储采集到的数据,支持自定义数据存储和数据处理。框架提供了Console和Loghub两种存储方式,用户也可继承Pipeline抽象类自定义存储方式。
4. setting配置文件:用于配置采集器、存储器、队列长度、采集线程数等参数,框架提供了默认配置,用户也可自定义配置。
## 安装
```
pip3 install --upgrade cobweb-launcher
```
## 使用方法介绍
### 1. 任务创建
- LauncherAir任务创建
```python
from cobweb import LauncherAir

# 创建启动器
app = LauncherAir(task="test", project="test")

# 设置采集种子
app.SEEDS = [{
    "url": "https://www.baidu.com"
}]
...
# 启动任务
app.start()
```
- LauncherPro任务创建  
LauncherPro依赖redis实现分布式调度,使用LauncherPro启动器需要完成环境变量的配置或自定义setting文件中的redis配置,如何配置查看`2. 自定义配置文件参数`
```python
from cobweb import LauncherPro

# 创建启动器
app = LauncherPro(
    task="test",
    project="test"
)
...
# 启动任务
app.start()
```
### 2. 自定义配置文件参数
- 通过自定义setting文件,配置文件导入字符串方式  
> 默认配置文件:import cobweb.setting  
> 不推荐!!!目前有bug,随缘使用...
例如:同级目录下自定义创建了setting.py文件。
```python
from cobweb import LauncherAir

app = LauncherAir(
    task="test", 
    project="test",
    setting="import setting"
)

...

app.start()
```
- 自定义修改setting中对象值
```python
from cobweb import LauncherPro

# 创建启动器
app = LauncherPro(
    task="test",
    project="test",
    REDIS_CONFIG = {
        "host": ...,
        "password":...,
        "port": ...,
        "db": ...
    }
)
...
# 启动任务
app.start()
```
### 3. 自定义请求
`@app.request`使用装饰器封装自定义请求方法,作用于发生请求前的操作,返回Request对象或继承于BaseItem对象,用于控制请求参数。
```python
from typing import Union
from cobweb import LauncherAir
from cobweb.base import Seed, Request, BaseItem

app = LauncherAir(
    task="test", 
    project="test"
)

...

@app.request
def request(seed: Seed) -> Union[Request, BaseItem]:
    # 可自定义headers,代理,构造请求参数等操作
    proxies = {"http": ..., "https": ...}
    yield Request(seed.url, seed, ..., proxies=proxies, timeout=15)
    # yield xxxItem(seed, ...) # 跳过请求和解析直接进入数据存储流程
    
...

app.start()
```
> 默认请求方法  
> def request(seed: Seed) -> Union[Request, BaseItem]:  
>     yield Request(seed.url, seed, timeout=5)
### 4. 自定义下载
`@app.download`使用装饰器封装自定义下载方法,作用于发生请求时的操作,返回Response对象或继承于BaseItem对象,用于控制请求参数。
```python
from typing import Union
from cobweb import LauncherAir
from cobweb.base import Request, Response, BaseItem

app = LauncherAir(
    task="test", 
    project="test"
)

...

@app.download
def download(item: Request) -> Union[BaseItem, Response]:
    ...
    response = ...
    ...
    yield Response(item.seed, response, ...) # 返回Response对象,进行解析
    # yield xxxItem(seed, ...) # 跳过请求和解析直接进入数据存储流程
    
...

app.start()
```
> 默认下载方法  
> def download(item: Request) -> Union[Seed, BaseItem, Response, str]:  
>     response = item.download()  
>     yield Response(item.seed, response, **item.to_dict)
### 5. 自定义解析
自定义解析需要由一个存储数据类和解析方法组成。存储数据类继承于BaseItem的对象,规定存储表名及字段,
解析方法返回继承于BaseItem的对象,yield返回进行控制数据存储流程。
```python
from typing import Union
from cobweb import LauncherAir
from cobweb.base import Seed, Response, BaseItem

class TestItem(BaseItem):
    __TABLE__ = "test_data" # 表名
    __FIELDS__ = "field1, field2, field3" # 字段名

app = LauncherAir(
    task="test", 
    project="test"
)

...

@app.parse
def parse(item: Response) -> Union[Seed, BaseItem]:
    ...
    yield TestItem(item.seed, field1=..., field2=..., field3=...)
    # yield Seed(...) # 构造新种子推送至消费队列
    
...

app.start()
```
> 默认解析方法  
> def parse(item: Request) -> Union[Seed, BaseItem]:  
>     upload_item = item.to_dict  
>     upload_item["text"] = item.response.text  
>     yield ConsoleItem(item.seed, data=json.dumps(upload_item, ensure_ascii=False))
## need deal
- 队列优化完善,使用queue的机制wait()同步各模块执行?
- 日志功能完善,单机模式调度和保存数据写入文件,结构化输出各任务日志
- 去重过滤(布隆过滤器等)
- 单机防丢失
- excel、mysql、redis数据完善

> 未更新流程图!!!
![img.png](https://image-luyuan.oss-cn-hangzhou.aliyuncs.com/image/D2388CDC-B9E5-4CE4-9F2C-7D173763B6A8.png)



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Juannie-PP/cobweb",
    "name": "cobweb-launcher",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "cobweb-launcher, cobweb",
    "author": "Juannie-PP",
    "author_email": "2604868278@qq.com",
    "download_url": "https://files.pythonhosted.org/packages/eb/aa/acb3e303b8c3676791b45f2195c392043470caa3bfd19bec13ffacb4d1a9/cobweb-launcher-1.2.11.tar.gz",
    "platform": null,
    "description": "# cobweb \ncobweb\u662f\u4e00\u4e2a\u57fa\u4e8epython\u7684\u5206\u5e03\u5f0f\u722c\u866b\u8c03\u5ea6\u6846\u67b6\uff0c\u76ee\u524d\u652f\u6301\u5206\u5e03\u5f0f\u722c\u866b\uff0c\u5355\u673a\u722c\u866b\uff0c\u652f\u6301\u81ea\u5b9a\u4e49\u6570\u636e\u5e93\uff0c\u652f\u6301\u81ea\u5b9a\u4e49\u6570\u636e\u5b58\u50a8\uff0c\u652f\u6301\u81ea\u5b9a\u4e49\u6570\u636e\u5904\u7406\u7b49\u64cd\u4f5c\u3002  \n\ncobweb\u4e3b\u8981\u75313\u4e2a\u6a21\u5757\u548c\u4e00\u4e2a\u914d\u7f6e\u6587\u4ef6\u7ec4\u6210\uff1aLauncher\u542f\u52a8\u5668\u3001Crawler\u91c7\u96c6\u5668\u3001Pipeline\u5b58\u50a8\u548csetting\u914d\u7f6e\u6587\u4ef6\u3002\n1. Launcher\u542f\u52a8\u5668\uff1a\u7528\u4e8e\u542f\u52a8\u722c\u866b\u4efb\u52a1\uff0c\u63a7\u5236\u722c\u866b\u4efb\u52a1\u7684\u6267\u884c\u6d41\u7a0b\uff0c\u4ee5\u53ca\u6570\u636e\u5b58\u50a8\u548c\u6570\u636e\u5904\u7406\u3002\n\u6846\u67b6\u63d0\u4f9b\u4e24\u79cd\u542f\u52a8\u5668\u6a21\u5f0f\uff1aLauncherAir\u3001LauncherPro\uff0c\u5206\u522b\u5bf9\u5e94\u5355\u673a\u722c\u866b\u6a21\u5f0f\u548c\u5206\u5e03\u5f0f\u8c03\u5ea6\u6a21\u5f0f\u3002\n2. Crawler\u91c7\u96c6\u5668\uff1a\u7528\u4e8e\u63a7\u5236\u91c7\u96c6\u6d41\u7a0b\u3001\u6570\u636e\u4e0b\u8f7d\u548c\u6570\u636e\u5904\u7406\u3002\n\u6846\u67b6\u63d0\u4f9b\u4e86\u57fa\u7840\u7684\u91c7\u96c6\u5668\uff0c\u7528\u4e8e\u63a7\u5236\u91c7\u96c6\u6d41\u7a0b\u3001\u6570\u636e\u4e0b\u8f7d\u548c\u6570\u636e\u5904\u7406\uff0c\u7528\u6237\u4e5f\u53ef\u5728\u521b\u5efa\u4efb\u52a1\u65f6\u81ea\u5b9a\u4e49\u8bf7\u6c42\u3001\u4e0b\u8f7d\u548c\u89e3\u6790\u65b9\u6cd5\uff0c\u5177\u4f53\u770b\u4f7f\u7528\u65b9\u6cd5\u4ecb\u7ecd\u3002\n3. Pipeline\u5b58\u50a8\uff1a\u7528\u4e8e\u5b58\u50a8\u91c7\u96c6\u5230\u7684\u6570\u636e\uff0c\u652f\u6301\u81ea\u5b9a\u4e49\u6570\u636e\u5b58\u50a8\u548c\u6570\u636e\u5904\u7406\u3002\u6846\u67b6\u63d0\u4f9b\u4e86Console\u548cLoghub\u4e24\u79cd\u5b58\u50a8\u65b9\u5f0f\uff0c\u7528\u6237\u4e5f\u53ef\u7ee7\u627fPipeline\u62bd\u8c61\u7c7b\u81ea\u5b9a\u4e49\u5b58\u50a8\u65b9\u5f0f\u3002\n4. setting\u914d\u7f6e\u6587\u4ef6\uff1a\u7528\u4e8e\u914d\u7f6e\u91c7\u96c6\u5668\u3001\u5b58\u50a8\u5668\u3001\u961f\u5217\u957f\u5ea6\u3001\u91c7\u96c6\u7ebf\u7a0b\u6570\u7b49\u53c2\u6570\uff0c\u6846\u67b6\u63d0\u4f9b\u4e86\u9ed8\u8ba4\u914d\u7f6e\uff0c\u7528\u6237\u4e5f\u53ef\u81ea\u5b9a\u4e49\u914d\u7f6e\u3002\n## \u5b89\u88c5\n```\npip3 install --upgrade cobweb-launcher\n```\n## \u4f7f\u7528\u65b9\u6cd5\u4ecb\u7ecd\n### 1. \u4efb\u52a1\u521b\u5efa\n- LauncherAir\u4efb\u52a1\u521b\u5efa\n```python\nfrom cobweb import LauncherAir\n\n# \u521b\u5efa\u542f\u52a8\u5668\napp = LauncherAir(task=\"test\", project=\"test\")\n\n# \u8bbe\u7f6e\u91c7\u96c6\u79cd\u5b50\napp.SEEDS = [{\n    \"url\": \"https://www.baidu.com\"\n}]\n...\n# \u542f\u52a8\u4efb\u52a1\napp.start()\n```\n- LauncherPro\u4efb\u52a1\u521b\u5efa  \nLauncherPro\u4f9d\u8d56redis\u5b9e\u73b0\u5206\u5e03\u5f0f\u8c03\u5ea6\uff0c\u4f7f\u7528LauncherPro\u542f\u52a8\u5668\u9700\u8981\u5b8c\u6210\u73af\u5883\u53d8\u91cf\u7684\u914d\u7f6e\u6216\u81ea\u5b9a\u4e49setting\u6587\u4ef6\u4e2d\u7684redis\u914d\u7f6e\uff0c\u5982\u4f55\u914d\u7f6e\u67e5\u770b`2. \u81ea\u5b9a\u4e49\u914d\u7f6e\u6587\u4ef6\u53c2\u6570`\n```python\nfrom cobweb import LauncherPro\n\n# \u521b\u5efa\u542f\u52a8\u5668\napp = LauncherPro(\n    task=\"test\",\n    project=\"test\"\n)\n...\n# \u542f\u52a8\u4efb\u52a1\napp.start()\n```\n### 2. \u81ea\u5b9a\u4e49\u914d\u7f6e\u6587\u4ef6\u53c2\u6570\n- \u901a\u8fc7\u81ea\u5b9a\u4e49setting\u6587\u4ef6\uff0c\u914d\u7f6e\u6587\u4ef6\u5bfc\u5165\u5b57\u7b26\u4e32\u65b9\u5f0f  \n> \u9ed8\u8ba4\u914d\u7f6e\u6587\u4ef6\uff1aimport cobweb.setting  \n> \u4e0d\u63a8\u8350\uff01\uff01\uff01\u76ee\u524d\u6709bug\uff0c\u968f\u7f18\u4f7f\u7528...\n\u4f8b\u5982\uff1a\u540c\u7ea7\u76ee\u5f55\u4e0b\u81ea\u5b9a\u4e49\u521b\u5efa\u4e86setting.py\u6587\u4ef6\u3002\n```python\nfrom cobweb import LauncherAir\n\napp = LauncherAir(\n    task=\"test\", \n    project=\"test\",\n    setting=\"import setting\"\n)\n\n...\n\napp.start()\n```\n- \u81ea\u5b9a\u4e49\u4fee\u6539setting\u4e2d\u5bf9\u8c61\u503c\n```python\nfrom cobweb import LauncherPro\n\n# \u521b\u5efa\u542f\u52a8\u5668\napp = LauncherPro(\n    task=\"test\",\n    project=\"test\",\n    REDIS_CONFIG = {\n        \"host\": ...,\n        \"password\":...,\n        \"port\": ...,\n        \"db\": ...\n    }\n)\n...\n# \u542f\u52a8\u4efb\u52a1\napp.start()\n```\n### 3. \u81ea\u5b9a\u4e49\u8bf7\u6c42\n`@app.request`\u4f7f\u7528\u88c5\u9970\u5668\u5c01\u88c5\u81ea\u5b9a\u4e49\u8bf7\u6c42\u65b9\u6cd5\uff0c\u4f5c\u7528\u4e8e\u53d1\u751f\u8bf7\u6c42\u524d\u7684\u64cd\u4f5c\uff0c\u8fd4\u56deRequest\u5bf9\u8c61\u6216\u7ee7\u627f\u4e8eBaseItem\u5bf9\u8c61\uff0c\u7528\u4e8e\u63a7\u5236\u8bf7\u6c42\u53c2\u6570\u3002\n```python\nfrom typing import Union\nfrom cobweb import LauncherAir\nfrom cobweb.base import Seed, Request, BaseItem\n\napp = LauncherAir(\n    task=\"test\", \n    project=\"test\"\n)\n\n...\n\n@app.request\ndef request(seed: Seed) -> Union[Request, BaseItem]:\n    # \u53ef\u81ea\u5b9a\u4e49headers\uff0c\u4ee3\u7406\uff0c\u6784\u9020\u8bf7\u6c42\u53c2\u6570\u7b49\u64cd\u4f5c\n    proxies = {\"http\": ..., \"https\": ...}\n    yield Request(seed.url, seed, ..., proxies=proxies, timeout=15)\n    # yield xxxItem(seed, ...) # \u8df3\u8fc7\u8bf7\u6c42\u548c\u89e3\u6790\u76f4\u63a5\u8fdb\u5165\u6570\u636e\u5b58\u50a8\u6d41\u7a0b\n    \n...\n\napp.start()\n```\n> \u9ed8\u8ba4\u8bf7\u6c42\u65b9\u6cd5  \n> def request(seed: Seed) -> Union[Request, BaseItem]:  \n>     yield Request(seed.url, seed, timeout=5)\n### 4. \u81ea\u5b9a\u4e49\u4e0b\u8f7d\n`@app.download`\u4f7f\u7528\u88c5\u9970\u5668\u5c01\u88c5\u81ea\u5b9a\u4e49\u4e0b\u8f7d\u65b9\u6cd5\uff0c\u4f5c\u7528\u4e8e\u53d1\u751f\u8bf7\u6c42\u65f6\u7684\u64cd\u4f5c\uff0c\u8fd4\u56deResponse\u5bf9\u8c61\u6216\u7ee7\u627f\u4e8eBaseItem\u5bf9\u8c61\uff0c\u7528\u4e8e\u63a7\u5236\u8bf7\u6c42\u53c2\u6570\u3002\n```python\nfrom typing import Union\nfrom cobweb import LauncherAir\nfrom cobweb.base import Request, Response, BaseItem\n\napp = LauncherAir(\n    task=\"test\", \n    project=\"test\"\n)\n\n...\n\n@app.download\ndef download(item: Request) -> Union[BaseItem, Response]:\n    ...\n    response = ...\n    ...\n    yield Response(item.seed, response, ...) # \u8fd4\u56deResponse\u5bf9\u8c61\uff0c\u8fdb\u884c\u89e3\u6790\n    # yield xxxItem(seed, ...) # \u8df3\u8fc7\u8bf7\u6c42\u548c\u89e3\u6790\u76f4\u63a5\u8fdb\u5165\u6570\u636e\u5b58\u50a8\u6d41\u7a0b\n    \n...\n\napp.start()\n```\n> \u9ed8\u8ba4\u4e0b\u8f7d\u65b9\u6cd5  \n> def download(item: Request) -> Union[Seed, BaseItem, Response, str]:  \n>     response = item.download()  \n>     yield Response(item.seed, response, **item.to_dict)\n### 5. \u81ea\u5b9a\u4e49\u89e3\u6790\n\u81ea\u5b9a\u4e49\u89e3\u6790\u9700\u8981\u7531\u4e00\u4e2a\u5b58\u50a8\u6570\u636e\u7c7b\u548c\u89e3\u6790\u65b9\u6cd5\u7ec4\u6210\u3002\u5b58\u50a8\u6570\u636e\u7c7b\u7ee7\u627f\u4e8eBaseItem\u7684\u5bf9\u8c61\uff0c\u89c4\u5b9a\u5b58\u50a8\u8868\u540d\u53ca\u5b57\u6bb5\uff0c\n\u89e3\u6790\u65b9\u6cd5\u8fd4\u56de\u7ee7\u627f\u4e8eBaseItem\u7684\u5bf9\u8c61\uff0cyield\u8fd4\u56de\u8fdb\u884c\u63a7\u5236\u6570\u636e\u5b58\u50a8\u6d41\u7a0b\u3002\n```python\nfrom typing import Union\nfrom cobweb import LauncherAir\nfrom cobweb.base import Seed, Response, BaseItem\n\nclass TestItem(BaseItem):\n    __TABLE__ = \"test_data\" # \u8868\u540d\n    __FIELDS__ = \"field1, field2, field3\" # \u5b57\u6bb5\u540d\n\napp = LauncherAir(\n    task=\"test\", \n    project=\"test\"\n)\n\n...\n\n@app.parse\ndef parse(item: Response) -> Union[Seed, BaseItem]:\n    ...\n    yield TestItem(item.seed, field1=..., field2=..., field3=...)\n    # yield Seed(...) # \u6784\u9020\u65b0\u79cd\u5b50\u63a8\u9001\u81f3\u6d88\u8d39\u961f\u5217\n    \n...\n\napp.start()\n```\n> \u9ed8\u8ba4\u89e3\u6790\u65b9\u6cd5  \n> def parse(item: Request) -> Union[Seed, BaseItem]:  \n>     upload_item = item.to_dict  \n>     upload_item[\"text\"] = item.response.text  \n>     yield ConsoleItem(item.seed, data=json.dumps(upload_item, ensure_ascii=False))\n## need deal\n- \u961f\u5217\u4f18\u5316\u5b8c\u5584\uff0c\u4f7f\u7528queue\u7684\u673a\u5236wait()\u540c\u6b65\u5404\u6a21\u5757\u6267\u884c\uff1f\n- \u65e5\u5fd7\u529f\u80fd\u5b8c\u5584\uff0c\u5355\u673a\u6a21\u5f0f\u8c03\u5ea6\u548c\u4fdd\u5b58\u6570\u636e\u5199\u5165\u6587\u4ef6\uff0c\u7ed3\u6784\u5316\u8f93\u51fa\u5404\u4efb\u52a1\u65e5\u5fd7\n- \u53bb\u91cd\u8fc7\u6ee4\uff08\u5e03\u9686\u8fc7\u6ee4\u5668\u7b49\uff09\n- \u5355\u673a\u9632\u4e22\u5931\n- excel\u3001mysql\u3001redis\u6570\u636e\u5b8c\u5584\n\n> \u672a\u66f4\u65b0\u6d41\u7a0b\u56fe\uff01\uff01\uff01\n![img.png](https://image-luyuan.oss-cn-hangzhou.aliyuncs.com/image/D2388CDC-B9E5-4CE4-9F2C-7D173763B6A8.png)\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "spider_hole",
    "version": "1.2.11",
    "project_urls": {
        "Homepage": "https://github.com/Juannie-PP/cobweb"
    },
    "split_keywords": [
        "cobweb-launcher",
        " cobweb"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a09578934ad1479861e54d4362fa22dcfb72d8111c6443b7e6fdf4de39b65a12",
                "md5": "56547c39591bc380d0118aba2e990af1",
                "sha256": "9af8e8d29658ebc81ba3b9f785cd071e566d4f85a338521fec9f7b24bc5569bd"
            },
            "downloads": -1,
            "filename": "cobweb_launcher-1.2.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "56547c39591bc380d0118aba2e990af1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 29417,
            "upload_time": "2024-09-25T11:51:41",
            "upload_time_iso_8601": "2024-09-25T11:51:41.520784Z",
            "url": "https://files.pythonhosted.org/packages/a0/95/78934ad1479861e54d4362fa22dcfb72d8111c6443b7e6fdf4de39b65a12/cobweb_launcher-1.2.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ebaaacb3e303b8c3676791b45f2195c392043470caa3bfd19bec13ffacb4d1a9",
                "md5": "42083ac10b160e3b34e7ff9abd0c3c6d",
                "sha256": "9c9c7fb8f1262d06490d5b76333e12fe0680d362d26a2a1f0543a3e01acfb2af"
            },
            "downloads": -1,
            "filename": "cobweb-launcher-1.2.11.tar.gz",
            "has_sig": false,
            "md5_digest": "42083ac10b160e3b34e7ff9abd0c3c6d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 20733,
            "upload_time": "2024-09-25T11:51:43",
            "upload_time_iso_8601": "2024-09-25T11:51:43.652273Z",
            "url": "https://files.pythonhosted.org/packages/eb/aa/acb3e303b8c3676791b45f2195c392043470caa3bfd19bec13ffacb4d1a9/cobweb-launcher-1.2.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-25 11:51:43",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Juannie-PP",
    "github_project": "cobweb",
    "github_not_found": true,
    "lcname": "cobweb-launcher"
}
        
Elapsed time: 0.82894s