ZhihuSpiderPlus


NameZhihuSpiderPlus JSON
Version 1.2.5 PyPI version JSON
download
home_pagehttps://github.com/yanjlee/ZhihuSpider
Summary# Python 知乎用户信息爬虫
upload_time2024-06-01 08:47:23
maintainerNone
docs_urlNone
authoryanjlee
requires_pythonNone
licenseNone
keywords
VCS
bugtrack_url
requirements beautifulsoup4 bs4 PyMySQL redis requests
Travis-CI No Travis.
coveralls test coverage No coveralls.
            

# Python 知乎用户信息爬虫



## 特点

* 除了爬取用户信息外,还可以选择爬取用户之间的关注关系
* 使用多线程爬取,并可以自行配置使用的线程数目
* 使用Redis作为任务队列
* 使用高匿代理IP进行数据的爬取,并且失效后会重新分配新的可用代理,避免频繁访问导致本机 IP 被封
* 可以启用邮件定时通知功能



## 运行要求

* Python 版本:3.0 以上
* 数据库:MySQL、Redis




## 使用到的库

项目中使用到的 Python 第三方库如下:

###### 第三方库:

- requests——一个非常好用的请求库,http://docs.python-requests.org/en/master/
- pymysql——python 与 MySQL 连接,https://github.com/PyMySQL/PyMySQL
- BeautifulSoup——简单但是强大的网页文档解析库,https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- Redis-py——Redis Python客户端,[How To Configure a Redis Cluster on CentOS 7](https://www.digitalocean.com/community/tutorials/how-to-configure-a-redis-cluster-on-centos-7)



## 写在前面

### 用户Token



![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-3-10/zhihuSpider-7.PNG)



![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-3-10/zhihuSpider-8.PNG)

​	用户Token是注册知乎账号时设置的一个非中文昵称,通过其可唯一确定某一个用户。同时由于URL中也是通过该Token区分不同用户的页面,所以我们可以很容易的利用token来爬取



### URL分析

爬虫中用到3类URL,分别是:

* 用户与获取用户详细信息:

  ```
  https://www.zhihu.com/people/excited-vczh/pins
  ```
  个人认为用户详细信息仅仅在加载用户信息页时已经在后端进行渲染一同载入,数据放在`id`为`data`的`<div>`标签中的`data-state`属性,目前没有找到可以直接提取数据的接口,所以只能够选择一个数据量较少的页面整个爬取

* 用户正在关注列表信息:

  ```
  http://www.zhihu.com/api/v4/members/xzer/followees?limit=20&offset=0
  ```

  该URL需要用户登陆后才用权限获取数据,返回的数据格式为JSON,URL的参数:`limit`列表分页大小,`offset`列表分页偏移值

* 用户关注者列表信息:

  ```
  http://www.zhihu.com/api/v4/members/xzer/followers?limit=20&offset=0
  ```


  该URL需要用户登陆后才用权限获取数据,返回的数据格式为JSON,URL的参数:`limit`列表分页大小,`offset`列表分页偏移值



## 爬取的用户信息内容

本爬虫的目标是爬取知乎中用户公开的个人信息,例如:

![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-2-15/%E7%9F%A5%E4%B9%8E%E7%88%AC%E8%99%AB-%E7%88%AC%E5%8F%96%E5%86%85%E5%AE%B9%E4%BB%8B%E7%BB%8D1.png)

由于其中包含的信息较多,这个知乎爬虫只是选择了其中一些比较有意义的信息进行爬取。具体的信息包括:

|       字段       |        含义         |
| :------------: | :---------------: |
|   avator_url   |      用户头像URL      |
|     token      |      用户标识字段       |
|    headline    |     用户的一句话介绍      |
|    location    |        居住地        |
|    business    |       所在行业        |
|  employments   |       工作经历        |
|   educations   |       教育经历        |
|  description   |       用户描述        |
| sinaweibo_url  | 新浪微博网址(知乎貌似已不再提供) |
|     gender     |        性别         |
| followingCount |   该用户正在关注的用户数目    |
| followerCount  |    关注该用户的用户数目     |
|  answerCount   |    该用户回答的问题的数目    |
| questionCount  |    该用户提问的问题数目     |
|  voteupCount   |     该用户获得赞的数目     |
|    userName    |       用户昵称        |



## 结构设计



![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-3-22/spider-7.PNG)

## 如何运行

0. 安装指定版本的 Python
1. 执行 `pip3 install -r requirements.txt` 命令安装数据库、以及必须的第三方库 
2. 配置程序中的数据库配置

   1. 打开`SpiderCoreConfig.conf` 文件,修改MySQL的配置

      ![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-3-22/spider-1.PNG)

   2. 在同一个文件下,修改Redis的配置、

      ![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-3-22/spider-2.PNG)

3. 执行`db.sql`文件,创建使用到的数据库以及表
4. 添加若干个初始的用户 token,程序运行后将会以这个用户开始搜索

   1. 修改`SpiderCoreConfig.conf`文件中里面的的`startToken` 变量的值为初始的用户token(可以设置多个)

      ```
      # 初始token(如果有多个初始token, 使用‘,’分隔)
      initToken = excited-vczh
      ```

5. 配置数据下载以及数据处理的线程数目

   1. 数据下载线程数目,修改`SpiderCoreConfig.conf`文件中的`downloadThreadNum`,默认为10个线程

   ![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-3-22/spider-4.PNG)

   2. 数据处理线程数目,修改`SpiderCoreConfig.conf`文件中的`processThreadNum`,默认为3个线程

   ![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-3-22/spider-5.PNG)

6. 配置是否使用代理

   1. 使用代理可避免爬虫频繁访问导致IP被屏蔽。修改`SpiderCoreConfig.conf`文件中的`isProxyServiceEnable`,值为`1`代表启动, `0`代表关闭

      ```
      # 是否启用代理服务(1代表是,0代表否)
      isProxyServiceEnable = 1
      ```

7. 知乎账户配置

   1. 配置登陆方式。设定配置文件的`isLoginByCookie`字段, 若值为`1`则使用Cookie方式登陆,若为`1`则使用普通方式(邮箱或手机号码)登陆

      ```
      # 是否使用Cookie登陆
      isLoginByCookie = 1
      ```

   2. 配置登陆认证信息。以下两种登陆方式

      1. Cookie登陆方式。首先使用PC浏览器手动登陆知乎账号,然后从浏览器中将登陆成功后的Cookie配置到爬虫配置文件中。配置的cookie包括:`z_c0`。(如何从浏览器获取Cookie不详述)

      ```
      # Cookie 登陆方式配置
      z_c0 = XXX
      ```

      2. 普通方式。(当前不可用)配置知乎账户的账号和密码,最好不要使用自己的主账号(目前知乎的邮箱登陆和手机号码登陆方式均需要输入普通验证码或选择倒转文字验证码, 还没有解决)

      ![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-3-22/spider-3.PNG)

8. 日志配置

   1. 可选择将程序运行信息输出到控制台,或者写入到日志文件中,选择哪一种方式在`Logger.py` 文件中配置。而日志级别等具体的设置在`SpiderLoggingConfig.conf`中配置

      ![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-3-22/spider-6.PNG)

9. 若使用的Window平台,打开CMD,打开项目所在的文件夹的根目录

   ![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-2-15/%E7%9F%A5%E4%B9%8E%E7%88%AC%E8%99%AB-%E5%A6%82%E4%BD%95%E4%BD%BF%E7%94%A84.png)

10. 输入`startup.py`运行程序

    ![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-2-15/%E7%9F%A5%E4%B9%8E%E7%88%AC%E8%99%AB-%E5%A6%82%E4%BD%95%E4%BD%BF%E7%94%A85.png)

 需要注意的是,CMD的字符集需要设置为utf8,否则可能会出现问题

11. 程序开始运行

    * 运行结果

      ![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-2-15/%E7%9F%A5%E4%B9%8E%E7%88%AC%E8%99%AB-%E7%88%AC%E5%8F%96%E5%86%85%E5%AE%B9%E4%BB%8B%E7%BB%8D2.png)




## 可配置的内容

爬虫的相关参数在配置文件`SpiderCore.conf`中设置.具体如下:

```
[spider_core]

# 数据下载配置
# 是否启用代理服务(1代表是,0代表否)
isProxyServiceEnable = 1
# session pool 的大小
sessionPoolSize = 20
# 下载线程数量
downloadThreadNum = 5
# 网络连接错误重试次数
networkRetryTimes = 3
# 网络连接超时(单位:秒)
connectTimeout = 30
# 下载间隔
downloadInterval = 6

# 数据处理配置
# 数据处理线程数量
processThreadNum = 2
# 是否解析following列表(通过用户的正在关注列表获取下一批需要分析的token)
isParserFollowingList = 1
# 是否解析follower列表(通过用户的关注者列表获取下一批需要分析的token)
isParserFollowerList = 0

# URL调度配置
# 用户信息下载和用户关注列表下载URL比例(用户信息URL / URL总数, 例如:值为8,代表每次调度中80%是用户信息URL)
urlRate = 8

# 数据持久化配置
# 用户信息数据库写缓存大小(记录条数)
persistentCacheSize = 100
# 用户关注关系数据库写缓存大小
followRelationPersistentCacheSize = 500

# 邮件服务配置
# 是否启用邮件通知(1代表是,0代表否)
isEmailServiceEnable = 0
# SMTP邮件服务器域名
smtpServerHost = smtp.mxhichina.com
# SMTP邮件服务器端口
smtpServerPort = 25
# SMTP邮件服务器登陆密码
smtpServerPassword = XXX
# 邮件发送人地址
smtpFromAddr = centosserver@ken-ljq.xyz
# 邮件接收人地址
smtpToAddr = ljq1120799726@outlook.com
# 邮件标题
smtpEmailHeader = ZhiZhuSpiderNotification
# 邮件发送间隔(单位:秒)
smtpSendInterval = 3600

# Redis 数据库配置
redisHost = localhost
redisPort = 6379
redisDB = 1
redisPassword = XXX

# MySQL 数据库配置
mysqlHost = localhost
mysqlUsername = root
mysqlPassword = XXX
mysqlDatabase = spider_user
mysqlCharset = utf8

# 知乎登陆配置
# 是否使用Cookie登陆
isLoginByCookie = 1
# Cookie 登陆方式配置
z_c0 = XXX
# 普通登陆方式配置
loginToken = XXX
password = XXX

# 初始token(如果有多个初始token, 使用‘,’分隔)
initToken = excited-vczh
```

代理模块参数在配置文件`proxyConfiguration.conf`中设置.具体如下:

```

[proxy_core]
# 代理验证连接超时时长(单位:秒)
proxyValidate_connectTimeout = 30
# 代理验证重新连接次数
proxyValidate_networkReconnectTimes = 3
# 代理数据抓取连接超时时长(单位:秒)
dataFetch_connectTimeout = 30
# 代理数据抓取重新连接时间间隔(单位:秒)
dataFetch_networkReconnectInterval = 30
# 代理数据抓取重新连接次数
dataFetch_networkReconnectionTimes = 3
# 代理网页数据抓取起始页码
proxyCore_fetchStartPage = 1
# 代理网页数据抓取结束页码
proxyCore_fetchEndPage = 5
# 代理池大小(不大于100)
proxyCore_proxyPoolSize = 10
# 代理池更新扫描间隔
proxyCore_proxyPoolScanInterval = 300
# 代理验证线程数量
proxyCore_proxyValidateThreadNum = 5

```



## 数据分析

[简书 - 知乎用户信息分析](http://www.jianshu.com/p/962bc581e03a)






            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/yanjlee/ZhihuSpider",
    "name": "ZhihuSpiderPlus",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "yanjlee",
    "author_email": "yanjlee@163.com",
    "download_url": "https://files.pythonhosted.org/packages/ff/cd/6544211ed7cd09fe1bbc9c3ee45dea735de9d02d51dcba7e102b53f7262e/zhihuspiderplus-1.2.5.tar.gz",
    "platform": null,
    "description": "\r\n\r\n# Python \u77e5\u4e4e\u7528\u6237\u4fe1\u606f\u722c\u866b\r\n\r\n\r\n\r\n## \u7279\u70b9\r\n\r\n* \u9664\u4e86\u722c\u53d6\u7528\u6237\u4fe1\u606f\u5916\uff0c\u8fd8\u53ef\u4ee5\u9009\u62e9\u722c\u53d6\u7528\u6237\u4e4b\u95f4\u7684\u5173\u6ce8\u5173\u7cfb\r\n* \u4f7f\u7528\u591a\u7ebf\u7a0b\u722c\u53d6\uff0c\u5e76\u53ef\u4ee5\u81ea\u884c\u914d\u7f6e\u4f7f\u7528\u7684\u7ebf\u7a0b\u6570\u76ee\r\n* \u4f7f\u7528Redis\u4f5c\u4e3a\u4efb\u52a1\u961f\u5217\r\n* \u4f7f\u7528\u9ad8\u533f\u4ee3\u7406IP\u8fdb\u884c\u6570\u636e\u7684\u722c\u53d6\uff0c\u5e76\u4e14\u5931\u6548\u540e\u4f1a\u91cd\u65b0\u5206\u914d\u65b0\u7684\u53ef\u7528\u4ee3\u7406\uff0c\u907f\u514d\u9891\u7e41\u8bbf\u95ee\u5bfc\u81f4\u672c\u673a IP \u88ab\u5c01\r\n* \u53ef\u4ee5\u542f\u7528\u90ae\u4ef6\u5b9a\u65f6\u901a\u77e5\u529f\u80fd\r\n\r\n\r\n\r\n## \u8fd0\u884c\u8981\u6c42\r\n\r\n* Python \u7248\u672c\uff1a3.0 \u4ee5\u4e0a\r\n* \u6570\u636e\u5e93\uff1aMySQL\u3001Redis\r\n\r\n\r\n\r\n\r\n## \u4f7f\u7528\u5230\u7684\u5e93\r\n\r\n\u9879\u76ee\u4e2d\u4f7f\u7528\u5230\u7684 Python \u7b2c\u4e09\u65b9\u5e93\u5982\u4e0b\uff1a\r\n\r\n###### \u7b2c\u4e09\u65b9\u5e93\uff1a\r\n\r\n- requests\u2014\u2014\u4e00\u4e2a\u975e\u5e38\u597d\u7528\u7684\u8bf7\u6c42\u5e93\uff0chttp://docs.python-requests.org/en/master/\r\n- pymysql\u2014\u2014python \u4e0e MySQL \u8fde\u63a5\uff0chttps://github.com/PyMySQL/PyMySQL\r\n- BeautifulSoup\u2014\u2014\u7b80\u5355\u4f46\u662f\u5f3a\u5927\u7684\u7f51\u9875\u6587\u6863\u89e3\u6790\u5e93\uff0chttps://www.crummy.com/software/BeautifulSoup/bs4/doc/\r\n- Redis-py\u2014\u2014Redis Python\u5ba2\u6237\u7aef\uff0c[How To Configure a Redis Cluster on CentOS 7](https://www.digitalocean.com/community/tutorials/how-to-configure-a-redis-cluster-on-centos-7)\r\n\r\n\r\n\r\n## \u5199\u5728\u524d\u9762\r\n\r\n### \u7528\u6237Token\r\n\r\n\r\n\r\n![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-3-10/zhihuSpider-7.PNG)\r\n\r\n\r\n\r\n![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-3-10/zhihuSpider-8.PNG)\r\n\r\n\u200b\t\u7528\u6237Token\u662f\u6ce8\u518c\u77e5\u4e4e\u8d26\u53f7\u65f6\u8bbe\u7f6e\u7684\u4e00\u4e2a\u975e\u4e2d\u6587\u6635\u79f0\uff0c\u901a\u8fc7\u5176\u53ef\u552f\u4e00\u786e\u5b9a\u67d0\u4e00\u4e2a\u7528\u6237\u3002\u540c\u65f6\u7531\u4e8eURL\u4e2d\u4e5f\u662f\u901a\u8fc7\u8be5Token\u533a\u5206\u4e0d\u540c\u7528\u6237\u7684\u9875\u9762\uff0c\u6240\u4ee5\u6211\u4eec\u53ef\u4ee5\u5f88\u5bb9\u6613\u7684\u5229\u7528token\u6765\u722c\u53d6\r\n\r\n\r\n\r\n### URL\u5206\u6790\r\n\r\n\u722c\u866b\u4e2d\u7528\u52303\u7c7bURL\uff0c\u5206\u522b\u662f\uff1a\r\n\r\n* \u7528\u6237\u4e0e\u83b7\u53d6\u7528\u6237\u8be6\u7ec6\u4fe1\u606f\uff1a\r\n\r\n  ```\r\n  https://www.zhihu.com/people/excited-vczh/pins\r\n  ```\r\n  \u4e2a\u4eba\u8ba4\u4e3a\u7528\u6237\u8be6\u7ec6\u4fe1\u606f\u4ec5\u4ec5\u5728\u52a0\u8f7d\u7528\u6237\u4fe1\u606f\u9875\u65f6\u5df2\u7ecf\u5728\u540e\u7aef\u8fdb\u884c\u6e32\u67d3\u4e00\u540c\u8f7d\u5165\uff0c\u6570\u636e\u653e\u5728`id`\u4e3a`data`\u7684`<div>`\u6807\u7b7e\u4e2d\u7684`data-state`\u5c5e\u6027\uff0c\u76ee\u524d\u6ca1\u6709\u627e\u5230\u53ef\u4ee5\u76f4\u63a5\u63d0\u53d6\u6570\u636e\u7684\u63a5\u53e3\uff0c\u6240\u4ee5\u53ea\u80fd\u591f\u9009\u62e9\u4e00\u4e2a\u6570\u636e\u91cf\u8f83\u5c11\u7684\u9875\u9762\u6574\u4e2a\u722c\u53d6\r\n\r\n* \u7528\u6237\u6b63\u5728\u5173\u6ce8\u5217\u8868\u4fe1\u606f\uff1a\r\n\r\n  ```\r\n  http://www.zhihu.com/api/v4/members/xzer/followees?limit=20&offset=0\r\n  ```\r\n\r\n  \u8be5URL\u9700\u8981\u7528\u6237\u767b\u9646\u540e\u624d\u7528\u6743\u9650\u83b7\u53d6\u6570\u636e\uff0c\u8fd4\u56de\u7684\u6570\u636e\u683c\u5f0f\u4e3aJSON\uff0cURL\u7684\u53c2\u6570\uff1a`limit`\u5217\u8868\u5206\u9875\u5927\u5c0f\uff0c`offset`\u5217\u8868\u5206\u9875\u504f\u79fb\u503c\r\n\r\n* \u7528\u6237\u5173\u6ce8\u8005\u5217\u8868\u4fe1\u606f\uff1a\r\n\r\n  ```\r\n  http://www.zhihu.com/api/v4/members/xzer/followers?limit=20&offset=0\r\n  ```\r\n\r\n\r\n  \u8be5URL\u9700\u8981\u7528\u6237\u767b\u9646\u540e\u624d\u7528\u6743\u9650\u83b7\u53d6\u6570\u636e\uff0c\u8fd4\u56de\u7684\u6570\u636e\u683c\u5f0f\u4e3aJSON\uff0cURL\u7684\u53c2\u6570\uff1a`limit`\u5217\u8868\u5206\u9875\u5927\u5c0f\uff0c`offset`\u5217\u8868\u5206\u9875\u504f\u79fb\u503c\r\n\r\n\r\n\r\n## \u722c\u53d6\u7684\u7528\u6237\u4fe1\u606f\u5185\u5bb9\r\n\r\n\u672c\u722c\u866b\u7684\u76ee\u6807\u662f\u722c\u53d6\u77e5\u4e4e\u4e2d\u7528\u6237\u516c\u5f00\u7684\u4e2a\u4eba\u4fe1\u606f\uff0c\u4f8b\u5982\uff1a\r\n\r\n![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-2-15/%E7%9F%A5%E4%B9%8E%E7%88%AC%E8%99%AB-%E7%88%AC%E5%8F%96%E5%86%85%E5%AE%B9%E4%BB%8B%E7%BB%8D1.png)\r\n\r\n\u7531\u4e8e\u5176\u4e2d\u5305\u542b\u7684\u4fe1\u606f\u8f83\u591a\uff0c\u8fd9\u4e2a\u77e5\u4e4e\u722c\u866b\u53ea\u662f\u9009\u62e9\u4e86\u5176\u4e2d\u4e00\u4e9b\u6bd4\u8f83\u6709\u610f\u4e49\u7684\u4fe1\u606f\u8fdb\u884c\u722c\u53d6\u3002\u5177\u4f53\u7684\u4fe1\u606f\u5305\u62ec\uff1a\r\n\r\n|       \u5b57\u6bb5       |        \u542b\u4e49         |\r\n| :------------: | :---------------: |\r\n|   avator_url   |      \u7528\u6237\u5934\u50cfURL      |\r\n|     token      |      \u7528\u6237\u6807\u8bc6\u5b57\u6bb5       |\r\n|    headline    |     \u7528\u6237\u7684\u4e00\u53e5\u8bdd\u4ecb\u7ecd      |\r\n|    location    |        \u5c45\u4f4f\u5730        |\r\n|    business    |       \u6240\u5728\u884c\u4e1a        |\r\n|  employments   |       \u5de5\u4f5c\u7ecf\u5386        |\r\n|   educations   |       \u6559\u80b2\u7ecf\u5386        |\r\n|  description   |       \u7528\u6237\u63cf\u8ff0        |\r\n| sinaweibo_url  | \u65b0\u6d6a\u5fae\u535a\u7f51\u5740(\u77e5\u4e4e\u8c8c\u4f3c\u5df2\u4e0d\u518d\u63d0\u4f9b) |\r\n|     gender     |        \u6027\u522b         |\r\n| followingCount |   \u8be5\u7528\u6237\u6b63\u5728\u5173\u6ce8\u7684\u7528\u6237\u6570\u76ee    |\r\n| followerCount  |    \u5173\u6ce8\u8be5\u7528\u6237\u7684\u7528\u6237\u6570\u76ee     |\r\n|  answerCount   |    \u8be5\u7528\u6237\u56de\u7b54\u7684\u95ee\u9898\u7684\u6570\u76ee    |\r\n| questionCount  |    \u8be5\u7528\u6237\u63d0\u95ee\u7684\u95ee\u9898\u6570\u76ee     |\r\n|  voteupCount   |     \u8be5\u7528\u6237\u83b7\u5f97\u8d5e\u7684\u6570\u76ee     |\r\n|    userName    |       \u7528\u6237\u6635\u79f0        |\r\n\r\n\r\n\r\n## \u7ed3\u6784\u8bbe\u8ba1\r\n\r\n\r\n\r\n![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-3-22/spider-7.PNG)\r\n\r\n## \u5982\u4f55\u8fd0\u884c\r\n\r\n0. \u5b89\u88c5\u6307\u5b9a\u7248\u672c\u7684 Python\r\n1. \u6267\u884c `pip3 install -r requirements.txt` \u547d\u4ee4\u5b89\u88c5\u6570\u636e\u5e93\u3001\u4ee5\u53ca\u5fc5\u987b\u7684\u7b2c\u4e09\u65b9\u5e93 \r\n2. \u914d\u7f6e\u7a0b\u5e8f\u4e2d\u7684\u6570\u636e\u5e93\u914d\u7f6e\r\n\r\n   1. \u6253\u5f00`SpiderCoreConfig.conf` \u6587\u4ef6\uff0c\u4fee\u6539MySQL\u7684\u914d\u7f6e\r\n\r\n      ![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-3-22/spider-1.PNG)\r\n\r\n   2. \u5728\u540c\u4e00\u4e2a\u6587\u4ef6\u4e0b\uff0c\u4fee\u6539Redis\u7684\u914d\u7f6e\u3001\r\n\r\n      ![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-3-22/spider-2.PNG)\r\n\r\n3. \u6267\u884c`db.sql`\u6587\u4ef6\uff0c\u521b\u5efa\u4f7f\u7528\u5230\u7684\u6570\u636e\u5e93\u4ee5\u53ca\u8868\r\n4. \u6dfb\u52a0\u82e5\u5e72\u4e2a\u521d\u59cb\u7684\u7528\u6237 token\uff0c\u7a0b\u5e8f\u8fd0\u884c\u540e\u5c06\u4f1a\u4ee5\u8fd9\u4e2a\u7528\u6237\u5f00\u59cb\u641c\u7d22\r\n\r\n   1. \u4fee\u6539`SpiderCoreConfig.conf`\u6587\u4ef6\u4e2d\u91cc\u9762\u7684\u7684`startToken` \u53d8\u91cf\u7684\u503c\u4e3a\u521d\u59cb\u7684\u7528\u6237token\uff08\u53ef\u4ee5\u8bbe\u7f6e\u591a\u4e2a\uff09\r\n\r\n      ```\r\n      # \u521d\u59cbtoken\uff08\u5982\u679c\u6709\u591a\u4e2a\u521d\u59cbtoken\uff0c \u4f7f\u7528\u2018,\u2019\u5206\u9694\uff09\r\n      initToken = excited-vczh\r\n      ```\r\n\r\n5. \u914d\u7f6e\u6570\u636e\u4e0b\u8f7d\u4ee5\u53ca\u6570\u636e\u5904\u7406\u7684\u7ebf\u7a0b\u6570\u76ee\r\n\r\n   1. \u6570\u636e\u4e0b\u8f7d\u7ebf\u7a0b\u6570\u76ee\uff0c\u4fee\u6539`SpiderCoreConfig.conf`\u6587\u4ef6\u4e2d\u7684`downloadThreadNum`\uff0c\u9ed8\u8ba4\u4e3a10\u4e2a\u7ebf\u7a0b\r\n\r\n   ![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-3-22/spider-4.PNG)\r\n\r\n   2. \u6570\u636e\u5904\u7406\u7ebf\u7a0b\u6570\u76ee\uff0c\u4fee\u6539`SpiderCoreConfig.conf`\u6587\u4ef6\u4e2d\u7684`processThreadNum`\uff0c\u9ed8\u8ba4\u4e3a3\u4e2a\u7ebf\u7a0b\r\n\r\n   ![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-3-22/spider-5.PNG)\r\n\r\n6. \u914d\u7f6e\u662f\u5426\u4f7f\u7528\u4ee3\u7406\r\n\r\n   1. \u4f7f\u7528\u4ee3\u7406\u53ef\u907f\u514d\u722c\u866b\u9891\u7e41\u8bbf\u95ee\u5bfc\u81f4IP\u88ab\u5c4f\u853d\u3002\u4fee\u6539`SpiderCoreConfig.conf`\u6587\u4ef6\u4e2d\u7684`isProxyServiceEnable`\uff0c\u503c\u4e3a`1`\u4ee3\u8868\u542f\u52a8\uff0c `0`\u4ee3\u8868\u5173\u95ed\r\n\r\n      ```\r\n      # \u662f\u5426\u542f\u7528\u4ee3\u7406\u670d\u52a1(1\u4ee3\u8868\u662f\uff0c0\u4ee3\u8868\u5426)\r\n      isProxyServiceEnable = 1\r\n      ```\r\n\r\n7. \u77e5\u4e4e\u8d26\u6237\u914d\u7f6e\r\n\r\n   1. \u914d\u7f6e\u767b\u9646\u65b9\u5f0f\u3002\u8bbe\u5b9a\u914d\u7f6e\u6587\u4ef6\u7684`isLoginByCookie`\u5b57\u6bb5\uff0c \u82e5\u503c\u4e3a`1`\u5219\u4f7f\u7528Cookie\u65b9\u5f0f\u767b\u9646\uff0c\u82e5\u4e3a`1`\u5219\u4f7f\u7528\u666e\u901a\u65b9\u5f0f\uff08\u90ae\u7bb1\u6216\u624b\u673a\u53f7\u7801\uff09\u767b\u9646\r\n\r\n      ```\r\n      # \u662f\u5426\u4f7f\u7528Cookie\u767b\u9646\r\n      isLoginByCookie = 1\r\n      ```\r\n\r\n   2. \u914d\u7f6e\u767b\u9646\u8ba4\u8bc1\u4fe1\u606f\u3002\u4ee5\u4e0b\u4e24\u79cd\u767b\u9646\u65b9\u5f0f\r\n\r\n      1. Cookie\u767b\u9646\u65b9\u5f0f\u3002\u9996\u5148\u4f7f\u7528PC\u6d4f\u89c8\u5668\u624b\u52a8\u767b\u9646\u77e5\u4e4e\u8d26\u53f7\uff0c\u7136\u540e\u4ece\u6d4f\u89c8\u5668\u4e2d\u5c06\u767b\u9646\u6210\u529f\u540e\u7684Cookie\u914d\u7f6e\u5230\u722c\u866b\u914d\u7f6e\u6587\u4ef6\u4e2d\u3002\u914d\u7f6e\u7684cookie\u5305\u62ec\uff1a`z_c0`\u3002(\u5982\u4f55\u4ece\u6d4f\u89c8\u5668\u83b7\u53d6Cookie\u4e0d\u8be6\u8ff0)\r\n\r\n      ```\r\n      # Cookie \u767b\u9646\u65b9\u5f0f\u914d\u7f6e\r\n      z_c0 = XXX\r\n      ```\r\n\r\n      2. \u666e\u901a\u65b9\u5f0f\u3002\uff08\u5f53\u524d\u4e0d\u53ef\u7528\uff09\u914d\u7f6e\u77e5\u4e4e\u8d26\u6237\u7684\u8d26\u53f7\u548c\u5bc6\u7801\uff0c\u6700\u597d\u4e0d\u8981\u4f7f\u7528\u81ea\u5df1\u7684\u4e3b\u8d26\u53f7\uff08\u76ee\u524d\u77e5\u4e4e\u7684\u90ae\u7bb1\u767b\u9646\u548c\u624b\u673a\u53f7\u7801\u767b\u9646\u65b9\u5f0f\u5747\u9700\u8981\u8f93\u5165\u666e\u901a\u9a8c\u8bc1\u7801\u6216\u9009\u62e9\u5012\u8f6c\u6587\u5b57\u9a8c\u8bc1\u7801\uff0c \u8fd8\u6ca1\u6709\u89e3\u51b3\uff09\r\n\r\n      ![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-3-22/spider-3.PNG)\r\n\r\n8. \u65e5\u5fd7\u914d\u7f6e\r\n\r\n   1. \u53ef\u9009\u62e9\u5c06\u7a0b\u5e8f\u8fd0\u884c\u4fe1\u606f\u8f93\u51fa\u5230\u63a7\u5236\u53f0\uff0c\u6216\u8005\u5199\u5165\u5230\u65e5\u5fd7\u6587\u4ef6\u4e2d\uff0c\u9009\u62e9\u54ea\u4e00\u79cd\u65b9\u5f0f\u5728`Logger.py` \u6587\u4ef6\u4e2d\u914d\u7f6e\u3002\u800c\u65e5\u5fd7\u7ea7\u522b\u7b49\u5177\u4f53\u7684\u8bbe\u7f6e\u5728`SpiderLoggingConfig.conf`\u4e2d\u914d\u7f6e\r\n\r\n      ![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-3-22/spider-6.PNG)\r\n\r\n9. \u82e5\u4f7f\u7528\u7684Window\u5e73\u53f0,\u6253\u5f00CMD\uff0c\u6253\u5f00\u9879\u76ee\u6240\u5728\u7684\u6587\u4ef6\u5939\u7684\u6839\u76ee\u5f55\r\n\r\n   ![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-2-15/%E7%9F%A5%E4%B9%8E%E7%88%AC%E8%99%AB-%E5%A6%82%E4%BD%95%E4%BD%BF%E7%94%A84.png)\r\n\r\n10. \u8f93\u5165`startup.py`\u8fd0\u884c\u7a0b\u5e8f\r\n\r\n    ![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-2-15/%E7%9F%A5%E4%B9%8E%E7%88%AC%E8%99%AB-%E5%A6%82%E4%BD%95%E4%BD%BF%E7%94%A85.png)\r\n\r\n \u9700\u8981\u6ce8\u610f\u7684\u662f\uff0cCMD\u7684\u5b57\u7b26\u96c6\u9700\u8981\u8bbe\u7f6e\u4e3autf8\uff0c\u5426\u5219\u53ef\u80fd\u4f1a\u51fa\u73b0\u95ee\u9898\r\n\r\n11. \u7a0b\u5e8f\u5f00\u59cb\u8fd0\u884c\r\n\r\n    * \u8fd0\u884c\u7ed3\u679c\r\n\r\n      ![](https://raw.githubusercontent.com/KEN-LJQ/MarkdownPics/master/Resource/2017-2-15/%E7%9F%A5%E4%B9%8E%E7%88%AC%E8%99%AB-%E7%88%AC%E5%8F%96%E5%86%85%E5%AE%B9%E4%BB%8B%E7%BB%8D2.png)\r\n\r\n\r\n\r\n\r\n## \u53ef\u914d\u7f6e\u7684\u5185\u5bb9\r\n\r\n\u722c\u866b\u7684\u76f8\u5173\u53c2\u6570\u5728\u914d\u7f6e\u6587\u4ef6`SpiderCore.conf`\u4e2d\u8bbe\u7f6e.\u5177\u4f53\u5982\u4e0b\uff1a\r\n\r\n```\r\n[spider_core]\r\n\r\n# \u6570\u636e\u4e0b\u8f7d\u914d\u7f6e\r\n# \u662f\u5426\u542f\u7528\u4ee3\u7406\u670d\u52a1(1\u4ee3\u8868\u662f\uff0c0\u4ee3\u8868\u5426)\r\nisProxyServiceEnable = 1\r\n# session pool \u7684\u5927\u5c0f\r\nsessionPoolSize = 20\r\n# \u4e0b\u8f7d\u7ebf\u7a0b\u6570\u91cf\r\ndownloadThreadNum = 5\r\n# \u7f51\u7edc\u8fde\u63a5\u9519\u8bef\u91cd\u8bd5\u6b21\u6570\r\nnetworkRetryTimes = 3\r\n# \u7f51\u7edc\u8fde\u63a5\u8d85\u65f6\uff08\u5355\u4f4d\uff1a\u79d2\uff09\r\nconnectTimeout = 30\r\n# \u4e0b\u8f7d\u95f4\u9694\r\ndownloadInterval = 6\r\n\r\n# \u6570\u636e\u5904\u7406\u914d\u7f6e\r\n# \u6570\u636e\u5904\u7406\u7ebf\u7a0b\u6570\u91cf\r\nprocessThreadNum = 2\r\n# \u662f\u5426\u89e3\u6790following\u5217\u8868\uff08\u901a\u8fc7\u7528\u6237\u7684\u6b63\u5728\u5173\u6ce8\u5217\u8868\u83b7\u53d6\u4e0b\u4e00\u6279\u9700\u8981\u5206\u6790\u7684token\uff09\r\nisParserFollowingList = 1\r\n# \u662f\u5426\u89e3\u6790follower\u5217\u8868\uff08\u901a\u8fc7\u7528\u6237\u7684\u5173\u6ce8\u8005\u5217\u8868\u83b7\u53d6\u4e0b\u4e00\u6279\u9700\u8981\u5206\u6790\u7684token\uff09\r\nisParserFollowerList = 0\r\n\r\n# URL\u8c03\u5ea6\u914d\u7f6e\r\n# \u7528\u6237\u4fe1\u606f\u4e0b\u8f7d\u548c\u7528\u6237\u5173\u6ce8\u5217\u8868\u4e0b\u8f7dURL\u6bd4\u4f8b\uff08\u7528\u6237\u4fe1\u606fURL / URL\u603b\u6570\uff0c \u4f8b\u5982\uff1a\u503c\u4e3a8\uff0c\u4ee3\u8868\u6bcf\u6b21\u8c03\u5ea6\u4e2d80%\u662f\u7528\u6237\u4fe1\u606fURL\uff09\r\nurlRate = 8\r\n\r\n# \u6570\u636e\u6301\u4e45\u5316\u914d\u7f6e\r\n# \u7528\u6237\u4fe1\u606f\u6570\u636e\u5e93\u5199\u7f13\u5b58\u5927\u5c0f\uff08\u8bb0\u5f55\u6761\u6570\uff09\r\npersistentCacheSize = 100\r\n# \u7528\u6237\u5173\u6ce8\u5173\u7cfb\u6570\u636e\u5e93\u5199\u7f13\u5b58\u5927\u5c0f\r\nfollowRelationPersistentCacheSize = 500\r\n\r\n# \u90ae\u4ef6\u670d\u52a1\u914d\u7f6e\r\n# \u662f\u5426\u542f\u7528\u90ae\u4ef6\u901a\u77e5(1\u4ee3\u8868\u662f\uff0c0\u4ee3\u8868\u5426)\r\nisEmailServiceEnable = 0\r\n# SMTP\u90ae\u4ef6\u670d\u52a1\u5668\u57df\u540d\r\nsmtpServerHost = smtp.mxhichina.com\r\n# SMTP\u90ae\u4ef6\u670d\u52a1\u5668\u7aef\u53e3\r\nsmtpServerPort = 25\r\n# SMTP\u90ae\u4ef6\u670d\u52a1\u5668\u767b\u9646\u5bc6\u7801\r\nsmtpServerPassword = XXX\r\n# \u90ae\u4ef6\u53d1\u9001\u4eba\u5730\u5740\r\nsmtpFromAddr = centosserver@ken-ljq.xyz\r\n# \u90ae\u4ef6\u63a5\u6536\u4eba\u5730\u5740\r\nsmtpToAddr = ljq1120799726@outlook.com\r\n# \u90ae\u4ef6\u6807\u9898\r\nsmtpEmailHeader = ZhiZhuSpiderNotification\r\n# \u90ae\u4ef6\u53d1\u9001\u95f4\u9694(\u5355\u4f4d\uff1a\u79d2)\r\nsmtpSendInterval = 3600\r\n\r\n# Redis \u6570\u636e\u5e93\u914d\u7f6e\r\nredisHost = localhost\r\nredisPort = 6379\r\nredisDB = 1\r\nredisPassword = XXX\r\n\r\n# MySQL \u6570\u636e\u5e93\u914d\u7f6e\r\nmysqlHost = localhost\r\nmysqlUsername = root\r\nmysqlPassword = XXX\r\nmysqlDatabase = spider_user\r\nmysqlCharset = utf8\r\n\r\n# \u77e5\u4e4e\u767b\u9646\u914d\u7f6e\r\n# \u662f\u5426\u4f7f\u7528Cookie\u767b\u9646\r\nisLoginByCookie = 1\r\n# Cookie \u767b\u9646\u65b9\u5f0f\u914d\u7f6e\r\nz_c0 = XXX\r\n# \u666e\u901a\u767b\u9646\u65b9\u5f0f\u914d\u7f6e\r\nloginToken = XXX\r\npassword = XXX\r\n\r\n# \u521d\u59cbtoken\uff08\u5982\u679c\u6709\u591a\u4e2a\u521d\u59cbtoken\uff0c \u4f7f\u7528\u2018,\u2019\u5206\u9694\uff09\r\ninitToken = excited-vczh\r\n```\r\n\r\n\u4ee3\u7406\u6a21\u5757\u53c2\u6570\u5728\u914d\u7f6e\u6587\u4ef6`proxyConfiguration.conf`\u4e2d\u8bbe\u7f6e.\u5177\u4f53\u5982\u4e0b\uff1a\r\n\r\n```\r\n\r\n[proxy_core]\r\n# \u4ee3\u7406\u9a8c\u8bc1\u8fde\u63a5\u8d85\u65f6\u65f6\u957f\uff08\u5355\u4f4d\uff1a\u79d2\uff09\r\nproxyValidate_connectTimeout = 30\r\n# \u4ee3\u7406\u9a8c\u8bc1\u91cd\u65b0\u8fde\u63a5\u6b21\u6570\r\nproxyValidate_networkReconnectTimes = 3\r\n# \u4ee3\u7406\u6570\u636e\u6293\u53d6\u8fde\u63a5\u8d85\u65f6\u65f6\u957f\uff08\u5355\u4f4d\uff1a\u79d2\uff09\r\ndataFetch_connectTimeout = 30\r\n# \u4ee3\u7406\u6570\u636e\u6293\u53d6\u91cd\u65b0\u8fde\u63a5\u65f6\u95f4\u95f4\u9694\uff08\u5355\u4f4d\uff1a\u79d2\uff09\r\ndataFetch_networkReconnectInterval = 30\r\n# \u4ee3\u7406\u6570\u636e\u6293\u53d6\u91cd\u65b0\u8fde\u63a5\u6b21\u6570\r\ndataFetch_networkReconnectionTimes = 3\r\n# \u4ee3\u7406\u7f51\u9875\u6570\u636e\u6293\u53d6\u8d77\u59cb\u9875\u7801\r\nproxyCore_fetchStartPage = 1\r\n# \u4ee3\u7406\u7f51\u9875\u6570\u636e\u6293\u53d6\u7ed3\u675f\u9875\u7801\r\nproxyCore_fetchEndPage = 5\r\n# \u4ee3\u7406\u6c60\u5927\u5c0f(\u4e0d\u5927\u4e8e100)\r\nproxyCore_proxyPoolSize = 10\r\n# \u4ee3\u7406\u6c60\u66f4\u65b0\u626b\u63cf\u95f4\u9694\r\nproxyCore_proxyPoolScanInterval = 300\r\n# \u4ee3\u7406\u9a8c\u8bc1\u7ebf\u7a0b\u6570\u91cf\r\nproxyCore_proxyValidateThreadNum = 5\r\n\r\n```\r\n\r\n\r\n\r\n## \u6570\u636e\u5206\u6790\r\n\r\n[\u7b80\u4e66 - \u77e5\u4e4e\u7528\u6237\u4fe1\u606f\u5206\u6790](http://www.jianshu.com/p/962bc581e03a)\r\n\r\n\r\n\r\n\r\n\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "# Python \u77e5\u4e4e\u7528\u6237\u4fe1\u606f\u722c\u866b",
    "version": "1.2.5",
    "project_urls": {
        "Homepage": "https://github.com/yanjlee/ZhihuSpider"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "767a1f9aae75c747c3d57facbe89a3fa676b10b2b15d58d8be2e5c625535fead",
                "md5": "194e52507f17c8b9a465d00d8b88a957",
                "sha256": "71d2a939163da7c0678718c35da428c41150d0367d98aeab384f9a51e1a33609"
            },
            "downloads": -1,
            "filename": "ZhihuSpiderPlus-1.2.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "194e52507f17c8b9a465d00d8b88a957",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 29645,
            "upload_time": "2024-06-01T08:47:21",
            "upload_time_iso_8601": "2024-06-01T08:47:21.336189Z",
            "url": "https://files.pythonhosted.org/packages/76/7a/1f9aae75c747c3d57facbe89a3fa676b10b2b15d58d8be2e5c625535fead/ZhihuSpiderPlus-1.2.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ffcd6544211ed7cd09fe1bbc9c3ee45dea735de9d02d51dcba7e102b53f7262e",
                "md5": "ce3ad65779a2d496b0e5361bc9b9387d",
                "sha256": "9dffa397159a14432a4a6b404c462822f192b1a4b1d608650e4789271006f4ab"
            },
            "downloads": -1,
            "filename": "zhihuspiderplus-1.2.5.tar.gz",
            "has_sig": false,
            "md5_digest": "ce3ad65779a2d496b0e5361bc9b9387d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 28784,
            "upload_time": "2024-06-01T08:47:23",
            "upload_time_iso_8601": "2024-06-01T08:47:23.561322Z",
            "url": "https://files.pythonhosted.org/packages/ff/cd/6544211ed7cd09fe1bbc9c3ee45dea735de9d02d51dcba7e102b53f7262e/zhihuspiderplus-1.2.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-01 08:47:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yanjlee",
    "github_project": "ZhihuSpider",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    "==",
                    "4.5.3"
                ]
            ]
        },
        {
            "name": "bs4",
            "specs": [
                [
                    "==",
                    "0.0.1"
                ]
            ]
        },
        {
            "name": "PyMySQL",
            "specs": [
                [
                    "==",
                    "0.7.10"
                ]
            ]
        },
        {
            "name": "redis",
            "specs": [
                [
                    "==",
                    "2.10.5"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    "==",
                    "2.13.0"
                ]
            ]
        }
    ],
    "lcname": "zhihuspiderplus"
}
        
Elapsed time: 0.82014s