scrapy-zhihu-github


Namescrapy-zhihu-github JSON
Version 1.2.5 PyPI version JSON
download
home_pagehttps://github.com/yanjlee/scrapy-zhihu-github
Summary用于爬取zhihu和github的代码,数据存储于mongodb。.
upload_time2024-06-01 08:23:08
maintainerNone
docs_urlNone
authoryanjlee
requires_pythonNone
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            scrapy-zhihu-github
===================

用于爬取zhihu和github的代码,数据存储于mongodb。


# Install

Scrapy安装见[使用Scrapy抓取数据](http://blog.javachen.com/2014/05/24/using-scrapy-to-cralw-data.html)。

Mongodb安装在本机,数据库为`zhihu`,端口默认,存在以下collection:
              
   - `zh_user`:知乎用户
   - `zh_ask`:知乎问题
   - `zh_answer`:知乎回答
   - `zh_followee`:知乎关注列表
   - `zh_follower`:知乎粉丝列表
   - `gh_user`:github 用户
   - `gh_repo`:github 仓库

# zhihu

Scrapy爬取知乎数据,说明见[使用Scrapy爬取知乎网站](http://blog.javachen.com/2014/06/08/using-scrapy-to-cralw-zhihu.html)。

zhihu 用户表结构(db.zhihu.zh_user)为:

```
_id int, # 用户id
url string,
username string, # 用户名,如 zhouyuan
nickname string, # 昵称,如 周源
location string, # 居住地
industry string, # 行业,如 互联网
sex int, # 性别,1:男, 2:女, 0:未知
jobs [],
educations [],
description string, # 自我简介
sinaweibo string, # 新浪微博账号
tencentweibo string, # 腾讯微博账号
# qq string, # QQ号
ask_num int, # 提问数, 如 590
answer_num int, # 回答数,如 340
post_num int, # 专栏文章数, 如 3
collection_num int, # 收藏数,如 9
log_num int, # 编辑数,如14980
agree_num int, # 收到的赞同,如 15316
thank_num int, # 收到的感谢,如 3500
fav_num int, # 被收藏次数,如 9424
share_num int, # 被分享次数,如 922
followee_num int, # 关注数,如 1515
follower_num int, # 被关注数(粉丝),如 319529
update_time datetime # 信息更新时间,如 2014-05-17 11:15:00
```

先运行下面代码,采集用户信息以及用户的关注和粉丝列表:

```python
scrapy crawl zhihu_user
```

再来采集问题和答案:

```python
scrapy crawl zhihu_ask

scrapy crawl zhihu_answer
```



# github

github 用户表结构(db.zhihu.gh_user)为:

```
_id, #用户id
url, #主页url
username,#用户名
nickname,#昵称 
user_id,#用户id
type,#类型:1,组织;0,个人 

company,#公司
location,#位置 
website,#网站 
email,#邮箱 
update_time,#爬虫更新时间

join_date,#加入时间
followee_num,#关注数
follower_num,#粉丝数 
star_num,#星数 
organizations,#加入的组织

member_num,#组织成员数
```

先运行下面代码,采集用户信息:

```python
scrapy crawl github_user
```

爬取用户信息以及粉丝用户:

```python
scrapy crawl github_follower
```

查看爬取的结果:

```
> use zhihu
switched to db zhihu
> db.gh_user.count()
126135
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/yanjlee/scrapy-zhihu-github",
    "name": "scrapy-zhihu-github",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "yanjlee",
    "author_email": "yanjlee@163.com",
    "download_url": "https://files.pythonhosted.org/packages/81/63/591ff1c5ae3bd134f6c10dd2da6ba62868def620ac749fce0e501236a0a4/scrapy_zhihu_github-1.2.5.tar.gz",
    "platform": null,
    "description": "scrapy-zhihu-github\r\n===================\r\n\r\n\u7528\u4e8e\u722c\u53d6zhihu\u548cgithub\u7684\u4ee3\u7801\uff0c\u6570\u636e\u5b58\u50a8\u4e8emongodb\u3002\r\n\r\n\r\n# Install\r\n\r\nScrapy\u5b89\u88c5\u89c1[\u4f7f\u7528Scrapy\u6293\u53d6\u6570\u636e](http://blog.javachen.com/2014/05/24/using-scrapy-to-cralw-data.html)\u3002\r\n\r\nMongodb\u5b89\u88c5\u5728\u672c\u673a\uff0c\u6570\u636e\u5e93\u4e3a`zhihu`\uff0c\u7aef\u53e3\u9ed8\u8ba4\uff0c\u5b58\u5728\u4ee5\u4e0bcollection\uff1a\r\n              \r\n   - `zh_user`\uff1a\u77e5\u4e4e\u7528\u6237\r\n   - `zh_ask`\uff1a\u77e5\u4e4e\u95ee\u9898\r\n   - `zh_answer`\uff1a\u77e5\u4e4e\u56de\u7b54\r\n   - `zh_followee`\uff1a\u77e5\u4e4e\u5173\u6ce8\u5217\u8868\r\n   - `zh_follower`\uff1a\u77e5\u4e4e\u7c89\u4e1d\u5217\u8868\r\n   - `gh_user`\uff1agithub \u7528\u6237\r\n   - `gh_repo`\uff1agithub \u4ed3\u5e93\r\n\r\n# zhihu\r\n\r\nScrapy\u722c\u53d6\u77e5\u4e4e\u6570\u636e\uff0c\u8bf4\u660e\u89c1[\u4f7f\u7528Scrapy\u722c\u53d6\u77e5\u4e4e\u7f51\u7ad9](http://blog.javachen.com/2014/06/08/using-scrapy-to-cralw-zhihu.html)\u3002\r\n\r\nzhihu \u7528\u6237\u8868\u7ed3\u6784\uff08db.zhihu.zh_user\uff09\u4e3a\uff1a\r\n\r\n```\r\n_id int, # \u7528\u6237id\r\nurl string,\r\nusername string, # \u7528\u6237\u540d\uff0c\u5982 zhouyuan\r\nnickname string, # \u6635\u79f0\uff0c\u5982 \u5468\u6e90\r\nlocation string, # \u5c45\u4f4f\u5730\r\nindustry string, # \u884c\u4e1a\uff0c\u5982 \u4e92\u8054\u7f51\r\nsex int, # \u6027\u522b\uff0c1\uff1a\u7537\uff0c 2\uff1a\u5973\uff0c 0\uff1a\u672a\u77e5\r\njobs [],\r\neducations [],\r\ndescription string, # \u81ea\u6211\u7b80\u4ecb\r\nsinaweibo string, # \u65b0\u6d6a\u5fae\u535a\u8d26\u53f7\r\ntencentweibo string, # \u817e\u8baf\u5fae\u535a\u8d26\u53f7\r\n# qq string, # QQ\u53f7\r\nask_num int, # \u63d0\u95ee\u6570\uff0c \u5982 590\r\nanswer_num int, # \u56de\u7b54\u6570\uff0c\u5982 340\r\npost_num int, # \u4e13\u680f\u6587\u7ae0\u6570\uff0c \u5982 3\r\ncollection_num int, # \u6536\u85cf\u6570\uff0c\u5982 9\r\nlog_num int, # \u7f16\u8f91\u6570\uff0c\u598214980\r\nagree_num int, # \u6536\u5230\u7684\u8d5e\u540c\uff0c\u5982 15316\r\nthank_num int, # \u6536\u5230\u7684\u611f\u8c22\uff0c\u5982 3500\r\nfav_num int, # \u88ab\u6536\u85cf\u6b21\u6570\uff0c\u5982 9424\r\nshare_num int, # \u88ab\u5206\u4eab\u6b21\u6570\uff0c\u5982 922\r\nfollowee_num int, # \u5173\u6ce8\u6570\uff0c\u5982 1515\r\nfollower_num int, # \u88ab\u5173\u6ce8\u6570\uff08\u7c89\u4e1d\uff09\uff0c\u5982 319529\r\nupdate_time datetime # \u4fe1\u606f\u66f4\u65b0\u65f6\u95f4\uff0c\u5982 2014-05-17 11:15:00\r\n```\r\n\r\n\u5148\u8fd0\u884c\u4e0b\u9762\u4ee3\u7801\uff0c\u91c7\u96c6\u7528\u6237\u4fe1\u606f\u4ee5\u53ca\u7528\u6237\u7684\u5173\u6ce8\u548c\u7c89\u4e1d\u5217\u8868\uff1a\r\n\r\n```python\r\nscrapy crawl zhihu_user\r\n```\r\n\r\n\u518d\u6765\u91c7\u96c6\u95ee\u9898\u548c\u7b54\u6848\uff1a\r\n\r\n```python\r\nscrapy crawl zhihu_ask\r\n\r\nscrapy crawl zhihu_answer\r\n```\r\n\r\n\r\n\r\n# github\r\n\r\ngithub \u7528\u6237\u8868\u7ed3\u6784\uff08db.zhihu.gh_user\uff09\u4e3a\uff1a\r\n\r\n```\r\n_id, #\u7528\u6237id\r\nurl, #\u4e3b\u9875url\r\nusername,#\u7528\u6237\u540d\r\nnickname,#\u6635\u79f0 \r\nuser_id,#\u7528\u6237id\r\ntype,#\u7c7b\u578b\uff1a1,\u7ec4\u7ec7\uff1b0,\u4e2a\u4eba \r\n\r\ncompany,#\u516c\u53f8\r\nlocation,#\u4f4d\u7f6e \r\nwebsite,#\u7f51\u7ad9 \r\nemail,#\u90ae\u7bb1 \r\nupdate_time,#\u722c\u866b\u66f4\u65b0\u65f6\u95f4\r\n\r\njoin_date,#\u52a0\u5165\u65f6\u95f4\r\nfollowee_num,#\u5173\u6ce8\u6570\r\nfollower_num,#\u7c89\u4e1d\u6570 \r\nstar_num,#\u661f\u6570 \r\norganizations,#\u52a0\u5165\u7684\u7ec4\u7ec7\r\n\r\nmember_num,#\u7ec4\u7ec7\u6210\u5458\u6570\r\n```\r\n\r\n\u5148\u8fd0\u884c\u4e0b\u9762\u4ee3\u7801\uff0c\u91c7\u96c6\u7528\u6237\u4fe1\u606f\uff1a\r\n\r\n```python\r\nscrapy crawl github_user\r\n```\r\n\r\n\u722c\u53d6\u7528\u6237\u4fe1\u606f\u4ee5\u53ca\u7c89\u4e1d\u7528\u6237:\r\n\r\n```python\r\nscrapy crawl github_follower\r\n```\r\n\r\n\u67e5\u770b\u722c\u53d6\u7684\u7ed3\u679c:\r\n\r\n```\r\n> use zhihu\r\nswitched to db zhihu\r\n> db.gh_user.count()\r\n126135\r\n```\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "\u7528\u4e8e\u722c\u53d6zhihu\u548cgithub\u7684\u4ee3\u7801\uff0c\u6570\u636e\u5b58\u50a8\u4e8emongodb\u3002.",
    "version": "1.2.5",
    "project_urls": {
        "Homepage": "https://github.com/yanjlee/scrapy-zhihu-github"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "39d4313f1afac95353b5b3390c54c89075518c537e66e1f694545692cb7d6402",
                "md5": "33b089690a0d64df40194c3927178dc7",
                "sha256": "0b3286a83df10741664ad493b0dff8382799ff72280ec1c8b6b3c9d956b461bc"
            },
            "downloads": -1,
            "filename": "scrapy_zhihu_github-1.2.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "33b089690a0d64df40194c3927178dc7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 36923,
            "upload_time": "2024-06-01T08:23:05",
            "upload_time_iso_8601": "2024-06-01T08:23:05.920972Z",
            "url": "https://files.pythonhosted.org/packages/39/d4/313f1afac95353b5b3390c54c89075518c537e66e1f694545692cb7d6402/scrapy_zhihu_github-1.2.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8163591ff1c5ae3bd134f6c10dd2da6ba62868def620ac749fce0e501236a0a4",
                "md5": "eec24cb70cde6c61aae5ef8d7f31d617",
                "sha256": "bb55270945ae37a5b834356abc7f8a4e32888962e44f396c299966b4f43f1912"
            },
            "downloads": -1,
            "filename": "scrapy_zhihu_github-1.2.5.tar.gz",
            "has_sig": false,
            "md5_digest": "eec24cb70cde6c61aae5ef8d7f31d617",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 27228,
            "upload_time": "2024-06-01T08:23:08",
            "upload_time_iso_8601": "2024-06-01T08:23:08.079704Z",
            "url": "https://files.pythonhosted.org/packages/81/63/591ff1c5ae3bd134f6c10dd2da6ba62868def620ac749fce0e501236a0a4/scrapy_zhihu_github-1.2.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-01 08:23:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yanjlee",
    "github_project": "scrapy-zhihu-github",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "scrapy-zhihu-github"
}
        
Elapsed time: 3.64132s