EpubCrawler


NameEpubCrawler JSON
Version 2023.7.9.2 PyPI version JSON
download
home_pagehttps://github.com/apachecn/epub-crawler
SummaryEpubCrawler,用于抓取网页内容并制作 EPUB 的小工具
upload_time2023-07-09 17:39:03
maintainer
docs_urlNone
authorwizardforcel
requires_python>=3.6
license
keywords ebook epub crawler 爬虫 电子书
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            # epub-crawler

用于抓取网页内容并制作 EPUB 的小工具。

## 安装

通过 pip(推荐):

```
pip install EpubCrawler
```

从源码安装:

```
pip install git+https://github.com/apachecn/epub-crawler
```

## 使用指南

```
crawl-epub [CONFIG]

CONFIG: JSON 格式的配置文件,默认为当前工作目录中的 config.json
```

配置文件包含以下属性:

+   `name: String`
    
    元信息中的书籍名称,也是在当前工作目录中保存文件的名称
    
+   `url: String`(和`list`二选一)

    目录页面的 URL
    
+   `link: String`(若`url`非空则必填)

    链接`<a>`的选择器
    
+   `list: [String]`(和`url`二选一)

    待抓取页面的列表,如果这个列表不为空,则抓取这个列表
	
	⚠该配置项会覆盖`url`、`link`和`external`⚠
    
+   `title: String`(可空)

    文章页面的标题选择器(默认为`title`)
    
+   `content: String`(可空)

    文章页面的内容选择器,为空则智能分析

+   `remove: String`(可空)

    文章页面需要移除的元素的选择器
    
+   `credit: Boolean`(可空)

    是否显示原文链接
    
+   `headers: {String: String}`(可空)

    HTTP 请求的协议头,默认为`{"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36"}`
    
+   `retry: Integer`(可空)

    HTTP 请求的重试次数,默认为 10
    
+   `wait: Float`(可空)

    两次请求之间的间隔(秒),默认为 0
    
+   `timeout: Integer`(可空)

    同时设置 HTTP 请求的连接和读取超时(秒)
    
    ⚠会覆盖`connTimeout`和`readTimeout`

+   `connTimeout: Integer`(可空)

    HTTP 请求的连接超时(秒),默认为 1

+   `readTimeout: Integer`(可空)

    HTTP 请求的读取超时(秒),默认为 60
    
+   `encoding: String`(可空)

    网页编码,默认为 UTF-8
    
+   `optiMode: String`(可空)

    图片处理的模型,`'none'`表示不处理,其它值请见 imgyaso 支持的模式,默认为`'quant'`
    
+   `colors: Integer`(可空)

    imgyaso 接收的`colors`参数,默认为 8
	
+   `imgSrc: [String]`(可空)

    图片源的属性,默认为`["data-src", "data-original-src", "src"]`
	
+   `proxy: String`(可空)

    要使用的代理,格式为`<protocal>://<host>:<port>`
	
+   `checkStatus: Bool`(可空)

    是否检查状态码。如果为`true`并且状态码非 2XX,当作失败。默认为`False`。
	
+   `textThreads: Integer`(可空)

    爬取文本的线程数,默认为 5
	
+   `imgThreads: Integer`(可空)

    爬取图片的线程数,默认为 5
	
+   `external: String`(可空)

    外部脚本的路径。脚本中可定义`get_toc`和`get_article`函数来自定义获取目录和正文的逻辑。
	
	`get_toc(html: string, url: string): [string]`
	
	接受页面 HTML 和 URL,返回目录列表
	
	`get_article(html: string, url: string): {'title': string, 'content': string}`
	
	接受页面 HTML 和 URL,返回字典,`title`键是标题,`content`键是正文
	
	⚠该配置项会覆盖`link`,`title`和`content`,但不会覆盖`list`⚠
	
+   `sizeLimit:String`(可空)

	EPUB 大小限制,格式为【数字+字母单位】,默认为`100m`。

用于抓取我们的 PyTorch 1.4 文档的示例:

```json
{
    "name": "PyTorch 1.4 中文文档 & 教程",
    "url": "https://gitee.com/apachecn/pytorch-doc-zh/blob/master/docs/1.4/SUMMARY.md",
    "link": ".markdown-body li a",
    "remove": "a.anchor",
    "headers": {"Referer": "https://gitee.com/"}
}
```

## 协议

本项目基于 SATA 协议发布。

您有义务为此开源项目点赞,并考虑额外给予作者适当的奖励。

## 赞助我们

![](https://home.apachecn.org/img/about/donate.jpg)

## 另见

+   [ApacheCN 学习资源](https://docs.apachecn.org/)
+   [计算机电子书](http://it-ebooks.flygon.net)
+   [布客新知](http://flygon.net/ixinzhi/)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/apachecn/epub-crawler",
    "name": "EpubCrawler",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "ebook,epub,crawler,\u722c\u866b,\u7535\u5b50\u4e66",
    "author": "wizardforcel",
    "author_email": "wizard.z@qq.com",
    "download_url": "https://files.pythonhosted.org/packages/19/2c/3dff1b707a29b36d3e6c12e20ae3580d2a90ef13e80a48a5a15cf8288f05/EpubCrawler-2023.7.9.2.tar.gz",
    "platform": null,
    "description": "# epub-crawler\r\n\r\n\u7528\u4e8e\u6293\u53d6\u7f51\u9875\u5185\u5bb9\u5e76\u5236\u4f5c EPUB \u7684\u5c0f\u5de5\u5177\u3002\r\n\r\n## \u5b89\u88c5\r\n\r\n\u901a\u8fc7 pip\uff08\u63a8\u8350\uff09\uff1a\r\n\r\n```\r\npip install EpubCrawler\r\n```\r\n\r\n\u4ece\u6e90\u7801\u5b89\u88c5\uff1a\r\n\r\n```\r\npip install git+https://github.com/apachecn/epub-crawler\r\n```\r\n\r\n## \u4f7f\u7528\u6307\u5357\r\n\r\n```\r\ncrawl-epub [CONFIG]\r\n\r\nCONFIG: JSON \u683c\u5f0f\u7684\u914d\u7f6e\u6587\u4ef6\uff0c\u9ed8\u8ba4\u4e3a\u5f53\u524d\u5de5\u4f5c\u76ee\u5f55\u4e2d\u7684 config.json\r\n```\r\n\r\n\u914d\u7f6e\u6587\u4ef6\u5305\u542b\u4ee5\u4e0b\u5c5e\u6027\uff1a\r\n\r\n+   `name: String`\r\n    \r\n    \u5143\u4fe1\u606f\u4e2d\u7684\u4e66\u7c4d\u540d\u79f0\uff0c\u4e5f\u662f\u5728\u5f53\u524d\u5de5\u4f5c\u76ee\u5f55\u4e2d\u4fdd\u5b58\u6587\u4ef6\u7684\u540d\u79f0\r\n    \r\n+   `url: String`\uff08\u548c`list`\u4e8c\u9009\u4e00\uff09\r\n\r\n    \u76ee\u5f55\u9875\u9762\u7684 URL\r\n    \r\n+   `link: String`\uff08\u82e5`url`\u975e\u7a7a\u5219\u5fc5\u586b\uff09\r\n\r\n    \u94fe\u63a5`<a>`\u7684\u9009\u62e9\u5668\r\n    \r\n+   `list: [String]`\uff08\u548c`url`\u4e8c\u9009\u4e00\uff09\r\n\r\n    \u5f85\u6293\u53d6\u9875\u9762\u7684\u5217\u8868\uff0c\u5982\u679c\u8fd9\u4e2a\u5217\u8868\u4e0d\u4e3a\u7a7a\uff0c\u5219\u6293\u53d6\u8fd9\u4e2a\u5217\u8868\r\n\t\r\n\t\u26a0\u8be5\u914d\u7f6e\u9879\u4f1a\u8986\u76d6`url`\u3001`link`\u548c`external`\u26a0\r\n    \r\n+   `title: String`\uff08\u53ef\u7a7a\uff09\r\n\r\n    \u6587\u7ae0\u9875\u9762\u7684\u6807\u9898\u9009\u62e9\u5668\uff08\u9ed8\u8ba4\u4e3a`title`\uff09\r\n    \r\n+   `content: String`\uff08\u53ef\u7a7a\uff09\r\n\r\n    \u6587\u7ae0\u9875\u9762\u7684\u5185\u5bb9\u9009\u62e9\u5668\uff0c\u4e3a\u7a7a\u5219\u667a\u80fd\u5206\u6790\r\n\r\n+   `remove: String`\uff08\u53ef\u7a7a\uff09\r\n\r\n    \u6587\u7ae0\u9875\u9762\u9700\u8981\u79fb\u9664\u7684\u5143\u7d20\u7684\u9009\u62e9\u5668\r\n    \r\n+   `credit: Boolean`\uff08\u53ef\u7a7a\uff09\r\n\r\n    \u662f\u5426\u663e\u793a\u539f\u6587\u94fe\u63a5\r\n    \r\n+   `headers: {String: String}`\uff08\u53ef\u7a7a\uff09\r\n\r\n    HTTP \u8bf7\u6c42\u7684\u534f\u8bae\u5934\uff0c\u9ed8\u8ba4\u4e3a`{\"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36\"}`\r\n    \r\n+   `retry: Integer`\uff08\u53ef\u7a7a\uff09\r\n\r\n    HTTP \u8bf7\u6c42\u7684\u91cd\u8bd5\u6b21\u6570\uff0c\u9ed8\u8ba4\u4e3a 10\r\n    \r\n+   `wait: Float`\uff08\u53ef\u7a7a\uff09\r\n\r\n    \u4e24\u6b21\u8bf7\u6c42\u4e4b\u95f4\u7684\u95f4\u9694\uff08\u79d2\uff09\uff0c\u9ed8\u8ba4\u4e3a 0\r\n    \r\n+   `timeout: Integer`\uff08\u53ef\u7a7a\uff09\r\n\r\n    \u540c\u65f6\u8bbe\u7f6e HTTP \u8bf7\u6c42\u7684\u8fde\u63a5\u548c\u8bfb\u53d6\u8d85\u65f6\uff08\u79d2\uff09\r\n    \r\n    \u26a0\u4f1a\u8986\u76d6`connTimeout`\u548c`readTimeout`\r\n\r\n+   `connTimeout: Integer`\uff08\u53ef\u7a7a\uff09\r\n\r\n    HTTP \u8bf7\u6c42\u7684\u8fde\u63a5\u8d85\u65f6\uff08\u79d2\uff09\uff0c\u9ed8\u8ba4\u4e3a 1\r\n\r\n+   `readTimeout: Integer`\uff08\u53ef\u7a7a\uff09\r\n\r\n    HTTP \u8bf7\u6c42\u7684\u8bfb\u53d6\u8d85\u65f6\uff08\u79d2\uff09\uff0c\u9ed8\u8ba4\u4e3a 60\r\n    \r\n+   `encoding: String`\uff08\u53ef\u7a7a\uff09\r\n\r\n    \u7f51\u9875\u7f16\u7801\uff0c\u9ed8\u8ba4\u4e3a UTF-8\r\n    \r\n+   `optiMode: String`\uff08\u53ef\u7a7a\uff09\r\n\r\n    \u56fe\u7247\u5904\u7406\u7684\u6a21\u578b\uff0c`'none'`\u8868\u793a\u4e0d\u5904\u7406\uff0c\u5176\u5b83\u503c\u8bf7\u89c1 imgyaso \u652f\u6301\u7684\u6a21\u5f0f\uff0c\u9ed8\u8ba4\u4e3a`'quant'`\r\n    \r\n+   `colors: Integer`\uff08\u53ef\u7a7a\uff09\r\n\r\n    imgyaso \u63a5\u6536\u7684`colors`\u53c2\u6570\uff0c\u9ed8\u8ba4\u4e3a 8\r\n\t\r\n+   `imgSrc: [String]`\uff08\u53ef\u7a7a\uff09\r\n\r\n    \u56fe\u7247\u6e90\u7684\u5c5e\u6027\uff0c\u9ed8\u8ba4\u4e3a`[\"data-src\", \"data-original-src\", \"src\"]`\r\n\t\r\n+   `proxy: String`\uff08\u53ef\u7a7a\uff09\r\n\r\n    \u8981\u4f7f\u7528\u7684\u4ee3\u7406\uff0c\u683c\u5f0f\u4e3a`<protocal>://<host>:<port>`\r\n\t\r\n+   `checkStatus: Bool`\uff08\u53ef\u7a7a\uff09\r\n\r\n    \u662f\u5426\u68c0\u67e5\u72b6\u6001\u7801\u3002\u5982\u679c\u4e3a`true`\u5e76\u4e14\u72b6\u6001\u7801\u975e 2XX\uff0c\u5f53\u4f5c\u5931\u8d25\u3002\u9ed8\u8ba4\u4e3a`False`\u3002\r\n\t\r\n+   `textThreads: Integer`\uff08\u53ef\u7a7a\uff09\r\n\r\n    \u722c\u53d6\u6587\u672c\u7684\u7ebf\u7a0b\u6570\uff0c\u9ed8\u8ba4\u4e3a 5\r\n\t\r\n+   `imgThreads: Integer`\uff08\u53ef\u7a7a\uff09\r\n\r\n    \u722c\u53d6\u56fe\u7247\u7684\u7ebf\u7a0b\u6570\uff0c\u9ed8\u8ba4\u4e3a 5\r\n\t\r\n+   `external: String`\uff08\u53ef\u7a7a\uff09\r\n\r\n    \u5916\u90e8\u811a\u672c\u7684\u8def\u5f84\u3002\u811a\u672c\u4e2d\u53ef\u5b9a\u4e49`get_toc`\u548c`get_article`\u51fd\u6570\u6765\u81ea\u5b9a\u4e49\u83b7\u53d6\u76ee\u5f55\u548c\u6b63\u6587\u7684\u903b\u8f91\u3002\r\n\t\r\n\t`get_toc(html: string, url: string): [string]`\r\n\t\r\n\t\u63a5\u53d7\u9875\u9762 HTML \u548c URL\uff0c\u8fd4\u56de\u76ee\u5f55\u5217\u8868\r\n\t\r\n\t`get_article(html: string, url: string): {'title': string, 'content': string}`\r\n\t\r\n\t\u63a5\u53d7\u9875\u9762 HTML \u548c URL\uff0c\u8fd4\u56de\u5b57\u5178\uff0c`title`\u952e\u662f\u6807\u9898\uff0c`content`\u952e\u662f\u6b63\u6587\r\n\t\r\n\t\u26a0\u8be5\u914d\u7f6e\u9879\u4f1a\u8986\u76d6`link`\uff0c`title`\u548c`content`\uff0c\u4f46\u4e0d\u4f1a\u8986\u76d6`list`\u26a0\r\n\t\r\n+   `sizeLimit\uff1aString`\uff08\u53ef\u7a7a\uff09\r\n\r\n\tEPUB \u5927\u5c0f\u9650\u5236\uff0c\u683c\u5f0f\u4e3a\u3010\u6570\u5b57+\u5b57\u6bcd\u5355\u4f4d\u3011\uff0c\u9ed8\u8ba4\u4e3a`100m`\u3002\r\n\r\n\u7528\u4e8e\u6293\u53d6\u6211\u4eec\u7684 PyTorch 1.4 \u6587\u6863\u7684\u793a\u4f8b\uff1a\r\n\r\n```json\r\n{\r\n    \"name\": \"PyTorch 1.4 \u4e2d\u6587\u6587\u6863 & \u6559\u7a0b\",\r\n    \"url\": \"https://gitee.com/apachecn/pytorch-doc-zh/blob/master/docs/1.4/SUMMARY.md\",\r\n    \"link\": \".markdown-body li a\",\r\n    \"remove\": \"a.anchor\",\r\n    \"headers\": {\"Referer\": \"https://gitee.com/\"}\r\n}\r\n```\r\n\r\n## \u534f\u8bae\r\n\r\n\u672c\u9879\u76ee\u57fa\u4e8e SATA \u534f\u8bae\u53d1\u5e03\u3002\r\n\r\n\u60a8\u6709\u4e49\u52a1\u4e3a\u6b64\u5f00\u6e90\u9879\u76ee\u70b9\u8d5e\uff0c\u5e76\u8003\u8651\u989d\u5916\u7ed9\u4e88\u4f5c\u8005\u9002\u5f53\u7684\u5956\u52b1\u3002\r\n\r\n## \u8d5e\u52a9\u6211\u4eec\r\n\r\n![](https://home.apachecn.org/img/about/donate.jpg)\r\n\r\n## \u53e6\u89c1\r\n\r\n+   [ApacheCN \u5b66\u4e60\u8d44\u6e90](https://docs.apachecn.org/)\r\n+   [\u8ba1\u7b97\u673a\u7535\u5b50\u4e66](http://it-ebooks.flygon.net)\r\n+   [\u5e03\u5ba2\u65b0\u77e5](http://flygon.net/ixinzhi/)\r\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "EpubCrawler\uff0c\u7528\u4e8e\u6293\u53d6\u7f51\u9875\u5185\u5bb9\u5e76\u5236\u4f5c EPUB \u7684\u5c0f\u5de5\u5177",
    "version": "2023.7.9.2",
    "project_urls": {
        "Homepage": "https://github.com/apachecn/epub-crawler"
    },
    "split_keywords": [
        "ebook",
        "epub",
        "crawler",
        "\u722c\u866b",
        "\u7535\u5b50\u4e66"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9f69644c775278f637ca2aaeeec2179617eda4bc3bcccc1e27f4042501edcb68",
                "md5": "8eef2d98ea639a0900a1a264fb20d015",
                "sha256": "e956eb057387816d6c72767d57b2b9923d4723d100bebd029de82ec92fd55772"
            },
            "downloads": -1,
            "filename": "EpubCrawler-2023.7.9.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8eef2d98ea639a0900a1a264fb20d015",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 14189,
            "upload_time": "2023-07-09T17:39:01",
            "upload_time_iso_8601": "2023-07-09T17:39:01.711400Z",
            "url": "https://files.pythonhosted.org/packages/9f/69/644c775278f637ca2aaeeec2179617eda4bc3bcccc1e27f4042501edcb68/EpubCrawler-2023.7.9.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "192c3dff1b707a29b36d3e6c12e20ae3580d2a90ef13e80a48a5a15cf8288f05",
                "md5": "fcd11afb69fa26893e53e226316640f6",
                "sha256": "ffdc2c66ced9d371c2625d104eaa579184175421e108af9e4a0aef2178507e19"
            },
            "downloads": -1,
            "filename": "EpubCrawler-2023.7.9.2.tar.gz",
            "has_sig": false,
            "md5_digest": "fcd11afb69fa26893e53e226316640f6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 10986,
            "upload_time": "2023-07-09T17:39:03",
            "upload_time_iso_8601": "2023-07-09T17:39:03.315023Z",
            "url": "https://files.pythonhosted.org/packages/19/2c/3dff1b707a29b36d3e6c12e20ae3580d2a90ef13e80a48a5a15cf8288f05/EpubCrawler-2023.7.9.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-09 17:39:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "apachecn",
    "github_project": "epub-crawler",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "epubcrawler"
}
        
Elapsed time: 0.22462s