Name | article-crawler JSON |
Version |
0.0.4
JSON |
| download |
home_page | |
Summary | A package for crawling markdown formatted articles from certain webpage and storing them locally. |
upload_time | 2023-08-12 07:36:24 |
maintainer | |
docs_url | None |
author | ltyzzz (Tycho) |
requires_python | |
license | |
keywords |
python
markdown
pdf
article
crawler
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Article Crawler
[![PyPI Latest Release](https://img.shields.io/pypi/v/article-crawler.svg)](https://pypi.org/project/article-crawler/)
[![PyPI Downloads](https://img.shields.io/pypi/dm/article-crawler?label=PyPI%20downloads)](https://pypi.org/project/article-crawler/)
[![](https://img.shields.io/github/v/release/ltyzzzxxx/article_crawler?display_name=tag)](https://github.com/ltyzzzxxx/article_crawler/releases/tag/v0.0.1)
[![](https://img.shields.io/github/stars/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler)
[![](https://img.shields.io/github/forks/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler)
[![](https://img.shields.io/github/issues/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler/issues)
[![](https://img.shields.io/badge/license-MIT%20-yellow.svg)](https://github.com/ltyzzzxxx/article_crawler/issues)
[English Doc](./README_EN.md) | [中文文档](./README_CN.md)
## ✨ Introduction
Article Crawler is a package used to crawl articles with Markdown format from a specific webpage and store them locally in HTML / Markdown formats.
## 🚀 Quick Start
1. Install through `pip`
```python
pip install article-crawler
```
2. Usage
Usage: `python3 -m article_crawler -u [url] -t [type] -o [output_folder] -c [class_] -i [id]`
```
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-u URL, --url=URL crawled url (required)
-t TYPE, --type=TYPE crawled article type [csdn] | [juejin] | [zhihu] | [jianshu]
-o OUTPUT_FOLDER, --output_folder=OUTPUT_FOLDER
output html / markdown / pdf folder (required)
-w WEBSITE_TAG, --website_tag=WEBSITE_TAG
position of the article content in HTML (not required if 'type' is specified)
-c CLASS_, --class=CLASS_
position of the article content in HTML (not required if 'type' is specified)
-i ID, --id=ID position of the article content in HTML (not required if 'type' is specified)
```
- type: Specific websites, currently supported are CSDN, Zhihu, Juejin, and Jianshu.
- website_tag / class_ / id:
e.g. `<div id="article_content" class="article_content clearfix"></div>`
- In this element, `website_tag`, `class_`, `id` is `div`, `article_content clearfix`, `article_content` respectively.
> 1. You don't need to specify `type` when you specify `website_tag / class_ / id`.
> 2. You need to use the web console to locate the position of the article.
> 3. `website_tag / class_ / id` is used to locate the position of the article in HTML. It is possible to only use one or two of them instead of all.
## Open Source License
MIT License see https://opensource.org/license/mit/
Raw data
{
"_id": null,
"home_page": "",
"name": "article-crawler",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "python,markdown,pdf,article,crawler",
"author": "ltyzzz (Tycho)",
"author_email": "ltyzzz2000@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/b1/e8/4f3811daa9fea55b4e483e3073ab059d79202fbc8b6ebdb0d797987a6437/article_crawler-0.0.4.tar.gz",
"platform": null,
"description": "\n# Article Crawler\n\n[![PyPI Latest Release](https://img.shields.io/pypi/v/article-crawler.svg)](https://pypi.org/project/article-crawler/)\n[![PyPI Downloads](https://img.shields.io/pypi/dm/article-crawler?label=PyPI%20downloads)](https://pypi.org/project/article-crawler/)\n[![](https://img.shields.io/github/v/release/ltyzzzxxx/article_crawler?display_name=tag)](https://github.com/ltyzzzxxx/article_crawler/releases/tag/v0.0.1)\n[![](https://img.shields.io/github/stars/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler)\n[![](https://img.shields.io/github/forks/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler)\n[![](https://img.shields.io/github/issues/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler/issues)\n[![](https://img.shields.io/badge/license-MIT%20-yellow.svg)](https://github.com/ltyzzzxxx/article_crawler/issues)\n\n[English Doc](./README_EN.md) | [\u4e2d\u6587\u6587\u6863](./README_CN.md)\n\n## \u2728 Introduction\n\nArticle Crawler is a package used to crawl articles with Markdown format from a specific webpage and store them locally in HTML / Markdown formats.\n\n## \ud83d\ude80 Quick Start\n\n1. Install through `pip`\n\n ```python\n pip install article-crawler\n ```\n2. Usage\n\n Usage: `python3 -m article_crawler -u [url] -t [type] -o [output_folder] -c [class_] -i [id]`\n\n ```\n Options:\n --version show program's version number and exit\n -h, --help show this help message and exit\n -u URL, --url=URL crawled url (required)\n -t TYPE, --type=TYPE crawled article type [csdn] | [juejin] | [zhihu] | [jianshu]\n -o OUTPUT_FOLDER, --output_folder=OUTPUT_FOLDER\n output html / markdown / pdf folder (required)\n -w WEBSITE_TAG, --website_tag=WEBSITE_TAG\n position of the article content in HTML (not required if 'type' is specified)\n -c CLASS_, --class=CLASS_\n position of the article content in HTML (not required if 'type' is specified)\n -i ID, --id=ID position of the article content in HTML (not required if 'type' is specified)\n ```\n - type: Specific websites, currently supported are CSDN, Zhihu, Juejin, and Jianshu.\n - website_tag / class_ / id:\n \n e.g. `<div id=\"article_content\" class=\"article_content clearfix\"></div>`\n \n - In this element, `website_tag`, `class_`, `id` is `div`, `article_content clearfix`, `article_content` respectively.\n \n > 1. You don't need to specify `type` when you specify `website_tag / class_ / id`.\n > 2. You need to use the web console to locate the position of the article.\n > 3. `website_tag / class_ / id` is used to locate the position of the article in HTML. It is possible to only use one or two of them instead of all.\n\n## Open Source License\n\nMIT License see https://opensource.org/license/mit/\n \n",
"bugtrack_url": null,
"license": "",
"summary": "A package for crawling markdown formatted articles from certain webpage and storing them locally.",
"version": "0.0.4",
"project_urls": null,
"split_keywords": [
"python",
"markdown",
"pdf",
"article",
"crawler"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fe758abcf946d3c03680a6ee1ff46ca855e82d20fca8da13b7879d1af1f51470",
"md5": "fc00089c77356a6519d8ca9fc377e4e1",
"sha256": "f388713e3d22526bb68404044f16af0652536ba817db3a125c48c9a4a69578a8"
},
"downloads": -1,
"filename": "article_crawler-0.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "fc00089c77356a6519d8ca9fc377e4e1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 9046,
"upload_time": "2023-08-12T07:36:22",
"upload_time_iso_8601": "2023-08-12T07:36:22.450728Z",
"url": "https://files.pythonhosted.org/packages/fe/75/8abcf946d3c03680a6ee1ff46ca855e82d20fca8da13b7879d1af1f51470/article_crawler-0.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "b1e84f3811daa9fea55b4e483e3073ab059d79202fbc8b6ebdb0d797987a6437",
"md5": "2e65b55346ea59609f7041ed219e1045",
"sha256": "bae481259fbc896014cf02c79bb169519451bc4dfa5ef29ee5f35d0492e6aadb"
},
"downloads": -1,
"filename": "article_crawler-0.0.4.tar.gz",
"has_sig": false,
"md5_digest": "2e65b55346ea59609f7041ed219e1045",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 7237,
"upload_time": "2023-08-12T07:36:24",
"upload_time_iso_8601": "2023-08-12T07:36:24.302786Z",
"url": "https://files.pythonhosted.org/packages/b1/e8/4f3811daa9fea55b4e483e3073ab059d79202fbc8b6ebdb0d797987a6437/article_crawler-0.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-12 07:36:24",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "article-crawler"
}