article-crawler

Name	article-crawler JSON
Version	0.0.4 JSON
	download
home_page
Summary	A package for crawling markdown formatted articles from certain webpage and storing them locally.
upload_time	2023-08-12 07:36:24
maintainer
docs_url	None
author	ltyzzz (Tycho)
requires_python
license
keywords	python markdown pdf article crawler
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            
# Article Crawler

[![PyPI Latest Release](https://img.shields.io/pypi/v/article-crawler.svg)](https://pypi.org/project/article-crawler/)
[![PyPI Downloads](https://img.shields.io/pypi/dm/article-crawler?label=PyPI%20downloads)](https://pypi.org/project/article-crawler/)
[![](https://img.shields.io/github/v/release/ltyzzzxxx/article_crawler?display_name=tag)](https://github.com/ltyzzzxxx/article_crawler/releases/tag/v0.0.1)
[![](https://img.shields.io/github/stars/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler)
[![](https://img.shields.io/github/forks/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler)
[![](https://img.shields.io/github/issues/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler/issues)
[![](https://img.shields.io/badge/license-MIT%20-yellow.svg)](https://github.com/ltyzzzxxx/article_crawler/issues)

[English Doc](./README_EN.md) | [中文文档](./README_CN.md)

## ✨ Introduction

Article Crawler is a package used to crawl articles with Markdown format from a specific webpage and store them locally in HTML / Markdown formats.

## 🚀 Quick Start

1. Install through `pip`

    ```python
    pip install article-crawler
    ```
2. Usage

    Usage: `python3 -m article_crawler -u [url] -t [type] -o [output_folder] -c [class_] -i [id]`

    ```
    Options:
      --version             show program's version number and exit
      -h, --help            show this help message and exit
      -u URL, --url=URL     crawled url (required)
      -t TYPE, --type=TYPE  crawled article type [csdn] | [juejin] | [zhihu] | [jianshu]
      -o OUTPUT_FOLDER, --output_folder=OUTPUT_FOLDER
                            output html / markdown / pdf folder (required)
      -w WEBSITE_TAG, --website_tag=WEBSITE_TAG
                            position of the article content in HTML (not required if 'type' is specified)
      -c CLASS_, --class=CLASS_
                            position of the article content in HTML (not required if 'type' is specified)
      -i ID, --id=ID        position of the article content in HTML (not required if 'type' is specified)
    ```
    - type: Specific websites, currently supported are CSDN, Zhihu, Juejin, and Jianshu.
    - website_tag / class_ / id:
   
      e.g. `<div id="article_content" class="article_content clearfix"></div>`
   
      - In this element, `website_tag`, `class_`, `id` is `div`, `article_content clearfix`, `article_content` respectively.
      
      > 1. You don't need to specify `type` when you specify `website_tag / class_ / id`.
      > 2. You need to use the web console to locate the position of the article.
      > 3. `website_tag / class_ / id` is used to locate the position of the article in HTML. It is possible to only use one or two of them instead of all.

## Open Source License

MIT License see https://opensource.org/license/mit/

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "article-crawler",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "python,markdown,pdf,article,crawler",
    "author": "ltyzzz (Tycho)",
    "author_email": "ltyzzz2000@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/b1/e8/4f3811daa9fea55b4e483e3073ab059d79202fbc8b6ebdb0d797987a6437/article_crawler-0.0.4.tar.gz",
    "platform": null,
    "description": "\n# Article Crawler\n\n[![PyPI Latest Release](https://img.shields.io/pypi/v/article-crawler.svg)](https://pypi.org/project/article-crawler/)\n[![PyPI Downloads](https://img.shields.io/pypi/dm/article-crawler?label=PyPI%20downloads)](https://pypi.org/project/article-crawler/)\n[![](https://img.shields.io/github/v/release/ltyzzzxxx/article_crawler?display_name=tag)](https://github.com/ltyzzzxxx/article_crawler/releases/tag/v0.0.1)\n[![](https://img.shields.io/github/stars/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler)\n[![](https://img.shields.io/github/forks/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler)\n[![](https://img.shields.io/github/issues/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler/issues)\n[![](https://img.shields.io/badge/license-MIT%20-yellow.svg)](https://github.com/ltyzzzxxx/article_crawler/issues)\n\n[English Doc](./README_EN.md) | [\u4e2d\u6587\u6587\u6863](./README_CN.md)\n\n## \u2728 Introduction\n\nArticle Crawler is a package used to crawl articles with Markdown format from a specific webpage and store them locally in HTML / Markdown formats.\n\n## \ud83d\ude80 Quick Start\n\n1. Install through `pip`\n\n    ```python\n    pip install article-crawler\n    ```\n2. Usage\n\n    Usage: `python3 -m article_crawler -u [url] -t [type] -o [output_folder] -c [class_] -i [id]`\n\n    ```\n    Options:\n      --version             show program's version number and exit\n      -h, --help            show this help message and exit\n      -u URL, --url=URL     crawled url (required)\n      -t TYPE, --type=TYPE  crawled article type [csdn] | [juejin] | [zhihu] | [jianshu]\n      -o OUTPUT_FOLDER, --output_folder=OUTPUT_FOLDER\n                            output html / markdown / pdf folder (required)\n      -w WEBSITE_TAG, --website_tag=WEBSITE_TAG\n                            position of the article content in HTML (not required if 'type' is specified)\n      -c CLASS_, --class=CLASS_\n                            position of the article content in HTML (not required if 'type' is specified)\n      -i ID, --id=ID        position of the article content in HTML (not required if 'type' is specified)\n    ```\n    - type: Specific websites, currently supported are CSDN, Zhihu, Juejin, and Jianshu.\n    - website_tag / class_ / id:\n   \n      e.g. `<div id=\"article_content\" class=\"article_content clearfix\"></div>`\n   \n      - In this element, `website_tag`, `class_`, `id` is `div`, `article_content clearfix`, `article_content` respectively.\n      \n      > 1. You don't need to specify `type` when you specify `website_tag / class_ / id`.\n      > 2. You need to use the web console to locate the position of the article.\n      > 3. `website_tag / class_ / id` is used to locate the position of the article in HTML. It is possible to only use one or two of them instead of all.\n\n## Open Source License\n\nMIT License see https://opensource.org/license/mit/\n       \n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A package for crawling markdown formatted articles from certain webpage and storing them locally.",
    "version": "0.0.4",
    "project_urls": null,
    "split_keywords": [
        "python",
        "markdown",
        "pdf",
        "article",
        "crawler"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fe758abcf946d3c03680a6ee1ff46ca855e82d20fca8da13b7879d1af1f51470",
                "md5": "fc00089c77356a6519d8ca9fc377e4e1",
                "sha256": "f388713e3d22526bb68404044f16af0652536ba817db3a125c48c9a4a69578a8"
            },
            "downloads": -1,
            "filename": "article_crawler-0.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fc00089c77356a6519d8ca9fc377e4e1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 9046,
            "upload_time": "2023-08-12T07:36:22",
            "upload_time_iso_8601": "2023-08-12T07:36:22.450728Z",
            "url": "https://files.pythonhosted.org/packages/fe/75/8abcf946d3c03680a6ee1ff46ca855e82d20fca8da13b7879d1af1f51470/article_crawler-0.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b1e84f3811daa9fea55b4e483e3073ab059d79202fbc8b6ebdb0d797987a6437",
                "md5": "2e65b55346ea59609f7041ed219e1045",
                "sha256": "bae481259fbc896014cf02c79bb169519451bc4dfa5ef29ee5f35d0492e6aadb"
            },
            "downloads": -1,
            "filename": "article_crawler-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "2e65b55346ea59609f7041ed219e1045",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 7237,
            "upload_time": "2023-08-12T07:36:24",
            "upload_time_iso_8601": "2023-08-12T07:36:24.302786Z",
            "url": "https://files.pythonhosted.org/packages/b1/e8/4f3811daa9fea55b4e483e3073ab059d79202fbc8b6ebdb0d797987a6437/article_crawler-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-12 07:36:24",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "article-crawler"
}

ltyzzz (Tycho)