article-parser


Namearticle-parser JSON
Version 1.8.0 PyPI version JSON
download
home_pagehttps://github.com/myifeng/article-parser
SummaryA parser that parses articles from any url or html
upload_time2024-06-04 02:50:48
maintainermyifeng
docs_urlNone
authormyifeng
requires_python>=3.8
licenseMIT
keywords article news html parser extract extractor body
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # article-parser

![GitHub Repo stars](https://img.shields.io/github/stars/myifeng/article-parser)
[![python](https://img.shields.io/pypi/pyversions/article-parser)](https://pypi.org/project/article-parser/)
[![pypi](https://img.shields.io/pypi/v/article-parser)](https://pypi.org/project/article-parser/)
[![wheel](https://img.shields.io/pypi/wheel/article-parser)](https://pypi.org/project/article-parser/)
[![license](https://img.shields.io/github/license/myifeng/article-parser)](https://pypi.org/project/article-parser/)
![PyPI - Downloads](https://img.shields.io/pypi/dm/article-parser)


**Extract article or news by url or html, parse the title and content.**

*[English](https://github.com/myifeng/article-parser/blob/master/README.md)  ∙ [简体中文](https://github.com/myifeng/article-parser/blob/master/README.zh-CN.md)*

## How to install

[`article-parser`](https://pypi.org/project/article-parser/) is available on pypi
https://pypi.org/project/article-parser/

```
$ pip install article-parser
```

## Basic Usage

```python
>>> import article_parser

article_parser.parse(
  url='',               # The URL of the article.
  html='',              # The HTML of the article.
  threshold=0.9,        # The ratio of text to the entire document, default 0.9.
  output='html',        # Result output format, support ``markdown`` and ``html``, default ``html``.
  **kwargs              # Optional arguments that `request` takes. optional
  ),
  

## ouput markdown
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", output='markdown', timeout=5)

## output html
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", timeout=5)

```

## Example
[Djokovic wins record 36th Masters title in Rome - Chinadaily.com.cn](http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html)


* Markdown

```python
>>> import article_parser
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", output='markdown', timeout=5)
>>> print(title)
>>> print('----------------')
>>> print(content)

Djokovic wins record 36th Masters title in Rome
----------------
![](http://img2.chinadaily.com.cn/images/202009/22/5f6962b2a31024adbd959228.jpeg)
Serbia's Novak Djokovic kisses the trophy after winning the final against
Argentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept
21, 2020. [Photo/Agencies]

ROME - Novak Djokovic won a record 36th Masters crown as he beat Diego
Schwartzman in the men's final of the ATP Italian Open on Monday.

Djokovic, the world number one and the top seed at the tournament, won 7-5,
6-3 against Argentine Schwartzman to lift his 36th Masters title, one more
than Rafael Nadal.

The Serb said he did not play his best tennis this time in Rome, but could
find it when needed.

Simona Halep, top seed of the women's draw, won her first title in Rome after
defending champion Karolina Pliskova of the Czech Republic retired while
trailing 6-0, 2-1 in the final.
```


* HTML
```python
>>> import article_parser
>>> title, content = article_parser.parse(url="http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html", timeout=5)
>>> print(title)
>>> print('----------------')
>>> print(content)

Djokovic wins record 36th Masters title in Rome
----------------
<div id="Content">

<figure class="image" style="display: table;">
<img data-from="newsroom" id="img-5f6962b2a31024adbd959228" src="//img2.chinadaily.com.cn/images/202009/22/5f6962b2a31024adbd959228.jpeg"/>
<figcaption style="font-size: 14px; display: table-caption; caption-side: bottom;">
   Serbia's Novak Djokovic kisses the trophy after winning the final against Argentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept 21, 2020. [Photo/Agencies]
 </figcaption>
</figure>
<p dir="ltr">ROME - Novak Djokovic won a record 36th Masters crown as he beat Diego Schwartzman in the men's final of the ATP Italian Open on Monday.</p>
<p dir="ltr">Djokovic, the world number one and the top seed at the tournament, won 7-5, 6-3 against Argentine Schwartzman to lift his 36th Masters title, one more than Rafael Nadal.</p>
<p dir="ltr">The Serb said he did not play his best tennis this time in Rome, but could find it when needed.</p>
<p dir="ltr">Simona Halep, top seed of the women's draw, won her first title in Rome after defending champion Karolina Pliskova of the Czech Republic retired while trailing 6-0, 2-1 in the final.</p>
</div>
```
## Contributors

[![All contributions](https://contrib.rocks/image?repo=myifeng/article-parser)](https://github.com/myifeng/article-parser/graphs/contributors)

## Stargazers over time
[![Stargazers over time](https://starchart.cc/myifeng/article-parser.svg?variant=adaptive)](https://github.com/myifeng/article-parser)


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/myifeng/article-parser",
    "name": "article-parser",
    "maintainer": "myifeng",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "myifengs@gmail.com",
    "keywords": "article news html parser Extract extractor body",
    "author": "myifeng",
    "author_email": "myifengs@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/c6/a7/213ebca6eb776362e377efc7ddb8d3c588844ba7881d6dc3f407d55b4264/article_parser-1.8.0.tar.gz",
    "platform": null,
    "description": "# article-parser\n\n![GitHub Repo stars](https://img.shields.io/github/stars/myifeng/article-parser)\n[![python](https://img.shields.io/pypi/pyversions/article-parser)](https://pypi.org/project/article-parser/)\n[![pypi](https://img.shields.io/pypi/v/article-parser)](https://pypi.org/project/article-parser/)\n[![wheel](https://img.shields.io/pypi/wheel/article-parser)](https://pypi.org/project/article-parser/)\n[![license](https://img.shields.io/github/license/myifeng/article-parser)](https://pypi.org/project/article-parser/)\n![PyPI - Downloads](https://img.shields.io/pypi/dm/article-parser)\n\n\n**Extract article or news by url or html, parse the title and content.**\n\n*[English](https://github.com/myifeng/article-parser/blob/master/README.md)  \u2219 [\u7b80\u4f53\u4e2d\u6587](https://github.com/myifeng/article-parser/blob/master/README.zh-CN.md)*\n\n## How to install\n\n[`article-parser`](https://pypi.org/project/article-parser/) is available on pypi\nhttps://pypi.org/project/article-parser/\n\n```\n$ pip install article-parser\n```\n\n## Basic Usage\n\n```python\n>>> import article_parser\n\narticle_parser.parse(\n  url='',               # The URL of the article.\n  html='',              # The HTML of the article.\n  threshold=0.9,        # The ratio of text to the entire document, default 0.9.\n  output='html',        # Result output format, support ``markdown`` and ``html``, default ``html``.\n  **kwargs              # Optional arguments that `request` takes. optional\n  ),\n  \n\n## ouput markdown\n>>> title, content = article_parser.parse(url=\"http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html\", output='markdown', timeout=5)\n\n## output html\n>>> title, content = article_parser.parse(url=\"http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html\", timeout=5)\n\n```\n\n## Example\n[Djokovic wins record 36th Masters title in Rome - Chinadaily.com.cn](http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html)\n\n\n* Markdown\n\n```python\n>>> import article_parser\n>>> title, content = article_parser.parse(url=\"http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html\", output='markdown', timeout=5)\n>>> print(title)\n>>> print('----------------')\n>>> print(content)\n\nDjokovic wins record 36th Masters title in Rome\n----------------\n![](http://img2.chinadaily.com.cn/images/202009/22/5f6962b2a31024adbd959228.jpeg)\nSerbia's Novak Djokovic kisses the trophy after winning the final against\nArgentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept\n21, 2020. [Photo/Agencies]\n\nROME - Novak Djokovic won a record 36th Masters crown as he beat Diego\nSchwartzman in the men's final of the ATP Italian Open on Monday.\n\nDjokovic, the world number one and the top seed at the tournament, won 7-5,\n6-3 against Argentine Schwartzman to lift his 36th Masters title, one more\nthan Rafael Nadal.\n\nThe Serb said he did not play his best tennis this time in Rome, but could\nfind it when needed.\n\nSimona Halep, top seed of the women's draw, won her first title in Rome after\ndefending champion Karolina Pliskova of the Czech Republic retired while\ntrailing 6-0, 2-1 in the final.\n```\n\n\n* HTML\n```python\n>>> import article_parser\n>>> title, content = article_parser.parse(url=\"http://www.chinadaily.com.cn/a/202009/22/WS5f6962b2a31024ad0ba7afcb.html\", timeout=5)\n>>> print(title)\n>>> print('----------------')\n>>> print(content)\n\nDjokovic wins record 36th Masters title in Rome\n----------------\n<div id=\"Content\">\n\n<figure class=\"image\" style=\"display: table;\">\n<img data-from=\"newsroom\" id=\"img-5f6962b2a31024adbd959228\" src=\"//img2.chinadaily.com.cn/images/202009/22/5f6962b2a31024adbd959228.jpeg\"/>\n<figcaption style=\"font-size: 14px; display: table-caption; caption-side: bottom;\">\n   Serbia's Novak Djokovic kisses the trophy after winning the final against Argentina's Diego Schwartzman at Italian Open, Foro Italico, Rome, Italy, Sept 21, 2020. [Photo/Agencies]\n </figcaption>\n</figure>\n<p dir=\"ltr\">ROME - Novak Djokovic won a record 36th Masters crown as he beat Diego Schwartzman in the men's final of the ATP Italian Open on Monday.</p>\n<p dir=\"ltr\">Djokovic, the world number one and the top seed at the tournament, won 7-5, 6-3 against Argentine Schwartzman to lift his 36th Masters title, one more than Rafael Nadal.</p>\n<p dir=\"ltr\">The Serb said he did not play his best tennis this time in Rome, but could find it when needed.</p>\n<p dir=\"ltr\">Simona Halep, top seed of the women's draw, won her first title in Rome after defending champion Karolina Pliskova of the Czech Republic retired while trailing 6-0, 2-1 in the final.</p>\n</div>\n```\n## Contributors\n\n[![All contributions](https://contrib.rocks/image?repo=myifeng/article-parser)](https://github.com/myifeng/article-parser/graphs/contributors)\n\n## Stargazers over time\n[![Stargazers over time](https://starchart.cc/myifeng/article-parser.svg?variant=adaptive)](https://github.com/myifeng/article-parser)\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A parser that parses articles from any url or html",
    "version": "1.8.0",
    "project_urls": {
        "Homepage": "https://github.com/myifeng/article-parser"
    },
    "split_keywords": [
        "article",
        "news",
        "html",
        "parser",
        "extract",
        "extractor",
        "body"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "23228def5b461d8c42d1e9428c9ff3b07e3ccf491e0391b42093a5cf84967648",
                "md5": "e60d07cd45dd2b78bd646350657acd7f",
                "sha256": "b848f87968b5b71b0b7d182da2abf0b2b6a95f53f76906593c3bae4aee7f3d48"
            },
            "downloads": -1,
            "filename": "article_parser-1.8.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e60d07cd45dd2b78bd646350657acd7f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 5636,
            "upload_time": "2024-06-04T02:50:47",
            "upload_time_iso_8601": "2024-06-04T02:50:47.112053Z",
            "url": "https://files.pythonhosted.org/packages/23/22/8def5b461d8c42d1e9428c9ff3b07e3ccf491e0391b42093a5cf84967648/article_parser-1.8.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c6a7213ebca6eb776362e377efc7ddb8d3c588844ba7881d6dc3f407d55b4264",
                "md5": "16572af3198521f63b76184111a01330",
                "sha256": "bf56405a6b3c0aad1dcdada74874dbb6932ce177e813aa4b4e8198634d663c52"
            },
            "downloads": -1,
            "filename": "article_parser-1.8.0.tar.gz",
            "has_sig": false,
            "md5_digest": "16572af3198521f63b76184111a01330",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 5541,
            "upload_time": "2024-06-04T02:50:48",
            "upload_time_iso_8601": "2024-06-04T02:50:48.701631Z",
            "url": "https://files.pythonhosted.org/packages/c6/a7/213ebca6eb776362e377efc7ddb8d3c588844ba7881d6dc3f407d55b4264/article_parser-1.8.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-04 02:50:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "myifeng",
    "github_project": "article-parser",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "article-parser"
}
        
Elapsed time: 0.24667s