# newspaperV3
An advanced library for news extraction, article parsing, and content analysis. This is a fork/version based on the original `newspaper` library by Lucas Ou-Yang.
## Installation
Install the package using pip:
```bash
pip install newspaperV3
```
## Basic Usage
Here's a simple example of how to download and parse an article:
```python
from newspaperV3 import Article
import nltk
# NLTK data is required for the first run
# nltk.download('punkt')
url = 'https://www.cnn.com/2023/11/15/politics/us-china-meeting-biden-xi/index.html'
# Create an Article object
article = Article(url)
# Download and parse the article
article.download()
article.parse()
# Perform Natural Language Processing (NLP)
article.nlp()
# Print the results
print("Title:", article.title)
print("Authors:", article.authors)
print("Publish Date:", article.publish_date)
print("Top Image:", article.top_image)
print("\nSummary:")
print(article.summary)
print("\nKeywords:", article.keywords)
```
## Features
* **Article Extraction** : Automatically extract clean article text from web pages
* **Metadata Parsing** : Extract titles, authors, publication dates, and images
* **Natural Language Processing** : Generate summaries and extract keywords
* **Multi-language Support** : Process articles in various languages
* **Image Processing** : Extract and analyze article images
* **Content Analysis** : Advanced text processing and analysis capabilities
## Requirements
* Python 3.6+
* NLTK (for natural language processing)
* Additional dependencies installed automatically
## License
This project is licensed under the MIT License.
Raw data
{
"_id": null,
"home_page": "https://github.com/salah55s/newspaperV3",
"name": "newspaperV3",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.8",
"maintainer_email": null,
"keywords": "newspaper, news, article, extraction, scraping, nlp, content, parsing",
"author": "Lucas Ou-Yang",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/df/96/4295c7b2d9e3ccadbc452e4b0c05d3c93e7bd67e14ef72950f3cb4da5637/newspaperv3-0.3.0.tar.gz",
"platform": null,
"description": "# newspaperV3\n\nAn advanced library for news extraction, article parsing, and content analysis. This is a fork/version based on the original `newspaper` library by Lucas Ou-Yang.\n\n## Installation\n\nInstall the package using pip:\n\n```bash\npip install newspaperV3\n```\n\n## Basic Usage\n\nHere's a simple example of how to download and parse an article:\n\n```python\nfrom newspaperV3 import Article\nimport nltk\n\n# NLTK data is required for the first run\n# nltk.download('punkt')\n\nurl = 'https://www.cnn.com/2023/11/15/politics/us-china-meeting-biden-xi/index.html'\n\n# Create an Article object\narticle = Article(url)\n\n# Download and parse the article\narticle.download()\narticle.parse()\n\n# Perform Natural Language Processing (NLP)\narticle.nlp()\n\n# Print the results\nprint(\"Title:\", article.title)\nprint(\"Authors:\", article.authors)\nprint(\"Publish Date:\", article.publish_date)\nprint(\"Top Image:\", article.top_image)\nprint(\"\\nSummary:\")\nprint(article.summary)\nprint(\"\\nKeywords:\", article.keywords)\n```\n\n## Features\n\n* **Article Extraction** : Automatically extract clean article text from web pages\n* **Metadata Parsing** : Extract titles, authors, publication dates, and images\n* **Natural Language Processing** : Generate summaries and extract keywords\n* **Multi-language Support** : Process articles in various languages\n* **Image Processing** : Extract and analyze article images\n* **Content Analysis** : Advanced text processing and analysis capabilities\n\n## Requirements\n\n* Python 3.6+\n* NLTK (for natural language processing)\n* Additional dependencies installed automatically\n\n## License\n\nThis project is licensed under the MIT License.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Advanced news extraction, article parsing, and content analysis.",
"version": "0.3.0",
"project_urls": {
"Homepage": "https://github.com/salah55s/newspaperV3",
"Repository": "https://github.com/salah55s/newspaperV3"
},
"split_keywords": [
"newspaper",
" news",
" article",
" extraction",
" scraping",
" nlp",
" content",
" parsing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "62ced1292dc7774bb54694e0ab4848c18eaae6ad3f40cb93401337730f7ed502",
"md5": "1046ba9e1cc5ad4423baa50066d70281",
"sha256": "714d650e768c8bed1354d7e9949ff1f0703ddf723d89a64f3b4d865ac684a033"
},
"downloads": -1,
"filename": "newspaperv3-0.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1046ba9e1cc5ad4423baa50066d70281",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.8",
"size": 210928,
"upload_time": "2025-07-30T10:22:10",
"upload_time_iso_8601": "2025-07-30T10:22:10.982349Z",
"url": "https://files.pythonhosted.org/packages/62/ce/d1292dc7774bb54694e0ab4848c18eaae6ad3f40cb93401337730f7ed502/newspaperv3-0.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "df964295c7b2d9e3ccadbc452e4b0c05d3c93e7bd67e14ef72950f3cb4da5637",
"md5": "ed06273e0574cac9373a5c7a0f5e90b4",
"sha256": "9bfbb93d5054f52c21e918cd35b6954475d4f441b61c1aba4208011a6faaf22f"
},
"downloads": -1,
"filename": "newspaperv3-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "ed06273e0574cac9373a5c7a0f5e90b4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.8",
"size": 198486,
"upload_time": "2025-07-30T10:22:13",
"upload_time_iso_8601": "2025-07-30T10:22:13.890352Z",
"url": "https://files.pythonhosted.org/packages/df/96/4295c7b2d9e3ccadbc452e4b0c05d3c93e7bd67e14ef72950f3cb4da5637/newspaperv3-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-30 10:22:13",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "salah55s",
"github_project": "newspaperV3",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "requests",
"specs": [
[
">=",
"2.31.0"
]
]
},
{
"name": "python-dotenv",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "responses",
"specs": []
}
],
"tox": true,
"lcname": "newspaperv3"
}