# Journalist
[](https://badge.fury.io/py/journ4list)
[](https://pypi.org/project/journ4list/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/username/journalist/actions)
[](https://codecov.io/gh/username/journalist)
A powerful async news content extraction library with modern API for web scraping and article analysis.
## Features
๐ **Modern Async API** - Built with asyncio for high-performance concurrent scraping
๐ฐ **Universal News Support** - Works with news websites and content from any language or region
๐ฏ **Smart Content Extraction** - Multiple extraction methods (readability, CSS selectors, JSON-LD)
๐ **Flexible Persistence** - Memory-only or filesystem persistence modes
๐ก๏ธ **Error Handling** - Robust error handling with custom exception types
๐ **Session Management** - Built-in session management with race condition protection
๐งช **Well Tested** - Comprehensive unit tests with high coverage
## Installation
### Option 1: Using pip (Recommended)
```bash
pip install journ4list
```
### Option 2: Using Poetry
```bash
poetry add journ4list
```
### Option 3: Development Installation
#### Using Poetry (Recommended for Development)
```bash
# Clone the repository
git clone https://github.com/username/journalist.git
cd journalist
# Install with Poetry
poetry install
# Activate virtual environment
poetry shell
```
#### Using pip-tools (Alternative)
```bash
# Clone the repository
git clone https://github.com/username/journalist.git
cd journalist
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install pip-tools
pip install pip-tools
# Compile and install dependencies
pip-compile requirements.in --output-file requirements.txt
pip install -r requirements.txt
```
## Quick Start
### Basic Usage
```python
import asyncio
from journalist import Journalist
async def main():
# Create journalist instance
journalist = Journalist(persist=True, scrape_depth=1)
# Extract content from news sites
result = await journalist.read(
urls=[ "https://www.bbc.com/news",
"https://www.reuters.com/"
],
keywords=["teknologi", "spor", "ekonomi"]
)
# Access extracted articles
for article in result['articles']:
print(f"Title: {article['title']}")
print(f"URL: {article['url']}")
print(f"Content: {article['content'][:200]}...")
print("-" * 50)
# Check extraction summary
summary = result['extraction_summary']
print(f"Processed {summary['urls_processed']} URLs")
print(f"Found {summary['articles_extracted']} articles")
print(f"Extraction took {summary['extraction_time_seconds']} seconds")
# Run the example
asyncio.run(main())
```
### Memory-Only Mode (No File Persistence)
```python
import asyncio
from journalist import Journalist
async def main():
# Use memory-only mode for temporary scraping
journalist = Journalist(persist=False)
result = await journalist.read( urls=["https://www.cnn.com/"],
keywords=["news", "breaking"]
)
# Articles are stored in memory only
print(f"Found {len(result['articles'])} articles")
print(f"Session ID: {result['session_id']}")
asyncio.run(main())
```
### Concurrent Scraping
```python
import asyncio
from journalist import Journalist
async def scrape_multiple_sources():
"""Example of concurrent scraping with multiple journalist instances."""
# Create tasks for different news sources
async def scrape_sports():
journalist = Journalist(persist=True, scrape_depth=2)
return await journalist.read(
urls=["https://www.espn.com/", "https://www.skysports.com/"],
keywords=["futbol", "basketbol"]
)
async def scrape_tech():
journalist = Journalist(persist=True, scrape_depth=1)
return await journalist.read(
urls=["https://www.techcrunch.com/", "https://www.wired.com/"],
keywords=["teknologi", "yazฤฑlฤฑm"]
)
# Run concurrently
sports_task = asyncio.create_task(scrape_sports())
tech_task = asyncio.create_task(scrape_tech())
sports_result, tech_result = await asyncio.gather(sports_task, tech_task)
print(f"Sports articles: {len(sports_result['articles'])}")
print(f"Tech articles: {len(tech_result['articles'])}")
asyncio.run(scrape_multiple_sources())
```
## Configuration
### Journalist Parameters
- **`persist`** (bool, default: `True`) - Enable filesystem persistence for session data
- **`scrape_depth`** (int, default: `1`) - Depth level for link discovery and scraping
### Environment Configuration
The library uses sensible defaults but can be configured via the `JournalistConfig` class:
```python
from journalist.config import JournalistConfig
# Get current workspace path
workspace = JournalistConfig.get_base_workspace_path()
print(f"Workspace: {workspace}") # Output: .journalist_workspace
```
## Error Handling
The library provides custom exception types for better error handling:
```python
import asyncio
from journalist import Journalist
from journalist.exceptions import NetworkError, ExtractionError, ValidationError
async def robust_scraping():
try:
journalist = Journalist()
result = await journalist.read(
urls=["https://example-news-site.com/"],
keywords=["important", "news"]
)
return result
except NetworkError as e:
print(f"Network error: {e}")
if hasattr(e, 'status_code'):
print(f"HTTP Status: {e.status_code}")
except ExtractionError as e:
print(f"Content extraction failed: {e}")
if hasattr(e, 'url'):
print(f"Failed URL: {e.url}")
except ValidationError as e:
print(f"Input validation error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
asyncio.run(robust_scraping())
```
## API Reference
### Journalist Class
#### `__init__(persist=True, scrape_depth=1)`
Initialize a new Journalist instance.
**Parameters:**
- `persist` (bool): Enable filesystem persistence
- `scrape_depth` (int): Link discovery depth level
#### `async read(urls, keywords=None)`
Extract content from provided URLs with optional keyword filtering.
**Parameters:**
- `urls` (List[str]): List of website URLs to process
- `keywords` (Optional[List[str]]): Keywords for relevance filtering
**Returns:**
- `Dict[str, Any]`: Dictionary containing extracted articles and metadata
**Return Structure:**
```python
{
'articles': [
{
'title': str,
'url': str,
'content': str,
'author': str,
'published_date': str,
'keywords_found': List[str]
}
],
'session_id': str,
'extraction_summary': {
'session_id': str,
'urls_requested': int,
'urls_processed': int,
'articles_extracted': int,
'extraction_time_seconds': float,
'keywords_used': List[str]
}
}
```
## Development
### Running Tests
```bash
# Using Poetry
poetry run pytest
# Using pip
pytest
# With coverage
pytest --cov=journalist --cov-report=html
```
### Code Quality
```bash
# Format code
black src tests
# Sort imports
isort src tests
# Type checking
mypy src
# Linting
pylint src
```
### Development Dependencies
The project supports both Poetry and pip-tools for dependency management:
**Poetry (pyproject.toml):**
```bash
poetry install --with dev
```
**pip-tools (requirements.in):**
```bash
pip-compile requirements.in --output-file requirements.txt
python -m pip install -r requirements.txt
```
## Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Add tests for new functionality
5. Ensure tests pass (`pytest`)
6. Format code (`black src tests`)
7. Commit changes (`git commit -m 'Add amazing feature'`)
8. Push to branch (`git push origin feature/amazing-feature`)
9. Open a Pull Request
## Changelog
### v0.1.0 (2025-06-17)
- Initial release
- Async API for universal news content extraction
- Support for multiple extraction methods
- Memory and filesystem persistence modes
- Comprehensive error handling
- Session management with race condition protection
- Concurrent scraping support
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Author
**Oktay Burak Ertas**
Email: oktay.burak.ertas@gmail.com
## Acknowledgments
- Built with modern Python async/await patterns
- Optimized for global news websites
- Inspired by newspaper3k and readability libraries
- Uses BeautifulSoup4 and lxml for HTML parsing
Raw data
{
"_id": null,
"home_page": null,
"name": "journ4list",
"maintainer": "Oktay Burak Ertas",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "oktay.burak.ertas@gmail.com",
"keywords": "scraping, news, content, extraction, journalism, articles, web-scraping",
"author": "Oktay Burak Ertas",
"author_email": "oktay.burak.ertas@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/89/4e/0c89ed13ed74adf5fd6c324f4a66b4f868a9192907885f3e01bae0d76a85/journ4list-0.11.0.tar.gz",
"platform": null,
"description": "# Journalist\n\n[](https://badge.fury.io/py/journ4list)\n[](https://pypi.org/project/journ4list/)\n[](https://opensource.org/licenses/MIT)\n[](https://github.com/username/journalist/actions)\n[](https://codecov.io/gh/username/journalist)\n\nA powerful async news content extraction library with modern API for web scraping and article analysis.\n\n## Features\n\n\ud83d\ude80 **Modern Async API** - Built with asyncio for high-performance concurrent scraping \n\ud83d\udcf0 **Universal News Support** - Works with news websites and content from any language or region \n\ud83c\udfaf **Smart Content Extraction** - Multiple extraction methods (readability, CSS selectors, JSON-LD)\n\ud83d\udd04 **Flexible Persistence** - Memory-only or filesystem persistence modes \n\ud83d\udee1\ufe0f **Error Handling** - Robust error handling with custom exception types \n\ud83d\udcca **Session Management** - Built-in session management with race condition protection \n\ud83e\uddea **Well Tested** - Comprehensive unit tests with high coverage\n\n## Installation\n\n### Option 1: Using pip (Recommended)\n\n```bash\npip install journ4list\n```\n\n### Option 2: Using Poetry\n\n```bash\npoetry add journ4list\n```\n\n### Option 3: Development Installation\n\n#### Using Poetry (Recommended for Development)\n\n```bash\n# Clone the repository\ngit clone https://github.com/username/journalist.git\ncd journalist\n\n# Install with Poetry\npoetry install\n\n# Activate virtual environment\npoetry shell\n```\n\n#### Using pip-tools (Alternative)\n\n```bash\n# Clone the repository\ngit clone https://github.com/username/journalist.git\ncd journalist\n\n# Create virtual environment\npython -m venv .venv\nsource .venv/bin/activate # On Windows: .venv\\Scripts\\activate\n\n# Install pip-tools\npip install pip-tools\n\n# Compile and install dependencies\npip-compile requirements.in --output-file requirements.txt\npip install -r requirements.txt\n```\n\n## Quick Start\n\n### Basic Usage\n\n```python\nimport asyncio\nfrom journalist import Journalist\n\nasync def main():\n # Create journalist instance\n journalist = Journalist(persist=True, scrape_depth=1)\n\n # Extract content from news sites\n result = await journalist.read(\n urls=[ \"https://www.bbc.com/news\",\n \"https://www.reuters.com/\"\n ],\n keywords=[\"teknologi\", \"spor\", \"ekonomi\"]\n )\n\n # Access extracted articles\n for article in result['articles']:\n print(f\"Title: {article['title']}\")\n print(f\"URL: {article['url']}\")\n print(f\"Content: {article['content'][:200]}...\")\n print(\"-\" * 50)\n\n # Check extraction summary\n summary = result['extraction_summary']\n print(f\"Processed {summary['urls_processed']} URLs\")\n print(f\"Found {summary['articles_extracted']} articles\")\n print(f\"Extraction took {summary['extraction_time_seconds']} seconds\")\n\n# Run the example\nasyncio.run(main())\n```\n\n### Memory-Only Mode (No File Persistence)\n\n```python\nimport asyncio\nfrom journalist import Journalist\n\nasync def main():\n # Use memory-only mode for temporary scraping\n journalist = Journalist(persist=False)\n\n result = await journalist.read( urls=[\"https://www.cnn.com/\"],\n keywords=[\"news\", \"breaking\"]\n )\n\n # Articles are stored in memory only\n print(f\"Found {len(result['articles'])} articles\")\n print(f\"Session ID: {result['session_id']}\")\n\nasyncio.run(main())\n```\n\n### Concurrent Scraping\n\n```python\nimport asyncio\nfrom journalist import Journalist\n\nasync def scrape_multiple_sources():\n \"\"\"Example of concurrent scraping with multiple journalist instances.\"\"\"\n\n # Create tasks for different news sources\n async def scrape_sports():\n journalist = Journalist(persist=True, scrape_depth=2)\n return await journalist.read(\n urls=[\"https://www.espn.com/\", \"https://www.skysports.com/\"],\n keywords=[\"futbol\", \"basketbol\"]\n )\n\n async def scrape_tech():\n journalist = Journalist(persist=True, scrape_depth=1)\n return await journalist.read(\n urls=[\"https://www.techcrunch.com/\", \"https://www.wired.com/\"],\n keywords=[\"teknologi\", \"yaz\u0131l\u0131m\"]\n )\n\n # Run concurrently\n sports_task = asyncio.create_task(scrape_sports())\n tech_task = asyncio.create_task(scrape_tech())\n\n sports_result, tech_result = await asyncio.gather(sports_task, tech_task)\n\n print(f\"Sports articles: {len(sports_result['articles'])}\")\n print(f\"Tech articles: {len(tech_result['articles'])}\")\n\nasyncio.run(scrape_multiple_sources())\n```\n\n## Configuration\n\n### Journalist Parameters\n\n- **`persist`** (bool, default: `True`) - Enable filesystem persistence for session data\n- **`scrape_depth`** (int, default: `1`) - Depth level for link discovery and scraping\n\n### Environment Configuration\n\nThe library uses sensible defaults but can be configured via the `JournalistConfig` class:\n\n```python\nfrom journalist.config import JournalistConfig\n\n# Get current workspace path\nworkspace = JournalistConfig.get_base_workspace_path()\nprint(f\"Workspace: {workspace}\") # Output: .journalist_workspace\n```\n\n## Error Handling\n\nThe library provides custom exception types for better error handling:\n\n```python\nimport asyncio\nfrom journalist import Journalist\nfrom journalist.exceptions import NetworkError, ExtractionError, ValidationError\n\nasync def robust_scraping():\n try:\n journalist = Journalist()\n result = await journalist.read(\n urls=[\"https://example-news-site.com/\"],\n keywords=[\"important\", \"news\"]\n )\n return result\n\n except NetworkError as e:\n print(f\"Network error: {e}\")\n if hasattr(e, 'status_code'):\n print(f\"HTTP Status: {e.status_code}\")\n\n except ExtractionError as e:\n print(f\"Content extraction failed: {e}\")\n if hasattr(e, 'url'):\n print(f\"Failed URL: {e.url}\")\n\n except ValidationError as e:\n print(f\"Input validation error: {e}\")\n\n except Exception as e:\n print(f\"Unexpected error: {e}\")\n\nasyncio.run(robust_scraping())\n```\n\n## API Reference\n\n### Journalist Class\n\n#### `__init__(persist=True, scrape_depth=1)`\n\nInitialize a new Journalist instance.\n\n**Parameters:**\n\n- `persist` (bool): Enable filesystem persistence\n- `scrape_depth` (int): Link discovery depth level\n\n#### `async read(urls, keywords=None)`\n\nExtract content from provided URLs with optional keyword filtering.\n\n**Parameters:**\n\n- `urls` (List[str]): List of website URLs to process\n- `keywords` (Optional[List[str]]): Keywords for relevance filtering\n\n**Returns:**\n\n- `Dict[str, Any]`: Dictionary containing extracted articles and metadata\n\n**Return Structure:**\n\n```python\n{\n 'articles': [\n {\n 'title': str,\n 'url': str,\n 'content': str,\n 'author': str,\n 'published_date': str,\n 'keywords_found': List[str]\n }\n ],\n 'session_id': str,\n 'extraction_summary': {\n 'session_id': str,\n 'urls_requested': int,\n 'urls_processed': int,\n 'articles_extracted': int,\n 'extraction_time_seconds': float,\n 'keywords_used': List[str]\n }\n}\n```\n\n## Development\n\n### Running Tests\n\n```bash\n# Using Poetry\npoetry run pytest\n\n# Using pip\npytest\n\n# With coverage\npytest --cov=journalist --cov-report=html\n```\n\n### Code Quality\n\n```bash\n# Format code\nblack src tests\n\n# Sort imports\nisort src tests\n\n# Type checking\nmypy src\n\n# Linting\npylint src\n```\n\n### Development Dependencies\n\nThe project supports both Poetry and pip-tools for dependency management:\n\n**Poetry (pyproject.toml):**\n\n```bash\npoetry install --with dev\n```\n\n**pip-tools (requirements.in):**\n\n```bash\npip-compile requirements.in --output-file requirements.txt\npython -m pip install -r requirements.txt\n```\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes\n4. Add tests for new functionality\n5. Ensure tests pass (`pytest`)\n6. Format code (`black src tests`)\n7. Commit changes (`git commit -m 'Add amazing feature'`)\n8. Push to branch (`git push origin feature/amazing-feature`)\n9. Open a Pull Request\n\n## Changelog\n\n### v0.1.0 (2025-06-17)\n\n- Initial release\n- Async API for universal news content extraction\n- Support for multiple extraction methods\n- Memory and filesystem persistence modes\n- Comprehensive error handling\n- Session management with race condition protection\n- Concurrent scraping support\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Author\n\n**Oktay Burak Ertas** \nEmail: oktay.burak.ertas@gmail.com\n\n## Acknowledgments\n\n- Built with modern Python async/await patterns\n- Optimized for global news websites\n- Inspired by newspaper3k and readability libraries\n- Uses BeautifulSoup4 and lxml for HTML parsing\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A powerful async news content extraction library with modern API for web scraping and article analysis",
"version": "0.11.0",
"project_urls": {
"Documentation": "https://journalist.readthedocs.io",
"Homepage": "https://github.com/username/journalist",
"Repository": "https://github.com/username/journalist"
},
"split_keywords": [
"scraping",
" news",
" content",
" extraction",
" journalism",
" articles",
" web-scraping"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "7701ed09679a70ebfd4e4747193a3a2cedadb0528de33f6846d0eb607fa39879",
"md5": "570555b44067559549f1837d3d50092b",
"sha256": "eb013109e4766ae3041d778df4f5e6aeb8c5b28f17ff19362ab3f828515e61fb"
},
"downloads": -1,
"filename": "journ4list-0.11.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "570555b44067559549f1837d3d50092b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 43775,
"upload_time": "2025-07-24T21:57:42",
"upload_time_iso_8601": "2025-07-24T21:57:42.910829Z",
"url": "https://files.pythonhosted.org/packages/77/01/ed09679a70ebfd4e4747193a3a2cedadb0528de33f6846d0eb607fa39879/journ4list-0.11.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "894e0c89ed13ed74adf5fd6c324f4a66b4f868a9192907885f3e01bae0d76a85",
"md5": "3789f35b2f8c4a3528a4a35f67dcf98a",
"sha256": "c23331f5a5a5650431ed1d9eec7ba817f85a6a6d07457b25ecf7ac900e97f572"
},
"downloads": -1,
"filename": "journ4list-0.11.0.tar.gz",
"has_sig": false,
"md5_digest": "3789f35b2f8c4a3528a4a35f67dcf98a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 36746,
"upload_time": "2025-07-24T21:57:44",
"upload_time_iso_8601": "2025-07-24T21:57:44.512553Z",
"url": "https://files.pythonhosted.org/packages/89/4e/0c89ed13ed74adf5fd6c324f4a66b4f868a9192907885f3e01bae0d76a85/journ4list-0.11.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-24 21:57:44",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "username",
"github_project": "journalist",
"github_not_found": true,
"lcname": "journ4list"
}