journ4list


Namejourn4list JSON
Version 0.11.0 PyPI version JSON
download
home_pageNone
SummaryA powerful async news content extraction library with modern API for web scraping and article analysis
upload_time2025-07-24 21:57:44
maintainerOktay Burak Ertas
docs_urlNone
authorOktay Burak Ertas
requires_python>=3.8
licenseMIT
keywords scraping news content extraction journalism articles web-scraping
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Journalist

[![PyPI version](https://badge.fury.io/py/journ4list.svg)](https://badge.fury.io/py/journ4list)
[![Python Versions](https://img.shields.io/pypi/pyversions/journ4list)](https://pypi.org/project/journ4list/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Tests](https://github.com/username/journalist/workflows/Tests/badge.svg)](https://github.com/username/journalist/actions)
[![Coverage](https://codecov.io/gh/username/journalist/branch/main/graph/badge.svg)](https://codecov.io/gh/username/journalist)

A powerful async news content extraction library with modern API for web scraping and article analysis.

## Features

๐Ÿš€ **Modern Async API** - Built with asyncio for high-performance concurrent scraping  
๐Ÿ“ฐ **Universal News Support** - Works with news websites and content from any language or region  
๐ŸŽฏ **Smart Content Extraction** - Multiple extraction methods (readability, CSS selectors, JSON-LD)
๐Ÿ”„ **Flexible Persistence** - Memory-only or filesystem persistence modes  
๐Ÿ›ก๏ธ **Error Handling** - Robust error handling with custom exception types  
๐Ÿ“Š **Session Management** - Built-in session management with race condition protection  
๐Ÿงช **Well Tested** - Comprehensive unit tests with high coverage

## Installation

### Option 1: Using pip (Recommended)

```bash
pip install journ4list
```

### Option 2: Using Poetry

```bash
poetry add journ4list
```

### Option 3: Development Installation

#### Using Poetry (Recommended for Development)

```bash
# Clone the repository
git clone https://github.com/username/journalist.git
cd journalist

# Install with Poetry
poetry install

# Activate virtual environment
poetry shell
```

#### Using pip-tools (Alternative)

```bash
# Clone the repository
git clone https://github.com/username/journalist.git
cd journalist

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install pip-tools
pip install pip-tools

# Compile and install dependencies
pip-compile requirements.in --output-file requirements.txt
pip install -r requirements.txt
```

## Quick Start

### Basic Usage

```python
import asyncio
from journalist import Journalist

async def main():
    # Create journalist instance
    journalist = Journalist(persist=True, scrape_depth=1)

    # Extract content from news sites
    result = await journalist.read(
        urls=[            "https://www.bbc.com/news",
            "https://www.reuters.com/"
        ],
        keywords=["teknologi", "spor", "ekonomi"]
    )

    # Access extracted articles
    for article in result['articles']:
        print(f"Title: {article['title']}")
        print(f"URL: {article['url']}")
        print(f"Content: {article['content'][:200]}...")
        print("-" * 50)

    # Check extraction summary
    summary = result['extraction_summary']
    print(f"Processed {summary['urls_processed']} URLs")
    print(f"Found {summary['articles_extracted']} articles")
    print(f"Extraction took {summary['extraction_time_seconds']} seconds")

# Run the example
asyncio.run(main())
```

### Memory-Only Mode (No File Persistence)

```python
import asyncio
from journalist import Journalist

async def main():
    # Use memory-only mode for temporary scraping
    journalist = Journalist(persist=False)

    result = await journalist.read(        urls=["https://www.cnn.com/"],
        keywords=["news", "breaking"]
    )

    # Articles are stored in memory only
    print(f"Found {len(result['articles'])} articles")
    print(f"Session ID: {result['session_id']}")

asyncio.run(main())
```

### Concurrent Scraping

```python
import asyncio
from journalist import Journalist

async def scrape_multiple_sources():
    """Example of concurrent scraping with multiple journalist instances."""

    # Create tasks for different news sources
    async def scrape_sports():
        journalist = Journalist(persist=True, scrape_depth=2)
        return await journalist.read(
            urls=["https://www.espn.com/", "https://www.skysports.com/"],
            keywords=["futbol", "basketbol"]
        )

    async def scrape_tech():
        journalist = Journalist(persist=True, scrape_depth=1)
        return await journalist.read(
            urls=["https://www.techcrunch.com/", "https://www.wired.com/"],
            keywords=["teknologi", "yazฤฑlฤฑm"]
        )

    # Run concurrently
    sports_task = asyncio.create_task(scrape_sports())
    tech_task = asyncio.create_task(scrape_tech())

    sports_result, tech_result = await asyncio.gather(sports_task, tech_task)

    print(f"Sports articles: {len(sports_result['articles'])}")
    print(f"Tech articles: {len(tech_result['articles'])}")

asyncio.run(scrape_multiple_sources())
```

## Configuration

### Journalist Parameters

- **`persist`** (bool, default: `True`) - Enable filesystem persistence for session data
- **`scrape_depth`** (int, default: `1`) - Depth level for link discovery and scraping

### Environment Configuration

The library uses sensible defaults but can be configured via the `JournalistConfig` class:

```python
from journalist.config import JournalistConfig

# Get current workspace path
workspace = JournalistConfig.get_base_workspace_path()
print(f"Workspace: {workspace}")  # Output: .journalist_workspace
```

## Error Handling

The library provides custom exception types for better error handling:

```python
import asyncio
from journalist import Journalist
from journalist.exceptions import NetworkError, ExtractionError, ValidationError

async def robust_scraping():
    try:
        journalist = Journalist()
        result = await journalist.read(
            urls=["https://example-news-site.com/"],
            keywords=["important", "news"]
        )
        return result

    except NetworkError as e:
        print(f"Network error: {e}")
        if hasattr(e, 'status_code'):
            print(f"HTTP Status: {e.status_code}")

    except ExtractionError as e:
        print(f"Content extraction failed: {e}")
        if hasattr(e, 'url'):
            print(f"Failed URL: {e.url}")

    except ValidationError as e:
        print(f"Input validation error: {e}")

    except Exception as e:
        print(f"Unexpected error: {e}")

asyncio.run(robust_scraping())
```

## API Reference

### Journalist Class

#### `__init__(persist=True, scrape_depth=1)`

Initialize a new Journalist instance.

**Parameters:**

- `persist` (bool): Enable filesystem persistence
- `scrape_depth` (int): Link discovery depth level

#### `async read(urls, keywords=None)`

Extract content from provided URLs with optional keyword filtering.

**Parameters:**

- `urls` (List[str]): List of website URLs to process
- `keywords` (Optional[List[str]]): Keywords for relevance filtering

**Returns:**

- `Dict[str, Any]`: Dictionary containing extracted articles and metadata

**Return Structure:**

```python
{
    'articles': [
        {
            'title': str,
            'url': str,
            'content': str,
            'author': str,
            'published_date': str,
            'keywords_found': List[str]
        }
    ],
    'session_id': str,
    'extraction_summary': {
        'session_id': str,
        'urls_requested': int,
        'urls_processed': int,
        'articles_extracted': int,
        'extraction_time_seconds': float,
        'keywords_used': List[str]
    }
}
```

## Development

### Running Tests

```bash
# Using Poetry
poetry run pytest

# Using pip
pytest

# With coverage
pytest --cov=journalist --cov-report=html
```

### Code Quality

```bash
# Format code
black src tests

# Sort imports
isort src tests

# Type checking
mypy src

# Linting
pylint src
```

### Development Dependencies

The project supports both Poetry and pip-tools for dependency management:

**Poetry (pyproject.toml):**

```bash
poetry install --with dev
```

**pip-tools (requirements.in):**

```bash
pip-compile requirements.in --output-file requirements.txt
python -m pip install -r requirements.txt
```

## Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Add tests for new functionality
5. Ensure tests pass (`pytest`)
6. Format code (`black src tests`)
7. Commit changes (`git commit -m 'Add amazing feature'`)
8. Push to branch (`git push origin feature/amazing-feature`)
9. Open a Pull Request

## Changelog

### v0.1.0 (2025-06-17)

- Initial release
- Async API for universal news content extraction
- Support for multiple extraction methods
- Memory and filesystem persistence modes
- Comprehensive error handling
- Session management with race condition protection
- Concurrent scraping support

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Author

**Oktay Burak Ertas**  
Email: oktay.burak.ertas@gmail.com

## Acknowledgments

- Built with modern Python async/await patterns
- Optimized for global news websites
- Inspired by newspaper3k and readability libraries
- Uses BeautifulSoup4 and lxml for HTML parsing


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "journ4list",
    "maintainer": "Oktay Burak Ertas",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "oktay.burak.ertas@gmail.com",
    "keywords": "scraping, news, content, extraction, journalism, articles, web-scraping",
    "author": "Oktay Burak Ertas",
    "author_email": "oktay.burak.ertas@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/89/4e/0c89ed13ed74adf5fd6c324f4a66b4f868a9192907885f3e01bae0d76a85/journ4list-0.11.0.tar.gz",
    "platform": null,
    "description": "# Journalist\n\n[![PyPI version](https://badge.fury.io/py/journ4list.svg)](https://badge.fury.io/py/journ4list)\n[![Python Versions](https://img.shields.io/pypi/pyversions/journ4list)](https://pypi.org/project/journ4list/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Tests](https://github.com/username/journalist/workflows/Tests/badge.svg)](https://github.com/username/journalist/actions)\n[![Coverage](https://codecov.io/gh/username/journalist/branch/main/graph/badge.svg)](https://codecov.io/gh/username/journalist)\n\nA powerful async news content extraction library with modern API for web scraping and article analysis.\n\n## Features\n\n\ud83d\ude80 **Modern Async API** - Built with asyncio for high-performance concurrent scraping  \n\ud83d\udcf0 **Universal News Support** - Works with news websites and content from any language or region  \n\ud83c\udfaf **Smart Content Extraction** - Multiple extraction methods (readability, CSS selectors, JSON-LD)\n\ud83d\udd04 **Flexible Persistence** - Memory-only or filesystem persistence modes  \n\ud83d\udee1\ufe0f **Error Handling** - Robust error handling with custom exception types  \n\ud83d\udcca **Session Management** - Built-in session management with race condition protection  \n\ud83e\uddea **Well Tested** - Comprehensive unit tests with high coverage\n\n## Installation\n\n### Option 1: Using pip (Recommended)\n\n```bash\npip install journ4list\n```\n\n### Option 2: Using Poetry\n\n```bash\npoetry add journ4list\n```\n\n### Option 3: Development Installation\n\n#### Using Poetry (Recommended for Development)\n\n```bash\n# Clone the repository\ngit clone https://github.com/username/journalist.git\ncd journalist\n\n# Install with Poetry\npoetry install\n\n# Activate virtual environment\npoetry shell\n```\n\n#### Using pip-tools (Alternative)\n\n```bash\n# Clone the repository\ngit clone https://github.com/username/journalist.git\ncd journalist\n\n# Create virtual environment\npython -m venv .venv\nsource .venv/bin/activate  # On Windows: .venv\\Scripts\\activate\n\n# Install pip-tools\npip install pip-tools\n\n# Compile and install dependencies\npip-compile requirements.in --output-file requirements.txt\npip install -r requirements.txt\n```\n\n## Quick Start\n\n### Basic Usage\n\n```python\nimport asyncio\nfrom journalist import Journalist\n\nasync def main():\n    # Create journalist instance\n    journalist = Journalist(persist=True, scrape_depth=1)\n\n    # Extract content from news sites\n    result = await journalist.read(\n        urls=[            \"https://www.bbc.com/news\",\n            \"https://www.reuters.com/\"\n        ],\n        keywords=[\"teknologi\", \"spor\", \"ekonomi\"]\n    )\n\n    # Access extracted articles\n    for article in result['articles']:\n        print(f\"Title: {article['title']}\")\n        print(f\"URL: {article['url']}\")\n        print(f\"Content: {article['content'][:200]}...\")\n        print(\"-\" * 50)\n\n    # Check extraction summary\n    summary = result['extraction_summary']\n    print(f\"Processed {summary['urls_processed']} URLs\")\n    print(f\"Found {summary['articles_extracted']} articles\")\n    print(f\"Extraction took {summary['extraction_time_seconds']} seconds\")\n\n# Run the example\nasyncio.run(main())\n```\n\n### Memory-Only Mode (No File Persistence)\n\n```python\nimport asyncio\nfrom journalist import Journalist\n\nasync def main():\n    # Use memory-only mode for temporary scraping\n    journalist = Journalist(persist=False)\n\n    result = await journalist.read(        urls=[\"https://www.cnn.com/\"],\n        keywords=[\"news\", \"breaking\"]\n    )\n\n    # Articles are stored in memory only\n    print(f\"Found {len(result['articles'])} articles\")\n    print(f\"Session ID: {result['session_id']}\")\n\nasyncio.run(main())\n```\n\n### Concurrent Scraping\n\n```python\nimport asyncio\nfrom journalist import Journalist\n\nasync def scrape_multiple_sources():\n    \"\"\"Example of concurrent scraping with multiple journalist instances.\"\"\"\n\n    # Create tasks for different news sources\n    async def scrape_sports():\n        journalist = Journalist(persist=True, scrape_depth=2)\n        return await journalist.read(\n            urls=[\"https://www.espn.com/\", \"https://www.skysports.com/\"],\n            keywords=[\"futbol\", \"basketbol\"]\n        )\n\n    async def scrape_tech():\n        journalist = Journalist(persist=True, scrape_depth=1)\n        return await journalist.read(\n            urls=[\"https://www.techcrunch.com/\", \"https://www.wired.com/\"],\n            keywords=[\"teknologi\", \"yaz\u0131l\u0131m\"]\n        )\n\n    # Run concurrently\n    sports_task = asyncio.create_task(scrape_sports())\n    tech_task = asyncio.create_task(scrape_tech())\n\n    sports_result, tech_result = await asyncio.gather(sports_task, tech_task)\n\n    print(f\"Sports articles: {len(sports_result['articles'])}\")\n    print(f\"Tech articles: {len(tech_result['articles'])}\")\n\nasyncio.run(scrape_multiple_sources())\n```\n\n## Configuration\n\n### Journalist Parameters\n\n- **`persist`** (bool, default: `True`) - Enable filesystem persistence for session data\n- **`scrape_depth`** (int, default: `1`) - Depth level for link discovery and scraping\n\n### Environment Configuration\n\nThe library uses sensible defaults but can be configured via the `JournalistConfig` class:\n\n```python\nfrom journalist.config import JournalistConfig\n\n# Get current workspace path\nworkspace = JournalistConfig.get_base_workspace_path()\nprint(f\"Workspace: {workspace}\")  # Output: .journalist_workspace\n```\n\n## Error Handling\n\nThe library provides custom exception types for better error handling:\n\n```python\nimport asyncio\nfrom journalist import Journalist\nfrom journalist.exceptions import NetworkError, ExtractionError, ValidationError\n\nasync def robust_scraping():\n    try:\n        journalist = Journalist()\n        result = await journalist.read(\n            urls=[\"https://example-news-site.com/\"],\n            keywords=[\"important\", \"news\"]\n        )\n        return result\n\n    except NetworkError as e:\n        print(f\"Network error: {e}\")\n        if hasattr(e, 'status_code'):\n            print(f\"HTTP Status: {e.status_code}\")\n\n    except ExtractionError as e:\n        print(f\"Content extraction failed: {e}\")\n        if hasattr(e, 'url'):\n            print(f\"Failed URL: {e.url}\")\n\n    except ValidationError as e:\n        print(f\"Input validation error: {e}\")\n\n    except Exception as e:\n        print(f\"Unexpected error: {e}\")\n\nasyncio.run(robust_scraping())\n```\n\n## API Reference\n\n### Journalist Class\n\n#### `__init__(persist=True, scrape_depth=1)`\n\nInitialize a new Journalist instance.\n\n**Parameters:**\n\n- `persist` (bool): Enable filesystem persistence\n- `scrape_depth` (int): Link discovery depth level\n\n#### `async read(urls, keywords=None)`\n\nExtract content from provided URLs with optional keyword filtering.\n\n**Parameters:**\n\n- `urls` (List[str]): List of website URLs to process\n- `keywords` (Optional[List[str]]): Keywords for relevance filtering\n\n**Returns:**\n\n- `Dict[str, Any]`: Dictionary containing extracted articles and metadata\n\n**Return Structure:**\n\n```python\n{\n    'articles': [\n        {\n            'title': str,\n            'url': str,\n            'content': str,\n            'author': str,\n            'published_date': str,\n            'keywords_found': List[str]\n        }\n    ],\n    'session_id': str,\n    'extraction_summary': {\n        'session_id': str,\n        'urls_requested': int,\n        'urls_processed': int,\n        'articles_extracted': int,\n        'extraction_time_seconds': float,\n        'keywords_used': List[str]\n    }\n}\n```\n\n## Development\n\n### Running Tests\n\n```bash\n# Using Poetry\npoetry run pytest\n\n# Using pip\npytest\n\n# With coverage\npytest --cov=journalist --cov-report=html\n```\n\n### Code Quality\n\n```bash\n# Format code\nblack src tests\n\n# Sort imports\nisort src tests\n\n# Type checking\nmypy src\n\n# Linting\npylint src\n```\n\n### Development Dependencies\n\nThe project supports both Poetry and pip-tools for dependency management:\n\n**Poetry (pyproject.toml):**\n\n```bash\npoetry install --with dev\n```\n\n**pip-tools (requirements.in):**\n\n```bash\npip-compile requirements.in --output-file requirements.txt\npython -m pip install -r requirements.txt\n```\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes\n4. Add tests for new functionality\n5. Ensure tests pass (`pytest`)\n6. Format code (`black src tests`)\n7. Commit changes (`git commit -m 'Add amazing feature'`)\n8. Push to branch (`git push origin feature/amazing-feature`)\n9. Open a Pull Request\n\n## Changelog\n\n### v0.1.0 (2025-06-17)\n\n- Initial release\n- Async API for universal news content extraction\n- Support for multiple extraction methods\n- Memory and filesystem persistence modes\n- Comprehensive error handling\n- Session management with race condition protection\n- Concurrent scraping support\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Author\n\n**Oktay Burak Ertas**  \nEmail: oktay.burak.ertas@gmail.com\n\n## Acknowledgments\n\n- Built with modern Python async/await patterns\n- Optimized for global news websites\n- Inspired by newspaper3k and readability libraries\n- Uses BeautifulSoup4 and lxml for HTML parsing\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A powerful async news content extraction library with modern API for web scraping and article analysis",
    "version": "0.11.0",
    "project_urls": {
        "Documentation": "https://journalist.readthedocs.io",
        "Homepage": "https://github.com/username/journalist",
        "Repository": "https://github.com/username/journalist"
    },
    "split_keywords": [
        "scraping",
        " news",
        " content",
        " extraction",
        " journalism",
        " articles",
        " web-scraping"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7701ed09679a70ebfd4e4747193a3a2cedadb0528de33f6846d0eb607fa39879",
                "md5": "570555b44067559549f1837d3d50092b",
                "sha256": "eb013109e4766ae3041d778df4f5e6aeb8c5b28f17ff19362ab3f828515e61fb"
            },
            "downloads": -1,
            "filename": "journ4list-0.11.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "570555b44067559549f1837d3d50092b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 43775,
            "upload_time": "2025-07-24T21:57:42",
            "upload_time_iso_8601": "2025-07-24T21:57:42.910829Z",
            "url": "https://files.pythonhosted.org/packages/77/01/ed09679a70ebfd4e4747193a3a2cedadb0528de33f6846d0eb607fa39879/journ4list-0.11.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "894e0c89ed13ed74adf5fd6c324f4a66b4f868a9192907885f3e01bae0d76a85",
                "md5": "3789f35b2f8c4a3528a4a35f67dcf98a",
                "sha256": "c23331f5a5a5650431ed1d9eec7ba817f85a6a6d07457b25ecf7ac900e97f572"
            },
            "downloads": -1,
            "filename": "journ4list-0.11.0.tar.gz",
            "has_sig": false,
            "md5_digest": "3789f35b2f8c4a3528a4a35f67dcf98a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 36746,
            "upload_time": "2025-07-24T21:57:44",
            "upload_time_iso_8601": "2025-07-24T21:57:44.512553Z",
            "url": "https://files.pythonhosted.org/packages/89/4e/0c89ed13ed74adf5fd6c324f4a66b4f868a9192907885f3e01bae0d76a85/journ4list-0.11.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-24 21:57:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "username",
    "github_project": "journalist",
    "github_not_found": true,
    "lcname": "journ4list"
}
        
Elapsed time: 0.74556s