genai-processors-url-fetch

Name	genai-processors-url-fetch JSON
Version	0.3.1 JSON
	download
home_page	None
Summary	A URL fetch processor for Google's genai-processors framework
upload_time	2025-08-04 15:43:31
maintainer	None
docs_url	None
author	None
requires_python	>=3.11
license	Apache-2.0
keywords	ai contrib deepmind fetch gemini genai genai-processors genai-processors-contrib generative-ai google http processor processors schema url validation
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # genai-processors-url-fetch

[![PyPI version](https://img.shields.io/pypi/v/genai-processors-url-fetch.svg)](https://pypi.org/project/genai-processors-url-fetch/)
[![Validation](https://github.com/mbeacom/genai-processors-url-fetch/actions/workflows/validate.yml/badge.svg)](https://github.com/mbeacom/genai-processors-url-fetch/actions/workflows/validate.yml)
[![codecov](https://codecov.io/github/mbeacom/genai-processors-url-fetch/graph/badge.svg?token=ghq0EnDIZl)](https://codecov.io/github/mbeacom/genai-processors-url-fetch)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)

A URL Fetch Processor for Google's genai-processors framework that detects URLs in text, fetches their content concurrently, and yields new ProcessorParts containing the page content.

## UrlFetchProcessor

The UrlFetchProcessor is a PartProcessor that detects URLs in incoming text parts, fetches their content concurrently, and yields new ProcessorParts containing the page content. It is a powerful and secure tool for enabling AI agents to access and process information from the web.

### Motivation

Many advanced AI applications, especially those involving Retrieval-Augmented Generation (RAG) or agentic behavior, need to interact with the outside world. This processor provides the fundamental capability of "reading" a webpage.

* **Enables RAG:** Fetches the content of source URLs so an LLM can use up-to-date information to answer questions.
* **Automates Research:** Allows an agent to follow links to gather context for a research task.
* **Simplifies Tooling:** Abstracts away the complexities of asynchronous HTTP requests, rate-limiting, security validation, and HTML parsing.

### Installation

Install the package using pip:

```bash
pip install genai-processors-url-fetch
```

For enhanced content processing with markitdown support:

```bash
pip install genai-processors-url-fetch[markitdown]
```

Or using uv (recommended):

```bash
uv add genai-processors-url-fetch
# or with markitdown support
uv add genai-processors-url-fetch[markitdown]
```

### Quick Start

```python
from genai_processors import processor
from genai_processors_url_fetch import UrlFetchProcessor, FetchConfig, ContentProcessor

# Basic usage with defaults (BeautifulSoup text extraction)
fetcher = UrlFetchProcessor()

# Use markitdown for richer content processing
config = FetchConfig(content_processor=ContentProcessor.MARKITDOWN)
markitdown_fetcher = UrlFetchProcessor(config)

# Process text containing URLs
input_text = "Check out https://developers.googleblog.com/en/genai-processors/ for more information"
input_part = processor.ProcessorPart(input_text)

async for result_part in fetcher.call(input_part):
    if result_part.metadata.get("fetch_status") == "success":
        print(f"Fetched: {result_part.metadata['source_url']}")
        print(f"Content: {result_part.text[:100]}...")
```

### Security Features

A primary design goal of this processor is to fetch web content safely. By default, it includes several security controls to prevent common vulnerabilities like Server-Side Request Forgery (SSRF).

* **IP Address Blocking:** Prevents requests to private, reserved, and loopback (localhost) IP address ranges (e.g., 192.168.1.1, 10.0.0.1, 127.0.0.1).
* **Cloud Metadata Protection:** Blocks requests to known cloud provider metadata endpoints (e.g., 169.254.169.254), which can expose sensitive instance information.
* **Domain Filtering:** Allows you to restrict fetches to an explicit list of allowed domains or deny requests to a list of blocked domains.
* **Scheme Enforcement:** By default, only allows http and https schemes, preventing requests to other protocols like file:// or ftp://.
* **Response Size Limiting:** Protects against "zip bomb" type attacks by enforcing a maximum size for response bodies (default is 10MB).

All security features are enabled by default but can be configured via the FetchConfig object.

### Configuration

The processor uses a dataclass-based configuration system for clean, type-safe settings. You can customize the processor's behavior by passing a FetchConfig object during initialization.

```python
from genai_processors_url_fetch import UrlFetchProcessor, FetchConfig, ContentProcessor

# Example of a customized security configuration
config = FetchConfig(
    timeout=10.0,
    allowed_domains=["github.com", "pypi.org"],  # Only allow these domains
    fail_on_error=True,
    max_response_size=5 * 1024 * 1024  # 5MB limit
)
secure_fetcher = UrlFetchProcessor(config=config)
```

#### FetchConfig Parameters

The `FetchConfig` dataclass provides comprehensive configuration options organized into logical categories:

##### Basic Behavior

* **timeout** (float, default: 15.0): The timeout in seconds for each HTTP request.
* **max_concurrent_fetches_per_host** (int, default: 3): The maximum number of parallel requests to a single hostname.
* **user_agent** (str, default: "GenAI-Processors/UrlFetchProcessor"): The User-Agent string to send with HTTP requests.
* **include_original_part** (bool, default: True): If True, the original ProcessorPart that contained the URL(s) will be yielded at the end of processing.
* **fail_on_error** (bool, default: False): If True, the processor will raise a RuntimeError on the first failed fetch.
* **content_processor** (ContentProcessor, default: ContentProcessor.BEAUTIFULSOUP): Content processing method.
  * `ContentProcessor.BEAUTIFULSOUP`: Extract clean text using BeautifulSoup (fastest, good for simple HTML)
  * `ContentProcessor.MARKITDOWN`: Convert content to markdown using Microsoft's markitdown library (best for rich content, requires optional dependency)
  * `ContentProcessor.RAW`: Return the raw HTML content without processing
  * **Note:** String values ("beautifulsoup", "markitdown", "raw") are automatically converted to enum values for backward compatibility.
* **markitdown_options** (dict[str, Any], default: {}): Options passed to the markitdown MarkItDown constructor when using markitdown processor.
* **extract_text_only** (bool | None, default: None): **Deprecated.** Use `content_processor` instead. For backward compatibility: `True` maps to `ContentProcessor.BEAUTIFULSOUP`, `False` maps to `ContentProcessor.RAW`.

##### Security Controls

* **block_private_ips** (bool, default: True): If True, blocks requests to RFC 1918 and other reserved IP ranges.
* **block_localhost** (bool, default: True): If True, blocks requests to 127.0.0.1, ::1, and localhost.
* **block_metadata_endpoints** (bool, default: True): If True, blocks requests to common cloud metadata services.
* **allowed_domains** (list[str] | None, default: None): If set, only URLs matching a domain in this list (or its subdomains) will be fetched.
* **blocked_domains** (list[str] | None, default: None): If set, any URL matching a domain in this list (or its subdomains) will be blocked.
* **allowed_schemes** (list[str], default: ['http', 'https']): A list of allowed URL schemes.
* **max_response_size** (int, default: 10485760): The maximum size of the response body in bytes (10MB).

### Content Processing Options

The UrlFetchProcessor supports three content processing methods via the `content_processor` configuration:

#### BeautifulSoup (Default)

```python
config = FetchConfig(content_processor=ContentProcessor.BEAUTIFULSOUP)
fetcher = UrlFetchProcessor(config)
# Returns: Clean text extracted from HTML, fastest processing
# Mimetype: "text/plain; charset=utf-8"
```

#### Markitdown (Rich Content Processing)

The markitdown processor provides the richest content extraction by converting HTML to structured markdown. It's ideal for preserving formatting, tables, links, and document structure.

```python
config = FetchConfig(
    content_processor=ContentProcessor.MARKITDOWN,
    markitdown_options={
        "extract_tables": True,     # Preserve table structure
        "preserve_links": True,     # Keep link formatting
        # Additional markitdown options can be specified here
    }
)
fetcher = UrlFetchProcessor(config)
# Returns: Rich markdown with preserved formatting, tables, links
# Mimetype: "text/markdown; charset=utf-8"
# Requires: pip install genai-processors-url-fetch[markitdown]
```

**When to use markitdown:**

* Processing documentation pages or wikis
* Extracting structured content with tables and lists
* Preserving links and formatting for downstream processing
* Working with rich content that benefits from markdown structure

**Comparison with other processors:**

* **BeautifulSoup**: Fast text extraction, loses formatting
* **Markitdown**: Rich markdown, preserves structure, slower processing
* **Raw**: Full HTML control, requires custom parsing

#### Raw HTML

```python
config = FetchConfig(content_processor=ContentProcessor.RAW)
fetcher = UrlFetchProcessor(config)
# Returns: Original HTML content without processing
# Mimetype: "text/html; charset=utf-8"
```

### Usage Examples

#### High Security Configuration

```python
config = FetchConfig(
    allowed_domains=["trusted.com", "docs.python.org"],
    allowed_schemes=["https"],
    block_private_ips=True,
    max_response_size=1024 * 1024,  # 1MB
    timeout=10.0
)
fetcher = UrlFetchProcessor(config)
```

#### Fast and Flexible

```python
config = FetchConfig(
    timeout=5.0,
    extract_text_only=False,  # Keep HTML
    include_original_part=False,  # Only return fetched content
    fail_on_error=True  # Stop on first error
)
fetcher = UrlFetchProcessor(config)
```

#### Pipeline Processing

```python
# Configure for pipeline use
config = FetchConfig(include_original_part=False)
fetcher = UrlFetchProcessor(config)

# Process input
results = [part async for part in fetcher.call(input_part)]

# Filter successful fetches
successful_content = [
    part for part in results
    if part.metadata.get("fetch_status") == "success"
]

# Further process the content
for content_part in successful_content:
    source_url = content_part.metadata["source_url"]
    text_content = content_part.text
    # ... your processing logic here
```

#### Markitdown Processing Example

```python
from genai_processors import streams
from genai_processors_url_fetch import UrlFetchProcessor, FetchConfig, ContentProcessor

# Configure markitdown processor for rich content extraction
config = FetchConfig(
    content_processor=ContentProcessor.MARKITDOWN,
    include_original_part=False,
    markitdown_options={
        "extract_tables": True,
        "preserve_links": True,
    }
)

processor = UrlFetchProcessor(config)

# Process URLs in text
text_with_urls = "Check out https://github.com/microsoft/markitdown for examples"
input_stream = streams.stream_content([text_with_urls])

async for part in processor(input_stream):
    if part.metadata.get("fetch_status") == "success":
        print(f"📄 Fetched from: {part.metadata['source_url']}")
        print(f"📝 Markdown content:\n{part.text}")
        print(f"✨ Content type: {part.mimetype}")
    elif part.substream_name == "status":
        print(f"Status: {part.text}")
```

### Behavior and Output

For each ProcessorPart that contains one or more URLs, the UrlFetchProcessor yields several new parts:

1. **Status Parts:** For each URL, a processor.status() message is yielded, indicating the outcome (✅ Fetched successfully... or ❌ Fetch failed...).
2. **Content Parts:** For each valid and successful fetch, a corresponding content part is yielded.
3. **Failure Parts:** For each fetch that fails due to a network error or security violation, a failure part is yielded (if fail_on_error is False).
4. **Original Part (Optional):** If include_original_part is True, the original part is yielded last.

#### On Successful Fetch

* A ProcessorPart is yielded with metadata['fetch_status'] set to 'success'.
* The metadata['source_url'] contains the URL that was fetched.
* The text of the part contains the page content.
* The mimetype indicates the content type ('text/plain; charset=utf-8' or 'text/html; charset=utf-8').

#### On Failed Fetch

* A ProcessorPart is yielded with metadata['fetch_status'] set to 'failure'.
* The metadata['fetch_error'] contains a string describing the error (e.g., "Security validation failed: Domain '...' is blocked" or "HTTP Error: 404...").
* The part's text will be empty.

### Error Handling

The processor provides detailed error information through metadata:

#### Metadata Fields

* **fetch_status**: "success" or "failure"
* **source_url**: The URL that was fetched
* **fetch_error**: Error message for failed fetches (only present on failures)

#### Example Error Handling

```python
async for part in fetcher.call(input_part):
    status = part.metadata.get("fetch_status")

    if status == "success":
        print(f"✅ Fetched: {part.metadata['source_url']}")
        print(f"Content: {part.text}")
    elif status == "failure":
        print(f"❌ Failed: {part.metadata['source_url']}")
        print(f"Error: {part.metadata['fetch_error']}")
    elif part.substream_name == processor.STATUS_STREAM:
        print(f"📄 Status: {part.text}")
```

### Examples and Testing

#### Working Examples

For complete, runnable examples that demonstrate the UrlFetchProcessor, see:

**URL Content Summarizer** (`examples/url_content_summarizer.py`):
This example builds a URL content summarizer that:

* Fetches content from URLs in user input
* Uses secure configuration (HTTPS only, blocks private IPs)
* Integrates with GenAI models for content summarization
* Shows proper error handling and pipeline construction
* Provides a practical CLI interface

**Markitdown Content Processing** (`examples/markitdown_example.py`):
This example demonstrates markitdown processor capabilities:

* Shows different content processor options (BeautifulSoup, Markitdown, Raw)
* Compares output formats and use cases
* Demonstrates markitdown configuration options
* Shows how to handle different content types

To run the examples:

```bash
# Content summarizer (requires API key)
export GEMINI_API_KEY=your_api_key_here
python examples/url_content_summarizer.py

# Markitdown demo (requires markitdown optional dependency)
pip install genai-processors-url-fetch[markitdown]
python examples/markitdown_example.py
```

#### Test Suite

For comprehensive test coverage including security features, error handling, and all configuration options, see: `genai_processors_url_fetch/tests/test_url_fetch.py`

The test suite includes:

* Basic functionality tests
* Security feature validation
* Error handling scenarios
* Configuration option testing
* Mock implementations for reliable testing

### Considerations

1. **Security First**: Always configure appropriate security controls for your use case
2. **Resource Limits**: Set reasonable timeout and size limits to prevent resource exhaustion
3. **Error Handling**: Handle both successful and failed fetches appropriately in your application logic
4. **Rate Limiting**: Use the built-in per-host rate limiting to be respectful to target servers
5. **Content Processing**: Choose between text extraction and raw HTML based on your downstream processing needs

## Development

For development setup, testing, and contribution guidelines, see [CONTRIBUTING.md](CONTRIBUTING.md).

This project uses [uv](https://docs.astral.sh/uv/) for dependency management and [Poe the Poet](https://poethepoet.natn.io/) for task automation.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

## License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "genai-processors-url-fetch",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "ai, contrib, deepmind, fetch, gemini, genai, genai-processors, genai-processors-contrib, generative-ai, google, http, processor, processors, schema, url, validation",
    "author": null,
    "author_email": "Mark Beacom <m@beacom.dev>",
    "download_url": "https://files.pythonhosted.org/packages/ea/ce/1e6a0429e7d146e05927559f77eecf6330589052358aba89ee4b77cc1d1a/genai_processors_url_fetch-0.3.1.tar.gz",
    "platform": null,
    "description": "# genai-processors-url-fetch\n\n[![PyPI version](https://img.shields.io/pypi/v/genai-processors-url-fetch.svg)](https://pypi.org/project/genai-processors-url-fetch/)\n[![Validation](https://github.com/mbeacom/genai-processors-url-fetch/actions/workflows/validate.yml/badge.svg)](https://github.com/mbeacom/genai-processors-url-fetch/actions/workflows/validate.yml)\n[![codecov](https://codecov.io/github/mbeacom/genai-processors-url-fetch/graph/badge.svg?token=ghq0EnDIZl)](https://codecov.io/github/mbeacom/genai-processors-url-fetch)\n[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)\n\nA URL Fetch Processor for Google's genai-processors framework that detects URLs in text, fetches their content concurrently, and yields new ProcessorParts containing the page content.\n\n## UrlFetchProcessor\n\nThe UrlFetchProcessor is a PartProcessor that detects URLs in incoming text parts, fetches their content concurrently, and yields new ProcessorParts containing the page content. It is a powerful and secure tool for enabling AI agents to access and process information from the web.\n\n### Motivation\n\nMany advanced AI applications, especially those involving Retrieval-Augmented Generation (RAG) or agentic behavior, need to interact with the outside world. This processor provides the fundamental capability of \"reading\" a webpage.\n\n* **Enables RAG:** Fetches the content of source URLs so an LLM can use up-to-date information to answer questions.\n* **Automates Research:** Allows an agent to follow links to gather context for a research task.\n* **Simplifies Tooling:** Abstracts away the complexities of asynchronous HTTP requests, rate-limiting, security validation, and HTML parsing.\n\n### Installation\n\nInstall the package using pip:\n\n```bash\npip install genai-processors-url-fetch\n```\n\nFor enhanced content processing with markitdown support:\n\n```bash\npip install genai-processors-url-fetch[markitdown]\n```\n\nOr using uv (recommended):\n\n```bash\nuv add genai-processors-url-fetch\n# or with markitdown support\nuv add genai-processors-url-fetch[markitdown]\n```\n\n### Quick Start\n\n```python\nfrom genai_processors import processor\nfrom genai_processors_url_fetch import UrlFetchProcessor, FetchConfig, ContentProcessor\n\n# Basic usage with defaults (BeautifulSoup text extraction)\nfetcher = UrlFetchProcessor()\n\n# Use markitdown for richer content processing\nconfig = FetchConfig(content_processor=ContentProcessor.MARKITDOWN)\nmarkitdown_fetcher = UrlFetchProcessor(config)\n\n# Process text containing URLs\ninput_text = \"Check out https://developers.googleblog.com/en/genai-processors/ for more information\"\ninput_part = processor.ProcessorPart(input_text)\n\nasync for result_part in fetcher.call(input_part):\n    if result_part.metadata.get(\"fetch_status\") == \"success\":\n        print(f\"Fetched: {result_part.metadata['source_url']}\")\n        print(f\"Content: {result_part.text[:100]}...\")\n```\n\n### Security Features\n\nA primary design goal of this processor is to fetch web content safely. By default, it includes several security controls to prevent common vulnerabilities like Server-Side Request Forgery (SSRF).\n\n* **IP Address Blocking:** Prevents requests to private, reserved, and loopback (localhost) IP address ranges (e.g., 192.168.1.1, 10.0.0.1, 127.0.0.1).\n* **Cloud Metadata Protection:** Blocks requests to known cloud provider metadata endpoints (e.g., 169.254.169.254), which can expose sensitive instance information.\n* **Domain Filtering:** Allows you to restrict fetches to an explicit list of allowed domains or deny requests to a list of blocked domains.\n* **Scheme Enforcement:** By default, only allows http and https schemes, preventing requests to other protocols like file:// or ftp://.\n* **Response Size Limiting:** Protects against \"zip bomb\" type attacks by enforcing a maximum size for response bodies (default is 10MB).\n\nAll security features are enabled by default but can be configured via the FetchConfig object.\n\n### Configuration\n\nThe processor uses a dataclass-based configuration system for clean, type-safe settings. You can customize the processor's behavior by passing a FetchConfig object during initialization.\n\n```python\nfrom genai_processors_url_fetch import UrlFetchProcessor, FetchConfig, ContentProcessor\n\n# Example of a customized security configuration\nconfig = FetchConfig(\n    timeout=10.0,\n    allowed_domains=[\"github.com\", \"pypi.org\"],  # Only allow these domains\n    fail_on_error=True,\n    max_response_size=5 * 1024 * 1024  # 5MB limit\n)\nsecure_fetcher = UrlFetchProcessor(config=config)\n```\n\n#### FetchConfig Parameters\n\nThe `FetchConfig` dataclass provides comprehensive configuration options organized into logical categories:\n\n##### Basic Behavior\n\n* **timeout** (float, default: 15.0): The timeout in seconds for each HTTP request.\n* **max_concurrent_fetches_per_host** (int, default: 3): The maximum number of parallel requests to a single hostname.\n* **user_agent** (str, default: \"GenAI-Processors/UrlFetchProcessor\"): The User-Agent string to send with HTTP requests.\n* **include_original_part** (bool, default: True): If True, the original ProcessorPart that contained the URL(s) will be yielded at the end of processing.\n* **fail_on_error** (bool, default: False): If True, the processor will raise a RuntimeError on the first failed fetch.\n* **content_processor** (ContentProcessor, default: ContentProcessor.BEAUTIFULSOUP): Content processing method.\n  * `ContentProcessor.BEAUTIFULSOUP`: Extract clean text using BeautifulSoup (fastest, good for simple HTML)\n  * `ContentProcessor.MARKITDOWN`: Convert content to markdown using Microsoft's markitdown library (best for rich content, requires optional dependency)\n  * `ContentProcessor.RAW`: Return the raw HTML content without processing\n  * **Note:** String values (\"beautifulsoup\", \"markitdown\", \"raw\") are automatically converted to enum values for backward compatibility.\n* **markitdown_options** (dict[str, Any], default: {}): Options passed to the markitdown MarkItDown constructor when using markitdown processor.\n* **extract_text_only** (bool | None, default: None): **Deprecated.** Use `content_processor` instead. For backward compatibility: `True` maps to `ContentProcessor.BEAUTIFULSOUP`, `False` maps to `ContentProcessor.RAW`.\n\n##### Security Controls\n\n* **block_private_ips** (bool, default: True): If True, blocks requests to RFC 1918 and other reserved IP ranges.\n* **block_localhost** (bool, default: True): If True, blocks requests to 127.0.0.1, ::1, and localhost.\n* **block_metadata_endpoints** (bool, default: True): If True, blocks requests to common cloud metadata services.\n* **allowed_domains** (list[str] | None, default: None): If set, only URLs matching a domain in this list (or its subdomains) will be fetched.\n* **blocked_domains** (list[str] | None, default: None): If set, any URL matching a domain in this list (or its subdomains) will be blocked.\n* **allowed_schemes** (list[str], default: ['http', 'https']): A list of allowed URL schemes.\n* **max_response_size** (int, default: 10485760): The maximum size of the response body in bytes (10MB).\n\n### Content Processing Options\n\nThe UrlFetchProcessor supports three content processing methods via the `content_processor` configuration:\n\n#### BeautifulSoup (Default)\n\n```python\nconfig = FetchConfig(content_processor=ContentProcessor.BEAUTIFULSOUP)\nfetcher = UrlFetchProcessor(config)\n# Returns: Clean text extracted from HTML, fastest processing\n# Mimetype: \"text/plain; charset=utf-8\"\n```\n\n#### Markitdown (Rich Content Processing)\n\nThe markitdown processor provides the richest content extraction by converting HTML to structured markdown. It's ideal for preserving formatting, tables, links, and document structure.\n\n```python\nconfig = FetchConfig(\n    content_processor=ContentProcessor.MARKITDOWN,\n    markitdown_options={\n        \"extract_tables\": True,     # Preserve table structure\n        \"preserve_links\": True,     # Keep link formatting\n        # Additional markitdown options can be specified here\n    }\n)\nfetcher = UrlFetchProcessor(config)\n# Returns: Rich markdown with preserved formatting, tables, links\n# Mimetype: \"text/markdown; charset=utf-8\"\n# Requires: pip install genai-processors-url-fetch[markitdown]\n```\n\n**When to use markitdown:**\n\n* Processing documentation pages or wikis\n* Extracting structured content with tables and lists\n* Preserving links and formatting for downstream processing\n* Working with rich content that benefits from markdown structure\n\n**Comparison with other processors:**\n\n* **BeautifulSoup**: Fast text extraction, loses formatting\n* **Markitdown**: Rich markdown, preserves structure, slower processing\n* **Raw**: Full HTML control, requires custom parsing\n\n#### Raw HTML\n\n```python\nconfig = FetchConfig(content_processor=ContentProcessor.RAW)\nfetcher = UrlFetchProcessor(config)\n# Returns: Original HTML content without processing\n# Mimetype: \"text/html; charset=utf-8\"\n```\n\n### Usage Examples\n\n#### High Security Configuration\n\n```python\nconfig = FetchConfig(\n    allowed_domains=[\"trusted.com\", \"docs.python.org\"],\n    allowed_schemes=[\"https\"],\n    block_private_ips=True,\n    max_response_size=1024 * 1024,  # 1MB\n    timeout=10.0\n)\nfetcher = UrlFetchProcessor(config)\n```\n\n#### Fast and Flexible\n\n```python\nconfig = FetchConfig(\n    timeout=5.0,\n    extract_text_only=False,  # Keep HTML\n    include_original_part=False,  # Only return fetched content\n    fail_on_error=True  # Stop on first error\n)\nfetcher = UrlFetchProcessor(config)\n```\n\n#### Pipeline Processing\n\n```python\n# Configure for pipeline use\nconfig = FetchConfig(include_original_part=False)\nfetcher = UrlFetchProcessor(config)\n\n# Process input\nresults = [part async for part in fetcher.call(input_part)]\n\n# Filter successful fetches\nsuccessful_content = [\n    part for part in results\n    if part.metadata.get(\"fetch_status\") == \"success\"\n]\n\n# Further process the content\nfor content_part in successful_content:\n    source_url = content_part.metadata[\"source_url\"]\n    text_content = content_part.text\n    # ... your processing logic here\n```\n\n#### Markitdown Processing Example\n\n```python\nfrom genai_processors import streams\nfrom genai_processors_url_fetch import UrlFetchProcessor, FetchConfig, ContentProcessor\n\n# Configure markitdown processor for rich content extraction\nconfig = FetchConfig(\n    content_processor=ContentProcessor.MARKITDOWN,\n    include_original_part=False,\n    markitdown_options={\n        \"extract_tables\": True,\n        \"preserve_links\": True,\n    }\n)\n\nprocessor = UrlFetchProcessor(config)\n\n# Process URLs in text\ntext_with_urls = \"Check out https://github.com/microsoft/markitdown for examples\"\ninput_stream = streams.stream_content([text_with_urls])\n\nasync for part in processor(input_stream):\n    if part.metadata.get(\"fetch_status\") == \"success\":\n        print(f\"\ud83d\udcc4 Fetched from: {part.metadata['source_url']}\")\n        print(f\"\ud83d\udcdd Markdown content:\\n{part.text}\")\n        print(f\"\u2728 Content type: {part.mimetype}\")\n    elif part.substream_name == \"status\":\n        print(f\"Status: {part.text}\")\n```\n\n### Behavior and Output\n\nFor each ProcessorPart that contains one or more URLs, the UrlFetchProcessor yields several new parts:\n\n1. **Status Parts:** For each URL, a processor.status() message is yielded, indicating the outcome (\u2705 Fetched successfully... or \u274c Fetch failed...).\n2. **Content Parts:** For each valid and successful fetch, a corresponding content part is yielded.\n3. **Failure Parts:** For each fetch that fails due to a network error or security violation, a failure part is yielded (if fail_on_error is False).\n4. **Original Part (Optional):** If include_original_part is True, the original part is yielded last.\n\n#### On Successful Fetch\n\n* A ProcessorPart is yielded with metadata['fetch_status'] set to 'success'.\n* The metadata['source_url'] contains the URL that was fetched.\n* The text of the part contains the page content.\n* The mimetype indicates the content type ('text/plain; charset=utf-8' or 'text/html; charset=utf-8').\n\n#### On Failed Fetch\n\n* A ProcessorPart is yielded with metadata['fetch_status'] set to 'failure'.\n* The metadata['fetch_error'] contains a string describing the error (e.g., \"Security validation failed: Domain '...' is blocked\" or \"HTTP Error: 404...\").\n* The part's text will be empty.\n\n### Error Handling\n\nThe processor provides detailed error information through metadata:\n\n#### Metadata Fields\n\n* **fetch_status**: \"success\" or \"failure\"\n* **source_url**: The URL that was fetched\n* **fetch_error**: Error message for failed fetches (only present on failures)\n\n#### Example Error Handling\n\n```python\nasync for part in fetcher.call(input_part):\n    status = part.metadata.get(\"fetch_status\")\n\n    if status == \"success\":\n        print(f\"\u2705 Fetched: {part.metadata['source_url']}\")\n        print(f\"Content: {part.text}\")\n    elif status == \"failure\":\n        print(f\"\u274c Failed: {part.metadata['source_url']}\")\n        print(f\"Error: {part.metadata['fetch_error']}\")\n    elif part.substream_name == processor.STATUS_STREAM:\n        print(f\"\ud83d\udcc4 Status: {part.text}\")\n```\n\n### Examples and Testing\n\n#### Working Examples\n\nFor complete, runnable examples that demonstrate the UrlFetchProcessor, see:\n\n**URL Content Summarizer** (`examples/url_content_summarizer.py`):\nThis example builds a URL content summarizer that:\n\n* Fetches content from URLs in user input\n* Uses secure configuration (HTTPS only, blocks private IPs)\n* Integrates with GenAI models for content summarization\n* Shows proper error handling and pipeline construction\n* Provides a practical CLI interface\n\n**Markitdown Content Processing** (`examples/markitdown_example.py`):\nThis example demonstrates markitdown processor capabilities:\n\n* Shows different content processor options (BeautifulSoup, Markitdown, Raw)\n* Compares output formats and use cases\n* Demonstrates markitdown configuration options\n* Shows how to handle different content types\n\nTo run the examples:\n\n```bash\n# Content summarizer (requires API key)\nexport GEMINI_API_KEY=your_api_key_here\npython examples/url_content_summarizer.py\n\n# Markitdown demo (requires markitdown optional dependency)\npip install genai-processors-url-fetch[markitdown]\npython examples/markitdown_example.py\n```\n\n#### Test Suite\n\nFor comprehensive test coverage including security features, error handling, and all configuration options, see: `genai_processors_url_fetch/tests/test_url_fetch.py`\n\nThe test suite includes:\n\n* Basic functionality tests\n* Security feature validation\n* Error handling scenarios\n* Configuration option testing\n* Mock implementations for reliable testing\n\n### Considerations\n\n1. **Security First**: Always configure appropriate security controls for your use case\n2. **Resource Limits**: Set reasonable timeout and size limits to prevent resource exhaustion\n3. **Error Handling**: Handle both successful and failed fetches appropriately in your application logic\n4. **Rate Limiting**: Use the built-in per-host rate limiting to be respectful to target servers\n5. **Content Processing**: Choose between text extraction and raw HTML based on your downstream processing needs\n\n## Development\n\nFor development setup, testing, and contribution guidelines, see [CONTRIBUTING.md](CONTRIBUTING.md).\n\nThis project uses [uv](https://docs.astral.sh/uv/) for dependency management and [Poe the Poet](https://poethepoet.natn.io/) for task automation.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.\n\n## License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "A URL fetch processor for Google's genai-processors framework",
    "version": "0.3.1",
    "project_urls": {
        "Documentation": "https://github.com/mbeacom/genai-processors-url-fetch#readme",
        "Homepage": "https://github.com/mbeacom/genai-processors-url-fetch",
        "Issues": "https://github.com/mbeacom/genai-processors-url-fetch/issues",
        "Repository": "https://github.com/mbeacom/genai-processors-url-fetch.git"
    },
    "split_keywords": [
        "ai",
        " contrib",
        " deepmind",
        " fetch",
        " gemini",
        " genai",
        " genai-processors",
        " genai-processors-contrib",
        " generative-ai",
        " google",
        " http",
        " processor",
        " processors",
        " schema",
        " url",
        " validation"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f4f8be3e2f723ffc1a619c7c6f3cf69ac3239f373a242be182c91693d089c255",
                "md5": "52f52360790b8a3d461941d9ada76fba",
                "sha256": "24e449ef2b1e5361af547b1118f59cd558034d11c5f5568ae3fa6cfb110a7dbd"
            },
            "downloads": -1,
            "filename": "genai_processors_url_fetch-0.3.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "52f52360790b8a3d461941d9ada76fba",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 24113,
            "upload_time": "2025-08-04T15:43:30",
            "upload_time_iso_8601": "2025-08-04T15:43:30.212724Z",
            "url": "https://files.pythonhosted.org/packages/f4/f8/be3e2f723ffc1a619c7c6f3cf69ac3239f373a242be182c91693d089c255/genai_processors_url_fetch-0.3.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "eace1e6a0429e7d146e05927559f77eecf6330589052358aba89ee4b77cc1d1a",
                "md5": "4587b8fd05fc8e86dc26dfebf10e5fd8",
                "sha256": "c393bbe0999ab1ca651670fb6a62175580758018e569a950971ed7ed5d77bc3e"
            },
            "downloads": -1,
            "filename": "genai_processors_url_fetch-0.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "4587b8fd05fc8e86dc26dfebf10e5fd8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 26663,
            "upload_time": "2025-08-04T15:43:31",
            "upload_time_iso_8601": "2025-08-04T15:43:31.539297Z",
            "url": "https://files.pythonhosted.org/packages/ea/ce/1e6a0429e7d146e05927559f77eecf6330589052358aba89ee4b77cc1d1a/genai_processors_url_fetch-0.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-04 15:43:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mbeacom",
    "github_project": "genai-processors-url-fetch#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "genai-processors-url-fetch"
}

None