spiderforce4ai


Namespiderforce4ai JSON
Version 2.6.7 PyPI version JSON
download
home_pagehttps://petertam.pro
SummaryPython wrapper for SpiderForce4AI HTML-to-Markdown conversion service with LLM post-processing
upload_time2025-02-16 14:44:55
maintainerNone
docs_urlNone
authorPiotr Tamulewicz
requires_python>=3.11
licenseNone
keywords web-scraping markdown html-to-markdown llm ai content-extraction async parallel-processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SpiderForce4AI Python Wrapper

A Python package for web content crawling and HTML-to-Markdown conversion. Built for seamless integration with SpiderForce4AI service.

## Features

- HTML to Markdown conversion
- Parallel and async crawling support
- Sitemap processing
- Custom content selection
- Automatic retry mechanism
- Detailed progress tracking
- Webhook notifications
- Customizable reporting

## Installation

```bash
pip install spiderforce4ai
```

## Quick Start

```python
from spiderforce4ai import SpiderForce4AI, CrawlConfig
from pathlib import Path

# Initialize crawler
spider = SpiderForce4AI("http://localhost:3004")

# Configure crawling options
config = CrawlConfig(
    target_selector="article",
    remove_selectors=[".ads", ".navigation"],
    max_concurrent_requests=5,
    save_reports=True
)

# Crawl a sitemap
results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)
```

## Key Features

### 1. Smart Retry Mechanism
- Automatically retries failed URLs
- Monitors failure ratio to prevent server overload
- Detailed retry statistics and progress tracking
- Aborts retries if failure rate exceeds 20%

```python
# Retry behavior is automatic
config = CrawlConfig(
    max_concurrent_requests=5,
    request_delay=1.0  # Delay between retries
)
results = spider.crawl_urls_async(urls, config)
```

### 2. Custom Webhook Integration
- Flexible payload formatting
- Custom headers support
- Variable substitution in templates

```python
config = CrawlConfig(
    webhook_url="https://your-webhook.com",
    webhook_headers={
        "Authorization": "Bearer token",
        "X-Custom-Header": "value"
    },
    webhook_payload_template='''{
        "url": "{url}",
        "content": "{markdown}",
        "status": "{status}",
        "custom_field": "value"
    }'''
)
```

### 3. Flexible Report Generation
- Optional report saving
- Customizable report location
- Detailed success/failure statistics

```python
config = CrawlConfig(
    save_reports=True,
    report_file=Path("custom_report.json"),
    output_dir=Path("content")
)
```

## Crawling Methods

### 1. Single URL Processing

```python
# Synchronous
result = spider.crawl_url("https://example.com", config)

# Asynchronous
async def crawl():
    result = await spider.crawl_url_async("https://example.com", config)
```

### 2. Multiple URLs

```python
urls = ["https://example.com/page1", "https://example.com/page2"]

# Server-side parallel (recommended)
results = spider.crawl_urls_server_parallel(urls, config)

# Client-side parallel
results = spider.crawl_urls_parallel(urls, config)

# Asynchronous
async def crawl():
    results = await spider.crawl_urls_async(urls, config)
```

### 3. Sitemap Processing

```python
# Server-side parallel (recommended)
results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)

# Client-side parallel
results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)

# Asynchronous
async def crawl():
    results = await spider.crawl_sitemap_async("https://example.com/sitemap.xml", config)
```

## Configuration Options

```python
config = CrawlConfig(
    # Content Selection
    target_selector="article",              # Target element to extract
    remove_selectors=[".ads", "#popup"],    # Elements to remove
    remove_selectors_regex=["modal-\\d+"],  # Regex patterns for removal
    
    # Processing
    max_concurrent_requests=5,              # Parallel processing limit
    request_delay=0.5,                      # Delay between requests
    timeout=30,                             # Request timeout
    
    # Output
    output_dir=Path("content"),             # Output directory
    save_reports=False,                     # Enable/disable report saving
    report_file=Path("report.json"),        # Report location
    
    # Webhook
    webhook_url="https://webhook.com",      # Webhook endpoint
    webhook_timeout=10,                     # Webhook timeout
    webhook_headers={                       # Custom headers
        "Authorization": "Bearer token"
    },
    webhook_payload_template='''            # Custom payload format
    {
        "url": "{url}",
        "content": "{markdown}",
        "status": "{status}",
        "error": "{error}",
        "time": "{timestamp}"
    }'''
)
```

## Progress Tracking

The package provides detailed progress information:

```
Fetching sitemap from https://example.com/sitemap.xml...
Found 156 URLs in sitemap
[━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 100% • 156/156 URLs

Retrying failed URLs: 18 (11.5% failed)
[━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 100% • 18/18 retries

Crawling Summary:
Total URLs processed: 156
Initial failures: 18 (11.5%)
Final results:
  ✓ Successful: 150
  ✗ Failed: 6
Retry success rate: 12/18 (66.7%)
```

## Output Structure

### 1. Directory Layout
```
content/                    # Output directory
├── example-com-page1.md   # Markdown files
├── example-com-page2.md
└── report.json            # Crawl report
```

### 2. Report Format
```json
{
  "timestamp": "2025-02-15T10:30:00",
  "config": {
    "target_selector": "article",
    "remove_selectors": [".ads"]
  },
  "results": {
    "successful": [...],
    "failed": [...]
  },
  "summary": {
    "total": 156,
    "successful": 150,
    "failed": 6
  }
}
```

## Performance Optimization

1. Server-side Parallel Processing
   - Recommended for most cases
   - Single HTTP request
   - Reduced network overhead
   - Built-in load balancing

2. Client-side Parallel Processing
   - Better control over processing
   - Customizable concurrency
   - Progress tracking per URL
   - Automatic retry handling

3. Asynchronous Processing
   - Ideal for async applications
   - Non-blocking operation
   - Real-time progress updates
   - Efficient resource usage

## Error Handling

The package provides comprehensive error handling:

- Automatic retry for failed URLs
- Failure ratio monitoring
- Detailed error reporting
- Webhook error notifications
- Progress tracking during retries

## Requirements

- Python 3.11+
- Running SpiderForce4AI service
- Internet connection

## Dependencies

- aiohttp
- asyncio
- rich
- aiofiles
- httpx

## License

MIT License

## Credits

Created by [Peter Tam](https://petertam.pro)

            

Raw data

            {
    "_id": null,
    "home_page": "https://petertam.pro",
    "name": "spiderforce4ai",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "web-scraping, markdown, html-to-markdown, llm, ai, content-extraction, async, parallel-processing",
    "author": "Piotr Tamulewicz",
    "author_email": "Piotr Tamulewicz <pt@petertam.pro>",
    "download_url": "https://files.pythonhosted.org/packages/38/4c/2398147f46541ba38ad8b1d93503f69beb8b0bdb638a8c5f2f6bd6621c91/spiderforce4ai-2.6.7.tar.gz",
    "platform": null,
    "description": "# SpiderForce4AI Python Wrapper\n\nA Python package for web content crawling and HTML-to-Markdown conversion. Built for seamless integration with SpiderForce4AI service.\n\n## Features\n\n- HTML to Markdown conversion\n- Parallel and async crawling support\n- Sitemap processing\n- Custom content selection\n- Automatic retry mechanism\n- Detailed progress tracking\n- Webhook notifications\n- Customizable reporting\n\n## Installation\n\n```bash\npip install spiderforce4ai\n```\n\n## Quick Start\n\n```python\nfrom spiderforce4ai import SpiderForce4AI, CrawlConfig\nfrom pathlib import Path\n\n# Initialize crawler\nspider = SpiderForce4AI(\"http://localhost:3004\")\n\n# Configure crawling options\nconfig = CrawlConfig(\n    target_selector=\"article\",\n    remove_selectors=[\".ads\", \".navigation\"],\n    max_concurrent_requests=5,\n    save_reports=True\n)\n\n# Crawl a sitemap\nresults = spider.crawl_sitemap_server_parallel(\"https://example.com/sitemap.xml\", config)\n```\n\n## Key Features\n\n### 1. Smart Retry Mechanism\n- Automatically retries failed URLs\n- Monitors failure ratio to prevent server overload\n- Detailed retry statistics and progress tracking\n- Aborts retries if failure rate exceeds 20%\n\n```python\n# Retry behavior is automatic\nconfig = CrawlConfig(\n    max_concurrent_requests=5,\n    request_delay=1.0  # Delay between retries\n)\nresults = spider.crawl_urls_async(urls, config)\n```\n\n### 2. Custom Webhook Integration\n- Flexible payload formatting\n- Custom headers support\n- Variable substitution in templates\n\n```python\nconfig = CrawlConfig(\n    webhook_url=\"https://your-webhook.com\",\n    webhook_headers={\n        \"Authorization\": \"Bearer token\",\n        \"X-Custom-Header\": \"value\"\n    },\n    webhook_payload_template='''{\n        \"url\": \"{url}\",\n        \"content\": \"{markdown}\",\n        \"status\": \"{status}\",\n        \"custom_field\": \"value\"\n    }'''\n)\n```\n\n### 3. Flexible Report Generation\n- Optional report saving\n- Customizable report location\n- Detailed success/failure statistics\n\n```python\nconfig = CrawlConfig(\n    save_reports=True,\n    report_file=Path(\"custom_report.json\"),\n    output_dir=Path(\"content\")\n)\n```\n\n## Crawling Methods\n\n### 1. Single URL Processing\n\n```python\n# Synchronous\nresult = spider.crawl_url(\"https://example.com\", config)\n\n# Asynchronous\nasync def crawl():\n    result = await spider.crawl_url_async(\"https://example.com\", config)\n```\n\n### 2. Multiple URLs\n\n```python\nurls = [\"https://example.com/page1\", \"https://example.com/page2\"]\n\n# Server-side parallel (recommended)\nresults = spider.crawl_urls_server_parallel(urls, config)\n\n# Client-side parallel\nresults = spider.crawl_urls_parallel(urls, config)\n\n# Asynchronous\nasync def crawl():\n    results = await spider.crawl_urls_async(urls, config)\n```\n\n### 3. Sitemap Processing\n\n```python\n# Server-side parallel (recommended)\nresults = spider.crawl_sitemap_server_parallel(\"https://example.com/sitemap.xml\", config)\n\n# Client-side parallel\nresults = spider.crawl_sitemap_parallel(\"https://example.com/sitemap.xml\", config)\n\n# Asynchronous\nasync def crawl():\n    results = await spider.crawl_sitemap_async(\"https://example.com/sitemap.xml\", config)\n```\n\n## Configuration Options\n\n```python\nconfig = CrawlConfig(\n    # Content Selection\n    target_selector=\"article\",              # Target element to extract\n    remove_selectors=[\".ads\", \"#popup\"],    # Elements to remove\n    remove_selectors_regex=[\"modal-\\\\d+\"],  # Regex patterns for removal\n    \n    # Processing\n    max_concurrent_requests=5,              # Parallel processing limit\n    request_delay=0.5,                      # Delay between requests\n    timeout=30,                             # Request timeout\n    \n    # Output\n    output_dir=Path(\"content\"),             # Output directory\n    save_reports=False,                     # Enable/disable report saving\n    report_file=Path(\"report.json\"),        # Report location\n    \n    # Webhook\n    webhook_url=\"https://webhook.com\",      # Webhook endpoint\n    webhook_timeout=10,                     # Webhook timeout\n    webhook_headers={                       # Custom headers\n        \"Authorization\": \"Bearer token\"\n    },\n    webhook_payload_template='''            # Custom payload format\n    {\n        \"url\": \"{url}\",\n        \"content\": \"{markdown}\",\n        \"status\": \"{status}\",\n        \"error\": \"{error}\",\n        \"time\": \"{timestamp}\"\n    }'''\n)\n```\n\n## Progress Tracking\n\nThe package provides detailed progress information:\n\n```\nFetching sitemap from https://example.com/sitemap.xml...\nFound 156 URLs in sitemap\n[\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501] 100% \u2022 156/156 URLs\n\nRetrying failed URLs: 18 (11.5% failed)\n[\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501] 100% \u2022 18/18 retries\n\nCrawling Summary:\nTotal URLs processed: 156\nInitial failures: 18 (11.5%)\nFinal results:\n  \u2713 Successful: 150\n  \u2717 Failed: 6\nRetry success rate: 12/18 (66.7%)\n```\n\n## Output Structure\n\n### 1. Directory Layout\n```\ncontent/                    # Output directory\n\u251c\u2500\u2500 example-com-page1.md   # Markdown files\n\u251c\u2500\u2500 example-com-page2.md\n\u2514\u2500\u2500 report.json            # Crawl report\n```\n\n### 2. Report Format\n```json\n{\n  \"timestamp\": \"2025-02-15T10:30:00\",\n  \"config\": {\n    \"target_selector\": \"article\",\n    \"remove_selectors\": [\".ads\"]\n  },\n  \"results\": {\n    \"successful\": [...],\n    \"failed\": [...]\n  },\n  \"summary\": {\n    \"total\": 156,\n    \"successful\": 150,\n    \"failed\": 6\n  }\n}\n```\n\n## Performance Optimization\n\n1. Server-side Parallel Processing\n   - Recommended for most cases\n   - Single HTTP request\n   - Reduced network overhead\n   - Built-in load balancing\n\n2. Client-side Parallel Processing\n   - Better control over processing\n   - Customizable concurrency\n   - Progress tracking per URL\n   - Automatic retry handling\n\n3. Asynchronous Processing\n   - Ideal for async applications\n   - Non-blocking operation\n   - Real-time progress updates\n   - Efficient resource usage\n\n## Error Handling\n\nThe package provides comprehensive error handling:\n\n- Automatic retry for failed URLs\n- Failure ratio monitoring\n- Detailed error reporting\n- Webhook error notifications\n- Progress tracking during retries\n\n## Requirements\n\n- Python 3.11+\n- Running SpiderForce4AI service\n- Internet connection\n\n## Dependencies\n\n- aiohttp\n- asyncio\n- rich\n- aiofiles\n- httpx\n\n## License\n\nMIT License\n\n## Credits\n\nCreated by [Peter Tam](https://petertam.pro)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service with LLM post-processing",
    "version": "2.6.7",
    "project_urls": {
        "Bug Tracker": "https://github.com/yourusername/spiderforce4ai/issues",
        "Documentation": "https://petertam.pro/docs/spiderforce4ai",
        "Homepage": "https://petertam.pro",
        "Repository": "https://github.com/yourusername/spiderforce4ai"
    },
    "split_keywords": [
        "web-scraping",
        " markdown",
        " html-to-markdown",
        " llm",
        " ai",
        " content-extraction",
        " async",
        " parallel-processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ffa66972fade85996d090ba37de61eb0af974c559eae6bb1a19019639b671777",
                "md5": "eabe345771eee457d565a5b5cbf535db",
                "sha256": "4e0fdf652536b55badb1914426fca6b92d82c9a7e6cdc343a090de747791671f"
            },
            "downloads": -1,
            "filename": "spiderforce4ai-2.6.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "eabe345771eee457d565a5b5cbf535db",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 16500,
            "upload_time": "2025-02-16T14:44:53",
            "upload_time_iso_8601": "2025-02-16T14:44:53.015485Z",
            "url": "https://files.pythonhosted.org/packages/ff/a6/6972fade85996d090ba37de61eb0af974c559eae6bb1a19019639b671777/spiderforce4ai-2.6.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "384c2398147f46541ba38ad8b1d93503f69beb8b0bdb638a8c5f2f6bd6621c91",
                "md5": "b19b7c5b72bb6540bd69da33e19e1058",
                "sha256": "e0ddd0f0791dab4dde480ef8a82cf4b8bb0edefd552f495acf2dc1efcc6a945f"
            },
            "downloads": -1,
            "filename": "spiderforce4ai-2.6.7.tar.gz",
            "has_sig": false,
            "md5_digest": "b19b7c5b72bb6540bd69da33e19e1058",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 19753,
            "upload_time": "2025-02-16T14:44:55",
            "upload_time_iso_8601": "2025-02-16T14:44:55.451752Z",
            "url": "https://files.pythonhosted.org/packages/38/4c/2398147f46541ba38ad8b1d93503f69beb8b0bdb638a8c5f2f6bd6621c91/spiderforce4ai-2.6.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-16 14:44:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yourusername",
    "github_project": "spiderforce4ai",
    "github_not_found": true,
    "lcname": "spiderforce4ai"
}
        
Elapsed time: 0.42879s