intelliscraper-core

Name	intelliscraper-core JSON
Version	0.1.2 JSON
	download
home_page	None
Summary	Smart web scraper that abstracts away complexity - from simple sites to highly protected ones.
upload_time	2025-10-18 20:46:45
maintainer	None
docs_url	None
author	None
requires_python	>=3.12
license	MIT License Copyright (c) 2025 omkar musale Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	anti-detection crawling playwright proxy scraper web-scraping
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # IntelliScraper

A powerful, anti-bot detection web scraping solution built with Playwright, designed for scraping protected sites like Himalayas Jobs and other platforms that require authentication. Features session management, proxy support, and advanced HTML parsing capabilities.

![Python Version](https://img.shields.io/badge/python-3.12%2B-blue)
![License](https://img.shields.io/badge/license-MIT-green)
![Status](https://img.shields.io/badge/status-active-success)

## ✨ Features

- **🔐 Session Management**: Capture and reuse authentication sessions with cookies, local storage, and browser fingerprints
- **🛡️ Anti-Detection**: Advanced techniques to prevent bot detection
- **🌐 Proxy Support**: Integrated support for Bright Data and custom proxy solutions
- **📝 HTML Parsing**: Extract text, links, and convert to Markdown format (including LLM-optimized output)
- **🎯 CLI Tool**: Easy-to-use command-line interface for session generation
- **⚡ Playwright-Powered**: Built on robust Playwright automation framework

## 🚀 Quick Start

### Installation

```bash
# Install the package
pip install intelliscraper-core

# Install Playwright browser (Chromium)
playwright install chromium
```
> [!NOTE]  
> Playwright requires browser binaries to be installed separately.  
> The command above installs Chromium, which is necessary for this library to work.  

> For more reference : https://pypi.org/project/intelliscraper-core/

### Basic Scraping (No Authentication)

```python
from intelliscraper import Scraper, ScrapStatus

# Simple scraping without authentication
scraper = Scraper()
response = scraper.scrape("https://example.com")

if response.status == ScrapStatus.COMPLETED:
    print(response.scrap_html_content)
```

### Creating Session Data

Use the CLI tool to create session data for authenticated scraping. The tool will open a browser where you can manually log in:

```bash
intelliscraper-session --url "https://himalayas.app" --site "himalayas" --output "./himalayas_session.json"
```

**How it works:**
1. 🌐 Opens browser with the specified URL
2. 🔐 You manually log in with your credentials
3. ⏎ Press Enter after successful login
4. 💾 Session data (cookies, storage, fingerprints) saved to JSON file

> [!IMPORTANT]  
> Each session internally maintains time-series statistics of scraping events including timestamps, request start times, and statuses. 
> These metrics are useful for analyzing scraping behavior, rate limits, and identifying performance bottlenecks. 
> During testing, we observed that increasing concurrency too aggressively can lead to failures, while controlled, slower scraping rates maintain higher success rates and better session stability.

### Authenticated Scraping with Session

```python
import json
from intelliscraper import Scraper, Session, ScrapStatus

# Load session data
with open("himalayas_session.json") as f:
    session = Session(**json.load(f))

# Scrape with authentication
scraper = Scraper(session_data=session)
response = scraper.scrape("https://himalayas.app/jobs/python?experience=entry-level%2Cmid-level")

if response.status == ScrapStatus.COMPLETED:
    print("Successfully scraped authenticated page!")
    print(response.scrap_html_content)
```

## 📝 HTML Parsing

Parse scraped content to extract text, links, and markdown:

```python
from intelliscraper import Scraper, ScrapStatus, HTMLParser

scraper = Scraper()
response = scraper.scrape("https://example.com")

if response.status == ScrapStatus.COMPLETED:
    # Initialize parser
    parser = HTMLParser(
        url=response.scrape_request.url,
        html=response.scrap_html_content
    )
    
    # Extract different formats
    print(parser.text)              # Plain text
    print(parser.links)             # All links (normalized URLs)
    print(parser.markdown)          # Full markdown
    print(parser.markdown_for_llm)  # Clean markdown for AI (removes nav, footer, ads)
```

The `markdown_for_llm` property is optimized for AI processing - it removes navigation, footers, advertisements, and forms, keeping only useful content.

## 🌐 Proxy Support

IntelliScraper supports proxy configurations including Bright Data and custom solutions:

```python
from intelliscraper import Scraper, ProxyConfig

proxy = ProxyConfig(
    url="http://brd.superproxy.io:22225",
    username="your-username",
    password="your-password"
)

scraper = Scraper(proxy=proxy)
response = scraper.scrape("https://example.com")
```

> 📁 **More examples** including proxy configurations, and advanced usage can be found in the [`examples/`](./examples) folder.

## 📋 Requirements

- Python 3.12+
- Playwright
- Compatible with Windows, macOS, and Linux

## 🗺️ Roadmap

- ✅ Session management with CLI tool
- ✅ Proxy support (Bright Data)
- ✅ HTML parsing and Markdown conversion
- ✅ Anti-detection features
- ✅ PyPI package
- 🔄 Async scraping support
- 🔄 Web crawler
- 🔄 AI integration

## 📄 License

This project is licensed under the MIT License.


## 📧 Support

For issues, questions, or contributions, please visit our [GitHub repository's issues page](https://github.com/omkarmusale0910/IntelliScraper/issues).

---

**Note**: This project is under active development. The package will be available on PyPI in the coming weeks.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "intelliscraper-core",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": "Omkar Musale <omkarmusaleich@gmail.com>",
    "keywords": "anti-detection, crawling, playwright, proxy, scraper, web-scraping",
    "author": null,
    "author_email": "Omkar Musale <omkarmusaleich@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/b4/86/ec08b0a2f036b8cc5c697418840dbf7f28ab34bd9ad06f3638b09089b19b/intelliscraper_core-0.1.2.tar.gz",
    "platform": null,
    "description": "# IntelliScraper\n\nA powerful, anti-bot detection web scraping solution built with Playwright, designed for scraping protected sites like Himalayas Jobs and other platforms that require authentication. Features session management, proxy support, and advanced HTML parsing capabilities.\n\n![Python Version](https://img.shields.io/badge/python-3.12%2B-blue)\n![License](https://img.shields.io/badge/license-MIT-green)\n![Status](https://img.shields.io/badge/status-active-success)\n\n## \u2728 Features\n\n- **\ud83d\udd10 Session Management**: Capture and reuse authentication sessions with cookies, local storage, and browser fingerprints\n- **\ud83d\udee1\ufe0f Anti-Detection**: Advanced techniques to prevent bot detection\n- **\ud83c\udf10 Proxy Support**: Integrated support for Bright Data and custom proxy solutions\n- **\ud83d\udcdd HTML Parsing**: Extract text, links, and convert to Markdown format (including LLM-optimized output)\n- **\ud83c\udfaf CLI Tool**: Easy-to-use command-line interface for session generation\n- **\u26a1 Playwright-Powered**: Built on robust Playwright automation framework\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\n```bash\n# Install the package\npip install intelliscraper-core\n\n# Install Playwright browser (Chromium)\nplaywright install chromium\n```\n> [!NOTE]  \n> Playwright requires browser binaries to be installed separately.  \n> The command above installs Chromium, which is necessary for this library to work.  \n\n> For more reference : https://pypi.org/project/intelliscraper-core/\n\n### Basic Scraping (No Authentication)\n\n```python\nfrom intelliscraper import Scraper, ScrapStatus\n\n# Simple scraping without authentication\nscraper = Scraper()\nresponse = scraper.scrape(\"https://example.com\")\n\nif response.status == ScrapStatus.COMPLETED:\n    print(response.scrap_html_content)\n```\n\n### Creating Session Data\n\nUse the CLI tool to create session data for authenticated scraping. The tool will open a browser where you can manually log in:\n\n```bash\nintelliscraper-session --url \"https://himalayas.app\" --site \"himalayas\" --output \"./himalayas_session.json\"\n```\n\n**How it works:**\n1. \ud83c\udf10 Opens browser with the specified URL\n2. \ud83d\udd10 You manually log in with your credentials\n3. \u23ce Press Enter after successful login\n4. \ud83d\udcbe Session data (cookies, storage, fingerprints) saved to JSON file\n\n> [!IMPORTANT]  \n> Each session internally maintains time-series statistics of scraping events including timestamps, request start times, and statuses. \n> These metrics are useful for analyzing scraping behavior, rate limits, and identifying performance bottlenecks. \n> During testing, we observed that increasing concurrency too aggressively can lead to failures, while controlled, slower scraping rates maintain higher success rates and better session stability.\n\n### Authenticated Scraping with Session\n\n```python\nimport json\nfrom intelliscraper import Scraper, Session, ScrapStatus\n\n# Load session data\nwith open(\"himalayas_session.json\") as f:\n    session = Session(**json.load(f))\n\n# Scrape with authentication\nscraper = Scraper(session_data=session)\nresponse = scraper.scrape(\"https://himalayas.app/jobs/python?experience=entry-level%2Cmid-level\")\n\nif response.status == ScrapStatus.COMPLETED:\n    print(\"Successfully scraped authenticated page!\")\n    print(response.scrap_html_content)\n```\n\n## \ud83d\udcdd HTML Parsing\n\nParse scraped content to extract text, links, and markdown:\n\n```python\nfrom intelliscraper import Scraper, ScrapStatus, HTMLParser\n\nscraper = Scraper()\nresponse = scraper.scrape(\"https://example.com\")\n\nif response.status == ScrapStatus.COMPLETED:\n    # Initialize parser\n    parser = HTMLParser(\n        url=response.scrape_request.url,\n        html=response.scrap_html_content\n    )\n    \n    # Extract different formats\n    print(parser.text)              # Plain text\n    print(parser.links)             # All links (normalized URLs)\n    print(parser.markdown)          # Full markdown\n    print(parser.markdown_for_llm)  # Clean markdown for AI (removes nav, footer, ads)\n```\n\nThe `markdown_for_llm` property is optimized for AI processing - it removes navigation, footers, advertisements, and forms, keeping only useful content.\n\n## \ud83c\udf10 Proxy Support\n\nIntelliScraper supports proxy configurations including Bright Data and custom solutions:\n\n```python\nfrom intelliscraper import Scraper, ProxyConfig\n\nproxy = ProxyConfig(\n    url=\"http://brd.superproxy.io:22225\",\n    username=\"your-username\",\n    password=\"your-password\"\n)\n\nscraper = Scraper(proxy=proxy)\nresponse = scraper.scrape(\"https://example.com\")\n```\n\n> \ud83d\udcc1 **More examples** including proxy configurations, and advanced usage can be found in the [`examples/`](./examples) folder.\n\n## \ud83d\udccb Requirements\n\n- Python 3.12+\n- Playwright\n- Compatible with Windows, macOS, and Linux\n\n## \ud83d\uddfa\ufe0f Roadmap\n\n- \u2705 Session management with CLI tool\n- \u2705 Proxy support (Bright Data)\n- \u2705 HTML parsing and Markdown conversion\n- \u2705 Anti-detection features\n- \u2705 PyPI package\n- \ud83d\udd04 Async scraping support\n- \ud83d\udd04 Web crawler\n- \ud83d\udd04 AI integration\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License.\n\n\n## \ud83d\udce7 Support\n\nFor issues, questions, or contributions, please visit our [GitHub repository's issues page](https://github.com/omkarmusale0910/IntelliScraper/issues).\n\n---\n\n**Note**: This project is under active development. The package will be available on PyPI in the coming weeks.",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2025 omkar musale  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
    "summary": "Smart web scraper that abstracts away complexity - from simple sites to highly protected ones.",
    "version": "0.1.2",
    "project_urls": {
        "Changelog": "https://github.com/omkarmusale0910/IntelliScraper/blob/main/CHANGELOG.md",
        "Homepage": "https://github.com/omkarmusale0910/IntelliScraper",
        "Issues": "https://github.com/omkarmusale0910/IntelliScraper/issues",
        "Repository": "https://github.com/omkarmusale0910/IntelliScraper"
    },
    "split_keywords": [
        "anti-detection",
        " crawling",
        " playwright",
        " proxy",
        " scraper",
        " web-scraping"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7f09213c92b4f28fc48f08171349159f1af9e8daea47bd3a724bda121a35b99d",
                "md5": "23b3a5b77ad8ca617fc547a2532aa701",
                "sha256": "4c365bb7f4417ba029f67eb297da370dbe46859945c319e0b8778a03a3903184"
            },
            "downloads": -1,
            "filename": "intelliscraper_core-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "23b3a5b77ad8ca617fc547a2532aa701",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 5308,
            "upload_time": "2025-10-18T20:46:43",
            "upload_time_iso_8601": "2025-10-18T20:46:43.710493Z",
            "url": "https://files.pythonhosted.org/packages/7f/09/213c92b4f28fc48f08171349159f1af9e8daea47bd3a724bda121a35b99d/intelliscraper_core-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b486ec08b0a2f036b8cc5c697418840dbf7f28ab34bd9ad06f3638b09089b19b",
                "md5": "092edad803c82094cc48496f0bddcaa2",
                "sha256": "53e73d54776fd16847e4d2718189acefbe3de54338719825d5e8672480079dc3"
            },
            "downloads": -1,
            "filename": "intelliscraper_core-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "092edad803c82094cc48496f0bddcaa2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 45388,
            "upload_time": "2025-10-18T20:46:45",
            "upload_time_iso_8601": "2025-10-18T20:46:45.298857Z",
            "url": "https://files.pythonhosted.org/packages/b4/86/ec08b0a2f036b8cc5c697418840dbf7f28ab34bd9ad06f3638b09089b19b/intelliscraper_core-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-18 20:46:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "omkarmusale0910",
    "github_project": "IntelliScraper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "intelliscraper-core"
}

None