playwright-simple-scraper


Nameplaywright-simple-scraper JSON
Version 0.1.5.dev0 PyPI version JSON
download
home_pageNone
SummaryA simple Playwright-based scraper that has simple result (based on List[str])
upload_time2025-08-11 14:37:07
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords playwright scraping web-scraping automation crawler
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # playwright-simple-scraper

A tiny scraper built on Playwright.
Give it a URL and a CSS selector → get back the texts or the links that match.

It tries a few safe strategies (stealth, human-like, Firefox, mobile, proxy-like headers) and returns as soon as one works.

<br>

## Requirements
- Python 3.9+
- Playwright package
- At least one Playwright browser installed (we’ll install Chromium below)

<br>

## Install (local dev)

From the project root (where `pyproject.toml` is):

``` bash
# (optional) create a virtual env
python3 -m venv .venv
source .venv/bin/activate    # on Windows: .venv\Scripts\activate

# install your package in editable mode + dev tools
pip install -e ".[dev]"

# download a browser once (required by Playwright)
playwright install chromium
# (optional) also install:  playwright install firefox
```

<br>

## Quick start

``` python
from playwright_simple_scraper import scrape_context, scrape_href

# 1) Get inner text of matched elements
res1 = scrape_context(
    "https://news.ycombinator.com",
    ".athing .titleline > a"
)

# 2) Get href attributes of matched elements
res2 = scrape_href(
    "https://news.ycombinator.com",
    ".athing .titleline > a"
)
```

Or run the example script:

```bash
python examples/simple_usage.py
```

## What the functions return

Both functions return a `ScrapeResult` dataclass:

```python
@dataclass
class ScrapeResult:
    url: str
    selector: str
    result: List[str]       # your texts or hrefs
    count: int              # len(result)
    fetched_at: datetime    # UTC timestamp

    def first(self) -> Optional[str]: ...
    def to_dict(self) -> dict: ...
)
```

* `scrape_context(url, selector, respect_robots=True, user_agent="*")`

  * Returns texts (innerText) of the matched elements.
* `scrape_href(url, selector, respect_robots=True, user_agent="*")`

  * Returns the `href` attribute of the matched elements.

Note: `respect_robots` and `user_agent` are placeholders for now (not implemented yet).

## CSS selector tips

* Start simple (e.g., `h1`, `a.article-link`, `#main .title > a`).
* If nothing returns, check the page in DevTools and try a different selector.
* Many sites load content late; the scraper already waits, but strong bot protection may still block you.

## How it works (short)

1. Validate inputs.
2. Try strategies in this order until one works:

   * stealth (hides automation hints, blocks heavy assets, light human moves)
   * human\_like (slower loads, extra waits, human scroll/click)
   * diff\_browser (Firefox)
   * mobile (mobile UA/layout)
   * proxy (adds proxy-like headers and random IP-ish headers)
3. Return results on first success; otherwise raise an error.

## Jupyter note

Jupyter already runs an event loop.
This library uses `nest_asyncio` internally so you can call `scrape_*()` without `await`.
If you still see loop errors, restart the kernel and try again.

## Running tests

```bash
pytest -q
```

The basic test runs `examples/simple_usage.py` and checks it finishes without errors.

## Troubleshooting

* “Playwright browser not installed”

  * Run: `playwright install chromium` (or `firefox` if you use the Firefox strategy).
* Empty results

  * Your selector may not match. Test it in DevTools.
  * Some sites have strong bot protection. Slow down, try again later, or provide your own proxies/cookies.
* Import errors

  * The package folder must be `playwright_simple_scraper/` (with `__init__.py`) and you should install with `pip install -e ".[dev]"` from the project root.
* Timeouts / hanging

  * Sites can be slow or blocked. The strategies use waits between \~15–45s. If needed, adjust timeouts in `strategies/*.py`.

## FAQ

* Does it respect robots.txt?

  * Not yet. The flags exist but are not implemented.
* Can I change headless mode?

  * Currently it runs headless by default. You can change it in `core.py` (look for `_run_sync(..., True)`).
* Which browsers are used?

  * Chromium by default, plus a Firefox strategy.
* Can I set a custom timeout or headers?

  * Not via public API yet. You can tweak each strategy in `playwright_simple_scraper/strategies/`.

## Project layout

```
playwright-simple-scraper/
├─ playwright_simple_scraper/
│  ├─ __init__.py
│  ├─ core.py
│  ├─ browser.py
│  ├─ model.py
│  ├─ utils.py
│  └─ strategies/
│     ├─ stealth.py
│     ├─ human_like.py
│     ├─ diff_browser.py
│     ├─ mobile.py
│     └─ proxy.py
├─ examples/
│  └─ simple_usage.py
├─ tests/
│  └─ test_core.py
├─ pyproject.toml
└─ README.md
```

## License

MIT

## Notes & ethics

* Follow each site’s Terms of Service and local laws.
* Keep request rates polite.
* Do not use this tool to harm services or violate privacy.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "playwright-simple-scraper",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "playwright, scraping, web-scraping, automation, crawler",
    "author": null,
    "author_email": "elecbrandy <elecbrandy@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/33/3c/b11a34572e11a74d3f15e79ff8fb6de6e1e5c3e4640330e66699a1928a4b/playwright_simple_scraper-0.1.5.dev0.tar.gz",
    "platform": null,
    "description": "# playwright-simple-scraper\n\nA tiny scraper built on Playwright.\nGive it a URL and a CSS selector \u2192 get back the texts or the links that match.\n\nIt tries a few safe strategies (stealth, human-like, Firefox, mobile, proxy-like headers) and returns as soon as one works.\n\n<br>\n\n## Requirements\n- Python 3.9+\n- Playwright package\n- At least one Playwright browser installed (we\u2019ll install Chromium below)\n\n<br>\n\n## Install (local dev)\n\nFrom the project root (where `pyproject.toml` is):\n\n``` bash\n# (optional) create a virtual env\npython3 -m venv .venv\nsource .venv/bin/activate    # on Windows: .venv\\Scripts\\activate\n\n# install your package in editable mode + dev tools\npip install -e \".[dev]\"\n\n# download a browser once (required by Playwright)\nplaywright install chromium\n# (optional) also install:  playwright install firefox\n```\n\n<br>\n\n## Quick start\n\n``` python\nfrom playwright_simple_scraper import scrape_context, scrape_href\n\n# 1) Get inner text of matched elements\nres1 = scrape_context(\n    \"https://news.ycombinator.com\",\n    \".athing .titleline > a\"\n)\n\n# 2) Get href attributes of matched elements\nres2 = scrape_href(\n    \"https://news.ycombinator.com\",\n    \".athing .titleline > a\"\n)\n```\n\nOr run the example script:\n\n```bash\npython examples/simple_usage.py\n```\n\n## What the functions return\n\nBoth functions return a `ScrapeResult` dataclass:\n\n```python\n@dataclass\nclass ScrapeResult:\n    url: str\n    selector: str\n    result: List[str]       # your texts or hrefs\n    count: int              # len(result)\n    fetched_at: datetime    # UTC timestamp\n\n    def first(self) -> Optional[str]: ...\n    def to_dict(self) -> dict: ...\n)\n```\n\n* `scrape_context(url, selector, respect_robots=True, user_agent=\"*\")`\n\n  * Returns texts (innerText) of the matched elements.\n* `scrape_href(url, selector, respect_robots=True, user_agent=\"*\")`\n\n  * Returns the `href` attribute of the matched elements.\n\nNote: `respect_robots` and `user_agent` are placeholders for now (not implemented yet).\n\n## CSS selector tips\n\n* Start simple (e.g., `h1`, `a.article-link`, `#main .title > a`).\n* If nothing returns, check the page in DevTools and try a different selector.\n* Many sites load content late; the scraper already waits, but strong bot protection may still block you.\n\n## How it works (short)\n\n1. Validate inputs.\n2. Try strategies in this order until one works:\n\n   * stealth (hides automation hints, blocks heavy assets, light human moves)\n   * human\\_like (slower loads, extra waits, human scroll/click)\n   * diff\\_browser (Firefox)\n   * mobile (mobile UA/layout)\n   * proxy (adds proxy-like headers and random IP-ish headers)\n3. Return results on first success; otherwise raise an error.\n\n## Jupyter note\n\nJupyter already runs an event loop.\nThis library uses `nest_asyncio` internally so you can call `scrape_*()` without `await`.\nIf you still see loop errors, restart the kernel and try again.\n\n## Running tests\n\n```bash\npytest -q\n```\n\nThe basic test runs `examples/simple_usage.py` and checks it finishes without errors.\n\n## Troubleshooting\n\n* \u201cPlaywright browser not installed\u201d\n\n  * Run: `playwright install chromium` (or `firefox` if you use the Firefox strategy).\n* Empty results\n\n  * Your selector may not match. Test it in DevTools.\n  * Some sites have strong bot protection. Slow down, try again later, or provide your own proxies/cookies.\n* Import errors\n\n  * The package folder must be `playwright_simple_scraper/` (with `__init__.py`) and you should install with `pip install -e \".[dev]\"` from the project root.\n* Timeouts / hanging\n\n  * Sites can be slow or blocked. The strategies use waits between \\~15\u201345s. If needed, adjust timeouts in `strategies/*.py`.\n\n## FAQ\n\n* Does it respect robots.txt?\n\n  * Not yet. The flags exist but are not implemented.\n* Can I change headless mode?\n\n  * Currently it runs headless by default. You can change it in `core.py` (look for `_run_sync(..., True)`).\n* Which browsers are used?\n\n  * Chromium by default, plus a Firefox strategy.\n* Can I set a custom timeout or headers?\n\n  * Not via public API yet. You can tweak each strategy in `playwright_simple_scraper/strategies/`.\n\n## Project layout\n\n```\nplaywright-simple-scraper/\n\u251c\u2500 playwright_simple_scraper/\n\u2502  \u251c\u2500 __init__.py\n\u2502  \u251c\u2500 core.py\n\u2502  \u251c\u2500 browser.py\n\u2502  \u251c\u2500 model.py\n\u2502  \u251c\u2500 utils.py\n\u2502  \u2514\u2500 strategies/\n\u2502     \u251c\u2500 stealth.py\n\u2502     \u251c\u2500 human_like.py\n\u2502     \u251c\u2500 diff_browser.py\n\u2502     \u251c\u2500 mobile.py\n\u2502     \u2514\u2500 proxy.py\n\u251c\u2500 examples/\n\u2502  \u2514\u2500 simple_usage.py\n\u251c\u2500 tests/\n\u2502  \u2514\u2500 test_core.py\n\u251c\u2500 pyproject.toml\n\u2514\u2500 README.md\n```\n\n## License\n\nMIT\n\n## Notes & ethics\n\n* Follow each site\u2019s Terms of Service and local laws.\n* Keep request rates polite.\n* Do not use this tool to harm services or violate privacy.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A simple Playwright-based scraper that has simple result (based on List[str])",
    "version": "0.1.5.dev0",
    "project_urls": {
        "Homepage": "https://github.com/elecbrandy/playwright-simple-scraper",
        "Issues": "https://github.com/elecbrandy/playwright-simple-scraper/issues",
        "Repository": "https://github.com/elecbrandy/playwright-simple-scraper"
    },
    "split_keywords": [
        "playwright",
        " scraping",
        " web-scraping",
        " automation",
        " crawler"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "436570f73cf9ec8709c9eb39d4a5a52db561ef899249dc2eb6e9b12c88400dff",
                "md5": "7a21d58c6cb59c158319972de98bc6ab",
                "sha256": "7fc923be9594ba418040c0c2ea3c13844a5d897a33aa71eb8590a9790f573d68"
            },
            "downloads": -1,
            "filename": "playwright_simple_scraper-0.1.5.dev0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7a21d58c6cb59c158319972de98bc6ab",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 11673,
            "upload_time": "2025-08-11T14:37:06",
            "upload_time_iso_8601": "2025-08-11T14:37:06.516101Z",
            "url": "https://files.pythonhosted.org/packages/43/65/70f73cf9ec8709c9eb39d4a5a52db561ef899249dc2eb6e9b12c88400dff/playwright_simple_scraper-0.1.5.dev0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "333cb11a34572e11a74d3f15e79ff8fb6de6e1e5c3e4640330e66699a1928a4b",
                "md5": "b530f1590765746c1a54d771e8fc3172",
                "sha256": "ee86283eb9b178bb9c4154521c4f17437f3ccbe52d06200fc64306c9a4b6629e"
            },
            "downloads": -1,
            "filename": "playwright_simple_scraper-0.1.5.dev0.tar.gz",
            "has_sig": false,
            "md5_digest": "b530f1590765746c1a54d771e8fc3172",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 17438,
            "upload_time": "2025-08-11T14:37:07",
            "upload_time_iso_8601": "2025-08-11T14:37:07.827219Z",
            "url": "https://files.pythonhosted.org/packages/33/3c/b11a34572e11a74d3f15e79ff8fb6de6e1e5c3e4640330e66699a1928a4b/playwright_simple_scraper-0.1.5.dev0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-11 14:37:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "elecbrandy",
    "github_project": "playwright-simple-scraper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "playwright-simple-scraper"
}
        
Elapsed time: 1.40245s