# playwright-simple-scraper
A tiny scraper built on Playwright.
Give it a URL and a CSS selector → get back the texts or the links that match.
It tries a few safe strategies (stealth, human-like, Firefox, mobile, proxy-like headers) and returns as soon as one works.
<br>
## Requirements
- Python 3.9+
- Playwright package
- At least one Playwright browser installed (we’ll install Chromium below)
<br>
## Install (local dev)
From the project root (where `pyproject.toml` is):
``` bash
# (optional) create a virtual env
python3 -m venv .venv
source .venv/bin/activate # on Windows: .venv\Scripts\activate
# install your package in editable mode + dev tools
pip install -e ".[dev]"
# download a browser once (required by Playwright)
playwright install chromium
# (optional) also install: playwright install firefox
```
<br>
## Quick start
``` python
from playwright_simple_scraper import scrape_context, scrape_href
# 1) Get inner text of matched elements
res1 = scrape_context(
"https://news.ycombinator.com",
".athing .titleline > a"
)
# 2) Get href attributes of matched elements
res2 = scrape_href(
"https://news.ycombinator.com",
".athing .titleline > a"
)
```
Or run the example script:
```bash
python examples/simple_usage.py
```
## What the functions return
Both functions return a `ScrapeResult` dataclass:
```python
@dataclass
class ScrapeResult:
url: str
selector: str
result: List[str] # your texts or hrefs
count: int # len(result)
fetched_at: datetime # UTC timestamp
def first(self) -> Optional[str]: ...
def to_dict(self) -> dict: ...
)
```
* `scrape_context(url, selector, respect_robots=True, user_agent="*")`
* Returns texts (innerText) of the matched elements.
* `scrape_href(url, selector, respect_robots=True, user_agent="*")`
* Returns the `href` attribute of the matched elements.
Note: `respect_robots` and `user_agent` are placeholders for now (not implemented yet).
## CSS selector tips
* Start simple (e.g., `h1`, `a.article-link`, `#main .title > a`).
* If nothing returns, check the page in DevTools and try a different selector.
* Many sites load content late; the scraper already waits, but strong bot protection may still block you.
## How it works (short)
1. Validate inputs.
2. Try strategies in this order until one works:
* stealth (hides automation hints, blocks heavy assets, light human moves)
* human\_like (slower loads, extra waits, human scroll/click)
* diff\_browser (Firefox)
* mobile (mobile UA/layout)
* proxy (adds proxy-like headers and random IP-ish headers)
3. Return results on first success; otherwise raise an error.
## Jupyter note
Jupyter already runs an event loop.
This library uses `nest_asyncio` internally so you can call `scrape_*()` without `await`.
If you still see loop errors, restart the kernel and try again.
## Running tests
```bash
pytest -q
```
The basic test runs `examples/simple_usage.py` and checks it finishes without errors.
## Troubleshooting
* “Playwright browser not installed”
* Run: `playwright install chromium` (or `firefox` if you use the Firefox strategy).
* Empty results
* Your selector may not match. Test it in DevTools.
* Some sites have strong bot protection. Slow down, try again later, or provide your own proxies/cookies.
* Import errors
* The package folder must be `playwright_simple_scraper/` (with `__init__.py`) and you should install with `pip install -e ".[dev]"` from the project root.
* Timeouts / hanging
* Sites can be slow or blocked. The strategies use waits between \~15–45s. If needed, adjust timeouts in `strategies/*.py`.
## FAQ
* Does it respect robots.txt?
* Not yet. The flags exist but are not implemented.
* Can I change headless mode?
* Currently it runs headless by default. You can change it in `core.py` (look for `_run_sync(..., True)`).
* Which browsers are used?
* Chromium by default, plus a Firefox strategy.
* Can I set a custom timeout or headers?
* Not via public API yet. You can tweak each strategy in `playwright_simple_scraper/strategies/`.
## Project layout
```
playwright-simple-scraper/
├─ playwright_simple_scraper/
│ ├─ __init__.py
│ ├─ core.py
│ ├─ browser.py
│ ├─ model.py
│ ├─ utils.py
│ └─ strategies/
│ ├─ stealth.py
│ ├─ human_like.py
│ ├─ diff_browser.py
│ ├─ mobile.py
│ └─ proxy.py
├─ examples/
│ └─ simple_usage.py
├─ tests/
│ └─ test_core.py
├─ pyproject.toml
└─ README.md
```
## License
MIT
## Notes & ethics
* Follow each site’s Terms of Service and local laws.
* Keep request rates polite.
* Do not use this tool to harm services or violate privacy.
Raw data
{
"_id": null,
"home_page": null,
"name": "playwright-simple-scraper",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "playwright, scraping, web-scraping, automation, crawler",
"author": null,
"author_email": "elecbrandy <elecbrandy@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/33/3c/b11a34572e11a74d3f15e79ff8fb6de6e1e5c3e4640330e66699a1928a4b/playwright_simple_scraper-0.1.5.dev0.tar.gz",
"platform": null,
"description": "# playwright-simple-scraper\n\nA tiny scraper built on Playwright.\nGive it a URL and a CSS selector \u2192 get back the texts or the links that match.\n\nIt tries a few safe strategies (stealth, human-like, Firefox, mobile, proxy-like headers) and returns as soon as one works.\n\n<br>\n\n## Requirements\n- Python 3.9+\n- Playwright package\n- At least one Playwright browser installed (we\u2019ll install Chromium below)\n\n<br>\n\n## Install (local dev)\n\nFrom the project root (where `pyproject.toml` is):\n\n``` bash\n# (optional) create a virtual env\npython3 -m venv .venv\nsource .venv/bin/activate # on Windows: .venv\\Scripts\\activate\n\n# install your package in editable mode + dev tools\npip install -e \".[dev]\"\n\n# download a browser once (required by Playwright)\nplaywright install chromium\n# (optional) also install: playwright install firefox\n```\n\n<br>\n\n## Quick start\n\n``` python\nfrom playwright_simple_scraper import scrape_context, scrape_href\n\n# 1) Get inner text of matched elements\nres1 = scrape_context(\n \"https://news.ycombinator.com\",\n \".athing .titleline > a\"\n)\n\n# 2) Get href attributes of matched elements\nres2 = scrape_href(\n \"https://news.ycombinator.com\",\n \".athing .titleline > a\"\n)\n```\n\nOr run the example script:\n\n```bash\npython examples/simple_usage.py\n```\n\n## What the functions return\n\nBoth functions return a `ScrapeResult` dataclass:\n\n```python\n@dataclass\nclass ScrapeResult:\n url: str\n selector: str\n result: List[str] # your texts or hrefs\n count: int # len(result)\n fetched_at: datetime # UTC timestamp\n\n def first(self) -> Optional[str]: ...\n def to_dict(self) -> dict: ...\n)\n```\n\n* `scrape_context(url, selector, respect_robots=True, user_agent=\"*\")`\n\n * Returns texts (innerText) of the matched elements.\n* `scrape_href(url, selector, respect_robots=True, user_agent=\"*\")`\n\n * Returns the `href` attribute of the matched elements.\n\nNote: `respect_robots` and `user_agent` are placeholders for now (not implemented yet).\n\n## CSS selector tips\n\n* Start simple (e.g., `h1`, `a.article-link`, `#main .title > a`).\n* If nothing returns, check the page in DevTools and try a different selector.\n* Many sites load content late; the scraper already waits, but strong bot protection may still block you.\n\n## How it works (short)\n\n1. Validate inputs.\n2. Try strategies in this order until one works:\n\n * stealth (hides automation hints, blocks heavy assets, light human moves)\n * human\\_like (slower loads, extra waits, human scroll/click)\n * diff\\_browser (Firefox)\n * mobile (mobile UA/layout)\n * proxy (adds proxy-like headers and random IP-ish headers)\n3. Return results on first success; otherwise raise an error.\n\n## Jupyter note\n\nJupyter already runs an event loop.\nThis library uses `nest_asyncio` internally so you can call `scrape_*()` without `await`.\nIf you still see loop errors, restart the kernel and try again.\n\n## Running tests\n\n```bash\npytest -q\n```\n\nThe basic test runs `examples/simple_usage.py` and checks it finishes without errors.\n\n## Troubleshooting\n\n* \u201cPlaywright browser not installed\u201d\n\n * Run: `playwright install chromium` (or `firefox` if you use the Firefox strategy).\n* Empty results\n\n * Your selector may not match. Test it in DevTools.\n * Some sites have strong bot protection. Slow down, try again later, or provide your own proxies/cookies.\n* Import errors\n\n * The package folder must be `playwright_simple_scraper/` (with `__init__.py`) and you should install with `pip install -e \".[dev]\"` from the project root.\n* Timeouts / hanging\n\n * Sites can be slow or blocked. The strategies use waits between \\~15\u201345s. If needed, adjust timeouts in `strategies/*.py`.\n\n## FAQ\n\n* Does it respect robots.txt?\n\n * Not yet. The flags exist but are not implemented.\n* Can I change headless mode?\n\n * Currently it runs headless by default. You can change it in `core.py` (look for `_run_sync(..., True)`).\n* Which browsers are used?\n\n * Chromium by default, plus a Firefox strategy.\n* Can I set a custom timeout or headers?\n\n * Not via public API yet. You can tweak each strategy in `playwright_simple_scraper/strategies/`.\n\n## Project layout\n\n```\nplaywright-simple-scraper/\n\u251c\u2500 playwright_simple_scraper/\n\u2502 \u251c\u2500 __init__.py\n\u2502 \u251c\u2500 core.py\n\u2502 \u251c\u2500 browser.py\n\u2502 \u251c\u2500 model.py\n\u2502 \u251c\u2500 utils.py\n\u2502 \u2514\u2500 strategies/\n\u2502 \u251c\u2500 stealth.py\n\u2502 \u251c\u2500 human_like.py\n\u2502 \u251c\u2500 diff_browser.py\n\u2502 \u251c\u2500 mobile.py\n\u2502 \u2514\u2500 proxy.py\n\u251c\u2500 examples/\n\u2502 \u2514\u2500 simple_usage.py\n\u251c\u2500 tests/\n\u2502 \u2514\u2500 test_core.py\n\u251c\u2500 pyproject.toml\n\u2514\u2500 README.md\n```\n\n## License\n\nMIT\n\n## Notes & ethics\n\n* Follow each site\u2019s Terms of Service and local laws.\n* Keep request rates polite.\n* Do not use this tool to harm services or violate privacy.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A simple Playwright-based scraper that has simple result (based on List[str])",
"version": "0.1.5.dev0",
"project_urls": {
"Homepage": "https://github.com/elecbrandy/playwright-simple-scraper",
"Issues": "https://github.com/elecbrandy/playwright-simple-scraper/issues",
"Repository": "https://github.com/elecbrandy/playwright-simple-scraper"
},
"split_keywords": [
"playwright",
" scraping",
" web-scraping",
" automation",
" crawler"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "436570f73cf9ec8709c9eb39d4a5a52db561ef899249dc2eb6e9b12c88400dff",
"md5": "7a21d58c6cb59c158319972de98bc6ab",
"sha256": "7fc923be9594ba418040c0c2ea3c13844a5d897a33aa71eb8590a9790f573d68"
},
"downloads": -1,
"filename": "playwright_simple_scraper-0.1.5.dev0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7a21d58c6cb59c158319972de98bc6ab",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 11673,
"upload_time": "2025-08-11T14:37:06",
"upload_time_iso_8601": "2025-08-11T14:37:06.516101Z",
"url": "https://files.pythonhosted.org/packages/43/65/70f73cf9ec8709c9eb39d4a5a52db561ef899249dc2eb6e9b12c88400dff/playwright_simple_scraper-0.1.5.dev0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "333cb11a34572e11a74d3f15e79ff8fb6de6e1e5c3e4640330e66699a1928a4b",
"md5": "b530f1590765746c1a54d771e8fc3172",
"sha256": "ee86283eb9b178bb9c4154521c4f17437f3ccbe52d06200fc64306c9a4b6629e"
},
"downloads": -1,
"filename": "playwright_simple_scraper-0.1.5.dev0.tar.gz",
"has_sig": false,
"md5_digest": "b530f1590765746c1a54d771e8fc3172",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 17438,
"upload_time": "2025-08-11T14:37:07",
"upload_time_iso_8601": "2025-08-11T14:37:07.827219Z",
"url": "https://files.pythonhosted.org/packages/33/3c/b11a34572e11a74d3f15e79ff8fb6de6e1e5c3e4640330e66699a1928a4b/playwright_simple_scraper-0.1.5.dev0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-11 14:37:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "elecbrandy",
"github_project": "playwright-simple-scraper",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "playwright-simple-scraper"
}