# pw-simple-scraper
[](https://pypi.org/project/pw-simple-scraper/)
[](https://pypi.org/project/pw-simple-scraper/)
[](#license)
<br>
> **‼️ Forget the hassle of creating browsers or setting headers. Just focus on scraping ‼️**
<br>
## Table of Contents
- [1. Main Features](#1-main-features)
- [2. Installation](#2-installation)
- [3. How to Use](#3-how-to-use)
- [4. Examples](#4-examples)
- [5. Playwright Method Reference](#5-playwright-method-reference)
- [6. FAQ](#faq)
<br>
<br>
## 1. Main Features
- A scraper library built on top of [Playwright](https://playwright.dev).
- Automatically manages the lifecycle of browsers and pages with `async with`.
- Returns Playwright objects, so you can use **all the powerful Playwright features** as they are.
- ⚡️ Fast ⚡️
<br>
<br>
## 2. Installation
``` bash
# 1. Install Playwright
pip install playwright
# 2-1. Install Chromium (macOS / Windows)
python -m playwright install chromium
# 2-2. Install Chromium (Linux)
python -m playwright install --with-deps chromium
# 3. Install pw-simple-scraper
pip install pw-simple-scraper
```
- Since this scraper is based on `Playwright`, you need both the `Playwright` library and the `Chromium` browser.
<br>
<br>
## 3. How to Use
> Not sure how to handle the `Page` object returned by `get_page`? -> [Playwright Method Reference](#5-playwright-method-reference)
<br>
1. `async with PlaywrightScraper() as scraper`
Create an instance of the scraper.
2. `async with scraper.get_page("http://www.example.com/") as page:`
Get a page context using the `get_page` method.
3. Now you can directly use all the Playwright features on `page`.
<br>
#### 🖥️ Code Example
``` python
import asyncio
from pw_simple_scraper import PlaywrightScraper
async def main():
# Create scraper instance
async with PlaywrightScraper() as scraper:
# Get page context
async with scraper.get_page("http://www.example.com/") as page:
# >>>> Use `page` in this block! <<<<
```
<br>
<br>
## 4. Examples
> Not sure how to handle the `Page` object returned by `get_page`? -> [Playwright Method Reference](#5-playwright-method-reference)
<br>
### 4-1. Extract title / text / attributes
#### 🖥️ Code Example
``` python
import asyncio
from pw_simple_scraper import PlaywrightScraper
async def main():
async with PlaywrightScraper() as scraper:
async with scraper.get_page("https://quotes.toscrape.com/") as page:
title = await page.title()
first_quote = await page.locator("span.text").first.text_content()
quotes = await page.locator("span.text").all_text_contents()
first_author_link = await page.locator(".quote a").first.get_attribute("href")
print("Page Title:", title)
print("First Quote:", first_quote)
print("Quote List (first 3):", quotes[:3])
print("First Author Link:", first_author_link)
if __name__ == "__main__":
asyncio.run(main())
```
<br>
#### ⬇️ Example Output
``` bash
Page Title: Quotes to Scrape
First Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Quote List (first 3): ["The world as we have created it is a process of our thinking...", "It is our choices, Harry, that show what we truly are...", "There are only two ways to live your life..."]
First Author Link: /author/Albert-Einstein
```
<br>
<br>
### 4-2. Images & links — collect absolute paths
#### 🖥️ Code Example
```python
import asyncio
from urllib.parse import urljoin
from pw_simple_scraper import PlaywrightScraper
async def main():
async with PlaywrightScraper() as scraper:
async with scraper.get_page("https://books.toscrape.com/") as page:
img_urls = await page.locator("article.product_pod img").evaluate_all(
"els => els.map(el => el.getAttribute('src'))"
)
abs_imgs = [urljoin(page.url, u) for u in img_urls if u]
book_urls = await page.locator("article.product_pod h3 a").evaluate_all(
"els => els.map(el => el.getAttribute('href'))"
)
abs_books = [urljoin(page.url, u) for u in book_urls if u]
print("Image URLs (5):", abs_imgs[:5])
print("Book Links (5):", abs_books[:5])
if __name__ == "__main__":
asyncio.run(main())
```
<br>
#### ⬇️ Example Output
``` bash
Image URLs (5): [
'https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg',
'https://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg',
'https://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg',
'https://books.toscrape.com/media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg',
'https://books.toscrape.com/media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg'
]
Book Links (5): [
'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
'https://books.toscrape.com/catalogue/soumission_998/index.html',
'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html'
]
```
<br>
<br>
### 4-3. Evaluate JSON — convert DOM to JSON
#### 🖥️ Code Example
```python
import asyncio
from pw_simple_scraper import PlaywrightScraper
async def main():
async with PlaywrightScraper() as scraper:
async with scraper.get_page("https://books.toscrape.com/") as page:
cards = page.locator("article.product_pod")
items = await cards.evaluate_all("""
els => els.map(el => ({
title: el.querySelector("h3 a")?.getAttribute("title"),
price: el.querySelector(".price_color")?.innerText.trim(),
inStock: !!el.querySelector(".instock.availability"),
}))
""")
print(items[:5])
if __name__ == "__main__":
asyncio.run(main())
```
<br>
#### ⬇️ Example Output
``` bash
[
{"title": "A Light in the Attic", "price": "£51.77", "inStock": true},
{"title": "Tipping the Velvet", "price": "£53.74", "inStock": true},
{"title": "Soumission", "price": "£50.10", "inStock": true},
{"title": "Sharp Objects", "price": "£47.82", "inStock": true},
{"title": "Sapiens: A Brief History of Humankind", "price": "£54.23", "inStock": true}
]
```
<br>
<br>
## 5. Playwright Method Reference
- If you’re not sure how to handle the `Page` object returned by `get_page`, check the table below.
- 🚨 **Note**
- HTML Attribute: `<input value="default">` → always returns `"default"`
- JS Property: `input.value` → changes to `"user input"` when typed
| Category | Method | Description | Notes / Comparison |
| ------------------- | ----------------------------- | ------------------------------------------------- | ------------------------------------------ |
| **Text** | `all_text_contents()` | Returns a list of text from **all elements** | Similar to `all_inner_texts()` |
| | `text_content()` | Returns visible text of the **first element** | Similar to `innerText` (not `textContent`) |
| | `inner_text()` | Same as `text_content()` | Actual visible text |
| | `all_inner_texts()` | List of visible text from all elements | Similar to `all_text_contents()` |
| **Attribute** | `get_attribute('attr')` | Returns HTML attribute (`href`, `src`, `class`) | Static, as written in HTML |
| **Property** | `get_property('prop')` | Returns live DOM property (`value`, `checked`) | Useful for dynamic state |
| **HTML / Value** | `inner_html()` | Returns **inner HTML** of element | Only inside structure |
| | `outer_html()` | Returns element’s full HTML | Includes element itself |
| | `input_value()` | Returns current value of form elements | More accurate than `get_attribute('value')`|
| | `select_option()` | Returns `<option>` info from `<select>` | Shows selected state |
| **State (Boolean)** | `is_visible()` | Is element visible | True/False |
| | `is_hidden()` | Is element hidden | True/False |
| | `is_enabled()` | Is element enabled (clickable) | True/False |
| | `is_disabled()` | Is element disabled | True/False |
| | `is_editable()` | Is element editable | True/False |
| | `is_checked()` | Is checkbox/radio checked | True/False |
| **Advanced** | `evaluate("JS func", arg)` | Runs JS on first element | Flexible extraction |
| | `evaluate_all("JS func", arg)`| Runs JS on all elements, returns list | Useful for batch data |
<br>
<br>
## 6. FAQ
- **Browser launch error after install**
- You must install the browser with:
`python -m playwright install chromium` (check Linux options carefully)
- **Doesn’t work on some URLs**
- Please open a GitHub issue so we can check.
<br>
<br>
Raw data
{
"_id": null,
"home_page": null,
"name": "pw-simple-scraper",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "playwright, scraping, web-scraping, automation, crawler",
"author": null,
"author_email": "elecbrandy <elecbrandy@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/29/b2/5fd648a407ee4a8fa0806f10736a26a8f59221da9d1877df22dc8dcc5aba/pw_simple_scraper-0.1.4.tar.gz",
"platform": null,
"description": "# pw-simple-scraper\n\n[](https://pypi.org/project/pw-simple-scraper/)\n[](https://pypi.org/project/pw-simple-scraper/)\n[](#license)\n\n<br>\n\n> **\u203c\ufe0f Forget the hassle of creating browsers or setting headers. Just focus on scraping \u203c\ufe0f**\n\n<br>\n\n## Table of Contents\n- [1. Main Features](#1-main-features)\n- [2. Installation](#2-installation)\n- [3. How to Use](#3-how-to-use)\n- [4. Examples](#4-examples)\n- [5. Playwright Method Reference](#5-playwright-method-reference)\n- [6. FAQ](#faq)\n\n<br>\n<br>\n\n## 1. Main Features\n- A scraper library built on top of [Playwright](https://playwright.dev).\n- Automatically manages the lifecycle of browsers and pages with `async with`.\n- Returns Playwright objects, so you can use **all the powerful Playwright features** as they are.\n- \u26a1\ufe0f Fast \u26a1\ufe0f\n\n<br>\n<br>\n\n## 2. Installation\n\n``` bash\n# 1. Install Playwright\npip install playwright\n\n# 2-1. Install Chromium (macOS / Windows)\npython -m playwright install chromium\n\n# 2-2. Install Chromium (Linux)\npython -m playwright install --with-deps chromium\n\n# 3. Install pw-simple-scraper\npip install pw-simple-scraper\n```\n\n- Since this scraper is based on `Playwright`, you need both the `Playwright` library and the `Chromium` browser.\n\n<br>\n<br>\n\n## 3. How to Use\n\n> Not sure how to handle the `Page` object returned by `get_page`? -> [Playwright Method Reference](#5-playwright-method-reference)\n\n<br>\n\n1. `async with PlaywrightScraper() as scraper` \n Create an instance of the scraper.\n2. `async with scraper.get_page(\"http://www.example.com/\") as page:` \n Get a page context using the `get_page` method.\n3. Now you can directly use all the Playwright features on `page`.\n\n<br>\n\n#### \ud83d\udda5\ufe0f Code Example\n``` python\nimport asyncio\nfrom pw_simple_scraper import PlaywrightScraper\n\nasync def main():\n # Create scraper instance\n async with PlaywrightScraper() as scraper:\n # Get page context\n async with scraper.get_page(\"http://www.example.com/\") as page:\n # >>>> Use `page` in this block! <<<<\n\n```\n\n<br>\n<br>\n\n## 4. Examples\n\n> Not sure how to handle the `Page` object returned by `get_page`? -> [Playwright Method Reference](#5-playwright-method-reference)\n\n<br>\n\n### 4-1. Extract title / text / attributes\n\n#### \ud83d\udda5\ufe0f Code Example\n\n``` python\nimport asyncio\nfrom pw_simple_scraper import PlaywrightScraper\n\nasync def main():\n async with PlaywrightScraper() as scraper:\n async with scraper.get_page(\"https://quotes.toscrape.com/\") as page:\n title = await page.title()\n first_quote = await page.locator(\"span.text\").first.text_content()\n quotes = await page.locator(\"span.text\").all_text_contents()\n first_author_link = await page.locator(\".quote a\").first.get_attribute(\"href\")\n\n print(\"Page Title:\", title)\n print(\"First Quote:\", first_quote)\n print(\"Quote List (first 3):\", quotes[:3])\n print(\"First Author Link:\", first_author_link)\n\nif __name__ == \"__main__\":\n asyncio.run(main())\n```\n\n<br>\n\n#### \u2b07\ufe0f Example Output\n\n``` bash\nPage Title: Quotes to Scrape\nFirst Quote: \u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d\nQuote List (first 3): [\"The world as we have created it is a process of our thinking...\", \"It is our choices, Harry, that show what we truly are...\", \"There are only two ways to live your life...\"]\nFirst Author Link: /author/Albert-Einstein\n```\n\n<br>\n<br>\n\n### 4-2. Images & links \u2014 collect absolute paths\n\n#### \ud83d\udda5\ufe0f Code Example\n\n```python\nimport asyncio\nfrom urllib.parse import urljoin\nfrom pw_simple_scraper import PlaywrightScraper\n\nasync def main():\n async with PlaywrightScraper() as scraper:\n async with scraper.get_page(\"https://books.toscrape.com/\") as page:\n img_urls = await page.locator(\"article.product_pod img\").evaluate_all(\n \"els => els.map(el => el.getAttribute('src'))\"\n )\n abs_imgs = [urljoin(page.url, u) for u in img_urls if u]\n\n book_urls = await page.locator(\"article.product_pod h3 a\").evaluate_all(\n \"els => els.map(el => el.getAttribute('href'))\"\n )\n abs_books = [urljoin(page.url, u) for u in book_urls if u]\n\n print(\"Image URLs (5):\", abs_imgs[:5])\n print(\"Book Links (5):\", abs_books[:5])\n\nif __name__ == \"__main__\":\n asyncio.run(main())\n```\n\n<br>\n\n#### \u2b07\ufe0f Example Output\n\n``` bash\nImage URLs (5): [\n 'https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg',\n 'https://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg',\n 'https://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg',\n 'https://books.toscrape.com/media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg',\n 'https://books.toscrape.com/media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg'\n]\nBook Links (5): [\n 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',\n 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',\n 'https://books.toscrape.com/catalogue/soumission_998/index.html',\n 'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',\n 'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html'\n]\n```\n\n<br>\n<br>\n\n### 4-3. Evaluate JSON \u2014 convert DOM to JSON\n\n#### \ud83d\udda5\ufe0f Code Example\n\n```python\nimport asyncio\nfrom pw_simple_scraper import PlaywrightScraper\n\nasync def main():\n async with PlaywrightScraper() as scraper:\n async with scraper.get_page(\"https://books.toscrape.com/\") as page:\n cards = page.locator(\"article.product_pod\")\n items = await cards.evaluate_all(\"\"\"\n els => els.map(el => ({\n title: el.querySelector(\"h3 a\")?.getAttribute(\"title\"),\n price: el.querySelector(\".price_color\")?.innerText.trim(),\n inStock: !!el.querySelector(\".instock.availability\"),\n }))\n \"\"\")\n print(items[:5])\n\nif __name__ == \"__main__\":\n asyncio.run(main())\n```\n\n<br>\n\n#### \u2b07\ufe0f Example Output\n\n``` bash\n[\n {\"title\": \"A Light in the Attic\", \"price\": \"\u00a351.77\", \"inStock\": true},\n {\"title\": \"Tipping the Velvet\", \"price\": \"\u00a353.74\", \"inStock\": true},\n {\"title\": \"Soumission\", \"price\": \"\u00a350.10\", \"inStock\": true},\n {\"title\": \"Sharp Objects\", \"price\": \"\u00a347.82\", \"inStock\": true},\n {\"title\": \"Sapiens: A Brief History of Humankind\", \"price\": \"\u00a354.23\", \"inStock\": true}\n]\n```\n<br>\n<br>\n\n## 5. Playwright Method Reference\n\n- If you\u2019re not sure how to handle the `Page` object returned by `get_page`, check the table below.\n- \ud83d\udea8 **Note**\n - HTML Attribute: `<input value=\"default\">` \u2192 always returns `\"default\"`\n - JS Property: `input.value` \u2192 changes to `\"user input\"` when typed\n\n| Category | Method | Description | Notes / Comparison |\n| ------------------- | ----------------------------- | ------------------------------------------------- | ------------------------------------------ |\n| **Text** | `all_text_contents()` | Returns a list of text from **all elements** | Similar to `all_inner_texts()` |\n| | `text_content()` | Returns visible text of the **first element** | Similar to `innerText` (not `textContent`) |\n| | `inner_text()` | Same as `text_content()` | Actual visible text |\n| | `all_inner_texts()` | List of visible text from all elements | Similar to `all_text_contents()` |\n| **Attribute** | `get_attribute('attr')` | Returns HTML attribute (`href`, `src`, `class`) | Static, as written in HTML |\n| **Property** | `get_property('prop')` | Returns live DOM property (`value`, `checked`) | Useful for dynamic state |\n| **HTML / Value** | `inner_html()` | Returns **inner HTML** of element | Only inside structure |\n| | `outer_html()` | Returns element\u2019s full HTML | Includes element itself |\n| | `input_value()` | Returns current value of form elements | More accurate than `get_attribute('value')`|\n| | `select_option()` | Returns `<option>` info from `<select>` | Shows selected state |\n| **State (Boolean)** | `is_visible()` | Is element visible | True/False |\n| | `is_hidden()` | Is element hidden | True/False |\n| | `is_enabled()` | Is element enabled (clickable) | True/False |\n| | `is_disabled()` | Is element disabled | True/False |\n| | `is_editable()` | Is element editable | True/False |\n| | `is_checked()` | Is checkbox/radio checked | True/False |\n| **Advanced** | `evaluate(\"JS func\", arg)` | Runs JS on first element | Flexible extraction |\n| | `evaluate_all(\"JS func\", arg)`| Runs JS on all elements, returns list | Useful for batch data |\n\n<br>\n<br>\n\n## 6. FAQ\n\n- **Browser launch error after install**\n - You must install the browser with: \n `python -m playwright install chromium` (check Linux options carefully)\n\n- **Doesn\u2019t work on some URLs**\n - Please open a GitHub issue so we can check.\n\n<br>\n<br>\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A simple and light Playwright-based scraper",
"version": "0.1.4",
"project_urls": {
"Homepage": "https://github.com/elecbrandy/pw-simple-scraper",
"Issues": "https://github.com/elecbrandy/pw-simple-scraper/issues",
"Repository": "https://github.com/elecbrandy/pw-simple-scraper"
},
"split_keywords": [
"playwright",
" scraping",
" web-scraping",
" automation",
" crawler"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "89ea5f9573203081e4a238a9e43f313b90391a372cf68e97833c94c543e0b2ce",
"md5": "56ea397a0dbdb64825fa38a182d4714e",
"sha256": "6933b79d3e5e114c462d9649cbdc9b9d689826f481b80535793818007bc2a868"
},
"downloads": -1,
"filename": "pw_simple_scraper-0.1.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "56ea397a0dbdb64825fa38a182d4714e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 8608,
"upload_time": "2025-10-12T12:51:26",
"upload_time_iso_8601": "2025-10-12T12:51:26.274503Z",
"url": "https://files.pythonhosted.org/packages/89/ea/5f9573203081e4a238a9e43f313b90391a372cf68e97833c94c543e0b2ce/pw_simple_scraper-0.1.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "29b25fd648a407ee4a8fa0806f10736a26a8f59221da9d1877df22dc8dcc5aba",
"md5": "266ce3fa9dd89cbf18701ef4831805f3",
"sha256": "c519d796ea315018ecd408ace227aa47ded8cb3c7752d98af8d8b67486b1e160"
},
"downloads": -1,
"filename": "pw_simple_scraper-0.1.4.tar.gz",
"has_sig": false,
"md5_digest": "266ce3fa9dd89cbf18701ef4831805f3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 41695,
"upload_time": "2025-10-12T12:51:27",
"upload_time_iso_8601": "2025-10-12T12:51:27.637534Z",
"url": "https://files.pythonhosted.org/packages/29/b2/5fd648a407ee4a8fa0806f10736a26a8f59221da9d1877df22dc8dcc5aba/pw_simple_scraper-0.1.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-12 12:51:27",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "elecbrandy",
"github_project": "pw-simple-scraper",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pw-simple-scraper"
}