pw-simple-scraper


Namepw-simple-scraper JSON
Version 0.1.4 PyPI version JSON
download
home_pageNone
SummaryA simple and light Playwright-based scraper
upload_time2025-10-12 12:51:27
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords playwright scraping web-scraping automation crawler
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pw-simple-scraper

[![PyPI](https://img.shields.io/pypi/v/pw-simple-scraper.svg)](https://pypi.org/project/pw-simple-scraper/)
[![Python](https://img.shields.io/pypi/pyversions/pw-simple-scraper.svg)](https://pypi.org/project/pw-simple-scraper/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](#license)

<br>

> **‼️ Forget the hassle of creating browsers or setting headers. Just focus on scraping ‼️**

<br>

## Table of Contents
- [1. Main Features](#1-main-features)
- [2. Installation](#2-installation)
- [3. How to Use](#3-how-to-use)
- [4. Examples](#4-examples)
- [5. Playwright Method Reference](#5-playwright-method-reference)
- [6. FAQ](#faq)

<br>
<br>

## 1. Main Features
- A scraper library built on top of [Playwright](https://playwright.dev).
- Automatically manages the lifecycle of browsers and pages with `async with`.
- Returns Playwright objects, so you can use **all the powerful Playwright features** as they are.
- ⚡️ Fast ⚡️

<br>
<br>

## 2. Installation

``` bash
# 1. Install Playwright
pip install playwright

# 2-1. Install Chromium (macOS / Windows)
python -m playwright install chromium

# 2-2. Install Chromium (Linux)
python -m playwright install --with-deps chromium

# 3. Install pw-simple-scraper
pip install pw-simple-scraper
```

- Since this scraper is based on `Playwright`, you need both the `Playwright` library and the `Chromium` browser.

<br>
<br>

## 3. How to Use

> Not sure how to handle the `Page` object returned by `get_page`? -> [Playwright Method Reference](#5-playwright-method-reference)

<br>

1. `async with PlaywrightScraper() as scraper`  
   Create an instance of the scraper.
2. `async with scraper.get_page("http://www.example.com/") as page:`  
   Get a page context using the `get_page` method.
3. Now you can directly use all the Playwright features on `page`.

<br>

#### 🖥️ Code Example
``` python
import asyncio
from pw_simple_scraper import PlaywrightScraper

async def main():
    # Create scraper instance
    async with PlaywrightScraper() as scraper:
        # Get page context
        async with scraper.get_page("http://www.example.com/") as page:
            # >>>> Use `page` in this block! <<<<

```

<br>
<br>

## 4. Examples

> Not sure how to handle the `Page` object returned by `get_page`? -> [Playwright Method Reference](#5-playwright-method-reference)

<br>

### 4-1. Extract title / text / attributes

#### 🖥️ Code Example

``` python
import asyncio
from pw_simple_scraper import PlaywrightScraper

async def main():
    async with PlaywrightScraper() as scraper:
        async with scraper.get_page("https://quotes.toscrape.com/") as page:
            title = await page.title()
            first_quote = await page.locator("span.text").first.text_content()
            quotes = await page.locator("span.text").all_text_contents()
            first_author_link = await page.locator(".quote a").first.get_attribute("href")

            print("Page Title:", title)
            print("First Quote:", first_quote)
            print("Quote List (first 3):", quotes[:3])
            print("First Author Link:", first_author_link)

if __name__ == "__main__":
    asyncio.run(main())
```

<br>

#### ⬇️ Example Output

``` bash
Page Title: Quotes to Scrape
First Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Quote List (first 3): ["The world as we have created it is a process of our thinking...", "It is our choices, Harry, that show what we truly are...", "There are only two ways to live your life..."]
First Author Link: /author/Albert-Einstein
```

<br>
<br>

### 4-2. Images & links — collect absolute paths

#### 🖥️ Code Example

```python
import asyncio
from urllib.parse import urljoin
from pw_simple_scraper import PlaywrightScraper

async def main():
    async with PlaywrightScraper() as scraper:
        async with scraper.get_page("https://books.toscrape.com/") as page:
            img_urls = await page.locator("article.product_pod img").evaluate_all(
                "els => els.map(el => el.getAttribute('src'))"
            )
            abs_imgs = [urljoin(page.url, u) for u in img_urls if u]

            book_urls = await page.locator("article.product_pod h3 a").evaluate_all(
                "els => els.map(el => el.getAttribute('href'))"
            )
            abs_books = [urljoin(page.url, u) for u in book_urls if u]

            print("Image URLs (5):", abs_imgs[:5])
            print("Book Links (5):", abs_books[:5])

if __name__ == "__main__":
    asyncio.run(main())
```

<br>

#### ⬇️ Example Output

``` bash
Image URLs (5): [
  'https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg',
  'https://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg',
  'https://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg',
  'https://books.toscrape.com/media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg',
  'https://books.toscrape.com/media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg'
]
Book Links (5): [
  'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
  'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
  'https://books.toscrape.com/catalogue/soumission_998/index.html',
  'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
  'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html'
]
```

<br>
<br>

### 4-3. Evaluate JSON — convert DOM to JSON

#### 🖥️ Code Example

```python
import asyncio
from pw_simple_scraper import PlaywrightScraper

async def main():
    async with PlaywrightScraper() as scraper:
        async with scraper.get_page("https://books.toscrape.com/") as page:
            cards = page.locator("article.product_pod")
            items = await cards.evaluate_all("""
                els => els.map(el => ({
                    title: el.querySelector("h3 a")?.getAttribute("title"),
                    price: el.querySelector(".price_color")?.innerText.trim(),
                    inStock: !!el.querySelector(".instock.availability"),
                }))
            """)
            print(items[:5])

if __name__ == "__main__":
    asyncio.run(main())
```

<br>

#### ⬇️ Example Output

``` bash
[
  {"title": "A Light in the Attic", "price": "£51.77", "inStock": true},
  {"title": "Tipping the Velvet", "price": "£53.74", "inStock": true},
  {"title": "Soumission", "price": "£50.10", "inStock": true},
  {"title": "Sharp Objects", "price": "£47.82", "inStock": true},
  {"title": "Sapiens: A Brief History of Humankind", "price": "£54.23", "inStock": true}
]
```
<br>
<br>

## 5. Playwright Method Reference

- If you’re not sure how to handle the `Page` object returned by `get_page`, check the table below.
- 🚨 **Note**
    - HTML Attribute: `<input value="default">` → always returns `"default"`
    - JS Property: `input.value` → changes to `"user input"` when typed

| Category            | Method                        | Description                                       | Notes / Comparison                         |
| ------------------- | ----------------------------- | ------------------------------------------------- | ------------------------------------------ |
| **Text**            | `all_text_contents()`         | Returns a list of text from **all elements**      | Similar to `all_inner_texts()`             |
|                     | `text_content()`              | Returns visible text of the **first element**     | Similar to `innerText` (not `textContent`) |
|                     | `inner_text()`                | Same as `text_content()`                          | Actual visible text                        |
|                     | `all_inner_texts()`           | List of visible text from all elements            | Similar to `all_text_contents()`           |
| **Attribute**       | `get_attribute('attr')`       | Returns HTML attribute (`href`, `src`, `class`)   | Static, as written in HTML                 |
| **Property**        | `get_property('prop')`        | Returns live DOM property (`value`, `checked`)    | Useful for dynamic state                   |
| **HTML / Value**    | `inner_html()`                | Returns **inner HTML** of element                 | Only inside structure                      |
|                     | `outer_html()`                | Returns element’s full HTML                       | Includes element itself                    |
|                     | `input_value()`               | Returns current value of form elements            | More accurate than `get_attribute('value')`|
|                     | `select_option()`             | Returns `<option>` info from `<select>`           | Shows selected state                       |
| **State (Boolean)** | `is_visible()`                | Is element visible                                | True/False                                 |
|                     | `is_hidden()`                 | Is element hidden                                 | True/False                                 |
|                     | `is_enabled()`                | Is element enabled (clickable)                    | True/False                                 |
|                     | `is_disabled()`               | Is element disabled                               | True/False                                 |
|                     | `is_editable()`               | Is element editable                               | True/False                                 |
|                     | `is_checked()`                | Is checkbox/radio checked                         | True/False                                 |
| **Advanced**        | `evaluate("JS func", arg)`    | Runs JS on first element                          | Flexible extraction                        |
|                     | `evaluate_all("JS func", arg)`| Runs JS on all elements, returns list             | Useful for batch data                      |

<br>
<br>

## 6. FAQ

- **Browser launch error after install**
    - You must install the browser with:  
      `python -m playwright install chromium` (check Linux options carefully)

- **Doesn’t work on some URLs**
    - Please open a GitHub issue so we can check.

<br>
<br>

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pw-simple-scraper",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "playwright, scraping, web-scraping, automation, crawler",
    "author": null,
    "author_email": "elecbrandy <elecbrandy@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/29/b2/5fd648a407ee4a8fa0806f10736a26a8f59221da9d1877df22dc8dcc5aba/pw_simple_scraper-0.1.4.tar.gz",
    "platform": null,
    "description": "# pw-simple-scraper\n\n[![PyPI](https://img.shields.io/pypi/v/pw-simple-scraper.svg)](https://pypi.org/project/pw-simple-scraper/)\n[![Python](https://img.shields.io/pypi/pyversions/pw-simple-scraper.svg)](https://pypi.org/project/pw-simple-scraper/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](#license)\n\n<br>\n\n> **\u203c\ufe0f Forget the hassle of creating browsers or setting headers. Just focus on scraping \u203c\ufe0f**\n\n<br>\n\n## Table of Contents\n- [1. Main Features](#1-main-features)\n- [2. Installation](#2-installation)\n- [3. How to Use](#3-how-to-use)\n- [4. Examples](#4-examples)\n- [5. Playwright Method Reference](#5-playwright-method-reference)\n- [6. FAQ](#faq)\n\n<br>\n<br>\n\n## 1. Main Features\n- A scraper library built on top of [Playwright](https://playwright.dev).\n- Automatically manages the lifecycle of browsers and pages with `async with`.\n- Returns Playwright objects, so you can use **all the powerful Playwright features** as they are.\n- \u26a1\ufe0f Fast \u26a1\ufe0f\n\n<br>\n<br>\n\n## 2. Installation\n\n``` bash\n# 1. Install Playwright\npip install playwright\n\n# 2-1. Install Chromium (macOS / Windows)\npython -m playwright install chromium\n\n# 2-2. Install Chromium (Linux)\npython -m playwright install --with-deps chromium\n\n# 3. Install pw-simple-scraper\npip install pw-simple-scraper\n```\n\n- Since this scraper is based on `Playwright`, you need both the `Playwright` library and the `Chromium` browser.\n\n<br>\n<br>\n\n## 3. How to Use\n\n> Not sure how to handle the `Page` object returned by `get_page`? -> [Playwright Method Reference](#5-playwright-method-reference)\n\n<br>\n\n1. `async with PlaywrightScraper() as scraper`  \n   Create an instance of the scraper.\n2. `async with scraper.get_page(\"http://www.example.com/\") as page:`  \n   Get a page context using the `get_page` method.\n3. Now you can directly use all the Playwright features on `page`.\n\n<br>\n\n#### \ud83d\udda5\ufe0f Code Example\n``` python\nimport asyncio\nfrom pw_simple_scraper import PlaywrightScraper\n\nasync def main():\n    # Create scraper instance\n    async with PlaywrightScraper() as scraper:\n        # Get page context\n        async with scraper.get_page(\"http://www.example.com/\") as page:\n            # >>>> Use `page` in this block! <<<<\n\n```\n\n<br>\n<br>\n\n## 4. Examples\n\n> Not sure how to handle the `Page` object returned by `get_page`? -> [Playwright Method Reference](#5-playwright-method-reference)\n\n<br>\n\n### 4-1. Extract title / text / attributes\n\n#### \ud83d\udda5\ufe0f Code Example\n\n``` python\nimport asyncio\nfrom pw_simple_scraper import PlaywrightScraper\n\nasync def main():\n    async with PlaywrightScraper() as scraper:\n        async with scraper.get_page(\"https://quotes.toscrape.com/\") as page:\n            title = await page.title()\n            first_quote = await page.locator(\"span.text\").first.text_content()\n            quotes = await page.locator(\"span.text\").all_text_contents()\n            first_author_link = await page.locator(\".quote a\").first.get_attribute(\"href\")\n\n            print(\"Page Title:\", title)\n            print(\"First Quote:\", first_quote)\n            print(\"Quote List (first 3):\", quotes[:3])\n            print(\"First Author Link:\", first_author_link)\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n<br>\n\n#### \u2b07\ufe0f Example Output\n\n``` bash\nPage Title: Quotes to Scrape\nFirst Quote: \u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d\nQuote List (first 3): [\"The world as we have created it is a process of our thinking...\", \"It is our choices, Harry, that show what we truly are...\", \"There are only two ways to live your life...\"]\nFirst Author Link: /author/Albert-Einstein\n```\n\n<br>\n<br>\n\n### 4-2. Images & links \u2014 collect absolute paths\n\n#### \ud83d\udda5\ufe0f Code Example\n\n```python\nimport asyncio\nfrom urllib.parse import urljoin\nfrom pw_simple_scraper import PlaywrightScraper\n\nasync def main():\n    async with PlaywrightScraper() as scraper:\n        async with scraper.get_page(\"https://books.toscrape.com/\") as page:\n            img_urls = await page.locator(\"article.product_pod img\").evaluate_all(\n                \"els => els.map(el => el.getAttribute('src'))\"\n            )\n            abs_imgs = [urljoin(page.url, u) for u in img_urls if u]\n\n            book_urls = await page.locator(\"article.product_pod h3 a\").evaluate_all(\n                \"els => els.map(el => el.getAttribute('href'))\"\n            )\n            abs_books = [urljoin(page.url, u) for u in book_urls if u]\n\n            print(\"Image URLs (5):\", abs_imgs[:5])\n            print(\"Book Links (5):\", abs_books[:5])\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n<br>\n\n#### \u2b07\ufe0f Example Output\n\n``` bash\nImage URLs (5): [\n  'https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg',\n  'https://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg',\n  'https://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg',\n  'https://books.toscrape.com/media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg',\n  'https://books.toscrape.com/media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg'\n]\nBook Links (5): [\n  'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',\n  'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',\n  'https://books.toscrape.com/catalogue/soumission_998/index.html',\n  'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',\n  'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html'\n]\n```\n\n<br>\n<br>\n\n### 4-3. Evaluate JSON \u2014 convert DOM to JSON\n\n#### \ud83d\udda5\ufe0f Code Example\n\n```python\nimport asyncio\nfrom pw_simple_scraper import PlaywrightScraper\n\nasync def main():\n    async with PlaywrightScraper() as scraper:\n        async with scraper.get_page(\"https://books.toscrape.com/\") as page:\n            cards = page.locator(\"article.product_pod\")\n            items = await cards.evaluate_all(\"\"\"\n                els => els.map(el => ({\n                    title: el.querySelector(\"h3 a\")?.getAttribute(\"title\"),\n                    price: el.querySelector(\".price_color\")?.innerText.trim(),\n                    inStock: !!el.querySelector(\".instock.availability\"),\n                }))\n            \"\"\")\n            print(items[:5])\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n<br>\n\n#### \u2b07\ufe0f Example Output\n\n``` bash\n[\n  {\"title\": \"A Light in the Attic\", \"price\": \"\u00a351.77\", \"inStock\": true},\n  {\"title\": \"Tipping the Velvet\", \"price\": \"\u00a353.74\", \"inStock\": true},\n  {\"title\": \"Soumission\", \"price\": \"\u00a350.10\", \"inStock\": true},\n  {\"title\": \"Sharp Objects\", \"price\": \"\u00a347.82\", \"inStock\": true},\n  {\"title\": \"Sapiens: A Brief History of Humankind\", \"price\": \"\u00a354.23\", \"inStock\": true}\n]\n```\n<br>\n<br>\n\n## 5. Playwright Method Reference\n\n- If you\u2019re not sure how to handle the `Page` object returned by `get_page`, check the table below.\n- \ud83d\udea8 **Note**\n    - HTML Attribute: `<input value=\"default\">` \u2192 always returns `\"default\"`\n    - JS Property: `input.value` \u2192 changes to `\"user input\"` when typed\n\n| Category            | Method                        | Description                                       | Notes / Comparison                         |\n| ------------------- | ----------------------------- | ------------------------------------------------- | ------------------------------------------ |\n| **Text**            | `all_text_contents()`         | Returns a list of text from **all elements**      | Similar to `all_inner_texts()`             |\n|                     | `text_content()`              | Returns visible text of the **first element**     | Similar to `innerText` (not `textContent`) |\n|                     | `inner_text()`                | Same as `text_content()`                          | Actual visible text                        |\n|                     | `all_inner_texts()`           | List of visible text from all elements            | Similar to `all_text_contents()`           |\n| **Attribute**       | `get_attribute('attr')`       | Returns HTML attribute (`href`, `src`, `class`)   | Static, as written in HTML                 |\n| **Property**        | `get_property('prop')`        | Returns live DOM property (`value`, `checked`)    | Useful for dynamic state                   |\n| **HTML / Value**    | `inner_html()`                | Returns **inner HTML** of element                 | Only inside structure                      |\n|                     | `outer_html()`                | Returns element\u2019s full HTML                       | Includes element itself                    |\n|                     | `input_value()`               | Returns current value of form elements            | More accurate than `get_attribute('value')`|\n|                     | `select_option()`             | Returns `<option>` info from `<select>`           | Shows selected state                       |\n| **State (Boolean)** | `is_visible()`                | Is element visible                                | True/False                                 |\n|                     | `is_hidden()`                 | Is element hidden                                 | True/False                                 |\n|                     | `is_enabled()`                | Is element enabled (clickable)                    | True/False                                 |\n|                     | `is_disabled()`               | Is element disabled                               | True/False                                 |\n|                     | `is_editable()`               | Is element editable                               | True/False                                 |\n|                     | `is_checked()`                | Is checkbox/radio checked                         | True/False                                 |\n| **Advanced**        | `evaluate(\"JS func\", arg)`    | Runs JS on first element                          | Flexible extraction                        |\n|                     | `evaluate_all(\"JS func\", arg)`| Runs JS on all elements, returns list             | Useful for batch data                      |\n\n<br>\n<br>\n\n## 6. FAQ\n\n- **Browser launch error after install**\n    - You must install the browser with:  \n      `python -m playwright install chromium` (check Linux options carefully)\n\n- **Doesn\u2019t work on some URLs**\n    - Please open a GitHub issue so we can check.\n\n<br>\n<br>\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A simple and light Playwright-based scraper",
    "version": "0.1.4",
    "project_urls": {
        "Homepage": "https://github.com/elecbrandy/pw-simple-scraper",
        "Issues": "https://github.com/elecbrandy/pw-simple-scraper/issues",
        "Repository": "https://github.com/elecbrandy/pw-simple-scraper"
    },
    "split_keywords": [
        "playwright",
        " scraping",
        " web-scraping",
        " automation",
        " crawler"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "89ea5f9573203081e4a238a9e43f313b90391a372cf68e97833c94c543e0b2ce",
                "md5": "56ea397a0dbdb64825fa38a182d4714e",
                "sha256": "6933b79d3e5e114c462d9649cbdc9b9d689826f481b80535793818007bc2a868"
            },
            "downloads": -1,
            "filename": "pw_simple_scraper-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "56ea397a0dbdb64825fa38a182d4714e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 8608,
            "upload_time": "2025-10-12T12:51:26",
            "upload_time_iso_8601": "2025-10-12T12:51:26.274503Z",
            "url": "https://files.pythonhosted.org/packages/89/ea/5f9573203081e4a238a9e43f313b90391a372cf68e97833c94c543e0b2ce/pw_simple_scraper-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "29b25fd648a407ee4a8fa0806f10736a26a8f59221da9d1877df22dc8dcc5aba",
                "md5": "266ce3fa9dd89cbf18701ef4831805f3",
                "sha256": "c519d796ea315018ecd408ace227aa47ded8cb3c7752d98af8d8b67486b1e160"
            },
            "downloads": -1,
            "filename": "pw_simple_scraper-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "266ce3fa9dd89cbf18701ef4831805f3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 41695,
            "upload_time": "2025-10-12T12:51:27",
            "upload_time_iso_8601": "2025-10-12T12:51:27.637534Z",
            "url": "https://files.pythonhosted.org/packages/29/b2/5fd648a407ee4a8fa0806f10736a26a8f59221da9d1877df22dc8dcc5aba/pw_simple_scraper-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-12 12:51:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "elecbrandy",
    "github_project": "pw-simple-scraper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pw-simple-scraper"
}
        
Elapsed time: 1.36908s