stepwright


Namestepwright JSON
Version 0.1.3 PyPI version JSON
download
home_pagehttps://github.com/lablnet/stepwright
SummaryA powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction
upload_time2025-10-24 05:16:57
maintainerNone
docs_urlNone
authorMuhammad Umer Farooq
requires_python>=3.8
licenseMIT
keywords web scraping playwright automation data extraction web automation
VCS
bugtrack_url
requirements playwright
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # StepWright

A powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction.

## Features

- 🚀 **Declarative Scraping**: Define scraping workflows using Python dictionaries or dataclasses
- 🔄 **Pagination Support**: Built-in support for next button and scroll-based pagination
- 📊 **Data Collection**: Extract text, HTML, values, and files from web pages
- 🔗 **Multi-tab Support**: Handle multiple tabs and complex navigation flows
- 📄 **PDF Generation**: Save pages as PDFs or trigger print-to-PDF actions
- 📥 **File Downloads**: Download files with automatic directory creation
- 🔁 **Looping & Iteration**: ForEach loops for processing multiple elements
- 📡 **Streaming Results**: Real-time result processing with callbacks
- 🎯 **Error Handling**: Graceful error handling with configurable termination
- 🔧 **Flexible Selectors**: Support for ID, class, tag, and XPath selectors

## Installation

```bash
# Using pip
pip install stepwright

# Using pip with development dependencies
pip install stepwright[dev]

# From source
git clone https://github.com/lablnet/stepwright.git
cd stepwright
pip install -e .
```

## Quick Start

### Basic Usage

```python
import asyncio
from stepwright import run_scraper, TabTemplate, BaseStep

async def main():
    templates = [
        TabTemplate(
            tab="example",
            steps=[
                BaseStep(
                    id="navigate",
                    action="navigate",
                    value="https://example.com"
                ),
                BaseStep(
                    id="get_title",
                    action="data",
                    object_type="tag",
                    object="h1",
                    key="title",
                    data_type="text"
                )
            ]
        )
    ]

    results = await run_scraper(templates)
    print(results)

if __name__ == "__main__":
    asyncio.run(main())
```

## API Reference

### Core Functions

#### `run_scraper(templates, options=None)`

Main function to execute scraping templates.

**Parameters:**
- `templates`: List of `TabTemplate` objects
- `options`: Optional `RunOptions` object

**Returns:** `List[Dict[str, Any]]`

```python
results = await run_scraper(templates, RunOptions(
    browser={"headless": True}
))
```

#### `run_scraper_with_callback(templates, on_result, options=None)`

Execute scraping with streaming results via callback.

**Parameters:**
- `templates`: List of `TabTemplate` objects
- `on_result`: Callback function for each result (can be sync or async)
- `options`: Optional `RunOptions` object

```python
async def process_result(result, index):
    print(f"Result {index}: {result}")

await run_scraper_with_callback(templates, process_result)
```

### Types

#### `TabTemplate`

```python
@dataclass
class TabTemplate:
    tab: str
    initSteps: Optional[List[BaseStep]] = None      # Steps executed once before pagination
    perPageSteps: Optional[List[BaseStep]] = None   # Steps executed for each page
    steps: Optional[List[BaseStep]] = None          # Single steps array
    pagination: Optional[PaginationConfig] = None
```

#### `BaseStep`

```python
@dataclass
class BaseStep:
    id: str
    description: Optional[str] = None
    object_type: Optional[SelectorType] = None  # 'id' | 'class' | 'tag' | 'xpath'
    object: Optional[str] = None
    action: Literal[
        "navigate", "input", "click", "data", "scroll", 
        "eventBaseDownload", "foreach", "open", "savePDF", 
        "printToPDF", "downloadPDF", "downloadFile"
    ] = "navigate"
    value: Optional[str] = None
    key: Optional[str] = None
    data_type: Optional[DataType] = None        # 'text' | 'html' | 'value' | 'default' | 'attribute'
    wait: Optional[int] = None
    terminateonerror: Optional[bool] = None
    subSteps: Optional[List["BaseStep"]] = None
    autoScroll: Optional[bool] = None
```

#### `RunOptions`

```python
@dataclass
class RunOptions:
    browser: Optional[dict] = None  # Playwright launch options
    onResult: Optional[Callable] = None
```

## Step Actions

### Navigate
Navigate to a URL.

```python
BaseStep(
    id="go_to_page",
    action="navigate",
    value="https://example.com"
)
```

### Input
Fill form fields.

```python
BaseStep(
    id="search",
    action="input",
    object_type="id",
    object="search-box",
    value="search term"
)
```

### Click
Click on elements.

```python
BaseStep(
    id="submit",
    action="click",
    object_type="class",
    object="submit-button"
)
```

### Data Extraction
Extract data from elements.

```python
BaseStep(
    id="get_title",
    action="data",
    object_type="tag",
    object="h1",
    key="title",
    data_type="text"
)
```

### ForEach Loop
Process multiple elements.

```python
BaseStep(
    id="process_items",
    action="foreach",
    object_type="class",
    object="item",
    subSteps=[
        BaseStep(
            id="get_item_title",
            action="data",
            object_type="tag",
            object="h2",
            key="title",
            data_type="text"
        )
    ]
)
```

### File Operations

#### Event-Based Download
```python
BaseStep(
    id="download_file",
    action="eventBaseDownload",
    object_type="class",
    object="download-link",
    value="./downloads/file.pdf",
    key="downloaded_file"
)
```

#### Download PDF/File
```python
BaseStep(
    id="download_pdf",
    action="downloadPDF",
    object_type="class",
    object="pdf-link",
    value="./output/document.pdf",
    key="pdf_file"
)
```

#### Save PDF
```python
BaseStep(
    id="save_pdf",
    action="savePDF",
    value="./output/page.pdf",
    key="pdf_file"
)
```

## Pagination

### Next Button Pagination
```python
PaginationConfig(
    strategy="next",
    nextButton=NextButtonConfig(
        object_type="class",
        object="next-page",
        wait=2000
    ),
    maxPages=10
)
```

### Scroll Pagination
```python
PaginationConfig(
    strategy="scroll",
    scroll=ScrollConfig(
        offset=800,
        delay=1500
    ),
    maxPages=5
)
```

### Pagination Strategies

#### paginationFirst
Paginate first, then collect data from each page:

```python
TabTemplate(
    tab="news",
    initSteps=[...],
    perPageSteps=[...],  # Collect data from each page
    pagination=PaginationConfig(
        strategy="next",
        nextButton=NextButtonConfig(...),
        paginationFirst=True  # Go to next page before collecting
    )
)
```

#### paginateAllFirst
Paginate through all pages first, then collect all data at once:

```python
TabTemplate(
    tab="articles",
    initSteps=[...],
    perPageSteps=[...],  # Collect all data after all pagination
    pagination=PaginationConfig(
        strategy="next",
        nextButton=NextButtonConfig(...),
        paginateAllFirst=True  # Load all pages first
    )
)
```

## Advanced Features

### Proxy Support
```python
from stepwright import run_scraper, RunOptions

results = await run_scraper(templates, RunOptions(
    browser={
        "proxy": {
            "server": "http://proxy-server:8080",
            "username": "user",
            "password": "pass"
        }
    }
))
```

### Custom Browser Options
```python
results = await run_scraper(templates, RunOptions(
    browser={
        "headless": False,
        "slow_mo": 1000,
        "args": ["--no-sandbox", "--disable-setuid-sandbox"]
    }
))
```

### Streaming Results
```python
async def process_result(result, index):
    print(f"Result {index}: {result}")
    # Process result immediately (e.g., save to database)
    await save_to_database(result)

await run_scraper_with_callback(
    templates, 
    process_result,
    RunOptions(browser={"headless": True})
)
```

### Data Placeholders
Use collected data in subsequent steps:

```python
BaseStep(
    id="get_title",
    action="data",
    object_type="id",
    object="page-title",
    key="page_title",
    data_type="text"
),
BaseStep(
    id="save_with_title",
    action="savePDF",
    value="./output/{{page_title}}.pdf",  # Uses collected page_title
    key="pdf_file"
)
```

### Index Placeholders
Use loop index in foreach steps:

```python
BaseStep(
    id="process_items",
    action="foreach",
    object_type="class",
    object="item",
    subSteps=[
        BaseStep(
            id="save_item",
            action="savePDF",
            value="./output/item_{{i}}.pdf",      # i = 0, 1, 2, ...
            # or
            value="./output/item_{{i_plus1}}.pdf" # i_plus1 = 1, 2, 3, ...
        )
    ]
)
```

## Error Handling

Steps can be configured to terminate on error:

```python
BaseStep(
    id="critical_step",
    action="click",
    object_type="id",
    object="important-button",
    terminateonerror=True  # Stop execution if this fails
)
```

Without `terminateonerror=True`, errors are logged but execution continues.

## Complete Example

```python
import asyncio
from pathlib import Path
from stepwright import (
    run_scraper,
    TabTemplate,
    BaseStep,
    PaginationConfig,
    NextButtonConfig,
    RunOptions
)

async def main():
    templates = [
        TabTemplate(
            tab="news_scraper",
            initSteps=[
                BaseStep(
                    id="navigate",
                    action="navigate",
                    value="https://news-site.com"
                ),
                BaseStep(
                    id="search",
                    action="input",
                    object_type="id",
                    object="search-box",
                    value="technology"
                )
            ],
            perPageSteps=[
                BaseStep(
                    id="collect_articles",
                    action="foreach",
                    object_type="class",
                    object="article",
                    subSteps=[
                        BaseStep(
                            id="get_title",
                            action="data",
                            object_type="tag",
                            object="h2",
                            key="title",
                            data_type="text"
                        ),
                        BaseStep(
                            id="get_content",
                            action="data",
                            object_type="tag",
                            object="p",
                            key="content",
                            data_type="text"
                        ),
                        BaseStep(
                            id="get_link",
                            action="data",
                            object_type="tag",
                            object="a",
                            key="link",
                            data_type="value"
                        )
                    ]
                )
            ],
            pagination=PaginationConfig(
                strategy="next",
                nextButton=NextButtonConfig(
                    object_type="id",
                    object="next-page",
                    wait=2000
                ),
                maxPages=5
            )
        )
    ]

    # Run scraper
    results = await run_scraper(templates, RunOptions(
        browser={"headless": True}
    ))

    # Process results
    for i, article in enumerate(results):
        print(f"\nArticle {i + 1}:")
        print(f"Title: {article.get('title')}")
        print(f"Content: {article.get('content')[:100]}...")
        print(f"Link: {article.get('link')}")

if __name__ == "__main__":
    asyncio.run(main())
```

## Development

### Setup

```bash
# Clone repository
git clone https://github.com/lablnet/stepwright.git
cd stepwright

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

# Install Playwright browsers
playwright install chromium
```

### Running Tests

```bash
# Run all tests
pytest

# Run with verbose output
pytest -v

# Run specific test file
pytest tests/test_scraper.py

# Run specific test class
pytest tests/test_scraper.py::TestGetBrowser

# Run specific test
pytest tests/test_scraper.py::TestGetBrowser::test_create_browser_instance

# Run with coverage
pytest --cov=src --cov-report=html

# Run integration tests only
pytest tests/test_integration.py
```

### Project Structure

```
stepwright/
├── src/
│   ├── __init__.py
│   ├── step_types.py      # Type definitions and dataclasses
│   ├── helpers.py         # Utility functions
│   ├── executor.py        # Core step execution logic
│   ├── parser.py          # Public API (run_scraper)
│   ├── scraper.py         # Low-level browser automation
│   └── scraper_parser.py  # Backward compatibility
├── tests/
│   ├── __init__.py
│   ├── conftest.py        # Pytest configuration
│   ├── test_page.html     # Test HTML page
│   ├── test_scraper.py    # Core scraper tests
│   ├── test_parser.py     # Parser function tests
│   └── test_integration.py # Integration tests
├── pyproject.toml         # Package configuration
├── setup.py               # Setup script
├── pytest.ini             # Pytest configuration
├── README.md              # This file
└── README_TESTS.md        # Detailed test documentation
```

### Code Quality

```bash
# Format code with black
black src/ tests/

# Lint with flake8
flake8 src/ tests/

# Type checking with mypy
mypy src/
```

## Module Organization

The codebase follows separation of concerns:

- **step_types.py**: All type definitions (BaseStep, TabTemplate, etc.)
- **helpers.py**: Utility functions (placeholder replacement, locator creation)
- **executor.py**: Core execution logic (execute steps, handle pagination)
- **parser.py**: Public API (run_scraper, run_scraper_with_callback)
- **scraper.py**: Low-level Playwright wrapper (navigate, click, get_data)
- **scraper_parser.py**: Backward compatibility wrapper

You can import from the main module or specific submodules:

```python
# From main module (recommended)
from stepwright import run_scraper, TabTemplate, BaseStep

# From specific modules
from stepwright.step_types import TabTemplate, BaseStep
from stepwright.parser import run_scraper
from stepwright.helpers import replace_data_placeholders
```

## Testing

See [README_TESTS.md](README_TESTS.md) for detailed testing documentation.

## Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Add tests for new functionality
5. Ensure all tests pass (`pytest`)
6. Commit your changes (`git commit -m 'Add amazing feature'`)
7. Push to the branch (`git push origin feature/amazing-feature`)
8. Open a Pull Request

## License

MIT License - see [LICENSE](LICENSE) file for details.

## Support

- 🐛 Issues: [GitHub Issues](https://github.com/lablnet/stepwright/issues)
- 📖 Documentation: [README.md](README.md) and [README_TESTS.md](README_TESTS.md)
- 💬 Discussions: [GitHub Discussions](https://github.com/lablnet/stepwright/discussions)

## Acknowledgments

- Built with [Playwright](https://playwright.dev/)
- Inspired by declarative web scraping patterns
- Original TypeScript version: [framework-Island/stepwright](https://github.com/framework-Island/stepwright)

## Author

Muhammad Umer Farooq ([@lablnet](https://github.com/lablnet))


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/lablnet/stepwright",
    "name": "stepwright",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "web scraping, playwright, automation, data extraction, web automation",
    "author": "Muhammad Umer Farooq",
    "author_email": "Muhammad Umer Farooq <umer@lablnet.com>",
    "download_url": "https://files.pythonhosted.org/packages/82/ca/823d1a5d9bb80febf0b463183f942b044fb6d9a3cac5ee14b6793db8d3d7/stepwright-0.1.3.tar.gz",
    "platform": null,
    "description": "# StepWright\n\nA powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction.\n\n## Features\n\n- \ud83d\ude80 **Declarative Scraping**: Define scraping workflows using Python dictionaries or dataclasses\n- \ud83d\udd04 **Pagination Support**: Built-in support for next button and scroll-based pagination\n- \ud83d\udcca **Data Collection**: Extract text, HTML, values, and files from web pages\n- \ud83d\udd17 **Multi-tab Support**: Handle multiple tabs and complex navigation flows\n- \ud83d\udcc4 **PDF Generation**: Save pages as PDFs or trigger print-to-PDF actions\n- \ud83d\udce5 **File Downloads**: Download files with automatic directory creation\n- \ud83d\udd01 **Looping & Iteration**: ForEach loops for processing multiple elements\n- \ud83d\udce1 **Streaming Results**: Real-time result processing with callbacks\n- \ud83c\udfaf **Error Handling**: Graceful error handling with configurable termination\n- \ud83d\udd27 **Flexible Selectors**: Support for ID, class, tag, and XPath selectors\n\n## Installation\n\n```bash\n# Using pip\npip install stepwright\n\n# Using pip with development dependencies\npip install stepwright[dev]\n\n# From source\ngit clone https://github.com/lablnet/stepwright.git\ncd stepwright\npip install -e .\n```\n\n## Quick Start\n\n### Basic Usage\n\n```python\nimport asyncio\nfrom stepwright import run_scraper, TabTemplate, BaseStep\n\nasync def main():\n    templates = [\n        TabTemplate(\n            tab=\"example\",\n            steps=[\n                BaseStep(\n                    id=\"navigate\",\n                    action=\"navigate\",\n                    value=\"https://example.com\"\n                ),\n                BaseStep(\n                    id=\"get_title\",\n                    action=\"data\",\n                    object_type=\"tag\",\n                    object=\"h1\",\n                    key=\"title\",\n                    data_type=\"text\"\n                )\n            ]\n        )\n    ]\n\n    results = await run_scraper(templates)\n    print(results)\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n## API Reference\n\n### Core Functions\n\n#### `run_scraper(templates, options=None)`\n\nMain function to execute scraping templates.\n\n**Parameters:**\n- `templates`: List of `TabTemplate` objects\n- `options`: Optional `RunOptions` object\n\n**Returns:** `List[Dict[str, Any]]`\n\n```python\nresults = await run_scraper(templates, RunOptions(\n    browser={\"headless\": True}\n))\n```\n\n#### `run_scraper_with_callback(templates, on_result, options=None)`\n\nExecute scraping with streaming results via callback.\n\n**Parameters:**\n- `templates`: List of `TabTemplate` objects\n- `on_result`: Callback function for each result (can be sync or async)\n- `options`: Optional `RunOptions` object\n\n```python\nasync def process_result(result, index):\n    print(f\"Result {index}: {result}\")\n\nawait run_scraper_with_callback(templates, process_result)\n```\n\n### Types\n\n#### `TabTemplate`\n\n```python\n@dataclass\nclass TabTemplate:\n    tab: str\n    initSteps: Optional[List[BaseStep]] = None      # Steps executed once before pagination\n    perPageSteps: Optional[List[BaseStep]] = None   # Steps executed for each page\n    steps: Optional[List[BaseStep]] = None          # Single steps array\n    pagination: Optional[PaginationConfig] = None\n```\n\n#### `BaseStep`\n\n```python\n@dataclass\nclass BaseStep:\n    id: str\n    description: Optional[str] = None\n    object_type: Optional[SelectorType] = None  # 'id' | 'class' | 'tag' | 'xpath'\n    object: Optional[str] = None\n    action: Literal[\n        \"navigate\", \"input\", \"click\", \"data\", \"scroll\", \n        \"eventBaseDownload\", \"foreach\", \"open\", \"savePDF\", \n        \"printToPDF\", \"downloadPDF\", \"downloadFile\"\n    ] = \"navigate\"\n    value: Optional[str] = None\n    key: Optional[str] = None\n    data_type: Optional[DataType] = None        # 'text' | 'html' | 'value' | 'default' | 'attribute'\n    wait: Optional[int] = None\n    terminateonerror: Optional[bool] = None\n    subSteps: Optional[List[\"BaseStep\"]] = None\n    autoScroll: Optional[bool] = None\n```\n\n#### `RunOptions`\n\n```python\n@dataclass\nclass RunOptions:\n    browser: Optional[dict] = None  # Playwright launch options\n    onResult: Optional[Callable] = None\n```\n\n## Step Actions\n\n### Navigate\nNavigate to a URL.\n\n```python\nBaseStep(\n    id=\"go_to_page\",\n    action=\"navigate\",\n    value=\"https://example.com\"\n)\n```\n\n### Input\nFill form fields.\n\n```python\nBaseStep(\n    id=\"search\",\n    action=\"input\",\n    object_type=\"id\",\n    object=\"search-box\",\n    value=\"search term\"\n)\n```\n\n### Click\nClick on elements.\n\n```python\nBaseStep(\n    id=\"submit\",\n    action=\"click\",\n    object_type=\"class\",\n    object=\"submit-button\"\n)\n```\n\n### Data Extraction\nExtract data from elements.\n\n```python\nBaseStep(\n    id=\"get_title\",\n    action=\"data\",\n    object_type=\"tag\",\n    object=\"h1\",\n    key=\"title\",\n    data_type=\"text\"\n)\n```\n\n### ForEach Loop\nProcess multiple elements.\n\n```python\nBaseStep(\n    id=\"process_items\",\n    action=\"foreach\",\n    object_type=\"class\",\n    object=\"item\",\n    subSteps=[\n        BaseStep(\n            id=\"get_item_title\",\n            action=\"data\",\n            object_type=\"tag\",\n            object=\"h2\",\n            key=\"title\",\n            data_type=\"text\"\n        )\n    ]\n)\n```\n\n### File Operations\n\n#### Event-Based Download\n```python\nBaseStep(\n    id=\"download_file\",\n    action=\"eventBaseDownload\",\n    object_type=\"class\",\n    object=\"download-link\",\n    value=\"./downloads/file.pdf\",\n    key=\"downloaded_file\"\n)\n```\n\n#### Download PDF/File\n```python\nBaseStep(\n    id=\"download_pdf\",\n    action=\"downloadPDF\",\n    object_type=\"class\",\n    object=\"pdf-link\",\n    value=\"./output/document.pdf\",\n    key=\"pdf_file\"\n)\n```\n\n#### Save PDF\n```python\nBaseStep(\n    id=\"save_pdf\",\n    action=\"savePDF\",\n    value=\"./output/page.pdf\",\n    key=\"pdf_file\"\n)\n```\n\n## Pagination\n\n### Next Button Pagination\n```python\nPaginationConfig(\n    strategy=\"next\",\n    nextButton=NextButtonConfig(\n        object_type=\"class\",\n        object=\"next-page\",\n        wait=2000\n    ),\n    maxPages=10\n)\n```\n\n### Scroll Pagination\n```python\nPaginationConfig(\n    strategy=\"scroll\",\n    scroll=ScrollConfig(\n        offset=800,\n        delay=1500\n    ),\n    maxPages=5\n)\n```\n\n### Pagination Strategies\n\n#### paginationFirst\nPaginate first, then collect data from each page:\n\n```python\nTabTemplate(\n    tab=\"news\",\n    initSteps=[...],\n    perPageSteps=[...],  # Collect data from each page\n    pagination=PaginationConfig(\n        strategy=\"next\",\n        nextButton=NextButtonConfig(...),\n        paginationFirst=True  # Go to next page before collecting\n    )\n)\n```\n\n#### paginateAllFirst\nPaginate through all pages first, then collect all data at once:\n\n```python\nTabTemplate(\n    tab=\"articles\",\n    initSteps=[...],\n    perPageSteps=[...],  # Collect all data after all pagination\n    pagination=PaginationConfig(\n        strategy=\"next\",\n        nextButton=NextButtonConfig(...),\n        paginateAllFirst=True  # Load all pages first\n    )\n)\n```\n\n## Advanced Features\n\n### Proxy Support\n```python\nfrom stepwright import run_scraper, RunOptions\n\nresults = await run_scraper(templates, RunOptions(\n    browser={\n        \"proxy\": {\n            \"server\": \"http://proxy-server:8080\",\n            \"username\": \"user\",\n            \"password\": \"pass\"\n        }\n    }\n))\n```\n\n### Custom Browser Options\n```python\nresults = await run_scraper(templates, RunOptions(\n    browser={\n        \"headless\": False,\n        \"slow_mo\": 1000,\n        \"args\": [\"--no-sandbox\", \"--disable-setuid-sandbox\"]\n    }\n))\n```\n\n### Streaming Results\n```python\nasync def process_result(result, index):\n    print(f\"Result {index}: {result}\")\n    # Process result immediately (e.g., save to database)\n    await save_to_database(result)\n\nawait run_scraper_with_callback(\n    templates, \n    process_result,\n    RunOptions(browser={\"headless\": True})\n)\n```\n\n### Data Placeholders\nUse collected data in subsequent steps:\n\n```python\nBaseStep(\n    id=\"get_title\",\n    action=\"data\",\n    object_type=\"id\",\n    object=\"page-title\",\n    key=\"page_title\",\n    data_type=\"text\"\n),\nBaseStep(\n    id=\"save_with_title\",\n    action=\"savePDF\",\n    value=\"./output/{{page_title}}.pdf\",  # Uses collected page_title\n    key=\"pdf_file\"\n)\n```\n\n### Index Placeholders\nUse loop index in foreach steps:\n\n```python\nBaseStep(\n    id=\"process_items\",\n    action=\"foreach\",\n    object_type=\"class\",\n    object=\"item\",\n    subSteps=[\n        BaseStep(\n            id=\"save_item\",\n            action=\"savePDF\",\n            value=\"./output/item_{{i}}.pdf\",      # i = 0, 1, 2, ...\n            # or\n            value=\"./output/item_{{i_plus1}}.pdf\" # i_plus1 = 1, 2, 3, ...\n        )\n    ]\n)\n```\n\n## Error Handling\n\nSteps can be configured to terminate on error:\n\n```python\nBaseStep(\n    id=\"critical_step\",\n    action=\"click\",\n    object_type=\"id\",\n    object=\"important-button\",\n    terminateonerror=True  # Stop execution if this fails\n)\n```\n\nWithout `terminateonerror=True`, errors are logged but execution continues.\n\n## Complete Example\n\n```python\nimport asyncio\nfrom pathlib import Path\nfrom stepwright import (\n    run_scraper,\n    TabTemplate,\n    BaseStep,\n    PaginationConfig,\n    NextButtonConfig,\n    RunOptions\n)\n\nasync def main():\n    templates = [\n        TabTemplate(\n            tab=\"news_scraper\",\n            initSteps=[\n                BaseStep(\n                    id=\"navigate\",\n                    action=\"navigate\",\n                    value=\"https://news-site.com\"\n                ),\n                BaseStep(\n                    id=\"search\",\n                    action=\"input\",\n                    object_type=\"id\",\n                    object=\"search-box\",\n                    value=\"technology\"\n                )\n            ],\n            perPageSteps=[\n                BaseStep(\n                    id=\"collect_articles\",\n                    action=\"foreach\",\n                    object_type=\"class\",\n                    object=\"article\",\n                    subSteps=[\n                        BaseStep(\n                            id=\"get_title\",\n                            action=\"data\",\n                            object_type=\"tag\",\n                            object=\"h2\",\n                            key=\"title\",\n                            data_type=\"text\"\n                        ),\n                        BaseStep(\n                            id=\"get_content\",\n                            action=\"data\",\n                            object_type=\"tag\",\n                            object=\"p\",\n                            key=\"content\",\n                            data_type=\"text\"\n                        ),\n                        BaseStep(\n                            id=\"get_link\",\n                            action=\"data\",\n                            object_type=\"tag\",\n                            object=\"a\",\n                            key=\"link\",\n                            data_type=\"value\"\n                        )\n                    ]\n                )\n            ],\n            pagination=PaginationConfig(\n                strategy=\"next\",\n                nextButton=NextButtonConfig(\n                    object_type=\"id\",\n                    object=\"next-page\",\n                    wait=2000\n                ),\n                maxPages=5\n            )\n        )\n    ]\n\n    # Run scraper\n    results = await run_scraper(templates, RunOptions(\n        browser={\"headless\": True}\n    ))\n\n    # Process results\n    for i, article in enumerate(results):\n        print(f\"\\nArticle {i + 1}:\")\n        print(f\"Title: {article.get('title')}\")\n        print(f\"Content: {article.get('content')[:100]}...\")\n        print(f\"Link: {article.get('link')}\")\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n## Development\n\n### Setup\n\n```bash\n# Clone repository\ngit clone https://github.com/lablnet/stepwright.git\ncd stepwright\n\n# Create virtual environment\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n\n# Install in development mode\npip install -e \".[dev]\"\n\n# Install Playwright browsers\nplaywright install chromium\n```\n\n### Running Tests\n\n```bash\n# Run all tests\npytest\n\n# Run with verbose output\npytest -v\n\n# Run specific test file\npytest tests/test_scraper.py\n\n# Run specific test class\npytest tests/test_scraper.py::TestGetBrowser\n\n# Run specific test\npytest tests/test_scraper.py::TestGetBrowser::test_create_browser_instance\n\n# Run with coverage\npytest --cov=src --cov-report=html\n\n# Run integration tests only\npytest tests/test_integration.py\n```\n\n### Project Structure\n\n```\nstepwright/\n\u251c\u2500\u2500 src/\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 step_types.py      # Type definitions and dataclasses\n\u2502   \u251c\u2500\u2500 helpers.py         # Utility functions\n\u2502   \u251c\u2500\u2500 executor.py        # Core step execution logic\n\u2502   \u251c\u2500\u2500 parser.py          # Public API (run_scraper)\n\u2502   \u251c\u2500\u2500 scraper.py         # Low-level browser automation\n\u2502   \u2514\u2500\u2500 scraper_parser.py  # Backward compatibility\n\u251c\u2500\u2500 tests/\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 conftest.py        # Pytest configuration\n\u2502   \u251c\u2500\u2500 test_page.html     # Test HTML page\n\u2502   \u251c\u2500\u2500 test_scraper.py    # Core scraper tests\n\u2502   \u251c\u2500\u2500 test_parser.py     # Parser function tests\n\u2502   \u2514\u2500\u2500 test_integration.py # Integration tests\n\u251c\u2500\u2500 pyproject.toml         # Package configuration\n\u251c\u2500\u2500 setup.py               # Setup script\n\u251c\u2500\u2500 pytest.ini             # Pytest configuration\n\u251c\u2500\u2500 README.md              # This file\n\u2514\u2500\u2500 README_TESTS.md        # Detailed test documentation\n```\n\n### Code Quality\n\n```bash\n# Format code with black\nblack src/ tests/\n\n# Lint with flake8\nflake8 src/ tests/\n\n# Type checking with mypy\nmypy src/\n```\n\n## Module Organization\n\nThe codebase follows separation of concerns:\n\n- **step_types.py**: All type definitions (BaseStep, TabTemplate, etc.)\n- **helpers.py**: Utility functions (placeholder replacement, locator creation)\n- **executor.py**: Core execution logic (execute steps, handle pagination)\n- **parser.py**: Public API (run_scraper, run_scraper_with_callback)\n- **scraper.py**: Low-level Playwright wrapper (navigate, click, get_data)\n- **scraper_parser.py**: Backward compatibility wrapper\n\nYou can import from the main module or specific submodules:\n\n```python\n# From main module (recommended)\nfrom stepwright import run_scraper, TabTemplate, BaseStep\n\n# From specific modules\nfrom stepwright.step_types import TabTemplate, BaseStep\nfrom stepwright.parser import run_scraper\nfrom stepwright.helpers import replace_data_placeholders\n```\n\n## Testing\n\nSee [README_TESTS.md](README_TESTS.md) for detailed testing documentation.\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes\n4. Add tests for new functionality\n5. Ensure all tests pass (`pytest`)\n6. Commit your changes (`git commit -m 'Add amazing feature'`)\n7. Push to the branch (`git push origin feature/amazing-feature`)\n8. Open a Pull Request\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## Support\n\n- \ud83d\udc1b Issues: [GitHub Issues](https://github.com/lablnet/stepwright/issues)\n- \ud83d\udcd6 Documentation: [README.md](README.md) and [README_TESTS.md](README_TESTS.md)\n- \ud83d\udcac Discussions: [GitHub Discussions](https://github.com/lablnet/stepwright/discussions)\n\n## Acknowledgments\n\n- Built with [Playwright](https://playwright.dev/)\n- Inspired by declarative web scraping patterns\n- Original TypeScript version: [framework-Island/stepwright](https://github.com/framework-Island/stepwright)\n\n## Author\n\nMuhammad Umer Farooq ([@lablnet](https://github.com/lablnet))\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction",
    "version": "0.1.3",
    "project_urls": {
        "Bug Tracker": "https://github.com/lablnet/stepwright/issues",
        "Documentation": "https://github.com/lablnet/stepwright#readme",
        "Homepage": "https://github.com/lablnet/stepwright",
        "Repository": "https://github.com/lablnet/stepwright"
    },
    "split_keywords": [
        "web scraping",
        " playwright",
        " automation",
        " data extraction",
        " web automation"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9508db7c5acc410e540937a21eac14fe3f78d5e44db3f1b90c5078966a87a7c3",
                "md5": "e39fe6cc2ecf5b802ffb0cae3a71be41",
                "sha256": "049728b6f7a2358f27a5c468694a99416e30eda300dc34eb2c5921cdd03805cd"
            },
            "downloads": -1,
            "filename": "stepwright-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e39fe6cc2ecf5b802ffb0cae3a71be41",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 18373,
            "upload_time": "2025-10-24T05:16:55",
            "upload_time_iso_8601": "2025-10-24T05:16:55.746296Z",
            "url": "https://files.pythonhosted.org/packages/95/08/db7c5acc410e540937a21eac14fe3f78d5e44db3f1b90c5078966a87a7c3/stepwright-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "82ca823d1a5d9bb80febf0b463183f942b044fb6d9a3cac5ee14b6793db8d3d7",
                "md5": "bb8f9afdda55b09c4d0a56ee7d3bb934",
                "sha256": "f27bf3e8b4cae014b1060abaf1d3b826a33be3a00f1ce2420bb4e33398ad6876"
            },
            "downloads": -1,
            "filename": "stepwright-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "bb8f9afdda55b09c4d0a56ee7d3bb934",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 29639,
            "upload_time": "2025-10-24T05:16:57",
            "upload_time_iso_8601": "2025-10-24T05:16:57.359480Z",
            "url": "https://files.pythonhosted.org/packages/82/ca/823d1a5d9bb80febf0b463183f942b044fb6d9a3cac5ee14b6793db8d3d7/stepwright-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-24 05:16:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "lablnet",
    "github_project": "stepwright",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "playwright",
            "specs": [
                [
                    ">=",
                    "1.40.0"
                ]
            ]
        }
    ],
    "lcname": "stepwright"
}
        
Elapsed time: 1.34728s