# StepWright
A powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction.
## Features
- 🚀 **Declarative Scraping**: Define scraping workflows using Python dictionaries or dataclasses
- 🔄 **Pagination Support**: Built-in support for next button and scroll-based pagination
- 📊 **Data Collection**: Extract text, HTML, values, and files from web pages
- 🔗 **Multi-tab Support**: Handle multiple tabs and complex navigation flows
- 📄 **PDF Generation**: Save pages as PDFs or trigger print-to-PDF actions
- 📥 **File Downloads**: Download files with automatic directory creation
- 🔁 **Looping & Iteration**: ForEach loops for processing multiple elements
- 📡 **Streaming Results**: Real-time result processing with callbacks
- 🎯 **Error Handling**: Graceful error handling with configurable termination
- 🔧 **Flexible Selectors**: Support for ID, class, tag, and XPath selectors
## Installation
```bash
# Using pip
pip install stepwright
# Using pip with development dependencies
pip install stepwright[dev]
# From source
git clone https://github.com/lablnet/stepwright.git
cd stepwright
pip install -e .
```
## Quick Start
### Basic Usage
```python
import asyncio
from stepwright import run_scraper, TabTemplate, BaseStep
async def main():
templates = [
TabTemplate(
tab="example",
steps=[
BaseStep(
id="navigate",
action="navigate",
value="https://example.com"
),
BaseStep(
id="get_title",
action="data",
object_type="tag",
object="h1",
key="title",
data_type="text"
)
]
)
]
results = await run_scraper(templates)
print(results)
if __name__ == "__main__":
asyncio.run(main())
```
## API Reference
### Core Functions
#### `run_scraper(templates, options=None)`
Main function to execute scraping templates.
**Parameters:**
- `templates`: List of `TabTemplate` objects
- `options`: Optional `RunOptions` object
**Returns:** `List[Dict[str, Any]]`
```python
results = await run_scraper(templates, RunOptions(
browser={"headless": True}
))
```
#### `run_scraper_with_callback(templates, on_result, options=None)`
Execute scraping with streaming results via callback.
**Parameters:**
- `templates`: List of `TabTemplate` objects
- `on_result`: Callback function for each result (can be sync or async)
- `options`: Optional `RunOptions` object
```python
async def process_result(result, index):
print(f"Result {index}: {result}")
await run_scraper_with_callback(templates, process_result)
```
### Types
#### `TabTemplate`
```python
@dataclass
class TabTemplate:
tab: str
initSteps: Optional[List[BaseStep]] = None # Steps executed once before pagination
perPageSteps: Optional[List[BaseStep]] = None # Steps executed for each page
steps: Optional[List[BaseStep]] = None # Single steps array
pagination: Optional[PaginationConfig] = None
```
#### `BaseStep`
```python
@dataclass
class BaseStep:
id: str
description: Optional[str] = None
object_type: Optional[SelectorType] = None # 'id' | 'class' | 'tag' | 'xpath'
object: Optional[str] = None
action: Literal[
"navigate", "input", "click", "data", "scroll",
"eventBaseDownload", "foreach", "open", "savePDF",
"printToPDF", "downloadPDF", "downloadFile"
] = "navigate"
value: Optional[str] = None
key: Optional[str] = None
data_type: Optional[DataType] = None # 'text' | 'html' | 'value' | 'default' | 'attribute'
wait: Optional[int] = None
terminateonerror: Optional[bool] = None
subSteps: Optional[List["BaseStep"]] = None
autoScroll: Optional[bool] = None
```
#### `RunOptions`
```python
@dataclass
class RunOptions:
browser: Optional[dict] = None # Playwright launch options
onResult: Optional[Callable] = None
```
## Step Actions
### Navigate
Navigate to a URL.
```python
BaseStep(
id="go_to_page",
action="navigate",
value="https://example.com"
)
```
### Input
Fill form fields.
```python
BaseStep(
id="search",
action="input",
object_type="id",
object="search-box",
value="search term"
)
```
### Click
Click on elements.
```python
BaseStep(
id="submit",
action="click",
object_type="class",
object="submit-button"
)
```
### Data Extraction
Extract data from elements.
```python
BaseStep(
id="get_title",
action="data",
object_type="tag",
object="h1",
key="title",
data_type="text"
)
```
### ForEach Loop
Process multiple elements.
```python
BaseStep(
id="process_items",
action="foreach",
object_type="class",
object="item",
subSteps=[
BaseStep(
id="get_item_title",
action="data",
object_type="tag",
object="h2",
key="title",
data_type="text"
)
]
)
```
### File Operations
#### Event-Based Download
```python
BaseStep(
id="download_file",
action="eventBaseDownload",
object_type="class",
object="download-link",
value="./downloads/file.pdf",
key="downloaded_file"
)
```
#### Download PDF/File
```python
BaseStep(
id="download_pdf",
action="downloadPDF",
object_type="class",
object="pdf-link",
value="./output/document.pdf",
key="pdf_file"
)
```
#### Save PDF
```python
BaseStep(
id="save_pdf",
action="savePDF",
value="./output/page.pdf",
key="pdf_file"
)
```
## Pagination
### Next Button Pagination
```python
PaginationConfig(
strategy="next",
nextButton=NextButtonConfig(
object_type="class",
object="next-page",
wait=2000
),
maxPages=10
)
```
### Scroll Pagination
```python
PaginationConfig(
strategy="scroll",
scroll=ScrollConfig(
offset=800,
delay=1500
),
maxPages=5
)
```
### Pagination Strategies
#### paginationFirst
Paginate first, then collect data from each page:
```python
TabTemplate(
tab="news",
initSteps=[...],
perPageSteps=[...], # Collect data from each page
pagination=PaginationConfig(
strategy="next",
nextButton=NextButtonConfig(...),
paginationFirst=True # Go to next page before collecting
)
)
```
#### paginateAllFirst
Paginate through all pages first, then collect all data at once:
```python
TabTemplate(
tab="articles",
initSteps=[...],
perPageSteps=[...], # Collect all data after all pagination
pagination=PaginationConfig(
strategy="next",
nextButton=NextButtonConfig(...),
paginateAllFirst=True # Load all pages first
)
)
```
## Advanced Features
### Proxy Support
```python
from stepwright import run_scraper, RunOptions
results = await run_scraper(templates, RunOptions(
browser={
"proxy": {
"server": "http://proxy-server:8080",
"username": "user",
"password": "pass"
}
}
))
```
### Custom Browser Options
```python
results = await run_scraper(templates, RunOptions(
browser={
"headless": False,
"slow_mo": 1000,
"args": ["--no-sandbox", "--disable-setuid-sandbox"]
}
))
```
### Streaming Results
```python
async def process_result(result, index):
print(f"Result {index}: {result}")
# Process result immediately (e.g., save to database)
await save_to_database(result)
await run_scraper_with_callback(
templates,
process_result,
RunOptions(browser={"headless": True})
)
```
### Data Placeholders
Use collected data in subsequent steps:
```python
BaseStep(
id="get_title",
action="data",
object_type="id",
object="page-title",
key="page_title",
data_type="text"
),
BaseStep(
id="save_with_title",
action="savePDF",
value="./output/{{page_title}}.pdf", # Uses collected page_title
key="pdf_file"
)
```
### Index Placeholders
Use loop index in foreach steps:
```python
BaseStep(
id="process_items",
action="foreach",
object_type="class",
object="item",
subSteps=[
BaseStep(
id="save_item",
action="savePDF",
value="./output/item_{{i}}.pdf", # i = 0, 1, 2, ...
# or
value="./output/item_{{i_plus1}}.pdf" # i_plus1 = 1, 2, 3, ...
)
]
)
```
## Error Handling
Steps can be configured to terminate on error:
```python
BaseStep(
id="critical_step",
action="click",
object_type="id",
object="important-button",
terminateonerror=True # Stop execution if this fails
)
```
Without `terminateonerror=True`, errors are logged but execution continues.
## Complete Example
```python
import asyncio
from pathlib import Path
from stepwright import (
run_scraper,
TabTemplate,
BaseStep,
PaginationConfig,
NextButtonConfig,
RunOptions
)
async def main():
templates = [
TabTemplate(
tab="news_scraper",
initSteps=[
BaseStep(
id="navigate",
action="navigate",
value="https://news-site.com"
),
BaseStep(
id="search",
action="input",
object_type="id",
object="search-box",
value="technology"
)
],
perPageSteps=[
BaseStep(
id="collect_articles",
action="foreach",
object_type="class",
object="article",
subSteps=[
BaseStep(
id="get_title",
action="data",
object_type="tag",
object="h2",
key="title",
data_type="text"
),
BaseStep(
id="get_content",
action="data",
object_type="tag",
object="p",
key="content",
data_type="text"
),
BaseStep(
id="get_link",
action="data",
object_type="tag",
object="a",
key="link",
data_type="value"
)
]
)
],
pagination=PaginationConfig(
strategy="next",
nextButton=NextButtonConfig(
object_type="id",
object="next-page",
wait=2000
),
maxPages=5
)
)
]
# Run scraper
results = await run_scraper(templates, RunOptions(
browser={"headless": True}
))
# Process results
for i, article in enumerate(results):
print(f"\nArticle {i + 1}:")
print(f"Title: {article.get('title')}")
print(f"Content: {article.get('content')[:100]}...")
print(f"Link: {article.get('link')}")
if __name__ == "__main__":
asyncio.run(main())
```
## Development
### Setup
```bash
# Clone repository
git clone https://github.com/lablnet/stepwright.git
cd stepwright
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev]"
# Install Playwright browsers
playwright install chromium
```
### Running Tests
```bash
# Run all tests
pytest
# Run with verbose output
pytest -v
# Run specific test file
pytest tests/test_scraper.py
# Run specific test class
pytest tests/test_scraper.py::TestGetBrowser
# Run specific test
pytest tests/test_scraper.py::TestGetBrowser::test_create_browser_instance
# Run with coverage
pytest --cov=src --cov-report=html
# Run integration tests only
pytest tests/test_integration.py
```
### Project Structure
```
stepwright/
├── src/
│ ├── __init__.py
│ ├── step_types.py # Type definitions and dataclasses
│ ├── helpers.py # Utility functions
│ ├── executor.py # Core step execution logic
│ ├── parser.py # Public API (run_scraper)
│ ├── scraper.py # Low-level browser automation
│ └── scraper_parser.py # Backward compatibility
├── tests/
│ ├── __init__.py
│ ├── conftest.py # Pytest configuration
│ ├── test_page.html # Test HTML page
│ ├── test_scraper.py # Core scraper tests
│ ├── test_parser.py # Parser function tests
│ └── test_integration.py # Integration tests
├── pyproject.toml # Package configuration
├── setup.py # Setup script
├── pytest.ini # Pytest configuration
├── README.md # This file
└── README_TESTS.md # Detailed test documentation
```
### Code Quality
```bash
# Format code with black
black src/ tests/
# Lint with flake8
flake8 src/ tests/
# Type checking with mypy
mypy src/
```
## Module Organization
The codebase follows separation of concerns:
- **step_types.py**: All type definitions (BaseStep, TabTemplate, etc.)
- **helpers.py**: Utility functions (placeholder replacement, locator creation)
- **executor.py**: Core execution logic (execute steps, handle pagination)
- **parser.py**: Public API (run_scraper, run_scraper_with_callback)
- **scraper.py**: Low-level Playwright wrapper (navigate, click, get_data)
- **scraper_parser.py**: Backward compatibility wrapper
You can import from the main module or specific submodules:
```python
# From main module (recommended)
from stepwright import run_scraper, TabTemplate, BaseStep
# From specific modules
from stepwright.step_types import TabTemplate, BaseStep
from stepwright.parser import run_scraper
from stepwright.helpers import replace_data_placeholders
```
## Testing
See [README_TESTS.md](README_TESTS.md) for detailed testing documentation.
## Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Add tests for new functionality
5. Ensure all tests pass (`pytest`)
6. Commit your changes (`git commit -m 'Add amazing feature'`)
7. Push to the branch (`git push origin feature/amazing-feature`)
8. Open a Pull Request
## License
MIT License - see [LICENSE](LICENSE) file for details.
## Support
- 🐛 Issues: [GitHub Issues](https://github.com/lablnet/stepwright/issues)
- 📖 Documentation: [README.md](README.md) and [README_TESTS.md](README_TESTS.md)
- 💬 Discussions: [GitHub Discussions](https://github.com/lablnet/stepwright/discussions)
## Acknowledgments
- Built with [Playwright](https://playwright.dev/)
- Inspired by declarative web scraping patterns
- Original TypeScript version: [framework-Island/stepwright](https://github.com/framework-Island/stepwright)
## Author
Muhammad Umer Farooq ([@lablnet](https://github.com/lablnet))
Raw data
{
"_id": null,
"home_page": "https://github.com/lablnet/stepwright",
"name": "stepwright",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "web scraping, playwright, automation, data extraction, web automation",
"author": "Muhammad Umer Farooq",
"author_email": "Muhammad Umer Farooq <umer@lablnet.com>",
"download_url": "https://files.pythonhosted.org/packages/82/ca/823d1a5d9bb80febf0b463183f942b044fb6d9a3cac5ee14b6793db8d3d7/stepwright-0.1.3.tar.gz",
"platform": null,
"description": "# StepWright\n\nA powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction.\n\n## Features\n\n- \ud83d\ude80 **Declarative Scraping**: Define scraping workflows using Python dictionaries or dataclasses\n- \ud83d\udd04 **Pagination Support**: Built-in support for next button and scroll-based pagination\n- \ud83d\udcca **Data Collection**: Extract text, HTML, values, and files from web pages\n- \ud83d\udd17 **Multi-tab Support**: Handle multiple tabs and complex navigation flows\n- \ud83d\udcc4 **PDF Generation**: Save pages as PDFs or trigger print-to-PDF actions\n- \ud83d\udce5 **File Downloads**: Download files with automatic directory creation\n- \ud83d\udd01 **Looping & Iteration**: ForEach loops for processing multiple elements\n- \ud83d\udce1 **Streaming Results**: Real-time result processing with callbacks\n- \ud83c\udfaf **Error Handling**: Graceful error handling with configurable termination\n- \ud83d\udd27 **Flexible Selectors**: Support for ID, class, tag, and XPath selectors\n\n## Installation\n\n```bash\n# Using pip\npip install stepwright\n\n# Using pip with development dependencies\npip install stepwright[dev]\n\n# From source\ngit clone https://github.com/lablnet/stepwright.git\ncd stepwright\npip install -e .\n```\n\n## Quick Start\n\n### Basic Usage\n\n```python\nimport asyncio\nfrom stepwright import run_scraper, TabTemplate, BaseStep\n\nasync def main():\n templates = [\n TabTemplate(\n tab=\"example\",\n steps=[\n BaseStep(\n id=\"navigate\",\n action=\"navigate\",\n value=\"https://example.com\"\n ),\n BaseStep(\n id=\"get_title\",\n action=\"data\",\n object_type=\"tag\",\n object=\"h1\",\n key=\"title\",\n data_type=\"text\"\n )\n ]\n )\n ]\n\n results = await run_scraper(templates)\n print(results)\n\nif __name__ == \"__main__\":\n asyncio.run(main())\n```\n\n## API Reference\n\n### Core Functions\n\n#### `run_scraper(templates, options=None)`\n\nMain function to execute scraping templates.\n\n**Parameters:**\n- `templates`: List of `TabTemplate` objects\n- `options`: Optional `RunOptions` object\n\n**Returns:** `List[Dict[str, Any]]`\n\n```python\nresults = await run_scraper(templates, RunOptions(\n browser={\"headless\": True}\n))\n```\n\n#### `run_scraper_with_callback(templates, on_result, options=None)`\n\nExecute scraping with streaming results via callback.\n\n**Parameters:**\n- `templates`: List of `TabTemplate` objects\n- `on_result`: Callback function for each result (can be sync or async)\n- `options`: Optional `RunOptions` object\n\n```python\nasync def process_result(result, index):\n print(f\"Result {index}: {result}\")\n\nawait run_scraper_with_callback(templates, process_result)\n```\n\n### Types\n\n#### `TabTemplate`\n\n```python\n@dataclass\nclass TabTemplate:\n tab: str\n initSteps: Optional[List[BaseStep]] = None # Steps executed once before pagination\n perPageSteps: Optional[List[BaseStep]] = None # Steps executed for each page\n steps: Optional[List[BaseStep]] = None # Single steps array\n pagination: Optional[PaginationConfig] = None\n```\n\n#### `BaseStep`\n\n```python\n@dataclass\nclass BaseStep:\n id: str\n description: Optional[str] = None\n object_type: Optional[SelectorType] = None # 'id' | 'class' | 'tag' | 'xpath'\n object: Optional[str] = None\n action: Literal[\n \"navigate\", \"input\", \"click\", \"data\", \"scroll\", \n \"eventBaseDownload\", \"foreach\", \"open\", \"savePDF\", \n \"printToPDF\", \"downloadPDF\", \"downloadFile\"\n ] = \"navigate\"\n value: Optional[str] = None\n key: Optional[str] = None\n data_type: Optional[DataType] = None # 'text' | 'html' | 'value' | 'default' | 'attribute'\n wait: Optional[int] = None\n terminateonerror: Optional[bool] = None\n subSteps: Optional[List[\"BaseStep\"]] = None\n autoScroll: Optional[bool] = None\n```\n\n#### `RunOptions`\n\n```python\n@dataclass\nclass RunOptions:\n browser: Optional[dict] = None # Playwright launch options\n onResult: Optional[Callable] = None\n```\n\n## Step Actions\n\n### Navigate\nNavigate to a URL.\n\n```python\nBaseStep(\n id=\"go_to_page\",\n action=\"navigate\",\n value=\"https://example.com\"\n)\n```\n\n### Input\nFill form fields.\n\n```python\nBaseStep(\n id=\"search\",\n action=\"input\",\n object_type=\"id\",\n object=\"search-box\",\n value=\"search term\"\n)\n```\n\n### Click\nClick on elements.\n\n```python\nBaseStep(\n id=\"submit\",\n action=\"click\",\n object_type=\"class\",\n object=\"submit-button\"\n)\n```\n\n### Data Extraction\nExtract data from elements.\n\n```python\nBaseStep(\n id=\"get_title\",\n action=\"data\",\n object_type=\"tag\",\n object=\"h1\",\n key=\"title\",\n data_type=\"text\"\n)\n```\n\n### ForEach Loop\nProcess multiple elements.\n\n```python\nBaseStep(\n id=\"process_items\",\n action=\"foreach\",\n object_type=\"class\",\n object=\"item\",\n subSteps=[\n BaseStep(\n id=\"get_item_title\",\n action=\"data\",\n object_type=\"tag\",\n object=\"h2\",\n key=\"title\",\n data_type=\"text\"\n )\n ]\n)\n```\n\n### File Operations\n\n#### Event-Based Download\n```python\nBaseStep(\n id=\"download_file\",\n action=\"eventBaseDownload\",\n object_type=\"class\",\n object=\"download-link\",\n value=\"./downloads/file.pdf\",\n key=\"downloaded_file\"\n)\n```\n\n#### Download PDF/File\n```python\nBaseStep(\n id=\"download_pdf\",\n action=\"downloadPDF\",\n object_type=\"class\",\n object=\"pdf-link\",\n value=\"./output/document.pdf\",\n key=\"pdf_file\"\n)\n```\n\n#### Save PDF\n```python\nBaseStep(\n id=\"save_pdf\",\n action=\"savePDF\",\n value=\"./output/page.pdf\",\n key=\"pdf_file\"\n)\n```\n\n## Pagination\n\n### Next Button Pagination\n```python\nPaginationConfig(\n strategy=\"next\",\n nextButton=NextButtonConfig(\n object_type=\"class\",\n object=\"next-page\",\n wait=2000\n ),\n maxPages=10\n)\n```\n\n### Scroll Pagination\n```python\nPaginationConfig(\n strategy=\"scroll\",\n scroll=ScrollConfig(\n offset=800,\n delay=1500\n ),\n maxPages=5\n)\n```\n\n### Pagination Strategies\n\n#### paginationFirst\nPaginate first, then collect data from each page:\n\n```python\nTabTemplate(\n tab=\"news\",\n initSteps=[...],\n perPageSteps=[...], # Collect data from each page\n pagination=PaginationConfig(\n strategy=\"next\",\n nextButton=NextButtonConfig(...),\n paginationFirst=True # Go to next page before collecting\n )\n)\n```\n\n#### paginateAllFirst\nPaginate through all pages first, then collect all data at once:\n\n```python\nTabTemplate(\n tab=\"articles\",\n initSteps=[...],\n perPageSteps=[...], # Collect all data after all pagination\n pagination=PaginationConfig(\n strategy=\"next\",\n nextButton=NextButtonConfig(...),\n paginateAllFirst=True # Load all pages first\n )\n)\n```\n\n## Advanced Features\n\n### Proxy Support\n```python\nfrom stepwright import run_scraper, RunOptions\n\nresults = await run_scraper(templates, RunOptions(\n browser={\n \"proxy\": {\n \"server\": \"http://proxy-server:8080\",\n \"username\": \"user\",\n \"password\": \"pass\"\n }\n }\n))\n```\n\n### Custom Browser Options\n```python\nresults = await run_scraper(templates, RunOptions(\n browser={\n \"headless\": False,\n \"slow_mo\": 1000,\n \"args\": [\"--no-sandbox\", \"--disable-setuid-sandbox\"]\n }\n))\n```\n\n### Streaming Results\n```python\nasync def process_result(result, index):\n print(f\"Result {index}: {result}\")\n # Process result immediately (e.g., save to database)\n await save_to_database(result)\n\nawait run_scraper_with_callback(\n templates, \n process_result,\n RunOptions(browser={\"headless\": True})\n)\n```\n\n### Data Placeholders\nUse collected data in subsequent steps:\n\n```python\nBaseStep(\n id=\"get_title\",\n action=\"data\",\n object_type=\"id\",\n object=\"page-title\",\n key=\"page_title\",\n data_type=\"text\"\n),\nBaseStep(\n id=\"save_with_title\",\n action=\"savePDF\",\n value=\"./output/{{page_title}}.pdf\", # Uses collected page_title\n key=\"pdf_file\"\n)\n```\n\n### Index Placeholders\nUse loop index in foreach steps:\n\n```python\nBaseStep(\n id=\"process_items\",\n action=\"foreach\",\n object_type=\"class\",\n object=\"item\",\n subSteps=[\n BaseStep(\n id=\"save_item\",\n action=\"savePDF\",\n value=\"./output/item_{{i}}.pdf\", # i = 0, 1, 2, ...\n # or\n value=\"./output/item_{{i_plus1}}.pdf\" # i_plus1 = 1, 2, 3, ...\n )\n ]\n)\n```\n\n## Error Handling\n\nSteps can be configured to terminate on error:\n\n```python\nBaseStep(\n id=\"critical_step\",\n action=\"click\",\n object_type=\"id\",\n object=\"important-button\",\n terminateonerror=True # Stop execution if this fails\n)\n```\n\nWithout `terminateonerror=True`, errors are logged but execution continues.\n\n## Complete Example\n\n```python\nimport asyncio\nfrom pathlib import Path\nfrom stepwright import (\n run_scraper,\n TabTemplate,\n BaseStep,\n PaginationConfig,\n NextButtonConfig,\n RunOptions\n)\n\nasync def main():\n templates = [\n TabTemplate(\n tab=\"news_scraper\",\n initSteps=[\n BaseStep(\n id=\"navigate\",\n action=\"navigate\",\n value=\"https://news-site.com\"\n ),\n BaseStep(\n id=\"search\",\n action=\"input\",\n object_type=\"id\",\n object=\"search-box\",\n value=\"technology\"\n )\n ],\n perPageSteps=[\n BaseStep(\n id=\"collect_articles\",\n action=\"foreach\",\n object_type=\"class\",\n object=\"article\",\n subSteps=[\n BaseStep(\n id=\"get_title\",\n action=\"data\",\n object_type=\"tag\",\n object=\"h2\",\n key=\"title\",\n data_type=\"text\"\n ),\n BaseStep(\n id=\"get_content\",\n action=\"data\",\n object_type=\"tag\",\n object=\"p\",\n key=\"content\",\n data_type=\"text\"\n ),\n BaseStep(\n id=\"get_link\",\n action=\"data\",\n object_type=\"tag\",\n object=\"a\",\n key=\"link\",\n data_type=\"value\"\n )\n ]\n )\n ],\n pagination=PaginationConfig(\n strategy=\"next\",\n nextButton=NextButtonConfig(\n object_type=\"id\",\n object=\"next-page\",\n wait=2000\n ),\n maxPages=5\n )\n )\n ]\n\n # Run scraper\n results = await run_scraper(templates, RunOptions(\n browser={\"headless\": True}\n ))\n\n # Process results\n for i, article in enumerate(results):\n print(f\"\\nArticle {i + 1}:\")\n print(f\"Title: {article.get('title')}\")\n print(f\"Content: {article.get('content')[:100]}...\")\n print(f\"Link: {article.get('link')}\")\n\nif __name__ == \"__main__\":\n asyncio.run(main())\n```\n\n## Development\n\n### Setup\n\n```bash\n# Clone repository\ngit clone https://github.com/lablnet/stepwright.git\ncd stepwright\n\n# Create virtual environment\npython -m venv venv\nsource venv/bin/activate # On Windows: venv\\Scripts\\activate\n\n# Install in development mode\npip install -e \".[dev]\"\n\n# Install Playwright browsers\nplaywright install chromium\n```\n\n### Running Tests\n\n```bash\n# Run all tests\npytest\n\n# Run with verbose output\npytest -v\n\n# Run specific test file\npytest tests/test_scraper.py\n\n# Run specific test class\npytest tests/test_scraper.py::TestGetBrowser\n\n# Run specific test\npytest tests/test_scraper.py::TestGetBrowser::test_create_browser_instance\n\n# Run with coverage\npytest --cov=src --cov-report=html\n\n# Run integration tests only\npytest tests/test_integration.py\n```\n\n### Project Structure\n\n```\nstepwright/\n\u251c\u2500\u2500 src/\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 step_types.py # Type definitions and dataclasses\n\u2502 \u251c\u2500\u2500 helpers.py # Utility functions\n\u2502 \u251c\u2500\u2500 executor.py # Core step execution logic\n\u2502 \u251c\u2500\u2500 parser.py # Public API (run_scraper)\n\u2502 \u251c\u2500\u2500 scraper.py # Low-level browser automation\n\u2502 \u2514\u2500\u2500 scraper_parser.py # Backward compatibility\n\u251c\u2500\u2500 tests/\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 conftest.py # Pytest configuration\n\u2502 \u251c\u2500\u2500 test_page.html # Test HTML page\n\u2502 \u251c\u2500\u2500 test_scraper.py # Core scraper tests\n\u2502 \u251c\u2500\u2500 test_parser.py # Parser function tests\n\u2502 \u2514\u2500\u2500 test_integration.py # Integration tests\n\u251c\u2500\u2500 pyproject.toml # Package configuration\n\u251c\u2500\u2500 setup.py # Setup script\n\u251c\u2500\u2500 pytest.ini # Pytest configuration\n\u251c\u2500\u2500 README.md # This file\n\u2514\u2500\u2500 README_TESTS.md # Detailed test documentation\n```\n\n### Code Quality\n\n```bash\n# Format code with black\nblack src/ tests/\n\n# Lint with flake8\nflake8 src/ tests/\n\n# Type checking with mypy\nmypy src/\n```\n\n## Module Organization\n\nThe codebase follows separation of concerns:\n\n- **step_types.py**: All type definitions (BaseStep, TabTemplate, etc.)\n- **helpers.py**: Utility functions (placeholder replacement, locator creation)\n- **executor.py**: Core execution logic (execute steps, handle pagination)\n- **parser.py**: Public API (run_scraper, run_scraper_with_callback)\n- **scraper.py**: Low-level Playwright wrapper (navigate, click, get_data)\n- **scraper_parser.py**: Backward compatibility wrapper\n\nYou can import from the main module or specific submodules:\n\n```python\n# From main module (recommended)\nfrom stepwright import run_scraper, TabTemplate, BaseStep\n\n# From specific modules\nfrom stepwright.step_types import TabTemplate, BaseStep\nfrom stepwright.parser import run_scraper\nfrom stepwright.helpers import replace_data_placeholders\n```\n\n## Testing\n\nSee [README_TESTS.md](README_TESTS.md) for detailed testing documentation.\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes\n4. Add tests for new functionality\n5. Ensure all tests pass (`pytest`)\n6. Commit your changes (`git commit -m 'Add amazing feature'`)\n7. Push to the branch (`git push origin feature/amazing-feature`)\n8. Open a Pull Request\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## Support\n\n- \ud83d\udc1b Issues: [GitHub Issues](https://github.com/lablnet/stepwright/issues)\n- \ud83d\udcd6 Documentation: [README.md](README.md) and [README_TESTS.md](README_TESTS.md)\n- \ud83d\udcac Discussions: [GitHub Discussions](https://github.com/lablnet/stepwright/discussions)\n\n## Acknowledgments\n\n- Built with [Playwright](https://playwright.dev/)\n- Inspired by declarative web scraping patterns\n- Original TypeScript version: [framework-Island/stepwright](https://github.com/framework-Island/stepwright)\n\n## Author\n\nMuhammad Umer Farooq ([@lablnet](https://github.com/lablnet))\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction",
"version": "0.1.3",
"project_urls": {
"Bug Tracker": "https://github.com/lablnet/stepwright/issues",
"Documentation": "https://github.com/lablnet/stepwright#readme",
"Homepage": "https://github.com/lablnet/stepwright",
"Repository": "https://github.com/lablnet/stepwright"
},
"split_keywords": [
"web scraping",
" playwright",
" automation",
" data extraction",
" web automation"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "9508db7c5acc410e540937a21eac14fe3f78d5e44db3f1b90c5078966a87a7c3",
"md5": "e39fe6cc2ecf5b802ffb0cae3a71be41",
"sha256": "049728b6f7a2358f27a5c468694a99416e30eda300dc34eb2c5921cdd03805cd"
},
"downloads": -1,
"filename": "stepwright-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e39fe6cc2ecf5b802ffb0cae3a71be41",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 18373,
"upload_time": "2025-10-24T05:16:55",
"upload_time_iso_8601": "2025-10-24T05:16:55.746296Z",
"url": "https://files.pythonhosted.org/packages/95/08/db7c5acc410e540937a21eac14fe3f78d5e44db3f1b90c5078966a87a7c3/stepwright-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "82ca823d1a5d9bb80febf0b463183f942b044fb6d9a3cac5ee14b6793db8d3d7",
"md5": "bb8f9afdda55b09c4d0a56ee7d3bb934",
"sha256": "f27bf3e8b4cae014b1060abaf1d3b826a33be3a00f1ce2420bb4e33398ad6876"
},
"downloads": -1,
"filename": "stepwright-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "bb8f9afdda55b09c4d0a56ee7d3bb934",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 29639,
"upload_time": "2025-10-24T05:16:57",
"upload_time_iso_8601": "2025-10-24T05:16:57.359480Z",
"url": "https://files.pythonhosted.org/packages/82/ca/823d1a5d9bb80febf0b463183f942b044fb6d9a3cac5ee14b6793db8d3d7/stepwright-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-24 05:16:57",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "lablnet",
"github_project": "stepwright",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "playwright",
"specs": [
[
">=",
"1.40.0"
]
]
}
],
"lcname": "stepwright"
}