par-scrape


Namepar-scrape JSON
Version 0.4.5 PyPI version JSON
download
home_pageNone
SummaryA versatile web scraping tool with options for Selenium or Playwright, featuring OpenAI-powered data extraction and formatting.
upload_time2024-09-24 22:54:34
maintainerNone
docs_urlNone
authorNone
requires_python>=3.11
licenseMIT License Copyright (c) 2024 Paul Robello Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords data extraction openai playwright selenium web scraping
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PAR Scrape

[![PyPI](https://img.shields.io/pypi/v/par-scrape)](https://pypi.org/project/par-scrape/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/par-scrape.svg)](https://pypi.org/project/par-scrape/)  
![Runs on Linux | MacOS | Windows](https://img.shields.io/badge/runs%20on-Linux%20%7C%20MacOS%20%7C%20Windows-blue)
![Arch x86-63 | ARM | AppleSilicon](https://img.shields.io/badge/arch-x86--64%20%7C%20ARM%20%7C%20AppleSilicon-blue)  
![PyPI - License](https://img.shields.io/pypi/l/par-scrape)

PAR Scrape is a versatile web scraping tool with options for Selenium or Playwright, featuring AI-powered data extraction and formatting.

[!["Buy Me A Coffee"](https://www.buymeacoffee.com/assets/img/custom_images/orange_img.png)](https://buymeacoffee.com/probello3)

## Screenshots
![PAR Scrape Screenshot](https://raw.githubusercontent.com/paulrobello/par_scrape/main/Screenshot.png)

## Features

- Web scraping using Selenium or Playwright
- AI-powered data extraction and formatting
- Supports multiple output formats (JSON, Excel, CSV, Markdown)
- Customizable field extraction
- Token usage and cost estimation

## Known Issues
- Selenium silent mode on windows still shows message about websocket. There is no simple way to get rid of this.
- Providers other than OpenAI are hit-and-miss depending on provider / model / data being extracted.

## Installation

To install PAR Scrape, make sure you have Python 3.11 or higher and [uv](https://pypi.org/project/uv/) installed.

### Installation From Source

Then, follow these steps:

1. Clone the repository:
   ```bash
   git clone https://github.com/paulrobello/par_scrape.git
   cd par_scrape
   ```

2. Install the package dependencies using uv:
   ```bash
   uv sync
   ```
### Installation From PyPI

To install PAR Scrape from PyPI, run any of the following commands:

```bash
uv tool install par-scrape
```

```bash
pipx install par-scrape
```
### Playwright Installation
To use playwright as a scraper, you must install it and its browsers using the following commands:

```bash
uv tool install playwright
playwright install chromium
```

## Usage

To use PAR Scrape, you can run it from the command line with various options. Here's a basic example:
Ensure you have the AI provider api key in your environment.
The key names for supported providers are as follows:
- OpenAI: `OPENAI_API_KEY`
- Anthropic: `ANTHROPIC_API_KEY`
- Groq: `GROQ_API_KEY`
- Google: `GOOGLE_API_KEY`
- Ollama: `Not needed`

You can also store your key in the file `~/.par-scrape.env` as follows:
```
OPENAI_API_KEY=your_api_key
ANTHROPIC_API_KEY=your_api_key
GROQ_API_KEY=your_api_key
GOOGLE_API_KEY=your_api_key
```

### Running from source
```bash
uv run par_scrape --url "https://openai.com/api/pricing/" --fields "Model" --fields "Pricing Input" --fields "Pricing Output" --scraper selenium --model gpt-4o-mini --display-output md
```

### Running if installed from PyPI
```bash
par_scrape --url "https://openai.com/api/pricing/" --fields "Title" "Number of Points" "Creator" "Time Posted" "Number of Comments" --scraper selenium --model gpt-4o-mini --display-output md
```

### Options

- `--url`, `-u`: The URL to scrape or path to a local file (default: "https://openai.com/api/pricing/")
- `--fields`, `-f`: Fields to extract from the webpage (default: ["Model", "Pricing Input", "Pricing Output"])
- `--scraper`, `-s`: Scraper to use: 'selenium' or 'playwright' (default: "playwright")
- `--headless`, `-h`: Run in headless mode (for Selenium) (default: False)
- `--wait-type`, `-w`: Method to use for page content load waiting [none|pause|sleep|idle|selector|text] (default: sleep).
- `--wait-selector`, `-i`: Selector or text to use for page content load waiting.
- `--sleep-time`, `-t`: Time to sleep (in seconds) before scrolling and closing browser (default: 5)
- `--ai-provider`, `-a`: AI provider to use for processing (default: "OpenAI")
- `--model`, `-m`: AI model to use for processing. If not specified, a default model will be used based on the provider.
- `--display-output`, `-d`: Display output in terminal (md, csv, or json)
- `--output-folder`, `-o`: Specify the location of the output folder (default: "./output")
- `--silent`, `-q`: Run in silent mode, suppressing output (default: False)
- `--run-name`, `-n`: Specify a name for this run
- `--version`, `-v`: Show the version and exit
- `--pricing`: Enable pricing summary display (default: False)
- `--cleanup`, `-c`: How to handle cleanup of output folder (choices: none, before, after, both) (default: none)
- `--extraction-prompt`, `-e`: Path to alternate extraction prompt file
- `--ai-base-url`, `-b`: Override the base URL for the AI provider.

### Examples

1. Basic usage with default options:
```bash
par_scrape --url "https://openai.com/api/pricing/" -f "Model" -f "Pricing Input" -f "Pricing Output" --pricing -w text -i gpt-4o
```
2. Using Playwright and displaying JSON output:
```bash
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --scraper playwright -d json --pricing -w text -i gpt-4o
```
3. Specifying a custom model and output folder:
```bash
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --model gpt-4 --output-folder ./custom_output --pricing -w text -i gpt-4o
```
4. Running in silent mode with a custom run name:
```bash
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --silent --run-name my_custom_run --pricing -w text -i gpt-4o
```
5. Using the cleanup option to remove the output folder after scraping:
```bash
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --cleanup --pricing
```
6. Using the pause option to wait for user input before scrolling:
```bash
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --pause --pricing
```

## Whats New

- Version 0.4.5:
  - Added new option --wait-type that allows you to specify the type of wait to use such as pause, sleep, idle, text or selector.
  - Removed --pause option as it is no longer needed with --wait-type option.
  - Playwright scraping now honors the headless mode.
  - Playwright is now the default scraper as it is much faster.
- Version 0.4.4:
  - Better Playwright scraping.
- Version 0.4.3:
  - Added option to override the base URL for the AI provider.
- Version 0.4.2:
  - The url parameter can now point to a local rawData_*.md file for easier testing of different models without having to re-fetch the data.
  - Added ability to specify file with extraction prompt.
  - Tweaked extraction prompt to work with Groq and Anthropic. Google still does not work.
  - Remove need for ~/.par-scrape-config.json-
- Version 0.4.1:
  - Minor bug fixes for pricing summary.
  - Default model for google changed to "gemini-1.5-pro-exp-0827" which is free and usually works well.
- Version 0.4.0:
  - Added support for Anthropic, Google, Groq, and Ollama. (Not well tested with any providers other than OpenAI)
  - Add flag for displaying pricing summary. Defaults to False.
  - Added pricing data for Anthropic.
  - Better error handling for llm calls.
  - Updated cleanup flag to handle both before and after cleanup. Removed --remove-output-folder flag.
- Version 0.3.1:
  - Add pause and sleep-time options to control the browser and scraping delays.
  - Default headless mode to False so you can interact with the browser.
- Version 0.3.0:
  - Fixed location of config.json file.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Author

Paul Robello - probello@gmail.com

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "par-scrape",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": "Paul Robello <probello@gmail.com>",
    "keywords": "data extraction, openai, playwright, selenium, web scraping",
    "author": null,
    "author_email": "Paul Robello <probello@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/30/10/02e6fad19e34bd808ac2d8914a59d4c7746a816fb9baaca4c21334815478/par_scrape-0.4.5.tar.gz",
    "platform": null,
    "description": "# PAR Scrape\n\n[![PyPI](https://img.shields.io/pypi/v/par-scrape)](https://pypi.org/project/par-scrape/)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/par-scrape.svg)](https://pypi.org/project/par-scrape/)  \n![Runs on Linux | MacOS | Windows](https://img.shields.io/badge/runs%20on-Linux%20%7C%20MacOS%20%7C%20Windows-blue)\n![Arch x86-63 | ARM | AppleSilicon](https://img.shields.io/badge/arch-x86--64%20%7C%20ARM%20%7C%20AppleSilicon-blue)  \n![PyPI - License](https://img.shields.io/pypi/l/par-scrape)\n\nPAR Scrape is a versatile web scraping tool with options for Selenium or Playwright, featuring AI-powered data extraction and formatting.\n\n[![\"Buy Me A Coffee\"](https://www.buymeacoffee.com/assets/img/custom_images/orange_img.png)](https://buymeacoffee.com/probello3)\n\n## Screenshots\n![PAR Scrape Screenshot](https://raw.githubusercontent.com/paulrobello/par_scrape/main/Screenshot.png)\n\n## Features\n\n- Web scraping using Selenium or Playwright\n- AI-powered data extraction and formatting\n- Supports multiple output formats (JSON, Excel, CSV, Markdown)\n- Customizable field extraction\n- Token usage and cost estimation\n\n## Known Issues\n- Selenium silent mode on windows still shows message about websocket. There is no simple way to get rid of this.\n- Providers other than OpenAI are hit-and-miss depending on provider / model / data being extracted.\n\n## Installation\n\nTo install PAR Scrape, make sure you have Python 3.11 or higher and [uv](https://pypi.org/project/uv/) installed.\n\n### Installation From Source\n\nThen, follow these steps:\n\n1. Clone the repository:\n   ```bash\n   git clone https://github.com/paulrobello/par_scrape.git\n   cd par_scrape\n   ```\n\n2. Install the package dependencies using uv:\n   ```bash\n   uv sync\n   ```\n### Installation From PyPI\n\nTo install PAR Scrape from PyPI, run any of the following commands:\n\n```bash\nuv tool install par-scrape\n```\n\n```bash\npipx install par-scrape\n```\n### Playwright Installation\nTo use playwright as a scraper, you must install it and its browsers using the following commands:\n\n```bash\nuv tool install playwright\nplaywright install chromium\n```\n\n## Usage\n\nTo use PAR Scrape, you can run it from the command line with various options. Here's a basic example:\nEnsure you have the AI provider api key in your environment.\nThe key names for supported providers are as follows:\n- OpenAI: `OPENAI_API_KEY`\n- Anthropic: `ANTHROPIC_API_KEY`\n- Groq: `GROQ_API_KEY`\n- Google: `GOOGLE_API_KEY`\n- Ollama: `Not needed`\n\nYou can also store your key in the file `~/.par-scrape.env` as follows:\n```\nOPENAI_API_KEY=your_api_key\nANTHROPIC_API_KEY=your_api_key\nGROQ_API_KEY=your_api_key\nGOOGLE_API_KEY=your_api_key\n```\n\n### Running from source\n```bash\nuv run par_scrape --url \"https://openai.com/api/pricing/\" --fields \"Model\" --fields \"Pricing Input\" --fields \"Pricing Output\" --scraper selenium --model gpt-4o-mini --display-output md\n```\n\n### Running if installed from PyPI\n```bash\npar_scrape --url \"https://openai.com/api/pricing/\" --fields \"Title\" \"Number of Points\" \"Creator\" \"Time Posted\" \"Number of Comments\" --scraper selenium --model gpt-4o-mini --display-output md\n```\n\n### Options\n\n- `--url`, `-u`: The URL to scrape or path to a local file (default: \"https://openai.com/api/pricing/\")\n- `--fields`, `-f`: Fields to extract from the webpage (default: [\"Model\", \"Pricing Input\", \"Pricing Output\"])\n- `--scraper`, `-s`: Scraper to use: 'selenium' or 'playwright' (default: \"playwright\")\n- `--headless`, `-h`: Run in headless mode (for Selenium) (default: False)\n- `--wait-type`, `-w`: Method to use for page content load waiting [none|pause|sleep|idle|selector|text] (default: sleep).\n- `--wait-selector`, `-i`: Selector or text to use for page content load waiting.\n- `--sleep-time`, `-t`: Time to sleep (in seconds) before scrolling and closing browser (default: 5)\n- `--ai-provider`, `-a`: AI provider to use for processing (default: \"OpenAI\")\n- `--model`, `-m`: AI model to use for processing. If not specified, a default model will be used based on the provider.\n- `--display-output`, `-d`: Display output in terminal (md, csv, or json)\n- `--output-folder`, `-o`: Specify the location of the output folder (default: \"./output\")\n- `--silent`, `-q`: Run in silent mode, suppressing output (default: False)\n- `--run-name`, `-n`: Specify a name for this run\n- `--version`, `-v`: Show the version and exit\n- `--pricing`: Enable pricing summary display (default: False)\n- `--cleanup`, `-c`: How to handle cleanup of output folder (choices: none, before, after, both) (default: none)\n- `--extraction-prompt`, `-e`: Path to alternate extraction prompt file\n- `--ai-base-url`, `-b`: Override the base URL for the AI provider.\n\n### Examples\n\n1. Basic usage with default options:\n```bash\npar_scrape --url \"https://openai.com/api/pricing/\" -f \"Model\" -f \"Pricing Input\" -f \"Pricing Output\" --pricing -w text -i gpt-4o\n```\n2. Using Playwright and displaying JSON output:\n```bash\npar_scrape --url \"https://openai.com/api/pricing/\" -f \"Title\" -f \"Description\" -f \"Price\" --scraper playwright -d json --pricing -w text -i gpt-4o\n```\n3. Specifying a custom model and output folder:\n```bash\npar_scrape --url \"https://openai.com/api/pricing/\" -f \"Title\" -f \"Description\" -f \"Price\" --model gpt-4 --output-folder ./custom_output --pricing -w text -i gpt-4o\n```\n4. Running in silent mode with a custom run name:\n```bash\npar_scrape --url \"https://openai.com/api/pricing/\" -f \"Title\" -f \"Description\" -f \"Price\" --silent --run-name my_custom_run --pricing -w text -i gpt-4o\n```\n5. Using the cleanup option to remove the output folder after scraping:\n```bash\npar_scrape --url \"https://openai.com/api/pricing/\" -f \"Title\" -f \"Description\" -f \"Price\" --cleanup --pricing\n```\n6. Using the pause option to wait for user input before scrolling:\n```bash\npar_scrape --url \"https://openai.com/api/pricing/\" -f \"Title\" -f \"Description\" -f \"Price\" --pause --pricing\n```\n\n## Whats New\n\n- Version 0.4.5:\n  - Added new option --wait-type that allows you to specify the type of wait to use such as pause, sleep, idle, text or selector.\n  - Removed --pause option as it is no longer needed with --wait-type option.\n  - Playwright scraping now honors the headless mode.\n  - Playwright is now the default scraper as it is much faster.\n- Version 0.4.4:\n  - Better Playwright scraping.\n- Version 0.4.3:\n  - Added option to override the base URL for the AI provider.\n- Version 0.4.2:\n  - The url parameter can now point to a local rawData_*.md file for easier testing of different models without having to re-fetch the data.\n  - Added ability to specify file with extraction prompt.\n  - Tweaked extraction prompt to work with Groq and Anthropic. Google still does not work.\n  - Remove need for ~/.par-scrape-config.json-\n- Version 0.4.1:\n  - Minor bug fixes for pricing summary.\n  - Default model for google changed to \"gemini-1.5-pro-exp-0827\" which is free and usually works well.\n- Version 0.4.0:\n  - Added support for Anthropic, Google, Groq, and Ollama. (Not well tested with any providers other than OpenAI)\n  - Add flag for displaying pricing summary. Defaults to False.\n  - Added pricing data for Anthropic.\n  - Better error handling for llm calls.\n  - Updated cleanup flag to handle both before and after cleanup. Removed --remove-output-folder flag.\n- Version 0.3.1:\n  - Add pause and sleep-time options to control the browser and scraping delays.\n  - Default headless mode to False so you can interact with the browser.\n- Version 0.3.0:\n  - Fixed location of config.json file.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Author\n\nPaul Robello - probello@gmail.com\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 Paul Robello  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
    "summary": "A versatile web scraping tool with options for Selenium or Playwright, featuring OpenAI-powered data extraction and formatting.",
    "version": "0.4.5",
    "project_urls": {
        "Discussions": "https://github.com/paulrobello/par_scrape/discussions",
        "Documentation": "https://github.com/paulrobello/par_scrape/blob/main/README.md",
        "Homepage": "https://github.com/paulrobello/par_scrape",
        "Issues": "https://github.com/paulrobello/par_scrape/issues",
        "Repository": "https://github.com/paulrobello/par_scrape",
        "Wiki": "https://github.com/paulrobello/par_scrape/wiki"
    },
    "split_keywords": [
        "data extraction",
        " openai",
        " playwright",
        " selenium",
        " web scraping"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8aff05eed8abbfb812bd46c2b28d24c60643e2fba5479b07df03fadfb38dedd0",
                "md5": "f6195c4f04aa488f75aec66b12b2ac6a",
                "sha256": "54cbe6680dc521acdb831c344e6320188c12db83565e4e47041da6cdefe9dbb0"
            },
            "downloads": -1,
            "filename": "par_scrape-0.4.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f6195c4f04aa488f75aec66b12b2ac6a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 21802,
            "upload_time": "2024-09-24T22:54:33",
            "upload_time_iso_8601": "2024-09-24T22:54:33.266625Z",
            "url": "https://files.pythonhosted.org/packages/8a/ff/05eed8abbfb812bd46c2b28d24c60643e2fba5479b07df03fadfb38dedd0/par_scrape-0.4.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "301002e6fad19e34bd808ac2d8914a59d4c7746a816fb9baaca4c21334815478",
                "md5": "008458e343ac625f159d80a9942c4693",
                "sha256": "c96d7f637a433bae472828638d1b7f05d8b84f77c74f933796ad93f43bbd29cb"
            },
            "downloads": -1,
            "filename": "par_scrape-0.4.5.tar.gz",
            "has_sig": false,
            "md5_digest": "008458e343ac625f159d80a9942c4693",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 17587,
            "upload_time": "2024-09-24T22:54:34",
            "upload_time_iso_8601": "2024-09-24T22:54:34.762091Z",
            "url": "https://files.pythonhosted.org/packages/30/10/02e6fad19e34bd808ac2d8914a59d4c7746a816fb9baaca4c21334815478/par_scrape-0.4.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-24 22:54:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "paulrobello",
    "github_project": "par_scrape",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "par-scrape"
}
        
Elapsed time: 0.75967s