Name | web-scraper-client JSON |
Version |
0.1.9
JSON |
| download |
home_page | None |
Summary | Client library for https://github.com/snackbeard/web-scraper-api |
upload_time | 2025-07-10 13:26:31 |
maintainer | None |
docs_url | None |
author | Snackbeard |
requires_python | None |
license | MIT License
Copyright (c) [year] [fullname]
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
|
keywords |
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
### Goal
This project aims to simplify the webscraping process with Selenium by providing an easy to use client library to access
a self hosted instance of **Selenium Grid**.
---
Standard usage:
~~~python
import logging
import time
from selenium import webdriver
from selenium.common import TimeoutException, JavascriptException
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.wait import WebDriverWait
chrome_options: Options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36")
chrome_options.set_capability('goog:loggingPrefs', {'performance': 'ALL'})
service: Service = Service('/path/to/chromedriver')
driver: webdriver.Chrome = webdriver.Chrome(service=service, options=chrome_options)
driver.get('https://my-url.com')
try:
# accept cookies
cookie_button = WebDriverWait(driver, 10).until(
expected_conditions.element_to_be_clickable((
By.CSS_SELECTOR,
'button.sc-aXZVg.sc-lcIPJg.fkTzLw.jlhbaU.acceptAll'
))
)
driver.execute_script('arguments[0].click();', cookie_button)
except TimeoutException as e:
logging.info('element not present')
try:
# wait for page to load
WebDriverWait(driver, 10).until(
expected_conditions.presence_of_element_located((By.CSS_SELECTOR, 'div.sc-iwOjIX.cPJSFQ.events-list'))
)
time.sleep(1)
# scroll down a bit
element_to_scroll = driver.find_element(By.CSS_SELECTOR, 'div.sc-iwOjIX.cPJSFQ.events-list')
driver.execute_script('arguments[0].scrollIntoView({block: "end"});', element_to_scroll)
time.sleep(1)
element_to_scroll = driver.find_element(By.CSS_SELECTOR, 'div.sc-PXPPG.hIImXk')
driver.execute_script('arguments[0].scrollIntoView({block: "start"});', element_to_scroll)
except TimeoutException | JavascriptException as e:
logging.info('error')
page_source = driver.page_source
~~~
---
With client library:
~~~python
from client.enums.api_instruction_block_type import ApiInstructionBlockType
from client.enums.api_instruction_content_type import ApiInstructionContentType
from client.enums.api_instruction_element_type import ApiInstructionElementType
from client.enums.api_instruction_identificator_type import ApiInstructionIdentificatorType
from client.models.driver_options import DriverOptions
from client.webscraper_instruction_builder import WebScraperInstructionBuilder
options = DriverOptions(user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36', options=[
'--headless',
'--disable-gpu',
'--no-sandbox',
'--disable-dev-shm-usage'
])
page_source = WebScraperInstructionBuilder(url='api-url',
api_key='api-key')\
.wait_for(seconds=10,
wait_for=ApiInstructionElementType.ELEMENT_CLICKALBE,
by=ApiInstructionIdentificatorType.CSS_SELECTOR,
element_id='button.sc-aXZVg.sc-lcIPJg.fkTzLw.jlhbaU.acceptAll',
ignore_error=True)\
.click(ignore_error=True)\
.wait_for(seconds=10,
wait_for=ApiInstructionElementType.ELEMENT_PRESENCE,
by=ApiInstructionIdentificatorType.CSS_SELECTOR,
element_id='div.sc-iwOjIX.cPJSFQ.events-list')\
.wait(seconds=1)\
.scroll(ApiInstructionBlockType.END)\
.wait(seconds=1)\
.find(by=ApiInstructionIdentificatorType.CSS_SELECTOR, element_id='div.sc-PXPPG.hIImXk')\
.scroll(ApiInstructionBlockType.START)\
.get(page_url='page-to-scrape-url', options=options, content=ApiInstructionContentType.PAGE_SOURCE)
~~~
### Client Library
The client library makes it easy to access the API by providing a builder class.
**Example:**
~~~python
options = DriverOptions(
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36',
options=['--headless', '--disable-gpu', '--no-sandbox', '--disable-dev-shm-usage']
)
page_source = WebScraperBuilder(
url='https://my-domain.com',
api_key='c1f24ee0-1a77-4719-a33b-408069dfc15f')
.wait_for(seconds=5,
by=ApiInstructionIdentificatorType.CSS_SELECTOR,
wait_for=ApiInstructionElementType.ELEMENT_CLICKALBE,
element_id='button.sc-aXZVg.sc-lcIPJg.bjommA.jlhbaU.acceptAll',
ignore_error=True)
.click(ignore_error=True)
.wait_for(seconds=10,
by=ApiInstructionIdentificatorType.CSS_SELECTOR,
wait_for=ApiInstructionElementType.ELEMENT_PRESENCE,
element_id='div.category-event-items')
.scroll(block=ApiInstructionBlockType.END)
.get(page_url='https://www.page-to-scrape.com', options=options)
~~~
In the first _.wait_ instruction a cookie dialog overlaps with the actual content we want
to scrape. So to accept it we have to wait for the _.acceptAll_ button to be clickable and
then click it. Cookies won't appear again after accepting them once so if we scrape the site
again an error would appear. So _ignore_error_ is set to true in both instructions. Then
we wait for an list to be present and scroll to the end of it to load all its content.
The following actions are currently supported
- **wait** - simply waits
- _seconds_ seconds to wait
- **wait_for** - wait for an element
- _seconds_ - seconds to wait
- _by_ - find it either by css selector or by element_id
- _wait_for_ - wait until the element is either present or clickable
- _element_id_ - id/selector of the element
- _ignore_error_ - if an element was not found ignore it and continue with the next instruction
- **find** - find and element
- _by_ - find it either by css selector or by element id
- _element_id_ - id/selector of the element
- _ignore_error_ - ...
- **click** - click an element after finding it or waiting for it
- _ignore_error_ - ...
- **scroll** - scroll to an element after finding it or waiting for it
- _block_ - scroll to either end or start
- _ignore_error - ...
- **get**
- _page_url_ - webpage to scrape
- _content_ - page_source (html) or xhr (json)
- _xhr_name_ - document name (has to be provided if content is xhr)
- _options_ - chromedriver options
> These instructions should be able to handle most use-cases
---
### API
This API is using **FastAPI** and **Selenium Grid** and takes a list of instructions
and calls the corresponding selenium functions to control the browser. By default
there is no authentication with Selenium Grid so the API provides a simple check for an api-key. Only the API
is exposed, Selenium Grid itself is not.
Docker Compose
~~~yaml
version: "3.5"
services:
webscraper-api:
image: webscraper-api
container_name: webscraper-api
environment:
HOST: 0.0.0.0
PORT: 8081
X-API-KEY: c1f24ee0-1a77-4719-a33b-408069dfc15f
LOG_LEVEL: INFO
WEBDRIVER_REMOTE_HOST: http://selenium-chrome:4444/wd/hub
ports:
- "8081:8081"
networks:
- scraping-network
depends_on:
- selenium-chrome
selenium-chrome:
image: selenium/standalone-chrome
container_name: selenium-chrome
shm_size: 2g
networks:
- scraping-network
# ports:
# - "4444:4444"
# - "7900:7900"
networks:
scraping-network:
driver: bridge
~~~
The above configuration only provides one node. To deploy multiple nodes check the Selenium docs: https://github.com/SeleniumHQ/docker-selenium
---
### Future Improvements
1. support proxies to avoid ip bans
2. more actions and/or more details for an action
Raw data
{
"_id": null,
"home_page": null,
"name": "web-scraper-client",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Snackbeard",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/27/a6/ff9069a1ab817db682ad75d1362eee793deee5369014e338669b36dba277/web_scraper_client-0.1.9.tar.gz",
"platform": null,
"description": "### Goal\r\nThis project aims to simplify the webscraping process with Selenium by providing an easy to use client library to access\r\na self hosted instance of **Selenium Grid**. \r\n\r\n---\r\n\r\nStandard usage:\r\n~~~python\r\nimport logging\r\nimport time\r\n\r\nfrom selenium import webdriver\r\nfrom selenium.common import TimeoutException, JavascriptException\r\nfrom selenium.webdriver.chrome.options import Options\r\nfrom selenium.webdriver.chrome.service import Service\r\nfrom selenium.webdriver.common.by import By\r\nfrom selenium.webdriver.support import expected_conditions\r\nfrom selenium.webdriver.support.wait import WebDriverWait\r\n\r\nchrome_options: Options = Options()\r\nchrome_options.add_argument(\"--headless\")\r\nchrome_options.add_argument(\"--disable-gpu\")\r\nchrome_options.add_argument(\"--no-sandbox\")\r\nchrome_options.add_argument(\"--disable-dev-shm-usage\")\r\nchrome_options.add_argument(\"--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36\")\r\nchrome_options.set_capability('goog:loggingPrefs', {'performance': 'ALL'})\r\n\r\nservice: Service = Service('/path/to/chromedriver')\r\ndriver: webdriver.Chrome = webdriver.Chrome(service=service, options=chrome_options)\r\n\r\ndriver.get('https://my-url.com')\r\n\r\ntry:\r\n # accept cookies\r\n cookie_button = WebDriverWait(driver, 10).until(\r\n expected_conditions.element_to_be_clickable((\r\n By.CSS_SELECTOR,\r\n 'button.sc-aXZVg.sc-lcIPJg.fkTzLw.jlhbaU.acceptAll'\r\n ))\r\n )\r\n driver.execute_script('arguments[0].click();', cookie_button)\r\nexcept TimeoutException as e:\r\n logging.info('element not present')\r\n\r\ntry:\r\n # wait for page to load\r\n WebDriverWait(driver, 10).until(\r\n expected_conditions.presence_of_element_located((By.CSS_SELECTOR, 'div.sc-iwOjIX.cPJSFQ.events-list'))\r\n )\r\n\r\n time.sleep(1)\r\n\r\n # scroll down a bit\r\n element_to_scroll = driver.find_element(By.CSS_SELECTOR, 'div.sc-iwOjIX.cPJSFQ.events-list')\r\n driver.execute_script('arguments[0].scrollIntoView({block: \"end\"});', element_to_scroll)\r\n\r\n time.sleep(1)\r\n\r\n element_to_scroll = driver.find_element(By.CSS_SELECTOR, 'div.sc-PXPPG.hIImXk')\r\n driver.execute_script('arguments[0].scrollIntoView({block: \"start\"});', element_to_scroll)\r\n\r\nexcept TimeoutException | JavascriptException as e:\r\n logging.info('error')\r\n\r\npage_source = driver.page_source\r\n~~~\r\n\r\n---\r\n\r\nWith client library:\r\n~~~python\r\nfrom client.enums.api_instruction_block_type import ApiInstructionBlockType\r\nfrom client.enums.api_instruction_content_type import ApiInstructionContentType\r\nfrom client.enums.api_instruction_element_type import ApiInstructionElementType\r\nfrom client.enums.api_instruction_identificator_type import ApiInstructionIdentificatorType\r\nfrom client.models.driver_options import DriverOptions\r\nfrom client.webscraper_instruction_builder import WebScraperInstructionBuilder\r\n\r\noptions = DriverOptions(user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36', options=[\r\n '--headless',\r\n '--disable-gpu',\r\n '--no-sandbox',\r\n '--disable-dev-shm-usage'\r\n])\r\n\r\npage_source = WebScraperInstructionBuilder(url='api-url', \r\n api_key='api-key')\\\r\n .wait_for(seconds=10, \r\n wait_for=ApiInstructionElementType.ELEMENT_CLICKALBE,\r\n by=ApiInstructionIdentificatorType.CSS_SELECTOR,\r\n element_id='button.sc-aXZVg.sc-lcIPJg.fkTzLw.jlhbaU.acceptAll',\r\n ignore_error=True)\\\r\n .click(ignore_error=True)\\\r\n .wait_for(seconds=10,\r\n wait_for=ApiInstructionElementType.ELEMENT_PRESENCE,\r\n by=ApiInstructionIdentificatorType.CSS_SELECTOR,\r\n element_id='div.sc-iwOjIX.cPJSFQ.events-list')\\\r\n .wait(seconds=1)\\\r\n .scroll(ApiInstructionBlockType.END)\\\r\n .wait(seconds=1)\\\r\n .find(by=ApiInstructionIdentificatorType.CSS_SELECTOR, element_id='div.sc-PXPPG.hIImXk')\\\r\n .scroll(ApiInstructionBlockType.START)\\\r\n .get(page_url='page-to-scrape-url', options=options, content=ApiInstructionContentType.PAGE_SOURCE)\r\n~~~\r\n\r\n### Client Library\r\nThe client library makes it easy to access the API by providing a builder class.\r\n**Example:**\r\n~~~python\r\n options = DriverOptions(\r\n user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36',\r\n options=['--headless', '--disable-gpu', '--no-sandbox', '--disable-dev-shm-usage']\r\n )\r\n\r\n page_source = WebScraperBuilder(\r\n url='https://my-domain.com',\r\n api_key='c1f24ee0-1a77-4719-a33b-408069dfc15f')\r\n .wait_for(seconds=5,\r\n by=ApiInstructionIdentificatorType.CSS_SELECTOR,\r\n wait_for=ApiInstructionElementType.ELEMENT_CLICKALBE,\r\n element_id='button.sc-aXZVg.sc-lcIPJg.bjommA.jlhbaU.acceptAll',\r\n ignore_error=True)\r\n .click(ignore_error=True)\r\n .wait_for(seconds=10,\r\n by=ApiInstructionIdentificatorType.CSS_SELECTOR,\r\n wait_for=ApiInstructionElementType.ELEMENT_PRESENCE,\r\n element_id='div.category-event-items')\r\n .scroll(block=ApiInstructionBlockType.END)\r\n .get(page_url='https://www.page-to-scrape.com', options=options)\r\n~~~\r\nIn the first _.wait_ instruction a cookie dialog overlaps with the actual content we want\r\nto scrape. So to accept it we have to wait for the _.acceptAll_ button to be clickable and\r\nthen click it. Cookies won't appear again after accepting them once so if we scrape the site\r\nagain an error would appear. So _ignore_error_ is set to true in both instructions. Then\r\nwe wait for an list to be present and scroll to the end of it to load all its content.\r\n\r\nThe following actions are currently supported\r\n- **wait** - simply waits\r\n - _seconds_ seconds to wait\r\n- **wait_for** - wait for an element\r\n - _seconds_ - seconds to wait\r\n - _by_ - find it either by css selector or by element_id\r\n - _wait_for_ - wait until the element is either present or clickable\r\n - _element_id_ - id/selector of the element\r\n - _ignore_error_ - if an element was not found ignore it and continue with the next instruction\r\n- **find** - find and element\r\n - _by_ - find it either by css selector or by element id\r\n - _element_id_ - id/selector of the element\r\n - _ignore_error_ - ...\r\n- **click** - click an element after finding it or waiting for it\r\n - _ignore_error_ - ...\r\n- **scroll** - scroll to an element after finding it or waiting for it\r\n - _block_ - scroll to either end or start\r\n - _ignore_error - ...\r\n- **get**\r\n - _page_url_ - webpage to scrape\r\n - _content_ - page_source (html) or xhr (json)\r\n - _xhr_name_ - document name (has to be provided if content is xhr)\r\n - _options_ - chromedriver options\r\n\r\n> These instructions should be able to handle most use-cases\r\n\r\n---\r\n\r\n### API\r\nThis API is using **FastAPI** and **Selenium Grid** and takes a list of instructions\r\nand calls the corresponding selenium functions to control the browser. By default\r\nthere is no authentication with Selenium Grid so the API provides a simple check for an api-key. Only the API\r\nis exposed, Selenium Grid itself is not.\r\n\r\nDocker Compose\r\n~~~yaml\r\nversion: \"3.5\"\r\n\r\nservices:\r\n webscraper-api:\r\n image: webscraper-api\r\n container_name: webscraper-api\r\n environment:\r\n HOST: 0.0.0.0\r\n PORT: 8081\r\n X-API-KEY: c1f24ee0-1a77-4719-a33b-408069dfc15f\r\n LOG_LEVEL: INFO\r\n WEBDRIVER_REMOTE_HOST: http://selenium-chrome:4444/wd/hub\r\n ports:\r\n - \"8081:8081\"\r\n networks:\r\n - scraping-network\r\n depends_on:\r\n - selenium-chrome\r\n\r\n selenium-chrome:\r\n image: selenium/standalone-chrome\r\n container_name: selenium-chrome\r\n shm_size: 2g\r\n networks:\r\n - scraping-network\r\n # ports:\r\n # - \"4444:4444\"\r\n # - \"7900:7900\"\r\n\r\nnetworks:\r\n scraping-network:\r\n driver: bridge\r\n\r\n~~~\r\n\r\nThe above configuration only provides one node. To deploy multiple nodes check the Selenium docs: https://github.com/SeleniumHQ/docker-selenium\r\n\r\n---\r\n\r\n### Future Improvements\r\n1. support proxies to avoid ip bans\r\n2. more actions and/or more details for an action\r\n",
"bugtrack_url": null,
"license": "MIT License\r\n \r\n Copyright (c) [year] [fullname]\r\n \r\n Permission is hereby granted, free of charge, to any person obtaining a copy\r\n of this software and associated documentation files (the \"Software\"), to deal\r\n in the Software without restriction, including without limitation the rights\r\n to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\r\n copies of the Software, and to permit persons to whom the Software is\r\n furnished to do so, subject to the following conditions:\r\n \r\n The above copyright notice and this permission notice shall be included in all\r\n copies or substantial portions of the Software.\r\n \r\n THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\r\n IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\r\n FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\r\n AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\r\n LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\r\n OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\r\n SOFTWARE.\r\n ",
"summary": "Client library for https://github.com/snackbeard/web-scraper-api",
"version": "0.1.9",
"project_urls": {
"Homepage": "https://github.com/snackbeard/web-scraper-api",
"Repository": "https://github.com/snackbeard/web-scraper-api"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "3ac67a3a6214ad35b6a495822c95b1723cced42dce45b1d1565b7b3e2a297a30",
"md5": "aa6e84cb0aae07c229aedcbdceaf1482",
"sha256": "31c38573a1f1631444bfd1bd36893126e0933715cf7e41235e1004186966aacf"
},
"downloads": -1,
"filename": "web_scraper_client-0.1.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "aa6e84cb0aae07c229aedcbdceaf1482",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 9296,
"upload_time": "2025-07-10T13:26:30",
"upload_time_iso_8601": "2025-07-10T13:26:30.007891Z",
"url": "https://files.pythonhosted.org/packages/3a/c6/7a3a6214ad35b6a495822c95b1723cced42dce45b1d1565b7b3e2a297a30/web_scraper_client-0.1.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "27a6ff9069a1ab817db682ad75d1362eee793deee5369014e338669b36dba277",
"md5": "f140ac8033b40cb3182a4b191f1ed89c",
"sha256": "8dec5a7f1404cdea01165acfd4294077deea5335b6bcfed5569f78c97da34a8a"
},
"downloads": -1,
"filename": "web_scraper_client-0.1.9.tar.gz",
"has_sig": false,
"md5_digest": "f140ac8033b40cb3182a4b191f1ed89c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 9778,
"upload_time": "2025-07-10T13:26:31",
"upload_time_iso_8601": "2025-07-10T13:26:31.015203Z",
"url": "https://files.pythonhosted.org/packages/27/a6/ff9069a1ab817db682ad75d1362eee793deee5369014e338669b36dba277/web_scraper_client-0.1.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-10 13:26:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "snackbeard",
"github_project": "web-scraper-api",
"github_not_found": true,
"lcname": "web-scraper-client"
}