webscapy


Namewebscapy JSON
Version 1.6.5 PyPI version JSON
download
home_page
SummarySelenium built for scraping instead of testing
upload_time2023-06-06 10:19:21
maintainer
docs_urlNone
authorRahul Raj
requires_python
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Webscapy: Selenium Configured for Webscraping

## Introduction

Webscapy is a Python package that extends the capabilities of the Selenium framework, originally designed for web testing, to perform web scraping tasks. It provides a convenient and easy-to-use interface for automating browser interactions, navigating through web pages, and extracting data from websites. By combining the power of Selenium with the flexibility of web scraping, Webscapy enables you to extract structured data from dynamic websites efficiently.

## Features

1. <b>Automated Browser Interaction:</b> Webscapy enables you to automate browser actions, such as clicking buttons, filling forms, scrolling, and navigating between web pages. With a user-friendly interface, you can easily simulate human-like interactions with the target website.

2. <b>Undetected Mode:</b> Webscapy includes built-in mechanisms to bypass anti-bot measures, including Cloudflare protection. It provides an undetected mode that reduces the chances of detection and allows for seamless scraping even from websites with strict security measures.

   |                                         Undetected Mode (Off)                                          |                                          Undetected Mode (On)                                          |
   | :----------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------: |
   | ![image](https://github.com/dusklight00/webscapy/assets/71203637/d8325500-3793-4f26-b7dd-15e5da7ee100) | ![image](https://github.com/dusklight00/webscapy/assets/71203637/7344470a-6924-4556-a72e-a27638e410bd) |

3. <b>Headless Browsers:</b> Webscapy supports headless browser operations, allowing you to scrape websites without displaying the browser window. This feature is useful for running scraping tasks in the background or on headless servers.

4. <b>Element Load Waiting:</b> The package offers flexible options for waiting until specific elements are loaded on the web page. You can wait for elements to appear, disappear, or become interactable before performing further actions. This ensures that your scraping script synchronizes with the dynamic behavior of websites.

5. <b>Execute JavaScript Code:</b> Webscapy allows you to execute custom JavaScript code within the browser. This feature enables you to interact with JavaScript-based functionalities on web pages, manipulate the DOM, or extract data that is not easily accessible through traditional scraping techniques.

6. <b>Connect with Remote Browsers:</b> SeleniumWebScraper provides a simplified method to connect with remote browsers using just one line of code. This feature allows you to distribute your scraping tasks to remote nodes or cloud-based Selenium Grid infrastructure. By specifying the remote URL, you can easily connect to a remote browser and leverage its capabilities for efficient scraping.

## Installation

You can install Webscapy using pip, the Python package manager. Open your command-line interface and execute the following command:

```python
pip install webscapy
```

## Getting Started

Following are the ways to create a driver

1. Simple Driver (headless)

```python
from webscapy import Webscapy

driver = Webscapy()

driver.get("https://google.com")
```

2. Turn off headless

```python
from webscapy import Webscapy

driver = Webscapy(headless=False)

driver.get("https://google.com")
```

3. Make the driver undetectable

```python
from webscapy import Webscapy

driver = Webscapy(headless=False, undetectable=True)

driver.get("https://google.com")
```

4. Connect to a remote browser

```python
from webscapy import Webscapy

REMOTE_URL = "..."
driver = Webscapy(remote_url=REMOTE_URL)

driver.get("https://google.com")
```

## Element Interaction

Following are the ways to interact with DOM Element

1. Wait for the element to load

```python
driver.load_wait(type, selector)
```

2. Load the element

```python
element = driver.load_element(type, selector)
```

3. Load all the possible instance of the selector (outputs an array)

```python
elements = driver.load_elements(type, selector)

# Exmaple
elements = driver.load_elements("tag-name", "p")

# Output:
# [elem1, elem2, elem3, ...]
```

3. Wait and load element

```python
element = driver.wait_load_element(type, selector)
```

4. Interact / Click the element

```python
element = driver.load_element(type, selector)
element.click()
```

## Different Type of Selectors

Take the following sample HTML code as example

```html
<html>
  <body>
    <h1>Welcome</h1>
    <p>Site content goes here.</p>
    <form id="loginForm">
      <input name="username" type="text" />
      <input name="password" type="password" />
      <input name="continue" type="submit" value="Login" />
      <input name="continue" type="button" value="Clear" />
    </form>
    <p class="content">Site content goes here.</p>
    <a href="continue.html">Continue</a>
    <a href="cancel.html">Cancel</a>
  </body>
</html>
```

Following are different selector types

|       Type        |         Example         |
| :---------------: | :---------------------: |
|        id         |       `loginForm`       |
|       name        | `username` / `password` |
|       xpath       |  `/html/body/form[1]`   |
|     link-text     |       `Continue`        |
| partial-link-text |         `Conti`         |
|     tag-name      |          `h1`           |
|    class-name     |        `content`        |
|   css-selector    |       `p.content`       |

Following is some usecase examples

```python
content = driver.wait_load_element("css-selector", 'p.content')
content = driver.wait_load_element("class-name", 'content')
content = driver.wait_load_element("tag-name", 'p')
```

## Execute Javascript Code

You can execute any javascript code on the site using the following method

```python
code = "..."
driver.execute_script(code)
```

## Network Activity Data

You can get network activity data after waiting for a while using commands like `time.sleep(...)`

```python
network_data = driver.get_network_data()

print(network_data)
```

## Cookie Handling

You can add cookies using the following method

1. Add a single cookie

```python
cookie = {
   "name": "cookie1",
   "value": "value1"
}
driver.add_cookie(cookie)
```

2. Get a single cookie

```python
driver.get_cookie("cookie1")
```

3. Delete a single cookie

```python
driver.delete_cookie("cookie1")
```

4. Import cookie from JSON

```
driver.load_cookie_json("cookie.json")
```

## Close the driver

Always close the driver after using it to save memory, or avoid memory leaks

```python
driver.close()
```

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "webscapy",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Rahul Raj",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/0a/5e/eb84db26324b17b450e5964cafa26765a4ce77aa20dbe548401f00933f67/webscapy-1.6.5.tar.gz",
    "platform": null,
    "description": "# Webscapy: Selenium Configured for Webscraping\r\n\r\n## Introduction\r\n\r\nWebscapy is a Python package that extends the capabilities of the Selenium framework, originally designed for web testing, to perform web scraping tasks. It provides a convenient and easy-to-use interface for automating browser interactions, navigating through web pages, and extracting data from websites. By combining the power of Selenium with the flexibility of web scraping, Webscapy enables you to extract structured data from dynamic websites efficiently.\r\n\r\n## Features\r\n\r\n1. <b>Automated Browser Interaction:</b> Webscapy enables you to automate browser actions, such as clicking buttons, filling forms, scrolling, and navigating between web pages. With a user-friendly interface, you can easily simulate human-like interactions with the target website.\r\n\r\n2. <b>Undetected Mode:</b> Webscapy includes built-in mechanisms to bypass anti-bot measures, including Cloudflare protection. It provides an undetected mode that reduces the chances of detection and allows for seamless scraping even from websites with strict security measures.\r\n\r\n   |                                         Undetected Mode (Off)                                          |                                          Undetected Mode (On)                                          |\r\n   | :----------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------: |\r\n   | ![image](https://github.com/dusklight00/webscapy/assets/71203637/d8325500-3793-4f26-b7dd-15e5da7ee100) | ![image](https://github.com/dusklight00/webscapy/assets/71203637/7344470a-6924-4556-a72e-a27638e410bd) |\r\n\r\n3. <b>Headless Browsers:</b> Webscapy supports headless browser operations, allowing you to scrape websites without displaying the browser window. This feature is useful for running scraping tasks in the background or on headless servers.\r\n\r\n4. <b>Element Load Waiting:</b> The package offers flexible options for waiting until specific elements are loaded on the web page. You can wait for elements to appear, disappear, or become interactable before performing further actions. This ensures that your scraping script synchronizes with the dynamic behavior of websites.\r\n\r\n5. <b>Execute JavaScript Code:</b> Webscapy allows you to execute custom JavaScript code within the browser. This feature enables you to interact with JavaScript-based functionalities on web pages, manipulate the DOM, or extract data that is not easily accessible through traditional scraping techniques.\r\n\r\n6. <b>Connect with Remote Browsers:</b> SeleniumWebScraper provides a simplified method to connect with remote browsers using just one line of code. This feature allows you to distribute your scraping tasks to remote nodes or cloud-based Selenium Grid infrastructure. By specifying the remote URL, you can easily connect to a remote browser and leverage its capabilities for efficient scraping.\r\n\r\n## Installation\r\n\r\nYou can install Webscapy using pip, the Python package manager. Open your command-line interface and execute the following command:\r\n\r\n```python\r\npip install webscapy\r\n```\r\n\r\n## Getting Started\r\n\r\nFollowing are the ways to create a driver\r\n\r\n1. Simple Driver (headless)\r\n\r\n```python\r\nfrom webscapy import Webscapy\r\n\r\ndriver = Webscapy()\r\n\r\ndriver.get(\"https://google.com\")\r\n```\r\n\r\n2. Turn off headless\r\n\r\n```python\r\nfrom webscapy import Webscapy\r\n\r\ndriver = Webscapy(headless=False)\r\n\r\ndriver.get(\"https://google.com\")\r\n```\r\n\r\n3. Make the driver undetectable\r\n\r\n```python\r\nfrom webscapy import Webscapy\r\n\r\ndriver = Webscapy(headless=False, undetectable=True)\r\n\r\ndriver.get(\"https://google.com\")\r\n```\r\n\r\n4. Connect to a remote browser\r\n\r\n```python\r\nfrom webscapy import Webscapy\r\n\r\nREMOTE_URL = \"...\"\r\ndriver = Webscapy(remote_url=REMOTE_URL)\r\n\r\ndriver.get(\"https://google.com\")\r\n```\r\n\r\n## Element Interaction\r\n\r\nFollowing are the ways to interact with DOM Element\r\n\r\n1. Wait for the element to load\r\n\r\n```python\r\ndriver.load_wait(type, selector)\r\n```\r\n\r\n2. Load the element\r\n\r\n```python\r\nelement = driver.load_element(type, selector)\r\n```\r\n\r\n3. Load all the possible instance of the selector (outputs an array)\r\n\r\n```python\r\nelements = driver.load_elements(type, selector)\r\n\r\n# Exmaple\r\nelements = driver.load_elements(\"tag-name\", \"p\")\r\n\r\n# Output:\r\n# [elem1, elem2, elem3, ...]\r\n```\r\n\r\n3. Wait and load element\r\n\r\n```python\r\nelement = driver.wait_load_element(type, selector)\r\n```\r\n\r\n4. Interact / Click the element\r\n\r\n```python\r\nelement = driver.load_element(type, selector)\r\nelement.click()\r\n```\r\n\r\n## Different Type of Selectors\r\n\r\nTake the following sample HTML code as example\r\n\r\n```html\r\n<html>\r\n  <body>\r\n    <h1>Welcome</h1>\r\n    <p>Site content goes here.</p>\r\n    <form id=\"loginForm\">\r\n      <input name=\"username\" type=\"text\" />\r\n      <input name=\"password\" type=\"password\" />\r\n      <input name=\"continue\" type=\"submit\" value=\"Login\" />\r\n      <input name=\"continue\" type=\"button\" value=\"Clear\" />\r\n    </form>\r\n    <p class=\"content\">Site content goes here.</p>\r\n    <a href=\"continue.html\">Continue</a>\r\n    <a href=\"cancel.html\">Cancel</a>\r\n  </body>\r\n</html>\r\n```\r\n\r\nFollowing are different selector types\r\n\r\n|       Type        |         Example         |\r\n| :---------------: | :---------------------: |\r\n|        id         |       `loginForm`       |\r\n|       name        | `username` / `password` |\r\n|       xpath       |  `/html/body/form[1]`   |\r\n|     link-text     |       `Continue`        |\r\n| partial-link-text |         `Conti`         |\r\n|     tag-name      |          `h1`           |\r\n|    class-name     |        `content`        |\r\n|   css-selector    |       `p.content`       |\r\n\r\nFollowing is some usecase examples\r\n\r\n```python\r\ncontent = driver.wait_load_element(\"css-selector\", 'p.content')\r\ncontent = driver.wait_load_element(\"class-name\", 'content')\r\ncontent = driver.wait_load_element(\"tag-name\", 'p')\r\n```\r\n\r\n## Execute Javascript Code\r\n\r\nYou can execute any javascript code on the site using the following method\r\n\r\n```python\r\ncode = \"...\"\r\ndriver.execute_script(code)\r\n```\r\n\r\n## Network Activity Data\r\n\r\nYou can get network activity data after waiting for a while using commands like `time.sleep(...)`\r\n\r\n```python\r\nnetwork_data = driver.get_network_data()\r\n\r\nprint(network_data)\r\n```\r\n\r\n## Cookie Handling\r\n\r\nYou can add cookies using the following method\r\n\r\n1. Add a single cookie\r\n\r\n```python\r\ncookie = {\r\n   \"name\": \"cookie1\",\r\n   \"value\": \"value1\"\r\n}\r\ndriver.add_cookie(cookie)\r\n```\r\n\r\n2. Get a single cookie\r\n\r\n```python\r\ndriver.get_cookie(\"cookie1\")\r\n```\r\n\r\n3. Delete a single cookie\r\n\r\n```python\r\ndriver.delete_cookie(\"cookie1\")\r\n```\r\n\r\n4. Import cookie from JSON\r\n\r\n```\r\ndriver.load_cookie_json(\"cookie.json\")\r\n```\r\n\r\n## Close the driver\r\n\r\nAlways close the driver after using it to save memory, or avoid memory leaks\r\n\r\n```python\r\ndriver.close()\r\n```\r\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Selenium built for scraping instead of testing",
    "version": "1.6.5",
    "project_urls": {
        "Documentation": "https://pypi.org/project/webscapy/",
        "Source": "https://pypi.org/project/webscapy/"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0a5eeb84db26324b17b450e5964cafa26765a4ce77aa20dbe548401f00933f67",
                "md5": "28720512f12ad2ba6052cdfb9a892f6c",
                "sha256": "4a8de1caab6eaf285b212dfc58fd5089bb51629ee7c85f5ae1feed39cfab4d19"
            },
            "downloads": -1,
            "filename": "webscapy-1.6.5.tar.gz",
            "has_sig": false,
            "md5_digest": "28720512f12ad2ba6052cdfb9a892f6c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 5515,
            "upload_time": "2023-06-06T10:19:21",
            "upload_time_iso_8601": "2023-06-06T10:19:21.955713Z",
            "url": "https://files.pythonhosted.org/packages/0a/5e/eb84db26324b17b450e5964cafa26765a4ce77aa20dbe548401f00933f67/webscapy-1.6.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-06 10:19:21",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "webscapy"
}
        
Elapsed time: 0.08579s