sufsd


Namesufsd JSON
Version 0.8 PyPI version JSON
download
home_pagehttps://github.com/Triram-2/sufsd
SummaryWhen parsing different sites, you almost always have to copy+paste some functions; this module was created to make such code easier. It includes the most commonly used functions when parsing. In the future it will be very actively replenished.
upload_time2024-12-20 15:04:03
maintainerNone
docs_urlNone
authorTwir
requires_python>=3.8
licenseNone
keywords utils selenium_driverless
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SUFSD (Standart Utilits For Selenium_Driverless)



## What is this?

When parsing different sites, you almost always have to copy+paste some functions; this module was created to make such code easier. It includes the most commonly used functions when parsing. In the future it will be very actively replenished.



## Dependencies



- Python >= 3.8
- Google-Chrome installed (Chromium not tested)



## Usage

```python
import asyncio
import os
import base64
import logging

from sufsd import init_browser
from sufsd import init_logging
from sufsd import go_to_url
from sufsd import scroll_page
from sufsd import parse_element
from sufsd import By

LINK = 'https://pypi.org/project/sufsd'
PATH_TO_DIR = os.path.dirname(__file__)

async def main():
    await init_logging(to_console=True, filename= f'{PATH_TO_DIR}/logs.log')
    try:
        browser = await init_browser(
            proxy=False,
            headless=False,
            maximize_window = True)
        
        await go_to_url(browser, LINK)
        
        logging.info(f'Current version: {await parse_element(browser, By.XPATH, "/html/body/main/div[1]/div/div[1]/h1", only_nums=True)}')

        await scroll_page(browser)
        
        logging.info(f'Title page: {await browser.title}.')
        
        bytes_for_pdf = await browser.print_page()
        
        with open(f'{PATH_TO_DIR}/sufsd.pdf', 'wb') as file:
            file.write(base64.b64decode(bytes_for_pdf))
        
        logging.info('Created file sufsd.pdf.')
        
    except Exception as error:
        logging.info(f'ERROR: {error}')
    
    finally:
        await browser.quit()
        logging.info('The browser was closed.')


if __name__ == '__main__':
    asyncio.run(main())
```



## Utils implemented so far

`init_browser(proxy = None, headless = True, maximize_window = False, no_sandbox = False) #async`

Browser initialization, taking into account human delays, keeping logs.
     

**Parameters:**    

- proxy (`str`)  -  Proxy in the format `ip:port` or `user@password:ip:port`

- headless (`bool`)  -  Headless on/off.

- maximize_window (`bool`)  -  Maximize_window on/off

- no_sandbox (`bool`)  - `True` for server.

**Return type:**   `class selenium_driverless.webdriver.Chrome`

------

`go_to_url(browser, url) #async`

Сonfidently go to the link (it is impossible not to get to the site due to any lags/proxy speed limits), taking into account human delays, keeping logs.

**Parameters:**    

- browser (`Chrome`)  -  Browser selenium_driverless.webdriver.Chrome.

- url (`str`)  -  Link to site.

**Return type:**    `None`

------

`click(browser, by, value, ID = None) #async`

Click to button, finded by value. If ID is specified, it uses find_elements(by, value), and then element = finded_elements[ID]

**Parameters:**

- browser (`Chrome`)  -  Browser selenium_driverless.webdriver.Chrome.

- by (`str`)  -  One of the locators at `By`.

- value (`str`)  -  The actual query to find by.

- ID (`int`)  -  ID for the WebElement, if there are several of them.

**Return type:**    `None`

------

`init_logging(to_console = True, filename = f'{os.path.dirname(__file__)}logs.log') #async`

Enabling logs.

**Parameters:**   

- to_console (`bool`)  -  On/off logging to console.
- filename (`str | bool`)  -  On/off logging to filename. Filename=False to off logging to file.

**Return type:**  `None`

------

`auth(browser, url, path_to_cookies, sleep = random.uniform(0.5, 1.5)) #async`

The browser goes to the url and re-enters the site with cookies from path_to_cookies, keeping logs.

**Parameters:**    

- browser (`Chrome`)  -  Browser selenium_driverless.webdriver.Chrome.

- url (`str`)  -  Link to site.

- path_to_cookies (`str`)  -  Path to file with cookies.

- sleep (`float | int`)  -  Delay after adding cookies before re-entering the site

**Return type:**    `None`

------

`save_cookie(browser, path, close_browser = False) #async`

Saves the browser cookie to a file located at path if close_browser then closes the browser, keeping logs.

**Parameters:**    

- browser (`Chrome`)  -  Browser selenium_driverless.webdriver.Chrome.

- path (`str`)  - Path to file.

- close_browser (`bool`)  -  If True then closes the browser.

**Return type:**    `None`

------

`scroll_page(browser, by = 'class_name', value = None, sleep = random.uniform(12, 15)) #async`

Full scrolling of the page, with pressing the "Upload more" button by class class_name, given that the site may lag, keeping logs.

**Parameters:**

- browser(`Chrome`)  -  Browser selenium_driverless.webdriver.Chrome.

- by (`str`)  -  One of the locators at `By` for button "Upload more".

- value (`str`)  -  The actual query to find the button "Upload more" by.

- sleep (`list`)  -  The delay between "Upload more" button presses.

**Return type:**    `None`

------

`parse_element(browser_or_WebElement, by, value, ID = None, no_clean = False, full_clean = False, only_nums = False) #async`

Searches for a WebElement by value, takes its text, clears it using strip(). If ID is specified, it uses find_elements(by, value), and then element = finded_elements[ID]. If no_clean does not use strip(). If only_nums returns only numbers from the text WebElement. If full_clean completely removes line breaks and extra spaces(replacing with one).

**Parameters:**

- browser_or_WebElement (`Chrome | WebElement`)  -  Browser or WebElement where the subsequent WebElement will be searched.

- by (`str`)  -  One of the locators at `By`.

- value (`str`)  -  The actual query to find by.

- ID (`int`)  -  ID for the WebElement, if there are several of them.

- no_clean (`bool`)  -  True for off use strip().

- full_clean (`bool`)  -  True for completely removes line breaks and extra spaces(replacing with one).

- only_nums (`bool`)  -  True for returns only numbers, ',' and '.' from the text WebElement.

**Return Type:**    `str`

------

`clean_text(text, full_clean = False, only_nums = False) #async`

Clears the given text.

**Parameters:**

- text (`str`)  -  text for cleaning.

- full_clean (`str`)  -  remove all line breaks.

- only_nums (`bool`)  -  True for returns only numbers, ',' and '.' from text.

**Return type:**    `str`

------

`change_proxy(browser, proxy, refresh = False) #async`

Modifies the proxy browser to a proxy. If refresh, it goes back to the page.

**Parameters:**

- browser (`Chrome`)  -  Browser selenium_driverless.webdriver.Chrome.

- proxy (`str`)  -  Proxy format 'ip:port' or 'user:password@ip:port'

- refresh (`bool`)  -  On/off refresh to site after changing the proxy.

**Return type:**    `None`

------

`parse_pages(browser, by_for_num_pages, value_for_num_pages, by_for_next_page, value_for_next_page, func_for_every_page, args_for_func_for_every_page, ID_for_value_for_num_pages = None, ID_for_value_for_next_page = None, add_func_for_first_page = None, args_for_funs_for_first_page = None, skip_pages = None) #async`

Complete passage through all pages of the site by clicking the next page button(browser.find_element(by_for_next_page, value_for_next_page)), on each page using the asynchronous function func_for_every_page(args_for_func_for_every_page). If func_for_first_page is specified, the preface on the first page will use func_for_first_page(args_for_func_first_page). If skip_pages is specified, the browser will pass fewer (recent) pages on skip_pages.

**Parameters:**

browser (`Chrome`)  -  Browser selenium_driverless.webdriver.Chrome.

by_for_num_pages (`str`)  -  One of the locators at `By` for place, where is specified num_pages.

value_for_num_pages (`str`)  -  The actual query to find by place, where is specified num_pages.

by_for_next_page (`str`)  -  One of the locators at `By` for button "Next page".

value_for_next_page (`str`)  -  The actual query to find button "Next page" by.

func_for_every_page (`func`)  -  The function that will be used on each page.

args_for_func_for_every_page (`list`)  -  Arguments for `func_for_every_page`

ID_for_value_for_num_pages (`int`)  -  ID for the WebElement with num_pages, if there are several of them.

ID_for_value_for_next_page (`int`)  -  ID for the WebElement with button "Next page", if there are several of them.

add_func_for_first_page (`func`)  -  The function that will be used preface on the first_page.

args_for_func_for_first_page (`list`)  -  Arguments for `add_func_for_first_page`

skip_pages (`int`)  -  The number of pages to skip at the end.

**Return type:**    `None`

------

### **By Element Locator**

Set of supported locator strategies. Their supported aliases are also indicated on the right, the case is not important.

- ​    `ID='id'`
- ​    `NAME='name'`
- ​    `XPATH='xpath'`
- ​    `TAG_NAME='tag name'`  Also:`tag_name` , `tag`
- ​    `CLASS_NAME='class name'` Also:`class_name` , `class`
- ​    `CSS_SELECTOR='css selector'` Also:`css`
- ​    `CSS='css selector'` Also:`css`



## Author

Developer: https://t.me/VHdpcj

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Triram-2/sufsd",
    "name": "sufsd",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "utils selenium_driverless",
    "author": "Twir",
    "author_email": "bobyyy239@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/87/cd/9ad2c817f649c6c9b49c1c8d84b51cb7226f707fd928ee85561b053e1fab/sufsd-0.8.tar.gz",
    "platform": null,
    "description": "# SUFSD (Standart Utilits For Selenium_Driverless)\n\n\n\n## What is this?\n\nWhen parsing different sites, you almost always have to copy+paste some functions; this module was created to make such code easier. It includes the most commonly used functions when parsing. In the future it will be very actively replenished.\n\n\n\n## Dependencies\n\n\n\n- Python >= 3.8\n- Google-Chrome installed (Chromium not tested)\n\n\n\n## Usage\n\n```python\nimport asyncio\nimport os\nimport base64\nimport logging\n\nfrom sufsd import init_browser\nfrom sufsd import init_logging\nfrom sufsd import go_to_url\nfrom sufsd import scroll_page\nfrom sufsd import parse_element\nfrom sufsd import By\n\nLINK = 'https://pypi.org/project/sufsd'\nPATH_TO_DIR = os.path.dirname(__file__)\n\nasync def main():\n    await init_logging(to_console=True, filename= f'{PATH_TO_DIR}/logs.log')\n    try:\n        browser = await init_browser(\n            proxy=False,\n            headless=False,\n            maximize_window = True)\n        \n        await go_to_url(browser, LINK)\n        \n        logging.info(f'Current version: {await parse_element(browser, By.XPATH, \"/html/body/main/div[1]/div/div[1]/h1\", only_nums=True)}')\n\n        await scroll_page(browser)\n        \n        logging.info(f'Title page: {await browser.title}.')\n        \n        bytes_for_pdf = await browser.print_page()\n        \n        with open(f'{PATH_TO_DIR}/sufsd.pdf', 'wb') as file:\n            file.write(base64.b64decode(bytes_for_pdf))\n        \n        logging.info('Created file sufsd.pdf.')\n        \n    except Exception as error:\n        logging.info(f'ERROR: {error}')\n    \n    finally:\n        await browser.quit()\n        logging.info('The browser was closed.')\n\n\nif __name__ == '__main__':\n    asyncio.run(main())\n```\n\n\n\n## Utils implemented so far\n\n`init_browser(proxy = None, headless = True, maximize_window = False, no_sandbox = False) #async`\n\nBrowser initialization, taking into account human delays, keeping logs.\n     \n\n**Parameters:**    \n\n- proxy (`str`)  -  Proxy in the format `ip:port` or `user@password:ip:port`\n\n- headless (`bool`)  -  Headless on/off.\n\n- maximize_window (`bool`)  -  Maximize_window on/off\n\n- no_sandbox (`bool`)  - `True` for server.\n\n**Return type:**   `class selenium_driverless.webdriver.Chrome`\n\n------\n\n`go_to_url(browser, url) #async`\n\n\u0421onfidently go to the link (it is impossible not to get to the site due to any lags/proxy speed limits), taking into account human delays, keeping logs.\n\n**Parameters:**    \n\n- browser (`Chrome`)  -  Browser selenium_driverless.webdriver.Chrome.\n\n- url (`str`)  -  Link to site.\n\n**Return type:**    `None`\n\n------\n\n`click(browser, by, value, ID = None) #async`\n\nClick to button, finded by value. If ID is specified, it uses find_elements(by, value), and then element = finded_elements[ID]\n\n**Parameters:**\n\n- browser (`Chrome`)  -  Browser selenium_driverless.webdriver.Chrome.\n\n- by (`str`)  -  One of the locators at `By`.\n\n- value (`str`)  -  The actual query to find by.\n\n- ID (`int`)  -  ID for the WebElement, if there are several of them.\n\n**Return type:**    `None`\n\n------\n\n`init_logging(to_console = True, filename = f'{os.path.dirname(__file__)}logs.log') #async`\n\nEnabling logs.\n\n**Parameters:**   \n\n- to_console (`bool`)  -  On/off logging to console.\n- filename (`str | bool`)  -  On/off logging to filename. Filename=False to off logging to file.\n\n**Return type:**  `None`\n\n------\n\n`auth(browser, url, path_to_cookies, sleep = random.uniform(0.5, 1.5)) #async`\n\nThe browser goes to the url and re-enters the site with cookies from path_to_cookies, keeping logs.\n\n**Parameters:**    \n\n- browser (`Chrome`)  -  Browser selenium_driverless.webdriver.Chrome.\n\n- url (`str`)  -  Link to site.\n\n- path_to_cookies (`str`)  -  Path to file with cookies.\n\n- sleep (`float | int`)  -  Delay after adding cookies before re-entering the site\n\n**Return type:**    `None`\n\n------\n\n`save_cookie(browser, path, close_browser = False) #async`\n\nSaves the browser cookie to a file located at path if close_browser then closes the browser, keeping logs.\n\n**Parameters:**    \n\n- browser (`Chrome`)  -  Browser selenium_driverless.webdriver.Chrome.\n\n- path (`str`)  - Path to file.\n\n- close_browser (`bool`)  -  If True then closes the browser.\n\n**Return type:**    `None`\n\n------\n\n`scroll_page(browser, by = 'class_name', value = None, sleep = random.uniform(12, 15)) #async`\n\nFull scrolling of the page, with pressing the \"Upload more\" button by class class_name, given that the site may lag, keeping logs.\n\n**Parameters:**\n\n- browser(`Chrome`)  -  Browser selenium_driverless.webdriver.Chrome.\n\n- by (`str`)  -  One of the locators at `By` for button \"Upload more\".\n\n- value (`str`)  -  The actual query to find the button \"Upload more\" by.\n\n- sleep (`list`)  -  The delay between \"Upload more\" button presses.\n\n**Return type:**    `None`\n\n------\n\n`parse_element(browser_or_WebElement, by, value, ID = None, no_clean = False, full_clean = False, only_nums = False) #async`\n\nSearches for a WebElement by value, takes its text, clears it using strip(). If ID is specified, it uses find_elements(by, value), and then element = finded_elements[ID]. If no_clean does not use strip(). If only_nums returns only numbers from the text WebElement. If full_clean completely removes line breaks and extra spaces(replacing with one).\n\n**Parameters:**\n\n- browser_or_WebElement (`Chrome | WebElement`)  -  Browser or WebElement where the subsequent WebElement will be searched.\n\n- by (`str`)  -  One of the locators at `By`.\n\n- value (`str`)  -  The actual query to find by.\n\n- ID (`int`)  -  ID for the WebElement, if there are several of them.\n\n- no_clean (`bool`)  -  True for off use strip().\n\n- full_clean (`bool`)  -  True for completely removes line breaks and extra spaces(replacing with one).\n\n- only_nums (`bool`)  -  True for returns only numbers, ',' and '.' from the text WebElement.\n\n**Return Type:**    `str`\n\n------\n\n`clean_text(text, full_clean = False, only_nums = False) #async`\n\nClears the given text.\n\n**Parameters:**\n\n- text (`str`)  -  text for cleaning.\n\n- full_clean (`str`)  -  remove all line breaks.\n\n- only_nums (`bool`)  -  True for returns only numbers, ',' and '.' from text.\n\n**Return type:**    `str`\n\n------\n\n`change_proxy(browser, proxy, refresh = False) #async`\n\nModifies the proxy browser to a proxy. If refresh, it goes back to the page.\n\n**Parameters:**\n\n- browser (`Chrome`)  -  Browser selenium_driverless.webdriver.Chrome.\n\n- proxy (`str`)  -  Proxy format 'ip:port' or 'user:password@ip:port'\n\n- refresh (`bool`)  -  On/off refresh to site after changing the proxy.\n\n**Return type:**    `None`\n\n------\n\n`parse_pages(browser, by_for_num_pages, value_for_num_pages, by_for_next_page, value_for_next_page, func_for_every_page, args_for_func_for_every_page, ID_for_value_for_num_pages = None, ID_for_value_for_next_page = None, add_func_for_first_page = None, args_for_funs_for_first_page = None, skip_pages = None) #async`\n\nComplete passage through all pages of the site by clicking the next page button(browser.find_element(by_for_next_page, value_for_next_page)), on each page using the asynchronous function func_for_every_page(args_for_func_for_every_page). If func_for_first_page is specified, the preface on the first page will use func_for_first_page(args_for_func_first_page). If skip_pages is specified, the browser will pass fewer (recent) pages on skip_pages.\n\n**Parameters:**\n\nbrowser (`Chrome`)  -  Browser selenium_driverless.webdriver.Chrome.\n\nby_for_num_pages (`str`)  -  One of the locators at `By` for place, where is specified num_pages.\n\nvalue_for_num_pages (`str`)  -  The actual query to find by place, where is specified num_pages.\n\nby_for_next_page (`str`)  -  One of the locators at `By` for button \"Next page\".\n\nvalue_for_next_page (`str`)  -  The actual query to find button \"Next page\" by.\n\nfunc_for_every_page (`func`)  -  The function that will be used on each page.\n\nargs_for_func_for_every_page (`list`)  -  Arguments for `func_for_every_page`\n\nID_for_value_for_num_pages (`int`)  -  ID for the WebElement with num_pages, if there are several of them.\n\nID_for_value_for_next_page (`int`)  -  ID for the WebElement with button \"Next page\", if there are several of them.\n\nadd_func_for_first_page (`func`)  -  The function that will be used preface on the first_page.\n\nargs_for_func_for_first_page (`list`)  -  Arguments for `add_func_for_first_page`\n\nskip_pages (`int`)  -  The number of pages to skip at the end.\n\n**Return type:**    `None`\n\n------\n\n### **By Element Locator**\n\nSet of supported locator strategies. Their supported aliases are also indicated on the right, the case is not important.\n\n- \u200b    `ID='id'`\n- \u200b    `NAME='name'`\n- \u200b    `XPATH='xpath'`\n- \u200b    `TAG_NAME='tag name'`  Also:`tag_name` , `tag`\n- \u200b    `CLASS_NAME='class name'` Also:`class_name` , `class`\n- \u200b    `CSS_SELECTOR='css selector'` Also:`css`\n- \u200b    `CSS='css selector'` Also:`css`\n\n\n\n## Author\n\nDeveloper: https://t.me/VHdpcj\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "When parsing different sites, you almost always have to copy+paste some functions; this module was created to make such code easier. It includes the most commonly used functions when parsing. In the future it will be very actively replenished.",
    "version": "0.8",
    "project_urls": {
        "Homepage": "https://github.com/Triram-2/sufsd"
    },
    "split_keywords": [
        "utils",
        "selenium_driverless"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "96f0b55d566e4c041aa37e3f9927c0b7dab0773f40036c4c71e860bfad63443a",
                "md5": "5f4840bf187e2db2dee39752e46fb8c8",
                "sha256": "b3ed51db60744e283d0670e4985938e89eae0c3730a7c47de38094d77fe61d59"
            },
            "downloads": -1,
            "filename": "sufsd-0.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5f4840bf187e2db2dee39752e46fb8c8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 8162,
            "upload_time": "2024-12-20T15:04:01",
            "upload_time_iso_8601": "2024-12-20T15:04:01.486782Z",
            "url": "https://files.pythonhosted.org/packages/96/f0/b55d566e4c041aa37e3f9927c0b7dab0773f40036c4c71e860bfad63443a/sufsd-0.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "87cd9ad2c817f649c6c9b49c1c8d84b51cb7226f707fd928ee85561b053e1fab",
                "md5": "ee9b13014a5bcf01f621478cf646200c",
                "sha256": "c5618fc4a1989d352b441e85617d34955d1ce07df23d21acc8f7d222a15a65c1"
            },
            "downloads": -1,
            "filename": "sufsd-0.8.tar.gz",
            "has_sig": false,
            "md5_digest": "ee9b13014a5bcf01f621478cf646200c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 9652,
            "upload_time": "2024-12-20T15:04:03",
            "upload_time_iso_8601": "2024-12-20T15:04:03.944419Z",
            "url": "https://files.pythonhosted.org/packages/87/cd/9ad2c817f649c6c9b49c1c8d84b51cb7226f707fd928ee85561b053e1fab/sufsd-0.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-20 15:04:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Triram-2",
    "github_project": "sufsd",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "sufsd"
}
        
Elapsed time: 1.69050s