Name | sufsd JSON |
Version |
0.7
JSON |
| download |
home_page | https://github.com/Triram-2/sufsd |
Summary | When parsing different sites, you almost always have to copy+paste some functions; this module was created to make such code easier. It includes the most commonly used functions when parsing. In the future it will be very actively replenished. |
upload_time | 2024-11-24 10:01:22 |
maintainer | None |
docs_url | None |
author | Twir |
requires_python | >=3.8 |
license | None |
keywords |
utils
selenium_driverless
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# SUFSD (Standart Utilits For Selenium_Driverless)
## What is this?
When parsing different sites, you almost always have to copy+paste some functions; this module was created to make such code easier. It includes the most commonly used functions when parsing. In the future it will be very actively replenished.
## Dependencies
- Python >= 3.8
- Google-Chrome installed (Chromium not tested)
## Usage
```python
import asyncio
import os
import base64
import logging
from sufsd import init_browser
from sufsd import init_logging
from sufsd import go_to_url
from sufsd import scroll_page
from sufsd import parse_element
from sufsd import By
LINK = 'https://pypi.org/project/sufsd'
PATH_TO_DIR = os.path.dirname(__file__)
async def main():
await init_logging(to_console=True, filename= f'{PATH_TO_DIR}/logs.log')
try:
browser = await init_browser(
proxy=False,
headless=False,
maximize_window = True)
await go_to_url(browser, LINK)
logging.info(f'Current version: {await parse_element(browser, By.XPATH, "/html/body/main/div[1]/div/div[1]/h1", only_nums=True)}')
await scroll_page(browser)
logging.info(f'Title page: {await browser.title}.')
bytes_for_pdf = await browser.print_page()
with open(f'{PATH_TO_DIR}/sufsd.pdf', 'wb') as file:
file.write(base64.b64decode(bytes_for_pdf))
logging.info('Created file sufsd.pdf.')
except Exception as error:
logging.info(f'ERROR: {error}')
finally:
await browser.quit()
logging.info('The browser was closed.')
if __name__ == '__main__':
asyncio.run(main())
```
## Utils implemented so far
`init_browser(proxy = None, headless = True, maximize_window = False, no_sandbox = False) #async`
Browser initialization, taking into account human delays, keeping logs.
**Parameters:**
- proxy (`str`) - Proxy in the format `ip:port` or `user@password:ip:port`
- headless (`bool`) - Headless on/off.
- maximize_window (`bool`) - Maximize_window on/off
- no_sandbox (`bool`) - `True` for server.
**Return type:** `class selenium_driverless.webdriver.Chrome`
------
`go_to_url(browser, url) #async`
Сonfidently go to the link (it is impossible not to get to the site due to any lags/proxy speed limits), taking into account human delays, keeping logs.
**Parameters:**
- browser (`Chrome`) - Browser selenium_driverless.webdriver.Chrome.
- url (`str`) - Link to site.
**Return type:** `None`
------
`click(browser, by, value, ID = None) #async`
Click to button, finded by value. If ID is specified, it uses find_elements(by, value), and then element = finded_elements[ID]
**Parameters:**
- browser (`Chrome`) - Browser selenium_driverless.webdriver.Chrome.
- by (`str`) - One of the locators at `By`.
- value (`str`) - The actual query to find by.
- ID (`int`) - ID for the WebElement, if there are several of them.
**Return type:** `None`
------
`init_logging(to_console = True, filename = f'{os.path.dirname(__file__)}logs.log') #async`
Enabling logs.
**Parameters:**
- to_console (`bool`) - On/off logging to console.
- filename (`str | bool`) - On/off logging to filename. Filename=False to off logging to file.
**Return type:** `None`
------
`auth(browser, url, path_to_cookies, sleep = random.uniform(0.5, 1.5)) #async`
The browser goes to the url and re-enters the site with cookies from path_to_cookies, keeping logs.
**Parameters:**
- browser (`Chrome`) - Browser selenium_driverless.webdriver.Chrome.
- url (`str`) - Link to site.
- path_to_cookies (`str`) - Path to file with cookies.
- sleep (`float | int`) - Delay after adding cookies before re-entering the site
**Return type:** `None`
------
`save_cookie(browser, path, close_browser = False) #async`
Saves the browser cookie to a file located at path if close_browser then closes the browser, keeping logs.
**Parameters:**
- browser (`Chrome`) - Browser selenium_driverless.webdriver.Chrome.
- path (`str`) - Path to file.
- close_browser (`bool`) - If True then closes the browser.
**Return type:** `None`
------
`scroll_page(browser, by = 'class_name', value = None, sleep = random.uniform(12, 15)) #async`
Full scrolling of the page, with pressing the "Upload more" button by class class_name, given that the site may lag, keeping logs.
**Parameters:**
- browser(`Chrome`) - Browser selenium_driverless.webdriver.Chrome.
- by (`str`) - One of the locators at `By` for button "Upload more".
- value (`str`) - The actual query to find the button "Upload more" by.
- sleep (`list`) - The delay between "Upload more" button presses.
**Return type:** `None`
------
`parse_element(browser_or_WebElement, by, value, ID = None, no_clean = False, full_clean = False, only_nums = False) #async`
Searches for a WebElement by value, takes its text, clears it using strip(). If ID is specified, it uses find_elements(by, value), and then element = finded_elements[ID]. If no_clean does not use strip(). If only_nums returns only numbers from the text WebElement. If full_clean completely removes line breaks and extra spaces(replacing with one).
**Parameters:**
- browser_or_WebElement (`Chrome | WebElement`) - Browser or WebElement where the subsequent WebElement will be searched.
- by (`str`) - One of the locators at `By`.
- value (`str`) - The actual query to find by.
- ID (`int`) - ID for the WebElement, if there are several of them.
- no_clean (`bool`) - True for off use strip().
- full_clean (`bool`) - True for completely removes line breaks and extra spaces(replacing with one).
- only_nums (`bool`) - True for returns only numbers, ',' and '.' from the text WebElement.
**Return Type:** `str`
------
`clean_text(text, full_clean = False, only_nums = False) #async`
Clears the given text.
**Parameters:**
- text (`str`) - text for cleaning.
- full_clean (`str`) - remove all line breaks.
- only_nums (`bool`) - True for returns only numbers, ',' and '.' from text.
**Return type:** `str`
------
`change_proxy(browser, proxy, refresh = False) #async`
Modifies the proxy browser to a proxy. If refresh, it goes back to the page.
**Parameters:**
- browser (`Chrome`) - Browser selenium_driverless.webdriver.Chrome.
- proxy (`str`) - Proxy format 'ip:port' or 'user:password@ip:port'
- refresh (`bool`) - On/off refresh to site after changing the proxy.
**Return type:** `None`
------
`parse_pages(browser, by_for_num_pages, value_for_num_pages, by_for_next_page, value_for_next_page, func_for_every_page, args_for_func_for_every_page, ID_for_value_for_num_pages = None, ID_for_value_for_next_page = None, add_func_for_first_page = None, args_for_funs_for_first_page = None, skip_pages = None) #async`
Complete passage through all pages of the site by clicking the next page button(browser.find_element(by_for_next_page, value_for_next_page)), on each page using the asynchronous function func_for_every_page(args_for_func_for_every_page). If func_for_first_page is specified, the preface on the first page will use func_for_first_page(args_for_func_first_page). If skip_pages is specified, the browser will pass fewer (recent) pages on skip_pages.
**Parameters:**
browser (`Chrome`) - Browser selenium_driverless.webdriver.Chrome.
by_for_num_pages (`str`) - One of the locators at `By` for place, where is specified num_pages.
value_for_num_pages (`str`) - The actual query to find by place, where is specified num_pages.
by_for_next_page (`str`) - One of the locators at `By` for button "Next page".
value_for_next_page (`str`) - The actual query to find button "Next page" by.
func_for_every_page (`func`) - The function that will be used on each page.
args_for_func_for_every_page (`list`) - Arguments for `func_for_every_page`
ID_for_value_for_num_pages (`int`) - ID for the WebElement with num_pages, if there are several of them.
ID_for_value_for_next_page (`int`) - ID for the WebElement with button "Next page", if there are several of them.
add_func_for_first_page (`func`) - The function that will be used preface on the first_page.
args_for_func_for_first_page (`list`) - Arguments for `add_func_for_first_page`
skip_pages (`int`) - The number of pages to skip at the end.
**Return type:** `None`
------
### **By Element Locator**
Set of supported locator strategies. Their supported aliases are also indicated on the right, the case is not important.
- `ID='id'`
- `NAME='name'`
- `XPATH='xpath'`
- `TAG_NAME='tag name'` Also:`tag_name` , `tag`
- `CLASS_NAME='class name'` Also:`class_name` , `class`
- `CSS_SELECTOR='css selector'` Also:`css`
- `CSS='css selector'` Also:`css`
## Author
Developer: https://t.me/VHdpcj
Raw data
{
"_id": null,
"home_page": "https://github.com/Triram-2/sufsd",
"name": "sufsd",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "utils selenium_driverless",
"author": "Twir",
"author_email": "bobyyy239@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/02/dc/bdc5e9dce455171654432ccfcbc18d3dbb59b01473c57561ff6e52f68d71/sufsd-0.7.tar.gz",
"platform": null,
"description": "# SUFSD (Standart Utilits For Selenium_Driverless)\n\n\n\n## What is this?\n\nWhen parsing different sites, you almost always have to copy+paste some functions; this module was created to make such code easier. It includes the most commonly used functions when parsing. In the future it will be very actively replenished.\n\n\n\n## Dependencies\n\n\n\n- Python >= 3.8\n- Google-Chrome installed (Chromium not tested)\n\n\n\n## Usage\n\n```python\nimport asyncio\nimport os\nimport base64\nimport logging\n\nfrom sufsd import init_browser\nfrom sufsd import init_logging\nfrom sufsd import go_to_url\nfrom sufsd import scroll_page\nfrom sufsd import parse_element\nfrom sufsd import By\n\nLINK = 'https://pypi.org/project/sufsd'\nPATH_TO_DIR = os.path.dirname(__file__)\n\nasync def main():\n await init_logging(to_console=True, filename= f'{PATH_TO_DIR}/logs.log')\n try:\n browser = await init_browser(\n proxy=False,\n headless=False,\n maximize_window = True)\n \n await go_to_url(browser, LINK)\n \n logging.info(f'Current version: {await parse_element(browser, By.XPATH, \"/html/body/main/div[1]/div/div[1]/h1\", only_nums=True)}')\n\n await scroll_page(browser)\n \n logging.info(f'Title page: {await browser.title}.')\n \n bytes_for_pdf = await browser.print_page()\n \n with open(f'{PATH_TO_DIR}/sufsd.pdf', 'wb') as file:\n file.write(base64.b64decode(bytes_for_pdf))\n \n logging.info('Created file sufsd.pdf.')\n \n except Exception as error:\n logging.info(f'ERROR: {error}')\n \n finally:\n await browser.quit()\n logging.info('The browser was closed.')\n\n\nif __name__ == '__main__':\n asyncio.run(main())\n```\n\n\n\n## Utils implemented so far\n\n`init_browser(proxy = None, headless = True, maximize_window = False, no_sandbox = False) #async`\n\nBrowser initialization, taking into account human delays, keeping logs.\n \n\n**Parameters:** \n\n- proxy (`str`) - Proxy in the format `ip:port` or `user@password:ip:port`\n\n- headless (`bool`) - Headless on/off.\n\n- maximize_window (`bool`) - Maximize_window on/off\n\n- no_sandbox (`bool`) - `True` for server.\n\n**Return type:** `class selenium_driverless.webdriver.Chrome`\n\n------\n\n`go_to_url(browser, url) #async`\n\n\u0421onfidently go to the link (it is impossible not to get to the site due to any lags/proxy speed limits), taking into account human delays, keeping logs.\n\n**Parameters:** \n\n- browser (`Chrome`) - Browser selenium_driverless.webdriver.Chrome.\n\n- url (`str`) - Link to site.\n\n**Return type:** `None`\n\n------\n\n`click(browser, by, value, ID = None) #async`\n\nClick to button, finded by value. If ID is specified, it uses find_elements(by, value), and then element = finded_elements[ID]\n\n**Parameters:**\n\n- browser (`Chrome`) - Browser selenium_driverless.webdriver.Chrome.\n\n- by (`str`) - One of the locators at `By`.\n\n- value (`str`) - The actual query to find by.\n\n- ID (`int`) - ID for the WebElement, if there are several of them.\n\n**Return type:** `None`\n\n------\n\n`init_logging(to_console = True, filename = f'{os.path.dirname(__file__)}logs.log') #async`\n\nEnabling logs.\n\n**Parameters:** \n\n- to_console (`bool`) - On/off logging to console.\n- filename (`str | bool`) - On/off logging to filename. Filename=False to off logging to file.\n\n**Return type:** `None`\n\n------\n\n`auth(browser, url, path_to_cookies, sleep = random.uniform(0.5, 1.5)) #async`\n\nThe browser goes to the url and re-enters the site with cookies from path_to_cookies, keeping logs.\n\n**Parameters:** \n\n- browser (`Chrome`) - Browser selenium_driverless.webdriver.Chrome.\n\n- url (`str`) - Link to site.\n\n- path_to_cookies (`str`) - Path to file with cookies.\n\n- sleep (`float | int`) - Delay after adding cookies before re-entering the site\n\n**Return type:** `None`\n\n------\n\n`save_cookie(browser, path, close_browser = False) #async`\n\nSaves the browser cookie to a file located at path if close_browser then closes the browser, keeping logs.\n\n**Parameters:** \n\n- browser (`Chrome`) - Browser selenium_driverless.webdriver.Chrome.\n\n- path (`str`) - Path to file.\n\n- close_browser (`bool`) - If True then closes the browser.\n\n**Return type:** `None`\n\n------\n\n`scroll_page(browser, by = 'class_name', value = None, sleep = random.uniform(12, 15)) #async`\n\nFull scrolling of the page, with pressing the \"Upload more\" button by class class_name, given that the site may lag, keeping logs.\n\n**Parameters:**\n\n- browser(`Chrome`) - Browser selenium_driverless.webdriver.Chrome.\n\n- by (`str`) - One of the locators at `By` for button \"Upload more\".\n\n- value (`str`) - The actual query to find the button \"Upload more\" by.\n\n- sleep (`list`) - The delay between \"Upload more\" button presses.\n\n**Return type:** `None`\n\n------\n\n`parse_element(browser_or_WebElement, by, value, ID = None, no_clean = False, full_clean = False, only_nums = False) #async`\n\nSearches for a WebElement by value, takes its text, clears it using strip(). If ID is specified, it uses find_elements(by, value), and then element = finded_elements[ID]. If no_clean does not use strip(). If only_nums returns only numbers from the text WebElement. If full_clean completely removes line breaks and extra spaces(replacing with one).\n\n**Parameters:**\n\n- browser_or_WebElement (`Chrome | WebElement`) - Browser or WebElement where the subsequent WebElement will be searched.\n\n- by (`str`) - One of the locators at `By`.\n\n- value (`str`) - The actual query to find by.\n\n- ID (`int`) - ID for the WebElement, if there are several of them.\n\n- no_clean (`bool`) - True for off use strip().\n\n- full_clean (`bool`) - True for completely removes line breaks and extra spaces(replacing with one).\n\n- only_nums (`bool`) - True for returns only numbers, ',' and '.' from the text WebElement.\n\n**Return Type:** `str`\n\n------\n\n`clean_text(text, full_clean = False, only_nums = False) #async`\n\nClears the given text.\n\n**Parameters:**\n\n- text (`str`) - text for cleaning.\n\n- full_clean (`str`) - remove all line breaks.\n\n- only_nums (`bool`) - True for returns only numbers, ',' and '.' from text.\n\n**Return type:** `str`\n\n------\n\n`change_proxy(browser, proxy, refresh = False) #async`\n\nModifies the proxy browser to a proxy. If refresh, it goes back to the page.\n\n**Parameters:**\n\n- browser (`Chrome`) - Browser selenium_driverless.webdriver.Chrome.\n\n- proxy (`str`) - Proxy format 'ip:port' or 'user:password@ip:port'\n\n- refresh (`bool`) - On/off refresh to site after changing the proxy.\n\n**Return type:** `None`\n\n------\n\n`parse_pages(browser, by_for_num_pages, value_for_num_pages, by_for_next_page, value_for_next_page, func_for_every_page, args_for_func_for_every_page, ID_for_value_for_num_pages = None, ID_for_value_for_next_page = None, add_func_for_first_page = None, args_for_funs_for_first_page = None, skip_pages = None) #async`\n\nComplete passage through all pages of the site by clicking the next page button(browser.find_element(by_for_next_page, value_for_next_page)), on each page using the asynchronous function func_for_every_page(args_for_func_for_every_page). If func_for_first_page is specified, the preface on the first page will use func_for_first_page(args_for_func_first_page). If skip_pages is specified, the browser will pass fewer (recent) pages on skip_pages.\n\n**Parameters:**\n\nbrowser (`Chrome`) - Browser selenium_driverless.webdriver.Chrome.\n\nby_for_num_pages (`str`) - One of the locators at `By` for place, where is specified num_pages.\n\nvalue_for_num_pages (`str`) - The actual query to find by place, where is specified num_pages.\n\nby_for_next_page (`str`) - One of the locators at `By` for button \"Next page\".\n\nvalue_for_next_page (`str`) - The actual query to find button \"Next page\" by.\n\nfunc_for_every_page (`func`) - The function that will be used on each page.\n\nargs_for_func_for_every_page (`list`) - Arguments for `func_for_every_page`\n\nID_for_value_for_num_pages (`int`) - ID for the WebElement with num_pages, if there are several of them.\n\nID_for_value_for_next_page (`int`) - ID for the WebElement with button \"Next page\", if there are several of them.\n\nadd_func_for_first_page (`func`) - The function that will be used preface on the first_page.\n\nargs_for_func_for_first_page (`list`) - Arguments for `add_func_for_first_page`\n\nskip_pages (`int`) - The number of pages to skip at the end.\n\n**Return type:** `None`\n\n------\n\n### **By Element Locator**\n\nSet of supported locator strategies. Their supported aliases are also indicated on the right, the case is not important.\n\n- \u200b `ID='id'`\n- \u200b `NAME='name'`\n- \u200b `XPATH='xpath'`\n- \u200b `TAG_NAME='tag name'` Also:`tag_name` , `tag`\n- \u200b `CLASS_NAME='class name'` Also:`class_name` , `class`\n- \u200b `CSS_SELECTOR='css selector'` Also:`css`\n- \u200b `CSS='css selector'` Also:`css`\n\n\n\n## Author\n\nDeveloper: https://t.me/VHdpcj\n",
"bugtrack_url": null,
"license": null,
"summary": "When parsing different sites, you almost always have to copy+paste some functions; this module was created to make such code easier. It includes the most commonly used functions when parsing. In the future it will be very actively replenished.",
"version": "0.7",
"project_urls": {
"Homepage": "https://github.com/Triram-2/sufsd"
},
"split_keywords": [
"utils",
"selenium_driverless"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ff0c57015cea35110fc2e25ed263cf094195a7c63bd9517d5d6a0beb8e6c9b92",
"md5": "62f6f594ff230b9a5f8dce2e3cd43b52",
"sha256": "60304e9472b93a7b0c5023d7b1885dbd52f2ab15998c5bcdc977a46ddc8aaf1e"
},
"downloads": -1,
"filename": "sufsd-0.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "62f6f594ff230b9a5f8dce2e3cd43b52",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 8130,
"upload_time": "2024-11-24T10:01:20",
"upload_time_iso_8601": "2024-11-24T10:01:20.415084Z",
"url": "https://files.pythonhosted.org/packages/ff/0c/57015cea35110fc2e25ed263cf094195a7c63bd9517d5d6a0beb8e6c9b92/sufsd-0.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "02dcbdc5e9dce455171654432ccfcbc18d3dbb59b01473c57561ff6e52f68d71",
"md5": "bed2ce1cec7f4f835719622d03c940a3",
"sha256": "6e2bf6ab6e55a241acd7784aad3d9042523fcb0a559d1c2aa237fd5cea8afc9d"
},
"downloads": -1,
"filename": "sufsd-0.7.tar.gz",
"has_sig": false,
"md5_digest": "bed2ce1cec7f4f835719622d03c940a3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 9647,
"upload_time": "2024-11-24T10:01:22",
"upload_time_iso_8601": "2024-11-24T10:01:22.258609Z",
"url": "https://files.pythonhosted.org/packages/02/dc/bdc5e9dce455171654432ccfcbc18d3dbb59b01473c57561ff6e52f68d71/sufsd-0.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-24 10:01:22",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Triram-2",
"github_project": "sufsd",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "sufsd"
}