generic-crawler-sdk

Name	generic-crawler-sdk JSON
Version	0.1.11 JSON
	download
home_page
Summary	Generic Crawler SDK
upload_time	2023-04-18 09:26:18
maintainer
docs_url	None
author	Umut Alihan Dikel
requires_python	>=3.8
license
keywords	generic_crawler
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Introduction
Generic Crawler SDK is a web crawling framework used to extract structured data from world wide web. It can be used for a wide range of purposes, from data mining for intelligent analytics to monitoring competitor pricing.

![](docs/images/generic-crawler-logomsu.png)

## Architecture
Generic Crawler SDK is designed as client-server model that connects executed defined crawler actions from client-side with REST Api to handling all incoming requests and returning extracted data from crawler engine.

![](docs/architecture/generic-crawler-arch-fordocs.png)

## Requirements
* Works on Linux, Windows. Simply any environment, where Python can be installed is suitable to install this SDK. 
* Python minimum versin of 3.8 should be installed. Official address to download is [https://www.python.org/downloads/](https://www.python.org/downloads/).
* SDK communincates with the crawler service. Therefore applying whitelisting for the network is required: 
    * service endpoint URL 
    * outbound port 443(https)

## Installation

### Installing via pip 
Since the SDK is a pip package, it can be easily installed via pip. Creating a virtual environment is recommended.

![](docs/images/install.png)

### Setting up user specific variables
Once the package is installed .env file needs to be configured, in order to provide access token and endpoint url to the SDK. Create ".env" file in the root directory and add user specific variables as below:

![](docs/images/docs/images/dotenv-file.png)

# Usage 

## Config

Config is an object. Its main function is to load user-specific variables from dotenv file and provide those to other objects, such as GenericCrawler. Currently there is two user-specific variables, which are service enpoint url and access token.

![](docs/images/docs/images/docs/images/config-object.png)

## Action Reader
ActionReader is an object. Its main function is to read, load Action files and validates for structural correctness of the format. In a case where user has written an action which includes an unimplemented attribute or missing one, it will throw Exception.

![](docs/images/actionreader-schema-validation.png)

ActionReader object has one attribute: **action**. Loaded valid action file is converted into Dict and assigned to this attribute.

![](docs/images/actionreader-reader-action.png)



## Generic Crawler
The main function of GenericCrawler object is to send requests to remote crawler service with payload including actions loaded by ActionReader. During instantiation GenericCrawler object checks the health status of remote endpoint of crawler service. If only service is up and ready, object is created. 

![](docs/images/genericcrawler.png)


Instantiated crawler onject has two attributes: endpoint & is_alive.

![](docs/images/genericcrawler-attributes.png)

It has single method, retrieve(). Retrieve method is called with argument of action method of ActionReader. Once it is called, the request is sent to crawler service and waited for a response.

![](docs/images/genericcrawler-retrieve.png)

Crawler service executes actions defined by the users action.yaml file and returns the extracted data from targets or exception detail if there is an error during crawling.

![](docs/images/genericcrawler-retrieve-result.png)

Retrieve method of GenericCrawler object returns parsed extracted data and response object. Response object is returned only for debugging purposes. Therefore it can be ignored. Extracted data is converted into Python Dictionary.

![](docs/images/generic-crawler-data.png)

Keys in dictionary are named based on targets of users action.yaml file.

![](docs/images/target-keyvalue-dummy.png)


Succesfully crawled data can be further processed & stored by user.

## Action Components
Actions are yaml formatted files, where browser interactions are defined and consist of two components; **steps & targets**. Action files should include name, url info:

![](docs/images/sample-action-1.png)

### Steps
Steps point to elements and describe specific actions on those, which are required in order to reach the target element(s).

#### do-nothing
Literally does nothing. Because generic crawler always requires minimum a single step to execute, use this action if there is no step required to extract the target.

![](docs/images/step-do-nothing.png)

#### wait-for 
Waits for given duration.

![](docs/images/step-wait-for.png)

#### click 
Mouse click on given element selector

![](docs/images/step-click.png)

#### write 
Write specific string on given element selector. When "wait" is true, the step waits for elements visibility & presence before executing (see step [wait](#wait)).

![](docs/images/step-write.png)

#### mouse-hover 
Move mouse (virtually) over the given selector. 

![](docs/images/step-mouse-hover.png)

#### scroll 
Scrolls page given direction; up/down. Repetition enables multiple times of scrolling for pages having infinite scroll.

![](docs/images/step-scroll.png)

#### hit-enter 
Sends keyboard event 'enter' to page.

![](docs/images/step-hit-enter.png)


#### iterate-until 
Retrieves the given parent element and starts iterating over its child elements. Iteration continues until given condition applies. The condition is a string search and its match. Once the looked up child element is found, it executes custom action (e.g.: click, write, etc).

![](docs/images/step-iterate-until.png)

#### retrieve-sitemap 
Some pages provide their entire sitemap in xml format without any GUI component. This action enables sitemap data crawling. Depth attribute defines how further crawling should progress recursively.

![](docs/images/step-retrieve-sitemap.png)

#### popup-check 
Waits for popups after page-load and dismisses if given popup window exists.

![](docs/images/step-popup-check.png)



### Targets
Targets are defined as pointers to elements using xpath/css selectors, which contain data to be extracted from pages. A crawl action can have multiple targets. Currently available target types are text, nontext, url and values of custom attributes.

#### text 
Extracts text from element, which user sees on the page.

![](docs/images/target-text.png)

#### nontext 
Extracts non-text attribute from element. Currently "image_url" is supported and available. 

![](docs/images/target-nontext.png)

#### extract-urls 
Extracts urls from href attribute of given element selector. Used with a boolean value.

![](docs/images/target-extract-urls.png)



#### attribute 
Extracts value of any given attribute from element selector. This target type returns dynamically based on value of extracted attribute. If attribute has multiple values, it returns a list of values, otherwise single string of value is retruend. 

![](docs/images/target-attribute.png)

#### anchored-iteration 
This type of target includes a parent selector and its child selector(s). Child selectors consists of Anchor and Target. Then anchor and target child selectors are retrieved as sub-selector of parent(s). Iteration occurs on anchor selector. Target values are extracted for each target element of each anchor. Given anchor action is taken as a mini-step on each iteration between anchors, so that target values are available. Values of anchors are also extracted. Finally service returns a dictionary of extracted Anchor values and Target values of each anchor belonging to parent selector.

![](docs/images/target-anchored-iteration.png)


## Error Handling
Crawler service tries to catch as diverse error types as possible on executing crawler actions. Any error caused by missing or mismatched selector is returned to developer using SDK. Developer is expected to handle response of crawler on his/her pyhon script, whether succesfully extracted data or an error message containing exception details is returned. On unexpected, unclear error messages you can contact to "TEAM-AI@turkcell.com.tr" for further investigation. If the error is browser drvier related, the exact error detail text is reflected as it it returned from driver, (e.g: ERR_NAME_NOT_RESOLVED error caused by trying to navigate non-existing URL). 


#### Selector Error:

When given selector not found

![](docs/images/error-message-selector.png)


#### URL Error:  

When non-existing URL is tried to crawl

![](docs/images/error-message-nonexist-url.png)


### Connection Error:

Due to security concerns, Generic Crawler Service lives in an environment, where only specific page categories are whitelisted. Some pages might be grouped as malicious or dangerous in terms of Turkcell's security policies, therefore those might be excluded in the whitelist. In this regard if connection is dropped/failed by the firewall policy rules, SDK will return a connection error.

![](docs/images/error-message-connection.png)


# Use Case Examples
We provide some use case examples, which are ready-to-use. Those are heavily commented, so that reader has a comprehension on how to implement crawler bots using this crawler framework SDK. 
For each crawler use case to be implemented, with which this SDK is used, we write a python script file and action files in yml format. Action files can be as many distinct files as possible for each browser interaction required to exrtract the data. 


## Example (1) - Crawling the seller info from an ecommerce marketplace site
In this use case which we need to crawl and extract sellers information from a ecommerce marketplace site, the files are as below:

- **crawl_seller_page.py** ; crawler logic
- **actions_seller_page.yml** ; defined interactions as shown in above sections [steps & targets](#steps) described above. 

![](docs/images/example-action-seller-yaml.png)

![](docs/images/example-action-seller-py.png)

## Example (2) - Pagination of tariff details on a telecom operator site
Pagination is important aspect of data extraction from web pages. Some pages reuqiure to click "Next" button or another method to see all the list of items displayed. Here we crawl all the partially displayed tariff details from each page.

Note: \break .../li/[**last()**] target selector used below is to retrieve last item from list of pagination related elements. You can consult the official xpath documentation for the usage details of last() function: [https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/last](https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/last)

- **action_pagination.yml** ; defined interactions as shown in above sections [steps & targets](#steps) described above.
- **crawl_products_using_pagination.py** ; crawler logic


![](docs/images/example-action-pagination-yaml.png)

![](docs/images/example-crawl-products-using-pagination-py.png)


# Contact
We would like to hear about any feature requests, bug reports, issues, or any kind of questions regarding this crawler framework SDK and also its Documentation, which you are currently reading. Please feel free to contact us at anytime.

**TEAM-SENSAI** - [team-sensai@turkcell.entp.tgc](mailto:TEAM-SENSAI@turkcell.entp.tgc)

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "generic-crawler-sdk",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "generic_crawler",
    "author": "Umut Alihan Dikel",
    "author_email": "alihan.dikel@turkcell.com.tr",
    "download_url": "https://files.pythonhosted.org/packages/d7/2e/6d3f5cdfd6a92cebeeb3095b10a7974e1bf0e9fa8c4d55da81b38a25d903/generic-crawler-sdk-0.1.11.tar.gz",
    "platform": null,
    "description": "# Introduction\nGeneric Crawler SDK is a web crawling framework used to extract structured data from world wide web. It can be used for a wide range of purposes, from data mining for intelligent analytics to monitoring competitor pricing.\n\n![](docs/images/generic-crawler-logomsu.png)\n\n## Architecture\nGeneric Crawler SDK is designed as client-server model that connects executed defined crawler actions from client-side with REST Api to handling all incoming requests and returning extracted data from crawler engine.\n\n![](docs/architecture/generic-crawler-arch-fordocs.png)\n\n## Requirements\n* Works on Linux, Windows. Simply any environment, where Python can be installed is suitable to install this SDK. \n* Python minimum versin of 3.8 should be installed. Official address to download is [https://www.python.org/downloads/](https://www.python.org/downloads/).\n* SDK communincates with the crawler service. Therefore applying whitelisting for the network is required: \n    * service endpoint URL \n    * outbound port 443(https)\n\n## Installation\n\n### Installing via pip \nSince the SDK is a pip package, it can be easily installed via pip. Creating a virtual environment is recommended.\n\n![](docs/images/install.png)\n\n### Setting up user specific variables\nOnce the package is installed .env file needs to be configured, in order to provide access token and endpoint url to the SDK. Create \".env\" file in the root directory and add user specific variables as below:\n\n![](docs/images/docs/images/dotenv-file.png)\n\n# Usage \n\n## Config\n\nConfig is an object. Its main function is to load user-specific variables from dotenv file and provide those to other objects, such as GenericCrawler. Currently there is two user-specific variables, which are service enpoint url and access token.\n\n![](docs/images/docs/images/docs/images/config-object.png)\n\n## Action Reader\nActionReader is an object. Its main function is to read, load Action files and validates for structural correctness of the format. In a case where user has written an action which includes an unimplemented attribute or missing one, it will throw Exception.\n\n![](docs/images/actionreader-schema-validation.png)\n\nActionReader object has one attribute: **action**. Loaded valid action file is converted into Dict and assigned to this attribute.\n\n![](docs/images/actionreader-reader-action.png)\n\n\n\n## Generic Crawler\nThe main function of GenericCrawler object is to send requests to remote crawler service with payload including actions loaded by ActionReader. During instantiation GenericCrawler object checks the health status of remote endpoint of crawler service. If only service is up and ready, object is created. \n\n![](docs/images/genericcrawler.png)\n\n\nInstantiated crawler onject has two attributes: endpoint & is_alive.\n\n![](docs/images/genericcrawler-attributes.png)\n\nIt has single method, retrieve(). Retrieve method is called with argument of action method of ActionReader. Once it is called, the request is sent to crawler service and waited for a response.\n\n![](docs/images/genericcrawler-retrieve.png)\n\nCrawler service executes actions defined by the users action.yaml file and returns the extracted data from targets or exception detail if there is an error during crawling.\n\n![](docs/images/genericcrawler-retrieve-result.png)\n\nRetrieve method of GenericCrawler object returns parsed extracted data and response object. Response object is returned only for debugging purposes. Therefore it can be ignored. Extracted data is converted into Python Dictionary.\n\n![](docs/images/generic-crawler-data.png)\n\nKeys in dictionary are named based on targets of users action.yaml file.\n\n![](docs/images/target-keyvalue-dummy.png)\n\n\nSuccesfully crawled data can be further processed & stored by user.\n\n## Action Components\nActions are yaml formatted files, where browser interactions are defined and consist of two components; **steps & targets**. Action files should include name, url info:\n\n![](docs/images/sample-action-1.png)\n\n### Steps\nSteps point to elements and describe specific actions on those, which are required in order to reach the target element(s).\n\n#### do-nothing\nLiterally does nothing. Because generic crawler always requires minimum a single step to execute, use this action if there is no step required to extract the target.\n\n![](docs/images/step-do-nothing.png)\n\n#### wait-for \nWaits for given duration.\n\n![](docs/images/step-wait-for.png)\n\n#### click \nMouse click on given element selector\n\n![](docs/images/step-click.png)\n\n#### write \nWrite specific string on given element selector. When \"wait\" is true, the step waits for elements visibility & presence before executing (see step [wait](#wait)).\n\n![](docs/images/step-write.png)\n\n#### mouse-hover \nMove mouse (virtually) over the given selector. \n\n![](docs/images/step-mouse-hover.png)\n\n#### scroll \nScrolls page given direction; up/down. Repetition enables multiple times of scrolling for pages having infinite scroll.\n\n![](docs/images/step-scroll.png)\n\n#### hit-enter \nSends keyboard event 'enter' to page.\n\n![](docs/images/step-hit-enter.png)\n\n\n#### iterate-until \nRetrieves the given parent element and starts iterating over its child elements. Iteration continues until given condition applies. The condition is a string search and its match. Once the looked up child element is found, it executes custom action (e.g.: click, write, etc).\n\n![](docs/images/step-iterate-until.png)\n\n#### retrieve-sitemap \nSome pages provide their entire sitemap in xml format without any GUI component. This action enables sitemap data crawling. Depth attribute defines how further crawling should progress recursively.\n\n![](docs/images/step-retrieve-sitemap.png)\n\n#### popup-check \nWaits for popups after page-load and dismisses if given popup window exists.\n\n![](docs/images/step-popup-check.png)\n\n\n\n### Targets\nTargets are defined as pointers to elements using xpath/css selectors, which contain data to be extracted from pages. A crawl action can have multiple targets. Currently available target types are text, nontext, url and values of custom attributes.\n\n#### text \nExtracts text from element, which user sees on the page.\n\n![](docs/images/target-text.png)\n\n#### nontext \nExtracts non-text attribute from element. Currently \"image_url\" is supported and available. \n\n![](docs/images/target-nontext.png)\n\n#### extract-urls \nExtracts urls from href attribute of given element selector. Used with a boolean value.\n\n![](docs/images/target-extract-urls.png)\n\n\n\n#### attribute \nExtracts value of any given attribute from element selector. This target type returns dynamically based on value of extracted attribute. If attribute has multiple values, it returns a list of values, otherwise single string of value is retruend. \n\n![](docs/images/target-attribute.png)\n\n#### anchored-iteration \nThis type of target includes a parent selector and its child selector(s). Child selectors consists of Anchor and Target. Then anchor and target child selectors are retrieved as sub-selector of parent(s). Iteration occurs on anchor selector. Target values are extracted for each target element of each anchor. Given anchor action is taken as a mini-step on each iteration between anchors, so that target values are available. Values of anchors are also extracted. Finally service returns a dictionary of extracted Anchor values and Target values of each anchor belonging to parent selector.\n\n![](docs/images/target-anchored-iteration.png)\n\n\n## Error Handling\nCrawler service tries to catch as diverse error types as possible on executing crawler actions. Any error caused by missing or mismatched selector is returned to developer using SDK. Developer is expected to handle response of crawler on his/her pyhon script, whether succesfully extracted data or an error message containing exception details is returned. On unexpected, unclear error messages you can contact to \"TEAM-AI@turkcell.com.tr\" for further investigation. If the error is browser drvier related, the exact error detail text is reflected as it it returned from driver, (e.g: ERR_NAME_NOT_RESOLVED error caused by trying to navigate non-existing URL). \n\n\n#### Selector Error:\n\nWhen given selector not found\n\n![](docs/images/error-message-selector.png)\n\n\n#### URL Error:  \n\nWhen non-existing URL is tried to crawl\n\n![](docs/images/error-message-nonexist-url.png)\n\n\n### Connection Error:\n\nDue to security concerns, Generic Crawler Service lives in an environment, where only specific page categories are whitelisted. Some pages might be grouped as malicious or dangerous in terms of Turkcell's security policies, therefore those might be excluded in the whitelist. In this regard if connection is dropped/failed by the firewall policy rules, SDK will return a connection error.\n\n![](docs/images/error-message-connection.png)\n\n\n# Use Case Examples\nWe provide some use case examples, which are ready-to-use. Those are heavily commented, so that reader has a comprehension on how to implement crawler bots using this crawler framework SDK. \nFor each crawler use case to be implemented, with which this SDK is used, we write a python script file and action files in yml format. Action files can be as many distinct files as possible for each browser interaction required to exrtract the data. \n\n\n## Example (1) - Crawling the seller info from an ecommerce marketplace site\nIn this use case which we need to crawl and extract sellers information from a ecommerce marketplace site, the files are as below:\n\n- **crawl_seller_page.py** ; crawler logic\n- **actions_seller_page.yml** ; defined interactions as shown in above sections [steps & targets](#steps) described above. \n\n![](docs/images/example-action-seller-yaml.png)\n\n![](docs/images/example-action-seller-py.png)\n\n## Example (2) - Pagination of tariff details on a telecom operator site\nPagination is important aspect of data extraction from web pages. Some pages reuqiure to click \"Next\" button or another method to see all the list of items displayed. Here we crawl all the partially displayed tariff details from each page.\n\nNote: \\break .../li/[**last()**] target selector used below is to retrieve last item from list of pagination related elements. You can consult the official xpath documentation for the usage details of last() function: [https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/last](https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/last)\n\n- **action_pagination.yml** ; defined interactions as shown in above sections [steps & targets](#steps) described above.\n- **crawl_products_using_pagination.py** ; crawler logic\n\n\n![](docs/images/example-action-pagination-yaml.png)\n\n![](docs/images/example-crawl-products-using-pagination-py.png)\n\n\n# Contact\nWe would like to hear about any feature requests, bug reports, issues, or any kind of questions regarding this crawler framework SDK and also its Documentation, which you are currently reading. Please feel free to contact us at anytime.\n\n**TEAM-SENSAI** - [team-sensai@turkcell.entp.tgc](mailto:TEAM-SENSAI@turkcell.entp.tgc)\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Generic Crawler SDK",
    "version": "0.1.11",
    "split_keywords": [
        "generic_crawler"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5208378cf8aef9c99c9e92b5ed63e674f21026e9135fd06c0bc3a00626c878b4",
                "md5": "d5e9716c934a5f7d22d5f517367bdfb1",
                "sha256": "c961973cbc751917def8c456bab8bd00d79c7d31ebb5ed478acbb07e419e26eb"
            },
            "downloads": -1,
            "filename": "generic_crawler_sdk-0.1.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d5e9716c934a5f7d22d5f517367bdfb1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 9334,
            "upload_time": "2023-04-18T09:26:16",
            "upload_time_iso_8601": "2023-04-18T09:26:16.330177Z",
            "url": "https://files.pythonhosted.org/packages/52/08/378cf8aef9c99c9e92b5ed63e674f21026e9135fd06c0bc3a00626c878b4/generic_crawler_sdk-0.1.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d72e6d3f5cdfd6a92cebeeb3095b10a7974e1bf0e9fa8c4d55da81b38a25d903",
                "md5": "14fa9d0c77247da21ca2b39ef9155b75",
                "sha256": "97a9a8d8796ed5565129985536aa76c043c78d767c865edb09fe2c7f0fcf155d"
            },
            "downloads": -1,
            "filename": "generic-crawler-sdk-0.1.11.tar.gz",
            "has_sig": false,
            "md5_digest": "14fa9d0c77247da21ca2b39ef9155b75",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 9143,
            "upload_time": "2023-04-18T09:26:18",
            "upload_time_iso_8601": "2023-04-18T09:26:18.896831Z",
            "url": "https://files.pythonhosted.org/packages/d7/2e/6d3f5cdfd6a92cebeeb3095b10a7974e1bf0e9fa8c4d55da81b38a25d903/generic-crawler-sdk-0.1.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-18 09:26:18",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "generic-crawler-sdk"
}

Umut Alihan Dikel