ciag-robot


Nameciag-robot JSON
Version 0.3.0 PyPI version JSON
download
home_pagehttps://github.com/OpenCIAg/py-robot
SummaryPython Library to Build Web Robots
upload_time2021-01-07 17:33:11
maintainer
docs_urlNone
authorÉttore Leandro Tognoli
requires_python
licenseApache License 2.0
keywords robot web crawler
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            # Python Web Robot Builder

[![Build Status](https://travis-ci.org/OpenCIAg/py-robot.svg?branch=master)](https://travis-ci.org/OpenCIAg/py-robot)
[![PyPI version](https://badge.fury.io/py/ciag-robot.svg)](https://badge.fury.io/py/ciag-robot)
[![Maintainability](https://api.codeclimate.com/v1/badges/4116e2ba99ce56e1397e/maintainability)](https://codeclimate.com/github/OpenCIAg/py-robot/maintainability)
[![Test Coverage](https://api.codeclimate.com/v1/badges/4116e2ba99ce56e1397e/test_coverage)](https://codeclimate.com/github/OpenCIAg/py-robot/test_coverage)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-v2.0%20adopted-ff69b4.svg)](CODE_OF_CONDUCT)



The main idea of py-robot is to simplify the code, and improve the performance of web crawlers.

## Install

```sh
pip install ciag-robot
```


## Intro

Bellow we have a simple example of crawler that needs to get a page, and for each specific item get another page.
Because it was written without the use of async requests, it will make a request and make the another one only when the previous has finished.

```py
# examples/iot_eetimes.py

import requests
import json

from lxml import html
from pyquery.pyquery import PyQuery as pq

page = requests.get('https://iot.eetimes.com/')
dom = pq(html.fromstring(page.content.decode()))

result = []
for link in dom.find('.theiaStickySidebar ul li'):
    news = {
        'category': pq(link).find('span').text(),
        'url': pq(link).find('a[href]').attr('href'),
    }
    news_page = requests.get(news['url'])
    dom = pq(news_page.content.decode())
    news['body'] = dom.find('p').text()
    news['title'] = dom.find('h1.post-title').text()
    result.append(news)

print(json.dumps(result, indent=4))

```

We can rewrite that using py-robot, and it will look like that:


```py
# examples/iot_eetimes2.py

import json
from robot import Robot
from robot.collector.shortcut import *
import logging

logging.basicConfig(level=logging.DEBUG)

collector = pipe(
    const('https://iot.eetimes.com/'),
    get(),
    css('.theiaStickySidebar ul li'),
    foreach(dict(
        pipe(
            css('a[href]'), attr('href'), any(),
            get(),
            dict(
                body=pipe(css('p'), as_text()),
                title=pipe(css('h1.post-title'), as_text()),
            )
        ),
        category=pipe(css('span'), as_text()),
        url=pipe(css('a[href]'), attr('href'), any(), url())
    ))
)

with Robot() as robot:
    result = robot.sync_run(collector)
print(json.dumps(result, indent=4))

```

Now all the requests will be async, so it will start all the requests for each item at the same time, and it will improve the performance of the crawler.
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/OpenCIAg/py-robot",
    "name": "ciag-robot",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "Robot,Web Crawler",
    "author": "\u00c9ttore Leandro Tognoli",
    "author_email": "ettore.tognoli@ciag.org.br",
    "download_url": "https://files.pythonhosted.org/packages/ff/5a/dcb5447349cae8b7620922d3644f2fad3503491ece13c2068f3a31b9f3e6/ciag-robot-0.3.0.tar.gz",
    "platform": "",
    "description": "# Python Web Robot Builder\n\n[![Build Status](https://travis-ci.org/OpenCIAg/py-robot.svg?branch=master)](https://travis-ci.org/OpenCIAg/py-robot)\n[![PyPI version](https://badge.fury.io/py/ciag-robot.svg)](https://badge.fury.io/py/ciag-robot)\n[![Maintainability](https://api.codeclimate.com/v1/badges/4116e2ba99ce56e1397e/maintainability)](https://codeclimate.com/github/OpenCIAg/py-robot/maintainability)\n[![Test Coverage](https://api.codeclimate.com/v1/badges/4116e2ba99ce56e1397e/test_coverage)](https://codeclimate.com/github/OpenCIAg/py-robot/test_coverage)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)\n[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-v2.0%20adopted-ff69b4.svg)](CODE_OF_CONDUCT)\n\n\n\nThe main idea of py-robot is to simplify the code, and improve the performance of web crawlers.\n\n## Install\n\n```sh\npip install ciag-robot\n```\n\n\n## Intro\n\nBellow we have a simple example of crawler that needs to get a page, and for each specific item get another page.\nBecause it was written without the use of async requests, it will make a request and make the another one only when the previous has finished.\n\n```py\n# examples/iot_eetimes.py\n\nimport requests\nimport json\n\nfrom lxml import html\nfrom pyquery.pyquery import PyQuery as pq\n\npage = requests.get('https://iot.eetimes.com/')\ndom = pq(html.fromstring(page.content.decode()))\n\nresult = []\nfor link in dom.find('.theiaStickySidebar ul li'):\n    news = {\n        'category': pq(link).find('span').text(),\n        'url': pq(link).find('a[href]').attr('href'),\n    }\n    news_page = requests.get(news['url'])\n    dom = pq(news_page.content.decode())\n    news['body'] = dom.find('p').text()\n    news['title'] = dom.find('h1.post-title').text()\n    result.append(news)\n\nprint(json.dumps(result, indent=4))\n\n```\n\nWe can rewrite that using py-robot, and it will look like that:\n\n\n```py\n# examples/iot_eetimes2.py\n\nimport json\nfrom robot import Robot\nfrom robot.collector.shortcut import *\nimport logging\n\nlogging.basicConfig(level=logging.DEBUG)\n\ncollector = pipe(\n    const('https://iot.eetimes.com/'),\n    get(),\n    css('.theiaStickySidebar ul li'),\n    foreach(dict(\n        pipe(\n            css('a[href]'), attr('href'), any(),\n            get(),\n            dict(\n                body=pipe(css('p'), as_text()),\n                title=pipe(css('h1.post-title'), as_text()),\n            )\n        ),\n        category=pipe(css('span'), as_text()),\n        url=pipe(css('a[href]'), attr('href'), any(), url())\n    ))\n)\n\nwith Robot() as robot:\n    result = robot.sync_run(collector)\nprint(json.dumps(result, indent=4))\n\n```\n\nNow all the requests will be async, so it will start all the requests for each item at the same time, and it will improve the performance of the crawler.",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "Python Library to Build Web Robots",
    "version": "0.3.0",
    "project_urls": {
        "Download": "https://github.com/OpenCIAg/py-robot/tree/0.3.0/",
        "Homepage": "https://github.com/OpenCIAg/py-robot"
    },
    "split_keywords": [
        "robot",
        "web crawler"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ff5adcb5447349cae8b7620922d3644f2fad3503491ece13c2068f3a31b9f3e6",
                "md5": "6b6a8d0a64a86867f2e3a41950248002",
                "sha256": "62af8f20c97d5da09117fc367ba8ecf620d9a33a70f21123fa9abf0b94663c0a"
            },
            "downloads": -1,
            "filename": "ciag-robot-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "6b6a8d0a64a86867f2e3a41950248002",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 14192,
            "upload_time": "2021-01-07T17:33:11",
            "upload_time_iso_8601": "2021-01-07T17:33:11.334751Z",
            "url": "https://files.pythonhosted.org/packages/ff/5a/dcb5447349cae8b7620922d3644f2fad3503491ece13c2068f3a31b9f3e6/ciag-robot-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2021-01-07 17:33:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "OpenCIAg",
    "github_project": "py-robot",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "lcname": "ciag-robot"
}
        
Elapsed time: 0.14052s