# TurboCrawler
# What it is?
It is a Micro-Framework that you can use to build your crawlers easily, focused in being fast, extremely
customizable, extensible and easy to use, giving you the power to control the crawler behavior.
Provide ways to schedule requests, parse your data asynchronously, extract redirect links from an HTML page.
# Install
```sh
pip install turbocrawler
```
# Code Example
```python
from pprint import pprint
import requests
from parsel import Selector
from turbocrawler import Crawler, CrawlerRequest, CrawlerResponse, CrawlerRunner, ExecutionInfo, ExtractRule
class QuotesToScrapeCrawler(Crawler):
crawler_name = "QuotesToScrape"
allowed_domains = ['quotes.toscrape']
regex_extract_rules = [ExtractRule(r'https://quotes.toscrape.com/page/[0-9]')]
time_between_requests = 1
session: requests.Session
@classmethod
def start_crawler(cls) -> None:
cls.session = requests.session()
@classmethod
def crawler_first_request(cls) -> CrawlerResponse | None:
cls.crawler_queue.add(CrawlerRequest(url="https://quotes.toscrape.com/page/9/"))
response = cls.session.get(url="https://quotes.toscrape.com/page/1/")
return CrawlerResponse(url=response.url,
body=response.text,
status_code=response.status_code)
@classmethod
def process_request(cls, crawler_request: CrawlerRequest) -> CrawlerResponse:
response = cls.session.get(crawler_request.url)
return CrawlerResponse(url=response.url,
body=response.text,
status_code=response.status_code)
@classmethod
def parse(cls, crawler_request: CrawlerRequest, crawler_response: CrawlerResponse) -> None:
selector = Selector(crawler_response.body)
for quote in selector.css('div[class="quote"]'):
data = {"quote": quote.css('span:nth-child(1)::text').get()[1:-1],
"author": quote.css('span:nth-child(2)>small::text').get(),
"tags_list": quote.css('div[class="tags"]>a::text').getall()}
pprint(data)
@classmethod
def stop_crawler(cls, execution_info: ExecutionInfo) -> None:
cls.session.close()
CrawlerRunner(crawler=QuotesToScrapeCrawler).run()
```
# Understanding the turbocrawler:
## Crawler
### Attributes
- `crawler_name` the name of your crawler, this info will be used by `CrawledQueue`
- `allowed_domains` list containing all domains that the crawler may add to `CrawlerQueue`
- `regex_extract_rules` list containing `ExtractRule` objects, the regex passed here will be
used to extract all redirect links from an HTML page, EX: 'href="/users"', that you return in `CrawlerResponse.body`
If you let this list empty will not enable the automatic population of `CrawlerQueue` for every `CrawlerResponse.body`
- `time_between_requests` Time that each request will have to wait before being executed
### Methods
#### `start_crawler`
Should be used to start a session, webdriver, etc...
#### `crawler_first_request`
Should be used to make the first request in a site normally the login,
Here could also be used to schedule the first pages to crawl.
2 possible Returns:
- return `CrawlerResponse` the response will be sent to `parse` method and apply follow rule **OBS-1
- return `None` the response will not be sent to `parse` method
#### `process_request`
This method receives all scheduled requests in the `CrawlerQueue.add`
being added through manual `CrawlerQueue.add` or by automatic schedule with regex_extract_rules.
Here you must implement all your request logic, cookies, headers, proxy, retries, etc...
The method receives a `CrawlerRequest` and must return a `CrawlerResponse`.
Apply follow rule **OBS-1.
#### `process_respose`
This method receives all requests made by `process_request`
Here you can implement any logic, like, scheduling requests,
validating response, remake the requests, etc...
Isn't mandatory to implement this method
#### `parse`
This method receives all `CrawlerResponse` from
`crawler_first_request`, `process_request` or `process_respose`
Here you can parse your response,
getting the targets fields from HTML and dump the data, in a database for example.
#### `stop_crawler`
Should be used to close a session, webdriver, etc...
OBS:
1. If filled `regex_extract_rules` the redicts specified in the rules will schedule
in the `CrawlerQueue`, if not filled `regex_extract_rules` will not schedule any request.
### Order of calls
1. `start_crawler`
2. `crawler_first_request`
3. Start loop executing the methods sequentially `process_request` -> `process_response` -> `parse` -> loop forever.
The loop only stops when `CrawlerQueue` is empty.
4. `stop_crawler`
---
## CrawlerRunner
Is the responsible to run the Crawler, calling the methods in order,
responsible to automatic schedule your requests, and handle the queues.
It uses by default:
- `FIFOMemoryCrawlerQueue` for `CrawlerQueue`
- `MemoryCrawledQueue` for `CrawledQueue`
But you can change it using the built-ins queues
in `turbocrawler.queues` or creating your own queues
---
## CrawlerQueue
CrawlerQueue is where yours `CrawlerRequest` are stored
and then will be removed to be processed at `process_request`
---
## CrawledQueue
CrawledQueue is where all urls from the processed `CrawlerRequest` are stored
It prevents to remake a request to the same url, but this behavior can be changed.
Raw data
{
"_id": null,
"home_page": "https://github.com/toriium/turbocrawler",
"name": "turbocrawler",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.11,<4.0",
"maintainer_email": "",
"keywords": "crawler,crawler framework,scraping,turbocrawler",
"author": "toriium",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/19/21/fcd25018b3c322cf0a36d1fe777a280eeb7844bad4fc23551e4c14effe01/turbocrawler-0.0.1.post1.tar.gz",
"platform": null,
"description": "# TurboCrawler\n\n# What it is?\nIt is a Micro-Framework that you can use to build your crawlers easily, focused in being fast, extremely\ncustomizable, extensible and easy to use, giving you the power to control the crawler behavior. \nProvide ways to schedule requests, parse your data asynchronously, extract redirect links from an HTML page.\n\n# Install\n\n```sh\npip install turbocrawler\n```\n\n# Code Example\n\n```python\nfrom pprint import pprint\nimport requests\nfrom parsel import Selector\nfrom turbocrawler import Crawler, CrawlerRequest, CrawlerResponse, CrawlerRunner, ExecutionInfo, ExtractRule\n\n\nclass QuotesToScrapeCrawler(Crawler):\n crawler_name = \"QuotesToScrape\"\n allowed_domains = ['quotes.toscrape']\n regex_extract_rules = [ExtractRule(r'https://quotes.toscrape.com/page/[0-9]')]\n time_between_requests = 1\n session: requests.Session\n\n @classmethod\n def start_crawler(cls) -> None:\n cls.session = requests.session()\n\n @classmethod\n def crawler_first_request(cls) -> CrawlerResponse | None:\n cls.crawler_queue.add(CrawlerRequest(url=\"https://quotes.toscrape.com/page/9/\"))\n response = cls.session.get(url=\"https://quotes.toscrape.com/page/1/\")\n return CrawlerResponse(url=response.url,\n body=response.text,\n status_code=response.status_code)\n\n @classmethod\n def process_request(cls, crawler_request: CrawlerRequest) -> CrawlerResponse:\n response = cls.session.get(crawler_request.url)\n return CrawlerResponse(url=response.url,\n body=response.text,\n status_code=response.status_code)\n\n @classmethod\n def parse(cls, crawler_request: CrawlerRequest, crawler_response: CrawlerResponse) -> None:\n selector = Selector(crawler_response.body)\n for quote in selector.css('div[class=\"quote\"]'):\n data = {\"quote\": quote.css('span:nth-child(1)::text').get()[1:-1],\n \"author\": quote.css('span:nth-child(2)>small::text').get(),\n \"tags_list\": quote.css('div[class=\"tags\"]>a::text').getall()}\n pprint(data)\n\n @classmethod\n def stop_crawler(cls, execution_info: ExecutionInfo) -> None:\n cls.session.close()\n\n\nCrawlerRunner(crawler=QuotesToScrapeCrawler).run()\n```\n# Understanding the turbocrawler:\n\n## Crawler\n### Attributes\n- `crawler_name` the name of your crawler, this info will be used by `CrawledQueue`\n- `allowed_domains` list containing all domains that the crawler may add to `CrawlerQueue`\n- `regex_extract_rules` list containing `ExtractRule` objects, the regex passed here will be \n used to extract all redirect links from an HTML page, EX: 'href=\"/users\"', that you return in `CrawlerResponse.body` \n If you let this list empty will not enable the automatic population of `CrawlerQueue` for every `CrawlerResponse.body`\n- `time_between_requests` Time that each request will have to wait before being executed\n\n### Methods\n#### `start_crawler`\nShould be used to start a session, webdriver, etc...\n\n#### `crawler_first_request`\nShould be used to make the first request in a site normally the login,\nHere could also be used to schedule the first pages to crawl. \n2 possible Returns:\n- return `CrawlerResponse` the response will be sent to `parse` method and apply follow rule **OBS-1 \n- return `None` the response will not be sent to `parse` method\n\n#### `process_request`\nThis method receives all scheduled requests in the `CrawlerQueue.add`\nbeing added through manual `CrawlerQueue.add` or by automatic schedule with regex_extract_rules. \nHere you must implement all your request logic, cookies, headers, proxy, retries, etc... \nThe method receives a `CrawlerRequest` and must return a `CrawlerResponse`. \nApply follow rule **OBS-1.\n\n#### `process_respose`\nThis method receives all requests made by `process_request` \nHere you can implement any logic, like, scheduling requests,\nvalidating response, remake the requests, etc...\nIsn't mandatory to implement this method\n\n#### `parse`\nThis method receives all `CrawlerResponse` from\n`crawler_first_request`, `process_request` or `process_respose` \nHere you can parse your response,\ngetting the targets fields from HTML and dump the data, in a database for example.\n\n#### `stop_crawler`\nShould be used to close a session, webdriver, etc...\n\nOBS:\n1. If filled `regex_extract_rules` the redicts specified in the rules will schedule\n in the `CrawlerQueue`, if not filled `regex_extract_rules` will not schedule any request.\n\n### Order of calls\n1. `start_crawler`\n2. `crawler_first_request`\n3. Start loop executing the methods sequentially `process_request` -> `process_response` -> `parse` -> loop forever. \n The loop only stops when `CrawlerQueue` is empty.\n4. `stop_crawler`\n\n---\n\n## CrawlerRunner\nIs the responsible to run the Crawler, calling the methods in order,\nresponsible to automatic schedule your requests, and handle the queues. \nIt uses by default:\n- `FIFOMemoryCrawlerQueue` for `CrawlerQueue` \n- `MemoryCrawledQueue` for `CrawledQueue`\n\nBut you can change it using the built-ins queues\nin `turbocrawler.queues` or creating your own queues\n\n---\n\n## CrawlerQueue\nCrawlerQueue is where yours `CrawlerRequest` are stored\nand then will be removed to be processed at `process_request`\n\n---\n\n## CrawledQueue\nCrawledQueue is where all urls from the processed `CrawlerRequest` are stored\nIt prevents to remake a request to the same url, but this behavior can be changed.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Scraping and Crawling Micro-Framework",
"version": "0.0.1.post1",
"project_urls": {
"Homepage": "https://github.com/toriium/turbocrawler",
"Repository": "https://github.com/toriium/turbocrawler"
},
"split_keywords": [
"crawler",
"crawler framework",
"scraping",
"turbocrawler"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8fbcc5baf6db377bc23e50d1db83b09eee1b64fb7bc6ef47cec75c5bc99ed9d2",
"md5": "bef2cc308e9fda84d35f2c9242f56679",
"sha256": "934483ba0721c371acc8a4259f23d218881ddb2e7a1063a6a40e7a4651193fb0"
},
"downloads": -1,
"filename": "turbocrawler-0.0.1.post1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "bef2cc308e9fda84d35f2c9242f56679",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11,<4.0",
"size": 18299,
"upload_time": "2023-10-28T23:57:56",
"upload_time_iso_8601": "2023-10-28T23:57:56.042214Z",
"url": "https://files.pythonhosted.org/packages/8f/bc/c5baf6db377bc23e50d1db83b09eee1b64fb7bc6ef47cec75c5bc99ed9d2/turbocrawler-0.0.1.post1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "1921fcd25018b3c322cf0a36d1fe777a280eeb7844bad4fc23551e4c14effe01",
"md5": "e1e40e24fddc029c3c561255be464d51",
"sha256": "d184e61121669821d602c49ebda747108a40b106dcc50df69b8386714515ccf5"
},
"downloads": -1,
"filename": "turbocrawler-0.0.1.post1.tar.gz",
"has_sig": false,
"md5_digest": "e1e40e24fddc029c3c561255be464d51",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11,<4.0",
"size": 40649,
"upload_time": "2023-10-28T23:58:00",
"upload_time_iso_8601": "2023-10-28T23:58:00.299010Z",
"url": "https://files.pythonhosted.org/packages/19/21/fcd25018b3c322cf0a36d1fe777a280eeb7844bad4fc23551e4c14effe01/turbocrawler-0.0.1.post1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-28 23:58:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "toriium",
"github_project": "turbocrawler",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "turbocrawler"
}