new-frontera


Namenew-frontera JSON
Version 0.9.0 PyPI version JSON
download
home_pagehttps://github.com/ZeroCool940711/new-frontera
SummaryA scalable frontier for web crawlers
upload_time2024-01-28 04:00:54
maintainerAlejandro Gil
docs_urlNone
authornew_frontera developers
requires_python
licenseBSD
keywords crawler frontier scrapy web requests new_frontera
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage
            # new_frontera

[![pypi](https://img.shields.io/pypi/v/new_frontera)](https://pypi.org/project/new_frontera/)
[![python versions](https://img.shields.io/pypi/pyversions/new_frontera.svg)](https://pypi.org/project/new_frontera/)
[![Build Status](https://app.travis-ci.com/ZeroCool940711/new-new_frontera.svg?branch=master)](https://app.travis-ci.com/ZeroCool940711/new-new_frontera)
[![codecov](https://codecov.io/gh/scrapinghub/new_frontera/branch/master/graph/badge.svg)](https://codecov.io/gh/scrapinghub/new_frontera)

## Overview

new_frontera is a web crawling framework consisting of [crawl frontier](http://nlp.stanford.edu/IR-book/html/htmledition/the-url-frontier-1.html), and distribution/scaling primitives, allowing to build a large scale online web crawler. 

new_frontera takes care of the logic and policies to follow during the crawl. It stores and prioritizes links extracted by 
the crawler to decide which pages to visit next, and capable of doing it in distributed manner.

## Main features

- Online operation: small requests batches, with parsing done right after fetch.
- Pluggable backend architecture: low-level backend access logic is separated from crawling strategy.
- Two run modes: single process and distributed.
- Built-in SqlAlchemy, Redis and HBase backends.
- Built-in Apache Kafka and ZeroMQ message buses.
- Built-in crawling strategies: breadth-first, depth-first, Discovery (with support of robots.txt and sitemaps).
- Battle tested: our biggest deployment is 60 spiders/strategy workers delivering 50-60M of documents daily for 45 days, without downtime,
- Transparent data flow, allowing to integrate custom components easily using Kafka.
- Message bus abstraction, providing a way to implement your own transport (ZeroMQ and Kafka are available out of the box).
- Optional use of Scrapy for fetching and parsing.
- 3-clause BSD license, allowing to use in any commercial product.
- Python 3 support.

## Installation

Development version:

```bash
$ pip install git+https://github.com/ZeroCool940711/new_frontera.git
```

or from PyPi:

```bash
$ pip install new-frontera
```

## Documentation

- [Main documentation at RTD](http://frontera.readthedocs.org/)
- [EuroPython 2015 slides](http://www.slideshare.net/sixtyone/fronteraopen-source-large-scale-web-crawling-framework)
- [BigDataSpain 2015 slides](https://speakerdeck.com/scrapinghub/frontera-open-source-large-scale-web-crawling-framework)

## Community

Join our Google group at https://groups.google.com/a/scrapinghub.com/forum/#!forum/frontera or check GitHub issues and 
pull requests.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ZeroCool940711/new-frontera",
    "name": "new-frontera",
    "maintainer": "Alejandro Gil",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "crawler,frontier,scrapy,web,requests,new_frontera",
    "author": "new_frontera developers",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/66/57/5731e4d6fe79f265ea113c457f4c75c53b717d863d26de5b48a5cb1391f3/new_frontera-0.9.0.tar.gz",
    "platform": null,
    "description": "# new_frontera\n\n[![pypi](https://img.shields.io/pypi/v/new_frontera)](https://pypi.org/project/new_frontera/)\n[![python versions](https://img.shields.io/pypi/pyversions/new_frontera.svg)](https://pypi.org/project/new_frontera/)\n[![Build Status](https://app.travis-ci.com/ZeroCool940711/new-new_frontera.svg?branch=master)](https://app.travis-ci.com/ZeroCool940711/new-new_frontera)\n[![codecov](https://codecov.io/gh/scrapinghub/new_frontera/branch/master/graph/badge.svg)](https://codecov.io/gh/scrapinghub/new_frontera)\n\n## Overview\n\nnew_frontera is a web crawling framework consisting of [crawl frontier](http://nlp.stanford.edu/IR-book/html/htmledition/the-url-frontier-1.html), and distribution/scaling primitives, allowing to build a large scale online web crawler. \n\nnew_frontera takes care of the logic and policies to follow during the crawl. It stores and prioritizes links extracted by \nthe crawler to decide which pages to visit next, and capable of doing it in distributed manner.\n\n## Main features\n\n- Online operation: small requests batches, with parsing done right after fetch.\n- Pluggable backend architecture: low-level backend access logic is separated from crawling strategy.\n- Two run modes: single process and distributed.\n- Built-in SqlAlchemy, Redis and HBase backends.\n- Built-in Apache Kafka and ZeroMQ message buses.\n- Built-in crawling strategies: breadth-first, depth-first, Discovery (with support of robots.txt and sitemaps).\n- Battle tested: our biggest deployment is 60 spiders/strategy workers delivering 50-60M of documents daily for 45 days, without downtime,\n- Transparent data flow, allowing to integrate custom components easily using Kafka.\n- Message bus abstraction, providing a way to implement your own transport (ZeroMQ and Kafka are available out of the box).\n- Optional use of Scrapy for fetching and parsing.\n- 3-clause BSD license, allowing to use in any commercial product.\n- Python 3 support.\n\n## Installation\n\nDevelopment version:\n\n```bash\n$ pip install git+https://github.com/ZeroCool940711/new_frontera.git\n```\n\nor from PyPi:\n\n```bash\n$ pip install new-frontera\n```\n\n## Documentation\n\n- [Main documentation at RTD](http://frontera.readthedocs.org/)\n- [EuroPython 2015 slides](http://www.slideshare.net/sixtyone/fronteraopen-source-large-scale-web-crawling-framework)\n- [BigDataSpain 2015 slides](https://speakerdeck.com/scrapinghub/frontera-open-source-large-scale-web-crawling-framework)\n\n## Community\n\nJoin our Google group at https://groups.google.com/a/scrapinghub.com/forum/#!forum/frontera or check GitHub issues and \npull requests.\n",
    "bugtrack_url": null,
    "license": "BSD",
    "summary": "A scalable frontier for web crawlers",
    "version": "0.9.0",
    "project_urls": {
        "Homepage": "https://github.com/ZeroCool940711/new-frontera"
    },
    "split_keywords": [
        "crawler",
        "frontier",
        "scrapy",
        "web",
        "requests",
        "new_frontera"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ab72aed55c3d88901f8bb7b83ca83daf73ecc9336b1a0561ec1e89e33eeaf9ff",
                "md5": "cd0b20a217561ba6b0bc48b6a14a929a",
                "sha256": "6a6c1dd1196cf0fab235ecbc058a91a2c344231c2500f5c98d37713df01cc4a2"
            },
            "downloads": -1,
            "filename": "new_frontera-0.9.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cd0b20a217561ba6b0bc48b6a14a929a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 125159,
            "upload_time": "2024-01-28T04:00:52",
            "upload_time_iso_8601": "2024-01-28T04:00:52.565513Z",
            "url": "https://files.pythonhosted.org/packages/ab/72/aed55c3d88901f8bb7b83ca83daf73ecc9336b1a0561ec1e89e33eeaf9ff/new_frontera-0.9.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "66575731e4d6fe79f265ea113c457f4c75c53b717d863d26de5b48a5cb1391f3",
                "md5": "aad9dc99a77d5b6f4d84564150f8930d",
                "sha256": "36fbbfa932c2799463abd2f51b9296410c08f879044000c78d65c9efaeda731e"
            },
            "downloads": -1,
            "filename": "new_frontera-0.9.0.tar.gz",
            "has_sig": false,
            "md5_digest": "aad9dc99a77d5b6f4d84564150f8930d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 128044,
            "upload_time": "2024-01-28T04:00:54",
            "upload_time_iso_8601": "2024-01-28T04:00:54.226777Z",
            "url": "https://files.pythonhosted.org/packages/66/57/5731e4d6fe79f265ea113c457f4c75c53b717d863d26de5b48a5cb1391f3/new_frontera-0.9.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-28 04:00:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ZeroCool940711",
    "github_project": "new-frontera",
    "travis_ci": true,
    "coveralls": true,
    "github_actions": true,
    "requirements": [],
    "tox": true,
    "lcname": "new-frontera"
}
        
Elapsed time: 0.17809s