CobWeb-lnx


NameCobWeb-lnx JSON
Version 1.1.0 PyPI version JSON
download
home_page
SummaryCobWeb is a Python library for web scraping. The library consists of two classes: Spider and Scraper.
upload_time2023-05-21 12:25:07
maintainer
docs_urlNone
author
requires_python>=3.7
license
keywords data crawler scraper
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # CobWeb

CobWeb is a Python library for web scraping. The library consists of two classes: Spider and Scraper.

## Spider

The Spider class is used to crawl a website and identify internal and external links. It has the following methods:

    __init__(self, url, max_hops = 0): Initializes a Spider object with the given URL and maximum number of links to follow from the initial URL.
    _getLinks(self): Crawls the website and identifies internal and external links.
    _showLinks(self): Returns the set of internal and external URLs found during crawling.
    __str__(self): Returns a string representation of the Spider object.
    __repr__(self): Returns a string representation of the Spider object.

## Scraper

The Scraper class extends the functionality of the Spider class by scraping HTML content from web pages based on user-defined parameters. It has the following methods:

    __init__(self, config): Initializes a Scraper object with the given configuration parameters.
    run(self): A public method to scrape HTML content from web pages based on user-defined parameters.
    __str__(self): Returns a string representation of the Scraper object.
    __repr__(self): Returns a string representation of the Scraper object.

## Installation

You can install CobWeb using pip:

```bash

    pip install CobWeb

```

## Config

Config is either an object in memory or a YAML file you can build by installing YAMLbuilder or by using the provided template!
Example of a complete object:

```python
config = {
            "url": 
            "hops": 
            "tags":
            "classes":
            "attrs":
            "attrV":
            "IDv":
            "selectors":
        }
```

Example of YAML file (If you choose this path call the config_parser function and give it a valid path!):

```yaml
IDv:
attrV: []
attrs: []
classes: []
hops: 
selectors: []
tags:
    - 
    - 
url: 
```

## Example Usage

```python

from CobWeb import Spider, Scraper

# Create a Spider object with a URL and maximum number of hops
spider = Spider("https://example.com", max_hops=10)

# Get the internal and external links
links = spider.run()
print(links)

# Create a Scraper object with a configuration dictionary
# hops defines how deep it will scrape, it uses the Spider internally to get more links and scrape from those pages! If you just want to scrape from a single page set it to 0 or don't provide hops
config = {
    "url": "https://example.com",
    "hops": 2,
    "tags": ["h1", "p"]
    }
scraper = Scraper(config)

# Scrape HTML content from web pages based on user-defined parameters
results = scraper.run()

# Print the results it shall be a dictionary with arrays of scraped content separated by element, attributes, etc provided in the config!
print(results)


```

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "CobWeb-lnx",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "data,crawler,scraper",
    "author": "",
    "author_email": "\"Gon\u00e7alo Marques (_lnx/lnxdread)\" <gmgoncalo7@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/c2/bb/dfeb6e6492ecae39c8f9aa247c56e96c2fdcd47a3c194303a006e07ded2f/CobWeb-lnx-1.1.0.tar.gz",
    "platform": null,
    "description": "# CobWeb\r\n\r\nCobWeb is a Python library for web scraping. The library consists of two classes: Spider and Scraper.\r\n\r\n## Spider\r\n\r\nThe Spider class is used to crawl a website and identify internal and external links. It has the following methods:\r\n\r\n    __init__(self, url, max_hops = 0): Initializes a Spider object with the given URL and maximum number of links to follow from the initial URL.\r\n    _getLinks(self): Crawls the website and identifies internal and external links.\r\n    _showLinks(self): Returns the set of internal and external URLs found during crawling.\r\n    __str__(self): Returns a string representation of the Spider object.\r\n    __repr__(self): Returns a string representation of the Spider object.\r\n\r\n## Scraper\r\n\r\nThe Scraper class extends the functionality of the Spider class by scraping HTML content from web pages based on user-defined parameters. It has the following methods:\r\n\r\n    __init__(self, config): Initializes a Scraper object with the given configuration parameters.\r\n    run(self): A public method to scrape HTML content from web pages based on user-defined parameters.\r\n    __str__(self): Returns a string representation of the Scraper object.\r\n    __repr__(self): Returns a string representation of the Scraper object.\r\n\r\n## Installation\r\n\r\nYou can install CobWeb using pip:\r\n\r\n```bash\r\n\r\n    pip install CobWeb\r\n\r\n```\r\n\r\n## Config\r\n\r\nConfig is either an object in memory or a YAML file you can build by installing YAMLbuilder or by using the provided template!\r\nExample of a complete object:\r\n\r\n```python\r\nconfig = {\r\n            \"url\": \r\n            \"hops\": \r\n            \"tags\":\r\n            \"classes\":\r\n            \"attrs\":\r\n            \"attrV\":\r\n            \"IDv\":\r\n            \"selectors\":\r\n        }\r\n```\r\n\r\nExample of YAML file (If you choose this path call the config_parser function and give it a valid path!):\r\n\r\n```yaml\r\nIDv:\r\nattrV: []\r\nattrs: []\r\nclasses: []\r\nhops: \r\nselectors: []\r\ntags:\r\n    - \r\n    - \r\nurl: \r\n```\r\n\r\n## Example Usage\r\n\r\n```python\r\n\r\nfrom CobWeb import Spider, Scraper\r\n\r\n# Create a Spider object with a URL and maximum number of hops\r\nspider = Spider(\"https://example.com\", max_hops=10)\r\n\r\n# Get the internal and external links\r\nlinks = spider.run()\r\nprint(links)\r\n\r\n# Create a Scraper object with a configuration dictionary\r\n# hops defines how deep it will scrape, it uses the Spider internally to get more links and scrape from those pages! If you just want to scrape from a single page set it to 0 or don't provide hops\r\nconfig = {\r\n    \"url\": \"https://example.com\",\r\n    \"hops\": 2,\r\n    \"tags\": [\"h1\", \"p\"]\r\n    }\r\nscraper = Scraper(config)\r\n\r\n# Scrape HTML content from web pages based on user-defined parameters\r\nresults = scraper.run()\r\n\r\n# Print the results it shall be a dictionary with arrays of scraped content separated by element, attributes, etc provided in the config!\r\nprint(results)\r\n\r\n\r\n```\r\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "CobWeb is a Python library for web scraping. The library consists of two classes: Spider and Scraper.",
    "version": "1.1.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/GoncaloMark/Amara-CobWeb/issues",
        "Homepage": "https://github.com/GoncaloMark/Amara-CobWeb"
    },
    "split_keywords": [
        "data",
        "crawler",
        "scraper"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4d37fdd77322cb9edfe5357c425d8fe91412c3ba1750e35500651301a62e098e",
                "md5": "507ac3e05f0516f77c6ddecec233bfa8",
                "sha256": "1d9327d5939635af80e24c071f0c0093f163294dbbad4be2529da2f0cd7ced3f"
            },
            "downloads": -1,
            "filename": "CobWeb_lnx-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "507ac3e05f0516f77c6ddecec233bfa8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 6752,
            "upload_time": "2023-05-21T12:25:05",
            "upload_time_iso_8601": "2023-05-21T12:25:05.558602Z",
            "url": "https://files.pythonhosted.org/packages/4d/37/fdd77322cb9edfe5357c425d8fe91412c3ba1750e35500651301a62e098e/CobWeb_lnx-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c2bbdfeb6e6492ecae39c8f9aa247c56e96c2fdcd47a3c194303a006e07ded2f",
                "md5": "d29abc38eecbdac691fff35c51e5d01f",
                "sha256": "34ef205730ae592572b78b20837fad62fb7ee034418698d0fa9efc9bc3291e90"
            },
            "downloads": -1,
            "filename": "CobWeb-lnx-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "d29abc38eecbdac691fff35c51e5d01f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 6904,
            "upload_time": "2023-05-21T12:25:07",
            "upload_time_iso_8601": "2023-05-21T12:25:07.686537Z",
            "url": "https://files.pythonhosted.org/packages/c2/bb/dfeb6e6492ecae39c8f9aa247c56e96c2fdcd47a3c194303a006e07ded2f/CobWeb-lnx-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-21 12:25:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "GoncaloMark",
    "github_project": "Amara-CobWeb",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "cobweb-lnx"
}
        
Elapsed time: 0.06470s