webgraze


Namewebgraze JSON
Version 1.1.2 PyPI version JSON
download
home_pageNone
SummaryWebScraping library that scrapes & gathers data from multiple sources on the internet
upload_time2024-09-03 18:17:58
maintainerNone
docs_urlNone
authorshivendra
requires_pythonNone
licenseMIT
keywords webscraping scraping webscraping library web scraping python webscraping beautifulsoup selenium
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# web-graze



## Introduction

This repository contains a collection of scripts to scrape content from various sources like YouTube, Wikipedia, and Britannica. It includes functionality to download video captions from YouTube, scrape Wikipedia articles, and fetch content from Britannica.



## Table of Contents

- [Installation](#installation)

- [Usage](#usage)

  - [Queries](#1-queries)

  - [YouTube Scraper](#2-youtube-scraper)

  - [Wikipedia Scraper](#3-wikipedia-scraper)

  - [Unsplash Scraper](#4-unsplash-scraper)

  - [Britannica Scraper](#5-britannica-scraper)

  - [Freesound Scraper](#6-freesound-scraper)

  - [Pexels Scraper](#7-pexels-scraper)

- [Configuration](#configuration)

- [Logging](#logging)



## Installation



1. **Clone the repository:**

   ```sh

   git clone https://github.com/shivendrra/web-graze.git

   cd web-scraper-suite

   ```



2. **Create and activate a virtual environment:**

   ```sh

   python -m venv venv

   source venv/bin/activate   # On Windows: venv\Scripts\activate

   ```



3. **Install the required packages:**

   ```sh

   pip install -r requirements.txt

   ```



## Usage



For sample examples, use the [run.py](run.py) that contains example for each type of scraper.



### 1. Queries



This library contains some topics, keywords, search queries & channel ids which you can just load & use it with the respective scrapers.



#### Channel Ids



```python

from webgraze.queries import Queries



queries = Queries(category="channel")

```



#### Search Queries



```python

from webgraze.queries import Queries



queries = Queries(category="search")

```



#### Image Topics



```python

from webgraze.queries import Queries



queries = Queries(category="channel")

```



### 2. YouTube Scraper



The YouTube scraper fetches video captions from a list of channels.



#### Configuration

- Add your YouTube API key to a `.env` file:

  ```env

  yt_key=YOUR_API_KEY

  ```



- Create a `channelIds.json` file with the list of channel IDs:

  ```json

  [

    "UC_x5XG1OV2P6uZZ5FSM9Ttw",

    "UCJ0-OtVpF0wOKEqT2Z1HEtA"

  ]

  ```



#### Running the Scraper



```python

import os

from dotenv import load_dotenv

load_dotenv()

current_directory = os.path.dirname(os.path.abspath(__file__))

os.chdir(current_directory)



api_key = os.getenv('yt_key')



from webgraze import Youtube

from webgraze.queries import Queries



queries = Queries(category="channel")



youtube = Youtube(api_key=api_key, filepath='../transcripts', max_results=50)

youtube(channel_ids=queries(), videoUrls=True)

```



### 3. Wikipedia Scraper



The Wikipedia scraper generates target URLs from provided queries, fetches the complete web page, and writes it to a file.



#### Running the Scraper



```python

from webgraze import Wikipedia

from webgraze.queries import Queries



queries = Queries(category="search")

wiki = Wikipedia(filepath='../data.txt', metrics=True)



wiki(queries=queries(), extra_urls=True)

```



### 4. Unsplash Scraper



The Unsplash Image scraper fetches images based on given topics & saves them in their respective folders



#### Configuration

- Define your search queries like this:

  ```python

  search_queries = ["topic1", "topic2", "topic3"]

  ```



#### Running the Scraper



```python

from webgraze import Unsplash

from webgraze.queries import Queries



topics = Queries("images")



image = Unsplash(directory='../images', metrics=True)

image(topics=topics())

```



#### Output:

```shell

Downloading 'american football' images:

Downloading : 100%|██████████████████████████| 176/176 [00:30<00:00,  5.72it/s]



Downloading 'indian festivals' images:

Downloading : 100%|██████████████████████████| 121/121 [00:30<00:00,  7.29it/s]

```



### 5. Britannica Scraper



The Britannica scraper generates target URLs from provided queries, fetches the complete web page, and writes it to a file.



#### Running the scraper



```python

from webgraze import Britannica

from webgraze.queries import Queries



queries = Queries(category="search")

scraper = Britannica(filepath='../data.txt', metrics=True)



scraper(queries=queries())

```



### 6. Freesound Scraper



Scraper to download & save audios from [freesound.org](https://freesound.org/) using its official API. Saves audios in different directories according to the topics.



#### Running the scraper



```python

import os

current_directory = os.path.dirname(os.path.abspath(__file__))

os.chdir(current_directory)

from dotenv import load_dotenv

load_dotenv()



API_KEY = os.getenv("freesound_key")



from webgraze import Freesound



sound = Freesound(api_key=API_KEY, download_dir="audios", metrics=True)

sound(topics=["clicks", "background", "nature"])

```



#### Output



```shell

Downloading 'clicks' audio files:

Response status code: 200

Downloading 'clicks' audio files: 100%|██████████████████████████████| 10/10 [00:20<00:00,  2.01s/it] 



Downloading 'background' audio files:

Response status code: 200

Downloading 'background' audio files: 100%|██████████████████████████████| 10/10 [00:53<00:00,  5.37s/it] 



Downloading 'nature' audio files:

Response status code: 200

Downloading 'nature' audio files: 100%|██████████████████████████████| 10/10 [01:57<00:00, 11.78s/it] 



Freesound Scraper Metrics:



-------------------------------------------

Total topics fetched: 3

Total audio files downloaded: 30

Total time taken: 3.26 minutes

-------------------------------------------

```



### 7. Pexels Scraper



Scrapes & downloads pictures from [pexels.com](https://www.pexels.com/) & saves them in individual directory topic-wise.



#### Running the scraper



```python

from webgraze import Pexels

from webgraze.queries import Queries



queries = Queries("images")

scraper = Pexels(directory="./images", metrics=True)

scraper(topics=queries())

```



#### Output

```shell

Downloading 'american football' images:

Downloading: 100%|████████████████████████████████| 24/24 [00:03<00:00,  7.73it/s]



Downloading 'india' images:

Downloading: 100%|████████████████████████████████| 27/27 [00:04<00:00,  5.99it/s]



Downloading 'europe' images:

Downloading: 100%|████████████████████████████████| 24/24 [00:06<00:00,  3.55it/s]

```



## Configuration



- **API Keys and other secrets:** Ensure that your API keys and other sensitive data are stored securely and not hard-coded into your scripts.



- **Search Queries:** The search queries for Wikipedia and Britannica scrapers are defined in `queries.py`.



## Logging



Each scraper logs errors to respective `.log` file. Make sure to check this file for detailed error messages & troubleshooting information.



## Contribution

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate.



Check out [CONTRIBUTING.md](https://github.com/shivendrra/web-graze/blob/main/CONTRIBUTING.md) for more details



## License



This project is licensed under the MIT License.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "webgraze",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "webscraping, scraping, webscraping library, web scraping, python webscraping, beautifulsoup, selenium",
    "author": "shivendra",
    "author_email": "shivharsh44@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/0b/85/7cad637bf9fe03ae973a8316a6dbaed84a140baeb5b37c74c6d7d1ee8951/webgraze-1.1.2.tar.gz",
    "platform": null,
    "description": "\r\n# web-graze\r\n\r\n\r\n\r\n## Introduction\r\n\r\nThis repository contains a collection of scripts to scrape content from various sources like YouTube, Wikipedia, and Britannica. It includes functionality to download video captions from YouTube, scrape Wikipedia articles, and fetch content from Britannica.\r\n\r\n\r\n\r\n## Table of Contents\r\n\r\n- [Installation](#installation)\r\n\r\n- [Usage](#usage)\r\n\r\n  - [Queries](#1-queries)\r\n\r\n  - [YouTube Scraper](#2-youtube-scraper)\r\n\r\n  - [Wikipedia Scraper](#3-wikipedia-scraper)\r\n\r\n  - [Unsplash Scraper](#4-unsplash-scraper)\r\n\r\n  - [Britannica Scraper](#5-britannica-scraper)\r\n\r\n  - [Freesound Scraper](#6-freesound-scraper)\r\n\r\n  - [Pexels Scraper](#7-pexels-scraper)\r\n\r\n- [Configuration](#configuration)\r\n\r\n- [Logging](#logging)\r\n\r\n\r\n\r\n## Installation\r\n\r\n\r\n\r\n1. **Clone the repository:**\r\n\r\n   ```sh\r\n\r\n   git clone https://github.com/shivendrra/web-graze.git\r\n\r\n   cd web-scraper-suite\r\n\r\n   ```\r\n\r\n\r\n\r\n2. **Create and activate a virtual environment:**\r\n\r\n   ```sh\r\n\r\n   python -m venv venv\r\n\r\n   source venv/bin/activate   # On Windows: venv\\Scripts\\activate\r\n\r\n   ```\r\n\r\n\r\n\r\n3. **Install the required packages:**\r\n\r\n   ```sh\r\n\r\n   pip install -r requirements.txt\r\n\r\n   ```\r\n\r\n\r\n\r\n## Usage\r\n\r\n\r\n\r\nFor sample examples, use the [run.py](run.py) that contains example for each type of scraper.\r\n\r\n\r\n\r\n### 1. Queries\r\n\r\n\r\n\r\nThis library contains some topics, keywords, search queries & channel ids which you can just load & use it with the respective scrapers.\r\n\r\n\r\n\r\n#### Channel Ids\r\n\r\n\r\n\r\n```python\r\n\r\nfrom webgraze.queries import Queries\r\n\r\n\r\n\r\nqueries = Queries(category=\"channel\")\r\n\r\n```\r\n\r\n\r\n\r\n#### Search Queries\r\n\r\n\r\n\r\n```python\r\n\r\nfrom webgraze.queries import Queries\r\n\r\n\r\n\r\nqueries = Queries(category=\"search\")\r\n\r\n```\r\n\r\n\r\n\r\n#### Image Topics\r\n\r\n\r\n\r\n```python\r\n\r\nfrom webgraze.queries import Queries\r\n\r\n\r\n\r\nqueries = Queries(category=\"channel\")\r\n\r\n```\r\n\r\n\r\n\r\n### 2. YouTube Scraper\r\n\r\n\r\n\r\nThe YouTube scraper fetches video captions from a list of channels.\r\n\r\n\r\n\r\n#### Configuration\r\n\r\n- Add your YouTube API key to a `.env` file:\r\n\r\n  ```env\r\n\r\n  yt_key=YOUR_API_KEY\r\n\r\n  ```\r\n\r\n\r\n\r\n- Create a `channelIds.json` file with the list of channel IDs:\r\n\r\n  ```json\r\n\r\n  [\r\n\r\n    \"UC_x5XG1OV2P6uZZ5FSM9Ttw\",\r\n\r\n    \"UCJ0-OtVpF0wOKEqT2Z1HEtA\"\r\n\r\n  ]\r\n\r\n  ```\r\n\r\n\r\n\r\n#### Running the Scraper\r\n\r\n\r\n\r\n```python\r\n\r\nimport os\r\n\r\nfrom dotenv import load_dotenv\r\n\r\nload_dotenv()\r\n\r\ncurrent_directory = os.path.dirname(os.path.abspath(__file__))\r\n\r\nos.chdir(current_directory)\r\n\r\n\r\n\r\napi_key = os.getenv('yt_key')\r\n\r\n\r\n\r\nfrom webgraze import Youtube\r\n\r\nfrom webgraze.queries import Queries\r\n\r\n\r\n\r\nqueries = Queries(category=\"channel\")\r\n\r\n\r\n\r\nyoutube = Youtube(api_key=api_key, filepath='../transcripts', max_results=50)\r\n\r\nyoutube(channel_ids=queries(), videoUrls=True)\r\n\r\n```\r\n\r\n\r\n\r\n### 3. Wikipedia Scraper\r\n\r\n\r\n\r\nThe Wikipedia scraper generates target URLs from provided queries, fetches the complete web page, and writes it to a file.\r\n\r\n\r\n\r\n#### Running the Scraper\r\n\r\n\r\n\r\n```python\r\n\r\nfrom webgraze import Wikipedia\r\n\r\nfrom webgraze.queries import Queries\r\n\r\n\r\n\r\nqueries = Queries(category=\"search\")\r\n\r\nwiki = Wikipedia(filepath='../data.txt', metrics=True)\r\n\r\n\r\n\r\nwiki(queries=queries(), extra_urls=True)\r\n\r\n```\r\n\r\n\r\n\r\n### 4. Unsplash Scraper\r\n\r\n\r\n\r\nThe Unsplash Image scraper fetches images based on given topics & saves them in their respective folders\r\n\r\n\r\n\r\n#### Configuration\r\n\r\n- Define your search queries like this:\r\n\r\n  ```python\r\n\r\n  search_queries = [\"topic1\", \"topic2\", \"topic3\"]\r\n\r\n  ```\r\n\r\n\r\n\r\n#### Running the Scraper\r\n\r\n\r\n\r\n```python\r\n\r\nfrom webgraze import Unsplash\r\n\r\nfrom webgraze.queries import Queries\r\n\r\n\r\n\r\ntopics = Queries(\"images\")\r\n\r\n\r\n\r\nimage = Unsplash(directory='../images', metrics=True)\r\n\r\nimage(topics=topics())\r\n\r\n```\r\n\r\n\r\n\r\n#### Output:\r\n\r\n```shell\r\n\r\nDownloading 'american football' images:\r\n\r\nDownloading : 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 176/176 [00:30<00:00,  5.72it/s]\r\n\r\n\r\n\r\nDownloading 'indian festivals' images:\r\n\r\nDownloading : 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 121/121 [00:30<00:00,  7.29it/s]\r\n\r\n```\r\n\r\n\r\n\r\n### 5. Britannica Scraper\r\n\r\n\r\n\r\nThe Britannica scraper generates target URLs from provided queries, fetches the complete web page, and writes it to a file.\r\n\r\n\r\n\r\n#### Running the scraper\r\n\r\n\r\n\r\n```python\r\n\r\nfrom webgraze import Britannica\r\n\r\nfrom webgraze.queries import Queries\r\n\r\n\r\n\r\nqueries = Queries(category=\"search\")\r\n\r\nscraper = Britannica(filepath='../data.txt', metrics=True)\r\n\r\n\r\n\r\nscraper(queries=queries())\r\n\r\n```\r\n\r\n\r\n\r\n### 6. Freesound Scraper\r\n\r\n\r\n\r\nScraper to download & save audios from [freesound.org](https://freesound.org/) using its official API. Saves audios in different directories according to the topics.\r\n\r\n\r\n\r\n#### Running the scraper\r\n\r\n\r\n\r\n```python\r\n\r\nimport os\r\n\r\ncurrent_directory = os.path.dirname(os.path.abspath(__file__))\r\n\r\nos.chdir(current_directory)\r\n\r\nfrom dotenv import load_dotenv\r\n\r\nload_dotenv()\r\n\r\n\r\n\r\nAPI_KEY = os.getenv(\"freesound_key\")\r\n\r\n\r\n\r\nfrom webgraze import Freesound\r\n\r\n\r\n\r\nsound = Freesound(api_key=API_KEY, download_dir=\"audios\", metrics=True)\r\n\r\nsound(topics=[\"clicks\", \"background\", \"nature\"])\r\n\r\n```\r\n\r\n\r\n\r\n#### Output\r\n\r\n\r\n\r\n```shell\r\n\r\nDownloading 'clicks' audio files:\r\n\r\nResponse status code: 200\r\n\r\nDownloading 'clicks' audio files: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:20<00:00,  2.01s/it] \r\n\r\n\r\n\r\nDownloading 'background' audio files:\r\n\r\nResponse status code: 200\r\n\r\nDownloading 'background' audio files: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:53<00:00,  5.37s/it] \r\n\r\n\r\n\r\nDownloading 'nature' audio files:\r\n\r\nResponse status code: 200\r\n\r\nDownloading 'nature' audio files: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [01:57<00:00, 11.78s/it] \r\n\r\n\r\n\r\nFreesound Scraper Metrics:\r\n\r\n\r\n\r\n-------------------------------------------\r\n\r\nTotal topics fetched: 3\r\n\r\nTotal audio files downloaded: 30\r\n\r\nTotal time taken: 3.26 minutes\r\n\r\n-------------------------------------------\r\n\r\n```\r\n\r\n\r\n\r\n### 7. Pexels Scraper\r\n\r\n\r\n\r\nScrapes & downloads pictures from [pexels.com](https://www.pexels.com/) & saves them in individual directory topic-wise.\r\n\r\n\r\n\r\n#### Running the scraper\r\n\r\n\r\n\r\n```python\r\n\r\nfrom webgraze import Pexels\r\n\r\nfrom webgraze.queries import Queries\r\n\r\n\r\n\r\nqueries = Queries(\"images\")\r\n\r\nscraper = Pexels(directory=\"./images\", metrics=True)\r\n\r\nscraper(topics=queries())\r\n\r\n```\r\n\r\n\r\n\r\n#### Output\r\n\r\n```shell\r\n\r\nDownloading 'american football' images:\r\n\r\nDownloading: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 24/24 [00:03<00:00,  7.73it/s]\r\n\r\n\r\n\r\nDownloading 'india' images:\r\n\r\nDownloading: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 27/27 [00:04<00:00,  5.99it/s]\r\n\r\n\r\n\r\nDownloading 'europe' images:\r\n\r\nDownloading: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 24/24 [00:06<00:00,  3.55it/s]\r\n\r\n```\r\n\r\n\r\n\r\n## Configuration\r\n\r\n\r\n\r\n- **API Keys and other secrets:** Ensure that your API keys and other sensitive data are stored securely and not hard-coded into your scripts.\r\n\r\n\r\n\r\n- **Search Queries:** The search queries for Wikipedia and Britannica scrapers are defined in `queries.py`.\r\n\r\n\r\n\r\n## Logging\r\n\r\n\r\n\r\nEach scraper logs errors to respective `.log` file. Make sure to check this file for detailed error messages & troubleshooting information.\r\n\r\n\r\n\r\n## Contribution\r\n\r\nPull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate.\r\n\r\n\r\n\r\nCheck out [CONTRIBUTING.md](https://github.com/shivendrra/web-graze/blob/main/CONTRIBUTING.md) for more details\r\n\r\n\r\n\r\n## License\r\n\r\n\r\n\r\nThis project is licensed under the MIT License.\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "WebScraping library that scrapes & gathers data from multiple sources on the internet",
    "version": "1.1.2",
    "project_urls": null,
    "split_keywords": [
        "webscraping",
        " scraping",
        " webscraping library",
        " web scraping",
        " python webscraping",
        " beautifulsoup",
        " selenium"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fc589279731e0e61be630411a4f4a20ae1fbb011c9511649b0c8ba92c36e7688",
                "md5": "2c58b13b06e981182d64af08f3b801ce",
                "sha256": "b2afd8f91969927556b9ac7f4f29a48e8ca2affe3a371e1e983a28eae8da92ea"
            },
            "downloads": -1,
            "filename": "webgraze-1.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2c58b13b06e981182d64af08f3b801ce",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 17190,
            "upload_time": "2024-09-03T18:17:56",
            "upload_time_iso_8601": "2024-09-03T18:17:56.409673Z",
            "url": "https://files.pythonhosted.org/packages/fc/58/9279731e0e61be630411a4f4a20ae1fbb011c9511649b0c8ba92c36e7688/webgraze-1.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0b857cad637bf9fe03ae973a8316a6dbaed84a140baeb5b37c74c6d7d1ee8951",
                "md5": "843970abd022c22a71276e704e1334b7",
                "sha256": "b86e6dc2a8f030eb8c22a5b4d4377a895003e3bf1c581866542e4ac2dc04e209"
            },
            "downloads": -1,
            "filename": "webgraze-1.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "843970abd022c22a71276e704e1334b7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 14559,
            "upload_time": "2024-09-03T18:17:58",
            "upload_time_iso_8601": "2024-09-03T18:17:58.297475Z",
            "url": "https://files.pythonhosted.org/packages/0b/85/7cad637bf9fe03ae973a8316a6dbaed84a140baeb5b37c74c6d7d1ee8951/webgraze-1.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-03 18:17:58",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "webgraze"
}
        
Elapsed time: 1.23570s