# vidcrawler
Crawls major videos sites like YouTube/Rumble/Bitchute/Brighteon for video content and outputs a json feed of all the videos that were found.
## Platform Unit Tests
[![MacOS_Tests](https://github.com/zackees/vidcrawler/actions/workflows/test_macos.yml/badge.svg)](https://github.com/zackees/vidcrawler/actions/workflows/test_macos.yml)
[![Win_Tests](https://github.com/zackees/vidcrawler/actions/workflows/test_win.yml/badge.svg)](https://github.com/zackees/vidcrawler/actions/workflows/test_win.yml)
[![Ubuntu_Tests](https://github.com/zackees/vidcrawler/actions/workflows/test_ubuntu.yml/badge.svg)](https://github.com/zackees/vidcrawler/actions/workflows/test_ubuntu.yml)
## Scraper Tests
[![Scaper_Youtube](https://github.com/zackees/vidcrawler/actions/workflows/test_youtube.yml/badge.svg)](https://github.com/zackees/vidcrawler/actions/workflows/test_youtube.yml)
[![Actions Status](https://github.com/zackees/vidcrawler/workflows/Scaper_Rumble/badge.svg)](https://github.com/zackees/vidcrawler/actions/workflows/test_rumble.yml)
[![Scaper_Brighteon](https://github.com/zackees/vidcrawler/actions/workflows/test_brighteon.yml/badge.svg)](https://github.com/zackees/vidcrawler/actions/workflows/test_brighteon.yml)
[![Actions Status](https://github.com/zackees/vidcrawler/workflows/Scraper_Gabtv/badge.svg)](https://github.com/zackees/vidcrawler/actions/workflows/test_gabtv.yml)
[![Actions Status](https://github.com/zackees/vidcrawler/workflows/Scraper_Spotify/badge.svg)](https://github.com/zackees/vidcrawler/actions/workflows/test_spotify.yml)
Note that bitchute doesn't like the github runner's IP and will fail with a 403 forbidden.
[![Actions Status](https://github.com/zackees/vidcrawler/workflows/Scaper_Bitchute/badge.svg)](https://github.com/zackees/vidcrawler/actions/workflows/test_bitchute.yml)
## API
#### Command line
`vidcrawler --input_crawl_json "fetch_list.json" --output_json "out_list.json"`
#### Python
```python
import json
from vidcrawler import crawl_video_sites
crawl_list = [
[
"Computing Forever", # Can be whatever you want.
"bitchute", # Must be "youtube", "rumble", "bitchute" (and others).
"hybm74uihjkf" # The channel id on the service.
]
]
output = crawl_video_sites(crawl_list)
print(json.dumps(output))
```
"source" and "channel_id" are used to generate the video-platform-specific urls to fetch data. The "channel name"
is echo'd back in the generated json feeds, but doesn't not affect the fetching process in any way.
## Testing
Install vidcrawler and then the command `vidcralwer_test` will become available.
```bash
> pip install vidcrawler
> vidcrawler_test
```
# youtube-pull-channel
This new command will a channel and all of it's files as mp3s. Great for transcribing and putting into an LLM.
#### Example input `fetch_list.json`
```json
[
[
"Health Ranger Report",
"brighteon",
"hrreport"
],
[
"Sydney Watson",
"youtube",
"UCSFy-1JrpZf0tFlRZfo-Rvw"
],
[
"Computing Forever",
"bitchute",
"hybm74uihjkf"
],
[
"ThePeteSantilliShow",
"rumble",
"ThePeteSantilliShow"
],
[
"Macroaggressions",
"odysee",
"Macroaggressions"
]
]
```
#### Example Output:
```json
[
{
"channel_name": "ThePeteSantilliShow",
"title": "The damage this caused is now being totaled up",
"date_published": "2022-05-17T05:00:11+00:00",
"date_lastupdated": "2022-05-17T05:17:18.540084",
"channel_url": "https://www.youtube.com/channel/UCXIJgqnII2ZOINSWNOGFThA",
"source": "youtube.com",
"url": "https://www.youtube.com/watch?v=bwqBudCzDrQ",
"duration": 254,
"description": "",
"img_src": "https://i3.ytimg.com/vi/bwqBudCzDrQ/hqdefault.jpg",
"iframe_src": "https://youtube.com/embed/bwqBudCzDrQ",
"views": 1429
},
{
"channel_name": "ThePeteSantilliShow",
"title": "..."
}
]
```
# Releases
* 1.0.39: More pinned deps problems fixed.
* 1.0.38: One of the scrapers has a pinned dependency, install it with [full]
* 1.0.37: Misc fixes.
* 1.0.36: Fixed youtube, rumble and brighteon parsers. Bitchute is still broken and now has rate limits.
* 1.0.35: Added `update_yt_dlp()` to allow the client to update the downloader.
* 1.0.34: Upgraded `open-webdriver` to version `1.5.0` to avoid `yt-dlp` urllib incompatibility.
* 1.0.28: youtube_pull now takes in --channel-name and --output, like the other pullers
* 1.0.27: Fixed polluting path space from multiple added static-ffmpeg
* 1.0.24: Added `rumble-pull-channel`
* 1.0.21: Misc fixes
* 1.0.16: Make the library downloading more robust.
* 1.0.15: Improve cleaning filepaths for brighteon_bot
* 1.0.13: New `brighteon-pull-channel` command
* 1.0.11: Improves `youtube-pull-channel`
* 1.0.10: Adds `youtube-pull-channel` which pulls all files down as mp3s for a channel.
* 1.0.9: Fixes crawler for rumble and minor fixes + linting fixes.
* 1.0.8: Readme correction.
* 1.0.7: Fixes Odysee scraper by including image/webp thumbnail format.
* 1.0.4: Fixes local_now() to be local timezone aware.
* 1.0.3: Bump
* 1.0.2: Updates testing
* 1.0.1: improves command line
* 1.0.0: Initial release
Raw data
{
"_id": null,
"home_page": "https://github.com/zackees/vidcrawler",
"name": "vidcrawler",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6.0",
"maintainer_email": null,
"keywords": null,
"author": "Zach Vorhies",
"author_email": "dont@email.me",
"download_url": "https://files.pythonhosted.org/packages/80/a2/88fc4e4e107cbf01170a00a161ec3974b77b07ac8d27d2af8813179b0bae/vidcrawler-1.0.39.tar.gz",
"platform": null,
"description": "# vidcrawler\r\n\r\nCrawls major videos sites like YouTube/Rumble/Bitchute/Brighteon for video content and outputs a json feed of all the videos that were found.\r\n\r\n## Platform Unit Tests\r\n\r\n[![MacOS_Tests](https://github.com/zackees/vidcrawler/actions/workflows/test_macos.yml/badge.svg)](https://github.com/zackees/vidcrawler/actions/workflows/test_macos.yml)\r\n[![Win_Tests](https://github.com/zackees/vidcrawler/actions/workflows/test_win.yml/badge.svg)](https://github.com/zackees/vidcrawler/actions/workflows/test_win.yml)\r\n[![Ubuntu_Tests](https://github.com/zackees/vidcrawler/actions/workflows/test_ubuntu.yml/badge.svg)](https://github.com/zackees/vidcrawler/actions/workflows/test_ubuntu.yml)\r\n\r\n## Scraper Tests\r\n\r\n[![Scaper_Youtube](https://github.com/zackees/vidcrawler/actions/workflows/test_youtube.yml/badge.svg)](https://github.com/zackees/vidcrawler/actions/workflows/test_youtube.yml)\r\n[![Actions Status](https://github.com/zackees/vidcrawler/workflows/Scaper_Rumble/badge.svg)](https://github.com/zackees/vidcrawler/actions/workflows/test_rumble.yml)\r\n[![Scaper_Brighteon](https://github.com/zackees/vidcrawler/actions/workflows/test_brighteon.yml/badge.svg)](https://github.com/zackees/vidcrawler/actions/workflows/test_brighteon.yml)\r\n[![Actions Status](https://github.com/zackees/vidcrawler/workflows/Scraper_Gabtv/badge.svg)](https://github.com/zackees/vidcrawler/actions/workflows/test_gabtv.yml)\r\n[![Actions Status](https://github.com/zackees/vidcrawler/workflows/Scraper_Spotify/badge.svg)](https://github.com/zackees/vidcrawler/actions/workflows/test_spotify.yml)\r\n\r\nNote that bitchute doesn't like the github runner's IP and will fail with a 403 forbidden.\r\n[![Actions Status](https://github.com/zackees/vidcrawler/workflows/Scaper_Bitchute/badge.svg)](https://github.com/zackees/vidcrawler/actions/workflows/test_bitchute.yml)\r\n\r\n## API\r\n\r\n#### Command line\r\n\r\n`vidcrawler --input_crawl_json \"fetch_list.json\" --output_json \"out_list.json\"`\r\n\r\n#### Python\r\n\r\n```python\r\nimport json\r\nfrom vidcrawler import crawl_video_sites\r\ncrawl_list = [\r\n [\r\n \"Computing Forever\", # Can be whatever you want.\r\n \"bitchute\", # Must be \"youtube\", \"rumble\", \"bitchute\" (and others).\r\n \"hybm74uihjkf\" # The channel id on the service.\r\n ]\r\n]\r\noutput = crawl_video_sites(crawl_list)\r\nprint(json.dumps(output))\r\n```\r\n\r\n\"source\" and \"channel_id\" are used to generate the video-platform-specific urls to fetch data. The \"channel name\"\r\nis echo'd back in the generated json feeds, but doesn't not affect the fetching process in any way.\r\n\r\n## Testing\r\n\r\nInstall vidcrawler and then the command `vidcralwer_test` will become available.\r\n\r\n```bash\r\n> pip install vidcrawler\r\n> vidcrawler_test\r\n```\r\n\r\n# youtube-pull-channel\r\n\r\nThis new command will a channel and all of it's files as mp3s. Great for transcribing and putting into an LLM.\r\n\r\n\r\n#### Example input `fetch_list.json`\r\n\r\n```json\r\n[\r\n [\r\n \"Health Ranger Report\",\r\n \"brighteon\",\r\n \"hrreport\"\r\n ],\r\n [\r\n \"Sydney Watson\",\r\n \"youtube\",\r\n \"UCSFy-1JrpZf0tFlRZfo-Rvw\"\r\n ],\r\n [\r\n \"Computing Forever\",\r\n \"bitchute\",\r\n \"hybm74uihjkf\"\r\n ],\r\n [\r\n \"ThePeteSantilliShow\",\r\n \"rumble\",\r\n \"ThePeteSantilliShow\"\r\n ],\r\n [\r\n \"Macroaggressions\",\r\n \"odysee\",\r\n \"Macroaggressions\"\r\n ]\r\n]\r\n```\r\n\r\n#### Example Output:\r\n\r\n```json\r\n[\r\n {\r\n \"channel_name\": \"ThePeteSantilliShow\",\r\n \"title\": \"The damage this caused is now being totaled up\",\r\n \"date_published\": \"2022-05-17T05:00:11+00:00\",\r\n \"date_lastupdated\": \"2022-05-17T05:17:18.540084\",\r\n \"channel_url\": \"https://www.youtube.com/channel/UCXIJgqnII2ZOINSWNOGFThA\",\r\n \"source\": \"youtube.com\",\r\n \"url\": \"https://www.youtube.com/watch?v=bwqBudCzDrQ\",\r\n \"duration\": 254,\r\n \"description\": \"\",\r\n \"img_src\": \"https://i3.ytimg.com/vi/bwqBudCzDrQ/hqdefault.jpg\",\r\n \"iframe_src\": \"https://youtube.com/embed/bwqBudCzDrQ\",\r\n \"views\": 1429\r\n },\r\n {\r\n \"channel_name\": \"ThePeteSantilliShow\",\r\n \"title\": \"...\"\r\n }\r\n]\r\n```\r\n\r\n# Releases\r\n * 1.0.39: More pinned deps problems fixed.\r\n * 1.0.38: One of the scrapers has a pinned dependency, install it with [full]\r\n * 1.0.37: Misc fixes.\r\n * 1.0.36: Fixed youtube, rumble and brighteon parsers. Bitchute is still broken and now has rate limits.\r\n * 1.0.35: Added `update_yt_dlp()` to allow the client to update the downloader.\r\n * 1.0.34: Upgraded `open-webdriver` to version `1.5.0` to avoid `yt-dlp` urllib incompatibility.\r\n * 1.0.28: youtube_pull now takes in --channel-name and --output, like the other pullers\r\n * 1.0.27: Fixed polluting path space from multiple added static-ffmpeg\r\n * 1.0.24: Added `rumble-pull-channel`\r\n * 1.0.21: Misc fixes\r\n * 1.0.16: Make the library downloading more robust.\r\n * 1.0.15: Improve cleaning filepaths for brighteon_bot\r\n * 1.0.13: New `brighteon-pull-channel` command\r\n * 1.0.11: Improves `youtube-pull-channel`\r\n * 1.0.10: Adds `youtube-pull-channel` which pulls all files down as mp3s for a channel.\r\n * 1.0.9: Fixes crawler for rumble and minor fixes + linting fixes.\r\n * 1.0.8: Readme correction.\r\n * 1.0.7: Fixes Odysee scraper by including image/webp thumbnail format.\r\n * 1.0.4: Fixes local_now() to be local timezone aware.\r\n * 1.0.3: Bump\r\n * 1.0.2: Updates testing\r\n * 1.0.1: improves command line\r\n * 1.0.0: Initial release\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Video Crawler",
"version": "1.0.39",
"project_urls": {
"Homepage": "https://github.com/zackees/vidcrawler"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "adf5534291f1786f0ca5e46296be55d15c9477890065479aaf3d8dc5d29d5758",
"md5": "67248c8cf26816434cf059d5fa340791",
"sha256": "4f2aa97bda48a5b66d9dff48a61a99f49e01bddd6e770d922af9c7e8c6cd8924"
},
"downloads": -1,
"filename": "vidcrawler-1.0.39-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "67248c8cf26816434cf059d5fa340791",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.6.0",
"size": 60279,
"upload_time": "2024-08-27T01:56:03",
"upload_time_iso_8601": "2024-08-27T01:56:03.275108Z",
"url": "https://files.pythonhosted.org/packages/ad/f5/534291f1786f0ca5e46296be55d15c9477890065479aaf3d8dc5d29d5758/vidcrawler-1.0.39-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "80a288fc4e4e107cbf01170a00a161ec3974b77b07ac8d27d2af8813179b0bae",
"md5": "a40a2158048fecfa298a85fc5d2c817d",
"sha256": "f16b41c3f45803c7ca0106aa5f523e73ab85d1a14d897935223f95329d805bc3"
},
"downloads": -1,
"filename": "vidcrawler-1.0.39.tar.gz",
"has_sig": false,
"md5_digest": "a40a2158048fecfa298a85fc5d2c817d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6.0",
"size": 46600,
"upload_time": "2024-08-27T01:56:04",
"upload_time_iso_8601": "2024-08-27T01:56:04.698894Z",
"url": "https://files.pythonhosted.org/packages/80/a2/88fc4e4e107cbf01170a00a161ec3974b77b07ac8d27d2af8813179b0bae/vidcrawler-1.0.39.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-27 01:56:04",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "zackees",
"github_project": "vidcrawler",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "vidcrawler"
}