# Reddit Multimodal Crawler [](https://pepy.tech/project/reddit-multimodal-crawler)
This is a wrapper to the `PRAW` package to scrape content from image in the form of `csv`, `json`, `tsv`, `sql` files.
This repository will help you scrape various subreddits, and will return to you multi-media attributes.
You can pip install this to integrate with some other application, or use it as an commandline application.
- PyPI Link: https://pypi.org/project/reddit-multimodal-crawler/
```commandLine
pip install reddit-multimodal-crawler
```
## How to use the repository?
Before running the code, you should have registered with the Reddit API and create a sample project to run the code and obtain the `client_id`, `client_secret` and make a `user_agent`. Then pass them in the arguements.
Although, the easier way is to use the `pip install reddit-multimodal-crawler`.
## Functionalities
This will help you scrape multiple subreddits just like `PRAW` but, will also return and save datasets for the same. Will scrape the posts and the comments as well.
### Sample Code
```python
import nltk
from reddit_multimodal_crawler.crawler import Crawler
import argparse
nltk.download("vader_lexicon")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--subreddit_file_path",
"A path to the file which contains the subreddits to scrape from.",
type=str,
)
parser.add_argument(
"--limit", "The limit to number of articles to scrape.", type=int
)
parser.add_argument("--client_id", "The Client ID provided by Reddit.", type=str)
parser.add_argument(
"--client_secret", "The Secret ID provided by the Reddit.", type=str
)
parser.add_argument(
"--user_agent",
"The User Agent in the form of <APP_NAME> <VERSION> by /u/<REDDIT_USERNAME>",
type=str,
)
parser.add_argument(
"--posts", "A boolean variable to parse through the posts or not.", type=bool
)
parser.add_argument(
"--comments",
"A boolean variable to parse through the comments of the top posts of subreddit",
type=bool,
)
args = parser.parse_args()
client_id = args["client_id"]
client_secret = args["client_secret"]
user_agent = args["user_agent"]
file_path = args["subreddit_file_path"]
limit = args["limit"]
r = Crawler(client_id=client_id, client_secret=client_secret, user_agent=user_agent)
subreddit_list = open(file_path, "r").readlines().split()
print(subreddit_list)
if args["posts"]:
r.get_posts(subreddit_names=subreddit_list, sort_by="top", limit=limit)
if args["comments"]:
r.get_comments(subreddit_names=subreddit_list, sort_by="top", limit=limit)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/aneesh-aparajit/reddit-crawler",
"name": "reddit-multimodal-crawler",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "web-scraping,webscraper,reddit,multimodal,datascience",
"author": "Aneesh Aparajit G",
"author_email": "aneeshaparajit.g2002@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/54/1b/552761ce29265fc0f53fed692d5414d9138c5ba5f9dcd9f9d9001cca9649/reddit_multimodal_crawler-1.3.2.tar.gz",
"platform": null,
"description": "# Reddit Multimodal Crawler [](https://pepy.tech/project/reddit-multimodal-crawler)\n\nThis is a wrapper to the `PRAW` package to scrape content from image in the form of `csv`, `json`, `tsv`, `sql` files.\n\nThis repository will help you scrape various subreddits, and will return to you multi-media attributes.\n\nYou can pip install this to integrate with some other application, or use it as an commandline application.\n\n- PyPI Link: https://pypi.org/project/reddit-multimodal-crawler/\n\n```commandLine\npip install reddit-multimodal-crawler\n```\n\n## How to use the repository?\n\nBefore running the code, you should have registered with the Reddit API and create a sample project to run the code and obtain the `client_id`, `client_secret` and make a `user_agent`. Then pass them in the arguements.\n\nAlthough, the easier way is to use the `pip install reddit-multimodal-crawler`.\n\n## Functionalities\n\nThis will help you scrape multiple subreddits just like `PRAW` but, will also return and save datasets for the same. Will scrape the posts and the comments as well.\n\n### Sample Code\n\n```python\nimport nltk\nfrom reddit_multimodal_crawler.crawler import Crawler\nimport argparse\n\nnltk.download(\"vader_lexicon\")\n\nif __name__ == \"__main__\":\n\n parser = argparse.ArgumentParser()\n parser.add_argument(\n \"--subreddit_file_path\",\n \"A path to the file which contains the subreddits to scrape from.\",\n type=str,\n )\n parser.add_argument(\n \"--limit\", \"The limit to number of articles to scrape.\", type=int\n )\n parser.add_argument(\"--client_id\", \"The Client ID provided by Reddit.\", type=str)\n parser.add_argument(\n \"--client_secret\", \"The Secret ID provided by the Reddit.\", type=str\n )\n parser.add_argument(\n \"--user_agent\",\n \"The User Agent in the form of <APP_NAME> <VERSION> by /u/<REDDIT_USERNAME>\",\n type=str,\n )\n parser.add_argument(\n \"--posts\", \"A boolean variable to parse through the posts or not.\", type=bool\n )\n parser.add_argument(\n \"--comments\",\n \"A boolean variable to parse through the comments of the top posts of subreddit\",\n type=bool,\n )\n\n args = parser.parse_args()\n\n client_id = args[\"client_id\"]\n client_secret = args[\"client_secret\"]\n user_agent = args[\"user_agent\"]\n file_path = args[\"subreddit_file_path\"]\n limit = args[\"limit\"]\n\n r = Crawler(client_id=client_id, client_secret=client_secret, user_agent=user_agent)\n\n subreddit_list = open(file_path, \"r\").readlines().split()\n\n print(subreddit_list)\n\n if args[\"posts\"]:\n r.get_posts(subreddit_names=subreddit_list, sort_by=\"top\", limit=limit)\n\n if args[\"comments\"]:\n r.get_comments(subreddit_names=subreddit_list, sort_by=\"top\", limit=limit)\n\n```\n",
"bugtrack_url": null,
"license": "",
"summary": "A scraper which will scrape out multimedia data from reddit.",
"version": "1.3.2",
"split_keywords": [
"web-scraping",
"webscraper",
"reddit",
"multimodal",
"datascience"
],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "94bd45fdee3610ced26e9b182adcb0cb",
"sha256": "e627454c015000e79b10b20d3057bd4156e8e7af88dea11f4814487b50a7ce10"
},
"downloads": -1,
"filename": "reddit_multimodal_crawler-1.3.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "94bd45fdee3610ced26e9b182adcb0cb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 5368,
"upload_time": "2022-12-31T03:30:15",
"upload_time_iso_8601": "2022-12-31T03:30:15.299414Z",
"url": "https://files.pythonhosted.org/packages/de/8f/650efcb4ce8a2f2b0779e311af9a798c056dac0b73daf4f530c73fb1d4af/reddit_multimodal_crawler-1.3.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"md5": "48f3b2a959fd22ec25c66a9b488dfec2",
"sha256": "96485a1a0aa7c111fbbe8165e97eeef80e7f8715c4eaa639c419d9311ea22882"
},
"downloads": -1,
"filename": "reddit_multimodal_crawler-1.3.2.tar.gz",
"has_sig": false,
"md5_digest": "48f3b2a959fd22ec25c66a9b488dfec2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 4910,
"upload_time": "2022-12-31T03:30:17",
"upload_time_iso_8601": "2022-12-31T03:30:17.353718Z",
"url": "https://files.pythonhosted.org/packages/54/1b/552761ce29265fc0f53fed692d5414d9138c5ba5f9dcd9f9d9001cca9649/reddit_multimodal_crawler-1.3.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-12-31 03:30:17",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "aneesh-aparajit",
"github_project": "reddit-crawler",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "praw",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "numpy",
"specs": []
},
{
"name": "wheel",
"specs": []
},
{
"name": "requests",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "bcrypt",
"specs": []
},
{
"name": "nltk",
"specs": []
},
{
"name": "argparse",
"specs": []
}
],
"lcname": "reddit-multimodal-crawler"
}