reddit-multimodal-crawler


Namereddit-multimodal-crawler JSON
Version 1.3.2 PyPI version JSON
download
home_pagehttps://github.com/aneesh-aparajit/reddit-crawler
SummaryA scraper which will scrape out multimedia data from reddit.
upload_time2022-12-31 03:30:17
maintainer
docs_urlNone
authorAneesh Aparajit G
requires_python
license
keywords web-scraping webscraper reddit multimodal datascience
VCS
bugtrack_url
requirements praw pandas numpy wheel requests tqdm bcrypt nltk argparse
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Reddit Multimodal Crawler [![Downloads](https://static.pepy.tech/badge/reddit-multimodal-crawler)](https://pepy.tech/project/reddit-multimodal-crawler)

This is a wrapper to the `PRAW` package to scrape content from image in the form of `csv`, `json`, `tsv`, `sql` files.

This repository will help you scrape various subreddits, and will return to you multi-media attributes.

You can pip install this to integrate with some other application, or use it as an commandline application.

- PyPI Link:  https://pypi.org/project/reddit-multimodal-crawler/

```commandLine
pip install reddit-multimodal-crawler
```

## How to use the repository?

Before running the code, you should have registered with the Reddit API and create a sample project to run the code and obtain the `client_id`, `client_secret` and make a `user_agent`. Then pass them in the arguements.

Although, the easier way is to use the `pip install reddit-multimodal-crawler`.

## Functionalities

This will help you scrape multiple subreddits just like `PRAW` but, will also return and save datasets for the same. Will scrape the posts and the comments as well.

### Sample Code

```python
import nltk
from reddit_multimodal_crawler.crawler import Crawler
import argparse

nltk.download("vader_lexicon")

if __name__ == "__main__":

    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--subreddit_file_path",
        "A path to the file which contains the subreddits to scrape from.",
        type=str,
    )
    parser.add_argument(
        "--limit", "The limit to number of articles to scrape.", type=int
    )
    parser.add_argument("--client_id", "The Client ID provided by Reddit.", type=str)
    parser.add_argument(
        "--client_secret", "The Secret ID provided by the Reddit.", type=str
    )
    parser.add_argument(
        "--user_agent",
        "The User Agent in the form of <APP_NAME> <VERSION> by /u/<REDDIT_USERNAME>",
        type=str,
    )
    parser.add_argument(
        "--posts", "A boolean variable to parse through the posts or not.", type=bool
    )
    parser.add_argument(
        "--comments",
        "A boolean variable to parse through the comments of the top posts of subreddit",
        type=bool,
    )

    args = parser.parse_args()

    client_id = args["client_id"]
    client_secret = args["client_secret"]
    user_agent = args["user_agent"]
    file_path = args["subreddit_file_path"]
    limit = args["limit"]

    r = Crawler(client_id=client_id, client_secret=client_secret, user_agent=user_agent)

    subreddit_list = open(file_path, "r").readlines().split()

    print(subreddit_list)

    if args["posts"]:
        r.get_posts(subreddit_names=subreddit_list, sort_by="top", limit=limit)

    if args["comments"]:
        r.get_comments(subreddit_names=subreddit_list, sort_by="top", limit=limit)

```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/aneesh-aparajit/reddit-crawler",
    "name": "reddit-multimodal-crawler",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "web-scraping,webscraper,reddit,multimodal,datascience",
    "author": "Aneesh Aparajit G",
    "author_email": "aneeshaparajit.g2002@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/54/1b/552761ce29265fc0f53fed692d5414d9138c5ba5f9dcd9f9d9001cca9649/reddit_multimodal_crawler-1.3.2.tar.gz",
    "platform": null,
    "description": "# Reddit Multimodal Crawler [![Downloads](https://static.pepy.tech/badge/reddit-multimodal-crawler)](https://pepy.tech/project/reddit-multimodal-crawler)\n\nThis is a wrapper to the `PRAW` package to scrape content from image in the form of `csv`, `json`, `tsv`, `sql` files.\n\nThis repository will help you scrape various subreddits, and will return to you multi-media attributes.\n\nYou can pip install this to integrate with some other application, or use it as an commandline application.\n\n- PyPI Link:  https://pypi.org/project/reddit-multimodal-crawler/\n\n```commandLine\npip install reddit-multimodal-crawler\n```\n\n## How to use the repository?\n\nBefore running the code, you should have registered with the Reddit API and create a sample project to run the code and obtain the `client_id`, `client_secret` and make a `user_agent`. Then pass them in the arguements.\n\nAlthough, the easier way is to use the `pip install reddit-multimodal-crawler`.\n\n## Functionalities\n\nThis will help you scrape multiple subreddits just like `PRAW` but, will also return and save datasets for the same. Will scrape the posts and the comments as well.\n\n### Sample Code\n\n```python\nimport nltk\nfrom reddit_multimodal_crawler.crawler import Crawler\nimport argparse\n\nnltk.download(\"vader_lexicon\")\n\nif __name__ == \"__main__\":\n\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\n        \"--subreddit_file_path\",\n        \"A path to the file which contains the subreddits to scrape from.\",\n        type=str,\n    )\n    parser.add_argument(\n        \"--limit\", \"The limit to number of articles to scrape.\", type=int\n    )\n    parser.add_argument(\"--client_id\", \"The Client ID provided by Reddit.\", type=str)\n    parser.add_argument(\n        \"--client_secret\", \"The Secret ID provided by the Reddit.\", type=str\n    )\n    parser.add_argument(\n        \"--user_agent\",\n        \"The User Agent in the form of <APP_NAME> <VERSION> by /u/<REDDIT_USERNAME>\",\n        type=str,\n    )\n    parser.add_argument(\n        \"--posts\", \"A boolean variable to parse through the posts or not.\", type=bool\n    )\n    parser.add_argument(\n        \"--comments\",\n        \"A boolean variable to parse through the comments of the top posts of subreddit\",\n        type=bool,\n    )\n\n    args = parser.parse_args()\n\n    client_id = args[\"client_id\"]\n    client_secret = args[\"client_secret\"]\n    user_agent = args[\"user_agent\"]\n    file_path = args[\"subreddit_file_path\"]\n    limit = args[\"limit\"]\n\n    r = Crawler(client_id=client_id, client_secret=client_secret, user_agent=user_agent)\n\n    subreddit_list = open(file_path, \"r\").readlines().split()\n\n    print(subreddit_list)\n\n    if args[\"posts\"]:\n        r.get_posts(subreddit_names=subreddit_list, sort_by=\"top\", limit=limit)\n\n    if args[\"comments\"]:\n        r.get_comments(subreddit_names=subreddit_list, sort_by=\"top\", limit=limit)\n\n```\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A scraper which will scrape out multimedia data from reddit.",
    "version": "1.3.2",
    "split_keywords": [
        "web-scraping",
        "webscraper",
        "reddit",
        "multimodal",
        "datascience"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "94bd45fdee3610ced26e9b182adcb0cb",
                "sha256": "e627454c015000e79b10b20d3057bd4156e8e7af88dea11f4814487b50a7ce10"
            },
            "downloads": -1,
            "filename": "reddit_multimodal_crawler-1.3.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "94bd45fdee3610ced26e9b182adcb0cb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 5368,
            "upload_time": "2022-12-31T03:30:15",
            "upload_time_iso_8601": "2022-12-31T03:30:15.299414Z",
            "url": "https://files.pythonhosted.org/packages/de/8f/650efcb4ce8a2f2b0779e311af9a798c056dac0b73daf4f530c73fb1d4af/reddit_multimodal_crawler-1.3.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "48f3b2a959fd22ec25c66a9b488dfec2",
                "sha256": "96485a1a0aa7c111fbbe8165e97eeef80e7f8715c4eaa639c419d9311ea22882"
            },
            "downloads": -1,
            "filename": "reddit_multimodal_crawler-1.3.2.tar.gz",
            "has_sig": false,
            "md5_digest": "48f3b2a959fd22ec25c66a9b488dfec2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 4910,
            "upload_time": "2022-12-31T03:30:17",
            "upload_time_iso_8601": "2022-12-31T03:30:17.353718Z",
            "url": "https://files.pythonhosted.org/packages/54/1b/552761ce29265fc0f53fed692d5414d9138c5ba5f9dcd9f9d9001cca9649/reddit_multimodal_crawler-1.3.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-31 03:30:17",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "aneesh-aparajit",
    "github_project": "reddit-crawler",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "praw",
            "specs": []
        },
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "wheel",
            "specs": []
        },
        {
            "name": "requests",
            "specs": []
        },
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "bcrypt",
            "specs": []
        },
        {
            "name": "nltk",
            "specs": []
        },
        {
            "name": "argparse",
            "specs": []
        }
    ],
    "lcname": "reddit-multimodal-crawler"
}
        
Elapsed time: 0.02267s