# Reddit to Markdown Scraper (rd2md)
This Python script uses PRAW (Python Reddit API Wrapper) to scrape interesting posts from a specified subreddit and save them in a formatted Markdown file. It also downloads and saves images associated with the posts.
## Features
- Scrapes hot posts from a specified subreddit
- Filters posts based on score and whether they're stickied
- Downloads and saves images from posts
- Formats post content, including comments, into a Markdown file
- Handles both text posts and image posts
- Can be used as a standalone script or imported as a module
## Prerequisites
- Python 3.12.3 or higher
- pip (Python package installer)
## Installation
You can install rd2md directly from PyPI:
```
pip install rd-to-md
```
## Setup
To use this script, you need to create a Reddit application to get the necessary credentials:
1. Log in to your Reddit account
2. Go to https://www.reddit.com/prefs/apps
3. Scroll down and click "create another app..."
4. Fill out the form:
- Choose a name for your application
- Select "script" as the app type
- For "redirect uri", use http://localhost:8080
- Add a description (optional)
5. Click "create app"
After creating the app, note down the following:
- client_id: The string under "personal use script"
- client_secret: The string next to "secret"
## Usage
### As a Command-Line Tool
After installation, you can run rd2md from the command line:
```
rd2md --client_id=YOUR_CLIENT_ID --client_secret=YOUR_CLIENT_SECRET [options]
```
Options:
- `--client_id`: Your Reddit API client ID (required if not set as an environment variable)
- `--client_secret`: Your Reddit API client secret (required if not set as an environment variable)
- `--user_agent`: User agent for Reddit API (default: "praw_bot")
- `--subreddit`: Subreddit to scrape (default: "LocalLLaMA")
- `--limit`: Number of posts to scrape (default: 3)
Example:
```
rd2md --client_id=YOUR_CLIENT_ID --client_secret=YOUR_CLIENT_SECRET --subreddit=ProgrammingHumor --limit=10
```
### As an Importable Module
You can also use rd2md as a module in your Python code:
```python
from rd2md import rd2md
# Scrape and save posts
filename, list_contents, list_images = rd2md(
client_id="YOUR_CLIENT_ID",
client_secret="YOUR_CLIENT_SECRET",
subreddit_name="ProgrammingHumor",
limit=10
)
```
### Using Environment Variables
Instead of passing the client ID and secret as arguments, you can set them as environment variables:
```
export REDDIT_CLIENT_ID=your_client_id
export REDDIT_CLIENT_SECRET=your_client_secret
```
Then you can run the script without these arguments:
```
python rd2md.py --subreddit=ProgrammingHumor --limit=10
```
## Output
The script creates a new directory named `{subreddit}_posts_{date}` in the current working directory. This directory contains:
- A Markdown file named `{current_time}.md` with the scraped post content
- An `images` subdirectory containing any downloaded images
![Alt text](https://raw.githubusercontent.com/JosefAlbers/rd2md/main/assets/example_output.png)
## LLM Integration
```python
# https://github.com/JosefAlbers/Phi-3-Vision-MLX
from phi_3_vision_mlx import generate
from pathlib import Path
import json
filename, contents, images = rd2md(post_url='https://www.reddit.com/r/LocalLLaMA/comments/1e7pdig/this_sums_up_my_experience_with_all_llm/')
prompt = 'Write an executive summary of above (max 200 words). The article should capture the diverse range of opinions and key points discussed in the thread, presenting a balanced view of the topic without quoting specific users or comments directly. Focus on organizing the information cohesively, highlighting major arguments, counterarguments, and any emerging consensus or unresolved issues within the community.'
prompts = [f'{s}\n\n{prompt}' for s in contents]
results = [generate(prompts[i], images[i], max_tokens=512, blind_model=False, quantize_model=False, verbose=False) for i in range(len(prompts))]
with open(Path(filename).with_suffix('.json'), 'w') as f:
json.dump({'prompts':prompts, 'images':images, 'results':results}, f, indent=4)
```
<details><summary>Click to expand output</summary><pre>
The discussion revolves around the use of LLM (Language Learning Models) in software development, particularly focusing on the abstraction provided by Langchain. Opinions are divided, with some advocating for the convenience and ease of use Langchain offers, while others criticize it for being overly complex and hindering customization and understanding. The thread suggests that while Langchain may simplify certain tasks, it can also lead to a lack of control and flexibility, potentially resulting in technical debt. There is a consensus that understanding the underlying mechanisms is crucial, and that abstractions should be used judiciously to avoid over-complication. The thread also touches on the broader topic of software development practices, emphasizing the importance of clear, maintainable code and the potential pitfalls of over-reliance on abstractions. The community appears to be grappling with the balance between leveraging existing tools and maintaining the ability to adapt and innovate independently.<|end|>
</pre></details><br>
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/JosefAlbers/rd2md",
"name": "rd-to-md",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12.3",
"maintainer_email": null,
"keywords": null,
"author": "Josef Albers",
"author_email": "albersj66@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/09/68/a2dc524338e4cd233c5e827b930b9293cf068d842a110c8a93ff70f0be97/rd_to_md-0.0.3.tar.gz",
"platform": null,
"description": "# Reddit to Markdown Scraper (rd2md)\n\nThis Python script uses PRAW (Python Reddit API Wrapper) to scrape interesting posts from a specified subreddit and save them in a formatted Markdown file. It also downloads and saves images associated with the posts.\n\n## Features\n\n- Scrapes hot posts from a specified subreddit\n- Filters posts based on score and whether they're stickied\n- Downloads and saves images from posts\n- Formats post content, including comments, into a Markdown file\n- Handles both text posts and image posts\n- Can be used as a standalone script or imported as a module\n\n## Prerequisites\n\n- Python 3.12.3 or higher\n- pip (Python package installer)\n\n## Installation\n\nYou can install rd2md directly from PyPI:\n\n```\npip install rd-to-md\n```\n\n## Setup\n\nTo use this script, you need to create a Reddit application to get the necessary credentials:\n\n1. Log in to your Reddit account\n2. Go to https://www.reddit.com/prefs/apps\n3. Scroll down and click \"create another app...\"\n4. Fill out the form:\n - Choose a name for your application\n - Select \"script\" as the app type\n - For \"redirect uri\", use http://localhost:8080\n - Add a description (optional)\n5. Click \"create app\"\n\nAfter creating the app, note down the following:\n- client_id: The string under \"personal use script\"\n- client_secret: The string next to \"secret\"\n\n## Usage\n\n### As a Command-Line Tool\n\nAfter installation, you can run rd2md from the command line:\n\n```\nrd2md --client_id=YOUR_CLIENT_ID --client_secret=YOUR_CLIENT_SECRET [options]\n```\n\nOptions:\n- `--client_id`: Your Reddit API client ID (required if not set as an environment variable)\n- `--client_secret`: Your Reddit API client secret (required if not set as an environment variable)\n- `--user_agent`: User agent for Reddit API (default: \"praw_bot\")\n- `--subreddit`: Subreddit to scrape (default: \"LocalLLaMA\")\n- `--limit`: Number of posts to scrape (default: 3)\n\nExample:\n```\nrd2md --client_id=YOUR_CLIENT_ID --client_secret=YOUR_CLIENT_SECRET --subreddit=ProgrammingHumor --limit=10\n```\n\n### As an Importable Module\n\nYou can also use rd2md as a module in your Python code:\n\n```python\nfrom rd2md import rd2md\n\n# Scrape and save posts\nfilename, list_contents, list_images = rd2md(\n client_id=\"YOUR_CLIENT_ID\",\n client_secret=\"YOUR_CLIENT_SECRET\",\n subreddit_name=\"ProgrammingHumor\",\n limit=10\n)\n```\n\n### Using Environment Variables\n\nInstead of passing the client ID and secret as arguments, you can set them as environment variables:\n\n```\nexport REDDIT_CLIENT_ID=your_client_id\nexport REDDIT_CLIENT_SECRET=your_client_secret\n```\n\nThen you can run the script without these arguments:\n\n```\npython rd2md.py --subreddit=ProgrammingHumor --limit=10\n```\n\n## Output\n\nThe script creates a new directory named `{subreddit}_posts_{date}` in the current working directory. This directory contains:\n\n- A Markdown file named `{current_time}.md` with the scraped post content\n- An `images` subdirectory containing any downloaded images\n\n![Alt text](https://raw.githubusercontent.com/JosefAlbers/rd2md/main/assets/example_output.png)\n\n## LLM Integration\n\n```python\n# https://github.com/JosefAlbers/Phi-3-Vision-MLX\n\nfrom phi_3_vision_mlx import generate\nfrom pathlib import Path\nimport json\n\nfilename, contents, images = rd2md(post_url='https://www.reddit.com/r/LocalLLaMA/comments/1e7pdig/this_sums_up_my_experience_with_all_llm/')\nprompt = 'Write an executive summary of above (max 200 words). The article should capture the diverse range of opinions and key points discussed in the thread, presenting a balanced view of the topic without quoting specific users or comments directly. Focus on organizing the information cohesively, highlighting major arguments, counterarguments, and any emerging consensus or unresolved issues within the community.'\nprompts = [f'{s}\\n\\n{prompt}' for s in contents]\nresults = [generate(prompts[i], images[i], max_tokens=512, blind_model=False, quantize_model=False, verbose=False) for i in range(len(prompts))]\nwith open(Path(filename).with_suffix('.json'), 'w') as f:\n json.dump({'prompts':prompts, 'images':images, 'results':results}, f, indent=4)\n```\n\n<details><summary>Click to expand output</summary><pre>\nThe discussion revolves around the use of LLM (Language Learning Models) in software development, particularly focusing on the abstraction provided by Langchain. Opinions are divided, with some advocating for the convenience and ease of use Langchain offers, while others criticize it for being overly complex and hindering customization and understanding. The thread suggests that while Langchain may simplify certain tasks, it can also lead to a lack of control and flexibility, potentially resulting in technical debt. There is a consensus that understanding the underlying mechanisms is crucial, and that abstractions should be used judiciously to avoid over-complication. The thread also touches on the broader topic of software development practices, emphasizing the importance of clear, maintainable code and the potential pitfalls of over-reliance on abstractions. The community appears to be grappling with the balance between leveraging existing tools and maintaining the ability to adapt and innovate independently.<|end|>\n</pre></details><br>\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Scrape reddit posts into a single markdown file",
"version": "0.0.3",
"project_urls": {
"Homepage": "https://github.com/JosefAlbers/rd2md"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "7439afbba46c4b97bb352a3813653e6dd7fed90e824d7965e5c6d77b5044f50b",
"md5": "2dedb67a78e1d0a427af299b33273881",
"sha256": "be3e319c82b20e35a6ad88cbd30c8995d76cb1989037b145df617e36a215f0b3"
},
"downloads": -1,
"filename": "rd_to_md-0.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2dedb67a78e1d0a427af299b33273881",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12.3",
"size": 6075,
"upload_time": "2024-07-28T05:44:42",
"upload_time_iso_8601": "2024-07-28T05:44:42.976615Z",
"url": "https://files.pythonhosted.org/packages/74/39/afbba46c4b97bb352a3813653e6dd7fed90e824d7965e5c6d77b5044f50b/rd_to_md-0.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0968a2dc524338e4cd233c5e827b930b9293cf068d842a110c8a93ff70f0be97",
"md5": "0b6864a37a17d3c8697faa818c08c80c",
"sha256": "3dd73f6d408ec02d43e3013113dd86976273ca56a8c579c52297d6a2dfc5ac96"
},
"downloads": -1,
"filename": "rd_to_md-0.0.3.tar.gz",
"has_sig": false,
"md5_digest": "0b6864a37a17d3c8697faa818c08c80c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12.3",
"size": 5868,
"upload_time": "2024-07-28T05:44:44",
"upload_time_iso_8601": "2024-07-28T05:44:44.377014Z",
"url": "https://files.pythonhosted.org/packages/09/68/a2dc524338e4cd233c5e827b930b9293cf068d842a110c8a93ff70f0be97/rd_to_md-0.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-28 05:44:44",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "JosefAlbers",
"github_project": "rd2md",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "rd-to-md"
}