# news-watch
news-watch is a Python package that allows you to scrape news articles from various Indonesian news websites based on specific keywords and date ranges.
## Installation
You can install newswatch via pip:
```bash
pip install news-watch
```
## Usage
To run the scraper from the command line:
```bash
newswatch -k <keywords> -sd <start_date> -s [<scrapers>] [-v]
```
Command-Line Arguments
`--keywords`, `-k`: Required. A comma-separated list of keywords to scrape (e.g., -k "ojk,bank,npl").
`--start_date`, `-sd`: Required. The start date for scraping in YYYY-MM-DD format (e.g., -sd 2023-01-01).
`--scrapers`, `-s`: Optional. A comma-separated list of scrapers to use (e.g., -s "kompas,viva"). If not provided, all scrapers will be used by default.
`--verbose`, `-v`: Optional. Increase verbosity level (e.g., `-v`, `-vv`, `-vvv`).
### Examples
Scrape articles related to "ihsg" from October 28, 2024:
```bash
newswatch -k ihsg -sd 2024-10-28
```
Scrape articles for multiple keywords and increase verbosity:
```bash
newswatch -k "ihsg,bank,keuangan" -sd 2024-10-28 -vv
```
## Output
The scraped articles are saved as a CSV file in the current working directory with the format `news-watch-YYYYMMDD_HH.csv`.
The CSV file contains the following fields:
- `title`
- `publish_date`
- `author`
- `content`
- `keyword`
- `category`
- `source`
- `link`
## Supported Websites
- Bisnis Indonesia
- CNBC Indonesia
- Detik
- Kompas
- Kontan
> Note: Running this on the cloud currently leads to errors due to Cloudflare restrictions.
>
> Limitation: The scraper can process a maximum of 50 pages.
- Viva
## Contributing
Contributions are welcome! If you'd like to add support for more websites or improve the existing code, please open an issue or submit a pull request.
### Running Tests
To run the test suite:
```bash
pytest tests/
```
## License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/okkymabruri/news-watch",
"name": "news-watch",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": null,
"author": "Okky Mabruri",
"author_email": "okkymbrur@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/c4/e8/ca7c54b7c82ff360ff8c731d44b73f5b1624baf9c9104c6cf68121e0c6cd/news_watch-0.1.5.tar.gz",
"platform": null,
"description": "# news-watch\n\nnews-watch is a Python package that allows you to scrape news articles from various Indonesian news websites based on specific keywords and date ranges.\n\n\n## Installation\n\nYou can install newswatch via pip:\n\n```bash\npip install news-watch\n```\n\n## Usage\n\nTo run the scraper from the command line:\n\n```bash\nnewswatch -k <keywords> -sd <start_date> -s [<scrapers>] [-v]\n```\nCommand-Line Arguments\n\n`--keywords`, `-k`: Required. A comma-separated list of keywords to scrape (e.g., -k \"ojk,bank,npl\").\n\n`--start_date`, `-sd`: Required. The start date for scraping in YYYY-MM-DD format (e.g., -sd 2023-01-01).\n\n`--scrapers`, `-s`: Optional. A comma-separated list of scrapers to use (e.g., -s \"kompas,viva\"). If not provided, all scrapers will be used by default.\n\n`--verbose`, `-v`: Optional. Increase verbosity level (e.g., `-v`, `-vv`, `-vvv`).\n\n\n\n### Examples\n\nScrape articles related to \"ihsg\" from October 28, 2024:\n\n```bash\nnewswatch -k ihsg -sd 2024-10-28\n```\n\nScrape articles for multiple keywords and increase verbosity:\n\n```bash\nnewswatch -k \"ihsg,bank,keuangan\" -sd 2024-10-28 -vv\n```\n\n## Output\n\nThe scraped articles are saved as a CSV file in the current working directory with the format `news-watch-YYYYMMDD_HH.csv`.\n\nThe CSV file contains the following fields:\n\n- `title`\n- `publish_date`\n- `author`\n- `content`\n- `keyword`\n- `category`\n- `source`\n- `link`\n\n## Supported Websites\n\n- Bisnis Indonesia\n- CNBC Indonesia\n- Detik\n- Kompas\n- Kontan\n\n > Note: Running this on the cloud currently leads to errors due to Cloudflare restrictions.\n >\n > Limitation: The scraper can process a maximum of 50 pages.\n\n- Viva\n\n## Contributing\n\nContributions are welcome! If you'd like to add support for more websites or improve the existing code, please open an issue or submit a pull request.\n\n### Running Tests\n\nTo run the test suite:\n\n```bash\npytest tests/\n```\n\n## License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n",
"bugtrack_url": null,
"license": null,
"summary": "A scraper for Indonesian news websites.",
"version": "0.1.5",
"project_urls": {
"Homepage": "https://github.com/okkymabruri/news-watch"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "485892e3696146fe28b3c4a156fd01133ea8f171d270d2a6595116194db09009",
"md5": "942a503a4a1ccb21f960aab013cdf052",
"sha256": "9969cd04ea7e4e2263b363384a52eda1eb3996d023b78c49b4ad67ca16cb1f62"
},
"downloads": -1,
"filename": "news_watch-0.1.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "942a503a4a1ccb21f960aab013cdf052",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 27293,
"upload_time": "2024-11-06T03:19:41",
"upload_time_iso_8601": "2024-11-06T03:19:41.602264Z",
"url": "https://files.pythonhosted.org/packages/48/58/92e3696146fe28b3c4a156fd01133ea8f171d270d2a6595116194db09009/news_watch-0.1.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c4e8ca7c54b7c82ff360ff8c731d44b73f5b1624baf9c9104c6cf68121e0c6cd",
"md5": "2b9f3c514f5c4fbc54caaecd081dd622",
"sha256": "fce8456f4bc50fc6ca38adc89cfe19f2a80451a234c6b24b6a9daaca5609997a"
},
"downloads": -1,
"filename": "news_watch-0.1.5.tar.gz",
"has_sig": false,
"md5_digest": "2b9f3c514f5c4fbc54caaecd081dd622",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 22978,
"upload_time": "2024-11-06T03:19:43",
"upload_time_iso_8601": "2024-11-06T03:19:43.494434Z",
"url": "https://files.pythonhosted.org/packages/c4/e8/ca7c54b7c82ff360ff8c731d44b73f5b1624baf9c9104c6cf68121e0c6cd/news_watch-0.1.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-06 03:19:43",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "okkymabruri",
"github_project": "news-watch",
"github_not_found": true,
"lcname": "news-watch"
}