news-watch


Namenews-watch JSON
Version 0.1.5 PyPI version JSON
download
home_pagehttps://github.com/okkymabruri/news-watch
SummaryA scraper for Indonesian news websites.
upload_time2024-11-06 03:19:43
maintainerNone
docs_urlNone
authorOkky Mabruri
requires_python>=3.10
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # news-watch

news-watch is a Python package that allows you to scrape news articles from various Indonesian news websites based on specific keywords and date ranges.


## Installation

You can install newswatch via pip:

```bash
pip install news-watch
```

## Usage

To run the scraper from the command line:

```bash
newswatch -k <keywords> -sd <start_date> -s [<scrapers>] [-v]
```
Command-Line Arguments

`--keywords`, `-k`: Required. A comma-separated list of keywords to scrape (e.g., -k "ojk,bank,npl").

`--start_date`, `-sd`: Required. The start date for scraping in YYYY-MM-DD format (e.g., -sd 2023-01-01).

`--scrapers`, `-s`: Optional. A comma-separated list of scrapers to use (e.g., -s "kompas,viva"). If not provided, all scrapers will be used by default.

`--verbose`, `-v`: Optional. Increase verbosity level (e.g., `-v`, `-vv`, `-vvv`).



### Examples

Scrape articles related to "ihsg" from October 28, 2024:

```bash
newswatch -k ihsg -sd 2024-10-28
```

Scrape articles for multiple keywords and increase verbosity:

```bash
newswatch -k "ihsg,bank,keuangan" -sd 2024-10-28 -vv
```

## Output

The scraped articles are saved as a CSV file in the current working directory with the format `news-watch-YYYYMMDD_HH.csv`.

The CSV file contains the following fields:

- `title`
- `publish_date`
- `author`
- `content`
- `keyword`
- `category`
- `source`
- `link`

## Supported Websites

- Bisnis Indonesia
- CNBC Indonesia
- Detik
- Kompas
- Kontan

    > Note: Running this on the cloud currently leads to errors due to Cloudflare restrictions.
    >
    > Limitation: The scraper can process a maximum of 50 pages.

- Viva

## Contributing

Contributions are welcome! If you'd like to add support for more websites or improve the existing code, please open an issue or submit a pull request.

### Running Tests

To run the test suite:

```bash
pytest tests/
```

## License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/okkymabruri/news-watch",
    "name": "news-watch",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "Okky Mabruri",
    "author_email": "okkymbrur@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/c4/e8/ca7c54b7c82ff360ff8c731d44b73f5b1624baf9c9104c6cf68121e0c6cd/news_watch-0.1.5.tar.gz",
    "platform": null,
    "description": "# news-watch\n\nnews-watch is a Python package that allows you to scrape news articles from various Indonesian news websites based on specific keywords and date ranges.\n\n\n## Installation\n\nYou can install newswatch via pip:\n\n```bash\npip install news-watch\n```\n\n## Usage\n\nTo run the scraper from the command line:\n\n```bash\nnewswatch -k <keywords> -sd <start_date> -s [<scrapers>] [-v]\n```\nCommand-Line Arguments\n\n`--keywords`, `-k`: Required. A comma-separated list of keywords to scrape (e.g., -k \"ojk,bank,npl\").\n\n`--start_date`, `-sd`: Required. The start date for scraping in YYYY-MM-DD format (e.g., -sd 2023-01-01).\n\n`--scrapers`, `-s`: Optional. A comma-separated list of scrapers to use (e.g., -s \"kompas,viva\"). If not provided, all scrapers will be used by default.\n\n`--verbose`, `-v`: Optional. Increase verbosity level (e.g., `-v`, `-vv`, `-vvv`).\n\n\n\n### Examples\n\nScrape articles related to \"ihsg\" from October 28, 2024:\n\n```bash\nnewswatch -k ihsg -sd 2024-10-28\n```\n\nScrape articles for multiple keywords and increase verbosity:\n\n```bash\nnewswatch -k \"ihsg,bank,keuangan\" -sd 2024-10-28 -vv\n```\n\n## Output\n\nThe scraped articles are saved as a CSV file in the current working directory with the format `news-watch-YYYYMMDD_HH.csv`.\n\nThe CSV file contains the following fields:\n\n- `title`\n- `publish_date`\n- `author`\n- `content`\n- `keyword`\n- `category`\n- `source`\n- `link`\n\n## Supported Websites\n\n- Bisnis Indonesia\n- CNBC Indonesia\n- Detik\n- Kompas\n- Kontan\n\n    > Note: Running this on the cloud currently leads to errors due to Cloudflare restrictions.\n    >\n    > Limitation: The scraper can process a maximum of 50 pages.\n\n- Viva\n\n## Contributing\n\nContributions are welcome! If you'd like to add support for more websites or improve the existing code, please open an issue or submit a pull request.\n\n### Running Tests\n\nTo run the test suite:\n\n```bash\npytest tests/\n```\n\n## License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A scraper for Indonesian news websites.",
    "version": "0.1.5",
    "project_urls": {
        "Homepage": "https://github.com/okkymabruri/news-watch"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "485892e3696146fe28b3c4a156fd01133ea8f171d270d2a6595116194db09009",
                "md5": "942a503a4a1ccb21f960aab013cdf052",
                "sha256": "9969cd04ea7e4e2263b363384a52eda1eb3996d023b78c49b4ad67ca16cb1f62"
            },
            "downloads": -1,
            "filename": "news_watch-0.1.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "942a503a4a1ccb21f960aab013cdf052",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 27293,
            "upload_time": "2024-11-06T03:19:41",
            "upload_time_iso_8601": "2024-11-06T03:19:41.602264Z",
            "url": "https://files.pythonhosted.org/packages/48/58/92e3696146fe28b3c4a156fd01133ea8f171d270d2a6595116194db09009/news_watch-0.1.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c4e8ca7c54b7c82ff360ff8c731d44b73f5b1624baf9c9104c6cf68121e0c6cd",
                "md5": "2b9f3c514f5c4fbc54caaecd081dd622",
                "sha256": "fce8456f4bc50fc6ca38adc89cfe19f2a80451a234c6b24b6a9daaca5609997a"
            },
            "downloads": -1,
            "filename": "news_watch-0.1.5.tar.gz",
            "has_sig": false,
            "md5_digest": "2b9f3c514f5c4fbc54caaecd081dd622",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 22978,
            "upload_time": "2024-11-06T03:19:43",
            "upload_time_iso_8601": "2024-11-06T03:19:43.494434Z",
            "url": "https://files.pythonhosted.org/packages/c4/e8/ca7c54b7c82ff360ff8c731d44b73f5b1624baf9c9104c6cf68121e0c6cd/news_watch-0.1.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-06 03:19:43",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "okkymabruri",
    "github_project": "news-watch",
    "github_not_found": true,
    "lcname": "news-watch"
}
        
Elapsed time: 0.44799s