[](https://github.com/medialab/minet/actions) [](https://zenodo.org/badge/latestdoi/169059797) [](https://pepy.tech/project/minet)

**minet** is a webmining command line tool & library for python (>= 3.7) that can be used to collect and extract data from a large variety of web sources such as raw webpages, Facebook, YouTube, Twitter, Media Cloud etc.
It adopts a very simple approach to various webmining problems by letting you perform a wide array of tasks from the comfort of the command line. No database needed: raw CSV files should be sufficient to do most of the work.
In addition, **minet** also exposes its high-level programmatic interface as a python library so you remain free to use its utilities to suit your use-cases better.
**minet** is developed by [médialab SciencesPo](https://github.com/medialab/) research engineers and is the consolidation of more than a decade of webmining practices targeted at social sciences.
As such, it has been designed to be:
1. **low-tech**, as it requires minimal resources such as memory, CPUs or hard drive space and should be able to work on any low-cost PC.
2. **fault-tolerant**, as it is able to recover when network is bad and retry HTTP calls when suitable. What's more, most of minet commands can be resumed if aborted and are designed to run for a long time (think days or months) without leaking memory.
3. **unix-compliant**, as it can be piped easily and know how to work with the usual streams.
**Shortcuts**: [Command line documentation](./docs/cli.md), [Python library documentation](./docs/lib.md).

_How to cite?_
**minet** is published on [Zenodo](https://zenodo.org/) as [10.5281/zenodo.4564399](http://doi.org/10.5281/zenodo.4564399).
You can cite it thusly:
> Guillaume Plique, Pauline Breteau, Jules Farjas, Héloïse Théro, Jean Descamps, Amélie Pellé, Laura Miguel, César Pichon, & Kelly Christensen. (2019, October 14). Minet, a webmining CLI tool & library for python. Zenodo. http://doi.org/10.5281/zenodo.4564399
## Whirlwind tour
```bash
# Downloading large amount of urls as fast as possible
minet fetch url -i urls.csv > report.csv
# Extracting raw text from the downloaded HTML files
minet extract -i report.csv -I downloaded > extracted.csv
# Scraping the urls found in the downloaded HTML files
minet scrape urls -i report.csv -I downloaded > scraped_urls.csv
# Parsing & normalizing the scraped urls
minet url-parse scraped_url -i scraped_urls.csv > parsed_urls.csv
# Scraping data from Twitter
minet twitter scrape tweets "from:medialab_ScPo" > tweets.csv
# Printing a command's help
minet twitter scrape -h
# Searching videos on YouTube
minet youtube search -k "MY-YT-API-KEY" "médialab" > videos.csv
```
## Summary
- [What it does](#what-it-does)
- [Documented use cases](#documented-use-cases)
- [Features (from a technical standpoint)](#features-from-a-technical-standpoint)
- [Installation](#installation)
- [Upgrading](#upgrading)
- [Uninstallation](#uninstallation)
- [Documentation](#documentation)
- [Contributing](#contributing)
## What it does
Minet can single-handedly:
- Extract URLs from a text file (or a table)
- Parse URLs (get useful information, with Facebook- and Youtube-specific stuff)
- Join two CSV files by matching the columns containing URLs
- From a list of URLs, resolve their redirections
- ...and check their HTTP status
- ...and download the HTML
- ...and extract hyperlinks
- ...and extract the text content and other metadata (title...)
- ...and scrape structured data (using a declarative language to define your heuristics)
- Crawl (using a declarative language to define a browsing behavior, and what to harvest)
- Mine or search:
- _[Bluesky](https://bsky.app/)_ (requires a free user account)
- _[Mediacloud](https://mediacloud.org/)_ (requires free API access)
- _[Twitter](https://twitter.com)_ (requires free API access)
- _[Wikipedia](https://www.wikipedia.org)_
- _[Youtube](https://www.youtube.com/)_ (requires free API access)
- Scrape (without requiring special access, often just a user account):
- _[Instagram](https://www.instagram.com/)_
_ _[Reddit](https://www.reddit.com/)_
- _[Telegram](https://telegram.org/)_
- _[TikTok](https://www.tiktok.com)_
- _[Twitter](https://twitter.com)_
- _[Google Drive](https://drive.google.com)_ (spreadsheets etc.)
- Grab & dump cookies from your browser
- Dump _[Hyphe](https://hyphe.medialab.sciences-po.fr/)_ data
## Documented use cases
- [Fetching a large amount of urls](./docs/cookbook/fetch.md)
- [Joining 2 CSV files by urls](./docs/cookbook/url_join.md)
- [Using minet from a Jupyter notebook](./docs/cookbook/notebooks/Minet%20in%20a%20Jupyter%20notebook.ipynb) (_very useful to experiment with the tool or teach students_)
- [Downloading images associated with a given hashtag on Twitter](./docs/cookbook/twitter_images.md)
- [Scraping DSL Tutorial](./docs/cookbook/scraping_dsl.md)
## Features (from a technical standpoint)
- Multithreaded, memory-efficient fetching from the web.
- Multithreaded, scalable crawling.
- Multiprocessed raw text content extraction from HTML pages.
- Multiprocessed scraping from HTML pages.
- URL-related heuristics utilities such as extraction, normalization and matching.
- Data collection from various APIs such as [YouTube](https://www.youtube.com/).
## Installation
**minet** can be installed as a standalone CLI tool (currently only on mac >= 10.14, ubuntu & similar) by running the following command in your terminal:
```shell
curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/install.sh | bash
```
Don't trust us enough to pipe the result of a HTTP request into `bash`? We wouldn't either, so feel free to read the installation script [here](./scripts/install.sh) and run it on your end if you prefer.
On ubuntu & similar you might need to install `curl` and `unzip` before running the installation script if you don't already have it:
```shell
sudo apt-get install curl unzip
```
Else, **minet** can be installed directly as a python CLI tool and library using pip:
```shell
pip install minet
```
Finally if you want to install the standalone binaries by yourself (even for windows) you can find them in each release [here](https://github.com/medialab/minet/releases).
## Upgrading
To upgrade the standalone version, simply run the install script once again:
```shell
curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/install.sh | bash
```
To upgrade the python version you can use pip thusly:
```shell
pip install -U minet
```
## Uninstallation
To uninstall the standalone version:
```shell
curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/uninstall.sh | bash
```
To uninstall the python version:
```shell
pip uninstall minet
```
## Documentation
- [minet as a command line tool](./docs/cli.md)
- [minet as a python library](./docs/lib.md)
## Contributing
To contribute to **minet** you can check out [this](./CONTRIBUTING.md) documentation.
Raw data
{
"_id": null,
"home_page": "http://github.com/medialab/minet",
"name": "minet",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "webmining",
"author": "Guillaume Plique, Pauline Breteau, Jules Farjas, H\u00e9lo\u00efse Th\u00e9ro, Jean Descamps, Am\u00e9lie Pell\u00e9, Laura Miguel, C\u00e9sar Pichon, Kelly Christensen",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/91/28/fd20eaf1fdd9b4c8cf675f0b57dc2adaf18ba364e4492ea2ed3bc5abd8b3/minet-4.0.0.tar.gz",
"platform": null,
"description": "[](https://github.com/medialab/minet/actions) [](https://zenodo.org/badge/latestdoi/169059797) [](https://pepy.tech/project/minet)\n\n\n\n**minet** is a webmining command line tool & library for python (>= 3.7) that can be used to collect and extract data from a large variety of web sources such as raw webpages, Facebook, YouTube, Twitter, Media Cloud etc.\n\nIt adopts a very simple approach to various webmining problems by letting you perform a wide array of tasks from the comfort of the command line. No database needed: raw CSV files should be sufficient to do most of the work.\n\nIn addition, **minet** also exposes its high-level programmatic interface as a python library so you remain free to use its utilities to suit your use-cases better.\n\n**minet** is developed by [m\u00e9dialab SciencesPo](https://github.com/medialab/) research engineers and is the consolidation of more than a decade of webmining practices targeted at social sciences.\n\nAs such, it has been designed to be:\n\n1. **low-tech**, as it requires minimal resources such as memory, CPUs or hard drive space and should be able to work on any low-cost PC.\n2. **fault-tolerant**, as it is able to recover when network is bad and retry HTTP calls when suitable. What's more, most of minet commands can be resumed if aborted and are designed to run for a long time (think days or months) without leaking memory.\n3. **unix-compliant**, as it can be piped easily and know how to work with the usual streams.\n\n**Shortcuts**: [Command line documentation](./docs/cli.md), [Python library documentation](./docs/lib.md).\n\n\n\n_How to cite?_\n\n**minet** is published on [Zenodo](https://zenodo.org/) as [10.5281/zenodo.4564399](http://doi.org/10.5281/zenodo.4564399).\n\nYou can cite it thusly:\n\n> Guillaume Plique, Pauline Breteau, Jules Farjas, H\u00e9lo\u00efse Th\u00e9ro, Jean Descamps, Am\u00e9lie Pell\u00e9, Laura Miguel, C\u00e9sar Pichon, & Kelly Christensen. (2019, October 14). Minet, a webmining CLI tool & library for python. Zenodo. http://doi.org/10.5281/zenodo.4564399\n\n## Whirlwind tour\n\n```bash\n# Downloading large amount of urls as fast as possible\nminet fetch url -i urls.csv > report.csv\n\n# Extracting raw text from the downloaded HTML files\nminet extract -i report.csv -I downloaded > extracted.csv\n\n# Scraping the urls found in the downloaded HTML files\nminet scrape urls -i report.csv -I downloaded > scraped_urls.csv\n\n# Parsing & normalizing the scraped urls\nminet url-parse scraped_url -i scraped_urls.csv > parsed_urls.csv\n\n# Scraping data from Twitter\nminet twitter scrape tweets \"from:medialab_ScPo\" > tweets.csv\n\n# Printing a command's help\nminet twitter scrape -h\n\n# Searching videos on YouTube\nminet youtube search -k \"MY-YT-API-KEY\" \"m\u00e9dialab\" > videos.csv\n```\n\n## Summary\n\n- [What it does](#what-it-does)\n- [Documented use cases](#documented-use-cases)\n- [Features (from a technical standpoint)](#features-from-a-technical-standpoint)\n- [Installation](#installation)\n- [Upgrading](#upgrading)\n- [Uninstallation](#uninstallation)\n- [Documentation](#documentation)\n- [Contributing](#contributing)\n\n## What it does\n\nMinet can single-handedly:\n\n- Extract URLs from a text file (or a table)\n- Parse URLs (get useful information, with Facebook- and Youtube-specific stuff)\n- Join two CSV files by matching the columns containing URLs\n- From a list of URLs, resolve their redirections\n - ...and check their HTTP status\n - ...and download the HTML\n - ...and extract hyperlinks\n - ...and extract the text content and other metadata (title...)\n - ...and scrape structured data (using a declarative language to define your heuristics)\n- Crawl (using a declarative language to define a browsing behavior, and what to harvest)\n- Mine or search:\n - _[Bluesky](https://bsky.app/)_ (requires a free user account)\n - _[Mediacloud](https://mediacloud.org/)_ (requires free API access)\n - _[Twitter](https://twitter.com)_ (requires free API access)\n - _[Wikipedia](https://www.wikipedia.org)_\n - _[Youtube](https://www.youtube.com/)_ (requires free API access)\n- Scrape (without requiring special access, often just a user account):\n - _[Instagram](https://www.instagram.com/)_\n _ _[Reddit](https://www.reddit.com/)_\n - _[Telegram](https://telegram.org/)_\n - _[TikTok](https://www.tiktok.com)_\n - _[Twitter](https://twitter.com)_\n - _[Google Drive](https://drive.google.com)_ (spreadsheets etc.)\n- Grab & dump cookies from your browser\n- Dump _[Hyphe](https://hyphe.medialab.sciences-po.fr/)_ data\n\n## Documented use cases\n\n- [Fetching a large amount of urls](./docs/cookbook/fetch.md)\n- [Joining 2 CSV files by urls](./docs/cookbook/url_join.md)\n- [Using minet from a Jupyter notebook](./docs/cookbook/notebooks/Minet%20in%20a%20Jupyter%20notebook.ipynb) (_very useful to experiment with the tool or teach students_)\n- [Downloading images associated with a given hashtag on Twitter](./docs/cookbook/twitter_images.md)\n- [Scraping DSL Tutorial](./docs/cookbook/scraping_dsl.md)\n\n## Features (from a technical standpoint)\n\n- Multithreaded, memory-efficient fetching from the web.\n- Multithreaded, scalable crawling.\n- Multiprocessed raw text content extraction from HTML pages.\n- Multiprocessed scraping from HTML pages.\n- URL-related heuristics utilities such as extraction, normalization and matching.\n- Data collection from various APIs such as [YouTube](https://www.youtube.com/).\n\n## Installation\n\n**minet** can be installed as a standalone CLI tool (currently only on mac >= 10.14, ubuntu & similar) by running the following command in your terminal:\n\n```shell\ncurl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/install.sh | bash\n```\n\nDon't trust us enough to pipe the result of a HTTP request into `bash`? We wouldn't either, so feel free to read the installation script [here](./scripts/install.sh) and run it on your end if you prefer.\n\nOn ubuntu & similar you might need to install `curl` and `unzip` before running the installation script if you don't already have it:\n\n```shell\nsudo apt-get install curl unzip\n```\n\nElse, **minet** can be installed directly as a python CLI tool and library using pip:\n\n```shell\npip install minet\n```\n\nFinally if you want to install the standalone binaries by yourself (even for windows) you can find them in each release [here](https://github.com/medialab/minet/releases).\n\n## Upgrading\n\nTo upgrade the standalone version, simply run the install script once again:\n\n```shell\ncurl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/install.sh | bash\n```\n\nTo upgrade the python version you can use pip thusly:\n\n```shell\npip install -U minet\n```\n\n## Uninstallation\n\nTo uninstall the standalone version:\n\n```shell\ncurl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/uninstall.sh | bash\n```\n\nTo uninstall the python version:\n\n```shell\npip uninstall minet\n```\n\n## Documentation\n\n- [minet as a command line tool](./docs/cli.md)\n- [minet as a python library](./docs/lib.md)\n\n## Contributing\n\nTo contribute to **minet** you can check out [this](./CONTRIBUTING.md) documentation.\n",
"bugtrack_url": null,
"license": "GPL-3.0",
"summary": "A webmining CLI tool & library for python.",
"version": "4.0.0",
"project_urls": {
"Homepage": "http://github.com/medialab/minet"
},
"split_keywords": [
"webmining"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "ae7888a76d5b1da5806f2205412ab2ad57b40a221db95bd3c8f237f36d840dab",
"md5": "49cf3806ffbefce57f8dde4bc40c11b2",
"sha256": "762b561e5889de8d452a5c15b5a71d61f6e4c73cf46737d5053fc16288c21277"
},
"downloads": -1,
"filename": "minet-4.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "49cf3806ffbefce57f8dde4bc40c11b2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 300606,
"upload_time": "2025-02-19T15:23:59",
"upload_time_iso_8601": "2025-02-19T15:23:59.704276Z",
"url": "https://files.pythonhosted.org/packages/ae/78/88a76d5b1da5806f2205412ab2ad57b40a221db95bd3c8f237f36d840dab/minet-4.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "9128fd20eaf1fdd9b4c8cf675f0b57dc2adaf18ba364e4492ea2ed3bc5abd8b3",
"md5": "7f862d12b9031de040b120e82ed8533c",
"sha256": "dfb29d4383eaeb5590ea74ee161ac61e4eacb3a45b6787096bd7a21992abf2c1"
},
"downloads": -1,
"filename": "minet-4.0.0.tar.gz",
"has_sig": false,
"md5_digest": "7f862d12b9031de040b120e82ed8533c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 226239,
"upload_time": "2025-02-19T15:24:01",
"upload_time_iso_8601": "2025-02-19T15:24:01.493605Z",
"url": "https://files.pythonhosted.org/packages/91/28/fd20eaf1fdd9b4c8cf675f0b57dc2adaf18ba364e4492ea2ed3bc5abd8b3/minet-4.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-19 15:24:01",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "medialab",
"github_project": "minet",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "ipywidgets",
"specs": []
},
{
"name": "jupyterlab",
"specs": []
},
{
"name": "PyInstaller",
"specs": [
[
"==",
"6.12.0"
]
]
},
{
"name": "pytest",
"specs": [
[
"==",
"7.2.1"
]
]
},
{
"name": "ruff",
"specs": []
},
{
"name": "twine",
"specs": []
},
{
"name": "wheel",
"specs": []
},
{
"name": "about-time",
"specs": [
[
"==",
"4.2.1"
]
]
},
{
"name": "beautifulsoup4",
"specs": [
[
"==",
"4.12.3"
]
]
},
{
"name": "browser-cookie3",
"specs": [
[
"==",
"0.19.1"
]
]
},
{
"name": "casanova",
"specs": [
[
"==",
"2.0.1"
]
]
},
{
"name": "charset-normalizer",
"specs": [
[
"==",
"3.4.1"
]
]
},
{
"name": "dateparser",
"specs": [
[
"==",
"1.1.6"
]
]
},
{
"name": "ebbe",
"specs": [
[
"==",
"1.13.2"
]
]
},
{
"name": "json5",
"specs": [
[
"==",
"0.9.11"
]
]
},
{
"name": "libipld",
"specs": [
[
"==",
"3.0.1"
]
]
},
{
"name": "lxml",
"specs": [
[
"==",
"4.9.2"
]
]
},
{
"name": "lxml",
"specs": [
[
">=",
"5.3.0"
]
]
},
{
"name": "nanoid",
"specs": [
[
"==",
"2.0.0"
]
]
},
{
"name": "playwright",
"specs": [
[
"==",
"1.46.0"
]
]
},
{
"name": "playwright-stealth",
"specs": [
[
"==",
"1.0.6"
]
]
},
{
"name": "pyyaml",
"specs": [
[
"==",
"6.0.1"
]
]
},
{
"name": "quenouille",
"specs": [
[
"==",
"1.9.1"
]
]
},
{
"name": "rich",
"specs": [
[
"==",
"13.8.0"
]
]
},
{
"name": "rich-argparse",
"specs": [
[
"==",
"1.5.2"
]
]
},
{
"name": "soupsieve",
"specs": [
[
"<",
"3"
],
[
">=",
"2.1"
]
]
},
{
"name": "tenacity",
"specs": [
[
"==",
"8.2.1"
]
]
},
{
"name": "trafilatura",
"specs": [
[
"==",
"2.0.0"
]
]
},
{
"name": "typing_extensions",
"specs": [
[
">=",
"4.3"
]
]
},
{
"name": "twitwi",
"specs": [
[
"==",
"0.19.2"
]
]
},
{
"name": "ural",
"specs": [
[
"==",
"1.4.0"
]
]
},
{
"name": "urllib3",
"specs": [
[
"==",
"1.26.16"
]
]
},
{
"name": "websockets",
"specs": [
[
"==",
"13.1"
]
]
}
],
"lcname": "minet"
}