# PIMMI : Python IMage MIning
PIMMI is a software that performs visual mining in a corpus of images. Its main objective is to find all copies,
total or partial, in large volumes of images and to group them together. Our initial goal is to study the reuse
of images on social networks (typically, our first use is the propagation of memes on Twitter). However, we believe
that its use can be much wider and that it can be easily adapted for other studies. The main features of PIMMI
are therefore :
- ability to process large image corpora, up to several millions files
- be robust to some modifications of the images, typical of their reuse on social networks (crop, zoom,
composition, addition of text, ...)
- be flexible enough to adapt to different use cases (mainly the nature and volume of the image corpora)
PIMMI is currently only focused on visual mining and therefore does not manage metadata related to images.
The latter are specific to each study and are therefore outside our scope. Thus, a study using PIMMI
will generally be broken down into several steps:
1. constitution of a corpus of images (jpg and/or png files) and their metadata
2. choice of PIMMI parameters according to the criteria of the corpus
3. indexing the images with PIMMI and obtaining clusters of reused images
4. exploitation of the clusters by combining them with the descriptive metadata of the images
PIMMI relies on existing technologies and integrates them into a simple data pipeline:
1. Use well-established local image descriptors (Scale Invariant Feature Transform: SIFT) to represent images
as sets of keypoints. Geometric consistency verification is also used. ([OpenCV](https://opencv.org/) implementation
for both).
2. To adapt to large volumes of images, it relies on a well known vectors indexing library that provides some
of the most efficient algorithms implementations ([FAISS](https://github.com/facebookresearch/faiss)) to query
the database of keypoints.
3. Similar images are grouped together using standard community detection algorithms on the graph of similarities.
PIMMI is a library developed in Python, which can be used through a command line interface. It is multithreaded.
A rudimentary web interface to visualize the results is also provided, but more as an example than for
intensive use ([Pimmi-ui](https://github.com/nrv/pimmi-ui)).
The development of this software is still in progress : we warmly welcome beta-testers, feedback,
proposals for new features and even pull requests !
## Authors
- [Béatrice Mazoyer](https://bmaz.github.io/)
- [Nicolas Hervé](http://herve.name)
## Installation
Pimmi requires Python 3 to be installed. If Python 3 is not installed on your computer, we recommend installing the distribution provided by Miniconda: https://docs.conda.io/projects/miniconda/en/latest/#quick-command-line-install
We recommend installing Pimmi in a virtual environment. The installation scenarios below provide instructions for installing Pimmi with [conda](#install-with-conda) (if you have Miniconda or Anaconda installed), with [venv](#install-with-venv) or with [pyenv-virtualenv](#install-with-pyenv-virtualenv-and-pip). If you are using another virtual environment management system, simply create a new environment, activate it and run:
```bash
pip install pimmi
```
### Install with conda
```bash
conda create --name pimmi-env
conda activate pimmi-env
pip install -U pip
pip install pimmi
```
### Install with venv
```bash
python3 -m venv /tmp/pimmi-env
source /tmp/pimmi-env/bin/activate
pip install -U pip
pip install pimmi
```
### Install with pyenv-virtualenv
```bash
pyenv virtualenv 3.8.0 pimmi-env
pyenv activate pimmi-env
pip install -U pip
pip install pimmi
```
## Demo
```bash
# --- Play with the demo dataset 1
# Download the demo dataset, it will be loaded in the folder demo_dataset
# You can choose between small_dataset and dataset1.
# small_dataset contains 10 images and dataset contains 1000 images, it takes 2 minutes to be downloaded.
pimmi download_demo dataset1
# Create a default index structure and fill it with the demo dataset. A directory named my_index will be created,
# it will contain the 2 files of the pimmi index : index.faiss and index.meta
pimmi fill demo_dataset/dataset1 my_index
# Query the same dataset on this index, the results will be stored in
# result_query.csv
pimmi query demo_dataset/dataset1 my_index -o result_query.csv
# Post process the mining results in order to visualize them
pimmi clusters my_index result_query.csv clusters.json
# You can also play with the configuration parameters. First, generate a default configuration file
pimmi create-config my_pimmi_conf.yml
# Then simply use this configuration file to relaunch the mining steps (erasing without prompt the
# previous data)
pimmi fill --erase --force --config-path my_pimmi_conf.yml demo_dataset/dataset1 dataset1
pimmi query --config-path my_pimmi_conf.yml demo_dataset/dataset1 dataset1
pimmi clusters --config-path my_pimmi_conf.yml dataset1
```
## Test on the Copydays dataset
Unfortunately, the data files and the dataset explanations are not available anymore, you can get them from web archive with this [link](http://web.archive.org/web/20181015092553if_/http://pascal.inrialpes.fr/data/holidays/) for the data files and with this [link](https://web.archive.org/web/20170325224315/https://lear.inrialpes.fr/people/jegou/data.php) for dataset explanations.
### Download the dataset
Download the 4 following gunzip folders : copydays_crop.tar.gz, copydays_jpeg.tar.gz, copydays_original.tar.gz, copydays_strong.tar.gz.
Create a project structure and uncompress all the files downloaded in the same images directory.
```
images
└───copydays_crop
└───original
└───jpegqual
└───copydays_strong
```
### Clone the repository
The script to evaluate your results is not included in the command line interface, so you should clone this repository to access it. It is located in scripts/copydays_groundtruth.py
```bash
git clone https://github.com/nrv/pimmi.git
```
### Commands to reproduce the results
```bash
pimmi --sift-nfeatures 1000 --index-type IVF1024,Flat fill images/ my_index_folder
pimmi --query-sift-knn 1000 --query-dist-ratio-threshold 0.8 --index-type IVF1024,Flat query images my_index_folder -o result_query.csv
pimmi --index-type IVF1024,Flat --algo components clusters my_index_folder result_query.csv -o clusters.csv
#Run the script to create the groundtruth file
python scripts/copydays_groundtruth.py images/ clusters.csv
#Compare the results to the groundtruth
pimmi eval groundtruth.csv --query-column image_status
```
### Results :
```
cluster precision: 0.98650288140734
cluster recall: 0.7441110375823754
cluster f1: 0.7838840961245362
query average precision: 0.839968152866242
```
### Play with the parameters
You can then play with the different parameters and re-evaluate the results. If you want to loop over several parameters to optimize your settings, you may have a look at scripts/eval_copydays.sh.
## Troubleshooting
### Error while installing faiss-cpu for macOS > 12
```
error: command '/usr/local/bin/swig' failed with exit code 1
```
The installation of pimmi requires the package faiss-cpu. However, on macOS > 12 this package cannot be installed by pip. (https://github.com/facebookresearch/faiss/issues/2868)
To fix this issue, please follow these steps:
Install Miniconda :
https://docs.conda.io/projects/miniconda/en/latest/#quick-command-line-install
Create and activate a virtual environnement:
```
conda create --name testenv1
conda activate testenv1
```
In this virtual environment, install faiss-cpu:
```
conda install -c pytorch faiss-cpu
```
And then you should be able to install pimmi:
```
pip install pimmi
```
### I have another error
Please submit an issue [here](https://github.com/nrv/pimmi/issues)
## Contribute
Pull requests are welcome! Please find below the instructions to install a development version.
### Install from source
```bash
python3 -m venv /tmp/pimmi-env
source /tmp/pimmi-env/bin/activate
pip install -U pip
git clone git@github.com:nrv/pimmi.git
cd pimmi
pip install -r requirements.txt
pip install -e .
```
### Linting and tests
To lint the code and run the unit tests you can use the following commands:
```bash
# Only linter
make lint
# Only unit tests
make test
# Both
make
```
Raw data
{
"_id": null,
"home_page": "http://github.com/nrv/pimmi",
"name": "pimmi",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "image mining",
"author": "Nicolas Herv\u00e9",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/63/ae/242019b0a6a4d25a07dd15441ef1259ba09750b503d73c8d25ade5d52794/pimmi-0.4.0.tar.gz",
"platform": null,
"description": "# PIMMI : Python IMage MIning\n\nPIMMI is a software that performs visual mining in a corpus of images. Its main objective is to find all copies,\ntotal or partial, in large volumes of images and to group them together. Our initial goal is to study the reuse\nof images on social networks (typically, our first use is the propagation of memes on Twitter). However, we believe\nthat its use can be much wider and that it can be easily adapted for other studies. The main features of PIMMI\nare therefore :\n\n- ability to process large image corpora, up to several millions files\n- be robust to some modifications of the images, typical of their reuse on social networks (crop, zoom,\n composition, addition of text, ...)\n- be flexible enough to adapt to different use cases (mainly the nature and volume of the image corpora)\n\nPIMMI is currently only focused on visual mining and therefore does not manage metadata related to images.\nThe latter are specific to each study and are therefore outside our scope. Thus, a study using PIMMI\nwill generally be broken down into several steps:\n\n1. constitution of a corpus of images (jpg and/or png files) and their metadata\n2. choice of PIMMI parameters according to the criteria of the corpus\n3. indexing the images with PIMMI and obtaining clusters of reused images\n4. exploitation of the clusters by combining them with the descriptive metadata of the images\n\nPIMMI relies on existing technologies and integrates them into a simple data pipeline:\n\n1. Use well-established local image descriptors (Scale Invariant Feature Transform: SIFT) to represent images\n as sets of keypoints. Geometric consistency verification is also used. ([OpenCV](https://opencv.org/) implementation\n for both).\n2. To adapt to large volumes of images, it relies on a well known vectors indexing library that provides some\n of the most efficient algorithms implementations ([FAISS](https://github.com/facebookresearch/faiss)) to query\n the database of keypoints.\n3. Similar images are grouped together using standard community detection algorithms on the graph of similarities.\n\nPIMMI is a library developed in Python, which can be used through a command line interface. It is multithreaded.\nA rudimentary web interface to visualize the results is also provided, but more as an example than for\nintensive use ([Pimmi-ui](https://github.com/nrv/pimmi-ui)).\n\nThe development of this software is still in progress : we warmly welcome beta-testers, feedback,\nproposals for new features and even pull requests !\n\n## Authors\n\n- [B\u00e9atrice Mazoyer](https://bmaz.github.io/)\n- [Nicolas Herv\u00e9](http://herve.name)\n\n## Installation\n\nPimmi requires Python 3 to be installed. If Python 3 is not installed on your computer, we recommend installing the distribution provided by Miniconda: https://docs.conda.io/projects/miniconda/en/latest/#quick-command-line-install\n\nWe recommend installing Pimmi in a virtual environment. The installation scenarios below provide instructions for installing Pimmi with [conda](#install-with-conda) (if you have Miniconda or Anaconda installed), with [venv](#install-with-venv) or with [pyenv-virtualenv](#install-with-pyenv-virtualenv-and-pip). If you are using another virtual environment management system, simply create a new environment, activate it and run:\n\n```bash\npip install pimmi\n```\n\n### Install with conda\n\n```bash\nconda create --name pimmi-env\nconda activate pimmi-env\npip install -U pip\npip install pimmi\n```\n\n### Install with venv\n\n```bash\npython3 -m venv /tmp/pimmi-env\nsource /tmp/pimmi-env/bin/activate\npip install -U pip\npip install pimmi\n\n```\n\n### Install with pyenv-virtualenv\n\n```bash\npyenv virtualenv 3.8.0 pimmi-env\npyenv activate pimmi-env\npip install -U pip\npip install pimmi\n```\n\n## Demo\n\n```bash\n# --- Play with the demo dataset 1\n# Download the demo dataset, it will be loaded in the folder demo_dataset\n# You can choose between small_dataset and dataset1.\n# small_dataset contains 10 images and dataset contains 1000 images, it takes 2 minutes to be downloaded.\n\npimmi download_demo dataset1\n\n# Create a default index structure and fill it with the demo dataset. A directory named my_index will be created,\n# it will contain the 2 files of the pimmi index : index.faiss and index.meta\npimmi fill demo_dataset/dataset1 my_index\n\n# Query the same dataset on this index, the results will be stored in\n# result_query.csv\npimmi query demo_dataset/dataset1 my_index -o result_query.csv\n\n# Post process the mining results in order to visualize them\npimmi clusters my_index result_query.csv clusters.json\n\n# You can also play with the configuration parameters. First, generate a default configuration file\npimmi create-config my_pimmi_conf.yml\n\n# Then simply use this configuration file to relaunch the mining steps (erasing without prompt the\n# previous data)\npimmi fill --erase --force --config-path my_pimmi_conf.yml demo_dataset/dataset1 dataset1\npimmi query --config-path my_pimmi_conf.yml demo_dataset/dataset1 dataset1\npimmi clusters --config-path my_pimmi_conf.yml dataset1\n```\n\n## Test on the Copydays dataset\n\nUnfortunately, the data files and the dataset explanations are not available anymore, you can get them from web archive with this [link](http://web.archive.org/web/20181015092553if_/http://pascal.inrialpes.fr/data/holidays/) for the data files and with this [link](https://web.archive.org/web/20170325224315/https://lear.inrialpes.fr/people/jegou/data.php) for dataset explanations.\n\n### Download the dataset\n\nDownload the 4 following gunzip folders : copydays_crop.tar.gz, copydays_jpeg.tar.gz, copydays_original.tar.gz, copydays_strong.tar.gz.\nCreate a project structure and uncompress all the files downloaded in the same images directory.\n\n```\nimages\n \u2514\u2500\u2500\u2500copydays_crop\n \u2514\u2500\u2500\u2500original\n \u2514\u2500\u2500\u2500jpegqual\n \u2514\u2500\u2500\u2500copydays_strong\n```\n\n### Clone the repository\n\nThe script to evaluate your results is not included in the command line interface, so you should clone this repository to access it. It is located in scripts/copydays_groundtruth.py\n\n```bash\ngit clone https://github.com/nrv/pimmi.git\n```\n\n### Commands to reproduce the results\n\n```bash\npimmi --sift-nfeatures 1000 --index-type IVF1024,Flat fill images/ my_index_folder\npimmi --query-sift-knn 1000 --query-dist-ratio-threshold 0.8 --index-type IVF1024,Flat query images my_index_folder -o result_query.csv\npimmi --index-type IVF1024,Flat --algo components clusters my_index_folder result_query.csv -o clusters.csv\n\n#Run the script to create the groundtruth file\npython scripts/copydays_groundtruth.py images/ clusters.csv\n\n#Compare the results to the groundtruth\npimmi eval groundtruth.csv --query-column image_status\n```\n\n### Results :\n\n```\ncluster precision: 0.98650288140734\ncluster recall: 0.7441110375823754\ncluster f1: 0.7838840961245362\nquery average precision: 0.839968152866242\n```\n\n### Play with the parameters\n\nYou can then play with the different parameters and re-evaluate the results. If you want to loop over several parameters to optimize your settings, you may have a look at scripts/eval_copydays.sh.\n\n\n## Troubleshooting\n\n### Error while installing faiss-cpu for macOS > 12\n\n```\nerror: command '/usr/local/bin/swig' failed with exit code 1\n```\n\nThe installation of pimmi requires the package faiss-cpu. However, on macOS > 12 this package cannot be installed by pip. (https://github.com/facebookresearch/faiss/issues/2868)\nTo fix this issue, please follow these steps:\n\nInstall Miniconda :\nhttps://docs.conda.io/projects/miniconda/en/latest/#quick-command-line-install\n\nCreate and activate a virtual environnement:\n\n```\nconda create --name testenv1\nconda activate testenv1\n```\n\nIn this virtual environment, install faiss-cpu:\n\n```\nconda install -c pytorch faiss-cpu\n```\n\nAnd then you should be able to install pimmi:\n\n```\npip install pimmi\n```\n\n### I have another error\n\nPlease submit an issue [here](https://github.com/nrv/pimmi/issues)\n\n## Contribute\n\nPull requests are welcome! Please find below the instructions to install a development version.\n\n### Install from source\n\n```bash\npython3 -m venv /tmp/pimmi-env\nsource /tmp/pimmi-env/bin/activate\npip install -U pip\ngit clone git@github.com:nrv/pimmi.git\ncd pimmi\npip install -r requirements.txt\npip install -e .\n```\n\n### Linting and tests\n\nTo lint the code and run the unit tests you can use the following commands:\n\n```bash\n# Only linter\nmake lint\n\n# Only unit tests\nmake test\n\n# Both\nmake\n```\n",
"bugtrack_url": null,
"license": "GPL-3.0",
"summary": "Python IMage MIning",
"version": "0.4.0",
"project_urls": {
"Homepage": "http://github.com/nrv/pimmi"
},
"split_keywords": [
"image",
"mining"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b539f97300a08cb8e34d0ff449a68be424b65491eb512cf521400faa89efc9a4",
"md5": "bbf315d4ec4133612f741c15430b911b",
"sha256": "91d7385d4bb4419ff8d00f4177172add345e5af2d0c915fbf4420df7a656bf00"
},
"downloads": -1,
"filename": "pimmi-0.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "bbf315d4ec4133612f741c15430b911b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 40891,
"upload_time": "2025-02-07T10:17:04",
"upload_time_iso_8601": "2025-02-07T10:17:04.288644Z",
"url": "https://files.pythonhosted.org/packages/b5/39/f97300a08cb8e34d0ff449a68be424b65491eb512cf521400faa89efc9a4/pimmi-0.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "63ae242019b0a6a4d25a07dd15441ef1259ba09750b503d73c8d25ade5d52794",
"md5": "2c6d9a5b5e2708fbe2dd320dee85e22b",
"sha256": "6a3065fe43798d5e50e57a8763bede288bb4640b3f2e1c25fa6d89d08c6cc7ac"
},
"downloads": -1,
"filename": "pimmi-0.4.0.tar.gz",
"has_sig": false,
"md5_digest": "2c6d9a5b5e2708fbe2dd320dee85e22b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 39744,
"upload_time": "2025-02-07T10:17:05",
"upload_time_iso_8601": "2025-02-07T10:17:05.498886Z",
"url": "https://files.pythonhosted.org/packages/63/ae/242019b0a6a4d25a07dd15441ef1259ba09750b503d73c8d25ade5d52794/pimmi-0.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-07 10:17:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "nrv",
"github_project": "pimmi",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "casanova",
"specs": []
},
{
"name": "faiss-cpu",
"specs": []
},
{
"name": "fog",
"specs": []
},
{
"name": "networkx",
"specs": [
[
"<",
"3"
],
[
">=",
"2.7"
]
]
},
{
"name": "opencv-python",
"specs": [
[
">=",
"4.8.1"
],
[
"<",
"5"
]
]
},
{
"name": "pyyaml",
"specs": []
},
{
"name": "tqdm",
"specs": [
[
">=",
"4.66.1"
],
[
"<",
"5"
]
]
},
{
"name": "importchecker",
"specs": [
[
"==",
"2.0"
]
]
},
{
"name": "pytest",
"specs": [
[
"==",
"7.3.1"
]
]
},
{
"name": "twine",
"specs": []
},
{
"name": "wheel",
"specs": []
}
],
"lcname": "pimmi"
}