# PIMMI : Python IMage MIning
PIMMI is a software that performs visual mining in a corpus of images. Its main objective is to find all copies,
total or partial, in large volumes of images and to group them together. Our initial goal is to study the reuse
of images on social networks (typically, our first use is the propagation of memes on Twitter). However, we believe
that its use can be much wider and that it can be easily adapted for other studies. The main features of PIMMI
are therefore :
- ability to process large image corpora, up to several millions files
- be robust to some modifications of the images, typical of their reuse on social networks (crop, zoom,
composition, addition of text, ...)
- be flexible enough to adapt to different use cases (mainly the nature and volume of the image corpora)
PIMMI is currently only focused on visual mining and therefore does not manage metadata related to images.
The latter are specific to each study and are therefore outside our scope. Thus, a study using PIMMI
will generally be broken down into several steps:
1. constitution of a corpus of images (jpg and/or png files) and their metadata
2. choice of PIMMI parameters according to the criteria of the corpus
3. indexing the images with PIMMI and obtaining clusters of reused images
4. exploitation of the clusters by combining them with the descriptive metadata of the images
PIMMI relies on existing technologies and integrates them into a simple data pipeline:
1. Use well-established local image descriptors (Scale Invariant Feature Transform: SIFT) to represent images
as sets of keypoints. Geometric consistency verification is also used. ([OpenCV](https://opencv.org/) implementation
for both).
2. To adapt to large volumes of images, it relies on a well known vectors indexing library that provides some
of the most efficient algorithms implementations ([FAISS](https://github.com/facebookresearch/faiss)) to query
the database of keypoints.
3. Similar images are grouped together using standard community detection algorithms on the graph of similarities.
PIMMI is a library developed in Python, which can be used through a command line interface. It is multithreaded.
A rudimentary web interface to visualize the results is also provided, but more as an example than for
intensive use ([Pimmi-ui](https://github.com/nrv/pimmi-ui)).
The development of this software is still in progress : we warmly welcome beta-testers, feedback,
proposals for new features and even pull requests !
### Initial authors
- [Béatrice Mazoyer](https://bmaz.github.io/)
- [Nicolas Hervé](http://herve.name)
## Install with pyenv and pip
```bash
pyenv virtualenv 3.7.0 pimmi-env
pyenv activate pimmi-env
pip install -U pip
pip install pimmi
```
## Install from source
```bash
python3 -m venv /tmp/pimmi-env
source /tmp/pimmi-env/bin/activate
pip install -U pip
git clone git@github.com:nrv/pimmi.git
cd pimmi
pip install -r requirements.txt
pip install -e .
```
## Demo
```bash
# --- Play with the demo dataset 1
# Create a default index structure and fill it with the demo dataset. An 'index' directory will be created,
# it will contain the 2 files of the pimmi index : dataset1.IVF1024,Flat.faiss and
# dataset1.IVF1024,Flat.meta
pimmi fill demo_dataset/dataset1 dataset1
# Query the same dataset on this index, the results will be stored in
# index/dataset1.IVF1024,Flat.mining_000000.csv
pimmi query demo_dataset/dataset1 dataset1
# Post process the mining results in order to visualize them
pimmi clusters dataset1
# You can also play with the configuration parameters. First, generate a default configuration file
pimmi create-config my_pimmi_conf.yml
# Then simply use this configuration file to relaunch the mining steps (erasing without prompt the
# previous data)
pimmi fill --erase --force --config-path my_pimmi_conf.yml demo_dataset/dataset1 dataset1
pimmi query --config-path my_pimmi_conf.yml demo_dataset/dataset1 dataset1
pimmi clusters --config-path my_pimmi_conf.yml dataset1
```
## Test on the Copydays dataset
You can find the dataset explanations [here](https://lear.inrialpes.fr/~jegou/data.php#copydays). Unfortunately, the data files are not available anymore, you can get them from [web archive](http://web.archive.org/web/20181015092553if_/http://pascal.inrialpes.fr/data/holidays/).
Create a project structure and uncompress all the files in the same images directory.
```
copydays
└───index
└───images
└───crop
│ └───crops
│ └───10
│ └───15
│ └───20
│ └───30
│ └───40
│ └───50
│ └───60
│ └───70
│ └───80
└───original
└───jpeg
│ └───jpegqual
│ └───3
│ └───5
│ └───8
│ └───10
│ └───15
│ └───20
│ └───30
│ └───50
│ └───75
└───strong
```
You can then play with the different parameters and evaluate the results. If you want to loop over several parameters to optimize your settings, you may have a look at eval_copydays.sh.
```bash
cd scripts
mkdir index
pimmi --sift-nfeatures 1000 --index-type IVF1024,Flat fill /path/to/copydays/images/ copydays
pimmi --query-sift-knn 1000 --query-dist-ratio-threshold 0.8 --index-type IVF1024,Flat query /path/to/copydays/images/ copydays
pimmi --index-type IVF1024,Flat --algo components clusters copydays
python copydays_groundtruth.py /path/to/copydays/images/ index/copydays.IVF1024,Flat.mining.clusters.csv
pimmi eval index/copydays.IVF1024,Flat.mining.groundtruth.csv --query-column image_status
```
```
cluster precision: 0.9924645454677958
cluster recall: 0.7406974660159374
cluster f1: 0.7856752626502786
query average precision: 0.8459113427266295
```
Happy hacking !
Raw data
{
"_id": null,
"home_page": "http://github.com/nrv/pimmi",
"name": "pimmi",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "image mining",
"author": "Nicolas Herv\u00e9",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/e1/9e/6e9b82085fbe02c803f5e661e7115d84abee8cdbaa9ae84dc032742f28b3/pimmi-0.3.0.tar.gz",
"platform": null,
"description": "# PIMMI : Python IMage MIning\n\nPIMMI is a software that performs visual mining in a corpus of images. Its main objective is to find all copies, \ntotal or partial, in large volumes of images and to group them together. Our initial goal is to study the reuse \nof images on social networks (typically, our first use is the propagation of memes on Twitter). However, we believe \nthat its use can be much wider and that it can be easily adapted for other studies. The main features of PIMMI \nare therefore :\n- ability to process large image corpora, up to several millions files\n- be robust to some modifications of the images, typical of their reuse on social networks (crop, zoom,\ncomposition, addition of text, ...)\n- be flexible enough to adapt to different use cases (mainly the nature and volume of the image corpora)\n\nPIMMI is currently only focused on visual mining and therefore does not manage metadata related to images. \nThe latter are specific to each study and are therefore outside our scope. Thus, a study using PIMMI \nwill generally be broken down into several steps:\n1. constitution of a corpus of images (jpg and/or png files) and their metadata\n2. choice of PIMMI parameters according to the criteria of the corpus\n3. indexing the images with PIMMI and obtaining clusters of reused images\n4. exploitation of the clusters by combining them with the descriptive metadata of the images\n\n\nPIMMI relies on existing technologies and integrates them into a simple data pipeline:\n1. Use well-established local image descriptors (Scale Invariant Feature Transform: SIFT) to represent images \nas sets of keypoints. Geometric consistency verification is also used. ([OpenCV](https://opencv.org/) implementation\nfor both).\n2. To adapt to large volumes of images, it relies on a well known vectors indexing library that provides some \nof the most efficient algorithms implementations ([FAISS](https://github.com/facebookresearch/faiss)) to query \nthe database of keypoints.\n3. Similar images are grouped together using standard community detection algorithms on the graph of similarities.\n\n\nPIMMI is a library developed in Python, which can be used through a command line interface. It is multithreaded. \nA rudimentary web interface to visualize the results is also provided, but more as an example than for \nintensive use ([Pimmi-ui](https://github.com/nrv/pimmi-ui)).\n\nThe development of this software is still in progress : we warmly welcome beta-testers, feedback, \nproposals for new features and even pull requests !\n\n### Initial authors\n- [B\u00e9atrice Mazoyer](https://bmaz.github.io/)\n- [Nicolas Herv\u00e9](http://herve.name)\n\n\n## Install with pyenv and pip\n```bash\npyenv virtualenv 3.7.0 pimmi-env\npyenv activate pimmi-env\npip install -U pip\npip install pimmi\n```\n\n## Install from source\n```bash\npython3 -m venv /tmp/pimmi-env\nsource /tmp/pimmi-env/bin/activate\npip install -U pip\ngit clone git@github.com:nrv/pimmi.git\ncd pimmi\npip install -r requirements.txt \npip install -e .\n```\n\n## Demo\n```bash\n# --- Play with the demo dataset 1\n# Create a default index structure and fill it with the demo dataset. An 'index' directory will be created,\n# it will contain the 2 files of the pimmi index : dataset1.IVF1024,Flat.faiss and \n# dataset1.IVF1024,Flat.meta\npimmi fill demo_dataset/dataset1 dataset1\n\n# Query the same dataset on this index, the results will be stored in \n# index/dataset1.IVF1024,Flat.mining_000000.csv\npimmi query demo_dataset/dataset1 dataset1\n\n# Post process the mining results in order to visualize them\npimmi clusters dataset1\n\n# You can also play with the configuration parameters. First, generate a default configuration file\npimmi create-config my_pimmi_conf.yml\n\n# Then simply use this configuration file to relaunch the mining steps (erasing without prompt the \n# previous data)\npimmi fill --erase --force --config-path my_pimmi_conf.yml demo_dataset/dataset1 dataset1\npimmi query --config-path my_pimmi_conf.yml demo_dataset/dataset1 dataset1\npimmi clusters --config-path my_pimmi_conf.yml dataset1\n```\n\n## Test on the Copydays dataset\nYou can find the dataset explanations [here](https://lear.inrialpes.fr/~jegou/data.php#copydays). Unfortunately, the data files are not available anymore, you can get them from [web archive](http://web.archive.org/web/20181015092553if_/http://pascal.inrialpes.fr/data/holidays/).\n\nCreate a project structure and uncompress all the files in the same images directory.\n\n```\ncopydays\n\u2514\u2500\u2500\u2500index\n\u2514\u2500\u2500\u2500images\n \u2514\u2500\u2500\u2500crop\n \u2502 \u2514\u2500\u2500\u2500crops\n \u2502 \u2514\u2500\u2500\u250010\n \u2502 \u2514\u2500\u2500\u250015\n \u2502 \u2514\u2500\u2500\u250020\n \u2502 \u2514\u2500\u2500\u250030\n \u2502 \u2514\u2500\u2500\u250040\n \u2502 \u2514\u2500\u2500\u250050\n \u2502 \u2514\u2500\u2500\u250060\n \u2502 \u2514\u2500\u2500\u250070\n \u2502 \u2514\u2500\u2500\u250080\n \u2514\u2500\u2500\u2500original\n \u2514\u2500\u2500\u2500jpeg\n \u2502 \u2514\u2500\u2500\u2500jpegqual\n \u2502 \u2514\u2500\u2500\u25003\n \u2502 \u2514\u2500\u2500\u25005\n \u2502 \u2514\u2500\u2500\u25008\n \u2502 \u2514\u2500\u2500\u250010\n \u2502 \u2514\u2500\u2500\u250015\n \u2502 \u2514\u2500\u2500\u250020\n \u2502 \u2514\u2500\u2500\u250030\n \u2502 \u2514\u2500\u2500\u250050\n \u2502 \u2514\u2500\u2500\u250075\n \u2514\u2500\u2500\u2500strong\n```\n\nYou can then play with the different parameters and evaluate the results. If you want to loop over several parameters to optimize your settings, you may have a look at eval_copydays.sh. \n```bash\ncd scripts\nmkdir index\npimmi --sift-nfeatures 1000 --index-type IVF1024,Flat fill /path/to/copydays/images/ copydays\npimmi --query-sift-knn 1000 --query-dist-ratio-threshold 0.8 --index-type IVF1024,Flat query /path/to/copydays/images/ copydays \npimmi --index-type IVF1024,Flat --algo components clusters copydays\npython copydays_groundtruth.py /path/to/copydays/images/ index/copydays.IVF1024,Flat.mining.clusters.csv\npimmi eval index/copydays.IVF1024,Flat.mining.groundtruth.csv --query-column image_status\n```\n```\ncluster precision: 0.9924645454677958\ncluster recall: 0.7406974660159374\ncluster f1: 0.7856752626502786\nquery average precision: 0.8459113427266295\n```\nHappy hacking !\n\n\n\n",
"bugtrack_url": null,
"license": "GPL-3.0",
"summary": "Python IMage MIning",
"version": "0.3.0",
"project_urls": {
"Homepage": "http://github.com/nrv/pimmi"
},
"split_keywords": [
"image",
"mining"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8993582d8b5e9d6c8066407b7dd2c0214dc905d11631eaa3062262c055278b5d",
"md5": "f8226f2aa9a2c5b3a443f457fbd88721",
"sha256": "98f89d1b09c823b7c60c01fddeb3ab3d5be0ab38aa0af4e3d5162df486878ed0"
},
"downloads": -1,
"filename": "pimmi-0.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f8226f2aa9a2c5b3a443f457fbd88721",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 37721,
"upload_time": "2023-06-29T15:57:50",
"upload_time_iso_8601": "2023-06-29T15:57:50.267750Z",
"url": "https://files.pythonhosted.org/packages/89/93/582d8b5e9d6c8066407b7dd2c0214dc905d11631eaa3062262c055278b5d/pimmi-0.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e19e6e9b82085fbe02c803f5e661e7115d84abee8cdbaa9ae84dc032742f28b3",
"md5": "21a07790e6a8f25d9a09d9567c5f5237",
"sha256": "c3ef46f2edaa89f21dd34d133b6a8819f4c7f469ece410a37d2c5a1406ef1b7b"
},
"downloads": -1,
"filename": "pimmi-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "21a07790e6a8f25d9a09d9567c5f5237",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 36394,
"upload_time": "2023-06-29T15:57:51",
"upload_time_iso_8601": "2023-06-29T15:57:51.633834Z",
"url": "https://files.pythonhosted.org/packages/e1/9e/6e9b82085fbe02c803f5e661e7115d84abee8cdbaa9ae84dc032742f28b3/pimmi-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-29 15:57:51",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "nrv",
"github_project": "pimmi",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "pimmi"
}