Vizuka


NameVizuka JSON
Version 0.30.1 PyPI version JSON
download
home_page
SummaryRepresents your high-dimensional datas in a 2D space and play with it
upload_time2017-10-20 16:40:05
maintainer
docs_urlNone
authorSofian Medbouhi
requires_python
licenseGPL V3
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            Data vizualization
==================

This is a collection of tools to represent and navigate through the high-dimensional data.
 * The algorithm t-SNE is default to construct the 2D space.
 * The module should be agnostic of the data provided.
 * It ships with MNIST for quick testing.

Usage
-----
### How to install ?
```sh
$ pip install vizuka
```
or clone the repo :)

### How to run?

```sh
$ vizuka
# For a quick working example run :
$ vizuka --mnist
# Similar to copy your data and run "vizuka --image:images --version _MNIST_example"
$ vizuka --show-required-files
# To show the format of files you need to launch a data viz 
```
You can add human-readable data visualization in data/set/raw\_data\_VERSION.npz :

```sh
$ vizuka -s price:logdensity -s name:wordcloud
# vizuka --feature-to-show raw_variable_name:{wordcloud|counter|density|logdensity|images}
```

It assumes you already have your 2D data, projection will be done if launched for the first time (not for MNIST toy example)
You can force for PCA reduction prior to t-SNE :
```sh
$ vizuka --reduce --use_pca 0.99
# Use PCA to reduce dimension and keep 99% of explained variance, then tSNE
```

It will search in \_\_package\_\_/data/ the datas but you can force your own with __--path__ argument

* Note that if you are effectively doing big data you should **uncomment MulticoreTSNE** in vizuka/dimension\_reduction/tSNE.py unless you want to discover t-SNE crashed with a segfault. Instructions for installation can be found in requirements/requirements.apt

I want to add a specific tool to this visualization ? // how to contribute ?
----------------
Add your plugins in vizuka/plugins/ 
You can define your own heatmaps, clustering engines, cluster viewer (cf 2nd image), read plugins/heatmap/How\_to\_add\_heatmap.py for documentation.

Using the plugins/ directory it will not interfere with the original code ! If it works do not hesitate to make a PR

What will I get ?
-----------------

A nice tool to draw clusters, find details about inside distribution and zoom in.
Example with MNIST toy dataset (vizuka --mnist): (**for real life example please scroll down**)

![alt zoomview](docs/main_view.png)

![alt clusterview](docs/cluster_view.png)


### How to use ?
Navigate inside the 2D space and look at the data, selecting it in the main window (the big one). Data is grouped by cluster, you can select cluster individually (left click).

Main window represents all the data in 2D space. Blue are good-predicted transactions, Red are the bad ones, Green are the special class (by default the label 0).

Below are three subplots :
* a summary of the data inside the selected buckets (see navigation)
* a heatmap of the red/blue/green representation
* a heatmap of the cross-entropy of each bucket empirical distribution with empirical global empirical distribution.

Data viz navigation :
* left click selects a bucket of data
* right click reset all in-memory buckets

Other options:
* filter by predictions or by real class.
* detect mouse event : if unchecked, cluster will not be selected on click (useful for zooming)
* clusterize with an algo, Dummy is a simple grid, KMeans should be used, DBSCAN is experimental.
* export x : export the raw inputs you selected in an output.csv 
* cluster borders : draw borders between clusters based on bhattacharyya similarity measure, or just all
* force number of clusters (for kmeans essentially)
* choose a different set of predictions to display

What does it needs to be executed ?
-----------------------------------

```sh
$ vizuka --show-required-files

VERSION: string that identifies your dataset (default is vizuka --version MNIST_example)
Vizuka needs the following files :

	 + data/set/preprocessed_inputs_VERSION.npz
	 ------------------------------------------
		 x:	 preprocessed inputs
		 y:	 outputs to be predicted
		 NB:	 this is the only mandatory file, the following is highly recommended:


	 + data/models/predict_VERSION.npz -> optional but recommended
	 -------------------------------------------------------------
		 pred:	 predictions returned by your algorithm
		 NB:	 should be same formatting as in preprocessed_inputs_VERSION["y"])


	 + raw_data.npz -> optional
	 --------------------------
		 x:		 array of inputs BEFORE preprocessing
					 probably human-readbable, thus useful for vizualization
		 columns:	 the name of the columns variable in x
		 NB:	 this file is used if you run vizuka with
			    --feature-name-to-display COLUMN_NAME:PLOTTER COLUMN_NAME2:PLOTTER2 or
			    --feature-name-to-filter COLUMN_NAME1 COLUMN_NAME2 (see help for details)


	 + reduced/2Dembedding_PARAMS_VERSION.npz -> reaaaally optional
	 --------------------------------------------------------------
		 x2D:	 projections of the preprocessed inputs x in a 2D space
		 NB:	 this set is automatically generated with tSNE but you can specify your own

```

Typical use-case :
------------------

You have your preprocessed data ? Cool, this is the only mandatory file you need. Place it in the folder *data/set/preprocessed_inputs_VERSION.npz*, VERSION being a string specific to this specific dataset. It must contains at least the key 'x' representing the vectors you learn from. If you have both the correct output and your own predicitons (inside *data/models/ALGONAMEpredict_VERSION.npz* and key 'pred' *predict_VERSION.npz* will be the default loaded) that your algo try to predict, place it under the key 'y', the data viz will be much more useful !

Optionally you can add an *raw_data_VERSION.npz* file containing raw data non-preprocessed. The vector should be the key "originals" and the name of the human-readable "features" in the key "columns".

Now you may want to launch Vizuka ! First do specify the parameters fitting your needs in config.py. And take some coffee. Or two. Or three. Vizuka is busy reducing the dimension.

...

Congratulations ! Now you may want to display your 2D-data, as your arble to browse your embedded space. Maybe you want to look for a specific cluster. Explore the data with graph options, zoom in and zoom out, and use the filters provided to find an interesting area.

When you are satisfied, enable "detect mouse event" to be able to select clusters. This is quite unefficient you will select smal rectangular tiles one by one, you may want to *Clusterize* using KMeans or DBSCAN.

Great now you can select whole clusters of data at once. But what's in there ? Click on the *export* button to find out in a nicely formatted csv (assuming you provided "raw" data).

You finished your session but still want to dive in the clusters later ? Just select *Save clusterization* to save your session.


Default parameters
------------------

See config.py

Real life example
=================

![alt zoomview](docs/zoom_view.png)
![alt clusterview](docs/cluster_view-mana.png)
            

Raw data

            {
    "maintainer": "", 
    "docs_url": null, 
    "requires_python": "", 
    "maintainer_email": "", 
    "cheesecake_code_kwalitee_id": null, 
    "keywords": "", 
    "upload_time": "2017-10-20 16:40:05", 
    "author": "Sofian Medbouhi", 
    "home_page": "", 
    "download_url": "https://pypi.python.org/packages/62/f6/254e73e9c9d6e58b9e07c770401898ef755a8475e434b3889dd0c598f9aa/Vizuka-0.30.1.tar.gz", 
    "platform": "", 
    "version": "0.30.1", 
    "cheesecake_documentation_id": null, 
    "description": "Data vizualization\n==================\n\nThis is a collection of tools to represent and navigate through the high-dimensional data.\n * The algorithm t-SNE is default to construct the 2D space.\n * The module should be agnostic of the data provided.\n * It ships with MNIST for quick testing.\n\nUsage\n-----\n### How to install ?\n```sh\n$ pip install vizuka\n```\nor clone the repo :)\n\n### How to run?\n\n```sh\n$ vizuka\n# For a quick working example run :\n$ vizuka --mnist\n# Similar to copy your data and run \"vizuka --image:images --version _MNIST_example\"\n$ vizuka --show-required-files\n# To show the format of files you need to launch a data viz \n```\nYou can add human-readable data visualization in data/set/raw\\_data\\_VERSION.npz :\n\n```sh\n$ vizuka -s price:logdensity -s name:wordcloud\n# vizuka --feature-to-show raw_variable_name:{wordcloud|counter|density|logdensity|images}\n```\n\nIt assumes you already have your 2D data, projection will be done if launched for the first time (not for MNIST toy example)\nYou can force for PCA reduction prior to t-SNE :\n```sh\n$ vizuka --reduce --use_pca 0.99\n# Use PCA to reduce dimension and keep 99% of explained variance, then tSNE\n```\n\nIt will search in \\_\\_package\\_\\_/data/ the datas but you can force your own with __--path__ argument\n\n* Note that if you are effectively doing big data you should **uncomment MulticoreTSNE** in vizuka/dimension\\_reduction/tSNE.py unless you want to discover t-SNE crashed with a segfault. Instructions for installation can be found in requirements/requirements.apt\n\nI want to add a specific tool to this visualization ? // how to contribute ?\n----------------\nAdd your plugins in vizuka/plugins/ \nYou can define your own heatmaps, clustering engines, cluster viewer (cf 2nd image), read plugins/heatmap/How\\_to\\_add\\_heatmap.py for documentation.\n\nUsing the plugins/ directory it will not interfere with the original code ! If it works do not hesitate to make a PR\n\nWhat will I get ?\n-----------------\n\nA nice tool to draw clusters, find details about inside distribution and zoom in.\nExample with MNIST toy dataset (vizuka --mnist): (**for real life example please scroll down**)\n\n![alt zoomview](docs/main_view.png)\n\n![alt clusterview](docs/cluster_view.png)\n\n\n### How to use ?\nNavigate inside the 2D space and look at the data, selecting it in the main window (the big one). Data is grouped by cluster, you can select cluster individually (left click).\n\nMain window represents all the data in 2D space. Blue are good-predicted transactions, Red are the bad ones, Green are the special class (by default the label 0).\n\nBelow are three subplots :\n* a summary of the data inside the selected buckets (see navigation)\n* a heatmap of the red/blue/green representation\n* a heatmap of the cross-entropy of each bucket empirical distribution with empirical global empirical distribution.\n\nData viz navigation :\n* left click selects a bucket of data\n* right click reset all in-memory buckets\n\nOther options:\n* filter by predictions or by real class.\n* detect mouse event : if unchecked, cluster will not be selected on click (useful for zooming)\n* clusterize with an algo, Dummy is a simple grid, KMeans should be used, DBSCAN is experimental.\n* export x : export the raw inputs you selected in an output.csv \n* cluster borders : draw borders between clusters based on bhattacharyya similarity measure, or just all\n* force number of clusters (for kmeans essentially)\n* choose a different set of predictions to display\n\nWhat does it needs to be executed ?\n-----------------------------------\n\n```sh\n$ vizuka --show-required-files\n\nVERSION: string that identifies your dataset (default is vizuka --version MNIST_example)\nVizuka needs the following files :\n\n\t + data/set/preprocessed_inputs_VERSION.npz\n\t ------------------------------------------\n\t\t x:\t preprocessed inputs\n\t\t y:\t outputs to be predicted\n\t\t NB:\t this is the only mandatory file, the following is highly recommended:\n\n\n\t + data/models/predict_VERSION.npz -> optional but recommended\n\t -------------------------------------------------------------\n\t\t pred:\t predictions returned by your algorithm\n\t\t NB:\t should be same formatting as in preprocessed_inputs_VERSION[\"y\"])\n\n\n\t + raw_data.npz -> optional\n\t --------------------------\n\t\t x:\t\t array of inputs BEFORE preprocessing\n\t\t\t\t\t probably human-readbable, thus useful for vizualization\n\t\t columns:\t the name of the columns variable in x\n\t\t NB:\t this file is used if you run vizuka with\n\t\t\t    --feature-name-to-display COLUMN_NAME:PLOTTER COLUMN_NAME2:PLOTTER2 or\n\t\t\t    --feature-name-to-filter COLUMN_NAME1 COLUMN_NAME2 (see help for details)\n\n\n\t + reduced/2Dembedding_PARAMS_VERSION.npz -> reaaaally optional\n\t --------------------------------------------------------------\n\t\t x2D:\t projections of the preprocessed inputs x in a 2D space\n\t\t NB:\t this set is automatically generated with tSNE but you can specify your own\n\n```\n\nTypical use-case :\n------------------\n\nYou have your preprocessed data ? Cool, this is the only mandatory file you need. Place it in the folder *data/set/preprocessed_inputs_VERSION.npz*, VERSION being a string specific to this specific dataset. It must contains at least the key 'x' representing the vectors you learn from. If you have both the correct output and your own predicitons (inside *data/models/ALGONAMEpredict_VERSION.npz* and key 'pred' *predict_VERSION.npz* will be the default loaded) that your algo try to predict, place it under the key 'y', the data viz will be much more useful !\n\nOptionally you can add an *raw_data_VERSION.npz* file containing raw data non-preprocessed. The vector should be the key \"originals\" and the name of the human-readable \"features\" in the key \"columns\".\n\nNow you may want to launch Vizuka ! First do specify the parameters fitting your needs in config.py. And take some coffee. Or two. Or three. Vizuka is busy reducing the dimension.\n\n...\n\nCongratulations ! Now you may want to display your 2D-data, as your arble to browse your embedded space. Maybe you want to look for a specific cluster. Explore the data with graph options, zoom in and zoom out, and use the filters provided to find an interesting area.\n\nWhen you are satisfied, enable \"detect mouse event\" to be able to select clusters. This is quite unefficient you will select smal rectangular tiles one by one, you may want to *Clusterize* using KMeans or DBSCAN.\n\nGreat now you can select whole clusters of data at once. But what's in there ? Click on the *export* button to find out in a nicely formatted csv (assuming you provided \"raw\" data).\n\nYou finished your session but still want to dive in the clusters later ? Just select *Save clusterization* to save your session.\n\n\nDefault parameters\n------------------\n\nSee config.py\n\nReal life example\n=================\n\n![alt zoomview](docs/zoom_view.png)\n![alt clusterview](docs/cluster_view-mana.png)", 
    "lcname": "vizuka", 
    "bugtrack_url": null, 
    "github": false, 
    "name": "Vizuka", 
    "license": "GPL V3", 
    "summary": "Represents your high-dimensional datas in a 2D space and play with it", 
    "split_keywords": [], 
    "author_email": "sof.m.sk@free.fr", 
    "urls": [
        {
            "has_sig": false, 
            "upload_time": "2017-10-20T16:40:05", 
            "comment_text": "", 
            "python_version": "source", 
            "url": "https://pypi.python.org/packages/62/f6/254e73e9c9d6e58b9e07c770401898ef755a8475e434b3889dd0c598f9aa/Vizuka-0.30.1.tar.gz", 
            "md5_digest": "0fdc26bc01ac9230e6c2cc96f13d58ac", 
            "downloads": 0, 
            "filename": "Vizuka-0.30.1.tar.gz", 
            "packagetype": "sdist", 
            "path": "62/f6/254e73e9c9d6e58b9e07c770401898ef755a8475e434b3889dd0c598f9aa/Vizuka-0.30.1.tar.gz", 
            "size": 1117434
        }
    ], 
    "_id": null, 
    "cheesecake_installability_id": null
}