targetviz


Nametargetviz JSON
Version 0.0.1 PyPI version JSON
download
home_pageNone
SummaryPackage for generating automatic reports of variable analyses related to a target variable
upload_time2024-09-04 08:23:12
maintainerNone
docs_urlNone
authorLuis de Quadras
requires_python>=3.8
licenseMIT License Copyright (c) 2024 dequadras Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords targetviz data analysis data visualization data science
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # TargetViz
This library generates HTML reports relating a target variable with all the dependent variables of a dataset.

## Installation
You can install TargetViz using pip

```bash
pip install targetviz
```


## Example
To generate an HTML report, just use the `targetviz_report` function, passing as arguments the dataset and indicating the name of the target variable.
```py
import pandas as pd
from sklearn import datasets

import targetviz

# Load an example dataset fom sklearn repository
sklearn_dataset = datasets.fetch_california_housing()
df = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)
df['price'] = pd.Series(sklearn_dataset.target)

# Generate HTML report in the current directory
targetviz.targetviz_report(df, target='price', output_dir='./', name_file_out="cal_housing.html", pct_outliers=.05)
```

The function targetviz.targetviz_report outputs an html file in the working directory.


# Mini Tutorial
Let's give an example of how to use that with a real live dataset, for that let's use the california housing dataset from sklearn.
This dataset contains information on house prices in California from the 1990s along with other features.

Let's start by generating the report, for that we will use the code in the [example above](#example)

Notice we have added outlier removal. If we set it to 5%, it will remove the top 2.5% and bottom 2.5% of rows when analysing each column (e.g. rows where column income is very high or very low). This typically makes the plots nicer.

So we have the report generated [here](./samples/cal_housing.html). Let's take a look.

First we have a quick univariate analysis of the target value (median price of houses in units of 100k USD), in this case it is a regression, but it also accepts binary or categorical variables.
From the histogram we can see that the values go from 0 to 5, the spike at 5 suggest results might have been clipped. We can see the mode at ~1.5 and the mean close to 2. We probably would have expected a long tail on the right if results had not been clipped

Next to the histogram we have the table with some statistics that can help us understand the variable.

![alt text](./img/cal_housing_target_analysis.png)


Now let's look at the first predictor, which is column MedInc (median income). Note that columns are sorted based on how much of the target they can explain, so the first variable is likely the most powerful.

We have the same univariate plot and table as for the target.

We also have 3 more plots that help us explain the relation of this variable to the target variable.
On the left we have a scatterplot so that we can get a first view. But it is sometimes difficult to get a good view of the correlation with the scatter plot, so we also have a another plot on the top right. Here we split the predictor variable MedInc into quantiles. We then plot the average of the predicted variable (house price) for each of those quantiles. It is here that we see, that on average, places with higher income tend to have way more expensive houses.
![alt text](img/cal_housing_medinc_analysis.png)

We can dig deeper and we wil see different relationships. When we arrive to house age, we see something interesting: older houses tend to be more expensive. That doesn't make economical sense, under the same conditions, a new house is typically more desirable, since they have better conditions and tend to have less issues.
The problem here is "under the same conditions", which is clearly not true in the real world. Probably older houses tend to be in city centers where the square footage is quite expensive and thus the counterintuitive results.
The reason I'm bringing this up is that we need to be careful when jumping to conclusions, and we need to think critically about the results we see

Some other important things to know:

In the file config.yaml there are some default configuration values. Those can be overriden when calling the function `targetviz_report`. For example, the default for outlier removal is 0% but in our function we set it to 5% using pct_outliers=0.05

For calculating the explained variance, we first divide the explanatory variable into quantiles, we then calculate the sum of squares between groups:

![equation](https://latex.codecogs.com/gif.latex?%5Ctext%7BSSB%7D%20%3D%20%5Csum_%7Bj%3D1%7D%5E%7Bk%7D%20n_j%20%28%5Cbar%7By%7D_j%20-%20%5Cbar%7By%7D%29%5E2)


and we divide this number by the total sum of squares to get the percentage explained


## Another example
We can also compute the same report for the titanic dataset ([see result](./samples/titanic.html))]


```py
import pandas as pd
import targetviz

# Load the Titanic dataset from a CSV file directly from the web
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
# Load an example dataset fom sklearn repository

titanic = pd.read_csv(url)

# Generate HTML report in the current directory
targetviz.targetviz_report(titanic, target='Survived', output_dir='./', name_file_out="titanic.html", pct_outliers=.05)
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "targetviz",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "targetviz, data analysis, data visualization, data science",
    "author": "Luis de Quadras",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/25/22/0e1c04c688b86504f466c9013d057e42a41914e47d1fa3763681b18004c5/targetviz-0.0.1.tar.gz",
    "platform": null,
    "description": "# TargetViz\nThis library generates HTML reports relating a target variable with all the dependent variables of a dataset.\n\n## Installation\nYou can install TargetViz using pip\n\n```bash\npip install targetviz\n```\n\n\n## Example\nTo generate an HTML report, just use the `targetviz_report` function, passing as arguments the dataset and indicating the name of the target variable.\n```py\nimport pandas as pd\nfrom sklearn import datasets\n\nimport targetviz\n\n# Load an example dataset fom sklearn repository\nsklearn_dataset = datasets.fetch_california_housing()\ndf = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)\ndf['price'] = pd.Series(sklearn_dataset.target)\n\n# Generate HTML report in the current directory\ntargetviz.targetviz_report(df, target='price', output_dir='./', name_file_out=\"cal_housing.html\", pct_outliers=.05)\n```\n\nThe function targetviz.targetviz_report outputs an html file in the working directory.\n\n\n# Mini Tutorial\nLet's give an example of how to use that with a real live dataset, for that let's use the california housing dataset from sklearn.\nThis dataset contains information on house prices in California from the 1990s along with other features.\n\nLet's start by generating the report, for that we will use the code in the [example above](#example)\n\nNotice we have added outlier removal. If we set it to 5%, it will remove the top 2.5% and bottom 2.5% of rows when analysing each column (e.g. rows where column income is very high or very low). This typically makes the plots nicer.\n\nSo we have the report generated [here](./samples/cal_housing.html). Let's take a look.\n\nFirst we have a quick univariate analysis of the target value (median price of houses in units of 100k USD), in this case it is a regression, but it also accepts binary or categorical variables.\nFrom the histogram we can see that the values go from 0 to 5, the spike at 5 suggest results might have been clipped. We can see the mode at ~1.5 and the mean close to 2. We probably would have expected a long tail on the right if results had not been clipped\n\nNext to the histogram we have the table with some statistics that can help us understand the variable.\n\n![alt text](./img/cal_housing_target_analysis.png)\n\n\nNow let's look at the first predictor, which is column MedInc (median income). Note that columns are sorted based on how much of the target they can explain, so the first variable is likely the most powerful.\n\nWe have the same univariate plot and table as for the target.\n\nWe also have 3 more plots that help us explain the relation of this variable to the target variable.\nOn the left we have a scatterplot so that we can get a first view. But it is sometimes difficult to get a good view of the correlation with the scatter plot, so we also have a another plot on the top right. Here we split the predictor variable MedInc into quantiles. We then plot the average of the predicted variable (house price) for each of those quantiles. It is here that we see, that on average, places with higher income tend to have way more expensive houses.\n![alt text](img/cal_housing_medinc_analysis.png)\n\nWe can dig deeper and we wil see different relationships. When we arrive to house age, we see something interesting: older houses tend to be more expensive. That doesn't make economical sense, under the same conditions, a new house is typically more desirable, since they have better conditions and tend to have less issues.\nThe problem here is \"under the same conditions\", which is clearly not true in the real world. Probably older houses tend to be in city centers where the square footage is quite expensive and thus the counterintuitive results.\nThe reason I'm bringing this up is that we need to be careful when jumping to conclusions, and we need to think critically about the results we see\n\nSome other important things to know:\n\nIn the file config.yaml there are some default configuration values. Those can be overriden when calling the function `targetviz_report`. For example, the default for outlier removal is 0% but in our function we set it to 5% using pct_outliers=0.05\n\nFor calculating the explained variance, we first divide the explanatory variable into quantiles, we then calculate the sum of squares between groups:\n\n![equation](https://latex.codecogs.com/gif.latex?%5Ctext%7BSSB%7D%20%3D%20%5Csum_%7Bj%3D1%7D%5E%7Bk%7D%20n_j%20%28%5Cbar%7By%7D_j%20-%20%5Cbar%7By%7D%29%5E2)\n\n\nand we divide this number by the total sum of squares to get the percentage explained\n\n\n## Another example\nWe can also compute the same report for the titanic dataset ([see result](./samples/titanic.html))]\n\n\n```py\nimport pandas as pd\nimport targetviz\n\n# Load the Titanic dataset from a CSV file directly from the web\nurl = \"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv\"\n# Load an example dataset fom sklearn repository\n\ntitanic = pd.read_csv(url)\n\n# Generate HTML report in the current directory\ntargetviz.targetviz_report(titanic, target='Survived', output_dir='./', name_file_out=\"titanic.html\", pct_outliers=.05)\n```\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 dequadras  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Package for generating automatic reports of variable analyses related to a target variable",
    "version": "0.0.1",
    "project_urls": {
        "Homepage": "https://github.com/dequadras/targetviz"
    },
    "split_keywords": [
        "targetviz",
        " data analysis",
        " data visualization",
        " data science"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f3ed27da6a4d161b66b72ad05e217372d7c00678241e7fab7418b2e09c84380b",
                "md5": "80ef4a092f79297074706da4849fea76",
                "sha256": "c6763b14978d8ec49bbb4ecc6ef6bdafe359c37b910c83a52bc096555b8b64de"
            },
            "downloads": -1,
            "filename": "targetviz-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "80ef4a092f79297074706da4849fea76",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 16796,
            "upload_time": "2024-09-04T08:23:10",
            "upload_time_iso_8601": "2024-09-04T08:23:10.216300Z",
            "url": "https://files.pythonhosted.org/packages/f3/ed/27da6a4d161b66b72ad05e217372d7c00678241e7fab7418b2e09c84380b/targetviz-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "25220e1c04c688b86504f466c9013d057e42a41914e47d1fa3763681b18004c5",
                "md5": "2e0236d1755a22693d294d6f443cef22",
                "sha256": "f568b30c63a0ed6298d357be09f2b55ee5edb0ee207e30051c084b63851096a7"
            },
            "downloads": -1,
            "filename": "targetviz-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "2e0236d1755a22693d294d6f443cef22",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 20662,
            "upload_time": "2024-09-04T08:23:12",
            "upload_time_iso_8601": "2024-09-04T08:23:12.619461Z",
            "url": "https://files.pythonhosted.org/packages/25/22/0e1c04c688b86504f466c9013d057e42a41914e47d1fa3763681b18004c5/targetviz-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-04 08:23:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dequadras",
    "github_project": "targetviz",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "targetviz"
}
        
Elapsed time: 0.38149s