pyEDAkit

Name	pyEDAkit JSON
Version	0.1.4 JSON
	download
home_page	None
Summary	A toolkit for exploratory data analysis in Python
upload_time	2025-01-18 19:01:09
maintainer	None
docs_url	None
author	None
requires_python	>=3.7
license	MIT License Copyright (c) 2024 Marco Schivo Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	data analysis eda machine learning
VCS
bugtrack_url
requirements	matplotlib MiniSom networkx numpy pandas scikit_learn scipy seaborn
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Exploratory Data Analysis (EDA) Python Toolkit
The project is sponsored by Malmö Universitet developed by Eng. Marco Schivo and Eng. Alberto Biscalchin under the supervision of Associete Professor Yuanji Cheng and is released under the MIT License. It is open source and available for anyone to use and contribute to.

Internal course Code reference: MA661E

## Installation Instructions

To install the `pyEDAkit` package, follow the steps below:

1. **Prerequisites**:  
   Ensure you have Python 3.7 or higher installed. You can download it from [python.org](https://www.python.org/).

2. **Install from PyPI**:  
   Use the following command to install `pyEDAkit`:

   ```bash
   pip install pyEDAkit
   ```

3. **Verify Installation**:  
   After installation, verify it by running:

   ```bash
   python -c "import pyEDAkit; print('pyEDAkit installed successfully!')"
   ```

4. **Optional (Upgrade)**:  
   To upgrade to the latest version:

   ```bash
   pip install --upgrade pyEDAkit
   ```

You're now ready to use `pyEDAkit`.

## Warnings
Please note that this project is still under development and is not yet ready for production use. The code is subject to change, and new features will be added over time. We welcome contributions and feedback from the community to improve the toolkit.

## Overview
This repository implements MATLAB-style functions in Python for various data analysis, clustering, dimensionality reduction, and graph algorithms. These functions leverage popular Python libraries such as `numpy`, `scipy`, `matplotlib`, `networkx`, and `scikit-learn`, while maintaining a familiar MATLAB-like syntax and behavior.

---

## Features

### 1. **Clustering and Dimensionality Reduction**
- **Hierarchical Clustering** (`linkage`, `cluster`):
  - MATLAB-style hierarchical clustering using `scipy.cluster.hierarchy`.
  - Supports `single`, `complete`, `average`, `ward`, and other linkage methods.
  - MATLAB-like `cluster` function for cutting hierarchical clusters based on distance or cluster count.

- **K-Means Clustering** (`kmeans`):
  - MATLAB-style K-Means implementation with support for:
    - Initialization methods: `k-means++`, random, or user-specified.
    - Metrics: `sqeuclidean` distance.
    - Number of replicates and maximum iterations.
  - Outputs cluster assignments, centroids, and within-cluster sum of distances.

- **Silhouette Analysis** (`silhouette`):
  - Computes silhouette values to evaluate clustering quality.
  - Supports various distance metrics, including `euclidean`, `manhattan`, `cosine`, and `minkowski`.
  - Generates detailed silhouette plots for cluster visualization.

- **Silhouette Criterion Evaluation** (`SilhouetteEvaluation` class):
  - Evaluates clustering solutions for different cluster counts (`k`) using the silhouette criterion.
  - Identifies the optimal number of clusters (`OptimalK`) based on silhouette scores.
  - Handles missing data and supports weighted or unweighted silhouette averages.
  - Visualizes silhouette criterion scores vs. the number of clusters.

- **PCA and SVD**:
  - `PCA`: Computes Principal Components and visualizes scree plots and scatter matrices.
  - `SVD`: Performs Singular Value Decomposition with visualization of singular values.

- **Non-Negative Matrix Factorization (NMF)**:
  - Reduces data dimensionality while ensuring non-negativity constraints.

- **Factor Analysis (FA)**:
  - Estimates latent factors using sklearn's `FactorAnalysis`.

- **Linear Discriminant Analysis (LDA)**:
  - Reduces dimensionality while maximizing class separability.

- **Random Projection**:
  - Performs Gaussian Random Projection to reduce dimensionality.

---

### 2. **Graph Algorithms**
- **Minimum Spanning Tree (MST)** (`minspantree`):
  - MATLAB-style wrapper using `networkx`.
  - Supports both Prim's (`dense`) and Kruskal's (`sparse`) algorithms.
  - Option to extract a spanning tree for a specific connected component or spanning forest.

---

### 3. **Intrinsic Dimensionality Estimation**
- **Packing Numbers**:
  - Computes intrinsic dimensionality using packing arguments.

- **Maximum Likelihood Estimation (MLE)**:
  - Estimates intrinsic dimensionality using a k-Nearest Neighbor approach.

- **Correlation Dimension**:
  - Estimates intrinsic dimensionality via correlation methods.

- **Pettis Method**:
  - Computes intrinsic dimensionality using Pettis et al.'s algorithm.

---

### 4. **Normalization**
- **Standardization (`with_std_dev`)**:
  - Standardizes data with zero-mean and unit variance.
  
- **Min-Max Normalization (`min_max_norm`)**:
  - Rescales data to a range between 0 and 1.

- **Sphering (`sphering`)**:
  - Whitens data, decorrelating variables and setting variance to 1.

---

### 5. **Clustering Quality Metrics**
- **Cophenetic Correlation** (`cophenet`):
  - Computes the cophenetic correlation coefficient to measure how well a dendrogram preserves the original pairwise distances.

- **Silhouette Evaluation and Visualization**:
  - `silhouette`: Computes silhouette values and plots silhouette scores for each cluster.
  - `SilhouetteEvaluation`: Evaluates and visualizes the silhouette criterion for determining the optimal number of clusters.

---
# Examples
!IMPORTANT: The examples are not finished yet, they are just a draft of what we are going to implement.
The import statements are placeholders and need to be replaced with the actual module name that will be available through PiPy soon.
## Clustering
### **`Linkage` Function**

This example demonstrates the usage of the `linkage` function for hierarchical clustering. The `linkage` function builds a hierarchical cluster tree (also known as a dendrogram) using various linkage methods. We show two use cases: clustering a large dataset into groups and visualizing the hierarchy using a dendrogram.


#### Python Code:
```python
import numpy as np
import matplotlib.pyplot as plt
from pyEDAkit.clustering import linkage, cluster
from scipy.cluster.hierarchy import dendrogram
from scipy.spatial.distance import squareform
from mpl_toolkits.mplot3d import Axes3D

def test_linkage():
    # Step 1: Randomly generate sample data with 20,000 observations
    np.random.seed(0)  # For reproducibility
    X = np.random.rand(20000, 3)

    # Step 2: Create a hierarchical cluster tree using the ward linkage method
    Z = linkage(X, method='ward')

    # Step 3: Cluster the data into a maximum of four groups
    max_clusters = 4
    cluster_labels = cluster(Z, 'MaxClust', max_clusters, criterion='maxclust')

    # Step 4: Plot the result in 3D
    fig = plt.figure(figsize=(10, 8))
    ax = fig.add_subplot(111, projection='3d')

    scatter = ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=cluster_labels, cmap='viridis', s=10)
    ax.set_xlabel('X-axis')
    ax.set_ylabel('Y-axis')
    ax.set_zlabel('Z-axis')

    plt.title('3D Scatter Plot of Hierarchical Clustering')
    plt.colorbar(scatter, ax=ax, label='Cluster Label')
    plt.show()

    # Step 5: Define the dissimilarity matrix
    X = np.array([
        [0, 1, 2, 3],
        [1, 0, 4, 5],
        [2, 4, 0, 6],
        [3, 5, 6, 0]
    ])

    # Step 6: Convert the dissimilarity matrix to vector form using squareform
    y = squareform(X)

    # Step 7: Create a hierarchical cluster tree using the 'complete' method
    Z = linkage(y, method='complete')

    # Step 8: Print the resulting linkage matrix
    print("Linkage matrix (Z):")
    print(Z)

    # Step 9: Plot the dendrogram
    plt.figure(figsize=(10, 6))
    dendrogram(
        Z,
        labels=[1, 2, 3, 4],  # Use MATLAB-style indices for the leaf nodes
        leaf_font_size=10      # Adjust font size for clarity
    )
    plt.title('Dendrogram')
    plt.xlabel('Leaf Nodes')
    plt.ylabel('Linkage Distance')
    plt.show()

test_linkage()
```

### Visualizations

#### **3D Scatter Plot of Hierarchical Clustering**
This plot visualizes the clusters formed by hierarchical clustering on a randomly generated dataset of 20,000 observations. The data points are colored by their cluster labels (maximum of 4 clusters).

![3D Scatter Plot](https://github.com/zkivo/pyEDAkit/raw/main/examples/hierarchical_clustering_scatter_3d.png)


#### **Dendrogram**
The dendrogram represents the hierarchical clustering of a small dataset, built from a dissimilarity matrix. The `complete` linkage method is used to compute the hierarchical structure, and the result is visualized as a dendrogram.

![Dendrogram](https://github.com/zkivo/pyEDAkit/raw/main/examples/dendrogram.png)

### **`Cluster` Function**

The `cluster` function is a MATLAB-style wrapper for SciPy's `fcluster` function, allowing flexible and intuitive hierarchical clustering. This example demonstrates its usage with various clustering criteria, such as distance thresholds, inconsistent measures, and a fixed number of clusters. Additionally, it supports multiple cutoffs to produce a matrix of cluster assignments.


#### Example:

```python
import numpy as np
from pyEDAkit.clustering import cluster, linkage 

def test_cluster():
    # Generate sample data
    X = np.random.rand(10, 3)

    # Compute linkage matrix
    Z = linkage(X, method='ward')

    # 1) Cut off by distance = 0.7
    T_distance = cluster(Z, 'Cutoff', 0.7, 'Criterion', 'distance')

    # 2) Cut off by inconsistent measure
    T_inconsist = cluster(Z, 'Cutoff', 1.5)

    # 3) Force a maximum of 3 clusters
    T_maxclust = cluster(Z, 'MaxClust', 3)

    # 4) Multiple cutoffs -> T is an m-by-l matrix
    T_multi = cluster(Z, 'Cutoff', [0.7, 1.0, 1.5], 'Criterion', 'distance')
    print(T_multi.shape)  # (10, 3)

test_cluster()
```

#### Output:

This example showcases the flexibility of the `cluster` function. Below is the output from the final step, where multiple cutoffs are used:

```bash
(10, 3)
```

### Key Points:

1. **Cut off by Distance**: Creates clusters by specifying a distance threshold. For example:
   ```python
   T_distance = cluster(Z, 'Cutoff', 0.7, 'Criterion', 'distance')
   ```

2. **Cut off by Inconsistent Measure**: Uses the default 'inconsistent' criterion for clustering:
   ```python
   T_inconsist = cluster(Z, 'Cutoff', 1.5)
   ```

3. **Force a Maximum Number of Clusters**: Ensures the data is divided into a fixed number of clusters:
   ```python
   T_maxclust = cluster(Z, 'MaxClust', 3)
   ```

4. **Multiple Cutoffs**: Produces a matrix where each column corresponds to cluster assignments for a specific cutoff:
   ```python
   T_multi = cluster(Z, 'Cutoff', [0.7, 1.0, 1.5], 'Criterion', 'distance')
   ```

The flexibility of `cluster` makes it an ideal choice for hierarchical clustering tasks requiring MATLAB-like functionality in Python.

---

### **`K-means` Clustering**

The `kmeans` function, imported from the `pyEDAkit.clustering` module, provides a MATLAB-style implementation of the K-means algorithm, allowing intuitive and flexible clustering with support for optional parameters such as the number of replicates and maximum iterations.


#### Example:

```python
import numpy as np
import pandas as pd
from pyEDAkit.clustering import kmeans
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.metrics import accuracy_score

def test_kmeans():
    # Load Iris dataset
    iris_path = "../datasets/iris_dataset.csv"
    iris_data = pd.read_csv(iris_path)

    # Use only petal_length and petal_width features (2D data)
    X = iris_data.iloc[:, [2, 3]].values  # Columns for petal_length and petal_width
    y_true = iris_data.iloc[:, -1].values  # True labels (species)

    # Map the target labels to numeric values for true labels
    label_mapping = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}
    y_numeric = np.array([label_mapping[label] for label in y_true])

    # Step 1: Apply k-means clustering using your custom function
    k = 3  # Number of clusters
    idx, C, sumd, D = kmeans(X, k, 'Distance', 'sqeuclidean', 'Replicates', 5, 'MaxIter', 300)

    # Step 2: Create a 2D grid for the feature space
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.05),
                         np.arange(y_min, y_max, 0.05))

    # Combine the grid into a list of points
    grid_points = np.c_[xx.ravel(), yy.ravel()]  # Flatten the grid into points for classification

    # Define the function to assign grid points to clusters
    def assign_to_clusters(grid_points, centroids):
        """
        Assign grid points to the nearest cluster based on centroids.
        """
        distances = np.linalg.norm(grid_points[:, None] - centroids, axis=2)  # Euclidean distance
        return np.argmin(distances, axis=1)  # Assign to the nearest centroid

    # Assign each grid point to a cluster
    Z = assign_to_clusters(grid_points, C)
    Z = Z.reshape(xx.shape)  # Reshape to match the grid dimensions

    # Step 4: Plot the results
    plt.figure(figsize=(10, 6))

    # Plot cluster regions
    cmap = ListedColormap(['#FFCCCC', '#CCFFCC', '#CCCCFF'])  # Colors for regions
    plt.contourf(xx, yy, Z, cmap=cmap, alpha=0.5)

    # Plot data points
    plt.scatter(X[:, 0], X[:, 1], c=idx - 1, cmap='viridis', edgecolor='k', s=50, label='Data')

    # Plot centroids
    plt.scatter(C[:, 0], C[:, 1], c='red', s=200, marker='X', label='Centroids')

    # Add labels and title
    plt.title("K-means Clustering with Correctly Colored Areas")
    plt.xlabel("Petal Length (cm)")
    plt.ylabel("Petal Width (cm)")
    plt.legend(loc='best')
    plt.show()

    # Step 5: Print results
    print("Cluster assignments (idx):")
    print(idx)
    print("\nCentroids (C):")
    print(C)
    print("\nWithin-cluster sum of distances (sumd):")
    print(sumd)



test_kmeans()
```

#### Centroids displacement plot
![K-Means Example](https://github.com/zkivo/pyEDAkit/raw/main/examples/k-means.png)

#### Bash Output:

```bash
Cluster assignments (idx):
[2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
 2. 2. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
 3. 3. 3. 3. 3. 1. 3. 3. 3. 3. 3. 1. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
 3. 3. 3. 3. 1. 1. 1. 1. 1. 1. 3. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 3.
 1. 1. 1. 1. 1. 1. 3. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 3. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]

Centroids (C):
[[5.59583333 2.0375    ]
 [1.464      0.244     ]
 [4.26923077 1.34230769]]

Within-cluster sum of distances (sumd):
[16.29166667  2.0384     13.05769231]
```

#### Key Points:

1. **Basic Clustering**:
   - Cluster assignment, centroids, within-cluster sum of distances, and point-to-centroid distances are calculated.
   
2. **Iris Dataset Example**:
   - Used the Iris dataset to cluster the data and compare with true labels for accuracy.
   
3. **Visualization**:
   - PCA reduces the dimensionality for 2D visualization.
   - Colored areas represent cluster boundaries, and centroids are marked with red `X`. 

4. **Accuracy**:
   - Clustering accuracy is calculated by mapping clusters to the closest true labels.

---

### **`minspantree` - Minimum Spanning Tree**

The `minspantree` function computes the Minimum Spanning Tree (MST) of a given graph. It supports both Prim's and Kruskal's algorithms and can return the MST for a specific component (`Type='tree'`) or the entire graph (`Type='forest'`).

#### Example:

```python
import networkx as nx
import matplotlib.pyplot as plt
from pyEDAkit.clustering import minspantree

def test_minspantree():
    # Create a graph with weighted edges
    G = nx.Graph()
    G.add_weighted_edges_from([
        (1, 2, 2.0),
        (2, 3, 1.5),
        (2, 4, 3.0),
        (1, 5, 4.0),
        (3, 5, 2.5),
        (4, 5, 1.0),
        (5, 6, 2.0)
    ])

    # Visualize the original graph
    plt.figure(figsize=(12, 6))
    plt.subplot(1, 3, 1)
    pos = nx.spring_layout(G, seed=42)  # For consistent layout
    nx.draw(G, pos, with_labels=True, node_color='lightblue', edge_color='gray', node_size=1000, font_size=10)
    labels = nx.get_edge_attributes(G, 'weight')
    nx.draw_networkx_edge_labels(G, pos, edge_labels=labels)
    plt.title("Original Graph")

    # 1) Compute MST using Prim's algorithm (default method)
    T_prim, pred_prim = minspantree(G)  # Assuming minspantree implements Prim's by default

    # Visualize MST with Prim's algorithm
    plt.subplot(1, 3, 2)
    nx.draw(T_prim, pos, with_labels=True, node_color='lightgreen', edge_color='blue', node_size=1000, font_size=10)
    labels = nx.get_edge_attributes(T_prim, 'weight')
    nx.draw_networkx_edge_labels(T_prim, pos, edge_labels=labels)
    plt.title("MST (Prim's Algorithm)")

    # 2) Compute MST using Kruskal's algorithm (Method='sparse')
    T_kruskal, pred_kruskal = minspantree(G, 'Method', 'sparse', 'Root', 2, 'Type', 'forest')

    # Visualize MST with Kruskal's algorithm
    plt.subplot(1, 3, 3)
    nx.draw(T_kruskal, pos, with_labels=True, node_color='lightcoral', edge_color='purple', node_size=1000, font_size=10)
    labels = nx.get_edge_attributes(T_kruskal, 'weight')
    nx.draw_networkx_edge_labels(T_kruskal, pos, edge_labels=labels)
    plt.title("MST (Kruskal's Algorithm)")

    plt.tight_layout()
    plt.show()

    # Print details of the MSTs
    print("\n--- MST using Prim's Algorithm ---")
    print("Edges of T (Prim):", list(T_prim.edges(data=True)))
    print("Predecessors (Prim):", pred_prim)

    print("\n--- MST using Kruskal's Algorithm ---")
    print("Edges of T (Kruskal):", list(T_kruskal.edges(data=True)))
    print("Predecessors (Kruskal):", pred_kruskal)

test_minspantree()
```


#### Visualization

- **Original Graph**: Displays the graph with all its nodes and edges, labeled with weights.
- **MST (Prim's Algorithm)**: Highlights the MST computed using Prim's algorithm, rooted at node 1.
- **MST (Kruskal's Algorithm)**: Displays the MST computed using Kruskal's algorithm, including a spanning forest for all components.

![Minimum Spanning Tree](https://github.com/zkivo/pyEDAkit/raw/main/examples/minspantree.png)



#### Bash Output

```bash
--- MST using Prim's Algorithm ---
Edges of T (Prim): [(1, 2, {'weight': 2.0}), (2, 3, {'weight': 1.5}), (3, 5, {'weight': 2.5}), (4, 5, {'weight': 1.0}), (5, 6, {'weight': 2.0})]
Predecessors (Prim): {1: 0, 2: 1, 3: 2, 4: 5, 5: 3, 6: 5}

--- MST using Kruskal's Algorithm ---
Edges of T (Kruskal): [(1, 2, {'weight': 2.0}), (2, 3, {'weight': 1.5}), (3, 5, {'weight': 2.5}), (4, 5, {'weight': 1.0}), (5, 6, {'weight': 2.0})]
Predecessors (Kruskal): {1: 2, 2: 0, 3: 2, 4: 5, 5: 3, 6: 5}

Process finished with exit code 0
```

#### Key Points

1. **Prim's Algorithm**:
   - Grows the MST from a specific root node.
   - Produces a single tree for the connected component containing the root.

2. **Kruskal's Algorithm**:
   - Builds the MST by adding edges with the smallest weights.
   - Can generate a spanning forest if the graph is disconnected.

3. **Visualization**:
   - Easily compares the original graph with the MSTs generated by different algorithms.

4. **Customizability**:
   - Supports options like specifying the root node and generating a forest for disconnected graphs.
---


### **`cophenet` - Cophenetic Correlation Coefficient**

The `cophenet` function computes the cophenetic correlation coefficient, a measure of how well the hierarchical clustering structure reflects the pairwise distances among the data points. It also provides the cophenetic distances for the linkage matrix.


#### Enhanced Example:

```python
import numpy as np
from pyEDAkit.IntrinsicDimensionality import pdist
from pyEDAkit.clustering import linkage, cophenet
import matplotlib.pyplot as plt

def test_cophenet():
    # Step 1: Generate sample data
    np.random.seed(42)
    X = np.vstack([
        np.random.rand(5, 2),       # Cluster 1
        np.random.rand(5, 2) + 5   # Cluster 2 (shifted by +5)
    ])

    # Step 2: Compute pairwise distances
    Y = pdist(X)

    # Step 3: Perform hierarchical clustering with average linkage
    Z = linkage(Y, method='average')

    # Step 4: Compute cophenetic correlation coefficient
    c, d = cophenet(Z, Y)

    # Step 5: Display results
    print("Expected Result: Pairwise distances")
    print("Y (first 10 values):", Y[:10])

    print("\nActual Result: Cophenetic distances")
    print("d (first 10 values):", d[:10])

    print("\nCophenetic correlation coefficient:", c)

    # Step 6: Visualize clustering with dendrogram
    plt.figure(figsize=(10, 5))
    plt.title("Dendrogram")
    plt.xlabel("Data Points")
    plt.ylabel("Distance")
    from scipy.cluster.hierarchy import dendrogram
    dendrogram(Z, leaf_rotation=90., leaf_font_size=10.)
    plt.show()

# Run the example
test_cophenet()
```

#### **Explanation**:

1. **Input Data**:
   - A synthetic dataset `X` is created with two distinct clusters: one centered around `(0, 0)` and another around `(5, 5)`.

2. **Pairwise Distances**:
   - `pdist` computes the Euclidean distances between all pairs of points in `X`. These are the **expected pairwise distances**.

3. **Hierarchical Clustering**:
   - The `linkage` function is used with the `'average'` method to compute the hierarchical clustering.

4. **Cophenetic Correlation Coefficient**:
   - The `cophenet` function computes:
     - The **cophenetic correlation coefficient**: A measure of how well the clustering represents the original pairwise distances. It ranges from `-1` to `1`, with values closer to `1` indicating better clustering fidelity.
     - The **cophenetic distances**: These are the distances implied by the hierarchical clustering.

5. **Comparison**:
   - The example prints the **expected pairwise distances** (`Y`) and the **actual cophenetic distances** (`d`), along with the cophenetic correlation coefficient (`c`).

6. **Visualization**:
   - The dendrogram visualizes the hierarchical clustering.


#### **Expected Output**:

```bash
Expected Result: Pairwise distances
Y (first 10 values): [0.5017136  0.82421549 0.32755369 0.33198071 6.83944824 6.92460439
 6.40512716 6.72486612 6.66463903 0.72642889]

Actual Result: Cophenetic distances
d (first 10 values): [0.5310849  0.7441755  0.32755369 0.5310849  6.90984673 6.90984673
 6.90984673 6.90984673 6.90984673 0.7441755 ]

Cophenetic correlation coefficient: 0.9966209146535578

Process finished with exit code 0
```
![Cophenetic Correlation](https://github.com/zkivo/pyEDAkit/raw/main/examples/Cophenetic_corr.png)

#### **Key Takeaways**:

- The **cophenetic correlation coefficient** (`c = 0.994`) indicates that the hierarchical clustering structure is a very good representation of the original pairwise distances.
- The **cophenetic distances** (`d`) closely match the original distances in `Y`.
- The dendrogram visually confirms the clustering structure, with two distinct groups evident.

---

### **`silhouette` - Evaluating Clustering Quality**

The `silhouette` function computes silhouette scores for clustering and provides an optional visualization of the silhouette plot. It measures how similar an object is to its own cluster compared to other clusters, with scores ranging from `-1` to `1`. Higher scores indicate better-defined clusters.


#### **Example Usage**

```python
import numpy as np
from pyEDAkit.clustering import silhouette

def test_silhouette():
    # Step 1: Generate synthetic data
    X = np.random.rand(20, 2)  # 20 data points with 2 features
    labels = np.random.randint(0, 3, size=20)  # Assign random labels for 3 clusters

    # Step 2: Silhouette analysis using default Euclidean distance with visualization
    s, fig = silhouette(X, labels)  # Default parameters
    print("Silhouette values:\n", s)

    # Step 3: Silhouette analysis using Minkowski distance (p=3) without visualization
    s2, _ = silhouette(X, labels, Distance='minkowski', DistParameter={'p': 3}, do_plot=False)
    print("Silhouette values with Minkowski distance, p=3:\n", s2)

# Run the example
test_silhouette()
```

#### **Explanation**

1. **Input Data**:
   - `X`: The dataset with shape `(n_samples, n_features)`. In this example, it consists of 20 randomly generated data points in 2D space.
   - `labels`: Cluster assignments for each data point. Random labels are generated for three clusters.

2. **Silhouette Analysis**:
   - The silhouette score is computed for each data point as:
   
        $s(i)=\frac{b(i)-a(i)}{\max(a(i),b(i))}$

     where:
     - $a(i)$: Average intra-cluster distance (distance to other points in the same cluster).
     - $b(i)$: Average nearest-cluster distance (distance to points in the nearest other cluster).

3. **Default Case**:
   - By default, the function computes the silhouette scores using the **Euclidean distance** and visualizes the silhouette plot.

4. **Custom Distance**:
   - In the second case, the function uses the **Minkowski distance** with \( p = 3 \) (specified using the `DistParameter` argument). No plot is produced (`do_plot=False`).

5. **Output**:
   - `s`: An array of silhouette scores for each data point.
   - `fig`: A matplotlib figure object showing the silhouette plot (if visualization is enabled).



#### **Expected Output**

For the given random data, the output will include:

1. **Silhouette Scores (Default Distance)**:
   ```Bash
    Silhouette values:
    [-0.20917609 -0.11501692 -0.09117026  0.07603928 -0.49128607 -0.34661769
    -0.20534282 -0.02257866 -0.15532452 -0.36940948 -0.26504272  0.32133183
    -0.08229476  0.30387826 -0.09858198 -0.26869494 -0.59434334  0.08777368
    0.11872492  0.06386749]
   ```

2. **Silhouette Scores (Minkowski Distance, \( p=3 \))**:
   ```Bash
    Silhouette values with Minkowski distance, p=3:
     [-0.19668877 -0.14506527 -0.06741624  0.07712329 -0.48447927 -0.32749879
     -0.1640281  -0.04252481 -0.14681262 -0.39123349 -0.26663026  0.3170098
     -0.09092208  0.30229521 -0.06645392 -0.27165836 -0.5954229   0.07757885
      0.11677314  0.06568025]
   ```

3. **Visualization**:
   - The silhouette plot displays the silhouette scores for each cluster, helping to evaluate the compactness and separation of clusters. The cluster with the largest average silhouette score is better defined.


#### **Silhouette Plot Visualization**

The silhouette plot provides:
- **Bars**: Represent silhouette scores for individual points, grouped by clusters.
- **Dashed Line**: Represents the average silhouette score for each cluster.

This plot is useful to evaluate the clustering quality visually.



#### **Notes**
- The silhouette score for a single point:
  - **Close to 1**: Well-clustered.
  - **Close to 0**: Overlapping clusters.
  - **Negative**: Misclassified point.
- Custom distance metrics can be specified via the `Distance` parameter (e.g., Minkowski, Manhattan, etc.).


---
### Silhouette Evaluation Example

The **Silhouette Evaluation** example demonstrates how to use the `SilhouetteEvaluation` class to determine the optimal number of clusters (k) for a given dataset using the silhouette criterion. The class evaluates various cluster solutions and identifies the best one based on the silhouette measure.


#### Example Code:

```python
from pyEDAkit.clustering import SilhouetteEvaluation
import numpy as np
from numpy.random import default_rng

def test_eval_silhouette():
    rng = default_rng(42)
    n = 200
    X1 = rng.multivariate_normal(mean=[2, 2], cov=[[0.9, -0.0255], [-0.0255, 0.9]], size=n)
    X2 = rng.multivariate_normal(mean=[5, 5], cov=[[0.5, 0], [0, 0.3]], size=n)
    X3 = rng.multivariate_normal(mean=[-2, -2], cov=[[1, 0], [0, 0.9]], size=n)
    X = np.vstack([X1, X2, X3])

    # Evaluate the silhouette for k=1..6
    evaluation = SilhouetteEvaluation(X,
                                      clusteringFunction='kmeans',
                                      KList=[1, 2, 3, 4, 5, 6],
                                      Distance='sqEuclidean',
                                      ClusterPriors='empirical')

    print("Inspected K:         ", evaluation.InspectedK)
    print("Criterion Values:    ", evaluation.CriterionValues)
    print("OptimalK:            ", evaluation.OptimalK)

    # Optional: plot the silhouette criterion
    evaluation.plot()

    # The best cluster solution's assignment:
    best_clusters = evaluation.OptimalY
    print("Shape of best cluster assignment:", best_clusters.shape)
    print("Some cluster labels:", np.unique(best_clusters[~np.isnan(best_clusters)]))

test_eval_silhouette()
```

### **Output and Plot**

- **Inspected K**: The list of cluster numbers (k) evaluated.
- **Criterion Values**: Silhouette scores for each k. Higher values indicate better cluster separation.
- **Optimal K**: The k with the highest silhouette score.

#### Example Output:
```bash
Inspected K:          [1 2 3 4 5 6]
Criterion Values:     [nan 0.65458239 0.67860193 0.54828688 0.45514697 0.33797165]
OptimalK:             3
Shape of best cluster assignment: (600,)
Some cluster labels: [1. 2. 3.]
```

#### Plot:
The silhouette values vs. the number of clusters are visualized in the plot below:

![Silhouette Criterion Evaluation](https://github.com/zkivo/pyEDAkit/raw/main/examples/silhouette_eval.png)


### **How It Works**

1. **Cluster Evaluation**:  
   The `SilhouetteEvaluation` class evaluates clusters for different k (e.g., 1 to 6) using the silhouette criterion:
   - The silhouette score is computed for each data point as:
   
        $s(i)=\frac{b(i)-a(i)}{\max(a(i),b(i))}$

     where:
     - $a(i)$: Average intra-cluster distance (distance to other points in the same cluster).
     - $b(i)$: Average nearest-cluster distance (distance to points in the nearest other cluster).

2. **Optimal Number of Clusters**:  
   - The silhouette score is calculated for each k.
   - The `OptimalK` property identifies the k with the maximum score.

3. **Cluster Priors**:  
   - `'empirical'`: Weighted by cluster sizes.
   - `'equal'`: Equal weight for all clusters.

4. **Visualization**:  
   The `plot()` method shows the silhouette values for each k, aiding in identifying the best clustering solution.

This example demonstrates the importance of silhouette analysis for optimal clustering and provides an easy-to-follow implementation of the process.

---

## Dimensionality Reduction Examples

This section provides examples demonstrating how to use various dimensionality reduction techniques and their visualizations with the `pyEDAkit` library. The dataset used in all examples is the Iris dataset, which includes features (`sepal_length`, `sepal_width`, `petal_length`, and `petal_width`) and classes representing three flower species (`Iris-setosa`, `Iris-versicolor`, and `Iris-virginica`).

#### Dataset Preparation

```python
from pyEDAkit import linear as eda_lin
import pandas as pd
import numpy as np

# Make sure to import the dataset from your local directory
df = pd.read_csv("../datasets/iris/iris.data")
df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']

# Extract features and target labels
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].to_numpy()
y = df[['class']].to_numpy()[:,0]
y = np.where(y == 'Iris-setosa', 0, y)
y = np.where(y == 'Iris-versicolor', 1, y)
y = np.where(y == 'Iris-virginica', 2, y)
y = y.astype(int)
y_names = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
```

---

### 1. **Principal Component Analysis (PCA)**

Principal Component Analysis reduces the dataset’s dimensionality by projecting it onto a lower-dimensional subspace while retaining as much variance as possible.

```python
eda_lin.PCA(X, d=3, plot=True)
```
#### Output (plot=True)
![PCA Example](https://github.com/zkivo/pyEDAkit/raw/main/examples/Iris_PCA.png)

- **Explanation**:
  - The data is reduced to 3 principal components.
  - The plot visualizes the variance explained by each component.
  - Useful for understanding which components capture the most variance.

---

### 2. **Singular Value Decomposition (SVD)**

SVD decomposes the matrix without explicitly calculating the covariance matrix. It provides an efficient way to find the principal components.

```python
eda_lin.SVD(X, plot=True)
```
#### Output (plot=True)
![SVD Example](https://github.com/zkivo/pyEDAkit/raw/main/examples/Iris_SVD.png)


- **Explanation**:
  - Displays the singular values, which indicate the importance of each dimension.
  - Useful for identifying the rank and energy captured by the decomposition.

---

### 3. **Non-negative Matrix Factorization (NMF)**

NMF decomposes a non-negative matrix into two smaller non-negative matrices. It is more efficient than SVD and suitable for datasets where negative values do not make sense.

```python
eda_lin.NMF(X, d=4, plot=True)
```
#### Output (plot=True)
![NMF Example](https://github.com/zkivo/pyEDAkit/raw/main/examples/Iris_NMF.png)


- **Explanation**:
  - Decomposes the data into 4 components.
  - The plot shows the contribution of each component to the original dataset.

---

### 4. **Factor Analysis (FA)**

Factor Analysis reduces dimensionality by relating each original variable to a smaller set of factors while adding small error values to allow flexibility.

```python
eda_lin.FA(X, d=3, plot=True)
```

#### Output (plot=True)
![FA Example](https://github.com/zkivo/pyEDAkit/raw/main/examples/Iris_FA.png)

- **Explanation**:
  - Reduces the data to 3 factors.
  - Shows how the factors contribute to explaining the dataset's variance.

---

### 5. **Linear Discriminant Analysis (LDA)**

LDA projects the data onto a line that maximizes class separability, making it a supervised dimensionality reduction method.

```python
eda_lin.LDA(X, y, plot=True)
```

#### Output (plot=True)
![LDA Example](https://github.com/zkivo/pyEDAkit/raw/main/examples/Iris_LDA.png)

- **Explanation**:
  - Projects the data into a single line for maximum class separation.
  - Visualization highlights how well the classes are separated along the discriminant axis.

---

### 6. **Random Projection**

Random Projection projects the data points into a random subspace while preserving the distances between points. It is computationally efficient and effective.

```python
eda_lin.RandProj(X, d=3, plot=True)
```
#### Output (plot=True)
![Random Projection Example](https://github.com/zkivo/pyEDAkit/raw/main/examples/Iris_RandProj.png)

- **Explanation**:
  - Projects the data into a random 3-dimensional subspace.
  - The plot visualizes the points in the new subspace, demonstrating that the relative distances between points are preserved.

---

### Notes:

- **Visualization**: Each method generates a plot when `plot=True`, helping users visually interpret the results of the dimensionality reduction.
- **Flexibility**: The `d` parameter in most functions controls the number of reduced dimensions.
- **Interactivity**: The methods allow exploration of how different dimensionality reduction techniques work with the same dataset.

---

### Examples: Intrinsic Dimensionality Estimation with Code and Visualizations

This section provides examples of intrinsic dimensionality estimation for various datasets. The examples include:

1. **3D Scene Analysis**
2. **1D Helix**
3. **3D Helix**

Each example includes code snippets, results, and visualizations.



#### **1. 3D Scene Analysis**

In this example, we generate a synthetic 3D scene with varying intrinsic dimensions. We estimate the intrinsic dimensionality for each point using the `MLE` method and visualize the results.

**Code**:

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
import pandas as pd
from pyEDAkit.IntrinsicDimensionality import MLE
from generate_data import generate_scene

print('--------- Scene ---------')
X = generate_scene(plot=False)

# Initialize nearest neighbors
nbrs = NearestNeighbors(n_neighbors=100, algorithm='auto').fit(X)
_, indices = nbrs.kneighbors(X)

# Calculate intrinsic dimension for each point
intrinsic_dimensions = []
for idx in range(len(X)):
    neighbors = X[indices[idx]]
    dim = MLE(neighbors)
    dim = np.round(dim)
    intrinsic_dimensions.append(dim)

# Replace values > 3 with 4
intrinsic_dimensions = np.array(intrinsic_dimensions)
intrinsic_dimensions[intrinsic_dimensions > 3] = 4

# Calculate percentage of each intrinsic dimension
unique, counts = np.unique(intrinsic_dimensions, return_counts=True)
percentages = (counts / len(intrinsic_dimensions)) * 100
percentage_table = pd.DataFrame({
    'Intrinsic Dimension': unique,
    'Count': counts,
    'Percentage (%)': percentages
})

# Display the percentage table
print("Intrinsic Dimension Percentage Table:")
print(percentage_table)

# Plot the 3D graph with intrinsic dimensions as colors
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=intrinsic_dimensions, cmap='viridis', s=10)
ax.set_title('3D Scatter Plot Colored by Local Intrinsic Dimension (k=100)')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
colorbar = fig.colorbar(scatter, ax=ax, label='Intrinsic Dimension')
plt.show()
```

**Results**:
- **Percentage Table**:
  | Intrinsic Dimension | Count | Percentage (%) |
  |---------------------|-------|----------------|
  | 1.0                 | 3912  | 65.20          |
  | 2.0                 | 1071  | 17.85          |
  | 3.0                 | 1017  | 16.95          |

**Visualization**:

![3D Scatter Plot with Intrinsic Dimensions](https://github.com/zkivo/pyEDAkit/raw/main/examples/Intrinsic_Dim_Scene_1.png)



#### **2. 1D Helix**

This example demonstrates intrinsic dimensionality estimation for a 1D helix dataset using multiple methods.

**Code**:

```python
from pyEDAkit.IntrinsicDimensionality import id_pettis, corr_dim, MLE, packing_numbers
from generate_data import generate_1D_helix

print('--------- 1D helix ---------')
X = generate_1D_helix(500, plot=True)

idhat = id_pettis(X)
print("Pettis:", idhat)

idhat = corr_dim(X)
print("CorrDim:", idhat)

idhat = MLE(X)
print("MLE:", idhat)

idhat = packing_numbers(X)
print("PackingNumbers:", idhat)
```

**Results**:
| Method           | Estimated Intrinsic Dimension |
|------------------|-------------------------------|
| Pettis           | 1.12                          |
| CorrDim          | 1.05                          |
| MLE              | 1.02                          |
| PackingNumbers   | 0.97                          |

**Visualization**:

![1D Helix](https://github.com/zkivo/pyEDAkit/raw/main/examples/helix.png)



#### **3. 3D Helix**

Finally, we estimate intrinsic dimensionality for a 3D helix dataset.

**Code**:

```python
from pyEDAkit.IntrinsicDimensionality import id_pettis, corr_dim, MLE, packing_numbers
from generate_data import generate_3D_helix

print('--------- 3D helix ---------')
X, _ = generate_3D_helix(2000, 0.05, plot=True)

idhat = id_pettis(X)
print("Pettis:", idhat)

idhat = corr_dim(X)
print("CorrDim:", idhat)

idhat = MLE(X)
print("MLE:", idhat)

idhat = packing_numbers(X)
print("PackingNumbers:", idhat)
```

**Results**:
| Method           | Estimated Intrinsic Dimension |
|------------------|-------------------------------|
| Pettis           | 2.97                          |
| CorrDim          | 1.92                          |
| MLE              | 2.24                          |
| PackingNumbers   | 1.45                          |

**Visualization**:

![3D Helix](https://github.com/zkivo/pyEDAkit/raw/main/examples/3d_helix.png)

These examples illustrate the application of various intrinsic dimensionality estimation methods to datasets with different geometries. The visualizations help validate the results by showing dimensionality estimates in their natural geometric contexts.

---

### Normalization Example

This example demonstrates the usage of the normalization functions implemented in **pyEDAkit**, comparing the results with standard libraries like `scikit-learn`. These functions include z-score normalization (with and without mean centering), min-max normalization, and sphering. Visualizations and outputs illustrate their effects on the Iris dataset.

#### Dataset Overview
We use the Iris dataset, focusing on `sepal_length` and `petal_length` features for visualization. The dataset contains three classes: `Iris-setosa`, `Iris-versicolor`, and `Iris-virginica`.

#### Original Data
The original data is plotted to show the unnormalized feature values.

![Original Data](https://github.com/zkivo/pyEDAkit/raw/main/examples/original_data.png)


#### Z-Scores with Zero Mean
Z-score normalization scales data to have a standard deviation of 1 and a mean of 0.

- **Implementation Comparison**: Results are identical to `scikit-learn`'s `StandardScaler` (with default settings).

**Output**:
```plaintext
Is z_scores_zero_mean allclose to sklearn: True
std:  [1. 1.] 
mean:  [ 2.38437160e-16 -9.53748639e-17]
```

![Z-Scores with Zero Mean](https://github.com/zkivo/pyEDAkit/raw/main/examples/z-scores_mean_0.png)


#### Z-Scores Without Zero Mean
Z-score normalization scales data to have a standard deviation of 1 but does not subtract the mean.

- **Implementation Comparison**: Matches `scikit-learn`'s `StandardScaler` (with `with_mean=False`).

**Output**:
```plaintext
Is z_scores_not_zero_mean allclose to sklearn: True
std:  [1. 1.] 
mean:  [7.08193195 2.15226003]
```

![Z-Scores Without Zero Mean](https://github.com/zkivo/pyEDAkit/raw/main/examples/z-scores_not_mean_0.png)


#### Min-Max Normalization
Min-max normalization scales data to fit within the range [0, 1].

- **Implementation Comparison**: Results are identical to `scikit-learn`'s `MinMaxScaler`.

**Output**:
```plaintext
Is min-max norm allclose to sklearn: True
std:  [0.22939135 0.29724345] 
mean:  [0.43008949 0.47025367]
```

![Min-Max Normalization](https://github.com/zkivo/pyEDAkit/raw/main/examples/min-max_norm.png)

#### Sphering
Sphering, also known as whitening, removes correlations between features and scales them to have unit variance.

- **Implementation Comparison**: Similar to `scikit-learn`'s `PCA(whiten=True)` but differs due to possible rotations or reflections. These differences are expected.

**Output**:
```plaintext
Is sphering allclose to sklearn: False
std:  [0.99663865 0.99663865] 
mean:  [-5.42444538e-16  3.05497611e-17]
Rotation-tolerant match for sphering vs PCA whiten:  True
```

**Visualizations**:
- Sphering (pyEDAkit): 

  ![Sphering (pyEDAkit)](https://github.com/zkivo/pyEDAkit/raw/main/examples/Sphering(pyEDAkit).png)

- Sphering (PCA Whiten):

  ![Sphering (PCA Whiten)](https://github.com/zkivo/pyEDAkit/raw/main/examples/Sphering(PCA_whiten).png)


#### Code
Below is the complete code used in this example:
```python
from pyEDAkit import standardization as eda_std
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Scatter plot function
def scatter_plot(x, y, targets, title='Title', class_names=None):
    plt.figure()
    unique_labels = np.unique(targets)
    for lbl in unique_labels:
        mask = (targets == lbl)
        label_str = class_names[lbl] if class_names else f"Class {lbl}"
        plt.scatter(x[mask], y[mask], s=10, label=label_str)
    plt.axhline(0, color='gray', linestyle='--')
    plt.axvline(0, color='gray', linestyle='--')
    plt.title(title)
    plt.legend()
    plt.draw()

# Load Iris dataset
df = pd.read_csv("../datasets/iris/iris.data")
df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
sp_df = df[['sepal_length', 'petal_length']].to_numpy()

y = df['class'].to_numpy()
y = np.where(y == 'Iris-setosa', 0, y)
y = np.where(y == 'Iris-versicolor', 1, y)
y = np.where(y == 'Iris-virginica', 2, y)
y = y.astype(int)
y_names = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

# Normalization techniques and visualization
scatter_plot(df['sepal_length'], df['petal_length'], y, title='Original data', class_names=y_names)

# Z-scores with mean 0
z_scores_zero_mean = eda_std.with_std_dev(sp_df, zero_mean=True)
scatter_plot(z_scores_zero_mean[:, 0], z_scores_zero_mean[:, 1], y, title='z-scores with mean 0', class_names=y_names)

# Z-scores without mean 0
z_scores_not_zero_mean = eda_std.with_std_dev(sp_df, zero_mean=False)
scatter_plot(z_scores_not_zero_mean[:, 0], z_scores_not_zero_mean[:, 1], y, title='z-scores with NOT mean 0', class_names=y_names)

# Min-max normalization
Z_minmax = eda_std.min_max_norm(sp_df)
scatter_plot(Z_minmax[:, 0], Z_minmax[:, 1], y, title='min-max normalization', class_names=y_names)

# Sphering
Z_sphere = eda_std.sphering(sp_df)
scatter_plot(Z_sphere[:, 0], Z_sphere[:, 1], y, title='Sphering (pyEDAkit)', class_names=y_names)

pca = PCA(whiten=True)
pca_data = pca.fit_transform(sp_df)
scatter_plot(pca_data[:, 0], pca_data[:, 1], y, title='Sphering (PCA whiten)', class_names=y_names)
```




---
## Dependencies
This repository requires the following Python libraries:
- `numpy>=1.21.0`
- `scipy>=1.7.0`
- `matplotlib>=3.4.0`
- `seaborn>=0.11.0`
- `scikit-learn>=0.24.0`
- `networkx>=2.5`
- `pandas>=1.3.0`


---

## Contributing
Feel free to fork this repository, report issues, or contribute by adding new MATLAB-style functions or improving existing ones. 

---

## License
This project is licensed under the MIT License. See `LICENSE` for more details.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pyEDAkit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "data analysis, EDA, machine learning",
    "author": null,
    "author_email": "Alberto Biscalchin <biscalchin.mau.se@gmail.com>, Marco Schivo <marcoschivo1@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/f6/49/200063563637d5fb9c71f8ae8208a388cb4d6f937ac42c3db051eb65e234/pyedakit-0.1.4.tar.gz",
    "platform": null,
    "description": "# Exploratory Data Analysis (EDA) Python Toolkit\r\nThe project is sponsored by Malm\u00f6 Universitet developed by Eng. Marco Schivo and Eng. Alberto Biscalchin under the supervision of Associete Professor Yuanji Cheng and is released under the MIT License. It is open source and available for anyone to use and contribute to.\r\n\r\nInternal course Code reference: MA661E\r\n\r\n## Installation Instructions\r\n\r\nTo install the `pyEDAkit` package, follow the steps below:\r\n\r\n1. **Prerequisites**:  \r\n   Ensure you have Python 3.7 or higher installed. You can download it from [python.org](https://www.python.org/).\r\n\r\n2. **Install from PyPI**:  \r\n   Use the following command to install `pyEDAkit`:\r\n\r\n   ```bash\r\n   pip install pyEDAkit\r\n   ```\r\n\r\n3. **Verify Installation**:  \r\n   After installation, verify it by running:\r\n\r\n   ```bash\r\n   python -c \"import pyEDAkit; print('pyEDAkit installed successfully!')\"\r\n   ```\r\n\r\n4. **Optional (Upgrade)**:  \r\n   To upgrade to the latest version:\r\n\r\n   ```bash\r\n   pip install --upgrade pyEDAkit\r\n   ```\r\n\r\nYou're now ready to use `pyEDAkit`.\r\n\r\n## Warnings\r\nPlease note that this project is still under development and is not yet ready for production use. The code is subject to change, and new features will be added over time. We welcome contributions and feedback from the community to improve the toolkit.\r\n\r\n## Overview\r\nThis repository implements MATLAB-style functions in Python for various data analysis, clustering, dimensionality reduction, and graph algorithms. These functions leverage popular Python libraries such as `numpy`, `scipy`, `matplotlib`, `networkx`, and `scikit-learn`, while maintaining a familiar MATLAB-like syntax and behavior.\r\n\r\n---\r\n\r\n## Features\r\n\r\n### 1. **Clustering and Dimensionality Reduction**\r\n- **Hierarchical Clustering** (`linkage`, `cluster`):\r\n  - MATLAB-style hierarchical clustering using `scipy.cluster.hierarchy`.\r\n  - Supports `single`, `complete`, `average`, `ward`, and other linkage methods.\r\n  - MATLAB-like `cluster` function for cutting hierarchical clusters based on distance or cluster count.\r\n\r\n- **K-Means Clustering** (`kmeans`):\r\n  - MATLAB-style K-Means implementation with support for:\r\n    - Initialization methods: `k-means++`, random, or user-specified.\r\n    - Metrics: `sqeuclidean` distance.\r\n    - Number of replicates and maximum iterations.\r\n  - Outputs cluster assignments, centroids, and within-cluster sum of distances.\r\n\r\n- **Silhouette Analysis** (`silhouette`):\r\n  - Computes silhouette values to evaluate clustering quality.\r\n  - Supports various distance metrics, including `euclidean`, `manhattan`, `cosine`, and `minkowski`.\r\n  - Generates detailed silhouette plots for cluster visualization.\r\n\r\n- **Silhouette Criterion Evaluation** (`SilhouetteEvaluation` class):\r\n  - Evaluates clustering solutions for different cluster counts (`k`) using the silhouette criterion.\r\n  - Identifies the optimal number of clusters (`OptimalK`) based on silhouette scores.\r\n  - Handles missing data and supports weighted or unweighted silhouette averages.\r\n  - Visualizes silhouette criterion scores vs. the number of clusters.\r\n\r\n- **PCA and SVD**:\r\n  - `PCA`: Computes Principal Components and visualizes scree plots and scatter matrices.\r\n  - `SVD`: Performs Singular Value Decomposition with visualization of singular values.\r\n\r\n- **Non-Negative Matrix Factorization (NMF)**:\r\n  - Reduces data dimensionality while ensuring non-negativity constraints.\r\n\r\n- **Factor Analysis (FA)**:\r\n  - Estimates latent factors using sklearn's `FactorAnalysis`.\r\n\r\n- **Linear Discriminant Analysis (LDA)**:\r\n  - Reduces dimensionality while maximizing class separability.\r\n\r\n- **Random Projection**:\r\n  - Performs Gaussian Random Projection to reduce dimensionality.\r\n\r\n---\r\n\r\n### 2. **Graph Algorithms**\r\n- **Minimum Spanning Tree (MST)** (`minspantree`):\r\n  - MATLAB-style wrapper using `networkx`.\r\n  - Supports both Prim's (`dense`) and Kruskal's (`sparse`) algorithms.\r\n  - Option to extract a spanning tree for a specific connected component or spanning forest.\r\n\r\n---\r\n\r\n### 3. **Intrinsic Dimensionality Estimation**\r\n- **Packing Numbers**:\r\n  - Computes intrinsic dimensionality using packing arguments.\r\n\r\n- **Maximum Likelihood Estimation (MLE)**:\r\n  - Estimates intrinsic dimensionality using a k-Nearest Neighbor approach.\r\n\r\n- **Correlation Dimension**:\r\n  - Estimates intrinsic dimensionality via correlation methods.\r\n\r\n- **Pettis Method**:\r\n  - Computes intrinsic dimensionality using Pettis et al.'s algorithm.\r\n\r\n---\r\n\r\n### 4. **Normalization**\r\n- **Standardization (`with_std_dev`)**:\r\n  - Standardizes data with zero-mean and unit variance.\r\n  \r\n- **Min-Max Normalization (`min_max_norm`)**:\r\n  - Rescales data to a range between 0 and 1.\r\n\r\n- **Sphering (`sphering`)**:\r\n  - Whitens data, decorrelating variables and setting variance to 1.\r\n\r\n---\r\n\r\n### 5. **Clustering Quality Metrics**\r\n- **Cophenetic Correlation** (`cophenet`):\r\n  - Computes the cophenetic correlation coefficient to measure how well a dendrogram preserves the original pairwise distances.\r\n\r\n- **Silhouette Evaluation and Visualization**:\r\n  - `silhouette`: Computes silhouette values and plots silhouette scores for each cluster.\r\n  - `SilhouetteEvaluation`: Evaluates and visualizes the silhouette criterion for determining the optimal number of clusters.\r\n\r\n---\r\n# Examples\r\n!IMPORTANT: The examples are not finished yet, they are just a draft of what we are going to implement.\r\nThe import statements are placeholders and need to be replaced with the actual module name that will be available through PiPy soon.\r\n## Clustering\r\n### **`Linkage` Function**\r\n\r\nThis example demonstrates the usage of the `linkage` function for hierarchical clustering. The `linkage` function builds a hierarchical cluster tree (also known as a dendrogram) using various linkage methods. We show two use cases: clustering a large dataset into groups and visualizing the hierarchy using a dendrogram.\r\n\r\n\r\n#### Python Code:\r\n```python\r\nimport numpy as np\r\nimport matplotlib.pyplot as plt\r\nfrom pyEDAkit.clustering import linkage, cluster\r\nfrom scipy.cluster.hierarchy import dendrogram\r\nfrom scipy.spatial.distance import squareform\r\nfrom mpl_toolkits.mplot3d import Axes3D\r\n\r\ndef test_linkage():\r\n    # Step 1: Randomly generate sample data with 20,000 observations\r\n    np.random.seed(0)  # For reproducibility\r\n    X = np.random.rand(20000, 3)\r\n\r\n    # Step 2: Create a hierarchical cluster tree using the ward linkage method\r\n    Z = linkage(X, method='ward')\r\n\r\n    # Step 3: Cluster the data into a maximum of four groups\r\n    max_clusters = 4\r\n    cluster_labels = cluster(Z, 'MaxClust', max_clusters, criterion='maxclust')\r\n\r\n    # Step 4: Plot the result in 3D\r\n    fig = plt.figure(figsize=(10, 8))\r\n    ax = fig.add_subplot(111, projection='3d')\r\n\r\n    scatter = ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=cluster_labels, cmap='viridis', s=10)\r\n    ax.set_xlabel('X-axis')\r\n    ax.set_ylabel('Y-axis')\r\n    ax.set_zlabel('Z-axis')\r\n\r\n    plt.title('3D Scatter Plot of Hierarchical Clustering')\r\n    plt.colorbar(scatter, ax=ax, label='Cluster Label')\r\n    plt.show()\r\n\r\n    # Step 5: Define the dissimilarity matrix\r\n    X = np.array([\r\n        [0, 1, 2, 3],\r\n        [1, 0, 4, 5],\r\n        [2, 4, 0, 6],\r\n        [3, 5, 6, 0]\r\n    ])\r\n\r\n    # Step 6: Convert the dissimilarity matrix to vector form using squareform\r\n    y = squareform(X)\r\n\r\n    # Step 7: Create a hierarchical cluster tree using the 'complete' method\r\n    Z = linkage(y, method='complete')\r\n\r\n    # Step 8: Print the resulting linkage matrix\r\n    print(\"Linkage matrix (Z):\")\r\n    print(Z)\r\n\r\n    # Step 9: Plot the dendrogram\r\n    plt.figure(figsize=(10, 6))\r\n    dendrogram(\r\n        Z,\r\n        labels=[1, 2, 3, 4],  # Use MATLAB-style indices for the leaf nodes\r\n        leaf_font_size=10      # Adjust font size for clarity\r\n    )\r\n    plt.title('Dendrogram')\r\n    plt.xlabel('Leaf Nodes')\r\n    plt.ylabel('Linkage Distance')\r\n    plt.show()\r\n\r\ntest_linkage()\r\n```\r\n\r\n### Visualizations\r\n\r\n#### **3D Scatter Plot of Hierarchical Clustering**\r\nThis plot visualizes the clusters formed by hierarchical clustering on a randomly generated dataset of 20,000 observations. The data points are colored by their cluster labels (maximum of 4 clusters).\r\n\r\n![3D Scatter Plot](https://github.com/zkivo/pyEDAkit/raw/main/examples/hierarchical_clustering_scatter_3d.png)\r\n\r\n\r\n#### **Dendrogram**\r\nThe dendrogram represents the hierarchical clustering of a small dataset, built from a dissimilarity matrix. The `complete` linkage method is used to compute the hierarchical structure, and the result is visualized as a dendrogram.\r\n\r\n![Dendrogram](https://github.com/zkivo/pyEDAkit/raw/main/examples/dendrogram.png)\r\n\r\n### **`Cluster` Function**\r\n\r\nThe `cluster` function is a MATLAB-style wrapper for SciPy's `fcluster` function, allowing flexible and intuitive hierarchical clustering. This example demonstrates its usage with various clustering criteria, such as distance thresholds, inconsistent measures, and a fixed number of clusters. Additionally, it supports multiple cutoffs to produce a matrix of cluster assignments.\r\n\r\n\r\n#### Example:\r\n\r\n```python\r\nimport numpy as np\r\nfrom pyEDAkit.clustering import cluster, linkage \r\n\r\ndef test_cluster():\r\n    # Generate sample data\r\n    X = np.random.rand(10, 3)\r\n\r\n    # Compute linkage matrix\r\n    Z = linkage(X, method='ward')\r\n\r\n    # 1) Cut off by distance = 0.7\r\n    T_distance = cluster(Z, 'Cutoff', 0.7, 'Criterion', 'distance')\r\n\r\n    # 2) Cut off by inconsistent measure\r\n    T_inconsist = cluster(Z, 'Cutoff', 1.5)\r\n\r\n    # 3) Force a maximum of 3 clusters\r\n    T_maxclust = cluster(Z, 'MaxClust', 3)\r\n\r\n    # 4) Multiple cutoffs -> T is an m-by-l matrix\r\n    T_multi = cluster(Z, 'Cutoff', [0.7, 1.0, 1.5], 'Criterion', 'distance')\r\n    print(T_multi.shape)  # (10, 3)\r\n\r\ntest_cluster()\r\n```\r\n\r\n#### Output:\r\n\r\nThis example showcases the flexibility of the `cluster` function. Below is the output from the final step, where multiple cutoffs are used:\r\n\r\n```bash\r\n(10, 3)\r\n```\r\n\r\n### Key Points:\r\n\r\n1. **Cut off by Distance**: Creates clusters by specifying a distance threshold. For example:\r\n   ```python\r\n   T_distance = cluster(Z, 'Cutoff', 0.7, 'Criterion', 'distance')\r\n   ```\r\n\r\n2. **Cut off by Inconsistent Measure**: Uses the default 'inconsistent' criterion for clustering:\r\n   ```python\r\n   T_inconsist = cluster(Z, 'Cutoff', 1.5)\r\n   ```\r\n\r\n3. **Force a Maximum Number of Clusters**: Ensures the data is divided into a fixed number of clusters:\r\n   ```python\r\n   T_maxclust = cluster(Z, 'MaxClust', 3)\r\n   ```\r\n\r\n4. **Multiple Cutoffs**: Produces a matrix where each column corresponds to cluster assignments for a specific cutoff:\r\n   ```python\r\n   T_multi = cluster(Z, 'Cutoff', [0.7, 1.0, 1.5], 'Criterion', 'distance')\r\n   ```\r\n\r\nThe flexibility of `cluster` makes it an ideal choice for hierarchical clustering tasks requiring MATLAB-like functionality in Python.\r\n\r\n---\r\n\r\n### **`K-means` Clustering**\r\n\r\nThe `kmeans` function, imported from the `pyEDAkit.clustering` module, provides a MATLAB-style implementation of the K-means algorithm, allowing intuitive and flexible clustering with support for optional parameters such as the number of replicates and maximum iterations.\r\n\r\n\r\n#### Example:\r\n\r\n```python\r\nimport numpy as np\r\nimport pandas as pd\r\nfrom pyEDAkit.clustering import kmeans\r\nimport matplotlib.pyplot as plt\r\nfrom matplotlib.colors import ListedColormap\r\nfrom sklearn.metrics import accuracy_score\r\n\r\ndef test_kmeans():\r\n    # Load Iris dataset\r\n    iris_path = \"../datasets/iris_dataset.csv\"\r\n    iris_data = pd.read_csv(iris_path)\r\n\r\n    # Use only petal_length and petal_width features (2D data)\r\n    X = iris_data.iloc[:, [2, 3]].values  # Columns for petal_length and petal_width\r\n    y_true = iris_data.iloc[:, -1].values  # True labels (species)\r\n\r\n    # Map the target labels to numeric values for true labels\r\n    label_mapping = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}\r\n    y_numeric = np.array([label_mapping[label] for label in y_true])\r\n\r\n    # Step 1: Apply k-means clustering using your custom function\r\n    k = 3  # Number of clusters\r\n    idx, C, sumd, D = kmeans(X, k, 'Distance', 'sqeuclidean', 'Replicates', 5, 'MaxIter', 300)\r\n\r\n    # Step 2: Create a 2D grid for the feature space\r\n    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\r\n    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\r\n    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.05),\r\n                         np.arange(y_min, y_max, 0.05))\r\n\r\n    # Combine the grid into a list of points\r\n    grid_points = np.c_[xx.ravel(), yy.ravel()]  # Flatten the grid into points for classification\r\n\r\n    # Define the function to assign grid points to clusters\r\n    def assign_to_clusters(grid_points, centroids):\r\n        \"\"\"\r\n        Assign grid points to the nearest cluster based on centroids.\r\n        \"\"\"\r\n        distances = np.linalg.norm(grid_points[:, None] - centroids, axis=2)  # Euclidean distance\r\n        return np.argmin(distances, axis=1)  # Assign to the nearest centroid\r\n\r\n    # Assign each grid point to a cluster\r\n    Z = assign_to_clusters(grid_points, C)\r\n    Z = Z.reshape(xx.shape)  # Reshape to match the grid dimensions\r\n\r\n    # Step 4: Plot the results\r\n    plt.figure(figsize=(10, 6))\r\n\r\n    # Plot cluster regions\r\n    cmap = ListedColormap(['#FFCCCC', '#CCFFCC', '#CCCCFF'])  # Colors for regions\r\n    plt.contourf(xx, yy, Z, cmap=cmap, alpha=0.5)\r\n\r\n    # Plot data points\r\n    plt.scatter(X[:, 0], X[:, 1], c=idx - 1, cmap='viridis', edgecolor='k', s=50, label='Data')\r\n\r\n    # Plot centroids\r\n    plt.scatter(C[:, 0], C[:, 1], c='red', s=200, marker='X', label='Centroids')\r\n\r\n    # Add labels and title\r\n    plt.title(\"K-means Clustering with Correctly Colored Areas\")\r\n    plt.xlabel(\"Petal Length (cm)\")\r\n    plt.ylabel(\"Petal Width (cm)\")\r\n    plt.legend(loc='best')\r\n    plt.show()\r\n\r\n    # Step 5: Print results\r\n    print(\"Cluster assignments (idx):\")\r\n    print(idx)\r\n    print(\"\\nCentroids (C):\")\r\n    print(C)\r\n    print(\"\\nWithin-cluster sum of distances (sumd):\")\r\n    print(sumd)\r\n\r\n\r\n\r\ntest_kmeans()\r\n```\r\n\r\n#### Centroids displacement plot\r\n![K-Means Example](https://github.com/zkivo/pyEDAkit/raw/main/examples/k-means.png)\r\n\r\n#### Bash Output:\r\n\r\n```bash\r\nCluster assignments (idx):\r\n[2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.\r\n 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.\r\n 2. 2. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.\r\n 3. 3. 3. 3. 3. 1. 3. 3. 3. 3. 3. 1. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.\r\n 3. 3. 3. 3. 1. 1. 1. 1. 1. 1. 3. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 3.\r\n 1. 1. 1. 1. 1. 1. 3. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 3. 1. 1. 1. 1. 1.\r\n 1. 1. 1. 1. 1. 1.]\r\n\r\nCentroids (C):\r\n[[5.59583333 2.0375    ]\r\n [1.464      0.244     ]\r\n [4.26923077 1.34230769]]\r\n\r\nWithin-cluster sum of distances (sumd):\r\n[16.29166667  2.0384     13.05769231]\r\n```\r\n\r\n#### Key Points:\r\n\r\n1. **Basic Clustering**:\r\n   - Cluster assignment, centroids, within-cluster sum of distances, and point-to-centroid distances are calculated.\r\n   \r\n2. **Iris Dataset Example**:\r\n   - Used the Iris dataset to cluster the data and compare with true labels for accuracy.\r\n   \r\n3. **Visualization**:\r\n   - PCA reduces the dimensionality for 2D visualization.\r\n   - Colored areas represent cluster boundaries, and centroids are marked with red `X`. \r\n\r\n4. **Accuracy**:\r\n   - Clustering accuracy is calculated by mapping clusters to the closest true labels.\r\n\r\n---\r\n\r\n### **`minspantree` - Minimum Spanning Tree**\r\n\r\nThe `minspantree` function computes the Minimum Spanning Tree (MST) of a given graph. It supports both Prim's and Kruskal's algorithms and can return the MST for a specific component (`Type='tree'`) or the entire graph (`Type='forest'`).\r\n\r\n#### Example:\r\n\r\n```python\r\nimport networkx as nx\r\nimport matplotlib.pyplot as plt\r\nfrom pyEDAkit.clustering import minspantree\r\n\r\ndef test_minspantree():\r\n    # Create a graph with weighted edges\r\n    G = nx.Graph()\r\n    G.add_weighted_edges_from([\r\n        (1, 2, 2.0),\r\n        (2, 3, 1.5),\r\n        (2, 4, 3.0),\r\n        (1, 5, 4.0),\r\n        (3, 5, 2.5),\r\n        (4, 5, 1.0),\r\n        (5, 6, 2.0)\r\n    ])\r\n\r\n    # Visualize the original graph\r\n    plt.figure(figsize=(12, 6))\r\n    plt.subplot(1, 3, 1)\r\n    pos = nx.spring_layout(G, seed=42)  # For consistent layout\r\n    nx.draw(G, pos, with_labels=True, node_color='lightblue', edge_color='gray', node_size=1000, font_size=10)\r\n    labels = nx.get_edge_attributes(G, 'weight')\r\n    nx.draw_networkx_edge_labels(G, pos, edge_labels=labels)\r\n    plt.title(\"Original Graph\")\r\n\r\n    # 1) Compute MST using Prim's algorithm (default method)\r\n    T_prim, pred_prim = minspantree(G)  # Assuming minspantree implements Prim's by default\r\n\r\n    # Visualize MST with Prim's algorithm\r\n    plt.subplot(1, 3, 2)\r\n    nx.draw(T_prim, pos, with_labels=True, node_color='lightgreen', edge_color='blue', node_size=1000, font_size=10)\r\n    labels = nx.get_edge_attributes(T_prim, 'weight')\r\n    nx.draw_networkx_edge_labels(T_prim, pos, edge_labels=labels)\r\n    plt.title(\"MST (Prim's Algorithm)\")\r\n\r\n    # 2) Compute MST using Kruskal's algorithm (Method='sparse')\r\n    T_kruskal, pred_kruskal = minspantree(G, 'Method', 'sparse', 'Root', 2, 'Type', 'forest')\r\n\r\n    # Visualize MST with Kruskal's algorithm\r\n    plt.subplot(1, 3, 3)\r\n    nx.draw(T_kruskal, pos, with_labels=True, node_color='lightcoral', edge_color='purple', node_size=1000, font_size=10)\r\n    labels = nx.get_edge_attributes(T_kruskal, 'weight')\r\n    nx.draw_networkx_edge_labels(T_kruskal, pos, edge_labels=labels)\r\n    plt.title(\"MST (Kruskal's Algorithm)\")\r\n\r\n    plt.tight_layout()\r\n    plt.show()\r\n\r\n    # Print details of the MSTs\r\n    print(\"\\n--- MST using Prim's Algorithm ---\")\r\n    print(\"Edges of T (Prim):\", list(T_prim.edges(data=True)))\r\n    print(\"Predecessors (Prim):\", pred_prim)\r\n\r\n    print(\"\\n--- MST using Kruskal's Algorithm ---\")\r\n    print(\"Edges of T (Kruskal):\", list(T_kruskal.edges(data=True)))\r\n    print(\"Predecessors (Kruskal):\", pred_kruskal)\r\n\r\ntest_minspantree()\r\n```\r\n\r\n\r\n#### Visualization\r\n\r\n- **Original Graph**: Displays the graph with all its nodes and edges, labeled with weights.\r\n- **MST (Prim's Algorithm)**: Highlights the MST computed using Prim's algorithm, rooted at node 1.\r\n- **MST (Kruskal's Algorithm)**: Displays the MST computed using Kruskal's algorithm, including a spanning forest for all components.\r\n\r\n![Minimum Spanning Tree](https://github.com/zkivo/pyEDAkit/raw/main/examples/minspantree.png)\r\n\r\n\r\n\r\n#### Bash Output\r\n\r\n```bash\r\n--- MST using Prim's Algorithm ---\r\nEdges of T (Prim): [(1, 2, {'weight': 2.0}), (2, 3, {'weight': 1.5}), (3, 5, {'weight': 2.5}), (4, 5, {'weight': 1.0}), (5, 6, {'weight': 2.0})]\r\nPredecessors (Prim): {1: 0, 2: 1, 3: 2, 4: 5, 5: 3, 6: 5}\r\n\r\n--- MST using Kruskal's Algorithm ---\r\nEdges of T (Kruskal): [(1, 2, {'weight': 2.0}), (2, 3, {'weight': 1.5}), (3, 5, {'weight': 2.5}), (4, 5, {'weight': 1.0}), (5, 6, {'weight': 2.0})]\r\nPredecessors (Kruskal): {1: 2, 2: 0, 3: 2, 4: 5, 5: 3, 6: 5}\r\n\r\nProcess finished with exit code 0\r\n```\r\n\r\n#### Key Points\r\n\r\n1. **Prim's Algorithm**:\r\n   - Grows the MST from a specific root node.\r\n   - Produces a single tree for the connected component containing the root.\r\n\r\n2. **Kruskal's Algorithm**:\r\n   - Builds the MST by adding edges with the smallest weights.\r\n   - Can generate a spanning forest if the graph is disconnected.\r\n\r\n3. **Visualization**:\r\n   - Easily compares the original graph with the MSTs generated by different algorithms.\r\n\r\n4. **Customizability**:\r\n   - Supports options like specifying the root node and generating a forest for disconnected graphs.\r\n---\r\n\r\n\r\n### **`cophenet` - Cophenetic Correlation Coefficient**\r\n\r\nThe `cophenet` function computes the cophenetic correlation coefficient, a measure of how well the hierarchical clustering structure reflects the pairwise distances among the data points. It also provides the cophenetic distances for the linkage matrix.\r\n\r\n\r\n#### Enhanced Example:\r\n\r\n```python\r\nimport numpy as np\r\nfrom pyEDAkit.IntrinsicDimensionality import pdist\r\nfrom pyEDAkit.clustering import linkage, cophenet\r\nimport matplotlib.pyplot as plt\r\n\r\ndef test_cophenet():\r\n    # Step 1: Generate sample data\r\n    np.random.seed(42)\r\n    X = np.vstack([\r\n        np.random.rand(5, 2),       # Cluster 1\r\n        np.random.rand(5, 2) + 5   # Cluster 2 (shifted by +5)\r\n    ])\r\n\r\n    # Step 2: Compute pairwise distances\r\n    Y = pdist(X)\r\n\r\n    # Step 3: Perform hierarchical clustering with average linkage\r\n    Z = linkage(Y, method='average')\r\n\r\n    # Step 4: Compute cophenetic correlation coefficient\r\n    c, d = cophenet(Z, Y)\r\n\r\n    # Step 5: Display results\r\n    print(\"Expected Result: Pairwise distances\")\r\n    print(\"Y (first 10 values):\", Y[:10])\r\n\r\n    print(\"\\nActual Result: Cophenetic distances\")\r\n    print(\"d (first 10 values):\", d[:10])\r\n\r\n    print(\"\\nCophenetic correlation coefficient:\", c)\r\n\r\n    # Step 6: Visualize clustering with dendrogram\r\n    plt.figure(figsize=(10, 5))\r\n    plt.title(\"Dendrogram\")\r\n    plt.xlabel(\"Data Points\")\r\n    plt.ylabel(\"Distance\")\r\n    from scipy.cluster.hierarchy import dendrogram\r\n    dendrogram(Z, leaf_rotation=90., leaf_font_size=10.)\r\n    plt.show()\r\n\r\n# Run the example\r\ntest_cophenet()\r\n```\r\n\r\n#### **Explanation**:\r\n\r\n1. **Input Data**:\r\n   - A synthetic dataset `X` is created with two distinct clusters: one centered around `(0, 0)` and another around `(5, 5)`.\r\n\r\n2. **Pairwise Distances**:\r\n   - `pdist` computes the Euclidean distances between all pairs of points in `X`. These are the **expected pairwise distances**.\r\n\r\n3. **Hierarchical Clustering**:\r\n   - The `linkage` function is used with the `'average'` method to compute the hierarchical clustering.\r\n\r\n4. **Cophenetic Correlation Coefficient**:\r\n   - The `cophenet` function computes:\r\n     - The **cophenetic correlation coefficient**: A measure of how well the clustering represents the original pairwise distances. It ranges from `-1` to `1`, with values closer to `1` indicating better clustering fidelity.\r\n     - The **cophenetic distances**: These are the distances implied by the hierarchical clustering.\r\n\r\n5. **Comparison**:\r\n   - The example prints the **expected pairwise distances** (`Y`) and the **actual cophenetic distances** (`d`), along with the cophenetic correlation coefficient (`c`).\r\n\r\n6. **Visualization**:\r\n   - The dendrogram visualizes the hierarchical clustering.\r\n\r\n\r\n#### **Expected Output**:\r\n\r\n```bash\r\nExpected Result: Pairwise distances\r\nY (first 10 values): [0.5017136  0.82421549 0.32755369 0.33198071 6.83944824 6.92460439\r\n 6.40512716 6.72486612 6.66463903 0.72642889]\r\n\r\nActual Result: Cophenetic distances\r\nd (first 10 values): [0.5310849  0.7441755  0.32755369 0.5310849  6.90984673 6.90984673\r\n 6.90984673 6.90984673 6.90984673 0.7441755 ]\r\n\r\nCophenetic correlation coefficient: 0.9966209146535578\r\n\r\nProcess finished with exit code 0\r\n```\r\n![Cophenetic Correlation](https://github.com/zkivo/pyEDAkit/raw/main/examples/Cophenetic_corr.png)\r\n\r\n#### **Key Takeaways**:\r\n\r\n- The **cophenetic correlation coefficient** (`c = 0.994`) indicates that the hierarchical clustering structure is a very good representation of the original pairwise distances.\r\n- The **cophenetic distances** (`d`) closely match the original distances in `Y`.\r\n- The dendrogram visually confirms the clustering structure, with two distinct groups evident.\r\n\r\n---\r\n\r\n### **`silhouette` - Evaluating Clustering Quality**\r\n\r\nThe `silhouette` function computes silhouette scores for clustering and provides an optional visualization of the silhouette plot. It measures how similar an object is to its own cluster compared to other clusters, with scores ranging from `-1` to `1`. Higher scores indicate better-defined clusters.\r\n\r\n\r\n#### **Example Usage**\r\n\r\n```python\r\nimport numpy as np\r\nfrom pyEDAkit.clustering import silhouette\r\n\r\ndef test_silhouette():\r\n    # Step 1: Generate synthetic data\r\n    X = np.random.rand(20, 2)  # 20 data points with 2 features\r\n    labels = np.random.randint(0, 3, size=20)  # Assign random labels for 3 clusters\r\n\r\n    # Step 2: Silhouette analysis using default Euclidean distance with visualization\r\n    s, fig = silhouette(X, labels)  # Default parameters\r\n    print(\"Silhouette values:\\n\", s)\r\n\r\n    # Step 3: Silhouette analysis using Minkowski distance (p=3) without visualization\r\n    s2, _ = silhouette(X, labels, Distance='minkowski', DistParameter={'p': 3}, do_plot=False)\r\n    print(\"Silhouette values with Minkowski distance, p=3:\\n\", s2)\r\n\r\n# Run the example\r\ntest_silhouette()\r\n```\r\n\r\n#### **Explanation**\r\n\r\n1. **Input Data**:\r\n   - `X`: The dataset with shape `(n_samples, n_features)`. In this example, it consists of 20 randomly generated data points in 2D space.\r\n   - `labels`: Cluster assignments for each data point. Random labels are generated for three clusters.\r\n\r\n2. **Silhouette Analysis**:\r\n   - The silhouette score is computed for each data point as:\r\n   \r\n        $s(i)=\\frac{b(i)-a(i)}{\\max(a(i),b(i))}$\r\n\r\n     where:\r\n     - $a(i)$: Average intra-cluster distance (distance to other points in the same cluster).\r\n     - $b(i)$: Average nearest-cluster distance (distance to points in the nearest other cluster).\r\n\r\n3. **Default Case**:\r\n   - By default, the function computes the silhouette scores using the **Euclidean distance** and visualizes the silhouette plot.\r\n\r\n4. **Custom Distance**:\r\n   - In the second case, the function uses the **Minkowski distance** with \\( p = 3 \\) (specified using the `DistParameter` argument). No plot is produced (`do_plot=False`).\r\n\r\n5. **Output**:\r\n   - `s`: An array of silhouette scores for each data point.\r\n   - `fig`: A matplotlib figure object showing the silhouette plot (if visualization is enabled).\r\n\r\n\r\n\r\n#### **Expected Output**\r\n\r\nFor the given random data, the output will include:\r\n\r\n1. **Silhouette Scores (Default Distance)**:\r\n   ```Bash\r\n    Silhouette values:\r\n    [-0.20917609 -0.11501692 -0.09117026  0.07603928 -0.49128607 -0.34661769\r\n    -0.20534282 -0.02257866 -0.15532452 -0.36940948 -0.26504272  0.32133183\r\n    -0.08229476  0.30387826 -0.09858198 -0.26869494 -0.59434334  0.08777368\r\n    0.11872492  0.06386749]\r\n   ```\r\n\r\n2. **Silhouette Scores (Minkowski Distance, \\( p=3 \\))**:\r\n   ```Bash\r\n    Silhouette values with Minkowski distance, p=3:\r\n     [-0.19668877 -0.14506527 -0.06741624  0.07712329 -0.48447927 -0.32749879\r\n     -0.1640281  -0.04252481 -0.14681262 -0.39123349 -0.26663026  0.3170098\r\n     -0.09092208  0.30229521 -0.06645392 -0.27165836 -0.5954229   0.07757885\r\n      0.11677314  0.06568025]\r\n   ```\r\n\r\n3. **Visualization**:\r\n   - The silhouette plot displays the silhouette scores for each cluster, helping to evaluate the compactness and separation of clusters. The cluster with the largest average silhouette score is better defined.\r\n\r\n\r\n#### **Silhouette Plot Visualization**\r\n\r\nThe silhouette plot provides:\r\n- **Bars**: Represent silhouette scores for individual points, grouped by clusters.\r\n- **Dashed Line**: Represents the average silhouette score for each cluster.\r\n\r\nThis plot is useful to evaluate the clustering quality visually.\r\n\r\n\r\n\r\n#### **Notes**\r\n- The silhouette score for a single point:\r\n  - **Close to 1**: Well-clustered.\r\n  - **Close to 0**: Overlapping clusters.\r\n  - **Negative**: Misclassified point.\r\n- Custom distance metrics can be specified via the `Distance` parameter (e.g., Minkowski, Manhattan, etc.).\r\n\r\n\r\n---\r\n### Silhouette Evaluation Example\r\n\r\nThe **Silhouette Evaluation** example demonstrates how to use the `SilhouetteEvaluation` class to determine the optimal number of clusters (k) for a given dataset using the silhouette criterion. The class evaluates various cluster solutions and identifies the best one based on the silhouette measure.\r\n\r\n\r\n#### Example Code:\r\n\r\n```python\r\nfrom pyEDAkit.clustering import SilhouetteEvaluation\r\nimport numpy as np\r\nfrom numpy.random import default_rng\r\n\r\ndef test_eval_silhouette():\r\n    rng = default_rng(42)\r\n    n = 200\r\n    X1 = rng.multivariate_normal(mean=[2, 2], cov=[[0.9, -0.0255], [-0.0255, 0.9]], size=n)\r\n    X2 = rng.multivariate_normal(mean=[5, 5], cov=[[0.5, 0], [0, 0.3]], size=n)\r\n    X3 = rng.multivariate_normal(mean=[-2, -2], cov=[[1, 0], [0, 0.9]], size=n)\r\n    X = np.vstack([X1, X2, X3])\r\n\r\n    # Evaluate the silhouette for k=1..6\r\n    evaluation = SilhouetteEvaluation(X,\r\n                                      clusteringFunction='kmeans',\r\n                                      KList=[1, 2, 3, 4, 5, 6],\r\n                                      Distance='sqEuclidean',\r\n                                      ClusterPriors='empirical')\r\n\r\n    print(\"Inspected K:         \", evaluation.InspectedK)\r\n    print(\"Criterion Values:    \", evaluation.CriterionValues)\r\n    print(\"OptimalK:            \", evaluation.OptimalK)\r\n\r\n    # Optional: plot the silhouette criterion\r\n    evaluation.plot()\r\n\r\n    # The best cluster solution's assignment:\r\n    best_clusters = evaluation.OptimalY\r\n    print(\"Shape of best cluster assignment:\", best_clusters.shape)\r\n    print(\"Some cluster labels:\", np.unique(best_clusters[~np.isnan(best_clusters)]))\r\n\r\ntest_eval_silhouette()\r\n```\r\n\r\n### **Output and Plot**\r\n\r\n- **Inspected K**: The list of cluster numbers (k) evaluated.\r\n- **Criterion Values**: Silhouette scores for each k. Higher values indicate better cluster separation.\r\n- **Optimal K**: The k with the highest silhouette score.\r\n\r\n#### Example Output:\r\n```bash\r\nInspected K:          [1 2 3 4 5 6]\r\nCriterion Values:     [nan 0.65458239 0.67860193 0.54828688 0.45514697 0.33797165]\r\nOptimalK:             3\r\nShape of best cluster assignment: (600,)\r\nSome cluster labels: [1. 2. 3.]\r\n```\r\n\r\n#### Plot:\r\nThe silhouette values vs. the number of clusters are visualized in the plot below:\r\n\r\n![Silhouette Criterion Evaluation](https://github.com/zkivo/pyEDAkit/raw/main/examples/silhouette_eval.png)\r\n\r\n\r\n### **How It Works**\r\n\r\n1. **Cluster Evaluation**:  \r\n   The `SilhouetteEvaluation` class evaluates clusters for different k (e.g., 1 to 6) using the silhouette criterion:\r\n   - The silhouette score is computed for each data point as:\r\n   \r\n        $s(i)=\\frac{b(i)-a(i)}{\\max(a(i),b(i))}$\r\n\r\n     where:\r\n     - $a(i)$: Average intra-cluster distance (distance to other points in the same cluster).\r\n     - $b(i)$: Average nearest-cluster distance (distance to points in the nearest other cluster).\r\n\r\n2. **Optimal Number of Clusters**:  \r\n   - The silhouette score is calculated for each k.\r\n   - The `OptimalK` property identifies the k with the maximum score.\r\n\r\n3. **Cluster Priors**:  \r\n   - `'empirical'`: Weighted by cluster sizes.\r\n   - `'equal'`: Equal weight for all clusters.\r\n\r\n4. **Visualization**:  \r\n   The `plot()` method shows the silhouette values for each k, aiding in identifying the best clustering solution.\r\n\r\nThis example demonstrates the importance of silhouette analysis for optimal clustering and provides an easy-to-follow implementation of the process.\r\n\r\n---\r\n\r\n## Dimensionality Reduction Examples\r\n\r\nThis section provides examples demonstrating how to use various dimensionality reduction techniques and their visualizations with the `pyEDAkit` library. The dataset used in all examples is the Iris dataset, which includes features (`sepal_length`, `sepal_width`, `petal_length`, and `petal_width`) and classes representing three flower species (`Iris-setosa`, `Iris-versicolor`, and `Iris-virginica`).\r\n\r\n#### Dataset Preparation\r\n\r\n```python\r\nfrom pyEDAkit import linear as eda_lin\r\nimport pandas as pd\r\nimport numpy as np\r\n\r\n# Make sure to import the dataset from your local directory\r\ndf = pd.read_csv(\"../datasets/iris/iris.data\")\r\ndf.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']\r\n\r\n# Extract features and target labels\r\nX = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].to_numpy()\r\ny = df[['class']].to_numpy()[:,0]\r\ny = np.where(y == 'Iris-setosa', 0, y)\r\ny = np.where(y == 'Iris-versicolor', 1, y)\r\ny = np.where(y == 'Iris-virginica', 2, y)\r\ny = y.astype(int)\r\ny_names = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']\r\n```\r\n\r\n---\r\n\r\n### 1. **Principal Component Analysis (PCA)**\r\n\r\nPrincipal Component Analysis reduces the dataset\u2019s dimensionality by projecting it onto a lower-dimensional subspace while retaining as much variance as possible.\r\n\r\n```python\r\neda_lin.PCA(X, d=3, plot=True)\r\n```\r\n#### Output (plot=True)\r\n![PCA Example](https://github.com/zkivo/pyEDAkit/raw/main/examples/Iris_PCA.png)\r\n\r\n- **Explanation**:\r\n  - The data is reduced to 3 principal components.\r\n  - The plot visualizes the variance explained by each component.\r\n  - Useful for understanding which components capture the most variance.\r\n\r\n---\r\n\r\n### 2. **Singular Value Decomposition (SVD)**\r\n\r\nSVD decomposes the matrix without explicitly calculating the covariance matrix. It provides an efficient way to find the principal components.\r\n\r\n```python\r\neda_lin.SVD(X, plot=True)\r\n```\r\n#### Output (plot=True)\r\n![SVD Example](https://github.com/zkivo/pyEDAkit/raw/main/examples/Iris_SVD.png)\r\n\r\n\r\n- **Explanation**:\r\n  - Displays the singular values, which indicate the importance of each dimension.\r\n  - Useful for identifying the rank and energy captured by the decomposition.\r\n\r\n---\r\n\r\n### 3. **Non-negative Matrix Factorization (NMF)**\r\n\r\nNMF decomposes a non-negative matrix into two smaller non-negative matrices. It is more efficient than SVD and suitable for datasets where negative values do not make sense.\r\n\r\n```python\r\neda_lin.NMF(X, d=4, plot=True)\r\n```\r\n#### Output (plot=True)\r\n![NMF Example](https://github.com/zkivo/pyEDAkit/raw/main/examples/Iris_NMF.png)\r\n\r\n\r\n- **Explanation**:\r\n  - Decomposes the data into 4 components.\r\n  - The plot shows the contribution of each component to the original dataset.\r\n\r\n---\r\n\r\n### 4. **Factor Analysis (FA)**\r\n\r\nFactor Analysis reduces dimensionality by relating each original variable to a smaller set of factors while adding small error values to allow flexibility.\r\n\r\n```python\r\neda_lin.FA(X, d=3, plot=True)\r\n```\r\n\r\n#### Output (plot=True)\r\n![FA Example](https://github.com/zkivo/pyEDAkit/raw/main/examples/Iris_FA.png)\r\n\r\n- **Explanation**:\r\n  - Reduces the data to 3 factors.\r\n  - Shows how the factors contribute to explaining the dataset's variance.\r\n\r\n---\r\n\r\n### 5. **Linear Discriminant Analysis (LDA)**\r\n\r\nLDA projects the data onto a line that maximizes class separability, making it a supervised dimensionality reduction method.\r\n\r\n```python\r\neda_lin.LDA(X, y, plot=True)\r\n```\r\n\r\n#### Output (plot=True)\r\n![LDA Example](https://github.com/zkivo/pyEDAkit/raw/main/examples/Iris_LDA.png)\r\n\r\n- **Explanation**:\r\n  - Projects the data into a single line for maximum class separation.\r\n  - Visualization highlights how well the classes are separated along the discriminant axis.\r\n\r\n---\r\n\r\n### 6. **Random Projection**\r\n\r\nRandom Projection projects the data points into a random subspace while preserving the distances between points. It is computationally efficient and effective.\r\n\r\n```python\r\neda_lin.RandProj(X, d=3, plot=True)\r\n```\r\n#### Output (plot=True)\r\n![Random Projection Example](https://github.com/zkivo/pyEDAkit/raw/main/examples/Iris_RandProj.png)\r\n\r\n- **Explanation**:\r\n  - Projects the data into a random 3-dimensional subspace.\r\n  - The plot visualizes the points in the new subspace, demonstrating that the relative distances between points are preserved.\r\n\r\n---\r\n\r\n### Notes:\r\n\r\n- **Visualization**: Each method generates a plot when `plot=True`, helping users visually interpret the results of the dimensionality reduction.\r\n- **Flexibility**: The `d` parameter in most functions controls the number of reduced dimensions.\r\n- **Interactivity**: The methods allow exploration of how different dimensionality reduction techniques work with the same dataset.\r\n\r\n---\r\n\r\n### Examples: Intrinsic Dimensionality Estimation with Code and Visualizations\r\n\r\nThis section provides examples of intrinsic dimensionality estimation for various datasets. The examples include:\r\n\r\n1. **3D Scene Analysis**\r\n2. **1D Helix**\r\n3. **3D Helix**\r\n\r\nEach example includes code snippets, results, and visualizations.\r\n\r\n\r\n\r\n#### **1. 3D Scene Analysis**\r\n\r\nIn this example, we generate a synthetic 3D scene with varying intrinsic dimensions. We estimate the intrinsic dimensionality for each point using the `MLE` method and visualize the results.\r\n\r\n**Code**:\r\n\r\n```python\r\nimport numpy as np\r\nimport matplotlib.pyplot as plt\r\nfrom sklearn.neighbors import NearestNeighbors\r\nimport pandas as pd\r\nfrom pyEDAkit.IntrinsicDimensionality import MLE\r\nfrom generate_data import generate_scene\r\n\r\nprint('--------- Scene ---------')\r\nX = generate_scene(plot=False)\r\n\r\n# Initialize nearest neighbors\r\nnbrs = NearestNeighbors(n_neighbors=100, algorithm='auto').fit(X)\r\n_, indices = nbrs.kneighbors(X)\r\n\r\n# Calculate intrinsic dimension for each point\r\nintrinsic_dimensions = []\r\nfor idx in range(len(X)):\r\n    neighbors = X[indices[idx]]\r\n    dim = MLE(neighbors)\r\n    dim = np.round(dim)\r\n    intrinsic_dimensions.append(dim)\r\n\r\n# Replace values > 3 with 4\r\nintrinsic_dimensions = np.array(intrinsic_dimensions)\r\nintrinsic_dimensions[intrinsic_dimensions > 3] = 4\r\n\r\n# Calculate percentage of each intrinsic dimension\r\nunique, counts = np.unique(intrinsic_dimensions, return_counts=True)\r\npercentages = (counts / len(intrinsic_dimensions)) * 100\r\npercentage_table = pd.DataFrame({\r\n    'Intrinsic Dimension': unique,\r\n    'Count': counts,\r\n    'Percentage (%)': percentages\r\n})\r\n\r\n# Display the percentage table\r\nprint(\"Intrinsic Dimension Percentage Table:\")\r\nprint(percentage_table)\r\n\r\n# Plot the 3D graph with intrinsic dimensions as colors\r\nfig = plt.figure(figsize=(10, 8))\r\nax = fig.add_subplot(111, projection='3d')\r\nscatter = ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=intrinsic_dimensions, cmap='viridis', s=10)\r\nax.set_title('3D Scatter Plot Colored by Local Intrinsic Dimension (k=100)')\r\nax.set_xlabel('X')\r\nax.set_ylabel('Y')\r\nax.set_zlabel('Z')\r\ncolorbar = fig.colorbar(scatter, ax=ax, label='Intrinsic Dimension')\r\nplt.show()\r\n```\r\n\r\n**Results**:\r\n- **Percentage Table**:\r\n  | Intrinsic Dimension | Count | Percentage (%) |\r\n  |---------------------|-------|----------------|\r\n  | 1.0                 | 3912  | 65.20          |\r\n  | 2.0                 | 1071  | 17.85          |\r\n  | 3.0                 | 1017  | 16.95          |\r\n\r\n**Visualization**:\r\n\r\n![3D Scatter Plot with Intrinsic Dimensions](https://github.com/zkivo/pyEDAkit/raw/main/examples/Intrinsic_Dim_Scene_1.png)\r\n\r\n\r\n\r\n#### **2. 1D Helix**\r\n\r\nThis example demonstrates intrinsic dimensionality estimation for a 1D helix dataset using multiple methods.\r\n\r\n**Code**:\r\n\r\n```python\r\nfrom pyEDAkit.IntrinsicDimensionality import id_pettis, corr_dim, MLE, packing_numbers\r\nfrom generate_data import generate_1D_helix\r\n\r\nprint('--------- 1D helix ---------')\r\nX = generate_1D_helix(500, plot=True)\r\n\r\nidhat = id_pettis(X)\r\nprint(\"Pettis:\", idhat)\r\n\r\nidhat = corr_dim(X)\r\nprint(\"CorrDim:\", idhat)\r\n\r\nidhat = MLE(X)\r\nprint(\"MLE:\", idhat)\r\n\r\nidhat = packing_numbers(X)\r\nprint(\"PackingNumbers:\", idhat)\r\n```\r\n\r\n**Results**:\r\n| Method           | Estimated Intrinsic Dimension |\r\n|------------------|-------------------------------|\r\n| Pettis           | 1.12                          |\r\n| CorrDim          | 1.05                          |\r\n| MLE              | 1.02                          |\r\n| PackingNumbers   | 0.97                          |\r\n\r\n**Visualization**:\r\n\r\n![1D Helix](https://github.com/zkivo/pyEDAkit/raw/main/examples/helix.png)\r\n\r\n\r\n\r\n#### **3. 3D Helix**\r\n\r\nFinally, we estimate intrinsic dimensionality for a 3D helix dataset.\r\n\r\n**Code**:\r\n\r\n```python\r\nfrom pyEDAkit.IntrinsicDimensionality import id_pettis, corr_dim, MLE, packing_numbers\r\nfrom generate_data import generate_3D_helix\r\n\r\nprint('--------- 3D helix ---------')\r\nX, _ = generate_3D_helix(2000, 0.05, plot=True)\r\n\r\nidhat = id_pettis(X)\r\nprint(\"Pettis:\", idhat)\r\n\r\nidhat = corr_dim(X)\r\nprint(\"CorrDim:\", idhat)\r\n\r\nidhat = MLE(X)\r\nprint(\"MLE:\", idhat)\r\n\r\nidhat = packing_numbers(X)\r\nprint(\"PackingNumbers:\", idhat)\r\n```\r\n\r\n**Results**:\r\n| Method           | Estimated Intrinsic Dimension |\r\n|------------------|-------------------------------|\r\n| Pettis           | 2.97                          |\r\n| CorrDim          | 1.92                          |\r\n| MLE              | 2.24                          |\r\n| PackingNumbers   | 1.45                          |\r\n\r\n**Visualization**:\r\n\r\n![3D Helix](https://github.com/zkivo/pyEDAkit/raw/main/examples/3d_helix.png)\r\n\r\nThese examples illustrate the application of various intrinsic dimensionality estimation methods to datasets with different geometries. The visualizations help validate the results by showing dimensionality estimates in their natural geometric contexts.\r\n\r\n---\r\n\r\n### Normalization Example\r\n\r\nThis example demonstrates the usage of the normalization functions implemented in **pyEDAkit**, comparing the results with standard libraries like `scikit-learn`. These functions include z-score normalization (with and without mean centering), min-max normalization, and sphering. Visualizations and outputs illustrate their effects on the Iris dataset.\r\n\r\n#### Dataset Overview\r\nWe use the Iris dataset, focusing on `sepal_length` and `petal_length` features for visualization. The dataset contains three classes: `Iris-setosa`, `Iris-versicolor`, and `Iris-virginica`.\r\n\r\n#### Original Data\r\nThe original data is plotted to show the unnormalized feature values.\r\n\r\n![Original Data](https://github.com/zkivo/pyEDAkit/raw/main/examples/original_data.png)\r\n\r\n\r\n#### Z-Scores with Zero Mean\r\nZ-score normalization scales data to have a standard deviation of 1 and a mean of 0.\r\n\r\n- **Implementation Comparison**: Results are identical to `scikit-learn`'s `StandardScaler` (with default settings).\r\n\r\n**Output**:\r\n```plaintext\r\nIs z_scores_zero_mean allclose to sklearn: True\r\nstd:  [1. 1.] \r\nmean:  [ 2.38437160e-16 -9.53748639e-17]\r\n```\r\n\r\n![Z-Scores with Zero Mean](https://github.com/zkivo/pyEDAkit/raw/main/examples/z-scores_mean_0.png)\r\n\r\n\r\n#### Z-Scores Without Zero Mean\r\nZ-score normalization scales data to have a standard deviation of 1 but does not subtract the mean.\r\n\r\n- **Implementation Comparison**: Matches `scikit-learn`'s `StandardScaler` (with `with_mean=False`).\r\n\r\n**Output**:\r\n```plaintext\r\nIs z_scores_not_zero_mean allclose to sklearn: True\r\nstd:  [1. 1.] \r\nmean:  [7.08193195 2.15226003]\r\n```\r\n\r\n![Z-Scores Without Zero Mean](https://github.com/zkivo/pyEDAkit/raw/main/examples/z-scores_not_mean_0.png)\r\n\r\n\r\n#### Min-Max Normalization\r\nMin-max normalization scales data to fit within the range [0, 1].\r\n\r\n- **Implementation Comparison**: Results are identical to `scikit-learn`'s `MinMaxScaler`.\r\n\r\n**Output**:\r\n```plaintext\r\nIs min-max norm allclose to sklearn: True\r\nstd:  [0.22939135 0.29724345] \r\nmean:  [0.43008949 0.47025367]\r\n```\r\n\r\n![Min-Max Normalization](https://github.com/zkivo/pyEDAkit/raw/main/examples/min-max_norm.png)\r\n\r\n#### Sphering\r\nSphering, also known as whitening, removes correlations between features and scales them to have unit variance.\r\n\r\n- **Implementation Comparison**: Similar to `scikit-learn`'s `PCA(whiten=True)` but differs due to possible rotations or reflections. These differences are expected.\r\n\r\n**Output**:\r\n```plaintext\r\nIs sphering allclose to sklearn: False\r\nstd:  [0.99663865 0.99663865] \r\nmean:  [-5.42444538e-16  3.05497611e-17]\r\nRotation-tolerant match for sphering vs PCA whiten:  True\r\n```\r\n\r\n**Visualizations**:\r\n- Sphering (pyEDAkit): \r\n\r\n  ![Sphering (pyEDAkit)](https://github.com/zkivo/pyEDAkit/raw/main/examples/Sphering(pyEDAkit).png)\r\n\r\n- Sphering (PCA Whiten):\r\n\r\n  ![Sphering (PCA Whiten)](https://github.com/zkivo/pyEDAkit/raw/main/examples/Sphering(PCA_whiten).png)\r\n\r\n\r\n#### Code\r\nBelow is the complete code used in this example:\r\n```python\r\nfrom pyEDAkit import standardization as eda_std\r\nfrom sklearn.preprocessing import StandardScaler, MinMaxScaler\r\nfrom sklearn.decomposition import PCA\r\nimport matplotlib.pyplot as plt\r\nimport pandas as pd\r\nimport numpy as np\r\n\r\n# Scatter plot function\r\ndef scatter_plot(x, y, targets, title='Title', class_names=None):\r\n    plt.figure()\r\n    unique_labels = np.unique(targets)\r\n    for lbl in unique_labels:\r\n        mask = (targets == lbl)\r\n        label_str = class_names[lbl] if class_names else f\"Class {lbl}\"\r\n        plt.scatter(x[mask], y[mask], s=10, label=label_str)\r\n    plt.axhline(0, color='gray', linestyle='--')\r\n    plt.axvline(0, color='gray', linestyle='--')\r\n    plt.title(title)\r\n    plt.legend()\r\n    plt.draw()\r\n\r\n# Load Iris dataset\r\ndf = pd.read_csv(\"../datasets/iris/iris.data\")\r\ndf.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']\r\nsp_df = df[['sepal_length', 'petal_length']].to_numpy()\r\n\r\ny = df['class'].to_numpy()\r\ny = np.where(y == 'Iris-setosa', 0, y)\r\ny = np.where(y == 'Iris-versicolor', 1, y)\r\ny = np.where(y == 'Iris-virginica', 2, y)\r\ny = y.astype(int)\r\ny_names = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']\r\n\r\n# Normalization techniques and visualization\r\nscatter_plot(df['sepal_length'], df['petal_length'], y, title='Original data', class_names=y_names)\r\n\r\n# Z-scores with mean 0\r\nz_scores_zero_mean = eda_std.with_std_dev(sp_df, zero_mean=True)\r\nscatter_plot(z_scores_zero_mean[:, 0], z_scores_zero_mean[:, 1], y, title='z-scores with mean 0', class_names=y_names)\r\n\r\n# Z-scores without mean 0\r\nz_scores_not_zero_mean = eda_std.with_std_dev(sp_df, zero_mean=False)\r\nscatter_plot(z_scores_not_zero_mean[:, 0], z_scores_not_zero_mean[:, 1], y, title='z-scores with NOT mean 0', class_names=y_names)\r\n\r\n# Min-max normalization\r\nZ_minmax = eda_std.min_max_norm(sp_df)\r\nscatter_plot(Z_minmax[:, 0], Z_minmax[:, 1], y, title='min-max normalization', class_names=y_names)\r\n\r\n# Sphering\r\nZ_sphere = eda_std.sphering(sp_df)\r\nscatter_plot(Z_sphere[:, 0], Z_sphere[:, 1], y, title='Sphering (pyEDAkit)', class_names=y_names)\r\n\r\npca = PCA(whiten=True)\r\npca_data = pca.fit_transform(sp_df)\r\nscatter_plot(pca_data[:, 0], pca_data[:, 1], y, title='Sphering (PCA whiten)', class_names=y_names)\r\n```\r\n\r\n\r\n\r\n\r\n---\r\n## Dependencies\r\nThis repository requires the following Python libraries:\r\n- `numpy>=1.21.0`\r\n- `scipy>=1.7.0`\r\n- `matplotlib>=3.4.0`\r\n- `seaborn>=0.11.0`\r\n- `scikit-learn>=0.24.0`\r\n- `networkx>=2.5`\r\n- `pandas>=1.3.0`\r\n\r\n\r\n---\r\n\r\n## Contributing\r\nFeel free to fork this repository, report issues, or contribute by adding new MATLAB-style functions or improving existing ones. \r\n\r\n---\r\n\r\n## License\r\nThis project is licensed under the MIT License. See `LICENSE` for more details.\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 Marco Schivo  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "A toolkit for exploratory data analysis in Python",
    "version": "0.1.4",
    "project_urls": {
        "homepage": "https://github.com/zkivo/pyEDAkit",
        "repository": "https://github.com/zkivo/pyEDAkit"
    },
    "split_keywords": [
        "data analysis",
        " eda",
        " machine learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d2d8ae9760b6758370a7b1654481bcb2b11dcc4617f5e836545d78b2d1872877",
                "md5": "437e5556faac611a2bb8ad787afa1cd0",
                "sha256": "61d99f5afd97bda7f6051040bc9d6d8ac1bdb5a7fec1c8327397d391ce70d73e"
            },
            "downloads": -1,
            "filename": "pyEDAkit-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "437e5556faac611a2bb8ad787afa1cd0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 36232,
            "upload_time": "2025-01-18T19:01:05",
            "upload_time_iso_8601": "2025-01-18T19:01:05.387560Z",
            "url": "https://files.pythonhosted.org/packages/d2/d8/ae9760b6758370a7b1654481bcb2b11dcc4617f5e836545d78b2d1872877/pyEDAkit-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f649200063563637d5fb9c71f8ae8208a388cb4d6f937ac42c3db051eb65e234",
                "md5": "ec2b5b2efa6fcdcbfea09ea6e84623a5",
                "sha256": "3708742c83bd91eb99cb9220edb510f4942887e1d1615e245cd86c19e0865c7a"
            },
            "downloads": -1,
            "filename": "pyedakit-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "ec2b5b2efa6fcdcbfea09ea6e84623a5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 68192,
            "upload_time": "2025-01-18T19:01:09",
            "upload_time_iso_8601": "2025-01-18T19:01:09.622817Z",
            "url": "https://files.pythonhosted.org/packages/f6/49/200063563637d5fb9c71f8ae8208a388cb4d6f937ac42c3db051eb65e234/pyedakit-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-18 19:01:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "zkivo",
    "github_project": "pyEDAkit",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "matplotlib",
            "specs": [
                [
                    "==",
                    "3.8.4"
                ]
            ]
        },
        {
            "name": "MiniSom",
            "specs": [
                [
                    "==",
                    "2.3.3"
                ]
            ]
        },
        {
            "name": "networkx",
            "specs": [
                [
                    "==",
                    "3.3"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "2.2.1"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "2.2.3"
                ]
            ]
        },
        {
            "name": "scikit_learn",
            "specs": [
                [
                    "==",
                    "1.6.1"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    "==",
                    "1.15.1"
                ]
            ]
        },
        {
            "name": "seaborn",
            "specs": [
                [
                    "==",
                    "0.13.2"
                ]
            ]
        }
    ],
    "lcname": "pyedakit"
}

None