# Multivariate Analysis (MVA) Pipeline for Pharmaceutical Manufacturing
This repository contains a sophisticated data analysis pipeline designed to improve yield and reduce anomalies in pharmaceutical manufacturing processes. By leveraging multivariate analysis (MVA), machine learning, and statistical techniques, this pipeline provides deep insights into complex production data, enabling proactive quality control and process optimization.
## 1. The Challenge: Complexity in Pharmaceutical Manufacturing
Pharmaceutical manufacturing is a highly complex and regulated process. It involves numerous stages, each with a multitude of parameters that can influence the final product's quality and yield. Key challenges include:
- **High-Dimensional Data**: A single manufacturing batch can generate thousands of data points, including sensor readings, material measurements, and quality control checks. Analyzing this high-dimensional data using traditional univariate methods (looking at one variable at a time) is often ineffective.
- **Interacting Variables**: Process parameters are rarely independent. A change in one variable (e.g., temperature) can have cascading effects on others (e.g., pressure, reaction rate). These interactions are often non-linear and difficult to detect.
- **Anomaly Detection**: Deviations from the optimal process, or anomalies, can lead to batch failures, reduced yield, and significant financial losses. These anomalies are often subtle and hidden within the process's natural variability.
- **Root Cause Analysis**: When an anomaly or low-yield batch occurs, identifying the root cause is critical but challenging. It requires sifting through vast amounts of data to pinpoint the specific combination of factors responsible for the deviation.
## 2. Our Solution: A Multivariate Approach
This pipeline addresses these challenges by adopting a multivariate approach, which considers all process variables simultaneously. This holistic view allows us to model the relationships between variables and understand the process as an integrated system.
### Core Concepts
#### a. The "Golden Batch"
The "Golden Batch" concept is central to our approach. It refers to an idealized manufacturing run that represents the optimal process conditions, leading to the desired product quality and yield. While a single perfect batch may not exist, we can define a "Golden Profile" or a statistical envelope of normal operating conditions based on historical data from successful batches.
Our pipeline uses data from high-quality batches to learn this Golden Profile. All subsequent batches are then compared against this profile to assess their performance.
#### b. Dimensionality Reduction with PCA
Principal Component Analysis (PCA) is a powerful technique for dimensionality reduction. In a high-dimensional space of process variables, PCA identifies the principal componentsβthe underlying dimensions that capture the most variance in the data.
- **Why we use it**: By projecting the data onto a smaller number of principal components, we can visualize and analyze the process more effectively. This reduces noise and reveals the underlying structure of the data. In our pipeline, we use a supervised version of PCA where the principal components are selected based on their correlation with the final product yield. This ensures that we focus on the process variability that is most impactful to the outcome.
#### c. Anomaly Detection with Isolation Forests
An isolation forest is a modern, effective algorithm for detecting anomalies. It works by randomly partitioning the data until each data point is isolated from the others.
- **Why we use it**: Anomalies are "few and different," which means they are more susceptible to isolation. Therefore, they will be isolated in fewer steps than normal data points. The "anomaly score" is based on the average path length required to isolate a data point across many random trees. This method is computationally efficient and works well with high-dimensional data, making it ideal for our use case.
#### d. Root Cause Analysis (RCA)
When a batch is flagged as an anomaly or exhibits low yield, we need to understand why. Our root cause analysis module uses machine learning models to identify the key features (process variables) that contributed to the deviation.
- **Why we use it**: By analyzing the feature importance scores from models trained to distinguish between good and bad outcomes, we can pinpoint the specific variables that are most likely responsible for the problem. This provides actionable insights for process engineers to investigate and correct.
#### e. Synthetic Data Augmentation
To train robust machine learning models, a large and diverse dataset is often required. In manufacturing, data for certain conditions (especially anomalous ones) may be scarce.
- **Why we use it**: We use synthetic data generation techniques to augment our dataset. By creating new, realistic data pointsβincluding plausible anomaliesβwe can improve the performance and robustness of our anomaly detection and root cause analysis models. This ensures that our models are not "surprised" by novel process conditions.
## 3. The Pipeline at a Glance
The pipeline is structured as a series of modular steps:
1. **Data Extraction**: Raw data from various sources (e.g., database tables from a LIMS or MES) is extracted.
2. **Data Building**: The raw data is transformed and merged into a single "wide" matrix, where each row represents a batch and each column represents a process parameter or measurement.
3. **Data Augmentation**: The batch matrix is augmented with synthetic data to create a more robust dataset for analysis.
4. **Analysis**:
- **Anomaly Detection**: Every batch is scored for its deviation from the "Golden Profile."
- **Supervised PCA**: The relationship between process variability and yield is modeled.
- **Root Cause Analysis**: The key drivers of low yield are identified.
- **Unified Importance**: The results from PCA and RCA are combined to provide a single, unified ranking of the most critical process parameters.
By following this structured, multivariate approach, this pipeline provides a powerful tool for understanding, monitoring, and optimizing complex pharmaceutical manufacturing processes.
## Overview
This package provides a complete analytical toolkit for pharmaceutical manufacturing data, featuring:
- **Anomaly Detection**: Multi-algorithm ensemble for identifying problematic batches
- **Yield Optimization**: PCA and SHAP-based feature importance analysis
- **Root Cause Analysis**: Machine learning-driven insights into yield drivers
- **Tool Interface**: Clean API for external LLM agents and applications
## Installation
### For External Use (Recommended)
Install directly from Git:
```bash
pip install git+https://github.com/your-org/mva-pipeline.git
```
Or clone and install in development mode:
```bash
git clone https://github.com/your-org/mva-pipeline.git
cd mva-pipeline
pip install -e .
```
### Optional Dependencies
Install with LLM integration support:
```bash
pip install "mva-pipeline[llm] @ git+https://github.com/your-org/mva-pipeline.git"
```
Install with development tools:
```bash
pip install -e ".[dev]"
```
## Quick Start - Using the Tools API
### Running the Pipeline
```python
from mva_pipeline import run_pipeline
# Run complete pipeline with caching
result = run_pipeline()
if result['cache_hit']:
print("π Cache hit! Analytics skipped")
print(f"Runtime: {result['runtime_seconds']:.1f}s")
else:
print("π Data changed, running full analytics...")
print(f"Runtime: {result['runtime_seconds']:.1f}s")
print(f"Updated: {result['updated']}")
# Access artifacts
print("Available artifacts:")
for name, path in result['artifacts'].items():
print(f" {name}: {path}")
```
### Basic Usage
```python
from mva_pipeline.tools import get_tool_specs, get_pipeline_status
# Check what analyses are available
status = get_pipeline_status()
print(f"Available tools: {status['available_tools']}")
# Get all tool specifications for LLM function calling
tools = get_tool_specs()
for tool in tools[:3]:
print(f"β’ {tool['name']}: {tool['description']}")
```
### Anomaly Analysis
```python
from mva_pipeline import get_top_anomalies, explain_batch, get_anomaly_statistics
# Get top anomalous batches
anomalies = get_top_anomalies(n=5)
print(f"Top anomaly: Batch {anomalies[0]['doc_id']} (score: {anomalies[0]['score_if']:.2f})")
# Detailed analysis of specific batch
details = explain_batch(doc_id=470)
print(f"Batch 470 anomaly status: {details['anomaly']}")
# Overall statistics
stats = get_anomaly_statistics()
print(f"Anomaly rate: {stats['anomaly_rate']:.1%}")
```
### Yield Driver Analysis
```python
from mva_pipeline import get_top_yield_drivers, get_feature_scores
# Top process parameters affecting yield
drivers = get_top_yield_drivers(n=10)
print(f"Top yield driver: {drivers[0]['feature']} (score: {drivers[0]['unified_score']:.3f})")
# Detailed feature analysis
feature_analysis = get_feature_scores("public.bprpoc_temperature__value_r0")
print(f"PCA score: {feature_analysis['pca_score']:.3f}")
print(f"SHAP score: {feature_analysis['shap_score']:.3f}")
```
### Batch Comparison
```python
from mva_pipeline import compare_batches, find_similar_batches
# Compare specific batches
comparison = compare_batches(doc_ids=[100, 200, 300])
yields = [b['yield'] for b in comparison['batch_comparison']]
print(f"Yield range: {min(yields):.1f} - {max(yields):.1f}")
# Find similar batches
similar = find_similar_batches(doc_id=100, n_similar=5, method="yield")
print(f"Found {len(similar)} similar batches")
```
## π§ Command Line Interface
Run the complete analytics pipeline:
```bash
# Extract data from database
mva-pipeline extract
# Build batch matrix
mva-pipeline build
# Run complete analysis
mva-pipeline analyze
# NEW: Run complete pipeline with intelligent caching
mva-pipeline pipeline --verbose
```
### Caching Pipeline
The MVA pipeline now includes intelligent caching that automatically detects when your data has changed and only re-runs analytics when necessary:
```bash
# Run pipeline with caching (recommended)
python -m mva_pipeline.cli pipeline --verbose
# Force rebuild ignoring cache
python -m mva_pipeline.cli pipeline --force
# Skip database extraction (use existing raw data)
python -m mva_pipeline.cli pipeline --skip-extraction
# Use custom raw data directory
python -m mva_pipeline.cli pipeline --raw-dir /path/to/data
```
### How Caching Works
1. **Fingerprinting**: The system computes a SHA1 fingerprint of all Parquet files in your raw data directory based on filename, modification time, and file size.
2. **Cache Check**: Before running expensive analytics, it compares the current fingerprint with the last known fingerprint.
3. **Smart Decisions**:
- **Cache Hit**: If fingerprints match and all artifacts exist β Fast exit (seconds)
- **Cache Miss**: If data changed β Full analytics pipeline (minutes)
4. **State Storage**: Fingerprints are stored in Redis (if available) or fallback to `.mva_state.json` file.
### Environment Variables
Configure caching behavior with environment variables:
```bash
# Redis URL for state storage (optional)
export MVA_STATE_REDIS_URL="redis://localhost:6379/0"
# State storage type: redis or file
export MVA_STATE_STORE="redis"
# Custom state file location
export MVA_STATE_FILE="/path/to/custom_state.json"
```
### Configuration
Add caching configuration to your `config.yaml`:
```yaml
# State management for pipeline caching
state_store: redis # Options: redis, file
state_file: ".mva_state.json" # Fallback file location
```
## π Feature Mapping: From Technical Names to Business Insights
The MVA pipeline automatically converts technical statistical feature names into meaningful business concepts for improved user experience. This ensures that business users can understand the analysis results without needing deep technical knowledge.
### How It Works
**The Challenge**: Machine learning models work with statistical aggregations like `public.bprpoc_temp_records__temperature_max` or `public.atrs_test_details__results_std`, which are confusing for business users.
**The Solution**: A smart mapping layer that converts technical features to business concepts while preserving model performance.
### Mapping Philosophy
Our feature mapping focuses on **business insights** rather than just renaming statistical terms:
- **Document Context**: Include data source (ATRS, RMI, BPR) to show where data comes from
- **Business Relevance**: Explain why the measurement matters for manufacturing processes
- **Statistical Meaning**: Convert technical aggregations to business understanding
### Example Mappings
| Technical Feature | Business Concept | Why This Matters |
|-------------------|------------------|------------------|
| `temperature_max` | "Process Temperature - Peak Values" | High temperature peaks can affect product quality |
| `results_std` | "Quality Control Testing - Process Consistency" | High variation indicates inconsistent process control |
| `quantity_issued_min` | "Material Issuance - Minimum Levels" | Low material levels may indicate supply issues |
| `net_wt_mean` | "Net Weight Management - Typical Levels" | Average weights show overall process control |
### Statistical Aggregation Guide
- `_min` β "Minimum Levels" (potential shortage indicators)
- `_max` β "Peak Values" (potential excess or spike indicators)
- `_mean` β "Typical Levels" (normal operating conditions)
- `_std` β "Process Consistency" (high std = inconsistent process)
### Dual Output System
The pipeline generates two versions of results:
1. **User-Friendly**: Business concepts for tools API and external users
2. **Technical**: Original feature names preserved for internal processing
```python
# User-friendly output (default)
drivers = get_top_yield_drivers(n=5)
print(drivers[0]['business_concept']) # "Process Temperature - Peak Values"
# Technical details still available in CSV files
# outputs/unified_importance_technical.csv contains original feature names
```
### Implementation Benefits
- **Preserved Performance**: All statistical features remain in the model
- **Business Clarity**: Users get actionable insights they can understand
- **Backward Compatibility**: Technical versions available for advanced analysis
- **Consistent Mapping**: Same business concepts across all analysis modules
## LLM Integration
The package is designed for seamless integration with LLM agents:
### OpenAI Function Calling
```python
import openai
from mva_pipeline.tools import get_tool_specs
# Get tool specifications
tools = get_tool_specs()
# Use with OpenAI
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "What are the top 3 anomalous batches?"}],
tools=[{
"type": "function",
"function": {
"name": tool["name"],
"description": tool["description"],
"parameters": tool["parameters"]
}
} for tool in tools],
tool_choice="auto"
)
```
### LangChain Integration
```python
from langchain_core.tools import Tool
from mva_pipeline.tools import get_tool_specs
# Convert to LangChain tools
tools = get_tool_specs()
langchain_tools = [
Tool(
name=tool["name"],
description=tool["description"],
func=tool["function"]
)
for tool in tools
]
```
## π Project Structure
```
mva-pipeline/
βββ mva_pipeline/ # Main package
β βββ tools.py # Main tools API (16 functions)
β βββ analysis/ # Core analytics modules
β βββ db/ # Database utilities
β βββ cli.py # Command line interface
βββ outputs/ # Analysis results (not included in package)
βββ setup.py # Package configuration
βββ pyproject.toml # Modern Python packaging
βββ requirements.txt # Dependencies
```
## Analysis Pipeline
The package follows a structured analytics workflow:
1. **Extract** - Pull data from manufacturing databases
2. **Build** - Create wide batch matrix with feature engineering
3. **Analyze** - Run anomaly detection, PCA, and SHAP analysis
4. **Tools** - Access results via clean API interface
## Available Tools (16 total)
### Anomaly Detection (4 tools)
- `get_top_anomalies()` - Highest scoring anomalous batches
- `explain_batch()` - Detailed anomaly profile for specific batch
- `filter_anomalies_by_doc_ids()` - Bulk anomaly analysis
- `get_anomaly_statistics()` - Overall detection statistics
### Yield Analysis (3 tools)
- `get_top_yield_drivers()` - Most critical process parameters
- `get_feature_scores()` - Individual feature importance scores
- `compare_feature_importance_methods()` - Method comparison analysis
### Advanced Analytics (6 tools)
- `get_pca_summary()` - Principal component analysis overview
- `get_batch_pca_scores()` - Batch positions in PCA space
- `get_batch_shap_explanation()` - Feature-level yield impact
- `get_global_shap_patterns()` - Global feature effect patterns
- `compare_batches()` - Multi-batch comparison
- `find_similar_batches()` - Similarity-based batch discovery
### Utilities (3 tools)
- `list_available_features()` - Available process parameters
- `get_pipeline_status()` - Analysis completion status
- `get_tool_specs()` - Tool specifications for LLM integration
Raw data
{
"_id": null,
"home_page": "https://github.com/aiorch/litewave-ml-models/tree/main/yield_data_analysis",
"name": "litewave-ml-models-yield-data-analysis",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "LitewaveAI <yash@litewave.ai>",
"keywords": "manufacturing, analytics, pca, shap, anomaly-detection, yield-optimization, pharmaceutical",
"author": "LitewaveAI",
"author_email": "LitewaveAI <yash@litewave.ai>",
"download_url": "https://files.pythonhosted.org/packages/9f/94/f142702315fe776b6b74d4dfd869f20bf018d54e328c5edfd30fd8e07eea/litewave_ml_models_yield_data_analysis-0.0.8.tar.gz",
"platform": null,
"description": "# Multivariate Analysis (MVA) Pipeline for Pharmaceutical Manufacturing\n\nThis repository contains a sophisticated data analysis pipeline designed to improve yield and reduce anomalies in pharmaceutical manufacturing processes. By leveraging multivariate analysis (MVA), machine learning, and statistical techniques, this pipeline provides deep insights into complex production data, enabling proactive quality control and process optimization.\n\n## 1. The Challenge: Complexity in Pharmaceutical Manufacturing\n\nPharmaceutical manufacturing is a highly complex and regulated process. It involves numerous stages, each with a multitude of parameters that can influence the final product's quality and yield. Key challenges include:\n\n- **High-Dimensional Data**: A single manufacturing batch can generate thousands of data points, including sensor readings, material measurements, and quality control checks. Analyzing this high-dimensional data using traditional univariate methods (looking at one variable at a time) is often ineffective.\n- **Interacting Variables**: Process parameters are rarely independent. A change in one variable (e.g., temperature) can have cascading effects on others (e.g., pressure, reaction rate). These interactions are often non-linear and difficult to detect.\n- **Anomaly Detection**: Deviations from the optimal process, or anomalies, can lead to batch failures, reduced yield, and significant financial losses. These anomalies are often subtle and hidden within the process's natural variability.\n- **Root Cause Analysis**: When an anomaly or low-yield batch occurs, identifying the root cause is critical but challenging. It requires sifting through vast amounts of data to pinpoint the specific combination of factors responsible for the deviation.\n\n## 2. Our Solution: A Multivariate Approach\n\nThis pipeline addresses these challenges by adopting a multivariate approach, which considers all process variables simultaneously. This holistic view allows us to model the relationships between variables and understand the process as an integrated system.\n\n### Core Concepts\n\n#### a. The \"Golden Batch\"\n\nThe \"Golden Batch\" concept is central to our approach. It refers to an idealized manufacturing run that represents the optimal process conditions, leading to the desired product quality and yield. While a single perfect batch may not exist, we can define a \"Golden Profile\" or a statistical envelope of normal operating conditions based on historical data from successful batches.\n\nOur pipeline uses data from high-quality batches to learn this Golden Profile. All subsequent batches are then compared against this profile to assess their performance.\n\n#### b. Dimensionality Reduction with PCA\n\nPrincipal Component Analysis (PCA) is a powerful technique for dimensionality reduction. In a high-dimensional space of process variables, PCA identifies the principal components\u2014the underlying dimensions that capture the most variance in the data.\n\n- **Why we use it**: By projecting the data onto a smaller number of principal components, we can visualize and analyze the process more effectively. This reduces noise and reveals the underlying structure of the data. In our pipeline, we use a supervised version of PCA where the principal components are selected based on their correlation with the final product yield. This ensures that we focus on the process variability that is most impactful to the outcome.\n\n#### c. Anomaly Detection with Isolation Forests\n\nAn isolation forest is a modern, effective algorithm for detecting anomalies. It works by randomly partitioning the data until each data point is isolated from the others.\n\n- **Why we use it**: Anomalies are \"few and different,\" which means they are more susceptible to isolation. Therefore, they will be isolated in fewer steps than normal data points. The \"anomaly score\" is based on the average path length required to isolate a data point across many random trees. This method is computationally efficient and works well with high-dimensional data, making it ideal for our use case.\n\n#### d. Root Cause Analysis (RCA)\n\nWhen a batch is flagged as an anomaly or exhibits low yield, we need to understand why. Our root cause analysis module uses machine learning models to identify the key features (process variables) that contributed to the deviation.\n\n- **Why we use it**: By analyzing the feature importance scores from models trained to distinguish between good and bad outcomes, we can pinpoint the specific variables that are most likely responsible for the problem. This provides actionable insights for process engineers to investigate and correct.\n\n#### e. Synthetic Data Augmentation\n\nTo train robust machine learning models, a large and diverse dataset is often required. In manufacturing, data for certain conditions (especially anomalous ones) may be scarce.\n\n- **Why we use it**: We use synthetic data generation techniques to augment our dataset. By creating new, realistic data points\u2014including plausible anomalies\u2014we can improve the performance and robustness of our anomaly detection and root cause analysis models. This ensures that our models are not \"surprised\" by novel process conditions.\n\n## 3. The Pipeline at a Glance\n\nThe pipeline is structured as a series of modular steps:\n\n1. **Data Extraction**: Raw data from various sources (e.g., database tables from a LIMS or MES) is extracted.\n2. **Data Building**: The raw data is transformed and merged into a single \"wide\" matrix, where each row represents a batch and each column represents a process parameter or measurement.\n3. **Data Augmentation**: The batch matrix is augmented with synthetic data to create a more robust dataset for analysis.\n4. **Analysis**:\n - **Anomaly Detection**: Every batch is scored for its deviation from the \"Golden Profile.\"\n - **Supervised PCA**: The relationship between process variability and yield is modeled.\n - **Root Cause Analysis**: The key drivers of low yield are identified.\n - **Unified Importance**: The results from PCA and RCA are combined to provide a single, unified ranking of the most critical process parameters.\n\nBy following this structured, multivariate approach, this pipeline provides a powerful tool for understanding, monitoring, and optimizing complex pharmaceutical manufacturing processes.\n\n## Overview\n\nThis package provides a complete analytical toolkit for pharmaceutical manufacturing data, featuring:\n\n- **Anomaly Detection**: Multi-algorithm ensemble for identifying problematic batches\n- **Yield Optimization**: PCA and SHAP-based feature importance analysis \n- **Root Cause Analysis**: Machine learning-driven insights into yield drivers\n- **Tool Interface**: Clean API for external LLM agents and applications\n\n## Installation\n\n### For External Use (Recommended)\n\nInstall directly from Git:\n```bash\npip install git+https://github.com/your-org/mva-pipeline.git\n```\n\nOr clone and install in development mode:\n```bash\ngit clone https://github.com/your-org/mva-pipeline.git\ncd mva-pipeline\npip install -e .\n```\n\n### Optional Dependencies\n\nInstall with LLM integration support:\n```bash\npip install \"mva-pipeline[llm] @ git+https://github.com/your-org/mva-pipeline.git\"\n```\n\nInstall with development tools:\n```bash\npip install -e \".[dev]\"\n```\n\n## Quick Start - Using the Tools API\n\n### Running the Pipeline\n\n```python\nfrom mva_pipeline import run_pipeline\n\n# Run complete pipeline with caching\nresult = run_pipeline()\n\nif result['cache_hit']:\n print(\"\ud83d\ude80 Cache hit! Analytics skipped\")\n print(f\"Runtime: {result['runtime_seconds']:.1f}s\")\nelse:\n print(\"\ud83d\udd04 Data changed, running full analytics...\")\n print(f\"Runtime: {result['runtime_seconds']:.1f}s\")\n print(f\"Updated: {result['updated']}\")\n\n# Access artifacts\nprint(\"Available artifacts:\")\nfor name, path in result['artifacts'].items():\n print(f\" {name}: {path}\")\n```\n\n### Basic Usage\n\n```python\nfrom mva_pipeline.tools import get_tool_specs, get_pipeline_status\n\n# Check what analyses are available\nstatus = get_pipeline_status()\nprint(f\"Available tools: {status['available_tools']}\")\n\n# Get all tool specifications for LLM function calling\ntools = get_tool_specs()\nfor tool in tools[:3]:\n print(f\"\u2022 {tool['name']}: {tool['description']}\")\n```\n\n### Anomaly Analysis\n\n```python\nfrom mva_pipeline import get_top_anomalies, explain_batch, get_anomaly_statistics\n\n# Get top anomalous batches\nanomalies = get_top_anomalies(n=5)\nprint(f\"Top anomaly: Batch {anomalies[0]['doc_id']} (score: {anomalies[0]['score_if']:.2f})\")\n\n# Detailed analysis of specific batch\ndetails = explain_batch(doc_id=470)\nprint(f\"Batch 470 anomaly status: {details['anomaly']}\")\n\n# Overall statistics\nstats = get_anomaly_statistics()\nprint(f\"Anomaly rate: {stats['anomaly_rate']:.1%}\")\n```\n\n### Yield Driver Analysis\n\n```python\nfrom mva_pipeline import get_top_yield_drivers, get_feature_scores\n\n# Top process parameters affecting yield\ndrivers = get_top_yield_drivers(n=10)\nprint(f\"Top yield driver: {drivers[0]['feature']} (score: {drivers[0]['unified_score']:.3f})\")\n\n# Detailed feature analysis\nfeature_analysis = get_feature_scores(\"public.bprpoc_temperature__value_r0\")\nprint(f\"PCA score: {feature_analysis['pca_score']:.3f}\")\nprint(f\"SHAP score: {feature_analysis['shap_score']:.3f}\")\n```\n\n### Batch Comparison\n\n```python\nfrom mva_pipeline import compare_batches, find_similar_batches\n\n# Compare specific batches\ncomparison = compare_batches(doc_ids=[100, 200, 300])\nyields = [b['yield'] for b in comparison['batch_comparison']]\nprint(f\"Yield range: {min(yields):.1f} - {max(yields):.1f}\")\n\n# Find similar batches\nsimilar = find_similar_batches(doc_id=100, n_similar=5, method=\"yield\")\nprint(f\"Found {len(similar)} similar batches\")\n```\n\n## \ud83d\udd27 Command Line Interface\n\nRun the complete analytics pipeline:\n\n```bash\n# Extract data from database\nmva-pipeline extract\n\n# Build batch matrix\nmva-pipeline build\n\n# Run complete analysis\nmva-pipeline analyze\n\n# NEW: Run complete pipeline with intelligent caching\nmva-pipeline pipeline --verbose\n```\n\n### Caching Pipeline\n\nThe MVA pipeline now includes intelligent caching that automatically detects when your data has changed and only re-runs analytics when necessary:\n\n```bash\n# Run pipeline with caching (recommended)\npython -m mva_pipeline.cli pipeline --verbose\n\n# Force rebuild ignoring cache\npython -m mva_pipeline.cli pipeline --force\n\n# Skip database extraction (use existing raw data)\npython -m mva_pipeline.cli pipeline --skip-extraction\n\n# Use custom raw data directory\npython -m mva_pipeline.cli pipeline --raw-dir /path/to/data\n```\n\n### How Caching Works\n\n1. **Fingerprinting**: The system computes a SHA1 fingerprint of all Parquet files in your raw data directory based on filename, modification time, and file size.\n\n2. **Cache Check**: Before running expensive analytics, it compares the current fingerprint with the last known fingerprint.\n\n3. **Smart Decisions**:\n - **Cache Hit**: If fingerprints match and all artifacts exist \u2192 Fast exit (seconds)\n - **Cache Miss**: If data changed \u2192 Full analytics pipeline (minutes)\n\n4. **State Storage**: Fingerprints are stored in Redis (if available) or fallback to `.mva_state.json` file.\n\n### Environment Variables\n\nConfigure caching behavior with environment variables:\n\n```bash\n# Redis URL for state storage (optional)\nexport MVA_STATE_REDIS_URL=\"redis://localhost:6379/0\"\n\n# State storage type: redis or file\nexport MVA_STATE_STORE=\"redis\"\n\n# Custom state file location\nexport MVA_STATE_FILE=\"/path/to/custom_state.json\"\n```\n\n### Configuration\n\nAdd caching configuration to your `config.yaml`:\n\n```yaml\n# State management for pipeline caching\nstate_store: redis # Options: redis, file \nstate_file: \".mva_state.json\" # Fallback file location\n```\n\n## \ud83d\udccb Feature Mapping: From Technical Names to Business Insights\n\nThe MVA pipeline automatically converts technical statistical feature names into meaningful business concepts for improved user experience. This ensures that business users can understand the analysis results without needing deep technical knowledge.\n\n### How It Works\n\n**The Challenge**: Machine learning models work with statistical aggregations like `public.bprpoc_temp_records__temperature_max` or `public.atrs_test_details__results_std`, which are confusing for business users.\n\n**The Solution**: A smart mapping layer that converts technical features to business concepts while preserving model performance.\n\n### Mapping Philosophy\n\nOur feature mapping focuses on **business insights** rather than just renaming statistical terms:\n\n- **Document Context**: Include data source (ATRS, RMI, BPR) to show where data comes from\n- **Business Relevance**: Explain why the measurement matters for manufacturing processes \n- **Statistical Meaning**: Convert technical aggregations to business understanding\n\n### Example Mappings\n\n| Technical Feature | Business Concept | Why This Matters |\n|-------------------|------------------|------------------|\n| `temperature_max` | \"Process Temperature - Peak Values\" | High temperature peaks can affect product quality |\n| `results_std` | \"Quality Control Testing - Process Consistency\" | High variation indicates inconsistent process control |\n| `quantity_issued_min` | \"Material Issuance - Minimum Levels\" | Low material levels may indicate supply issues |\n| `net_wt_mean` | \"Net Weight Management - Typical Levels\" | Average weights show overall process control |\n\n### Statistical Aggregation Guide\n\n- `_min` \u2192 \"Minimum Levels\" (potential shortage indicators)\n- `_max` \u2192 \"Peak Values\" (potential excess or spike indicators) \n- `_mean` \u2192 \"Typical Levels\" (normal operating conditions)\n- `_std` \u2192 \"Process Consistency\" (high std = inconsistent process)\n\n### Dual Output System\n\nThe pipeline generates two versions of results:\n\n1. **User-Friendly**: Business concepts for tools API and external users\n2. **Technical**: Original feature names preserved for internal processing\n\n```python\n# User-friendly output (default)\ndrivers = get_top_yield_drivers(n=5)\nprint(drivers[0]['business_concept']) # \"Process Temperature - Peak Values\"\n\n# Technical details still available in CSV files\n# outputs/unified_importance_technical.csv contains original feature names\n```\n\n### Implementation Benefits\n\n- **Preserved Performance**: All statistical features remain in the model\n- **Business Clarity**: Users get actionable insights they can understand\n- **Backward Compatibility**: Technical versions available for advanced analysis\n- **Consistent Mapping**: Same business concepts across all analysis modules\n\n## LLM Integration\n\nThe package is designed for seamless integration with LLM agents:\n\n### OpenAI Function Calling\n\n```python\nimport openai\nfrom mva_pipeline.tools import get_tool_specs\n\n# Get tool specifications\ntools = get_tool_specs()\n\n# Use with OpenAI\nclient = openai.OpenAI()\nresponse = client.chat.completions.create(\n model=\"gpt-4\",\n messages=[{\"role\": \"user\", \"content\": \"What are the top 3 anomalous batches?\"}],\n tools=[{\n \"type\": \"function\",\n \"function\": {\n \"name\": tool[\"name\"],\n \"description\": tool[\"description\"], \n \"parameters\": tool[\"parameters\"]\n }\n } for tool in tools],\n tool_choice=\"auto\"\n)\n```\n\n### LangChain Integration\n\n```python\nfrom langchain_core.tools import Tool\nfrom mva_pipeline.tools import get_tool_specs\n\n# Convert to LangChain tools\ntools = get_tool_specs()\nlangchain_tools = [\n Tool(\n name=tool[\"name\"],\n description=tool[\"description\"],\n func=tool[\"function\"]\n )\n for tool in tools\n]\n```\n\n## \ud83d\udcc1 Project Structure\n\n```\nmva-pipeline/\n\u251c\u2500\u2500 mva_pipeline/ # Main package\n\u2502 \u251c\u2500\u2500 tools.py # Main tools API (16 functions)\n\u2502 \u251c\u2500\u2500 analysis/ # Core analytics modules\n\u2502 \u251c\u2500\u2500 db/ # Database utilities\n\u2502 \u2514\u2500\u2500 cli.py # Command line interface\n\u251c\u2500\u2500 outputs/ # Analysis results (not included in package)\n\u251c\u2500\u2500 setup.py # Package configuration\n\u251c\u2500\u2500 pyproject.toml # Modern Python packaging\n\u2514\u2500\u2500 requirements.txt # Dependencies\n```\n\n## Analysis Pipeline\n\nThe package follows a structured analytics workflow:\n\n1. **Extract** - Pull data from manufacturing databases\n2. **Build** - Create wide batch matrix with feature engineering\n3. **Analyze** - Run anomaly detection, PCA, and SHAP analysis\n4. **Tools** - Access results via clean API interface\n\n## Available Tools (16 total)\n\n### Anomaly Detection (4 tools)\n- `get_top_anomalies()` - Highest scoring anomalous batches\n- `explain_batch()` - Detailed anomaly profile for specific batch\n- `filter_anomalies_by_doc_ids()` - Bulk anomaly analysis \n- `get_anomaly_statistics()` - Overall detection statistics\n\n### Yield Analysis (3 tools)\n- `get_top_yield_drivers()` - Most critical process parameters\n- `get_feature_scores()` - Individual feature importance scores\n- `compare_feature_importance_methods()` - Method comparison analysis\n\n### Advanced Analytics (6 tools)\n- `get_pca_summary()` - Principal component analysis overview\n- `get_batch_pca_scores()` - Batch positions in PCA space\n- `get_batch_shap_explanation()` - Feature-level yield impact\n- `get_global_shap_patterns()` - Global feature effect patterns\n- `compare_batches()` - Multi-batch comparison\n- `find_similar_batches()` - Similarity-based batch discovery\n\n### Utilities (3 tools)\n- `list_available_features()` - Available process parameters\n- `get_pipeline_status()` - Analysis completion status\n- `get_tool_specs()` - Tool specifications for LLM integration\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Multivariate analytics pipeline for pharmaceutical manufacturing yield optimization",
"version": "0.0.8",
"project_urls": {
"Bug Tracker": "https://github.com/aiorch/litewave-ml-models/tree/main/yield_data_analysis/issues",
"Documentation": "https://github.com/aiorch/litewave-ml-models/blob/main/yield_data_analysis/README.md",
"Homepage": "https://github.com/aiorch/litewave-ml-models/tree/main/yield_data_analysis",
"Repository": "https://github.com/aiorch/litewave-ml-models/tree/main/yield_data_analysis"
},
"split_keywords": [
"manufacturing",
" analytics",
" pca",
" shap",
" anomaly-detection",
" yield-optimization",
" pharmaceutical"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "7de2cacd742a7574fdc1659d8cb9b098d6817ff846e1972b267e02e910073f03",
"md5": "b43d2d143f437fe3f08714a2eac2a887",
"sha256": "ac87311448a568f88464d64a31b1a1824a52972ff51a9313fdfe76d49d706466"
},
"downloads": -1,
"filename": "litewave_ml_models_yield_data_analysis-0.0.8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b43d2d143f437fe3f08714a2eac2a887",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 65911,
"upload_time": "2025-08-01T02:20:33",
"upload_time_iso_8601": "2025-08-01T02:20:33.322729Z",
"url": "https://files.pythonhosted.org/packages/7d/e2/cacd742a7574fdc1659d8cb9b098d6817ff846e1972b267e02e910073f03/litewave_ml_models_yield_data_analysis-0.0.8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "9f94f142702315fe776b6b74d4dfd869f20bf018d54e328c5edfd30fd8e07eea",
"md5": "46568326b565e41844b48dff0bde60ca",
"sha256": "3766aa362e4b3645adf44e134bdf4ed4faaa6b4aa80815fd18c417a0adc22557"
},
"downloads": -1,
"filename": "litewave_ml_models_yield_data_analysis-0.0.8.tar.gz",
"has_sig": false,
"md5_digest": "46568326b565e41844b48dff0bde60ca",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 2185839,
"upload_time": "2025-08-01T02:20:34",
"upload_time_iso_8601": "2025-08-01T02:20:34.854211Z",
"url": "https://files.pythonhosted.org/packages/9f/94/f142702315fe776b6b74d4dfd869f20bf018d54e328c5edfd30fd8e07eea/litewave_ml_models_yield_data_analysis-0.0.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-01 02:20:34",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "aiorch",
"github_project": "litewave-ml-models",
"github_not_found": true,
"lcname": "litewave-ml-models-yield-data-analysis"
}