# HybridVectorizer
**Unified embedding for tabular, text, and multimodal data with powerful similarity search.**
HybridVectorizer automatically handles mixed data types (numerical, categorical, text, dates) and creates high-quality vector representations for similarity search, recommendation systems, and machine learning pipelines.
## 🚀 Quick Start
```python
import pandas as pd
from hybrid_vectorizer import HybridVectorizer
# Your data with mixed types
df = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'description': [
'AI machine learning platform for enterprises',
'Data analytics and business intelligence',
'Computer vision technology for robots',
'Natural language processing chatbots',
'Predictive analytics for healthcare'
],
'category': ['AI', 'Analytics', 'Vision', 'NLP', 'Healthcare'],
'funding': [1000000, 2500000, 800000, 1200000, 1800000],
'employees': [50, 150, 30, 80, 200]
})
# Initialize and fit
hv = HybridVectorizer(index_column='id')
vectors = hv.fit_transform(df)
# Search for similar items
query = {
'description': 'artificial intelligence startup',
'category': 'AI',
'employees': 100
}
results = hv.similarity_search(query, top_n=3)
print(results[['description', 'similarity']])
```
## 📦 Installation
```bash
pip install hybrid-vectorizer
```
**Requirements:**
- Python 3.8+
- pandas, numpy, scikit-learn
- sentence-transformers, torch
## ✨ Key Features
### 🔄 **Automatic Data Type Handling**
- **Numerical**: Auto-normalized with MinMaxScaler
- **Categorical**: One-hot or frequency encoding (smart threshold)
- **Text**: SentenceTransformer embeddings
- **Dates**: Extract features or ignore
- **Mixed**: Handles missing values, inf, NaN gracefully
### 🎯 **Powerful Similarity Search**
- **Late Fusion**: Combines modalities with configurable weights
- **Block-level Control**: Weight text vs. numerical vs. categorical separately
- **Explanation**: See which features drive similarity
### 🛠️ **Production Ready**
- **Memory Efficient**: Optimized for large datasets
- **GPU Support**: Automatic GPU detection for text encoding
- **Persistence**: Save/load trained models
- **Error Handling**: Informative custom exceptions
## 💡 Usage Examples
### Basic Usage
```python
# Fit and transform
hv = HybridVectorizer()
vectors = hv.fit_transform(df)
# Simple query
results = hv.similarity_search({'description': 'machine learning'})
```
### Advanced Configuration
```python
hv = HybridVectorizer(
column_encodings={'description': 'text', 'category': 'categorical'},
ignore_columns=['id', 'created_at'],
index_column='id',
onehot_threshold=15,
text_batch_size=64
)
```
### Weighted Search
```python
# Emphasize text over numerical features
results = hv.similarity_search(
query,
block_weights={'text': 3, 'categorical': 2, 'numerical': 1}
)
```
### Text-Only Search
```python
results = hv.similarity_search(
'AI startup',
text_column='description'
)
```
## 🔧 Configuration Options
| Parameter | Description | Default |
|-----------|-------------|---------|
| `column_encodings` | Manual type overrides | `{}` |
| `ignore_columns` | Skip these columns | `[]` |
| `index_column` | ID column (preserved in results) | `None` |
| `onehot_threshold` | Max categories for one-hot encoding | `10` |
| `default_text_model` | SentenceTransformer model | `'all-MiniLM-L6-v2'` |
| `text_batch_size` | Batch size for text encoding | `128` |
## 📊 Data Type Detection
HybridVectorizer automatically detects:
- **Numerical**: `int64`, `float64`, etc. → MinMax normalization
- **Categorical**: `object` with ≤10 unique values → One-hot encoding
- **Text**: `object` with >10 unique values → SentenceTransformer embeddings
- **Dates**: `datetime64` → Extract year/month/day or ignore
Override with `column_encodings={'col': 'text'}` if needed.
## 🎛️ Advanced Features
### Model Persistence
```python
# Save trained model
hv.save('my_vectorizer.pkl')
# Load later
hv2 = HybridVectorizer.load('my_vectorizer.pkl')
results = hv2.similarity_search(query)
```
### Encoding Report
```python
# See how each column was processed
report = hv.get_encoding_report()
print(report)
```
### External Vector Database
```python
import faiss
# Use FAISS for faster search
index = faiss.IndexFlatIP(vectors.shape[1])
index.add(vectors)
hv.set_vector_db(index)
```
## 🚨 Error Handling
```python
from hybrid_vectorizer import HybridVectorizerError, ModelNotFittedError
try:
results = hv.similarity_search(query)
except ModelNotFittedError:
print("Call fit_transform() first!")
except HybridVectorizerError as e:
print(f"HybridVectorizer error: {e}")
```
## 📈 Performance
Typical performance on modern hardware:
| Dataset Size | Fit Time | Search Time | Memory |
|--------------|----------|-------------|--------|
| 1K rows | <1s | <1ms | ~50MB |
| 10K rows | <10s | <10ms | ~200MB |
| 100K rows | <2min | <100ms | ~1GB |
*With mixed data types including text columns*
## 🛠️ Development
```bash
# Clone repository
git clone https://github.com/hariharaprabhu/hybrid-vectorizer
cd hybrid-vectorizer
# Install in development mode
pip install -e .
# Run tests
python tests/test_basic.py
```
## 📄 License
MIT License - see LICENSE file for details.
## 📞 Support
- **Issues**: [GitHub Issues](https://github.com/hariharaprabhu/hybrid-vectorizer/issues)
- **Documentation**: See this README and docstrings
- **Questions**: Open an issue for questions or feature requests
---
**HybridVectorizer** - Making multimodal similarity search simple and powerful. 🚀
Raw data
{
"_id": null,
"home_page": null,
"name": "hybrid-vectorizer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "machine-learning, embeddings, similarity-search, multimodal, vectorization",
"author": null,
"author_email": "Hari Narayanan <hari.dataprojects@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/27/8a/c5e7303c56b11a79ef9a44ed5d062dc9b6c56dc0913476b02f217894807b/hybrid_vectorizer-0.1.0.tar.gz",
"platform": null,
"description": "# HybridVectorizer\r\n\r\n**Unified embedding for tabular, text, and multimodal data with powerful similarity search.**\r\n\r\nHybridVectorizer automatically handles mixed data types (numerical, categorical, text, dates) and creates high-quality vector representations for similarity search, recommendation systems, and machine learning pipelines.\r\n\r\n## \ud83d\ude80 Quick Start\r\n\r\n```python\r\nimport pandas as pd\r\nfrom hybrid_vectorizer import HybridVectorizer\r\n\r\n# Your data with mixed types\r\ndf = pd.DataFrame({\r\n 'id': [1, 2, 3, 4, 5],\r\n 'description': [\r\n 'AI machine learning platform for enterprises',\r\n 'Data analytics and business intelligence',\r\n 'Computer vision technology for robots', \r\n 'Natural language processing chatbots',\r\n 'Predictive analytics for healthcare'\r\n ],\r\n 'category': ['AI', 'Analytics', 'Vision', 'NLP', 'Healthcare'],\r\n 'funding': [1000000, 2500000, 800000, 1200000, 1800000],\r\n 'employees': [50, 150, 30, 80, 200]\r\n})\r\n\r\n# Initialize and fit\r\nhv = HybridVectorizer(index_column='id')\r\nvectors = hv.fit_transform(df)\r\n\r\n# Search for similar items\r\nquery = {\r\n 'description': 'artificial intelligence startup',\r\n 'category': 'AI',\r\n 'employees': 100\r\n}\r\n\r\nresults = hv.similarity_search(query, top_n=3)\r\nprint(results[['description', 'similarity']])\r\n```\r\n\r\n## \ud83d\udce6 Installation\r\n\r\n```bash\r\npip install hybrid-vectorizer\r\n```\r\n\r\n**Requirements:**\r\n- Python 3.8+\r\n- pandas, numpy, scikit-learn\r\n- sentence-transformers, torch\r\n\r\n## \u2728 Key Features\r\n\r\n### \ud83d\udd04 **Automatic Data Type Handling**\r\n- **Numerical**: Auto-normalized with MinMaxScaler\r\n- **Categorical**: One-hot or frequency encoding (smart threshold)\r\n- **Text**: SentenceTransformer embeddings\r\n- **Dates**: Extract features or ignore\r\n- **Mixed**: Handles missing values, inf, NaN gracefully\r\n\r\n### \ud83c\udfaf **Powerful Similarity Search**\r\n- **Late Fusion**: Combines modalities with configurable weights\r\n- **Block-level Control**: Weight text vs. numerical vs. categorical separately\r\n- **Explanation**: See which features drive similarity\r\n\r\n### \ud83d\udee0\ufe0f **Production Ready**\r\n- **Memory Efficient**: Optimized for large datasets\r\n- **GPU Support**: Automatic GPU detection for text encoding\r\n- **Persistence**: Save/load trained models\r\n- **Error Handling**: Informative custom exceptions\r\n\r\n## \ud83d\udca1 Usage Examples\r\n\r\n### Basic Usage\r\n```python\r\n# Fit and transform\r\nhv = HybridVectorizer()\r\nvectors = hv.fit_transform(df)\r\n\r\n# Simple query\r\nresults = hv.similarity_search({'description': 'machine learning'})\r\n```\r\n\r\n### Advanced Configuration\r\n```python\r\nhv = HybridVectorizer(\r\n column_encodings={'description': 'text', 'category': 'categorical'},\r\n ignore_columns=['id', 'created_at'],\r\n index_column='id',\r\n onehot_threshold=15,\r\n text_batch_size=64\r\n)\r\n```\r\n\r\n### Weighted Search\r\n```python\r\n# Emphasize text over numerical features\r\nresults = hv.similarity_search(\r\n query,\r\n block_weights={'text': 3, 'categorical': 2, 'numerical': 1}\r\n)\r\n```\r\n\r\n### Text-Only Search\r\n```python\r\nresults = hv.similarity_search(\r\n 'AI startup', \r\n text_column='description'\r\n)\r\n```\r\n\r\n## \ud83d\udd27 Configuration Options\r\n\r\n| Parameter | Description | Default |\r\n|-----------|-------------|---------|\r\n| `column_encodings` | Manual type overrides | `{}` |\r\n| `ignore_columns` | Skip these columns | `[]` |\r\n| `index_column` | ID column (preserved in results) | `None` |\r\n| `onehot_threshold` | Max categories for one-hot encoding | `10` |\r\n| `default_text_model` | SentenceTransformer model | `'all-MiniLM-L6-v2'` |\r\n| `text_batch_size` | Batch size for text encoding | `128` |\r\n\r\n## \ud83d\udcca Data Type Detection\r\n\r\nHybridVectorizer automatically detects:\r\n\r\n- **Numerical**: `int64`, `float64`, etc. \u2192 MinMax normalization\r\n- **Categorical**: `object` with \u226410 unique values \u2192 One-hot encoding\r\n- **Text**: `object` with >10 unique values \u2192 SentenceTransformer embeddings\r\n- **Dates**: `datetime64` \u2192 Extract year/month/day or ignore\r\n\r\nOverride with `column_encodings={'col': 'text'}` if needed.\r\n\r\n## \ud83c\udf9b\ufe0f Advanced Features\r\n\r\n### Model Persistence\r\n```python\r\n# Save trained model\r\nhv.save('my_vectorizer.pkl')\r\n\r\n# Load later\r\nhv2 = HybridVectorizer.load('my_vectorizer.pkl')\r\nresults = hv2.similarity_search(query)\r\n```\r\n\r\n### Encoding Report\r\n```python\r\n# See how each column was processed\r\nreport = hv.get_encoding_report()\r\nprint(report)\r\n```\r\n\r\n### External Vector Database\r\n```python\r\nimport faiss\r\n\r\n# Use FAISS for faster search\r\nindex = faiss.IndexFlatIP(vectors.shape[1])\r\nindex.add(vectors)\r\nhv.set_vector_db(index)\r\n```\r\n\r\n## \ud83d\udea8 Error Handling\r\n\r\n```python\r\nfrom hybrid_vectorizer import HybridVectorizerError, ModelNotFittedError\r\n\r\ntry:\r\n results = hv.similarity_search(query)\r\nexcept ModelNotFittedError:\r\n print(\"Call fit_transform() first!\")\r\nexcept HybridVectorizerError as e:\r\n print(f\"HybridVectorizer error: {e}\")\r\n```\r\n\r\n## \ud83d\udcc8 Performance\r\n\r\nTypical performance on modern hardware:\r\n\r\n| Dataset Size | Fit Time | Search Time | Memory |\r\n|--------------|----------|-------------|--------|\r\n| 1K rows | <1s | <1ms | ~50MB |\r\n| 10K rows | <10s | <10ms | ~200MB |\r\n| 100K rows | <2min | <100ms | ~1GB |\r\n\r\n*With mixed data types including text columns*\r\n\r\n## \ud83d\udee0\ufe0f Development\r\n\r\n```bash\r\n# Clone repository\r\ngit clone https://github.com/hariharaprabhu/hybrid-vectorizer\r\ncd hybrid-vectorizer\r\n\r\n# Install in development mode\r\npip install -e .\r\n\r\n# Run tests\r\npython tests/test_basic.py\r\n```\r\n\r\n## \ud83d\udcc4 License\r\n\r\nMIT License - see LICENSE file for details.\r\n\r\n## \ud83d\udcde Support\r\n\r\n- **Issues**: [GitHub Issues](https://github.com/hariharaprabhu/hybrid-vectorizer/issues)\r\n- **Documentation**: See this README and docstrings\r\n- **Questions**: Open an issue for questions or feature requests\r\n\r\n---\r\n\r\n**HybridVectorizer** - Making multimodal similarity search simple and powerful. \ud83d\ude80\r\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Unified embedding for tabular, text, and multimodal data",
"version": "0.1.0",
"project_urls": {
"Documentation": "https://github.com/hariharaprabhu/hybrid-vectorizer#readme",
"Homepage": "https://github.com/hariharaprabhu/hybrid-vectorizer",
"Issues": "https://github.com/hariharaprabhu/hybrid-vectorizer/issues",
"Repository": "https://github.com/hariharaprabhu/hybrid-vectorizer"
},
"split_keywords": [
"machine-learning",
" embeddings",
" similarity-search",
" multimodal",
" vectorization"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "67fbd07dc5860d2c591a05c82a2d4d3a43fdbfef1281ab48ced39a3ffef5bee9",
"md5": "67a57a118ea871437f79ceb97efba4ef",
"sha256": "aa8b7e83f0108b817793f73d469f7d9b88337b33a803ba7d63914e0f0abb5955"
},
"downloads": -1,
"filename": "hybrid_vectorizer-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "67a57a118ea871437f79ceb97efba4ef",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 18372,
"upload_time": "2025-08-12T04:41:46",
"upload_time_iso_8601": "2025-08-12T04:41:46.843939Z",
"url": "https://files.pythonhosted.org/packages/67/fb/d07dc5860d2c591a05c82a2d4d3a43fdbfef1281ab48ced39a3ffef5bee9/hybrid_vectorizer-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "278ac5e7303c56b11a79ef9a44ed5d062dc9b6c56dc0913476b02f217894807b",
"md5": "89e274451f71cf23cdc09abd8c1882f4",
"sha256": "447e5b663fcf4f22efd5a8f982a854e55137d181dec3b676790e53f1953da094"
},
"downloads": -1,
"filename": "hybrid_vectorizer-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "89e274451f71cf23cdc09abd8c1882f4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 22088,
"upload_time": "2025-08-12T04:41:48",
"upload_time_iso_8601": "2025-08-12T04:41:48.355979Z",
"url": "https://files.pythonhosted.org/packages/27/8a/c5e7303c56b11a79ef9a44ed5d062dc9b6c56dc0913476b02f217894807b/hybrid_vectorizer-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-12 04:41:48",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "hariharaprabhu",
"github_project": "hybrid-vectorizer#readme",
"github_not_found": true,
"lcname": "hybrid-vectorizer"
}