hybrid-vectorizer


Namehybrid-vectorizer JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryUnified embedding for tabular, text, and multimodal data
upload_time2025-08-12 04:41:48
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseApache-2.0
keywords machine-learning embeddings similarity-search multimodal vectorization
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # HybridVectorizer

**Unified embedding for tabular, text, and multimodal data with powerful similarity search.**

HybridVectorizer automatically handles mixed data types (numerical, categorical, text, dates) and creates high-quality vector representations for similarity search, recommendation systems, and machine learning pipelines.

## 🚀 Quick Start

```python
import pandas as pd
from hybrid_vectorizer import HybridVectorizer

# Your data with mixed types
df = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'description': [
        'AI machine learning platform for enterprises',
        'Data analytics and business intelligence',
        'Computer vision technology for robots', 
        'Natural language processing chatbots',
        'Predictive analytics for healthcare'
    ],
    'category': ['AI', 'Analytics', 'Vision', 'NLP', 'Healthcare'],
    'funding': [1000000, 2500000, 800000, 1200000, 1800000],
    'employees': [50, 150, 30, 80, 200]
})

# Initialize and fit
hv = HybridVectorizer(index_column='id')
vectors = hv.fit_transform(df)

# Search for similar items
query = {
    'description': 'artificial intelligence startup',
    'category': 'AI',
    'employees': 100
}

results = hv.similarity_search(query, top_n=3)
print(results[['description', 'similarity']])
```

## 📦 Installation

```bash
pip install hybrid-vectorizer
```

**Requirements:**
- Python 3.8+
- pandas, numpy, scikit-learn
- sentence-transformers, torch

## ✨ Key Features

### 🔄 **Automatic Data Type Handling**
- **Numerical**: Auto-normalized with MinMaxScaler
- **Categorical**: One-hot or frequency encoding (smart threshold)
- **Text**: SentenceTransformer embeddings
- **Dates**: Extract features or ignore
- **Mixed**: Handles missing values, inf, NaN gracefully

### 🎯 **Powerful Similarity Search**
- **Late Fusion**: Combines modalities with configurable weights
- **Block-level Control**: Weight text vs. numerical vs. categorical separately
- **Explanation**: See which features drive similarity

### 🛠️ **Production Ready**
- **Memory Efficient**: Optimized for large datasets
- **GPU Support**: Automatic GPU detection for text encoding
- **Persistence**: Save/load trained models
- **Error Handling**: Informative custom exceptions

## 💡 Usage Examples

### Basic Usage
```python
# Fit and transform
hv = HybridVectorizer()
vectors = hv.fit_transform(df)

# Simple query
results = hv.similarity_search({'description': 'machine learning'})
```

### Advanced Configuration
```python
hv = HybridVectorizer(
    column_encodings={'description': 'text', 'category': 'categorical'},
    ignore_columns=['id', 'created_at'],
    index_column='id',
    onehot_threshold=15,
    text_batch_size=64
)
```

### Weighted Search
```python
# Emphasize text over numerical features
results = hv.similarity_search(
    query,
    block_weights={'text': 3, 'categorical': 2, 'numerical': 1}
)
```

### Text-Only Search
```python
results = hv.similarity_search(
    'AI startup', 
    text_column='description'
)
```

## 🔧 Configuration Options

| Parameter | Description | Default |
|-----------|-------------|---------|
| `column_encodings` | Manual type overrides | `{}` |
| `ignore_columns` | Skip these columns | `[]` |
| `index_column` | ID column (preserved in results) | `None` |
| `onehot_threshold` | Max categories for one-hot encoding | `10` |
| `default_text_model` | SentenceTransformer model | `'all-MiniLM-L6-v2'` |
| `text_batch_size` | Batch size for text encoding | `128` |

## 📊 Data Type Detection

HybridVectorizer automatically detects:

- **Numerical**: `int64`, `float64`, etc. → MinMax normalization
- **Categorical**: `object` with ≤10 unique values → One-hot encoding
- **Text**: `object` with >10 unique values → SentenceTransformer embeddings
- **Dates**: `datetime64` → Extract year/month/day or ignore

Override with `column_encodings={'col': 'text'}` if needed.

## 🎛️ Advanced Features

### Model Persistence
```python
# Save trained model
hv.save('my_vectorizer.pkl')

# Load later
hv2 = HybridVectorizer.load('my_vectorizer.pkl')
results = hv2.similarity_search(query)
```

### Encoding Report
```python
# See how each column was processed
report = hv.get_encoding_report()
print(report)
```

### External Vector Database
```python
import faiss

# Use FAISS for faster search
index = faiss.IndexFlatIP(vectors.shape[1])
index.add(vectors)
hv.set_vector_db(index)
```

## 🚨 Error Handling

```python
from hybrid_vectorizer import HybridVectorizerError, ModelNotFittedError

try:
    results = hv.similarity_search(query)
except ModelNotFittedError:
    print("Call fit_transform() first!")
except HybridVectorizerError as e:
    print(f"HybridVectorizer error: {e}")
```

## 📈 Performance

Typical performance on modern hardware:

| Dataset Size | Fit Time | Search Time | Memory |
|--------------|----------|-------------|--------|
| 1K rows | <1s | <1ms | ~50MB |
| 10K rows | <10s | <10ms | ~200MB |
| 100K rows | <2min | <100ms | ~1GB |

*With mixed data types including text columns*

## 🛠️ Development

```bash
# Clone repository
git clone https://github.com/hariharaprabhu/hybrid-vectorizer
cd hybrid-vectorizer

# Install in development mode
pip install -e .

# Run tests
python tests/test_basic.py
```

## 📄 License

MIT License - see LICENSE file for details.

## 📞 Support

- **Issues**: [GitHub Issues](https://github.com/hariharaprabhu/hybrid-vectorizer/issues)
- **Documentation**: See this README and docstrings
- **Questions**: Open an issue for questions or feature requests

---

**HybridVectorizer** - Making multimodal similarity search simple and powerful. 🚀

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "hybrid-vectorizer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "machine-learning, embeddings, similarity-search, multimodal, vectorization",
    "author": null,
    "author_email": "Hari Narayanan <hari.dataprojects@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/27/8a/c5e7303c56b11a79ef9a44ed5d062dc9b6c56dc0913476b02f217894807b/hybrid_vectorizer-0.1.0.tar.gz",
    "platform": null,
    "description": "# HybridVectorizer\r\n\r\n**Unified embedding for tabular, text, and multimodal data with powerful similarity search.**\r\n\r\nHybridVectorizer automatically handles mixed data types (numerical, categorical, text, dates) and creates high-quality vector representations for similarity search, recommendation systems, and machine learning pipelines.\r\n\r\n## \ud83d\ude80 Quick Start\r\n\r\n```python\r\nimport pandas as pd\r\nfrom hybrid_vectorizer import HybridVectorizer\r\n\r\n# Your data with mixed types\r\ndf = pd.DataFrame({\r\n    'id': [1, 2, 3, 4, 5],\r\n    'description': [\r\n        'AI machine learning platform for enterprises',\r\n        'Data analytics and business intelligence',\r\n        'Computer vision technology for robots', \r\n        'Natural language processing chatbots',\r\n        'Predictive analytics for healthcare'\r\n    ],\r\n    'category': ['AI', 'Analytics', 'Vision', 'NLP', 'Healthcare'],\r\n    'funding': [1000000, 2500000, 800000, 1200000, 1800000],\r\n    'employees': [50, 150, 30, 80, 200]\r\n})\r\n\r\n# Initialize and fit\r\nhv = HybridVectorizer(index_column='id')\r\nvectors = hv.fit_transform(df)\r\n\r\n# Search for similar items\r\nquery = {\r\n    'description': 'artificial intelligence startup',\r\n    'category': 'AI',\r\n    'employees': 100\r\n}\r\n\r\nresults = hv.similarity_search(query, top_n=3)\r\nprint(results[['description', 'similarity']])\r\n```\r\n\r\n## \ud83d\udce6 Installation\r\n\r\n```bash\r\npip install hybrid-vectorizer\r\n```\r\n\r\n**Requirements:**\r\n- Python 3.8+\r\n- pandas, numpy, scikit-learn\r\n- sentence-transformers, torch\r\n\r\n## \u2728 Key Features\r\n\r\n### \ud83d\udd04 **Automatic Data Type Handling**\r\n- **Numerical**: Auto-normalized with MinMaxScaler\r\n- **Categorical**: One-hot or frequency encoding (smart threshold)\r\n- **Text**: SentenceTransformer embeddings\r\n- **Dates**: Extract features or ignore\r\n- **Mixed**: Handles missing values, inf, NaN gracefully\r\n\r\n### \ud83c\udfaf **Powerful Similarity Search**\r\n- **Late Fusion**: Combines modalities with configurable weights\r\n- **Block-level Control**: Weight text vs. numerical vs. categorical separately\r\n- **Explanation**: See which features drive similarity\r\n\r\n### \ud83d\udee0\ufe0f **Production Ready**\r\n- **Memory Efficient**: Optimized for large datasets\r\n- **GPU Support**: Automatic GPU detection for text encoding\r\n- **Persistence**: Save/load trained models\r\n- **Error Handling**: Informative custom exceptions\r\n\r\n## \ud83d\udca1 Usage Examples\r\n\r\n### Basic Usage\r\n```python\r\n# Fit and transform\r\nhv = HybridVectorizer()\r\nvectors = hv.fit_transform(df)\r\n\r\n# Simple query\r\nresults = hv.similarity_search({'description': 'machine learning'})\r\n```\r\n\r\n### Advanced Configuration\r\n```python\r\nhv = HybridVectorizer(\r\n    column_encodings={'description': 'text', 'category': 'categorical'},\r\n    ignore_columns=['id', 'created_at'],\r\n    index_column='id',\r\n    onehot_threshold=15,\r\n    text_batch_size=64\r\n)\r\n```\r\n\r\n### Weighted Search\r\n```python\r\n# Emphasize text over numerical features\r\nresults = hv.similarity_search(\r\n    query,\r\n    block_weights={'text': 3, 'categorical': 2, 'numerical': 1}\r\n)\r\n```\r\n\r\n### Text-Only Search\r\n```python\r\nresults = hv.similarity_search(\r\n    'AI startup', \r\n    text_column='description'\r\n)\r\n```\r\n\r\n## \ud83d\udd27 Configuration Options\r\n\r\n| Parameter | Description | Default |\r\n|-----------|-------------|---------|\r\n| `column_encodings` | Manual type overrides | `{}` |\r\n| `ignore_columns` | Skip these columns | `[]` |\r\n| `index_column` | ID column (preserved in results) | `None` |\r\n| `onehot_threshold` | Max categories for one-hot encoding | `10` |\r\n| `default_text_model` | SentenceTransformer model | `'all-MiniLM-L6-v2'` |\r\n| `text_batch_size` | Batch size for text encoding | `128` |\r\n\r\n## \ud83d\udcca Data Type Detection\r\n\r\nHybridVectorizer automatically detects:\r\n\r\n- **Numerical**: `int64`, `float64`, etc. \u2192 MinMax normalization\r\n- **Categorical**: `object` with \u226410 unique values \u2192 One-hot encoding\r\n- **Text**: `object` with >10 unique values \u2192 SentenceTransformer embeddings\r\n- **Dates**: `datetime64` \u2192 Extract year/month/day or ignore\r\n\r\nOverride with `column_encodings={'col': 'text'}` if needed.\r\n\r\n## \ud83c\udf9b\ufe0f Advanced Features\r\n\r\n### Model Persistence\r\n```python\r\n# Save trained model\r\nhv.save('my_vectorizer.pkl')\r\n\r\n# Load later\r\nhv2 = HybridVectorizer.load('my_vectorizer.pkl')\r\nresults = hv2.similarity_search(query)\r\n```\r\n\r\n### Encoding Report\r\n```python\r\n# See how each column was processed\r\nreport = hv.get_encoding_report()\r\nprint(report)\r\n```\r\n\r\n### External Vector Database\r\n```python\r\nimport faiss\r\n\r\n# Use FAISS for faster search\r\nindex = faiss.IndexFlatIP(vectors.shape[1])\r\nindex.add(vectors)\r\nhv.set_vector_db(index)\r\n```\r\n\r\n## \ud83d\udea8 Error Handling\r\n\r\n```python\r\nfrom hybrid_vectorizer import HybridVectorizerError, ModelNotFittedError\r\n\r\ntry:\r\n    results = hv.similarity_search(query)\r\nexcept ModelNotFittedError:\r\n    print(\"Call fit_transform() first!\")\r\nexcept HybridVectorizerError as e:\r\n    print(f\"HybridVectorizer error: {e}\")\r\n```\r\n\r\n## \ud83d\udcc8 Performance\r\n\r\nTypical performance on modern hardware:\r\n\r\n| Dataset Size | Fit Time | Search Time | Memory |\r\n|--------------|----------|-------------|--------|\r\n| 1K rows | <1s | <1ms | ~50MB |\r\n| 10K rows | <10s | <10ms | ~200MB |\r\n| 100K rows | <2min | <100ms | ~1GB |\r\n\r\n*With mixed data types including text columns*\r\n\r\n## \ud83d\udee0\ufe0f Development\r\n\r\n```bash\r\n# Clone repository\r\ngit clone https://github.com/hariharaprabhu/hybrid-vectorizer\r\ncd hybrid-vectorizer\r\n\r\n# Install in development mode\r\npip install -e .\r\n\r\n# Run tests\r\npython tests/test_basic.py\r\n```\r\n\r\n## \ud83d\udcc4 License\r\n\r\nMIT License - see LICENSE file for details.\r\n\r\n## \ud83d\udcde Support\r\n\r\n- **Issues**: [GitHub Issues](https://github.com/hariharaprabhu/hybrid-vectorizer/issues)\r\n- **Documentation**: See this README and docstrings\r\n- **Questions**: Open an issue for questions or feature requests\r\n\r\n---\r\n\r\n**HybridVectorizer** - Making multimodal similarity search simple and powerful. \ud83d\ude80\r\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Unified embedding for tabular, text, and multimodal data",
    "version": "0.1.0",
    "project_urls": {
        "Documentation": "https://github.com/hariharaprabhu/hybrid-vectorizer#readme",
        "Homepage": "https://github.com/hariharaprabhu/hybrid-vectorizer",
        "Issues": "https://github.com/hariharaprabhu/hybrid-vectorizer/issues",
        "Repository": "https://github.com/hariharaprabhu/hybrid-vectorizer"
    },
    "split_keywords": [
        "machine-learning",
        " embeddings",
        " similarity-search",
        " multimodal",
        " vectorization"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "67fbd07dc5860d2c591a05c82a2d4d3a43fdbfef1281ab48ced39a3ffef5bee9",
                "md5": "67a57a118ea871437f79ceb97efba4ef",
                "sha256": "aa8b7e83f0108b817793f73d469f7d9b88337b33a803ba7d63914e0f0abb5955"
            },
            "downloads": -1,
            "filename": "hybrid_vectorizer-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "67a57a118ea871437f79ceb97efba4ef",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 18372,
            "upload_time": "2025-08-12T04:41:46",
            "upload_time_iso_8601": "2025-08-12T04:41:46.843939Z",
            "url": "https://files.pythonhosted.org/packages/67/fb/d07dc5860d2c591a05c82a2d4d3a43fdbfef1281ab48ced39a3ffef5bee9/hybrid_vectorizer-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "278ac5e7303c56b11a79ef9a44ed5d062dc9b6c56dc0913476b02f217894807b",
                "md5": "89e274451f71cf23cdc09abd8c1882f4",
                "sha256": "447e5b663fcf4f22efd5a8f982a854e55137d181dec3b676790e53f1953da094"
            },
            "downloads": -1,
            "filename": "hybrid_vectorizer-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "89e274451f71cf23cdc09abd8c1882f4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 22088,
            "upload_time": "2025-08-12T04:41:48",
            "upload_time_iso_8601": "2025-08-12T04:41:48.355979Z",
            "url": "https://files.pythonhosted.org/packages/27/8a/c5e7303c56b11a79ef9a44ed5d062dc9b6c56dc0913476b02f217894807b/hybrid_vectorizer-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-12 04:41:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hariharaprabhu",
    "github_project": "hybrid-vectorizer#readme",
    "github_not_found": true,
    "lcname": "hybrid-vectorizer"
}
        
Elapsed time: 0.48303s