# Row2Vec
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/downloads/)
**Row2Vec** is a Python library for easily generating low-dimensional vector embeddings from any tabular dataset. It uses deep learning and classical methods to create powerful, dense representations of your data, suitable for visualization, feature engineering, and gaining deeper insights into your data's structure.
## Features
### 🎯 Multiple Embedding Methods
- **Neural (Autoencoder)**: Deep learning approach for complex, non-linear patterns
- **Target-based**: Learn embeddings for categorical columns and their relationships
- **PCA**: Fast, linear dimensionality reduction with interpretable components
- **t-SNE**: Excellent for 2D/3D visualization and cluster discovery
- **UMAP**: Balanced preservation of local and global structure
### 🧠Intelligent Preprocessing
- **Adaptive Missing Value Imputation**: Automatically analyzes patterns and applies optimal strategies
- **Pattern-Aware Analysis**: Detects problematic missing patterns with configurable strategies
- **Automated Feature Engineering**: Handles scaling, encoding, and preprocessing seamlessly
### 🚀 Advanced Features
- **Neural Architecture Search (NAS)**: Automatically discovers optimal network architectures
- **Multi-layer Networks**: Support for deep architectures with dropout and regularization
- **Model Serialization**: Save and load models with full preprocessing pipelines
- **Command-Line Interface**: Complete CLI for batch processing and automation
### 🔧 Production Ready
- **Comprehensive Testing**: 163+ test functions across 17 test files
- **Type Safety**: Complete MyPy annotations
- **Modern Build System**: Uses `pyproject.toml` with hatchling backend
- **Documentation**: Interactive Jupyter Book with executable examples
## Installation
```bash
pip install row2vec
```
## Quick Start
```python
import pandas as pd
from row2vec import learn_embedding, generate_synthetic_data
# Load your data
df = generate_synthetic_data(num_records=1000)
# Generate neural embeddings for each row
embeddings = learn_embedding(
df,
mode="unsupervised",
embedding_dim=5
)
print(f"Embeddings shape: {embeddings.shape}")
print(embeddings.head())
# Learn categorical embeddings
country_embeddings = learn_embedding(
df,
mode="target",
reference_column="Country",
embedding_dim=3
)
print(f"Country embeddings: {country_embeddings}")
# Compare with classical methods
pca_embeddings = learn_embedding(df, mode="pca", embedding_dim=5)
tsne_embeddings = learn_embedding(df, mode="tsne", embedding_dim=2)
```
## Command Line Interface
```bash
# Quick embeddings
row2vec annotate --input data.csv --output embeddings.csv --mode unsupervised --dim 5
# Train and save model
row2vec train --input data.csv --output model.py --mode unsupervised --dim 10 --epochs 50
# Use saved model
row2vec predict --input new_data.csv --model model.py --output predictions.csv
# Target-based embeddings
row2vec annotate --input data.csv --output categories.csv --mode target --target-col Category --dim 3
```
## Advanced Usage
### Neural Architecture Search
```python
from row2vec import ArchitectureSearchConfig, search_architecture, EmbeddingConfig, NeuralConfig
# Configure architecture search
config = ArchitectureSearchConfig(
method='random',
max_layers=3,
width_options=[64, 128, 256],
max_trials=20
)
base_config = EmbeddingConfig(
mode="unsupervised",
embedding_dim=8,
neural=NeuralConfig(max_epochs=50)
)
# Find optimal architecture
best_arch, results = search_architecture(df, base_config, config)
print(f"Best architecture: {best_arch}")
# Train with optimal settings
optimal_embeddings = learn_embedding(
df,
mode="unsupervised",
embedding_dim=8,
hidden_units=best_arch.get('hidden_units', [128]),
max_epochs=100
)
```
### Missing Value Imputation
```python
from row2vec import ImputationConfig, AdaptiveImputer, MissingPatternAnalyzer
# Analyze missing patterns
analyzer = MissingPatternAnalyzer(ImputationConfig())
analysis = analyzer.analyze(df)
print(f"Missing patterns: {analysis['recommendations']}")
# Apply adaptive imputation
imputer = AdaptiveImputer(ImputationConfig(
numeric_strategy='knn',
categorical_strategy='mode',
knn_neighbors=10
))
df_clean = imputer.fit_transform(df)
```
## Documentation
### Online Documentation
- **[Installation Guide](https://evotext.github.io/row2vec/installation.html)**: Detailed setup instructions
- **[Quick Start Tutorial](https://evotext.github.io/row2vec/quickstart.html)**: Get up and running in 5 minutes
- **[API Reference](https://evotext.github.io/row2vec/api_reference.html)**: Complete function documentation
- **[Example Gallery](https://evotext.github.io/row2vec/)**: Real-world use cases and tutorials
- **[Advanced Features](https://evotext.github.io/row2vec/advanced_features.html)**: Neural architecture search, imputation strategies
### Local Documentation
- **[User Guide](docs/USER_GUIDE.md)**: Comprehensive guide with mathematical background, detailed examples, and best practices
- **[LLM Documentation](docs/LLM_DOCUMENTATION.md)**: Practical guide for LLM coding agents integrating Row2Vec
- **[API Reference](docs/API_REFERENCE.md)**: Complete function and class reference
- **[Tutorials](docs/)**: Executable Python tutorials (Nhandu format) - run `make docs` to build HTML
## Why Row2Vec?
| Method | Row2Vec Advantage | Alternative |
|--------|-------------------|-------------|
| **Manual Neural Networks** | Automated preprocessing, simple API | 200+ lines of boilerplate |
| **sklearn PCA** | Integrated preprocessing, multiple methods | Limited to linear reduction |
| **sklearn t-SNE/UMAP** | Unified interface, consistent preprocessing | Manual pipeline setup |
| **Custom Embeddings** | Production-ready with serialization | Significant development time |
## Contributing
We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.
## Citation
If you use Row2Vec in your research, please cite:
```bibtex
@software{tresoldi_row2vec,
author = {Tresoldi, Tiago},
title = {Row2Vec: Neural and Classical Embeddings for Tabular Data},
url = {https://github.com/evotext/row2vec},
version = {1.0.0}
}
```
## Acknowledgments
This library was originally developed as part of the **"Cultural Evolution of Texts"** project, led by Michael Dunn at the Department of Linguistics and Philology, Uppsala University. The project investigates the application of evolutionary models to textual data and cultural transmission patterns.
## Authors
**Tiago Tresoldi**
*Affiliate Researcher, Department of Linguistics and Philology*
*Uppsala University*
*GitHub: [@tresoldi](https://github.com/tresoldi)*
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "row2vec",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "Tiago Tresoldi <tiago@tresoldi.org>",
"keywords": "embeddings, machine-learning, deep-learning, dimensionality-reduction, autoencoder, tabular-data, representation-learning, tensorflow, keras, pca, tsne, umap",
"author": null,
"author_email": "Tiago Tresoldi <tiago@tresoldi.org>",
"download_url": "https://files.pythonhosted.org/packages/84/78/562ca6d78a5134b766e3a4ad25f1b2dc5b24a4fba86410b4811ed9c87194/row2vec-0.1.0.tar.gz",
"platform": null,
"description": "# Row2Vec\n\n[](https://opensource.org/licenses/MIT)\n[](https://www.python.org/downloads/)\n\n**Row2Vec** is a Python library for easily generating low-dimensional vector embeddings from any tabular dataset. It uses deep learning and classical methods to create powerful, dense representations of your data, suitable for visualization, feature engineering, and gaining deeper insights into your data's structure.\n\n## Features\n\n### \ud83c\udfaf Multiple Embedding Methods\n- **Neural (Autoencoder)**: Deep learning approach for complex, non-linear patterns\n- **Target-based**: Learn embeddings for categorical columns and their relationships\n- **PCA**: Fast, linear dimensionality reduction with interpretable components\n- **t-SNE**: Excellent for 2D/3D visualization and cluster discovery\n- **UMAP**: Balanced preservation of local and global structure\n\n### \ud83e\udde0 Intelligent Preprocessing\n- **Adaptive Missing Value Imputation**: Automatically analyzes patterns and applies optimal strategies\n- **Pattern-Aware Analysis**: Detects problematic missing patterns with configurable strategies\n- **Automated Feature Engineering**: Handles scaling, encoding, and preprocessing seamlessly\n\n### \ud83d\ude80 Advanced Features\n- **Neural Architecture Search (NAS)**: Automatically discovers optimal network architectures\n- **Multi-layer Networks**: Support for deep architectures with dropout and regularization\n- **Model Serialization**: Save and load models with full preprocessing pipelines\n- **Command-Line Interface**: Complete CLI for batch processing and automation\n\n### \ud83d\udd27 Production Ready\n- **Comprehensive Testing**: 163+ test functions across 17 test files\n- **Type Safety**: Complete MyPy annotations\n- **Modern Build System**: Uses `pyproject.toml` with hatchling backend\n- **Documentation**: Interactive Jupyter Book with executable examples\n\n## Installation\n\n```bash\npip install row2vec\n```\n\n## Quick Start\n\n```python\nimport pandas as pd\nfrom row2vec import learn_embedding, generate_synthetic_data\n\n# Load your data\ndf = generate_synthetic_data(num_records=1000)\n\n# Generate neural embeddings for each row\nembeddings = learn_embedding(\n df,\n mode=\"unsupervised\",\n embedding_dim=5\n)\nprint(f\"Embeddings shape: {embeddings.shape}\")\nprint(embeddings.head())\n\n# Learn categorical embeddings\ncountry_embeddings = learn_embedding(\n df,\n mode=\"target\",\n reference_column=\"Country\",\n embedding_dim=3\n)\nprint(f\"Country embeddings: {country_embeddings}\")\n\n# Compare with classical methods\npca_embeddings = learn_embedding(df, mode=\"pca\", embedding_dim=5)\ntsne_embeddings = learn_embedding(df, mode=\"tsne\", embedding_dim=2)\n```\n\n## Command Line Interface\n\n```bash\n# Quick embeddings\nrow2vec annotate --input data.csv --output embeddings.csv --mode unsupervised --dim 5\n\n# Train and save model\nrow2vec train --input data.csv --output model.py --mode unsupervised --dim 10 --epochs 50\n\n# Use saved model\nrow2vec predict --input new_data.csv --model model.py --output predictions.csv\n\n# Target-based embeddings\nrow2vec annotate --input data.csv --output categories.csv --mode target --target-col Category --dim 3\n```\n\n## Advanced Usage\n\n### Neural Architecture Search\n\n```python\nfrom row2vec import ArchitectureSearchConfig, search_architecture, EmbeddingConfig, NeuralConfig\n\n# Configure architecture search\nconfig = ArchitectureSearchConfig(\n method='random',\n max_layers=3,\n width_options=[64, 128, 256],\n max_trials=20\n)\n\nbase_config = EmbeddingConfig(\n mode=\"unsupervised\",\n embedding_dim=8,\n neural=NeuralConfig(max_epochs=50)\n)\n\n# Find optimal architecture\nbest_arch, results = search_architecture(df, base_config, config)\nprint(f\"Best architecture: {best_arch}\")\n\n# Train with optimal settings\noptimal_embeddings = learn_embedding(\n df,\n mode=\"unsupervised\",\n embedding_dim=8,\n hidden_units=best_arch.get('hidden_units', [128]),\n max_epochs=100\n)\n```\n\n### Missing Value Imputation\n\n```python\nfrom row2vec import ImputationConfig, AdaptiveImputer, MissingPatternAnalyzer\n\n# Analyze missing patterns\nanalyzer = MissingPatternAnalyzer(ImputationConfig())\nanalysis = analyzer.analyze(df)\nprint(f\"Missing patterns: {analysis['recommendations']}\")\n\n# Apply adaptive imputation\nimputer = AdaptiveImputer(ImputationConfig(\n numeric_strategy='knn',\n categorical_strategy='mode',\n knn_neighbors=10\n))\ndf_clean = imputer.fit_transform(df)\n```\n\n## Documentation\n\n### Online Documentation\n- **[Installation Guide](https://evotext.github.io/row2vec/installation.html)**: Detailed setup instructions\n- **[Quick Start Tutorial](https://evotext.github.io/row2vec/quickstart.html)**: Get up and running in 5 minutes\n- **[API Reference](https://evotext.github.io/row2vec/api_reference.html)**: Complete function documentation\n- **[Example Gallery](https://evotext.github.io/row2vec/)**: Real-world use cases and tutorials\n- **[Advanced Features](https://evotext.github.io/row2vec/advanced_features.html)**: Neural architecture search, imputation strategies\n\n### Local Documentation\n- **[User Guide](docs/USER_GUIDE.md)**: Comprehensive guide with mathematical background, detailed examples, and best practices\n- **[LLM Documentation](docs/LLM_DOCUMENTATION.md)**: Practical guide for LLM coding agents integrating Row2Vec\n- **[API Reference](docs/API_REFERENCE.md)**: Complete function and class reference\n- **[Tutorials](docs/)**: Executable Python tutorials (Nhandu format) - run `make docs` to build HTML\n\n## Why Row2Vec?\n\n| Method | Row2Vec Advantage | Alternative |\n|--------|-------------------|-------------|\n| **Manual Neural Networks** | Automated preprocessing, simple API | 200+ lines of boilerplate |\n| **sklearn PCA** | Integrated preprocessing, multiple methods | Limited to linear reduction |\n| **sklearn t-SNE/UMAP** | Unified interface, consistent preprocessing | Manual pipeline setup |\n| **Custom Embeddings** | Production-ready with serialization | Significant development time |\n\n## Contributing\n\nWe welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.\n\n## Citation\n\nIf you use Row2Vec in your research, please cite:\n\n```bibtex\n@software{tresoldi_row2vec,\n author = {Tresoldi, Tiago},\n title = {Row2Vec: Neural and Classical Embeddings for Tabular Data},\n url = {https://github.com/evotext/row2vec},\n version = {1.0.0}\n}\n```\n\n## Acknowledgments\n\nThis library was originally developed as part of the **\"Cultural Evolution of Texts\"** project, led by Michael Dunn at the Department of Linguistics and Philology, Uppsala University. The project investigates the application of evolutionary models to textual data and cultural transmission patterns.\n\n## Authors\n\n**Tiago Tresoldi**\n*Affiliate Researcher, Department of Linguistics and Philology*\n*Uppsala University*\n*GitHub: [@tresoldi](https://github.com/tresoldi)*\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n",
"bugtrack_url": null,
"license": null,
"summary": "A Python library for easily generating low-dimensional vector embeddings from any tabular dataset.",
"version": "0.1.0",
"project_urls": {
"Changelog": "https://github.com/evotext/row2vec/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/evotext/row2vec#readme",
"Homepage": "https://github.com/evotext/row2vec",
"Issues": "https://github.com/evotext/row2vec/issues",
"Repository": "https://github.com/evotext/row2vec"
},
"split_keywords": [
"embeddings",
" machine-learning",
" deep-learning",
" dimensionality-reduction",
" autoencoder",
" tabular-data",
" representation-learning",
" tensorflow",
" keras",
" pca",
" tsne",
" umap"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "698080947bed71add35a5887a59f66c56cf56ed29f2e828de111ac34e79266a1",
"md5": "32c5f4299a278203b84fe8b92596ea2b",
"sha256": "9777d3ca8b3cb07793813c06eee5bbd23aeccfde9da3b1cb7c0eacc4496f1892"
},
"downloads": -1,
"filename": "row2vec-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "32c5f4299a278203b84fe8b92596ea2b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 79544,
"upload_time": "2025-10-13T08:26:07",
"upload_time_iso_8601": "2025-10-13T08:26:07.663616Z",
"url": "https://files.pythonhosted.org/packages/69/80/80947bed71add35a5887a59f66c56cf56ed29f2e828de111ac34e79266a1/row2vec-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8478562ca6d78a5134b766e3a4ad25f1b2dc5b24a4fba86410b4811ed9c87194",
"md5": "558316079c676de804bc5989eb60f272",
"sha256": "2f193a742d37742f2a9cd6b626a2d2102b20efea97fc4b7b1c02e5a454f5445f"
},
"downloads": -1,
"filename": "row2vec-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "558316079c676de804bc5989eb60f272",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 106806,
"upload_time": "2025-10-13T08:26:09",
"upload_time_iso_8601": "2025-10-13T08:26:09.275969Z",
"url": "https://files.pythonhosted.org/packages/84/78/562ca6d78a5134b766e3a4ad25f1b2dc5b24a4fba86410b4811ed9c87194/row2vec-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-13 08:26:09",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "evotext",
"github_project": "row2vec",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "numpy",
"specs": [
[
">=",
"1.21.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.5.3"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "tensorflow",
"specs": [
[
">=",
"2.8.0"
]
]
},
{
"name": "jinja2",
"specs": [
[
">=",
"3.1.2"
]
]
},
{
"name": "psutil",
"specs": [
[
">=",
"5.8.0"
]
]
},
{
"name": "umap-learn",
"specs": [
[
">=",
"0.5.0"
]
]
},
{
"name": "click",
"specs": [
[
">=",
"8.0.0"
]
]
},
{
"name": "rich",
"specs": [
[
">=",
"12.0.0"
]
]
},
{
"name": "pyyaml",
"specs": [
[
">=",
"6.0.0"
]
]
}
],
"lcname": "row2vec"
}