# lsdembed - Physics-Inspired Text Embedding Library
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
Lagrangian Semantic Dynamics Embeddings[LSDEmbed] is a novel text embedding library that uses physics-inspired algorithms to create high-quality semantic embeddings. By modeling text tokens as particles in a physical system with forces of attraction and repulsion, lsdembed captures nuanced semantic relationships that traditional methods might miss.
<h3 align="center">
<code>pip install lsdembed</code>
</h3>
## Features
- **Physics-Inspired Embeddings**: Uses particle physics simulation to model semantic relationships
- **High Performance**: Optimized C++ backend with OpenMP parallelization
- **Flexible Parameters**: Customizable physics parameters for different use cases
- **Memory Efficient**: Spatial hashing and memory-aware processing
- **Easy Persistence**: Save and load trained models with compression support
- **Multiple Export Formats**: Export embeddings as NPZ, CSV, or JSON
## Installation
### Prerequisites
- Python 3.8 or higher
- C++ compiler with C++17 support
- OpenMP (optional, for parallel processing)
### Install from Source
```bash
git clone https://github.com/datasciritwik/lsdembed.git
cd lsdembed
pip install -e .
```
### Dependencies
The library requires:
- `numpy>=1.21.0`
- `scipy>=1.7.0`
- `pandas>=1.3.0`
- `pybind11>=2.6.0` (for building)
## Quick Start
```python
from lsdembed import LSdembed
# Initialize with default parameters
model = LSdembed()
# Fit on your text corpus
texts = [
"The quick brown fox jumps over the lazy dog",
"Machine learning is a subset of artificial intelligence",
"Natural language processing helps computers understand text"
]
model.fit(texts, chunk_size=500)
# Search for similar content
results = model.search("artificial intelligence", top_k=3)
for text, score in results:
print(f"Score: {score:.4f}")
print(f"Text: {text}")
print("---")
```
## Advanced Usage
### Custom Parameters
```python
# Configure physics parameters for your use case
params = {
'd': 256, # Embedding dimension
'alpha': 1.5, # Repulsion strength
'beta': 0.3, # Attraction strength
'gamma': 0.2, # Damping coefficient
'r_cutoff': 2.5, # Force cutoff radius
'dt': 0.05, # Integration time step
'scale': 0.1, # Initial position scale
'seed': 42 # Random seed for reproducibility
}
model = lsdembed(params)
```
### Model Persistence
```python
# Save trained model
model.save_model('my_model.pkl', compress=True)
# Load model later
loaded_model = lsdembed.from_pretrained('my_model.pkl.gz')
# Or load into existing instance
model = lsdembed()
model.load_model('my_model.pkl.gz')
```
### Export Embeddings
```python
# Export in different formats
model.export_embeddings('embeddings.npz', format='npz')
model.export_embeddings('embeddings.csv', format='csv')
model.export_embeddings('embeddings.json', format='json')
```
### Model Information
```python
# Get detailed model information
info = model.get_model_info()
print(f"Status: {info['status']}")
print(f"Chunks: {info['num_chunks']}")
print(f"Dimension: {info['embedding_dimension']}")
print(f"Memory usage: {info['memory_usage_mb']['total_approximate']:.2f} MB")
```
## Physics Parameters Guide
Understanding the physics parameters helps you tune the model for your specific use case:
- **`d` (dimension)**: Higher dimensions capture more nuanced relationships but require more memory
- **`alpha` (repulsion)**: Controls how strongly dissimilar tokens repel each other
- **`beta` (attraction)**: Controls sequential attraction between adjacent tokens
- **`gamma` (damping)**: Higher values lead to faster convergence but may reduce quality
- **`r_cutoff`**: Limits interaction range, affecting both performance and quality
- **`dt`**: Smaller values improve stability but increase computation time
- **`scale`**: Initial randomization scale, affects convergence behavior
### Recommended Configurations
**High Precision (slower, better quality):**
```python
params = {'d': 512, 'alpha': 2.0, 'beta': 0.8, 'r_cutoff': 4.0, 'dt': 0.02}
```
**Fast Inference (faster, good quality):**
```python
params = {'d': 128, 'alpha': 1.0, 'beta': 0.3, 'r_cutoff': 2.0, 'dt': 0.1}
```
**Balanced (recommended starting point):**
```python
params = {'d': 256, 'alpha': 1.5, 'beta': 0.5, 'r_cutoff': 3.0, 'dt': 0.05}
```
## API Reference
### lsdembed Class
#### `__init__(params=None)`
Initialize the lsdembed model with optional parameters.
#### `fit(texts, chunk_size=1000)`
Fit the model on a corpus of texts.
- `texts`: List of strings to train on
- `chunk_size`: Maximum characters per chunk
#### `embed_query(query)`
Embed a single query string.
- Returns: numpy array of embedding
#### `search(query, top_k=5)`
Search for similar chunks.
- Returns: List of (text, score) tuples
#### `save_model(filepath, compress=True)`
Save the fitted model to disk.
#### `load_model(filepath)`
Load a saved model from disk.
#### `export_embeddings(filepath, format='npz')`
Export embeddings in specified format ('npz', 'csv', 'json').
#### `get_model_info()`
Get comprehensive model information.
#### `update_params(**kwargs)`
Update model parameters after initialization.
### TextProcessor Class
#### `tokenize(text)`
Tokenize text using the configured pattern.
#### `chunk_texts(texts, max_chars=300)`
Split texts into chunks.
#### `calculate_idf(chunks)`
Calculate IDF scores for tokens.
## Performance Tips
1. **Memory Management**: For large corpora, use smaller embedding dimensions or process in batches
2. **Parallel Processing**: Ensure OpenMP is available for best performance
3. **Parameter Tuning**: Start with balanced parameters and adjust based on your data
4. **Chunk Size**: Optimal chunk size depends on your text length and domain
## Examples
See the `examples/` directory for complete examples:
- `basic_usage.py`: Simple usage example
- `model_persistence.py`: Saving and loading models
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Citation
If you use lsdembed in your research, please cite:
```bibtex
@software{lsdembed,
title={lsdembed: Physics-Inspired Text Embedding Library},
author={Ritwik singh},
year={2025},
url={https://github.com/datasciritwik/lsdembed}
}
```
## Changelog
### Version 0.1.0
- Initial release
- Physics-inspired embedding algorithm
- C++ backend with Python bindings
- Model persistence and export functionality
Raw data
{
"_id": null,
"home_page": null,
"name": "lsdembed",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "text-embeddings, nlp, physics, text-processing, machine-learning",
"author": null,
"author_email": "Ritwik Singh <officialritwik098@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/f4/4a/8b3c8f5e745cdf03a948fbeccb936045ceef977badb0e00c7ec0393b2795/lsdembed-0.2.0.tar.gz",
"platform": null,
"description": "# lsdembed - Physics-Inspired Text Embedding Library\n\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n\nLagrangian Semantic Dynamics Embeddings[LSDEmbed] is a novel text embedding library that uses physics-inspired algorithms to create high-quality semantic embeddings. By modeling text tokens as particles in a physical system with forces of attraction and repulsion, lsdembed captures nuanced semantic relationships that traditional methods might miss.\n\n<h3 align=\"center\">\n <code>pip install lsdembed</code>\n</h3>\n\n## Features\n\n- **Physics-Inspired Embeddings**: Uses particle physics simulation to model semantic relationships\n- **High Performance**: Optimized C++ backend with OpenMP parallelization\n- **Flexible Parameters**: Customizable physics parameters for different use cases\n- **Memory Efficient**: Spatial hashing and memory-aware processing\n- **Easy Persistence**: Save and load trained models with compression support\n- **Multiple Export Formats**: Export embeddings as NPZ, CSV, or JSON\n\n## Installation\n\n### Prerequisites\n\n- Python 3.8 or higher\n- C++ compiler with C++17 support\n- OpenMP (optional, for parallel processing)\n\n### Install from Source\n\n```bash\ngit clone https://github.com/datasciritwik/lsdembed.git\ncd lsdembed\npip install -e .\n```\n\n### Dependencies\n\nThe library requires:\n- `numpy>=1.21.0`\n- `scipy>=1.7.0`\n- `pandas>=1.3.0`\n- `pybind11>=2.6.0` (for building)\n\n## Quick Start\n\n```python\nfrom lsdembed import LSdembed\n\n# Initialize with default parameters\nmodel = LSdembed()\n\n# Fit on your text corpus\ntexts = [\n \"The quick brown fox jumps over the lazy dog\",\n \"Machine learning is a subset of artificial intelligence\",\n \"Natural language processing helps computers understand text\"\n]\n\nmodel.fit(texts, chunk_size=500)\n\n# Search for similar content\nresults = model.search(\"artificial intelligence\", top_k=3)\n\nfor text, score in results:\n print(f\"Score: {score:.4f}\")\n print(f\"Text: {text}\")\n print(\"---\")\n```\n\n## Advanced Usage\n\n### Custom Parameters\n\n```python\n# Configure physics parameters for your use case\nparams = {\n 'd': 256, # Embedding dimension\n 'alpha': 1.5, # Repulsion strength\n 'beta': 0.3, # Attraction strength\n 'gamma': 0.2, # Damping coefficient\n 'r_cutoff': 2.5, # Force cutoff radius\n 'dt': 0.05, # Integration time step\n 'scale': 0.1, # Initial position scale\n 'seed': 42 # Random seed for reproducibility\n}\n\nmodel = lsdembed(params)\n```\n\n### Model Persistence\n\n```python\n# Save trained model\nmodel.save_model('my_model.pkl', compress=True)\n\n# Load model later\nloaded_model = lsdembed.from_pretrained('my_model.pkl.gz')\n\n# Or load into existing instance\nmodel = lsdembed()\nmodel.load_model('my_model.pkl.gz')\n```\n\n### Export Embeddings\n\n```python\n# Export in different formats\nmodel.export_embeddings('embeddings.npz', format='npz')\nmodel.export_embeddings('embeddings.csv', format='csv')\nmodel.export_embeddings('embeddings.json', format='json')\n```\n\n### Model Information\n\n```python\n# Get detailed model information\ninfo = model.get_model_info()\nprint(f\"Status: {info['status']}\")\nprint(f\"Chunks: {info['num_chunks']}\")\nprint(f\"Dimension: {info['embedding_dimension']}\")\nprint(f\"Memory usage: {info['memory_usage_mb']['total_approximate']:.2f} MB\")\n```\n\n## Physics Parameters Guide\n\nUnderstanding the physics parameters helps you tune the model for your specific use case:\n\n- **`d` (dimension)**: Higher dimensions capture more nuanced relationships but require more memory\n- **`alpha` (repulsion)**: Controls how strongly dissimilar tokens repel each other\n- **`beta` (attraction)**: Controls sequential attraction between adjacent tokens\n- **`gamma` (damping)**: Higher values lead to faster convergence but may reduce quality\n- **`r_cutoff`**: Limits interaction range, affecting both performance and quality\n- **`dt`**: Smaller values improve stability but increase computation time\n- **`scale`**: Initial randomization scale, affects convergence behavior\n\n### Recommended Configurations\n\n**High Precision (slower, better quality):**\n```python\nparams = {'d': 512, 'alpha': 2.0, 'beta': 0.8, 'r_cutoff': 4.0, 'dt': 0.02}\n```\n\n**Fast Inference (faster, good quality):**\n```python\nparams = {'d': 128, 'alpha': 1.0, 'beta': 0.3, 'r_cutoff': 2.0, 'dt': 0.1}\n```\n\n**Balanced (recommended starting point):**\n```python\nparams = {'d': 256, 'alpha': 1.5, 'beta': 0.5, 'r_cutoff': 3.0, 'dt': 0.05}\n```\n\n## API Reference\n\n### lsdembed Class\n\n#### `__init__(params=None)`\nInitialize the lsdembed model with optional parameters.\n\n#### `fit(texts, chunk_size=1000)`\nFit the model on a corpus of texts.\n- `texts`: List of strings to train on\n- `chunk_size`: Maximum characters per chunk\n\n#### `embed_query(query)`\nEmbed a single query string.\n- Returns: numpy array of embedding\n\n#### `search(query, top_k=5)`\nSearch for similar chunks.\n- Returns: List of (text, score) tuples\n\n#### `save_model(filepath, compress=True)`\nSave the fitted model to disk.\n\n#### `load_model(filepath)`\nLoad a saved model from disk.\n\n#### `export_embeddings(filepath, format='npz')`\nExport embeddings in specified format ('npz', 'csv', 'json').\n\n#### `get_model_info()`\nGet comprehensive model information.\n\n#### `update_params(**kwargs)`\nUpdate model parameters after initialization.\n\n### TextProcessor Class\n\n#### `tokenize(text)`\nTokenize text using the configured pattern.\n\n#### `chunk_texts(texts, max_chars=300)`\nSplit texts into chunks.\n\n#### `calculate_idf(chunks)`\nCalculate IDF scores for tokens.\n\n## Performance Tips\n\n1. **Memory Management**: For large corpora, use smaller embedding dimensions or process in batches\n2. **Parallel Processing**: Ensure OpenMP is available for best performance\n3. **Parameter Tuning**: Start with balanced parameters and adjust based on your data\n4. **Chunk Size**: Optimal chunk size depends on your text length and domain\n\n## Examples\n\nSee the `examples/` directory for complete examples:\n- `basic_usage.py`: Simple usage example\n- `model_persistence.py`: Saving and loading models\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## Citation\n\nIf you use lsdembed in your research, please cite:\n\n```bibtex\n@software{lsdembed,\n title={lsdembed: Physics-Inspired Text Embedding Library},\n author={Ritwik singh},\n year={2025},\n url={https://github.com/datasciritwik/lsdembed}\n}\n```\n\n## Changelog\n\n### Version 0.1.0\n- Initial release\n- Physics-inspired embedding algorithm\n- C++ backend with Python bindings\n- Model persistence and export functionality\n",
"bugtrack_url": null,
"license": null,
"summary": "Physics-inspired text embedding library",
"version": "0.2.0",
"project_urls": {
"Homepage": "https://github.com/datasciritwik/lsdembed",
"Issues": "https://github.com/datasciritwik/lsdembed/issues",
"Repository": "https://github.com/datasciritwik/lsdembed"
},
"split_keywords": [
"text-embeddings",
" nlp",
" physics",
" text-processing",
" machine-learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "646c960b50c2f85b31c7b3d59c19fc335dc29ddd9b38bfdae55126d384010388",
"md5": "862ff4ea1120e4336f10e42a9b4395fd",
"sha256": "f17f4999adb286c5ec2c519bedd1276e27e257eb0e025ca1e58af6359c41f021"
},
"downloads": -1,
"filename": "lsdembed-0.2.0-cp312-cp312-manylinux_2_39_x86_64.whl",
"has_sig": false,
"md5_digest": "862ff4ea1120e4336f10e42a9b4395fd",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.8",
"size": 127627,
"upload_time": "2025-07-30T09:45:42",
"upload_time_iso_8601": "2025-07-30T09:45:42.555836Z",
"url": "https://files.pythonhosted.org/packages/64/6c/960b50c2f85b31c7b3d59c19fc335dc29ddd9b38bfdae55126d384010388/lsdembed-0.2.0-cp312-cp312-manylinux_2_39_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "f44a8b3c8f5e745cdf03a948fbeccb936045ceef977badb0e00c7ec0393b2795",
"md5": "a324c245afb4a5ab6d1334d34101c61a",
"sha256": "d8de14e627a63663c928a6c70bd161b7bc9906b4487790afdc44932ba346895e"
},
"downloads": -1,
"filename": "lsdembed-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "a324c245afb4a5ab6d1334d34101c61a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 24655,
"upload_time": "2025-07-30T09:45:44",
"upload_time_iso_8601": "2025-07-30T09:45:44.636536Z",
"url": "https://files.pythonhosted.org/packages/f4/4a/8b3c8f5e745cdf03a948fbeccb936045ceef977badb0e00c7ec0393b2795/lsdembed-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-30 09:45:44",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "datasciritwik",
"github_project": "lsdembed",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "build",
"specs": [
[
"==",
"1.2.2.post1"
]
]
},
{
"name": "iniconfig",
"specs": [
[
"==",
"2.1.0"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"2.3.1"
]
]
},
{
"name": "packaging",
"specs": [
[
"==",
"25.0"
]
]
},
{
"name": "pandas",
"specs": [
[
"==",
"2.3.1"
]
]
},
{
"name": "pluggy",
"specs": [
[
"==",
"1.6.0"
]
]
},
{
"name": "pybind11",
"specs": [
[
"==",
"3.0.0"
]
]
},
{
"name": "Pygments",
"specs": [
[
"==",
"2.19.2"
]
]
},
{
"name": "pyproject_hooks",
"specs": [
[
"==",
"1.2.0"
]
]
},
{
"name": "pytest",
"specs": [
[
"==",
"8.4.1"
]
]
},
{
"name": "python-dateutil",
"specs": [
[
"==",
"2.9.0.post0"
]
]
},
{
"name": "pytz",
"specs": [
[
"==",
"2025.2"
]
]
},
{
"name": "scipy",
"specs": [
[
"==",
"1.16.0"
]
]
},
{
"name": "setuptools",
"specs": [
[
"==",
"80.9.0"
]
]
},
{
"name": "six",
"specs": [
[
"==",
"1.17.0"
]
]
},
{
"name": "tzdata",
"specs": [
[
"==",
"2025.2"
]
]
},
{
"name": "wheel",
"specs": [
[
"==",
"0.45.1"
]
]
}
],
"lcname": "lsdembed"
}