lsdembed


Namelsdembed JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
SummaryPhysics-inspired text embedding library
upload_time2025-07-30 09:45:44
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords text-embeddings nlp physics text-processing machine-learning
VCS
bugtrack_url
requirements build iniconfig numpy packaging pandas pluggy pybind11 Pygments pyproject_hooks pytest python-dateutil pytz scipy setuptools six tzdata wheel
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # lsdembed - Physics-Inspired Text Embedding Library

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Lagrangian Semantic Dynamics Embeddings[LSDEmbed] is a novel text embedding library that uses physics-inspired algorithms to create high-quality semantic embeddings. By modeling text tokens as particles in a physical system with forces of attraction and repulsion, lsdembed captures nuanced semantic relationships that traditional methods might miss.

<h3 align="center">
  <code>pip install lsdembed</code>
</h3>

## Features

- **Physics-Inspired Embeddings**: Uses particle physics simulation to model semantic relationships
- **High Performance**: Optimized C++ backend with OpenMP parallelization
- **Flexible Parameters**: Customizable physics parameters for different use cases
- **Memory Efficient**: Spatial hashing and memory-aware processing
- **Easy Persistence**: Save and load trained models with compression support
- **Multiple Export Formats**: Export embeddings as NPZ, CSV, or JSON

## Installation

### Prerequisites

- Python 3.8 or higher
- C++ compiler with C++17 support
- OpenMP (optional, for parallel processing)

### Install from Source

```bash
git clone https://github.com/datasciritwik/lsdembed.git
cd lsdembed
pip install -e .
```

### Dependencies

The library requires:
- `numpy>=1.21.0`
- `scipy>=1.7.0`
- `pandas>=1.3.0`
- `pybind11>=2.6.0` (for building)

## Quick Start

```python
from lsdembed import LSdembed

# Initialize with default parameters
model = LSdembed()

# Fit on your text corpus
texts = [
    "The quick brown fox jumps over the lazy dog",
    "Machine learning is a subset of artificial intelligence",
    "Natural language processing helps computers understand text"
]

model.fit(texts, chunk_size=500)

# Search for similar content
results = model.search("artificial intelligence", top_k=3)

for text, score in results:
    print(f"Score: {score:.4f}")
    print(f"Text: {text}")
    print("---")
```

## Advanced Usage

### Custom Parameters

```python
# Configure physics parameters for your use case
params = {
    'd': 256,           # Embedding dimension
    'alpha': 1.5,       # Repulsion strength
    'beta': 0.3,        # Attraction strength
    'gamma': 0.2,       # Damping coefficient
    'r_cutoff': 2.5,    # Force cutoff radius
    'dt': 0.05,         # Integration time step
    'scale': 0.1,       # Initial position scale
    'seed': 42          # Random seed for reproducibility
}

model = lsdembed(params)
```

### Model Persistence

```python
# Save trained model
model.save_model('my_model.pkl', compress=True)

# Load model later
loaded_model = lsdembed.from_pretrained('my_model.pkl.gz')

# Or load into existing instance
model = lsdembed()
model.load_model('my_model.pkl.gz')
```

### Export Embeddings

```python
# Export in different formats
model.export_embeddings('embeddings.npz', format='npz')
model.export_embeddings('embeddings.csv', format='csv')
model.export_embeddings('embeddings.json', format='json')
```

### Model Information

```python
# Get detailed model information
info = model.get_model_info()
print(f"Status: {info['status']}")
print(f"Chunks: {info['num_chunks']}")
print(f"Dimension: {info['embedding_dimension']}")
print(f"Memory usage: {info['memory_usage_mb']['total_approximate']:.2f} MB")
```

## Physics Parameters Guide

Understanding the physics parameters helps you tune the model for your specific use case:

- **`d` (dimension)**: Higher dimensions capture more nuanced relationships but require more memory
- **`alpha` (repulsion)**: Controls how strongly dissimilar tokens repel each other
- **`beta` (attraction)**: Controls sequential attraction between adjacent tokens
- **`gamma` (damping)**: Higher values lead to faster convergence but may reduce quality
- **`r_cutoff`**: Limits interaction range, affecting both performance and quality
- **`dt`**: Smaller values improve stability but increase computation time
- **`scale`**: Initial randomization scale, affects convergence behavior

### Recommended Configurations

**High Precision (slower, better quality):**
```python
params = {'d': 512, 'alpha': 2.0, 'beta': 0.8, 'r_cutoff': 4.0, 'dt': 0.02}
```

**Fast Inference (faster, good quality):**
```python
params = {'d': 128, 'alpha': 1.0, 'beta': 0.3, 'r_cutoff': 2.0, 'dt': 0.1}
```

**Balanced (recommended starting point):**
```python
params = {'d': 256, 'alpha': 1.5, 'beta': 0.5, 'r_cutoff': 3.0, 'dt': 0.05}
```

## API Reference

### lsdembed Class

#### `__init__(params=None)`
Initialize the lsdembed model with optional parameters.

#### `fit(texts, chunk_size=1000)`
Fit the model on a corpus of texts.
- `texts`: List of strings to train on
- `chunk_size`: Maximum characters per chunk

#### `embed_query(query)`
Embed a single query string.
- Returns: numpy array of embedding

#### `search(query, top_k=5)`
Search for similar chunks.
- Returns: List of (text, score) tuples

#### `save_model(filepath, compress=True)`
Save the fitted model to disk.

#### `load_model(filepath)`
Load a saved model from disk.

#### `export_embeddings(filepath, format='npz')`
Export embeddings in specified format ('npz', 'csv', 'json').

#### `get_model_info()`
Get comprehensive model information.

#### `update_params(**kwargs)`
Update model parameters after initialization.

### TextProcessor Class

#### `tokenize(text)`
Tokenize text using the configured pattern.

#### `chunk_texts(texts, max_chars=300)`
Split texts into chunks.

#### `calculate_idf(chunks)`
Calculate IDF scores for tokens.

## Performance Tips

1. **Memory Management**: For large corpora, use smaller embedding dimensions or process in batches
2. **Parallel Processing**: Ensure OpenMP is available for best performance
3. **Parameter Tuning**: Start with balanced parameters and adjust based on your data
4. **Chunk Size**: Optimal chunk size depends on your text length and domain

## Examples

See the `examples/` directory for complete examples:
- `basic_usage.py`: Simple usage example
- `model_persistence.py`: Saving and loading models

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Citation

If you use lsdembed in your research, please cite:

```bibtex
@software{lsdembed,
  title={lsdembed: Physics-Inspired Text Embedding Library},
  author={Ritwik singh},
  year={2025},
  url={https://github.com/datasciritwik/lsdembed}
}
```

## Changelog

### Version 0.1.0
- Initial release
- Physics-inspired embedding algorithm
- C++ backend with Python bindings
- Model persistence and export functionality

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "lsdembed",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "text-embeddings, nlp, physics, text-processing, machine-learning",
    "author": null,
    "author_email": "Ritwik Singh <officialritwik098@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/f4/4a/8b3c8f5e745cdf03a948fbeccb936045ceef977badb0e00c7ec0393b2795/lsdembed-0.2.0.tar.gz",
    "platform": null,
    "description": "# lsdembed - Physics-Inspired Text Embedding Library\n\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nLagrangian Semantic Dynamics Embeddings[LSDEmbed] is a novel text embedding library that uses physics-inspired algorithms to create high-quality semantic embeddings. By modeling text tokens as particles in a physical system with forces of attraction and repulsion, lsdembed captures nuanced semantic relationships that traditional methods might miss.\n\n<h3 align=\"center\">\n  <code>pip install lsdembed</code>\n</h3>\n\n## Features\n\n- **Physics-Inspired Embeddings**: Uses particle physics simulation to model semantic relationships\n- **High Performance**: Optimized C++ backend with OpenMP parallelization\n- **Flexible Parameters**: Customizable physics parameters for different use cases\n- **Memory Efficient**: Spatial hashing and memory-aware processing\n- **Easy Persistence**: Save and load trained models with compression support\n- **Multiple Export Formats**: Export embeddings as NPZ, CSV, or JSON\n\n## Installation\n\n### Prerequisites\n\n- Python 3.8 or higher\n- C++ compiler with C++17 support\n- OpenMP (optional, for parallel processing)\n\n### Install from Source\n\n```bash\ngit clone https://github.com/datasciritwik/lsdembed.git\ncd lsdembed\npip install -e .\n```\n\n### Dependencies\n\nThe library requires:\n- `numpy>=1.21.0`\n- `scipy>=1.7.0`\n- `pandas>=1.3.0`\n- `pybind11>=2.6.0` (for building)\n\n## Quick Start\n\n```python\nfrom lsdembed import LSdembed\n\n# Initialize with default parameters\nmodel = LSdembed()\n\n# Fit on your text corpus\ntexts = [\n    \"The quick brown fox jumps over the lazy dog\",\n    \"Machine learning is a subset of artificial intelligence\",\n    \"Natural language processing helps computers understand text\"\n]\n\nmodel.fit(texts, chunk_size=500)\n\n# Search for similar content\nresults = model.search(\"artificial intelligence\", top_k=3)\n\nfor text, score in results:\n    print(f\"Score: {score:.4f}\")\n    print(f\"Text: {text}\")\n    print(\"---\")\n```\n\n## Advanced Usage\n\n### Custom Parameters\n\n```python\n# Configure physics parameters for your use case\nparams = {\n    'd': 256,           # Embedding dimension\n    'alpha': 1.5,       # Repulsion strength\n    'beta': 0.3,        # Attraction strength\n    'gamma': 0.2,       # Damping coefficient\n    'r_cutoff': 2.5,    # Force cutoff radius\n    'dt': 0.05,         # Integration time step\n    'scale': 0.1,       # Initial position scale\n    'seed': 42          # Random seed for reproducibility\n}\n\nmodel = lsdembed(params)\n```\n\n### Model Persistence\n\n```python\n# Save trained model\nmodel.save_model('my_model.pkl', compress=True)\n\n# Load model later\nloaded_model = lsdembed.from_pretrained('my_model.pkl.gz')\n\n# Or load into existing instance\nmodel = lsdembed()\nmodel.load_model('my_model.pkl.gz')\n```\n\n### Export Embeddings\n\n```python\n# Export in different formats\nmodel.export_embeddings('embeddings.npz', format='npz')\nmodel.export_embeddings('embeddings.csv', format='csv')\nmodel.export_embeddings('embeddings.json', format='json')\n```\n\n### Model Information\n\n```python\n# Get detailed model information\ninfo = model.get_model_info()\nprint(f\"Status: {info['status']}\")\nprint(f\"Chunks: {info['num_chunks']}\")\nprint(f\"Dimension: {info['embedding_dimension']}\")\nprint(f\"Memory usage: {info['memory_usage_mb']['total_approximate']:.2f} MB\")\n```\n\n## Physics Parameters Guide\n\nUnderstanding the physics parameters helps you tune the model for your specific use case:\n\n- **`d` (dimension)**: Higher dimensions capture more nuanced relationships but require more memory\n- **`alpha` (repulsion)**: Controls how strongly dissimilar tokens repel each other\n- **`beta` (attraction)**: Controls sequential attraction between adjacent tokens\n- **`gamma` (damping)**: Higher values lead to faster convergence but may reduce quality\n- **`r_cutoff`**: Limits interaction range, affecting both performance and quality\n- **`dt`**: Smaller values improve stability but increase computation time\n- **`scale`**: Initial randomization scale, affects convergence behavior\n\n### Recommended Configurations\n\n**High Precision (slower, better quality):**\n```python\nparams = {'d': 512, 'alpha': 2.0, 'beta': 0.8, 'r_cutoff': 4.0, 'dt': 0.02}\n```\n\n**Fast Inference (faster, good quality):**\n```python\nparams = {'d': 128, 'alpha': 1.0, 'beta': 0.3, 'r_cutoff': 2.0, 'dt': 0.1}\n```\n\n**Balanced (recommended starting point):**\n```python\nparams = {'d': 256, 'alpha': 1.5, 'beta': 0.5, 'r_cutoff': 3.0, 'dt': 0.05}\n```\n\n## API Reference\n\n### lsdembed Class\n\n#### `__init__(params=None)`\nInitialize the lsdembed model with optional parameters.\n\n#### `fit(texts, chunk_size=1000)`\nFit the model on a corpus of texts.\n- `texts`: List of strings to train on\n- `chunk_size`: Maximum characters per chunk\n\n#### `embed_query(query)`\nEmbed a single query string.\n- Returns: numpy array of embedding\n\n#### `search(query, top_k=5)`\nSearch for similar chunks.\n- Returns: List of (text, score) tuples\n\n#### `save_model(filepath, compress=True)`\nSave the fitted model to disk.\n\n#### `load_model(filepath)`\nLoad a saved model from disk.\n\n#### `export_embeddings(filepath, format='npz')`\nExport embeddings in specified format ('npz', 'csv', 'json').\n\n#### `get_model_info()`\nGet comprehensive model information.\n\n#### `update_params(**kwargs)`\nUpdate model parameters after initialization.\n\n### TextProcessor Class\n\n#### `tokenize(text)`\nTokenize text using the configured pattern.\n\n#### `chunk_texts(texts, max_chars=300)`\nSplit texts into chunks.\n\n#### `calculate_idf(chunks)`\nCalculate IDF scores for tokens.\n\n## Performance Tips\n\n1. **Memory Management**: For large corpora, use smaller embedding dimensions or process in batches\n2. **Parallel Processing**: Ensure OpenMP is available for best performance\n3. **Parameter Tuning**: Start with balanced parameters and adjust based on your data\n4. **Chunk Size**: Optimal chunk size depends on your text length and domain\n\n## Examples\n\nSee the `examples/` directory for complete examples:\n- `basic_usage.py`: Simple usage example\n- `model_persistence.py`: Saving and loading models\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## Citation\n\nIf you use lsdembed in your research, please cite:\n\n```bibtex\n@software{lsdembed,\n  title={lsdembed: Physics-Inspired Text Embedding Library},\n  author={Ritwik singh},\n  year={2025},\n  url={https://github.com/datasciritwik/lsdembed}\n}\n```\n\n## Changelog\n\n### Version 0.1.0\n- Initial release\n- Physics-inspired embedding algorithm\n- C++ backend with Python bindings\n- Model persistence and export functionality\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Physics-inspired text embedding library",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/datasciritwik/lsdembed",
        "Issues": "https://github.com/datasciritwik/lsdembed/issues",
        "Repository": "https://github.com/datasciritwik/lsdembed"
    },
    "split_keywords": [
        "text-embeddings",
        " nlp",
        " physics",
        " text-processing",
        " machine-learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "646c960b50c2f85b31c7b3d59c19fc335dc29ddd9b38bfdae55126d384010388",
                "md5": "862ff4ea1120e4336f10e42a9b4395fd",
                "sha256": "f17f4999adb286c5ec2c519bedd1276e27e257eb0e025ca1e58af6359c41f021"
            },
            "downloads": -1,
            "filename": "lsdembed-0.2.0-cp312-cp312-manylinux_2_39_x86_64.whl",
            "has_sig": false,
            "md5_digest": "862ff4ea1120e4336f10e42a9b4395fd",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.8",
            "size": 127627,
            "upload_time": "2025-07-30T09:45:42",
            "upload_time_iso_8601": "2025-07-30T09:45:42.555836Z",
            "url": "https://files.pythonhosted.org/packages/64/6c/960b50c2f85b31c7b3d59c19fc335dc29ddd9b38bfdae55126d384010388/lsdembed-0.2.0-cp312-cp312-manylinux_2_39_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f44a8b3c8f5e745cdf03a948fbeccb936045ceef977badb0e00c7ec0393b2795",
                "md5": "a324c245afb4a5ab6d1334d34101c61a",
                "sha256": "d8de14e627a63663c928a6c70bd161b7bc9906b4487790afdc44932ba346895e"
            },
            "downloads": -1,
            "filename": "lsdembed-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a324c245afb4a5ab6d1334d34101c61a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 24655,
            "upload_time": "2025-07-30T09:45:44",
            "upload_time_iso_8601": "2025-07-30T09:45:44.636536Z",
            "url": "https://files.pythonhosted.org/packages/f4/4a/8b3c8f5e745cdf03a948fbeccb936045ceef977badb0e00c7ec0393b2795/lsdembed-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-30 09:45:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "datasciritwik",
    "github_project": "lsdembed",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "build",
            "specs": [
                [
                    "==",
                    "1.2.2.post1"
                ]
            ]
        },
        {
            "name": "iniconfig",
            "specs": [
                [
                    "==",
                    "2.1.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "2.3.1"
                ]
            ]
        },
        {
            "name": "packaging",
            "specs": [
                [
                    "==",
                    "25.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "2.3.1"
                ]
            ]
        },
        {
            "name": "pluggy",
            "specs": [
                [
                    "==",
                    "1.6.0"
                ]
            ]
        },
        {
            "name": "pybind11",
            "specs": [
                [
                    "==",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "Pygments",
            "specs": [
                [
                    "==",
                    "2.19.2"
                ]
            ]
        },
        {
            "name": "pyproject_hooks",
            "specs": [
                [
                    "==",
                    "1.2.0"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    "==",
                    "8.4.1"
                ]
            ]
        },
        {
            "name": "python-dateutil",
            "specs": [
                [
                    "==",
                    "2.9.0.post0"
                ]
            ]
        },
        {
            "name": "pytz",
            "specs": [
                [
                    "==",
                    "2025.2"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    "==",
                    "1.16.0"
                ]
            ]
        },
        {
            "name": "setuptools",
            "specs": [
                [
                    "==",
                    "80.9.0"
                ]
            ]
        },
        {
            "name": "six",
            "specs": [
                [
                    "==",
                    "1.17.0"
                ]
            ]
        },
        {
            "name": "tzdata",
            "specs": [
                [
                    "==",
                    "2025.2"
                ]
            ]
        },
        {
            "name": "wheel",
            "specs": [
                [
                    "==",
                    "0.45.1"
                ]
            ]
        }
    ],
    "lcname": "lsdembed"
}
        
Elapsed time: 0.84681s