liuembeddings


Nameliuembeddings JSON
Version 1.1.0 PyPI version JSON
download
home_pagehttps://github.com/himanshuclub88/liuembeddings
SummaryTensorFlow-based embeddings with ChromaDB vector store for semantic search easy embeddings and search integration. decrease api charge as run on local light weight and easy to learn
upload_time2025-10-29 11:18:22
maintainerNone
docs_urlNone
authorHimanshu Singh
requires_python>=3.8
licenseNone
keywords embeddings semantic-search tensorflow chromadb nlp
VCS
bugtrack_url
requirements tensorflow tensorflow-hub chromadb numpy
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# LiuEmbedding = embedding + storage
**Save your money on expensive embedding models**

# LiuEmbedding

LiuEmbedding is a lightweight semantic search framework that combines **embedding generation** with **vector storage**. Built on TensorFlow embeddings and ChromaDB vector storage, it provides a unified solution for small to medium projects requiring efficient embedding, storage, and retrieval operations.

## 🚀 Why Choose LiuEmbedding?

### ⚡ Easy to Use - No Additional Setup Required
- **Minimal setup** - Get started in minutes, not hours
- **Zero configuration** - Works out of the box with sensible defaults
- **Automatic dependency management** - No manual ChromaDB or TensorFlow setup needed

### 💰 Cost-Effective Alternative
- **No expensive API calls** - Use open-source TensorFlow models instead of paid embedding services
- **Self-hosted solution** - Avoid recurring costs associated with cloud-based embedding APIs


## Core Architecture

### Embedding Layer
- **TensorFlow-based embedding generation** with consistent interface
- Model information exposure for debugging and observability
- Support for various pre-trained models and custom implementations

### Storage Layer  
- **ChromaDB-backed vector storage** with persistent HNSW indexing
- Metadata filtering for efficient similarity search and organization
- Comprehensive CRUD operations and batch ingestion capabilities

## Key Features

- **Unified API**: Single interface for both embedding generation and vector storage
- **Production Ready**: Integrated logging, validation, and error handling
- **Batch Operations**: Efficient batch ingestion and export to JSON for data portability
- **Text Processing**: Chunking with overlap and document packing for optimal retrieval
- **Lightweight**: Minimal dependencies while maintaining full functionality


LiuEmbeddings is a lightweight framework for semantic search built around TensorFlow-based embeddings and a ChromaDB-backed vector store. It targets small to medium projects that need fast embedding, storage, and retrieval with clear CRUD and batch APIs plus robust logging and validation out of the box.


### Installation

- pip install liuembeddings for the latest published package version, or install from source using pip install liuembeddings . inside the cloned repository root.
- Python 3.8+ and recent to ensure compatibility and performance.

```python
    pip install liuembeddings
```


### Quick start

- Initialize an embedder and vector store, then add documents and run a similarity search to retrieve relevant results.

```python
from liuembeddings import LiuEmbeddings, LiuVectorStore

embedder = LiuEmbeddings(model_name="USE")
store = LiuVectorStore(embedder, collection_name="my_docs")

store.add_texts([
    "Python is a programming language",
    "JavaScript is for web development"
])

results, documents = store.similarity_search(
    "What is Python?", n_results=2
)

print(documents)
```

The example shows a minimal flow: initialize, add, and search to get back matching texts quickly.


### Quick start: split_text

The following example mirrors the structure from example_usage.py and demonstrates chunking a long text, adding to the vector store, asking a question, and iterating on results while using the documented return shapes.

```python
from liuembeddings import LiuEmbeddings, LiuVectorStore, split_text

# Initialize
embedder = LiuEmbeddings(model_name="USE")
store = LiuVectorStore(embedder, collection_name="ml_knowledge")

# Long text
long_text = """
Machine learning is a powerful and rapidly growing method of data analysis...
Feature engineering is crucial for model performance...
"""

# Chunk and add
chunks = split_text(long_text, chunk_size=400, chunk_overlap=50)
store.add_texts(chunks)

# Ask a question
raw, docs = store.query("What techniques improve model accuracy?", n_results=2)

# Show the matched chunks
for i, d in enumerate(docs, 1):
    print(f"Answer {i}: {d[:250]}...")
```


### One‑liner semantic search

- Use the vector store search method to combine chunking, ingestion, and querying in a single call for rapid prototyping.

```python
from liuembeddings import LiuEmbeddings, LiuVectorStore

embedder = LiuEmbeddings()
store = LiuVectorStore(embedder, collection_name="my_docs")

long_doc = "Machine learning is a subset of AI. Deep learning uses neural networks."

# Ingest chunks and then search
store.search(
    text_document=long_doc,
    chunk_size=250,
    chunk_overlap=100
)

raw, docs = store.search(
    query="What is machine learning?",
    n_results=2
)

for d in docs:
    print(d)
```

This mirrors the end‑to‑end pattern shown in the examples and uses the text_document parameter name used by the vector store’s search method implementation and examples.


### Minimilistic Quick Start

- Initialize an embedder and vector store intenally not need to define but we can change model and collection.[^1]
- REFER end of the page to learn about fastquery in detailed [Go to Fastquery](#fastquery)


```python
from liuembeddings import fastquery

# Simple use: embed and search in 3 lines
text = "New York is the largest city in the United States. Washington D.C. is the capital. California is a state."


fastquery(text_document=text,)

raw,results = fastquery(
    query="Capital of USA?",
    n_results=2   
)

for chunk in results:
    print(chunk)
```


## config.py

- use this to change deafult variable for entire app or you can give it manully during function calls

```python
from liuembeddings import LiuConfig as l

l.DEFAULT_BATCH_SIZE=32
l.DEFAULT_CHUNK_SIZE=2000
l.DEFAULT_COLLECTION_NAME='test_collection'

```


### Vector store API

Here’s a **professional and detailed `README.md`** section for your `liuembeddings/vectorstore.py` module — written in GitHub-style Markdown with clean code formatting, proper examples, and explanations of all vector store methods:

---

# 🧠 `LiuVectorStore` — Semantic Vector Storage with ChromaDB

`LiuVectorStore` is a high-level wrapper around **ChromaDB** that integrates seamlessly with **TensorFlow-based embeddings** (`LiuEmbeddings`).
It provides a complete CRUD + semantic search interface with optional batch ingestion, persistence, and metadata filtering.

---

## ⚙️ Initialization

```python
from liuembeddings import LiuEmbeddings
from liuembeddings import LiuVectorStore

# Create embeddings and initialize the vector store
embedder = LiuEmbeddings()
store = LiuVectorStore(embedder, collection_name="my_collection")
```

✅ Automatically:

* Ensures the persistence path exists
* Creates or opens a ChromaDB collection
* Logs document count and health

---

## 📥 Adding Data

### `add_texts(texts, metadatas=None, ids=None)`

Adds one or more text documents to the vector store.

```python
texts = [
    "Python is a high-level programming language.",
    "Apache Spark is a distributed data processing framework."
]
store.add_texts(texts)
```

**Features**

* Auto-generates unique IDs if not provided
* Accepts optional metadata for each document
* Performs validation on inputs

```python
store.add_texts(
    texts=["Document A", "Document B"],
    metadatas=[{"topic": "A"}, {"topic": "B"}],
    ids=["docA", "docB"]
)
```

---

### `add_texts_batch(texts, batch_size=None, metadatas=None, ids=None)`

Adds large datasets in batches to manage memory efficiently.

```python
# Example: Add 20 texts in batches of 5
data = [f"Sample document {i}" for i in range(20)]
store.add_texts_batch(data, batch_size=5)
```

🧩 Automatically splits your dataset and logs batch progress.

---

## 🔍 Querying & Search

### `query(query_text, n_results=None) -> (raw_results, documents)`

Performs **semantic search** and returns both raw and simplified results.

```python
raw, docs = store.query("What is Spark?")
print(docs)
```

Returns:

```python
(['Apache Spark is a distributed data processing framework.'])
```

---

### `similarity_search(query_text, n_results=None, with_score=None)`

Returns the most similar documents with similarity scores.

```python

# With similarity scores
raw,results, _ = store.similarity_search("Python language")
for r in results:
    print(r["id"], r["similarity_score"], r["document"])
```

📈 When `.similarity_search`, each result includes:

```python
{
    "id": "...",
    "document": "...",
    "metadata": {...},
    "similarity_score": 0.93
}
```
## Cleaning raw output using `clean`
-----

```python
from liuembeddings import LiuEmbeddings, LiuVectorStore, split_text, clean

# Initialize
embedder = LiuEmbeddings(model_name="USE")
store = LiuVectorStore(embedder, collection_name="ml_knowledge")

# Long text
long_text = """
Machine learning is a powerful and rapidly growing method of data analysis...
Feature engineering is crucial for model performance...
"""

# Chunk and add
chunks = split_text(long_text, chunk_size=400, chunk_overlap=50)
store.add_texts(chunks)

# Ask a question
raw, docs = store.query("What techniques improve model accuracy?", n_results=2)

# Show the matched chunks
for i, d in enumerate(docs, 1):
    print(f"Answer {i}: {d[:250]}...")

#cleaning raw data:
clean_output = clean(raw)

print('CleanOutput')
for i in clean_output:
    print(i)
```
OUTPUT
```
Answer 1: machine learning is a powerful and rapidly growing method of data analysis... feature engineering is crucial for model performance......
CleanOutput
{'id': 'doc_0_1761401259744_1b0cf4', 'document': 'machine learning is a powerful and rapidly growing method of data analysis... feature engineering is crucial for model performance...', 'metadata': {'source': 'ml_knowledge'}, 'distance': 0.7753744125366211}   
```

-----

---

## 🧾 Document Management

### `search_by_id(doc_id) -> dict | None`

Fetch a document and metadata by its unique ID.

```python
result = store.search_by_id("docA")
print(result)
```

Returns:

```python
{
  "id": "docA",
  "document": "Document A",
  "metadata": {"topic": "A"}
}
```

---

### `search_by_metadata(metadata_filter) -> list[dict]`

Find documents matching a specific metadata filter.

```python
docs = store.search_by_metadata({"topic": "B"})
```

Returns a list of `{id, document, metadata}` objects.

---

### `get_all() -> list[dict]`

Retrieve **all** documents and metadata from the collection.

```python
all_docs = store.get_all()
```

Useful for exporting or debugging.

---

### `update_by_id(doc_id, new_text, new_metadata=None)`

Replace a document’s text (and optionally metadata).

```python
store.update_by_id("docA", "Updated Document A", {"topic": "Updated"})
```

✔️ Preserves the same ID — ideal for maintaining references.

---

### `delete_by_id(doc_id)`

Remove a specific document by ID.

```python
store.delete_by_id("docB")
```

---

## 📊 Collection Info & Utilities

### `count_documents() -> int`

Get the total number of stored documents.

```python
print("Document count:", store.count_documents())
```

---

### `save(path)`

Export all documents and metadata to a `.json` file.

```python
store.save("backup_my_collection.json")
```

---

### `info` (property)

Quick collection overview:

```python
print(store.info)
```

Output example:

```python
{
    "name": "my_collection",
    "document_count": 42,
    "embedding_model": "TensorFlow Universal Sentence Encoder"
}
```

---

## 🪄 One-Call Convenience Search

### `search(query=None, text_document=None, chunk_size=None, chunk_overlap=None, n_results=None)`

End-to-end helper that:

1. Splits long documents into chunks
2. Embeds and stores them
3. Performs similarity search in one call
4. same function can do `**Adding and Searching**` based on query and text_document or `**Adding and Searching**` can be done together

```python

document = """
my name is himanshu i am a data engineer working in tcs india pvt ltd.
i have experience in spark,hadoop,python,sql,azure,aws,tableau,power bi etc.
i love to work on data and build data pipelines and dashboards.
python is a high-level programming language known for its simplicity but it is not simple :).
""" 


#adding only
vector_store.search(
    text_document=document,
    chunk_size=250,
    chunk_overlap=100,
)


print("\nPerforming another semantic search using liu_search... \n only searching")
#searching only
raw,ans = vector_store.search(
    query="What himanshu does for a living?",
    n_results=1
)

```
Both **Adding and Searching**  Together
> ⚠️ **Note:** don't run add query multiple time leading to data duplication.

```python

document = """
my name is himanshu i am a data engineer working in tcs india pvt ltd.
i have experience in spark,hadoop,python,sql,azure,aws,tableau,power bi etc.
i love to work on data and build data pipelines and dashboards.
python is a high-level programming language known for its simplicity but it is not simple :).
""" 


#adding only
vector_store.search(
    query="What himanshu does for a living?",
    text_document=document,
    chunk_size=250,
    chunk_overlap=100,
    n_results=1
)
```

---

## 🧩 Example Workflow

```python
# Create embeddings
embedder = LiuEmbeddings()

# Initialize vector store
store = LiuVectorStore(embedder, collection_name="knowledge_base")

# Add documents
texts = ["AI is transforming industries.", "Data engineering powers analytics."]
store.add_texts(texts)

# Perform search
_, docs = store.similarity_search("What is AI?")
print("Search results:", docs)

# Get collection info
print(store.info)
```

---

## 🧱 Design Overview

| Feature                | Description                                                 |
| ---------------------- | ----------------------------------------------------------- |
| **Persistence**        | Uses `chromadb.PersistentClient` for disk-based collections |
| **Batch Support**      | Handles large ingestions efficiently                        |
| **CRUD Operations**    | Add, Update, Delete, Retrieve                               |
| **Semantic Search**    | Embedding-based similarity using `LiuEmbeddings`            |
| **Metadata Filtering** | Query subsets via structured filters                        |
| **Export**             | JSON serialization for backups or migration                 |

---

### Worked example: solve a retrieval question

The following example mirrors the structure from example_usage.py and demonstrates chunking a long text, adding to the vector store, asking a question, and iterating on results while using the documented return shapes.

```python
from liuembeddings import LiuEmbeddings, LiuVectorStore, split_text

# Initialize
embedder = LiuEmbeddings(model_name="USE")
store = LiuVectorStore(embedder, collection_name="ml_knowledge")

# Long text
long_text = """
Machine learning is a powerful and rapidly growing method of data analysis...
Feature engineering is crucial for model performance...
"""

# Chunk and add
chunks = split_text(long_text, chunk_size=400, chunk_overlap=50)
store.add_texts(chunks)

# Ask a question
raw, docs = store.query("What techniques improve model accuracy?", n_results=2)

# Show the matched chunks
for i, d in enumerate(docs, 1):
    print(f"Answer {i}: {d[:250]}...")
```

This demonstrates selecting only the documents from the returned tuple after calling query, exactly as shown in the example where ans = ans is used to extract the document list.

### CRUD and scored search example

- Retrieve scored results for CRUD, update one by id, read it back, and list the total count, following the patterns used in the example.

```python
# Scored search for a targeted update
raw,first = store.similarity_search(
    "What techniques improve model accuracy?",
    n_results=1,
)

store.update_by_id(first["id"], "Machine learning drives innovation and efficiency")

# Verify by id
found = store.search_by_id(first["id"])
print(found["document"])

# Count all
print("Total documents:", store.count_documents())
```

The similarity_search shape is a list of dicts, which enables direct access to id, document, metadata, and similarity_score for downstream operations as used in the example.

### Batch ingestion and metadata filtering

- Use add_texts_batch to process large inputs, assign consistent metadata for later filtering, and fetch the subset with search_by_metadata for targeted work.

```python
# Prepare 100 documents with metadata
texts = [f"Document {i+1}: This is sample text for document {i+1}." for i in range(100)]
metas = [{"source": "Batch of 5"} for _ in range(100)]

# Ingest in batches of 10
store.add_texts_batch(texts, batch_size=10, metadatas=metas)

# Filter by metadata
subset = store.search_by_metadata({"source": "Batch of 5"})
for x in subset[:3]:
    print(x["id"], x["document"][:60], "...")
```

This mirrors the example’s approach to batched add and subsequent metadata filtering to isolate a logical group of documents.

### Text utilities

- Use split_text for chunking long content with overlap to preserve context across chunk boundaries, and clean_text if included in your setup to normalize inputs prior to embedding as shown in the README’s text utilities section and example file.



# Fastquery

LiuEmbeddings includes an advanced utility function for rapid prototyping and streamlined semantic search, `fastquery`. This section adds full documentation and usage examples for `fastquery`, and clarifies key usage expectations such as embedding model consistency and API behaviors. All major methods, including `fastquery`, are now documented with coding blocks and concise explanations for every function.

***

## 🚀 Quick Embedding (`fastquery`)

- Always define collection do not rely on default collection
- Alaways rely on default embedings or use only 1 for all

**`fastquery`** provides the fastest workflow for embedding and semantic search. It is designed for scenarios where users need to process a document and execute queries immediately—no manual collection or embedding model setup required.

**Key Features:**

- **Default Model:** Uses the `"USE"` Universal Sentence Encoder by default for embeddings.
- **Model Consistency:** The embedding model is fixed _per vector store instance_. Once texts are embedded with a given model, you cannot switch models for the same collection.
- **Single-call API:** Combines text chunking, embedding, storage, and querying in one function.
- **Minimal Setup:** No need to initialize LiuEmbeddings or LiuVectorStore directly—simply provide your text and query.

***

### Function Documentation

```python
fastquery(
        query: str=None,
        text_document: str,
        chunk_size: int, #deafult from config file
        chunk_overlap: int, #deafult from config file
        n_results: int, #deafult from config file
        with_score: float = None, #deafult from config file
        collection_name: str = fastquery.collection_name,
        model_name: str = fastquery.model_name 
) -> list:
        """
        One-line semantic search function.

        note -> use the same model for embedding and searching.
        by default its use "USE" model.
        
        Combines chunking, embedding, storage, and search in a single call.
        Perfect for quick prototyping and small applications.
        
        Args:
            text_document: Long text or document to search within
            query: Query string to search for
            chunk_size: Size of text chunks (default: from config)
            chunk_overlap: Overlap between chunks (default: from config)
            n_results: Number of results to return (default: from config)
            with_score: give smilarity search like answer
            collection_name: Name for the vector store collection (default: from config)
            model_name: Embedding model to use (default: "USE")
            
        Returns:
            List of most similar chunks from the document
            
        Raises:
            ValueError: If inputs are invalid
            RuntimeError: If operation failsalueError/RuntimeError: For invalid input or failures.
    """
```


***

### ⚡ Quickstart Example

- The fastquery utility provides a minimal setup for embedding and querying text within your vector database.
- It automatically handles model loading, text chunking, and search retrieval in just a few lines of code.


```python
from liuembeddings import fastquery

# Simple use: embed and search in 3 lines
text = "New York is the largest city in the United States. Washington D.C. is the capital. California is a state."

fastquery.collection_name="minimal_collection"

fastquery(text_document=text,)

raw,results = fastquery(
    query="Capital of USA?",
    n_results=2
)

for chunk in results:
    print(chunk)
```

**🔹 Using Class Variables**
- You can configure fastquery globally before calling it.
- These class variables act as persistent defaults until changed or overridden.
You can configure **`fastquery`** globally **before calling it**.
These class variables act as **persistent defaults** until they are changed or overridden.

You can customize `fastquery` behavior in **three ways**:

| Method                            | Description                                                                          | Recommended Use                        |
| :-------------------------------- | :----------------------------------------------------------------------------------- | :------------------------------------- |
| **Class Variables**               | Set once and apply globally for all future calls.                                    | ✅ *Easy and Recommended*              |
| **Function Parameters**           | Define per call — overrides both class and global defaults.                          | Use for temporary or dynamic settings. |
| **Global Defaults (`LiuConfig`)** | Automatically used when neither class variables nor function parameters are defined. | Used as fallback configuration.        |

```python
from liuembeddings import fastquery
fastquery.collection_name='liu-collection'  
fastquery.model_name='USE'
```

***

### Using `fastquery` with Custom Settings

```python
# Custom chunk size and overlap

fastquery(
    text_document="Deep learning uses neural networks. Machine learning is a subset of AI.",
    chunk_size=80,
    chunk_overlap=15,
)


raw,results = fastquery(
    query="What is machine learning?",
    n_results=1
)
print("Best answer:", results[1][0])
```


### Scores and Metadata

**Adding Documents to a Collection** collection for later querying.

```python
from liuembeddings import fastquery

document = """
Luna loves exploring the night sky. Every weekend, she sets up her telescope on the rooftop to watch distant galaxies.
Her favorite constellation is Orion, and she can identify it even without a telescope.
Last month, she discovered a small comet passing near Jupiter and recorded its movement in her astronomy journal.
"""

# Add document to collection "story_collection"
fastquery(
    text_document=document,
    n_results=5,       
    collection_name="story_collection"
)
```

> ⚠️ **Note:** `n_results` specifies the maximum number of similar results to retrieve when querying.

---

**Querying the Collection**

When `with_score=0.4` (default), `fastquery` returns:
* `raw`: the raw output from the database or retrieval engine.
* `document`: a **list of matching documents**.

```python
raw, ans = fastquery(
    query="What celestial object did Luna discover?",
    collection_name="story_collection"
)

for item in ans:
    print(f"Answer: {item}")
```

**Example Output:**

```
Answer: Last month, she discovered a small comet passing near Jupiter and recorded its movement in her astronomy journal.
```
---

When `with_score=.5`, `fastquery` returns:
* `raw`: the raw retrieval output.
* `ans`: a **list of dictionaries**, each containing:

  * `id` – the document ID in the collection
  * `document` – the text content
  * `metadata` – metadata associated with the document
  * `similarity_score` – similarity between the query and the document

```python
raw, ans = fastquery(
    query="What celestial object did Luna discover?",
    with_score=0.5,
    collection_name="story_collection"
)

for item in ans:
    print(f"id: {item['id']}")
    print(f"Document: {item['document']}")
    print(f"Metadata: {item['metadata']}")
    print(f"Similarity score: {item['similarity_score']}")
```

**Example Output:**

```
id: doc_1_1761378265744
Document: Last month, she discovered a small comet passing near Jupiter and recorded its movement in her astronomy journal.
Metadata: {'source': 'story_collection'}
Similarity score: 0.41
```

---

- You can filter results by similarity score to get only the most relevant documents:

```python
for item in ans:
    if item['similarity_score'] < 0.5:
        print(f"Answer: {item['document']}")
```
**Example Output:**
```
Answer: Last month, she discovered a small comet passing near Jupiter and recorded its movement in her astronomy journal.
```

> This allows you to exclude low-relevance documents from your results.

---

## Summary

* **Adding documents:** `fastquery(text_document, collection_name)`
* **Querying documents:** `fastquery(query, collection_name)`
* **Optional similarity scores:** Use `with_score` to get IDs, metadata, and similarity values.
* **Filtering:** You can filter results by similarity score for more precise retrieval.

This function is particularly useful for **quick semantic search**, **QA over text collections**, and **vector database integrations**.

---

If you want, I can also create a **diagram showing the workflow**: document → collection → query → result, which would make this explanation even clearer.

Do you want me to do that?

***

### Notes on Model and Collection Management

- **Model Switching:** Once a vector store or collection is created with an embedding model, you cannot switch to another model for embedding/search in that collection. If you need to use a new model (e.g., USE/USEL), create a new collection:

```python
fastquery(
    text_document="...", 
    query="...", 
    model_name="USE",
    collection_name="my_new_collection"
)
```

Attempting to switch models within the same collection will result in an error.

***

## Summary Table: Quick Embedding API

| Function | Purpose | Returns | Model Switching | Use Case |
| :-- | :-- | :-- | :-- | :-- |
| fastquery | Rapid embed \& search | Chunks/results | Not allowed | Quick prototyping, temporary |
| LiuEmbeddings | Manual embed model | Embedding vectors | At instantiation | Advanced/custom workflows |
| LiuVectorStore | Full CRUD/search | Document batches | At initialization | Persistent/high-volume apps |


***

## Complete Example: End-to-End Embedding and Immediate Query

```python
from liuembeddings import fastquery

long_doc = """
The solar system includes the Sun and the objects that orbit it, such as planets,
asteroids, and comets. Planets like Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus,
and Neptune revolve around the Sun.
"""

#storing and query at same 
#use with CAUTION re ingesting same data leads to data dublication add ones and Query multiple time
results = fastquery(
    text_document=long_doc,
    query="Which planets orbit the sun?",
    n_results=3
)

for answer in results:
    print(answer)
```


## Embedding

- We can convert to embedding for visulization and other purpose
- Two embedding modeles we have 
-   ``` USE  512 dimension DEAFAULT``` 
-   ``` USEL 512 dimension ```

```python
# Single query embedding
from liuembeddings import LiuEmbeddings

print("\nInitializing embedding model")
embedder = LiuEmbeddings(model_name="USE") # USE AND USEL

query_embedding = embedder.embed_query("What is machine learning?")

print(f"Query embedding dimension: {len(query_embedding)}")
print(f"First 5 embedded values in vector: {query_embedding[:5]}")

>>> Query embedding dimension: 512
>>> First 5 embedded values in vector: [-0.004198556765913963, -0.07223273068666458, -0.06091027706861496, -0.007246586959809065, -0.022054186090826988]

```

```python
# Multiple documents
from liuembeddings import LiuEmbeddings

embedder = LiuEmbeddings() #DEAFAULT USE

documents = [
    "quickly bring the cash",
    "rush and get the money",
    "this boy love potato"
]

doc_embeddings = embedder.embed_documents(documents)
print(f"Embedded {len(doc_embeddings)} documents")
for i in doc_embeddings:
    print(f"Embedded:{i}")

>>>Embedded 3 documents
>>>Embedded:[-0.0444050170481205, -0.059026677161455154, 0.012156504206359386, 0.035481732338666916, 0.0641937330365181, 0.01327..,...
>>>Embedded:[0.045886047184467316, -0.07462672889232635, 0.07747738808393478, 0.00464465469121933, 0.07081839442253113, 0.01971..,...
>>>Embedded:[0.05816115066409111, 0.02540922723710537, 0.0019424951169639826, 0.029804585501551628, -0.03550824150443077, -0.05927..,...
```



---

## External Embeddings Model

You can add a new embedding model by modifying `LiuConfig.AVAILABLE_MODELS`. While you can use any embedding model of your choice, it is recommended to use the predefined models like **USE** or **USEL** for compatibility.

### Adding an External Embedding Model

1. Ensure the model is compatible with TensorFlow Hub.
2. Provide the model URL, embedding dimension, and a custom name.

```python
from liuembeddings import LiuEmbeddings, LiuVectorStore, LiuConfig

# Add a custom external embedding model
LiuConfig.AVAILABLE_MODELS['NNLM'] = {
    'url': "https://tfhub.dev/google/nnlm-en-dim50/2",
    'dimension': 50,
    'name': 'NNLM (custom name of your choice)'
}

# Initialize the custom embedder
custom_embedder = LiuEmbeddings('NNLM')

custom_embedder=LiuEmbeddings('NNLM') 

custom_vector = LiuVectorStore(
    embedding_model=custom_embedder,
    collection_name="knowledge-NNLM"
)

# Multiple documents
documents = [
    "all boy's lovewin prize",
    "all boy's love money",
    "all boy's love protein",
    "lorem ipsum lorem ipsum",
    "loremipsum lorem iplorem"
]

custom_vector.add_texts(documents,)

raw,docs=custom_vector.search('what all the boys love')

print("answer:")
for i in docs:
    print(i)

>>> answer:
    all boy's love money
    all boy's love protein
    all boy's lovewin prize

print(raw)
```

> ⚠️ **Note:** Make sure the embedding model is compatible with the hub and that the dimensions match your configuration.

---

***

## Final Tips

- **Use `fastquery` for fast, disposable vector stores and quick searches.**
- **Switch models only by creating new collections—existing data uses a single embedding model.**
- For larger or persistent applications, use the full LiuEmbeddings and LiuVectorStore APIs documented above for manual control, persistence, batch processing, and advanced CRUD.

***
---
<span style="display:none"></span>









## 🧩 Configuration

This module defines global configuration settings for the **LiuEmbeddings Framework**, which manages embedding models, chunking, vector storage, and search behavior across all components.

---

### ⚙️ Class: `LiuConfig`

The `LiuConfig` class holds configurable parameters that control how embeddings, vector databases, and search processes behave.
You can modify or extend these settings as needed for custom use cases.

---

### 🔹 Default Embedding Model Settings

| Attribute         | Description                               | Default                                                   |
| ----------------- | ----------------------------------------- | --------------------------------------------------------- |
| `EMBEDDING_MODEL` | Default TensorFlow Hub URL for embeddings | `"https://tfhub.dev/google/universal-sentence-encoder/4"` |
| `MODEL_DIMENSION` | Embedding vector dimension                | `512`                                                     |

---

### 🔹 Available Models

You can select from predefined embedding models or extend them with custom ones.

| Key      | URL                                                                                                                        | Dimension | Name                       |
| -------- | -------------------------------------------------------------------------------------------------------------------------- | --------- | -------------------------- |
| **USE**  | [https://tfhub.dev/google/universal-sentence-encoder/4](https://tfhub.dev/google/universal-sentence-encoder/4)             | 512       | Universal Sentence Encoder |
| **USEL** | [https://tfhub.dev/google/universal-sentence-encoder-large/5](https://tfhub.dev/google/universal-sentence-encoder-large/5) | 512       | Universal Sentence Encoder |


### 🔹 Below example show how to add a custom model

```python
from liuembeddings import LiuConfig

# Add a new model
LiuConfig.AVAILABLE_MODELS["NNLM"] = {
    "url": "https://tfhub.dev/google/nnlm-en-dim50/2",
    "dimension": 50,
    "name": "NNLM"
}
```

---

### 🔹 Chunking Settings

| Attribute               | Description                           | Default |
| ----------------------- | ------------------------------------- | ------- |
| `DEFAULT_CHUNK_SIZE`    | Number of characters per text chunk   | `1000`  |
| `DEFAULT_CHUNK_OVERLAP` | Overlapping characters between chunks | `200`   |

---

### 🔹 Vector Store Settings

| Attribute                 | Description                                                     | Default                |
| ------------------------- | --------------------------------------------------------------- | ---------------------- |
| `DEFAULT_CHROMA_PATH`     | Directory where ChromaDB stores data                            | `"./chroma_db"`        |
| `DEFAULT_COLLECTION_NAME` | Default vector collection name                                  | `"default_collection"` |
| `DISTANCE_METRIC`         | Similarity metric used for vector search (`cosine`, `l2`, `ip`) | `"cosine"`             |

---

### 🔹 Search Settings

| Attribute           | Description                           | Default |
| ------------------- | ------------------------------------- | ------- |
| `DEFAULT_N_RESULTS` | Default number of documents to return | `3`     |
| `MAX_N_RESULTS`     | Maximum allowed results per query     | `100`   |

---

### 🔹 Batch Processing

| Attribute            | Description                                        | Default |
| -------------------- | -------------------------------------------------- | ------- |
| `DEFAULT_BATCH_SIZE` | Default batch size for bulk embedding or insertion | `100`   |

---

### 🔹 Model Caching

| Attribute            | Description                             | Default |
| -------------------- | --------------------------------------- | ------- |
| `ENABLE_MODEL_CACHE` | Enable caching for faster model loading | `True`  |

---

### 🔹 Logging

| Attribute    | Description                                             | Default                                                  |
| ------------ | ------------------------------------------------------- | -------------------------------------------------------- |
| `LOG_LEVEL`  | Logging verbosity (`DEBUG`, `INFO`, `WARNING`, `ERROR`) | `"INFO"`                                                 |
| `LOG_FORMAT` | Format string for logging messages                      | `"%(asctime)s - %(name)s - %(levelname)s - %(message)s"` |

Example:

```python
import logging
logging.basicConfig(level=LiuConfig.LOG_LEVEL, format=LiuConfig.LOG_FORMAT)
```

---

## 🧠 How to Modify Configuration Variables

You can easily override or customize configuration values **without editing the source file**.
Simply assign new values to the class attributes before initializing components.

```python
from liuembeddings import LiuConfig

LiuConfig.DEFAULT_CHROMA_PATH = "./custom_chroma"
LiuConfig.DEFAULT_COLLECTION_NAME = "medical_articles"

print(LiuConfig.DEFAULT_COLLECTION_NAME)
# Output: medical_articles
```

---

```python
LiuConfig.LOG_LEVEL = "DEBUG"
LiuConfig.DEFAULT_N_RESULTS = 10
```

---

## 🧾 Summary

* `LiuConfig` centralizes all framework-level settings.
* You can **modify variables at runtime** to adapt to new projects or datasets.
* Encourages a clean and flexible configuration approach for reproducible experiments.

---


### Requirements and project structure

- Requirements include Python 3.8+, TensorFlow 2.8+, ChromaDB 0.3+, and NumPy 1.20+, and the repository layout includes embeddings.py, vectorstore.py, utils, config, logger modules, tests, and packaging files as shown in the current README.


### Notes on return shapes and usage patterns

- query and similarity_search return a tuple of (raw_results, documents) when with_score is False, and your code should access the second element to iterate over just the text matches, as demonstrated by assigning ans = ans before printing chunks.
- similarity_search with with_score=True returns a list of dicts where id can be fed into update_by_id and search_by_id to perform targeted modifications and retrieval as illustrated in the example.
- search composes splitting, ingestion, and similarity search and returns the same shape as similarity_search in the default mode, enabling quick prototyping without wiring multiple calls in your application code.


### Existing examples

- The repository examples cover basic embedding, text processing, CRUD, batch operations, one‑liner search, and error handling, and the updated examples above align with those flows while clarifying parameter names and return shapes.

### Contributing

- Fork the repository, create a feature branch, commit changes, push, and open a pull request following the guidelines already present in the README to keep contributions consistent and reviewable.


### License and citation

- The project is MIT‑licensed, and a BibTeX entry is provided in the README if you cite LiuEmbeddings in academic work, keeping attribution straightforward and standardized.


### Changelog and roadmap

- The initial release includes core embedding, vector store, utilities, tests, and documentation, and the roadmap lists future enhancements like additional embedding models, REST, Docker, and advanced filtering to guide community contributions.

If you’d like this as a drop‑in replacement file, the sections above can fully replace the current README’s Quick Start and API portions while keeping badges, installation, testing, and contribution policy intact from the existing document.

<div align="center">⁂</div>

[^1]: example_usage.py
: example.mimimal.py

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/himanshuclub88/liuembeddings",
    "name": "liuembeddings",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "embeddings semantic-search tensorflow chromadb nlp",
    "author": "Himanshu Singh",
    "author_email": "Himanshuclub88@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/47/3a/df39771d1c663deffb29ef3fa22d2b03aa941923a5e78910d88cc346387e/liuembeddings-1.1.0.tar.gz",
    "platform": null,
    "description": "\r\n# LiuEmbedding = embedding + storage\r\n**Save your money on expensive embedding models**\r\n\r\n# LiuEmbedding\r\n\r\nLiuEmbedding is a lightweight semantic search framework that combines **embedding generation** with **vector storage**. Built on TensorFlow embeddings and ChromaDB vector storage, it provides a unified solution for small to medium projects requiring efficient embedding, storage, and retrieval operations.\r\n\r\n## \ud83d\ude80 Why Choose LiuEmbedding?\r\n\r\n### \u26a1 Easy to Use - No Additional Setup Required\r\n- **Minimal setup** - Get started in minutes, not hours\r\n- **Zero configuration** - Works out of the box with sensible defaults\r\n- **Automatic dependency management** - No manual ChromaDB or TensorFlow setup needed\r\n\r\n### \ud83d\udcb0 Cost-Effective Alternative\r\n- **No expensive API calls** - Use open-source TensorFlow models instead of paid embedding services\r\n- **Self-hosted solution** - Avoid recurring costs associated with cloud-based embedding APIs\r\n\r\n\r\n## Core Architecture\r\n\r\n### Embedding Layer\r\n- **TensorFlow-based embedding generation** with consistent interface\r\n- Model information exposure for debugging and observability\r\n- Support for various pre-trained models and custom implementations\r\n\r\n### Storage Layer  \r\n- **ChromaDB-backed vector storage** with persistent HNSW indexing\r\n- Metadata filtering for efficient similarity search and organization\r\n- Comprehensive CRUD operations and batch ingestion capabilities\r\n\r\n## Key Features\r\n\r\n- **Unified API**: Single interface for both embedding generation and vector storage\r\n- **Production Ready**: Integrated logging, validation, and error handling\r\n- **Batch Operations**: Efficient batch ingestion and export to JSON for data portability\r\n- **Text Processing**: Chunking with overlap and document packing for optimal retrieval\r\n- **Lightweight**: Minimal dependencies while maintaining full functionality\r\n\r\n\r\nLiuEmbeddings is a lightweight framework for semantic search built around TensorFlow-based embeddings and a ChromaDB-backed vector store. It targets small to medium projects that need fast embedding, storage, and retrieval with clear CRUD and batch APIs plus robust logging and validation out of the box.\r\n\r\n\r\n### Installation\r\n\r\n- pip install liuembeddings for the latest published package version, or install from source using pip install liuembeddings . inside the cloned repository root.\r\n- Python 3.8+ and recent to ensure compatibility and performance.\r\n\r\n```python\r\n    pip install liuembeddings\r\n```\r\n\r\n\r\n### Quick start\r\n\r\n- Initialize an embedder and vector store, then add documents and run a similarity search to retrieve relevant results.\r\n\r\n```python\r\nfrom liuembeddings import LiuEmbeddings, LiuVectorStore\r\n\r\nembedder = LiuEmbeddings(model_name=\"USE\")\r\nstore = LiuVectorStore(embedder, collection_name=\"my_docs\")\r\n\r\nstore.add_texts([\r\n    \"Python is a programming language\",\r\n    \"JavaScript is for web development\"\r\n])\r\n\r\nresults, documents = store.similarity_search(\r\n    \"What is Python?\", n_results=2\r\n)\r\n\r\nprint(documents)\r\n```\r\n\r\nThe example shows a minimal flow: initialize, add, and search to get back matching texts quickly.\r\n\r\n\r\n### Quick start: split_text\r\n\r\nThe following example mirrors the structure from example_usage.py and demonstrates chunking a long text, adding to the vector store, asking a question, and iterating on results while using the documented return shapes.\r\n\r\n```python\r\nfrom liuembeddings import LiuEmbeddings, LiuVectorStore, split_text\r\n\r\n# Initialize\r\nembedder = LiuEmbeddings(model_name=\"USE\")\r\nstore = LiuVectorStore(embedder, collection_name=\"ml_knowledge\")\r\n\r\n# Long text\r\nlong_text = \"\"\"\r\nMachine learning is a powerful and rapidly growing method of data analysis...\r\nFeature engineering is crucial for model performance...\r\n\"\"\"\r\n\r\n# Chunk and add\r\nchunks = split_text(long_text, chunk_size=400, chunk_overlap=50)\r\nstore.add_texts(chunks)\r\n\r\n# Ask a question\r\nraw, docs = store.query(\"What techniques improve model accuracy?\", n_results=2)\r\n\r\n# Show the matched chunks\r\nfor i, d in enumerate(docs, 1):\r\n    print(f\"Answer {i}: {d[:250]}...\")\r\n```\r\n\r\n\r\n### One\u2011liner semantic search\r\n\r\n- Use the vector store search method to combine chunking, ingestion, and querying in a single call for rapid prototyping.\r\n\r\n```python\r\nfrom liuembeddings import LiuEmbeddings, LiuVectorStore\r\n\r\nembedder = LiuEmbeddings()\r\nstore = LiuVectorStore(embedder, collection_name=\"my_docs\")\r\n\r\nlong_doc = \"Machine learning is a subset of AI. Deep learning uses neural networks.\"\r\n\r\n# Ingest chunks and then search\r\nstore.search(\r\n    text_document=long_doc,\r\n    chunk_size=250,\r\n    chunk_overlap=100\r\n)\r\n\r\nraw, docs = store.search(\r\n    query=\"What is machine learning?\",\r\n    n_results=2\r\n)\r\n\r\nfor d in docs:\r\n    print(d)\r\n```\r\n\r\nThis mirrors the end\u2011to\u2011end pattern shown in the examples and uses the text_document parameter name used by the vector store\u2019s search method implementation and examples.\r\n\r\n\r\n### Minimilistic Quick Start\r\n\r\n- Initialize an embedder and vector store intenally not need to define but we can change model and collection.[^1]\r\n- REFER end of the page to learn about fastquery in detailed [Go to Fastquery](#fastquery)\r\n\r\n\r\n```python\r\nfrom liuembeddings import fastquery\r\n\r\n# Simple use: embed and search in 3 lines\r\ntext = \"New York is the largest city in the United States. Washington D.C. is the capital. California is a state.\"\r\n\r\n\r\nfastquery(text_document=text,)\r\n\r\nraw,results = fastquery(\r\n    query=\"Capital of USA?\",\r\n    n_results=2   \r\n)\r\n\r\nfor chunk in results:\r\n    print(chunk)\r\n```\r\n\r\n\r\n## config.py\r\n\r\n- use this to change deafult variable for entire app or you can give it manully during function calls\r\n\r\n```python\r\nfrom liuembeddings import LiuConfig as l\r\n\r\nl.DEFAULT_BATCH_SIZE=32\r\nl.DEFAULT_CHUNK_SIZE=2000\r\nl.DEFAULT_COLLECTION_NAME='test_collection'\r\n\r\n```\r\n\r\n\r\n### Vector store API\r\n\r\nHere\u2019s a **professional and detailed `README.md`** section for your `liuembeddings/vectorstore.py` module \u2014 written in GitHub-style Markdown with clean code formatting, proper examples, and explanations of all vector store methods:\r\n\r\n---\r\n\r\n# \ud83e\udde0 `LiuVectorStore` \u2014 Semantic Vector Storage with ChromaDB\r\n\r\n`LiuVectorStore` is a high-level wrapper around **ChromaDB** that integrates seamlessly with **TensorFlow-based embeddings** (`LiuEmbeddings`).\r\nIt provides a complete CRUD + semantic search interface with optional batch ingestion, persistence, and metadata filtering.\r\n\r\n---\r\n\r\n## \u2699\ufe0f Initialization\r\n\r\n```python\r\nfrom liuembeddings import LiuEmbeddings\r\nfrom liuembeddings import LiuVectorStore\r\n\r\n# Create embeddings and initialize the vector store\r\nembedder = LiuEmbeddings()\r\nstore = LiuVectorStore(embedder, collection_name=\"my_collection\")\r\n```\r\n\r\n\u2705 Automatically:\r\n\r\n* Ensures the persistence path exists\r\n* Creates or opens a ChromaDB collection\r\n* Logs document count and health\r\n\r\n---\r\n\r\n## \ud83d\udce5 Adding Data\r\n\r\n### `add_texts(texts, metadatas=None, ids=None)`\r\n\r\nAdds one or more text documents to the vector store.\r\n\r\n```python\r\ntexts = [\r\n    \"Python is a high-level programming language.\",\r\n    \"Apache Spark is a distributed data processing framework.\"\r\n]\r\nstore.add_texts(texts)\r\n```\r\n\r\n**Features**\r\n\r\n* Auto-generates unique IDs if not provided\r\n* Accepts optional metadata for each document\r\n* Performs validation on inputs\r\n\r\n```python\r\nstore.add_texts(\r\n    texts=[\"Document A\", \"Document B\"],\r\n    metadatas=[{\"topic\": \"A\"}, {\"topic\": \"B\"}],\r\n    ids=[\"docA\", \"docB\"]\r\n)\r\n```\r\n\r\n---\r\n\r\n### `add_texts_batch(texts, batch_size=None, metadatas=None, ids=None)`\r\n\r\nAdds large datasets in batches to manage memory efficiently.\r\n\r\n```python\r\n# Example: Add 20 texts in batches of 5\r\ndata = [f\"Sample document {i}\" for i in range(20)]\r\nstore.add_texts_batch(data, batch_size=5)\r\n```\r\n\r\n\ud83e\udde9 Automatically splits your dataset and logs batch progress.\r\n\r\n---\r\n\r\n## \ud83d\udd0d Querying & Search\r\n\r\n### `query(query_text, n_results=None) -> (raw_results, documents)`\r\n\r\nPerforms **semantic search** and returns both raw and simplified results.\r\n\r\n```python\r\nraw, docs = store.query(\"What is Spark?\")\r\nprint(docs)\r\n```\r\n\r\nReturns:\r\n\r\n```python\r\n(['Apache Spark is a distributed data processing framework.'])\r\n```\r\n\r\n---\r\n\r\n### `similarity_search(query_text, n_results=None, with_score=None)`\r\n\r\nReturns the most similar documents with similarity scores.\r\n\r\n```python\r\n\r\n# With similarity scores\r\nraw,results, _ = store.similarity_search(\"Python language\")\r\nfor r in results:\r\n    print(r[\"id\"], r[\"similarity_score\"], r[\"document\"])\r\n```\r\n\r\n\ud83d\udcc8 When `.similarity_search`, each result includes:\r\n\r\n```python\r\n{\r\n    \"id\": \"...\",\r\n    \"document\": \"...\",\r\n    \"metadata\": {...},\r\n    \"similarity_score\": 0.93\r\n}\r\n```\r\n## Cleaning raw output using `clean`\r\n-----\r\n\r\n```python\r\nfrom liuembeddings import LiuEmbeddings, LiuVectorStore, split_text, clean\r\n\r\n# Initialize\r\nembedder = LiuEmbeddings(model_name=\"USE\")\r\nstore = LiuVectorStore(embedder, collection_name=\"ml_knowledge\")\r\n\r\n# Long text\r\nlong_text = \"\"\"\r\nMachine learning is a powerful and rapidly growing method of data analysis...\r\nFeature engineering is crucial for model performance...\r\n\"\"\"\r\n\r\n# Chunk and add\r\nchunks = split_text(long_text, chunk_size=400, chunk_overlap=50)\r\nstore.add_texts(chunks)\r\n\r\n# Ask a question\r\nraw, docs = store.query(\"What techniques improve model accuracy?\", n_results=2)\r\n\r\n# Show the matched chunks\r\nfor i, d in enumerate(docs, 1):\r\n    print(f\"Answer {i}: {d[:250]}...\")\r\n\r\n#cleaning raw data:\r\nclean_output = clean(raw)\r\n\r\nprint('CleanOutput')\r\nfor i in clean_output:\r\n    print(i)\r\n```\r\nOUTPUT\r\n```\r\nAnswer 1: machine learning is a powerful and rapidly growing method of data analysis... feature engineering is crucial for model performance......\r\nCleanOutput\r\n{'id': 'doc_0_1761401259744_1b0cf4', 'document': 'machine learning is a powerful and rapidly growing method of data analysis... feature engineering is crucial for model performance...', 'metadata': {'source': 'ml_knowledge'}, 'distance': 0.7753744125366211}   \r\n```\r\n\r\n-----\r\n\r\n---\r\n\r\n## \ud83e\uddfe Document Management\r\n\r\n### `search_by_id(doc_id) -> dict | None`\r\n\r\nFetch a document and metadata by its unique ID.\r\n\r\n```python\r\nresult = store.search_by_id(\"docA\")\r\nprint(result)\r\n```\r\n\r\nReturns:\r\n\r\n```python\r\n{\r\n  \"id\": \"docA\",\r\n  \"document\": \"Document A\",\r\n  \"metadata\": {\"topic\": \"A\"}\r\n}\r\n```\r\n\r\n---\r\n\r\n### `search_by_metadata(metadata_filter) -> list[dict]`\r\n\r\nFind documents matching a specific metadata filter.\r\n\r\n```python\r\ndocs = store.search_by_metadata({\"topic\": \"B\"})\r\n```\r\n\r\nReturns a list of `{id, document, metadata}` objects.\r\n\r\n---\r\n\r\n### `get_all() -> list[dict]`\r\n\r\nRetrieve **all** documents and metadata from the collection.\r\n\r\n```python\r\nall_docs = store.get_all()\r\n```\r\n\r\nUseful for exporting or debugging.\r\n\r\n---\r\n\r\n### `update_by_id(doc_id, new_text, new_metadata=None)`\r\n\r\nReplace a document\u2019s text (and optionally metadata).\r\n\r\n```python\r\nstore.update_by_id(\"docA\", \"Updated Document A\", {\"topic\": \"Updated\"})\r\n```\r\n\r\n\u2714\ufe0f Preserves the same ID \u2014 ideal for maintaining references.\r\n\r\n---\r\n\r\n### `delete_by_id(doc_id)`\r\n\r\nRemove a specific document by ID.\r\n\r\n```python\r\nstore.delete_by_id(\"docB\")\r\n```\r\n\r\n---\r\n\r\n## \ud83d\udcca Collection Info & Utilities\r\n\r\n### `count_documents() -> int`\r\n\r\nGet the total number of stored documents.\r\n\r\n```python\r\nprint(\"Document count:\", store.count_documents())\r\n```\r\n\r\n---\r\n\r\n### `save(path)`\r\n\r\nExport all documents and metadata to a `.json` file.\r\n\r\n```python\r\nstore.save(\"backup_my_collection.json\")\r\n```\r\n\r\n---\r\n\r\n### `info` (property)\r\n\r\nQuick collection overview:\r\n\r\n```python\r\nprint(store.info)\r\n```\r\n\r\nOutput example:\r\n\r\n```python\r\n{\r\n    \"name\": \"my_collection\",\r\n    \"document_count\": 42,\r\n    \"embedding_model\": \"TensorFlow Universal Sentence Encoder\"\r\n}\r\n```\r\n\r\n---\r\n\r\n## \ud83e\ude84 One-Call Convenience Search\r\n\r\n### `search(query=None, text_document=None, chunk_size=None, chunk_overlap=None, n_results=None)`\r\n\r\nEnd-to-end helper that:\r\n\r\n1. Splits long documents into chunks\r\n2. Embeds and stores them\r\n3. Performs similarity search in one call\r\n4. same function can do `**Adding and Searching**` based on query and text_document or `**Adding and Searching**` can be done together\r\n\r\n```python\r\n\r\ndocument = \"\"\"\r\nmy name is himanshu i am a data engineer working in tcs india pvt ltd.\r\ni have experience in spark,hadoop,python,sql,azure,aws,tableau,power bi etc.\r\ni love to work on data and build data pipelines and dashboards.\r\npython is a high-level programming language known for its simplicity but it is not simple :).\r\n\"\"\" \r\n\r\n\r\n#adding only\r\nvector_store.search(\r\n    text_document=document,\r\n    chunk_size=250,\r\n    chunk_overlap=100,\r\n)\r\n\r\n\r\nprint(\"\\nPerforming another semantic search using liu_search... \\n only searching\")\r\n#searching only\r\nraw,ans = vector_store.search(\r\n    query=\"What himanshu does for a living?\",\r\n    n_results=1\r\n)\r\n\r\n```\r\nBoth **Adding and Searching**  Together\r\n> \u26a0\ufe0f **Note:** don't run add query multiple time leading to data duplication.\r\n\r\n```python\r\n\r\ndocument = \"\"\"\r\nmy name is himanshu i am a data engineer working in tcs india pvt ltd.\r\ni have experience in spark,hadoop,python,sql,azure,aws,tableau,power bi etc.\r\ni love to work on data and build data pipelines and dashboards.\r\npython is a high-level programming language known for its simplicity but it is not simple :).\r\n\"\"\" \r\n\r\n\r\n#adding only\r\nvector_store.search(\r\n    query=\"What himanshu does for a living?\",\r\n    text_document=document,\r\n    chunk_size=250,\r\n    chunk_overlap=100,\r\n    n_results=1\r\n)\r\n```\r\n\r\n---\r\n\r\n## \ud83e\udde9 Example Workflow\r\n\r\n```python\r\n# Create embeddings\r\nembedder = LiuEmbeddings()\r\n\r\n# Initialize vector store\r\nstore = LiuVectorStore(embedder, collection_name=\"knowledge_base\")\r\n\r\n# Add documents\r\ntexts = [\"AI is transforming industries.\", \"Data engineering powers analytics.\"]\r\nstore.add_texts(texts)\r\n\r\n# Perform search\r\n_, docs = store.similarity_search(\"What is AI?\")\r\nprint(\"Search results:\", docs)\r\n\r\n# Get collection info\r\nprint(store.info)\r\n```\r\n\r\n---\r\n\r\n## \ud83e\uddf1 Design Overview\r\n\r\n| Feature                | Description                                                 |\r\n| ---------------------- | ----------------------------------------------------------- |\r\n| **Persistence**        | Uses `chromadb.PersistentClient` for disk-based collections |\r\n| **Batch Support**      | Handles large ingestions efficiently                        |\r\n| **CRUD Operations**    | Add, Update, Delete, Retrieve                               |\r\n| **Semantic Search**    | Embedding-based similarity using `LiuEmbeddings`            |\r\n| **Metadata Filtering** | Query subsets via structured filters                        |\r\n| **Export**             | JSON serialization for backups or migration                 |\r\n\r\n---\r\n\r\n### Worked example: solve a retrieval question\r\n\r\nThe following example mirrors the structure from example_usage.py and demonstrates chunking a long text, adding to the vector store, asking a question, and iterating on results while using the documented return shapes.\r\n\r\n```python\r\nfrom liuembeddings import LiuEmbeddings, LiuVectorStore, split_text\r\n\r\n# Initialize\r\nembedder = LiuEmbeddings(model_name=\"USE\")\r\nstore = LiuVectorStore(embedder, collection_name=\"ml_knowledge\")\r\n\r\n# Long text\r\nlong_text = \"\"\"\r\nMachine learning is a powerful and rapidly growing method of data analysis...\r\nFeature engineering is crucial for model performance...\r\n\"\"\"\r\n\r\n# Chunk and add\r\nchunks = split_text(long_text, chunk_size=400, chunk_overlap=50)\r\nstore.add_texts(chunks)\r\n\r\n# Ask a question\r\nraw, docs = store.query(\"What techniques improve model accuracy?\", n_results=2)\r\n\r\n# Show the matched chunks\r\nfor i, d in enumerate(docs, 1):\r\n    print(f\"Answer {i}: {d[:250]}...\")\r\n```\r\n\r\nThis demonstrates selecting only the documents from the returned tuple after calling query, exactly as shown in the example where ans = ans is used to extract the document list.\r\n\r\n### CRUD and scored search example\r\n\r\n- Retrieve scored results for CRUD, update one by id, read it back, and list the total count, following the patterns used in the example.\r\n\r\n```python\r\n# Scored search for a targeted update\r\nraw,first = store.similarity_search(\r\n    \"What techniques improve model accuracy?\",\r\n    n_results=1,\r\n)\r\n\r\nstore.update_by_id(first[\"id\"], \"Machine learning drives innovation and efficiency\")\r\n\r\n# Verify by id\r\nfound = store.search_by_id(first[\"id\"])\r\nprint(found[\"document\"])\r\n\r\n# Count all\r\nprint(\"Total documents:\", store.count_documents())\r\n```\r\n\r\nThe similarity_search shape is a list of dicts, which enables direct access to id, document, metadata, and similarity_score for downstream operations as used in the example.\r\n\r\n### Batch ingestion and metadata filtering\r\n\r\n- Use add_texts_batch to process large inputs, assign consistent metadata for later filtering, and fetch the subset with search_by_metadata for targeted work.\r\n\r\n```python\r\n# Prepare 100 documents with metadata\r\ntexts = [f\"Document {i+1}: This is sample text for document {i+1}.\" for i in range(100)]\r\nmetas = [{\"source\": \"Batch of 5\"} for _ in range(100)]\r\n\r\n# Ingest in batches of 10\r\nstore.add_texts_batch(texts, batch_size=10, metadatas=metas)\r\n\r\n# Filter by metadata\r\nsubset = store.search_by_metadata({\"source\": \"Batch of 5\"})\r\nfor x in subset[:3]:\r\n    print(x[\"id\"], x[\"document\"][:60], \"...\")\r\n```\r\n\r\nThis mirrors the example\u2019s approach to batched add and subsequent metadata filtering to isolate a logical group of documents.\r\n\r\n### Text utilities\r\n\r\n- Use split_text for chunking long content with overlap to preserve context across chunk boundaries, and clean_text if included in your setup to normalize inputs prior to embedding as shown in the README\u2019s text utilities section and example file.\r\n\r\n\r\n\r\n# Fastquery\r\n\r\nLiuEmbeddings includes an advanced utility function for rapid prototyping and streamlined semantic search, `fastquery`. This section adds full documentation and usage examples for `fastquery`, and clarifies key usage expectations such as embedding model consistency and API behaviors. All major methods, including `fastquery`, are now documented with coding blocks and concise explanations for every function.\r\n\r\n***\r\n\r\n## \ud83d\ude80 Quick Embedding (`fastquery`)\r\n\r\n- Always define collection do not rely on default collection\r\n- Alaways rely on default embedings or use only 1 for all\r\n\r\n**`fastquery`** provides the fastest workflow for embedding and semantic search. It is designed for scenarios where users need to process a document and execute queries immediately\u2014no manual collection or embedding model setup required.\r\n\r\n**Key Features:**\r\n\r\n- **Default Model:** Uses the `\"USE\"` Universal Sentence Encoder by default for embeddings.\r\n- **Model Consistency:** The embedding model is fixed _per vector store instance_. Once texts are embedded with a given model, you cannot switch models for the same collection.\r\n- **Single-call API:** Combines text chunking, embedding, storage, and querying in one function.\r\n- **Minimal Setup:** No need to initialize LiuEmbeddings or LiuVectorStore directly\u2014simply provide your text and query.\r\n\r\n***\r\n\r\n### Function Documentation\r\n\r\n```python\r\nfastquery(\r\n        query: str=None,\r\n        text_document: str,\r\n        chunk_size: int, #deafult from config file\r\n        chunk_overlap: int, #deafult from config file\r\n        n_results: int, #deafult from config file\r\n        with_score: float = None, #deafult from config file\r\n        collection_name: str = fastquery.collection_name,\r\n        model_name: str = fastquery.model_name \r\n) -> list:\r\n        \"\"\"\r\n        One-line semantic search function.\r\n\r\n        note -> use the same model for embedding and searching.\r\n        by default its use \"USE\" model.\r\n        \r\n        Combines chunking, embedding, storage, and search in a single call.\r\n        Perfect for quick prototyping and small applications.\r\n        \r\n        Args:\r\n            text_document: Long text or document to search within\r\n            query: Query string to search for\r\n            chunk_size: Size of text chunks (default: from config)\r\n            chunk_overlap: Overlap between chunks (default: from config)\r\n            n_results: Number of results to return (default: from config)\r\n            with_score: give smilarity search like answer\r\n            collection_name: Name for the vector store collection (default: from config)\r\n            model_name: Embedding model to use (default: \"USE\")\r\n            \r\n        Returns:\r\n            List of most similar chunks from the document\r\n            \r\n        Raises:\r\n            ValueError: If inputs are invalid\r\n            RuntimeError: If operation failsalueError/RuntimeError: For invalid input or failures.\r\n    \"\"\"\r\n```\r\n\r\n\r\n***\r\n\r\n### \u26a1 Quickstart Example\r\n\r\n- The fastquery utility provides a minimal setup for embedding and querying text within your vector database.\r\n- It automatically handles model loading, text chunking, and search retrieval in just a few lines of code.\r\n\r\n\r\n```python\r\nfrom liuembeddings import fastquery\r\n\r\n# Simple use: embed and search in 3 lines\r\ntext = \"New York is the largest city in the United States. Washington D.C. is the capital. California is a state.\"\r\n\r\nfastquery.collection_name=\"minimal_collection\"\r\n\r\nfastquery(text_document=text,)\r\n\r\nraw,results = fastquery(\r\n    query=\"Capital of USA?\",\r\n    n_results=2\r\n)\r\n\r\nfor chunk in results:\r\n    print(chunk)\r\n```\r\n\r\n**\ud83d\udd39 Using Class Variables**\r\n- You can configure fastquery globally before calling it.\r\n- These class variables act as persistent defaults until changed or overridden.\r\nYou can configure **`fastquery`** globally **before calling it**.\r\nThese class variables act as **persistent defaults** until they are changed or overridden.\r\n\r\nYou can customize `fastquery` behavior in **three ways**:\r\n\r\n| Method                            | Description                                                                          | Recommended Use                        |\r\n| :-------------------------------- | :----------------------------------------------------------------------------------- | :------------------------------------- |\r\n| **Class Variables**               | Set once and apply globally for all future calls.                                    | \u2705 *Easy and Recommended*              |\r\n| **Function Parameters**           | Define per call \u2014 overrides both class and global defaults.                          | Use for temporary or dynamic settings. |\r\n| **Global Defaults (`LiuConfig`)** | Automatically used when neither class variables nor function parameters are defined. | Used as fallback configuration.        |\r\n\r\n```python\r\nfrom liuembeddings import fastquery\r\nfastquery.collection_name='liu-collection'  \r\nfastquery.model_name='USE'\r\n```\r\n\r\n***\r\n\r\n### Using `fastquery` with Custom Settings\r\n\r\n```python\r\n# Custom chunk size and overlap\r\n\r\nfastquery(\r\n    text_document=\"Deep learning uses neural networks. Machine learning is a subset of AI.\",\r\n    chunk_size=80,\r\n    chunk_overlap=15,\r\n)\r\n\r\n\r\nraw,results = fastquery(\r\n    query=\"What is machine learning?\",\r\n    n_results=1\r\n)\r\nprint(\"Best answer:\", results[1][0])\r\n```\r\n\r\n\r\n### Scores and Metadata\r\n\r\n**Adding Documents to a Collection** collection for later querying.\r\n\r\n```python\r\nfrom liuembeddings import fastquery\r\n\r\ndocument = \"\"\"\r\nLuna loves exploring the night sky. Every weekend, she sets up her telescope on the rooftop to watch distant galaxies.\r\nHer favorite constellation is Orion, and she can identify it even without a telescope.\r\nLast month, she discovered a small comet passing near Jupiter and recorded its movement in her astronomy journal.\r\n\"\"\"\r\n\r\n# Add document to collection \"story_collection\"\r\nfastquery(\r\n    text_document=document,\r\n    n_results=5,       \r\n    collection_name=\"story_collection\"\r\n)\r\n```\r\n\r\n> \u26a0\ufe0f **Note:** `n_results` specifies the maximum number of similar results to retrieve when querying.\r\n\r\n---\r\n\r\n**Querying the Collection**\r\n\r\nWhen `with_score=0.4` (default), `fastquery` returns:\r\n* `raw`: the raw output from the database or retrieval engine.\r\n* `document`: a **list of matching documents**.\r\n\r\n```python\r\nraw, ans = fastquery(\r\n    query=\"What celestial object did Luna discover?\",\r\n    collection_name=\"story_collection\"\r\n)\r\n\r\nfor item in ans:\r\n    print(f\"Answer: {item}\")\r\n```\r\n\r\n**Example Output:**\r\n\r\n```\r\nAnswer: Last month, she discovered a small comet passing near Jupiter and recorded its movement in her astronomy journal.\r\n```\r\n---\r\n\r\nWhen `with_score=.5`, `fastquery` returns:\r\n* `raw`: the raw retrieval output.\r\n* `ans`: a **list of dictionaries**, each containing:\r\n\r\n  * `id` \u2013 the document ID in the collection\r\n  * `document` \u2013 the text content\r\n  * `metadata` \u2013 metadata associated with the document\r\n  * `similarity_score` \u2013 similarity between the query and the document\r\n\r\n```python\r\nraw, ans = fastquery(\r\n    query=\"What celestial object did Luna discover?\",\r\n    with_score=0.5,\r\n    collection_name=\"story_collection\"\r\n)\r\n\r\nfor item in ans:\r\n    print(f\"id: {item['id']}\")\r\n    print(f\"Document: {item['document']}\")\r\n    print(f\"Metadata: {item['metadata']}\")\r\n    print(f\"Similarity score: {item['similarity_score']}\")\r\n```\r\n\r\n**Example Output:**\r\n\r\n```\r\nid: doc_1_1761378265744\r\nDocument: Last month, she discovered a small comet passing near Jupiter and recorded its movement in her astronomy journal.\r\nMetadata: {'source': 'story_collection'}\r\nSimilarity score: 0.41\r\n```\r\n\r\n---\r\n\r\n- You can filter results by similarity score to get only the most relevant documents:\r\n\r\n```python\r\nfor item in ans:\r\n    if item['similarity_score'] < 0.5:\r\n        print(f\"Answer: {item['document']}\")\r\n```\r\n**Example Output:**\r\n```\r\nAnswer: Last month, she discovered a small comet passing near Jupiter and recorded its movement in her astronomy journal.\r\n```\r\n\r\n> This allows you to exclude low-relevance documents from your results.\r\n\r\n---\r\n\r\n## Summary\r\n\r\n* **Adding documents:** `fastquery(text_document, collection_name)`\r\n* **Querying documents:** `fastquery(query, collection_name)`\r\n* **Optional similarity scores:** Use `with_score` to get IDs, metadata, and similarity values.\r\n* **Filtering:** You can filter results by similarity score for more precise retrieval.\r\n\r\nThis function is particularly useful for **quick semantic search**, **QA over text collections**, and **vector database integrations**.\r\n\r\n---\r\n\r\nIf you want, I can also create a **diagram showing the workflow**: document \u2192 collection \u2192 query \u2192 result, which would make this explanation even clearer.\r\n\r\nDo you want me to do that?\r\n\r\n***\r\n\r\n### Notes on Model and Collection Management\r\n\r\n- **Model Switching:** Once a vector store or collection is created with an embedding model, you cannot switch to another model for embedding/search in that collection. If you need to use a new model (e.g., USE/USEL), create a new collection:\r\n\r\n```python\r\nfastquery(\r\n    text_document=\"...\", \r\n    query=\"...\", \r\n    model_name=\"USE\",\r\n    collection_name=\"my_new_collection\"\r\n)\r\n```\r\n\r\nAttempting to switch models within the same collection will result in an error.\r\n\r\n***\r\n\r\n## Summary Table: Quick Embedding API\r\n\r\n| Function | Purpose | Returns | Model Switching | Use Case |\r\n| :-- | :-- | :-- | :-- | :-- |\r\n| fastquery | Rapid embed \\& search | Chunks/results | Not allowed | Quick prototyping, temporary |\r\n| LiuEmbeddings | Manual embed model | Embedding vectors | At instantiation | Advanced/custom workflows |\r\n| LiuVectorStore | Full CRUD/search | Document batches | At initialization | Persistent/high-volume apps |\r\n\r\n\r\n***\r\n\r\n## Complete Example: End-to-End Embedding and Immediate Query\r\n\r\n```python\r\nfrom liuembeddings import fastquery\r\n\r\nlong_doc = \"\"\"\r\nThe solar system includes the Sun and the objects that orbit it, such as planets,\r\nasteroids, and comets. Planets like Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus,\r\nand Neptune revolve around the Sun.\r\n\"\"\"\r\n\r\n#storing and query at same \r\n#use with CAUTION re ingesting same data leads to data dublication add ones and Query multiple time\r\nresults = fastquery(\r\n    text_document=long_doc,\r\n    query=\"Which planets orbit the sun?\",\r\n    n_results=3\r\n)\r\n\r\nfor answer in results:\r\n    print(answer)\r\n```\r\n\r\n\r\n## Embedding\r\n\r\n- We can convert to embedding for visulization and other purpose\r\n- Two embedding modeles we have \r\n-   ``` USE  512 dimension DEAFAULT``` \r\n-   ``` USEL 512 dimension ```\r\n\r\n```python\r\n# Single query embedding\r\nfrom liuembeddings import LiuEmbeddings\r\n\r\nprint(\"\\nInitializing embedding model\")\r\nembedder = LiuEmbeddings(model_name=\"USE\") # USE AND USEL\r\n\r\nquery_embedding = embedder.embed_query(\"What is machine learning?\")\r\n\r\nprint(f\"Query embedding dimension: {len(query_embedding)}\")\r\nprint(f\"First 5 embedded values in vector: {query_embedding[:5]}\")\r\n\r\n>>> Query embedding dimension: 512\r\n>>> First 5 embedded values in vector: [-0.004198556765913963, -0.07223273068666458, -0.06091027706861496, -0.007246586959809065, -0.022054186090826988]\r\n\r\n```\r\n\r\n```python\r\n# Multiple documents\r\nfrom liuembeddings import LiuEmbeddings\r\n\r\nembedder = LiuEmbeddings() #DEAFAULT USE\r\n\r\ndocuments = [\r\n    \"quickly bring the cash\",\r\n    \"rush and get the money\",\r\n    \"this boy love potato\"\r\n]\r\n\r\ndoc_embeddings = embedder.embed_documents(documents)\r\nprint(f\"Embedded {len(doc_embeddings)} documents\")\r\nfor i in doc_embeddings:\r\n    print(f\"Embedded:{i}\")\r\n\r\n>>>Embedded 3 documents\r\n>>>Embedded:[-0.0444050170481205, -0.059026677161455154, 0.012156504206359386, 0.035481732338666916, 0.0641937330365181, 0.01327..,...\r\n>>>Embedded:[0.045886047184467316, -0.07462672889232635, 0.07747738808393478, 0.00464465469121933, 0.07081839442253113, 0.01971..,...\r\n>>>Embedded:[0.05816115066409111, 0.02540922723710537, 0.0019424951169639826, 0.029804585501551628, -0.03550824150443077, -0.05927..,...\r\n```\r\n\r\n\r\n\r\n---\r\n\r\n## External Embeddings Model\r\n\r\nYou can add a new embedding model by modifying `LiuConfig.AVAILABLE_MODELS`. While you can use any embedding model of your choice, it is recommended to use the predefined models like **USE** or **USEL** for compatibility.\r\n\r\n### Adding an External Embedding Model\r\n\r\n1. Ensure the model is compatible with TensorFlow Hub.\r\n2. Provide the model URL, embedding dimension, and a custom name.\r\n\r\n```python\r\nfrom liuembeddings import LiuEmbeddings, LiuVectorStore, LiuConfig\r\n\r\n# Add a custom external embedding model\r\nLiuConfig.AVAILABLE_MODELS['NNLM'] = {\r\n    'url': \"https://tfhub.dev/google/nnlm-en-dim50/2\",\r\n    'dimension': 50,\r\n    'name': 'NNLM (custom name of your choice)'\r\n}\r\n\r\n# Initialize the custom embedder\r\ncustom_embedder = LiuEmbeddings('NNLM')\r\n\r\ncustom_embedder=LiuEmbeddings('NNLM') \r\n\r\ncustom_vector = LiuVectorStore(\r\n    embedding_model=custom_embedder,\r\n    collection_name=\"knowledge-NNLM\"\r\n)\r\n\r\n# Multiple documents\r\ndocuments = [\r\n    \"all boy's lovewin prize\",\r\n    \"all boy's love money\",\r\n    \"all boy's love protein\",\r\n    \"lorem ipsum lorem ipsum\",\r\n    \"loremipsum lorem iplorem\"\r\n]\r\n\r\ncustom_vector.add_texts(documents,)\r\n\r\nraw,docs=custom_vector.search('what all the boys love')\r\n\r\nprint(\"answer:\")\r\nfor i in docs:\r\n    print(i)\r\n\r\n>>> answer:\r\n    all boy's love money\r\n    all boy's love protein\r\n    all boy's lovewin prize\r\n\r\nprint(raw)\r\n```\r\n\r\n> \u26a0\ufe0f **Note:** Make sure the embedding model is compatible with the hub and that the dimensions match your configuration.\r\n\r\n---\r\n\r\n***\r\n\r\n## Final Tips\r\n\r\n- **Use `fastquery` for fast, disposable vector stores and quick searches.**\r\n- **Switch models only by creating new collections\u2014existing data uses a single embedding model.**\r\n- For larger or persistent applications, use the full LiuEmbeddings and LiuVectorStore APIs documented above for manual control, persistence, batch processing, and advanced CRUD.\r\n\r\n***\r\n---\r\n<span style=\"display:none\"></span>\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n## \ud83e\udde9 Configuration\r\n\r\nThis module defines global configuration settings for the **LiuEmbeddings Framework**, which manages embedding models, chunking, vector storage, and search behavior across all components.\r\n\r\n---\r\n\r\n### \u2699\ufe0f Class: `LiuConfig`\r\n\r\nThe `LiuConfig` class holds configurable parameters that control how embeddings, vector databases, and search processes behave.\r\nYou can modify or extend these settings as needed for custom use cases.\r\n\r\n---\r\n\r\n### \ud83d\udd39 Default Embedding Model Settings\r\n\r\n| Attribute         | Description                               | Default                                                   |\r\n| ----------------- | ----------------------------------------- | --------------------------------------------------------- |\r\n| `EMBEDDING_MODEL` | Default TensorFlow Hub URL for embeddings | `\"https://tfhub.dev/google/universal-sentence-encoder/4\"` |\r\n| `MODEL_DIMENSION` | Embedding vector dimension                | `512`                                                     |\r\n\r\n---\r\n\r\n### \ud83d\udd39 Available Models\r\n\r\nYou can select from predefined embedding models or extend them with custom ones.\r\n\r\n| Key      | URL                                                                                                                        | Dimension | Name                       |\r\n| -------- | -------------------------------------------------------------------------------------------------------------------------- | --------- | -------------------------- |\r\n| **USE**  | [https://tfhub.dev/google/universal-sentence-encoder/4](https://tfhub.dev/google/universal-sentence-encoder/4)             | 512       | Universal Sentence Encoder |\r\n| **USEL** | [https://tfhub.dev/google/universal-sentence-encoder-large/5](https://tfhub.dev/google/universal-sentence-encoder-large/5) | 512       | Universal Sentence Encoder |\r\n\r\n\r\n### \ud83d\udd39 Below example show how to add a custom model\r\n\r\n```python\r\nfrom liuembeddings import LiuConfig\r\n\r\n# Add a new model\r\nLiuConfig.AVAILABLE_MODELS[\"NNLM\"] = {\r\n    \"url\": \"https://tfhub.dev/google/nnlm-en-dim50/2\",\r\n    \"dimension\": 50,\r\n    \"name\": \"NNLM\"\r\n}\r\n```\r\n\r\n---\r\n\r\n### \ud83d\udd39 Chunking Settings\r\n\r\n| Attribute               | Description                           | Default |\r\n| ----------------------- | ------------------------------------- | ------- |\r\n| `DEFAULT_CHUNK_SIZE`    | Number of characters per text chunk   | `1000`  |\r\n| `DEFAULT_CHUNK_OVERLAP` | Overlapping characters between chunks | `200`   |\r\n\r\n---\r\n\r\n### \ud83d\udd39 Vector Store Settings\r\n\r\n| Attribute                 | Description                                                     | Default                |\r\n| ------------------------- | --------------------------------------------------------------- | ---------------------- |\r\n| `DEFAULT_CHROMA_PATH`     | Directory where ChromaDB stores data                            | `\"./chroma_db\"`        |\r\n| `DEFAULT_COLLECTION_NAME` | Default vector collection name                                  | `\"default_collection\"` |\r\n| `DISTANCE_METRIC`         | Similarity metric used for vector search (`cosine`, `l2`, `ip`) | `\"cosine\"`             |\r\n\r\n---\r\n\r\n### \ud83d\udd39 Search Settings\r\n\r\n| Attribute           | Description                           | Default |\r\n| ------------------- | ------------------------------------- | ------- |\r\n| `DEFAULT_N_RESULTS` | Default number of documents to return | `3`     |\r\n| `MAX_N_RESULTS`     | Maximum allowed results per query     | `100`   |\r\n\r\n---\r\n\r\n### \ud83d\udd39 Batch Processing\r\n\r\n| Attribute            | Description                                        | Default |\r\n| -------------------- | -------------------------------------------------- | ------- |\r\n| `DEFAULT_BATCH_SIZE` | Default batch size for bulk embedding or insertion | `100`   |\r\n\r\n---\r\n\r\n### \ud83d\udd39 Model Caching\r\n\r\n| Attribute            | Description                             | Default |\r\n| -------------------- | --------------------------------------- | ------- |\r\n| `ENABLE_MODEL_CACHE` | Enable caching for faster model loading | `True`  |\r\n\r\n---\r\n\r\n### \ud83d\udd39 Logging\r\n\r\n| Attribute    | Description                                             | Default                                                  |\r\n| ------------ | ------------------------------------------------------- | -------------------------------------------------------- |\r\n| `LOG_LEVEL`  | Logging verbosity (`DEBUG`, `INFO`, `WARNING`, `ERROR`) | `\"INFO\"`                                                 |\r\n| `LOG_FORMAT` | Format string for logging messages                      | `\"%(asctime)s - %(name)s - %(levelname)s - %(message)s\"` |\r\n\r\nExample:\r\n\r\n```python\r\nimport logging\r\nlogging.basicConfig(level=LiuConfig.LOG_LEVEL, format=LiuConfig.LOG_FORMAT)\r\n```\r\n\r\n---\r\n\r\n## \ud83e\udde0 How to Modify Configuration Variables\r\n\r\nYou can easily override or customize configuration values **without editing the source file**.\r\nSimply assign new values to the class attributes before initializing components.\r\n\r\n```python\r\nfrom liuembeddings import LiuConfig\r\n\r\nLiuConfig.DEFAULT_CHROMA_PATH = \"./custom_chroma\"\r\nLiuConfig.DEFAULT_COLLECTION_NAME = \"medical_articles\"\r\n\r\nprint(LiuConfig.DEFAULT_COLLECTION_NAME)\r\n# Output: medical_articles\r\n```\r\n\r\n---\r\n\r\n```python\r\nLiuConfig.LOG_LEVEL = \"DEBUG\"\r\nLiuConfig.DEFAULT_N_RESULTS = 10\r\n```\r\n\r\n---\r\n\r\n## \ud83e\uddfe Summary\r\n\r\n* `LiuConfig` centralizes all framework-level settings.\r\n* You can **modify variables at runtime** to adapt to new projects or datasets.\r\n* Encourages a clean and flexible configuration approach for reproducible experiments.\r\n\r\n---\r\n\r\n\r\n### Requirements and project structure\r\n\r\n- Requirements include Python 3.8+, TensorFlow 2.8+, ChromaDB 0.3+, and NumPy 1.20+, and the repository layout includes embeddings.py, vectorstore.py, utils, config, logger modules, tests, and packaging files as shown in the current README.\r\n\r\n\r\n### Notes on return shapes and usage patterns\r\n\r\n- query and similarity_search return a tuple of (raw_results, documents) when with_score is False, and your code should access the second element to iterate over just the text matches, as demonstrated by assigning ans = ans before printing chunks.\r\n- similarity_search with with_score=True returns a list of dicts where id can be fed into update_by_id and search_by_id to perform targeted modifications and retrieval as illustrated in the example.\r\n- search composes splitting, ingestion, and similarity search and returns the same shape as similarity_search in the default mode, enabling quick prototyping without wiring multiple calls in your application code.\r\n\r\n\r\n### Existing examples\r\n\r\n- The repository examples cover basic embedding, text processing, CRUD, batch operations, one\u2011liner search, and error handling, and the updated examples above align with those flows while clarifying parameter names and return shapes.\r\n\r\n### Contributing\r\n\r\n- Fork the repository, create a feature branch, commit changes, push, and open a pull request following the guidelines already present in the README to keep contributions consistent and reviewable.\r\n\r\n\r\n### License and citation\r\n\r\n- The project is MIT\u2011licensed, and a BibTeX entry is provided in the README if you cite LiuEmbeddings in academic work, keeping attribution straightforward and standardized.\r\n\r\n\r\n### Changelog and roadmap\r\n\r\n- The initial release includes core embedding, vector store, utilities, tests, and documentation, and the roadmap lists future enhancements like additional embedding models, REST, Docker, and advanced filtering to guide community contributions.\r\n\r\nIf you\u2019d like this as a drop\u2011in replacement file, the sections above can fully replace the current README\u2019s Quick Start and API portions while keeping badges, installation, testing, and contribution policy intact from the existing document.\r\n\r\n<div align=\"center\">\u2042</div>\r\n\r\n[^1]: example_usage.py\r\n: example.mimimal.py\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "TensorFlow-based embeddings with ChromaDB vector store for semantic search easy embeddings and search integration. decrease api charge as run on local light weight and easy to learn",
    "version": "1.1.0",
    "project_urls": {
        "Bug Reports": "https://github.com/himanshuclub88/liuembeddings/issues",
        "Documentation": "https://himanshuclub88.github.io/liuembeddings/",
        "Homepage": "https://github.com/himanshuclub88/liuembeddings",
        "Source": "https://github.com/himanshuclub88/liuembeddings"
    },
    "split_keywords": [
        "embeddings",
        "semantic-search",
        "tensorflow",
        "chromadb",
        "nlp"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8e8729468392228aaede6c0f3798cb0be670966613e52eb1abe9bcbd103fd4ec",
                "md5": "c4ebceb67e373fc28335b1ff312186d6",
                "sha256": "b799f882e1106050ebf96c789f23a74a33193d506974873c2d634aedd40817e5"
            },
            "downloads": -1,
            "filename": "liuembeddings-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c4ebceb67e373fc28335b1ff312186d6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 30158,
            "upload_time": "2025-10-29T11:18:20",
            "upload_time_iso_8601": "2025-10-29T11:18:20.709681Z",
            "url": "https://files.pythonhosted.org/packages/8e/87/29468392228aaede6c0f3798cb0be670966613e52eb1abe9bcbd103fd4ec/liuembeddings-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "473adf39771d1c663deffb29ef3fa22d2b03aa941923a5e78910d88cc346387e",
                "md5": "b6414490d2c7a6d117d86d572793d454",
                "sha256": "09a8b2fb7b57ed88af02cdfe6bdb7f4451321066b0926f1f56345574a9b46c35"
            },
            "downloads": -1,
            "filename": "liuembeddings-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "b6414490d2c7a6d117d86d572793d454",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 52160,
            "upload_time": "2025-10-29T11:18:22",
            "upload_time_iso_8601": "2025-10-29T11:18:22.172523Z",
            "url": "https://files.pythonhosted.org/packages/47/3a/df39771d1c663deffb29ef3fa22d2b03aa941923a5e78910d88cc346387e/liuembeddings-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-29 11:18:22",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "himanshuclub88",
    "github_project": "liuembeddings",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "tensorflow",
            "specs": [
                [
                    ">=",
                    "2.8.0"
                ]
            ]
        },
        {
            "name": "tensorflow-hub",
            "specs": [
                [
                    ">=",
                    "0.12.0"
                ]
            ]
        },
        {
            "name": "chromadb",
            "specs": [
                [
                    ">=",
                    "1.2.1"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.20.0"
                ]
            ]
        }
    ],
    "lcname": "liuembeddings"
}
        
Elapsed time: 3.70848s