sqlrooms-rag

Name	sqlrooms-rag JSON
Version	0.1.0a1 JSON
	download
home_page	None
Summary	Prepare and query vector embeddings for RAG applications using DuckDB
upload_time	2025-10-27 16:32:11
maintainer	None
docs_url	None
author	SQLRooms Contributors
requires_python	>=3.10
license	MIT
keywords	duckdb embeddings llama-index rag vector-search
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # SQLRooms RAG

A Python package for preparing and querying vector embeddings stored in DuckDB for RAG (Retrieval Augmented Generation) applications.

## Overview

This tool follows the approach outlined in [Developing a RAG Knowledge Base with DuckDB](https://motherduck.com/blog/search-using-duckdb-part-2/) to:

1. Load markdown files from a specified directory
2. Split them into chunks (default 512 tokens)
3. Generate vector embeddings using HuggingFace models
4. Store the embeddings in a DuckDB database for efficient retrieval

## Installation

### From PyPI (when published)

```bash
pip install sqlrooms-rag
```

### From source with uv

This project uses [uv](https://github.com/astral-sh/uv) for development.

```bash
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install from source
cd python/rag-embedding
uv sync
```

### Dependencies

The package includes:

- llama-index (core RAG framework)
- llama-index-embeddings-huggingface (HuggingFace embeddings)
- llama-index-vector-stores-duckdb (DuckDB vector store)
- sentence-transformers (embedding models)
- torch (ML framework)
- duckdb (database)

## Usage

### Basic Usage

Process markdown files from a directory and create a DuckDB knowledge base:

```bash
uv run prepare-embeddings /path/to/docs -o generated-embeddings/knowledge_base.duckdb
```

Or use the Python API:

```python
from sqlrooms_rag import prepare_embeddings

prepare_embeddings(
    input_dir="/path/to/docs",
    output_db="generated-embeddings/knowledge_base.duckdb",
    chunk_size=512,
    embed_model_name="BAAI/bge-small-en-v1.5",
    embed_dim=384
)
```

### Examples

#### Process documentation files

```bash
# Process all .md files in the docs directory
uv run prepare-embeddings ../../docs -o generated-embeddings/sqlrooms_docs.duckdb
```

#### Use custom chunk size

```bash
# Use smaller chunks for more granular retrieval
uv run prepare-embeddings docs -o generated-embeddings/kb.duckdb --chunk-size 256
```

#### Use a different embedding model

```bash
# Use all-MiniLM-L6-v2 (dimension: 384)
uv run prepare-embeddings docs -o generated-embeddings/kb.duckdb \
  --model "sentence-transformers/all-MiniLM-L6-v2" \
  --embed-dim 384
```

### Command-Line Options

```
positional arguments:
  input_dir             Directory containing markdown (.md) files to process

options:
  -h, --help            Show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output DuckDB database file path (default: knowledge_base.duckdb)
  --chunk-size CHUNK_SIZE
                        Max token size for text chunks (default: 512)
  --model EMBED_MODEL_NAME
                        HuggingFace embedding model name (default: BAAI/bge-small-en-v1.5)
  --embed-dim EMBED_DIM
                        Embedding dimension size (default: 384 for bge-small-en-v1.5)
  --no-markdown-chunking
                        Disable markdown-aware chunking (use size-based instead)
  -q, --quiet           Suppress progress messages
```

## How It Works

1. **Document Loading**: The tool recursively scans the input directory for `.md` files
2. **Embedding Model**: Downloads and initializes the HuggingFace embedding model (cached locally after first run)
3. **Smart Chunking**: By default, splits documents by markdown headers (##, ###) to preserve section context. Section titles are stored in metadata for better retrieval. Falls back to size-based chunking for large sections.
4. **Embedding Generation**: Generates vector embeddings for each chunk
5. **Storage**: Stores embeddings in DuckDB with metadata (including section titles) for efficient retrieval

### Chunking Strategy

**Markdown-Aware Chunking** (default):

- ✅ Splits by markdown headers (`##`, `###`, etc.)
- ✅ Preserves section context and hierarchy
- ✅ Stores section titles in metadata (`Header_1`, `Header_2`, etc.)
- ✅ Produces semantically coherent chunks

**Size-Based Chunking** (with `--no-markdown-chunking`):

- Simple token-based splitting
- May break sections mid-content
- Use only if your docs lack clear structure

See [CHUNKING.md](./CHUNKING.md) for detailed comparison and best practices.

## Output

The tool creates a DuckDB database file (`.duckdb`) that contains:

- Document chunks (text split by markdown sections)
- Vector embeddings (384-dimensional by default)
- Metadata including:
  - File paths
  - Section titles (`Header_1`, `Header_2`, etc.)
  - Document structure information

This database can be used with llama-index's query engine or any RAG application that supports DuckDB vector stores.

## Using the Generated Database

### Python API

You can use the package programmatically:

```python
from sqlrooms_rag import prepare_embeddings

# Create embeddings
index = prepare_embeddings(
    input_dir="../../docs",
    output_db="generated-embeddings/my_docs.duckdb"
)
```

### Query Examples

See `examples/example_query.py` for complete working examples. Here's a quick snippet:

```python
from llama_index.core import VectorStoreIndex, StorageContext, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.duckdb import DuckDBVectorStore

# Load the embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Settings.embed_model = embed_model

# Connect to the existing database
vector_store = DuckDBVectorStore(
    database_name="knowledge_base",
    persist_dir="./",
    embed_dim=384,
)

# Load the index
index = VectorStoreIndex.from_vector_store(vector_store)

# Create retriever and search
retriever = index.as_retriever(similarity_top_k=3)
results = retriever.retrieve("Your question here")

for result in results:
    print(f"Score: {result.score:.4f}")
    print(f"Text: {result.text[:200]}...")
```

### Running the Examples

**Prepare DuckDB documentation embeddings:**

```bash
# Using Python script (recommended)
uv run python examples/prepare_duckdb_docs.py

# Or using bash script
chmod +x examples/prepare_duckdb_docs.sh
./examples/prepare_duckdb_docs.sh

# Custom paths
uv run python examples/prepare_duckdb_docs.py \
  --docs-dir ./my-docs \
  --output ./embeddings/duckdb.duckdb
```

**Query embeddings using llama-index:**

```bash
uv run python examples/example_query.py
```

**Query embeddings using DuckDB directly:**

```bash
# Run predefined queries
uv run python examples/query_duckdb_direct.py

# Query with your own question
uv run python examples/query_duckdb_direct.py "Your question here"
```

See [QUERYING.md](./QUERYING.md) for detailed documentation on querying the database directly with SQL.

## Visualization

Generate 2D UMAP embeddings for visualization:

```bash
# Install visualization dependencies
uv pip install -e ".[viz]"

# Generate UMAP visualization
uv run generate-umap-embeddings generated-embeddings/duckdb_docs.duckdb

# Output: generated-embeddings/duckdb_docs_umap.parquet
```

The output includes two Parquet files:

**Main file (`*_umap.parquet`):**

- `node_id` - Unique node identifier (e.g., "node_0001")
- `title` - Document title (from markdown frontmatter)
- `fileName` - File name extracted from metadata (e.g., "window_functions")
- `file_path` - Full file path (e.g., "/path/to/docs/window_functions.md")
- `text` - Full document text
- `x`, `y` - UMAP coordinates for 2D plotting
- `topic` - Automatically detected topic/cluster name (e.g., "Window Functions / Aggregate / SQL")
- `outdegree` - Number of documents this document links TO
- `indegree` - Number of documents linking TO this document

**Links file (`*_umap_links.parquet`):**

- `source_id` - Source node ID
- `target_id` - Target node ID

**Features:**

- **Topic Detection:** Automatically clusters documents and assigns descriptive topic names using TF-IDF keyword extraction. Disable with `--no-topics`.
- **Link Extraction:** Parses markdown links to build a chunk-level graph. Source chunks keep individual outdegree values; target documents expand to all chunks. Disable with `--no-links`.

See [VISUALIZATION_GUIDE.md](./VISUALIZATION_GUIDE.md) for complete visualization examples and usage details.

See [CHUNKING.md](./CHUNKING.md) for information about markdown-aware chunking.

## Package Structure

```
sqlrooms-rag/
├── sqlrooms_rag/           # Main package (installed)
│   ├── __init__.py        # Public API
│   ├── prepare.py         # Core embedding preparation
│   └── cli.py             # Command-line interface
├── examples/               # Example scripts (not installed)
│   ├── prepare_duckdb_docs.py   # Download & prepare DuckDB docs
│   ├── prepare_duckdb_docs.sh   # Bash version of above
│   ├── test_duckdb_docs_query.py  # Test DuckDB docs queries
│   ├── example_query.py         # Query using llama-index
│   └── query_duckdb_direct.py   # Direct DuckDB queries
├── scripts/                # Documentation for utility scripts
├── generated-embeddings/   # Output directory
├── pyproject.toml         # Package configuration
└── README.md
```

## Example: DuckDB Documentation

The package includes a ready-to-use script for preparing DuckDB documentation embeddings:

```bash
# Download DuckDB docs and create embeddings
cd python/rag
uv run python examples/prepare_duckdb_docs.py
```

This will:

1. Download the latest DuckDB documentation from GitHub (~600+ files)
2. Process all markdown files
3. Generate embeddings using BAAI/bge-small-en-v1.5
4. Create `generated-embeddings/duckdb_docs.duckdb`

Test the embeddings:

```bash
# Run interactive test queries
uv run python examples/test_duckdb_docs_query.py

# Or test a specific query
uv run python examples/test_duckdb_docs_query.py "What is a window function?"
```

Then use it in your SQLRooms app:

```typescript
import {createRagSlice} from '@sqlrooms/rag';

const store = createRoomStore({
  slices: [
    createDuckDbSlice(),
    createRagSlice({
      embeddingsDatabases: [
        {
          databaseFilePath: './embeddings/duckdb_docs.duckdb',
          databaseName: 'duckdb_docs',
        },
      ],
    }),
  ],
});

// Search DuckDB documentation
const results = await store.getState().rag.queryEmbeddings(embedding);
```

## Supported Models

The tool works with any HuggingFace sentence-transformer model. Popular choices:

| Model                                   | Dimension | Max Tokens | Description           |
| --------------------------------------- | --------- | ---------- | --------------------- |
| BAAI/bge-small-en-v1.5                  | 384       | 512        | Default, good balance |
| sentence-transformers/all-MiniLM-L6-v2  | 384       | 256        | Fast, lightweight     |
| BAAI/bge-base-en-v1.5                   | 768       | 512        | Better accuracy       |
| sentence-transformers/all-mpnet-base-v2 | 768       | 384        | High quality          |

## Notes

- The embedding model is downloaded and cached on first run (~100-500MB depending on model)
- Processing time depends on the number and size of documents
- The generated DuckDB file can be reused and updated with additional documents
- Ensure the `--embed-dim` matches your chosen model's output dimension

## Requirements

- Python >=3.10
- 2-4GB RAM (depending on model and document size)
- ~500MB-2GB disk space for models and generated database

## Troubleshooting

### Out of Memory

If you run out of memory with large document sets, try:

- Using a smaller embedding model
- Processing documents in batches
- Reducing chunk size

### Slow Processing

- First run downloads the embedding model (one-time operation)
- Subsequent runs use the cached model
- Consider using a smaller/faster model for large document sets

## License

Part of the SQLRooms project.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "sqlrooms-rag",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "duckdb, embeddings, llama-index, rag, vector-search",
    "author": "SQLRooms Contributors",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/15/89/ddfefc0f8ccaaf0af5d85565e6002701cdbf762acf211d319110df5c3b67/sqlrooms_rag-0.1.0a1.tar.gz",
    "platform": null,
    "description": "# SQLRooms RAG\n\nA Python package for preparing and querying vector embeddings stored in DuckDB for RAG (Retrieval Augmented Generation) applications.\n\n## Overview\n\nThis tool follows the approach outlined in [Developing a RAG Knowledge Base with DuckDB](https://motherduck.com/blog/search-using-duckdb-part-2/) to:\n\n1. Load markdown files from a specified directory\n2. Split them into chunks (default 512 tokens)\n3. Generate vector embeddings using HuggingFace models\n4. Store the embeddings in a DuckDB database for efficient retrieval\n\n## Installation\n\n### From PyPI (when published)\n\n```bash\npip install sqlrooms-rag\n```\n\n### From source with uv\n\nThis project uses [uv](https://github.com/astral-sh/uv) for development.\n\n```bash\n# Install uv if not already installed\ncurl -LsSf https://astral.sh/uv/install.sh | sh\n\n# Install from source\ncd python/rag-embedding\nuv sync\n```\n\n### Dependencies\n\nThe package includes:\n\n- llama-index (core RAG framework)\n- llama-index-embeddings-huggingface (HuggingFace embeddings)\n- llama-index-vector-stores-duckdb (DuckDB vector store)\n- sentence-transformers (embedding models)\n- torch (ML framework)\n- duckdb (database)\n\n## Usage\n\n### Basic Usage\n\nProcess markdown files from a directory and create a DuckDB knowledge base:\n\n```bash\nuv run prepare-embeddings /path/to/docs -o generated-embeddings/knowledge_base.duckdb\n```\n\nOr use the Python API:\n\n```python\nfrom sqlrooms_rag import prepare_embeddings\n\nprepare_embeddings(\n    input_dir=\"/path/to/docs\",\n    output_db=\"generated-embeddings/knowledge_base.duckdb\",\n    chunk_size=512,\n    embed_model_name=\"BAAI/bge-small-en-v1.5\",\n    embed_dim=384\n)\n```\n\n### Examples\n\n#### Process documentation files\n\n```bash\n# Process all .md files in the docs directory\nuv run prepare-embeddings ../../docs -o generated-embeddings/sqlrooms_docs.duckdb\n```\n\n#### Use custom chunk size\n\n```bash\n# Use smaller chunks for more granular retrieval\nuv run prepare-embeddings docs -o generated-embeddings/kb.duckdb --chunk-size 256\n```\n\n#### Use a different embedding model\n\n```bash\n# Use all-MiniLM-L6-v2 (dimension: 384)\nuv run prepare-embeddings docs -o generated-embeddings/kb.duckdb \\\n  --model \"sentence-transformers/all-MiniLM-L6-v2\" \\\n  --embed-dim 384\n```\n\n### Command-Line Options\n\n```\npositional arguments:\n  input_dir             Directory containing markdown (.md) files to process\n\noptions:\n  -h, --help            Show this help message and exit\n  -o OUTPUT, --output OUTPUT\n                        Output DuckDB database file path (default: knowledge_base.duckdb)\n  --chunk-size CHUNK_SIZE\n                        Max token size for text chunks (default: 512)\n  --model EMBED_MODEL_NAME\n                        HuggingFace embedding model name (default: BAAI/bge-small-en-v1.5)\n  --embed-dim EMBED_DIM\n                        Embedding dimension size (default: 384 for bge-small-en-v1.5)\n  --no-markdown-chunking\n                        Disable markdown-aware chunking (use size-based instead)\n  -q, --quiet           Suppress progress messages\n```\n\n## How It Works\n\n1. **Document Loading**: The tool recursively scans the input directory for `.md` files\n2. **Embedding Model**: Downloads and initializes the HuggingFace embedding model (cached locally after first run)\n3. **Smart Chunking**: By default, splits documents by markdown headers (##, ###) to preserve section context. Section titles are stored in metadata for better retrieval. Falls back to size-based chunking for large sections.\n4. **Embedding Generation**: Generates vector embeddings for each chunk\n5. **Storage**: Stores embeddings in DuckDB with metadata (including section titles) for efficient retrieval\n\n### Chunking Strategy\n\n**Markdown-Aware Chunking** (default):\n\n- \u2705 Splits by markdown headers (`##`, `###`, etc.)\n- \u2705 Preserves section context and hierarchy\n- \u2705 Stores section titles in metadata (`Header_1`, `Header_2`, etc.)\n- \u2705 Produces semantically coherent chunks\n\n**Size-Based Chunking** (with `--no-markdown-chunking`):\n\n- Simple token-based splitting\n- May break sections mid-content\n- Use only if your docs lack clear structure\n\nSee [CHUNKING.md](./CHUNKING.md) for detailed comparison and best practices.\n\n## Output\n\nThe tool creates a DuckDB database file (`.duckdb`) that contains:\n\n- Document chunks (text split by markdown sections)\n- Vector embeddings (384-dimensional by default)\n- Metadata including:\n  - File paths\n  - Section titles (`Header_1`, `Header_2`, etc.)\n  - Document structure information\n\nThis database can be used with llama-index's query engine or any RAG application that supports DuckDB vector stores.\n\n## Using the Generated Database\n\n### Python API\n\nYou can use the package programmatically:\n\n```python\nfrom sqlrooms_rag import prepare_embeddings\n\n# Create embeddings\nindex = prepare_embeddings(\n    input_dir=\"../../docs\",\n    output_db=\"generated-embeddings/my_docs.duckdb\"\n)\n```\n\n### Query Examples\n\nSee `examples/example_query.py` for complete working examples. Here's a quick snippet:\n\n```python\nfrom llama_index.core import VectorStoreIndex, StorageContext, Settings\nfrom llama_index.embeddings.huggingface import HuggingFaceEmbedding\nfrom llama_index.vector_stores.duckdb import DuckDBVectorStore\n\n# Load the embedding model\nembed_model = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")\nSettings.embed_model = embed_model\n\n# Connect to the existing database\nvector_store = DuckDBVectorStore(\n    database_name=\"knowledge_base\",\n    persist_dir=\"./\",\n    embed_dim=384,\n)\n\n# Load the index\nindex = VectorStoreIndex.from_vector_store(vector_store)\n\n# Create retriever and search\nretriever = index.as_retriever(similarity_top_k=3)\nresults = retriever.retrieve(\"Your question here\")\n\nfor result in results:\n    print(f\"Score: {result.score:.4f}\")\n    print(f\"Text: {result.text[:200]}...\")\n```\n\n### Running the Examples\n\n**Prepare DuckDB documentation embeddings:**\n\n```bash\n# Using Python script (recommended)\nuv run python examples/prepare_duckdb_docs.py\n\n# Or using bash script\nchmod +x examples/prepare_duckdb_docs.sh\n./examples/prepare_duckdb_docs.sh\n\n# Custom paths\nuv run python examples/prepare_duckdb_docs.py \\\n  --docs-dir ./my-docs \\\n  --output ./embeddings/duckdb.duckdb\n```\n\n**Query embeddings using llama-index:**\n\n```bash\nuv run python examples/example_query.py\n```\n\n**Query embeddings using DuckDB directly:**\n\n```bash\n# Run predefined queries\nuv run python examples/query_duckdb_direct.py\n\n# Query with your own question\nuv run python examples/query_duckdb_direct.py \"Your question here\"\n```\n\nSee [QUERYING.md](./QUERYING.md) for detailed documentation on querying the database directly with SQL.\n\n## Visualization\n\nGenerate 2D UMAP embeddings for visualization:\n\n```bash\n# Install visualization dependencies\nuv pip install -e \".[viz]\"\n\n# Generate UMAP visualization\nuv run generate-umap-embeddings generated-embeddings/duckdb_docs.duckdb\n\n# Output: generated-embeddings/duckdb_docs_umap.parquet\n```\n\nThe output includes two Parquet files:\n\n**Main file (`*_umap.parquet`):**\n\n- `node_id` - Unique node identifier (e.g., \"node_0001\")\n- `title` - Document title (from markdown frontmatter)\n- `fileName` - File name extracted from metadata (e.g., \"window_functions\")\n- `file_path` - Full file path (e.g., \"/path/to/docs/window_functions.md\")\n- `text` - Full document text\n- `x`, `y` - UMAP coordinates for 2D plotting\n- `topic` - Automatically detected topic/cluster name (e.g., \"Window Functions / Aggregate / SQL\")\n- `outdegree` - Number of documents this document links TO\n- `indegree` - Number of documents linking TO this document\n\n**Links file (`*_umap_links.parquet`):**\n\n- `source_id` - Source node ID\n- `target_id` - Target node ID\n\n**Features:**\n\n- **Topic Detection:** Automatically clusters documents and assigns descriptive topic names using TF-IDF keyword extraction. Disable with `--no-topics`.\n- **Link Extraction:** Parses markdown links to build a chunk-level graph. Source chunks keep individual outdegree values; target documents expand to all chunks. Disable with `--no-links`.\n\nSee [VISUALIZATION_GUIDE.md](./VISUALIZATION_GUIDE.md) for complete visualization examples and usage details.\n\nSee [CHUNKING.md](./CHUNKING.md) for information about markdown-aware chunking.\n\n## Package Structure\n\n```\nsqlrooms-rag/\n\u251c\u2500\u2500 sqlrooms_rag/           # Main package (installed)\n\u2502   \u251c\u2500\u2500 __init__.py        # Public API\n\u2502   \u251c\u2500\u2500 prepare.py         # Core embedding preparation\n\u2502   \u2514\u2500\u2500 cli.py             # Command-line interface\n\u251c\u2500\u2500 examples/               # Example scripts (not installed)\n\u2502   \u251c\u2500\u2500 prepare_duckdb_docs.py   # Download & prepare DuckDB docs\n\u2502   \u251c\u2500\u2500 prepare_duckdb_docs.sh   # Bash version of above\n\u2502   \u251c\u2500\u2500 test_duckdb_docs_query.py  # Test DuckDB docs queries\n\u2502   \u251c\u2500\u2500 example_query.py         # Query using llama-index\n\u2502   \u2514\u2500\u2500 query_duckdb_direct.py   # Direct DuckDB queries\n\u251c\u2500\u2500 scripts/                # Documentation for utility scripts\n\u251c\u2500\u2500 generated-embeddings/   # Output directory\n\u251c\u2500\u2500 pyproject.toml         # Package configuration\n\u2514\u2500\u2500 README.md\n```\n\n## Example: DuckDB Documentation\n\nThe package includes a ready-to-use script for preparing DuckDB documentation embeddings:\n\n```bash\n# Download DuckDB docs and create embeddings\ncd python/rag\nuv run python examples/prepare_duckdb_docs.py\n```\n\nThis will:\n\n1. Download the latest DuckDB documentation from GitHub (~600+ files)\n2. Process all markdown files\n3. Generate embeddings using BAAI/bge-small-en-v1.5\n4. Create `generated-embeddings/duckdb_docs.duckdb`\n\nTest the embeddings:\n\n```bash\n# Run interactive test queries\nuv run python examples/test_duckdb_docs_query.py\n\n# Or test a specific query\nuv run python examples/test_duckdb_docs_query.py \"What is a window function?\"\n```\n\nThen use it in your SQLRooms app:\n\n```typescript\nimport {createRagSlice} from '@sqlrooms/rag';\n\nconst store = createRoomStore({\n  slices: [\n    createDuckDbSlice(),\n    createRagSlice({\n      embeddingsDatabases: [\n        {\n          databaseFilePath: './embeddings/duckdb_docs.duckdb',\n          databaseName: 'duckdb_docs',\n        },\n      ],\n    }),\n  ],\n});\n\n// Search DuckDB documentation\nconst results = await store.getState().rag.queryEmbeddings(embedding);\n```\n\n## Supported Models\n\nThe tool works with any HuggingFace sentence-transformer model. Popular choices:\n\n| Model                                   | Dimension | Max Tokens | Description           |\n| --------------------------------------- | --------- | ---------- | --------------------- |\n| BAAI/bge-small-en-v1.5                  | 384       | 512        | Default, good balance |\n| sentence-transformers/all-MiniLM-L6-v2  | 384       | 256        | Fast, lightweight     |\n| BAAI/bge-base-en-v1.5                   | 768       | 512        | Better accuracy       |\n| sentence-transformers/all-mpnet-base-v2 | 768       | 384        | High quality          |\n\n## Notes\n\n- The embedding model is downloaded and cached on first run (~100-500MB depending on model)\n- Processing time depends on the number and size of documents\n- The generated DuckDB file can be reused and updated with additional documents\n- Ensure the `--embed-dim` matches your chosen model's output dimension\n\n## Requirements\n\n- Python >=3.10\n- 2-4GB RAM (depending on model and document size)\n- ~500MB-2GB disk space for models and generated database\n\n## Troubleshooting\n\n### Out of Memory\n\nIf you run out of memory with large document sets, try:\n\n- Using a smaller embedding model\n- Processing documents in batches\n- Reducing chunk size\n\n### Slow Processing\n\n- First run downloads the embedding model (one-time operation)\n- Subsequent runs use the cached model\n- Consider using a smaller/faster model for large document sets\n\n## License\n\nPart of the SQLRooms project.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Prepare and query vector embeddings for RAG applications using DuckDB",
    "version": "0.1.0a1",
    "project_urls": {
        "Documentation": "https://github.com/sqlrooms/sqlrooms",
        "Homepage": "https://github.com/sqlrooms/sqlrooms",
        "Repository": "https://github.com/sqlrooms/sqlrooms"
    },
    "split_keywords": [
        "duckdb",
        " embeddings",
        " llama-index",
        " rag",
        " vector-search"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "640df1da0718b7f82247b425dd93f8b5ee428017eb2e98372381b54365be1577",
                "md5": "43975097bf1268d434c82a7107ca428b",
                "sha256": "9a62d64f13c1e1237c5d84e19c4ccf478bc02dd61d227b1131b3358e48949ba2"
            },
            "downloads": -1,
            "filename": "sqlrooms_rag-0.1.0a1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "43975097bf1268d434c82a7107ca428b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 18972,
            "upload_time": "2025-10-27T16:32:09",
            "upload_time_iso_8601": "2025-10-27T16:32:09.630133Z",
            "url": "https://files.pythonhosted.org/packages/64/0d/f1da0718b7f82247b425dd93f8b5ee428017eb2e98372381b54365be1577/sqlrooms_rag-0.1.0a1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1589ddfefc0f8ccaaf0af5d85565e6002701cdbf762acf211d319110df5c3b67",
                "md5": "6298930ef1b56a8fe09458e260bb2d07",
                "sha256": "27cf5a64d3da7b9391e95e155e508118b975d80c493735b5a7c129372435c3f2"
            },
            "downloads": -1,
            "filename": "sqlrooms_rag-0.1.0a1.tar.gz",
            "has_sig": false,
            "md5_digest": "6298930ef1b56a8fe09458e260bb2d07",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 47916,
            "upload_time": "2025-10-27T16:32:11",
            "upload_time_iso_8601": "2025-10-27T16:32:11.054148Z",
            "url": "https://files.pythonhosted.org/packages/15/89/ddfefc0f8ccaaf0af5d85565e6002701cdbf762acf211d319110df5c3b67/sqlrooms_rag-0.1.0a1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-27 16:32:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "sqlrooms",
    "github_project": "sqlrooms",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "sqlrooms-rag"
}

SQLRooms Contributors