semware

Name	semware JSON
Version	0.1.0 JSON
	download
home_page	None
Summary	Semantic search API server using vector databases and ML embeddings
upload_time	2025-09-10 03:14:47
maintainer	None
docs_url	None
author	SemWare Team
requires_python	>=3.11
license	MIT
keywords	api embeddings machine-learning semantic-search vector-database
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # SemWare 🚀

[![Tests](https://img.shields.io/badge/tests-46%20passed-brightgreen)](https://github.com/semware/semware)
[![Coverage](https://img.shields.io/badge/coverage-79%25-yellow)](https://github.com/semware/semware)
[![Python](https://img.shields.io/badge/python-3.11%2B-blue)](https://python.org)
[![License](https://img.shields.io/badge/license-MIT-green)](LICENSE)

A high-performance semantic search API server built with modern Python technologies. SemWare provides REST APIs for vector-based document storage, embedding generation, and similarity search using state-of-the-art machine learning models.

## ✨ Features

- **🚄 High Performance**: Built on FastAPI with automatic async/await support
- **🧠 Smart Embeddings**: Supports multiple embedding models (all-MiniLM-L6-v2, EmbeddingGemma-300M)
- **🔍 Advanced Search**: Similarity threshold and top-k search with sub-second response times
- **🛡️ Secure**: API key authentication with Bearer token support
- **📊 Vector Storage**: Powered by LanceDB for efficient vector operations
- **🔧 Developer Friendly**: Comprehensive OpenAPI docs, type hints, and test coverage
- **📈 Scalable**: Handles documents of any length with intelligent text batching
- **🏗️ Production Ready**: Comprehensive logging, error handling, and monitoring

## 🏛️ Architecture

SemWare follows a clean architecture pattern with separate layers:

```
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   FastAPI       │    │   Services      │    │   Storage       │
│   REST APIs     │───▶│   Business      │───▶│   LanceDB       │
│   (Routes)      │    │   Logic         │    │   Vector DB     │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                │
                       ┌─────────────────┐
                       │   ML Models     │
                       │   Embeddings    │
                       │   (HuggingFace) │
                       └─────────────────┘
```

**Core Components:**
- **Table Management**: Create custom schemas for different document types
- **Data Operations**: CRUD operations with automatic embedding generation  
- **Semantic Search**: Vector similarity search with configurable parameters
- **Text Processing**: Smart tokenization and batching for long documents

## 🚀 Quick Start

### Installation

**Using uv (Recommended):**
```bash
git clone https://github.com/your-org/semware.git
cd SemWare
uv sync --native-tls
```

**Using pip:**
```bash
git clone https://github.com/your-org/semware.git
cd SemWare
pip install -e .
```

### Configuration

Create a `.env` file:
```bash
# Required
API_KEY=your-super-secret-api-key-here

# Optional (with defaults)
DEBUG=false
DB_PATH=./data
HOST=0.0.0.0
PORT=8000
LOG_LEVEL=INFO
EMBEDDING_MODEL_NAME=all-MiniLM-L6-v2
EMBEDDING_DIMENSION=384
MAX_TOKENS_PER_BATCH=2000
```

### Start the Server

**Simple Command (Recommended):**
```bash
# Start with default settings from .env
semware

# Start with custom options
semware --debug --port 8080
semware --workers 4 --host 127.0.0.1
semware --reload  # Development mode with auto-reload
```

**Alternative Methods:**
```bash
# Using uv directly
uv run --native-tls semware

# Using Python module
uv run --native-tls python -m semware.main

# Using uvicorn directly
uv run --native-tls uvicorn semware.main:app --host 0.0.0.0 --port 8000 --workers 4
```

The server will be available at `http://localhost:8000` with automatic API documentation at `/docs`.

### CLI Options

The `semware` command supports these options:

```bash
semware --help                   Show help message
semware --version               Show version
semware --debug                 Enable debug mode & API docs
semware --reload                Development mode with auto-reload
semware --host 127.0.0.1       Bind to specific host
semware --port 8080             Use custom port
semware --workers 4             Number of worker processes
semware --log-level DEBUG       Set logging level
```

## 📚 API Reference

### Authentication

All endpoints require authentication using one of:
- **Header**: `X-API-Key: your-api-key`
- **Bearer Token**: `Authorization: Bearer your-api-key`

### 🗂️ Table Management

#### Create Table
Create a new table with custom schema.

```http
POST /tables
Content-Type: application/json
X-API-Key: your-api-key

{
  "schema": {
    "name": "research_papers",
    "columns": {
      "id": "string",
      "title": "string", 
      "abstract": "string",
      "authors": "string",
      "year": "int",
      "doi": "string"
    },
    "id_column": "id",
    "embedding_column": "abstract"
  }
}
```

**Response (201):**
```json
{
  "message": "Table 'research_papers' created successfully",
  "table_name": "research_papers"
}
```

#### List Tables
Get all available tables.

```http
GET /tables
X-API-Key: your-api-key
```

**Response (200):**
```json
{
  "tables": ["research_papers", "product_docs", "customer_support"],
  "count": 3
}
```

#### Get Table Info
Get detailed information about a specific table.

```http
GET /tables/research_papers
X-API-Key: your-api-key
```

**Response (200):**
```json
{
  "table_name": "research_papers",
  "schema": {
    "name": "research_papers",
    "columns": {
      "id": "string",
      "title": "string",
      "abstract": "string",
      "authors": "string", 
      "year": "int",
      "doi": "string"
    },
    "id_column": "id",
    "embedding_column": "abstract"
  },
  "record_count": 1547,
  "created_at": "2024-01-15T10:30:00Z"
}
```

#### Delete Table
Delete a table and all its data.

```http
DELETE /tables/research_papers
X-API-Key: your-api-key
```

**Response (200):**
```json
{
  "message": "Table 'research_papers' deleted successfully",
  "table_name": "research_papers"
}
```

### 📄 Data Operations

#### Insert/Update Documents
Insert new documents or update existing ones. Embeddings are generated automatically.

```http
POST /tables/research_papers/data
Content-Type: application/json
X-API-Key: your-api-key

{
  "records": [
    {
      "data": {
        "id": "paper_001",
        "title": "Attention Is All You Need",
        "abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks...",
        "authors": "Ashish Vaswani, Noam Shazeer, Niki Parmar",
        "year": 2017,
        "doi": "10.48550/arXiv.1706.03762"
      }
    },
    {
      "data": {
        "id": "paper_002", 
        "title": "BERT: Pre-training of Deep Bidirectional Transformers",
        "abstract": "We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations...",
        "authors": "Jacob Devlin, Ming-Wei Chang, Kenton Lee",
        "year": 2018,
        "doi": "10.48550/arXiv.1810.04805"
      }
    }
  ]
}
```

**Response (201):**
```json
{
  "message": "Successfully processed 2 records",
  "inserted_count": 2,
  "updated_count": 0,
  "processing_time_ms": 1247.3
}
```

#### Get Document
Retrieve a specific document by ID.

```http
GET /tables/research_papers/data/paper_001
X-API-Key: your-api-key
```

**Response (200):**
```json
{
  "table_name": "research_papers",
  "record_id": "paper_001",
  "data": {
    "id": "paper_001",
    "title": "Attention Is All You Need",
    "abstract": "The dominant sequence transduction models are based on complex recurrent...",
    "authors": "Ashish Vaswani, Noam Shazeer, Niki Parmar",
    "year": 2017,
    "doi": "10.48550/arXiv.1706.03762"
  }
}
```

#### Delete Document
Remove a document from the table.

```http
DELETE /tables/research_papers/data/paper_001
X-API-Key: your-api-key
```

**Response (200):**
```json
{
  "message": "Record 'paper_001' deleted successfully",
  "table_name": "research_papers",
  "deleted_id": "paper_001"
}
```

### 🔍 Search Operations

#### Similarity Search
Find all documents with similarity above a threshold.

```http
POST /tables/research_papers/search/similarity
Content-Type: application/json
X-API-Key: your-api-key

{
  "query": "transformer neural network attention mechanism",
  "threshold": 0.7,
  "limit": 10
}
```

**Response (200):**
```json
{
  "query": "transformer neural network attention mechanism",
  "results": [
    {
      "id": "paper_001",
      "data": {
        "id": "paper_001",
        "title": "Attention Is All You Need",
        "abstract": "The dominant sequence transduction models...",
        "authors": "Ashish Vaswani, Noam Shazeer, Niki Parmar",
        "year": 2017,
        "doi": "10.48550/arXiv.1706.03762"
      },
      "similarity_score": 0.89
    },
    {
      "id": "paper_002",
      "data": {
        "id": "paper_002",
        "title": "BERT: Pre-training of Deep Bidirectional Transformers", 
        "abstract": "We introduce a new language representation model...",
        "authors": "Jacob Devlin, Ming-Wei Chang, Kenton Lee",
        "year": 2018,
        "doi": "10.48550/arXiv.1810.04805"
      },
      "similarity_score": 0.76
    }
  ],
  "total_results": 2,
  "search_time_ms": 23.4,
  "threshold": 0.7
}
```

#### Top-K Search  
Find the K most similar documents.

```http
POST /tables/research_papers/search/top-k
Content-Type: application/json
X-API-Key: your-api-key

{
  "query": "natural language processing BERT",
  "k": 5
}
```

**Response (200):**
```json
{
  "query": "natural language processing BERT",
  "results": [
    {
      "id": "paper_002",
      "data": {
        "id": "paper_002",
        "title": "BERT: Pre-training of Deep Bidirectional Transformers",
        "abstract": "We introduce a new language representation model...",
        "authors": "Jacob Devlin, Ming-Wei Chang, Kenton Lee", 
        "year": 2018,
        "doi": "10.48550/arXiv.1810.04805"
      },
      "similarity_score": 0.94
    },
    {
      "id": "paper_001", 
      "data": {
        "id": "paper_001",
        "title": "Attention Is All You Need",
        "abstract": "The dominant sequence transduction models...",
        "authors": "Ashish Vaswani, Noam Shazeer, Niki Parmar",
        "year": 2017,
        "doi": "10.48550/arXiv.1706.03762"
      },
      "similarity_score": 0.81
    }
  ],
  "total_results": 5,
  "search_time_ms": 31.7,
  "k": 5
}
```

### ❤️ Health Check

```http
GET /health
```

**Response (200):**
```json
{
  "status": "healthy",
  "app_name": "SemWare",
  "version": "0.1.0", 
  "timestamp": "2024-01-15T14:30:25.123456"
}
```

## 🧠 Embedding Process

SemWare uses advanced text processing for optimal semantic understanding:

### 1. **Text Tokenization**
- Long texts are intelligently split into manageable chunks
- Uses `tiktoken` with `cl100k_base` encoding for precise token counting
- Default batch size: 2000 tokens with configurable limits

### 2. **Batch Processing**
- Each text chunk is processed through the embedding model
- Supports multiple embedding models via Hugging Face transformers
- Automatic GPU acceleration when available

### 3. **Embedding Aggregation**
- Multiple batch embeddings are combined using average pooling
- Preserves semantic meaning across the entire document
- Results in high-quality 384-dimensional vectors (MiniLM)

### 4. **Normalization & Storage**
- Final embeddings are L2 normalized for consistent similarity scoring
- Stored efficiently in LanceDB with optimized vector indexing
- Enables sub-second search across millions of documents

## 🛠️ Development

### Running Tests
```bash
# Run all tests with coverage
uv run --native-tls pytest --cov=src --cov-report=html

# Run specific test file
uv run --native-tls pytest tests/test_api/test_search.py -v

# Run with debug output
uv run --native-tls pytest -s --log-cli-level=DEBUG
```

### Code Quality
```bash
# Format code
uv run --native-tls ruff format src/ tests/

# Lint and fix issues
uv run --native-tls ruff check src/ tests/ --fix

# Type checking
uv run --native-tls mypy src/
```

### API Documentation
Start the server with `DEBUG=true` in your `.env` and visit:
- **Swagger UI**: http://localhost:8000/docs
- **ReDoc**: http://localhost:8000/redoc
- **OpenAPI JSON**: http://localhost:8000/openapi.json

## 📁 Project Structure

```
SemWare/
├── src/semware/
│   ├── api/                    # FastAPI route handlers
│   │   ├── __init__.py
│   │   ├── auth.py            # Authentication middleware
│   │   ├── data.py            # Data CRUD operations
│   │   ├── search.py          # Search endpoints  
│   │   └── tables.py          # Table management
│   ├── models/                 # Pydantic data models
│   │   ├── __init__.py
│   │   ├── requests.py        # Request/response models
│   │   └── schemas.py         # Core data schemas
│   ├── services/              # Business logic services
│   │   ├── __init__.py
│   │   ├── embedding.py       # ML embedding generation
│   │   ├── search.py          # Search orchestration
│   │   └── vectordb.py        # Vector database operations
│   ├── utils/                 # Utility functions
│   │   ├── __init__.py
│   │   ├── logging.py         # Logging configuration
│   │   └── tokenizer.py       # Text tokenization
│   ├── config.py              # Configuration management
│   └── main.py                # FastAPI application factory
├── tests/                     # Comprehensive test suite
│   ├── conftest.py           # Test configuration & fixtures
│   ├── test_api/             # API endpoint tests
│   ├── test_services/        # Service layer tests
│   └── test_utils/           # Utility function tests
├── pyproject.toml            # Project configuration
├── .env.example             # Environment template
└── README.md               # This file
```

## ⚙️ Configuration Reference

| Variable | Description | Default | Required |
|----------|-------------|---------|----------|
| `API_KEY` | Authentication key for all endpoints | - | ✅ |
| `DEBUG` | Enable debug mode and API docs | `false` | ❌ |
| `DB_PATH` | Database storage directory | `./data` | ❌ |
| `HOST` | Server bind address | `0.0.0.0` | ❌ |
| `PORT` | Server port | `8000` | ❌ |
| `LOG_LEVEL` | Logging level (DEBUG/INFO/WARNING/ERROR) | `INFO` | ❌ |
| `LOG_FILE` | Log file path (optional) | - | ❌ |
| `EMBEDDING_MODEL_NAME` | Hugging Face model name | `all-MiniLM-L6-v2` | ❌ |
| `EMBEDDING_DIMENSION` | Embedding vector dimensions | `384` | ❌ |
| `MAX_TOKENS_PER_BATCH` | Max tokens per embedding batch | `2000` | ❌ |
| `WORKERS` | Number of server workers | `1` | ❌ |

## 🚢 Deployment

### Docker
```dockerfile
FROM python:3.11-slim

WORKDIR /app
COPY . .

RUN pip install uv
RUN uv sync --native-tls

EXPOSE 8000
CMD ["uv", "run", "--native-tls", "uvicorn", "semware.main:app", "--host", "0.0.0.0", "--port", "8000"]
```

### Production Considerations
- Use multiple workers: `--workers 4`
- Enable access logs: `--access-log`
- Set up reverse proxy (nginx) for HTTPS termination
- Configure log rotation and monitoring
- Use a dedicated vector storage solution for large scale

## 🤝 Contributing

We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.

1. Fork the repository
2. Create a feature branch: `git checkout -b feature/amazing-feature`
3. Make your changes and add tests
4. Run the test suite: `uv run --native-tls pytest`
5. Submit a pull request

## 📊 Performance

**Benchmarks** (on Apple M2 Pro, 16GB RAM):
- **Embedding Generation**: ~200ms per batch (2000 tokens)
- **Document Insertion**: ~500ms per document (including embedding)
- **Vector Search**: <50ms for similarity search across 10K documents
- **Throughput**: ~100 requests/second with 4 workers

## 🐛 Troubleshooting

### Common Issues

**Authentication Errors**
```bash
# Ensure API key is set correctly
export API_KEY=your-secret-key
# Or check your .env file
```

**Model Download Issues**
```bash
# Clear Hugging Face cache
rm -rf ~/.cache/huggingface/
# Restart with debug logging
DEBUG=true uv run --native-tls python -m semware.main
```

**Database Permissions**
```bash
# Ensure write permissions to data directory
chmod 755 ./data
```

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- **FastAPI** for the excellent async web framework
- **LanceDB** for high-performance vector storage
- **Hugging Face** for the transformer models and ecosystem
- **Pydantic** for robust data validation
- **The Python Community** for the amazing open-source ecosystem

---

<p align="center">
  <strong>Built with ❤️ by the SemWare team</strong>
</p>

<p align="center">
  <a href="https://github.com/semware/semware/issues">Report Bug</a> •
  <a href="https://github.com/semware/semware/discussions">Discussions</a> •
  <a href="https://github.com/semware/semware/wiki">Wiki</a>
</p>

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "semware",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "api, embeddings, machine-learning, semantic-search, vector-database",
    "author": "SemWare Team",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/8a/02/c6394e787bf1bb03319e324567ac107c6550550fc6c0febdffc0aa6bb7a2/semware-0.1.0.tar.gz",
    "platform": null,
    "description": "# SemWare \ud83d\ude80\n\n[![Tests](https://img.shields.io/badge/tests-46%20passed-brightgreen)](https://github.com/semware/semware)\n[![Coverage](https://img.shields.io/badge/coverage-79%25-yellow)](https://github.com/semware/semware)\n[![Python](https://img.shields.io/badge/python-3.11%2B-blue)](https://python.org)\n[![License](https://img.shields.io/badge/license-MIT-green)](LICENSE)\n\nA high-performance semantic search API server built with modern Python technologies. SemWare provides REST APIs for vector-based document storage, embedding generation, and similarity search using state-of-the-art machine learning models.\n\n## \u2728 Features\n\n- **\ud83d\ude84 High Performance**: Built on FastAPI with automatic async/await support\n- **\ud83e\udde0 Smart Embeddings**: Supports multiple embedding models (all-MiniLM-L6-v2, EmbeddingGemma-300M)\n- **\ud83d\udd0d Advanced Search**: Similarity threshold and top-k search with sub-second response times\n- **\ud83d\udee1\ufe0f Secure**: API key authentication with Bearer token support\n- **\ud83d\udcca Vector Storage**: Powered by LanceDB for efficient vector operations\n- **\ud83d\udd27 Developer Friendly**: Comprehensive OpenAPI docs, type hints, and test coverage\n- **\ud83d\udcc8 Scalable**: Handles documents of any length with intelligent text batching\n- **\ud83c\udfd7\ufe0f Production Ready**: Comprehensive logging, error handling, and monitoring\n\n## \ud83c\udfdb\ufe0f Architecture\n\nSemWare follows a clean architecture pattern with separate layers:\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510    \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510    \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502   FastAPI       \u2502    \u2502   Services      \u2502    \u2502   Storage       \u2502\n\u2502   REST APIs     \u2502\u2500\u2500\u2500\u25b6\u2502   Business      \u2502\u2500\u2500\u2500\u25b6\u2502   LanceDB       \u2502\n\u2502   (Routes)      \u2502    \u2502   Logic         \u2502    \u2502   Vector DB     \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518    \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518    \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                                \u2502\n                       \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n                       \u2502   ML Models     \u2502\n                       \u2502   Embeddings    \u2502\n                       \u2502   (HuggingFace) \u2502\n                       \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n**Core Components:**\n- **Table Management**: Create custom schemas for different document types\n- **Data Operations**: CRUD operations with automatic embedding generation  \n- **Semantic Search**: Vector similarity search with configurable parameters\n- **Text Processing**: Smart tokenization and batching for long documents\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\n**Using uv (Recommended):**\n```bash\ngit clone https://github.com/your-org/semware.git\ncd SemWare\nuv sync --native-tls\n```\n\n**Using pip:**\n```bash\ngit clone https://github.com/your-org/semware.git\ncd SemWare\npip install -e .\n```\n\n### Configuration\n\nCreate a `.env` file:\n```bash\n# Required\nAPI_KEY=your-super-secret-api-key-here\n\n# Optional (with defaults)\nDEBUG=false\nDB_PATH=./data\nHOST=0.0.0.0\nPORT=8000\nLOG_LEVEL=INFO\nEMBEDDING_MODEL_NAME=all-MiniLM-L6-v2\nEMBEDDING_DIMENSION=384\nMAX_TOKENS_PER_BATCH=2000\n```\n\n### Start the Server\n\n**Simple Command (Recommended):**\n```bash\n# Start with default settings from .env\nsemware\n\n# Start with custom options\nsemware --debug --port 8080\nsemware --workers 4 --host 127.0.0.1\nsemware --reload  # Development mode with auto-reload\n```\n\n**Alternative Methods:**\n```bash\n# Using uv directly\nuv run --native-tls semware\n\n# Using Python module\nuv run --native-tls python -m semware.main\n\n# Using uvicorn directly\nuv run --native-tls uvicorn semware.main:app --host 0.0.0.0 --port 8000 --workers 4\n```\n\nThe server will be available at `http://localhost:8000` with automatic API documentation at `/docs`.\n\n### CLI Options\n\nThe `semware` command supports these options:\n\n```bash\nsemware --help                   Show help message\nsemware --version               Show version\nsemware --debug                 Enable debug mode & API docs\nsemware --reload                Development mode with auto-reload\nsemware --host 127.0.0.1       Bind to specific host\nsemware --port 8080             Use custom port\nsemware --workers 4             Number of worker processes\nsemware --log-level DEBUG       Set logging level\n```\n\n## \ud83d\udcda API Reference\n\n### Authentication\n\nAll endpoints require authentication using one of:\n- **Header**: `X-API-Key: your-api-key`\n- **Bearer Token**: `Authorization: Bearer your-api-key`\n\n### \ud83d\uddc2\ufe0f Table Management\n\n#### Create Table\nCreate a new table with custom schema.\n\n```http\nPOST /tables\nContent-Type: application/json\nX-API-Key: your-api-key\n\n{\n  \"schema\": {\n    \"name\": \"research_papers\",\n    \"columns\": {\n      \"id\": \"string\",\n      \"title\": \"string\", \n      \"abstract\": \"string\",\n      \"authors\": \"string\",\n      \"year\": \"int\",\n      \"doi\": \"string\"\n    },\n    \"id_column\": \"id\",\n    \"embedding_column\": \"abstract\"\n  }\n}\n```\n\n**Response (201):**\n```json\n{\n  \"message\": \"Table 'research_papers' created successfully\",\n  \"table_name\": \"research_papers\"\n}\n```\n\n#### List Tables\nGet all available tables.\n\n```http\nGET /tables\nX-API-Key: your-api-key\n```\n\n**Response (200):**\n```json\n{\n  \"tables\": [\"research_papers\", \"product_docs\", \"customer_support\"],\n  \"count\": 3\n}\n```\n\n#### Get Table Info\nGet detailed information about a specific table.\n\n```http\nGET /tables/research_papers\nX-API-Key: your-api-key\n```\n\n**Response (200):**\n```json\n{\n  \"table_name\": \"research_papers\",\n  \"schema\": {\n    \"name\": \"research_papers\",\n    \"columns\": {\n      \"id\": \"string\",\n      \"title\": \"string\",\n      \"abstract\": \"string\",\n      \"authors\": \"string\", \n      \"year\": \"int\",\n      \"doi\": \"string\"\n    },\n    \"id_column\": \"id\",\n    \"embedding_column\": \"abstract\"\n  },\n  \"record_count\": 1547,\n  \"created_at\": \"2024-01-15T10:30:00Z\"\n}\n```\n\n#### Delete Table\nDelete a table and all its data.\n\n```http\nDELETE /tables/research_papers\nX-API-Key: your-api-key\n```\n\n**Response (200):**\n```json\n{\n  \"message\": \"Table 'research_papers' deleted successfully\",\n  \"table_name\": \"research_papers\"\n}\n```\n\n### \ud83d\udcc4 Data Operations\n\n#### Insert/Update Documents\nInsert new documents or update existing ones. Embeddings are generated automatically.\n\n```http\nPOST /tables/research_papers/data\nContent-Type: application/json\nX-API-Key: your-api-key\n\n{\n  \"records\": [\n    {\n      \"data\": {\n        \"id\": \"paper_001\",\n        \"title\": \"Attention Is All You Need\",\n        \"abstract\": \"The dominant sequence transduction models are based on complex recurrent or convolutional neural networks...\",\n        \"authors\": \"Ashish Vaswani, Noam Shazeer, Niki Parmar\",\n        \"year\": 2017,\n        \"doi\": \"10.48550/arXiv.1706.03762\"\n      }\n    },\n    {\n      \"data\": {\n        \"id\": \"paper_002\", \n        \"title\": \"BERT: Pre-training of Deep Bidirectional Transformers\",\n        \"abstract\": \"We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations...\",\n        \"authors\": \"Jacob Devlin, Ming-Wei Chang, Kenton Lee\",\n        \"year\": 2018,\n        \"doi\": \"10.48550/arXiv.1810.04805\"\n      }\n    }\n  ]\n}\n```\n\n**Response (201):**\n```json\n{\n  \"message\": \"Successfully processed 2 records\",\n  \"inserted_count\": 2,\n  \"updated_count\": 0,\n  \"processing_time_ms\": 1247.3\n}\n```\n\n#### Get Document\nRetrieve a specific document by ID.\n\n```http\nGET /tables/research_papers/data/paper_001\nX-API-Key: your-api-key\n```\n\n**Response (200):**\n```json\n{\n  \"table_name\": \"research_papers\",\n  \"record_id\": \"paper_001\",\n  \"data\": {\n    \"id\": \"paper_001\",\n    \"title\": \"Attention Is All You Need\",\n    \"abstract\": \"The dominant sequence transduction models are based on complex recurrent...\",\n    \"authors\": \"Ashish Vaswani, Noam Shazeer, Niki Parmar\",\n    \"year\": 2017,\n    \"doi\": \"10.48550/arXiv.1706.03762\"\n  }\n}\n```\n\n#### Delete Document\nRemove a document from the table.\n\n```http\nDELETE /tables/research_papers/data/paper_001\nX-API-Key: your-api-key\n```\n\n**Response (200):**\n```json\n{\n  \"message\": \"Record 'paper_001' deleted successfully\",\n  \"table_name\": \"research_papers\",\n  \"deleted_id\": \"paper_001\"\n}\n```\n\n### \ud83d\udd0d Search Operations\n\n#### Similarity Search\nFind all documents with similarity above a threshold.\n\n```http\nPOST /tables/research_papers/search/similarity\nContent-Type: application/json\nX-API-Key: your-api-key\n\n{\n  \"query\": \"transformer neural network attention mechanism\",\n  \"threshold\": 0.7,\n  \"limit\": 10\n}\n```\n\n**Response (200):**\n```json\n{\n  \"query\": \"transformer neural network attention mechanism\",\n  \"results\": [\n    {\n      \"id\": \"paper_001\",\n      \"data\": {\n        \"id\": \"paper_001\",\n        \"title\": \"Attention Is All You Need\",\n        \"abstract\": \"The dominant sequence transduction models...\",\n        \"authors\": \"Ashish Vaswani, Noam Shazeer, Niki Parmar\",\n        \"year\": 2017,\n        \"doi\": \"10.48550/arXiv.1706.03762\"\n      },\n      \"similarity_score\": 0.89\n    },\n    {\n      \"id\": \"paper_002\",\n      \"data\": {\n        \"id\": \"paper_002\",\n        \"title\": \"BERT: Pre-training of Deep Bidirectional Transformers\", \n        \"abstract\": \"We introduce a new language representation model...\",\n        \"authors\": \"Jacob Devlin, Ming-Wei Chang, Kenton Lee\",\n        \"year\": 2018,\n        \"doi\": \"10.48550/arXiv.1810.04805\"\n      },\n      \"similarity_score\": 0.76\n    }\n  ],\n  \"total_results\": 2,\n  \"search_time_ms\": 23.4,\n  \"threshold\": 0.7\n}\n```\n\n#### Top-K Search  \nFind the K most similar documents.\n\n```http\nPOST /tables/research_papers/search/top-k\nContent-Type: application/json\nX-API-Key: your-api-key\n\n{\n  \"query\": \"natural language processing BERT\",\n  \"k\": 5\n}\n```\n\n**Response (200):**\n```json\n{\n  \"query\": \"natural language processing BERT\",\n  \"results\": [\n    {\n      \"id\": \"paper_002\",\n      \"data\": {\n        \"id\": \"paper_002\",\n        \"title\": \"BERT: Pre-training of Deep Bidirectional Transformers\",\n        \"abstract\": \"We introduce a new language representation model...\",\n        \"authors\": \"Jacob Devlin, Ming-Wei Chang, Kenton Lee\", \n        \"year\": 2018,\n        \"doi\": \"10.48550/arXiv.1810.04805\"\n      },\n      \"similarity_score\": 0.94\n    },\n    {\n      \"id\": \"paper_001\", \n      \"data\": {\n        \"id\": \"paper_001\",\n        \"title\": \"Attention Is All You Need\",\n        \"abstract\": \"The dominant sequence transduction models...\",\n        \"authors\": \"Ashish Vaswani, Noam Shazeer, Niki Parmar\",\n        \"year\": 2017,\n        \"doi\": \"10.48550/arXiv.1706.03762\"\n      },\n      \"similarity_score\": 0.81\n    }\n  ],\n  \"total_results\": 5,\n  \"search_time_ms\": 31.7,\n  \"k\": 5\n}\n```\n\n### \u2764\ufe0f Health Check\n\n```http\nGET /health\n```\n\n**Response (200):**\n```json\n{\n  \"status\": \"healthy\",\n  \"app_name\": \"SemWare\",\n  \"version\": \"0.1.0\", \n  \"timestamp\": \"2024-01-15T14:30:25.123456\"\n}\n```\n\n## \ud83e\udde0 Embedding Process\n\nSemWare uses advanced text processing for optimal semantic understanding:\n\n### 1. **Text Tokenization**\n- Long texts are intelligently split into manageable chunks\n- Uses `tiktoken` with `cl100k_base` encoding for precise token counting\n- Default batch size: 2000 tokens with configurable limits\n\n### 2. **Batch Processing**\n- Each text chunk is processed through the embedding model\n- Supports multiple embedding models via Hugging Face transformers\n- Automatic GPU acceleration when available\n\n### 3. **Embedding Aggregation**\n- Multiple batch embeddings are combined using average pooling\n- Preserves semantic meaning across the entire document\n- Results in high-quality 384-dimensional vectors (MiniLM)\n\n### 4. **Normalization & Storage**\n- Final embeddings are L2 normalized for consistent similarity scoring\n- Stored efficiently in LanceDB with optimized vector indexing\n- Enables sub-second search across millions of documents\n\n## \ud83d\udee0\ufe0f Development\n\n### Running Tests\n```bash\n# Run all tests with coverage\nuv run --native-tls pytest --cov=src --cov-report=html\n\n# Run specific test file\nuv run --native-tls pytest tests/test_api/test_search.py -v\n\n# Run with debug output\nuv run --native-tls pytest -s --log-cli-level=DEBUG\n```\n\n### Code Quality\n```bash\n# Format code\nuv run --native-tls ruff format src/ tests/\n\n# Lint and fix issues\nuv run --native-tls ruff check src/ tests/ --fix\n\n# Type checking\nuv run --native-tls mypy src/\n```\n\n### API Documentation\nStart the server with `DEBUG=true` in your `.env` and visit:\n- **Swagger UI**: http://localhost:8000/docs\n- **ReDoc**: http://localhost:8000/redoc\n- **OpenAPI JSON**: http://localhost:8000/openapi.json\n\n## \ud83d\udcc1 Project Structure\n\n```\nSemWare/\n\u251c\u2500\u2500 src/semware/\n\u2502   \u251c\u2500\u2500 api/                    # FastAPI route handlers\n\u2502   \u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u2502   \u251c\u2500\u2500 auth.py            # Authentication middleware\n\u2502   \u2502   \u251c\u2500\u2500 data.py            # Data CRUD operations\n\u2502   \u2502   \u251c\u2500\u2500 search.py          # Search endpoints  \n\u2502   \u2502   \u2514\u2500\u2500 tables.py          # Table management\n\u2502   \u251c\u2500\u2500 models/                 # Pydantic data models\n\u2502   \u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u2502   \u251c\u2500\u2500 requests.py        # Request/response models\n\u2502   \u2502   \u2514\u2500\u2500 schemas.py         # Core data schemas\n\u2502   \u251c\u2500\u2500 services/              # Business logic services\n\u2502   \u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u2502   \u251c\u2500\u2500 embedding.py       # ML embedding generation\n\u2502   \u2502   \u251c\u2500\u2500 search.py          # Search orchestration\n\u2502   \u2502   \u2514\u2500\u2500 vectordb.py        # Vector database operations\n\u2502   \u251c\u2500\u2500 utils/                 # Utility functions\n\u2502   \u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u2502   \u251c\u2500\u2500 logging.py         # Logging configuration\n\u2502   \u2502   \u2514\u2500\u2500 tokenizer.py       # Text tokenization\n\u2502   \u251c\u2500\u2500 config.py              # Configuration management\n\u2502   \u2514\u2500\u2500 main.py                # FastAPI application factory\n\u251c\u2500\u2500 tests/                     # Comprehensive test suite\n\u2502   \u251c\u2500\u2500 conftest.py           # Test configuration & fixtures\n\u2502   \u251c\u2500\u2500 test_api/             # API endpoint tests\n\u2502   \u251c\u2500\u2500 test_services/        # Service layer tests\n\u2502   \u2514\u2500\u2500 test_utils/           # Utility function tests\n\u251c\u2500\u2500 pyproject.toml            # Project configuration\n\u251c\u2500\u2500 .env.example             # Environment template\n\u2514\u2500\u2500 README.md               # This file\n```\n\n## \u2699\ufe0f Configuration Reference\n\n| Variable | Description | Default | Required |\n|----------|-------------|---------|----------|\n| `API_KEY` | Authentication key for all endpoints | - | \u2705 |\n| `DEBUG` | Enable debug mode and API docs | `false` | \u274c |\n| `DB_PATH` | Database storage directory | `./data` | \u274c |\n| `HOST` | Server bind address | `0.0.0.0` | \u274c |\n| `PORT` | Server port | `8000` | \u274c |\n| `LOG_LEVEL` | Logging level (DEBUG/INFO/WARNING/ERROR) | `INFO` | \u274c |\n| `LOG_FILE` | Log file path (optional) | - | \u274c |\n| `EMBEDDING_MODEL_NAME` | Hugging Face model name | `all-MiniLM-L6-v2` | \u274c |\n| `EMBEDDING_DIMENSION` | Embedding vector dimensions | `384` | \u274c |\n| `MAX_TOKENS_PER_BATCH` | Max tokens per embedding batch | `2000` | \u274c |\n| `WORKERS` | Number of server workers | `1` | \u274c |\n\n## \ud83d\udea2 Deployment\n\n### Docker\n```dockerfile\nFROM python:3.11-slim\n\nWORKDIR /app\nCOPY . .\n\nRUN pip install uv\nRUN uv sync --native-tls\n\nEXPOSE 8000\nCMD [\"uv\", \"run\", \"--native-tls\", \"uvicorn\", \"semware.main:app\", \"--host\", \"0.0.0.0\", \"--port\", \"8000\"]\n```\n\n### Production Considerations\n- Use multiple workers: `--workers 4`\n- Enable access logs: `--access-log`\n- Set up reverse proxy (nginx) for HTTPS termination\n- Configure log rotation and monitoring\n- Use a dedicated vector storage solution for large scale\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.\n\n1. Fork the repository\n2. Create a feature branch: `git checkout -b feature/amazing-feature`\n3. Make your changes and add tests\n4. Run the test suite: `uv run --native-tls pytest`\n5. Submit a pull request\n\n## \ud83d\udcca Performance\n\n**Benchmarks** (on Apple M2 Pro, 16GB RAM):\n- **Embedding Generation**: ~200ms per batch (2000 tokens)\n- **Document Insertion**: ~500ms per document (including embedding)\n- **Vector Search**: <50ms for similarity search across 10K documents\n- **Throughput**: ~100 requests/second with 4 workers\n\n## \ud83d\udc1b Troubleshooting\n\n### Common Issues\n\n**Authentication Errors**\n```bash\n# Ensure API key is set correctly\nexport API_KEY=your-secret-key\n# Or check your .env file\n```\n\n**Model Download Issues**\n```bash\n# Clear Hugging Face cache\nrm -rf ~/.cache/huggingface/\n# Restart with debug logging\nDEBUG=true uv run --native-tls python -m semware.main\n```\n\n**Database Permissions**\n```bash\n# Ensure write permissions to data directory\nchmod 755 ./data\n```\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\ude4f Acknowledgments\n\n- **FastAPI** for the excellent async web framework\n- **LanceDB** for high-performance vector storage\n- **Hugging Face** for the transformer models and ecosystem\n- **Pydantic** for robust data validation\n- **The Python Community** for the amazing open-source ecosystem\n\n---\n\n<p align=\"center\">\n  <strong>Built with \u2764\ufe0f by the SemWare team</strong>\n</p>\n\n<p align=\"center\">\n  <a href=\"https://github.com/semware/semware/issues\">Report Bug</a> \u2022\n  <a href=\"https://github.com/semware/semware/discussions\">Discussions</a> \u2022\n  <a href=\"https://github.com/semware/semware/wiki\">Wiki</a>\n</p>",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Semantic search API server using vector databases and ML embeddings",
    "version": "0.1.0",
    "project_urls": null,
    "split_keywords": [
        "api",
        " embeddings",
        " machine-learning",
        " semantic-search",
        " vector-database"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4439860f21c0724ca08ffa9dc258a3d92db7529497e0755a8ea846c3f848b53a",
                "md5": "b686de875291ac586b5c238ff11981d7",
                "sha256": "9322ccfa2c6b5f1bd808dda6f3dcb5a1a82115c482d2e02b28653e688374de67"
            },
            "downloads": -1,
            "filename": "semware-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b686de875291ac586b5c238ff11981d7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 27974,
            "upload_time": "2025-09-10T03:14:45",
            "upload_time_iso_8601": "2025-09-10T03:14:45.112356Z",
            "url": "https://files.pythonhosted.org/packages/44/39/860f21c0724ca08ffa9dc258a3d92db7529497e0755a8ea846c3f848b53a/semware-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8a02c6394e787bf1bb03319e324567ac107c6550550fc6c0febdffc0aa6bb7a2",
                "md5": "a87b340ee795e67d0cd7917c505d1d6d",
                "sha256": "8ce0a48cb8a9395fc2a35c863f7141a371dd0e11f1cf2cf54ef615b02dbe0f36"
            },
            "downloads": -1,
            "filename": "semware-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a87b340ee795e67d0cd7917c505d1d6d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 221366,
            "upload_time": "2025-09-10T03:14:47",
            "upload_time_iso_8601": "2025-09-10T03:14:47.037358Z",
            "url": "https://files.pythonhosted.org/packages/8a/02/c6394e787bf1bb03319e324567ac107c6550550fc6c0febdffc0aa6bb7a2/semware-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-10 03:14:47",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "semware"
}

SemWare Team