milvus-ingest


Namemilvus-ingest JSON
Version 0.1.2 PyPI version JSON
download
home_pageNone
SummaryHigh-performance data ingestion tool for Milvus vector database with vectorized operations
upload_time2025-07-25 11:39:04
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords milvus vector-database data-ingestion bulk-import data-generation high-performance vectorization etl
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Milvus Ingest - High-Performance Data Ingestion Tool

🚀 **Ultra-fast data ingestion tool for Milvus vector databases** - Built for large-scale data generation and ingestion with vectorized operations, parallel processing, and optimized file I/O. Generate and ingest millions of rows in seconds with automatic file partitioning and intelligent memory management.

## ⚡ Performance Highlights

- **🏎️ 10,000-100,000+ rows/sec** - Vectorized NumPy operations for maximum speed
- **📈 Large-scale optimized** - Designed for datasets >100K rows with intelligent batching  
- **🔥 Smart file partitioning** - Automatic splitting (256MB chunks, 1M rows/file)
- **💾 Memory efficient** - Streaming generation prevents memory exhaustion
- **⚡ Direct PyArrow I/O** - Optimized Parquet writing with Snappy compression
- **🔄 Parallel processing** - Multi-core CPU utilization with configurable workers

## ✨ Key Features

- 🎯 **Ready-to-use schemas** - Pre-built schemas for e-commerce, documents, images, users, news, and videos
- 📚 **Schema management** - Add, organize, and reuse custom schemas with metadata
- 🚀 **High-performance generation** - Vectorized operations optimized for large datasets
- 🔧 **Complete Milvus support** - All field types including vectors, arrays, JSON, and primitive types
- ✅ **Smart validation** - Pydantic-based validation with detailed error messages and suggestions
- 📊 **High-performance formats** - Parquet (fastest I/O), JSON (structured data)
- 🌱 **Reproducible results** - Seed support for consistent data generation
- 🎨 **Rich customization** - Field constraints, nullable fields, auto-generated IDs
- 🔍 **Schema exploration** - Validation, help commands, and schema details
- 🏠 **Unified interface** - Use custom and built-in schemas interchangeably

## Installation

```bash
# Install from source (recommended for development)
git clone https://github.com/zilliz/milvus-ingest.git
cd milvus-ingest
pdm install  # Installs with development dependencies

# For production use only
pdm install --prod

# After installation, the CLI tool is available as:
milvus-ingest --help
```

## 🚀 Quick Start

### 1. Use Built-in Schemas (Recommended)

Get started instantly with pre-built schemas optimized for large-scale generation:

```bash
# List all available built-in schemas
milvus-ingest schema list

# Generate data using a built-in schema (high-performance by default)
milvus-ingest generate --builtin simple --rows 100000 --preview

# Generate large e-commerce dataset with automatic file partitioning
milvus-ingest generate --builtin ecommerce --rows 2500000 --out products/
```

**Available Built-in Schemas:**
| Schema | Description | Use Cases |
|--------|-------------|-----------|
| `simple` | Basic example with common field types | Learning, testing |
| `ecommerce` | Product catalog with search embeddings | Online stores, recommendations |
| `documents` | Document search with semantic embeddings | Knowledge bases, document search |
| `images` | Image gallery with visual similarity | Media platforms, image search |
| `users` | User profiles with behavioral embeddings | User analytics, personalization |
| `videos` | Video library with multimodal embeddings | Video platforms, content discovery |
| `news` | News articles with sentiment analysis | News aggregation, content analysis |
| `audio_transcripts` | Audio transcription with FP16 embeddings | Speech-to-text search, podcasts |
| `ai_conversations` | AI chat history with BF16 embeddings | Chatbot analytics, conversation search |
| `face_recognition` | Facial recognition with binary vectors | Security systems, identity verification |
| `ecommerce_partitioned` | Partitioned e-commerce schema | Scalable product catalogs |
| `cardinality_demo` | Schema demonstrating cardinality features | Testing cardinality constraints |

### 2. Create Custom Schemas

Define your own collection structure with JSON or YAML:

```json
{
  "collection_name": "my_collection",
  "fields": [
    {
      "name": "id",
      "type": "Int64",
      "is_primary": true
    },
    {
      "name": "title",
      "type": "VarChar",
      "max_length": 256
    },
    {
      "name": "embedding",
      "type": "FloatVector",
      "dim": 128
    }
  ]
}
```

```bash
# Generate large dataset from custom schema with high-performance mode
milvus-ingest generate --schema my_schema.json --rows 1000000 --format parquet --preview
```

**Note:** Output is always a directory containing data files (in the specified format) and a `meta.json` file with collection metadata.

### 3. Schema Management

Store and organize your schemas for reuse:

```bash
# Add a custom schema to your library
milvus-ingest schema add my_products product_schema.json

# List all schemas (built-in + custom)
milvus-ingest schema list

# Use your custom schema like a built-in one (optimized for large datasets)
milvus-ingest generate --builtin my_products --rows 500000

# Show detailed schema information
milvus-ingest schema show my_products
```

### 4. Python API

```python
from milvus_ingest.generator import generate_mock_data
from milvus_ingest.schema_manager import get_schema_manager
from milvus_ingest.builtin_schemas import load_builtin_schema
import tempfile
import json

# Use the schema manager to work with schemas
manager = get_schema_manager()

# List all available schemas
all_schemas = manager.list_all_schemas()
print("Available schemas:", list(all_schemas.keys()))

# Load any schema (built-in or custom)
schema = manager.get_schema("ecommerce")  # Built-in
# schema = manager.get_schema("my_products")  # Custom

# Generate data from schema file
with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
    json.dump(schema, f, indent=2)
    f.flush()
    
    # Generate data (returns dict with DataFrame and metadata)
    result = generate_mock_data(f.name, rows=10000, seed=42, output_format="dict")
    df = result["data"]
    metadata = result["metadata"]

print(df.head())
print(f"Generated {len(df)} rows for collection: {metadata['collection_name']}")

# Add a custom schema programmatically
custom_schema = {
    "collection_name": "my_collection",
    "fields": [
        {"name": "id", "type": "Int64", "is_primary": True},
        {"name": "text", "type": "VarChar", "max_length": 100},
        {"name": "vector", "type": "FloatVector", "dim": 256}
    ]
}

manager.add_schema("my_custom", custom_schema, "Custom schema", ["testing"])
print("Added custom schema!")
```

## 📋 Schema Reference

### Supported Field Types

| Type | Description | Required Parameters | Optional Parameters |
|------|-------------|-------------------|-------------------|
| **Numeric Types** | | | |
| `Int8`, `Int16`, `Int32`, `Int64` | Integer types | - | `min`, `max` |
| `Float`, `Double` | Floating point | - | `min`, `max` |
| `Bool` | Boolean values | - | - |
| **Text Types** | | | |
| `VarChar`, `String` | Variable length string | `max_length` | - |
| `JSON` | JSON objects | - | - |
| **Vector Types** | | | |
| `FloatVector` | 32-bit float vectors | `dim` | - |
| `BinaryVector` | Binary vectors | `dim` | - |
| `Float16Vector` | 16-bit float vectors | `dim` | - |
| `BFloat16Vector` | Brain float vectors | `dim` | - |
| `SparseFloatVector` | Sparse float vectors | `dim` | - |
| **Complex Types** | | | |
| `Array` | Array of elements | `element_type`, `max_capacity` | `max_length` (for string elements) |

### Field Properties

| Property | Description | Applicable Types |
|----------|-------------|------------------|
| `is_primary` | Mark field as primary key (exactly one required) | All types |
| `auto_id` | Auto-generate ID values | Int64 primary keys only |
| `nullable` | Allow null values (10% probability) | All types |
| `min`, `max` | Value constraints | Numeric types |
| `max_length` | String/element length limit | String and Array types |
| `dim` | Vector dimension (1-32768) | Vector types |
| `element_type` | Array element type | Array type |
| `max_capacity` | Array capacity (1-4096) | Array type |

### Complete Example

```yaml
collection_name: "advanced_catalog"
fields:
  # Primary key with auto-generated IDs
  - name: "id"
    type: "Int64"
    is_primary: true
    auto_id: true
  
  # Text fields with constraints
  - name: "title"
    type: "VarChar"
    max_length: 200
  
  - name: "description"
    type: "VarChar"
    max_length: 1000
    nullable: true
  
  # Numeric fields with ranges
  - name: "price"
    type: "Float"
    min: 0.01
    max: 9999.99
  
  - name: "rating"
    type: "Int8"
    min: 1
    max: 5
  
  # Vector for semantic search
  - name: "embedding"
    type: "FloatVector"
    dim: 768
  
  # Array of tags
  - name: "tags"
    type: "Array"
    element_type: "VarChar"
    max_capacity: 10
    max_length: 50
  
  # Structured metadata
  - name: "metadata"
    type: "JSON"
    nullable: true
  
  # Boolean flags
  - name: "in_stock"
    type: "Bool"
```

## 📚 CLI Reference

### Command Structure

The CLI uses a clean grouped structure:

```bash
# Main command groups
milvus-ingest generate [options]  # Data generation
milvus-ingest schema [command]    # Schema management
milvus-ingest clean [options]     # Utility commands
```

### Data Generation Commands

| Command | Description | Example |
|---------|-------------|---------|
| `--schema PATH` | Generate from custom schema file | `milvus-ingest generate --schema my_schema.json` |
| `--builtin SCHEMA_ID` | Use built-in or managed schema | `milvus-ingest generate --builtin ecommerce` |
| `--rows INTEGER` | Number of rows to generate | `milvus-ingest generate --rows 5000` |
| `--format FORMAT` | Output format (parquet, json) | `milvus-ingest generate --format json` |
| `--out DIRECTORY` | Output directory path | `milvus-ingest generate --out my_data/` |
| `--preview` | Show first 5 rows | `milvus-ingest generate --preview` |
| `--seed INTEGER` | Random seed for reproducibility | `milvus-ingest generate --seed 42` |
| `--validate-only` | Validate schema without generating | `milvus-ingest generate --validate-only` |
| `--no-progress` | Disable progress bar display | `milvus-ingest generate --no-progress` |
| `--batch-size INTEGER` | Batch size for memory efficiency (default: 50000) | `milvus-ingest generate --batch-size 100000` |
| `--max-file-size INTEGER` | Maximum size per file in MB (default: 256) | `milvus-ingest generate --max-file-size 100` |
| `--max-rows-per-file INTEGER` | Maximum rows per file (default: 1000000) | `milvus-ingest generate --max-rows-per-file 500000` |
| `--force` | Force overwrite output directory | `milvus-ingest generate --force` |

### Schema Management Commands

| Command | Description | Example |
|---------|-------------|---------|
| `schema list` | List all schemas (built-in + custom) | `milvus-ingest schema list` |
| `schema show SCHEMA_ID` | Show schema details | `milvus-ingest schema show ecommerce` |
| `schema add SCHEMA_ID FILE` | Add custom schema | `milvus-ingest schema add products schema.json` |
| `schema remove SCHEMA_ID` | Remove custom schema | `milvus-ingest schema remove products` |
| `schema help` | Show schema format help | `milvus-ingest schema help` |

### Utility Commands

| Command | Description | Example |
|---------|-------------|---------|
| `clean` | Clean up generated output files | `milvus-ingest clean --yes` |
| `--help` | Show help message | `milvus-ingest --help` |

### Common Usage Patterns

```bash
# Quick start with built-in schema (high-performance by default)
milvus-ingest generate --builtin simple --rows 100000 --preview

# Generate massive datasets with automatic file partitioning 
milvus-ingest generate --builtin ecommerce --rows 5000000 --format parquet --out products/

# Test custom schema validation
milvus-ingest generate --schema my_schema.json --validate-only

# Reproducible large-scale data generation
milvus-ingest generate --builtin users --rows 2000000 --seed 42 --out users/

# Control file partitioning (smaller files for easier handling)
milvus-ingest generate --builtin ecommerce --rows 5000000 --max-file-size 128 --max-rows-per-file 500000

# Schema management workflow
milvus-ingest schema list
milvus-ingest schema show ecommerce
milvus-ingest schema add my_ecommerce ecommerce_base.json

# Clean up generated output files
milvus-ingest clean --yes
```

## 🔗 Milvus Integration

### Direct Insert to Milvus

Insert generated data directly into Milvus with automatic collection creation:

```bash
# Generate data first
milvus-ingest generate --builtin ecommerce --rows 100000 --out products/

# Insert to local Milvus (default: localhost:19530)
milvus-ingest to-milvus insert ./products/

# Insert to remote Milvus with authentication
milvus-ingest to-milvus insert ./products/ \
    --uri http://192.168.1.100:19530 \
    --token your-api-token \
    --db-name custom_db

# Insert with custom settings
milvus-ingest to-milvus insert ./products/ \
    --collection-name product_catalog \
    --batch-size 5000 \
    --drop-if-exists
```

**Direct Insert Features:**
- ✅ Automatic collection creation from metadata
- ✅ Smart index creation based on vector dimensions
- ✅ Progress tracking with batch processing
- ✅ Support for authentication and custom databases
- ✅ Connection testing before import

### Bulk Import from S3/MinIO

For very large datasets, use bulk import with pre-uploaded files:

```bash
# First, upload to S3/MinIO
milvus-ingest upload ./products/ s3://bucket/data/ \
    --endpoint-url http://minio:9000 \
    --access-key-id minioadmin \
    --secret-access-key minioadmin

# Then bulk import to Milvus
milvus-ingest to-milvus import product_catalog s3://bucket/data/file1.parquet

# Import multiple files
milvus-ingest to-milvus import product_catalog \
    s3://bucket/data/file1.parquet \
    s3://bucket/data/file2.parquet

# Import all files from directory
milvus-ingest to-milvus import product_catalog ./products/

# Import and wait for completion
milvus-ingest to-milvus import product_catalog ./products/ \
    --wait \
    --timeout 300
```

**Bulk Import Features:**
- ⚡ High-performance import for millions of rows
- 📁 Support for single/multiple files or directories
- ⏳ Asynchronous operation with job tracking
- 🔄 Wait for completion with timeout support
- 📊 Import job status monitoring

### S3/MinIO Upload

Upload generated data to S3-compatible storage:

```bash
# Upload to AWS S3 (using default credentials)
milvus-ingest upload ./output s3://my-bucket/data/

# Upload to MinIO with custom endpoint
milvus-ingest upload ./output s3://my-bucket/data/ \
    --endpoint-url http://localhost:9000 \
    --access-key-id minioadmin \
    --secret-access-key minioadmin

# Upload with environment variables
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret
milvus-ingest upload ./output s3://my-bucket/data/

# Disable SSL verification for local MinIO
milvus-ingest upload ./output s3://my-bucket/data/ \
    --endpoint-url http://localhost:9000 \
    --no-verify-ssl
```

### Complete Workflow Example

```bash
# 1. Generate large dataset
milvus-ingest generate --builtin ecommerce --rows 5000000 --out products/

# 2. Option A: Direct insert (for smaller datasets)
milvus-ingest to-milvus insert ./products/ \
    --uri http://milvus:19530 \
    --collection-name ecommerce_products

# 2. Option B: Bulk import (for very large datasets)
# First upload to MinIO
milvus-ingest upload ./products/ s3://milvus-data/products/ \
    --endpoint-url http://minio:9000

# Then bulk import
milvus-ingest to-milvus import ecommerce_products \
    s3://milvus-data/products/ \
    --wait
```

### Import Method Comparison

| Method | Best For | Speed | Max Size | Features |
|--------|----------|-------|----------|----------|
| **Direct Insert** | <1M rows | Moderate | Limited by memory | Automatic collection creation, progress bar |
| **Bulk Import** | >1M rows | Very Fast | 16GB per file | Async operation, job tracking |

**Important Notes:**
- Files must be uploaded to S3/MinIO before bulk import
- Maximum 1024 files per import request
- Each file should not exceed 16GB
- Collection must exist for bulk import (create with direct insert first if needed)

## 🛠️ Development

This project uses PDM for dependency management and follows modern Python development practices.

### Setup Development Environment

```bash
# Clone and setup
git clone https://github.com/your-org/milvus-ingest.git
cd milvus-ingest
pdm install  # Install development dependencies
```

### Development Workflow

```bash
# Code formatting and linting
pdm run ruff format src tests    # Format code
pdm run ruff check src tests     # Check linting
pdm run mypy src                 # Type checking

# Testing
pdm run pytest                           # Run all tests
pdm run pytest --cov=src --cov-report=html  # With coverage
pdm run pytest tests/test_generator.py   # Specific test file

# Combined quality checks
make lint test                   # Run linting and tests together
```

### Project Structure

```
src/milvus_fake_data/
├── cli.py              # Click-based CLI interface
├── generator.py        # Core data generation logic  
├── optimized_writer.py # High-performance vectorized data generation
├── models.py           # Pydantic schema validation models
├── schema_manager.py   # Schema management system
├── builtin_schemas.py  # Built-in schema definitions and metadata
├── rich_display.py     # Rich terminal formatting and UI
├── logging_config.py   # Loguru-based structured logging
├── exceptions.py       # Custom exception classes
├── uploader.py         # S3/MinIO upload functionality
├── milvus_inserter.py  # Direct Milvus insertion
├── milvus_importer.py  # Bulk import from S3/MinIO
└── schemas/            # Built-in schema JSON files (12 schemas)
    ├── simple.json
    ├── ecommerce.json
    ├── documents.json
    ├── images.json
    ├── users.json
    ├── videos.json
    ├── news.json
    ├── audio_transcripts.json
    ├── ai_conversations.json
    ├── face_recognition.json
    ├── ecommerce_partitioned.json
    └── cardinality_demo.json
```

## 📊 Performance Benchmarks

The high-performance engine delivers exceptional speed for large-scale data generation:

| Dataset Size | Time | Throughput | Memory Usage | File Output |
|-------------|------|------------|--------------|-------------|
| 100K rows | ~13s | 7,500 rows/sec | <1GB | Single file |
| 1M rows | ~87s | 11,500 rows/sec | <2GB | Single file |
| 2.5M rows | ~217s | 11,500 rows/sec | <3GB | 5 files (auto-partitioned) |
| 10M rows | ~870s | 11,500 rows/sec | <4GB | 10 files (auto-partitioned) |

**Key Performance Features:**
- **Vectorized Operations**: NumPy-based batch processing for maximum CPU efficiency
- **Smart Memory Management**: Streaming generation prevents memory exhaustion
- **Automatic File Partitioning**: Files split at 256MB/1M rows for optimal handling
- **Optimized I/O**: Direct PyArrow integration with Snappy compression
- **Parallel Processing**: Multi-core utilization for vector generation and normalization

**Recommended Settings for Large Datasets:**
- Use `--format parquet` for fastest I/O (default)
- Batch size 50K-100K rows for optimal memory/speed balance
- Enable automatic file partitioning for datasets >1M rows

## 🤝 Contributing

We welcome contributions! Please follow these steps:

1. **Fork** the repository
2. **Create** a feature branch: `git checkout -b feature/amazing-feature`
3. **Make** your changes with tests
4. **Ensure** quality checks pass: `make lint test`
5. **Commit** changes: `git commit -m 'Add amazing feature'`
6. **Push** to branch: `git push origin feature/amazing-feature`
7. **Open** a Pull Request

### Contribution Guidelines

- Add tests for new functionality
- Update documentation for API changes
- Follow existing code style (ruff + mypy)
- Include helpful error messages for user-facing features

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- Built for the [Milvus](https://milvus.io/) vector database ecosystem
- Optimized with [NumPy](https://numpy.org/) vectorized operations for maximum performance
- Uses [PyArrow](https://arrow.apache.org/docs/python/) for efficient Parquet I/O
- Powered by [Pandas](https://pandas.pydata.org/) and [Faker](https://faker.readthedocs.io/) for realistic data generatio
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "milvus-ingest",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "milvus, vector-database, data-ingestion, bulk-import, data-generation, high-performance, vectorization, etl",
    "author": null,
    "author_email": "zhuwenxing <wenxing.zhu@zilliz.com>",
    "download_url": "https://files.pythonhosted.org/packages/15/48/229ad3c834c7e93eedf585dbd82857839843f1e925e33b0fef08252d5240/milvus_ingest-0.1.2.tar.gz",
    "platform": null,
    "description": "# Milvus Ingest - High-Performance Data Ingestion Tool\n\n\ud83d\ude80 **Ultra-fast data ingestion tool for Milvus vector databases** - Built for large-scale data generation and ingestion with vectorized operations, parallel processing, and optimized file I/O. Generate and ingest millions of rows in seconds with automatic file partitioning and intelligent memory management.\n\n## \u26a1 Performance Highlights\n\n- **\ud83c\udfce\ufe0f 10,000-100,000+ rows/sec** - Vectorized NumPy operations for maximum speed\n- **\ud83d\udcc8 Large-scale optimized** - Designed for datasets >100K rows with intelligent batching  \n- **\ud83d\udd25 Smart file partitioning** - Automatic splitting (256MB chunks, 1M rows/file)\n- **\ud83d\udcbe Memory efficient** - Streaming generation prevents memory exhaustion\n- **\u26a1 Direct PyArrow I/O** - Optimized Parquet writing with Snappy compression\n- **\ud83d\udd04 Parallel processing** - Multi-core CPU utilization with configurable workers\n\n## \u2728 Key Features\n\n- \ud83c\udfaf **Ready-to-use schemas** - Pre-built schemas for e-commerce, documents, images, users, news, and videos\n- \ud83d\udcda **Schema management** - Add, organize, and reuse custom schemas with metadata\n- \ud83d\ude80 **High-performance generation** - Vectorized operations optimized for large datasets\n- \ud83d\udd27 **Complete Milvus support** - All field types including vectors, arrays, JSON, and primitive types\n- \u2705 **Smart validation** - Pydantic-based validation with detailed error messages and suggestions\n- \ud83d\udcca **High-performance formats** - Parquet (fastest I/O), JSON (structured data)\n- \ud83c\udf31 **Reproducible results** - Seed support for consistent data generation\n- \ud83c\udfa8 **Rich customization** - Field constraints, nullable fields, auto-generated IDs\n- \ud83d\udd0d **Schema exploration** - Validation, help commands, and schema details\n- \ud83c\udfe0 **Unified interface** - Use custom and built-in schemas interchangeably\n\n## Installation\n\n```bash\n# Install from source (recommended for development)\ngit clone https://github.com/zilliz/milvus-ingest.git\ncd milvus-ingest\npdm install  # Installs with development dependencies\n\n# For production use only\npdm install --prod\n\n# After installation, the CLI tool is available as:\nmilvus-ingest --help\n```\n\n## \ud83d\ude80 Quick Start\n\n### 1. Use Built-in Schemas (Recommended)\n\nGet started instantly with pre-built schemas optimized for large-scale generation:\n\n```bash\n# List all available built-in schemas\nmilvus-ingest schema list\n\n# Generate data using a built-in schema (high-performance by default)\nmilvus-ingest generate --builtin simple --rows 100000 --preview\n\n# Generate large e-commerce dataset with automatic file partitioning\nmilvus-ingest generate --builtin ecommerce --rows 2500000 --out products/\n```\n\n**Available Built-in Schemas:**\n| Schema | Description | Use Cases |\n|--------|-------------|-----------|\n| `simple` | Basic example with common field types | Learning, testing |\n| `ecommerce` | Product catalog with search embeddings | Online stores, recommendations |\n| `documents` | Document search with semantic embeddings | Knowledge bases, document search |\n| `images` | Image gallery with visual similarity | Media platforms, image search |\n| `users` | User profiles with behavioral embeddings | User analytics, personalization |\n| `videos` | Video library with multimodal embeddings | Video platforms, content discovery |\n| `news` | News articles with sentiment analysis | News aggregation, content analysis |\n| `audio_transcripts` | Audio transcription with FP16 embeddings | Speech-to-text search, podcasts |\n| `ai_conversations` | AI chat history with BF16 embeddings | Chatbot analytics, conversation search |\n| `face_recognition` | Facial recognition with binary vectors | Security systems, identity verification |\n| `ecommerce_partitioned` | Partitioned e-commerce schema | Scalable product catalogs |\n| `cardinality_demo` | Schema demonstrating cardinality features | Testing cardinality constraints |\n\n### 2. Create Custom Schemas\n\nDefine your own collection structure with JSON or YAML:\n\n```json\n{\n  \"collection_name\": \"my_collection\",\n  \"fields\": [\n    {\n      \"name\": \"id\",\n      \"type\": \"Int64\",\n      \"is_primary\": true\n    },\n    {\n      \"name\": \"title\",\n      \"type\": \"VarChar\",\n      \"max_length\": 256\n    },\n    {\n      \"name\": \"embedding\",\n      \"type\": \"FloatVector\",\n      \"dim\": 128\n    }\n  ]\n}\n```\n\n```bash\n# Generate large dataset from custom schema with high-performance mode\nmilvus-ingest generate --schema my_schema.json --rows 1000000 --format parquet --preview\n```\n\n**Note:** Output is always a directory containing data files (in the specified format) and a `meta.json` file with collection metadata.\n\n### 3. Schema Management\n\nStore and organize your schemas for reuse:\n\n```bash\n# Add a custom schema to your library\nmilvus-ingest schema add my_products product_schema.json\n\n# List all schemas (built-in + custom)\nmilvus-ingest schema list\n\n# Use your custom schema like a built-in one (optimized for large datasets)\nmilvus-ingest generate --builtin my_products --rows 500000\n\n# Show detailed schema information\nmilvus-ingest schema show my_products\n```\n\n### 4. Python API\n\n```python\nfrom milvus_ingest.generator import generate_mock_data\nfrom milvus_ingest.schema_manager import get_schema_manager\nfrom milvus_ingest.builtin_schemas import load_builtin_schema\nimport tempfile\nimport json\n\n# Use the schema manager to work with schemas\nmanager = get_schema_manager()\n\n# List all available schemas\nall_schemas = manager.list_all_schemas()\nprint(\"Available schemas:\", list(all_schemas.keys()))\n\n# Load any schema (built-in or custom)\nschema = manager.get_schema(\"ecommerce\")  # Built-in\n# schema = manager.get_schema(\"my_products\")  # Custom\n\n# Generate data from schema file\nwith tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:\n    json.dump(schema, f, indent=2)\n    f.flush()\n    \n    # Generate data (returns dict with DataFrame and metadata)\n    result = generate_mock_data(f.name, rows=10000, seed=42, output_format=\"dict\")\n    df = result[\"data\"]\n    metadata = result[\"metadata\"]\n\nprint(df.head())\nprint(f\"Generated {len(df)} rows for collection: {metadata['collection_name']}\")\n\n# Add a custom schema programmatically\ncustom_schema = {\n    \"collection_name\": \"my_collection\",\n    \"fields\": [\n        {\"name\": \"id\", \"type\": \"Int64\", \"is_primary\": True},\n        {\"name\": \"text\", \"type\": \"VarChar\", \"max_length\": 100},\n        {\"name\": \"vector\", \"type\": \"FloatVector\", \"dim\": 256}\n    ]\n}\n\nmanager.add_schema(\"my_custom\", custom_schema, \"Custom schema\", [\"testing\"])\nprint(\"Added custom schema!\")\n```\n\n## \ud83d\udccb Schema Reference\n\n### Supported Field Types\n\n| Type | Description | Required Parameters | Optional Parameters |\n|------|-------------|-------------------|-------------------|\n| **Numeric Types** | | | |\n| `Int8`, `Int16`, `Int32`, `Int64` | Integer types | - | `min`, `max` |\n| `Float`, `Double` | Floating point | - | `min`, `max` |\n| `Bool` | Boolean values | - | - |\n| **Text Types** | | | |\n| `VarChar`, `String` | Variable length string | `max_length` | - |\n| `JSON` | JSON objects | - | - |\n| **Vector Types** | | | |\n| `FloatVector` | 32-bit float vectors | `dim` | - |\n| `BinaryVector` | Binary vectors | `dim` | - |\n| `Float16Vector` | 16-bit float vectors | `dim` | - |\n| `BFloat16Vector` | Brain float vectors | `dim` | - |\n| `SparseFloatVector` | Sparse float vectors | `dim` | - |\n| **Complex Types** | | | |\n| `Array` | Array of elements | `element_type`, `max_capacity` | `max_length` (for string elements) |\n\n### Field Properties\n\n| Property | Description | Applicable Types |\n|----------|-------------|------------------|\n| `is_primary` | Mark field as primary key (exactly one required) | All types |\n| `auto_id` | Auto-generate ID values | Int64 primary keys only |\n| `nullable` | Allow null values (10% probability) | All types |\n| `min`, `max` | Value constraints | Numeric types |\n| `max_length` | String/element length limit | String and Array types |\n| `dim` | Vector dimension (1-32768) | Vector types |\n| `element_type` | Array element type | Array type |\n| `max_capacity` | Array capacity (1-4096) | Array type |\n\n### Complete Example\n\n```yaml\ncollection_name: \"advanced_catalog\"\nfields:\n  # Primary key with auto-generated IDs\n  - name: \"id\"\n    type: \"Int64\"\n    is_primary: true\n    auto_id: true\n  \n  # Text fields with constraints\n  - name: \"title\"\n    type: \"VarChar\"\n    max_length: 200\n  \n  - name: \"description\"\n    type: \"VarChar\"\n    max_length: 1000\n    nullable: true\n  \n  # Numeric fields with ranges\n  - name: \"price\"\n    type: \"Float\"\n    min: 0.01\n    max: 9999.99\n  \n  - name: \"rating\"\n    type: \"Int8\"\n    min: 1\n    max: 5\n  \n  # Vector for semantic search\n  - name: \"embedding\"\n    type: \"FloatVector\"\n    dim: 768\n  \n  # Array of tags\n  - name: \"tags\"\n    type: \"Array\"\n    element_type: \"VarChar\"\n    max_capacity: 10\n    max_length: 50\n  \n  # Structured metadata\n  - name: \"metadata\"\n    type: \"JSON\"\n    nullable: true\n  \n  # Boolean flags\n  - name: \"in_stock\"\n    type: \"Bool\"\n```\n\n## \ud83d\udcda CLI Reference\n\n### Command Structure\n\nThe CLI uses a clean grouped structure:\n\n```bash\n# Main command groups\nmilvus-ingest generate [options]  # Data generation\nmilvus-ingest schema [command]    # Schema management\nmilvus-ingest clean [options]     # Utility commands\n```\n\n### Data Generation Commands\n\n| Command | Description | Example |\n|---------|-------------|---------|\n| `--schema PATH` | Generate from custom schema file | `milvus-ingest generate --schema my_schema.json` |\n| `--builtin SCHEMA_ID` | Use built-in or managed schema | `milvus-ingest generate --builtin ecommerce` |\n| `--rows INTEGER` | Number of rows to generate | `milvus-ingest generate --rows 5000` |\n| `--format FORMAT` | Output format (parquet, json) | `milvus-ingest generate --format json` |\n| `--out DIRECTORY` | Output directory path | `milvus-ingest generate --out my_data/` |\n| `--preview` | Show first 5 rows | `milvus-ingest generate --preview` |\n| `--seed INTEGER` | Random seed for reproducibility | `milvus-ingest generate --seed 42` |\n| `--validate-only` | Validate schema without generating | `milvus-ingest generate --validate-only` |\n| `--no-progress` | Disable progress bar display | `milvus-ingest generate --no-progress` |\n| `--batch-size INTEGER` | Batch size for memory efficiency (default: 50000) | `milvus-ingest generate --batch-size 100000` |\n| `--max-file-size INTEGER` | Maximum size per file in MB (default: 256) | `milvus-ingest generate --max-file-size 100` |\n| `--max-rows-per-file INTEGER` | Maximum rows per file (default: 1000000) | `milvus-ingest generate --max-rows-per-file 500000` |\n| `--force` | Force overwrite output directory | `milvus-ingest generate --force` |\n\n### Schema Management Commands\n\n| Command | Description | Example |\n|---------|-------------|---------|\n| `schema list` | List all schemas (built-in + custom) | `milvus-ingest schema list` |\n| `schema show SCHEMA_ID` | Show schema details | `milvus-ingest schema show ecommerce` |\n| `schema add SCHEMA_ID FILE` | Add custom schema | `milvus-ingest schema add products schema.json` |\n| `schema remove SCHEMA_ID` | Remove custom schema | `milvus-ingest schema remove products` |\n| `schema help` | Show schema format help | `milvus-ingest schema help` |\n\n### Utility Commands\n\n| Command | Description | Example |\n|---------|-------------|---------|\n| `clean` | Clean up generated output files | `milvus-ingest clean --yes` |\n| `--help` | Show help message | `milvus-ingest --help` |\n\n### Common Usage Patterns\n\n```bash\n# Quick start with built-in schema (high-performance by default)\nmilvus-ingest generate --builtin simple --rows 100000 --preview\n\n# Generate massive datasets with automatic file partitioning \nmilvus-ingest generate --builtin ecommerce --rows 5000000 --format parquet --out products/\n\n# Test custom schema validation\nmilvus-ingest generate --schema my_schema.json --validate-only\n\n# Reproducible large-scale data generation\nmilvus-ingest generate --builtin users --rows 2000000 --seed 42 --out users/\n\n# Control file partitioning (smaller files for easier handling)\nmilvus-ingest generate --builtin ecommerce --rows 5000000 --max-file-size 128 --max-rows-per-file 500000\n\n# Schema management workflow\nmilvus-ingest schema list\nmilvus-ingest schema show ecommerce\nmilvus-ingest schema add my_ecommerce ecommerce_base.json\n\n# Clean up generated output files\nmilvus-ingest clean --yes\n```\n\n## \ud83d\udd17 Milvus Integration\n\n### Direct Insert to Milvus\n\nInsert generated data directly into Milvus with automatic collection creation:\n\n```bash\n# Generate data first\nmilvus-ingest generate --builtin ecommerce --rows 100000 --out products/\n\n# Insert to local Milvus (default: localhost:19530)\nmilvus-ingest to-milvus insert ./products/\n\n# Insert to remote Milvus with authentication\nmilvus-ingest to-milvus insert ./products/ \\\n    --uri http://192.168.1.100:19530 \\\n    --token your-api-token \\\n    --db-name custom_db\n\n# Insert with custom settings\nmilvus-ingest to-milvus insert ./products/ \\\n    --collection-name product_catalog \\\n    --batch-size 5000 \\\n    --drop-if-exists\n```\n\n**Direct Insert Features:**\n- \u2705 Automatic collection creation from metadata\n- \u2705 Smart index creation based on vector dimensions\n- \u2705 Progress tracking with batch processing\n- \u2705 Support for authentication and custom databases\n- \u2705 Connection testing before import\n\n### Bulk Import from S3/MinIO\n\nFor very large datasets, use bulk import with pre-uploaded files:\n\n```bash\n# First, upload to S3/MinIO\nmilvus-ingest upload ./products/ s3://bucket/data/ \\\n    --endpoint-url http://minio:9000 \\\n    --access-key-id minioadmin \\\n    --secret-access-key minioadmin\n\n# Then bulk import to Milvus\nmilvus-ingest to-milvus import product_catalog s3://bucket/data/file1.parquet\n\n# Import multiple files\nmilvus-ingest to-milvus import product_catalog \\\n    s3://bucket/data/file1.parquet \\\n    s3://bucket/data/file2.parquet\n\n# Import all files from directory\nmilvus-ingest to-milvus import product_catalog ./products/\n\n# Import and wait for completion\nmilvus-ingest to-milvus import product_catalog ./products/ \\\n    --wait \\\n    --timeout 300\n```\n\n**Bulk Import Features:**\n- \u26a1 High-performance import for millions of rows\n- \ud83d\udcc1 Support for single/multiple files or directories\n- \u23f3 Asynchronous operation with job tracking\n- \ud83d\udd04 Wait for completion with timeout support\n- \ud83d\udcca Import job status monitoring\n\n### S3/MinIO Upload\n\nUpload generated data to S3-compatible storage:\n\n```bash\n# Upload to AWS S3 (using default credentials)\nmilvus-ingest upload ./output s3://my-bucket/data/\n\n# Upload to MinIO with custom endpoint\nmilvus-ingest upload ./output s3://my-bucket/data/ \\\n    --endpoint-url http://localhost:9000 \\\n    --access-key-id minioadmin \\\n    --secret-access-key minioadmin\n\n# Upload with environment variables\nexport AWS_ACCESS_KEY_ID=your-key\nexport AWS_SECRET_ACCESS_KEY=your-secret\nmilvus-ingest upload ./output s3://my-bucket/data/\n\n# Disable SSL verification for local MinIO\nmilvus-ingest upload ./output s3://my-bucket/data/ \\\n    --endpoint-url http://localhost:9000 \\\n    --no-verify-ssl\n```\n\n### Complete Workflow Example\n\n```bash\n# 1. Generate large dataset\nmilvus-ingest generate --builtin ecommerce --rows 5000000 --out products/\n\n# 2. Option A: Direct insert (for smaller datasets)\nmilvus-ingest to-milvus insert ./products/ \\\n    --uri http://milvus:19530 \\\n    --collection-name ecommerce_products\n\n# 2. Option B: Bulk import (for very large datasets)\n# First upload to MinIO\nmilvus-ingest upload ./products/ s3://milvus-data/products/ \\\n    --endpoint-url http://minio:9000\n\n# Then bulk import\nmilvus-ingest to-milvus import ecommerce_products \\\n    s3://milvus-data/products/ \\\n    --wait\n```\n\n### Import Method Comparison\n\n| Method | Best For | Speed | Max Size | Features |\n|--------|----------|-------|----------|----------|\n| **Direct Insert** | <1M rows | Moderate | Limited by memory | Automatic collection creation, progress bar |\n| **Bulk Import** | >1M rows | Very Fast | 16GB per file | Async operation, job tracking |\n\n**Important Notes:**\n- Files must be uploaded to S3/MinIO before bulk import\n- Maximum 1024 files per import request\n- Each file should not exceed 16GB\n- Collection must exist for bulk import (create with direct insert first if needed)\n\n## \ud83d\udee0\ufe0f Development\n\nThis project uses PDM for dependency management and follows modern Python development practices.\n\n### Setup Development Environment\n\n```bash\n# Clone and setup\ngit clone https://github.com/your-org/milvus-ingest.git\ncd milvus-ingest\npdm install  # Install development dependencies\n```\n\n### Development Workflow\n\n```bash\n# Code formatting and linting\npdm run ruff format src tests    # Format code\npdm run ruff check src tests     # Check linting\npdm run mypy src                 # Type checking\n\n# Testing\npdm run pytest                           # Run all tests\npdm run pytest --cov=src --cov-report=html  # With coverage\npdm run pytest tests/test_generator.py   # Specific test file\n\n# Combined quality checks\nmake lint test                   # Run linting and tests together\n```\n\n### Project Structure\n\n```\nsrc/milvus_fake_data/\n\u251c\u2500\u2500 cli.py              # Click-based CLI interface\n\u251c\u2500\u2500 generator.py        # Core data generation logic  \n\u251c\u2500\u2500 optimized_writer.py # High-performance vectorized data generation\n\u251c\u2500\u2500 models.py           # Pydantic schema validation models\n\u251c\u2500\u2500 schema_manager.py   # Schema management system\n\u251c\u2500\u2500 builtin_schemas.py  # Built-in schema definitions and metadata\n\u251c\u2500\u2500 rich_display.py     # Rich terminal formatting and UI\n\u251c\u2500\u2500 logging_config.py   # Loguru-based structured logging\n\u251c\u2500\u2500 exceptions.py       # Custom exception classes\n\u251c\u2500\u2500 uploader.py         # S3/MinIO upload functionality\n\u251c\u2500\u2500 milvus_inserter.py  # Direct Milvus insertion\n\u251c\u2500\u2500 milvus_importer.py  # Bulk import from S3/MinIO\n\u2514\u2500\u2500 schemas/            # Built-in schema JSON files (12 schemas)\n    \u251c\u2500\u2500 simple.json\n    \u251c\u2500\u2500 ecommerce.json\n    \u251c\u2500\u2500 documents.json\n    \u251c\u2500\u2500 images.json\n    \u251c\u2500\u2500 users.json\n    \u251c\u2500\u2500 videos.json\n    \u251c\u2500\u2500 news.json\n    \u251c\u2500\u2500 audio_transcripts.json\n    \u251c\u2500\u2500 ai_conversations.json\n    \u251c\u2500\u2500 face_recognition.json\n    \u251c\u2500\u2500 ecommerce_partitioned.json\n    \u2514\u2500\u2500 cardinality_demo.json\n```\n\n## \ud83d\udcca Performance Benchmarks\n\nThe high-performance engine delivers exceptional speed for large-scale data generation:\n\n| Dataset Size | Time | Throughput | Memory Usage | File Output |\n|-------------|------|------------|--------------|-------------|\n| 100K rows | ~13s | 7,500 rows/sec | <1GB | Single file |\n| 1M rows | ~87s | 11,500 rows/sec | <2GB | Single file |\n| 2.5M rows | ~217s | 11,500 rows/sec | <3GB | 5 files (auto-partitioned) |\n| 10M rows | ~870s | 11,500 rows/sec | <4GB | 10 files (auto-partitioned) |\n\n**Key Performance Features:**\n- **Vectorized Operations**: NumPy-based batch processing for maximum CPU efficiency\n- **Smart Memory Management**: Streaming generation prevents memory exhaustion\n- **Automatic File Partitioning**: Files split at 256MB/1M rows for optimal handling\n- **Optimized I/O**: Direct PyArrow integration with Snappy compression\n- **Parallel Processing**: Multi-core utilization for vector generation and normalization\n\n**Recommended Settings for Large Datasets:**\n- Use `--format parquet` for fastest I/O (default)\n- Batch size 50K-100K rows for optimal memory/speed balance\n- Enable automatic file partitioning for datasets >1M rows\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Please follow these steps:\n\n1. **Fork** the repository\n2. **Create** a feature branch: `git checkout -b feature/amazing-feature`\n3. **Make** your changes with tests\n4. **Ensure** quality checks pass: `make lint test`\n5. **Commit** changes: `git commit -m 'Add amazing feature'`\n6. **Push** to branch: `git push origin feature/amazing-feature`\n7. **Open** a Pull Request\n\n### Contribution Guidelines\n\n- Add tests for new functionality\n- Update documentation for API changes\n- Follow existing code style (ruff + mypy)\n- Include helpful error messages for user-facing features\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\n- Built for the [Milvus](https://milvus.io/) vector database ecosystem\n- Optimized with [NumPy](https://numpy.org/) vectorized operations for maximum performance\n- Uses [PyArrow](https://arrow.apache.org/docs/python/) for efficient Parquet I/O\n- Powered by [Pandas](https://pandas.pydata.org/) and [Faker](https://faker.readthedocs.io/) for realistic data generatio",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "High-performance data ingestion tool for Milvus vector database with vectorized operations",
    "version": "0.1.2",
    "project_urls": null,
    "split_keywords": [
        "milvus",
        " vector-database",
        " data-ingestion",
        " bulk-import",
        " data-generation",
        " high-performance",
        " vectorization",
        " etl"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8e8205d4ed62b38318e0321dbd5c200dbd5051cfdf6f853b3869803aeeb4a3ce",
                "md5": "b0509a3fe3944eae9493a9258601ba51",
                "sha256": "69400d3f157f9f044d7a38a3f51719e6af5bc0c7a31b66603857a2e17a44d5af"
            },
            "downloads": -1,
            "filename": "milvus_ingest-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b0509a3fe3944eae9493a9258601ba51",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 97521,
            "upload_time": "2025-07-25T11:39:01",
            "upload_time_iso_8601": "2025-07-25T11:39:01.810645Z",
            "url": "https://files.pythonhosted.org/packages/8e/82/05d4ed62b38318e0321dbd5c200dbd5051cfdf6f853b3869803aeeb4a3ce/milvus_ingest-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1548229ad3c834c7e93eedf585dbd82857839843f1e925e33b0fef08252d5240",
                "md5": "5616305b96e945b05bee4998eb2089d9",
                "sha256": "6fdc6a936468855425f54701a0481dfd0b9a1fcc8fb181f694a6e9a18955e25c"
            },
            "downloads": -1,
            "filename": "milvus_ingest-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "5616305b96e945b05bee4998eb2089d9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 95009,
            "upload_time": "2025-07-25T11:39:04",
            "upload_time_iso_8601": "2025-07-25T11:39:04.074276Z",
            "url": "https://files.pythonhosted.org/packages/15/48/229ad3c834c7e93eedf585dbd82857839843f1e925e33b0fef08252d5240/milvus_ingest-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-25 11:39:04",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "milvus-ingest"
}
        
Elapsed time: 1.36081s