s3vectors-embed-cli

Name	s3vectors-embed-cli JSON
Version	0.1.1 JSON
	download
home_page	https://github.com/awslabs/s3vectors-embed-cli
Summary	Standalone CLI for S3 Vector operations with Bedrock embeddings
upload_time	2025-07-24 01:28:27
maintainer	Vaibhav Sabharwal
docs_url	None
author	Vaibhav Sabharwal
requires_python	>=3.8
license	Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
keywords	aws s3 vectors embeddings bedrock cli machine-learning ai
VCS
bugtrack_url
requirements	boto3 botocore click rich pydantic
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Amazon S3 Vectors Embed CLI

Amazon S3 Vectors Embed CLI is a standalone command-line tool that simplifies the process of working with vector embeddings in S3 Vectors. You can create vector embeddings for your data using Amazon Bedrock and store and query them in your S3 vector index using single commands. 

**Amazon S3 Vectors Embed CLI is in preview release and is subject to change.**

## Supported Commands

**s3vectors-embed put**: Embed text, file content, or S3 objects and store them as vectors in an S3 vector index.
You can create and ingest vector embeddings into an S3 vector index using a single put command. You specify the data input you want to create an embedding for, an Amazon Bedrock embeddings model ID, your S3 vector bucket name, and S3 vector index name. The command supports several input formats including text data, a local text or image file, an S3 image or text object or prefix. The command generates embeddings using the dimensions configured in your S3 vector index properties. If you are ingesting embeddings for several objects in an S3 prefix or local file path, it automatically uses batch processes to maximize throughput. 

**Note**: Each file is processed as a single embedding. Document chunking is not currently supported. 

**s3vectors-embed query**: Embed a query input and search for similar vectors in an S3 vector index.
You can perform similarity queries for vector embeddings in your S3 vector index using a single query command. You specify your query input, an Amazon Bedrock embeddings model ID, the vector bucket name, and vector index name. The command accepts several types of query inputs like a text string, an image file, or a single S3 text or image object. The command generates embeddings for your query using the input embeddings model and then performs a similarity search to find the most relevant matches. You can control the number of results returned, apply metadata filters to narrow your search, and choose whether to include similarity distance in the results for comprehensive analysis.


## Installation and Configuration
### Prerequisites
- Python 3.8 or higher
- To execute the CLI, you will need AWS credentials configured. 
- Update your AWS account with appropriate permissions to use Amazon Bedrock and S3 Vectors
- Access to an Amazon Bedrock embedding model
- Create an Amazon S3 vector bucket and vector index to store your embeddings

### Quick Install (Recommended)
```bash
pip install s3vectors-embed-cli
```

### Development Install
```bash
# Clone the repository
git clone https://github.com/awslabs/s3vectors-embed-cli
cd s3vectors-embed-cli

# Install in development mode
pip install -e .
```

**Note**: All dependencies are automatically installed when you install the package via pip.


### Quick Start

#### **Put Examples**

1. **Embed text and store them as vectors in your S3 vector index:**
```bash
s3vectors-embed put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 \
  --text-value "Hello, world!"
```

2. **Process local text files:**
```bash
s3vectors-embed put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 \
  --text "./documents/sample.txt"
```

3. **Process image files using a local file path:**
```bash
s3vectors-embed put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id amazon.titan-embed-image-v1 \
  --image "./images/photo.jpg"
```

4. **Process files from a local file path using wildcard characters:**
```bash
s3vectors-embed put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 \
  --text "./documents/*.txt"
```

5. **Process files from an S3 general purpose bucket using wildcard characters:**
```bash
s3vectors-embed put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 \
  --text "s3://bucket/path/*"
```

6. **Add metadata alongside your vectors:**
```bash
s3vectors-embed put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 \
  --text "s3://my-bucket/sample.txt"
  --metadata '{"category": "technology", "version": "1.0"}'
```

#### **Query Examples**

1. **Query with no filters:**
```bash
s3vectors-embed query \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 \
  --query-input "query text" \
  --k 10
```

2. **Query using a local text file as input:**
```bash
s3vectors-embed query \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 \
  --query-input "./query.txt" \
  --k 5 \
  --output table
```

3. **Query using an S3 text file as input:**
```bash
s3vectors-embed query \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 \
  --query-input "s3://my-bucket/image.jpeg" \
  --k 3 
```

4. **Query with metadata filters:**
```bash
s3vectors-embed query \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 \
  --query-input "query text" \
  --filter '{"category": {"$eq": "technology"}}' \
  --k 10 \
  --return-metadata
```

5. **Query with multiple metadata filters (AND):**
```bash
s3vectors-embed query \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 \
  --query-input "query text" \
  --filter '{"$and": [{"category": "technology"}, {"version": "1.0"}]}' \
  --k 10 \
  --return-metadata
```

6. **Query with multiple metadata filters (OR):**
```bash
s3vectors-embed query \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 \
  --query-input "query text" \
  --filter '{"$or": [{"category": "docs"}, {"category": "guides"}]}' \
  --k 5
```

7. **Query with metadata filters (comparison operators):**
```bash
s3vectors-embed query \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 \
  --query-input "query text" \
  --filter '{"$and": [{"category": "tech"}, {"version": {"$gte": "1.0"}}]}' \
  --k 10
```


### Command Parameters

#### Global Options
- `--debug`: Enable debug mode with detailed logging for troubleshooting
- `--profile`: AWS profile name to use from ~/.aws/credentials
- `--region`: AWS region name (overrides session/config defaults)

#### Put Command Parameters
Required:
- `--vector-bucket-name`: Name of the S3 vector bucket 
- `--index-name`: Name of the vector index in your vector index to store the vector embeddings
- `--model-id`: Bedrock model ID to use for generating embeddings (e.g., amazon.titan-embed-text-v2:0)

Input Options (one required):
- `--text-value`: Direct text input to embed
- `--text`: Text input - supports multiple input types:
  - **Local file**: `./document.txt`
  - **Local files with wildcard characters**: `./data/*.txt`, `~/docs/*.md`
  - **S3 object**: `s3://bucket/path/file.txt`
  - **S3 path with wildcard characters**: `s3://bucket/path/*` (prefix-based, not extension-based)
- `--image`: Image input - supports multiple input types:
  - **Local file**: `./document.jpg`
  - **Local wildcard**: `./data/*.jpg`
  - **S3 object**: `s3://bucket/path/file.jpg`
  - **S3 path with wildcard characters**: `s3://bucket/path/*` (prefix-based, not extension-based)

Optional:
- `--key`: Uniquely identifies each vector in the vector index (default: auto-generated UUID)
- `--metadata`: Additional metadata associated with the vector; provided as JSON string
- `--bucket-owner`: AWS account ID for cross-account S3 access
- `--output`: Output format (json or table, default: json)

#### Query Command Parameters
Required:
- `--vector-bucket-name`: Name of the S3 vector bucket
- `--index-name`: Name of the vector index 
- `--model-id`: Bedrock model ID to use for generating embeddings (e.g., amazon.titan-embed-text-v2:0)
- `--query-input`: Query text or file path (local file or S3 URI)

Optional:
- `--k`: Number of results to return (default: 5)
- `--filter`: Filter expression for metadata-based filtering (JSON format with AWS S3 Vectors API operators)
- `--return-metadata`: Include metadata in results (default: true)
- `--return-distance`: Include similarity distance
- `--output`: Output format (table or json, default: json)
- `--region`: AWS region name

Example with all optional parameters:
```bash
s3vectors-embed query --vector-bucket-name my-bucket --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 --query-input "search query" \
  --k 10 --filter '{"$and": [{"category": "tech"}, {"version": {"$gte": "1.0"}}]}' --return-metadata \
  --return-distance --output table --region us-west-2
```

### Model Compatibility
| Model | Type | Dimensions | Use Case |
|-------|------|------------|----------|
| `amazon.titan-embed-text-v2:0` | Text | 1024, 512, 256 | Modern text embedding |
| `amazon.titan-embed-text-v1` | Text | 1536 | Legacy text embedding |
| `amazon.titan-embed-image-v1` | Multimodal (Text + Image) | 1024, 384, 256 | Text and image embedding |
| `cohere.embed-english-v3` | Multimodal (Text or Image) | 1024 | Advanced English text or image embedding |
| `cohere.embed-multilingual-v3` | Multimodal (Text or Image) | 1024 | Multilingual text or image embedding |

## Metadata Filtering

### **Supported Operators**

#### **Comparison Operators**
- `$eq`: Equal to
- `$ne`: Not equal to  
- `$gt`: Greater than
- `$gte`: Greater than or equal to
- `$lt`: Less than
- `$lte`: Less than or equal to
- `$in`: Value in array
- `$nin`: Value not in array

#### **Logical Operators**
- `$and`: Logical AND (all conditions must be true)
- `$or`: Logical OR (at least one condition must be true)
- `$not`: Logical NOT (condition must be false)

### **Filter Examples**

#### **Single Condition Filters**
```bash
# Exact match
--filter '{"category": {"$eq": "documentation"}}'

# Not equal
--filter '{"status": {"$ne": "archived"}}'

# Greater than or equal
--filter '{"version": {"$gte": "2.0"}}'

# Value in list
--filter '{"category": {"$in": ["docs", "guides", "tutorials"]}}'
```

#### **Multiple Condition Filters**
```bash
# AND condition (all must be true)
--filter '{"$and": [{"category": "tech"}, {"version": "1.0"}]}'

# OR condition (at least one must be true)  
--filter '{"$or": [{"category": "docs"}, {"category": "guides"}]}'

# Complex nested conditions
--filter '{"$and": [{"category": "tech"}, {"$or": [{"version": "1.0"}, {"version": "2.0"}]}]}'

# NOT condition
--filter '{"$not": {"category": {"$eq": "archived"}}}'
```

#### **Advanced Filter Examples**
```bash
# Multiple AND conditions with comparison operators
--filter '{"$and": [{"category": "documentation"}, {"version": {"$gte": "1.0"}}, {"status": {"$ne": "draft"}}]}'

# OR with nested AND conditions
--filter '{"$or": [{"$and": [{"category": "docs"}, {"version": "1.0"}]}, {"$and": [{"category": "guides"}, {"version": "2.0"}]}]}'

# Using $in with multiple values
--filter '{"$and": [{"category": {"$in": ["docs", "guides"]}}, {"language": {"$eq": "en"}}]}'
```

### **Important Notes**

1. **JSON Format**: Filters must be valid JSON strings
2. **Quotes**: Use single quotes around the entire filter and double quotes inside JSON
3. **Case Sensitivity**: String comparisons are case-sensitive
3. **Data Types**: Ensure filter values match the data types in your metadata

## Metadata

The Amazon S3 Vectors Embed CLI automatically adds standard metadata fields to help track and manage your vector embeddings. Understanding these fields is important for filtering and troubleshooting your vector data.

### Standard Metadata Fields

The CLI automatically adds the following metadata fields to every vector:

#### `S3VECTORS-EMBED-SRC-CONTENT`
- **Purpose**: Stores the original text content. Configure this field as *nonFilterableMetadataKeys* while creating S3 vector index to store large text.
- **Behavior**:
  - **Direct text input** (`--text-value`): Contains the actual text content
  - **Text files**: Contains the full text content of the file
  - **Image files**: N/A (images don't have textual content to store) 

**Examples**:
```bash
# Direct text - stores the actual text
--text-value "Hello world" 
# Metadata: {"S3VECTORS-EMBED-SRC-CONTENT": "Hello world"}

# Text file - stores file content
--text document.txt
# Metadata: {"S3VECTORS-EMBED-SRC-CONTENT": "Contents of document.txt..."}

# Image file - no SOURCE_CONTENT field added
--image photo.jpg
# Metadata: {}
```

#### `S3VECTORS-EMBED-SRC-LOCATION`
- **Purpose**: Tracks the original file location
- **Behavior**:
  - **Text files**: Contains the file path or S3 URI
  - **Image files**: Contains the file path or S3 URI
  - **Direct text**: Not added (no file involved)

**Examples**:
```bash
# Local text file
--text /path/to/document.txt
# Metadata: {
#   "S3VECTORS-EMBED-SRC-CONTENT": "File contents...",
#   "S3VECTORS-EMBED-SRC-LOCATION": "file:///path/to/document.txt"
# }

# S3 text file
--text s3://my-bucket/docs/file.txt
# Metadata: {
#   "S3VECTORS-EMBED-SRC-CONTENT": "File contents...",
#   "S3VECTORS-EMBED-SRC-LOCATION": "s3://my-bucket/docs/file.txt"
# }

# Image file (local or S3)
--image /path/to/photo.jpg
# Metadata: {
#   "S3VECTORS-EMBED-SRC-LOCATION": "file:///path/to/photo.jpg"
# }

--image s3://my-bucket/images/photo.jpg
# Metadata: {
#   "S3VECTORS-EMBED-SRC-LOCATION": "s3://my-bucket/images/photo.jpg"
# }
```

### Additional Metadata

You can add your own metadata using the `--metadata` parameter with JSON format:

```bash
s3vectors-embed put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 \
  --text-value "Sample text" \
  --metadata '{"category": "documentation", "version": "1.0", "author": "team-a"}'
```

**Result**: Your metadata is merged with the two standard metadata fields:
```json
{
  "S3VECTORS-EMBED-SRC-CONTENT": "Sample text",
  "category": "documentation",
  "version": "1.0", 
  "author": "team-a"
}
```

## Output Formats

The CLI provides a simple output by default with an optional debug mode for more detailed information like progress information.

### Simple Output (Default)

The CLI provides a simple output without progress indicators:

```bash
# PUT output
s3vectors-embed put --vector-bucket-name my-bucket --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 --text-value "Hello"
```
**Output:**
```
{
  "key": "abc-123-def-456",
  "bucket": "my-bucket",
  "index": "my-index",
  "model": "amazon.titan-embed-text-v2:0",
  "contentType": "text",
  "embeddingDimensions": 1024,
  "metadata": {
    "S3VECTORS-EMBED-SRC-CONTENT": "Hello"
  }
}
```

### Debug option

Use `--debug` for comprehensive operational details:

```bash
# Debug mode provides detailed logging
s3vectors-embed --debug put --vector-bucket-name my-bucket --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 --text-value "Hello"
```

The CLI supports two output formats for query results:

### JSON Format (Default)
- **Machine-readable**: Perfect for programmatic processing
- **Complete data**: Shows full metadata content without truncation
- **Structured**: Easy to parse and integrate with other tools

```bash
# Uses JSON by default
s3vectors-embed query --vector-bucket-name my-bucket --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 --query-input "search text"

# Explicit JSON format (same as default)
s3vectors-embed query --vector-bucket-name my-bucket --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 --query-input "search text" --output json
```

**JSON Output Example:**
```json
{
  "results": [
    {
      "Key": "abc123-def456-ghi789",
      "distance": 0.2345,
      "metadata": {
        "S3VECTORS-EMBED-SRC-CONTENT": "Complete text content without any truncation...",
        "S3VECTORS-EMBED-SRC-LOCATION": "s3://bucket/path/file.txt",
        "category": "documentation",
        "author": "team-a"
      }
    }
  ],
  "summary": {
    "queryType": "text",
    "model": "amazon.titan-embed-text-v2:0",
    "index": "my-index",
    "resultsFound": 1,
    "queryDimensions": 1024
  }
}
```

### Table Format
- **Human-readable**: Easy to read and analyze visually
- **Complete data**: Shows full metadata content without truncation
- **Formatted**: Clean tabular display with proper alignment

```bash
# Explicit table format
s3vectors-embed query --vector-bucket-name my-bucket --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 --query-input "search text" --output table
```

## Wildcard Character Support

The CLI supports powerful wildcard characters in the input path for processing multiple files efficiently:

### **Local Filesystem Patterns (NEW)**

- **Basic wildcards**: `./data/*.txt` - all .txt files in data directory
- **Home directory**: `~/documents/*.md` - all .md files in user's documents
- **Recursive patterns**: `./docs/**/*.txt` - all .txt files recursively
- **Multiple extensions**: `./files/*.{txt,md,json}` - multiple file types
- **Question mark**: `./file?.txt` - single character wildcard

**Examples:**
```bash
# Process all text files in current directory
s3vectors-embed put --vector-bucket-name bucket --index-name idx \
  --model-id amazon.titan-embed-text-v2:0 --text "./*.txt"

# Process all markdown files in home directory
s3vectors-embed put --vector-bucket-name bucket --index-name idx \
  --model-id amazon.titan-embed-text-v2:0 --text "~/notes/*.md"

# Process files with pattern matching
s3vectors-embed put --vector-bucket-name bucket --index-name idx \
  --model-id amazon.titan-embed-text-v2:0 --text "./doc?.txt"
```


**Important**: S3 wildcards work with prefixes, not file extensions. Use `s3://bucket/path/*` not `s3://bucket/path/*.ext`

**Examples:**
```bash
# Process all files under an S3 prefix
s3vectors-embed put --vector-bucket-name bucket --index-name idx \
  --model-id amazon.titan-embed-text-v2:0 --text "s3://bucket/path1/*"

```

### **Important Differences: Local vs S3 Wildcards**

**Local Filesystem Wildcards:**
- ✅ Support file extensions: `./data/*.txt`, `./docs/*.json`
- ✅ Support complex patterns: `./files/*.{txt,md}`, `./doc?.txt`
- ✅ Support recursive patterns: `./docs/**/*.md`

**S3 Wildcards:**
- ✅ Support prefix patterns: `s3://bucket/docs/*`, `s3://bucket/2024/reports/*`
- ❌ **Do NOT support extension filtering**: `s3://bucket/path/*.json` won't work
- ❌ **Do NOT support complex patterns**: Use prefix-based organization instead 

**Best Practices:**
- **For S3**: Organize files by prefix/path structure: `s3://bucket/json-files/*`
- **For Local**: Use full wildcard capabilities: `./data/*.{json,txt}`

### **Pattern Processing Features**

- **Batch Processing**: Large file sets automatically batched 
- **Parallel Processing**: Configurable workers for concurrent processing
- **Error Handling**: Individual file failures don't stop batch processing and do not fail the whole batch.
- **Progress Tracking**: Clear reporting of processed vs failed files
- **File Type Filtering**: CLI automatically filters supported file types after pattern expansion

## Batch Processing

The CLI supports efficient batch processing for multiple files using both local and S3 wildcard characters in the input path

### **Batch Processing Features**

- **Automatic batching**: Large datasets are automatically split into batches of 500 vectors
- **Parallel processing**: Configurable worker threads for concurrent file processing
- **Error resilience**: Individual file failures don't stop batch processing
- **Performance optimization**: Efficient memory usage and API call batching

### Batch Processing Examples

**Local files batch processing (NEW):**
```bash
# Process all local text files
s3vectors-embed put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 \
  --text "./documents/*.txt" \
  --metadata '{"source": "local_batch", "category": "documents"}' \
  --max-workers 4

# Process files from multiple directories
s3vectors-embed put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 \
  --text "~/data/**/*.md" \
  --max-workers 2
```

**S3 files batch processing:**
```bash
# Text files batch processing
s3vectors-embed put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 \
  --text "s3://bucket/text/*" \
  --metadata '{"category": "documents", "batch": "2024-01"}' \
  --max-workers 4

# Image files batch processing
s3vectors-embed put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id amazon.titan-embed-image-v1 \
  --image "s3://bucket/images/*" \
  --metadata '{"category": "images", "source": "batch_upload"}' \
  --max-workers 2
```

### **Batch Processing Output**

```bash
# Example output for local wildcard processing
Processing chunk 1...
Found 94 supported files in chunk 1
Batch stored successfully. Total processed: 94

Batch processing completed!
   Total files found: 94
   Successfully processed: 94
   Failed: 0
```

### Troubleshooting

#### Use Debug Mode for Troubleshooting

For troubleshooting, first enable debug mode to get detailed information in the output:

```bash
# Add --debug to any command for detailed logging
s3vectors-embed --debug put --vector-bucket-name my-bucket --index-name my-index \
  --model-id amazon.titan-embed-text-v2:0 --text-value "test"
```

Debug mode provides:
- **API request/response details**: See exact payloads sent to Bedrock and S3 Vectors
- **Performance timing**: Identify slow operations
- **Configuration validation**: Verify AWS settings and service initialization
- **Error context**: Detailed error messages with full context

#### Troubleshooting Issues

1. **AWS Credentials Not Found**
```bash
# Error: Unable to locate credentials
# Solution: Configure AWS credentials
aws configure
# Or set environment variables:
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret

# Debug with credentials issue:
s3vectors-embed --debug put ... 
# Will show: "BedrockService initialization failed" with details
```

2. **Vector index Not Found**
```bash
# Error: ResourceNotFoundException: Vector index not found
# Solution: Ensure the vector index exists and you have correct permissions
aws s3 ls s3vectors://your-bucket

# Debug output will show:
# S3 Vectors ClientError: ResourceNotFoundException...
```

3. **Model Access Issues**
```bash
# Error: AccessDeniedException: Unable to access Bedrock model
# Solution: Verify Bedrock model access and permissions
aws bedrock list-foundation-models

# Debug output will show:
# Bedrock ClientError: AccessDeniedException...
# Request body: {...} (shows what was attempted)
```

4. **Performance Issues**
```bash
# Use debug mode to identify bottlenecks:
s3vectors-embed --debug put ...

# Debug output shows timing:
#  Bedrock API call completed in 2.45 seconds (slow)
#  S3 Vectors put_vectors completed in 0.15 seconds (normal)
```

5. **Service Unavailable Errors**
```bash
# Error: ServiceUnavailableException
# Debug output provides context:
# S3 Vectors ClientError: ServiceUnavailableException when calling PutVectors
# API parameters: {"vectorBucketName": "...", "indexName": "..."}
```

## Repository Structure
```
s3vectors-embed-cli/
├── s3vectors/                    # Main package directory
│   ├── cli.py                    # Main CLI entry point
│   ├── commands/                 # Command implementations
│   │   ├── embed_put.py         # Vector embedding and storage
│   │   └── embed_query.py       # Vector similarity search
│   ├── core/                    # Core functionality
│   │   ├── batch_processor.py   # Batch processing implementation
│   │   └── services.py         # Bedrock and S3Vector services
│   └── utils/                   # Utility functions
│       └── config.py           # AWS configuration management
├── setup.py                    # Package installation configuration
├── pyproject.toml              # Modern Python packaging configuration
├── requirements.txt            # Python dependencies
├── LICENSE                     # Apache 2.0 license
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/awslabs/s3vectors-embed-cli",
    "name": "s3vectors-embed-cli",
    "maintainer": "Vaibhav Sabharwal",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Vaibhav Sabharwal <vsabhar@amazon.com>",
    "keywords": "aws, s3, vectors, embeddings, bedrock, cli, machine-learning, ai",
    "author": "Vaibhav Sabharwal",
    "author_email": "Vaibhav Sabharwal <vsabhar@amazon.com>",
    "download_url": "https://files.pythonhosted.org/packages/98/54/df0d1664f18b98e3a37e1c115870abc2a95180248c92dddfdd60f498d3d5/s3vectors_embed_cli-0.1.1.tar.gz",
    "platform": null,
    "description": "# Amazon S3 Vectors Embed CLI\n\nAmazon S3 Vectors Embed CLI is a standalone command-line tool that simplifies the process of working with vector embeddings in S3 Vectors. You can create vector embeddings for your data using Amazon Bedrock and store and query them in your S3 vector index using single commands. \n\n**Amazon S3 Vectors Embed CLI is in preview release and is subject to change.**\n\n## Supported Commands\n\n**s3vectors-embed put**: Embed text, file content, or S3 objects and store them as vectors in an S3 vector index.\nYou can create and ingest vector embeddings into an S3 vector index using a single put command. You specify the data input you want to create an embedding for, an Amazon Bedrock embeddings model ID, your S3 vector bucket name, and S3 vector index name. The command supports several input formats including text data, a local text or image file, an S3 image or text object or prefix. The command generates embeddings using the dimensions configured in your S3 vector index properties. If you are ingesting embeddings for several objects in an S3 prefix or local file path, it automatically uses batch processes to maximize throughput. \n\n**Note**: Each file is processed as a single embedding. Document chunking is not currently supported. \n\n**s3vectors-embed query**: Embed a query input and search for similar vectors in an S3 vector index.\nYou can perform similarity queries for vector embeddings in your S3 vector index using a single query command. You specify your query input, an Amazon Bedrock embeddings model ID, the vector bucket name, and vector index name. The command accepts several types of query inputs like a text string, an image file, or a single S3 text or image object. The command generates embeddings for your query using the input embeddings model and then performs a similarity search to find the most relevant matches. You can control the number of results returned, apply metadata filters to narrow your search, and choose whether to include similarity distance in the results for comprehensive analysis.\n\n\n## Installation and Configuration\n### Prerequisites\n- Python 3.8 or higher\n- To execute the CLI, you will need AWS credentials configured. \n- Update your AWS account with appropriate permissions to use Amazon Bedrock and S3 Vectors\n- Access to an Amazon Bedrock embedding model\n- Create an Amazon S3 vector bucket and vector index to store your embeddings\n\n### Quick Install (Recommended)\n```bash\npip install s3vectors-embed-cli\n```\n\n### Development Install\n```bash\n# Clone the repository\ngit clone https://github.com/awslabs/s3vectors-embed-cli\ncd s3vectors-embed-cli\n\n# Install in development mode\npip install -e .\n```\n\n**Note**: All dependencies are automatically installed when you install the package via pip.\n\n\n### Quick Start\n\n#### **Put Examples**\n\n1. **Embed text and store them as vectors in your S3 vector index:**\n```bash\ns3vectors-embed put \\\n  --vector-bucket-name my-bucket \\\n  --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 \\\n  --text-value \"Hello, world!\"\n```\n\n2. **Process local text files:**\n```bash\ns3vectors-embed put \\\n  --vector-bucket-name my-bucket \\\n  --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 \\\n  --text \"./documents/sample.txt\"\n```\n\n3. **Process image files using a local file path:**\n```bash\ns3vectors-embed put \\\n  --vector-bucket-name my-bucket \\\n  --index-name my-index \\\n  --model-id amazon.titan-embed-image-v1 \\\n  --image \"./images/photo.jpg\"\n```\n\n4. **Process files from a local file path using wildcard characters:**\n```bash\ns3vectors-embed put \\\n  --vector-bucket-name my-bucket \\\n  --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 \\\n  --text \"./documents/*.txt\"\n```\n\n5. **Process files from an S3 general purpose bucket using wildcard characters:**\n```bash\ns3vectors-embed put \\\n  --vector-bucket-name my-bucket \\\n  --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 \\\n  --text \"s3://bucket/path/*\"\n```\n\n6. **Add metadata alongside your vectors:**\n```bash\ns3vectors-embed put \\\n  --vector-bucket-name my-bucket \\\n  --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 \\\n  --text \"s3://my-bucket/sample.txt\"\n  --metadata '{\"category\": \"technology\", \"version\": \"1.0\"}'\n```\n\n#### **Query Examples**\n\n1. **Query with no filters:**\n```bash\ns3vectors-embed query \\\n  --vector-bucket-name my-bucket \\\n  --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 \\\n  --query-input \"query text\" \\\n  --k 10\n```\n\n2. **Query using a local text file as input:**\n```bash\ns3vectors-embed query \\\n  --vector-bucket-name my-bucket \\\n  --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 \\\n  --query-input \"./query.txt\" \\\n  --k 5 \\\n  --output table\n```\n\n3. **Query using an S3 text file as input:**\n```bash\ns3vectors-embed query \\\n  --vector-bucket-name my-bucket \\\n  --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 \\\n  --query-input \"s3://my-bucket/image.jpeg\" \\\n  --k 3 \n```\n\n4. **Query with metadata filters:**\n```bash\ns3vectors-embed query \\\n  --vector-bucket-name my-bucket \\\n  --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 \\\n  --query-input \"query text\" \\\n  --filter '{\"category\": {\"$eq\": \"technology\"}}' \\\n  --k 10 \\\n  --return-metadata\n```\n\n5. **Query with multiple metadata filters (AND):**\n```bash\ns3vectors-embed query \\\n  --vector-bucket-name my-bucket \\\n  --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 \\\n  --query-input \"query text\" \\\n  --filter '{\"$and\": [{\"category\": \"technology\"}, {\"version\": \"1.0\"}]}' \\\n  --k 10 \\\n  --return-metadata\n```\n\n6. **Query with multiple metadata filters (OR):**\n```bash\ns3vectors-embed query \\\n  --vector-bucket-name my-bucket \\\n  --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 \\\n  --query-input \"query text\" \\\n  --filter '{\"$or\": [{\"category\": \"docs\"}, {\"category\": \"guides\"}]}' \\\n  --k 5\n```\n\n7. **Query with metadata filters (comparison operators):**\n```bash\ns3vectors-embed query \\\n  --vector-bucket-name my-bucket \\\n  --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 \\\n  --query-input \"query text\" \\\n  --filter '{\"$and\": [{\"category\": \"tech\"}, {\"version\": {\"$gte\": \"1.0\"}}]}' \\\n  --k 10\n```\n\n\n### Command Parameters\n\n#### Global Options\n- `--debug`: Enable debug mode with detailed logging for troubleshooting\n- `--profile`: AWS profile name to use from ~/.aws/credentials\n- `--region`: AWS region name (overrides session/config defaults)\n\n#### Put Command Parameters\nRequired:\n- `--vector-bucket-name`: Name of the S3 vector bucket \n- `--index-name`: Name of the vector index in your vector index to store the vector embeddings\n- `--model-id`: Bedrock model ID to use for generating embeddings (e.g., amazon.titan-embed-text-v2:0)\n\nInput Options (one required):\n- `--text-value`: Direct text input to embed\n- `--text`: Text input - supports multiple input types:\n  - **Local file**: `./document.txt`\n  - **Local files with wildcard characters**: `./data/*.txt`, `~/docs/*.md`\n  - **S3 object**: `s3://bucket/path/file.txt`\n  - **S3 path with wildcard characters**: `s3://bucket/path/*` (prefix-based, not extension-based)\n- `--image`: Image input - supports multiple input types:\n  - **Local file**: `./document.jpg`\n  - **Local wildcard**: `./data/*.jpg`\n  - **S3 object**: `s3://bucket/path/file.jpg`\n  - **S3 path with wildcard characters**: `s3://bucket/path/*` (prefix-based, not extension-based)\n\nOptional:\n- `--key`: Uniquely identifies each vector in the vector index (default: auto-generated UUID)\n- `--metadata`: Additional metadata associated with the vector; provided as JSON string\n- `--bucket-owner`: AWS account ID for cross-account S3 access\n- `--output`: Output format (json or table, default: json)\n\n#### Query Command Parameters\nRequired:\n- `--vector-bucket-name`: Name of the S3 vector bucket\n- `--index-name`: Name of the vector index \n- `--model-id`: Bedrock model ID to use for generating embeddings (e.g., amazon.titan-embed-text-v2:0)\n- `--query-input`: Query text or file path (local file or S3 URI)\n\nOptional:\n- `--k`: Number of results to return (default: 5)\n- `--filter`: Filter expression for metadata-based filtering (JSON format with AWS S3 Vectors API operators)\n- `--return-metadata`: Include metadata in results (default: true)\n- `--return-distance`: Include similarity distance\n- `--output`: Output format (table or json, default: json)\n- `--region`: AWS region name\n\nExample with all optional parameters:\n```bash\ns3vectors-embed query --vector-bucket-name my-bucket --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 --query-input \"search query\" \\\n  --k 10 --filter '{\"$and\": [{\"category\": \"tech\"}, {\"version\": {\"$gte\": \"1.0\"}}]}' --return-metadata \\\n  --return-distance --output table --region us-west-2\n```\n\n### Model Compatibility\n| Model | Type | Dimensions | Use Case |\n|-------|------|------------|----------|\n| `amazon.titan-embed-text-v2:0` | Text | 1024, 512, 256 | Modern text embedding |\n| `amazon.titan-embed-text-v1` | Text | 1536 | Legacy text embedding |\n| `amazon.titan-embed-image-v1` | Multimodal (Text + Image) | 1024, 384, 256 | Text and image embedding |\n| `cohere.embed-english-v3` | Multimodal (Text or Image) | 1024 | Advanced English text or image embedding |\n| `cohere.embed-multilingual-v3` | Multimodal (Text or Image) | 1024 | Multilingual text or image embedding |\n\n## Metadata Filtering\n\n### **Supported Operators**\n\n#### **Comparison Operators**\n- `$eq`: Equal to\n- `$ne`: Not equal to  \n- `$gt`: Greater than\n- `$gte`: Greater than or equal to\n- `$lt`: Less than\n- `$lte`: Less than or equal to\n- `$in`: Value in array\n- `$nin`: Value not in array\n\n#### **Logical Operators**\n- `$and`: Logical AND (all conditions must be true)\n- `$or`: Logical OR (at least one condition must be true)\n- `$not`: Logical NOT (condition must be false)\n\n### **Filter Examples**\n\n#### **Single Condition Filters**\n```bash\n# Exact match\n--filter '{\"category\": {\"$eq\": \"documentation\"}}'\n\n# Not equal\n--filter '{\"status\": {\"$ne\": \"archived\"}}'\n\n# Greater than or equal\n--filter '{\"version\": {\"$gte\": \"2.0\"}}'\n\n# Value in list\n--filter '{\"category\": {\"$in\": [\"docs\", \"guides\", \"tutorials\"]}}'\n```\n\n#### **Multiple Condition Filters**\n```bash\n# AND condition (all must be true)\n--filter '{\"$and\": [{\"category\": \"tech\"}, {\"version\": \"1.0\"}]}'\n\n# OR condition (at least one must be true)  \n--filter '{\"$or\": [{\"category\": \"docs\"}, {\"category\": \"guides\"}]}'\n\n# Complex nested conditions\n--filter '{\"$and\": [{\"category\": \"tech\"}, {\"$or\": [{\"version\": \"1.0\"}, {\"version\": \"2.0\"}]}]}'\n\n# NOT condition\n--filter '{\"$not\": {\"category\": {\"$eq\": \"archived\"}}}'\n```\n\n#### **Advanced Filter Examples**\n```bash\n# Multiple AND conditions with comparison operators\n--filter '{\"$and\": [{\"category\": \"documentation\"}, {\"version\": {\"$gte\": \"1.0\"}}, {\"status\": {\"$ne\": \"draft\"}}]}'\n\n# OR with nested AND conditions\n--filter '{\"$or\": [{\"$and\": [{\"category\": \"docs\"}, {\"version\": \"1.0\"}]}, {\"$and\": [{\"category\": \"guides\"}, {\"version\": \"2.0\"}]}]}'\n\n# Using $in with multiple values\n--filter '{\"$and\": [{\"category\": {\"$in\": [\"docs\", \"guides\"]}}, {\"language\": {\"$eq\": \"en\"}}]}'\n```\n\n### **Important Notes**\n\n1. **JSON Format**: Filters must be valid JSON strings\n2. **Quotes**: Use single quotes around the entire filter and double quotes inside JSON\n3. **Case Sensitivity**: String comparisons are case-sensitive\n3. **Data Types**: Ensure filter values match the data types in your metadata\n\n## Metadata\n\nThe Amazon S3 Vectors Embed CLI automatically adds standard metadata fields to help track and manage your vector embeddings. Understanding these fields is important for filtering and troubleshooting your vector data.\n\n### Standard Metadata Fields\n\nThe CLI automatically adds the following metadata fields to every vector:\n\n#### `S3VECTORS-EMBED-SRC-CONTENT`\n- **Purpose**: Stores the original text content. Configure this field as *nonFilterableMetadataKeys* while creating S3 vector index to store large text.\n- **Behavior**:\n  - **Direct text input** (`--text-value`): Contains the actual text content\n  - **Text files**: Contains the full text content of the file\n  - **Image files**: N/A (images don't have textual content to store) \n\n**Examples**:\n```bash\n# Direct text - stores the actual text\n--text-value \"Hello world\" \n# Metadata: {\"S3VECTORS-EMBED-SRC-CONTENT\": \"Hello world\"}\n\n# Text file - stores file content\n--text document.txt\n# Metadata: {\"S3VECTORS-EMBED-SRC-CONTENT\": \"Contents of document.txt...\"}\n\n# Image file - no SOURCE_CONTENT field added\n--image photo.jpg\n# Metadata: {}\n```\n\n#### `S3VECTORS-EMBED-SRC-LOCATION`\n- **Purpose**: Tracks the original file location\n- **Behavior**:\n  - **Text files**: Contains the file path or S3 URI\n  - **Image files**: Contains the file path or S3 URI\n  - **Direct text**: Not added (no file involved)\n\n**Examples**:\n```bash\n# Local text file\n--text /path/to/document.txt\n# Metadata: {\n#   \"S3VECTORS-EMBED-SRC-CONTENT\": \"File contents...\",\n#   \"S3VECTORS-EMBED-SRC-LOCATION\": \"file:///path/to/document.txt\"\n# }\n\n# S3 text file\n--text s3://my-bucket/docs/file.txt\n# Metadata: {\n#   \"S3VECTORS-EMBED-SRC-CONTENT\": \"File contents...\",\n#   \"S3VECTORS-EMBED-SRC-LOCATION\": \"s3://my-bucket/docs/file.txt\"\n# }\n\n# Image file (local or S3)\n--image /path/to/photo.jpg\n# Metadata: {\n#   \"S3VECTORS-EMBED-SRC-LOCATION\": \"file:///path/to/photo.jpg\"\n# }\n\n--image s3://my-bucket/images/photo.jpg\n# Metadata: {\n#   \"S3VECTORS-EMBED-SRC-LOCATION\": \"s3://my-bucket/images/photo.jpg\"\n# }\n```\n\n### Additional Metadata\n\nYou can add your own metadata using the `--metadata` parameter with JSON format:\n\n```bash\ns3vectors-embed put \\\n  --vector-bucket-name my-bucket \\\n  --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 \\\n  --text-value \"Sample text\" \\\n  --metadata '{\"category\": \"documentation\", \"version\": \"1.0\", \"author\": \"team-a\"}'\n```\n\n**Result**: Your metadata is merged with the two standard metadata fields:\n```json\n{\n  \"S3VECTORS-EMBED-SRC-CONTENT\": \"Sample text\",\n  \"category\": \"documentation\",\n  \"version\": \"1.0\", \n  \"author\": \"team-a\"\n}\n```\n\n## Output Formats\n\nThe CLI provides a simple output by default with an optional debug mode for more detailed information like progress information.\n\n### Simple Output (Default)\n\nThe CLI provides a simple output without progress indicators:\n\n```bash\n# PUT output\ns3vectors-embed put --vector-bucket-name my-bucket --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 --text-value \"Hello\"\n```\n**Output:**\n```\n{\n  \"key\": \"abc-123-def-456\",\n  \"bucket\": \"my-bucket\",\n  \"index\": \"my-index\",\n  \"model\": \"amazon.titan-embed-text-v2:0\",\n  \"contentType\": \"text\",\n  \"embeddingDimensions\": 1024,\n  \"metadata\": {\n    \"S3VECTORS-EMBED-SRC-CONTENT\": \"Hello\"\n  }\n}\n```\n\n### Debug option\n\nUse `--debug` for comprehensive operational details:\n\n```bash\n# Debug mode provides detailed logging\ns3vectors-embed --debug put --vector-bucket-name my-bucket --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 --text-value \"Hello\"\n```\n\nThe CLI supports two output formats for query results:\n\n### JSON Format (Default)\n- **Machine-readable**: Perfect for programmatic processing\n- **Complete data**: Shows full metadata content without truncation\n- **Structured**: Easy to parse and integrate with other tools\n\n```bash\n# Uses JSON by default\ns3vectors-embed query --vector-bucket-name my-bucket --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 --query-input \"search text\"\n\n# Explicit JSON format (same as default)\ns3vectors-embed query --vector-bucket-name my-bucket --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 --query-input \"search text\" --output json\n```\n\n**JSON Output Example:**\n```json\n{\n  \"results\": [\n    {\n      \"Key\": \"abc123-def456-ghi789\",\n      \"distance\": 0.2345,\n      \"metadata\": {\n        \"S3VECTORS-EMBED-SRC-CONTENT\": \"Complete text content without any truncation...\",\n        \"S3VECTORS-EMBED-SRC-LOCATION\": \"s3://bucket/path/file.txt\",\n        \"category\": \"documentation\",\n        \"author\": \"team-a\"\n      }\n    }\n  ],\n  \"summary\": {\n    \"queryType\": \"text\",\n    \"model\": \"amazon.titan-embed-text-v2:0\",\n    \"index\": \"my-index\",\n    \"resultsFound\": 1,\n    \"queryDimensions\": 1024\n  }\n}\n```\n\n### Table Format\n- **Human-readable**: Easy to read and analyze visually\n- **Complete data**: Shows full metadata content without truncation\n- **Formatted**: Clean tabular display with proper alignment\n\n```bash\n# Explicit table format\ns3vectors-embed query --vector-bucket-name my-bucket --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 --query-input \"search text\" --output table\n```\n\n## Wildcard Character Support\n\nThe CLI supports powerful wildcard characters in the input path for processing multiple files efficiently:\n\n### **Local Filesystem Patterns (NEW)**\n\n- **Basic wildcards**: `./data/*.txt` - all .txt files in data directory\n- **Home directory**: `~/documents/*.md` - all .md files in user's documents\n- **Recursive patterns**: `./docs/**/*.txt` - all .txt files recursively\n- **Multiple extensions**: `./files/*.{txt,md,json}` - multiple file types\n- **Question mark**: `./file?.txt` - single character wildcard\n\n**Examples:**\n```bash\n# Process all text files in current directory\ns3vectors-embed put --vector-bucket-name bucket --index-name idx \\\n  --model-id amazon.titan-embed-text-v2:0 --text \"./*.txt\"\n\n# Process all markdown files in home directory\ns3vectors-embed put --vector-bucket-name bucket --index-name idx \\\n  --model-id amazon.titan-embed-text-v2:0 --text \"~/notes/*.md\"\n\n# Process files with pattern matching\ns3vectors-embed put --vector-bucket-name bucket --index-name idx \\\n  --model-id amazon.titan-embed-text-v2:0 --text \"./doc?.txt\"\n```\n\n\n**Important**: S3 wildcards work with prefixes, not file extensions. Use `s3://bucket/path/*` not `s3://bucket/path/*.ext`\n\n**Examples:**\n```bash\n# Process all files under an S3 prefix\ns3vectors-embed put --vector-bucket-name bucket --index-name idx \\\n  --model-id amazon.titan-embed-text-v2:0 --text \"s3://bucket/path1/*\"\n\n```\n\n### **Important Differences: Local vs S3 Wildcards**\n\n**Local Filesystem Wildcards:**\n- \u2705 Support file extensions: `./data/*.txt`, `./docs/*.json`\n- \u2705 Support complex patterns: `./files/*.{txt,md}`, `./doc?.txt`\n- \u2705 Support recursive patterns: `./docs/**/*.md`\n\n**S3 Wildcards:**\n- \u2705 Support prefix patterns: `s3://bucket/docs/*`, `s3://bucket/2024/reports/*`\n- \u274c **Do NOT support extension filtering**: `s3://bucket/path/*.json` won't work\n- \u274c **Do NOT support complex patterns**: Use prefix-based organization instead \n\n**Best Practices:**\n- **For S3**: Organize files by prefix/path structure: `s3://bucket/json-files/*`\n- **For Local**: Use full wildcard capabilities: `./data/*.{json,txt}`\n\n### **Pattern Processing Features**\n\n- **Batch Processing**: Large file sets automatically batched \n- **Parallel Processing**: Configurable workers for concurrent processing\n- **Error Handling**: Individual file failures don't stop batch processing and do not fail the whole batch.\n- **Progress Tracking**: Clear reporting of processed vs failed files\n- **File Type Filtering**: CLI automatically filters supported file types after pattern expansion\n\n## Batch Processing\n\nThe CLI supports efficient batch processing for multiple files using both local and S3 wildcard characters in the input path\n\n### **Batch Processing Features**\n\n- **Automatic batching**: Large datasets are automatically split into batches of 500 vectors\n- **Parallel processing**: Configurable worker threads for concurrent file processing\n- **Error resilience**: Individual file failures don't stop batch processing\n- **Performance optimization**: Efficient memory usage and API call batching\n\n### Batch Processing Examples\n\n**Local files batch processing (NEW):**\n```bash\n# Process all local text files\ns3vectors-embed put \\\n  --vector-bucket-name my-bucket \\\n  --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 \\\n  --text \"./documents/*.txt\" \\\n  --metadata '{\"source\": \"local_batch\", \"category\": \"documents\"}' \\\n  --max-workers 4\n\n# Process files from multiple directories\ns3vectors-embed put \\\n  --vector-bucket-name my-bucket \\\n  --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 \\\n  --text \"~/data/**/*.md\" \\\n  --max-workers 2\n```\n\n**S3 files batch processing:**\n```bash\n# Text files batch processing\ns3vectors-embed put \\\n  --vector-bucket-name my-bucket \\\n  --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 \\\n  --text \"s3://bucket/text/*\" \\\n  --metadata '{\"category\": \"documents\", \"batch\": \"2024-01\"}' \\\n  --max-workers 4\n\n# Image files batch processing\ns3vectors-embed put \\\n  --vector-bucket-name my-bucket \\\n  --index-name my-index \\\n  --model-id amazon.titan-embed-image-v1 \\\n  --image \"s3://bucket/images/*\" \\\n  --metadata '{\"category\": \"images\", \"source\": \"batch_upload\"}' \\\n  --max-workers 2\n```\n\n### **Batch Processing Output**\n\n```bash\n# Example output for local wildcard processing\nProcessing chunk 1...\nFound 94 supported files in chunk 1\nBatch stored successfully. Total processed: 94\n\nBatch processing completed!\n   Total files found: 94\n   Successfully processed: 94\n   Failed: 0\n```\n\n### Troubleshooting\n\n#### Use Debug Mode for Troubleshooting\n\nFor troubleshooting, first enable debug mode to get detailed information in the output:\n\n```bash\n# Add --debug to any command for detailed logging\ns3vectors-embed --debug put --vector-bucket-name my-bucket --index-name my-index \\\n  --model-id amazon.titan-embed-text-v2:0 --text-value \"test\"\n```\n\nDebug mode provides:\n- **API request/response details**: See exact payloads sent to Bedrock and S3 Vectors\n- **Performance timing**: Identify slow operations\n- **Configuration validation**: Verify AWS settings and service initialization\n- **Error context**: Detailed error messages with full context\n\n#### Troubleshooting Issues\n\n1. **AWS Credentials Not Found**\n```bash\n# Error: Unable to locate credentials\n# Solution: Configure AWS credentials\naws configure\n# Or set environment variables:\nexport AWS_ACCESS_KEY_ID=your-key\nexport AWS_SECRET_ACCESS_KEY=your-secret\n\n# Debug with credentials issue:\ns3vectors-embed --debug put ... \n# Will show: \"BedrockService initialization failed\" with details\n```\n\n2. **Vector index Not Found**\n```bash\n# Error: ResourceNotFoundException: Vector index not found\n# Solution: Ensure the vector index exists and you have correct permissions\naws s3 ls s3vectors://your-bucket\n\n# Debug output will show:\n# S3 Vectors ClientError: ResourceNotFoundException...\n```\n\n3. **Model Access Issues**\n```bash\n# Error: AccessDeniedException: Unable to access Bedrock model\n# Solution: Verify Bedrock model access and permissions\naws bedrock list-foundation-models\n\n# Debug output will show:\n# Bedrock ClientError: AccessDeniedException...\n# Request body: {...} (shows what was attempted)\n```\n\n4. **Performance Issues**\n```bash\n# Use debug mode to identify bottlenecks:\ns3vectors-embed --debug put ...\n\n# Debug output shows timing:\n#  Bedrock API call completed in 2.45 seconds (slow)\n#  S3 Vectors put_vectors completed in 0.15 seconds (normal)\n```\n\n5. **Service Unavailable Errors**\n```bash\n# Error: ServiceUnavailableException\n# Debug output provides context:\n# S3 Vectors ClientError: ServiceUnavailableException when calling PutVectors\n# API parameters: {\"vectorBucketName\": \"...\", \"indexName\": \"...\"}\n```\n\n## Repository Structure\n```\ns3vectors-embed-cli/\n\u251c\u2500\u2500 s3vectors/                    # Main package directory\n\u2502   \u251c\u2500\u2500 cli.py                    # Main CLI entry point\n\u2502   \u251c\u2500\u2500 commands/                 # Command implementations\n\u2502   \u2502   \u251c\u2500\u2500 embed_put.py         # Vector embedding and storage\n\u2502   \u2502   \u2514\u2500\u2500 embed_query.py       # Vector similarity search\n\u2502   \u251c\u2500\u2500 core/                    # Core functionality\n\u2502   \u2502   \u251c\u2500\u2500 batch_processor.py   # Batch processing implementation\n\u2502   \u2502   \u2514\u2500\u2500 services.py         # Bedrock and S3Vector services\n\u2502   \u2514\u2500\u2500 utils/                   # Utility functions\n\u2502       \u2514\u2500\u2500 config.py           # AWS configuration management\n\u251c\u2500\u2500 setup.py                    # Package installation configuration\n\u251c\u2500\u2500 pyproject.toml              # Modern Python packaging configuration\n\u251c\u2500\u2500 requirements.txt            # Python dependencies\n\u251c\u2500\u2500 LICENSE                     # Apache 2.0 license\n```\n",
    "bugtrack_url": null,
    "license": "Apache License\n                                   Version 2.0, January 2004\n                                http://www.apache.org/licenses/\n        \n           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n        \n           1. Definitions.\n        \n              \"License\" shall mean the terms and conditions for use, reproduction,\n              and distribution as defined by Sections 1 through 9 of this document.\n        \n              \"Licensor\" shall mean the copyright owner or entity authorized by\n              the copyright owner that is granting the License.\n        \n              \"Legal Entity\" shall mean the union of the acting entity and all\n              other entities that control, are controlled by, or are under common\n              control with that entity. For the purposes of this definition,\n              \"control\" means (i) the power, direct or indirect, to cause the\n              direction or management of such entity, whether by contract or\n              otherwise, or (ii) ownership of fifty percent (50%) or more of the\n              outstanding shares, or (iii) beneficial ownership of such entity.\n        \n              \"You\" (or \"Your\") shall mean an individual or Legal Entity\n              exercising permissions granted by this License.\n        \n              \"Source\" form shall mean the preferred form for making modifications,\n              including but not limited to software source code, documentation\n              source, and configuration files.\n        \n              \"Object\" form shall mean any form resulting from mechanical\n              transformation or translation of a Source form, including but\n              not limited to compiled object code, generated documentation,\n              and conversions to other media types.\n        \n              \"Work\" shall mean the work of authorship, whether in Source or\n              Object form, made available under the License, as indicated by a\n              copyright notice that is included in or attached to the work\n              (an example is provided in the Appendix below).\n        \n              \"Derivative Works\" shall mean any work, whether in Source or Object\n              form, that is based on (or derived from) the Work and for which the\n              editorial revisions, annotations, elaborations, or other modifications\n              represent, as a whole, an original work of authorship. For the purposes\n              of this License, Derivative Works shall not include works that remain\n              separable from, or merely link (or bind by name) to the interfaces of,\n              the Work and Derivative Works thereof.\n        \n              \"Contribution\" shall mean any work of authorship, including\n              the original version of the Work and any modifications or additions\n              to that Work or Derivative Works thereof, that is intentionally\n              submitted to Licensor for inclusion in the Work by the copyright owner\n              or by an individual or Legal Entity authorized to submit on behalf of\n              the copyright owner. For the purposes of this definition, \"submitted\"\n              means any form of electronic, verbal, or written communication sent\n              to the Licensor or its representatives, including but not limited to\n              communication on electronic mailing lists, source code control systems,\n              and issue tracking systems that are managed by, or on behalf of, the\n              Licensor for the purpose of discussing and improving the Work, but\n              excluding communication that is conspicuously marked or otherwise\n              designated in writing by the copyright owner as \"Not a Contribution.\"\n        \n              \"Contributor\" shall mean Licensor and any individual or Legal Entity\n              on behalf of whom a Contribution has been received by Licensor and\n              subsequently incorporated within the Work.\n        \n           2. Grant of Copyright License. Subject to the terms and conditions of\n              this License, each Contributor hereby grants to You a perpetual,\n              worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n              copyright license to reproduce, prepare Derivative Works of,\n              publicly display, publicly perform, sublicense, and distribute the\n              Work and such Derivative Works in Source or Object form.\n        \n           3. Grant of Patent License. Subject to the terms and conditions of\n              this License, each Contributor hereby grants to You a perpetual,\n              worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n              (except as stated in this section) patent license to make, have made,\n              use, offer to sell, sell, import, and otherwise transfer the Work,\n              where such license applies only to those patent claims licensable\n              by such Contributor that are necessarily infringed by their\n              Contribution(s) alone or by combination of their Contribution(s)\n              with the Work to which such Contribution(s) was submitted. If You\n              institute patent litigation against any entity (including a\n              cross-claim or counterclaim in a lawsuit) alleging that the Work\n              or a Contribution incorporated within the Work constitutes direct\n              or contributory patent infringement, then any patent licenses\n              granted to You under this License for that Work shall terminate\n              as of the date such litigation is filed.\n        \n           4. Redistribution. You may reproduce and distribute copies of the\n              Work or Derivative Works thereof in any medium, with or without\n              modifications, and in Source or Object form, provided that You\n              meet the following conditions:\n        \n              (a) You must give any other recipients of the Work or\n                  Derivative Works a copy of this License; and\n        \n              (b) You must cause any modified files to carry prominent notices\n                  stating that You changed the files; and\n        \n              (c) You must retain, in the Source form of any Derivative Works\n                  that You distribute, all copyright, patent, trademark, and\n                  attribution notices from the Source form of the Work,\n                  excluding those notices that do not pertain to any part of\n                  the Derivative Works; and\n        \n              (d) If the Work includes a \"NOTICE\" text file as part of its\n                  distribution, then any Derivative Works that You distribute must\n                  include a readable copy of the attribution notices contained\n                  within such NOTICE file, excluding those notices that do not\n                  pertain to any part of the Derivative Works, in at least one\n                  of the following places: within a NOTICE text file distributed\n                  as part of the Derivative Works; within the Source form or\n                  documentation, if provided along with the Derivative Works; or,\n                  within a display generated by the Derivative Works, if and\n                  wherever such third-party notices normally appear. The contents\n                  of the NOTICE file are for informational purposes only and\n                  do not modify the License. You may add Your own attribution\n                  notices within Derivative Works that You distribute, alongside\n                  or as an addendum to the NOTICE text from the Work, provided\n                  that such additional attribution notices cannot be construed\n                  as modifying the License.\n        \n              You may add Your own copyright statement to Your modifications and\n              may provide additional or different license terms and conditions\n              for use, reproduction, or distribution of Your modifications, or\n              for any such Derivative Works as a whole, provided Your use,\n              reproduction, and distribution of the Work otherwise complies with\n              the conditions stated in this License.\n        \n           5. Submission of Contributions. Unless You explicitly state otherwise,\n              any Contribution intentionally submitted for inclusion in the Work\n              by You to the Licensor shall be under the terms and conditions of\n              this License, without any additional terms or conditions.\n              Notwithstanding the above, nothing herein shall supersede or modify\n              the terms of any separate license agreement you may have executed\n              with Licensor regarding such Contributions.\n        \n           6. Trademarks. This License does not grant permission to use the trade\n              names, trademarks, service marks, or product names of the Licensor,\n              except as required for reasonable and customary use in describing the\n              origin of the Work and reproducing the content of the NOTICE file.\n        \n           7. Disclaimer of Warranty. Unless required by applicable law or\n              agreed to in writing, Licensor provides the Work (and each\n              Contributor provides its Contributions) on an \"AS IS\" BASIS,\n              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\n              implied, including, without limitation, any warranties or conditions\n              of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\n              PARTICULAR PURPOSE. You are solely responsible for determining the\n              appropriateness of using or redistributing the Work and assume any\n              risks associated with Your exercise of permissions under this License.\n        \n           8. Limitation of Liability. In no event and under no legal theory,\n              whether in tort (including negligence), contract, or otherwise,\n              unless required by applicable law (such as deliberate and grossly\n              negligent acts) or agreed to in writing, shall any Contributor be\n              liable to You for damages, including any direct, indirect, special,\n              incidental, or consequential damages of any character arising as a\n              result of this License or out of the use or inability to use the\n              Work (including but not limited to damages for loss of goodwill,\n              work stoppage, computer failure or malfunction, or any and all\n              other commercial damages or losses), even if such Contributor\n              has been advised of the possibility of such damages.\n        \n           9. Accepting Warranty or Additional Liability. While redistributing\n              the Work or Derivative Works thereof, You may choose to offer,\n              and charge a fee for, acceptance of support, warranty, indemnity,\n              or other liability obligations and/or rights consistent with this\n              License. However, in accepting such obligations, You may act only\n              on Your own behalf and on Your sole responsibility, not on behalf\n              of any other Contributor, and only if You agree to indemnify,\n              defend, and hold each Contributor harmless for any liability\n              incurred by, or claims asserted against, such Contributor by reason\n              of your accepting any such warranty or additional liability.\n        ",
    "summary": "Standalone CLI for S3 Vector operations with Bedrock embeddings",
    "version": "0.1.1",
    "project_urls": {
        "Documentation": "https://github.com/awslabs/s3vectors-embed-cli#readme",
        "Homepage": "https://github.com/awslabs/s3vectors-embed-cli",
        "Issues": "https://github.com/awslabs/s3vectors-embed-cli/issues",
        "Repository": "https://github.com/awslabs/s3vectors-embed-cli",
        "Source": "https://github.com/awslabs/s3vectors-embed-cli"
    },
    "split_keywords": [
        "aws",
        " s3",
        " vectors",
        " embeddings",
        " bedrock",
        " cli",
        " machine-learning",
        " ai"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "76fb10afaf042cf62d62ffc29b20b249c947c27e62a2b675ef8fda969fb6877e",
                "md5": "b59108218cd59b9d45d900cd1128346f",
                "sha256": "877875dd89b7b9780033fe5f93c66dcea48f3f84b90a7099d28d8c53181fa256"
            },
            "downloads": -1,
            "filename": "s3vectors_embed_cli-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b59108218cd59b9d45d900cd1128346f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 39472,
            "upload_time": "2025-07-24T01:28:26",
            "upload_time_iso_8601": "2025-07-24T01:28:26.411640Z",
            "url": "https://files.pythonhosted.org/packages/76/fb/10afaf042cf62d62ffc29b20b249c947c27e62a2b675ef8fda969fb6877e/s3vectors_embed_cli-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9854df0d1664f18b98e3a37e1c115870abc2a95180248c92dddfdd60f498d3d5",
                "md5": "c56a90f2fa8acf439f25d82730bb492b",
                "sha256": "75a11a019989020c1b062a94bc078fa6e623c6b8ed5e1e1527365557cce63c3c"
            },
            "downloads": -1,
            "filename": "s3vectors_embed_cli-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "c56a90f2fa8acf439f25d82730bb492b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 33028,
            "upload_time": "2025-07-24T01:28:27",
            "upload_time_iso_8601": "2025-07-24T01:28:27.619017Z",
            "url": "https://files.pythonhosted.org/packages/98/54/df0d1664f18b98e3a37e1c115870abc2a95180248c92dddfdd60f498d3d5/s3vectors_embed_cli-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-24 01:28:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "awslabs",
    "github_project": "s3vectors-embed-cli",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "boto3",
            "specs": [
                [
                    ">=",
                    "1.39.5"
                ]
            ]
        },
        {
            "name": "botocore",
            "specs": [
                [
                    ">=",
                    "1.39.5"
                ]
            ]
        },
        {
            "name": "click",
            "specs": [
                [
                    ">=",
                    "8.0.0"
                ]
            ]
        },
        {
            "name": "rich",
            "specs": [
                [
                    ">=",
                    "12.0.0"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    ">=",
                    "1.10.0"
                ]
            ]
        }
    ],
    "lcname": "s3vectors-embed-cli"
}

Vaibhav Sabharwal