pyrtex


Namepyrtex JSON
Version 0.2.1 PyPI version JSON
download
home_pageNone
SummaryA Python library for batch text extraction and processing using Google Cloud Vertex AI
upload_time2025-08-31 14:08:52
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords ai vertex-ai google-cloud text-extraction batch-processing gemini pydantic
VCS
bugtrack_url
requirements pydantic pydantic-settings jinja2 google-cloud-aiplatform google-cloud-storage google-cloud-bigquery
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PyRTex

[![CI](https://github.com/CaptainTrojan/pyrtex/actions/workflows/ci.yml/badge.svg)](https://github.com/CaptainTrojan/pyrtex/actions/workflows/ci.yml)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)

A simple Python library for batch text extraction and processing using Google Cloud Vertex AI.

PyRTex makes it easy to process multiple documents, images, or text snippets with Gemini models and get back structured, type-safe results using Pydantic models.

## โœจ Features

- **๐Ÿš€ Simple API**: Just 3 steps - configure, submit, get results
- **๐Ÿ“ฆ Batch Processing**: Process multiple inputs efficiently  
- **๐Ÿ”’ Type Safety**: Pydantic models for structured output
- **๐ŸŽจ Flexible Templates**: Jinja2 templates for prompt engineering
- **โ˜๏ธ GCP Integration**: Seamless Vertex AI and BigQuery integration
- **๐Ÿงช Testing Mode**: Simulate without GCP costs
- **๐Ÿ”„ Async-Friendly**: Serialize job state & reconnect later (multi-process)

## ๐Ÿ“ฆ Installation

Install from PyPI (recommended):
```bash
pip install pyrtex
```

Or install from source:
```bash
git clone https://github.com/CaptainTrojan/pyrtex.git
cd pyrtex
pip install -e .
```

For development:
```bash
pip install -e .[dev]
```

## ๐Ÿš€ Quick Start

```python
from pydantic import BaseModel
from pyrtex import Job

# Define your data structures
class TextInput(BaseModel):
    content: str

class Analysis(BaseModel):
    summary: str
    sentiment: str
    key_points: list[str]

# Create a job
job = Job(
    model="gemini-2.0-flash-lite-001",
    output_schema=Analysis,
    prompt_template="Analyze this text: {{ content }}",
    simulation_mode=True  # Set to False for real processing
)

# Add your data
job.add_request("doc1", TextInput(content="Your text here"))
job.add_request("doc2", TextInput(content="Another document"))

# Process and get results
for result in job.submit().wait().results():
    if result.was_successful:
        print(f"Summary: {result.output.summary}")
        print(f"Sentiment: {result.output.sentiment}")
    else:
        print(f"Error: {result.error}")
```

Check out the [examples](https://github.com/CaptainTrojan/pyrtex/tree/main/examples) directory for diverse, concrete usage examples.

## ๐Ÿ“‹ Core Workflow

PyRTex uses a simple 3-step workflow:

### 1. Configure & Add Data
```python
job = Job(model="gemini-2.0-flash-lite-001", ...)
job.add_request("key1", YourModel(data="value1"))
job.add_request("key2", YourModel(data="value2"))
```

### 2. Submit & Wait  
Can be chained (for synchronous processing)
```python
job.submit().wait()
```
Or separated (classic blocking vs explicit wait):
```python
job.submit()
# do other work
job.wait()
```

### 2b. Asynchronous / Multi-Process Pattern
You can avoid blocking entirely by serializing job state after submission and reconnecting later (different process / machine / scheduled task):
```python
# Process A (submitter)
job.submit()
state_json = job.serialize()
# persist state_json somewhere durable (DB, GCS, S3, queue message)

# Process B (poller / checker)
re_job = Job.reconnect_from_state(state_json)
if not re_job.is_done:
    # job.status can be inspected, then exit / reschedule
    print("Still running:", re_job.status)

# Process C (collector) โ€“ run after poller detects completion
re_job = Job.reconnect_from_state(state_json)
if re_job.is_done:
    for r in re_job.results():
        if r.was_successful:
            print(r.request_key, r.output)
```
Why serialize? The serialized state contains:
- Vertex AI batch job resource name
- InfrastructureConfig (project/location/bucket/dataset)
- Session ID (for tracing)
- Instance map (request key โ†” internal instance id โ†” output schema type)
This allows precise result parsing without retaining the original Job object in memory.

See `examples/09_async_reconnect.py` for a CLI demonstration (start / status / results commands).

### 3. Get Results
```python
for result in job.results():
    if result.was_successful:
        # Use result.output (typed!)
    else:
        # Handle result.error
```

## ๐Ÿ” Authentication

PyRTex supports multiple authentication methods for Google Cloud Platform. Choose the method that best fits your deployment environment:

### Method 1: Service Account JSON String (Recommended for Production)

Perfect for serverless deployments (AWS Lambda, Google Cloud Functions, etc.):

```python
import json
import os
from pyrtex.config import InfrastructureConfig

# Set via environment variable (most secure)
os.environ["PYRTEX_SERVICE_ACCOUNT_KEY_JSON"] = json.dumps({
    "type": "service_account",
    "project_id": "your-project-id",
    "private_key_id": "key-id",
    "private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
    "client_email": "your-service@your-project.iam.gserviceaccount.com",
    "client_id": "123456789",
    "auth_uri": "https://accounts.google.com/o/oauth2/auth",
    "token_uri": "https://oauth2.googleapis.com/token",
    "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
    "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/your-service%40your-project.iam.gserviceaccount.com"
})

# PyRTex will automatically detect and use the JSON string
job = Job(
    model="gemini-2.0-flash-lite-001",
    output_schema=YourSchema,
    prompt_template="Your prompt"
)
```

Or configure directly:
```python
config = InfrastructureConfig(
    service_account_key_json=json.dumps(service_account_dict)
)
job = Job(..., config=config)
```

### Method 2: Service Account File Path

For traditional server deployments:

```python
import os

# Set via environment variable
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/service-account.json"

# PyRTex will automatically detect and use the file
job = Job(
    model="gemini-2.0-flash-lite-001",
    output_schema=YourSchema,
    prompt_template="Your prompt"
)
```

Or configure directly:
```python
config = InfrastructureConfig(
    service_account_key_path="/path/to/service-account.json"
)
job = Job(..., config=config)
```

### Method 3: Application Default Credentials (Development)

For local development and testing:

```bash
# One-time setup
gcloud auth application-default login
```

```python
# No additional configuration needed
job = Job(
    model="gemini-2.0-flash-lite-001",
    output_schema=YourSchema,
    prompt_template="Your prompt"
)
```

### Authentication Priority

PyRTex uses the following priority order:
1. **Service Account JSON string** (`PYRTEX_SERVICE_ACCOUNT_KEY_JSON` or `service_account_key_json`)
2. **Service Account file** (`GOOGLE_APPLICATION_CREDENTIALS` or `service_account_key_path`)  
3. **Application Default Credentials** (gcloud login)

### Required GCP Permissions

When creating a service account for PyRTex, assign these IAM roles:

**Required Roles:**
- **`roles/aiplatform.user`** - Vertex AI access for batch processing
- **`roles/storage.objectAdmin`** - GCS bucket read/write access
- **`roles/bigquery.dataEditor`** - BigQuery dataset read/write access
- **`roles/bigquery.jobUser`** - BigQuery job execution

**Alternative (More Restrictive):**
If you prefer granular permissions, create a custom role with:
```json
{
  "permissions": [
    "aiplatform.batchPredictionJobs.create",
    "aiplatform.batchPredictionJobs.get",
    "aiplatform.batchPredictionJobs.list",
    "aiplatform.models.predict",
    "storage.objects.create",
    "storage.objects.delete", 
    "storage.objects.get",
    "storage.objects.list",
    "bigquery.datasets.create",
    "bigquery.tables.create",
    "bigquery.tables.get",
    "bigquery.tables.getData",
    "bigquery.tables.updateData",
    "bigquery.jobs.create"
  ]
}
```

**Setup via gcloud CLI:**
```bash
# Create service account
gcloud iam service-accounts create pyrtex-service \
    --display-name="PyRTex Service Account"

# Assign roles
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
    --member="serviceAccount:pyrtex-service@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/aiplatform.user"

gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
    --member="serviceAccount:pyrtex-service@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/storage.objectAdmin"

gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
    --member="serviceAccount:pyrtex-service@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/bigquery.dataEditor"

gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
    --member="serviceAccount:pyrtex-service@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/bigquery.jobUser"

# Create and download key
gcloud iam service-accounts keys create pyrtex-key.json \
    --iam-account=pyrtex-service@YOUR_PROJECT_ID.iam.gserviceaccount.com
```

## โš™๏ธ Configuration

### InfrastructureConfig

Configure GCP resources and authentication:

```python
from pyrtex.config import InfrastructureConfig

config = InfrastructureConfig(
    # Required (set one of these)
    project_id="your-gcp-project-id",                    # GCP Project ID
    
    # Authentication (optional - detected automatically)
    service_account_key_json='{"type": "service_account", ...}',  # JSON string
    service_account_key_path="/path/to/service-account.json",     # File path
    
    # GCP Resources (optional - sensible defaults)
    location="us-central1",                              # Vertex AI location
    gcs_bucket_name="pyrtex-assets-your-project",        # GCS bucket for files
    bq_dataset_id="pyrtex_results",                      # BigQuery dataset
    
    # Data Retention (optional)
    gcs_file_retention_days=1,                           # GCS file cleanup (1-365)
    bq_table_retention_days=1                            # BigQuery table cleanup (1-365)
)

job = Job(..., config=config)
```

**Environment Variables:**
- `GOOGLE_PROJECT_ID` or `PYRTEX_PROJECT_ID` โ†’ `project_id`
- `GOOGLE_LOCATION` โ†’ `location` 
- `PYRTEX_GCS_BUCKET_NAME` โ†’ `gcs_bucket_name`
- `PYRTEX_BQ_DATASET_ID` โ†’ `bq_dataset_id`
- `PYRTEX_SERVICE_ACCOUNT_KEY_JSON` โ†’ `service_account_key_json`
- `GOOGLE_APPLICATION_CREDENTIALS` โ†’ `service_account_key_path`

### GenerationConfig

Fine-tune Gemini model behavior:

```python
from pyrtex.config import GenerationConfig

generation_config = GenerationConfig(
    temperature=0.7,        # Creativity level (0.0-2.0)
    max_output_tokens=2048, # Maximum response length (1-8192)
    top_p=0.95,            # Nucleus sampling (0.0-1.0)
    top_k=40               # Top-k sampling (1-40)
)

job = Job(
    model="gemini-2.0-flash-lite-001",
    output_schema=YourSchema,
    prompt_template="Your prompt",
    generation_config=generation_config
)
```

**Parameters:**
- **`temperature`** (0.0-2.0): Controls randomness. Lower = more focused, higher = more creative
- **`max_output_tokens`** (1-8192): Maximum tokens in response. Adjust based on expected output length
- **`top_p`** (0.0-1.0): Nucleus sampling. Consider tokens with cumulative probability up to top_p
- **`top_k`** (1-40): Top-k sampling. Consider only the k most likely tokens

**Quick Configs:**
```python
# Conservative (factual, consistent)
GenerationConfig(temperature=0.1, top_p=0.8, top_k=10)

# Balanced (default)
GenerationConfig(temperature=0.0, max_output_tokens=2048)

# Creative (diverse, experimental)  
GenerationConfig(temperature=1.2, top_p=0.95, top_k=40)
```

## ๐ŸŽฏ Usage Patterns

### Synchronous (Chained)
```python
for r in job.submit().wait().results():
    ...
```

### Non-Blocking (Explicit Wait)
```python
job.submit()
# do other tasks
job.wait()
for r in job.results():
    ...
```

### Async / Distributed (Serialize + Reconnect)
```python
# Submit phase
a_job = Job(...)
a_job.add_request("k1", Input(...))
a_job.add_request("k2", Input(...))
a_job.submit()
state_json = a_job.serialize()
# persist state_json externally

# Later (polling)
re_job = Job.reconnect_from_state(state_json)
if not re_job.is_done:
    print("Still running")

# Or even waiting (blocking)
re_job.wait()

# Later (collection)
re_job = Job.reconnect_from_state(state_json)
if re_job.is_done:
    for r in re_job.results():
        ...
```

## ๐Ÿ“š Examples

The `examples/` directory contains complete working examples:

```bash
cd examples

# Generate sample files
python generate_sample_data.py

# Extract contact info from business cards
python 01_simple_text_extraction.py

# Parse product catalogs  
python 02_pdf_product_parsing.py

# Extract invoice data from PDFs
python 03_image_description.py
```

### Example Use Cases

- **๐Ÿ“‡ Business Cards**: Extract contact information
- **๐Ÿ“„ Documents**: Process PDFs, images (PNG, JPEG)  
- **๐Ÿ›๏ธ Product Catalogs**: Parse pricing and inventory
- **๐Ÿงพ Invoices**: Extract structured financial data
- **๐Ÿ“Š Batch Processing**: Handle multiple files efficiently

## ๐Ÿงช Development

### Running Tests

```bash
# All tests (mocked, safe)
./test_runner.sh

# Specific test types
./test_runner.sh --unit
./test_runner.sh --integration
./test_runner.sh --flake

# Real GCP tests (costs money!)
./test_runner.sh --real --project-id your-project-id
```

Windows users:
```cmd
test_runner.bat --unit
test_runner.bat --flake
```

### Code Quality

- **flake8**: Linting
- **black**: Code formatting  
- **isort**: Import sorting
- **pytest**: Testing with coverage

## ๐Ÿค Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Run tests: `./test_runner.sh`
5. Submit a pull request

## ๐Ÿ“„ License

MIT License - see [LICENSE](LICENSE) for details.

## ๐Ÿ†˜ Support

- **Issues**: [GitHub Issues](https://github.com/CaptainTrojan/pyrtex/issues)
- **Examples**: Check the `examples/` directory
- **Testing**: Use `simulation_mode=True` for development

### Common Issues

**"Project was not passed and could not be determined from the environment"**
- Solution: Set `GOOGLE_PROJECT_ID` environment variable or use `simulation_mode=True`

**"Failed to initialize GCP clients"**  
- Solution: Run `gcloud auth application-default login` or use simulation mode

**Examples not working**
- Solution: Run `python generate_sample_data.py` first to create sample files

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pyrtex",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "ai, vertex-ai, google-cloud, text-extraction, batch-processing, gemini, pydantic",
    "author": null,
    "author_email": "CaptainTrojan <your-email@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/95/e2/86eb4fff7d84b7f3e30bd2a557bb3e3e7312da95715fd191becb786a72ea/pyrtex-0.2.1.tar.gz",
    "platform": null,
    "description": "# PyRTex\n\n[![CI](https://github.com/CaptainTrojan/pyrtex/actions/workflows/ci.yml/badge.svg)](https://github.com/CaptainTrojan/pyrtex/actions/workflows/ci.yml)\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)\n\nA simple Python library for batch text extraction and processing using Google Cloud Vertex AI.\n\nPyRTex makes it easy to process multiple documents, images, or text snippets with Gemini models and get back structured, type-safe results using Pydantic models.\n\n## \u2728 Features\n\n- **\ud83d\ude80 Simple API**: Just 3 steps - configure, submit, get results\n- **\ud83d\udce6 Batch Processing**: Process multiple inputs efficiently  \n- **\ud83d\udd12 Type Safety**: Pydantic models for structured output\n- **\ud83c\udfa8 Flexible Templates**: Jinja2 templates for prompt engineering\n- **\u2601\ufe0f GCP Integration**: Seamless Vertex AI and BigQuery integration\n- **\ud83e\uddea Testing Mode**: Simulate without GCP costs\n- **\ud83d\udd04 Async-Friendly**: Serialize job state & reconnect later (multi-process)\n\n## \ud83d\udce6 Installation\n\nInstall from PyPI (recommended):\n```bash\npip install pyrtex\n```\n\nOr install from source:\n```bash\ngit clone https://github.com/CaptainTrojan/pyrtex.git\ncd pyrtex\npip install -e .\n```\n\nFor development:\n```bash\npip install -e .[dev]\n```\n\n## \ud83d\ude80 Quick Start\n\n```python\nfrom pydantic import BaseModel\nfrom pyrtex import Job\n\n# Define your data structures\nclass TextInput(BaseModel):\n    content: str\n\nclass Analysis(BaseModel):\n    summary: str\n    sentiment: str\n    key_points: list[str]\n\n# Create a job\njob = Job(\n    model=\"gemini-2.0-flash-lite-001\",\n    output_schema=Analysis,\n    prompt_template=\"Analyze this text: {{ content }}\",\n    simulation_mode=True  # Set to False for real processing\n)\n\n# Add your data\njob.add_request(\"doc1\", TextInput(content=\"Your text here\"))\njob.add_request(\"doc2\", TextInput(content=\"Another document\"))\n\n# Process and get results\nfor result in job.submit().wait().results():\n    if result.was_successful:\n        print(f\"Summary: {result.output.summary}\")\n        print(f\"Sentiment: {result.output.sentiment}\")\n    else:\n        print(f\"Error: {result.error}\")\n```\n\nCheck out the [examples](https://github.com/CaptainTrojan/pyrtex/tree/main/examples) directory for diverse, concrete usage examples.\n\n## \ud83d\udccb Core Workflow\n\nPyRTex uses a simple 3-step workflow:\n\n### 1. Configure & Add Data\n```python\njob = Job(model=\"gemini-2.0-flash-lite-001\", ...)\njob.add_request(\"key1\", YourModel(data=\"value1\"))\njob.add_request(\"key2\", YourModel(data=\"value2\"))\n```\n\n### 2. Submit & Wait  \nCan be chained (for synchronous processing)\n```python\njob.submit().wait()\n```\nOr separated (classic blocking vs explicit wait):\n```python\njob.submit()\n# do other work\njob.wait()\n```\n\n### 2b. Asynchronous / Multi-Process Pattern\nYou can avoid blocking entirely by serializing job state after submission and reconnecting later (different process / machine / scheduled task):\n```python\n# Process A (submitter)\njob.submit()\nstate_json = job.serialize()\n# persist state_json somewhere durable (DB, GCS, S3, queue message)\n\n# Process B (poller / checker)\nre_job = Job.reconnect_from_state(state_json)\nif not re_job.is_done:\n    # job.status can be inspected, then exit / reschedule\n    print(\"Still running:\", re_job.status)\n\n# Process C (collector) \u2013 run after poller detects completion\nre_job = Job.reconnect_from_state(state_json)\nif re_job.is_done:\n    for r in re_job.results():\n        if r.was_successful:\n            print(r.request_key, r.output)\n```\nWhy serialize? The serialized state contains:\n- Vertex AI batch job resource name\n- InfrastructureConfig (project/location/bucket/dataset)\n- Session ID (for tracing)\n- Instance map (request key \u2194 internal instance id \u2194 output schema type)\nThis allows precise result parsing without retaining the original Job object in memory.\n\nSee `examples/09_async_reconnect.py` for a CLI demonstration (start / status / results commands).\n\n### 3. Get Results\n```python\nfor result in job.results():\n    if result.was_successful:\n        # Use result.output (typed!)\n    else:\n        # Handle result.error\n```\n\n## \ud83d\udd10 Authentication\n\nPyRTex supports multiple authentication methods for Google Cloud Platform. Choose the method that best fits your deployment environment:\n\n### Method 1: Service Account JSON String (Recommended for Production)\n\nPerfect for serverless deployments (AWS Lambda, Google Cloud Functions, etc.):\n\n```python\nimport json\nimport os\nfrom pyrtex.config import InfrastructureConfig\n\n# Set via environment variable (most secure)\nos.environ[\"PYRTEX_SERVICE_ACCOUNT_KEY_JSON\"] = json.dumps({\n    \"type\": \"service_account\",\n    \"project_id\": \"your-project-id\",\n    \"private_key_id\": \"key-id\",\n    \"private_key\": \"-----BEGIN PRIVATE KEY-----\\n...\\n-----END PRIVATE KEY-----\\n\",\n    \"client_email\": \"your-service@your-project.iam.gserviceaccount.com\",\n    \"client_id\": \"123456789\",\n    \"auth_uri\": \"https://accounts.google.com/o/oauth2/auth\",\n    \"token_uri\": \"https://oauth2.googleapis.com/token\",\n    \"auth_provider_x509_cert_url\": \"https://www.googleapis.com/oauth2/v1/certs\",\n    \"client_x509_cert_url\": \"https://www.googleapis.com/robot/v1/metadata/x509/your-service%40your-project.iam.gserviceaccount.com\"\n})\n\n# PyRTex will automatically detect and use the JSON string\njob = Job(\n    model=\"gemini-2.0-flash-lite-001\",\n    output_schema=YourSchema,\n    prompt_template=\"Your prompt\"\n)\n```\n\nOr configure directly:\n```python\nconfig = InfrastructureConfig(\n    service_account_key_json=json.dumps(service_account_dict)\n)\njob = Job(..., config=config)\n```\n\n### Method 2: Service Account File Path\n\nFor traditional server deployments:\n\n```python\nimport os\n\n# Set via environment variable\nos.environ[\"GOOGLE_APPLICATION_CREDENTIALS\"] = \"/path/to/service-account.json\"\n\n# PyRTex will automatically detect and use the file\njob = Job(\n    model=\"gemini-2.0-flash-lite-001\",\n    output_schema=YourSchema,\n    prompt_template=\"Your prompt\"\n)\n```\n\nOr configure directly:\n```python\nconfig = InfrastructureConfig(\n    service_account_key_path=\"/path/to/service-account.json\"\n)\njob = Job(..., config=config)\n```\n\n### Method 3: Application Default Credentials (Development)\n\nFor local development and testing:\n\n```bash\n# One-time setup\ngcloud auth application-default login\n```\n\n```python\n# No additional configuration needed\njob = Job(\n    model=\"gemini-2.0-flash-lite-001\",\n    output_schema=YourSchema,\n    prompt_template=\"Your prompt\"\n)\n```\n\n### Authentication Priority\n\nPyRTex uses the following priority order:\n1. **Service Account JSON string** (`PYRTEX_SERVICE_ACCOUNT_KEY_JSON` or `service_account_key_json`)\n2. **Service Account file** (`GOOGLE_APPLICATION_CREDENTIALS` or `service_account_key_path`)  \n3. **Application Default Credentials** (gcloud login)\n\n### Required GCP Permissions\n\nWhen creating a service account for PyRTex, assign these IAM roles:\n\n**Required Roles:**\n- **`roles/aiplatform.user`** - Vertex AI access for batch processing\n- **`roles/storage.objectAdmin`** - GCS bucket read/write access\n- **`roles/bigquery.dataEditor`** - BigQuery dataset read/write access\n- **`roles/bigquery.jobUser`** - BigQuery job execution\n\n**Alternative (More Restrictive):**\nIf you prefer granular permissions, create a custom role with:\n```json\n{\n  \"permissions\": [\n    \"aiplatform.batchPredictionJobs.create\",\n    \"aiplatform.batchPredictionJobs.get\",\n    \"aiplatform.batchPredictionJobs.list\",\n    \"aiplatform.models.predict\",\n    \"storage.objects.create\",\n    \"storage.objects.delete\", \n    \"storage.objects.get\",\n    \"storage.objects.list\",\n    \"bigquery.datasets.create\",\n    \"bigquery.tables.create\",\n    \"bigquery.tables.get\",\n    \"bigquery.tables.getData\",\n    \"bigquery.tables.updateData\",\n    \"bigquery.jobs.create\"\n  ]\n}\n```\n\n**Setup via gcloud CLI:**\n```bash\n# Create service account\ngcloud iam service-accounts create pyrtex-service \\\n    --display-name=\"PyRTex Service Account\"\n\n# Assign roles\ngcloud projects add-iam-policy-binding YOUR_PROJECT_ID \\\n    --member=\"serviceAccount:pyrtex-service@YOUR_PROJECT_ID.iam.gserviceaccount.com\" \\\n    --role=\"roles/aiplatform.user\"\n\ngcloud projects add-iam-policy-binding YOUR_PROJECT_ID \\\n    --member=\"serviceAccount:pyrtex-service@YOUR_PROJECT_ID.iam.gserviceaccount.com\" \\\n    --role=\"roles/storage.objectAdmin\"\n\ngcloud projects add-iam-policy-binding YOUR_PROJECT_ID \\\n    --member=\"serviceAccount:pyrtex-service@YOUR_PROJECT_ID.iam.gserviceaccount.com\" \\\n    --role=\"roles/bigquery.dataEditor\"\n\ngcloud projects add-iam-policy-binding YOUR_PROJECT_ID \\\n    --member=\"serviceAccount:pyrtex-service@YOUR_PROJECT_ID.iam.gserviceaccount.com\" \\\n    --role=\"roles/bigquery.jobUser\"\n\n# Create and download key\ngcloud iam service-accounts keys create pyrtex-key.json \\\n    --iam-account=pyrtex-service@YOUR_PROJECT_ID.iam.gserviceaccount.com\n```\n\n## \u2699\ufe0f Configuration\n\n### InfrastructureConfig\n\nConfigure GCP resources and authentication:\n\n```python\nfrom pyrtex.config import InfrastructureConfig\n\nconfig = InfrastructureConfig(\n    # Required (set one of these)\n    project_id=\"your-gcp-project-id\",                    # GCP Project ID\n    \n    # Authentication (optional - detected automatically)\n    service_account_key_json='{\"type\": \"service_account\", ...}',  # JSON string\n    service_account_key_path=\"/path/to/service-account.json\",     # File path\n    \n    # GCP Resources (optional - sensible defaults)\n    location=\"us-central1\",                              # Vertex AI location\n    gcs_bucket_name=\"pyrtex-assets-your-project\",        # GCS bucket for files\n    bq_dataset_id=\"pyrtex_results\",                      # BigQuery dataset\n    \n    # Data Retention (optional)\n    gcs_file_retention_days=1,                           # GCS file cleanup (1-365)\n    bq_table_retention_days=1                            # BigQuery table cleanup (1-365)\n)\n\njob = Job(..., config=config)\n```\n\n**Environment Variables:**\n- `GOOGLE_PROJECT_ID` or `PYRTEX_PROJECT_ID` \u2192 `project_id`\n- `GOOGLE_LOCATION` \u2192 `location` \n- `PYRTEX_GCS_BUCKET_NAME` \u2192 `gcs_bucket_name`\n- `PYRTEX_BQ_DATASET_ID` \u2192 `bq_dataset_id`\n- `PYRTEX_SERVICE_ACCOUNT_KEY_JSON` \u2192 `service_account_key_json`\n- `GOOGLE_APPLICATION_CREDENTIALS` \u2192 `service_account_key_path`\n\n### GenerationConfig\n\nFine-tune Gemini model behavior:\n\n```python\nfrom pyrtex.config import GenerationConfig\n\ngeneration_config = GenerationConfig(\n    temperature=0.7,        # Creativity level (0.0-2.0)\n    max_output_tokens=2048, # Maximum response length (1-8192)\n    top_p=0.95,            # Nucleus sampling (0.0-1.0)\n    top_k=40               # Top-k sampling (1-40)\n)\n\njob = Job(\n    model=\"gemini-2.0-flash-lite-001\",\n    output_schema=YourSchema,\n    prompt_template=\"Your prompt\",\n    generation_config=generation_config\n)\n```\n\n**Parameters:**\n- **`temperature`** (0.0-2.0): Controls randomness. Lower = more focused, higher = more creative\n- **`max_output_tokens`** (1-8192): Maximum tokens in response. Adjust based on expected output length\n- **`top_p`** (0.0-1.0): Nucleus sampling. Consider tokens with cumulative probability up to top_p\n- **`top_k`** (1-40): Top-k sampling. Consider only the k most likely tokens\n\n**Quick Configs:**\n```python\n# Conservative (factual, consistent)\nGenerationConfig(temperature=0.1, top_p=0.8, top_k=10)\n\n# Balanced (default)\nGenerationConfig(temperature=0.0, max_output_tokens=2048)\n\n# Creative (diverse, experimental)  \nGenerationConfig(temperature=1.2, top_p=0.95, top_k=40)\n```\n\n## \ud83c\udfaf Usage Patterns\n\n### Synchronous (Chained)\n```python\nfor r in job.submit().wait().results():\n    ...\n```\n\n### Non-Blocking (Explicit Wait)\n```python\njob.submit()\n# do other tasks\njob.wait()\nfor r in job.results():\n    ...\n```\n\n### Async / Distributed (Serialize + Reconnect)\n```python\n# Submit phase\na_job = Job(...)\na_job.add_request(\"k1\", Input(...))\na_job.add_request(\"k2\", Input(...))\na_job.submit()\nstate_json = a_job.serialize()\n# persist state_json externally\n\n# Later (polling)\nre_job = Job.reconnect_from_state(state_json)\nif not re_job.is_done:\n    print(\"Still running\")\n\n# Or even waiting (blocking)\nre_job.wait()\n\n# Later (collection)\nre_job = Job.reconnect_from_state(state_json)\nif re_job.is_done:\n    for r in re_job.results():\n        ...\n```\n\n## \ud83d\udcda Examples\n\nThe `examples/` directory contains complete working examples:\n\n```bash\ncd examples\n\n# Generate sample files\npython generate_sample_data.py\n\n# Extract contact info from business cards\npython 01_simple_text_extraction.py\n\n# Parse product catalogs  \npython 02_pdf_product_parsing.py\n\n# Extract invoice data from PDFs\npython 03_image_description.py\n```\n\n### Example Use Cases\n\n- **\ud83d\udcc7 Business Cards**: Extract contact information\n- **\ud83d\udcc4 Documents**: Process PDFs, images (PNG, JPEG)  \n- **\ud83d\udecd\ufe0f Product Catalogs**: Parse pricing and inventory\n- **\ud83e\uddfe Invoices**: Extract structured financial data\n- **\ud83d\udcca Batch Processing**: Handle multiple files efficiently\n\n## \ud83e\uddea Development\n\n### Running Tests\n\n```bash\n# All tests (mocked, safe)\n./test_runner.sh\n\n# Specific test types\n./test_runner.sh --unit\n./test_runner.sh --integration\n./test_runner.sh --flake\n\n# Real GCP tests (costs money!)\n./test_runner.sh --real --project-id your-project-id\n```\n\nWindows users:\n```cmd\ntest_runner.bat --unit\ntest_runner.bat --flake\n```\n\n### Code Quality\n\n- **flake8**: Linting\n- **black**: Code formatting  \n- **isort**: Import sorting\n- **pytest**: Testing with coverage\n\n## \ud83e\udd1d Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Run tests: `./test_runner.sh`\n5. Submit a pull request\n\n## \ud83d\udcc4 License\n\nMIT License - see [LICENSE](LICENSE) for details.\n\n## \ud83c\udd98 Support\n\n- **Issues**: [GitHub Issues](https://github.com/CaptainTrojan/pyrtex/issues)\n- **Examples**: Check the `examples/` directory\n- **Testing**: Use `simulation_mode=True` for development\n\n### Common Issues\n\n**\"Project was not passed and could not be determined from the environment\"**\n- Solution: Set `GOOGLE_PROJECT_ID` environment variable or use `simulation_mode=True`\n\n**\"Failed to initialize GCP clients\"**  \n- Solution: Run `gcloud auth application-default login` or use simulation mode\n\n**Examples not working**\n- Solution: Run `python generate_sample_data.py` first to create sample files\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python library for batch text extraction and processing using Google Cloud Vertex AI",
    "version": "0.2.1",
    "project_urls": {
        "Changelog": "https://github.com/CaptainTrojan/pyrtex/releases",
        "Documentation": "https://github.com/CaptainTrojan/pyrtex#readme",
        "Homepage": "https://github.com/CaptainTrojan/pyrtex",
        "Issues": "https://github.com/CaptainTrojan/pyrtex/issues",
        "Repository": "https://github.com/CaptainTrojan/pyrtex"
    },
    "split_keywords": [
        "ai",
        " vertex-ai",
        " google-cloud",
        " text-extraction",
        " batch-processing",
        " gemini",
        " pydantic"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ad5fffb9048dbabb13a3b29cf14c2b89a9bd3623b31b20eb7975934f280874a5",
                "md5": "2bf376d05d02883edb28a5602a92323c",
                "sha256": "0b3011d84b1b0b627b449742b9b77d3e7ec7863a71aab74dcae7e649edceb89e"
            },
            "downloads": -1,
            "filename": "pyrtex-0.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2bf376d05d02883edb28a5602a92323c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 21856,
            "upload_time": "2025-08-31T14:08:50",
            "upload_time_iso_8601": "2025-08-31T14:08:50.976494Z",
            "url": "https://files.pythonhosted.org/packages/ad/5f/ffb9048dbabb13a3b29cf14c2b89a9bd3623b31b20eb7975934f280874a5/pyrtex-0.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "95e286eb4fff7d84b7f3e30bd2a557bb3e3e7312da95715fd191becb786a72ea",
                "md5": "6b1524e4b041224c2a636b01fa9ece26",
                "sha256": "2fdd7ffca311cb6d63e32eb4820920615e0d202258686817264dd72e28119371"
            },
            "downloads": -1,
            "filename": "pyrtex-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "6b1524e4b041224c2a636b01fa9ece26",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 26383,
            "upload_time": "2025-08-31T14:08:52",
            "upload_time_iso_8601": "2025-08-31T14:08:52.292790Z",
            "url": "https://files.pythonhosted.org/packages/95/e2/86eb4fff7d84b7f3e30bd2a557bb3e3e7312da95715fd191becb786a72ea/pyrtex-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-31 14:08:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "CaptainTrojan",
    "github_project": "pyrtex",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "pydantic",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "pydantic-settings",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "jinja2",
            "specs": [
                [
                    ">=",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "google-cloud-aiplatform",
            "specs": [
                [
                    ">=",
                    "1.40.0"
                ]
            ]
        },
        {
            "name": "google-cloud-storage",
            "specs": [
                [
                    ">=",
                    "2.10.0"
                ]
            ]
        },
        {
            "name": "google-cloud-bigquery",
            "specs": [
                [
                    ">=",
                    "3.11.0"
                ]
            ]
        }
    ],
    "lcname": "pyrtex"
}
        
Elapsed time: 1.59117s