pytemporal


Namepytemporal JSON
Version 1.3.11 PyPI version JSON
download
home_pageNone
SummaryHigh-performance bitemporal timeseries update processor
upload_time2025-09-01 19:00:48
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PyTemporal Library

A high-performance Rust library with Python bindings for processing bitemporal timeseries data. Optimized for financial services and applications requiring immutable audit trails with both business and system time dimensions.

## Features

- **High Performance**: 500k records processed in ~885ms with adaptive parallelization
- **Optimized Python Wrapper**: Batch consolidation reduces conversion overhead to <0.1 seconds
- **Zero-Copy Processing**: Apache Arrow columnar data format for efficient memory usage
- **Parallel Processing**: Rayon-based parallelization with adaptive thresholds
- **Conflation**: Automatic merging of adjacent segments with identical values to reduce storage
- **Full State Mode**: Complete state replacement with tombstone records for audit trails
- **Flexible Schema**: Dynamic ID and value column configuration
- **Python Integration**: High-level DataFrame API with seamless PyO3 bindings
- **Modular Architecture**: Clean separation of concerns with dedicated modules
- **Performance Monitoring**: Integrated flamegraph generation and GitHub Pages benchmark reports

## Installation

Build from source (requires Rust):

```bash
git clone <your-repository-url>
cd pytemporal
uv run maturin develop --release
```

## Quick Start

### High-Level DataFrame API (Recommended)

```python
import pandas as pd
from pytemporal import BitemporalTimeseriesProcessor

# Initialize processor
processor = BitemporalTimeseriesProcessor(
    id_columns=['id', 'field'],
    value_columns=['mv', 'price']
)

# Current state
current_state = pd.DataFrame({
    'id': [1234, 1234],
    'field': ['test', 'fielda'], 
    'mv': [300, 400],
    'price': [400, 500],
    'effective_from': pd.to_datetime(['2020-01-01', '2020-01-01']),
    'effective_to': pd.to_datetime(['2021-01-01', '2021-01-01']),
    'as_of_from': pd.to_datetime(['2025-01-01', '2025-01-01']),
    'as_of_to': pd.to_datetime(['2262-04-11', '2262-04-11'])
})

# Updates
updates = pd.DataFrame({
    'id': [1234],
    'field': ['test'],
    'mv': [400], 
    'price': [300],
    'effective_from': pd.to_datetime(['2020-06-01']),
    'effective_to': pd.to_datetime(['2020-09-01']),
    'as_of_from': pd.to_datetime(['2025-07-27']),
    'as_of_to': pd.to_datetime(['2262-04-11'])
})

# Process updates (delta mode - incremental updates)
rows_to_expire, rows_to_insert = processor.compute_changes(
    current_state, 
    updates,
    update_mode='delta'
)

print(f"Records to expire: {len(rows_to_expire)}")
print(f"Records to insert: {len(rows_to_insert)}")

# Process updates (full_state mode - complete state replacement)
rows_to_expire, rows_to_insert = processor.compute_changes(
    current_state, 
    updates,
    update_mode='full_state'
)
```

### Low-Level Arrow API

```python
import pandas as pd
from pytemporal import compute_changes
import pyarrow as pa

# Convert pandas DataFrames to Arrow RecordBatches
current_batch = pa.RecordBatch.from_pandas(current_state)
updates_batch = pa.RecordBatch.from_pandas(updates)

# Direct Arrow processing
expire_indices, insert_batches, expired_batches = compute_changes(
    current_batch,
    updates_batch,
    id_columns=['id', 'field'],
    value_columns=['mv', 'price'],
    system_date='2025-07-27',
    update_mode='delta'
)
```

## Algorithm Explanation with Examples

### Bitemporal Model

Each record tracks two time dimensions:
- **Effective Time** (`effective_from`, `effective_to`): When the data is valid in the real world
- **As-Of Time** (`as_of_from`, `as_of_to`): When the data was known to the system

Both use TimestampMicrosecond precision for maximum accuracy.

### Core Algorithm: Timeline Processing

The algorithm processes updates by creating a timeline of events and determining what should be active at each point in time.

#### Example 1: Simple Overwrite

**Current State:**
```
ID=123, effective: [2020-01-01, 2021-01-01], as_of: [2025-01-01, max], mv=100
```

**Update:**
```  
ID=123, effective: [2020-06-01, 2020-09-01], as_of: [2025-07-27, max], mv=200
```

**Timeline Processing:**

1. **Create Events:**
   - 2020-01-01: Current starts (mv=100)
   - 2020-06-01: Update starts (mv=200) 
   - 2020-09-01: Update ends
   - 2021-01-01: Current ends

2. **Process Timeline:**
   - [2020-01-01, 2020-06-01): Current active → emit mv=100
   - [2020-06-01, 2020-09-01): Update active → emit mv=200  
   - [2020-09-01, 2021-01-01): Current active → emit mv=100

3. **Result:**
   - **Expire:** Original record (index 0)
   - **Insert:** Three new records covering the split timeline

**Visual Representation:**
```
Before:
Current |=======mv=100========|
        2020-01-01      2021-01-01

Update       |==mv=200==|
             2020-06-01  2020-09-01

After:
New     |=100=|=mv=200=|=100=|
        2020   2020     2020  2021
        01-01  06-01    09-01 01-01
```

#### Example 2: Conflation (Adjacent Identical Values)

**Current State:**
```
ID=123, effective: [2020-01-01, 2020-06-01], as_of: [2025-01-01, max], mv=100
ID=123, effective: [2020-06-01, 2021-01-01], as_of: [2025-01-01, max], mv=100  
```

**Update:**
```
ID=123, effective: [2020-03-01, 2020-04-01], as_of: [2025-07-27, max], mv=100
```

Since the update has the same value (mv=100) as the current state, the algorithm detects this as a **no-change scenario** and skips processing entirely.

#### Example 3: Complex Multi-Update

**Current State:**
```
ID=123, effective: [2020-01-01, 2021-01-01], as_of: [2025-01-01, max], mv=100
```

**Updates:**
```
ID=123, effective: [2020-03-01, 2020-06-01], as_of: [2025-07-27, max], mv=200
ID=123, effective: [2020-09-01, 2020-12-01], as_of: [2025-07-27, max], mv=300
```

**Timeline Processing:**

1. **Events:** 2020-01-01 (current start), 2020-03-01 (update1 start), 2020-06-01 (update1 end), 2020-09-01 (update2 start), 2020-12-01 (update2 end), 2021-01-01 (current end)

2. **Result:**
   - [2020-01-01, 2020-03-01): mv=100 (current)
   - [2020-03-01, 2020-06-01): mv=200 (update1)  
   - [2020-06-01, 2020-09-01): mv=100 (current)
   - [2020-09-01, 2020-12-01): mv=300 (update2)
   - [2020-12-01, 2021-01-01): mv=100 (current)

### Post-Processing Conflation

After timeline processing, the algorithm merges adjacent segments with identical value hashes:

**Before Conflation:**
```
|--mv=100--|--mv=100--|--mv=200--|--mv=100--|--mv=100--|
```

**After Conflation:**
```
|--------mv=100--------|--mv=200--|--------mv=100--------|
```

This significantly reduces database row count while preserving temporal accuracy.

### Update Modes

The algorithm supports two distinct update modes that determine how updates interact with existing current state:

#### 1. Delta Mode (default)
- **Behavior**: Only provided records are treated as updates
- **Existing State**: Preserved where not temporally overlapped by updates
- **Use Case**: Incremental updates where you only want to modify specific time periods
- **Processing**: Uses timeline-based algorithm to handle temporal overlaps

#### 2. Full State Mode
- **Behavior**: Provided records represent the complete desired state for each ID group
- **Existing State**: Only preserved if identical records exist in the updates (same values and temporal ranges)
- **Value Comparison**: Records are compared using SHA256 hashes of value columns
- **Processing Rules**:
  - **Unchanged Records**: If an update has identical values and temporal range as current state, neither expire nor insert
  - **Changed Records**: If values differ, expire old record and insert new record
  - **New Records**: Insert records that don't exist in current state
  - **Deleted Records**: Records not present in updates are expired AND get tombstone records created
- **Tombstone Records**: When records are deleted (exist in current state but not in updates):
  - The original record is expired with its original `as_of_from`
  - A tombstone record is created with the same ID and values but `effective_to = system_date`
  - This maintains complete audit trail showing exactly when data became inactive

#### Full State Mode Example

**Current State:**
```
ID=1, mv=100, price=250, effective=[2020-01-01, INFINITY]
ID=2, mv=300, price=400, effective=[2020-01-01, INFINITY]  
ID=3, mv=500, price=600, effective=[2020-01-01, INFINITY] 
```

**Updates (Full State):**
```
ID=1, mv=150, price=250, effective=[2020-01-01, 2020-02-01]  # Values changed
ID=2, mv=300, price=400, effective=[2020-01-01, INFINITY]   # Values identical  
# Note: ID=3 is not in updates = deleted
```

**Result:**
- **Expire**: ID=1 (values changed), ID=3 (deleted)
- **Insert**: 
  - ID=1 (new values with updated effective_to)
  - ID=3 (tombstone: same values but effective_to=system_date)
- **No Action**: ID=2 (identical values, no change needed)

**Tombstone Example:**
```
Original:  ID=3, mv=500, price=600, effective=[2020-01-01, INFINITY]
Tombstone: ID=3, mv=500, price=600, effective=[2020-01-01, 2025-08-30]  # Truncated to system_date
```

### Timezone Handling

The library seamlessly handles timezone-aware timestamps from databases like PostgreSQL:

```python
# PostgreSQL timestamptz columns (timezone-aware)
current_state = pd.DataFrame({
    'id': [1],
    'value': ['test'],
    'effective_from': pd.Timestamp("2024-01-01", tz='UTC'),  # timezone-aware
    'effective_to': pd.Timestamp("2099-12-31", tz='UTC'),
    # ... other columns
})

# Client updates (often timezone-naive)  
updates = pd.DataFrame({
    'id': [1],
    'value': ['updated'],
    'effective_from': pd.Timestamp("2024-01-02"),  # timezone-naive
    'effective_to': pd.Timestamp("2099-12-31"),
    # ... other columns
})

# Automatically handles timezone normalization
rows_to_expire, rows_to_insert = processor.compute_changes(current_state, updates)
```

**Key Features:**
- **Automatic Schema Normalization**: Mixed timezone-aware and timezone-naive timestamps are automatically reconciled
- **PostgreSQL Compatible**: Seamless integration with `timestamptz` columns
- **Timezone Preservation**: Original timezone information is preserved throughout the processing pipeline
- **Arrow Schema Compatibility**: Ensures consistent `timestamp[us, tz=UTC]` schemas for Rust processing

### Parallelization Strategy

The algorithm uses adaptive parallelization:
- **Serial Processing**: Small datasets (<50 ID groups AND <10k records) 
- **Parallel Processing**: Large datasets using Rayon for CPU-bound operations
- **ID Group Independence**: Each ID group processes independently, enabling perfect parallelization

## Performance

### Rust Core Performance
Benchmarked on modern hardware:

- **500k records**: ~885ms processing time
- **Adaptive Parallelization**: Automatically uses multiple threads for large datasets  
- **Parallel Thresholds**: >50 ID groups OR >10k total records triggers parallel processing
- **Conflation Efficiency**: Significant row reduction for datasets with temporal continuity

### Python Wrapper Performance (Optimized)
Advanced batch consolidation eliminates conversion bottlenecks:

- **Batch Consolidation**: Individual record batches consolidated into 10k-row batches
- **Conversion Overhead**: <0.1 seconds for large datasets (was 30+ seconds)
- **Overhead Ratio**: Python processing is only 20% of Rust processing time
- **Zero-Copy Efficiency**: Near-optimal Arrow/pandas conversion performance

**Example Performance (50k records):**
```
Rust processing:     0.5s
Python conversion:   0.05s  
Total time:          0.55s
Overhead ratio:      0.1x (10% overhead)
```

**Before Optimization:**
- 90,213 single-row batches → 45+ seconds conversion time
- Massive per-batch overhead in arro3 → pandas conversion

**After Optimization:**  
- 10 consolidated 10k-row batches → 0.05 seconds conversion time
- Efficient bulk conversion with minimal overhead

## Testing

Run the test suites:

```bash
# Rust tests
cargo test

# Python tests  
uv run python -m pytest tests/test_bitemporal.py -v

# Benchmarks
cargo bench
```

## Development

### Requirements
- Rust (latest stable)
- Python 3.8+ with development headers
- uv for Python dependency management

### Python Version Compatibility
This project uses PyO3 v0.21.2 which officially supports Python up to 3.12. If you're using Python 3.13+:

**For Rust tests:**
```bash
export PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1
cargo test
```

**For Python bindings (recommended):**
```bash
uv run maturin develop  # Uses venv Python 3.12 automatically
```

### Setup
```bash
# Install Rust dependencies and run tests
export PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1  # If using Python 3.13+
cargo test

# Build Python extension (development mode)
uv run maturin develop

# Run Python tests  
uv run python -m pytest tests/test_bitemporal_manual.py -v
```

### Permanent Fix for Python 3.13+
Add to your shell profile (~/.bashrc or ~/.zshrc):
```bash
export PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1
```

### Project Structure

**Modular Architecture** (274 lines total in main file, down from 1,085):

- `src/lib.rs` - Main processing function and Python bindings (280 lines)
- `src/types.rs` - Core data structures and constants (90 lines)
- `src/overlap.rs` - Overlap detection and record categorization (70 lines) 
- `src/timeline.rs` - Timeline event processing algorithm (250 lines)
- `src/conflation.rs` - Record conflation and batch consolidation (280 lines)
- `src/batch_utils.rs` - Arrow RecordBatch utilities with hash functions (490 lines)
- `tests/integration_tests.rs` - Rust integration tests (5 test scenarios)
- `tests/test_bitemporal_manual.py` - Python test suite (22 test scenarios)
- `benches/bitemporal_benchmarks.rs` - Performance benchmarks
- `CLAUDE.md` - Project context and development notes

### Key Commands

```bash
# Build release version
cargo build --release

# Run benchmarks with HTML reports
cargo bench

# Build Python wheel  
uv run maturin build --release

# Development install
uv run maturin develop
```

### Module Responsibilities

1. **`types.rs`** - Data structures (`BitemporalRecord`, `ChangeSet`, `UpdateMode`) and type conversions
2. **`overlap.rs`** - Determines which records overlap in time and need timeline processing vs direct insertion
3. **`timeline.rs`** - Core algorithm that processes overlapping records through event timeline
4. **`conflation.rs`** - Post-processes results to merge adjacent segments and consolidate batches for optimal performance
5. **`batch_utils.rs`** - Arrow utilities for RecordBatch creation, timestamp handling, and SHA256 hash computation

## Dependencies

- **arrow** (53.4) - Columnar data processing
- **pyo3** (0.21) - Python bindings  
- **chrono** (0.4) - Date/time handling
- **sha2** (0.10) - SHA256 hashing for client-compatible hex digests
- **rayon** (1.8) - Parallel processing
- **criterion** (0.5) - Benchmarking framework

## Architecture

### Rust Core
- Zero-copy Arrow array processing
- Parallel execution with Rayon
- Hash-based change detection with SHA256 (client-compatible hex digests)
- Post-processing conflation for optimal storage
- Batch consolidation for efficient Python conversion
- Modular design with clear separation of concerns

### Python Interface
- High-level DataFrame API (`BitemporalTimeseriesProcessor`)
- Low-level Arrow API (`compute_changes`)
- Optimized batch conversion with minimal overhead
- PyO3 bindings for seamless integration
- Compatible with pandas DataFrames via efficient conversion

## Performance Monitoring

This project includes comprehensive performance monitoring with flamegraph analysis:

### 📊 Release Performance Reports

View performance metrics and flamegraphs for each release at:
**[Release Benchmarks](https://your-username.github.io/pytemporal/)**

Each version tag automatically generates comprehensive performance documentation with flamegraphs, creating a historical record of performance evolution across releases.

### 🔥 Generating Flamegraphs Locally

```bash
# Generate flamegraphs for key benchmarks
cargo bench --bench bitemporal_benchmarks medium_dataset -- --profile-time 5
cargo bench --bench bitemporal_benchmarks conflation_effectiveness -- --profile-time 5 
cargo bench --bench bitemporal_benchmarks "scaling_by_dataset_size/records/500000" -- --profile-time 5

# Add flamegraph links to HTML reports  
python3 scripts/add_flamegraphs_to_html.py

# View reports locally
python3 -m http.server 8000 --directory target/criterion
# Then visit: http://localhost:8000/report/
```

### 📈 Performance Expectations

| Dataset Size | Processing Time | Flamegraph Available |
|--------------|----------------|---------------------|
| Small (5 records) | ~30-35 µs | ❌ |
| Medium (100 records) | ~165-170 µs | ✅ |
| Large (500k records) | ~900-950 ms | ✅ |
| Conflation test | ~28 µs | ✅ |

### 🎯 Key Optimization Areas (from Flamegraph Analysis)

- **`process_id_timeline`**: Core algorithm logic
- **Rayon parallelization**: Thread management overhead
- **Arrow operations**: Columnar data processing
- **SHA256 hashing**: Value fingerprinting for conflation

See `docs/benchmark-publishing.md` for complete setup details.

## Contributing

1. Check `CLAUDE.md` for project context and conventions
2. Run tests before submitting changes
3. Follow existing code style and patterns
4. Update benchmarks for performance-related changes
5. Use flamegraphs to validate performance improvements
6. Maintain modular architecture when adding features

## License

MIT License - see LICENSE file for details.

## Built With

- [Apache Arrow](https://arrow.apache.org/) - Columnar data format
- [PyO3](https://pyo3.rs/) - Rust-Python bindings  
- [Rayon](https://github.com/rayon-rs/rayon) - Data parallelism
- [Criterion](https://github.com/bheisler/criterion.rs) - Benchmarking
- [SHA256](https://en.wikipedia.org/wiki/SHA-2) - Cryptographic hashing algorithm (client-compatible)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pytemporal",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": null,
    "download_url": null,
    "platform": null,
    "description": "# PyTemporal Library\n\nA high-performance Rust library with Python bindings for processing bitemporal timeseries data. Optimized for financial services and applications requiring immutable audit trails with both business and system time dimensions.\n\n## Features\n\n- **High Performance**: 500k records processed in ~885ms with adaptive parallelization\n- **Optimized Python Wrapper**: Batch consolidation reduces conversion overhead to <0.1 seconds\n- **Zero-Copy Processing**: Apache Arrow columnar data format for efficient memory usage\n- **Parallel Processing**: Rayon-based parallelization with adaptive thresholds\n- **Conflation**: Automatic merging of adjacent segments with identical values to reduce storage\n- **Full State Mode**: Complete state replacement with tombstone records for audit trails\n- **Flexible Schema**: Dynamic ID and value column configuration\n- **Python Integration**: High-level DataFrame API with seamless PyO3 bindings\n- **Modular Architecture**: Clean separation of concerns with dedicated modules\n- **Performance Monitoring**: Integrated flamegraph generation and GitHub Pages benchmark reports\n\n## Installation\n\nBuild from source (requires Rust):\n\n```bash\ngit clone <your-repository-url>\ncd pytemporal\nuv run maturin develop --release\n```\n\n## Quick Start\n\n### High-Level DataFrame API (Recommended)\n\n```python\nimport pandas as pd\nfrom pytemporal import BitemporalTimeseriesProcessor\n\n# Initialize processor\nprocessor = BitemporalTimeseriesProcessor(\n    id_columns=['id', 'field'],\n    value_columns=['mv', 'price']\n)\n\n# Current state\ncurrent_state = pd.DataFrame({\n    'id': [1234, 1234],\n    'field': ['test', 'fielda'], \n    'mv': [300, 400],\n    'price': [400, 500],\n    'effective_from': pd.to_datetime(['2020-01-01', '2020-01-01']),\n    'effective_to': pd.to_datetime(['2021-01-01', '2021-01-01']),\n    'as_of_from': pd.to_datetime(['2025-01-01', '2025-01-01']),\n    'as_of_to': pd.to_datetime(['2262-04-11', '2262-04-11'])\n})\n\n# Updates\nupdates = pd.DataFrame({\n    'id': [1234],\n    'field': ['test'],\n    'mv': [400], \n    'price': [300],\n    'effective_from': pd.to_datetime(['2020-06-01']),\n    'effective_to': pd.to_datetime(['2020-09-01']),\n    'as_of_from': pd.to_datetime(['2025-07-27']),\n    'as_of_to': pd.to_datetime(['2262-04-11'])\n})\n\n# Process updates (delta mode - incremental updates)\nrows_to_expire, rows_to_insert = processor.compute_changes(\n    current_state, \n    updates,\n    update_mode='delta'\n)\n\nprint(f\"Records to expire: {len(rows_to_expire)}\")\nprint(f\"Records to insert: {len(rows_to_insert)}\")\n\n# Process updates (full_state mode - complete state replacement)\nrows_to_expire, rows_to_insert = processor.compute_changes(\n    current_state, \n    updates,\n    update_mode='full_state'\n)\n```\n\n### Low-Level Arrow API\n\n```python\nimport pandas as pd\nfrom pytemporal import compute_changes\nimport pyarrow as pa\n\n# Convert pandas DataFrames to Arrow RecordBatches\ncurrent_batch = pa.RecordBatch.from_pandas(current_state)\nupdates_batch = pa.RecordBatch.from_pandas(updates)\n\n# Direct Arrow processing\nexpire_indices, insert_batches, expired_batches = compute_changes(\n    current_batch,\n    updates_batch,\n    id_columns=['id', 'field'],\n    value_columns=['mv', 'price'],\n    system_date='2025-07-27',\n    update_mode='delta'\n)\n```\n\n## Algorithm Explanation with Examples\n\n### Bitemporal Model\n\nEach record tracks two time dimensions:\n- **Effective Time** (`effective_from`, `effective_to`): When the data is valid in the real world\n- **As-Of Time** (`as_of_from`, `as_of_to`): When the data was known to the system\n\nBoth use TimestampMicrosecond precision for maximum accuracy.\n\n### Core Algorithm: Timeline Processing\n\nThe algorithm processes updates by creating a timeline of events and determining what should be active at each point in time.\n\n#### Example 1: Simple Overwrite\n\n**Current State:**\n```\nID=123, effective: [2020-01-01, 2021-01-01], as_of: [2025-01-01, max], mv=100\n```\n\n**Update:**\n```  \nID=123, effective: [2020-06-01, 2020-09-01], as_of: [2025-07-27, max], mv=200\n```\n\n**Timeline Processing:**\n\n1. **Create Events:**\n   - 2020-01-01: Current starts (mv=100)\n   - 2020-06-01: Update starts (mv=200) \n   - 2020-09-01: Update ends\n   - 2021-01-01: Current ends\n\n2. **Process Timeline:**\n   - [2020-01-01, 2020-06-01): Current active \u2192 emit mv=100\n   - [2020-06-01, 2020-09-01): Update active \u2192 emit mv=200  \n   - [2020-09-01, 2021-01-01): Current active \u2192 emit mv=100\n\n3. **Result:**\n   - **Expire:** Original record (index 0)\n   - **Insert:** Three new records covering the split timeline\n\n**Visual Representation:**\n```\nBefore:\nCurrent |=======mv=100========|\n        2020-01-01      2021-01-01\n\nUpdate       |==mv=200==|\n             2020-06-01  2020-09-01\n\nAfter:\nNew     |=100=|=mv=200=|=100=|\n        2020   2020     2020  2021\n        01-01  06-01    09-01 01-01\n```\n\n#### Example 2: Conflation (Adjacent Identical Values)\n\n**Current State:**\n```\nID=123, effective: [2020-01-01, 2020-06-01], as_of: [2025-01-01, max], mv=100\nID=123, effective: [2020-06-01, 2021-01-01], as_of: [2025-01-01, max], mv=100  \n```\n\n**Update:**\n```\nID=123, effective: [2020-03-01, 2020-04-01], as_of: [2025-07-27, max], mv=100\n```\n\nSince the update has the same value (mv=100) as the current state, the algorithm detects this as a **no-change scenario** and skips processing entirely.\n\n#### Example 3: Complex Multi-Update\n\n**Current State:**\n```\nID=123, effective: [2020-01-01, 2021-01-01], as_of: [2025-01-01, max], mv=100\n```\n\n**Updates:**\n```\nID=123, effective: [2020-03-01, 2020-06-01], as_of: [2025-07-27, max], mv=200\nID=123, effective: [2020-09-01, 2020-12-01], as_of: [2025-07-27, max], mv=300\n```\n\n**Timeline Processing:**\n\n1. **Events:** 2020-01-01 (current start), 2020-03-01 (update1 start), 2020-06-01 (update1 end), 2020-09-01 (update2 start), 2020-12-01 (update2 end), 2021-01-01 (current end)\n\n2. **Result:**\n   - [2020-01-01, 2020-03-01): mv=100 (current)\n   - [2020-03-01, 2020-06-01): mv=200 (update1)  \n   - [2020-06-01, 2020-09-01): mv=100 (current)\n   - [2020-09-01, 2020-12-01): mv=300 (update2)\n   - [2020-12-01, 2021-01-01): mv=100 (current)\n\n### Post-Processing Conflation\n\nAfter timeline processing, the algorithm merges adjacent segments with identical value hashes:\n\n**Before Conflation:**\n```\n|--mv=100--|--mv=100--|--mv=200--|--mv=100--|--mv=100--|\n```\n\n**After Conflation:**\n```\n|--------mv=100--------|--mv=200--|--------mv=100--------|\n```\n\nThis significantly reduces database row count while preserving temporal accuracy.\n\n### Update Modes\n\nThe algorithm supports two distinct update modes that determine how updates interact with existing current state:\n\n#### 1. Delta Mode (default)\n- **Behavior**: Only provided records are treated as updates\n- **Existing State**: Preserved where not temporally overlapped by updates\n- **Use Case**: Incremental updates where you only want to modify specific time periods\n- **Processing**: Uses timeline-based algorithm to handle temporal overlaps\n\n#### 2. Full State Mode\n- **Behavior**: Provided records represent the complete desired state for each ID group\n- **Existing State**: Only preserved if identical records exist in the updates (same values and temporal ranges)\n- **Value Comparison**: Records are compared using SHA256 hashes of value columns\n- **Processing Rules**:\n  - **Unchanged Records**: If an update has identical values and temporal range as current state, neither expire nor insert\n  - **Changed Records**: If values differ, expire old record and insert new record\n  - **New Records**: Insert records that don't exist in current state\n  - **Deleted Records**: Records not present in updates are expired AND get tombstone records created\n- **Tombstone Records**: When records are deleted (exist in current state but not in updates):\n  - The original record is expired with its original `as_of_from`\n  - A tombstone record is created with the same ID and values but `effective_to = system_date`\n  - This maintains complete audit trail showing exactly when data became inactive\n\n#### Full State Mode Example\n\n**Current State:**\n```\nID=1, mv=100, price=250, effective=[2020-01-01, INFINITY]\nID=2, mv=300, price=400, effective=[2020-01-01, INFINITY]  \nID=3, mv=500, price=600, effective=[2020-01-01, INFINITY] \n```\n\n**Updates (Full State):**\n```\nID=1, mv=150, price=250, effective=[2020-01-01, 2020-02-01]  # Values changed\nID=2, mv=300, price=400, effective=[2020-01-01, INFINITY]   # Values identical  \n# Note: ID=3 is not in updates = deleted\n```\n\n**Result:**\n- **Expire**: ID=1 (values changed), ID=3 (deleted)\n- **Insert**: \n  - ID=1 (new values with updated effective_to)\n  - ID=3 (tombstone: same values but effective_to=system_date)\n- **No Action**: ID=2 (identical values, no change needed)\n\n**Tombstone Example:**\n```\nOriginal:  ID=3, mv=500, price=600, effective=[2020-01-01, INFINITY]\nTombstone: ID=3, mv=500, price=600, effective=[2020-01-01, 2025-08-30]  # Truncated to system_date\n```\n\n### Timezone Handling\n\nThe library seamlessly handles timezone-aware timestamps from databases like PostgreSQL:\n\n```python\n# PostgreSQL timestamptz columns (timezone-aware)\ncurrent_state = pd.DataFrame({\n    'id': [1],\n    'value': ['test'],\n    'effective_from': pd.Timestamp(\"2024-01-01\", tz='UTC'),  # timezone-aware\n    'effective_to': pd.Timestamp(\"2099-12-31\", tz='UTC'),\n    # ... other columns\n})\n\n# Client updates (often timezone-naive)  \nupdates = pd.DataFrame({\n    'id': [1],\n    'value': ['updated'],\n    'effective_from': pd.Timestamp(\"2024-01-02\"),  # timezone-naive\n    'effective_to': pd.Timestamp(\"2099-12-31\"),\n    # ... other columns\n})\n\n# Automatically handles timezone normalization\nrows_to_expire, rows_to_insert = processor.compute_changes(current_state, updates)\n```\n\n**Key Features:**\n- **Automatic Schema Normalization**: Mixed timezone-aware and timezone-naive timestamps are automatically reconciled\n- **PostgreSQL Compatible**: Seamless integration with `timestamptz` columns\n- **Timezone Preservation**: Original timezone information is preserved throughout the processing pipeline\n- **Arrow Schema Compatibility**: Ensures consistent `timestamp[us, tz=UTC]` schemas for Rust processing\n\n### Parallelization Strategy\n\nThe algorithm uses adaptive parallelization:\n- **Serial Processing**: Small datasets (<50 ID groups AND <10k records) \n- **Parallel Processing**: Large datasets using Rayon for CPU-bound operations\n- **ID Group Independence**: Each ID group processes independently, enabling perfect parallelization\n\n## Performance\n\n### Rust Core Performance\nBenchmarked on modern hardware:\n\n- **500k records**: ~885ms processing time\n- **Adaptive Parallelization**: Automatically uses multiple threads for large datasets  \n- **Parallel Thresholds**: >50 ID groups OR >10k total records triggers parallel processing\n- **Conflation Efficiency**: Significant row reduction for datasets with temporal continuity\n\n### Python Wrapper Performance (Optimized)\nAdvanced batch consolidation eliminates conversion bottlenecks:\n\n- **Batch Consolidation**: Individual record batches consolidated into 10k-row batches\n- **Conversion Overhead**: <0.1 seconds for large datasets (was 30+ seconds)\n- **Overhead Ratio**: Python processing is only 20% of Rust processing time\n- **Zero-Copy Efficiency**: Near-optimal Arrow/pandas conversion performance\n\n**Example Performance (50k records):**\n```\nRust processing:     0.5s\nPython conversion:   0.05s  \nTotal time:          0.55s\nOverhead ratio:      0.1x (10% overhead)\n```\n\n**Before Optimization:**\n- 90,213 single-row batches \u2192 45+ seconds conversion time\n- Massive per-batch overhead in arro3 \u2192 pandas conversion\n\n**After Optimization:**  \n- 10 consolidated 10k-row batches \u2192 0.05 seconds conversion time\n- Efficient bulk conversion with minimal overhead\n\n## Testing\n\nRun the test suites:\n\n```bash\n# Rust tests\ncargo test\n\n# Python tests  \nuv run python -m pytest tests/test_bitemporal.py -v\n\n# Benchmarks\ncargo bench\n```\n\n## Development\n\n### Requirements\n- Rust (latest stable)\n- Python 3.8+ with development headers\n- uv for Python dependency management\n\n### Python Version Compatibility\nThis project uses PyO3 v0.21.2 which officially supports Python up to 3.12. If you're using Python 3.13+:\n\n**For Rust tests:**\n```bash\nexport PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1\ncargo test\n```\n\n**For Python bindings (recommended):**\n```bash\nuv run maturin develop  # Uses venv Python 3.12 automatically\n```\n\n### Setup\n```bash\n# Install Rust dependencies and run tests\nexport PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1  # If using Python 3.13+\ncargo test\n\n# Build Python extension (development mode)\nuv run maturin develop\n\n# Run Python tests  \nuv run python -m pytest tests/test_bitemporal_manual.py -v\n```\n\n### Permanent Fix for Python 3.13+\nAdd to your shell profile (~/.bashrc or ~/.zshrc):\n```bash\nexport PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1\n```\n\n### Project Structure\n\n**Modular Architecture** (274 lines total in main file, down from 1,085):\n\n- `src/lib.rs` - Main processing function and Python bindings (280 lines)\n- `src/types.rs` - Core data structures and constants (90 lines)\n- `src/overlap.rs` - Overlap detection and record categorization (70 lines) \n- `src/timeline.rs` - Timeline event processing algorithm (250 lines)\n- `src/conflation.rs` - Record conflation and batch consolidation (280 lines)\n- `src/batch_utils.rs` - Arrow RecordBatch utilities with hash functions (490 lines)\n- `tests/integration_tests.rs` - Rust integration tests (5 test scenarios)\n- `tests/test_bitemporal_manual.py` - Python test suite (22 test scenarios)\n- `benches/bitemporal_benchmarks.rs` - Performance benchmarks\n- `CLAUDE.md` - Project context and development notes\n\n### Key Commands\n\n```bash\n# Build release version\ncargo build --release\n\n# Run benchmarks with HTML reports\ncargo bench\n\n# Build Python wheel  \nuv run maturin build --release\n\n# Development install\nuv run maturin develop\n```\n\n### Module Responsibilities\n\n1. **`types.rs`** - Data structures (`BitemporalRecord`, `ChangeSet`, `UpdateMode`) and type conversions\n2. **`overlap.rs`** - Determines which records overlap in time and need timeline processing vs direct insertion\n3. **`timeline.rs`** - Core algorithm that processes overlapping records through event timeline\n4. **`conflation.rs`** - Post-processes results to merge adjacent segments and consolidate batches for optimal performance\n5. **`batch_utils.rs`** - Arrow utilities for RecordBatch creation, timestamp handling, and SHA256 hash computation\n\n## Dependencies\n\n- **arrow** (53.4) - Columnar data processing\n- **pyo3** (0.21) - Python bindings  \n- **chrono** (0.4) - Date/time handling\n- **sha2** (0.10) - SHA256 hashing for client-compatible hex digests\n- **rayon** (1.8) - Parallel processing\n- **criterion** (0.5) - Benchmarking framework\n\n## Architecture\n\n### Rust Core\n- Zero-copy Arrow array processing\n- Parallel execution with Rayon\n- Hash-based change detection with SHA256 (client-compatible hex digests)\n- Post-processing conflation for optimal storage\n- Batch consolidation for efficient Python conversion\n- Modular design with clear separation of concerns\n\n### Python Interface\n- High-level DataFrame API (`BitemporalTimeseriesProcessor`)\n- Low-level Arrow API (`compute_changes`)\n- Optimized batch conversion with minimal overhead\n- PyO3 bindings for seamless integration\n- Compatible with pandas DataFrames via efficient conversion\n\n## Performance Monitoring\n\nThis project includes comprehensive performance monitoring with flamegraph analysis:\n\n### \ud83d\udcca Release Performance Reports\n\nView performance metrics and flamegraphs for each release at:\n**[Release Benchmarks](https://your-username.github.io/pytemporal/)**\n\nEach version tag automatically generates comprehensive performance documentation with flamegraphs, creating a historical record of performance evolution across releases.\n\n### \ud83d\udd25 Generating Flamegraphs Locally\n\n```bash\n# Generate flamegraphs for key benchmarks\ncargo bench --bench bitemporal_benchmarks medium_dataset -- --profile-time 5\ncargo bench --bench bitemporal_benchmarks conflation_effectiveness -- --profile-time 5 \ncargo bench --bench bitemporal_benchmarks \"scaling_by_dataset_size/records/500000\" -- --profile-time 5\n\n# Add flamegraph links to HTML reports  \npython3 scripts/add_flamegraphs_to_html.py\n\n# View reports locally\npython3 -m http.server 8000 --directory target/criterion\n# Then visit: http://localhost:8000/report/\n```\n\n### \ud83d\udcc8 Performance Expectations\n\n| Dataset Size | Processing Time | Flamegraph Available |\n|--------------|----------------|---------------------|\n| Small (5 records) | ~30-35 \u00b5s | \u274c |\n| Medium (100 records) | ~165-170 \u00b5s | \u2705 |\n| Large (500k records) | ~900-950 ms | \u2705 |\n| Conflation test | ~28 \u00b5s | \u2705 |\n\n### \ud83c\udfaf Key Optimization Areas (from Flamegraph Analysis)\n\n- **`process_id_timeline`**: Core algorithm logic\n- **Rayon parallelization**: Thread management overhead\n- **Arrow operations**: Columnar data processing\n- **SHA256 hashing**: Value fingerprinting for conflation\n\nSee `docs/benchmark-publishing.md` for complete setup details.\n\n## Contributing\n\n1. Check `CLAUDE.md` for project context and conventions\n2. Run tests before submitting changes\n3. Follow existing code style and patterns\n4. Update benchmarks for performance-related changes\n5. Use flamegraphs to validate performance improvements\n6. Maintain modular architecture when adding features\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Built With\n\n- [Apache Arrow](https://arrow.apache.org/) - Columnar data format\n- [PyO3](https://pyo3.rs/) - Rust-Python bindings  \n- [Rayon](https://github.com/rayon-rs/rayon) - Data parallelism\n- [Criterion](https://github.com/bheisler/criterion.rs) - Benchmarking\n- [SHA256](https://en.wikipedia.org/wiki/SHA-2) - Cryptographic hashing algorithm (client-compatible)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "High-performance bitemporal timeseries update processor",
    "version": "1.3.11",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "092e38f4db7477b9ac95b37b8b22cdabfed3b931df519ab9218e8d1935f4d410",
                "md5": "649bba0094f0450fcc82489f4b0fd484",
                "sha256": "d4f8d94cd06a575fa29b0db09d5fd0c92c4d8d4e2146ed1860ee8d3850b58cfc"
            },
            "downloads": -1,
            "filename": "pytemporal-1.3.11-cp310-cp310-manylinux_2_34_x86_64.whl",
            "has_sig": false,
            "md5_digest": "649bba0094f0450fcc82489f4b0fd484",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.9",
            "size": 2461399,
            "upload_time": "2025-09-01T19:00:48",
            "upload_time_iso_8601": "2025-09-01T19:00:48.192223Z",
            "url": "https://files.pythonhosted.org/packages/09/2e/38f4db7477b9ac95b37b8b22cdabfed3b931df519ab9218e8d1935f4d410/pytemporal-1.3.11-cp310-cp310-manylinux_2_34_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "34b4d5f543ca818440c2b76b5c3ce58abadd5a10c7ea2ef701dbc505e191d3a7",
                "md5": "7d758cdbab1c7c385b856eb2cd61c473",
                "sha256": "13f1b1a1802101426c03847e0f95072a6387710c0616f7c27fa46aaf25cb975b"
            },
            "downloads": -1,
            "filename": "pytemporal-1.3.11-cp311-cp311-manylinux_2_34_x86_64.whl",
            "has_sig": false,
            "md5_digest": "7d758cdbab1c7c385b856eb2cd61c473",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.9",
            "size": 2461301,
            "upload_time": "2025-09-01T19:00:49",
            "upload_time_iso_8601": "2025-09-01T19:00:49.815749Z",
            "url": "https://files.pythonhosted.org/packages/34/b4/d5f543ca818440c2b76b5c3ce58abadd5a10c7ea2ef701dbc505e191d3a7/pytemporal-1.3.11-cp311-cp311-manylinux_2_34_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4c734cab9c5de06e695d0001d1405bdd7dde7614a2c520d118ea51690e1af2d6",
                "md5": "5240c9c04ff53439405c3791504dc2ad",
                "sha256": "9e07c5812292fde67b7008ff429e7f21fde9b2d70df6c518245456577daa84b9"
            },
            "downloads": -1,
            "filename": "pytemporal-1.3.11-cp312-cp312-manylinux_2_34_x86_64.whl",
            "has_sig": false,
            "md5_digest": "5240c9c04ff53439405c3791504dc2ad",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.9",
            "size": 2456711,
            "upload_time": "2025-09-01T19:00:51",
            "upload_time_iso_8601": "2025-09-01T19:00:51.640779Z",
            "url": "https://files.pythonhosted.org/packages/4c/73/4cab9c5de06e695d0001d1405bdd7dde7614a2c520d118ea51690e1af2d6/pytemporal-1.3.11-cp312-cp312-manylinux_2_34_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "35242e20f0f0221a33c9f3ef3ac91bc155ce9c28fe87334a659b68c396b14ec7",
                "md5": "8bafb31b6aec3973d5f0a4659ff3881a",
                "sha256": "f9ddb409d6f989b04142ac27dd94e5a2e924e785f00664c661f87b3d9c6e2b31"
            },
            "downloads": -1,
            "filename": "pytemporal-1.3.11-cp39-cp39-manylinux_2_34_x86_64.whl",
            "has_sig": false,
            "md5_digest": "8bafb31b6aec3973d5f0a4659ff3881a",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": ">=3.9",
            "size": 2461290,
            "upload_time": "2025-09-01T19:00:53",
            "upload_time_iso_8601": "2025-09-01T19:00:53.177644Z",
            "url": "https://files.pythonhosted.org/packages/35/24/2e20f0f0221a33c9f3ef3ac91bc155ce9c28fe87334a659b68c396b14ec7/pytemporal-1.3.11-cp39-cp39-manylinux_2_34_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-01 19:00:48",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "pytemporal"
}
        
Elapsed time: 0.97781s