Name | pytemporal JSON |
Version |
1.3.11
JSON |
| download |
home_page | None |
Summary | High-performance bitemporal timeseries update processor |
upload_time | 2025-09-01 19:00:48 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.9 |
license | None |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# PyTemporal Library
A high-performance Rust library with Python bindings for processing bitemporal timeseries data. Optimized for financial services and applications requiring immutable audit trails with both business and system time dimensions.
## Features
- **High Performance**: 500k records processed in ~885ms with adaptive parallelization
- **Optimized Python Wrapper**: Batch consolidation reduces conversion overhead to <0.1 seconds
- **Zero-Copy Processing**: Apache Arrow columnar data format for efficient memory usage
- **Parallel Processing**: Rayon-based parallelization with adaptive thresholds
- **Conflation**: Automatic merging of adjacent segments with identical values to reduce storage
- **Full State Mode**: Complete state replacement with tombstone records for audit trails
- **Flexible Schema**: Dynamic ID and value column configuration
- **Python Integration**: High-level DataFrame API with seamless PyO3 bindings
- **Modular Architecture**: Clean separation of concerns with dedicated modules
- **Performance Monitoring**: Integrated flamegraph generation and GitHub Pages benchmark reports
## Installation
Build from source (requires Rust):
```bash
git clone <your-repository-url>
cd pytemporal
uv run maturin develop --release
```
## Quick Start
### High-Level DataFrame API (Recommended)
```python
import pandas as pd
from pytemporal import BitemporalTimeseriesProcessor
# Initialize processor
processor = BitemporalTimeseriesProcessor(
id_columns=['id', 'field'],
value_columns=['mv', 'price']
)
# Current state
current_state = pd.DataFrame({
'id': [1234, 1234],
'field': ['test', 'fielda'],
'mv': [300, 400],
'price': [400, 500],
'effective_from': pd.to_datetime(['2020-01-01', '2020-01-01']),
'effective_to': pd.to_datetime(['2021-01-01', '2021-01-01']),
'as_of_from': pd.to_datetime(['2025-01-01', '2025-01-01']),
'as_of_to': pd.to_datetime(['2262-04-11', '2262-04-11'])
})
# Updates
updates = pd.DataFrame({
'id': [1234],
'field': ['test'],
'mv': [400],
'price': [300],
'effective_from': pd.to_datetime(['2020-06-01']),
'effective_to': pd.to_datetime(['2020-09-01']),
'as_of_from': pd.to_datetime(['2025-07-27']),
'as_of_to': pd.to_datetime(['2262-04-11'])
})
# Process updates (delta mode - incremental updates)
rows_to_expire, rows_to_insert = processor.compute_changes(
current_state,
updates,
update_mode='delta'
)
print(f"Records to expire: {len(rows_to_expire)}")
print(f"Records to insert: {len(rows_to_insert)}")
# Process updates (full_state mode - complete state replacement)
rows_to_expire, rows_to_insert = processor.compute_changes(
current_state,
updates,
update_mode='full_state'
)
```
### Low-Level Arrow API
```python
import pandas as pd
from pytemporal import compute_changes
import pyarrow as pa
# Convert pandas DataFrames to Arrow RecordBatches
current_batch = pa.RecordBatch.from_pandas(current_state)
updates_batch = pa.RecordBatch.from_pandas(updates)
# Direct Arrow processing
expire_indices, insert_batches, expired_batches = compute_changes(
current_batch,
updates_batch,
id_columns=['id', 'field'],
value_columns=['mv', 'price'],
system_date='2025-07-27',
update_mode='delta'
)
```
## Algorithm Explanation with Examples
### Bitemporal Model
Each record tracks two time dimensions:
- **Effective Time** (`effective_from`, `effective_to`): When the data is valid in the real world
- **As-Of Time** (`as_of_from`, `as_of_to`): When the data was known to the system
Both use TimestampMicrosecond precision for maximum accuracy.
### Core Algorithm: Timeline Processing
The algorithm processes updates by creating a timeline of events and determining what should be active at each point in time.
#### Example 1: Simple Overwrite
**Current State:**
```
ID=123, effective: [2020-01-01, 2021-01-01], as_of: [2025-01-01, max], mv=100
```
**Update:**
```
ID=123, effective: [2020-06-01, 2020-09-01], as_of: [2025-07-27, max], mv=200
```
**Timeline Processing:**
1. **Create Events:**
- 2020-01-01: Current starts (mv=100)
- 2020-06-01: Update starts (mv=200)
- 2020-09-01: Update ends
- 2021-01-01: Current ends
2. **Process Timeline:**
- [2020-01-01, 2020-06-01): Current active → emit mv=100
- [2020-06-01, 2020-09-01): Update active → emit mv=200
- [2020-09-01, 2021-01-01): Current active → emit mv=100
3. **Result:**
- **Expire:** Original record (index 0)
- **Insert:** Three new records covering the split timeline
**Visual Representation:**
```
Before:
Current |=======mv=100========|
2020-01-01 2021-01-01
Update |==mv=200==|
2020-06-01 2020-09-01
After:
New |=100=|=mv=200=|=100=|
2020 2020 2020 2021
01-01 06-01 09-01 01-01
```
#### Example 2: Conflation (Adjacent Identical Values)
**Current State:**
```
ID=123, effective: [2020-01-01, 2020-06-01], as_of: [2025-01-01, max], mv=100
ID=123, effective: [2020-06-01, 2021-01-01], as_of: [2025-01-01, max], mv=100
```
**Update:**
```
ID=123, effective: [2020-03-01, 2020-04-01], as_of: [2025-07-27, max], mv=100
```
Since the update has the same value (mv=100) as the current state, the algorithm detects this as a **no-change scenario** and skips processing entirely.
#### Example 3: Complex Multi-Update
**Current State:**
```
ID=123, effective: [2020-01-01, 2021-01-01], as_of: [2025-01-01, max], mv=100
```
**Updates:**
```
ID=123, effective: [2020-03-01, 2020-06-01], as_of: [2025-07-27, max], mv=200
ID=123, effective: [2020-09-01, 2020-12-01], as_of: [2025-07-27, max], mv=300
```
**Timeline Processing:**
1. **Events:** 2020-01-01 (current start), 2020-03-01 (update1 start), 2020-06-01 (update1 end), 2020-09-01 (update2 start), 2020-12-01 (update2 end), 2021-01-01 (current end)
2. **Result:**
- [2020-01-01, 2020-03-01): mv=100 (current)
- [2020-03-01, 2020-06-01): mv=200 (update1)
- [2020-06-01, 2020-09-01): mv=100 (current)
- [2020-09-01, 2020-12-01): mv=300 (update2)
- [2020-12-01, 2021-01-01): mv=100 (current)
### Post-Processing Conflation
After timeline processing, the algorithm merges adjacent segments with identical value hashes:
**Before Conflation:**
```
|--mv=100--|--mv=100--|--mv=200--|--mv=100--|--mv=100--|
```
**After Conflation:**
```
|--------mv=100--------|--mv=200--|--------mv=100--------|
```
This significantly reduces database row count while preserving temporal accuracy.
### Update Modes
The algorithm supports two distinct update modes that determine how updates interact with existing current state:
#### 1. Delta Mode (default)
- **Behavior**: Only provided records are treated as updates
- **Existing State**: Preserved where not temporally overlapped by updates
- **Use Case**: Incremental updates where you only want to modify specific time periods
- **Processing**: Uses timeline-based algorithm to handle temporal overlaps
#### 2. Full State Mode
- **Behavior**: Provided records represent the complete desired state for each ID group
- **Existing State**: Only preserved if identical records exist in the updates (same values and temporal ranges)
- **Value Comparison**: Records are compared using SHA256 hashes of value columns
- **Processing Rules**:
- **Unchanged Records**: If an update has identical values and temporal range as current state, neither expire nor insert
- **Changed Records**: If values differ, expire old record and insert new record
- **New Records**: Insert records that don't exist in current state
- **Deleted Records**: Records not present in updates are expired AND get tombstone records created
- **Tombstone Records**: When records are deleted (exist in current state but not in updates):
- The original record is expired with its original `as_of_from`
- A tombstone record is created with the same ID and values but `effective_to = system_date`
- This maintains complete audit trail showing exactly when data became inactive
#### Full State Mode Example
**Current State:**
```
ID=1, mv=100, price=250, effective=[2020-01-01, INFINITY]
ID=2, mv=300, price=400, effective=[2020-01-01, INFINITY]
ID=3, mv=500, price=600, effective=[2020-01-01, INFINITY]
```
**Updates (Full State):**
```
ID=1, mv=150, price=250, effective=[2020-01-01, 2020-02-01] # Values changed
ID=2, mv=300, price=400, effective=[2020-01-01, INFINITY] # Values identical
# Note: ID=3 is not in updates = deleted
```
**Result:**
- **Expire**: ID=1 (values changed), ID=3 (deleted)
- **Insert**:
- ID=1 (new values with updated effective_to)
- ID=3 (tombstone: same values but effective_to=system_date)
- **No Action**: ID=2 (identical values, no change needed)
**Tombstone Example:**
```
Original: ID=3, mv=500, price=600, effective=[2020-01-01, INFINITY]
Tombstone: ID=3, mv=500, price=600, effective=[2020-01-01, 2025-08-30] # Truncated to system_date
```
### Timezone Handling
The library seamlessly handles timezone-aware timestamps from databases like PostgreSQL:
```python
# PostgreSQL timestamptz columns (timezone-aware)
current_state = pd.DataFrame({
'id': [1],
'value': ['test'],
'effective_from': pd.Timestamp("2024-01-01", tz='UTC'), # timezone-aware
'effective_to': pd.Timestamp("2099-12-31", tz='UTC'),
# ... other columns
})
# Client updates (often timezone-naive)
updates = pd.DataFrame({
'id': [1],
'value': ['updated'],
'effective_from': pd.Timestamp("2024-01-02"), # timezone-naive
'effective_to': pd.Timestamp("2099-12-31"),
# ... other columns
})
# Automatically handles timezone normalization
rows_to_expire, rows_to_insert = processor.compute_changes(current_state, updates)
```
**Key Features:**
- **Automatic Schema Normalization**: Mixed timezone-aware and timezone-naive timestamps are automatically reconciled
- **PostgreSQL Compatible**: Seamless integration with `timestamptz` columns
- **Timezone Preservation**: Original timezone information is preserved throughout the processing pipeline
- **Arrow Schema Compatibility**: Ensures consistent `timestamp[us, tz=UTC]` schemas for Rust processing
### Parallelization Strategy
The algorithm uses adaptive parallelization:
- **Serial Processing**: Small datasets (<50 ID groups AND <10k records)
- **Parallel Processing**: Large datasets using Rayon for CPU-bound operations
- **ID Group Independence**: Each ID group processes independently, enabling perfect parallelization
## Performance
### Rust Core Performance
Benchmarked on modern hardware:
- **500k records**: ~885ms processing time
- **Adaptive Parallelization**: Automatically uses multiple threads for large datasets
- **Parallel Thresholds**: >50 ID groups OR >10k total records triggers parallel processing
- **Conflation Efficiency**: Significant row reduction for datasets with temporal continuity
### Python Wrapper Performance (Optimized)
Advanced batch consolidation eliminates conversion bottlenecks:
- **Batch Consolidation**: Individual record batches consolidated into 10k-row batches
- **Conversion Overhead**: <0.1 seconds for large datasets (was 30+ seconds)
- **Overhead Ratio**: Python processing is only 20% of Rust processing time
- **Zero-Copy Efficiency**: Near-optimal Arrow/pandas conversion performance
**Example Performance (50k records):**
```
Rust processing: 0.5s
Python conversion: 0.05s
Total time: 0.55s
Overhead ratio: 0.1x (10% overhead)
```
**Before Optimization:**
- 90,213 single-row batches → 45+ seconds conversion time
- Massive per-batch overhead in arro3 → pandas conversion
**After Optimization:**
- 10 consolidated 10k-row batches → 0.05 seconds conversion time
- Efficient bulk conversion with minimal overhead
## Testing
Run the test suites:
```bash
# Rust tests
cargo test
# Python tests
uv run python -m pytest tests/test_bitemporal.py -v
# Benchmarks
cargo bench
```
## Development
### Requirements
- Rust (latest stable)
- Python 3.8+ with development headers
- uv for Python dependency management
### Python Version Compatibility
This project uses PyO3 v0.21.2 which officially supports Python up to 3.12. If you're using Python 3.13+:
**For Rust tests:**
```bash
export PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1
cargo test
```
**For Python bindings (recommended):**
```bash
uv run maturin develop # Uses venv Python 3.12 automatically
```
### Setup
```bash
# Install Rust dependencies and run tests
export PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1 # If using Python 3.13+
cargo test
# Build Python extension (development mode)
uv run maturin develop
# Run Python tests
uv run python -m pytest tests/test_bitemporal_manual.py -v
```
### Permanent Fix for Python 3.13+
Add to your shell profile (~/.bashrc or ~/.zshrc):
```bash
export PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1
```
### Project Structure
**Modular Architecture** (274 lines total in main file, down from 1,085):
- `src/lib.rs` - Main processing function and Python bindings (280 lines)
- `src/types.rs` - Core data structures and constants (90 lines)
- `src/overlap.rs` - Overlap detection and record categorization (70 lines)
- `src/timeline.rs` - Timeline event processing algorithm (250 lines)
- `src/conflation.rs` - Record conflation and batch consolidation (280 lines)
- `src/batch_utils.rs` - Arrow RecordBatch utilities with hash functions (490 lines)
- `tests/integration_tests.rs` - Rust integration tests (5 test scenarios)
- `tests/test_bitemporal_manual.py` - Python test suite (22 test scenarios)
- `benches/bitemporal_benchmarks.rs` - Performance benchmarks
- `CLAUDE.md` - Project context and development notes
### Key Commands
```bash
# Build release version
cargo build --release
# Run benchmarks with HTML reports
cargo bench
# Build Python wheel
uv run maturin build --release
# Development install
uv run maturin develop
```
### Module Responsibilities
1. **`types.rs`** - Data structures (`BitemporalRecord`, `ChangeSet`, `UpdateMode`) and type conversions
2. **`overlap.rs`** - Determines which records overlap in time and need timeline processing vs direct insertion
3. **`timeline.rs`** - Core algorithm that processes overlapping records through event timeline
4. **`conflation.rs`** - Post-processes results to merge adjacent segments and consolidate batches for optimal performance
5. **`batch_utils.rs`** - Arrow utilities for RecordBatch creation, timestamp handling, and SHA256 hash computation
## Dependencies
- **arrow** (53.4) - Columnar data processing
- **pyo3** (0.21) - Python bindings
- **chrono** (0.4) - Date/time handling
- **sha2** (0.10) - SHA256 hashing for client-compatible hex digests
- **rayon** (1.8) - Parallel processing
- **criterion** (0.5) - Benchmarking framework
## Architecture
### Rust Core
- Zero-copy Arrow array processing
- Parallel execution with Rayon
- Hash-based change detection with SHA256 (client-compatible hex digests)
- Post-processing conflation for optimal storage
- Batch consolidation for efficient Python conversion
- Modular design with clear separation of concerns
### Python Interface
- High-level DataFrame API (`BitemporalTimeseriesProcessor`)
- Low-level Arrow API (`compute_changes`)
- Optimized batch conversion with minimal overhead
- PyO3 bindings for seamless integration
- Compatible with pandas DataFrames via efficient conversion
## Performance Monitoring
This project includes comprehensive performance monitoring with flamegraph analysis:
### 📊 Release Performance Reports
View performance metrics and flamegraphs for each release at:
**[Release Benchmarks](https://your-username.github.io/pytemporal/)**
Each version tag automatically generates comprehensive performance documentation with flamegraphs, creating a historical record of performance evolution across releases.
### 🔥 Generating Flamegraphs Locally
```bash
# Generate flamegraphs for key benchmarks
cargo bench --bench bitemporal_benchmarks medium_dataset -- --profile-time 5
cargo bench --bench bitemporal_benchmarks conflation_effectiveness -- --profile-time 5
cargo bench --bench bitemporal_benchmarks "scaling_by_dataset_size/records/500000" -- --profile-time 5
# Add flamegraph links to HTML reports
python3 scripts/add_flamegraphs_to_html.py
# View reports locally
python3 -m http.server 8000 --directory target/criterion
# Then visit: http://localhost:8000/report/
```
### 📈 Performance Expectations
| Dataset Size | Processing Time | Flamegraph Available |
|--------------|----------------|---------------------|
| Small (5 records) | ~30-35 µs | ❌ |
| Medium (100 records) | ~165-170 µs | ✅ |
| Large (500k records) | ~900-950 ms | ✅ |
| Conflation test | ~28 µs | ✅ |
### 🎯 Key Optimization Areas (from Flamegraph Analysis)
- **`process_id_timeline`**: Core algorithm logic
- **Rayon parallelization**: Thread management overhead
- **Arrow operations**: Columnar data processing
- **SHA256 hashing**: Value fingerprinting for conflation
See `docs/benchmark-publishing.md` for complete setup details.
## Contributing
1. Check `CLAUDE.md` for project context and conventions
2. Run tests before submitting changes
3. Follow existing code style and patterns
4. Update benchmarks for performance-related changes
5. Use flamegraphs to validate performance improvements
6. Maintain modular architecture when adding features
## License
MIT License - see LICENSE file for details.
## Built With
- [Apache Arrow](https://arrow.apache.org/) - Columnar data format
- [PyO3](https://pyo3.rs/) - Rust-Python bindings
- [Rayon](https://github.com/rayon-rs/rayon) - Data parallelism
- [Criterion](https://github.com/bheisler/criterion.rs) - Benchmarking
- [SHA256](https://en.wikipedia.org/wiki/SHA-2) - Cryptographic hashing algorithm (client-compatible)
Raw data
{
"_id": null,
"home_page": null,
"name": "pytemporal",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": null,
"author": null,
"author_email": null,
"download_url": null,
"platform": null,
"description": "# PyTemporal Library\n\nA high-performance Rust library with Python bindings for processing bitemporal timeseries data. Optimized for financial services and applications requiring immutable audit trails with both business and system time dimensions.\n\n## Features\n\n- **High Performance**: 500k records processed in ~885ms with adaptive parallelization\n- **Optimized Python Wrapper**: Batch consolidation reduces conversion overhead to <0.1 seconds\n- **Zero-Copy Processing**: Apache Arrow columnar data format for efficient memory usage\n- **Parallel Processing**: Rayon-based parallelization with adaptive thresholds\n- **Conflation**: Automatic merging of adjacent segments with identical values to reduce storage\n- **Full State Mode**: Complete state replacement with tombstone records for audit trails\n- **Flexible Schema**: Dynamic ID and value column configuration\n- **Python Integration**: High-level DataFrame API with seamless PyO3 bindings\n- **Modular Architecture**: Clean separation of concerns with dedicated modules\n- **Performance Monitoring**: Integrated flamegraph generation and GitHub Pages benchmark reports\n\n## Installation\n\nBuild from source (requires Rust):\n\n```bash\ngit clone <your-repository-url>\ncd pytemporal\nuv run maturin develop --release\n```\n\n## Quick Start\n\n### High-Level DataFrame API (Recommended)\n\n```python\nimport pandas as pd\nfrom pytemporal import BitemporalTimeseriesProcessor\n\n# Initialize processor\nprocessor = BitemporalTimeseriesProcessor(\n id_columns=['id', 'field'],\n value_columns=['mv', 'price']\n)\n\n# Current state\ncurrent_state = pd.DataFrame({\n 'id': [1234, 1234],\n 'field': ['test', 'fielda'], \n 'mv': [300, 400],\n 'price': [400, 500],\n 'effective_from': pd.to_datetime(['2020-01-01', '2020-01-01']),\n 'effective_to': pd.to_datetime(['2021-01-01', '2021-01-01']),\n 'as_of_from': pd.to_datetime(['2025-01-01', '2025-01-01']),\n 'as_of_to': pd.to_datetime(['2262-04-11', '2262-04-11'])\n})\n\n# Updates\nupdates = pd.DataFrame({\n 'id': [1234],\n 'field': ['test'],\n 'mv': [400], \n 'price': [300],\n 'effective_from': pd.to_datetime(['2020-06-01']),\n 'effective_to': pd.to_datetime(['2020-09-01']),\n 'as_of_from': pd.to_datetime(['2025-07-27']),\n 'as_of_to': pd.to_datetime(['2262-04-11'])\n})\n\n# Process updates (delta mode - incremental updates)\nrows_to_expire, rows_to_insert = processor.compute_changes(\n current_state, \n updates,\n update_mode='delta'\n)\n\nprint(f\"Records to expire: {len(rows_to_expire)}\")\nprint(f\"Records to insert: {len(rows_to_insert)}\")\n\n# Process updates (full_state mode - complete state replacement)\nrows_to_expire, rows_to_insert = processor.compute_changes(\n current_state, \n updates,\n update_mode='full_state'\n)\n```\n\n### Low-Level Arrow API\n\n```python\nimport pandas as pd\nfrom pytemporal import compute_changes\nimport pyarrow as pa\n\n# Convert pandas DataFrames to Arrow RecordBatches\ncurrent_batch = pa.RecordBatch.from_pandas(current_state)\nupdates_batch = pa.RecordBatch.from_pandas(updates)\n\n# Direct Arrow processing\nexpire_indices, insert_batches, expired_batches = compute_changes(\n current_batch,\n updates_batch,\n id_columns=['id', 'field'],\n value_columns=['mv', 'price'],\n system_date='2025-07-27',\n update_mode='delta'\n)\n```\n\n## Algorithm Explanation with Examples\n\n### Bitemporal Model\n\nEach record tracks two time dimensions:\n- **Effective Time** (`effective_from`, `effective_to`): When the data is valid in the real world\n- **As-Of Time** (`as_of_from`, `as_of_to`): When the data was known to the system\n\nBoth use TimestampMicrosecond precision for maximum accuracy.\n\n### Core Algorithm: Timeline Processing\n\nThe algorithm processes updates by creating a timeline of events and determining what should be active at each point in time.\n\n#### Example 1: Simple Overwrite\n\n**Current State:**\n```\nID=123, effective: [2020-01-01, 2021-01-01], as_of: [2025-01-01, max], mv=100\n```\n\n**Update:**\n``` \nID=123, effective: [2020-06-01, 2020-09-01], as_of: [2025-07-27, max], mv=200\n```\n\n**Timeline Processing:**\n\n1. **Create Events:**\n - 2020-01-01: Current starts (mv=100)\n - 2020-06-01: Update starts (mv=200) \n - 2020-09-01: Update ends\n - 2021-01-01: Current ends\n\n2. **Process Timeline:**\n - [2020-01-01, 2020-06-01): Current active \u2192 emit mv=100\n - [2020-06-01, 2020-09-01): Update active \u2192 emit mv=200 \n - [2020-09-01, 2021-01-01): Current active \u2192 emit mv=100\n\n3. **Result:**\n - **Expire:** Original record (index 0)\n - **Insert:** Three new records covering the split timeline\n\n**Visual Representation:**\n```\nBefore:\nCurrent |=======mv=100========|\n 2020-01-01 2021-01-01\n\nUpdate |==mv=200==|\n 2020-06-01 2020-09-01\n\nAfter:\nNew |=100=|=mv=200=|=100=|\n 2020 2020 2020 2021\n 01-01 06-01 09-01 01-01\n```\n\n#### Example 2: Conflation (Adjacent Identical Values)\n\n**Current State:**\n```\nID=123, effective: [2020-01-01, 2020-06-01], as_of: [2025-01-01, max], mv=100\nID=123, effective: [2020-06-01, 2021-01-01], as_of: [2025-01-01, max], mv=100 \n```\n\n**Update:**\n```\nID=123, effective: [2020-03-01, 2020-04-01], as_of: [2025-07-27, max], mv=100\n```\n\nSince the update has the same value (mv=100) as the current state, the algorithm detects this as a **no-change scenario** and skips processing entirely.\n\n#### Example 3: Complex Multi-Update\n\n**Current State:**\n```\nID=123, effective: [2020-01-01, 2021-01-01], as_of: [2025-01-01, max], mv=100\n```\n\n**Updates:**\n```\nID=123, effective: [2020-03-01, 2020-06-01], as_of: [2025-07-27, max], mv=200\nID=123, effective: [2020-09-01, 2020-12-01], as_of: [2025-07-27, max], mv=300\n```\n\n**Timeline Processing:**\n\n1. **Events:** 2020-01-01 (current start), 2020-03-01 (update1 start), 2020-06-01 (update1 end), 2020-09-01 (update2 start), 2020-12-01 (update2 end), 2021-01-01 (current end)\n\n2. **Result:**\n - [2020-01-01, 2020-03-01): mv=100 (current)\n - [2020-03-01, 2020-06-01): mv=200 (update1) \n - [2020-06-01, 2020-09-01): mv=100 (current)\n - [2020-09-01, 2020-12-01): mv=300 (update2)\n - [2020-12-01, 2021-01-01): mv=100 (current)\n\n### Post-Processing Conflation\n\nAfter timeline processing, the algorithm merges adjacent segments with identical value hashes:\n\n**Before Conflation:**\n```\n|--mv=100--|--mv=100--|--mv=200--|--mv=100--|--mv=100--|\n```\n\n**After Conflation:**\n```\n|--------mv=100--------|--mv=200--|--------mv=100--------|\n```\n\nThis significantly reduces database row count while preserving temporal accuracy.\n\n### Update Modes\n\nThe algorithm supports two distinct update modes that determine how updates interact with existing current state:\n\n#### 1. Delta Mode (default)\n- **Behavior**: Only provided records are treated as updates\n- **Existing State**: Preserved where not temporally overlapped by updates\n- **Use Case**: Incremental updates where you only want to modify specific time periods\n- **Processing**: Uses timeline-based algorithm to handle temporal overlaps\n\n#### 2. Full State Mode\n- **Behavior**: Provided records represent the complete desired state for each ID group\n- **Existing State**: Only preserved if identical records exist in the updates (same values and temporal ranges)\n- **Value Comparison**: Records are compared using SHA256 hashes of value columns\n- **Processing Rules**:\n - **Unchanged Records**: If an update has identical values and temporal range as current state, neither expire nor insert\n - **Changed Records**: If values differ, expire old record and insert new record\n - **New Records**: Insert records that don't exist in current state\n - **Deleted Records**: Records not present in updates are expired AND get tombstone records created\n- **Tombstone Records**: When records are deleted (exist in current state but not in updates):\n - The original record is expired with its original `as_of_from`\n - A tombstone record is created with the same ID and values but `effective_to = system_date`\n - This maintains complete audit trail showing exactly when data became inactive\n\n#### Full State Mode Example\n\n**Current State:**\n```\nID=1, mv=100, price=250, effective=[2020-01-01, INFINITY]\nID=2, mv=300, price=400, effective=[2020-01-01, INFINITY] \nID=3, mv=500, price=600, effective=[2020-01-01, INFINITY] \n```\n\n**Updates (Full State):**\n```\nID=1, mv=150, price=250, effective=[2020-01-01, 2020-02-01] # Values changed\nID=2, mv=300, price=400, effective=[2020-01-01, INFINITY] # Values identical \n# Note: ID=3 is not in updates = deleted\n```\n\n**Result:**\n- **Expire**: ID=1 (values changed), ID=3 (deleted)\n- **Insert**: \n - ID=1 (new values with updated effective_to)\n - ID=3 (tombstone: same values but effective_to=system_date)\n- **No Action**: ID=2 (identical values, no change needed)\n\n**Tombstone Example:**\n```\nOriginal: ID=3, mv=500, price=600, effective=[2020-01-01, INFINITY]\nTombstone: ID=3, mv=500, price=600, effective=[2020-01-01, 2025-08-30] # Truncated to system_date\n```\n\n### Timezone Handling\n\nThe library seamlessly handles timezone-aware timestamps from databases like PostgreSQL:\n\n```python\n# PostgreSQL timestamptz columns (timezone-aware)\ncurrent_state = pd.DataFrame({\n 'id': [1],\n 'value': ['test'],\n 'effective_from': pd.Timestamp(\"2024-01-01\", tz='UTC'), # timezone-aware\n 'effective_to': pd.Timestamp(\"2099-12-31\", tz='UTC'),\n # ... other columns\n})\n\n# Client updates (often timezone-naive) \nupdates = pd.DataFrame({\n 'id': [1],\n 'value': ['updated'],\n 'effective_from': pd.Timestamp(\"2024-01-02\"), # timezone-naive\n 'effective_to': pd.Timestamp(\"2099-12-31\"),\n # ... other columns\n})\n\n# Automatically handles timezone normalization\nrows_to_expire, rows_to_insert = processor.compute_changes(current_state, updates)\n```\n\n**Key Features:**\n- **Automatic Schema Normalization**: Mixed timezone-aware and timezone-naive timestamps are automatically reconciled\n- **PostgreSQL Compatible**: Seamless integration with `timestamptz` columns\n- **Timezone Preservation**: Original timezone information is preserved throughout the processing pipeline\n- **Arrow Schema Compatibility**: Ensures consistent `timestamp[us, tz=UTC]` schemas for Rust processing\n\n### Parallelization Strategy\n\nThe algorithm uses adaptive parallelization:\n- **Serial Processing**: Small datasets (<50 ID groups AND <10k records) \n- **Parallel Processing**: Large datasets using Rayon for CPU-bound operations\n- **ID Group Independence**: Each ID group processes independently, enabling perfect parallelization\n\n## Performance\n\n### Rust Core Performance\nBenchmarked on modern hardware:\n\n- **500k records**: ~885ms processing time\n- **Adaptive Parallelization**: Automatically uses multiple threads for large datasets \n- **Parallel Thresholds**: >50 ID groups OR >10k total records triggers parallel processing\n- **Conflation Efficiency**: Significant row reduction for datasets with temporal continuity\n\n### Python Wrapper Performance (Optimized)\nAdvanced batch consolidation eliminates conversion bottlenecks:\n\n- **Batch Consolidation**: Individual record batches consolidated into 10k-row batches\n- **Conversion Overhead**: <0.1 seconds for large datasets (was 30+ seconds)\n- **Overhead Ratio**: Python processing is only 20% of Rust processing time\n- **Zero-Copy Efficiency**: Near-optimal Arrow/pandas conversion performance\n\n**Example Performance (50k records):**\n```\nRust processing: 0.5s\nPython conversion: 0.05s \nTotal time: 0.55s\nOverhead ratio: 0.1x (10% overhead)\n```\n\n**Before Optimization:**\n- 90,213 single-row batches \u2192 45+ seconds conversion time\n- Massive per-batch overhead in arro3 \u2192 pandas conversion\n\n**After Optimization:** \n- 10 consolidated 10k-row batches \u2192 0.05 seconds conversion time\n- Efficient bulk conversion with minimal overhead\n\n## Testing\n\nRun the test suites:\n\n```bash\n# Rust tests\ncargo test\n\n# Python tests \nuv run python -m pytest tests/test_bitemporal.py -v\n\n# Benchmarks\ncargo bench\n```\n\n## Development\n\n### Requirements\n- Rust (latest stable)\n- Python 3.8+ with development headers\n- uv for Python dependency management\n\n### Python Version Compatibility\nThis project uses PyO3 v0.21.2 which officially supports Python up to 3.12. If you're using Python 3.13+:\n\n**For Rust tests:**\n```bash\nexport PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1\ncargo test\n```\n\n**For Python bindings (recommended):**\n```bash\nuv run maturin develop # Uses venv Python 3.12 automatically\n```\n\n### Setup\n```bash\n# Install Rust dependencies and run tests\nexport PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1 # If using Python 3.13+\ncargo test\n\n# Build Python extension (development mode)\nuv run maturin develop\n\n# Run Python tests \nuv run python -m pytest tests/test_bitemporal_manual.py -v\n```\n\n### Permanent Fix for Python 3.13+\nAdd to your shell profile (~/.bashrc or ~/.zshrc):\n```bash\nexport PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1\n```\n\n### Project Structure\n\n**Modular Architecture** (274 lines total in main file, down from 1,085):\n\n- `src/lib.rs` - Main processing function and Python bindings (280 lines)\n- `src/types.rs` - Core data structures and constants (90 lines)\n- `src/overlap.rs` - Overlap detection and record categorization (70 lines) \n- `src/timeline.rs` - Timeline event processing algorithm (250 lines)\n- `src/conflation.rs` - Record conflation and batch consolidation (280 lines)\n- `src/batch_utils.rs` - Arrow RecordBatch utilities with hash functions (490 lines)\n- `tests/integration_tests.rs` - Rust integration tests (5 test scenarios)\n- `tests/test_bitemporal_manual.py` - Python test suite (22 test scenarios)\n- `benches/bitemporal_benchmarks.rs` - Performance benchmarks\n- `CLAUDE.md` - Project context and development notes\n\n### Key Commands\n\n```bash\n# Build release version\ncargo build --release\n\n# Run benchmarks with HTML reports\ncargo bench\n\n# Build Python wheel \nuv run maturin build --release\n\n# Development install\nuv run maturin develop\n```\n\n### Module Responsibilities\n\n1. **`types.rs`** - Data structures (`BitemporalRecord`, `ChangeSet`, `UpdateMode`) and type conversions\n2. **`overlap.rs`** - Determines which records overlap in time and need timeline processing vs direct insertion\n3. **`timeline.rs`** - Core algorithm that processes overlapping records through event timeline\n4. **`conflation.rs`** - Post-processes results to merge adjacent segments and consolidate batches for optimal performance\n5. **`batch_utils.rs`** - Arrow utilities for RecordBatch creation, timestamp handling, and SHA256 hash computation\n\n## Dependencies\n\n- **arrow** (53.4) - Columnar data processing\n- **pyo3** (0.21) - Python bindings \n- **chrono** (0.4) - Date/time handling\n- **sha2** (0.10) - SHA256 hashing for client-compatible hex digests\n- **rayon** (1.8) - Parallel processing\n- **criterion** (0.5) - Benchmarking framework\n\n## Architecture\n\n### Rust Core\n- Zero-copy Arrow array processing\n- Parallel execution with Rayon\n- Hash-based change detection with SHA256 (client-compatible hex digests)\n- Post-processing conflation for optimal storage\n- Batch consolidation for efficient Python conversion\n- Modular design with clear separation of concerns\n\n### Python Interface\n- High-level DataFrame API (`BitemporalTimeseriesProcessor`)\n- Low-level Arrow API (`compute_changes`)\n- Optimized batch conversion with minimal overhead\n- PyO3 bindings for seamless integration\n- Compatible with pandas DataFrames via efficient conversion\n\n## Performance Monitoring\n\nThis project includes comprehensive performance monitoring with flamegraph analysis:\n\n### \ud83d\udcca Release Performance Reports\n\nView performance metrics and flamegraphs for each release at:\n**[Release Benchmarks](https://your-username.github.io/pytemporal/)**\n\nEach version tag automatically generates comprehensive performance documentation with flamegraphs, creating a historical record of performance evolution across releases.\n\n### \ud83d\udd25 Generating Flamegraphs Locally\n\n```bash\n# Generate flamegraphs for key benchmarks\ncargo bench --bench bitemporal_benchmarks medium_dataset -- --profile-time 5\ncargo bench --bench bitemporal_benchmarks conflation_effectiveness -- --profile-time 5 \ncargo bench --bench bitemporal_benchmarks \"scaling_by_dataset_size/records/500000\" -- --profile-time 5\n\n# Add flamegraph links to HTML reports \npython3 scripts/add_flamegraphs_to_html.py\n\n# View reports locally\npython3 -m http.server 8000 --directory target/criterion\n# Then visit: http://localhost:8000/report/\n```\n\n### \ud83d\udcc8 Performance Expectations\n\n| Dataset Size | Processing Time | Flamegraph Available |\n|--------------|----------------|---------------------|\n| Small (5 records) | ~30-35 \u00b5s | \u274c |\n| Medium (100 records) | ~165-170 \u00b5s | \u2705 |\n| Large (500k records) | ~900-950 ms | \u2705 |\n| Conflation test | ~28 \u00b5s | \u2705 |\n\n### \ud83c\udfaf Key Optimization Areas (from Flamegraph Analysis)\n\n- **`process_id_timeline`**: Core algorithm logic\n- **Rayon parallelization**: Thread management overhead\n- **Arrow operations**: Columnar data processing\n- **SHA256 hashing**: Value fingerprinting for conflation\n\nSee `docs/benchmark-publishing.md` for complete setup details.\n\n## Contributing\n\n1. Check `CLAUDE.md` for project context and conventions\n2. Run tests before submitting changes\n3. Follow existing code style and patterns\n4. Update benchmarks for performance-related changes\n5. Use flamegraphs to validate performance improvements\n6. Maintain modular architecture when adding features\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Built With\n\n- [Apache Arrow](https://arrow.apache.org/) - Columnar data format\n- [PyO3](https://pyo3.rs/) - Rust-Python bindings \n- [Rayon](https://github.com/rayon-rs/rayon) - Data parallelism\n- [Criterion](https://github.com/bheisler/criterion.rs) - Benchmarking\n- [SHA256](https://en.wikipedia.org/wiki/SHA-2) - Cryptographic hashing algorithm (client-compatible)\n",
"bugtrack_url": null,
"license": null,
"summary": "High-performance bitemporal timeseries update processor",
"version": "1.3.11",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "092e38f4db7477b9ac95b37b8b22cdabfed3b931df519ab9218e8d1935f4d410",
"md5": "649bba0094f0450fcc82489f4b0fd484",
"sha256": "d4f8d94cd06a575fa29b0db09d5fd0c92c4d8d4e2146ed1860ee8d3850b58cfc"
},
"downloads": -1,
"filename": "pytemporal-1.3.11-cp310-cp310-manylinux_2_34_x86_64.whl",
"has_sig": false,
"md5_digest": "649bba0094f0450fcc82489f4b0fd484",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.9",
"size": 2461399,
"upload_time": "2025-09-01T19:00:48",
"upload_time_iso_8601": "2025-09-01T19:00:48.192223Z",
"url": "https://files.pythonhosted.org/packages/09/2e/38f4db7477b9ac95b37b8b22cdabfed3b931df519ab9218e8d1935f4d410/pytemporal-1.3.11-cp310-cp310-manylinux_2_34_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "34b4d5f543ca818440c2b76b5c3ce58abadd5a10c7ea2ef701dbc505e191d3a7",
"md5": "7d758cdbab1c7c385b856eb2cd61c473",
"sha256": "13f1b1a1802101426c03847e0f95072a6387710c0616f7c27fa46aaf25cb975b"
},
"downloads": -1,
"filename": "pytemporal-1.3.11-cp311-cp311-manylinux_2_34_x86_64.whl",
"has_sig": false,
"md5_digest": "7d758cdbab1c7c385b856eb2cd61c473",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.9",
"size": 2461301,
"upload_time": "2025-09-01T19:00:49",
"upload_time_iso_8601": "2025-09-01T19:00:49.815749Z",
"url": "https://files.pythonhosted.org/packages/34/b4/d5f543ca818440c2b76b5c3ce58abadd5a10c7ea2ef701dbc505e191d3a7/pytemporal-1.3.11-cp311-cp311-manylinux_2_34_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "4c734cab9c5de06e695d0001d1405bdd7dde7614a2c520d118ea51690e1af2d6",
"md5": "5240c9c04ff53439405c3791504dc2ad",
"sha256": "9e07c5812292fde67b7008ff429e7f21fde9b2d70df6c518245456577daa84b9"
},
"downloads": -1,
"filename": "pytemporal-1.3.11-cp312-cp312-manylinux_2_34_x86_64.whl",
"has_sig": false,
"md5_digest": "5240c9c04ff53439405c3791504dc2ad",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.9",
"size": 2456711,
"upload_time": "2025-09-01T19:00:51",
"upload_time_iso_8601": "2025-09-01T19:00:51.640779Z",
"url": "https://files.pythonhosted.org/packages/4c/73/4cab9c5de06e695d0001d1405bdd7dde7614a2c520d118ea51690e1af2d6/pytemporal-1.3.11-cp312-cp312-manylinux_2_34_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "35242e20f0f0221a33c9f3ef3ac91bc155ce9c28fe87334a659b68c396b14ec7",
"md5": "8bafb31b6aec3973d5f0a4659ff3881a",
"sha256": "f9ddb409d6f989b04142ac27dd94e5a2e924e785f00664c661f87b3d9c6e2b31"
},
"downloads": -1,
"filename": "pytemporal-1.3.11-cp39-cp39-manylinux_2_34_x86_64.whl",
"has_sig": false,
"md5_digest": "8bafb31b6aec3973d5f0a4659ff3881a",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.9",
"size": 2461290,
"upload_time": "2025-09-01T19:00:53",
"upload_time_iso_8601": "2025-09-01T19:00:53.177644Z",
"url": "https://files.pythonhosted.org/packages/35/24/2e20f0f0221a33c9f3ef3ac91bc155ce9c28fe87334a659b68c396b14ec7/pytemporal-1.3.11-cp39-cp39-manylinux_2_34_x86_64.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-01 19:00:48",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "pytemporal"
}