exness-data-preprocess


Nameexness-data-preprocess JSON
Version 0.5.0 PyPI version JSON
download
home_pageNone
SummaryProfessional Exness forex tick data preprocessing with optimal compression (Parquet Zstd-22) and DuckDB OHLC generation. Provides efficient storage (9% smaller than ZIP) with lossless precision and direct queryability.
upload_time2025-10-21 03:32:40
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords backtesting compression data-engineering duckdb efficient-storage eurusd exness financial-data forex forex-data lossless microstructure ohlc pandas parquet preprocessing pyarrow quantitative-finance queryable tick-data time-series zstd
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Exness Data Preprocess v2.0.0

[![PyPI version](https://img.shields.io/pypi/v/exness-data-preprocess.svg)](https://pypi.org/project/exness-data-preprocess/)
[![Python versions](https://img.shields.io/pypi/pyversions/exness-data-preprocess.svg)](https://pypi.org/project/exness-data-preprocess/)
[![License](https://img.shields.io/pypi/l/exness-data-preprocess.svg)](https://github.com/terrylica/exness-data-preprocess/blob/main/LICENSE)
[![CI](https://github.com/terrylica/exness-data-preprocess/workflows/CI/badge.svg)](https://github.com/terrylica/exness-data-preprocess/actions)
[![Downloads](https://img.shields.io/pypi/dm/exness-data-preprocess.svg)](https://pypi.org/project/exness-data-preprocess/)
[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)

Professional forex tick data preprocessing with unified single-file DuckDB storage. Provides incremental updates, dual-variant storage (Raw_Spread + Standard), and Phase7 30-column OHLC schema (v1.6.0) with 10 global exchange sessions (trading hour detection) and sub-15ms query performance.

## Features

- **Unified Single-File Architecture**: One DuckDB file per instrument (eurusd.duckdb)
- **Incremental Updates**: Automatic gap detection and download only missing months
- **Dual-Variant Storage**: Raw_Spread (primary) + Standard (reference) in same database
- **Phase7 OHLC Schema**: 30-column bars (v1.6.0) with dual spreads, tick counts, normalized metrics, and 10 global exchange sessions with trading hour detection
- **High Performance**: Incremental OHLC generation (7.3x speedup), vectorized session detection (2.2x speedup), SQL gap detection with complete coverage
- **Fast Queries**: Date range queries with sub-15ms performance
- **On-Demand Resampling**: Any timeframe (5m, 1h, 1d) resampled in <15ms
- **PRIMARY KEY Constraints**: Prevents duplicate data during incremental updates
- **Simple API**: Clean Python API for all operations

## Installation

```bash
# From PyPI (when published)
pip install exness-data-preprocess

# From source
git clone https://github.com/Eon-Labs/exness-data-preprocess.git
cd exness-data-preprocess
pip install -e .

# Using uv (recommended)
uv pip install exness-data-preprocess
```

## Quick Start

### Python API

```python
import exness_data_preprocess as edp

# Initialize processor
processor = edp.ExnessDataProcessor(base_dir="~/eon/exness-data")

# Download 3 years of EURUSD data (automatic gap detection)
result = processor.update_data(
    pair="EURUSD",
    start_date="2022-01-01",
    delete_zip=True,
)

print(f"Months added:  {result['months_added']}")
print(f"Raw ticks:     {result['raw_ticks_added']:,}")
print(f"Standard ticks: {result['standard_ticks_added']:,}")
print(f"OHLC bars:     {result['ohlc_bars']:,}")
print(f"Database size: {result['duckdb_size_mb']:.2f} MB")

# Query 1-minute OHLC bars for January 2024
df_1m = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1m",
    start_date="2024-01-01",
    end_date="2024-01-31",
)
print(df_1m.head())

# Query raw tick data for September 2024
df_ticks = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-30",
)
print(f"Ticks: {len(df_ticks):,}")
```

## Architecture v2.0.0

### Data Flow

```
Exness Public Repository (monthly ZIPs, both variants)
           ↓
    Automatic Gap Detection
           ↓
Download Only Missing Months (Raw_Spread + Standard)
           ↓
DuckDB Single-File Storage (PRIMARY KEY prevents duplicates)
           ↓
Phase7 30-Column OHLC Generation (v1.6.0 - dual spreads, tick counts, normalized metrics, 10 global exchange sessions with trading hour detection)
           ↓
Query Interface (date ranges, SQL filters, on-demand resampling)
```

### Storage Format

**Single File Per Instrument**: `~/eon/exness-data/eurusd.duckdb`

**Schema**:

- `raw_spread_ticks` table: Timestamp (PK), Bid, Ask
- `standard_ticks` table: Timestamp (PK), Bid, Ask
- `ohlc_1m` table: Phase7 30-column schema (v1.6.0)
- `metadata` table: Coverage tracking

**Phase7 30-Column OHLC (v1.6.0)**:

- **Column Definitions**: See [`schema.py`](src/exness_data_preprocess/schema.py) - Single source of truth
- **Comprehensive Reference**: See [`DATABASE_SCHEMA.md`](docs/DATABASE_SCHEMA.md) - Query examples and usage patterns
- **Key Features**: BID-only OHLC with dual spreads (Raw_Spread + Standard), normalized spread metrics, and 10 global exchange sessions with trading hour detection (XNYS, XLON, XSWX, XFRA, XTSE, XNZE, XTKS, XASX, XHKG, XSES)

### Directory Structure

**Default Location**: `~/eon/exness-data/` (outside project workspace)

```
~/eon/exness-data/
├── eurusd.duckdb      # Single file for all EURUSD data
├── gbpusd.duckdb      # Single file for all GBPUSD data
├── xauusd.duckdb      # Single file for all XAUUSD data
└── temp/
    └── (temporary ZIP files)
```

**Why Single-File Per Instrument?**

- **Unified Storage**: All years in one database
- **Incremental Updates**: Automatic gap detection and download only missing months
- **No Duplicates**: PRIMARY KEY constraints prevent duplicate data
- **Fast Queries**: Date range queries with sub-15ms performance
- **Scalability**: Multi-year data in ~2 GB per instrument (3 years)

## Usage Examples

### Example 1: Initial Download and Incremental Updates

```python
import exness_data_preprocess as edp

processor = edp.ExnessDataProcessor()

# Initial download (3-year history)
result = processor.update_data(
    pair="EURUSD",
    start_date="2022-01-01",
    delete_zip=True,
)

# Run again - only downloads new months since last update
result = processor.update_data(
    pair="EURUSD",
    start_date="2022-01-01",
)
print(f"Months added: {result['months_added']} (0 if up to date)")
```

### Example 2: Check Data Coverage

```python
coverage = processor.get_data_coverage("EURUSD")

print(f"Database exists: {coverage['database_exists']}")
print(f"Raw_Spread ticks: {coverage['raw_spread_ticks']:,}")
print(f"Standard ticks:  {coverage['standard_ticks']:,}")
print(f"OHLC bars:       {coverage['ohlc_bars']:,}")
print(f"Date range:      {coverage['earliest_date']} to {coverage['latest_date']}")
print(f"Days covered:    {coverage['date_range_days']}")
print(f"Database size:   {coverage['duckdb_size_mb']:.2f} MB")
```

### Example 3: Query OHLC with Date Ranges

```python
# Query 1-minute bars for January 2024
df_1m = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1m",
    start_date="2024-01-01",
    end_date="2024-01-31",
)

# Query 1-hour bars for Q1 2024 (resampled on-demand)
df_1h = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1h",
    start_date="2024-01-01",
    end_date="2024-03-31",
)

# Query daily bars for entire 2024
df_1d = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1d",
    start_date="2024-01-01",
    end_date="2024-12-31",
)

print(f"1m bars: {len(df_1m):,}")
print(f"1h bars: {len(df_1h):,}")
print(f"1d bars: {len(df_1d):,}")
```

### Example 4: Query Ticks with Date Ranges

```python
# Query Raw_Spread ticks for September 2024
df_raw = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-30",
)

print(f"Raw_Spread ticks: {len(df_raw):,}")
print(f"Columns: {list(df_raw.columns)}")

# Calculate spread statistics
df_raw['Spread'] = df_raw['Ask'] - df_raw['Bid']
print(f"Mean spread: {df_raw['Spread'].mean() * 10000:.4f} pips")
print(f"Zero-spreads: {((df_raw['Spread'] == 0).sum() / len(df_raw) * 100):.2f}%")
```

### Example 5: Query with SQL Filters

```python
# Query only zero-spread ticks
df_zero = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-01",
    filter_sql="Bid = Ask",
)
print(f"Zero-spread ticks: {len(df_zero):,}")

# Query high-price ticks
df_high = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-30",
    filter_sql="Bid > 1.11",
)
print(f"High-price ticks: {len(df_high):,}")
```

### Example 6: Process Multiple Instruments

```python
processor = edp.ExnessDataProcessor()

# Process multiple pairs
pairs = ["EURUSD", "GBPUSD", "XAUUSD"]

for pair in pairs:
    print(f"Processing {pair}...")
    result = processor.update_data(
        pair=pair,
        start_date="2023-01-01",
        delete_zip=True,
    )
    print(f"  Months added: {result['months_added']}")
    print(f"  Database size: {result['duckdb_size_mb']:.2f} MB")
```

### Example 7: Parallel Processing

```python
from concurrent.futures import ThreadPoolExecutor, as_completed

def process_instrument(pair, start_date):
    processor = edp.ExnessDataProcessor()
    return processor.update_data(pair=pair, start_date=start_date, delete_zip=True)

instruments = [
    ("EURUSD", "2023-01-01"),
    ("GBPUSD", "2023-01-01"),
    ("XAUUSD", "2023-01-01"),
    ("USDJPY", "2023-01-01"),
]

with ThreadPoolExecutor(max_workers=4) as executor:
    futures = {
        executor.submit(process_instrument, pair, start_date): pair
        for pair, start_date in instruments
    }

    for future in as_completed(futures):
        pair = futures[future]
        result = future.result()
        print(f"{pair}: {result['months_added']} months added")
```

## Development

### Setup

```bash
# Clone repository
git clone https://github.com/Eon-Labs/exness-data-preprocess.git
cd exness-data-preprocess

# Install with development dependencies (using uv)
uv sync --dev

# Or with pip
pip install -e ".[dev]"
```

### Testing

```bash
# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=exness_data_preprocess --cov-report=html

# Run specific test
uv run pytest tests/test_processor.py -v
```

### Code Quality

```bash
# Format code
uv run ruff format .

# Lint
uv run ruff check --fix .

# Type checking
uv run mypy src/
```

### Building

```bash
# Build package
uv build

# Test installation locally
uv tool install --editable .
```

## Data Source

Data is sourced from Exness's public tick data repository:

- **URL**: https://ticks.ex2archive.com/
- **Format**: Monthly ZIP files with CSV tick data
- **Variants**: Raw_Spread (zero-spreads) + Standard (market spreads)
- **Content**: Timestamp, Bid, Ask prices for major forex pairs
- **Quality**: Institutional ECN/STP data with microsecond precision

## Technical Specifications

### Database Size (3-Year History, EURUSD)

| Metric           | Value                    |
| ---------------- | ------------------------ |
| Raw_Spread ticks | ~18.6M                   |
| Standard ticks   | ~19.6M                   |
| OHLC bars (1m)   | ~413K                    |
| Database size    | ~2.08 GB                 |
| Date range       | 2022-01-01 to 2025-01-10 |

### Query Performance

| Operation                  | Time  |
| -------------------------- | ----- |
| Query 880K ticks (1 month) | <15ms |
| Query 1m OHLC (1 month)    | <10ms |
| Resample to 1h (1 month)   | <15ms |
| Resample to 1d (1 year)    | <20ms |

### Architecture Benefits

| Feature                    | Benefit                                            |
| -------------------------- | -------------------------------------------------- |
| Single file per instrument | Unified storage, no file fragmentation             |
| PRIMARY KEY constraints    | Prevents duplicates during incremental updates     |
| Automatic gap detection    | Download only missing months                       |
| Dual-variant storage       | Raw_Spread + Standard in same database             |
| Phase7 OHLC schema         | Dual spreads + dual tick counts                    |
| Date range queries         | Efficient filtering without loading entire dataset |
| On-demand resampling       | Any timeframe in <15ms                             |
| SQL filter support         | Direct SQL WHERE clauses on ticks                  |

### Performance Optimizations (v0.5.0)

**Incremental OHLC Generation** - 7.3x speedup for updates:
- Full regeneration: 8.05s (303K bars, 7 months)
- Incremental update: 1.10s (43K new bars, 1 month)
- Implementation: Optional date-range parameters for partial regeneration
- Validation: [`docs/validation/SPIKE_TEST_RESULTS_PHASE1_2025-10-18.md`](docs/validation/SPIKE_TEST_RESULTS_PHASE1_2025-10-18.md)

**Vectorized Session Detection** - 2.2x speedup for trading hour detection:
- Current approach: 5.99s (302K bars, 10 exchanges)
- Vectorized approach: 2.69s (302K bars, 10 exchanges)
- Combined Phase 1+2: ~16x total speedup (8.05s → 0.50s)
- Implementation: Pre-compute trading minutes, vectorized `.isin()` lookup
- SSoT: [`docs/PHASE2_SESSION_VECTORIZATION_PLAN.yaml`](docs/PHASE2_SESSION_VECTORIZATION_PLAN.yaml)
- Validation: [`docs/validation/SPIKE_TEST_RESULTS_PHASE2_2025-10-18.md`](docs/validation/SPIKE_TEST_RESULTS_PHASE2_2025-10-18.md)

**SQL Gap Detection** - Complete coverage with 46% code reduction:
- Bug fix: Python approach missed internal gaps (41 detected vs 42 actual)
- SQL EXCEPT operator detects ALL gaps (before + within + after existing data)
- Code reduced from 62 lines to 34 lines (46% reduction)
- SSoT: [`docs/PHASE3_SQL_GAP_DETECTION_PLAN.yaml`](docs/PHASE3_SQL_GAP_DETECTION_PLAN.yaml)

**Release Notes**: See [`CHANGELOG.md`](CHANGELOG.md) for complete v0.5.0 details

## API Reference

### ExnessDataProcessor

```python
processor = edp.ExnessDataProcessor(base_dir="~/eon/exness-data")
```

**Methods**:

- `update_data(pair, start_date, force_redownload=False, delete_zip=True)` - Update database with latest data
- `query_ohlc(pair, timeframe, start_date=None, end_date=None)` - Query OHLC bars
- `query_ticks(pair, variant, start_date=None, end_date=None, filter_sql=None)` - Query tick data
- `get_data_coverage(pair)` - Get coverage information

**Parameters**:

- `pair` (str): Currency pair (e.g., "EURUSD", "GBPUSD", "XAUUSD")
- `timeframe` (str): OHLC timeframe ("1m", "5m", "15m", "1h", "4h", "1d")
- `variant` (str): Tick variant ("raw_spread" or "standard")
- `start_date` (str): Start date in "YYYY-MM-DD" format
- `end_date` (str): End date in "YYYY-MM-DD" format
- `filter_sql` (str): SQL WHERE clause (e.g., "Bid > 1.11 AND Ask < 1.12")

## Migration from v1.0.0

**v1.0.0 (Legacy)**:

- Monthly DuckDB files: `eurusd_ohlc_2024_08.duckdb`
- Parquet tick storage: `eurusd_ticks_2024_08.parquet`
- Functions: `process_month()`, `process_date_range()`, `analyze_ticks()`

**v2.0.0 (Current)**:

- Single DuckDB file: `eurusd.duckdb`
- No Parquet files (everything in DuckDB)
- Unified API: `processor.update_data()`, `processor.query_ohlc()`, `processor.query_ticks()`

**Migration Steps**:

1. Run `processor.update_data(pair, start_date)` to create new unified database
2. Delete old monthly files: `rm eurusd_ohlc_2024_*.duckdb eurusd_ticks_2024_*.parquet`
3. Update code to use new API methods

## License

MIT License - see LICENSE file for details.

## Authors

- Terry Li <terry@eonlabs.com>
- Eon Labs

## Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

## Acknowledgments

- Exness for providing high-quality public tick data
- DuckDB for embedded OLAP capabilities with sub-15ms query performance

## Additional Documentation

**[📚 Complete Documentation Hub](DOCUMENTATION.md)** - Organized guide from beginner to advanced (72+ documents)

- **Basic Usage Examples**: See `examples/basic_usage.py`
- **Batch Processing**: See `examples/batch_processing.py`
- **Architecture Details**: See `docs/UNIFIED_DUCKDB_PLAN_v2.md`
- **Unit Tests**: See `tests/` directory

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "exness-data-preprocess",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "Terry Li <terry@eonlabs.com>",
    "keywords": "backtesting, compression, data-engineering, duckdb, efficient-storage, eurusd, exness, financial-data, forex, forex-data, lossless, microstructure, ohlc, pandas, parquet, preprocessing, pyarrow, quantitative-finance, queryable, tick-data, time-series, zstd",
    "author": null,
    "author_email": "Eon Labs <terry@eonlabs.com>",
    "download_url": "https://files.pythonhosted.org/packages/06/35/c42990506acd4649cef7acb96bb716225909190a930f4ea9b2588d35d187/exness_data_preprocess-0.5.0.tar.gz",
    "platform": null,
    "description": "# Exness Data Preprocess v2.0.0\n\n[![PyPI version](https://img.shields.io/pypi/v/exness-data-preprocess.svg)](https://pypi.org/project/exness-data-preprocess/)\n[![Python versions](https://img.shields.io/pypi/pyversions/exness-data-preprocess.svg)](https://pypi.org/project/exness-data-preprocess/)\n[![License](https://img.shields.io/pypi/l/exness-data-preprocess.svg)](https://github.com/terrylica/exness-data-preprocess/blob/main/LICENSE)\n[![CI](https://github.com/terrylica/exness-data-preprocess/workflows/CI/badge.svg)](https://github.com/terrylica/exness-data-preprocess/actions)\n[![Downloads](https://img.shields.io/pypi/dm/exness-data-preprocess.svg)](https://pypi.org/project/exness-data-preprocess/)\n[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)\n\nProfessional forex tick data preprocessing with unified single-file DuckDB storage. Provides incremental updates, dual-variant storage (Raw_Spread + Standard), and Phase7 30-column OHLC schema (v1.6.0) with 10 global exchange sessions (trading hour detection) and sub-15ms query performance.\n\n## Features\n\n- **Unified Single-File Architecture**: One DuckDB file per instrument (eurusd.duckdb)\n- **Incremental Updates**: Automatic gap detection and download only missing months\n- **Dual-Variant Storage**: Raw_Spread (primary) + Standard (reference) in same database\n- **Phase7 OHLC Schema**: 30-column bars (v1.6.0) with dual spreads, tick counts, normalized metrics, and 10 global exchange sessions with trading hour detection\n- **High Performance**: Incremental OHLC generation (7.3x speedup), vectorized session detection (2.2x speedup), SQL gap detection with complete coverage\n- **Fast Queries**: Date range queries with sub-15ms performance\n- **On-Demand Resampling**: Any timeframe (5m, 1h, 1d) resampled in <15ms\n- **PRIMARY KEY Constraints**: Prevents duplicate data during incremental updates\n- **Simple API**: Clean Python API for all operations\n\n## Installation\n\n```bash\n# From PyPI (when published)\npip install exness-data-preprocess\n\n# From source\ngit clone https://github.com/Eon-Labs/exness-data-preprocess.git\ncd exness-data-preprocess\npip install -e .\n\n# Using uv (recommended)\nuv pip install exness-data-preprocess\n```\n\n## Quick Start\n\n### Python API\n\n```python\nimport exness_data_preprocess as edp\n\n# Initialize processor\nprocessor = edp.ExnessDataProcessor(base_dir=\"~/eon/exness-data\")\n\n# Download 3 years of EURUSD data (automatic gap detection)\nresult = processor.update_data(\n    pair=\"EURUSD\",\n    start_date=\"2022-01-01\",\n    delete_zip=True,\n)\n\nprint(f\"Months added:  {result['months_added']}\")\nprint(f\"Raw ticks:     {result['raw_ticks_added']:,}\")\nprint(f\"Standard ticks: {result['standard_ticks_added']:,}\")\nprint(f\"OHLC bars:     {result['ohlc_bars']:,}\")\nprint(f\"Database size: {result['duckdb_size_mb']:.2f} MB\")\n\n# Query 1-minute OHLC bars for January 2024\ndf_1m = processor.query_ohlc(\n    pair=\"EURUSD\",\n    timeframe=\"1m\",\n    start_date=\"2024-01-01\",\n    end_date=\"2024-01-31\",\n)\nprint(df_1m.head())\n\n# Query raw tick data for September 2024\ndf_ticks = processor.query_ticks(\n    pair=\"EURUSD\",\n    variant=\"raw_spread\",\n    start_date=\"2024-09-01\",\n    end_date=\"2024-09-30\",\n)\nprint(f\"Ticks: {len(df_ticks):,}\")\n```\n\n## Architecture v2.0.0\n\n### Data Flow\n\n```\nExness Public Repository (monthly ZIPs, both variants)\n           \u2193\n    Automatic Gap Detection\n           \u2193\nDownload Only Missing Months (Raw_Spread + Standard)\n           \u2193\nDuckDB Single-File Storage (PRIMARY KEY prevents duplicates)\n           \u2193\nPhase7 30-Column OHLC Generation (v1.6.0 - dual spreads, tick counts, normalized metrics, 10 global exchange sessions with trading hour detection)\n           \u2193\nQuery Interface (date ranges, SQL filters, on-demand resampling)\n```\n\n### Storage Format\n\n**Single File Per Instrument**: `~/eon/exness-data/eurusd.duckdb`\n\n**Schema**:\n\n- `raw_spread_ticks` table: Timestamp (PK), Bid, Ask\n- `standard_ticks` table: Timestamp (PK), Bid, Ask\n- `ohlc_1m` table: Phase7 30-column schema (v1.6.0)\n- `metadata` table: Coverage tracking\n\n**Phase7 30-Column OHLC (v1.6.0)**:\n\n- **Column Definitions**: See [`schema.py`](src/exness_data_preprocess/schema.py) - Single source of truth\n- **Comprehensive Reference**: See [`DATABASE_SCHEMA.md`](docs/DATABASE_SCHEMA.md) - Query examples and usage patterns\n- **Key Features**: BID-only OHLC with dual spreads (Raw_Spread + Standard), normalized spread metrics, and 10 global exchange sessions with trading hour detection (XNYS, XLON, XSWX, XFRA, XTSE, XNZE, XTKS, XASX, XHKG, XSES)\n\n### Directory Structure\n\n**Default Location**: `~/eon/exness-data/` (outside project workspace)\n\n```\n~/eon/exness-data/\n\u251c\u2500\u2500 eurusd.duckdb      # Single file for all EURUSD data\n\u251c\u2500\u2500 gbpusd.duckdb      # Single file for all GBPUSD data\n\u251c\u2500\u2500 xauusd.duckdb      # Single file for all XAUUSD data\n\u2514\u2500\u2500 temp/\n    \u2514\u2500\u2500 (temporary ZIP files)\n```\n\n**Why Single-File Per Instrument?**\n\n- **Unified Storage**: All years in one database\n- **Incremental Updates**: Automatic gap detection and download only missing months\n- **No Duplicates**: PRIMARY KEY constraints prevent duplicate data\n- **Fast Queries**: Date range queries with sub-15ms performance\n- **Scalability**: Multi-year data in ~2 GB per instrument (3 years)\n\n## Usage Examples\n\n### Example 1: Initial Download and Incremental Updates\n\n```python\nimport exness_data_preprocess as edp\n\nprocessor = edp.ExnessDataProcessor()\n\n# Initial download (3-year history)\nresult = processor.update_data(\n    pair=\"EURUSD\",\n    start_date=\"2022-01-01\",\n    delete_zip=True,\n)\n\n# Run again - only downloads new months since last update\nresult = processor.update_data(\n    pair=\"EURUSD\",\n    start_date=\"2022-01-01\",\n)\nprint(f\"Months added: {result['months_added']} (0 if up to date)\")\n```\n\n### Example 2: Check Data Coverage\n\n```python\ncoverage = processor.get_data_coverage(\"EURUSD\")\n\nprint(f\"Database exists: {coverage['database_exists']}\")\nprint(f\"Raw_Spread ticks: {coverage['raw_spread_ticks']:,}\")\nprint(f\"Standard ticks:  {coverage['standard_ticks']:,}\")\nprint(f\"OHLC bars:       {coverage['ohlc_bars']:,}\")\nprint(f\"Date range:      {coverage['earliest_date']} to {coverage['latest_date']}\")\nprint(f\"Days covered:    {coverage['date_range_days']}\")\nprint(f\"Database size:   {coverage['duckdb_size_mb']:.2f} MB\")\n```\n\n### Example 3: Query OHLC with Date Ranges\n\n```python\n# Query 1-minute bars for January 2024\ndf_1m = processor.query_ohlc(\n    pair=\"EURUSD\",\n    timeframe=\"1m\",\n    start_date=\"2024-01-01\",\n    end_date=\"2024-01-31\",\n)\n\n# Query 1-hour bars for Q1 2024 (resampled on-demand)\ndf_1h = processor.query_ohlc(\n    pair=\"EURUSD\",\n    timeframe=\"1h\",\n    start_date=\"2024-01-01\",\n    end_date=\"2024-03-31\",\n)\n\n# Query daily bars for entire 2024\ndf_1d = processor.query_ohlc(\n    pair=\"EURUSD\",\n    timeframe=\"1d\",\n    start_date=\"2024-01-01\",\n    end_date=\"2024-12-31\",\n)\n\nprint(f\"1m bars: {len(df_1m):,}\")\nprint(f\"1h bars: {len(df_1h):,}\")\nprint(f\"1d bars: {len(df_1d):,}\")\n```\n\n### Example 4: Query Ticks with Date Ranges\n\n```python\n# Query Raw_Spread ticks for September 2024\ndf_raw = processor.query_ticks(\n    pair=\"EURUSD\",\n    variant=\"raw_spread\",\n    start_date=\"2024-09-01\",\n    end_date=\"2024-09-30\",\n)\n\nprint(f\"Raw_Spread ticks: {len(df_raw):,}\")\nprint(f\"Columns: {list(df_raw.columns)}\")\n\n# Calculate spread statistics\ndf_raw['Spread'] = df_raw['Ask'] - df_raw['Bid']\nprint(f\"Mean spread: {df_raw['Spread'].mean() * 10000:.4f} pips\")\nprint(f\"Zero-spreads: {((df_raw['Spread'] == 0).sum() / len(df_raw) * 100):.2f}%\")\n```\n\n### Example 5: Query with SQL Filters\n\n```python\n# Query only zero-spread ticks\ndf_zero = processor.query_ticks(\n    pair=\"EURUSD\",\n    variant=\"raw_spread\",\n    start_date=\"2024-09-01\",\n    end_date=\"2024-09-01\",\n    filter_sql=\"Bid = Ask\",\n)\nprint(f\"Zero-spread ticks: {len(df_zero):,}\")\n\n# Query high-price ticks\ndf_high = processor.query_ticks(\n    pair=\"EURUSD\",\n    variant=\"raw_spread\",\n    start_date=\"2024-09-01\",\n    end_date=\"2024-09-30\",\n    filter_sql=\"Bid > 1.11\",\n)\nprint(f\"High-price ticks: {len(df_high):,}\")\n```\n\n### Example 6: Process Multiple Instruments\n\n```python\nprocessor = edp.ExnessDataProcessor()\n\n# Process multiple pairs\npairs = [\"EURUSD\", \"GBPUSD\", \"XAUUSD\"]\n\nfor pair in pairs:\n    print(f\"Processing {pair}...\")\n    result = processor.update_data(\n        pair=pair,\n        start_date=\"2023-01-01\",\n        delete_zip=True,\n    )\n    print(f\"  Months added: {result['months_added']}\")\n    print(f\"  Database size: {result['duckdb_size_mb']:.2f} MB\")\n```\n\n### Example 7: Parallel Processing\n\n```python\nfrom concurrent.futures import ThreadPoolExecutor, as_completed\n\ndef process_instrument(pair, start_date):\n    processor = edp.ExnessDataProcessor()\n    return processor.update_data(pair=pair, start_date=start_date, delete_zip=True)\n\ninstruments = [\n    (\"EURUSD\", \"2023-01-01\"),\n    (\"GBPUSD\", \"2023-01-01\"),\n    (\"XAUUSD\", \"2023-01-01\"),\n    (\"USDJPY\", \"2023-01-01\"),\n]\n\nwith ThreadPoolExecutor(max_workers=4) as executor:\n    futures = {\n        executor.submit(process_instrument, pair, start_date): pair\n        for pair, start_date in instruments\n    }\n\n    for future in as_completed(futures):\n        pair = futures[future]\n        result = future.result()\n        print(f\"{pair}: {result['months_added']} months added\")\n```\n\n## Development\n\n### Setup\n\n```bash\n# Clone repository\ngit clone https://github.com/Eon-Labs/exness-data-preprocess.git\ncd exness-data-preprocess\n\n# Install with development dependencies (using uv)\nuv sync --dev\n\n# Or with pip\npip install -e \".[dev]\"\n```\n\n### Testing\n\n```bash\n# Run all tests\nuv run pytest\n\n# Run with coverage\nuv run pytest --cov=exness_data_preprocess --cov-report=html\n\n# Run specific test\nuv run pytest tests/test_processor.py -v\n```\n\n### Code Quality\n\n```bash\n# Format code\nuv run ruff format .\n\n# Lint\nuv run ruff check --fix .\n\n# Type checking\nuv run mypy src/\n```\n\n### Building\n\n```bash\n# Build package\nuv build\n\n# Test installation locally\nuv tool install --editable .\n```\n\n## Data Source\n\nData is sourced from Exness's public tick data repository:\n\n- **URL**: https://ticks.ex2archive.com/\n- **Format**: Monthly ZIP files with CSV tick data\n- **Variants**: Raw_Spread (zero-spreads) + Standard (market spreads)\n- **Content**: Timestamp, Bid, Ask prices for major forex pairs\n- **Quality**: Institutional ECN/STP data with microsecond precision\n\n## Technical Specifications\n\n### Database Size (3-Year History, EURUSD)\n\n| Metric           | Value                    |\n| ---------------- | ------------------------ |\n| Raw_Spread ticks | ~18.6M                   |\n| Standard ticks   | ~19.6M                   |\n| OHLC bars (1m)   | ~413K                    |\n| Database size    | ~2.08 GB                 |\n| Date range       | 2022-01-01 to 2025-01-10 |\n\n### Query Performance\n\n| Operation                  | Time  |\n| -------------------------- | ----- |\n| Query 880K ticks (1 month) | <15ms |\n| Query 1m OHLC (1 month)    | <10ms |\n| Resample to 1h (1 month)   | <15ms |\n| Resample to 1d (1 year)    | <20ms |\n\n### Architecture Benefits\n\n| Feature                    | Benefit                                            |\n| -------------------------- | -------------------------------------------------- |\n| Single file per instrument | Unified storage, no file fragmentation             |\n| PRIMARY KEY constraints    | Prevents duplicates during incremental updates     |\n| Automatic gap detection    | Download only missing months                       |\n| Dual-variant storage       | Raw_Spread + Standard in same database             |\n| Phase7 OHLC schema         | Dual spreads + dual tick counts                    |\n| Date range queries         | Efficient filtering without loading entire dataset |\n| On-demand resampling       | Any timeframe in <15ms                             |\n| SQL filter support         | Direct SQL WHERE clauses on ticks                  |\n\n### Performance Optimizations (v0.5.0)\n\n**Incremental OHLC Generation** - 7.3x speedup for updates:\n- Full regeneration: 8.05s (303K bars, 7 months)\n- Incremental update: 1.10s (43K new bars, 1 month)\n- Implementation: Optional date-range parameters for partial regeneration\n- Validation: [`docs/validation/SPIKE_TEST_RESULTS_PHASE1_2025-10-18.md`](docs/validation/SPIKE_TEST_RESULTS_PHASE1_2025-10-18.md)\n\n**Vectorized Session Detection** - 2.2x speedup for trading hour detection:\n- Current approach: 5.99s (302K bars, 10 exchanges)\n- Vectorized approach: 2.69s (302K bars, 10 exchanges)\n- Combined Phase 1+2: ~16x total speedup (8.05s \u2192 0.50s)\n- Implementation: Pre-compute trading minutes, vectorized `.isin()` lookup\n- SSoT: [`docs/PHASE2_SESSION_VECTORIZATION_PLAN.yaml`](docs/PHASE2_SESSION_VECTORIZATION_PLAN.yaml)\n- Validation: [`docs/validation/SPIKE_TEST_RESULTS_PHASE2_2025-10-18.md`](docs/validation/SPIKE_TEST_RESULTS_PHASE2_2025-10-18.md)\n\n**SQL Gap Detection** - Complete coverage with 46% code reduction:\n- Bug fix: Python approach missed internal gaps (41 detected vs 42 actual)\n- SQL EXCEPT operator detects ALL gaps (before + within + after existing data)\n- Code reduced from 62 lines to 34 lines (46% reduction)\n- SSoT: [`docs/PHASE3_SQL_GAP_DETECTION_PLAN.yaml`](docs/PHASE3_SQL_GAP_DETECTION_PLAN.yaml)\n\n**Release Notes**: See [`CHANGELOG.md`](CHANGELOG.md) for complete v0.5.0 details\n\n## API Reference\n\n### ExnessDataProcessor\n\n```python\nprocessor = edp.ExnessDataProcessor(base_dir=\"~/eon/exness-data\")\n```\n\n**Methods**:\n\n- `update_data(pair, start_date, force_redownload=False, delete_zip=True)` - Update database with latest data\n- `query_ohlc(pair, timeframe, start_date=None, end_date=None)` - Query OHLC bars\n- `query_ticks(pair, variant, start_date=None, end_date=None, filter_sql=None)` - Query tick data\n- `get_data_coverage(pair)` - Get coverage information\n\n**Parameters**:\n\n- `pair` (str): Currency pair (e.g., \"EURUSD\", \"GBPUSD\", \"XAUUSD\")\n- `timeframe` (str): OHLC timeframe (\"1m\", \"5m\", \"15m\", \"1h\", \"4h\", \"1d\")\n- `variant` (str): Tick variant (\"raw_spread\" or \"standard\")\n- `start_date` (str): Start date in \"YYYY-MM-DD\" format\n- `end_date` (str): End date in \"YYYY-MM-DD\" format\n- `filter_sql` (str): SQL WHERE clause (e.g., \"Bid > 1.11 AND Ask < 1.12\")\n\n## Migration from v1.0.0\n\n**v1.0.0 (Legacy)**:\n\n- Monthly DuckDB files: `eurusd_ohlc_2024_08.duckdb`\n- Parquet tick storage: `eurusd_ticks_2024_08.parquet`\n- Functions: `process_month()`, `process_date_range()`, `analyze_ticks()`\n\n**v2.0.0 (Current)**:\n\n- Single DuckDB file: `eurusd.duckdb`\n- No Parquet files (everything in DuckDB)\n- Unified API: `processor.update_data()`, `processor.query_ohlc()`, `processor.query_ticks()`\n\n**Migration Steps**:\n\n1. Run `processor.update_data(pair, start_date)` to create new unified database\n2. Delete old monthly files: `rm eurusd_ohlc_2024_*.duckdb eurusd_ticks_2024_*.parquet`\n3. Update code to use new API methods\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Authors\n\n- Terry Li <terry@eonlabs.com>\n- Eon Labs\n\n## Contributing\n\nContributions are welcome! Please see CONTRIBUTING.md for guidelines.\n\n## Acknowledgments\n\n- Exness for providing high-quality public tick data\n- DuckDB for embedded OLAP capabilities with sub-15ms query performance\n\n## Additional Documentation\n\n**[\ud83d\udcda Complete Documentation Hub](DOCUMENTATION.md)** - Organized guide from beginner to advanced (72+ documents)\n\n- **Basic Usage Examples**: See `examples/basic_usage.py`\n- **Batch Processing**: See `examples/batch_processing.py`\n- **Architecture Details**: See `docs/UNIFIED_DUCKDB_PLAN_v2.md`\n- **Unit Tests**: See `tests/` directory\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Professional Exness forex tick data preprocessing with optimal compression (Parquet Zstd-22) and DuckDB OHLC generation. Provides efficient storage (9% smaller than ZIP) with lossless precision and direct queryability.",
    "version": "0.5.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/terrylica/exness-data-preprocess/issues",
        "Changelog": "https://github.com/terrylica/exness-data-preprocess/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/terrylica/exness-data-preprocess#readme",
        "Homepage": "https://github.com/terrylica/exness-data-preprocess",
        "Repository": "https://github.com/terrylica/exness-data-preprocess.git",
        "Source Code": "https://github.com/terrylica/exness-data-preprocess"
    },
    "split_keywords": [
        "backtesting",
        " compression",
        " data-engineering",
        " duckdb",
        " efficient-storage",
        " eurusd",
        " exness",
        " financial-data",
        " forex",
        " forex-data",
        " lossless",
        " microstructure",
        " ohlc",
        " pandas",
        " parquet",
        " preprocessing",
        " pyarrow",
        " quantitative-finance",
        " queryable",
        " tick-data",
        " time-series",
        " zstd"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3a38e1e7841231b33f0389b863f8b77307a93b2b14d9e2f2133fa05d143ef46f",
                "md5": "73f79fb35115f29bfc3a58df5891588b",
                "sha256": "53aa348e64551f66d07ff83e221221a489305d0ced3d07d003e0a8c582351116"
            },
            "downloads": -1,
            "filename": "exness_data_preprocess-0.5.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "73f79fb35115f29bfc3a58df5891588b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 42973,
            "upload_time": "2025-10-21T03:32:38",
            "upload_time_iso_8601": "2025-10-21T03:32:38.662258Z",
            "url": "https://files.pythonhosted.org/packages/3a/38/e1e7841231b33f0389b863f8b77307a93b2b14d9e2f2133fa05d143ef46f/exness_data_preprocess-0.5.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0635c42990506acd4649cef7acb96bb716225909190a930f4ea9b2588d35d187",
                "md5": "3028acd3e32ff203896c29568d88d5e1",
                "sha256": "e5fab02a876cebbd72f149b6c1ddd5c0778a7d6fbb507320f9e0fd3ce351c69c"
            },
            "downloads": -1,
            "filename": "exness_data_preprocess-0.5.0.tar.gz",
            "has_sig": false,
            "md5_digest": "3028acd3e32ff203896c29568d88d5e1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 1567943,
            "upload_time": "2025-10-21T03:32:40",
            "upload_time_iso_8601": "2025-10-21T03:32:40.729251Z",
            "url": "https://files.pythonhosted.org/packages/06/35/c42990506acd4649cef7acb96bb716225909190a930f4ea9b2588d35d187/exness_data_preprocess-0.5.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-21 03:32:40",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "terrylica",
    "github_project": "exness-data-preprocess",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "exness-data-preprocess"
}
        
Elapsed time: 1.38333s