# Exness Data Preprocess v2.0.0
[](https://pypi.org/project/exness-data-preprocess/)
[](https://pypi.org/project/exness-data-preprocess/)
[](https://github.com/terrylica/exness-data-preprocess/blob/main/LICENSE)
[](https://github.com/terrylica/exness-data-preprocess/actions)
[](https://pypi.org/project/exness-data-preprocess/)
[](https://github.com/astral-sh/ruff)
Professional forex tick data preprocessing with unified single-file DuckDB storage. Provides incremental updates, dual-variant storage (Raw_Spread + Standard), and Phase7 30-column OHLC schema (v1.6.0) with 10 global exchange sessions (trading hour detection) and sub-15ms query performance.
## Features
- **Unified Single-File Architecture**: One DuckDB file per instrument (eurusd.duckdb)
- **Incremental Updates**: Automatic gap detection and download only missing months
- **Dual-Variant Storage**: Raw_Spread (primary) + Standard (reference) in same database
- **Phase7 OHLC Schema**: 30-column bars (v1.6.0) with dual spreads, tick counts, normalized metrics, and 10 global exchange sessions with trading hour detection
- **High Performance**: Incremental OHLC generation (7.3x speedup), vectorized session detection (2.2x speedup), SQL gap detection with complete coverage
- **Fast Queries**: Date range queries with sub-15ms performance
- **On-Demand Resampling**: Any timeframe (5m, 1h, 1d) resampled in <15ms
- **PRIMARY KEY Constraints**: Prevents duplicate data during incremental updates
- **Simple API**: Clean Python API for all operations
## Installation
```bash
# From PyPI (when published)
pip install exness-data-preprocess
# From source
git clone https://github.com/Eon-Labs/exness-data-preprocess.git
cd exness-data-preprocess
pip install -e .
# Using uv (recommended)
uv pip install exness-data-preprocess
```
## Quick Start
### Python API
```python
import exness_data_preprocess as edp
# Initialize processor
processor = edp.ExnessDataProcessor(base_dir="~/eon/exness-data")
# Download 3 years of EURUSD data (automatic gap detection)
result = processor.update_data(
pair="EURUSD",
start_date="2022-01-01",
delete_zip=True,
)
print(f"Months added: {result['months_added']}")
print(f"Raw ticks: {result['raw_ticks_added']:,}")
print(f"Standard ticks: {result['standard_ticks_added']:,}")
print(f"OHLC bars: {result['ohlc_bars']:,}")
print(f"Database size: {result['duckdb_size_mb']:.2f} MB")
# Query 1-minute OHLC bars for January 2024
df_1m = processor.query_ohlc(
pair="EURUSD",
timeframe="1m",
start_date="2024-01-01",
end_date="2024-01-31",
)
print(df_1m.head())
# Query raw tick data for September 2024
df_ticks = processor.query_ticks(
pair="EURUSD",
variant="raw_spread",
start_date="2024-09-01",
end_date="2024-09-30",
)
print(f"Ticks: {len(df_ticks):,}")
```
## Architecture v2.0.0
### Data Flow
```
Exness Public Repository (monthly ZIPs, both variants)
↓
Automatic Gap Detection
↓
Download Only Missing Months (Raw_Spread + Standard)
↓
DuckDB Single-File Storage (PRIMARY KEY prevents duplicates)
↓
Phase7 30-Column OHLC Generation (v1.6.0 - dual spreads, tick counts, normalized metrics, 10 global exchange sessions with trading hour detection)
↓
Query Interface (date ranges, SQL filters, on-demand resampling)
```
### Storage Format
**Single File Per Instrument**: `~/eon/exness-data/eurusd.duckdb`
**Schema**:
- `raw_spread_ticks` table: Timestamp (PK), Bid, Ask
- `standard_ticks` table: Timestamp (PK), Bid, Ask
- `ohlc_1m` table: Phase7 30-column schema (v1.6.0)
- `metadata` table: Coverage tracking
**Phase7 30-Column OHLC (v1.6.0)**:
- **Column Definitions**: See [`schema.py`](src/exness_data_preprocess/schema.py) - Single source of truth
- **Comprehensive Reference**: See [`DATABASE_SCHEMA.md`](docs/DATABASE_SCHEMA.md) - Query examples and usage patterns
- **Key Features**: BID-only OHLC with dual spreads (Raw_Spread + Standard), normalized spread metrics, and 10 global exchange sessions with trading hour detection (XNYS, XLON, XSWX, XFRA, XTSE, XNZE, XTKS, XASX, XHKG, XSES)
### Directory Structure
**Default Location**: `~/eon/exness-data/` (outside project workspace)
```
~/eon/exness-data/
├── eurusd.duckdb # Single file for all EURUSD data
├── gbpusd.duckdb # Single file for all GBPUSD data
├── xauusd.duckdb # Single file for all XAUUSD data
└── temp/
└── (temporary ZIP files)
```
**Why Single-File Per Instrument?**
- **Unified Storage**: All years in one database
- **Incremental Updates**: Automatic gap detection and download only missing months
- **No Duplicates**: PRIMARY KEY constraints prevent duplicate data
- **Fast Queries**: Date range queries with sub-15ms performance
- **Scalability**: Multi-year data in ~2 GB per instrument (3 years)
## Usage Examples
### Example 1: Initial Download and Incremental Updates
```python
import exness_data_preprocess as edp
processor = edp.ExnessDataProcessor()
# Initial download (3-year history)
result = processor.update_data(
pair="EURUSD",
start_date="2022-01-01",
delete_zip=True,
)
# Run again - only downloads new months since last update
result = processor.update_data(
pair="EURUSD",
start_date="2022-01-01",
)
print(f"Months added: {result['months_added']} (0 if up to date)")
```
### Example 2: Check Data Coverage
```python
coverage = processor.get_data_coverage("EURUSD")
print(f"Database exists: {coverage['database_exists']}")
print(f"Raw_Spread ticks: {coverage['raw_spread_ticks']:,}")
print(f"Standard ticks: {coverage['standard_ticks']:,}")
print(f"OHLC bars: {coverage['ohlc_bars']:,}")
print(f"Date range: {coverage['earliest_date']} to {coverage['latest_date']}")
print(f"Days covered: {coverage['date_range_days']}")
print(f"Database size: {coverage['duckdb_size_mb']:.2f} MB")
```
### Example 3: Query OHLC with Date Ranges
```python
# Query 1-minute bars for January 2024
df_1m = processor.query_ohlc(
pair="EURUSD",
timeframe="1m",
start_date="2024-01-01",
end_date="2024-01-31",
)
# Query 1-hour bars for Q1 2024 (resampled on-demand)
df_1h = processor.query_ohlc(
pair="EURUSD",
timeframe="1h",
start_date="2024-01-01",
end_date="2024-03-31",
)
# Query daily bars for entire 2024
df_1d = processor.query_ohlc(
pair="EURUSD",
timeframe="1d",
start_date="2024-01-01",
end_date="2024-12-31",
)
print(f"1m bars: {len(df_1m):,}")
print(f"1h bars: {len(df_1h):,}")
print(f"1d bars: {len(df_1d):,}")
```
### Example 4: Query Ticks with Date Ranges
```python
# Query Raw_Spread ticks for September 2024
df_raw = processor.query_ticks(
pair="EURUSD",
variant="raw_spread",
start_date="2024-09-01",
end_date="2024-09-30",
)
print(f"Raw_Spread ticks: {len(df_raw):,}")
print(f"Columns: {list(df_raw.columns)}")
# Calculate spread statistics
df_raw['Spread'] = df_raw['Ask'] - df_raw['Bid']
print(f"Mean spread: {df_raw['Spread'].mean() * 10000:.4f} pips")
print(f"Zero-spreads: {((df_raw['Spread'] == 0).sum() / len(df_raw) * 100):.2f}%")
```
### Example 5: Query with SQL Filters
```python
# Query only zero-spread ticks
df_zero = processor.query_ticks(
pair="EURUSD",
variant="raw_spread",
start_date="2024-09-01",
end_date="2024-09-01",
filter_sql="Bid = Ask",
)
print(f"Zero-spread ticks: {len(df_zero):,}")
# Query high-price ticks
df_high = processor.query_ticks(
pair="EURUSD",
variant="raw_spread",
start_date="2024-09-01",
end_date="2024-09-30",
filter_sql="Bid > 1.11",
)
print(f"High-price ticks: {len(df_high):,}")
```
### Example 6: Process Multiple Instruments
```python
processor = edp.ExnessDataProcessor()
# Process multiple pairs
pairs = ["EURUSD", "GBPUSD", "XAUUSD"]
for pair in pairs:
print(f"Processing {pair}...")
result = processor.update_data(
pair=pair,
start_date="2023-01-01",
delete_zip=True,
)
print(f" Months added: {result['months_added']}")
print(f" Database size: {result['duckdb_size_mb']:.2f} MB")
```
### Example 7: Parallel Processing
```python
from concurrent.futures import ThreadPoolExecutor, as_completed
def process_instrument(pair, start_date):
processor = edp.ExnessDataProcessor()
return processor.update_data(pair=pair, start_date=start_date, delete_zip=True)
instruments = [
("EURUSD", "2023-01-01"),
("GBPUSD", "2023-01-01"),
("XAUUSD", "2023-01-01"),
("USDJPY", "2023-01-01"),
]
with ThreadPoolExecutor(max_workers=4) as executor:
futures = {
executor.submit(process_instrument, pair, start_date): pair
for pair, start_date in instruments
}
for future in as_completed(futures):
pair = futures[future]
result = future.result()
print(f"{pair}: {result['months_added']} months added")
```
## Development
### Setup
```bash
# Clone repository
git clone https://github.com/Eon-Labs/exness-data-preprocess.git
cd exness-data-preprocess
# Install with development dependencies (using uv)
uv sync --dev
# Or with pip
pip install -e ".[dev]"
```
### Testing
```bash
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=exness_data_preprocess --cov-report=html
# Run specific test
uv run pytest tests/test_processor.py -v
```
### Code Quality
```bash
# Format code
uv run ruff format .
# Lint
uv run ruff check --fix .
# Type checking
uv run mypy src/
```
### Building
```bash
# Build package
uv build
# Test installation locally
uv tool install --editable .
```
## Data Source
Data is sourced from Exness's public tick data repository:
- **URL**: https://ticks.ex2archive.com/
- **Format**: Monthly ZIP files with CSV tick data
- **Variants**: Raw_Spread (zero-spreads) + Standard (market spreads)
- **Content**: Timestamp, Bid, Ask prices for major forex pairs
- **Quality**: Institutional ECN/STP data with microsecond precision
## Technical Specifications
### Database Size (3-Year History, EURUSD)
| Metric | Value |
| ---------------- | ------------------------ |
| Raw_Spread ticks | ~18.6M |
| Standard ticks | ~19.6M |
| OHLC bars (1m) | ~413K |
| Database size | ~2.08 GB |
| Date range | 2022-01-01 to 2025-01-10 |
### Query Performance
| Operation | Time |
| -------------------------- | ----- |
| Query 880K ticks (1 month) | <15ms |
| Query 1m OHLC (1 month) | <10ms |
| Resample to 1h (1 month) | <15ms |
| Resample to 1d (1 year) | <20ms |
### Architecture Benefits
| Feature | Benefit |
| -------------------------- | -------------------------------------------------- |
| Single file per instrument | Unified storage, no file fragmentation |
| PRIMARY KEY constraints | Prevents duplicates during incremental updates |
| Automatic gap detection | Download only missing months |
| Dual-variant storage | Raw_Spread + Standard in same database |
| Phase7 OHLC schema | Dual spreads + dual tick counts |
| Date range queries | Efficient filtering without loading entire dataset |
| On-demand resampling | Any timeframe in <15ms |
| SQL filter support | Direct SQL WHERE clauses on ticks |
### Performance Optimizations (v0.5.0)
**Incremental OHLC Generation** - 7.3x speedup for updates:
- Full regeneration: 8.05s (303K bars, 7 months)
- Incremental update: 1.10s (43K new bars, 1 month)
- Implementation: Optional date-range parameters for partial regeneration
- Validation: [`docs/validation/SPIKE_TEST_RESULTS_PHASE1_2025-10-18.md`](docs/validation/SPIKE_TEST_RESULTS_PHASE1_2025-10-18.md)
**Vectorized Session Detection** - 2.2x speedup for trading hour detection:
- Current approach: 5.99s (302K bars, 10 exchanges)
- Vectorized approach: 2.69s (302K bars, 10 exchanges)
- Combined Phase 1+2: ~16x total speedup (8.05s → 0.50s)
- Implementation: Pre-compute trading minutes, vectorized `.isin()` lookup
- SSoT: [`docs/PHASE2_SESSION_VECTORIZATION_PLAN.yaml`](docs/PHASE2_SESSION_VECTORIZATION_PLAN.yaml)
- Validation: [`docs/validation/SPIKE_TEST_RESULTS_PHASE2_2025-10-18.md`](docs/validation/SPIKE_TEST_RESULTS_PHASE2_2025-10-18.md)
**SQL Gap Detection** - Complete coverage with 46% code reduction:
- Bug fix: Python approach missed internal gaps (41 detected vs 42 actual)
- SQL EXCEPT operator detects ALL gaps (before + within + after existing data)
- Code reduced from 62 lines to 34 lines (46% reduction)
- SSoT: [`docs/PHASE3_SQL_GAP_DETECTION_PLAN.yaml`](docs/PHASE3_SQL_GAP_DETECTION_PLAN.yaml)
**Release Notes**: See [`CHANGELOG.md`](CHANGELOG.md) for complete v0.5.0 details
## API Reference
### ExnessDataProcessor
```python
processor = edp.ExnessDataProcessor(base_dir="~/eon/exness-data")
```
**Methods**:
- `update_data(pair, start_date, force_redownload=False, delete_zip=True)` - Update database with latest data
- `query_ohlc(pair, timeframe, start_date=None, end_date=None)` - Query OHLC bars
- `query_ticks(pair, variant, start_date=None, end_date=None, filter_sql=None)` - Query tick data
- `get_data_coverage(pair)` - Get coverage information
**Parameters**:
- `pair` (str): Currency pair (e.g., "EURUSD", "GBPUSD", "XAUUSD")
- `timeframe` (str): OHLC timeframe ("1m", "5m", "15m", "1h", "4h", "1d")
- `variant` (str): Tick variant ("raw_spread" or "standard")
- `start_date` (str): Start date in "YYYY-MM-DD" format
- `end_date` (str): End date in "YYYY-MM-DD" format
- `filter_sql` (str): SQL WHERE clause (e.g., "Bid > 1.11 AND Ask < 1.12")
## Migration from v1.0.0
**v1.0.0 (Legacy)**:
- Monthly DuckDB files: `eurusd_ohlc_2024_08.duckdb`
- Parquet tick storage: `eurusd_ticks_2024_08.parquet`
- Functions: `process_month()`, `process_date_range()`, `analyze_ticks()`
**v2.0.0 (Current)**:
- Single DuckDB file: `eurusd.duckdb`
- No Parquet files (everything in DuckDB)
- Unified API: `processor.update_data()`, `processor.query_ohlc()`, `processor.query_ticks()`
**Migration Steps**:
1. Run `processor.update_data(pair, start_date)` to create new unified database
2. Delete old monthly files: `rm eurusd_ohlc_2024_*.duckdb eurusd_ticks_2024_*.parquet`
3. Update code to use new API methods
## License
MIT License - see LICENSE file for details.
## Authors
- Terry Li <terry@eonlabs.com>
- Eon Labs
## Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
## Acknowledgments
- Exness for providing high-quality public tick data
- DuckDB for embedded OLAP capabilities with sub-15ms query performance
## Additional Documentation
**[📚 Complete Documentation Hub](DOCUMENTATION.md)** - Organized guide from beginner to advanced (72+ documents)
- **Basic Usage Examples**: See `examples/basic_usage.py`
- **Batch Processing**: See `examples/batch_processing.py`
- **Architecture Details**: See `docs/UNIFIED_DUCKDB_PLAN_v2.md`
- **Unit Tests**: See `tests/` directory
Raw data
{
"_id": null,
"home_page": null,
"name": "exness-data-preprocess",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "Terry Li <terry@eonlabs.com>",
"keywords": "backtesting, compression, data-engineering, duckdb, efficient-storage, eurusd, exness, financial-data, forex, forex-data, lossless, microstructure, ohlc, pandas, parquet, preprocessing, pyarrow, quantitative-finance, queryable, tick-data, time-series, zstd",
"author": null,
"author_email": "Eon Labs <terry@eonlabs.com>",
"download_url": "https://files.pythonhosted.org/packages/06/35/c42990506acd4649cef7acb96bb716225909190a930f4ea9b2588d35d187/exness_data_preprocess-0.5.0.tar.gz",
"platform": null,
"description": "# Exness Data Preprocess v2.0.0\n\n[](https://pypi.org/project/exness-data-preprocess/)\n[](https://pypi.org/project/exness-data-preprocess/)\n[](https://github.com/terrylica/exness-data-preprocess/blob/main/LICENSE)\n[](https://github.com/terrylica/exness-data-preprocess/actions)\n[](https://pypi.org/project/exness-data-preprocess/)\n[](https://github.com/astral-sh/ruff)\n\nProfessional forex tick data preprocessing with unified single-file DuckDB storage. Provides incremental updates, dual-variant storage (Raw_Spread + Standard), and Phase7 30-column OHLC schema (v1.6.0) with 10 global exchange sessions (trading hour detection) and sub-15ms query performance.\n\n## Features\n\n- **Unified Single-File Architecture**: One DuckDB file per instrument (eurusd.duckdb)\n- **Incremental Updates**: Automatic gap detection and download only missing months\n- **Dual-Variant Storage**: Raw_Spread (primary) + Standard (reference) in same database\n- **Phase7 OHLC Schema**: 30-column bars (v1.6.0) with dual spreads, tick counts, normalized metrics, and 10 global exchange sessions with trading hour detection\n- **High Performance**: Incremental OHLC generation (7.3x speedup), vectorized session detection (2.2x speedup), SQL gap detection with complete coverage\n- **Fast Queries**: Date range queries with sub-15ms performance\n- **On-Demand Resampling**: Any timeframe (5m, 1h, 1d) resampled in <15ms\n- **PRIMARY KEY Constraints**: Prevents duplicate data during incremental updates\n- **Simple API**: Clean Python API for all operations\n\n## Installation\n\n```bash\n# From PyPI (when published)\npip install exness-data-preprocess\n\n# From source\ngit clone https://github.com/Eon-Labs/exness-data-preprocess.git\ncd exness-data-preprocess\npip install -e .\n\n# Using uv (recommended)\nuv pip install exness-data-preprocess\n```\n\n## Quick Start\n\n### Python API\n\n```python\nimport exness_data_preprocess as edp\n\n# Initialize processor\nprocessor = edp.ExnessDataProcessor(base_dir=\"~/eon/exness-data\")\n\n# Download 3 years of EURUSD data (automatic gap detection)\nresult = processor.update_data(\n pair=\"EURUSD\",\n start_date=\"2022-01-01\",\n delete_zip=True,\n)\n\nprint(f\"Months added: {result['months_added']}\")\nprint(f\"Raw ticks: {result['raw_ticks_added']:,}\")\nprint(f\"Standard ticks: {result['standard_ticks_added']:,}\")\nprint(f\"OHLC bars: {result['ohlc_bars']:,}\")\nprint(f\"Database size: {result['duckdb_size_mb']:.2f} MB\")\n\n# Query 1-minute OHLC bars for January 2024\ndf_1m = processor.query_ohlc(\n pair=\"EURUSD\",\n timeframe=\"1m\",\n start_date=\"2024-01-01\",\n end_date=\"2024-01-31\",\n)\nprint(df_1m.head())\n\n# Query raw tick data for September 2024\ndf_ticks = processor.query_ticks(\n pair=\"EURUSD\",\n variant=\"raw_spread\",\n start_date=\"2024-09-01\",\n end_date=\"2024-09-30\",\n)\nprint(f\"Ticks: {len(df_ticks):,}\")\n```\n\n## Architecture v2.0.0\n\n### Data Flow\n\n```\nExness Public Repository (monthly ZIPs, both variants)\n \u2193\n Automatic Gap Detection\n \u2193\nDownload Only Missing Months (Raw_Spread + Standard)\n \u2193\nDuckDB Single-File Storage (PRIMARY KEY prevents duplicates)\n \u2193\nPhase7 30-Column OHLC Generation (v1.6.0 - dual spreads, tick counts, normalized metrics, 10 global exchange sessions with trading hour detection)\n \u2193\nQuery Interface (date ranges, SQL filters, on-demand resampling)\n```\n\n### Storage Format\n\n**Single File Per Instrument**: `~/eon/exness-data/eurusd.duckdb`\n\n**Schema**:\n\n- `raw_spread_ticks` table: Timestamp (PK), Bid, Ask\n- `standard_ticks` table: Timestamp (PK), Bid, Ask\n- `ohlc_1m` table: Phase7 30-column schema (v1.6.0)\n- `metadata` table: Coverage tracking\n\n**Phase7 30-Column OHLC (v1.6.0)**:\n\n- **Column Definitions**: See [`schema.py`](src/exness_data_preprocess/schema.py) - Single source of truth\n- **Comprehensive Reference**: See [`DATABASE_SCHEMA.md`](docs/DATABASE_SCHEMA.md) - Query examples and usage patterns\n- **Key Features**: BID-only OHLC with dual spreads (Raw_Spread + Standard), normalized spread metrics, and 10 global exchange sessions with trading hour detection (XNYS, XLON, XSWX, XFRA, XTSE, XNZE, XTKS, XASX, XHKG, XSES)\n\n### Directory Structure\n\n**Default Location**: `~/eon/exness-data/` (outside project workspace)\n\n```\n~/eon/exness-data/\n\u251c\u2500\u2500 eurusd.duckdb # Single file for all EURUSD data\n\u251c\u2500\u2500 gbpusd.duckdb # Single file for all GBPUSD data\n\u251c\u2500\u2500 xauusd.duckdb # Single file for all XAUUSD data\n\u2514\u2500\u2500 temp/\n \u2514\u2500\u2500 (temporary ZIP files)\n```\n\n**Why Single-File Per Instrument?**\n\n- **Unified Storage**: All years in one database\n- **Incremental Updates**: Automatic gap detection and download only missing months\n- **No Duplicates**: PRIMARY KEY constraints prevent duplicate data\n- **Fast Queries**: Date range queries with sub-15ms performance\n- **Scalability**: Multi-year data in ~2 GB per instrument (3 years)\n\n## Usage Examples\n\n### Example 1: Initial Download and Incremental Updates\n\n```python\nimport exness_data_preprocess as edp\n\nprocessor = edp.ExnessDataProcessor()\n\n# Initial download (3-year history)\nresult = processor.update_data(\n pair=\"EURUSD\",\n start_date=\"2022-01-01\",\n delete_zip=True,\n)\n\n# Run again - only downloads new months since last update\nresult = processor.update_data(\n pair=\"EURUSD\",\n start_date=\"2022-01-01\",\n)\nprint(f\"Months added: {result['months_added']} (0 if up to date)\")\n```\n\n### Example 2: Check Data Coverage\n\n```python\ncoverage = processor.get_data_coverage(\"EURUSD\")\n\nprint(f\"Database exists: {coverage['database_exists']}\")\nprint(f\"Raw_Spread ticks: {coverage['raw_spread_ticks']:,}\")\nprint(f\"Standard ticks: {coverage['standard_ticks']:,}\")\nprint(f\"OHLC bars: {coverage['ohlc_bars']:,}\")\nprint(f\"Date range: {coverage['earliest_date']} to {coverage['latest_date']}\")\nprint(f\"Days covered: {coverage['date_range_days']}\")\nprint(f\"Database size: {coverage['duckdb_size_mb']:.2f} MB\")\n```\n\n### Example 3: Query OHLC with Date Ranges\n\n```python\n# Query 1-minute bars for January 2024\ndf_1m = processor.query_ohlc(\n pair=\"EURUSD\",\n timeframe=\"1m\",\n start_date=\"2024-01-01\",\n end_date=\"2024-01-31\",\n)\n\n# Query 1-hour bars for Q1 2024 (resampled on-demand)\ndf_1h = processor.query_ohlc(\n pair=\"EURUSD\",\n timeframe=\"1h\",\n start_date=\"2024-01-01\",\n end_date=\"2024-03-31\",\n)\n\n# Query daily bars for entire 2024\ndf_1d = processor.query_ohlc(\n pair=\"EURUSD\",\n timeframe=\"1d\",\n start_date=\"2024-01-01\",\n end_date=\"2024-12-31\",\n)\n\nprint(f\"1m bars: {len(df_1m):,}\")\nprint(f\"1h bars: {len(df_1h):,}\")\nprint(f\"1d bars: {len(df_1d):,}\")\n```\n\n### Example 4: Query Ticks with Date Ranges\n\n```python\n# Query Raw_Spread ticks for September 2024\ndf_raw = processor.query_ticks(\n pair=\"EURUSD\",\n variant=\"raw_spread\",\n start_date=\"2024-09-01\",\n end_date=\"2024-09-30\",\n)\n\nprint(f\"Raw_Spread ticks: {len(df_raw):,}\")\nprint(f\"Columns: {list(df_raw.columns)}\")\n\n# Calculate spread statistics\ndf_raw['Spread'] = df_raw['Ask'] - df_raw['Bid']\nprint(f\"Mean spread: {df_raw['Spread'].mean() * 10000:.4f} pips\")\nprint(f\"Zero-spreads: {((df_raw['Spread'] == 0).sum() / len(df_raw) * 100):.2f}%\")\n```\n\n### Example 5: Query with SQL Filters\n\n```python\n# Query only zero-spread ticks\ndf_zero = processor.query_ticks(\n pair=\"EURUSD\",\n variant=\"raw_spread\",\n start_date=\"2024-09-01\",\n end_date=\"2024-09-01\",\n filter_sql=\"Bid = Ask\",\n)\nprint(f\"Zero-spread ticks: {len(df_zero):,}\")\n\n# Query high-price ticks\ndf_high = processor.query_ticks(\n pair=\"EURUSD\",\n variant=\"raw_spread\",\n start_date=\"2024-09-01\",\n end_date=\"2024-09-30\",\n filter_sql=\"Bid > 1.11\",\n)\nprint(f\"High-price ticks: {len(df_high):,}\")\n```\n\n### Example 6: Process Multiple Instruments\n\n```python\nprocessor = edp.ExnessDataProcessor()\n\n# Process multiple pairs\npairs = [\"EURUSD\", \"GBPUSD\", \"XAUUSD\"]\n\nfor pair in pairs:\n print(f\"Processing {pair}...\")\n result = processor.update_data(\n pair=pair,\n start_date=\"2023-01-01\",\n delete_zip=True,\n )\n print(f\" Months added: {result['months_added']}\")\n print(f\" Database size: {result['duckdb_size_mb']:.2f} MB\")\n```\n\n### Example 7: Parallel Processing\n\n```python\nfrom concurrent.futures import ThreadPoolExecutor, as_completed\n\ndef process_instrument(pair, start_date):\n processor = edp.ExnessDataProcessor()\n return processor.update_data(pair=pair, start_date=start_date, delete_zip=True)\n\ninstruments = [\n (\"EURUSD\", \"2023-01-01\"),\n (\"GBPUSD\", \"2023-01-01\"),\n (\"XAUUSD\", \"2023-01-01\"),\n (\"USDJPY\", \"2023-01-01\"),\n]\n\nwith ThreadPoolExecutor(max_workers=4) as executor:\n futures = {\n executor.submit(process_instrument, pair, start_date): pair\n for pair, start_date in instruments\n }\n\n for future in as_completed(futures):\n pair = futures[future]\n result = future.result()\n print(f\"{pair}: {result['months_added']} months added\")\n```\n\n## Development\n\n### Setup\n\n```bash\n# Clone repository\ngit clone https://github.com/Eon-Labs/exness-data-preprocess.git\ncd exness-data-preprocess\n\n# Install with development dependencies (using uv)\nuv sync --dev\n\n# Or with pip\npip install -e \".[dev]\"\n```\n\n### Testing\n\n```bash\n# Run all tests\nuv run pytest\n\n# Run with coverage\nuv run pytest --cov=exness_data_preprocess --cov-report=html\n\n# Run specific test\nuv run pytest tests/test_processor.py -v\n```\n\n### Code Quality\n\n```bash\n# Format code\nuv run ruff format .\n\n# Lint\nuv run ruff check --fix .\n\n# Type checking\nuv run mypy src/\n```\n\n### Building\n\n```bash\n# Build package\nuv build\n\n# Test installation locally\nuv tool install --editable .\n```\n\n## Data Source\n\nData is sourced from Exness's public tick data repository:\n\n- **URL**: https://ticks.ex2archive.com/\n- **Format**: Monthly ZIP files with CSV tick data\n- **Variants**: Raw_Spread (zero-spreads) + Standard (market spreads)\n- **Content**: Timestamp, Bid, Ask prices for major forex pairs\n- **Quality**: Institutional ECN/STP data with microsecond precision\n\n## Technical Specifications\n\n### Database Size (3-Year History, EURUSD)\n\n| Metric | Value |\n| ---------------- | ------------------------ |\n| Raw_Spread ticks | ~18.6M |\n| Standard ticks | ~19.6M |\n| OHLC bars (1m) | ~413K |\n| Database size | ~2.08 GB |\n| Date range | 2022-01-01 to 2025-01-10 |\n\n### Query Performance\n\n| Operation | Time |\n| -------------------------- | ----- |\n| Query 880K ticks (1 month) | <15ms |\n| Query 1m OHLC (1 month) | <10ms |\n| Resample to 1h (1 month) | <15ms |\n| Resample to 1d (1 year) | <20ms |\n\n### Architecture Benefits\n\n| Feature | Benefit |\n| -------------------------- | -------------------------------------------------- |\n| Single file per instrument | Unified storage, no file fragmentation |\n| PRIMARY KEY constraints | Prevents duplicates during incremental updates |\n| Automatic gap detection | Download only missing months |\n| Dual-variant storage | Raw_Spread + Standard in same database |\n| Phase7 OHLC schema | Dual spreads + dual tick counts |\n| Date range queries | Efficient filtering without loading entire dataset |\n| On-demand resampling | Any timeframe in <15ms |\n| SQL filter support | Direct SQL WHERE clauses on ticks |\n\n### Performance Optimizations (v0.5.0)\n\n**Incremental OHLC Generation** - 7.3x speedup for updates:\n- Full regeneration: 8.05s (303K bars, 7 months)\n- Incremental update: 1.10s (43K new bars, 1 month)\n- Implementation: Optional date-range parameters for partial regeneration\n- Validation: [`docs/validation/SPIKE_TEST_RESULTS_PHASE1_2025-10-18.md`](docs/validation/SPIKE_TEST_RESULTS_PHASE1_2025-10-18.md)\n\n**Vectorized Session Detection** - 2.2x speedup for trading hour detection:\n- Current approach: 5.99s (302K bars, 10 exchanges)\n- Vectorized approach: 2.69s (302K bars, 10 exchanges)\n- Combined Phase 1+2: ~16x total speedup (8.05s \u2192 0.50s)\n- Implementation: Pre-compute trading minutes, vectorized `.isin()` lookup\n- SSoT: [`docs/PHASE2_SESSION_VECTORIZATION_PLAN.yaml`](docs/PHASE2_SESSION_VECTORIZATION_PLAN.yaml)\n- Validation: [`docs/validation/SPIKE_TEST_RESULTS_PHASE2_2025-10-18.md`](docs/validation/SPIKE_TEST_RESULTS_PHASE2_2025-10-18.md)\n\n**SQL Gap Detection** - Complete coverage with 46% code reduction:\n- Bug fix: Python approach missed internal gaps (41 detected vs 42 actual)\n- SQL EXCEPT operator detects ALL gaps (before + within + after existing data)\n- Code reduced from 62 lines to 34 lines (46% reduction)\n- SSoT: [`docs/PHASE3_SQL_GAP_DETECTION_PLAN.yaml`](docs/PHASE3_SQL_GAP_DETECTION_PLAN.yaml)\n\n**Release Notes**: See [`CHANGELOG.md`](CHANGELOG.md) for complete v0.5.0 details\n\n## API Reference\n\n### ExnessDataProcessor\n\n```python\nprocessor = edp.ExnessDataProcessor(base_dir=\"~/eon/exness-data\")\n```\n\n**Methods**:\n\n- `update_data(pair, start_date, force_redownload=False, delete_zip=True)` - Update database with latest data\n- `query_ohlc(pair, timeframe, start_date=None, end_date=None)` - Query OHLC bars\n- `query_ticks(pair, variant, start_date=None, end_date=None, filter_sql=None)` - Query tick data\n- `get_data_coverage(pair)` - Get coverage information\n\n**Parameters**:\n\n- `pair` (str): Currency pair (e.g., \"EURUSD\", \"GBPUSD\", \"XAUUSD\")\n- `timeframe` (str): OHLC timeframe (\"1m\", \"5m\", \"15m\", \"1h\", \"4h\", \"1d\")\n- `variant` (str): Tick variant (\"raw_spread\" or \"standard\")\n- `start_date` (str): Start date in \"YYYY-MM-DD\" format\n- `end_date` (str): End date in \"YYYY-MM-DD\" format\n- `filter_sql` (str): SQL WHERE clause (e.g., \"Bid > 1.11 AND Ask < 1.12\")\n\n## Migration from v1.0.0\n\n**v1.0.0 (Legacy)**:\n\n- Monthly DuckDB files: `eurusd_ohlc_2024_08.duckdb`\n- Parquet tick storage: `eurusd_ticks_2024_08.parquet`\n- Functions: `process_month()`, `process_date_range()`, `analyze_ticks()`\n\n**v2.0.0 (Current)**:\n\n- Single DuckDB file: `eurusd.duckdb`\n- No Parquet files (everything in DuckDB)\n- Unified API: `processor.update_data()`, `processor.query_ohlc()`, `processor.query_ticks()`\n\n**Migration Steps**:\n\n1. Run `processor.update_data(pair, start_date)` to create new unified database\n2. Delete old monthly files: `rm eurusd_ohlc_2024_*.duckdb eurusd_ticks_2024_*.parquet`\n3. Update code to use new API methods\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Authors\n\n- Terry Li <terry@eonlabs.com>\n- Eon Labs\n\n## Contributing\n\nContributions are welcome! Please see CONTRIBUTING.md for guidelines.\n\n## Acknowledgments\n\n- Exness for providing high-quality public tick data\n- DuckDB for embedded OLAP capabilities with sub-15ms query performance\n\n## Additional Documentation\n\n**[\ud83d\udcda Complete Documentation Hub](DOCUMENTATION.md)** - Organized guide from beginner to advanced (72+ documents)\n\n- **Basic Usage Examples**: See `examples/basic_usage.py`\n- **Batch Processing**: See `examples/batch_processing.py`\n- **Architecture Details**: See `docs/UNIFIED_DUCKDB_PLAN_v2.md`\n- **Unit Tests**: See `tests/` directory\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Professional Exness forex tick data preprocessing with optimal compression (Parquet Zstd-22) and DuckDB OHLC generation. Provides efficient storage (9% smaller than ZIP) with lossless precision and direct queryability.",
"version": "0.5.0",
"project_urls": {
"Bug Tracker": "https://github.com/terrylica/exness-data-preprocess/issues",
"Changelog": "https://github.com/terrylica/exness-data-preprocess/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/terrylica/exness-data-preprocess#readme",
"Homepage": "https://github.com/terrylica/exness-data-preprocess",
"Repository": "https://github.com/terrylica/exness-data-preprocess.git",
"Source Code": "https://github.com/terrylica/exness-data-preprocess"
},
"split_keywords": [
"backtesting",
" compression",
" data-engineering",
" duckdb",
" efficient-storage",
" eurusd",
" exness",
" financial-data",
" forex",
" forex-data",
" lossless",
" microstructure",
" ohlc",
" pandas",
" parquet",
" preprocessing",
" pyarrow",
" quantitative-finance",
" queryable",
" tick-data",
" time-series",
" zstd"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "3a38e1e7841231b33f0389b863f8b77307a93b2b14d9e2f2133fa05d143ef46f",
"md5": "73f79fb35115f29bfc3a58df5891588b",
"sha256": "53aa348e64551f66d07ff83e221221a489305d0ced3d07d003e0a8c582351116"
},
"downloads": -1,
"filename": "exness_data_preprocess-0.5.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "73f79fb35115f29bfc3a58df5891588b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 42973,
"upload_time": "2025-10-21T03:32:38",
"upload_time_iso_8601": "2025-10-21T03:32:38.662258Z",
"url": "https://files.pythonhosted.org/packages/3a/38/e1e7841231b33f0389b863f8b77307a93b2b14d9e2f2133fa05d143ef46f/exness_data_preprocess-0.5.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "0635c42990506acd4649cef7acb96bb716225909190a930f4ea9b2588d35d187",
"md5": "3028acd3e32ff203896c29568d88d5e1",
"sha256": "e5fab02a876cebbd72f149b6c1ddd5c0778a7d6fbb507320f9e0fd3ce351c69c"
},
"downloads": -1,
"filename": "exness_data_preprocess-0.5.0.tar.gz",
"has_sig": false,
"md5_digest": "3028acd3e32ff203896c29568d88d5e1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 1567943,
"upload_time": "2025-10-21T03:32:40",
"upload_time_iso_8601": "2025-10-21T03:32:40.729251Z",
"url": "https://files.pythonhosted.org/packages/06/35/c42990506acd4649cef7acb96bb716225909190a930f4ea9b2588d35d187/exness_data_preprocess-0.5.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-21 03:32:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "terrylica",
"github_project": "exness-data-preprocess",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "exness-data-preprocess"
}