config-driven-data-ingestion


Nameconfig-driven-data-ingestion JSON
Version 1.0.2 PyPI version JSON
download
home_pageNone
SummaryA comprehensive, config-driven data ingestion library for Python
upload_time2025-07-17 10:22:25
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords data-ingestion etl csv json yaml sqlalchemy database batch-processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Config-Driven Data Loading Framework

Framework that eliminates repetitive data loading code by using configuration files. Instead of writing custom code for each data source, we configure once and reuse everywhere.

## **Framework Architecture**

```
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   DATA SOURCES  │    │   ORCHESTRATOR   │    │  Database       │
│                 │    │                  │    │    Schema       │
│ • CSV Files     │────│ Config Reader    │────│                 │
│ • Excel Files   │    │ Data Processor   │    │ • api_data      │
│ • REST APIs     │    │ Column Mapper    │    │ • market_trends │
│ • JSON Files    │    │ Database Writer  │    │ • risk_metrics  │
└─────────────────┘    └──────────────────┘    └─────────────────┘
```

## **Core Components**

- **DataSourceFactory**: Creates appropriate loaders based on configuration type  
- **DataOrchestrator**: Manages the entire pipeline with error handling and retry logic [One Time Implementation]  
- **DataProcessor**: Handles transformations and column mapping  
- **DatabaseWriter**: Executes batch operations to Database schema tables  

---

## **Sample Configuration Files**

### **Data Sources Configuration (data-sources.yml)**
```yaml
# Data Sources Configuration for Database Schema
data_sources:
  market_data_csv:
    type: "csv"
    source:
      file_path: "/data/market/daily_rates.csv"
      delimiter: ","
      header: true
      encoding: "UTF-8"
    target:
      schema: "MarketData"
      table: "market_trends"
      batch_size: 500
    column_mapping:
      - source: "date" → target: "trade_date"
      - source: "currency_pair" → target: "currency"
      - source: "rate" → target: "exchange_rate"
      - source: "volume" → target: "trading_volume"
  risk_metrics_excel:
    type: "excel"
    source:
      file_path: "/data/risk/monthly_risk.xlsx"
      sheet_name: "RiskData"
      skip_rows: 1
    target:
      schema: "RiskMetrics"
      table: "risk_metrics"
      batch_size: 200
    column_mapping:
      - source: "Portfolio ID" → target: "portfolio_id"
      - source: "VaR 95%" → target: "var_95"
      - source: "Expected Shortfall" → target: "expected_shortfall"
      - source: "Liquidity Score" → target: "liquidity_score"
    validation:
      required_columns: ["Portfolio ID", "VaR 95%"]
      data_quality_checks: true
  rest_api_data:
    type: "rest_api"
    source:
      url: "https://api.provider.com/v1/liq"
      method: "GET"
      headers:
        Authorization: "Bearer ${API_TOKEN}"
        Content-Type: "application/json"
      timeout: 30
      retry_attempts: 3
    target:
      schema: "APIData"
      table: "forecast_data"
      batch_size: 1000
    column_mapping:
      - source: "id" → target: "id"
      - source: "assetClass" → target: "asset_class"
      - source: "predictedLiquidity" → target: "predicted_liquidity"
      - source: "confidenceLevel" → target: "confidence_level"
      - source: "forecastDate" → target: "forecast_date"
  config_json:
    type: "json"
    source:
      file_path: "/config/portfolio_settings.json"
      json_path: "$.portfolios[*]"
    target:
      schema: "ConfigData"
      table: "portfolio_config"
      batch_size: 100
    column_mapping:
      - source: "id" → target: "portfolio_id"
      - source: "name" → target: "portfolio_name"
      - source: "riskProfile" → target: "risk_profile"
      - source: "liquidityThreshold" → target: "liquidity_threshold"

# Global Settings
global_settings:
  error_handling:
    continue_on_error: true
    error_threshold: 10
    notification_email: "dev-team@company.com"
  data_quality:
    enable_validation: true
    null_value_handling: "skip"
    duplicate_handling: "ignore"
  performance:
    connection_pool_size: 10
    query_timeout: 300
    memory_limit: "2GB"
```

## **High Level Architecture Diagram**

![Architecture](zz_image_dump/system-arch.png)

## **Class Diagram**
![Class](zz_image_dump/system-class.png)

## Sequence Diagram
![Sequence](zz_image_dump/system-sequence.png)

## **Key Developer Benefits**

- **Code Reusability**: Write once, configure multiple times - no duplicate data loading logic
- **Maintenance Reduction**: Single codebase handles all data sources through configuration
- **Easy Onboarding**: New data sources added via YAML files, not code changes
- **Error Handling**: Built-in retry logic and comprehensive error reporting
- **Performance**: Batch processing and connection pooling optimize database operations

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "config-driven-data-ingestion",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "data-ingestion, ETL, csv, json, yaml, sqlalchemy, database, batch-processing",
    "author": null,
    "author_email": "Sathwick <sathwick@outlook.in>",
    "download_url": "https://files.pythonhosted.org/packages/37/0f/ecd4c3a4e7f4dc05ecbf589f53ffb7827fabaedbc789ae34884655d10a86/config_driven_data_ingestion-1.0.2.tar.gz",
    "platform": null,
    "description": "# Config-Driven Data Loading Framework\n\nFramework that eliminates repetitive data loading code by using configuration files. Instead of writing custom code for each data source, we configure once and reuse everywhere.\n\n## **Framework Architecture**\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510    \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510    \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502   DATA SOURCES  \u2502    \u2502   ORCHESTRATOR   \u2502    \u2502  Database       \u2502\n\u2502                 \u2502    \u2502                  \u2502    \u2502    Schema       \u2502\n\u2502 \u2022 CSV Files     \u2502\u2500\u2500\u2500\u2500\u2502 Config Reader    \u2502\u2500\u2500\u2500\u2500\u2502                 \u2502\n\u2502 \u2022 Excel Files   \u2502    \u2502 Data Processor   \u2502    \u2502 \u2022 api_data      \u2502\n\u2502 \u2022 REST APIs     \u2502    \u2502 Column Mapper    \u2502    \u2502 \u2022 market_trends \u2502\n\u2502 \u2022 JSON Files    \u2502    \u2502 Database Writer  \u2502    \u2502 \u2022 risk_metrics  \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518    \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518    \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n## **Core Components**\n\n- **DataSourceFactory**: Creates appropriate loaders based on configuration type  \n- **DataOrchestrator**: Manages the entire pipeline with error handling and retry logic [One Time Implementation]  \n- **DataProcessor**: Handles transformations and column mapping  \n- **DatabaseWriter**: Executes batch operations to Database schema tables  \n\n---\n\n## **Sample Configuration Files**\n\n### **Data Sources Configuration (data-sources.yml)**\n```yaml\n# Data Sources Configuration for Database Schema\ndata_sources:\n  market_data_csv:\n    type: \"csv\"\n    source:\n      file_path: \"/data/market/daily_rates.csv\"\n      delimiter: \",\"\n      header: true\n      encoding: \"UTF-8\"\n    target:\n      schema: \"MarketData\"\n      table: \"market_trends\"\n      batch_size: 500\n    column_mapping:\n      - source: \"date\" \u2192 target: \"trade_date\"\n      - source: \"currency_pair\" \u2192 target: \"currency\"\n      - source: \"rate\" \u2192 target: \"exchange_rate\"\n      - source: \"volume\" \u2192 target: \"trading_volume\"\n  risk_metrics_excel:\n    type: \"excel\"\n    source:\n      file_path: \"/data/risk/monthly_risk.xlsx\"\n      sheet_name: \"RiskData\"\n      skip_rows: 1\n    target:\n      schema: \"RiskMetrics\"\n      table: \"risk_metrics\"\n      batch_size: 200\n    column_mapping:\n      - source: \"Portfolio ID\" \u2192 target: \"portfolio_id\"\n      - source: \"VaR 95%\" \u2192 target: \"var_95\"\n      - source: \"Expected Shortfall\" \u2192 target: \"expected_shortfall\"\n      - source: \"Liquidity Score\" \u2192 target: \"liquidity_score\"\n    validation:\n      required_columns: [\"Portfolio ID\", \"VaR 95%\"]\n      data_quality_checks: true\n  rest_api_data:\n    type: \"rest_api\"\n    source:\n      url: \"https://api.provider.com/v1/liq\"\n      method: \"GET\"\n      headers:\n        Authorization: \"Bearer ${API_TOKEN}\"\n        Content-Type: \"application/json\"\n      timeout: 30\n      retry_attempts: 3\n    target:\n      schema: \"APIData\"\n      table: \"forecast_data\"\n      batch_size: 1000\n    column_mapping:\n      - source: \"id\" \u2192 target: \"id\"\n      - source: \"assetClass\" \u2192 target: \"asset_class\"\n      - source: \"predictedLiquidity\" \u2192 target: \"predicted_liquidity\"\n      - source: \"confidenceLevel\" \u2192 target: \"confidence_level\"\n      - source: \"forecastDate\" \u2192 target: \"forecast_date\"\n  config_json:\n    type: \"json\"\n    source:\n      file_path: \"/config/portfolio_settings.json\"\n      json_path: \"$.portfolios[*]\"\n    target:\n      schema: \"ConfigData\"\n      table: \"portfolio_config\"\n      batch_size: 100\n    column_mapping:\n      - source: \"id\" \u2192 target: \"portfolio_id\"\n      - source: \"name\" \u2192 target: \"portfolio_name\"\n      - source: \"riskProfile\" \u2192 target: \"risk_profile\"\n      - source: \"liquidityThreshold\" \u2192 target: \"liquidity_threshold\"\n\n# Global Settings\nglobal_settings:\n  error_handling:\n    continue_on_error: true\n    error_threshold: 10\n    notification_email: \"dev-team@company.com\"\n  data_quality:\n    enable_validation: true\n    null_value_handling: \"skip\"\n    duplicate_handling: \"ignore\"\n  performance:\n    connection_pool_size: 10\n    query_timeout: 300\n    memory_limit: \"2GB\"\n```\n\n## **High Level Architecture Diagram**\n\n![Architecture](zz_image_dump/system-arch.png)\n\n## **Class Diagram**\n![Class](zz_image_dump/system-class.png)\n\n## Sequence Diagram\n![Sequence](zz_image_dump/system-sequence.png)\n\n## **Key Developer Benefits**\n\n- **Code Reusability**: Write once, configure multiple times - no duplicate data loading logic\n- **Maintenance Reduction**: Single codebase handles all data sources through configuration\n- **Easy Onboarding**: New data sources added via YAML files, not code changes\n- **Error Handling**: Built-in retry logic and comprehensive error reporting\n- **Performance**: Batch processing and connection pooling optimize database operations\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A comprehensive, config-driven data ingestion library for Python",
    "version": "1.0.2",
    "project_urls": {
        "Bug Tracker": "https://github.com/sathwickreddyy/python_projects/issues",
        "Homepage": "https://github.com/sathwickreddyy/python_projects/tree/main/config_driven_loading"
    },
    "split_keywords": [
        "data-ingestion",
        " etl",
        " csv",
        " json",
        " yaml",
        " sqlalchemy",
        " database",
        " batch-processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f40ce01be64e362eea02e22493d9933ecb50e462a489542527e2c070a9802850",
                "md5": "fc0ad9f20bf6c42c07aa3351d1e12ed6",
                "sha256": "47972c4cd82996a78e19d1115de202dba57ed0c6a5989a2ccdcd29df08a53061"
            },
            "downloads": -1,
            "filename": "config_driven_data_ingestion-1.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fc0ad9f20bf6c42c07aa3351d1e12ed6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 28896,
            "upload_time": "2025-07-17T10:22:23",
            "upload_time_iso_8601": "2025-07-17T10:22:23.630521Z",
            "url": "https://files.pythonhosted.org/packages/f4/0c/e01be64e362eea02e22493d9933ecb50e462a489542527e2c070a9802850/config_driven_data_ingestion-1.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "370fecd4c3a4e7f4dc05ecbf589f53ffb7827fabaedbc789ae34884655d10a86",
                "md5": "3eea9bff3185aba279cdfa507a5d692c",
                "sha256": "f08cd627bdba81e28e121fa35fcba2667bb567005f429b1ce20129130f216982"
            },
            "downloads": -1,
            "filename": "config_driven_data_ingestion-1.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "3eea9bff3185aba279cdfa507a5d692c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 25138,
            "upload_time": "2025-07-17T10:22:25",
            "upload_time_iso_8601": "2025-07-17T10:22:25.061512Z",
            "url": "https://files.pythonhosted.org/packages/37/0f/ecd4c3a4e7f4dc05ecbf589f53ffb7827fabaedbc789ae34884655d10a86/config_driven_data_ingestion-1.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-17 10:22:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "sathwickreddyy",
    "github_project": "python_projects",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "config-driven-data-ingestion"
}
        
Elapsed time: 1.02710s