splurge-data-profiler


Namesplurge-data-profiler JSON
Version 0.1.2 PyPI version JSON
download
home_pageNone
SummaryA data profiling tool for delimited and database sources.
upload_time2025-07-11 10:27:59
maintainerJim Schilling
docs_urlNone
authorJim Schilling
requires_python>=3.10
licenseNone
keywords data-profiling csv tsv dsv data-lake sqlite type-inference data-analysis etl data-processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Splurge Data Profiler

A powerful data profiling tool for delimited and database sources that automatically infers data types and creates optimized data lakes (SQLite database).

## Features

- **DSV File Support**: Profile CSV, TSV, and other delimiter-separated value files
- **Automatic Type Inference**: Intelligently detect data types using adaptive sampling
- **Data Lake Creation**: Generate SQLite databases with optimized schemas
- **Inferred Tables**: Create tables with both original and type-cast columns
- **Flexible Configuration**: JSON-based configuration for customization
- **Command Line Interface**: Easy-to-use CLI for batch processing

## Installation

```bash
pip install splurge-data-profiler
```

## Quick Start

1. **Create a configuration file**:
```bash
python -m splurge_data_profiler create-config examples/example_config.json
```

2. **Profile your data**:
```bash
python -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json
```

## CLI Usage

### Profile Command

Profile a DSV file and create a data lake:

```bash
python -m splurge_data_profiler profile <dsv_file> <config_file> [options]
```

**Options:**
- `--verbose`: Enable verbose output

**Examples:**
```bash
# Basic profiling
python -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json

# Verbose output
python -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json --verbose
```

### Create Config Command

Generate a sample configuration file:

```bash
python -m splurge_data_profiler create-config <output_file>
```

**Example:**
```bash
python -m splurge_data_profiler create-config examples/example_config.json
```

## Configuration File

The configuration file is a JSON file that specifies how to process your DSV file:

```json
{
  "data_lake_path": "./data_lake",
  "dsv": {
    "delimiter": ",",
    "strip": true,
    "bookend": "\"",
    "bookend_strip": true,
    "encoding": "utf-8",
    "skip_header_rows": 0,
    "skip_footer_rows": 0,
    "header_rows": 1,
    "skip_empty_rows": true
  }
}
```

### Configuration Options

#### Required Fields
- `data_lake_path`: Directory where the SQLite database will be created

#### DSV Configuration (`dsv` object)
- `delimiter`: Character used to separate values (default: `","`)
- `strip`: Whether to strip whitespace from values (default: `true`)
- `bookend`: Character used to quote values (default: `"\""`)
- `bookend_strip`: Whether to strip bookend characters (default: `true`)
- `encoding`: File encoding (default: `"utf-8"`)
- `skip_header_rows`: Number of header rows to skip (default: `0`)
- `skip_footer_rows`: Number of footer rows to skip (default: `0`)
- `header_rows`: Number of header rows (default: `1`)
- `skip_empty_rows`: Whether to skip empty rows (default: `true`)

**Note**: The profiler always uses adaptive sampling and always creates an inferred table.

## Examples

### Example 1: Basic Profiling

1. Create a configuration:
```bash
python -m splurge_data_profiler create-config examples/example_config.json
```

2. Profile your data:
```bash
python -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json
```

Output:
```
=== PROFILING RESULTS ===
id: INTEGER
name: TEXT
age: INTEGER
salary: FLOAT
is_active: BOOLEAN
hire_date: DATE
last_login: DATETIME

Profiling completed successfully!
```

**Note**: Datetime values should be in ISO 8601 format (YYYY-MM-DDTHH:MM:SS) for proper type inference.

### Example 2: With Inferred Table

```bash
python -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json
```

This creates an additional table with:
- Original columns (preserving text values)
- Cast columns with inferred data types

### Example 3: Verbose Output

```bash
python -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json --verbose
```

## Data Types

The profiler can infer the following data types:

- **TEXT**: String values
- **INTEGER**: Whole numbers
- **FLOAT**: Decimal numbers
- **BOOLEAN**: True/false values
- **DATE**: Date values (YYYY-MM-DD)
- **TIME**: Time values (HH:MM:SS)
- **DATETIME**: Date and time values (ISO 8601 format: YYYY-MM-DDTHH:MM:SS)

## Adaptive Sampling

When no sample size is specified, the profiler uses adaptive sampling:

- Datasets < 25K rows: 100% sample
- Datasets 25K-50K rows: 50% sample
- Datasets 50K-100K rows: 25% sample
- Datasets 100K-500K rows: 20% sample
- Datasets > 500K rows: 10% sample

## Programmatic Usage

You can also use the profiler programmatically:

```python
from splurge_data_profiler.data_lake import DataLakeFactory
from splurge_data_profiler.profiler import Profiler
from splurge_data_profiler.source import DsvSource

# Create DSV source
dsv_source = DsvSource(
    file_path="examples/example_data.csv",
    delimiter=",",
    encoding="utf-8"
)

# Create data lake
data_lake = DataLakeFactory.from_dsv_source(
    dsv_source=dsv_source,
    data_lake_path="./data_lake"
)

# Create profiler and run profiling
profiler = Profiler(data_lake=data_lake)
profiler.profile()

# Get results
for column in profiler.profiled_columns:
    print(f"{column.name}: {column.inferred_type}")
```

## Requirements

- Python 3.10+
- SQLAlchemy >= 2.0.37
- splurge-tools == 0.2.4

## License

MIT License


## Changelog

### [0.1.1] 2025-07-10

- **Refactored adaptive sampling logic**: Sampling thresholds and factors are now defined as class-level rules using a dataclass, improving maintainability and clarity.
- **Public classmethod for adaptive sample size**: `calculate_adaptive_sample_size` is now a public classmethod, replacing the previous private method and magic numbers.
- **Test suite updated**: All tests now use the new classmethod for adaptive sample size, ensuring consistency and eliminating magic numbers.
- **Sampling rules updated**: New adaptive sampling rules:
    - < 5K rows: 100%
    - < 10K rows: 80%
    - < 25K rows: 60%
    - < 100K rows: 40%
    - < 500K rows: 20%
    - >= 500K rows: 10%
- **General code quality improvements**: Improved type annotations, error handling, and code organization per project standards.
- **Enhanced test coverage and reliability**: Test logic and assertions now reflect the updated adaptive sampling strategy.
- (See previous notes for performance test and runner improvements.)

### [0.1.0] 2025-07-06

- **Initial release** of Splurge Data Profiler
- **CLI implementation** with `profile` and `create-config` commands
- **DSV file support** for CSV, TSV, and other delimiter-separated value files
- **Automatic type inference** using adaptive sampling strategy
- **Data lake creation** with SQLite database generation
- **Inferred table creation** with both original and type-cast columns
- **JSON configuration** for DSV parsing options
- **ISO 8601 datetime support** for proper type inference
- **Adaptive sampling** based on dataset size (100% for <25K rows, 50% for 25K-50K, 25% for 50K-100K, 20% for 100K-500K, 10% for >500K)
- **Simplified workflow** - always profiles and always creates inferred tables

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "splurge-data-profiler",
    "maintainer": "Jim Schilling",
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "data-profiling, csv, tsv, dsv, data-lake, sqlite, type-inference, data-analysis, etl, data-processing",
    "author": "Jim Schilling",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/77/4f/4da2cbf45bfeb86245f7525d881b9db162dd049dfe17394b9c3367e9c598/splurge_data_profiler-0.1.2.tar.gz",
    "platform": null,
    "description": "# Splurge Data Profiler\r\n\r\nA powerful data profiling tool for delimited and database sources that automatically infers data types and creates optimized data lakes (SQLite database).\r\n\r\n## Features\r\n\r\n- **DSV File Support**: Profile CSV, TSV, and other delimiter-separated value files\r\n- **Automatic Type Inference**: Intelligently detect data types using adaptive sampling\r\n- **Data Lake Creation**: Generate SQLite databases with optimized schemas\r\n- **Inferred Tables**: Create tables with both original and type-cast columns\r\n- **Flexible Configuration**: JSON-based configuration for customization\r\n- **Command Line Interface**: Easy-to-use CLI for batch processing\r\n\r\n## Installation\r\n\r\n```bash\r\npip install splurge-data-profiler\r\n```\r\n\r\n## Quick Start\r\n\r\n1. **Create a configuration file**:\r\n```bash\r\npython -m splurge_data_profiler create-config examples/example_config.json\r\n```\r\n\r\n2. **Profile your data**:\r\n```bash\r\npython -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json\r\n```\r\n\r\n## CLI Usage\r\n\r\n### Profile Command\r\n\r\nProfile a DSV file and create a data lake:\r\n\r\n```bash\r\npython -m splurge_data_profiler profile <dsv_file> <config_file> [options]\r\n```\r\n\r\n**Options:**\r\n- `--verbose`: Enable verbose output\r\n\r\n**Examples:**\r\n```bash\r\n# Basic profiling\r\npython -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json\r\n\r\n# Verbose output\r\npython -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json --verbose\r\n```\r\n\r\n### Create Config Command\r\n\r\nGenerate a sample configuration file:\r\n\r\n```bash\r\npython -m splurge_data_profiler create-config <output_file>\r\n```\r\n\r\n**Example:**\r\n```bash\r\npython -m splurge_data_profiler create-config examples/example_config.json\r\n```\r\n\r\n## Configuration File\r\n\r\nThe configuration file is a JSON file that specifies how to process your DSV file:\r\n\r\n```json\r\n{\r\n  \"data_lake_path\": \"./data_lake\",\r\n  \"dsv\": {\r\n    \"delimiter\": \",\",\r\n    \"strip\": true,\r\n    \"bookend\": \"\\\"\",\r\n    \"bookend_strip\": true,\r\n    \"encoding\": \"utf-8\",\r\n    \"skip_header_rows\": 0,\r\n    \"skip_footer_rows\": 0,\r\n    \"header_rows\": 1,\r\n    \"skip_empty_rows\": true\r\n  }\r\n}\r\n```\r\n\r\n### Configuration Options\r\n\r\n#### Required Fields\r\n- `data_lake_path`: Directory where the SQLite database will be created\r\n\r\n#### DSV Configuration (`dsv` object)\r\n- `delimiter`: Character used to separate values (default: `\",\"`)\r\n- `strip`: Whether to strip whitespace from values (default: `true`)\r\n- `bookend`: Character used to quote values (default: `\"\\\"\"`)\r\n- `bookend_strip`: Whether to strip bookend characters (default: `true`)\r\n- `encoding`: File encoding (default: `\"utf-8\"`)\r\n- `skip_header_rows`: Number of header rows to skip (default: `0`)\r\n- `skip_footer_rows`: Number of footer rows to skip (default: `0`)\r\n- `header_rows`: Number of header rows (default: `1`)\r\n- `skip_empty_rows`: Whether to skip empty rows (default: `true`)\r\n\r\n**Note**: The profiler always uses adaptive sampling and always creates an inferred table.\r\n\r\n## Examples\r\n\r\n### Example 1: Basic Profiling\r\n\r\n1. Create a configuration:\r\n```bash\r\npython -m splurge_data_profiler create-config examples/example_config.json\r\n```\r\n\r\n2. Profile your data:\r\n```bash\r\npython -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json\r\n```\r\n\r\nOutput:\r\n```\r\n=== PROFILING RESULTS ===\r\nid: INTEGER\r\nname: TEXT\r\nage: INTEGER\r\nsalary: FLOAT\r\nis_active: BOOLEAN\r\nhire_date: DATE\r\nlast_login: DATETIME\r\n\r\nProfiling completed successfully!\r\n```\r\n\r\n**Note**: Datetime values should be in ISO 8601 format (YYYY-MM-DDTHH:MM:SS) for proper type inference.\r\n\r\n### Example 2: With Inferred Table\r\n\r\n```bash\r\npython -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json\r\n```\r\n\r\nThis creates an additional table with:\r\n- Original columns (preserving text values)\r\n- Cast columns with inferred data types\r\n\r\n### Example 3: Verbose Output\r\n\r\n```bash\r\npython -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json --verbose\r\n```\r\n\r\n## Data Types\r\n\r\nThe profiler can infer the following data types:\r\n\r\n- **TEXT**: String values\r\n- **INTEGER**: Whole numbers\r\n- **FLOAT**: Decimal numbers\r\n- **BOOLEAN**: True/false values\r\n- **DATE**: Date values (YYYY-MM-DD)\r\n- **TIME**: Time values (HH:MM:SS)\r\n- **DATETIME**: Date and time values (ISO 8601 format: YYYY-MM-DDTHH:MM:SS)\r\n\r\n## Adaptive Sampling\r\n\r\nWhen no sample size is specified, the profiler uses adaptive sampling:\r\n\r\n- Datasets < 25K rows: 100% sample\r\n- Datasets 25K-50K rows: 50% sample\r\n- Datasets 50K-100K rows: 25% sample\r\n- Datasets 100K-500K rows: 20% sample\r\n- Datasets > 500K rows: 10% sample\r\n\r\n## Programmatic Usage\r\n\r\nYou can also use the profiler programmatically:\r\n\r\n```python\r\nfrom splurge_data_profiler.data_lake import DataLakeFactory\r\nfrom splurge_data_profiler.profiler import Profiler\r\nfrom splurge_data_profiler.source import DsvSource\r\n\r\n# Create DSV source\r\ndsv_source = DsvSource(\r\n    file_path=\"examples/example_data.csv\",\r\n    delimiter=\",\",\r\n    encoding=\"utf-8\"\r\n)\r\n\r\n# Create data lake\r\ndata_lake = DataLakeFactory.from_dsv_source(\r\n    dsv_source=dsv_source,\r\n    data_lake_path=\"./data_lake\"\r\n)\r\n\r\n# Create profiler and run profiling\r\nprofiler = Profiler(data_lake=data_lake)\r\nprofiler.profile()\r\n\r\n# Get results\r\nfor column in profiler.profiled_columns:\r\n    print(f\"{column.name}: {column.inferred_type}\")\r\n```\r\n\r\n## Requirements\r\n\r\n- Python 3.10+\r\n- SQLAlchemy >= 2.0.37\r\n- splurge-tools == 0.2.4\r\n\r\n## License\r\n\r\nMIT License\r\n\r\n\r\n## Changelog\r\n\r\n### [0.1.1] 2025-07-10\r\n\r\n- **Refactored adaptive sampling logic**: Sampling thresholds and factors are now defined as class-level rules using a dataclass, improving maintainability and clarity.\r\n- **Public classmethod for adaptive sample size**: `calculate_adaptive_sample_size` is now a public classmethod, replacing the previous private method and magic numbers.\r\n- **Test suite updated**: All tests now use the new classmethod for adaptive sample size, ensuring consistency and eliminating magic numbers.\r\n- **Sampling rules updated**: New adaptive sampling rules:\r\n    - < 5K rows: 100%\r\n    - < 10K rows: 80%\r\n    - < 25K rows: 60%\r\n    - < 100K rows: 40%\r\n    - < 500K rows: 20%\r\n    - >= 500K rows: 10%\r\n- **General code quality improvements**: Improved type annotations, error handling, and code organization per project standards.\r\n- **Enhanced test coverage and reliability**: Test logic and assertions now reflect the updated adaptive sampling strategy.\r\n- (See previous notes for performance test and runner improvements.)\r\n\r\n### [0.1.0] 2025-07-06\r\n\r\n- **Initial release** of Splurge Data Profiler\r\n- **CLI implementation** with `profile` and `create-config` commands\r\n- **DSV file support** for CSV, TSV, and other delimiter-separated value files\r\n- **Automatic type inference** using adaptive sampling strategy\r\n- **Data lake creation** with SQLite database generation\r\n- **Inferred table creation** with both original and type-cast columns\r\n- **JSON configuration** for DSV parsing options\r\n- **ISO 8601 datetime support** for proper type inference\r\n- **Adaptive sampling** based on dataset size (100% for <25K rows, 50% for 25K-50K, 25% for 50K-100K, 20% for 100K-500K, 10% for >500K)\r\n- **Simplified workflow** - always profiles and always creates inferred tables\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A data profiling tool for delimited and database sources.",
    "version": "0.1.2",
    "project_urls": {
        "Changelog": "https://github.com/jim-schilling/splurge-data-profiler/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/jim-schilling/splurge-data-profiler#readme",
        "Homepage": "https://github.com/jim-schilling/splurge-data-profiler",
        "Issues": "https://github.com/jim-schilling/splurge-data-profiler/issues",
        "Repository": "https://github.com/jim-schilling/splurge-data-profiler.git"
    },
    "split_keywords": [
        "data-profiling",
        " csv",
        " tsv",
        " dsv",
        " data-lake",
        " sqlite",
        " type-inference",
        " data-analysis",
        " etl",
        " data-processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6b970a96d08cb41e8efe8bc290d7ea07a2a33bc2f9f640035c043d1d47b379b0",
                "md5": "d25e08fce94d0f494cc2a5b5c40cab06",
                "sha256": "36f7dd5afabfff46b3b742eba8293f98698fe4fe77c3c62198c78c98d8467c9a"
            },
            "downloads": -1,
            "filename": "splurge_data_profiler-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d25e08fce94d0f494cc2a5b5c40cab06",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 18636,
            "upload_time": "2025-07-11T10:27:58",
            "upload_time_iso_8601": "2025-07-11T10:27:58.133943Z",
            "url": "https://files.pythonhosted.org/packages/6b/97/0a96d08cb41e8efe8bc290d7ea07a2a33bc2f9f640035c043d1d47b379b0/splurge_data_profiler-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "774f4da2cbf45bfeb86245f7525d881b9db162dd049dfe17394b9c3367e9c598",
                "md5": "3b965968603416905fb4610bef6b3204",
                "sha256": "d2fb20ae8c3388e17d159b3f34daed21928a3525ddd6e7d6b16c31bd50d2454c"
            },
            "downloads": -1,
            "filename": "splurge_data_profiler-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "3b965968603416905fb4610bef6b3204",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 36514,
            "upload_time": "2025-07-11T10:27:59",
            "upload_time_iso_8601": "2025-07-11T10:27:59.305170Z",
            "url": "https://files.pythonhosted.org/packages/77/4f/4da2cbf45bfeb86245f7525d881b9db162dd049dfe17394b9c3367e9c598/splurge_data_profiler-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-11 10:27:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jim-schilling",
    "github_project": "splurge-data-profiler",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "splurge-data-profiler"
}
        
Elapsed time: 0.42554s