# Splurge Data Profiler
A powerful data profiling tool for delimited and database sources that automatically infers data types and creates optimized data lakes (SQLite database).
## Features
- **DSV File Support**: Profile CSV, TSV, and other delimiter-separated value files
- **Automatic Type Inference**: Intelligently detect data types using adaptive sampling
- **Data Lake Creation**: Generate SQLite databases with optimized schemas
- **Inferred Tables**: Create tables with both original and type-cast columns
- **Flexible Configuration**: JSON-based configuration for customization
- **Command Line Interface**: Easy-to-use CLI for batch processing
## Installation
```bash
pip install splurge-data-profiler
```
## Quick Start
1. **Create a configuration file**:
```bash
python -m splurge_data_profiler create-config examples/example_config.json
```
2. **Profile your data**:
```bash
python -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json
```
## CLI Usage
### Profile Command
Profile a DSV file and create a data lake:
```bash
python -m splurge_data_profiler profile <dsv_file> <config_file> [options]
```
**Options:**
- `--verbose`: Enable verbose output
**Examples:**
```bash
# Basic profiling
python -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json
# Verbose output
python -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json --verbose
```
### Create Config Command
Generate a sample configuration file:
```bash
python -m splurge_data_profiler create-config <output_file>
```
**Example:**
```bash
python -m splurge_data_profiler create-config examples/example_config.json
```
## Configuration File
The configuration file is a JSON file that specifies how to process your DSV file:
```json
{
"data_lake_path": "./data_lake",
"dsv": {
"delimiter": ",",
"strip": true,
"bookend": "\"",
"bookend_strip": true,
"encoding": "utf-8",
"skip_header_rows": 0,
"skip_footer_rows": 0,
"header_rows": 1,
"skip_empty_rows": true
}
}
```
### Configuration Options
#### Required Fields
- `data_lake_path`: Directory where the SQLite database will be created
#### DSV Configuration (`dsv` object)
- `delimiter`: Character used to separate values (default: `","`)
- `strip`: Whether to strip whitespace from values (default: `true`)
- `bookend`: Character used to quote values (default: `"\""`)
- `bookend_strip`: Whether to strip bookend characters (default: `true`)
- `encoding`: File encoding (default: `"utf-8"`)
- `skip_header_rows`: Number of header rows to skip (default: `0`)
- `skip_footer_rows`: Number of footer rows to skip (default: `0`)
- `header_rows`: Number of header rows (default: `1`)
- `skip_empty_rows`: Whether to skip empty rows (default: `true`)
**Note**: The profiler always uses adaptive sampling and always creates an inferred table.
## Examples
### Example 1: Basic Profiling
1. Create a configuration:
```bash
python -m splurge_data_profiler create-config examples/example_config.json
```
2. Profile your data:
```bash
python -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json
```
Output:
```
=== PROFILING RESULTS ===
id: INTEGER
name: TEXT
age: INTEGER
salary: FLOAT
is_active: BOOLEAN
hire_date: DATE
last_login: DATETIME
Profiling completed successfully!
```
**Note**: Datetime values should be in ISO 8601 format (YYYY-MM-DDTHH:MM:SS) for proper type inference.
### Example 2: With Inferred Table
```bash
python -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json
```
This creates an additional table with:
- Original columns (preserving text values)
- Cast columns with inferred data types
### Example 3: Verbose Output
```bash
python -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json --verbose
```
## Data Types
The profiler can infer the following data types:
- **TEXT**: String values
- **INTEGER**: Whole numbers
- **FLOAT**: Decimal numbers
- **BOOLEAN**: True/false values
- **DATE**: Date values (YYYY-MM-DD)
- **TIME**: Time values (HH:MM:SS)
- **DATETIME**: Date and time values (ISO 8601 format: YYYY-MM-DDTHH:MM:SS)
## Adaptive Sampling
When no sample size is specified, the profiler uses adaptive sampling:
- Datasets < 25K rows: 100% sample
- Datasets 25K-50K rows: 50% sample
- Datasets 50K-100K rows: 25% sample
- Datasets 100K-500K rows: 20% sample
- Datasets > 500K rows: 10% sample
## Programmatic Usage
You can also use the profiler programmatically:
```python
from splurge_data_profiler.data_lake import DataLakeFactory
from splurge_data_profiler.profiler import Profiler
from splurge_data_profiler.source import DsvSource
# Create DSV source
dsv_source = DsvSource(
file_path="examples/example_data.csv",
delimiter=",",
encoding="utf-8"
)
# Create data lake
data_lake = DataLakeFactory.from_dsv_source(
dsv_source=dsv_source,
data_lake_path="./data_lake"
)
# Create profiler and run profiling
profiler = Profiler(data_lake=data_lake)
profiler.profile()
# Get results
for column in profiler.profiled_columns:
print(f"{column.name}: {column.inferred_type}")
```
## Requirements
- Python 3.10+
- SQLAlchemy >= 2.0.37
- splurge-tools == 0.2.4
## License
MIT License
## Changelog
### [0.1.1] 2025-07-10
- **Refactored adaptive sampling logic**: Sampling thresholds and factors are now defined as class-level rules using a dataclass, improving maintainability and clarity.
- **Public classmethod for adaptive sample size**: `calculate_adaptive_sample_size` is now a public classmethod, replacing the previous private method and magic numbers.
- **Test suite updated**: All tests now use the new classmethod for adaptive sample size, ensuring consistency and eliminating magic numbers.
- **Sampling rules updated**: New adaptive sampling rules:
- < 5K rows: 100%
- < 10K rows: 80%
- < 25K rows: 60%
- < 100K rows: 40%
- < 500K rows: 20%
- >= 500K rows: 10%
- **General code quality improvements**: Improved type annotations, error handling, and code organization per project standards.
- **Enhanced test coverage and reliability**: Test logic and assertions now reflect the updated adaptive sampling strategy.
- (See previous notes for performance test and runner improvements.)
### [0.1.0] 2025-07-06
- **Initial release** of Splurge Data Profiler
- **CLI implementation** with `profile` and `create-config` commands
- **DSV file support** for CSV, TSV, and other delimiter-separated value files
- **Automatic type inference** using adaptive sampling strategy
- **Data lake creation** with SQLite database generation
- **Inferred table creation** with both original and type-cast columns
- **JSON configuration** for DSV parsing options
- **ISO 8601 datetime support** for proper type inference
- **Adaptive sampling** based on dataset size (100% for <25K rows, 50% for 25K-50K, 25% for 50K-100K, 20% for 100K-500K, 10% for >500K)
- **Simplified workflow** - always profiles and always creates inferred tables
Raw data
{
"_id": null,
"home_page": null,
"name": "splurge-data-profiler",
"maintainer": "Jim Schilling",
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "data-profiling, csv, tsv, dsv, data-lake, sqlite, type-inference, data-analysis, etl, data-processing",
"author": "Jim Schilling",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/77/4f/4da2cbf45bfeb86245f7525d881b9db162dd049dfe17394b9c3367e9c598/splurge_data_profiler-0.1.2.tar.gz",
"platform": null,
"description": "# Splurge Data Profiler\r\n\r\nA powerful data profiling tool for delimited and database sources that automatically infers data types and creates optimized data lakes (SQLite database).\r\n\r\n## Features\r\n\r\n- **DSV File Support**: Profile CSV, TSV, and other delimiter-separated value files\r\n- **Automatic Type Inference**: Intelligently detect data types using adaptive sampling\r\n- **Data Lake Creation**: Generate SQLite databases with optimized schemas\r\n- **Inferred Tables**: Create tables with both original and type-cast columns\r\n- **Flexible Configuration**: JSON-based configuration for customization\r\n- **Command Line Interface**: Easy-to-use CLI for batch processing\r\n\r\n## Installation\r\n\r\n```bash\r\npip install splurge-data-profiler\r\n```\r\n\r\n## Quick Start\r\n\r\n1. **Create a configuration file**:\r\n```bash\r\npython -m splurge_data_profiler create-config examples/example_config.json\r\n```\r\n\r\n2. **Profile your data**:\r\n```bash\r\npython -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json\r\n```\r\n\r\n## CLI Usage\r\n\r\n### Profile Command\r\n\r\nProfile a DSV file and create a data lake:\r\n\r\n```bash\r\npython -m splurge_data_profiler profile <dsv_file> <config_file> [options]\r\n```\r\n\r\n**Options:**\r\n- `--verbose`: Enable verbose output\r\n\r\n**Examples:**\r\n```bash\r\n# Basic profiling\r\npython -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json\r\n\r\n# Verbose output\r\npython -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json --verbose\r\n```\r\n\r\n### Create Config Command\r\n\r\nGenerate a sample configuration file:\r\n\r\n```bash\r\npython -m splurge_data_profiler create-config <output_file>\r\n```\r\n\r\n**Example:**\r\n```bash\r\npython -m splurge_data_profiler create-config examples/example_config.json\r\n```\r\n\r\n## Configuration File\r\n\r\nThe configuration file is a JSON file that specifies how to process your DSV file:\r\n\r\n```json\r\n{\r\n \"data_lake_path\": \"./data_lake\",\r\n \"dsv\": {\r\n \"delimiter\": \",\",\r\n \"strip\": true,\r\n \"bookend\": \"\\\"\",\r\n \"bookend_strip\": true,\r\n \"encoding\": \"utf-8\",\r\n \"skip_header_rows\": 0,\r\n \"skip_footer_rows\": 0,\r\n \"header_rows\": 1,\r\n \"skip_empty_rows\": true\r\n }\r\n}\r\n```\r\n\r\n### Configuration Options\r\n\r\n#### Required Fields\r\n- `data_lake_path`: Directory where the SQLite database will be created\r\n\r\n#### DSV Configuration (`dsv` object)\r\n- `delimiter`: Character used to separate values (default: `\",\"`)\r\n- `strip`: Whether to strip whitespace from values (default: `true`)\r\n- `bookend`: Character used to quote values (default: `\"\\\"\"`)\r\n- `bookend_strip`: Whether to strip bookend characters (default: `true`)\r\n- `encoding`: File encoding (default: `\"utf-8\"`)\r\n- `skip_header_rows`: Number of header rows to skip (default: `0`)\r\n- `skip_footer_rows`: Number of footer rows to skip (default: `0`)\r\n- `header_rows`: Number of header rows (default: `1`)\r\n- `skip_empty_rows`: Whether to skip empty rows (default: `true`)\r\n\r\n**Note**: The profiler always uses adaptive sampling and always creates an inferred table.\r\n\r\n## Examples\r\n\r\n### Example 1: Basic Profiling\r\n\r\n1. Create a configuration:\r\n```bash\r\npython -m splurge_data_profiler create-config examples/example_config.json\r\n```\r\n\r\n2. Profile your data:\r\n```bash\r\npython -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json\r\n```\r\n\r\nOutput:\r\n```\r\n=== PROFILING RESULTS ===\r\nid: INTEGER\r\nname: TEXT\r\nage: INTEGER\r\nsalary: FLOAT\r\nis_active: BOOLEAN\r\nhire_date: DATE\r\nlast_login: DATETIME\r\n\r\nProfiling completed successfully!\r\n```\r\n\r\n**Note**: Datetime values should be in ISO 8601 format (YYYY-MM-DDTHH:MM:SS) for proper type inference.\r\n\r\n### Example 2: With Inferred Table\r\n\r\n```bash\r\npython -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json\r\n```\r\n\r\nThis creates an additional table with:\r\n- Original columns (preserving text values)\r\n- Cast columns with inferred data types\r\n\r\n### Example 3: Verbose Output\r\n\r\n```bash\r\npython -m splurge_data_profiler profile examples/example_data.csv examples/example_config.json --verbose\r\n```\r\n\r\n## Data Types\r\n\r\nThe profiler can infer the following data types:\r\n\r\n- **TEXT**: String values\r\n- **INTEGER**: Whole numbers\r\n- **FLOAT**: Decimal numbers\r\n- **BOOLEAN**: True/false values\r\n- **DATE**: Date values (YYYY-MM-DD)\r\n- **TIME**: Time values (HH:MM:SS)\r\n- **DATETIME**: Date and time values (ISO 8601 format: YYYY-MM-DDTHH:MM:SS)\r\n\r\n## Adaptive Sampling\r\n\r\nWhen no sample size is specified, the profiler uses adaptive sampling:\r\n\r\n- Datasets < 25K rows: 100% sample\r\n- Datasets 25K-50K rows: 50% sample\r\n- Datasets 50K-100K rows: 25% sample\r\n- Datasets 100K-500K rows: 20% sample\r\n- Datasets > 500K rows: 10% sample\r\n\r\n## Programmatic Usage\r\n\r\nYou can also use the profiler programmatically:\r\n\r\n```python\r\nfrom splurge_data_profiler.data_lake import DataLakeFactory\r\nfrom splurge_data_profiler.profiler import Profiler\r\nfrom splurge_data_profiler.source import DsvSource\r\n\r\n# Create DSV source\r\ndsv_source = DsvSource(\r\n file_path=\"examples/example_data.csv\",\r\n delimiter=\",\",\r\n encoding=\"utf-8\"\r\n)\r\n\r\n# Create data lake\r\ndata_lake = DataLakeFactory.from_dsv_source(\r\n dsv_source=dsv_source,\r\n data_lake_path=\"./data_lake\"\r\n)\r\n\r\n# Create profiler and run profiling\r\nprofiler = Profiler(data_lake=data_lake)\r\nprofiler.profile()\r\n\r\n# Get results\r\nfor column in profiler.profiled_columns:\r\n print(f\"{column.name}: {column.inferred_type}\")\r\n```\r\n\r\n## Requirements\r\n\r\n- Python 3.10+\r\n- SQLAlchemy >= 2.0.37\r\n- splurge-tools == 0.2.4\r\n\r\n## License\r\n\r\nMIT License\r\n\r\n\r\n## Changelog\r\n\r\n### [0.1.1] 2025-07-10\r\n\r\n- **Refactored adaptive sampling logic**: Sampling thresholds and factors are now defined as class-level rules using a dataclass, improving maintainability and clarity.\r\n- **Public classmethod for adaptive sample size**: `calculate_adaptive_sample_size` is now a public classmethod, replacing the previous private method and magic numbers.\r\n- **Test suite updated**: All tests now use the new classmethod for adaptive sample size, ensuring consistency and eliminating magic numbers.\r\n- **Sampling rules updated**: New adaptive sampling rules:\r\n - < 5K rows: 100%\r\n - < 10K rows: 80%\r\n - < 25K rows: 60%\r\n - < 100K rows: 40%\r\n - < 500K rows: 20%\r\n - >= 500K rows: 10%\r\n- **General code quality improvements**: Improved type annotations, error handling, and code organization per project standards.\r\n- **Enhanced test coverage and reliability**: Test logic and assertions now reflect the updated adaptive sampling strategy.\r\n- (See previous notes for performance test and runner improvements.)\r\n\r\n### [0.1.0] 2025-07-06\r\n\r\n- **Initial release** of Splurge Data Profiler\r\n- **CLI implementation** with `profile` and `create-config` commands\r\n- **DSV file support** for CSV, TSV, and other delimiter-separated value files\r\n- **Automatic type inference** using adaptive sampling strategy\r\n- **Data lake creation** with SQLite database generation\r\n- **Inferred table creation** with both original and type-cast columns\r\n- **JSON configuration** for DSV parsing options\r\n- **ISO 8601 datetime support** for proper type inference\r\n- **Adaptive sampling** based on dataset size (100% for <25K rows, 50% for 25K-50K, 25% for 50K-100K, 20% for 100K-500K, 10% for >500K)\r\n- **Simplified workflow** - always profiles and always creates inferred tables\r\n",
"bugtrack_url": null,
"license": null,
"summary": "A data profiling tool for delimited and database sources.",
"version": "0.1.2",
"project_urls": {
"Changelog": "https://github.com/jim-schilling/splurge-data-profiler/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/jim-schilling/splurge-data-profiler#readme",
"Homepage": "https://github.com/jim-schilling/splurge-data-profiler",
"Issues": "https://github.com/jim-schilling/splurge-data-profiler/issues",
"Repository": "https://github.com/jim-schilling/splurge-data-profiler.git"
},
"split_keywords": [
"data-profiling",
" csv",
" tsv",
" dsv",
" data-lake",
" sqlite",
" type-inference",
" data-analysis",
" etl",
" data-processing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "6b970a96d08cb41e8efe8bc290d7ea07a2a33bc2f9f640035c043d1d47b379b0",
"md5": "d25e08fce94d0f494cc2a5b5c40cab06",
"sha256": "36f7dd5afabfff46b3b742eba8293f98698fe4fe77c3c62198c78c98d8467c9a"
},
"downloads": -1,
"filename": "splurge_data_profiler-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d25e08fce94d0f494cc2a5b5c40cab06",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 18636,
"upload_time": "2025-07-11T10:27:58",
"upload_time_iso_8601": "2025-07-11T10:27:58.133943Z",
"url": "https://files.pythonhosted.org/packages/6b/97/0a96d08cb41e8efe8bc290d7ea07a2a33bc2f9f640035c043d1d47b379b0/splurge_data_profiler-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "774f4da2cbf45bfeb86245f7525d881b9db162dd049dfe17394b9c3367e9c598",
"md5": "3b965968603416905fb4610bef6b3204",
"sha256": "d2fb20ae8c3388e17d159b3f34daed21928a3525ddd6e7d6b16c31bd50d2454c"
},
"downloads": -1,
"filename": "splurge_data_profiler-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "3b965968603416905fb4610bef6b3204",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 36514,
"upload_time": "2025-07-11T10:27:59",
"upload_time_iso_8601": "2025-07-11T10:27:59.305170Z",
"url": "https://files.pythonhosted.org/packages/77/4f/4da2cbf45bfeb86245f7525d881b9db162dd049dfe17394b9c3367e9c598/splurge_data_profiler-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-11 10:27:59",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jim-schilling",
"github_project": "splurge-data-profiler",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "splurge-data-profiler"
}