splurge-tools


Namesplurge-tools JSON
Version 0.2.6 PyPI version JSON
download
home_pageNone
SummaryPython tools for data type handling and validation
upload_time2025-07-12 20:27:52
maintainerJim Schilling
docs_urlNone
authorJim Schilling
requires_python>=3.10
licenseNone
keywords data validation text-processing utilities type-handling
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # splurge-tools

A Python package providing comprehensive tools for data type handling, validation, text processing, and streaming data analysis.

## Description

splurge-tools is a collection of Python utilities focused on:
- **Data type handling and validation** with comprehensive type inference and conversion
- **Text file processing and manipulation** with streaming support for large files
- **String tokenization and parsing** with delimited value support
- **Text case transformations** and normalization
- **Delimited separated value (DSV) parsing** with streaming capabilities
- **Tabular data models** for both in-memory and streaming datasets
- **Typed tabular data models** with schema validation
- **Data validation and transformation** utilities
- **Random data generation** for testing and development
- **Memory-efficient streaming** for large datasets that don't fit in RAM
- **Python 3.10+ compatibility** with full type annotations

## Installation

```bash
pip install splurge-tools
```

## Features

### Core Data Processing
- **`type_helper.py`**: Comprehensive type validation, conversion, and inference utilities with support for strings, numbers, dates, times, booleans, and collections
- **`dsv_helper.py`**: Delimited separated value parsing with streaming support, column profiling, and data analysis
- **`tabular_data_model.py`**: In-memory data model for tabular datasets with multi-row header support
- **`typed_tabular_data_model.py`**: Type-safe data model with schema validation and type enforcement
- **`streaming_tabular_data_model.py`**: Memory-efficient streaming data model for large datasets (>100MB)

### Text Processing
- **`text_file_helper.py`**: Text file processing with streaming support, header/footer skipping, and memory-efficient operations
- **`string_tokenizer.py`**: String parsing and tokenization utilities with delimited value support
- **`case_helper.py`**: Text case transformation utilities (camelCase, snake_case, kebab-case, etc.)
- **`text_normalizer.py`**: Text normalization and cleaning utilities

### Data Utilities
- **`data_validator.py`**: Data validation framework with custom validation rules
- **`data_transformer.py`**: Data transformation utilities for converting between formats
- **`random_helper.py`**: Random data generation for testing, including realistic test data

### Key Capabilities
- **Streaming Support**: Process datasets larger than available RAM with configurable chunk sizes
- **Type Inference**: Automatic detection of data types including dates, times, numbers, and booleans
- **Multi-row Headers**: Support for complex header structures with automatic merging
- **Memory Efficiency**: Streaming models use minimal memory regardless of dataset size
- **Type Safety**: Full type annotations and validation throughout the codebase
- **Error Handling**: Comprehensive error handling with meaningful error messages
- **Performance**: Optimized for large datasets with efficient algorithms and data structures

## Examples

### Streaming Large Datasets

```python
from splurge_tools.dsv_helper import DsvHelper
from splurge_tools.streaming_tabular_data_model import StreamingTabularDataModel

# Process a large CSV file without loading it into memory
stream = DsvHelper.parse_stream("large_dataset.csv", delimiter=",")
model = StreamingTabularDataModel(stream, header_rows=1, chunk_size=1000)

# Iterate through data efficiently
for row in model:
    # Process each row
    print(row)

# Or get rows as dictionaries
for row_dict in model.iter_rows():
    print(row_dict["column_name"])
```

### Type Inference and Validation

```python
from splurge_tools.type_helper import String, DataType

# Infer data types
data_type = String.infer_type("2023-12-25")  # DataType.DATE
data_type = String.infer_type("123.45")      # DataType.FLOAT
data_type = String.infer_type("true")        # DataType.BOOLEAN

# Convert values with validation
date_val = String.to_date("2023-12-25")
float_val = String.to_float("123.45", default=0.0)
bool_val = String.to_bool("true")
```

### DSV Parsing and Profiling

```python
from splurge_tools.dsv_helper import DsvHelper

# Parse and profile columns
data = DsvHelper.parse("data.csv", delimiter=",")
profile = DsvHelper.profile_columns(data)

# Get column information
for col_name, col_info in profile.items():
    print(f"{col_name}: {col_info['datatype']} ({col_info['count']} values)")
```

## Development

### Requirements

- Python 3.10 or higher
- setuptools
- wheel

### Setup

1. Clone the repository:
```bash
git clone https://github.com/jim-schilling/splurge-tools.git
cd splurge-tools
```

2. Create and activate a virtual environment:
```bash
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
```

3. Install development dependencies:
```bash
pip install -e ".[dev]"
```

### Testing

Run tests using pytest:
```bash
python -m pytest tests/
```

### Code Quality

The project uses several tools to maintain code quality:

- **Black**: Code formatting
- **isort**: Import sorting
- **flake8**: Linting
- **mypy**: Type checking
- **pytest**: Testing with coverage

Run all quality checks:
```bash
black .
isort .
flake8 splurge_tools/ tests/ --max-line-length=120
mypy splurge_tools/
python -m pytest tests/ --cov=splurge_tools
```

### Build

Build distribution:
```bash
python -m build
```

## Changelog

# [0.2.6] - 2025-07-12

#### Added
- **Incremental Type Checking Optimization**: Added performance optimization to `profile_values()` function in `type_helper.py` that uses weighted incremental checks at 25%, 50%, and 75% of data processing to short-circuit early when a definitive type can be determined. This provides significant performance improvements for large datasets (>10,000 items) while maintaining accuracy.
- **Early Mixed Type Detection**: Enhanced early termination logic to immediately return `MIXED` type when both numeric/temporal types and string types are detected, avoiding unnecessary processing.
- **Configurable Optimization**: Added `use_incremental_typecheck` parameter (default: `True`) to control whether incremental checking is used, allowing users to disable optimization if needed.
- **Performance Benchmarking**: Added comprehensive performance benchmark script (`examples/profile_values_performance_benchmark.py`) demonstrating 2-3x performance improvements for large datasets.

#### Changed
- **Performance Threshold**: Incremental type checking is automatically disabled for datasets of 10,000 items or fewer to avoid overhead on small datasets.
- **Documentation Updates**: Updated docstrings in `type_helper.py` to accurately reflect the simplified implementation.
- **Test Structure**: Updated unittest test classes to properly inherit from `unittest.TestCase` for improved test organization and consistency.

#### Removed
- **Unused Imports**: Removed unused `os` import from `type_helper.py` to improve code cleanliness.


# [0.2.5] - 2025-07-10

#### Changed
- **Test Organization**: Reorganized test files to improve clarity and maintainability by separating core functionality tests from complex/integration tests. Split the following test files:
  - `test_dsv_helper.py`: Kept core parsing tests; moved file I/O and streaming tests to `test_dsv_helper_file_stream.py`
  - `test_streaming_tabular_data_model.py`: Kept core streaming model tests; moved complex scenarios and edge cases to `test_streaming_tabular_data_model_complex.py`
  - `test_text_file_helper.py`: Kept core text file operations; moved streaming tests to `test_text_file_helper_streaming.py`
- **Import Cleanup**: Removed unused import statements from all test files to improve code quality and maintainability:
  - Removed unused `DataType` import from `test_dsv_helper.py`
  - Removed unused `Iterator` imports from streaming tabular data model test files
- **String Class Refactoring**: Migrated method-level constants to class-level constants in `type_helper.py` String class for improved performance and maintainability:
  - Moved date/time/datetime pattern lists to class-level constants (`_DATE_PATTERNS`, `_TIME_PATTERNS`, `_DATETIME_PATTERNS`)
  - Moved regex patterns to class-level constants (`_FLOAT_REGEX`, `_INTEGER_REGEX`, `_DATE_YYYY_MM_DD_REGEX`, etc.)
  - This eliminates repeated pattern compilation on each method call and improves code organization

#### Fixed
- **Test Expectations**: Fixed test failures related to incorrect expectations for `profile_columns` method keys (`datatype` instead of `type` and no `count` key) and adjusted error message regex in streaming tabular data model tests.
- **String Class Regex Patterns**: Fixed regex patterns in `type_helper.py` String class for datetime parsing. Updated `_DATETIME_YYYY_MM_DD_REGEX` and `_DATETIME_MM_DD_YYYY_REGEX` patterns to properly handle microseconds with `[.]?\d+` instead of the incorrect `[.]?\d{5}` pattern.

#### Testing
- **Maintained Coverage**: All 167 tests continue to pass with 96% code coverage after reorganization and cleanup.
- **Improved Maintainability**: Test organization now provides clearer separation between core functionality and complex scenarios, enabling selective test execution and better code organization.

### [0.2.4] - 2025-07-05

#### Fixed
- **profile_values Edge Case**: Fixed edge case in `profile_values` function where collections of all-digit strings that could be interpreted as different types (DATE, TIME, DATETIME, INTEGER) were being classified as MIXED instead of INTEGER. The function now prioritizes INTEGER type when all values are all-digit strings (with optional +/- signs) and there's a mix of DATE, TIME, DATETIME, and INTEGER interpretations.
- **profile_values Iterator Safety**: Fixed issue where `profile_values` function would fail when given a non-reusable iterator (e.g., generator). The function now uses a 2-pass approach that always uses a list for the special case logic is needed, ensuring both correctness with generators.

### [0.2.3] - 2025-07-05

#### Changed
- **API Simplification**: Removed the `multi_row_headers` parameter from `TabularDataModel`, `StreamingTabularDataModel`, `TypedTabularDataModel`, and `DsvHelper.profile_columns`. Multi-row header merging is now controlled solely by the `header_rows` parameter, which specifies how many rows to merge for column names. This change simplifies the API and eliminates redundant parameters.
- **StreamingTabularDataModel API Refinement**: Streamlined the `StreamingTabularDataModel` API to focus on streaming functionality by removing random access methods (`row()`, `row_as_list()`, `row_as_tuple()`, `cell_value()`) and column analysis methods (`column_values()`, `column_type()`). This creates a cleaner, more consistent streaming paradigm.
- **Tests and Examples Updated**: All tests and example scripts have been updated to use only the `header_rows` parameter for multi-row header merging. Any usage of `multi_row_headers` has been removed.
- **StringTokenizer Tests Refactored**: Consolidated and removed redundant tests in `test_string_tokenizer.py` for improved maintainability and clarity. Test coverage and edge case handling remain comprehensive.

#### Added
- **StreamingTabularDataModel**: New streaming tabular data model for large datasets that don't fit in memory. Works with streams from `DsvHelper.parse_stream` to process data without loading the entire dataset into memory. Features include:
  - Memory-efficient streaming processing with configurable chunk sizes (minimum 100 rows)
  - Support for multi-row headers with automatic merging
  - Multiple iteration methods (as lists, dictionaries, tuples)
  - Empty row skipping and uneven row handling
  - Comprehensive error handling and validation
  - Dynamic column expansion during iteration
  - Row padding for uneven data
- **Comprehensive Test Coverage**: Added extensive test suite for `StreamingTabularDataModel` with 26 test methods covering:
  - Basic functionality with and without headers
  - Multi-row header processing
  - Buffer operations and memory management
  - Iteration methods (direct, dict, tuple)
  - Error handling for invalid parameters and columns
  - Edge cases (empty files, large datasets, uneven rows, empty headers)
  - Header validation and initialization
  - Chunk processing and buffer size limits
  - Dynamic column expansion and row padding
- **Streaming Data Example**: Added comprehensive example demonstrating `StreamingTabularDataModel` usage, including memory usage comparison with traditional loading methods.

#### Fixed
- **Header Processing**: Fixed header processing logic in all data models (`StreamingTabularDataModel`, `TabularDataModel`, `TypedTabularDataModel`) to properly handle empty headers by filling them with `column_<index>` names. Headers like `"Name,,City"` now correctly become `["Name", "column_1", "City"]`.
- **DSV Parsing**: Fixed `StringTokenizer.parse` to preserve empty fields instead of filtering them out. This ensures that `"Name,,City"` is parsed as `["Name", "", "City"]` instead of `["Name", "City"]`, maintaining data integrity.
- **Row Padding and Dynamic Column Expansion**: Fixed row padding logic in `StreamingTabularDataModel` to properly handle uneven rows and dynamically expand columns during iteration.
- **File Handling**: Fixed file permission errors in tests by ensuring proper cleanup of temporary files and stream exhaustion.

#### Performance
- **Memory Efficiency**: `StreamingTabularDataModel` provides significant memory savings for large datasets by processing data in configurable chunks rather than loading entire files into memory.
- **Streaming Processing**: Enables processing of datasets larger than available RAM through efficient streaming and buffer management.

#### Testing
- **94% Test Coverage**: Achieved 94% test coverage for `StreamingTabularDataModel` with comprehensive edge case testing.
- **Error Condition Testing**: Added thorough testing of error conditions including invalid parameters and missing columns.
- **Integration Testing**: Tests cover integration with `DsvHelper.parse_stream` and various data formats.
- **StringTokenizer Tests Updated**: Updated `StringTokenizer` tests to reflect the new behavior of preserving empty fields.

### [0.2.2] - 2025-07-04

#### Added
- **TextFileHelper.load_as_stream**: Added new method for memory-efficient streaming of large text files with configurable chunk sizes. Supports header/footer row skipping and uses optimized deque-based sliding window for footer handling.
- **TextFileHelper.preview skip_header_rows parameter**: Added `skip_header_rows` parameter to the `preview()` method, allowing users to skip header rows when previewing file contents.

#### Performance
- **TextFileHelper Footer Buffer Optimization**: Replaced list-based footer buffer with `collections.deque` in `load_as_stream()` method, improving performance from O(n) to O(1) for footer row operations.

#### Fixed
- **TabularDataModel No-Header Scenarios**: Fixed issue where column names were empty when `header_rows=0`. Column names are now properly generated as `["column_0", "column_1", "column_2"]` when no headers are provided.
- **TabularDataModel Row Access**: Fixed `IndexError` in the `row()` method when accessing uneven data rows. Added proper padding logic to ensure row data has enough columns before access.
- **TabularDataModel Data Normalization**: Improved consistency between column count and column names by ensuring column names always match the actual column count, regardless of header configuration.

### [0.2.1] - 2025-07-03

#### Added
- **DsvHelper.profile_columns**: Added `DsvHelper.profile_columns`, a new method that generates a simple data profile from parsed DSV data, inferring column names and datatypes.
- **Test Coverage**: Added comprehensive test cases for `DsvHelper.profile_columns` and improved validation of DSV parsing logic, including edge cases for all supported datatypes.

### [0.2.0] - 2025-07-02

#### Breaking Changes
- **Method Signature Standardization**: All method signatures across the codebase have been updated to require default parameters to be named (e.g., `def myfunc(value: str, *, trim: bool = True)`). This enforces keyword-only arguments for all default values, improving clarity and consistency. This is a breaking change and may require updates to any code that calls these methods positionally for defaulted parameters.
- All method signatures now use explicit type annotations and follow PEP8 and project-specific conventions for parameter ordering and naming.
- Some methods may have reordered parameters or stricter type requirements as part of this standardization.

### Fixed
- **Resolved Regex Pattern Bug**: Fixed regex pattern bug - ?? should have been ? in String class in type_helper.py.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Author

Jim Schilling

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "splurge-tools",
    "maintainer": "Jim Schilling",
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "data, validation, text-processing, utilities, type-handling",
    "author": "Jim Schilling",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/ce/da/e3ae843207fa9fabc546c756075e726d9078b2ddaa5463a018f812c11391/splurge_tools-0.2.6.tar.gz",
    "platform": null,
    "description": "# splurge-tools\r\n\r\nA Python package providing comprehensive tools for data type handling, validation, text processing, and streaming data analysis.\r\n\r\n## Description\r\n\r\nsplurge-tools is a collection of Python utilities focused on:\r\n- **Data type handling and validation** with comprehensive type inference and conversion\r\n- **Text file processing and manipulation** with streaming support for large files\r\n- **String tokenization and parsing** with delimited value support\r\n- **Text case transformations** and normalization\r\n- **Delimited separated value (DSV) parsing** with streaming capabilities\r\n- **Tabular data models** for both in-memory and streaming datasets\r\n- **Typed tabular data models** with schema validation\r\n- **Data validation and transformation** utilities\r\n- **Random data generation** for testing and development\r\n- **Memory-efficient streaming** for large datasets that don't fit in RAM\r\n- **Python 3.10+ compatibility** with full type annotations\r\n\r\n## Installation\r\n\r\n```bash\r\npip install splurge-tools\r\n```\r\n\r\n## Features\r\n\r\n### Core Data Processing\r\n- **`type_helper.py`**: Comprehensive type validation, conversion, and inference utilities with support for strings, numbers, dates, times, booleans, and collections\r\n- **`dsv_helper.py`**: Delimited separated value parsing with streaming support, column profiling, and data analysis\r\n- **`tabular_data_model.py`**: In-memory data model for tabular datasets with multi-row header support\r\n- **`typed_tabular_data_model.py`**: Type-safe data model with schema validation and type enforcement\r\n- **`streaming_tabular_data_model.py`**: Memory-efficient streaming data model for large datasets (>100MB)\r\n\r\n### Text Processing\r\n- **`text_file_helper.py`**: Text file processing with streaming support, header/footer skipping, and memory-efficient operations\r\n- **`string_tokenizer.py`**: String parsing and tokenization utilities with delimited value support\r\n- **`case_helper.py`**: Text case transformation utilities (camelCase, snake_case, kebab-case, etc.)\r\n- **`text_normalizer.py`**: Text normalization and cleaning utilities\r\n\r\n### Data Utilities\r\n- **`data_validator.py`**: Data validation framework with custom validation rules\r\n- **`data_transformer.py`**: Data transformation utilities for converting between formats\r\n- **`random_helper.py`**: Random data generation for testing, including realistic test data\r\n\r\n### Key Capabilities\r\n- **Streaming Support**: Process datasets larger than available RAM with configurable chunk sizes\r\n- **Type Inference**: Automatic detection of data types including dates, times, numbers, and booleans\r\n- **Multi-row Headers**: Support for complex header structures with automatic merging\r\n- **Memory Efficiency**: Streaming models use minimal memory regardless of dataset size\r\n- **Type Safety**: Full type annotations and validation throughout the codebase\r\n- **Error Handling**: Comprehensive error handling with meaningful error messages\r\n- **Performance**: Optimized for large datasets with efficient algorithms and data structures\r\n\r\n## Examples\r\n\r\n### Streaming Large Datasets\r\n\r\n```python\r\nfrom splurge_tools.dsv_helper import DsvHelper\r\nfrom splurge_tools.streaming_tabular_data_model import StreamingTabularDataModel\r\n\r\n# Process a large CSV file without loading it into memory\r\nstream = DsvHelper.parse_stream(\"large_dataset.csv\", delimiter=\",\")\r\nmodel = StreamingTabularDataModel(stream, header_rows=1, chunk_size=1000)\r\n\r\n# Iterate through data efficiently\r\nfor row in model:\r\n    # Process each row\r\n    print(row)\r\n\r\n# Or get rows as dictionaries\r\nfor row_dict in model.iter_rows():\r\n    print(row_dict[\"column_name\"])\r\n```\r\n\r\n### Type Inference and Validation\r\n\r\n```python\r\nfrom splurge_tools.type_helper import String, DataType\r\n\r\n# Infer data types\r\ndata_type = String.infer_type(\"2023-12-25\")  # DataType.DATE\r\ndata_type = String.infer_type(\"123.45\")      # DataType.FLOAT\r\ndata_type = String.infer_type(\"true\")        # DataType.BOOLEAN\r\n\r\n# Convert values with validation\r\ndate_val = String.to_date(\"2023-12-25\")\r\nfloat_val = String.to_float(\"123.45\", default=0.0)\r\nbool_val = String.to_bool(\"true\")\r\n```\r\n\r\n### DSV Parsing and Profiling\r\n\r\n```python\r\nfrom splurge_tools.dsv_helper import DsvHelper\r\n\r\n# Parse and profile columns\r\ndata = DsvHelper.parse(\"data.csv\", delimiter=\",\")\r\nprofile = DsvHelper.profile_columns(data)\r\n\r\n# Get column information\r\nfor col_name, col_info in profile.items():\r\n    print(f\"{col_name}: {col_info['datatype']} ({col_info['count']} values)\")\r\n```\r\n\r\n## Development\r\n\r\n### Requirements\r\n\r\n- Python 3.10 or higher\r\n- setuptools\r\n- wheel\r\n\r\n### Setup\r\n\r\n1. Clone the repository:\r\n```bash\r\ngit clone https://github.com/jim-schilling/splurge-tools.git\r\ncd splurge-tools\r\n```\r\n\r\n2. Create and activate a virtual environment:\r\n```bash\r\npython -m venv .venv\r\nsource .venv/bin/activate  # On Windows: .venv\\Scripts\\activate\r\n```\r\n\r\n3. Install development dependencies:\r\n```bash\r\npip install -e \".[dev]\"\r\n```\r\n\r\n### Testing\r\n\r\nRun tests using pytest:\r\n```bash\r\npython -m pytest tests/\r\n```\r\n\r\n### Code Quality\r\n\r\nThe project uses several tools to maintain code quality:\r\n\r\n- **Black**: Code formatting\r\n- **isort**: Import sorting\r\n- **flake8**: Linting\r\n- **mypy**: Type checking\r\n- **pytest**: Testing with coverage\r\n\r\nRun all quality checks:\r\n```bash\r\nblack .\r\nisort .\r\nflake8 splurge_tools/ tests/ --max-line-length=120\r\nmypy splurge_tools/\r\npython -m pytest tests/ --cov=splurge_tools\r\n```\r\n\r\n### Build\r\n\r\nBuild distribution:\r\n```bash\r\npython -m build\r\n```\r\n\r\n## Changelog\r\n\r\n# [0.2.6] - 2025-07-12\r\n\r\n#### Added\r\n- **Incremental Type Checking Optimization**: Added performance optimization to `profile_values()` function in `type_helper.py` that uses weighted incremental checks at 25%, 50%, and 75% of data processing to short-circuit early when a definitive type can be determined. This provides significant performance improvements for large datasets (>10,000 items) while maintaining accuracy.\r\n- **Early Mixed Type Detection**: Enhanced early termination logic to immediately return `MIXED` type when both numeric/temporal types and string types are detected, avoiding unnecessary processing.\r\n- **Configurable Optimization**: Added `use_incremental_typecheck` parameter (default: `True`) to control whether incremental checking is used, allowing users to disable optimization if needed.\r\n- **Performance Benchmarking**: Added comprehensive performance benchmark script (`examples/profile_values_performance_benchmark.py`) demonstrating 2-3x performance improvements for large datasets.\r\n\r\n#### Changed\r\n- **Performance Threshold**: Incremental type checking is automatically disabled for datasets of 10,000 items or fewer to avoid overhead on small datasets.\r\n- **Documentation Updates**: Updated docstrings in `type_helper.py` to accurately reflect the simplified implementation.\r\n- **Test Structure**: Updated unittest test classes to properly inherit from `unittest.TestCase` for improved test organization and consistency.\r\n\r\n#### Removed\r\n- **Unused Imports**: Removed unused `os` import from `type_helper.py` to improve code cleanliness.\r\n\r\n\r\n# [0.2.5] - 2025-07-10\r\n\r\n#### Changed\r\n- **Test Organization**: Reorganized test files to improve clarity and maintainability by separating core functionality tests from complex/integration tests. Split the following test files:\r\n  - `test_dsv_helper.py`: Kept core parsing tests; moved file I/O and streaming tests to `test_dsv_helper_file_stream.py`\r\n  - `test_streaming_tabular_data_model.py`: Kept core streaming model tests; moved complex scenarios and edge cases to `test_streaming_tabular_data_model_complex.py`\r\n  - `test_text_file_helper.py`: Kept core text file operations; moved streaming tests to `test_text_file_helper_streaming.py`\r\n- **Import Cleanup**: Removed unused import statements from all test files to improve code quality and maintainability:\r\n  - Removed unused `DataType` import from `test_dsv_helper.py`\r\n  - Removed unused `Iterator` imports from streaming tabular data model test files\r\n- **String Class Refactoring**: Migrated method-level constants to class-level constants in `type_helper.py` String class for improved performance and maintainability:\r\n  - Moved date/time/datetime pattern lists to class-level constants (`_DATE_PATTERNS`, `_TIME_PATTERNS`, `_DATETIME_PATTERNS`)\r\n  - Moved regex patterns to class-level constants (`_FLOAT_REGEX`, `_INTEGER_REGEX`, `_DATE_YYYY_MM_DD_REGEX`, etc.)\r\n  - This eliminates repeated pattern compilation on each method call and improves code organization\r\n\r\n#### Fixed\r\n- **Test Expectations**: Fixed test failures related to incorrect expectations for `profile_columns` method keys (`datatype` instead of `type` and no `count` key) and adjusted error message regex in streaming tabular data model tests.\r\n- **String Class Regex Patterns**: Fixed regex patterns in `type_helper.py` String class for datetime parsing. Updated `_DATETIME_YYYY_MM_DD_REGEX` and `_DATETIME_MM_DD_YYYY_REGEX` patterns to properly handle microseconds with `[.]?\\d+` instead of the incorrect `[.]?\\d{5}` pattern.\r\n\r\n#### Testing\r\n- **Maintained Coverage**: All 167 tests continue to pass with 96% code coverage after reorganization and cleanup.\r\n- **Improved Maintainability**: Test organization now provides clearer separation between core functionality and complex scenarios, enabling selective test execution and better code organization.\r\n\r\n### [0.2.4] - 2025-07-05\r\n\r\n#### Fixed\r\n- **profile_values Edge Case**: Fixed edge case in `profile_values` function where collections of all-digit strings that could be interpreted as different types (DATE, TIME, DATETIME, INTEGER) were being classified as MIXED instead of INTEGER. The function now prioritizes INTEGER type when all values are all-digit strings (with optional +/- signs) and there's a mix of DATE, TIME, DATETIME, and INTEGER interpretations.\r\n- **profile_values Iterator Safety**: Fixed issue where `profile_values` function would fail when given a non-reusable iterator (e.g., generator). The function now uses a 2-pass approach that always uses a list for the special case logic is needed, ensuring both correctness with generators.\r\n\r\n### [0.2.3] - 2025-07-05\r\n\r\n#### Changed\r\n- **API Simplification**: Removed the `multi_row_headers` parameter from `TabularDataModel`, `StreamingTabularDataModel`, `TypedTabularDataModel`, and `DsvHelper.profile_columns`. Multi-row header merging is now controlled solely by the `header_rows` parameter, which specifies how many rows to merge for column names. This change simplifies the API and eliminates redundant parameters.\r\n- **StreamingTabularDataModel API Refinement**: Streamlined the `StreamingTabularDataModel` API to focus on streaming functionality by removing random access methods (`row()`, `row_as_list()`, `row_as_tuple()`, `cell_value()`) and column analysis methods (`column_values()`, `column_type()`). This creates a cleaner, more consistent streaming paradigm.\r\n- **Tests and Examples Updated**: All tests and example scripts have been updated to use only the `header_rows` parameter for multi-row header merging. Any usage of `multi_row_headers` has been removed.\r\n- **StringTokenizer Tests Refactored**: Consolidated and removed redundant tests in `test_string_tokenizer.py` for improved maintainability and clarity. Test coverage and edge case handling remain comprehensive.\r\n\r\n#### Added\r\n- **StreamingTabularDataModel**: New streaming tabular data model for large datasets that don't fit in memory. Works with streams from `DsvHelper.parse_stream` to process data without loading the entire dataset into memory. Features include:\r\n  - Memory-efficient streaming processing with configurable chunk sizes (minimum 100 rows)\r\n  - Support for multi-row headers with automatic merging\r\n  - Multiple iteration methods (as lists, dictionaries, tuples)\r\n  - Empty row skipping and uneven row handling\r\n  - Comprehensive error handling and validation\r\n  - Dynamic column expansion during iteration\r\n  - Row padding for uneven data\r\n- **Comprehensive Test Coverage**: Added extensive test suite for `StreamingTabularDataModel` with 26 test methods covering:\r\n  - Basic functionality with and without headers\r\n  - Multi-row header processing\r\n  - Buffer operations and memory management\r\n  - Iteration methods (direct, dict, tuple)\r\n  - Error handling for invalid parameters and columns\r\n  - Edge cases (empty files, large datasets, uneven rows, empty headers)\r\n  - Header validation and initialization\r\n  - Chunk processing and buffer size limits\r\n  - Dynamic column expansion and row padding\r\n- **Streaming Data Example**: Added comprehensive example demonstrating `StreamingTabularDataModel` usage, including memory usage comparison with traditional loading methods.\r\n\r\n#### Fixed\r\n- **Header Processing**: Fixed header processing logic in all data models (`StreamingTabularDataModel`, `TabularDataModel`, `TypedTabularDataModel`) to properly handle empty headers by filling them with `column_<index>` names. Headers like `\"Name,,City\"` now correctly become `[\"Name\", \"column_1\", \"City\"]`.\r\n- **DSV Parsing**: Fixed `StringTokenizer.parse` to preserve empty fields instead of filtering them out. This ensures that `\"Name,,City\"` is parsed as `[\"Name\", \"\", \"City\"]` instead of `[\"Name\", \"City\"]`, maintaining data integrity.\r\n- **Row Padding and Dynamic Column Expansion**: Fixed row padding logic in `StreamingTabularDataModel` to properly handle uneven rows and dynamically expand columns during iteration.\r\n- **File Handling**: Fixed file permission errors in tests by ensuring proper cleanup of temporary files and stream exhaustion.\r\n\r\n#### Performance\r\n- **Memory Efficiency**: `StreamingTabularDataModel` provides significant memory savings for large datasets by processing data in configurable chunks rather than loading entire files into memory.\r\n- **Streaming Processing**: Enables processing of datasets larger than available RAM through efficient streaming and buffer management.\r\n\r\n#### Testing\r\n- **94% Test Coverage**: Achieved 94% test coverage for `StreamingTabularDataModel` with comprehensive edge case testing.\r\n- **Error Condition Testing**: Added thorough testing of error conditions including invalid parameters and missing columns.\r\n- **Integration Testing**: Tests cover integration with `DsvHelper.parse_stream` and various data formats.\r\n- **StringTokenizer Tests Updated**: Updated `StringTokenizer` tests to reflect the new behavior of preserving empty fields.\r\n\r\n### [0.2.2] - 2025-07-04\r\n\r\n#### Added\r\n- **TextFileHelper.load_as_stream**: Added new method for memory-efficient streaming of large text files with configurable chunk sizes. Supports header/footer row skipping and uses optimized deque-based sliding window for footer handling.\r\n- **TextFileHelper.preview skip_header_rows parameter**: Added `skip_header_rows` parameter to the `preview()` method, allowing users to skip header rows when previewing file contents.\r\n\r\n#### Performance\r\n- **TextFileHelper Footer Buffer Optimization**: Replaced list-based footer buffer with `collections.deque` in `load_as_stream()` method, improving performance from O(n) to O(1) for footer row operations.\r\n\r\n#### Fixed\r\n- **TabularDataModel No-Header Scenarios**: Fixed issue where column names were empty when `header_rows=0`. Column names are now properly generated as `[\"column_0\", \"column_1\", \"column_2\"]` when no headers are provided.\r\n- **TabularDataModel Row Access**: Fixed `IndexError` in the `row()` method when accessing uneven data rows. Added proper padding logic to ensure row data has enough columns before access.\r\n- **TabularDataModel Data Normalization**: Improved consistency between column count and column names by ensuring column names always match the actual column count, regardless of header configuration.\r\n\r\n### [0.2.1] - 2025-07-03\r\n\r\n#### Added\r\n- **DsvHelper.profile_columns**: Added `DsvHelper.profile_columns`, a new method that generates a simple data profile from parsed DSV data, inferring column names and datatypes.\r\n- **Test Coverage**: Added comprehensive test cases for `DsvHelper.profile_columns` and improved validation of DSV parsing logic, including edge cases for all supported datatypes.\r\n\r\n### [0.2.0] - 2025-07-02\r\n\r\n#### Breaking Changes\r\n- **Method Signature Standardization**: All method signatures across the codebase have been updated to require default parameters to be named (e.g., `def myfunc(value: str, *, trim: bool = True)`). This enforces keyword-only arguments for all default values, improving clarity and consistency. This is a breaking change and may require updates to any code that calls these methods positionally for defaulted parameters.\r\n- All method signatures now use explicit type annotations and follow PEP8 and project-specific conventions for parameter ordering and naming.\r\n- Some methods may have reordered parameters or stricter type requirements as part of this standardization.\r\n\r\n### Fixed\r\n- **Resolved Regex Pattern Bug**: Fixed regex pattern bug - ?? should have been ? in String class in type_helper.py.\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n## Author\r\n\r\nJim Schilling\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Python tools for data type handling and validation",
    "version": "0.2.6",
    "project_urls": {
        "Bug Tracker": "https://github.com/jim-schilling/splurge-tools/issues",
        "Documentation": "https://github.com/jim-schilling/splurge-tools#readme",
        "Homepage": "https://github.com/jim-schilling/splurge-tools",
        "Repository": "https://github.com/jim-schilling/splurge-tools.git"
    },
    "split_keywords": [
        "data",
        " validation",
        " text-processing",
        " utilities",
        " type-handling"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3180fca425a75b95e060caf30cee5461375ed91b0ceea7a777c6f5be0b076d68",
                "md5": "e6aa5fbc3b3ab9181680deb7844996fd",
                "sha256": "7bf3a383bb9f0d02a0c1381447014f7c766a73288fcdcc78313ce9183f53afef"
            },
            "downloads": -1,
            "filename": "splurge_tools-0.2.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e6aa5fbc3b3ab9181680deb7844996fd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 37762,
            "upload_time": "2025-07-12T20:27:50",
            "upload_time_iso_8601": "2025-07-12T20:27:50.994279Z",
            "url": "https://files.pythonhosted.org/packages/31/80/fca425a75b95e060caf30cee5461375ed91b0ceea7a777c6f5be0b076d68/splurge_tools-0.2.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "cedae3ae843207fa9fabc546c756075e726d9078b2ddaa5463a018f812c11391",
                "md5": "5a655dab0775b1609a6d0995266ac482",
                "sha256": "542b42513c9f138158bbe486a19f2756d7161a6c6dbb6a2b772415877a6dc47d"
            },
            "downloads": -1,
            "filename": "splurge_tools-0.2.6.tar.gz",
            "has_sig": false,
            "md5_digest": "5a655dab0775b1609a6d0995266ac482",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 61535,
            "upload_time": "2025-07-12T20:27:52",
            "upload_time_iso_8601": "2025-07-12T20:27:52.165965Z",
            "url": "https://files.pythonhosted.org/packages/ce/da/e3ae843207fa9fabc546c756075e726d9078b2ddaa5463a018f812c11391/splurge_tools-0.2.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-12 20:27:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jim-schilling",
    "github_project": "splurge-tools",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "splurge-tools"
}
        
Elapsed time: 0.42710s