datagrunt


Namedatagrunt JSON
Version 2.1.1 PyPI version JSON
download
home_pageNone
SummaryRead CSV files and convert to other file formats easily
upload_time2025-09-13 18:22:40
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT License
keywords csv data duckdb polars pyarrow xlsx delimiter ai gemini
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Welcome To Datagrunt

Datagrunt is a Python library designed to simplify the way you work with CSV files. It provides a streamlined approach to reading, processing, and transforming your data into various formats, making data manipulation efficient and intuitive.

## Why Datagrunt?

Born out of real-world frustration, Datagrunt eliminates the need for repetitive coding when handling CSV files. Whether you're a data analyst, data engineer, or data scientist, Datagrunt empowers you to focus on insights, not tedious data wrangling.

### What Datagrunt Is Not
Datagrunt is not an extension of or a replacement for DuckDB, Polars, or PyArrow, nor is it a comprehensive data processing solution. Instead, it's designed to simplify the way you work with CSV files and to help solve the pain point of inferring delimiters when a file structure is unknown. Datagrunt provides an easy way to convert CSV files to dataframes and export them to various formats. One of Datagrunt's value propositions is its relative simplicity and ease of use.

## Key Features

- **Intelligent Delimiter Inference:** Datagrunt automatically detects and applies the correct delimiter for your CSV files.
- **Path Object Support:** Full support for both string paths and `pathlib.Path` objects for modern, cross-platform file handling.
- **Multiple Processing Engines:** Choose from three powerful engines - [DuckDB](https://duckdb.org), [Polars](https://pola.rs), and [PyArrow](https://arrow.apache.org/docs/python/) - to handle your data processing needs.
- **Flexible Data Transformation:** Easily convert your processed CSV data into various formats including CSV, Excel, JSON, JSONL, and Parquet.
- **AI-Powered Schema Analysis:** Use Google's Gemini models to automatically generate detailed schema reports for your CSV files, including data types, column classifications, and data quality checks.
- **Pythonic API:** Enjoy a clean and intuitive API that integrates seamlessly into your existing Python workflows.

### Powertools Under The Hood
| Tool | Description |
|-------------------|----------------------------|
| [DuckDB](https://duckdb.org)| Fast in-process analytical database with excellent SQL support |
| [Polars](https://pola.rs) | Multi-threaded DataFrame library written in Rust, optimized for performance |
| [PyArrow](https://arrow.apache.org/docs/python/) | Python bindings for Apache Arrow with efficient columnar data processing |
| [Google Gemini](https://deepmind.google/technologies/gemini/) | A powerful family of generative AI models for schema analysis |

## Installation

We recommend using [UV](https://docs.astral.sh/uv/). However, you may get started with Datagrunt in seconds using UV or pip.

Get started with UV:

```bash
uv pip install datagrunt
```

Get started with pip:

```bash
pip install datagrunt
```

## Quick Start

### Reading CSV Files with Multiple Engine Options

```python
from datagrunt import CSVReader
from pathlib import Path

# Load your CSV file with different engines
# Accepts both string paths and Path objects
csv_file = 'electric_vehicle_population_data.csv'
csv_path = Path('electric_vehicle_population_data.csv')

# Choose your engine: 'polars' (default), 'duckdb', or 'pyarrow'
reader_polars = CSVReader(csv_file, engine='polars')    # String path - fast DataFrame ops
reader_duckdb = CSVReader(csv_path, engine='duckdb')    # Path object - best for SQL queries
reader_pyarrow = CSVReader(csv_file, engine='pyarrow')  # Arrow ecosystem integration

# Get a sample of the data
reader_duckdb.get_sample()
```

### DuckDB Integration for Performant SQL Queries

```python
from datagrunt import CSVReader

# Set up DuckDB engine for SQL capabilities
dg = CSVReader('electric_vehicle_population_data.csv', engine='duckdb')

# Construct your SQL query using the auto-generated table name
query = f"""
WITH core AS (
    SELECT
        City AS city,
        "VIN (1-10)" AS vin
    FROM {dg.db_table}
)
SELECT
    city,
    COUNT(vin) AS vehicle_count
FROM core
GROUP BY 1
ORDER BY 2 DESC
"""

# Execute the query and get results as a Polars DataFrame
df = dg.query_data(query).pl()
print(df)
```

### Exporting Data to Multiple Formats

```python
from datagrunt import CSVWriter
from pathlib import Path

# Create writer with your preferred engine (accepts both strings and Path objects)
input_file = Path('input.csv')
writer = CSVWriter(input_file, engine='duckdb')  # Default for exports

# Export to various formats
writer.write_csv('output.csv')          # Clean CSV export
writer.write_excel('output.xlsx')       # Excel workbook
writer.write_json('output.json')        # JSON format
writer.write_parquet('output.parquet')  # Parquet for analytics

# Use PyArrow engine for optimized Parquet exports
writer_arrow = CSVWriter('input.csv', engine='pyarrow')  # String path also works
writer_arrow.write_parquet('optimized.parquet')  # Native Arrow Parquet
```

### AI-Powered Schema Analysis

```python
from datagrunt import CSVSchemaReportAIGenerated
from pathlib import Path
import os

# Generate detailed schema reports with AI (accepts both strings and Path objects)
api_key = os.environ.get("GEMINI_API_KEY")
data_file = Path('your_data.csv')

schema_analyzer = CSVSchemaReportAIGenerated(
    filepath=data_file,  # Path object works seamlessly
    engine='google',
    api_key=api_key
)

# Get comprehensive schema analysis
report = schema_analyzer.generate_csv_schema_report(
    model='gemini-2.5-flash',
    return_json=True
)

print(report)  # Detailed JSON schema with data types, classifications, and more
```

## Engine Comparison

| Feature | Polars | DuckDB | PyArrow |
|---------|--------|--------|---------|
| **Best for** | DataFrame operations | SQL queries & analytics | Arrow ecosystem integration |
| **Performance** | Fast in-memory processing | Excellent for large datasets | Optimized columnar operations |
| **Default for** | CSVReader | CSVWriter | - |
| **Export Quality** | Good | Excellent (especially JSON) | Native Parquet support |

## Primary Classes

- **`CSVReader`**: Read and process CSV files with intelligent delimiter detection
- **`CSVWriter`**: Export CSV data to multiple formats (CSV, Excel, JSON, Parquet)
- **`CSVSchemaReportAIGenerated`**: Generate AI-powered schema analysis reports

## Full Documentation

For complete documentation, detailed examples, and advanced usage patterns, see:
📖 **[Complete Documentation](docs/README.md)**

## License

This project is licensed under the [MIT License](https://opensource.org/license/mit)

## Acknowledgements

A HUGE thank you to the open source community and the creators of [DuckDB](https://duckdb.org), [Polars](https://pola.rs), and [PyArrow](https://arrow.apache.org/docs/python/) for their fantastic libraries that power Datagrunt.

## Source Repository

[https://github.com/pmgraham/datagrunt](https://github.com/pmgraham/datagrunt)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "datagrunt",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "csv, data, duckdb, polars, pyarrow, xlsx, delimiter, ai, gemini",
    "author": null,
    "author_email": "Martin Graham <datagrunt@datagrunt.io>",
    "download_url": "https://files.pythonhosted.org/packages/c2/26/db2e859136310a2eae2793ec45fa2cf54966738e94b7b8f0ad984130f489/datagrunt-2.1.1.tar.gz",
    "platform": null,
    "description": "# Welcome To Datagrunt\n\nDatagrunt is a Python library designed to simplify the way you work with CSV files. It provides a streamlined approach to reading, processing, and transforming your data into various formats, making data manipulation efficient and intuitive.\n\n## Why Datagrunt?\n\nBorn out of real-world frustration, Datagrunt eliminates the need for repetitive coding when handling CSV files. Whether you're a data analyst, data engineer, or data scientist, Datagrunt empowers you to focus on insights, not tedious data wrangling.\n\n### What Datagrunt Is Not\nDatagrunt is not an extension of or a replacement for DuckDB, Polars, or PyArrow, nor is it a comprehensive data processing solution. Instead, it's designed to simplify the way you work with CSV files and to help solve the pain point of inferring delimiters when a file structure is unknown. Datagrunt provides an easy way to convert CSV files to dataframes and export them to various formats. One of Datagrunt's value propositions is its relative simplicity and ease of use.\n\n## Key Features\n\n- **Intelligent Delimiter Inference:** Datagrunt automatically detects and applies the correct delimiter for your CSV files.\n- **Path Object Support:** Full support for both string paths and `pathlib.Path` objects for modern, cross-platform file handling.\n- **Multiple Processing Engines:** Choose from three powerful engines - [DuckDB](https://duckdb.org), [Polars](https://pola.rs), and [PyArrow](https://arrow.apache.org/docs/python/) - to handle your data processing needs.\n- **Flexible Data Transformation:** Easily convert your processed CSV data into various formats including CSV, Excel, JSON, JSONL, and Parquet.\n- **AI-Powered Schema Analysis:** Use Google's Gemini models to automatically generate detailed schema reports for your CSV files, including data types, column classifications, and data quality checks.\n- **Pythonic API:** Enjoy a clean and intuitive API that integrates seamlessly into your existing Python workflows.\n\n### Powertools Under The Hood\n| Tool | Description |\n|-------------------|----------------------------|\n| [DuckDB](https://duckdb.org)| Fast in-process analytical database with excellent SQL support |\n| [Polars](https://pola.rs) | Multi-threaded DataFrame library written in Rust, optimized for performance |\n| [PyArrow](https://arrow.apache.org/docs/python/) | Python bindings for Apache Arrow with efficient columnar data processing |\n| [Google Gemini](https://deepmind.google/technologies/gemini/) | A powerful family of generative AI models for schema analysis |\n\n## Installation\n\nWe recommend using [UV](https://docs.astral.sh/uv/). However, you may get started with Datagrunt in seconds using UV or pip.\n\nGet started with UV:\n\n```bash\nuv pip install datagrunt\n```\n\nGet started with pip:\n\n```bash\npip install datagrunt\n```\n\n## Quick Start\n\n### Reading CSV Files with Multiple Engine Options\n\n```python\nfrom datagrunt import CSVReader\nfrom pathlib import Path\n\n# Load your CSV file with different engines\n# Accepts both string paths and Path objects\ncsv_file = 'electric_vehicle_population_data.csv'\ncsv_path = Path('electric_vehicle_population_data.csv')\n\n# Choose your engine: 'polars' (default), 'duckdb', or 'pyarrow'\nreader_polars = CSVReader(csv_file, engine='polars')    # String path - fast DataFrame ops\nreader_duckdb = CSVReader(csv_path, engine='duckdb')    # Path object - best for SQL queries\nreader_pyarrow = CSVReader(csv_file, engine='pyarrow')  # Arrow ecosystem integration\n\n# Get a sample of the data\nreader_duckdb.get_sample()\n```\n\n### DuckDB Integration for Performant SQL Queries\n\n```python\nfrom datagrunt import CSVReader\n\n# Set up DuckDB engine for SQL capabilities\ndg = CSVReader('electric_vehicle_population_data.csv', engine='duckdb')\n\n# Construct your SQL query using the auto-generated table name\nquery = f\"\"\"\nWITH core AS (\n    SELECT\n        City AS city,\n        \"VIN (1-10)\" AS vin\n    FROM {dg.db_table}\n)\nSELECT\n    city,\n    COUNT(vin) AS vehicle_count\nFROM core\nGROUP BY 1\nORDER BY 2 DESC\n\"\"\"\n\n# Execute the query and get results as a Polars DataFrame\ndf = dg.query_data(query).pl()\nprint(df)\n```\n\n### Exporting Data to Multiple Formats\n\n```python\nfrom datagrunt import CSVWriter\nfrom pathlib import Path\n\n# Create writer with your preferred engine (accepts both strings and Path objects)\ninput_file = Path('input.csv')\nwriter = CSVWriter(input_file, engine='duckdb')  # Default for exports\n\n# Export to various formats\nwriter.write_csv('output.csv')          # Clean CSV export\nwriter.write_excel('output.xlsx')       # Excel workbook\nwriter.write_json('output.json')        # JSON format\nwriter.write_parquet('output.parquet')  # Parquet for analytics\n\n# Use PyArrow engine for optimized Parquet exports\nwriter_arrow = CSVWriter('input.csv', engine='pyarrow')  # String path also works\nwriter_arrow.write_parquet('optimized.parquet')  # Native Arrow Parquet\n```\n\n### AI-Powered Schema Analysis\n\n```python\nfrom datagrunt import CSVSchemaReportAIGenerated\nfrom pathlib import Path\nimport os\n\n# Generate detailed schema reports with AI (accepts both strings and Path objects)\napi_key = os.environ.get(\"GEMINI_API_KEY\")\ndata_file = Path('your_data.csv')\n\nschema_analyzer = CSVSchemaReportAIGenerated(\n    filepath=data_file,  # Path object works seamlessly\n    engine='google',\n    api_key=api_key\n)\n\n# Get comprehensive schema analysis\nreport = schema_analyzer.generate_csv_schema_report(\n    model='gemini-2.5-flash',\n    return_json=True\n)\n\nprint(report)  # Detailed JSON schema with data types, classifications, and more\n```\n\n## Engine Comparison\n\n| Feature | Polars | DuckDB | PyArrow |\n|---------|--------|--------|---------|\n| **Best for** | DataFrame operations | SQL queries & analytics | Arrow ecosystem integration |\n| **Performance** | Fast in-memory processing | Excellent for large datasets | Optimized columnar operations |\n| **Default for** | CSVReader | CSVWriter | - |\n| **Export Quality** | Good | Excellent (especially JSON) | Native Parquet support |\n\n## Primary Classes\n\n- **`CSVReader`**: Read and process CSV files with intelligent delimiter detection\n- **`CSVWriter`**: Export CSV data to multiple formats (CSV, Excel, JSON, Parquet)\n- **`CSVSchemaReportAIGenerated`**: Generate AI-powered schema analysis reports\n\n## Full Documentation\n\nFor complete documentation, detailed examples, and advanced usage patterns, see:\n\ud83d\udcd6 **[Complete Documentation](docs/README.md)**\n\n## License\n\nThis project is licensed under the [MIT License](https://opensource.org/license/mit)\n\n## Acknowledgements\n\nA HUGE thank you to the open source community and the creators of [DuckDB](https://duckdb.org), [Polars](https://pola.rs), and [PyArrow](https://arrow.apache.org/docs/python/) for their fantastic libraries that power Datagrunt.\n\n## Source Repository\n\n[https://github.com/pmgraham/datagrunt](https://github.com/pmgraham/datagrunt)\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "Read CSV files and convert to other file formats easily",
    "version": "2.1.1",
    "project_urls": {
        "Bug Tracker": "https://github.com/pmgraham/datagrunt/issues",
        "Documentation": "https://pmgraham.github.io/datagrunt",
        "Homepage": "https://pmgraham.github.io/datagrunt",
        "Source Code": "https://github.com/pmgraham/datagrunt"
    },
    "split_keywords": [
        "csv",
        " data",
        " duckdb",
        " polars",
        " pyarrow",
        " xlsx",
        " delimiter",
        " ai",
        " gemini"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d9adf349171592dd78134314035b4ccc54cf7058da5331f1b1266fd7048914e2",
                "md5": "fcb3ae124510926ebd871310071f7f40",
                "sha256": "b98417c0dfd8353c6a73d000ce7c478e58d136d043f33be513c786cf749b6808"
            },
            "downloads": -1,
            "filename": "datagrunt-2.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fcb3ae124510926ebd871310071f7f40",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 28624,
            "upload_time": "2025-09-13T18:22:38",
            "upload_time_iso_8601": "2025-09-13T18:22:38.834734Z",
            "url": "https://files.pythonhosted.org/packages/d9/ad/f349171592dd78134314035b4ccc54cf7058da5331f1b1266fd7048914e2/datagrunt-2.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c226db2e859136310a2eae2793ec45fa2cf54966738e94b7b8f0ad984130f489",
                "md5": "0d2d463856af4e6b8f6b4574b4de9c1b",
                "sha256": "4cca67438a013d0bf379a5b24bbac7230bcf2a22f2dc9c675e07f567ef348da1"
            },
            "downloads": -1,
            "filename": "datagrunt-2.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "0d2d463856af4e6b8f6b4574b4de9c1b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 25872,
            "upload_time": "2025-09-13T18:22:40",
            "upload_time_iso_8601": "2025-09-13T18:22:40.303966Z",
            "url": "https://files.pythonhosted.org/packages/c2/26/db2e859136310a2eae2793ec45fa2cf54966738e94b7b8f0ad984130f489/datagrunt-2.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-13 18:22:40",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "pmgraham",
    "github_project": "datagrunt",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "datagrunt"
}
        
Elapsed time: 3.20521s