gridgulp


Namegridgulp JSON
Version 0.3.4 PyPI version JSON
download
home_pageNone
SummarySimplified intelligent spreadsheet ingestion framework with automatic table detection
upload_time2025-07-30 05:53:54
maintainerNone
docs_urlNone
authorGridGulp Contributors
requires_python>=3.10
licenseMIT
keywords csv data-extraction excel spreadsheet table-detection
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # GridGulp

[![PyPI version](https://badge.fury.io/py/gridgulp.svg)](https://pypi.org/project/gridgulp/)
[![Python Versions](https://img.shields.io/pypi/pyversions/gridgulp.svg)](https://pypi.org/project/gridgulp/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Documentation](https://img.shields.io/badge/docs-GitHub%20Pages-blue)](https://ganymede-bio.github.io/gridgulp/)

Automatically detect and extract tables from Excel, CSV, and text files.

## What is GridGulp?

GridGulp finds tables in your spreadsheets, even when

- there are multiple tables on one sheet
- tables don't start at cell A1
- file extensions do not reflect its file type
- the file encoding is opaque

**Supported formats:** `.xlsx`, `.xls`, `.xlsm`, `.csv`, `.tsv`, `.txt`

## Installation

```bash
pip install gridgulp
```

## Quick Start

### Trying GridGulp Out

To quickly try GridGulp on some spreadsheets, clone the repo, place example spreadsheets in the examples/ directory, and run

```bash
python scripts/test_example_files.py
```

You will receive output that looks like, representing identified ranges:

šŸ“ tests/manual
----------------------------------------------------------------------------------------------------
āœ“ sample.xlsx                              | Tables: 1  | Time: 1.099s | Size: 122.6KB | Method: magika
  šŸ“„ Sheet: Sheet
     └─ A1:CV203        | 203Ɨ100 | Conf: 70%


### Table Ranges vs DataFrames

GridGulp provides two ways to work with detected tables:

1. **Table Ranges** - JSON metadata about where tables are located (e.g., "A1:E100")
   - Fast and memory-efficient
   - Perfect for agent use as tools - mapping table locations or visualizing spreadsheet structure
   - No actual data is loaded into memory

2. **DataFrames** - The actual data extracted from those ranges as pandas DataFrames
   - Contains the full data with proper types
   - Ready for analysis, transformation, or export

Choose based on your needs:
- Use **ranges only** when you need to know where tables are and want to submit to other tasks - for example, a downstream process to infer purpose / intent based on data content
- Use **DataFrames** when you need to analyze or transform the actual data

### Getting Table Ranges Only

```python
from gridgulp import GridGulp

# Detect tables in a file (lightweight - just finds locations)
porter = GridGulp()
result = await porter.detect_tables("sales_report.xlsx")

# Process results - no data loaded yet, just locations
for sheet in result.sheets:
    print(f"{sheet.name}: {len(sheet.tables)} tables found")
    for table in sheet.tables:
        print(f"  - Table at {table.range.excel_range}")
        print(f"    Size: {table.shape[0]} rows Ɨ {table.shape[1]} columns")
        print(f"    Confidence: {table.confidence:.1%}")
```

### Getting DataFrames

To extract the actual data as pandas DataFrames:

```python
from gridgulp import GridGulp
from gridgulp.extractors import DataFrameExtractor
from gridgulp.readers import get_reader

# Step 1: Detect table locations
porter = GridGulp()
result = await porter.detect_tables("sales_report.xlsx")

# Step 2: Read the file data
reader = get_reader("sales_report.xlsx")
file_data = reader.read_sync()

# Step 3: Extract DataFrames from detected ranges
extractor = DataFrameExtractor()
for sheet_result in result.sheets:
    sheet_data = next(s for s in file_data.sheets if s.name == sheet_result.name)

    for table in sheet_result.tables:
        df, metadata, quality = extractor.extract_dataframe(sheet_data, table.range)
        if df is not None:
            print(f"\nšŸ“Š Extracted table from {table.range.excel_range}")
            print(f"   Shape: {df.shape} | Quality: {quality:.1%}")
            print(f"   Headers: {', '.join(df.columns[:5])}{'...' if len(df.columns) > 5 else ''}")
            # Now you have a pandas DataFrame to work with
            print(df.head())
```

### Processing Multiple Files

GridGulp makes it easy to process entire directories of spreadsheets:

```python
from gridgulp import GridGulp

# Create GridGulp instance
gg = GridGulp()

# Process all spreadsheets in a directory (recursively by default)
results = await gg.detect_tables_in_directory("~/data")

# Summary statistics
total_tables = sum(r.total_tables for r in results.values())
print(f"Found {total_tables} tables across {len(results)} files")

# Process each file's results
for file_path, result in results.items():
    if result.total_tables > 0:
        print(f"\n{file_path.name}:")
        for sheet in result.sheets:
            for table in sheet.tables:
                print(f"  - {sheet.name}: {table.range.excel_range}")
```

#### Directory Processing Options

```python
# Process only Excel files
results = await gg.detect_tables_in_directory(
    "~/reports",
    patterns=["*.xlsx", "*.xls"]
)

# Non-recursive (single directory only)
results = await gg.detect_tables_in_directory(
    "~/data",
    recursive=False
)

# With progress tracking for large directories
def show_progress(current, total):
    print(f"Processing file {current} of {total}...")

results = await gg.detect_tables_in_directory(
    "~/large_dataset",
    progress_callback=show_progress
)

# Sync version for Jupyter notebooks
results = gg.detect_tables_in_directory_sync("~/data")
```

#### Extracting DataFrames from Multiple Files

```python
from gridgulp.extractors import DataFrameExtractor
from gridgulp.readers import get_reader

# Process directory and extract all tables as DataFrames
gg = GridGulp()
extractor = DataFrameExtractor()

results = gg.detect_tables_in_directory_sync("~/sales_data")

all_dataframes = []
for file_path, detection_result in results.items():
    if detection_result.total_tables > 0:
        # Read the file
        reader = get_reader(file_path)
        file_data = reader.read_sync()

        # Extract each table
        for sheet_result in detection_result.sheets:
            sheet_data = next(s for s in file_data.sheets if s.name == sheet_result.name)

            for table in sheet_result.tables:
                df, metadata, quality = extractor.extract_dataframe(sheet_data, table.range)
                if df is not None:
                    # Add source info to the dataframe
                    df['_source_file'] = file_path.name
                    df['_source_sheet'] = sheet_result.name
                    all_dataframes.append(df)

print(f"Extracted {len(all_dataframes)} tables from {len(results)} files")

# Combine all tables if they have the same structure
import pandas as pd
if all_dataframes:
    combined_df = pd.concat(all_dataframes, ignore_index=True)
    print(f"Combined dataset: {combined_df.shape}")
```

## Key Features

- **Automatic Detection** - Finds all tables with sensible defaults
- **Fully Configurable** - Customize detection thresholds and behavior
- **Smart Headers** - Detects single and multi-row headers automatically
- **Multiple Tables** - Handles sheets with multiple separate tables
- **Quality Scoring** - Confidence scores for each detected table
- **Fast** - Processes 1M+ cells/second for simple tables, 100K+ cells/second for complex tables

## Documentation

- [Full Usage Guide](docs/USAGE_GUIDE.md) - Detailed examples and configuration
- [API Reference](docs/API_REFERENCE.md) - Complete API documentation
- [Architecture](docs/ARCHITECTURE.md) - How GridGulp works internally
- [Testing Guide](docs/TESTING_GUIDE.md) - Test spreadsheets in bulk with the unified test script

## License

MIT License - see [LICENSE](LICENSE) file.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "gridgulp",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "csv, data-extraction, excel, spreadsheet, table-detection",
    "author": "GridGulp Contributors",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/3f/3c/277de1053bed8462440b906490ac7354b3ebf5fb56fc5b30f338b8ade319/gridgulp-0.3.4.tar.gz",
    "platform": null,
    "description": "# GridGulp\n\n[![PyPI version](https://badge.fury.io/py/gridgulp.svg)](https://pypi.org/project/gridgulp/)\n[![Python Versions](https://img.shields.io/pypi/pyversions/gridgulp.svg)](https://pypi.org/project/gridgulp/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Documentation](https://img.shields.io/badge/docs-GitHub%20Pages-blue)](https://ganymede-bio.github.io/gridgulp/)\n\nAutomatically detect and extract tables from Excel, CSV, and text files.\n\n## What is GridGulp?\n\nGridGulp finds tables in your spreadsheets, even when\n\n- there are multiple tables on one sheet\n- tables don't start at cell A1\n- file extensions do not reflect its file type\n- the file encoding is opaque\n\n**Supported formats:** `.xlsx`, `.xls`, `.xlsm`, `.csv`, `.tsv`, `.txt`\n\n## Installation\n\n```bash\npip install gridgulp\n```\n\n## Quick Start\n\n### Trying GridGulp Out\n\nTo quickly try GridGulp on some spreadsheets, clone the repo, place example spreadsheets in the examples/ directory, and run\n\n```bash\npython scripts/test_example_files.py\n```\n\nYou will receive output that looks like, representing identified ranges:\n\n\ud83d\udcc1 tests/manual\n----------------------------------------------------------------------------------------------------\n\u2713 sample.xlsx                              | Tables: 1  | Time: 1.099s | Size: 122.6KB | Method: magika\n  \ud83d\udcc4 Sheet: Sheet\n     \u2514\u2500 A1:CV203        | 203\u00d7100 | Conf: 70%\n\n\n### Table Ranges vs DataFrames\n\nGridGulp provides two ways to work with detected tables:\n\n1. **Table Ranges** - JSON metadata about where tables are located (e.g., \"A1:E100\")\n   - Fast and memory-efficient\n   - Perfect for agent use as tools - mapping table locations or visualizing spreadsheet structure\n   - No actual data is loaded into memory\n\n2. **DataFrames** - The actual data extracted from those ranges as pandas DataFrames\n   - Contains the full data with proper types\n   - Ready for analysis, transformation, or export\n\nChoose based on your needs:\n- Use **ranges only** when you need to know where tables are and want to submit to other tasks - for example, a downstream process to infer purpose / intent based on data content\n- Use **DataFrames** when you need to analyze or transform the actual data\n\n### Getting Table Ranges Only\n\n```python\nfrom gridgulp import GridGulp\n\n# Detect tables in a file (lightweight - just finds locations)\nporter = GridGulp()\nresult = await porter.detect_tables(\"sales_report.xlsx\")\n\n# Process results - no data loaded yet, just locations\nfor sheet in result.sheets:\n    print(f\"{sheet.name}: {len(sheet.tables)} tables found\")\n    for table in sheet.tables:\n        print(f\"  - Table at {table.range.excel_range}\")\n        print(f\"    Size: {table.shape[0]} rows \u00d7 {table.shape[1]} columns\")\n        print(f\"    Confidence: {table.confidence:.1%}\")\n```\n\n### Getting DataFrames\n\nTo extract the actual data as pandas DataFrames:\n\n```python\nfrom gridgulp import GridGulp\nfrom gridgulp.extractors import DataFrameExtractor\nfrom gridgulp.readers import get_reader\n\n# Step 1: Detect table locations\nporter = GridGulp()\nresult = await porter.detect_tables(\"sales_report.xlsx\")\n\n# Step 2: Read the file data\nreader = get_reader(\"sales_report.xlsx\")\nfile_data = reader.read_sync()\n\n# Step 3: Extract DataFrames from detected ranges\nextractor = DataFrameExtractor()\nfor sheet_result in result.sheets:\n    sheet_data = next(s for s in file_data.sheets if s.name == sheet_result.name)\n\n    for table in sheet_result.tables:\n        df, metadata, quality = extractor.extract_dataframe(sheet_data, table.range)\n        if df is not None:\n            print(f\"\\n\ud83d\udcca Extracted table from {table.range.excel_range}\")\n            print(f\"   Shape: {df.shape} | Quality: {quality:.1%}\")\n            print(f\"   Headers: {', '.join(df.columns[:5])}{'...' if len(df.columns) > 5 else ''}\")\n            # Now you have a pandas DataFrame to work with\n            print(df.head())\n```\n\n### Processing Multiple Files\n\nGridGulp makes it easy to process entire directories of spreadsheets:\n\n```python\nfrom gridgulp import GridGulp\n\n# Create GridGulp instance\ngg = GridGulp()\n\n# Process all spreadsheets in a directory (recursively by default)\nresults = await gg.detect_tables_in_directory(\"~/data\")\n\n# Summary statistics\ntotal_tables = sum(r.total_tables for r in results.values())\nprint(f\"Found {total_tables} tables across {len(results)} files\")\n\n# Process each file's results\nfor file_path, result in results.items():\n    if result.total_tables > 0:\n        print(f\"\\n{file_path.name}:\")\n        for sheet in result.sheets:\n            for table in sheet.tables:\n                print(f\"  - {sheet.name}: {table.range.excel_range}\")\n```\n\n#### Directory Processing Options\n\n```python\n# Process only Excel files\nresults = await gg.detect_tables_in_directory(\n    \"~/reports\",\n    patterns=[\"*.xlsx\", \"*.xls\"]\n)\n\n# Non-recursive (single directory only)\nresults = await gg.detect_tables_in_directory(\n    \"~/data\",\n    recursive=False\n)\n\n# With progress tracking for large directories\ndef show_progress(current, total):\n    print(f\"Processing file {current} of {total}...\")\n\nresults = await gg.detect_tables_in_directory(\n    \"~/large_dataset\",\n    progress_callback=show_progress\n)\n\n# Sync version for Jupyter notebooks\nresults = gg.detect_tables_in_directory_sync(\"~/data\")\n```\n\n#### Extracting DataFrames from Multiple Files\n\n```python\nfrom gridgulp.extractors import DataFrameExtractor\nfrom gridgulp.readers import get_reader\n\n# Process directory and extract all tables as DataFrames\ngg = GridGulp()\nextractor = DataFrameExtractor()\n\nresults = gg.detect_tables_in_directory_sync(\"~/sales_data\")\n\nall_dataframes = []\nfor file_path, detection_result in results.items():\n    if detection_result.total_tables > 0:\n        # Read the file\n        reader = get_reader(file_path)\n        file_data = reader.read_sync()\n\n        # Extract each table\n        for sheet_result in detection_result.sheets:\n            sheet_data = next(s for s in file_data.sheets if s.name == sheet_result.name)\n\n            for table in sheet_result.tables:\n                df, metadata, quality = extractor.extract_dataframe(sheet_data, table.range)\n                if df is not None:\n                    # Add source info to the dataframe\n                    df['_source_file'] = file_path.name\n                    df['_source_sheet'] = sheet_result.name\n                    all_dataframes.append(df)\n\nprint(f\"Extracted {len(all_dataframes)} tables from {len(results)} files\")\n\n# Combine all tables if they have the same structure\nimport pandas as pd\nif all_dataframes:\n    combined_df = pd.concat(all_dataframes, ignore_index=True)\n    print(f\"Combined dataset: {combined_df.shape}\")\n```\n\n## Key Features\n\n- **Automatic Detection** - Finds all tables with sensible defaults\n- **Fully Configurable** - Customize detection thresholds and behavior\n- **Smart Headers** - Detects single and multi-row headers automatically\n- **Multiple Tables** - Handles sheets with multiple separate tables\n- **Quality Scoring** - Confidence scores for each detected table\n- **Fast** - Processes 1M+ cells/second for simple tables, 100K+ cells/second for complex tables\n\n## Documentation\n\n- [Full Usage Guide](docs/USAGE_GUIDE.md) - Detailed examples and configuration\n- [API Reference](docs/API_REFERENCE.md) - Complete API documentation\n- [Architecture](docs/ARCHITECTURE.md) - How GridGulp works internally\n- [Testing Guide](docs/TESTING_GUIDE.md) - Test spreadsheets in bulk with the unified test script\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Simplified intelligent spreadsheet ingestion framework with automatic table detection",
    "version": "0.3.4",
    "project_urls": {
        "Homepage": "https://github.com/Ganymede-Bio/gridgulp",
        "Issues": "https://github.com/Ganymede-Bio/gridgulp/issues",
        "Repository": "https://github.com/Ganymede-Bio/gridgulp"
    },
    "split_keywords": [
        "csv",
        " data-extraction",
        " excel",
        " spreadsheet",
        " table-detection"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c476c2aebdc3d48f5f6325cd31947f77508a1203082473fe19e8ef4f5dd48731",
                "md5": "c425c48ac9203b6225cf7465ed23a49d",
                "sha256": "e5bd88c6fe399432d460014a4af41212ecf6c7f1743370c18ff3daf338b95a13"
            },
            "downloads": -1,
            "filename": "gridgulp-0.3.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c425c48ac9203b6225cf7465ed23a49d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 125630,
            "upload_time": "2025-07-30T05:53:52",
            "upload_time_iso_8601": "2025-07-30T05:53:52.363298Z",
            "url": "https://files.pythonhosted.org/packages/c4/76/c2aebdc3d48f5f6325cd31947f77508a1203082473fe19e8ef4f5dd48731/gridgulp-0.3.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3f3c277de1053bed8462440b906490ac7354b3ebf5fb56fc5b30f338b8ade319",
                "md5": "73d7dd0967df5ae7db78e49b64b2c4ba",
                "sha256": "aad7fda1744bc270f2a39de07dbaa81aef6c4ff70175423ad5abb5f0aa913ca3"
            },
            "downloads": -1,
            "filename": "gridgulp-0.3.4.tar.gz",
            "has_sig": false,
            "md5_digest": "73d7dd0967df5ae7db78e49b64b2c4ba",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 341517,
            "upload_time": "2025-07-30T05:53:54",
            "upload_time_iso_8601": "2025-07-30T05:53:54.049871Z",
            "url": "https://files.pythonhosted.org/packages/3f/3c/277de1053bed8462440b906490ac7354b3ebf5fb56fc5b30f338b8ade319/gridgulp-0.3.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-30 05:53:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Ganymede-Bio",
    "github_project": "gridgulp",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "gridgulp"
}
        
Elapsed time: 0.58300s