# GridGulp
[](https://pypi.org/project/gridgulp/)
[](https://pypi.org/project/gridgulp/)
[](https://opensource.org/licenses/MIT)
[](https://ganymede-bio.github.io/gridgulp/)
Automatically detect and extract tables from Excel, CSV, and text files.
## What is GridGulp?
GridGulp finds tables in your spreadsheets, even when
- there are multiple tables on one sheet
- tables don't start at cell A1
- file extensions do not reflect its file type
- the file encoding is opaque
**Supported formats:** `.xlsx`, `.xls`, `.xlsm`, `.csv`, `.tsv`, `.txt`
## Installation
```bash
pip install gridgulp
```
## Quick Start
### Trying GridGulp Out
To quickly try GridGulp on some spreadsheets, clone the repo, place example spreadsheets in the examples/ directory, and run
```bash
python scripts/test_example_files.py
```
You will receive output that looks like, representing identified ranges:
š tests/manual
----------------------------------------------------------------------------------------------------
ā sample.xlsx | Tables: 1 | Time: 1.099s | Size: 122.6KB | Method: magika
š Sheet: Sheet
āā A1:CV203 | 203Ć100 | Conf: 70%
### Table Ranges vs DataFrames
GridGulp provides two ways to work with detected tables:
1. **Table Ranges** - JSON metadata about where tables are located (e.g., "A1:E100")
- Fast and memory-efficient
- Perfect for agent use as tools - mapping table locations or visualizing spreadsheet structure
- No actual data is loaded into memory
2. **DataFrames** - The actual data extracted from those ranges as pandas DataFrames
- Contains the full data with proper types
- Ready for analysis, transformation, or export
Choose based on your needs:
- Use **ranges only** when you need to know where tables are and want to submit to other tasks - for example, a downstream process to infer purpose / intent based on data content
- Use **DataFrames** when you need to analyze or transform the actual data
### Getting Table Ranges Only
```python
from gridgulp import GridGulp
# Detect tables in a file (lightweight - just finds locations)
porter = GridGulp()
result = await porter.detect_tables("sales_report.xlsx")
# Process results - no data loaded yet, just locations
for sheet in result.sheets:
print(f"{sheet.name}: {len(sheet.tables)} tables found")
for table in sheet.tables:
print(f" - Table at {table.range.excel_range}")
print(f" Size: {table.shape[0]} rows Ć {table.shape[1]} columns")
print(f" Confidence: {table.confidence:.1%}")
```
### Getting DataFrames
To extract the actual data as pandas DataFrames:
```python
from gridgulp import GridGulp
from gridgulp.extractors import DataFrameExtractor
from gridgulp.readers import get_reader
# Step 1: Detect table locations
porter = GridGulp()
result = await porter.detect_tables("sales_report.xlsx")
# Step 2: Read the file data
reader = get_reader("sales_report.xlsx")
file_data = reader.read_sync()
# Step 3: Extract DataFrames from detected ranges
extractor = DataFrameExtractor()
for sheet_result in result.sheets:
sheet_data = next(s for s in file_data.sheets if s.name == sheet_result.name)
for table in sheet_result.tables:
df, metadata, quality = extractor.extract_dataframe(sheet_data, table.range)
if df is not None:
print(f"\nš Extracted table from {table.range.excel_range}")
print(f" Shape: {df.shape} | Quality: {quality:.1%}")
print(f" Headers: {', '.join(df.columns[:5])}{'...' if len(df.columns) > 5 else ''}")
# Now you have a pandas DataFrame to work with
print(df.head())
```
### Processing Multiple Files
GridGulp makes it easy to process entire directories of spreadsheets:
```python
from gridgulp import GridGulp
# Create GridGulp instance
gg = GridGulp()
# Process all spreadsheets in a directory (recursively by default)
results = await gg.detect_tables_in_directory("~/data")
# Summary statistics
total_tables = sum(r.total_tables for r in results.values())
print(f"Found {total_tables} tables across {len(results)} files")
# Process each file's results
for file_path, result in results.items():
if result.total_tables > 0:
print(f"\n{file_path.name}:")
for sheet in result.sheets:
for table in sheet.tables:
print(f" - {sheet.name}: {table.range.excel_range}")
```
#### Directory Processing Options
```python
# Process only Excel files
results = await gg.detect_tables_in_directory(
"~/reports",
patterns=["*.xlsx", "*.xls"]
)
# Non-recursive (single directory only)
results = await gg.detect_tables_in_directory(
"~/data",
recursive=False
)
# With progress tracking for large directories
def show_progress(current, total):
print(f"Processing file {current} of {total}...")
results = await gg.detect_tables_in_directory(
"~/large_dataset",
progress_callback=show_progress
)
# Sync version for Jupyter notebooks
results = gg.detect_tables_in_directory_sync("~/data")
```
#### Extracting DataFrames from Multiple Files
```python
from gridgulp.extractors import DataFrameExtractor
from gridgulp.readers import get_reader
# Process directory and extract all tables as DataFrames
gg = GridGulp()
extractor = DataFrameExtractor()
results = gg.detect_tables_in_directory_sync("~/sales_data")
all_dataframes = []
for file_path, detection_result in results.items():
if detection_result.total_tables > 0:
# Read the file
reader = get_reader(file_path)
file_data = reader.read_sync()
# Extract each table
for sheet_result in detection_result.sheets:
sheet_data = next(s for s in file_data.sheets if s.name == sheet_result.name)
for table in sheet_result.tables:
df, metadata, quality = extractor.extract_dataframe(sheet_data, table.range)
if df is not None:
# Add source info to the dataframe
df['_source_file'] = file_path.name
df['_source_sheet'] = sheet_result.name
all_dataframes.append(df)
print(f"Extracted {len(all_dataframes)} tables from {len(results)} files")
# Combine all tables if they have the same structure
import pandas as pd
if all_dataframes:
combined_df = pd.concat(all_dataframes, ignore_index=True)
print(f"Combined dataset: {combined_df.shape}")
```
## Key Features
- **Automatic Detection** - Finds all tables with sensible defaults
- **Fully Configurable** - Customize detection thresholds and behavior
- **Smart Headers** - Detects single and multi-row headers automatically
- **Multiple Tables** - Handles sheets with multiple separate tables
- **Quality Scoring** - Confidence scores for each detected table
- **Fast** - Processes 1M+ cells/second for simple tables, 100K+ cells/second for complex tables
## Documentation
- [Full Usage Guide](docs/USAGE_GUIDE.md) - Detailed examples and configuration
- [API Reference](docs/API_REFERENCE.md) - Complete API documentation
- [Architecture](docs/ARCHITECTURE.md) - How GridGulp works internally
- [Testing Guide](docs/TESTING_GUIDE.md) - Test spreadsheets in bulk with the unified test script
## License
MIT License - see [LICENSE](LICENSE) file.
Raw data
{
"_id": null,
"home_page": null,
"name": "gridgulp",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "csv, data-extraction, excel, spreadsheet, table-detection",
"author": "GridGulp Contributors",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/3f/3c/277de1053bed8462440b906490ac7354b3ebf5fb56fc5b30f338b8ade319/gridgulp-0.3.4.tar.gz",
"platform": null,
"description": "# GridGulp\n\n[](https://pypi.org/project/gridgulp/)\n[](https://pypi.org/project/gridgulp/)\n[](https://opensource.org/licenses/MIT)\n[](https://ganymede-bio.github.io/gridgulp/)\n\nAutomatically detect and extract tables from Excel, CSV, and text files.\n\n## What is GridGulp?\n\nGridGulp finds tables in your spreadsheets, even when\n\n- there are multiple tables on one sheet\n- tables don't start at cell A1\n- file extensions do not reflect its file type\n- the file encoding is opaque\n\n**Supported formats:** `.xlsx`, `.xls`, `.xlsm`, `.csv`, `.tsv`, `.txt`\n\n## Installation\n\n```bash\npip install gridgulp\n```\n\n## Quick Start\n\n### Trying GridGulp Out\n\nTo quickly try GridGulp on some spreadsheets, clone the repo, place example spreadsheets in the examples/ directory, and run\n\n```bash\npython scripts/test_example_files.py\n```\n\nYou will receive output that looks like, representing identified ranges:\n\n\ud83d\udcc1 tests/manual\n----------------------------------------------------------------------------------------------------\n\u2713 sample.xlsx | Tables: 1 | Time: 1.099s | Size: 122.6KB | Method: magika\n \ud83d\udcc4 Sheet: Sheet\n \u2514\u2500 A1:CV203 | 203\u00d7100 | Conf: 70%\n\n\n### Table Ranges vs DataFrames\n\nGridGulp provides two ways to work with detected tables:\n\n1. **Table Ranges** - JSON metadata about where tables are located (e.g., \"A1:E100\")\n - Fast and memory-efficient\n - Perfect for agent use as tools - mapping table locations or visualizing spreadsheet structure\n - No actual data is loaded into memory\n\n2. **DataFrames** - The actual data extracted from those ranges as pandas DataFrames\n - Contains the full data with proper types\n - Ready for analysis, transformation, or export\n\nChoose based on your needs:\n- Use **ranges only** when you need to know where tables are and want to submit to other tasks - for example, a downstream process to infer purpose / intent based on data content\n- Use **DataFrames** when you need to analyze or transform the actual data\n\n### Getting Table Ranges Only\n\n```python\nfrom gridgulp import GridGulp\n\n# Detect tables in a file (lightweight - just finds locations)\nporter = GridGulp()\nresult = await porter.detect_tables(\"sales_report.xlsx\")\n\n# Process results - no data loaded yet, just locations\nfor sheet in result.sheets:\n print(f\"{sheet.name}: {len(sheet.tables)} tables found\")\n for table in sheet.tables:\n print(f\" - Table at {table.range.excel_range}\")\n print(f\" Size: {table.shape[0]} rows \u00d7 {table.shape[1]} columns\")\n print(f\" Confidence: {table.confidence:.1%}\")\n```\n\n### Getting DataFrames\n\nTo extract the actual data as pandas DataFrames:\n\n```python\nfrom gridgulp import GridGulp\nfrom gridgulp.extractors import DataFrameExtractor\nfrom gridgulp.readers import get_reader\n\n# Step 1: Detect table locations\nporter = GridGulp()\nresult = await porter.detect_tables(\"sales_report.xlsx\")\n\n# Step 2: Read the file data\nreader = get_reader(\"sales_report.xlsx\")\nfile_data = reader.read_sync()\n\n# Step 3: Extract DataFrames from detected ranges\nextractor = DataFrameExtractor()\nfor sheet_result in result.sheets:\n sheet_data = next(s for s in file_data.sheets if s.name == sheet_result.name)\n\n for table in sheet_result.tables:\n df, metadata, quality = extractor.extract_dataframe(sheet_data, table.range)\n if df is not None:\n print(f\"\\n\ud83d\udcca Extracted table from {table.range.excel_range}\")\n print(f\" Shape: {df.shape} | Quality: {quality:.1%}\")\n print(f\" Headers: {', '.join(df.columns[:5])}{'...' if len(df.columns) > 5 else ''}\")\n # Now you have a pandas DataFrame to work with\n print(df.head())\n```\n\n### Processing Multiple Files\n\nGridGulp makes it easy to process entire directories of spreadsheets:\n\n```python\nfrom gridgulp import GridGulp\n\n# Create GridGulp instance\ngg = GridGulp()\n\n# Process all spreadsheets in a directory (recursively by default)\nresults = await gg.detect_tables_in_directory(\"~/data\")\n\n# Summary statistics\ntotal_tables = sum(r.total_tables for r in results.values())\nprint(f\"Found {total_tables} tables across {len(results)} files\")\n\n# Process each file's results\nfor file_path, result in results.items():\n if result.total_tables > 0:\n print(f\"\\n{file_path.name}:\")\n for sheet in result.sheets:\n for table in sheet.tables:\n print(f\" - {sheet.name}: {table.range.excel_range}\")\n```\n\n#### Directory Processing Options\n\n```python\n# Process only Excel files\nresults = await gg.detect_tables_in_directory(\n \"~/reports\",\n patterns=[\"*.xlsx\", \"*.xls\"]\n)\n\n# Non-recursive (single directory only)\nresults = await gg.detect_tables_in_directory(\n \"~/data\",\n recursive=False\n)\n\n# With progress tracking for large directories\ndef show_progress(current, total):\n print(f\"Processing file {current} of {total}...\")\n\nresults = await gg.detect_tables_in_directory(\n \"~/large_dataset\",\n progress_callback=show_progress\n)\n\n# Sync version for Jupyter notebooks\nresults = gg.detect_tables_in_directory_sync(\"~/data\")\n```\n\n#### Extracting DataFrames from Multiple Files\n\n```python\nfrom gridgulp.extractors import DataFrameExtractor\nfrom gridgulp.readers import get_reader\n\n# Process directory and extract all tables as DataFrames\ngg = GridGulp()\nextractor = DataFrameExtractor()\n\nresults = gg.detect_tables_in_directory_sync(\"~/sales_data\")\n\nall_dataframes = []\nfor file_path, detection_result in results.items():\n if detection_result.total_tables > 0:\n # Read the file\n reader = get_reader(file_path)\n file_data = reader.read_sync()\n\n # Extract each table\n for sheet_result in detection_result.sheets:\n sheet_data = next(s for s in file_data.sheets if s.name == sheet_result.name)\n\n for table in sheet_result.tables:\n df, metadata, quality = extractor.extract_dataframe(sheet_data, table.range)\n if df is not None:\n # Add source info to the dataframe\n df['_source_file'] = file_path.name\n df['_source_sheet'] = sheet_result.name\n all_dataframes.append(df)\n\nprint(f\"Extracted {len(all_dataframes)} tables from {len(results)} files\")\n\n# Combine all tables if they have the same structure\nimport pandas as pd\nif all_dataframes:\n combined_df = pd.concat(all_dataframes, ignore_index=True)\n print(f\"Combined dataset: {combined_df.shape}\")\n```\n\n## Key Features\n\n- **Automatic Detection** - Finds all tables with sensible defaults\n- **Fully Configurable** - Customize detection thresholds and behavior\n- **Smart Headers** - Detects single and multi-row headers automatically\n- **Multiple Tables** - Handles sheets with multiple separate tables\n- **Quality Scoring** - Confidence scores for each detected table\n- **Fast** - Processes 1M+ cells/second for simple tables, 100K+ cells/second for complex tables\n\n## Documentation\n\n- [Full Usage Guide](docs/USAGE_GUIDE.md) - Detailed examples and configuration\n- [API Reference](docs/API_REFERENCE.md) - Complete API documentation\n- [Architecture](docs/ARCHITECTURE.md) - How GridGulp works internally\n- [Testing Guide](docs/TESTING_GUIDE.md) - Test spreadsheets in bulk with the unified test script\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Simplified intelligent spreadsheet ingestion framework with automatic table detection",
"version": "0.3.4",
"project_urls": {
"Homepage": "https://github.com/Ganymede-Bio/gridgulp",
"Issues": "https://github.com/Ganymede-Bio/gridgulp/issues",
"Repository": "https://github.com/Ganymede-Bio/gridgulp"
},
"split_keywords": [
"csv",
" data-extraction",
" excel",
" spreadsheet",
" table-detection"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "c476c2aebdc3d48f5f6325cd31947f77508a1203082473fe19e8ef4f5dd48731",
"md5": "c425c48ac9203b6225cf7465ed23a49d",
"sha256": "e5bd88c6fe399432d460014a4af41212ecf6c7f1743370c18ff3daf338b95a13"
},
"downloads": -1,
"filename": "gridgulp-0.3.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c425c48ac9203b6225cf7465ed23a49d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 125630,
"upload_time": "2025-07-30T05:53:52",
"upload_time_iso_8601": "2025-07-30T05:53:52.363298Z",
"url": "https://files.pythonhosted.org/packages/c4/76/c2aebdc3d48f5f6325cd31947f77508a1203082473fe19e8ef4f5dd48731/gridgulp-0.3.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "3f3c277de1053bed8462440b906490ac7354b3ebf5fb56fc5b30f338b8ade319",
"md5": "73d7dd0967df5ae7db78e49b64b2c4ba",
"sha256": "aad7fda1744bc270f2a39de07dbaa81aef6c4ff70175423ad5abb5f0aa913ca3"
},
"downloads": -1,
"filename": "gridgulp-0.3.4.tar.gz",
"has_sig": false,
"md5_digest": "73d7dd0967df5ae7db78e49b64b2c4ba",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 341517,
"upload_time": "2025-07-30T05:53:54",
"upload_time_iso_8601": "2025-07-30T05:53:54.049871Z",
"url": "https://files.pythonhosted.org/packages/3f/3c/277de1053bed8462440b906490ac7354b3ebf5fb56fc5b30f338b8ade319/gridgulp-0.3.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-30 05:53:54",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Ganymede-Bio",
"github_project": "gridgulp",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "gridgulp"
}