# SheetWise
A Python package for encoding spreadsheets for Large Language Models, implementing the SpreadsheetLLM research framework.
[](https://badge.fury.io/py/sheetwise)
[](https://www.python.org/downloads/release/python-380/)
[](https://opensource.org/licenses/MIT)
## Overview
SheetWise is a Python package that implements the key components from Microsoft Research's SpreadsheetLLM paper for efficiently encoding spreadsheets for use with Large Language Models. The package provides:
- **SheetCompressor**: Efficient encoding framework with three compression modules
- **Chain of Spreadsheet**: Multi-step reasoning approach for spreadsheet analysis
- **Vanilla Encoding**: Traditional cell-by-cell encoding methods
- **Token Optimization**: Significant reduction in token usage for LLM processing
## Key Features
- **Intelligent Compression**: Up to 96% reduction in token usage while preserving semantic information
- **Auto-Configuration**: Automatically optimizes compression settings based on spreadsheet characteristics
- **Multi-LLM Support**: Provider-specific formats for ChatGPT, Claude, Gemini
- **Multi-Table Support**: Handles complex spreadsheets with multiple tables and regions
- **Structural Analysis**: Identifies and preserves important structural elements
- **LLM-Ready Output**: Generates optimized text for direct use with ChatGPT, Claude, etc.
- **Format-Aware**: Preserves data type and formatting information
- **Enhanced Algorithms**: Improved range detection and contiguous cell grouping
- **Easy Integration**: Simple API for immediate use
## Installation
### Using pip
```bash
pip install sheetwise
```
### Using Poetry
```bash
poetry add sheetwise
```
### Development Installation
```bash
git clone https://github.com/yourusername/sheetwise.git
cd sheetwise
poetry install
```
## Quick Start
### Basic Usage
```python
import pandas as pd
from sheetwise import SpreadsheetLLM
# Initialize the framework
sllm = SpreadsheetLLM()
# Load your spreadsheet
df = pd.read_excel("your_spreadsheet.xlsx")
# Compress and encode for LLM use
llm_ready_text = sllm.compress_and_encode_for_llm(df)
# Copy and paste this text directly into ChatGPT/Claude
print(llm_ready_text)
```
### Advanced Usage
```python
from sheetwise import SpreadsheetLLM, SheetCompressor
# Auto-configuration (NEW!)
sllm = SpreadsheetLLM(enable_logging=True)
auto_compressed = sllm.compress_with_auto_config(df) # Automatically optimizes settings
# LLM Provider-specific formats (NEW!)
compressed = sllm.compress_spreadsheet(df)
chatgpt_format = sllm.encode_for_llm_provider(compressed, "chatgpt")
claude_format = sllm.encode_for_llm_provider(compressed, "claude")
gemini_format = sllm.encode_for_llm_provider(compressed, "gemini")
# Manual configuration
compressor = SheetCompressor(
k=2, # Structural anchor neighborhood size
use_extraction=True,
use_translation=True,
use_aggregation=True
)
# Compress the spreadsheet
compressed_result = compressor.compress(df)
print(f"Compression ratio: {compressed_result['compression_ratio']:.1f}x")
print(f"Compressed shape: {compressed_result['compressed_df'].shape}")
# Or use with SpreadsheetLLM for full pipeline
sllm = SpreadsheetLLM(compression_params={
'k': 2,
'use_extraction': True,
'use_translation': True,
'use_aggregation': True
})
# Get detailed statistics
stats = sllm.get_encoding_stats(df)
print(f"Token reduction: {stats['token_reduction_ratio']:.1f}x")
# Process QA queries
result = sllm.process_qa_query(df, "What was the total revenue in 2023?")
```
### Command Line Interface
```bash
# Compress a spreadsheet file
sheetwise input.xlsx -o output.txt --stats
# Auto-configure compression (NEW!)
sheetwise input.xlsx --auto-config --verbose
# Run demo with sample data
sheetwise --demo --auto-config
# Run demo with vanilla encoding
sheetwise --demo --vanilla --stats
# Run demo with JSON output format
sheetwise --demo --format json
# Run demo with auto-config and JSON output
sheetwise --demo --auto-config --format json --verbose
# Use vanilla encoding instead of compression
sheetwise input.xlsx --vanilla
# Output in JSON format
sheetwise input.xlsx --format json
```
## Core Components
### 1. SheetCompressor
The main compression framework with three modules:
- **Structural Anchor Extraction**: Identifies and preserves structurally important rows/columns
- **Inverted Index Translation**: Creates efficient value-to-location mappings
- **Data Format Aggregation**: Groups cells by data type and format
### 2. Chain of Spreadsheet
Multi-step reasoning approach:
1. **Table Identification**: Automatically detects table regions
2. **Compression**: Applies SheetCompressor to reduce size
3. **Query Processing**: Identifies relevant regions for specific queries
### 3. Vanilla Encoder
Traditional encoding methods for comparison and compatibility.
## Examples
### Working with Financial Data
```python
from sheetwise import SpreadsheetLLM
from sheetwise.utils import create_realistic_spreadsheet
# Create sample financial spreadsheet
df = create_realistic_spreadsheet()
sllm = SpreadsheetLLM()
# Analyze the data
stats = sllm.get_encoding_stats(df)
print(f"Original size: {stats['original_shape']}")
print(f"Sparsity: {stats['sparsity_percentage']:.1f}% empty cells")
print(f"Compression: {stats['compression_ratio']:.1f}x smaller")
# Generate LLM-ready output
encoded = sllm.compress_and_encode_for_llm(df)
print("\\nReady for LLM:")
print(encoded[:300] + "...")
```
### Custom Compression Pipeline
```python
from sheetwise import SheetCompressor
# Compare different compression strategies
configs = [
{"name": "Extraction Only", "use_translation": False, "use_aggregation": False},
{"name": "Translation Only", "use_extraction": False, "use_aggregation": False},
{"name": "All Modules", "use_extraction": True, "use_translation": True, "use_aggregation": True}
]
for config in configs:
compressor = SheetCompressor(**{k: v for k, v in config.items() if k != "name"})
result = compressor.compress(df)
print(f"{config['name']}: {result['compression_ratio']:.1f}x compression")
```
## Performance
SpreadsheetLLM achieves significant improvements over vanilla encoding:
| Metric | Vanilla | SpreadsheetLLM | Improvement |
|--------|---------|----------------|-------------|
| Token Count | ~25,000 | ~1,200 | **96% reduction** |
| Sparsity Handling | Poor | Excellent | **Removes empty regions** |
| Multi-Table Support | Limited | Native | **Preserves structure** |
| Format Preservation | Basic | Advanced | **Type-aware grouping** |
## API Reference
### SpreadsheetLLM Class
The main interface for the framework.
#### Methods
- `compress_and_encode_for_llm(df)`: One-step compression and encoding
- `compress_spreadsheet(df)`: Apply compression pipeline
- `encode_vanilla(df)`: Traditional encoding
- `get_encoding_stats(df)`: Detailed compression statistics
- `process_qa_query(df, query)`: Chain of Spreadsheet reasoning
- `load_from_file(filepath)`: Load spreadsheet from file
### SheetCompressor Class
Core compression framework.
#### Parameters
- `k`: Structural anchor neighborhood size (default: 4)
- `use_extraction`: Enable structural extraction (default: True)
- `use_translation`: Enable inverted index translation (default: True)
- `use_aggregation`: Enable format aggregation (default: True)
## Contributing
We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.
### Development Setup
```bash
# Clone the repository
git clone https://github.com/yourusername/sheetwise.git
cd sheetwise
# Install development dependencies
poetry install
# Run tests
poetry run pytest
# Run linting
poetry run black src tests
poetry run isort src tests
poetry run flake8 src tests
```
### Running Tests
```bash
# Run all tests
poetry run pytest
# Run with coverage
poetry run pytest --cov=src
# Run specific test file
poetry run pytest tests/test_core.py
```
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Citation
If you use SpreadsheetLLM in your research, please cite:
```bibtex
@article{spreadsheetllm2024,
title={SpreadsheetLLM: Encoding Spreadsheets for Large Language Models},
author={Microsoft Research Team},
journal={arXiv preprint},
year={2024}
}
```
## Support
- [Documentation](https://sheetwise.readthedocs.io)
- [Issue Tracker](https://github.com/yourusername/sheetwise/issues)
- [Discussions](https://github.com/yourusername/sheetwise/discussions)
Raw data
{
"_id": null,
"home_page": "https://github.com/Khushiyant/sheetwise",
"name": "sheetwise",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0.0,>=3.8.1",
"maintainer_email": null,
"keywords": "spreadsheet, llm, encoding, compression, data-processing",
"author": "Khushiyant Chauhan",
"author_email": "khushiyant@example.com",
"download_url": "https://files.pythonhosted.org/packages/63/cd/1e0fbe1713a55a95edc81c046a61190e725a770e46a86b90e5d07472a451/sheetwise-1.2.0.tar.gz",
"platform": null,
"description": "# SheetWise\n\nA Python package for encoding spreadsheets for Large Language Models, implementing the SpreadsheetLLM research framework.\n\n[](https://badge.fury.io/py/sheetwise)\n[](https://www.python.org/downloads/release/python-380/)\n[](https://opensource.org/licenses/MIT)\n\n## Overview\n\nSheetWise is a Python package that implements the key components from Microsoft Research's SpreadsheetLLM paper for efficiently encoding spreadsheets for use with Large Language Models. The package provides:\n\n- **SheetCompressor**: Efficient encoding framework with three compression modules\n- **Chain of Spreadsheet**: Multi-step reasoning approach for spreadsheet analysis\n- **Vanilla Encoding**: Traditional cell-by-cell encoding methods\n- **Token Optimization**: Significant reduction in token usage for LLM processing\n\n## Key Features\n\n- **Intelligent Compression**: Up to 96% reduction in token usage while preserving semantic information\n- **Auto-Configuration**: Automatically optimizes compression settings based on spreadsheet characteristics \n- **Multi-LLM Support**: Provider-specific formats for ChatGPT, Claude, Gemini\n- **Multi-Table Support**: Handles complex spreadsheets with multiple tables and regions\n- **Structural Analysis**: Identifies and preserves important structural elements\n- **LLM-Ready Output**: Generates optimized text for direct use with ChatGPT, Claude, etc.\n- **Format-Aware**: Preserves data type and formatting information\n- **Enhanced Algorithms**: Improved range detection and contiguous cell grouping\n- **Easy Integration**: Simple API for immediate use\n\n## Installation\n\n### Using pip\n\n```bash\npip install sheetwise\n```\n\n### Using Poetry\n\n```bash\npoetry add sheetwise\n```\n\n### Development Installation\n\n```bash\ngit clone https://github.com/yourusername/sheetwise.git\ncd sheetwise\npoetry install\n```\n\n## Quick Start\n\n### Basic Usage\n\n```python\nimport pandas as pd\nfrom sheetwise import SpreadsheetLLM\n\n# Initialize the framework\nsllm = SpreadsheetLLM()\n\n# Load your spreadsheet\ndf = pd.read_excel(\"your_spreadsheet.xlsx\")\n\n# Compress and encode for LLM use\nllm_ready_text = sllm.compress_and_encode_for_llm(df)\n\n# Copy and paste this text directly into ChatGPT/Claude\nprint(llm_ready_text)\n```\n\n### Advanced Usage\n\n```python\nfrom sheetwise import SpreadsheetLLM, SheetCompressor\n\n# Auto-configuration (NEW!)\nsllm = SpreadsheetLLM(enable_logging=True)\nauto_compressed = sllm.compress_with_auto_config(df) # Automatically optimizes settings\n\n# LLM Provider-specific formats (NEW!)\ncompressed = sllm.compress_spreadsheet(df)\nchatgpt_format = sllm.encode_for_llm_provider(compressed, \"chatgpt\")\nclaude_format = sllm.encode_for_llm_provider(compressed, \"claude\")\ngemini_format = sllm.encode_for_llm_provider(compressed, \"gemini\")\n\n# Manual configuration\ncompressor = SheetCompressor(\n k=2, # Structural anchor neighborhood size\n use_extraction=True,\n use_translation=True, \n use_aggregation=True\n)\n\n# Compress the spreadsheet\ncompressed_result = compressor.compress(df)\nprint(f\"Compression ratio: {compressed_result['compression_ratio']:.1f}x\")\nprint(f\"Compressed shape: {compressed_result['compressed_df'].shape}\")\n\n# Or use with SpreadsheetLLM for full pipeline\nsllm = SpreadsheetLLM(compression_params={\n 'k': 2,\n 'use_extraction': True,\n 'use_translation': True, \n 'use_aggregation': True\n})\n\n# Get detailed statistics\nstats = sllm.get_encoding_stats(df)\nprint(f\"Token reduction: {stats['token_reduction_ratio']:.1f}x\")\n\n# Process QA queries\nresult = sllm.process_qa_query(df, \"What was the total revenue in 2023?\")\n```\n\n### Command Line Interface\n\n```bash\n# Compress a spreadsheet file\nsheetwise input.xlsx -o output.txt --stats\n\n# Auto-configure compression (NEW!)\nsheetwise input.xlsx --auto-config --verbose\n\n# Run demo with sample data\nsheetwise --demo --auto-config\n\n# Run demo with vanilla encoding\nsheetwise --demo --vanilla --stats\n\n# Run demo with JSON output format\nsheetwise --demo --format json\n\n# Run demo with auto-config and JSON output\nsheetwise --demo --auto-config --format json --verbose\n\n# Use vanilla encoding instead of compression\nsheetwise input.xlsx --vanilla\n\n# Output in JSON format\nsheetwise input.xlsx --format json\n```\n\n## Core Components\n\n### 1. SheetCompressor\n\nThe main compression framework with three modules:\n\n- **Structural Anchor Extraction**: Identifies and preserves structurally important rows/columns\n- **Inverted Index Translation**: Creates efficient value-to-location mappings\n- **Data Format Aggregation**: Groups cells by data type and format\n\n### 2. Chain of Spreadsheet\n\nMulti-step reasoning approach:\n\n1. **Table Identification**: Automatically detects table regions\n2. **Compression**: Applies SheetCompressor to reduce size\n3. **Query Processing**: Identifies relevant regions for specific queries\n\n### 3. Vanilla Encoder\n\nTraditional encoding methods for comparison and compatibility.\n\n## Examples\n\n### Working with Financial Data\n\n```python\nfrom sheetwise import SpreadsheetLLM\nfrom sheetwise.utils import create_realistic_spreadsheet\n\n# Create sample financial spreadsheet\ndf = create_realistic_spreadsheet()\n\nsllm = SpreadsheetLLM()\n\n# Analyze the data\nstats = sllm.get_encoding_stats(df)\nprint(f\"Original size: {stats['original_shape']}\")\nprint(f\"Sparsity: {stats['sparsity_percentage']:.1f}% empty cells\")\nprint(f\"Compression: {stats['compression_ratio']:.1f}x smaller\")\n\n# Generate LLM-ready output\nencoded = sllm.compress_and_encode_for_llm(df)\nprint(\"\\\\nReady for LLM:\")\nprint(encoded[:300] + \"...\")\n```\n\n### Custom Compression Pipeline\n\n```python\nfrom sheetwise import SheetCompressor\n\n# Compare different compression strategies\nconfigs = [\n {\"name\": \"Extraction Only\", \"use_translation\": False, \"use_aggregation\": False},\n {\"name\": \"Translation Only\", \"use_extraction\": False, \"use_aggregation\": False}, \n {\"name\": \"All Modules\", \"use_extraction\": True, \"use_translation\": True, \"use_aggregation\": True}\n]\n\nfor config in configs:\n compressor = SheetCompressor(**{k: v for k, v in config.items() if k != \"name\"})\n result = compressor.compress(df)\n print(f\"{config['name']}: {result['compression_ratio']:.1f}x compression\")\n```\n\n## Performance\n\nSpreadsheetLLM achieves significant improvements over vanilla encoding:\n\n| Metric | Vanilla | SpreadsheetLLM | Improvement |\n|--------|---------|----------------|-------------|\n| Token Count | ~25,000 | ~1,200 | **96% reduction** |\n| Sparsity Handling | Poor | Excellent | **Removes empty regions** |\n| Multi-Table Support | Limited | Native | **Preserves structure** |\n| Format Preservation | Basic | Advanced | **Type-aware grouping** |\n\n## API Reference\n\n### SpreadsheetLLM Class\n\nThe main interface for the framework.\n\n#### Methods\n\n- `compress_and_encode_for_llm(df)`: One-step compression and encoding\n- `compress_spreadsheet(df)`: Apply compression pipeline \n- `encode_vanilla(df)`: Traditional encoding\n- `get_encoding_stats(df)`: Detailed compression statistics\n- `process_qa_query(df, query)`: Chain of Spreadsheet reasoning\n- `load_from_file(filepath)`: Load spreadsheet from file\n\n### SheetCompressor Class\n\nCore compression framework.\n\n#### Parameters\n\n- `k`: Structural anchor neighborhood size (default: 4)\n- `use_extraction`: Enable structural extraction (default: True)\n- `use_translation`: Enable inverted index translation (default: True)\n- `use_aggregation`: Enable format aggregation (default: True)\n\n## Contributing\n\nWe welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.\n\n### Development Setup\n\n```bash\n# Clone the repository\n\ngit clone https://github.com/yourusername/sheetwise.git\ncd sheetwise\n\n# Install development dependencies\npoetry install\n\n# Run tests\npoetry run pytest\n\n# Run linting\npoetry run black src tests\npoetry run isort src tests\npoetry run flake8 src tests\n```\n\n### Running Tests\n\n```bash\n# Run all tests\npoetry run pytest\n\n# Run with coverage\npoetry run pytest --cov=src\n\n# Run specific test file\npoetry run pytest tests/test_core.py\n```\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Citation\n\nIf you use SpreadsheetLLM in your research, please cite:\n\n```bibtex\n@article{spreadsheetllm2024,\n title={SpreadsheetLLM: Encoding Spreadsheets for Large Language Models},\n author={Microsoft Research Team},\n journal={arXiv preprint},\n year={2024}\n}\n```\n\n\n## Support\n\n- [Documentation](https://sheetwise.readthedocs.io)\n- [Issue Tracker](https://github.com/yourusername/sheetwise/issues)\n- [Discussions](https://github.com/yourusername/sheetwise/discussions)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python package for encoding spreadsheets for Large Language Models, implementing the SpreadsheetLLM research framework",
"version": "1.2.0",
"project_urls": {
"Documentation": "https://khushiyant.github.io/sheetwise",
"Homepage": "https://github.com/Khushiyant/sheetwise",
"Repository": "https://github.com/Khushiyant/sheetwise"
},
"split_keywords": [
"spreadsheet",
" llm",
" encoding",
" compression",
" data-processing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "70ac3277fec297d0c2295a69386838d8db3c1116405ea528b1d46f39111d77ef",
"md5": "bfdcd969085364ca82316394c3a34627",
"sha256": "5a3c886a00db45249d17f057816a5cfe2ce67488d8e269dc8e90f146937f0a05"
},
"downloads": -1,
"filename": "sheetwise-1.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "bfdcd969085364ca82316394c3a34627",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0.0,>=3.8.1",
"size": 21779,
"upload_time": "2025-07-29T22:30:36",
"upload_time_iso_8601": "2025-07-29T22:30:36.092538Z",
"url": "https://files.pythonhosted.org/packages/70/ac/3277fec297d0c2295a69386838d8db3c1116405ea528b1d46f39111d77ef/sheetwise-1.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "63cd1e0fbe1713a55a95edc81c046a61190e725a770e46a86b90e5d07472a451",
"md5": "f439f18e8cb75893790adc2887a22c90",
"sha256": "5e4f6ad92d3585eaba20b159ce222d513115c796545e0622bcc67334aa7599c5"
},
"downloads": -1,
"filename": "sheetwise-1.2.0.tar.gz",
"has_sig": false,
"md5_digest": "f439f18e8cb75893790adc2887a22c90",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0.0,>=3.8.1",
"size": 20335,
"upload_time": "2025-07-29T22:30:37",
"upload_time_iso_8601": "2025-07-29T22:30:37.175029Z",
"url": "https://files.pythonhosted.org/packages/63/cd/1e0fbe1713a55a95edc81c046a61190e725a770e46a86b90e5d07472a451/sheetwise-1.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-29 22:30:37",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Khushiyant",
"github_project": "sheetwise",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "sheetwise"
}