# Deus LLM Token Stats Guru
Advanced LLM token analysis and statistics toolkit for comprehensive document processing and multi-format text
extraction.
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
## Features
- **๐ Multi-Format Document Processing**: Supports 25+ file formats across office documents, data files, and text
formats
- **๐ฏ Accurate Token Counting**: Uses OpenAI's tiktoken library for precise token counts
- **๐ Recursive Processing**: Automatically discovers and processes all supported files in directories
- **๐ค Multiple Encoding Models**: Supports different OpenAI models (gpt-4, gpt-3.5-turbo, etc.)
- **โก Dual CLI Interface**: Two convenient command-line tools (`deus-llm-token-guru` and `llm-token-stats`)
- **๐ Comprehensive Analytics**: Detailed statistics including file sizes, row counts, and processing times
- **๐พ JSON Export**: Results can be exported to JSON format for further analysis
- **๐ก๏ธ Type Safety**: Full type hints and modern Python features
- **๐ง Extensible Architecture**: Modular processor-based design for easy format additions
- **๐ข Office Suite Support**: Full Microsoft Office and OpenDocument format compatibility
- **๐ Web Document Support**: HTML, RTF, and other web-based document formats
- **๐ Developer-Friendly**: Supports source code files and configuration formats
## Installation
```bash
pip install deus-llm-token-stats-guru
```
### Optional Dependencies
For full format support, install with specific extras:
```bash
# Microsoft Office formats (Word, Excel, PowerPoint)
pip install deus-llm-token-stats-guru[office]
# PDF processing with multiple extraction methods
pip install deus-llm-token-stats-guru[pdf]
# Excel files (.xlsx, .xls)
pip install deus-llm-token-stats-guru[excel]
# Word documents (.docx, .doc)
pip install deus-llm-token-stats-guru[docx]
# PowerPoint presentations (.pptx, .ppt)
pip install deus-llm-token-stats-guru[powerpoint]
# OpenDocument formats (LibreOffice/OpenOffice)
pip install deus-llm-token-stats-guru[opendocument]
# RTF (Rich Text Format)
pip install deus-llm-token-stats-guru[rtf]
# HTML processing
pip install deus-llm-token-stats-guru[html]
# Install all optional dependencies
pip install deus-llm-token-stats-guru[all]
```
## Supported File Formats
The package supports **25+ file extensions** across **10 specialized processors**:
### ๐ Data & Spreadsheet Files
- **CSV/TSV**: `.csv`, `.tsv` - Comma/tab-separated values with robust parsing
- **Excel**: `.xlsx`, `.xls` - Microsoft Excel workbooks (all sheets)
### ๐ Document Files
- **Microsoft Word**: `.docx`, `.doc` - Word documents with full text extraction
- **PDF**: `.pdf` - Portable Document Format with multiple extraction engines
- **RTF**: `.rtf` - Rich Text Format (compatible with Google Docs exports)
### ๐ Presentation Files
- **PowerPoint**: `.pptx`, `.ppt` - Microsoft PowerPoint presentations
### ๐ข OpenDocument Formats
- **OpenDocument**: `.odt`, `.ods`, `.odp`, `.odg`, `.odf` - LibreOffice/OpenOffice formats
### ๐ Web & Markup Files
- **HTML**: `.html`, `.htm`, `.xhtml` - Web documents with tag parsing
- **JSON**: `.json`, `.jsonl`, `.ndjson` - JavaScript Object Notation
### ๐ Text & Source Code Files
- **Text**: `.txt`, `.text`, `.log`
- **Markdown**: `.md`, `.markdown`, `.mdown`, `.mkd`
- **Documentation**: `.rst` (reStructuredText)
- **Programming Languages**: `.py`, `.js`, `.c`, `.cpp`, `.h`, `.hpp`, `.java`, `.cs`, `.php`, `.rb`, `.go`, `.rs`
- **Web Technologies**: `.html`, `.css`, `.xml`
- **Configuration**: `.yml`, `.yaml`, `.toml`, `.ini`, `.cfg`
- **Scripts**: `.sh`, `.bat`, `.ps1`
### ๐ Processing Features by Format
| Format Category | Extensions | Key Features |
|-------------------|-----------------------------|-----------------------------------------------------------|
| **CSV/Data** | `.csv`, `.tsv` | Multi-strategy parsing, malformed file handling |
| **Office Docs** | `.docx`, `.xlsx`, `.pptx` | Full text extraction, multi-sheet/slide support |
| **Legacy Office** | `.doc`, `.xls`, `.ppt` | Backward compatibility with older formats |
| **PDF** | `.pdf` | Multiple extraction engines (PyMuPDF, PyPDF2, pdfplumber) |
| **OpenDocument** | `.odt`, `.ods`, `.odp` | Native LibreOffice format support |
| **Web/Markup** | `.html`, `.xml`, `.json` | Tag parsing, structure preservation |
| **Source Code** | `.py`, `.js`, `.java`, etc. | Syntax-aware text extraction |
| **Config Files** | `.yml`, `.toml`, `.ini` | Configuration format parsing |
## Quick Start
### Command Line Usage
```bash
# Count tokens in all supported files in current directory
deus-llm-token-guru .
# Alternative command (same functionality)
llm-token-stats ./data
# Process specific directory with all file types
deus-llm-token-guru /path/to/documents --model gpt-3.5-turbo
# Save comprehensive results to JSON file
llm-token-stats /path/to/documents --output analysis.json
# Enable debug logging to see processing details
deus-llm-token-guru /path/to/documents --debug --log-file debug.log
# Quiet mode (suppress progress output)
llm-token-stats ./data --quiet
# Process mixed file types recursively
deus-llm-token-guru ./project_docs --model gpt-4
```
### Python API Usage
```python
from deus_llm_token_stats_guru import DocumentProcessor
from pathlib import Path
# Initialize document processor
processor = DocumentProcessor(encoding_model="gpt-4")
# Count tokens in a single file (any supported format)
result = processor.count_tokens_in_file(Path("document.pdf"))
print(f"Total tokens: {result['total_tokens']:,}")
print(f"File type: {result['file_type']}")
# Process entire directory (all supported file types)
summary = processor.count_tokens_in_directory(Path("./documents"))
print(f"Processed {summary['total_files']} files")
print(f"Total tokens: {summary['total_tokens']:,}")
print(f"File types found: {set(r['file_type'] for r in summary['file_results'])}")
# Backward compatibility - CSV-specific processing
from deus_llm_token_stats_guru import CSVTokenCounter
csv_counter = CSVTokenCounter(encoding_model="gpt-4")
csv_result = csv_counter.count_tokens_in_csv(Path("data.csv"))
```
## Available Commands
The package provides two CLI commands:
- **`deus-llm-token-guru`**: Main command name
- **`llm-token-stats`**: Alternative shorter command
Both commands have identical functionality.
## API Reference
### DocumentProcessor
Main class for counting tokens across all supported file formats.
#### Methods
- `__init__(encoding_model: str = "gpt-4")`: Initialize with specific encoding model
- `count_tokens_in_text(text: str) -> int`: Count tokens in a text string
- `count_tokens_in_file(file_path: Path) -> CountResult`: Count tokens in any supported file format
- `count_tokens_in_directory(directory: Path) -> CountSummary`: Process all supported files in directory recursively
- `get_supported_extensions() -> Set[str]`: Get all supported file extensions
- `get_processor_for_file(file_path: Path) -> BaseFileProcessor`: Get appropriate processor for file
### CSVTokenCounter (Backward Compatibility)
Legacy class maintained for backward compatibility. Inherits from DocumentProcessor.
#### Methods
- `count_tokens_in_csv(file_path: Path) -> CountResult`: Count tokens in CSV file (alias for count_tokens_in_file)
#### Type Definitions
```python
class CountResult(TypedDict):
file_path: str
total_tokens: int
row_count: int # For structured data (CSV, Excel)
column_count: int # For structured data (CSV, Excel)
encoding_model: str
file_size_bytes: int
file_type: str # NEW: Processor type (CSV, PDF, DOCX, etc.)
processor_name: str # NEW: Human-readable processor name
class CountSummary(TypedDict):
total_files: int
total_tokens: int
total_rows: int
total_file_size_bytes: int
encoding_model: str
file_results: List[CountResult]
processing_time_seconds: float
supported_extensions: Set[str] # NEW: All supported file extensions
```
## Examples
### Multi-Format File Processing
```python
from deus_llm_token_stats_guru import DocumentProcessor
from pathlib import Path
# Initialize processor
processor = DocumentProcessor(encoding_model="gpt-4")
# Process different file types
files_to_process = [
"report.pdf", # PDF document
"data.xlsx", # Excel spreadsheet
"presentation.pptx", # PowerPoint presentation
"article.docx", # Word document
"config.json", # JSON data
"readme.md", # Markdown text
"analysis.csv" # CSV data
]
for file_path in files_to_process:
if Path(file_path).exists():
result = processor.count_tokens_in_file(Path(file_path))
print(f"File: {file_path}")
print(f"Type: {result['file_type']}")
print(f"Tokens: {result['total_tokens']:,}")
print(f"Size: {result['file_size_bytes']:,} bytes")
print("---")
# Get all supported extensions
extensions = processor.get_supported_extensions()
print(f"Supported extensions ({len(extensions)}): {', '.join(sorted(extensions))}")
```
### Directory Processing with Different Models
```python
from deus_llm_token_stats_guru import DocumentProcessor
from pathlib import Path
models = ["gpt-4", "gpt-3.5-turbo"]
directory = Path("./mixed_documents")
for model in models:
processor = DocumentProcessor(encoding_model=model)
summary = processor.count_tokens_in_directory(directory)
# Analyze results by file type
file_types = {}
for result in summary['file_results']:
file_type = result['file_type']
if file_type not in file_types:
file_types[file_type] = {'count': 0, 'tokens': 0}
file_types[file_type]['count'] += 1
file_types[file_type]['tokens'] += result['total_tokens']
print(f"Model: {model}")
print(f"Total files: {summary['total_files']}")
print(f"Total tokens: {summary['total_tokens']:,}")
print(f"Processing time: {summary['processing_time_seconds']:.2f}s")
print("File types processed:")
for file_type, stats in file_types.items():
print(f" {file_type}: {stats['count']} files, {stats['tokens']:,} tokens")
print()
```
### Cost Estimation for Multi-Format Documents
```python
from deus_llm_token_stats_guru import DocumentProcessor
processor = DocumentProcessor(encoding_model="gpt-4")
summary = processor.count_tokens_in_directory("./document_library")
# Estimate OpenAI API costs (example rates)
gpt4_cost_per_1k = 0.03 # USD per 1K tokens (input)
gpt4_output_cost_per_1k = 0.06 # USD per 1K tokens (output)
# Calculate costs for different scenarios
input_cost = (summary['total_tokens'] / 1000) * gpt4_cost_per_1k
processing_cost = input_cost * 1.2 # Assume 20% output tokens
print(f"Document analysis for {summary['total_files']} files:")
print(f"Total tokens: {summary['total_tokens']:,}")
print(f"Estimated GPT-4 input cost: ${input_cost:.2f}")
print(f"Estimated processing cost: ${processing_cost:.2f}")
print(f"Cost per document: ${processing_cost / summary['total_files']:.3f}")
# Break down by file type
file_type_stats = {}
for result in summary['file_results']:
file_type = result['file_type']
if file_type not in file_type_stats:
file_type_stats[file_type] = {'tokens': 0, 'files': 0}
file_type_stats[file_type]['tokens'] += result['total_tokens']
file_type_stats[file_type]['files'] += 1
print("\nCost breakdown by file type:")
for file_type, stats in sorted(file_type_stats.items()):
type_cost = (stats['tokens'] / 1000) * gpt4_cost_per_1k * 1.2
print(f" {file_type}: {stats['files']} files, ${type_cost:.2f}")
```
### CLI Output Format
The CLI tools output JSON with comprehensive multi-format support:
```json
{
"summary": {
"total_files": 8,
"total_tokens": 45230,
"total_rows": 150,
"total_file_size_mb": 12.7,
"encoding_model": "gpt-4",
"processing_time_seconds": 2.45,
"supported_extensions": [
".csv",
".pdf",
".docx",
".xlsx",
"..."
]
},
"file_details": [
{
"file_path": "/path/to/report.pdf",
"file_type": "PDF",
"processor_name": "PDF",
"tokens": 8540,
"rows": 0,
"columns": 0,
"size_mb": 2.1
},
{
"file_path": "/path/to/data.xlsx",
"file_type": "XLSX",
"processor_name": "Excel",
"tokens": 3200,
"rows": 245,
"columns": 8,
"size_mb": 0.9
},
{
"file_path": "/path/to/presentation.pptx",
"file_type": "PPTX",
"processor_name": "PowerPoint",
"tokens": 1850,
"rows": 0,
"columns": 0,
"size_mb": 4.2
},
{
"file_path": "/path/to/analysis.csv",
"file_type": "CSV",
"processor_name": "CSV",
"tokens": 5140,
"rows": 50,
"columns": 4,
"size_mb": 0.7
}
]
}
```
## Environment Setup
### 1. Create Virtual Environment
```bash
python -m venv .venv
```
### 2. Activate Virtual Environment
```bash
# Linux/macOS
source .venv/bin/activate
# Windows
.venv\Scripts\activate
```
### 3. Install Package
```bash
pip install --upgrade pip
pip install deus-llm-token-stats-guru
```
## Development
### Local Installation
```bash
git clone https://github.com/yourusername/deus-llm-token-stats-guru.git
cd deus-llm-token-stats-guru
pip install -e .
```
### Running Tests
```bash
# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-report=term-missing
# Run specific test file
pytest tests/unit/test_core.py
```
### Code Quality
```bash
# Format code
ruff format src/ tests/
# Lint code
ruff check src/ tests/
# Type checking
mypy src/
```
### Building Package
```bash
# Build package
python setup.py sdist bdist_wheel
# Check package
twine check dist/*
# Test installation
pip install dist/*.whl
```
## Supported Models
The package supports all OpenAI tiktoken encoding models:
- `gpt-4` (default)
- `gpt-3.5-turbo`
- `text-davinci-003`
- `text-davinci-002`
- `code-davinci-002`
- Custom encodings via tiktoken
## Performance
- Processes ~1000 rows/second on typical hardware
- Memory usage scales with CSV file size
- Supports files with millions of rows
- Recursive directory scanning with progress tracking
## Error Handling
The package includes comprehensive error handling:
- `FileProcessingError`: Issues reading or processing CSV files
- `EncodingError`: Problems with tiktoken encoding
- `ConfigurationError`: Invalid configuration or paths
## Use Cases
### ๐ Enterprise Document Analysis
```bash
# Analyze mixed office documents and data files
deus-llm-token-guru ./corporate_docs --model gpt-4 --output enterprise_analysis.json
```
### ๐ฐ LLM Cost Planning
```bash
# Estimate API costs for document processing workflows
llm-token-stats ./knowledge_base --model gpt-3.5-turbo --output cost_planning.json
```
### ๐ Batch Multi-Format Processing
```python
# Process multiple directories with different document types
from deus_llm_token_stats_guru import DocumentProcessor
processor = DocumentProcessor()
directories = [
"./legal_docs", # PDFs, Word docs
"./data_exports", # CSV, Excel files
"./presentations", # PowerPoint files
"./source_code", # Various code files
"./configs" # JSON, YAML configs
]
total_tokens = 0
for dir_path in directories:
summary = processor.count_tokens_in_directory(dir_path)
total_tokens += summary['total_tokens']
print(f"\n๐ {dir_path}:")
print(f" Files: {summary['total_files']}")
print(f" Tokens: {summary['total_tokens']:,}")
# Show file type distribution
file_types = {}
for result in summary['file_results']:
file_type = result['file_type']
file_types[file_type] = file_types.get(file_type, 0) + 1
for file_type, count in file_types.items():
print(f" {file_type}: {count} files")
print(f"\n๐ฏ Total tokens across all directories: {total_tokens:,}")
```
### ๐ข Office Suite Integration
```python
# Specialized office document processing
from deus_llm_token_stats_guru import DocumentProcessor
processor = DocumentProcessor()
# Process typical office workflow files
office_files = [
"quarterly_report.docx", # Word report
"budget_analysis.xlsx", # Excel spreadsheet
"board_presentation.pptx", # PowerPoint deck
"meeting_notes.pdf", # PDF minutes
"project_data.csv", # Data export
"specifications.rtf" # RTF document
]
for file_path in office_files:
result = processor.count_tokens_in_file(Path(file_path))
print(f"๐ {file_path} ({result['file_type']}): {result['total_tokens']:,} tokens")
```
### ๐ Web Content Analysis
```bash
# Process web-exported documents (HTML, RTF from Google Docs)
deus-llm-token-guru ./web_exports --model gpt-4 --output web_content_analysis.json
```
### ๐ฉโ๐ป Developer Workflow Integration
```python
# Analyze documentation and code repositories
from deus_llm_token_stats_guru import DocumentProcessor
processor = DocumentProcessor()
# Process development artifacts
dev_summary = processor.count_tokens_in_directory("./project")
# Filter results by category
docs = [r for r in dev_summary['file_results'] if r['file_type'] in ['Text', 'JSON']]
code = [r for r in dev_summary['file_results'] if r['file_path'].endswith(('.py', '.js', '.java'))]
print(f"๐ Documentation: {sum(r['total_tokens'] for r in docs):,} tokens")
print(f"๐ป Source code: {sum(r['total_tokens'] for r in code):,} tokens")
```
## Future Roadmap
- ๐ฎ **Additional Format Support**: Binary formats (images with OCR, audio transcripts)
- ๐ค **LLM Provider Integration**: Support for Anthropic Claude, Google Gemini, local models
- ๐ **Advanced Analytics**: Token distribution analysis, content similarity metrics
- ๐ **Web Interface**: Browser-based document analysis dashboard
- โก **Performance Optimization**: Parallel processing, streaming for large files
- ๐ **API Integration**: REST API for service integration
- ๐ฑ **Cloud Storage Support**: Direct S3, Google Drive, SharePoint integration
## Contributing
1. Fork the repository
2. Create a feature branch: `git checkout -b feature-name`
3. Make changes with tests
4. Run quality checks: `ruff check && mypy src/ && pytest`
5. Submit a pull request
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Changelog
### v0.3.3 (Current)
- ๐ค **Author Update**: Changed author to "deus-global" and email to "sean@deus.com.tw"
### v0.3.2
- ๐ **Documentation Update**: Comprehensive README.md with all 25+ supported file formats
- ๐ท๏ธ **Format Categorization**: Organized file formats by category with detailed descriptions
- ๐ **Enhanced Examples**: Multi-format processing examples and use cases
- ๐ **Updated API Documentation**: DocumentProcessor examples with backward compatibility
- ๐ **CLI Output Examples**: Updated JSON output format with file_type and processor_name
### v0.3.1
- ๐ข **Multi-Format Document Support**: 25+ file extensions across 10 processors
- ๐ **Office Suite Integration**: Full Microsoft Office (.docx, .xlsx, .pptx) and legacy format support
- ๐ **OpenDocument Support**: LibreOffice/OpenOffice formats (.odt, .ods, .odp, .odg, .odf)
- ๐ **Enhanced Document Processing**: PDF (multi-engine), RTF, HTML, JSON support
- ๐ป **Developer-Friendly**: Source code files, configuration formats, Markdown
- ๐ก๏ธ **Robust CSV Processing**: Multi-strategy parsing with malformed file handling
- ๐ฆ **PEP 625 Compliance**: Fixed PyPI package naming for proper distribution
- ๐ง **Optional Dependencies**: Granular installation options for specific format groups
- โก **Processor Architecture**: Extensible, modular design for easy format additions
- ๐ **Backward Compatibility**: Maintained CSVTokenCounter API for existing users
### v0.2.1
- ๐ **Package Refactoring**: Transitioned from CSV-only to multi-format architecture
- ๐๏ธ **Processor System**: Implemented BaseFileProcessor with specialized processors
- ๐ **Enhanced Metadata**: Added file_type and processor_name to results
- ๐ ๏ธ **PyPI Compatibility**: Resolved license metadata conflicts
### v0.1.0
- ๐ **Initial Release**: Core CSV token counting functionality
- โก **Dual CLI Interface**: `deus-llm-token-guru` and `llm-token-stats` commands
- ๐ค **Multi-Model Support**: Various OpenAI encoding models (gpt-4, gpt-3.5-turbo)
- ๐ **Comprehensive Output**: Detailed statistics with JSON export
- ๐ก๏ธ **Type Safety**: Full Python type hints and modern features
- ๐งช **Testing**: Complete test suite with coverage
Raw data
{
"_id": null,
"home_page": "https://github.com/yourusername/deus-llm-token-stats-guru",
"name": "deus-llm-token-stats-guru",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "llm tokens tiktoken csv analysis statistics",
"author": "deus-global",
"author_email": "sean@deus.com.tw",
"download_url": "https://files.pythonhosted.org/packages/99/8c/04cbfa02063259318ec1f05f21eebdc2b819873c623cada5853c9b0afcbe/deus_llm_token_stats_guru-0.3.3.tar.gz",
"platform": null,
"description": "# Deus LLM Token Stats Guru\n\nAdvanced LLM token analysis and statistics toolkit for comprehensive document processing and multi-format text\nextraction.\n\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n\n## Features\n\n- **\ud83d\udcca Multi-Format Document Processing**: Supports 25+ file formats across office documents, data files, and text\n formats\n- **\ud83c\udfaf Accurate Token Counting**: Uses OpenAI's tiktoken library for precise token counts\n- **\ud83d\udd04 Recursive Processing**: Automatically discovers and processes all supported files in directories\n- **\ud83e\udd16 Multiple Encoding Models**: Supports different OpenAI models (gpt-4, gpt-3.5-turbo, etc.)\n- **\u26a1 Dual CLI Interface**: Two convenient command-line tools (`deus-llm-token-guru` and `llm-token-stats`)\n- **\ud83d\udcc8 Comprehensive Analytics**: Detailed statistics including file sizes, row counts, and processing times\n- **\ud83d\udcbe JSON Export**: Results can be exported to JSON format for further analysis\n- **\ud83d\udee1\ufe0f Type Safety**: Full type hints and modern Python features\n- **\ud83d\udd27 Extensible Architecture**: Modular processor-based design for easy format additions\n- **\ud83c\udfe2 Office Suite Support**: Full Microsoft Office and OpenDocument format compatibility\n- **\ud83c\udf10 Web Document Support**: HTML, RTF, and other web-based document formats\n- **\ud83d\udcdd Developer-Friendly**: Supports source code files and configuration formats\n\n## Installation\n\n```bash\npip install deus-llm-token-stats-guru\n```\n\n### Optional Dependencies\n\nFor full format support, install with specific extras:\n\n```bash\n# Microsoft Office formats (Word, Excel, PowerPoint)\npip install deus-llm-token-stats-guru[office]\n\n# PDF processing with multiple extraction methods\npip install deus-llm-token-stats-guru[pdf]\n\n# Excel files (.xlsx, .xls)\npip install deus-llm-token-stats-guru[excel]\n\n# Word documents (.docx, .doc)\npip install deus-llm-token-stats-guru[docx]\n\n# PowerPoint presentations (.pptx, .ppt)\npip install deus-llm-token-stats-guru[powerpoint]\n\n# OpenDocument formats (LibreOffice/OpenOffice)\npip install deus-llm-token-stats-guru[opendocument]\n\n# RTF (Rich Text Format)\npip install deus-llm-token-stats-guru[rtf]\n\n# HTML processing\npip install deus-llm-token-stats-guru[html]\n\n# Install all optional dependencies\npip install deus-llm-token-stats-guru[all]\n```\n\n## Supported File Formats\n\nThe package supports **25+ file extensions** across **10 specialized processors**:\n\n### \ud83d\udcca Data & Spreadsheet Files\n\n- **CSV/TSV**: `.csv`, `.tsv` - Comma/tab-separated values with robust parsing\n- **Excel**: `.xlsx`, `.xls` - Microsoft Excel workbooks (all sheets)\n\n### \ud83d\udcdd Document Files\n\n- **Microsoft Word**: `.docx`, `.doc` - Word documents with full text extraction\n- **PDF**: `.pdf` - Portable Document Format with multiple extraction engines\n- **RTF**: `.rtf` - Rich Text Format (compatible with Google Docs exports)\n\n### \ud83d\udcca Presentation Files\n\n- **PowerPoint**: `.pptx`, `.ppt` - Microsoft PowerPoint presentations\n\n### \ud83c\udfe2 OpenDocument Formats\n\n- **OpenDocument**: `.odt`, `.ods`, `.odp`, `.odg`, `.odf` - LibreOffice/OpenOffice formats\n\n### \ud83c\udf10 Web & Markup Files\n\n- **HTML**: `.html`, `.htm`, `.xhtml` - Web documents with tag parsing\n- **JSON**: `.json`, `.jsonl`, `.ndjson` - JavaScript Object Notation\n\n### \ud83d\udcc4 Text & Source Code Files\n\n- **Text**: `.txt`, `.text`, `.log`\n- **Markdown**: `.md`, `.markdown`, `.mdown`, `.mkd`\n- **Documentation**: `.rst` (reStructuredText)\n- **Programming Languages**: `.py`, `.js`, `.c`, `.cpp`, `.h`, `.hpp`, `.java`, `.cs`, `.php`, `.rb`, `.go`, `.rs`\n- **Web Technologies**: `.html`, `.css`, `.xml`\n- **Configuration**: `.yml`, `.yaml`, `.toml`, `.ini`, `.cfg`\n- **Scripts**: `.sh`, `.bat`, `.ps1`\n\n### \ud83d\ude80 Processing Features by Format\n\n| Format Category | Extensions | Key Features |\n|-------------------|-----------------------------|-----------------------------------------------------------|\n| **CSV/Data** | `.csv`, `.tsv` | Multi-strategy parsing, malformed file handling |\n| **Office Docs** | `.docx`, `.xlsx`, `.pptx` | Full text extraction, multi-sheet/slide support |\n| **Legacy Office** | `.doc`, `.xls`, `.ppt` | Backward compatibility with older formats |\n| **PDF** | `.pdf` | Multiple extraction engines (PyMuPDF, PyPDF2, pdfplumber) |\n| **OpenDocument** | `.odt`, `.ods`, `.odp` | Native LibreOffice format support |\n| **Web/Markup** | `.html`, `.xml`, `.json` | Tag parsing, structure preservation |\n| **Source Code** | `.py`, `.js`, `.java`, etc. | Syntax-aware text extraction |\n| **Config Files** | `.yml`, `.toml`, `.ini` | Configuration format parsing |\n\n## Quick Start\n\n### Command Line Usage\n\n```bash\n# Count tokens in all supported files in current directory\ndeus-llm-token-guru .\n\n# Alternative command (same functionality)\nllm-token-stats ./data\n\n# Process specific directory with all file types\ndeus-llm-token-guru /path/to/documents --model gpt-3.5-turbo\n\n# Save comprehensive results to JSON file\nllm-token-stats /path/to/documents --output analysis.json\n\n# Enable debug logging to see processing details\ndeus-llm-token-guru /path/to/documents --debug --log-file debug.log\n\n# Quiet mode (suppress progress output)\nllm-token-stats ./data --quiet\n\n# Process mixed file types recursively\ndeus-llm-token-guru ./project_docs --model gpt-4\n```\n\n### Python API Usage\n\n```python\nfrom deus_llm_token_stats_guru import DocumentProcessor\nfrom pathlib import Path\n\n# Initialize document processor\nprocessor = DocumentProcessor(encoding_model=\"gpt-4\")\n\n# Count tokens in a single file (any supported format)\nresult = processor.count_tokens_in_file(Path(\"document.pdf\"))\nprint(f\"Total tokens: {result['total_tokens']:,}\")\nprint(f\"File type: {result['file_type']}\")\n\n# Process entire directory (all supported file types)\nsummary = processor.count_tokens_in_directory(Path(\"./documents\"))\nprint(f\"Processed {summary['total_files']} files\")\nprint(f\"Total tokens: {summary['total_tokens']:,}\")\nprint(f\"File types found: {set(r['file_type'] for r in summary['file_results'])}\")\n\n# Backward compatibility - CSV-specific processing\nfrom deus_llm_token_stats_guru import CSVTokenCounter\n\ncsv_counter = CSVTokenCounter(encoding_model=\"gpt-4\")\ncsv_result = csv_counter.count_tokens_in_csv(Path(\"data.csv\"))\n```\n\n## Available Commands\n\nThe package provides two CLI commands:\n\n- **`deus-llm-token-guru`**: Main command name\n- **`llm-token-stats`**: Alternative shorter command\n\nBoth commands have identical functionality.\n\n## API Reference\n\n### DocumentProcessor\n\nMain class for counting tokens across all supported file formats.\n\n#### Methods\n\n- `__init__(encoding_model: str = \"gpt-4\")`: Initialize with specific encoding model\n- `count_tokens_in_text(text: str) -> int`: Count tokens in a text string\n- `count_tokens_in_file(file_path: Path) -> CountResult`: Count tokens in any supported file format\n- `count_tokens_in_directory(directory: Path) -> CountSummary`: Process all supported files in directory recursively\n- `get_supported_extensions() -> Set[str]`: Get all supported file extensions\n- `get_processor_for_file(file_path: Path) -> BaseFileProcessor`: Get appropriate processor for file\n\n### CSVTokenCounter (Backward Compatibility)\n\nLegacy class maintained for backward compatibility. Inherits from DocumentProcessor.\n\n#### Methods\n\n- `count_tokens_in_csv(file_path: Path) -> CountResult`: Count tokens in CSV file (alias for count_tokens_in_file)\n\n#### Type Definitions\n\n```python\nclass CountResult(TypedDict):\n file_path: str\n total_tokens: int\n row_count: int # For structured data (CSV, Excel)\n column_count: int # For structured data (CSV, Excel) \n encoding_model: str\n file_size_bytes: int\n file_type: str # NEW: Processor type (CSV, PDF, DOCX, etc.)\n processor_name: str # NEW: Human-readable processor name\n\n\nclass CountSummary(TypedDict):\n total_files: int\n total_tokens: int\n total_rows: int\n total_file_size_bytes: int\n encoding_model: str\n file_results: List[CountResult]\n processing_time_seconds: float\n supported_extensions: Set[str] # NEW: All supported file extensions\n```\n\n## Examples\n\n### Multi-Format File Processing\n\n```python\nfrom deus_llm_token_stats_guru import DocumentProcessor\nfrom pathlib import Path\n\n# Initialize processor\nprocessor = DocumentProcessor(encoding_model=\"gpt-4\")\n\n# Process different file types\nfiles_to_process = [\n \"report.pdf\", # PDF document\n \"data.xlsx\", # Excel spreadsheet \n \"presentation.pptx\", # PowerPoint presentation\n \"article.docx\", # Word document\n \"config.json\", # JSON data\n \"readme.md\", # Markdown text\n \"analysis.csv\" # CSV data\n]\n\nfor file_path in files_to_process:\n if Path(file_path).exists():\n result = processor.count_tokens_in_file(Path(file_path))\n\n print(f\"File: {file_path}\")\n print(f\"Type: {result['file_type']}\")\n print(f\"Tokens: {result['total_tokens']:,}\")\n print(f\"Size: {result['file_size_bytes']:,} bytes\")\n print(\"---\")\n\n# Get all supported extensions\nextensions = processor.get_supported_extensions()\nprint(f\"Supported extensions ({len(extensions)}): {', '.join(sorted(extensions))}\")\n```\n\n### Directory Processing with Different Models\n\n```python\nfrom deus_llm_token_stats_guru import DocumentProcessor\nfrom pathlib import Path\n\nmodels = [\"gpt-4\", \"gpt-3.5-turbo\"]\ndirectory = Path(\"./mixed_documents\")\n\nfor model in models:\n processor = DocumentProcessor(encoding_model=model)\n summary = processor.count_tokens_in_directory(directory)\n\n # Analyze results by file type\n file_types = {}\n for result in summary['file_results']:\n file_type = result['file_type']\n if file_type not in file_types:\n file_types[file_type] = {'count': 0, 'tokens': 0}\n file_types[file_type]['count'] += 1\n file_types[file_type]['tokens'] += result['total_tokens']\n\n print(f\"Model: {model}\")\n print(f\"Total files: {summary['total_files']}\")\n print(f\"Total tokens: {summary['total_tokens']:,}\")\n print(f\"Processing time: {summary['processing_time_seconds']:.2f}s\")\n print(\"File types processed:\")\n for file_type, stats in file_types.items():\n print(f\" {file_type}: {stats['count']} files, {stats['tokens']:,} tokens\")\n print()\n```\n\n### Cost Estimation for Multi-Format Documents\n\n```python\nfrom deus_llm_token_stats_guru import DocumentProcessor\n\nprocessor = DocumentProcessor(encoding_model=\"gpt-4\")\nsummary = processor.count_tokens_in_directory(\"./document_library\")\n\n# Estimate OpenAI API costs (example rates)\ngpt4_cost_per_1k = 0.03 # USD per 1K tokens (input)\ngpt4_output_cost_per_1k = 0.06 # USD per 1K tokens (output)\n\n# Calculate costs for different scenarios\ninput_cost = (summary['total_tokens'] / 1000) * gpt4_cost_per_1k\nprocessing_cost = input_cost * 1.2 # Assume 20% output tokens\n\nprint(f\"Document analysis for {summary['total_files']} files:\")\nprint(f\"Total tokens: {summary['total_tokens']:,}\")\nprint(f\"Estimated GPT-4 input cost: ${input_cost:.2f}\")\nprint(f\"Estimated processing cost: ${processing_cost:.2f}\")\nprint(f\"Cost per document: ${processing_cost / summary['total_files']:.3f}\")\n\n# Break down by file type\nfile_type_stats = {}\nfor result in summary['file_results']:\n file_type = result['file_type']\n if file_type not in file_type_stats:\n file_type_stats[file_type] = {'tokens': 0, 'files': 0}\n file_type_stats[file_type]['tokens'] += result['total_tokens']\n file_type_stats[file_type]['files'] += 1\n\nprint(\"\\nCost breakdown by file type:\")\nfor file_type, stats in sorted(file_type_stats.items()):\n type_cost = (stats['tokens'] / 1000) * gpt4_cost_per_1k * 1.2\n print(f\" {file_type}: {stats['files']} files, ${type_cost:.2f}\")\n```\n\n### CLI Output Format\n\nThe CLI tools output JSON with comprehensive multi-format support:\n\n```json\n{\n \"summary\": {\n \"total_files\": 8,\n \"total_tokens\": 45230,\n \"total_rows\": 150,\n \"total_file_size_mb\": 12.7,\n \"encoding_model\": \"gpt-4\",\n \"processing_time_seconds\": 2.45,\n \"supported_extensions\": [\n \".csv\",\n \".pdf\",\n \".docx\",\n \".xlsx\",\n \"...\"\n ]\n },\n \"file_details\": [\n {\n \"file_path\": \"/path/to/report.pdf\",\n \"file_type\": \"PDF\",\n \"processor_name\": \"PDF\",\n \"tokens\": 8540,\n \"rows\": 0,\n \"columns\": 0,\n \"size_mb\": 2.1\n },\n {\n \"file_path\": \"/path/to/data.xlsx\",\n \"file_type\": \"XLSX\",\n \"processor_name\": \"Excel\",\n \"tokens\": 3200,\n \"rows\": 245,\n \"columns\": 8,\n \"size_mb\": 0.9\n },\n {\n \"file_path\": \"/path/to/presentation.pptx\",\n \"file_type\": \"PPTX\",\n \"processor_name\": \"PowerPoint\",\n \"tokens\": 1850,\n \"rows\": 0,\n \"columns\": 0,\n \"size_mb\": 4.2\n },\n {\n \"file_path\": \"/path/to/analysis.csv\",\n \"file_type\": \"CSV\",\n \"processor_name\": \"CSV\",\n \"tokens\": 5140,\n \"rows\": 50,\n \"columns\": 4,\n \"size_mb\": 0.7\n }\n ]\n}\n```\n\n## Environment Setup\n\n### 1. Create Virtual Environment\n\n```bash\npython -m venv .venv\n```\n\n### 2. Activate Virtual Environment\n\n```bash\n# Linux/macOS\nsource .venv/bin/activate\n\n# Windows\n.venv\\Scripts\\activate\n```\n\n### 3. Install Package\n\n```bash\npip install --upgrade pip\npip install deus-llm-token-stats-guru\n```\n\n## Development\n\n### Local Installation\n\n```bash\ngit clone https://github.com/yourusername/deus-llm-token-stats-guru.git\ncd deus-llm-token-stats-guru\npip install -e .\n```\n\n### Running Tests\n\n```bash\n# Run all tests\npytest\n\n# Run with coverage\npytest --cov=src --cov-report=term-missing\n\n# Run specific test file\npytest tests/unit/test_core.py\n```\n\n### Code Quality\n\n```bash\n# Format code\nruff format src/ tests/\n\n# Lint code\nruff check src/ tests/\n\n# Type checking\nmypy src/\n```\n\n### Building Package\n\n```bash\n# Build package\npython setup.py sdist bdist_wheel\n\n# Check package\ntwine check dist/*\n\n# Test installation\npip install dist/*.whl\n```\n\n## Supported Models\n\nThe package supports all OpenAI tiktoken encoding models:\n\n- `gpt-4` (default)\n- `gpt-3.5-turbo`\n- `text-davinci-003`\n- `text-davinci-002`\n- `code-davinci-002`\n- Custom encodings via tiktoken\n\n## Performance\n\n- Processes ~1000 rows/second on typical hardware\n- Memory usage scales with CSV file size\n- Supports files with millions of rows\n- Recursive directory scanning with progress tracking\n\n## Error Handling\n\nThe package includes comprehensive error handling:\n\n- `FileProcessingError`: Issues reading or processing CSV files\n- `EncodingError`: Problems with tiktoken encoding\n- `ConfigurationError`: Invalid configuration or paths\n\n## Use Cases\n\n### \ud83d\udcca Enterprise Document Analysis\n\n```bash\n# Analyze mixed office documents and data files\ndeus-llm-token-guru ./corporate_docs --model gpt-4 --output enterprise_analysis.json\n```\n\n### \ud83d\udcb0 LLM Cost Planning\n\n```bash\n# Estimate API costs for document processing workflows\nllm-token-stats ./knowledge_base --model gpt-3.5-turbo --output cost_planning.json\n```\n\n### \ud83d\udd04 Batch Multi-Format Processing\n\n```python\n# Process multiple directories with different document types\nfrom deus_llm_token_stats_guru import DocumentProcessor\n\nprocessor = DocumentProcessor()\ndirectories = [\n \"./legal_docs\", # PDFs, Word docs\n \"./data_exports\", # CSV, Excel files \n \"./presentations\", # PowerPoint files\n \"./source_code\", # Various code files\n \"./configs\" # JSON, YAML configs\n]\n\ntotal_tokens = 0\nfor dir_path in directories:\n summary = processor.count_tokens_in_directory(dir_path)\n total_tokens += summary['total_tokens']\n\n print(f\"\\n\ud83d\udcc1 {dir_path}:\")\n print(f\" Files: {summary['total_files']}\")\n print(f\" Tokens: {summary['total_tokens']:,}\")\n\n # Show file type distribution\n file_types = {}\n for result in summary['file_results']:\n file_type = result['file_type']\n file_types[file_type] = file_types.get(file_type, 0) + 1\n\n for file_type, count in file_types.items():\n print(f\" {file_type}: {count} files\")\n\nprint(f\"\\n\ud83c\udfaf Total tokens across all directories: {total_tokens:,}\")\n```\n\n### \ud83c\udfe2 Office Suite Integration\n\n```python\n# Specialized office document processing\nfrom deus_llm_token_stats_guru import DocumentProcessor\n\nprocessor = DocumentProcessor()\n\n# Process typical office workflow files\noffice_files = [\n \"quarterly_report.docx\", # Word report\n \"budget_analysis.xlsx\", # Excel spreadsheet \n \"board_presentation.pptx\", # PowerPoint deck\n \"meeting_notes.pdf\", # PDF minutes\n \"project_data.csv\", # Data export\n \"specifications.rtf\" # RTF document\n]\n\nfor file_path in office_files:\n result = processor.count_tokens_in_file(Path(file_path))\n print(f\"\ud83d\udcc4 {file_path} ({result['file_type']}): {result['total_tokens']:,} tokens\")\n```\n\n### \ud83c\udf10 Web Content Analysis\n\n```bash\n# Process web-exported documents (HTML, RTF from Google Docs)\ndeus-llm-token-guru ./web_exports --model gpt-4 --output web_content_analysis.json\n```\n\n### \ud83d\udc69\u200d\ud83d\udcbb Developer Workflow Integration\n\n```python\n# Analyze documentation and code repositories\nfrom deus_llm_token_stats_guru import DocumentProcessor\n\nprocessor = DocumentProcessor()\n\n# Process development artifacts\ndev_summary = processor.count_tokens_in_directory(\"./project\")\n\n# Filter results by category\ndocs = [r for r in dev_summary['file_results'] if r['file_type'] in ['Text', 'JSON']]\ncode = [r for r in dev_summary['file_results'] if r['file_path'].endswith(('.py', '.js', '.java'))]\n\nprint(f\"\ud83d\udcda Documentation: {sum(r['total_tokens'] for r in docs):,} tokens\")\nprint(f\"\ud83d\udcbb Source code: {sum(r['total_tokens'] for r in code):,} tokens\")\n```\n\n## Future Roadmap\n\n- \ud83d\udd2e **Additional Format Support**: Binary formats (images with OCR, audio transcripts)\n- \ud83e\udd16 **LLM Provider Integration**: Support for Anthropic Claude, Google Gemini, local models\n- \ud83d\udcca **Advanced Analytics**: Token distribution analysis, content similarity metrics\n- \ud83c\udf10 **Web Interface**: Browser-based document analysis dashboard\n- \u26a1 **Performance Optimization**: Parallel processing, streaming for large files\n- \ud83d\udd17 **API Integration**: REST API for service integration\n- \ud83d\udcf1 **Cloud Storage Support**: Direct S3, Google Drive, SharePoint integration\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch: `git checkout -b feature-name`\n3. Make changes with tests\n4. Run quality checks: `ruff check && mypy src/ && pytest`\n5. Submit a pull request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Changelog\n\n### v0.3.3 (Current)\n\n- \ud83d\udc64 **Author Update**: Changed author to \"deus-global\" and email to \"sean@deus.com.tw\"\n\n### v0.3.2\n\n- \ud83d\udcda **Documentation Update**: Comprehensive README.md with all 25+ supported file formats\n- \ud83c\udff7\ufe0f **Format Categorization**: Organized file formats by category with detailed descriptions\n- \ud83d\udcd6 **Enhanced Examples**: Multi-format processing examples and use cases\n- \ud83d\ude80 **Updated API Documentation**: DocumentProcessor examples with backward compatibility\n- \ud83d\udcca **CLI Output Examples**: Updated JSON output format with file_type and processor_name\n\n### v0.3.1\n\n- \ud83c\udfe2 **Multi-Format Document Support**: 25+ file extensions across 10 processors\n- \ud83d\udcca **Office Suite Integration**: Full Microsoft Office (.docx, .xlsx, .pptx) and legacy format support\n- \ud83c\udf10 **OpenDocument Support**: LibreOffice/OpenOffice formats (.odt, .ods, .odp, .odg, .odf)\n- \ud83d\udcc4 **Enhanced Document Processing**: PDF (multi-engine), RTF, HTML, JSON support\n- \ud83d\udcbb **Developer-Friendly**: Source code files, configuration formats, Markdown\n- \ud83d\udee1\ufe0f **Robust CSV Processing**: Multi-strategy parsing with malformed file handling\n- \ud83d\udce6 **PEP 625 Compliance**: Fixed PyPI package naming for proper distribution\n- \ud83d\udd27 **Optional Dependencies**: Granular installation options for specific format groups\n- \u26a1 **Processor Architecture**: Extensible, modular design for easy format additions\n- \ud83d\udd04 **Backward Compatibility**: Maintained CSVTokenCounter API for existing users\n\n### v0.2.1\n\n- \ud83d\udd04 **Package Refactoring**: Transitioned from CSV-only to multi-format architecture\n- \ud83c\udfd7\ufe0f **Processor System**: Implemented BaseFileProcessor with specialized processors\n- \ud83d\udcca **Enhanced Metadata**: Added file_type and processor_name to results\n- \ud83d\udee0\ufe0f **PyPI Compatibility**: Resolved license metadata conflicts\n\n### v0.1.0\n\n- \ud83d\ude80 **Initial Release**: Core CSV token counting functionality\n- \u26a1 **Dual CLI Interface**: `deus-llm-token-guru` and `llm-token-stats` commands\n- \ud83e\udd16 **Multi-Model Support**: Various OpenAI encoding models (gpt-4, gpt-3.5-turbo)\n- \ud83d\udcca **Comprehensive Output**: Detailed statistics with JSON export\n- \ud83d\udee1\ufe0f **Type Safety**: Full Python type hints and modern features\n- \ud83e\uddea **Testing**: Complete test suite with coverage\n\n",
"bugtrack_url": null,
"license": null,
"summary": "Advanced LLM token analysis and statistics toolkit for various data formats",
"version": "0.3.3",
"project_urls": {
"Bug Tracker": "https://github.com/yourusername/deus-llm-token-stats-guru/issues",
"Homepage": "https://github.com/yourusername/deus-llm-token-stats-guru",
"Repository": "https://github.com/yourusername/deus-llm-token-stats-guru.git"
},
"split_keywords": [
"llm",
"tokens",
"tiktoken",
"csv",
"analysis",
"statistics"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e6e18373a0f89ee36f475106b0f41cdab019701e0acf3275e6dee830076196dd",
"md5": "9df7c3da958958289685e332d273dd0a",
"sha256": "29000f76994a17fb39e066a2dd8c9f9aeb3ff9b4c721fad838acb5701b0f970d"
},
"downloads": -1,
"filename": "deus_llm_token_stats_guru-0.3.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9df7c3da958958289685e332d273dd0a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 42073,
"upload_time": "2025-08-30T14:02:38",
"upload_time_iso_8601": "2025-08-30T14:02:38.720594Z",
"url": "https://files.pythonhosted.org/packages/e6/e1/8373a0f89ee36f475106b0f41cdab019701e0acf3275e6dee830076196dd/deus_llm_token_stats_guru-0.3.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "998c04cbfa02063259318ec1f05f21eebdc2b819873c623cada5853c9b0afcbe",
"md5": "bd83f048e64f5a53e1886c709f0ce77a",
"sha256": "18aaf1a29c78aff95de77fb8bc34931cf209ffa2e0ea216830be1f8cf2ef8236"
},
"downloads": -1,
"filename": "deus_llm_token_stats_guru-0.3.3.tar.gz",
"has_sig": false,
"md5_digest": "bd83f048e64f5a53e1886c709f0ce77a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 36652,
"upload_time": "2025-08-30T14:02:40",
"upload_time_iso_8601": "2025-08-30T14:02:40.707502Z",
"url": "https://files.pythonhosted.org/packages/99/8c/04cbfa02063259318ec1f05f21eebdc2b819873c623cada5853c9b0afcbe/deus_llm_token_stats_guru-0.3.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-30 14:02:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "yourusername",
"github_project": "deus-llm-token-stats-guru",
"github_not_found": true,
"lcname": "deus-llm-token-stats-guru"
}