# PyForge CLI
<div align="center">
<img src="assets/icon_pyforge_forge.svg" alt="PyForge CLI" width="128" height="128">
</div>
<div align="center">
<strong>A powerful command-line tool for data format conversion and manipulation, built with Python and Click.</strong>
</div>
## 📖 Documentation
**[📚 Complete Documentation](https://py-forge-cli.github.io/PyForge-CLI/)** | **[🚀 Quick Start](https://py-forge-cli.github.io/PyForge-CLI/getting-started/quick-start)** | **[📦 Installation Guide](https://py-forge-cli.github.io/PyForge-CLI/getting-started/installation)** | **[🔧 CLI Reference](https://py-forge-cli.github.io/PyForge-CLI/reference/cli-reference)**
<div align="center">
<a href="https://pypi.org/project/pyforge-cli/">
<img src="https://img.shields.io/pypi/v/pyforge-cli.svg" alt="PyPI version">
</a>
<a href="https://pypi.org/project/pyforge-cli/">
<img src="https://img.shields.io/pypi/pyversions/pyforge-cli.svg" alt="Python versions">
</a>
<a href="https://github.com/Py-Forge-Cli/PyForge-CLI/blob/main/LICENSE">
<img src="https://img.shields.io/github/license/Py-Forge-Cli/PyForge-CLI.svg" alt="License">
</a>
<a href="https://github.com/Py-Forge-Cli/PyForge-CLI/actions">
<img src="https://github.com/Py-Forge-Cli/PyForge-CLI/workflows/CI/badge.svg" alt="CI Status">
</a>
</div>
## Features
- **PDF to Text Conversion**: Extract text from PDF documents with advanced options
- **Excel to Parquet Conversion**: Convert Excel files (.xlsx) to Parquet format with multi-sheet support
- **XML to Parquet Conversion**: Intelligent XML flattening with automatic structure detection and configurable strategies
- **Database File Conversion**: Convert Microsoft Access (.mdb/.accdb), DBF, and SQL Server MDF files to Parquet
- **MDF Tools Installer**: Automated setup of Docker and SQL Server Express for MDF file processing
- **CSV to Parquet Conversion**: Smart delimiter detection and encoding handling
- **Rich CLI Interface**: Beautiful terminal output with progress bars and tables
- **Intelligent Processing**: Automatic encoding detection, structure analysis, and column matching
- **Extensible Architecture**: Plugin-based system for adding new format converters
- **Metadata Extraction**: Get detailed information about your files
- **Cross-platform**: Works on Windows, macOS, and Linux
## Installation
### Stable Version (Recommended)
```bash
pip install pyforge-cli
```
### Development Version
To test the latest features and bug fixes before they're officially released:
```bash
# Install from PyPI Test (latest development version)
pip install -i https://test.pypi.org/simple/ pyforge-cli
# Or with dependency fallback
pip install --index-url https://test.pypi.org/simple/ \
--extra-index-url https://pypi.org/simple/ \
pyforge-cli
```
> **Note**: Development versions follow the pattern `X.Y.Z.devN+gCOMMIT` and are automatically deployed from the main branch.
### From Source
```bash
git clone https://github.com/Py-Forge-Cli/PyForge-CLI.git
cd PyForge-CLI
pip install -e .
```
### Development Installation
```bash
git clone https://github.com/Py-Forge-Cli/PyForge-CLI.git
cd PyForge-CLI
pip install -e ".[dev,test]"
```
### System Dependencies
For MDB/Access file support on non-Windows systems:
```bash
# Ubuntu/Debian
sudo apt-get install mdbtools
# macOS
brew install mdbtools
```
### MDF File Processing Setup
For SQL Server MDF file conversion, install the MDF Tools:
```bash
# Interactive setup wizard
pyforge install mdf-tools
# Check installation status
pyforge mdf-tools status
```
This installs Docker Desktop and SQL Server Express automatically. See [MDF Tools Documentation](https://py-forge-cli.github.io/PyForge-CLI/converters/mdf-tools-installer) for details.
## Quick Start
### Convert PDF to Text
```bash
# Convert entire PDF
pyforge convert document.pdf
# Convert to specific output file
pyforge convert document.pdf output.txt
# Convert specific page range
pyforge convert document.pdf --pages "1-5"
# Include page metadata
pyforge convert document.pdf --metadata
```
### Convert Excel to Parquet
```bash
# Convert Excel file to Parquet
pyforge convert data.xlsx
# Convert with specific compression
pyforge convert data.xlsx --compression gzip
# Convert specific sheets only
pyforge convert data.xlsx --sheets "Sheet1,Sheet3"
```
### Convert Database Files
```bash
# Convert Access database to Parquet
pyforge convert database.mdb
# Convert DBF file with encoding detection
pyforge convert data.dbf
# Convert with custom output directory
pyforge convert database.accdb output_folder/
```
### Convert CSV Files
```bash
# Convert CSV with automatic delimiter and encoding detection
pyforge convert data.csv
# Convert TSV (tab-separated) file
pyforge convert data.tsv
# Convert with compression
pyforge convert sales_data.csv --compression gzip
# Convert delimited text file
pyforge convert export.txt
```
### Convert XML Files
```bash
# Convert XML with automatic structure detection
pyforge convert data.xml
# Convert with aggressive flattening for analytics
pyforge convert catalog.xml --flatten-strategy aggressive
# Handle arrays by expanding to multiple rows
pyforge convert orders.xml --array-handling expand
# Strip namespaces for cleaner column names
pyforge convert api_response.xml --namespace-handling strip
# Preview schema before conversion
pyforge convert complex.xml --preview-schema
```
### Get File Information
```bash
# Display file metadata as table
pyforge info document.pdf
# Get Excel file information
pyforge info spreadsheet.xlsx
# Output metadata as JSON
pyforge info database.mdb --format json
```
### List Supported Formats
```bash
pyforge formats
```
### Validate Files
```bash
pyforge validate document.pdf
pyforge validate data.xlsx
```
### MDF File Processing
```bash
# Step 1: Install MDF processing tools (one-time setup)
pyforge install mdf-tools
# Step 2: Check installation status
pyforge mdf-tools status
# Step 3: Start SQL Server (if not running)
pyforge mdf-tools start
# Step 4: Convert MDF files (when MDF converter is available)
# pyforge convert database.mdf --format parquet
# Container management
pyforge mdf-tools stop # Stop SQL Server
pyforge mdf-tools restart # Restart SQL Server
pyforge mdf-tools logs # View SQL Server logs
pyforge mdf-tools test # Test connectivity
pyforge mdf-tools config # Show configuration
pyforge mdf-tools uninstall # Remove everything
```
## Usage Examples
### Basic PDF Conversion
```bash
# Convert PDF to text (creates report.txt in same directory)
pyforge convert report.pdf
# Convert with custom output path
pyforge convert report.pdf /path/to/output.txt
# Convert with verbose output
pyforge convert report.pdf --verbose
# Force overwrite existing file
pyforge convert report.pdf output.txt --force
```
### Advanced PDF Options
```bash
# Convert pages 1-10
pyforge convert document.pdf --pages "1-10"
# Convert from page 5 to end
pyforge convert document.pdf --pages "5-"
# Convert up to page 10
pyforge convert document.pdf --pages "-10"
# Include page markers in output
pyforge convert document.pdf --metadata
```
### Excel Conversion Examples
```bash
# Convert Excel with all sheets
pyforge convert sales_data.xlsx
# Interactive mode - prompts for sheet selection
pyforge convert multi_sheet.xlsx --interactive
# Convert sheets with matching columns into single file
pyforge convert monthly_reports.xlsx --merge-sheets
# Generate summary report
pyforge convert data.xlsx --summary
```
### Database Conversion Examples
```bash
# Convert Access database (all tables)
pyforge convert company.mdb
# Convert with progress tracking
pyforge convert large_database.accdb --verbose
# Convert DBF with specific encoding
pyforge convert legacy.dbf --encoding cp1252
# Batch convert all DBF files in directory
for file in *.dbf; do pyforge convert "$file"; done
```
### CSV Conversion Examples
```bash
# Convert CSV with automatic detection
pyforge convert sales_data.csv
# Convert international CSV with auto-encoding detection
pyforge convert european_data.csv --verbose
# Convert semicolon-delimited CSV (European format)
pyforge convert data_semicolon.csv
# Convert tab-separated file with compression
pyforge convert data.tsv --compression gzip
# Batch convert multiple CSV files
for file in *.csv; do pyforge convert "$file" --compression snappy; done
```
### XML Conversion Examples
```bash
# Convert XML with automatic structure detection
pyforge convert catalog.xml
# Convert with aggressive flattening for data analysis
pyforge convert api_response.xml --flatten-strategy aggressive
# Handle arrays as concatenated strings
pyforge convert orders.xml --array-handling concatenate
# Strip namespaces for cleaner output
pyforge convert soap_response.xml --namespace-handling strip
# Preview structure before conversion
pyforge convert complex_structure.xml --preview-schema
# Convert compressed XML files
pyforge convert data.xml.gz --verbose
# Batch convert XML files with specific strategy
for file in *.xml; do pyforge convert "$file" --flatten-strategy moderate; done
```
### File Information
```bash
# Show file metadata
pyforge info document.pdf
# Excel file details (sheets, row counts)
pyforge info spreadsheet.xlsx
# Database file information (tables, record counts)
pyforge info database.mdb
# Export metadata as JSON
pyforge info document.pdf --format json > metadata.json
```
## Supported Formats
| Input Format | Output Formats | Status |
|-------------|----------------|---------|
| PDF (.pdf) | Text (.txt) | ✅ Available |
| Excel (.xlsx) | Parquet (.parquet) | ✅ Available |
| XML (.xml, .xml.gz, .xml.bz2) | Parquet (.parquet) | ✅ Available |
| Access (.mdb/.accdb) | Parquet (.parquet) | ✅ Available |
| DBF (.dbf) | Parquet (.parquet) | ✅ Available |
| CSV (.csv, .tsv, .txt) | Parquet (.parquet) | ✅ Available |
## Development
### Setting Up Development Environment
```bash
# Clone the repository
git clone https://github.com/Py-Forge-Cli/PyForge-CLI.git
cd PyForge-CLI
# Set up development environment
pip install -e ".[dev,test]"
# Run tests
pytest
# Format code
black src tests
# Run all checks
ruff check src tests
```
### Development Commands
```bash
# Testing
pytest # Run tests
pytest --cov=pyforge_cli # Run tests with coverage
# Code Quality
black src tests # Format code
ruff check src tests # Run linting
mypy src # Type checking
# Building
python -m build # Build distribution packages
twine upload dist/* # Publish to PyPI
```
### Project Structure
```text
PyForge-CLI/
├── src/pyforge_cli/ # Main package
│ ├── __init__.py
│ ├── main.py # CLI entry point
│ ├── converters/ # Format converters
│ │ ├── __init__.py
│ │ ├── base.py # Base converter class
│ │ ├── pdf_converter.py # PDF to text conversion
│ │ ├── excel_converter.py # Excel to Parquet conversion
│ │ ├── mdb_converter.py # MDB/ACCDB to Parquet conversion
│ │ └── dbf_converter.py # DBF to Parquet conversion
│ ├── plugins/ # Plugin system
│ │ └── loader.py # Plugin loading
│ └── utils/ # Utilities
│ ├── file_utils.py # File operations
│ └── cli_utils.py # CLI helpers
├── docs/ # Documentation source
├── tests/ # Test files
├── pyproject.toml # Project configuration
└── README.md # This file
```
## Requirements
- Python 3.8+
- PyMuPDF (for PDF processing)
- Click (for CLI interface)
- Rich (for beautiful terminal output)
- Pandas & PyArrow (for data processing and Parquet support)
- pandas-access (for MDB file support)
- dbfread (for DBF file support)
- openpyxl (for Excel file support)
- chardet (for encoding detection)
## Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Run tests and linting (`pytest && ruff check src tests`)
5. Commit your changes (`git commit -m 'Add amazing feature'`)
6. Push to the branch (`git push origin feature/amazing-feature`)
7. Open a Pull Request
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Roadmap
### Version 0.2.0 - Database & Spreadsheet Support (Completed)
- ✅ **Excel to Parquet Conversion**
- Multi-sheet support with intelligent detection
- Interactive sheet selection mode
- Column matching for combined output
- Progress tracking and summary reports
- ✅ **MDB/ACCDB to Parquet Conversion**
- Microsoft Access database support (.mdb, .accdb)
- Automatic table discovery
- Cross-platform compatibility (Windows/Linux/macOS)
- Excel summary reports with sample data
- ✅ **DBF to Parquet Conversion**
- Automatic encoding detection
- Support for various DBF formats
- Robust error handling for corrupted files
### Version 0.3.0 - Enhanced Features (Released)
- ✅ **XML to Parquet Conversion**
- Automatic XML structure detection and analysis
- Intelligent flattening strategies (conservative, moderate, aggressive)
- Array handling modes (expand, concatenate, json_string)
- Namespace processing (preserve, strip, prefix)
- Schema preview before conversion
- Support for compressed XML files (.xml.gz, .xml.bz2)
- ✅ **CSV to Parquet conversion** with auto-detection (string-based)
- [ ] CSV schema inference and native type conversion
- [ ] JSON processing and flattening
- [ ] Data validation and cleaning options
- [ ] Batch processing with pattern matching
- [ ] Configuration file support
- [ ] REST API wrapper for notebook integration
- [ ] Data type preservation options (beyond string conversion)
### Version 0.4.0 - MDF Tools Infrastructure (Completed)
- ✅ **MDF Tools Installer**
- Automated Docker Desktop installation (macOS/Windows/Linux)
- SQL Server Express 2019 container setup
- Interactive setup wizard with non-interactive mode
- Cross-platform compatibility with system package managers
- ✅ **Container Management Commands**
- 9 lifecycle commands: status, start, stop, restart, logs, test, config, uninstall
- Rich terminal UI with status tables and progress tracking
- Configuration management with JSON persistence
- Health monitoring and connectivity testing
- ✅ **Infrastructure Foundation**
- Complete Docker and SQL Server environment for MDF processing
- Foundation ready for MDF file conversion implementation
- Comprehensive documentation with live terminal examples
- ASCII architecture diagrams and troubleshooting guides
### Version 0.5.0 - MDF File Conversion (Planned)
- [ ] **MDF to Parquet Converter** (Issue #13)
- SQL Server MDF file attachment and processing
- 6-stage conversion process (matching MDB converter)
- String-only data conversion for Phase 1 consistency
- Batch processing with progress tracking
- Excel summary reports with conversion statistics
### Version 0.5.1 - Test Datasets Collection (Planned)
- [ ] **Sample Datasets Installation** (Issue #15)
- CLI command: `pyforge install sample-datasets`
- Curated test datasets for all formats (XML, MDF, DBF, MDB, CSV, PDF)
- Multiple size categories: small (<1MB), medium (1-100MB), large (100MB-3GB)
- Direct download from online sources with metadata
- Organized directory structure for testing scenarios
### Version 0.6.0 - Advanced Features (Future)
- [ ] SQL query support for database files
- [ ] Data transformation pipelines
- [ ] Cloud storage integration (S3, Azure Blob)
- [ ] Incremental/delta conversions
- [ ] Custom plugin development SDK
## Support
If you encounter any issues or have questions:
1. Check the [📚 Complete Documentation](https://py-forge-cli.github.io/PyForge-CLI/)
2. Search [existing issues](https://github.com/Py-Forge-Cli/PyForge-CLI/issues)
3. Create a [new issue](https://github.com/Py-Forge-Cli/PyForge-CLI/issues/new)
4. Join the [discussion](https://github.com/Py-Forge-Cli/PyForge-CLI/discussions)
## Changelog
### 1.0.8 (Current Release)
- ✅ **Complete Testing Infrastructure Overhaul**: Fixed 13 major issues across infrastructure and notebooks
- ✅ **Sample Datasets Installation**: Fixed with intelligent fallback versioning system
- ✅ **Missing Dependencies**: Added PyMuPDF, chardet, requests to resolve import errors
- ✅ **Convert Command Fix**: Resolved TypeError in ConverterRegistry API
- ✅ **Comprehensive Testing Framework**: Created systematic testing with 402 lines of test code
- ✅ **Notebook Organization**: Restructured with proper unit/integration/functional hierarchy
- ✅ **Cross-Environment Support**: Both local and Databricks notebooks fully functional
- ✅ **Enhanced Error Handling**: Smart file selection, directory creation, PDF skip logic
- ✅ **Developer Documentation**: Complete guides and deployment documentation
### 0.4.0
- ✅ **MDF Tools Installer**: Complete automated Docker Desktop and SQL Server Express 2019 setup
- ✅ **Cross-Platform Installation**: System package managers (Homebrew, Winget, apt/yum)
- ✅ **Container Management**: 9 lifecycle commands for SQL Server operations
- ✅ **Interactive Setup Wizard**: User-guided installation with smart detection
- ✅ **Rich Terminal UI**: Beautiful status tables, progress bars, and error handling
- ✅ **Critical Docker Fix**: Proper system package manager installation (not pip)
- ✅ **Infrastructure Foundation**: Complete environment for MDF file processing
- ✅ **Comprehensive Documentation**: Live terminal examples and ASCII diagrams
### 0.3.0
- ✅ **XML to Parquet Converter**: Complete implementation with intelligent flattening
- ✅ **Automatic Structure Detection**: Analyzes XML hierarchy and array patterns
- ✅ **Flexible Flattening Strategies**: Conservative, moderate, and aggressive options
- ✅ **Advanced Array Handling**: Expand, concatenate, or JSON string modes
- ✅ **Namespace Support**: Configurable namespace processing
- ✅ **Schema Preview**: Optional structure preview before conversion
- ✅ **Comprehensive Documentation**: User guide and quick reference
- ✅ **Compressed XML Support**: Handles .xml.gz and .xml.bz2 files
### 0.2.1
- ✅ **Complete Documentation Site**: Comprehensive GitHub Pages documentation
- ✅ **Fixed CI/CD**: GitHub Actions workflow for automated PyPI publishing
- ✅ **Improved Distribution**: API token authentication and automation
- ✅ **Better Navigation**: Fixed broken links and improved project structure
### 0.2.0
- ✅ Excel to Parquet conversion with multi-sheet support
- ✅ MDB/ACCDB to Parquet conversion with cross-platform support
- ✅ DBF to Parquet conversion with encoding detection
- ✅ Interactive mode for Excel sheet selection
- ✅ Automatic table discovery for database files
- ✅ Progress tracking with rich terminal UI
- ✅ Excel summary reports for batch conversions
- ✅ Robust error handling and recovery
### 0.1.0 (Initial Release)
- PDF to text conversion
- CLI interface with Click
- Rich terminal output
- File metadata extraction
- Page range support
- Development tooling setup
## 🚀 Development & Deployment
### Automated Versioning
PyForge CLI uses automated versioning with setuptools-scm:
- **Development versions**: `1.0.x.devN` - Auto-deployed to [PyPI Test](https://test.pypi.org/project/pyforge-cli/) on every commit to main
- **Release versions**: `1.0.x` - Deployed to [PyPI](https://pypi.org/project/pyforge-cli/) when Git tags are created
- **Version increment**: Development versions auto-increment on each commit (dev1 → dev2 → dev3...)
### Installation from Test Repository
```bash
# Install latest development version
pip install -i https://test.pypi.org/simple/ pyforge-cli
# Install specific development version
pip install -i https://test.pypi.org/simple/ pyforge-cli==1.0.8.dev5
```
### CI/CD Pipeline
- **Trigger**: Every push to `main` branch automatically builds and deploys to PyPI Test
- **Release**: Create a Git tag to trigger deployment to PyPI Production
- **Testing**: Use `pyforge-cli` from test.pypi.org for validation before release
### Recent Updates (v1.0.8)
**Complete Testing Infrastructure Overhaul** - Fixed 13 major issues across PyForge CLI infrastructure and testing notebooks:
- ✅ **Infrastructure Fixes**: Sample datasets installation, missing dependencies, convert command API
- ✅ **Testing Framework**: Comprehensive testing suite with 402 lines of test code
- ✅ **Notebook Support**: Full local and Databricks notebook functionality
- ✅ **Error Handling**: Smart file selection, directory creation, PDF skip logic
- ✅ **Documentation**: Complete developer guides and deployment processes
Raw data
{
"_id": null,
"home_page": null,
"name": "pyforge-cli",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "cli, data, conversion, pdf, csv, parquet, excel, database, mdb, dbf, synthetic-data",
"author": null,
"author_email": "Santosh Dandey <dd.santosh@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/db/70/8098593e06efc63eacc57823a32b42ce6f0816bbffd82442a23a92368f62/pyforge_cli-1.0.9.tar.gz",
"platform": null,
"description": "# PyForge CLI\n\n<div align=\"center\">\n <img src=\"assets/icon_pyforge_forge.svg\" alt=\"PyForge CLI\" width=\"128\" height=\"128\">\n</div>\n\n<div align=\"center\">\n <strong>A powerful command-line tool for data format conversion and manipulation, built with Python and Click.</strong>\n</div>\n\n## \ud83d\udcd6 Documentation\n\n**[\ud83d\udcda Complete Documentation](https://py-forge-cli.github.io/PyForge-CLI/)** | **[\ud83d\ude80 Quick Start](https://py-forge-cli.github.io/PyForge-CLI/getting-started/quick-start)** | **[\ud83d\udce6 Installation Guide](https://py-forge-cli.github.io/PyForge-CLI/getting-started/installation)** | **[\ud83d\udd27 CLI Reference](https://py-forge-cli.github.io/PyForge-CLI/reference/cli-reference)**\n\n<div align=\"center\">\n <a href=\"https://pypi.org/project/pyforge-cli/\">\n <img src=\"https://img.shields.io/pypi/v/pyforge-cli.svg\" alt=\"PyPI version\">\n </a>\n <a href=\"https://pypi.org/project/pyforge-cli/\">\n <img src=\"https://img.shields.io/pypi/pyversions/pyforge-cli.svg\" alt=\"Python versions\">\n </a>\n <a href=\"https://github.com/Py-Forge-Cli/PyForge-CLI/blob/main/LICENSE\">\n <img src=\"https://img.shields.io/github/license/Py-Forge-Cli/PyForge-CLI.svg\" alt=\"License\">\n </a>\n <a href=\"https://github.com/Py-Forge-Cli/PyForge-CLI/actions\">\n <img src=\"https://github.com/Py-Forge-Cli/PyForge-CLI/workflows/CI/badge.svg\" alt=\"CI Status\">\n </a>\n</div>\n\n## Features\n\n- **PDF to Text Conversion**: Extract text from PDF documents with advanced options\n- **Excel to Parquet Conversion**: Convert Excel files (.xlsx) to Parquet format with multi-sheet support\n- **XML to Parquet Conversion**: Intelligent XML flattening with automatic structure detection and configurable strategies\n- **Database File Conversion**: Convert Microsoft Access (.mdb/.accdb), DBF, and SQL Server MDF files to Parquet\n- **MDF Tools Installer**: Automated setup of Docker and SQL Server Express for MDF file processing\n- **CSV to Parquet Conversion**: Smart delimiter detection and encoding handling\n- **Rich CLI Interface**: Beautiful terminal output with progress bars and tables\n- **Intelligent Processing**: Automatic encoding detection, structure analysis, and column matching\n- **Extensible Architecture**: Plugin-based system for adding new format converters\n- **Metadata Extraction**: Get detailed information about your files\n- **Cross-platform**: Works on Windows, macOS, and Linux\n\n## Installation\n\n### Stable Version (Recommended)\n\n```bash\npip install pyforge-cli\n```\n\n### Development Version\n\nTo test the latest features and bug fixes before they're officially released:\n\n```bash\n# Install from PyPI Test (latest development version)\npip install -i https://test.pypi.org/simple/ pyforge-cli\n\n# Or with dependency fallback\npip install --index-url https://test.pypi.org/simple/ \\\n --extra-index-url https://pypi.org/simple/ \\\n pyforge-cli\n```\n\n> **Note**: Development versions follow the pattern `X.Y.Z.devN+gCOMMIT` and are automatically deployed from the main branch.\n\n### From Source\n\n```bash\ngit clone https://github.com/Py-Forge-Cli/PyForge-CLI.git\ncd PyForge-CLI\npip install -e .\n```\n\n### Development Installation\n\n```bash\ngit clone https://github.com/Py-Forge-Cli/PyForge-CLI.git\ncd PyForge-CLI\npip install -e \".[dev,test]\"\n```\n\n### System Dependencies\n\nFor MDB/Access file support on non-Windows systems:\n\n```bash\n# Ubuntu/Debian\nsudo apt-get install mdbtools\n\n# macOS\nbrew install mdbtools\n```\n\n### MDF File Processing Setup\n\nFor SQL Server MDF file conversion, install the MDF Tools:\n\n```bash\n# Interactive setup wizard\npyforge install mdf-tools\n\n# Check installation status\npyforge mdf-tools status\n```\n\nThis installs Docker Desktop and SQL Server Express automatically. See [MDF Tools Documentation](https://py-forge-cli.github.io/PyForge-CLI/converters/mdf-tools-installer) for details.\n\n## Quick Start\n\n### Convert PDF to Text\n\n```bash\n# Convert entire PDF\npyforge convert document.pdf\n\n# Convert to specific output file\npyforge convert document.pdf output.txt\n\n# Convert specific page range\npyforge convert document.pdf --pages \"1-5\"\n\n# Include page metadata\npyforge convert document.pdf --metadata\n```\n\n### Convert Excel to Parquet\n\n```bash\n# Convert Excel file to Parquet\npyforge convert data.xlsx\n\n# Convert with specific compression\npyforge convert data.xlsx --compression gzip\n\n# Convert specific sheets only\npyforge convert data.xlsx --sheets \"Sheet1,Sheet3\"\n```\n\n### Convert Database Files\n\n```bash\n# Convert Access database to Parquet\npyforge convert database.mdb\n\n# Convert DBF file with encoding detection\npyforge convert data.dbf\n\n# Convert with custom output directory\npyforge convert database.accdb output_folder/\n```\n\n### Convert CSV Files\n\n```bash\n# Convert CSV with automatic delimiter and encoding detection\npyforge convert data.csv\n\n# Convert TSV (tab-separated) file\npyforge convert data.tsv\n\n# Convert with compression\npyforge convert sales_data.csv --compression gzip\n\n# Convert delimited text file\npyforge convert export.txt\n```\n\n### Convert XML Files\n\n```bash\n# Convert XML with automatic structure detection\npyforge convert data.xml\n\n# Convert with aggressive flattening for analytics\npyforge convert catalog.xml --flatten-strategy aggressive\n\n# Handle arrays by expanding to multiple rows\npyforge convert orders.xml --array-handling expand\n\n# Strip namespaces for cleaner column names\npyforge convert api_response.xml --namespace-handling strip\n\n# Preview schema before conversion\npyforge convert complex.xml --preview-schema\n```\n\n### Get File Information\n\n```bash\n# Display file metadata as table\npyforge info document.pdf\n\n# Get Excel file information\npyforge info spreadsheet.xlsx\n\n# Output metadata as JSON\npyforge info database.mdb --format json\n```\n\n### List Supported Formats\n\n```bash\npyforge formats\n```\n\n### Validate Files\n\n```bash\npyforge validate document.pdf\npyforge validate data.xlsx\n```\n\n### MDF File Processing\n\n```bash\n# Step 1: Install MDF processing tools (one-time setup)\npyforge install mdf-tools\n\n# Step 2: Check installation status\npyforge mdf-tools status\n\n# Step 3: Start SQL Server (if not running)\npyforge mdf-tools start\n\n# Step 4: Convert MDF files (when MDF converter is available)\n# pyforge convert database.mdf --format parquet\n\n# Container management\npyforge mdf-tools stop # Stop SQL Server\npyforge mdf-tools restart # Restart SQL Server\npyforge mdf-tools logs # View SQL Server logs\npyforge mdf-tools test # Test connectivity\npyforge mdf-tools config # Show configuration\npyforge mdf-tools uninstall # Remove everything\n```\n\n## Usage Examples\n\n### Basic PDF Conversion\n\n```bash\n# Convert PDF to text (creates report.txt in same directory)\npyforge convert report.pdf\n\n# Convert with custom output path\npyforge convert report.pdf /path/to/output.txt\n\n# Convert with verbose output\npyforge convert report.pdf --verbose\n\n# Force overwrite existing file\npyforge convert report.pdf output.txt --force\n```\n\n### Advanced PDF Options\n\n```bash\n# Convert pages 1-10\npyforge convert document.pdf --pages \"1-10\"\n\n# Convert from page 5 to end\npyforge convert document.pdf --pages \"5-\"\n\n# Convert up to page 10\npyforge convert document.pdf --pages \"-10\"\n\n# Include page markers in output\npyforge convert document.pdf --metadata\n```\n\n### Excel Conversion Examples\n\n```bash\n# Convert Excel with all sheets\npyforge convert sales_data.xlsx\n\n# Interactive mode - prompts for sheet selection\npyforge convert multi_sheet.xlsx --interactive\n\n# Convert sheets with matching columns into single file\npyforge convert monthly_reports.xlsx --merge-sheets\n\n# Generate summary report\npyforge convert data.xlsx --summary\n```\n\n### Database Conversion Examples\n\n```bash\n# Convert Access database (all tables)\npyforge convert company.mdb\n\n# Convert with progress tracking\npyforge convert large_database.accdb --verbose\n\n# Convert DBF with specific encoding\npyforge convert legacy.dbf --encoding cp1252\n\n# Batch convert all DBF files in directory\nfor file in *.dbf; do pyforge convert \"$file\"; done\n```\n\n### CSV Conversion Examples\n\n```bash\n# Convert CSV with automatic detection\npyforge convert sales_data.csv\n\n# Convert international CSV with auto-encoding detection\npyforge convert european_data.csv --verbose\n\n# Convert semicolon-delimited CSV (European format)\npyforge convert data_semicolon.csv\n\n# Convert tab-separated file with compression\npyforge convert data.tsv --compression gzip\n\n# Batch convert multiple CSV files\nfor file in *.csv; do pyforge convert \"$file\" --compression snappy; done\n```\n\n### XML Conversion Examples\n\n```bash\n# Convert XML with automatic structure detection\npyforge convert catalog.xml\n\n# Convert with aggressive flattening for data analysis\npyforge convert api_response.xml --flatten-strategy aggressive\n\n# Handle arrays as concatenated strings\npyforge convert orders.xml --array-handling concatenate\n\n# Strip namespaces for cleaner output\npyforge convert soap_response.xml --namespace-handling strip\n\n# Preview structure before conversion\npyforge convert complex_structure.xml --preview-schema\n\n# Convert compressed XML files\npyforge convert data.xml.gz --verbose\n\n# Batch convert XML files with specific strategy\nfor file in *.xml; do pyforge convert \"$file\" --flatten-strategy moderate; done\n```\n\n### File Information\n\n```bash\n# Show file metadata\npyforge info document.pdf\n\n# Excel file details (sheets, row counts)\npyforge info spreadsheet.xlsx\n\n# Database file information (tables, record counts)\npyforge info database.mdb\n\n# Export metadata as JSON\npyforge info document.pdf --format json > metadata.json\n```\n\n## Supported Formats\n\n| Input Format | Output Formats | Status |\n|-------------|----------------|---------|\n| PDF (.pdf) | Text (.txt) | \u2705 Available |\n| Excel (.xlsx) | Parquet (.parquet) | \u2705 Available |\n| XML (.xml, .xml.gz, .xml.bz2) | Parquet (.parquet) | \u2705 Available |\n| Access (.mdb/.accdb) | Parquet (.parquet) | \u2705 Available |\n| DBF (.dbf) | Parquet (.parquet) | \u2705 Available |\n| CSV (.csv, .tsv, .txt) | Parquet (.parquet) | \u2705 Available |\n\n## Development\n\n### Setting Up Development Environment\n\n```bash\n# Clone the repository\ngit clone https://github.com/Py-Forge-Cli/PyForge-CLI.git\ncd PyForge-CLI\n\n# Set up development environment\npip install -e \".[dev,test]\"\n\n# Run tests\npytest\n\n# Format code\nblack src tests\n\n# Run all checks\nruff check src tests\n```\n\n### Development Commands\n\n```bash\n# Testing\npytest # Run tests\npytest --cov=pyforge_cli # Run tests with coverage\n\n# Code Quality\nblack src tests # Format code\nruff check src tests # Run linting\nmypy src # Type checking\n\n# Building\npython -m build # Build distribution packages\ntwine upload dist/* # Publish to PyPI\n```\n\n### Project Structure\n\n```text\nPyForge-CLI/\n\u251c\u2500\u2500 src/pyforge_cli/ # Main package\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 main.py # CLI entry point\n\u2502 \u251c\u2500\u2500 converters/ # Format converters\n\u2502 \u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u2502 \u251c\u2500\u2500 base.py # Base converter class\n\u2502 \u2502 \u251c\u2500\u2500 pdf_converter.py # PDF to text conversion\n\u2502 \u2502 \u251c\u2500\u2500 excel_converter.py # Excel to Parquet conversion\n\u2502 \u2502 \u251c\u2500\u2500 mdb_converter.py # MDB/ACCDB to Parquet conversion\n\u2502 \u2502 \u2514\u2500\u2500 dbf_converter.py # DBF to Parquet conversion\n\u2502 \u251c\u2500\u2500 plugins/ # Plugin system\n\u2502 \u2502 \u2514\u2500\u2500 loader.py # Plugin loading\n\u2502 \u2514\u2500\u2500 utils/ # Utilities\n\u2502 \u251c\u2500\u2500 file_utils.py # File operations\n\u2502 \u2514\u2500\u2500 cli_utils.py # CLI helpers\n\u251c\u2500\u2500 docs/ # Documentation source\n\u251c\u2500\u2500 tests/ # Test files\n\u251c\u2500\u2500 pyproject.toml # Project configuration\n\u2514\u2500\u2500 README.md # This file\n```\n\n## Requirements\n\n- Python 3.8+\n- PyMuPDF (for PDF processing)\n- Click (for CLI interface)\n- Rich (for beautiful terminal output)\n- Pandas & PyArrow (for data processing and Parquet support)\n- pandas-access (for MDB file support)\n- dbfread (for DBF file support)\n- openpyxl (for Excel file support)\n- chardet (for encoding detection)\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes\n4. Run tests and linting (`pytest && ruff check src tests`)\n5. Commit your changes (`git commit -m 'Add amazing feature'`)\n6. Push to the branch (`git push origin feature/amazing-feature`)\n7. Open a Pull Request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Roadmap\n\n### Version 0.2.0 - Database & Spreadsheet Support (Completed)\n- \u2705 **Excel to Parquet Conversion**\n - Multi-sheet support with intelligent detection\n - Interactive sheet selection mode\n - Column matching for combined output\n - Progress tracking and summary reports\n- \u2705 **MDB/ACCDB to Parquet Conversion**\n - Microsoft Access database support (.mdb, .accdb)\n - Automatic table discovery\n - Cross-platform compatibility (Windows/Linux/macOS)\n - Excel summary reports with sample data\n- \u2705 **DBF to Parquet Conversion**\n - Automatic encoding detection\n - Support for various DBF formats\n - Robust error handling for corrupted files\n\n### Version 0.3.0 - Enhanced Features (Released)\n- \u2705 **XML to Parquet Conversion**\n - Automatic XML structure detection and analysis\n - Intelligent flattening strategies (conservative, moderate, aggressive)\n - Array handling modes (expand, concatenate, json_string)\n - Namespace processing (preserve, strip, prefix)\n - Schema preview before conversion\n - Support for compressed XML files (.xml.gz, .xml.bz2)\n- \u2705 **CSV to Parquet conversion** with auto-detection (string-based)\n- [ ] CSV schema inference and native type conversion\n- [ ] JSON processing and flattening\n- [ ] Data validation and cleaning options\n- [ ] Batch processing with pattern matching\n- [ ] Configuration file support\n- [ ] REST API wrapper for notebook integration\n- [ ] Data type preservation options (beyond string conversion)\n\n### Version 0.4.0 - MDF Tools Infrastructure (Completed)\n- \u2705 **MDF Tools Installer**\n - Automated Docker Desktop installation (macOS/Windows/Linux)\n - SQL Server Express 2019 container setup\n - Interactive setup wizard with non-interactive mode\n - Cross-platform compatibility with system package managers\n- \u2705 **Container Management Commands**\n - 9 lifecycle commands: status, start, stop, restart, logs, test, config, uninstall\n - Rich terminal UI with status tables and progress tracking\n - Configuration management with JSON persistence\n - Health monitoring and connectivity testing\n- \u2705 **Infrastructure Foundation**\n - Complete Docker and SQL Server environment for MDF processing\n - Foundation ready for MDF file conversion implementation\n - Comprehensive documentation with live terminal examples\n - ASCII architecture diagrams and troubleshooting guides\n\n### Version 0.5.0 - MDF File Conversion (Planned)\n- [ ] **MDF to Parquet Converter** (Issue #13)\n - SQL Server MDF file attachment and processing\n - 6-stage conversion process (matching MDB converter)\n - String-only data conversion for Phase 1 consistency\n - Batch processing with progress tracking\n - Excel summary reports with conversion statistics\n\n### Version 0.5.1 - Test Datasets Collection (Planned)\n- [ ] **Sample Datasets Installation** (Issue #15)\n - CLI command: `pyforge install sample-datasets`\n - Curated test datasets for all formats (XML, MDF, DBF, MDB, CSV, PDF)\n - Multiple size categories: small (<1MB), medium (1-100MB), large (100MB-3GB)\n - Direct download from online sources with metadata\n - Organized directory structure for testing scenarios\n\n### Version 0.6.0 - Advanced Features (Future)\n- [ ] SQL query support for database files\n- [ ] Data transformation pipelines\n- [ ] Cloud storage integration (S3, Azure Blob)\n- [ ] Incremental/delta conversions\n- [ ] Custom plugin development SDK\n\n## Support\n\nIf you encounter any issues or have questions:\n\n1. Check the [\ud83d\udcda Complete Documentation](https://py-forge-cli.github.io/PyForge-CLI/)\n2. Search [existing issues](https://github.com/Py-Forge-Cli/PyForge-CLI/issues)\n3. Create a [new issue](https://github.com/Py-Forge-Cli/PyForge-CLI/issues/new)\n4. Join the [discussion](https://github.com/Py-Forge-Cli/PyForge-CLI/discussions)\n\n## Changelog\n\n### 1.0.8 (Current Release)\n\n- \u2705 **Complete Testing Infrastructure Overhaul**: Fixed 13 major issues across infrastructure and notebooks\n- \u2705 **Sample Datasets Installation**: Fixed with intelligent fallback versioning system\n- \u2705 **Missing Dependencies**: Added PyMuPDF, chardet, requests to resolve import errors\n- \u2705 **Convert Command Fix**: Resolved TypeError in ConverterRegistry API\n- \u2705 **Comprehensive Testing Framework**: Created systematic testing with 402 lines of test code\n- \u2705 **Notebook Organization**: Restructured with proper unit/integration/functional hierarchy\n- \u2705 **Cross-Environment Support**: Both local and Databricks notebooks fully functional\n- \u2705 **Enhanced Error Handling**: Smart file selection, directory creation, PDF skip logic\n- \u2705 **Developer Documentation**: Complete guides and deployment documentation\n\n### 0.4.0\n\n- \u2705 **MDF Tools Installer**: Complete automated Docker Desktop and SQL Server Express 2019 setup\n- \u2705 **Cross-Platform Installation**: System package managers (Homebrew, Winget, apt/yum)\n- \u2705 **Container Management**: 9 lifecycle commands for SQL Server operations\n- \u2705 **Interactive Setup Wizard**: User-guided installation with smart detection\n- \u2705 **Rich Terminal UI**: Beautiful status tables, progress bars, and error handling\n- \u2705 **Critical Docker Fix**: Proper system package manager installation (not pip)\n- \u2705 **Infrastructure Foundation**: Complete environment for MDF file processing\n- \u2705 **Comprehensive Documentation**: Live terminal examples and ASCII diagrams\n\n### 0.3.0\n\n- \u2705 **XML to Parquet Converter**: Complete implementation with intelligent flattening\n- \u2705 **Automatic Structure Detection**: Analyzes XML hierarchy and array patterns\n- \u2705 **Flexible Flattening Strategies**: Conservative, moderate, and aggressive options\n- \u2705 **Advanced Array Handling**: Expand, concatenate, or JSON string modes\n- \u2705 **Namespace Support**: Configurable namespace processing\n- \u2705 **Schema Preview**: Optional structure preview before conversion\n- \u2705 **Comprehensive Documentation**: User guide and quick reference\n- \u2705 **Compressed XML Support**: Handles .xml.gz and .xml.bz2 files\n\n### 0.2.1\n\n- \u2705 **Complete Documentation Site**: Comprehensive GitHub Pages documentation\n- \u2705 **Fixed CI/CD**: GitHub Actions workflow for automated PyPI publishing \n- \u2705 **Improved Distribution**: API token authentication and automation\n- \u2705 **Better Navigation**: Fixed broken links and improved project structure\n\n### 0.2.0\n\n- \u2705 Excel to Parquet conversion with multi-sheet support\n- \u2705 MDB/ACCDB to Parquet conversion with cross-platform support\n- \u2705 DBF to Parquet conversion with encoding detection\n- \u2705 Interactive mode for Excel sheet selection\n- \u2705 Automatic table discovery for database files\n- \u2705 Progress tracking with rich terminal UI\n- \u2705 Excel summary reports for batch conversions\n- \u2705 Robust error handling and recovery\n\n### 0.1.0 (Initial Release)\n\n- PDF to text conversion\n- CLI interface with Click\n- Rich terminal output\n- File metadata extraction\n- Page range support\n- Development tooling setup\n\n## \ud83d\ude80 Development & Deployment\n\n### Automated Versioning\n\nPyForge CLI uses automated versioning with setuptools-scm:\n\n- **Development versions**: `1.0.x.devN` - Auto-deployed to [PyPI Test](https://test.pypi.org/project/pyforge-cli/) on every commit to main\n- **Release versions**: `1.0.x` - Deployed to [PyPI](https://pypi.org/project/pyforge-cli/) when Git tags are created\n- **Version increment**: Development versions auto-increment on each commit (dev1 \u2192 dev2 \u2192 dev3...)\n\n### Installation from Test Repository\n\n```bash\n# Install latest development version\npip install -i https://test.pypi.org/simple/ pyforge-cli\n\n# Install specific development version \npip install -i https://test.pypi.org/simple/ pyforge-cli==1.0.8.dev5\n```\n\n### CI/CD Pipeline\n\n- **Trigger**: Every push to `main` branch automatically builds and deploys to PyPI Test\n- **Release**: Create a Git tag to trigger deployment to PyPI Production\n- **Testing**: Use `pyforge-cli` from test.pypi.org for validation before release\n\n### Recent Updates (v1.0.8)\n\n**Complete Testing Infrastructure Overhaul** - Fixed 13 major issues across PyForge CLI infrastructure and testing notebooks:\n\n- \u2705 **Infrastructure Fixes**: Sample datasets installation, missing dependencies, convert command API\n- \u2705 **Testing Framework**: Comprehensive testing suite with 402 lines of test code \n- \u2705 **Notebook Support**: Full local and Databricks notebook functionality\n- \u2705 **Error Handling**: Smart file selection, directory creation, PDF skip logic\n- \u2705 **Documentation**: Complete developer guides and deployment processes\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A powerful CLI tool for data format conversion and synthetic data generation",
"version": "1.0.9",
"project_urls": {
"Changelog": "https://github.com/Py-Forge-Cli/PyForge-CLI/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/Py-Forge-Cli/PyForge-CLI/blob/main/docs",
"Homepage": "https://github.com/Py-Forge-Cli/PyForge-CLI",
"Issues": "https://github.com/Py-Forge-Cli/PyForge-CLI/issues",
"Repository": "https://github.com/Py-Forge-Cli/PyForge-CLI"
},
"split_keywords": [
"cli",
" data",
" conversion",
" pdf",
" csv",
" parquet",
" excel",
" database",
" mdb",
" dbf",
" synthetic-data"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "9ec3ca49b4b410c7a828d795c0e36289b5114671ff2efcd12e80243e8225e063",
"md5": "7dbee674e9e92e4284dc633c4b5ee6ee",
"sha256": "057e69b7203c8ed609a6c17b16cc4df70f32fea25e81df206d18bb5ff7786f4d"
},
"downloads": -1,
"filename": "pyforge_cli-1.0.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7dbee674e9e92e4284dc633c4b5ee6ee",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 6777237,
"upload_time": "2025-07-11T04:38:01",
"upload_time_iso_8601": "2025-07-11T04:38:01.133610Z",
"url": "https://files.pythonhosted.org/packages/9e/c3/ca49b4b410c7a828d795c0e36289b5114671ff2efcd12e80243e8225e063/pyforge_cli-1.0.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "db708098593e06efc63eacc57823a32b42ce6f0816bbffd82442a23a92368f62",
"md5": "afc1f5a6459b68647df5fba192e5ae95",
"sha256": "b05224398454c7c3901ddddf604ae4f3d6e809c32aa31270dd18d94921408425"
},
"downloads": -1,
"filename": "pyforge_cli-1.0.9.tar.gz",
"has_sig": false,
"md5_digest": "afc1f5a6459b68647df5fba192e5ae95",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 7318447,
"upload_time": "2025-07-11T04:38:06",
"upload_time_iso_8601": "2025-07-11T04:38:06.458979Z",
"url": "https://files.pythonhosted.org/packages/db/70/8098593e06efc63eacc57823a32b42ce6f0816bbffd82442a23a92368f62/pyforge_cli-1.0.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-11 04:38:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Py-Forge-Cli",
"github_project": "PyForge-CLI",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pyforge-cli"
}