pyforge-cli

Name	pyforge-cli JSON
Version	1.0.9 JSON
	download
home_page	None
Summary	A powerful CLI tool for data format conversion and synthetic data generation
upload_time	2025-07-11 04:38:06
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	MIT
keywords	cli data conversion pdf csv parquet excel database mdb dbf synthetic-data
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # PyForge CLI

<div align="center">
  <img src="assets/icon_pyforge_forge.svg" alt="PyForge CLI" width="128" height="128">
</div>

<div align="center">
  <strong>A powerful command-line tool for data format conversion and manipulation, built with Python and Click.</strong>
</div>

## 📖 Documentation

**[📚 Complete Documentation](https://py-forge-cli.github.io/PyForge-CLI/)** | **[🚀 Quick Start](https://py-forge-cli.github.io/PyForge-CLI/getting-started/quick-start)** | **[📦 Installation Guide](https://py-forge-cli.github.io/PyForge-CLI/getting-started/installation)** | **[🔧 CLI Reference](https://py-forge-cli.github.io/PyForge-CLI/reference/cli-reference)**

<div align="center">
  <a href="https://pypi.org/project/pyforge-cli/">
    <img src="https://img.shields.io/pypi/v/pyforge-cli.svg" alt="PyPI version">
  </a>
  <a href="https://pypi.org/project/pyforge-cli/">
    <img src="https://img.shields.io/pypi/pyversions/pyforge-cli.svg" alt="Python versions">
  </a>
  <a href="https://github.com/Py-Forge-Cli/PyForge-CLI/blob/main/LICENSE">
    <img src="https://img.shields.io/github/license/Py-Forge-Cli/PyForge-CLI.svg" alt="License">
  </a>
  <a href="https://github.com/Py-Forge-Cli/PyForge-CLI/actions">
    <img src="https://github.com/Py-Forge-Cli/PyForge-CLI/workflows/CI/badge.svg" alt="CI Status">
  </a>
</div>

## Features

- **PDF to Text Conversion**: Extract text from PDF documents with advanced options
- **Excel to Parquet Conversion**: Convert Excel files (.xlsx) to Parquet format with multi-sheet support
- **XML to Parquet Conversion**: Intelligent XML flattening with automatic structure detection and configurable strategies
- **Database File Conversion**: Convert Microsoft Access (.mdb/.accdb), DBF, and SQL Server MDF files to Parquet
- **MDF Tools Installer**: Automated setup of Docker and SQL Server Express for MDF file processing
- **CSV to Parquet Conversion**: Smart delimiter detection and encoding handling
- **Rich CLI Interface**: Beautiful terminal output with progress bars and tables
- **Intelligent Processing**: Automatic encoding detection, structure analysis, and column matching
- **Extensible Architecture**: Plugin-based system for adding new format converters
- **Metadata Extraction**: Get detailed information about your files
- **Cross-platform**: Works on Windows, macOS, and Linux

## Installation

### Stable Version (Recommended)

```bash
pip install pyforge-cli
```

### Development Version

To test the latest features and bug fixes before they're officially released:

```bash
# Install from PyPI Test (latest development version)
pip install -i https://test.pypi.org/simple/ pyforge-cli

# Or with dependency fallback
pip install --index-url https://test.pypi.org/simple/ \
           --extra-index-url https://pypi.org/simple/ \
           pyforge-cli
```

> **Note**: Development versions follow the pattern `X.Y.Z.devN+gCOMMIT` and are automatically deployed from the main branch.

### From Source

```bash
git clone https://github.com/Py-Forge-Cli/PyForge-CLI.git
cd PyForge-CLI
pip install -e .
```

### Development Installation

```bash
git clone https://github.com/Py-Forge-Cli/PyForge-CLI.git
cd PyForge-CLI
pip install -e ".[dev,test]"
```

### System Dependencies

For MDB/Access file support on non-Windows systems:

```bash
# Ubuntu/Debian
sudo apt-get install mdbtools

# macOS
brew install mdbtools
```

### MDF File Processing Setup

For SQL Server MDF file conversion, install the MDF Tools:

```bash
# Interactive setup wizard
pyforge install mdf-tools

# Check installation status
pyforge mdf-tools status
```

This installs Docker Desktop and SQL Server Express automatically. See [MDF Tools Documentation](https://py-forge-cli.github.io/PyForge-CLI/converters/mdf-tools-installer) for details.

## Quick Start

### Convert PDF to Text

```bash
# Convert entire PDF
pyforge convert document.pdf

# Convert to specific output file
pyforge convert document.pdf output.txt

# Convert specific page range
pyforge convert document.pdf --pages "1-5"

# Include page metadata
pyforge convert document.pdf --metadata
```

### Convert Excel to Parquet

```bash
# Convert Excel file to Parquet
pyforge convert data.xlsx

# Convert with specific compression
pyforge convert data.xlsx --compression gzip

# Convert specific sheets only
pyforge convert data.xlsx --sheets "Sheet1,Sheet3"
```

### Convert Database Files

```bash
# Convert Access database to Parquet
pyforge convert database.mdb

# Convert DBF file with encoding detection
pyforge convert data.dbf

# Convert with custom output directory
pyforge convert database.accdb output_folder/
```

### Convert CSV Files

```bash
# Convert CSV with automatic delimiter and encoding detection
pyforge convert data.csv

# Convert TSV (tab-separated) file
pyforge convert data.tsv

# Convert with compression
pyforge convert sales_data.csv --compression gzip

# Convert delimited text file
pyforge convert export.txt
```

### Convert XML Files

```bash
# Convert XML with automatic structure detection
pyforge convert data.xml

# Convert with aggressive flattening for analytics
pyforge convert catalog.xml --flatten-strategy aggressive

# Handle arrays by expanding to multiple rows
pyforge convert orders.xml --array-handling expand

# Strip namespaces for cleaner column names
pyforge convert api_response.xml --namespace-handling strip

# Preview schema before conversion
pyforge convert complex.xml --preview-schema
```

### Get File Information

```bash
# Display file metadata as table
pyforge info document.pdf

# Get Excel file information
pyforge info spreadsheet.xlsx

# Output metadata as JSON
pyforge info database.mdb --format json
```

### List Supported Formats

```bash
pyforge formats
```

### Validate Files

```bash
pyforge validate document.pdf
pyforge validate data.xlsx
```

### MDF File Processing

```bash
# Step 1: Install MDF processing tools (one-time setup)
pyforge install mdf-tools

# Step 2: Check installation status
pyforge mdf-tools status

# Step 3: Start SQL Server (if not running)
pyforge mdf-tools start

# Step 4: Convert MDF files (when MDF converter is available)
# pyforge convert database.mdf --format parquet

# Container management
pyforge mdf-tools stop      # Stop SQL Server
pyforge mdf-tools restart   # Restart SQL Server
pyforge mdf-tools logs      # View SQL Server logs
pyforge mdf-tools test      # Test connectivity
pyforge mdf-tools config    # Show configuration
pyforge mdf-tools uninstall # Remove everything
```

## Usage Examples

### Basic PDF Conversion

```bash
# Convert PDF to text (creates report.txt in same directory)
pyforge convert report.pdf

# Convert with custom output path
pyforge convert report.pdf /path/to/output.txt

# Convert with verbose output
pyforge convert report.pdf --verbose

# Force overwrite existing file
pyforge convert report.pdf output.txt --force
```

### Advanced PDF Options

```bash
# Convert pages 1-10
pyforge convert document.pdf --pages "1-10"

# Convert from page 5 to end
pyforge convert document.pdf --pages "5-"

# Convert up to page 10
pyforge convert document.pdf --pages "-10"

# Include page markers in output
pyforge convert document.pdf --metadata
```

### Excel Conversion Examples

```bash
# Convert Excel with all sheets
pyforge convert sales_data.xlsx

# Interactive mode - prompts for sheet selection
pyforge convert multi_sheet.xlsx --interactive

# Convert sheets with matching columns into single file
pyforge convert monthly_reports.xlsx --merge-sheets

# Generate summary report
pyforge convert data.xlsx --summary
```

### Database Conversion Examples

```bash
# Convert Access database (all tables)
pyforge convert company.mdb

# Convert with progress tracking
pyforge convert large_database.accdb --verbose

# Convert DBF with specific encoding
pyforge convert legacy.dbf --encoding cp1252

# Batch convert all DBF files in directory
for file in *.dbf; do pyforge convert "$file"; done
```

### CSV Conversion Examples

```bash
# Convert CSV with automatic detection
pyforge convert sales_data.csv

# Convert international CSV with auto-encoding detection
pyforge convert european_data.csv --verbose

# Convert semicolon-delimited CSV (European format)
pyforge convert data_semicolon.csv

# Convert tab-separated file with compression
pyforge convert data.tsv --compression gzip

# Batch convert multiple CSV files
for file in *.csv; do pyforge convert "$file" --compression snappy; done
```

### XML Conversion Examples

```bash
# Convert XML with automatic structure detection
pyforge convert catalog.xml

# Convert with aggressive flattening for data analysis
pyforge convert api_response.xml --flatten-strategy aggressive

# Handle arrays as concatenated strings
pyforge convert orders.xml --array-handling concatenate

# Strip namespaces for cleaner output
pyforge convert soap_response.xml --namespace-handling strip

# Preview structure before conversion
pyforge convert complex_structure.xml --preview-schema

# Convert compressed XML files
pyforge convert data.xml.gz --verbose

# Batch convert XML files with specific strategy
for file in *.xml; do pyforge convert "$file" --flatten-strategy moderate; done
```

### File Information

```bash
# Show file metadata
pyforge info document.pdf

# Excel file details (sheets, row counts)
pyforge info spreadsheet.xlsx

# Database file information (tables, record counts)
pyforge info database.mdb

# Export metadata as JSON
pyforge info document.pdf --format json > metadata.json
```

## Supported Formats

| Input Format | Output Formats | Status |
|-------------|----------------|---------|
| PDF (.pdf)  | Text (.txt)    | ✅ Available |
| Excel (.xlsx) | Parquet (.parquet) | ✅ Available |
| XML (.xml, .xml.gz, .xml.bz2) | Parquet (.parquet) | ✅ Available |
| Access (.mdb/.accdb) | Parquet (.parquet) | ✅ Available |
| DBF (.dbf)  | Parquet (.parquet) | ✅ Available |
| CSV (.csv, .tsv, .txt) | Parquet (.parquet) | ✅ Available |

## Development

### Setting Up Development Environment

```bash
# Clone the repository
git clone https://github.com/Py-Forge-Cli/PyForge-CLI.git
cd PyForge-CLI

# Set up development environment
pip install -e ".[dev,test]"

# Run tests
pytest

# Format code
black src tests

# Run all checks
ruff check src tests
```

### Development Commands

```bash
# Testing
pytest                                    # Run tests
pytest --cov=pyforge_cli                 # Run tests with coverage

# Code Quality
black src tests                          # Format code
ruff check src tests                     # Run linting
mypy src                                 # Type checking

# Building
python -m build                          # Build distribution packages
twine upload dist/*                      # Publish to PyPI
```

### Project Structure

```text
PyForge-CLI/
├── src/pyforge_cli/            # Main package
│   ├── __init__.py
│   ├── main.py                 # CLI entry point
│   ├── converters/             # Format converters
│   │   ├── __init__.py
│   │   ├── base.py            # Base converter class
│   │   ├── pdf_converter.py   # PDF to text conversion
│   │   ├── excel_converter.py # Excel to Parquet conversion
│   │   ├── mdb_converter.py   # MDB/ACCDB to Parquet conversion
│   │   └── dbf_converter.py   # DBF to Parquet conversion
│   ├── plugins/               # Plugin system
│   │   └── loader.py          # Plugin loading
│   └── utils/                 # Utilities
│       ├── file_utils.py      # File operations
│       └── cli_utils.py       # CLI helpers
├── docs/                      # Documentation source
├── tests/                     # Test files
├── pyproject.toml            # Project configuration
└── README.md                 # This file
```

## Requirements

- Python 3.8+
- PyMuPDF (for PDF processing)
- Click (for CLI interface)
- Rich (for beautiful terminal output)
- Pandas & PyArrow (for data processing and Parquet support)
- pandas-access (for MDB file support)
- dbfread (for DBF file support)
- openpyxl (for Excel file support)
- chardet (for encoding detection)

## Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Run tests and linting (`pytest && ruff check src tests`)
5. Commit your changes (`git commit -m 'Add amazing feature'`)
6. Push to the branch (`git push origin feature/amazing-feature`)
7. Open a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Roadmap

### Version 0.2.0 - Database & Spreadsheet Support (Completed)
- ✅ **Excel to Parquet Conversion**
  - Multi-sheet support with intelligent detection
  - Interactive sheet selection mode
  - Column matching for combined output
  - Progress tracking and summary reports
- ✅ **MDB/ACCDB to Parquet Conversion**
  - Microsoft Access database support (.mdb, .accdb)
  - Automatic table discovery
  - Cross-platform compatibility (Windows/Linux/macOS)
  - Excel summary reports with sample data
- ✅ **DBF to Parquet Conversion**
  - Automatic encoding detection
  - Support for various DBF formats
  - Robust error handling for corrupted files

### Version 0.3.0 - Enhanced Features (Released)
- ✅ **XML to Parquet Conversion**
  - Automatic XML structure detection and analysis
  - Intelligent flattening strategies (conservative, moderate, aggressive)
  - Array handling modes (expand, concatenate, json_string)
  - Namespace processing (preserve, strip, prefix)
  - Schema preview before conversion
  - Support for compressed XML files (.xml.gz, .xml.bz2)
- ✅ **CSV to Parquet conversion** with auto-detection (string-based)
- [ ] CSV schema inference and native type conversion
- [ ] JSON processing and flattening
- [ ] Data validation and cleaning options
- [ ] Batch processing with pattern matching
- [ ] Configuration file support
- [ ] REST API wrapper for notebook integration
- [ ] Data type preservation options (beyond string conversion)

### Version 0.4.0 - MDF Tools Infrastructure (Completed)
- ✅ **MDF Tools Installer**
  - Automated Docker Desktop installation (macOS/Windows/Linux)
  - SQL Server Express 2019 container setup
  - Interactive setup wizard with non-interactive mode
  - Cross-platform compatibility with system package managers
- ✅ **Container Management Commands**
  - 9 lifecycle commands: status, start, stop, restart, logs, test, config, uninstall
  - Rich terminal UI with status tables and progress tracking
  - Configuration management with JSON persistence
  - Health monitoring and connectivity testing
- ✅ **Infrastructure Foundation**
  - Complete Docker and SQL Server environment for MDF processing
  - Foundation ready for MDF file conversion implementation
  - Comprehensive documentation with live terminal examples
  - ASCII architecture diagrams and troubleshooting guides

### Version 0.5.0 - MDF File Conversion (Planned)
- [ ] **MDF to Parquet Converter** (Issue #13)
  - SQL Server MDF file attachment and processing
  - 6-stage conversion process (matching MDB converter)
  - String-only data conversion for Phase 1 consistency
  - Batch processing with progress tracking
  - Excel summary reports with conversion statistics

### Version 0.5.1 - Test Datasets Collection (Planned)
- [ ] **Sample Datasets Installation** (Issue #15)
  - CLI command: `pyforge install sample-datasets`
  - Curated test datasets for all formats (XML, MDF, DBF, MDB, CSV, PDF)
  - Multiple size categories: small (<1MB), medium (1-100MB), large (100MB-3GB)
  - Direct download from online sources with metadata
  - Organized directory structure for testing scenarios

### Version 0.6.0 - Advanced Features (Future)
- [ ] SQL query support for database files
- [ ] Data transformation pipelines
- [ ] Cloud storage integration (S3, Azure Blob)
- [ ] Incremental/delta conversions
- [ ] Custom plugin development SDK

## Support

If you encounter any issues or have questions:

1. Check the [📚 Complete Documentation](https://py-forge-cli.github.io/PyForge-CLI/)
2. Search [existing issues](https://github.com/Py-Forge-Cli/PyForge-CLI/issues)
3. Create a [new issue](https://github.com/Py-Forge-Cli/PyForge-CLI/issues/new)
4. Join the [discussion](https://github.com/Py-Forge-Cli/PyForge-CLI/discussions)

## Changelog

### 1.0.8 (Current Release)

- ✅ **Complete Testing Infrastructure Overhaul**: Fixed 13 major issues across infrastructure and notebooks
- ✅ **Sample Datasets Installation**: Fixed with intelligent fallback versioning system
- ✅ **Missing Dependencies**: Added PyMuPDF, chardet, requests to resolve import errors
- ✅ **Convert Command Fix**: Resolved TypeError in ConverterRegistry API
- ✅ **Comprehensive Testing Framework**: Created systematic testing with 402 lines of test code
- ✅ **Notebook Organization**: Restructured with proper unit/integration/functional hierarchy
- ✅ **Cross-Environment Support**: Both local and Databricks notebooks fully functional
- ✅ **Enhanced Error Handling**: Smart file selection, directory creation, PDF skip logic
- ✅ **Developer Documentation**: Complete guides and deployment documentation

### 0.4.0

- ✅ **MDF Tools Installer**: Complete automated Docker Desktop and SQL Server Express 2019 setup
- ✅ **Cross-Platform Installation**: System package managers (Homebrew, Winget, apt/yum)
- ✅ **Container Management**: 9 lifecycle commands for SQL Server operations
- ✅ **Interactive Setup Wizard**: User-guided installation with smart detection
- ✅ **Rich Terminal UI**: Beautiful status tables, progress bars, and error handling
- ✅ **Critical Docker Fix**: Proper system package manager installation (not pip)
- ✅ **Infrastructure Foundation**: Complete environment for MDF file processing
- ✅ **Comprehensive Documentation**: Live terminal examples and ASCII diagrams

### 0.3.0

- ✅ **XML to Parquet Converter**: Complete implementation with intelligent flattening
- ✅ **Automatic Structure Detection**: Analyzes XML hierarchy and array patterns
- ✅ **Flexible Flattening Strategies**: Conservative, moderate, and aggressive options
- ✅ **Advanced Array Handling**: Expand, concatenate, or JSON string modes
- ✅ **Namespace Support**: Configurable namespace processing
- ✅ **Schema Preview**: Optional structure preview before conversion
- ✅ **Comprehensive Documentation**: User guide and quick reference
- ✅ **Compressed XML Support**: Handles .xml.gz and .xml.bz2 files

### 0.2.1

- ✅ **Complete Documentation Site**: Comprehensive GitHub Pages documentation
- ✅ **Fixed CI/CD**: GitHub Actions workflow for automated PyPI publishing  
- ✅ **Improved Distribution**: API token authentication and automation
- ✅ **Better Navigation**: Fixed broken links and improved project structure

### 0.2.0

- ✅ Excel to Parquet conversion with multi-sheet support
- ✅ MDB/ACCDB to Parquet conversion with cross-platform support
- ✅ DBF to Parquet conversion with encoding detection
- ✅ Interactive mode for Excel sheet selection
- ✅ Automatic table discovery for database files
- ✅ Progress tracking with rich terminal UI
- ✅ Excel summary reports for batch conversions
- ✅ Robust error handling and recovery

### 0.1.0 (Initial Release)

- PDF to text conversion
- CLI interface with Click
- Rich terminal output
- File metadata extraction
- Page range support
- Development tooling setup

## 🚀 Development & Deployment

### Automated Versioning

PyForge CLI uses automated versioning with setuptools-scm:

- **Development versions**: `1.0.x.devN` - Auto-deployed to [PyPI Test](https://test.pypi.org/project/pyforge-cli/) on every commit to main
- **Release versions**: `1.0.x` - Deployed to [PyPI](https://pypi.org/project/pyforge-cli/) when Git tags are created
- **Version increment**: Development versions auto-increment on each commit (dev1 → dev2 → dev3...)

### Installation from Test Repository

```bash
# Install latest development version
pip install -i https://test.pypi.org/simple/ pyforge-cli

# Install specific development version  
pip install -i https://test.pypi.org/simple/ pyforge-cli==1.0.8.dev5
```

### CI/CD Pipeline

- **Trigger**: Every push to `main` branch automatically builds and deploys to PyPI Test
- **Release**: Create a Git tag to trigger deployment to PyPI Production
- **Testing**: Use `pyforge-cli` from test.pypi.org for validation before release

### Recent Updates (v1.0.8)

**Complete Testing Infrastructure Overhaul** - Fixed 13 major issues across PyForge CLI infrastructure and testing notebooks:

- ✅ **Infrastructure Fixes**: Sample datasets installation, missing dependencies, convert command API
- ✅ **Testing Framework**: Comprehensive testing suite with 402 lines of test code  
- ✅ **Notebook Support**: Full local and Databricks notebook functionality
- ✅ **Error Handling**: Smart file selection, directory creation, PDF skip logic
- ✅ **Documentation**: Complete developer guides and deployment processes

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pyforge-cli",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "cli, data, conversion, pdf, csv, parquet, excel, database, mdb, dbf, synthetic-data",
    "author": null,
    "author_email": "Santosh Dandey <dd.santosh@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/db/70/8098593e06efc63eacc57823a32b42ce6f0816bbffd82442a23a92368f62/pyforge_cli-1.0.9.tar.gz",
    "platform": null,
    "description": "# PyForge CLI\n\n<div align=\"center\">\n  <img src=\"assets/icon_pyforge_forge.svg\" alt=\"PyForge CLI\" width=\"128\" height=\"128\">\n</div>\n\n<div align=\"center\">\n  <strong>A powerful command-line tool for data format conversion and manipulation, built with Python and Click.</strong>\n</div>\n\n## \ud83d\udcd6 Documentation\n\n**[\ud83d\udcda Complete Documentation](https://py-forge-cli.github.io/PyForge-CLI/)** | **[\ud83d\ude80 Quick Start](https://py-forge-cli.github.io/PyForge-CLI/getting-started/quick-start)** | **[\ud83d\udce6 Installation Guide](https://py-forge-cli.github.io/PyForge-CLI/getting-started/installation)** | **[\ud83d\udd27 CLI Reference](https://py-forge-cli.github.io/PyForge-CLI/reference/cli-reference)**\n\n<div align=\"center\">\n  <a href=\"https://pypi.org/project/pyforge-cli/\">\n    <img src=\"https://img.shields.io/pypi/v/pyforge-cli.svg\" alt=\"PyPI version\">\n  </a>\n  <a href=\"https://pypi.org/project/pyforge-cli/\">\n    <img src=\"https://img.shields.io/pypi/pyversions/pyforge-cli.svg\" alt=\"Python versions\">\n  </a>\n  <a href=\"https://github.com/Py-Forge-Cli/PyForge-CLI/blob/main/LICENSE\">\n    <img src=\"https://img.shields.io/github/license/Py-Forge-Cli/PyForge-CLI.svg\" alt=\"License\">\n  </a>\n  <a href=\"https://github.com/Py-Forge-Cli/PyForge-CLI/actions\">\n    <img src=\"https://github.com/Py-Forge-Cli/PyForge-CLI/workflows/CI/badge.svg\" alt=\"CI Status\">\n  </a>\n</div>\n\n## Features\n\n- **PDF to Text Conversion**: Extract text from PDF documents with advanced options\n- **Excel to Parquet Conversion**: Convert Excel files (.xlsx) to Parquet format with multi-sheet support\n- **XML to Parquet Conversion**: Intelligent XML flattening with automatic structure detection and configurable strategies\n- **Database File Conversion**: Convert Microsoft Access (.mdb/.accdb), DBF, and SQL Server MDF files to Parquet\n- **MDF Tools Installer**: Automated setup of Docker and SQL Server Express for MDF file processing\n- **CSV to Parquet Conversion**: Smart delimiter detection and encoding handling\n- **Rich CLI Interface**: Beautiful terminal output with progress bars and tables\n- **Intelligent Processing**: Automatic encoding detection, structure analysis, and column matching\n- **Extensible Architecture**: Plugin-based system for adding new format converters\n- **Metadata Extraction**: Get detailed information about your files\n- **Cross-platform**: Works on Windows, macOS, and Linux\n\n## Installation\n\n### Stable Version (Recommended)\n\n```bash\npip install pyforge-cli\n```\n\n### Development Version\n\nTo test the latest features and bug fixes before they're officially released:\n\n```bash\n# Install from PyPI Test (latest development version)\npip install -i https://test.pypi.org/simple/ pyforge-cli\n\n# Or with dependency fallback\npip install --index-url https://test.pypi.org/simple/ \\\n           --extra-index-url https://pypi.org/simple/ \\\n           pyforge-cli\n```\n\n> **Note**: Development versions follow the pattern `X.Y.Z.devN+gCOMMIT` and are automatically deployed from the main branch.\n\n### From Source\n\n```bash\ngit clone https://github.com/Py-Forge-Cli/PyForge-CLI.git\ncd PyForge-CLI\npip install -e .\n```\n\n### Development Installation\n\n```bash\ngit clone https://github.com/Py-Forge-Cli/PyForge-CLI.git\ncd PyForge-CLI\npip install -e \".[dev,test]\"\n```\n\n### System Dependencies\n\nFor MDB/Access file support on non-Windows systems:\n\n```bash\n# Ubuntu/Debian\nsudo apt-get install mdbtools\n\n# macOS\nbrew install mdbtools\n```\n\n### MDF File Processing Setup\n\nFor SQL Server MDF file conversion, install the MDF Tools:\n\n```bash\n# Interactive setup wizard\npyforge install mdf-tools\n\n# Check installation status\npyforge mdf-tools status\n```\n\nThis installs Docker Desktop and SQL Server Express automatically. See [MDF Tools Documentation](https://py-forge-cli.github.io/PyForge-CLI/converters/mdf-tools-installer) for details.\n\n## Quick Start\n\n### Convert PDF to Text\n\n```bash\n# Convert entire PDF\npyforge convert document.pdf\n\n# Convert to specific output file\npyforge convert document.pdf output.txt\n\n# Convert specific page range\npyforge convert document.pdf --pages \"1-5\"\n\n# Include page metadata\npyforge convert document.pdf --metadata\n```\n\n### Convert Excel to Parquet\n\n```bash\n# Convert Excel file to Parquet\npyforge convert data.xlsx\n\n# Convert with specific compression\npyforge convert data.xlsx --compression gzip\n\n# Convert specific sheets only\npyforge convert data.xlsx --sheets \"Sheet1,Sheet3\"\n```\n\n### Convert Database Files\n\n```bash\n# Convert Access database to Parquet\npyforge convert database.mdb\n\n# Convert DBF file with encoding detection\npyforge convert data.dbf\n\n# Convert with custom output directory\npyforge convert database.accdb output_folder/\n```\n\n### Convert CSV Files\n\n```bash\n# Convert CSV with automatic delimiter and encoding detection\npyforge convert data.csv\n\n# Convert TSV (tab-separated) file\npyforge convert data.tsv\n\n# Convert with compression\npyforge convert sales_data.csv --compression gzip\n\n# Convert delimited text file\npyforge convert export.txt\n```\n\n### Convert XML Files\n\n```bash\n# Convert XML with automatic structure detection\npyforge convert data.xml\n\n# Convert with aggressive flattening for analytics\npyforge convert catalog.xml --flatten-strategy aggressive\n\n# Handle arrays by expanding to multiple rows\npyforge convert orders.xml --array-handling expand\n\n# Strip namespaces for cleaner column names\npyforge convert api_response.xml --namespace-handling strip\n\n# Preview schema before conversion\npyforge convert complex.xml --preview-schema\n```\n\n### Get File Information\n\n```bash\n# Display file metadata as table\npyforge info document.pdf\n\n# Get Excel file information\npyforge info spreadsheet.xlsx\n\n# Output metadata as JSON\npyforge info database.mdb --format json\n```\n\n### List Supported Formats\n\n```bash\npyforge formats\n```\n\n### Validate Files\n\n```bash\npyforge validate document.pdf\npyforge validate data.xlsx\n```\n\n### MDF File Processing\n\n```bash\n# Step 1: Install MDF processing tools (one-time setup)\npyforge install mdf-tools\n\n# Step 2: Check installation status\npyforge mdf-tools status\n\n# Step 3: Start SQL Server (if not running)\npyforge mdf-tools start\n\n# Step 4: Convert MDF files (when MDF converter is available)\n# pyforge convert database.mdf --format parquet\n\n# Container management\npyforge mdf-tools stop      # Stop SQL Server\npyforge mdf-tools restart   # Restart SQL Server\npyforge mdf-tools logs      # View SQL Server logs\npyforge mdf-tools test      # Test connectivity\npyforge mdf-tools config    # Show configuration\npyforge mdf-tools uninstall # Remove everything\n```\n\n## Usage Examples\n\n### Basic PDF Conversion\n\n```bash\n# Convert PDF to text (creates report.txt in same directory)\npyforge convert report.pdf\n\n# Convert with custom output path\npyforge convert report.pdf /path/to/output.txt\n\n# Convert with verbose output\npyforge convert report.pdf --verbose\n\n# Force overwrite existing file\npyforge convert report.pdf output.txt --force\n```\n\n### Advanced PDF Options\n\n```bash\n# Convert pages 1-10\npyforge convert document.pdf --pages \"1-10\"\n\n# Convert from page 5 to end\npyforge convert document.pdf --pages \"5-\"\n\n# Convert up to page 10\npyforge convert document.pdf --pages \"-10\"\n\n# Include page markers in output\npyforge convert document.pdf --metadata\n```\n\n### Excel Conversion Examples\n\n```bash\n# Convert Excel with all sheets\npyforge convert sales_data.xlsx\n\n# Interactive mode - prompts for sheet selection\npyforge convert multi_sheet.xlsx --interactive\n\n# Convert sheets with matching columns into single file\npyforge convert monthly_reports.xlsx --merge-sheets\n\n# Generate summary report\npyforge convert data.xlsx --summary\n```\n\n### Database Conversion Examples\n\n```bash\n# Convert Access database (all tables)\npyforge convert company.mdb\n\n# Convert with progress tracking\npyforge convert large_database.accdb --verbose\n\n# Convert DBF with specific encoding\npyforge convert legacy.dbf --encoding cp1252\n\n# Batch convert all DBF files in directory\nfor file in *.dbf; do pyforge convert \"$file\"; done\n```\n\n### CSV Conversion Examples\n\n```bash\n# Convert CSV with automatic detection\npyforge convert sales_data.csv\n\n# Convert international CSV with auto-encoding detection\npyforge convert european_data.csv --verbose\n\n# Convert semicolon-delimited CSV (European format)\npyforge convert data_semicolon.csv\n\n# Convert tab-separated file with compression\npyforge convert data.tsv --compression gzip\n\n# Batch convert multiple CSV files\nfor file in *.csv; do pyforge convert \"$file\" --compression snappy; done\n```\n\n### XML Conversion Examples\n\n```bash\n# Convert XML with automatic structure detection\npyforge convert catalog.xml\n\n# Convert with aggressive flattening for data analysis\npyforge convert api_response.xml --flatten-strategy aggressive\n\n# Handle arrays as concatenated strings\npyforge convert orders.xml --array-handling concatenate\n\n# Strip namespaces for cleaner output\npyforge convert soap_response.xml --namespace-handling strip\n\n# Preview structure before conversion\npyforge convert complex_structure.xml --preview-schema\n\n# Convert compressed XML files\npyforge convert data.xml.gz --verbose\n\n# Batch convert XML files with specific strategy\nfor file in *.xml; do pyforge convert \"$file\" --flatten-strategy moderate; done\n```\n\n### File Information\n\n```bash\n# Show file metadata\npyforge info document.pdf\n\n# Excel file details (sheets, row counts)\npyforge info spreadsheet.xlsx\n\n# Database file information (tables, record counts)\npyforge info database.mdb\n\n# Export metadata as JSON\npyforge info document.pdf --format json > metadata.json\n```\n\n## Supported Formats\n\n| Input Format | Output Formats | Status |\n|-------------|----------------|---------|\n| PDF (.pdf)  | Text (.txt)    | \u2705 Available |\n| Excel (.xlsx) | Parquet (.parquet) | \u2705 Available |\n| XML (.xml, .xml.gz, .xml.bz2) | Parquet (.parquet) | \u2705 Available |\n| Access (.mdb/.accdb) | Parquet (.parquet) | \u2705 Available |\n| DBF (.dbf)  | Parquet (.parquet) | \u2705 Available |\n| CSV (.csv, .tsv, .txt) | Parquet (.parquet) | \u2705 Available |\n\n## Development\n\n### Setting Up Development Environment\n\n```bash\n# Clone the repository\ngit clone https://github.com/Py-Forge-Cli/PyForge-CLI.git\ncd PyForge-CLI\n\n# Set up development environment\npip install -e \".[dev,test]\"\n\n# Run tests\npytest\n\n# Format code\nblack src tests\n\n# Run all checks\nruff check src tests\n```\n\n### Development Commands\n\n```bash\n# Testing\npytest                                    # Run tests\npytest --cov=pyforge_cli                 # Run tests with coverage\n\n# Code Quality\nblack src tests                          # Format code\nruff check src tests                     # Run linting\nmypy src                                 # Type checking\n\n# Building\npython -m build                          # Build distribution packages\ntwine upload dist/*                      # Publish to PyPI\n```\n\n### Project Structure\n\n```text\nPyForge-CLI/\n\u251c\u2500\u2500 src/pyforge_cli/            # Main package\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 main.py                 # CLI entry point\n\u2502   \u251c\u2500\u2500 converters/             # Format converters\n\u2502   \u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u2502   \u251c\u2500\u2500 base.py            # Base converter class\n\u2502   \u2502   \u251c\u2500\u2500 pdf_converter.py   # PDF to text conversion\n\u2502   \u2502   \u251c\u2500\u2500 excel_converter.py # Excel to Parquet conversion\n\u2502   \u2502   \u251c\u2500\u2500 mdb_converter.py   # MDB/ACCDB to Parquet conversion\n\u2502   \u2502   \u2514\u2500\u2500 dbf_converter.py   # DBF to Parquet conversion\n\u2502   \u251c\u2500\u2500 plugins/               # Plugin system\n\u2502   \u2502   \u2514\u2500\u2500 loader.py          # Plugin loading\n\u2502   \u2514\u2500\u2500 utils/                 # Utilities\n\u2502       \u251c\u2500\u2500 file_utils.py      # File operations\n\u2502       \u2514\u2500\u2500 cli_utils.py       # CLI helpers\n\u251c\u2500\u2500 docs/                      # Documentation source\n\u251c\u2500\u2500 tests/                     # Test files\n\u251c\u2500\u2500 pyproject.toml            # Project configuration\n\u2514\u2500\u2500 README.md                 # This file\n```\n\n## Requirements\n\n- Python 3.8+\n- PyMuPDF (for PDF processing)\n- Click (for CLI interface)\n- Rich (for beautiful terminal output)\n- Pandas & PyArrow (for data processing and Parquet support)\n- pandas-access (for MDB file support)\n- dbfread (for DBF file support)\n- openpyxl (for Excel file support)\n- chardet (for encoding detection)\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes\n4. Run tests and linting (`pytest && ruff check src tests`)\n5. Commit your changes (`git commit -m 'Add amazing feature'`)\n6. Push to the branch (`git push origin feature/amazing-feature`)\n7. Open a Pull Request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Roadmap\n\n### Version 0.2.0 - Database & Spreadsheet Support (Completed)\n- \u2705 **Excel to Parquet Conversion**\n  - Multi-sheet support with intelligent detection\n  - Interactive sheet selection mode\n  - Column matching for combined output\n  - Progress tracking and summary reports\n- \u2705 **MDB/ACCDB to Parquet Conversion**\n  - Microsoft Access database support (.mdb, .accdb)\n  - Automatic table discovery\n  - Cross-platform compatibility (Windows/Linux/macOS)\n  - Excel summary reports with sample data\n- \u2705 **DBF to Parquet Conversion**\n  - Automatic encoding detection\n  - Support for various DBF formats\n  - Robust error handling for corrupted files\n\n### Version 0.3.0 - Enhanced Features (Released)\n- \u2705 **XML to Parquet Conversion**\n  - Automatic XML structure detection and analysis\n  - Intelligent flattening strategies (conservative, moderate, aggressive)\n  - Array handling modes (expand, concatenate, json_string)\n  - Namespace processing (preserve, strip, prefix)\n  - Schema preview before conversion\n  - Support for compressed XML files (.xml.gz, .xml.bz2)\n- \u2705 **CSV to Parquet conversion** with auto-detection (string-based)\n- [ ] CSV schema inference and native type conversion\n- [ ] JSON processing and flattening\n- [ ] Data validation and cleaning options\n- [ ] Batch processing with pattern matching\n- [ ] Configuration file support\n- [ ] REST API wrapper for notebook integration\n- [ ] Data type preservation options (beyond string conversion)\n\n### Version 0.4.0 - MDF Tools Infrastructure (Completed)\n- \u2705 **MDF Tools Installer**\n  - Automated Docker Desktop installation (macOS/Windows/Linux)\n  - SQL Server Express 2019 container setup\n  - Interactive setup wizard with non-interactive mode\n  - Cross-platform compatibility with system package managers\n- \u2705 **Container Management Commands**\n  - 9 lifecycle commands: status, start, stop, restart, logs, test, config, uninstall\n  - Rich terminal UI with status tables and progress tracking\n  - Configuration management with JSON persistence\n  - Health monitoring and connectivity testing\n- \u2705 **Infrastructure Foundation**\n  - Complete Docker and SQL Server environment for MDF processing\n  - Foundation ready for MDF file conversion implementation\n  - Comprehensive documentation with live terminal examples\n  - ASCII architecture diagrams and troubleshooting guides\n\n### Version 0.5.0 - MDF File Conversion (Planned)\n- [ ] **MDF to Parquet Converter** (Issue #13)\n  - SQL Server MDF file attachment and processing\n  - 6-stage conversion process (matching MDB converter)\n  - String-only data conversion for Phase 1 consistency\n  - Batch processing with progress tracking\n  - Excel summary reports with conversion statistics\n\n### Version 0.5.1 - Test Datasets Collection (Planned)\n- [ ] **Sample Datasets Installation** (Issue #15)\n  - CLI command: `pyforge install sample-datasets`\n  - Curated test datasets for all formats (XML, MDF, DBF, MDB, CSV, PDF)\n  - Multiple size categories: small (<1MB), medium (1-100MB), large (100MB-3GB)\n  - Direct download from online sources with metadata\n  - Organized directory structure for testing scenarios\n\n### Version 0.6.0 - Advanced Features (Future)\n- [ ] SQL query support for database files\n- [ ] Data transformation pipelines\n- [ ] Cloud storage integration (S3, Azure Blob)\n- [ ] Incremental/delta conversions\n- [ ] Custom plugin development SDK\n\n## Support\n\nIf you encounter any issues or have questions:\n\n1. Check the [\ud83d\udcda Complete Documentation](https://py-forge-cli.github.io/PyForge-CLI/)\n2. Search [existing issues](https://github.com/Py-Forge-Cli/PyForge-CLI/issues)\n3. Create a [new issue](https://github.com/Py-Forge-Cli/PyForge-CLI/issues/new)\n4. Join the [discussion](https://github.com/Py-Forge-Cli/PyForge-CLI/discussions)\n\n## Changelog\n\n### 1.0.8 (Current Release)\n\n- \u2705 **Complete Testing Infrastructure Overhaul**: Fixed 13 major issues across infrastructure and notebooks\n- \u2705 **Sample Datasets Installation**: Fixed with intelligent fallback versioning system\n- \u2705 **Missing Dependencies**: Added PyMuPDF, chardet, requests to resolve import errors\n- \u2705 **Convert Command Fix**: Resolved TypeError in ConverterRegistry API\n- \u2705 **Comprehensive Testing Framework**: Created systematic testing with 402 lines of test code\n- \u2705 **Notebook Organization**: Restructured with proper unit/integration/functional hierarchy\n- \u2705 **Cross-Environment Support**: Both local and Databricks notebooks fully functional\n- \u2705 **Enhanced Error Handling**: Smart file selection, directory creation, PDF skip logic\n- \u2705 **Developer Documentation**: Complete guides and deployment documentation\n\n### 0.4.0\n\n- \u2705 **MDF Tools Installer**: Complete automated Docker Desktop and SQL Server Express 2019 setup\n- \u2705 **Cross-Platform Installation**: System package managers (Homebrew, Winget, apt/yum)\n- \u2705 **Container Management**: 9 lifecycle commands for SQL Server operations\n- \u2705 **Interactive Setup Wizard**: User-guided installation with smart detection\n- \u2705 **Rich Terminal UI**: Beautiful status tables, progress bars, and error handling\n- \u2705 **Critical Docker Fix**: Proper system package manager installation (not pip)\n- \u2705 **Infrastructure Foundation**: Complete environment for MDF file processing\n- \u2705 **Comprehensive Documentation**: Live terminal examples and ASCII diagrams\n\n### 0.3.0\n\n- \u2705 **XML to Parquet Converter**: Complete implementation with intelligent flattening\n- \u2705 **Automatic Structure Detection**: Analyzes XML hierarchy and array patterns\n- \u2705 **Flexible Flattening Strategies**: Conservative, moderate, and aggressive options\n- \u2705 **Advanced Array Handling**: Expand, concatenate, or JSON string modes\n- \u2705 **Namespace Support**: Configurable namespace processing\n- \u2705 **Schema Preview**: Optional structure preview before conversion\n- \u2705 **Comprehensive Documentation**: User guide and quick reference\n- \u2705 **Compressed XML Support**: Handles .xml.gz and .xml.bz2 files\n\n### 0.2.1\n\n- \u2705 **Complete Documentation Site**: Comprehensive GitHub Pages documentation\n- \u2705 **Fixed CI/CD**: GitHub Actions workflow for automated PyPI publishing  \n- \u2705 **Improved Distribution**: API token authentication and automation\n- \u2705 **Better Navigation**: Fixed broken links and improved project structure\n\n### 0.2.0\n\n- \u2705 Excel to Parquet conversion with multi-sheet support\n- \u2705 MDB/ACCDB to Parquet conversion with cross-platform support\n- \u2705 DBF to Parquet conversion with encoding detection\n- \u2705 Interactive mode for Excel sheet selection\n- \u2705 Automatic table discovery for database files\n- \u2705 Progress tracking with rich terminal UI\n- \u2705 Excel summary reports for batch conversions\n- \u2705 Robust error handling and recovery\n\n### 0.1.0 (Initial Release)\n\n- PDF to text conversion\n- CLI interface with Click\n- Rich terminal output\n- File metadata extraction\n- Page range support\n- Development tooling setup\n\n## \ud83d\ude80 Development & Deployment\n\n### Automated Versioning\n\nPyForge CLI uses automated versioning with setuptools-scm:\n\n- **Development versions**: `1.0.x.devN` - Auto-deployed to [PyPI Test](https://test.pypi.org/project/pyforge-cli/) on every commit to main\n- **Release versions**: `1.0.x` - Deployed to [PyPI](https://pypi.org/project/pyforge-cli/) when Git tags are created\n- **Version increment**: Development versions auto-increment on each commit (dev1 \u2192 dev2 \u2192 dev3...)\n\n### Installation from Test Repository\n\n```bash\n# Install latest development version\npip install -i https://test.pypi.org/simple/ pyforge-cli\n\n# Install specific development version  \npip install -i https://test.pypi.org/simple/ pyforge-cli==1.0.8.dev5\n```\n\n### CI/CD Pipeline\n\n- **Trigger**: Every push to `main` branch automatically builds and deploys to PyPI Test\n- **Release**: Create a Git tag to trigger deployment to PyPI Production\n- **Testing**: Use `pyforge-cli` from test.pypi.org for validation before release\n\n### Recent Updates (v1.0.8)\n\n**Complete Testing Infrastructure Overhaul** - Fixed 13 major issues across PyForge CLI infrastructure and testing notebooks:\n\n- \u2705 **Infrastructure Fixes**: Sample datasets installation, missing dependencies, convert command API\n- \u2705 **Testing Framework**: Comprehensive testing suite with 402 lines of test code  \n- \u2705 **Notebook Support**: Full local and Databricks notebook functionality\n- \u2705 **Error Handling**: Smart file selection, directory creation, PDF skip logic\n- \u2705 **Documentation**: Complete developer guides and deployment processes\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A powerful CLI tool for data format conversion and synthetic data generation",
    "version": "1.0.9",
    "project_urls": {
        "Changelog": "https://github.com/Py-Forge-Cli/PyForge-CLI/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/Py-Forge-Cli/PyForge-CLI/blob/main/docs",
        "Homepage": "https://github.com/Py-Forge-Cli/PyForge-CLI",
        "Issues": "https://github.com/Py-Forge-Cli/PyForge-CLI/issues",
        "Repository": "https://github.com/Py-Forge-Cli/PyForge-CLI"
    },
    "split_keywords": [
        "cli",
        " data",
        " conversion",
        " pdf",
        " csv",
        " parquet",
        " excel",
        " database",
        " mdb",
        " dbf",
        " synthetic-data"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9ec3ca49b4b410c7a828d795c0e36289b5114671ff2efcd12e80243e8225e063",
                "md5": "7dbee674e9e92e4284dc633c4b5ee6ee",
                "sha256": "057e69b7203c8ed609a6c17b16cc4df70f32fea25e81df206d18bb5ff7786f4d"
            },
            "downloads": -1,
            "filename": "pyforge_cli-1.0.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7dbee674e9e92e4284dc633c4b5ee6ee",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 6777237,
            "upload_time": "2025-07-11T04:38:01",
            "upload_time_iso_8601": "2025-07-11T04:38:01.133610Z",
            "url": "https://files.pythonhosted.org/packages/9e/c3/ca49b4b410c7a828d795c0e36289b5114671ff2efcd12e80243e8225e063/pyforge_cli-1.0.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "db708098593e06efc63eacc57823a32b42ce6f0816bbffd82442a23a92368f62",
                "md5": "afc1f5a6459b68647df5fba192e5ae95",
                "sha256": "b05224398454c7c3901ddddf604ae4f3d6e809c32aa31270dd18d94921408425"
            },
            "downloads": -1,
            "filename": "pyforge_cli-1.0.9.tar.gz",
            "has_sig": false,
            "md5_digest": "afc1f5a6459b68647df5fba192e5ae95",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 7318447,
            "upload_time": "2025-07-11T04:38:06",
            "upload_time_iso_8601": "2025-07-11T04:38:06.458979Z",
            "url": "https://files.pythonhosted.org/packages/db/70/8098593e06efc63eacc57823a32b42ce6f0816bbffd82442a23a92368f62/pyforge_cli-1.0.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-11 04:38:06",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Py-Forge-Cli",
    "github_project": "PyForge-CLI",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pyforge-cli"
}

None