llmbuilder


Namellmbuilder JSON
Version 1.0.2 PyPI version JSON
download
home_pageNone
SummaryComplete LLM Training and Deployment Pipeline with CLI
upload_time2025-09-02 11:26:19
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords llm training fine-tuning machine-learning ai transformers cli pytorch huggingface gpt llama deployment inference
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # LLMBuilder

Created By Qubase

A comprehensive, production-ready implementation for training and fine-tuning Large Language Models from scratch. This project provides an advanced pipeline with enhanced document ingestion, intelligent deduplication, model training, automated GGUF conversion, and comprehensive testing - all optimized for both CPU and GPU environments.

## Table of Contents

- [Key Features](#key-features)
- [System Requirements](#system-requirements)
- [Installation](#installation)
- [Quick Start](#quick-start)
  - [1. Prepare Your Data](#1-prepare-your-data)
  - [2. Run the Pipeline](#2-run-the-pipeline)
  - [3. Run Specific Stages](#3-run-specific-stages)
- [Project Structure](#project-structure)
- [Fine-tuning](#fine-tuning)
- [Text Generation](#text-generation)
- [Configuration](#configuration)
- [Advanced Usage](#advanced-usage)
  - [CPU Optimization](#cpu-optimization)
  - [Data Processing](#data-processing)
  - [Training API](#training-api)
- [Monitoring](#monitoring-training)
- [Performance Optimization](#performance-optimization)
- [Troubleshooting](#troubleshooting)
- [Model Architecture](#model-architecture)
- [Pre-trained Models](#pre-trained-models)
- [License](#license)
- [Contributing](#contributing)

## Key Features <a name="key-features"></a>

### ๐Ÿš€ **Enhanced Data Processing**
- **Multi-Format Document Ingestion**: HTML, EPUB, PDF (with OCR), Markdown, DOCX, TXT
- **Intelligent Deduplication**: Hash-based exact + embedding-based semantic duplicate removal
- **OCR Support**: Automatic fallback for scanned PDFs using Tesseract
- **Advanced Text Cleaning**: BeautifulSoup HTML processing, metadata extraction

### ๐Ÿง  **Advanced Training Pipeline**
- **End-to-End Workflow**: From raw documents to production-ready models
- **Multiple Tokenizer Options**: HuggingFace Tokenizers + SentencePiece CLI integration
- **CPU/GPU Optimization**: Efficient multi-threaded training with mixed precision
- **Modern GPT Architecture**: Transformer implementation with latest optimizations

### ๐Ÿ“ฆ **Production-Ready Export**
- **Automated GGUF Conversion**: Multiple quantization levels (f16, q8_0, q4_0)
- **Quality Validation**: Comprehensive model validation and quality scoring
- **Batch Processing**: Parallel conversion with error recovery
- **llama.cpp Compatibility**: Direct integration with inference engines

### ๐Ÿ”ง **Developer Experience**
- **Comprehensive Testing**: Automated test suite with pytest integration
- **Robust Error Handling**: Detailed logging and recovery mechanisms
- **Modular Architecture**: Clean, maintainable, extensible codebase
- **Cross-Platform**: Windows PowerShell + Linux/macOS Bash scripts

## System Requirements

### Minimum Requirements
- **Python**: 3.8 or higher
- **RAM**: 4GB minimum (8GB+ recommended for large datasets)
- **Storage**: 5GB+ free disk space
- **OS**: Windows 10+, Linux, or macOS

### Additional Dependencies
- **Tesseract OCR**: For PDF OCR processing (see [INSTALL_TESSERACT.md](INSTALL_TESSERACT.md))
- **Git**: For repository management
- **Optional**: CUDA-compatible GPU for accelerated training

## Installation <a name="installation"></a>

1. Clone the repository:
   ```bash
   git clone <repository-url>
   cd LLMBuilder
   ```

2. Create and activate virtual environment:
   ```bash
   # Linux/macOS
   python -m venv venv
   source venv/bin/activate
   
   # Windows
   python -m venv venv
   .\venv\Scripts\activate
   ```

3. Install dependencies:
   ```bash
   pip install -r requirements.txt
   ```

4. Install Tesseract OCR (for PDF processing):
   ```bash
   # Ubuntu/Debian
   sudo apt-get install tesseract-ocr
   
   # macOS
   brew install tesseract
   
   # Windows - see INSTALL_TESSERACT.md for detailed instructions
   ```

5. Verify installation:
   ```bash
   python -c "import torch; print('PyTorch:', torch.__version__)"
   tesseract --version
   ```

## System Preparation

### System Requirements Check

Before starting, ensure your system meets the requirements:

```bash
# Linux/macOS
free -h      # Check available memory
df -h        # Check disk space
nproc        # Check CPU cores

# Windows
# Use Task Manager โ†’ Performance โ†’ Memory/Disk
# Check CPU cores in System Information
```

### Recommended Workflow

1. **Start with a small dataset** (100MB) to test the pipeline
2. **Monitor system resources** during initial runs
3. **Use checkpoints** - training progress is saved automatically
4. **Check logs** in `logs/training.log` for any issues

### **๐Ÿ” Real-time Monitoring:**
```bash
# Linux/Mac: Monitor system resources
htop
# Windows: Use Task Manager or Resource Monitor
```

## Getting Started

For detailed instructions, see the [**๐Ÿ“– Complete Usage Guide (USAGE.md)**](USAGE.md) which includes:
- **Step-by-step walkthroughs** with example outputs
- **Advanced configuration options** for all components
- **Troubleshooting guide** with common solutions
- **Performance optimization** tips
- **Platform-specific commands** for Windows/Linux/macOS
- **Integration examples** with other tools

## Project Structure <a name="project-structure"></a>

```
LLMBuilder/
โ”œโ”€โ”€ data/                   # Data directories
โ”‚   โ”œโ”€โ”€ raw/               # Raw input files (all formats)
โ”‚   โ”œโ”€โ”€ cleaned/           # Processed text files
โ”‚   โ”œโ”€โ”€ deduped/           # Deduplicated content
โ”‚   โ”œโ”€โ”€ tokens/            # Tokenized datasets
โ”‚   โ”œโ”€โ”€ finetune/          # Fine-tuning datasets
โ”‚   โ”œโ”€โ”€ ingest.py          # Enhanced document ingester
โ”‚   โ”œโ”€โ”€ dedup.py           # Deduplication system
โ”‚   โ”œโ”€โ”€ download_data.py   # Script to download datasets
โ”‚   โ”œโ”€โ”€ SOURCES.md         # Data sources documentation
โ”‚   โ””โ”€โ”€ README_INGESTION.md # Ingestion documentation
โ”‚
โ”œโ”€โ”€ debug_scripts/         # Debugging utilities
โ”‚   โ”œโ”€โ”€ debug_loader.py    # Data loading debugger
โ”‚   โ”œโ”€โ”€ debug_training.py  # Training process debugger
โ”‚   โ””โ”€โ”€ debug_timestamps.py # Timing analysis
โ”‚
โ”œโ”€โ”€ eval/                  # Model evaluation
โ”‚   โ””โ”€โ”€ eval.py           # Evaluation scripts
โ”‚
โ”œโ”€โ”€ exports/               # Output directories
โ”‚   โ”œโ”€โ”€ checkpoints/      # Training checkpoints
โ”‚   โ”œโ”€โ”€ gguf/             # GGUF model exports
โ”‚   โ”œโ”€โ”€ onnx/             # ONNX model exports
โ”‚   โ””โ”€โ”€ tokenizer/        # Saved tokenizer files
โ”‚
โ”œโ”€โ”€ finetune/             # Fine-tuning scripts
โ”‚   โ”œโ”€โ”€ finetune.py      # Fine-tuning implementation
โ”‚   โ””โ”€โ”€ __init__.py      # Package initialization
โ”‚
โ”œโ”€โ”€ logs/                 # Training and evaluation logs
โ”‚
โ”œโ”€โ”€ model/                # Model architecture
โ”‚   โ””โ”€โ”€ gpt_model.py     # GPT model implementation
โ”‚
โ”œโ”€โ”€ scripts/              # Enhanced processing scripts
โ”‚   โ”œโ”€โ”€ run_ingestion.py  # Document ingestion CLI
โ”‚   โ”œโ”€โ”€ enhanced_preprocess.py # Advanced preprocessing
โ”‚   โ”œโ”€โ”€ train_sentencepiece.py # SentencePiece training
โ”‚   โ””โ”€โ”€ test_deduplication.py # Deduplication testing
โ”‚
โ”œโ”€โ”€ tests/                # Comprehensive test suite
โ”‚   โ”œโ”€โ”€ test_ingestion.py # Document ingestion tests
โ”‚   โ”œโ”€โ”€ test_deduplication.py # Deduplication tests
โ”‚   โ”œโ”€โ”€ test_conversion_pipeline.py # GGUF conversion tests
โ”‚   โ”œโ”€โ”€ test_tokenizer_trainer.py # Tokenizer tests
โ”‚   โ””โ”€โ”€ ... (many more test files)
โ”‚
โ”œโ”€โ”€ tools/                # Utility scripts
โ”‚   โ”œโ”€โ”€ analyze_data.ps1  # PowerShell data analysis
โ”‚   โ”œโ”€โ”€ analyze_data.sh   # Bash data analysis
โ”‚   โ”œโ”€โ”€ download_hf_model.py # HuggingFace model downloader
โ”‚   โ”œโ”€โ”€ export_gguf.py    # Enhanced GGUF export utility
โ”‚   โ”œโ”€โ”€ conversion_pipeline.py # Automated GGUF conversion
โ”‚   โ””โ”€โ”€ quantization_manager.py # Advanced quantization
โ”‚
โ”œโ”€โ”€ training/             # Training pipeline
โ”‚   โ”œโ”€โ”€ dataset.py       # Dataset handling
โ”‚   โ”œโ”€โ”€ preprocess.py    # Data preprocessing
โ”‚   โ”œโ”€โ”€ quantization.py  # Model quantization
โ”‚   โ”œโ”€โ”€ train.py         # Main training script
โ”‚   โ”œโ”€โ”€ train_tokenizer.py # Enhanced tokenizer training
โ”‚   โ””โ”€โ”€ utils.py         # Training utilities
โ”‚
โ”œโ”€โ”€ .gitignore           # Git ignore rules
โ”œโ”€โ”€ config.json          # Main configuration
โ”œโ”€โ”€ config_cpu_small.json # Small CPU config
โ”œโ”€โ”€ config_gpu.json      # GPU configuration
โ”œโ”€โ”€ inference.py         # Inference script
โ”œโ”€โ”€ quantize_model.py    # Model quantization
โ”œโ”€โ”€ README.md           # This file
โ”œโ”€โ”€ PIPELINE_UPDATES.md  # Recent updates summary
โ”œโ”€โ”€ INSTALL_TESSERACT.md # OCR installation guide
โ”œโ”€โ”€ requirements.txt    # Python dependencies
โ”œโ”€โ”€ run.ps1            # Enhanced PowerShell runner
โ””โ”€โ”€ run.sh             # Enhanced Bash runner
```

## Quick Start <a name="quick-start"></a>

### 1. Prepare Your Data <a name="1-prepare-your-data"></a>

#### Enhanced Document Support
Place your documents in `data/raw/`. The system now supports:
- **Text files** (.txt, .md)
- **PDF files** (.pdf) - with automatic OCR for scanned documents
- **Word documents** (.docx)
- **Web content** (.html)
- **E-books** (.epub)
- **Markdown** (.md)

#### Option 1: Download Sample Data
```bash
# Download sample corpus
python data/download_data.py --corpus

# Or download specific topics
python data/download_data.py --topic literature --count 5
python data/download_data.py --topic technology --count 3
```

Available topics: literature, science, technology, business, health, education

#### Option 2: Use Your Own Data
Simply place your documents in `data/raw/` - the enhanced ingestion pipeline will automatically:
- Detect file formats
- Extract text with appropriate methods
- Handle OCR for scanned PDFs
- Clean and normalize content

### 2. Run the Pipeline <a name="2-run-the-pipeline"></a>

#### Linux/macOS:
```bash
chmod +x run.sh
./run.sh
```

#### Windows:
```batch
run.bat
```

Or using PowerShell:
```powershell
.\run.ps1
```

### 3. Run Specific Stages <a name="3-run-specific-stages"></a>

The enhanced pipeline includes new stages for better data processing:

```bash
# NEW: Enhanced document ingestion
./run.sh ingest

# NEW: Intelligent deduplication  
./run.sh dedup

# Traditional preprocessing (optional)
./run.sh preprocess

# Train tokenizer
./run.sh tokenizer

# Train model
./run.sh train

# Evaluate model
./run.sh eval

# Fine-tune existing model
./run.sh finetune

# Interactive text generation
./run.sh inference

# NEW: Convert to GGUF format
./run.sh gguf

# NEW: Run comprehensive tests
./run.sh test
```

#### Windows PowerShell Examples:
```powershell
# Enhanced document processing
.\run.ps1 -Stage ingest

# Run deduplication
.\run.ps1 -Stage dedup

# Complete pipeline
.\run.ps1 -Stage all

# Convert to GGUF
.\run.ps1 -Stage gguf
```

## Enhanced Pipeline Stages

### ๐Ÿ”„ **Document Ingestion** (`ingest`)
Advanced document processing with multi-format support:

```bash
# Process all supported formats with OCR
./run.sh ingest

# With custom options
python scripts/run_ingestion.py \
  --input data/raw \
  --output data/cleaned \
  --ocr-lang eng fra deu \
  --max-size 50 \
  --recursive
```

**Features:**
- **HTML Processing**: BeautifulSoup-based cleaning, removes scripts/styles
- **EPUB Support**: Full e-book text extraction with metadata
- **PDF with OCR**: Automatic fallback to Tesseract for scanned documents
- **Markdown Processing**: Advanced parsing with table/code block support
- **Progress Tracking**: Real-time processing statistics

### ๐Ÿ” **Intelligent Deduplication** (`dedup`)
Remove exact and near-duplicate content to improve training quality:

```bash
# Run both hash and embedding deduplication
./run.sh dedup

# Custom similarity threshold
python data/dedup.py \
  --input-dir data/cleaned \
  --output-dir data/deduped \
  --similarity-threshold 0.85
```

**Methods:**
- **Hash-based**: Exact duplicate detection with text normalization
- **Embedding-based**: Semantic similarity using sentence-transformers
- **Quality Preservation**: Keeps highest quality version of duplicates
- **Statistics**: Detailed reporting of removed content

### ๐Ÿ“ฆ **GGUF Conversion** (`gguf`)
Automated conversion to GGUF format for production deployment:

```bash
# Convert with multiple quantization levels
./run.sh gguf

# Custom quantization options
python tools/conversion_pipeline.py \
  exports/checkpoints/best_model.pt \
  exports/gguf \
  --quantization f16 q8_0 q4_0 q4_1 \
  --tokenizer exports/tokenizer
```

**Features:**
- **Multiple Quantization**: f16, q8_0, q4_0, q4_1, q5_0, q5_1
- **Quality Validation**: Automatic validation and quality scoring
- **Batch Processing**: Parallel conversion with error recovery
- **Metadata Preservation**: Complete model metadata in GGUF format

### ๐Ÿงช **Comprehensive Testing** (`test`)
Automated test suite for quality assurance:

```bash
# Run all tests
./run.sh test

# Run specific test categories
python -m pytest tests/test_ingestion.py -v
python -m pytest tests/test_deduplication.py -v
python -m pytest tests/test_conversion_pipeline.py -v
```

## Fine-tuning <a name="fine-tuning"></a>

To fine-tune the model on your own data:

1. Place your training files in `data/finetune/`
2. The system will automatically use the latest checkpoint
3. Run the fine-tuning script:
   ```bash
   python finetune/finetune.py \
     --config config.json \
     --pretrained-model exports/checkpoints/latest.pt \
     --train-data data/finetune/ \
     --tokenizer-dir exports/tokenizer/
   ```
4. Fine-tuned models save to `exports/checkpoints/finetuned/`

### Fine-tuning Configuration

You can customize fine-tuning by modifying these parameters:

```yaml
finetune:
  learning_rate: 0.0001    # Lower than training LR
  batch_size: 4           # Adjust based on GPU memory
  num_epochs: 3           # Number of fine-tuning epochs
  warmup_steps: 100       # Learning rate warmup steps
```

### Monitoring Fine-tuning

Monitor the fine-tuning process with:
```bash
tensorboard --logdir=exports/logs/finetune/
```

## Text Generation <a name="text-generation"></a>

Run interactive text generation:

```bash
python inference.py --interactive
```

Options:
- `--temperature`: Controls randomness (0.0-1.0)
- `--top_k`: Limit to top-k predictions
- `--top_p`: Nucleus sampling threshold

## Configuration <a name="configuration"></a>

This project includes multiple configuration files optimized for different hardware setups. Choose the one that best matches your environment:

### Available Configurations

1. **config.json** - Balanced configuration for standard CPUs
   - Moderate model size
   - Good balance between speed and quality
   - Works well on most modern laptops/desktops

2. **config_gpu.json** - Optimized for GPU training
   - Larger model capacity
   - Mixed precision training
   - Gradient accumulation
   - Best for NVIDIA GPUs with 8GB+ VRAM

3. **config_cpu_small.json** - For very limited CPUs
   - Minimal memory footprint
   - Smaller model size
   - Reduced sequence length
   - Ideal for testing or low-resource environments

### Configuration Options

#### Model Architecture
```yaml
model:
  vocab_size: 16000      # Vocabulary size
  embedding_dim: 384     # Size of token embeddings
  num_layers: 6          # Number of transformer layers
  num_heads: 6           # Number of attention heads
  hidden_dim: 1536       # Size of feedforward layers
  max_seq_length: 256    # Maximum sequence length
  dropout: 0.1           # Dropout rate
  use_bias: true         # Use bias in linear layers
  tie_weights: true      # Tie input/output embeddings
```

#### Training Settings
```yaml
training:
  batch_size: 8          # Training batch size
  learning_rate: 0.0002  # Learning rate
  weight_decay: 0.01     # Weight decay for regularization
  num_epochs: 10         # Number of training epochs
  warmup_steps: 1000     # Warmup steps for learning rate
  gradient_clip_norm: 1.0 # Gradient clipping
  save_every: 1000       # Save checkpoint every N steps
  eval_every: 500        # Evaluate every N steps
  log_every: 10          # Log metrics every N steps
  num_workers: 4         # Data loading workers
  pin_memory: true       # Pin memory for faster transfer
  prefetch_factor: 2      # Batches to prefetch
  use_mixed_precision: false # Enable mixed precision
```

#### Device Configuration
```yaml
device:
  use_cuda: false        # Use CUDA if available
  cuda_device: 0         # CUDA device index
  use_mps: false         # Use MPS on Apple Silicon
  cpu_threads: 0         # Number of CPU threads (0 = all)
  enable_mkldnn: true    # Enable MKLDNN acceleration
  mixed_precision: false # Global mixed precision flag
```

### Choosing the Right Configuration

1. **For GPU Training**: Use `config_gpu.json`
   ```bash
   python training/train.py --config config_gpu.json
   ```

2. **For Standard CPU Training**: Use `config.json`
   ```bash
   python training/train.py --config config.json
   ```

3. **For Low-End CPUs**: Use `config_cpu_small.json`
   ```bash
   python training/train.py --config config_cpu_small.json
   ```

### Custom Configuration

1. Copy an existing config file:
   ```bash
   cp config.json my_config.json
   ```

2. Edit the parameters as needed
3. Use your custom config:
   ```bash
   python training/train.py --config my_config.json
   ```

### Important Notes
- Larger `batch_size` and `max_seq_length` require more memory
- `num_workers` should be โ‰ค number of CPU cores
- Enable `mixed_precision` for GPUs with Tensor Cores (Volta, Turing, Ampere, etc.)
- For small GPUs, reduce `batch_size` and enable `gradient_accumulation_steps`
- For very small CPUs, reduce `num_layers`, `embedding_dim`, and `hidden_dim`

## Debugging <a name="debugging"></a>

The project includes several debugging scripts in the `debug_scripts/` directory to help diagnose issues:

### Available Debug Scripts

1. **debug_loader.py**
   - Tests and profiles the data loading pipeline
   - Helps identify bottlenecks in data loading
   - Usage:
     ```bash
     python debug_scripts/debug_loader.py --config config.json
     ```

2. **debug_training.py**
   - Runs a minimal training loop with extensive logging
   - Verifies model can complete a forward/backward pass
   - Usage:
     ```bash
     python debug_scripts/debug_training.py --config config.json --max-steps 10
     ```

3. **debug_timestamps.py**
   - Profiles different components of the training loop
   - Helps identify slow operations
   - Usage:
     ```bash
     python debug_scripts/debug_timestamps.py --config config.json
     ```

### Debugging Tips

1. **Reduced Test Case**
   - Use a tiny dataset with `--max-samples 10`
   - Set `num_workers=0` to simplify data loading
   - Reduce `batch_size` and `max_seq_length`

2. **Common Issues**
   - **CUDA Out of Memory**: Reduce `batch_size` or model dimensions
   - **Slow Training**: Check data loading with `debug_loader.py`
   - **NaN/Inf Losses**: Try gradient clipping and lower learning rate

3. **Verbose Logging**
   ```bash
   python training/train.py --config config.json --log-level DEBUG
   ```

4. **Memory Profiling**
   ```bash
   python -m memory_profiler training/train.py --config config.json
   ```

## Advanced Usage <a name="advanced-usage"></a>

### CPU Optimization <a name="cpu-optimization"></a>

Optimize for CPU training with:
- Multi-threading
- Memory efficiency
- Gradient accumulation
- MKLDNN acceleration

### Data Processing <a name="data-processing"></a>

Example custom preprocessing:

```python
from training.preprocess import DataPreprocessor

processor = DataPreprocessor(
    min_length=100,       # Min text length
    max_length=500000,    # Max text length
    remove_urls=True,     # Clean URLs
    remove_emails=True,   # Clean emails
    normalize_whitespace=True
)
```

### Training API <a name="training-api"></a>

```python
from training.train import Trainer

# Initialize trainer with JSON config
trainer = Trainer(config_path="config.json")

# Start training
trainer.train()

# Example with custom settings
custom_trainer = Trainer(
    config_path="config.json",
    train_data_dir="data/processed/train",
    val_data_dir="data/processed/val",
    output_dir="exports/models/custom_run"
)
custom_trainer.train()
```

**Configuration Options**:
- `config_path`: Path to JSON config file (e.g., `config.json`)
- `train_data_dir`: Directory containing training data (overrides config)
- `val_data_dir`: Directory containing validation data (overrides config)
- `output_dir`: Directory to save checkpoints and logs (overrides config)

## Training Monitoring <a name="monitoring-training"></a>

### Logs
- Console: Real-time progress
- File: `logs/training.log`
- Metrics: `logs/training_history.json`

### Checkpoints
- `checkpoint_epoch_N.pt`: Regular saves
- `best_model.pt`: Best validation score
- `latest.pt`: Most recent checkpoint

## Performance Optimization <a name="performance-optimization"></a>

### CPU Training
- Batch size: 8-32 (adjust for RAM)
- Use all CPU cores
- Enable gradient accumulation
- Try mixed precision if available

### Memory Management
- Reduce `block_size` (128-256)
- Decrease `batch_size`
- Use smaller model dimensions

### Speed Improvements
- Increase `batch_size` (if RAM allows)
- Use larger `block_size` for context
- Multiple data files improve shuffling

## Troubleshooting <a name="troubleshooting"></a>

### Common Issues

1. **Out of Memory**
   - Reduce `batch_size` in config.yaml
   - Decrease `block_size` or model size
   - Close other applications

2. **No Training Data**
   - Check `data/raw/` directory
   - Supported formats: .txt, .pdf, .docx
   - Verify file permissions

3. **Slow Training**
   - Optimize CPU thread count
   - Reduce model size
   - Monitor system resources

4. **Import Errors**
   ```bash
   pip install -r requirements.txt
   python --version  # Requires 3.8+
   ```

Check `logs/` for detailed error messages.

## Model Architecture <a name="model-architecture"></a>

GPT-style transformer with:
- Multi-head self-attention
- GELU activation
- Pre-norm layer normalization
- Learned positional embeddings
- Weight-tied embeddings

### Default Specs
- Parameters: ~50M
- Layers: 12
- Heads: 12
- Embedding: 768D
- Context: 512 tokens
- Vocabulary: 16K BPE

## Recent Updates <a name="recent-updates"></a>

### โœจ **Latest Features** (See [PIPELINE_UPDATES.md](PIPELINE_UPDATES.md))
- **Enhanced Document Ingestion**: Multi-format support with OCR
- **Intelligent Deduplication**: Hash + embedding-based duplicate removal
- **Automated GGUF Conversion**: Production-ready model export
- **Comprehensive Testing**: Full test suite with pytest
- **Cross-platform Scripts**: Enhanced PowerShell and Bash runners

### ๐Ÿš€ **Future Enhancements**
- **Distributed Training**: Multi-GPU and multi-node support
- **Web Interface**: Real-time monitoring dashboard
- **More Architectures**: LLaMA, BERT, and custom models
- **Cloud Integration**: AWS/GCP/Azure deployment
- **Advanced Optimizations**: Dynamic quantization, pruning

## Pre-trained Models <a name="pre-trained-models"></a>

Download models from HuggingFace:

```bash
python tools/download_hf_model.py \
  --model Qwen/Qwen2.5-Coder-0.5B \
  --output-dir ./models/Qwen2.5-Coder-0.5B
```

## License <a name="license"></a>

MIT Licensed. See [LICENSE](LICENSE) for details.

## Contributing <a name="contributing"></a>

Contributions welcome! Please submit PRs or open issues.

## Quick Reference

### ๐Ÿš€ **One-Command Setup**
```bash
# Complete pipeline with enhanced features
./run.sh all          # Linux/macOS
.\run.ps1 -Stage all  # Windows PowerShell
```

### ๐Ÿ“‹ **Essential Commands**
```bash
# Enhanced document processing
./run.sh ingest       # Process HTML, PDF, EPUB, etc.
./run.sh dedup        # Remove duplicates intelligently
./run.sh train        # Train your model
./run.sh gguf         # Convert to GGUF format
./run.sh test         # Run comprehensive tests
```

### ๐Ÿ“š **Documentation**
- **[USAGE.md](USAGE.md)** - Complete usage guide with examples
- **[PIPELINE_UPDATES.md](PIPELINE_UPDATES.md)** - Recent feature updates
- **[INSTALL_TESSERACT.md](INSTALL_TESSERACT.md)** - OCR setup guide
- **[data/README_INGESTION.md](data/README_INGESTION.md)** - Document ingestion details

### ๐Ÿ†˜ **Need Help?**
1. Check the [Usage Guide](USAGE.md) for detailed examples
2. Review logs in `logs/` directory
3. Run tests: `./run.sh test`
4. Open an issue on the repository

---

**Get started** by adding your documents to `data/raw/` and running:

```bash
./run.sh all  # Complete enhanced pipeline
``` 

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "llmbuilder",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Qubase Team <support@qubase.in>",
    "keywords": "llm, training, fine-tuning, machine-learning, ai, transformers, cli, pytorch, huggingface, gpt, llama, deployment, inference",
    "author": null,
    "author_email": "Qubase <contact@qubase.in>",
    "download_url": "https://files.pythonhosted.org/packages/7e/d2/00f4c327bb868ee2933836db6a2f87f2e8df5d7f495d767bdb8b267bd2fa/llmbuilder-1.0.2.tar.gz",
    "platform": null,
    "description": "# LLMBuilder\r\n\r\nCreated By Qubase\r\n\r\nA comprehensive, production-ready implementation for training and fine-tuning Large Language Models from scratch. This project provides an advanced pipeline with enhanced document ingestion, intelligent deduplication, model training, automated GGUF conversion, and comprehensive testing - all optimized for both CPU and GPU environments.\r\n\r\n## Table of Contents\r\n\r\n- [Key Features](#key-features)\r\n- [System Requirements](#system-requirements)\r\n- [Installation](#installation)\r\n- [Quick Start](#quick-start)\r\n  - [1. Prepare Your Data](#1-prepare-your-data)\r\n  - [2. Run the Pipeline](#2-run-the-pipeline)\r\n  - [3. Run Specific Stages](#3-run-specific-stages)\r\n- [Project Structure](#project-structure)\r\n- [Fine-tuning](#fine-tuning)\r\n- [Text Generation](#text-generation)\r\n- [Configuration](#configuration)\r\n- [Advanced Usage](#advanced-usage)\r\n  - [CPU Optimization](#cpu-optimization)\r\n  - [Data Processing](#data-processing)\r\n  - [Training API](#training-api)\r\n- [Monitoring](#monitoring-training)\r\n- [Performance Optimization](#performance-optimization)\r\n- [Troubleshooting](#troubleshooting)\r\n- [Model Architecture](#model-architecture)\r\n- [Pre-trained Models](#pre-trained-models)\r\n- [License](#license)\r\n- [Contributing](#contributing)\r\n\r\n## Key Features <a name=\"key-features\"></a>\r\n\r\n### \ud83d\ude80 **Enhanced Data Processing**\r\n- **Multi-Format Document Ingestion**: HTML, EPUB, PDF (with OCR), Markdown, DOCX, TXT\r\n- **Intelligent Deduplication**: Hash-based exact + embedding-based semantic duplicate removal\r\n- **OCR Support**: Automatic fallback for scanned PDFs using Tesseract\r\n- **Advanced Text Cleaning**: BeautifulSoup HTML processing, metadata extraction\r\n\r\n### \ud83e\udde0 **Advanced Training Pipeline**\r\n- **End-to-End Workflow**: From raw documents to production-ready models\r\n- **Multiple Tokenizer Options**: HuggingFace Tokenizers + SentencePiece CLI integration\r\n- **CPU/GPU Optimization**: Efficient multi-threaded training with mixed precision\r\n- **Modern GPT Architecture**: Transformer implementation with latest optimizations\r\n\r\n### \ud83d\udce6 **Production-Ready Export**\r\n- **Automated GGUF Conversion**: Multiple quantization levels (f16, q8_0, q4_0)\r\n- **Quality Validation**: Comprehensive model validation and quality scoring\r\n- **Batch Processing**: Parallel conversion with error recovery\r\n- **llama.cpp Compatibility**: Direct integration with inference engines\r\n\r\n### \ud83d\udd27 **Developer Experience**\r\n- **Comprehensive Testing**: Automated test suite with pytest integration\r\n- **Robust Error Handling**: Detailed logging and recovery mechanisms\r\n- **Modular Architecture**: Clean, maintainable, extensible codebase\r\n- **Cross-Platform**: Windows PowerShell + Linux/macOS Bash scripts\r\n\r\n## System Requirements\r\n\r\n### Minimum Requirements\r\n- **Python**: 3.8 or higher\r\n- **RAM**: 4GB minimum (8GB+ recommended for large datasets)\r\n- **Storage**: 5GB+ free disk space\r\n- **OS**: Windows 10+, Linux, or macOS\r\n\r\n### Additional Dependencies\r\n- **Tesseract OCR**: For PDF OCR processing (see [INSTALL_TESSERACT.md](INSTALL_TESSERACT.md))\r\n- **Git**: For repository management\r\n- **Optional**: CUDA-compatible GPU for accelerated training\r\n\r\n## Installation <a name=\"installation\"></a>\r\n\r\n1. Clone the repository:\r\n   ```bash\r\n   git clone <repository-url>\r\n   cd LLMBuilder\r\n   ```\r\n\r\n2. Create and activate virtual environment:\r\n   ```bash\r\n   # Linux/macOS\r\n   python -m venv venv\r\n   source venv/bin/activate\r\n   \r\n   # Windows\r\n   python -m venv venv\r\n   .\\venv\\Scripts\\activate\r\n   ```\r\n\r\n3. Install dependencies:\r\n   ```bash\r\n   pip install -r requirements.txt\r\n   ```\r\n\r\n4. Install Tesseract OCR (for PDF processing):\r\n   ```bash\r\n   # Ubuntu/Debian\r\n   sudo apt-get install tesseract-ocr\r\n   \r\n   # macOS\r\n   brew install tesseract\r\n   \r\n   # Windows - see INSTALL_TESSERACT.md for detailed instructions\r\n   ```\r\n\r\n5. Verify installation:\r\n   ```bash\r\n   python -c \"import torch; print('PyTorch:', torch.__version__)\"\r\n   tesseract --version\r\n   ```\r\n\r\n## System Preparation\r\n\r\n### System Requirements Check\r\n\r\nBefore starting, ensure your system meets the requirements:\r\n\r\n```bash\r\n# Linux/macOS\r\nfree -h      # Check available memory\r\ndf -h        # Check disk space\r\nnproc        # Check CPU cores\r\n\r\n# Windows\r\n# Use Task Manager \u2192 Performance \u2192 Memory/Disk\r\n# Check CPU cores in System Information\r\n```\r\n\r\n### Recommended Workflow\r\n\r\n1. **Start with a small dataset** (100MB) to test the pipeline\r\n2. **Monitor system resources** during initial runs\r\n3. **Use checkpoints** - training progress is saved automatically\r\n4. **Check logs** in `logs/training.log` for any issues\r\n\r\n### **\ud83d\udd0d Real-time Monitoring:**\r\n```bash\r\n# Linux/Mac: Monitor system resources\r\nhtop\r\n# Windows: Use Task Manager or Resource Monitor\r\n```\r\n\r\n## Getting Started\r\n\r\nFor detailed instructions, see the [**\ud83d\udcd6 Complete Usage Guide (USAGE.md)**](USAGE.md) which includes:\r\n- **Step-by-step walkthroughs** with example outputs\r\n- **Advanced configuration options** for all components\r\n- **Troubleshooting guide** with common solutions\r\n- **Performance optimization** tips\r\n- **Platform-specific commands** for Windows/Linux/macOS\r\n- **Integration examples** with other tools\r\n\r\n## Project Structure <a name=\"project-structure\"></a>\r\n\r\n```\r\nLLMBuilder/\r\n\u251c\u2500\u2500 data/                   # Data directories\r\n\u2502   \u251c\u2500\u2500 raw/               # Raw input files (all formats)\r\n\u2502   \u251c\u2500\u2500 cleaned/           # Processed text files\r\n\u2502   \u251c\u2500\u2500 deduped/           # Deduplicated content\r\n\u2502   \u251c\u2500\u2500 tokens/            # Tokenized datasets\r\n\u2502   \u251c\u2500\u2500 finetune/          # Fine-tuning datasets\r\n\u2502   \u251c\u2500\u2500 ingest.py          # Enhanced document ingester\r\n\u2502   \u251c\u2500\u2500 dedup.py           # Deduplication system\r\n\u2502   \u251c\u2500\u2500 download_data.py   # Script to download datasets\r\n\u2502   \u251c\u2500\u2500 SOURCES.md         # Data sources documentation\r\n\u2502   \u2514\u2500\u2500 README_INGESTION.md # Ingestion documentation\r\n\u2502\r\n\u251c\u2500\u2500 debug_scripts/         # Debugging utilities\r\n\u2502   \u251c\u2500\u2500 debug_loader.py    # Data loading debugger\r\n\u2502   \u251c\u2500\u2500 debug_training.py  # Training process debugger\r\n\u2502   \u2514\u2500\u2500 debug_timestamps.py # Timing analysis\r\n\u2502\r\n\u251c\u2500\u2500 eval/                  # Model evaluation\r\n\u2502   \u2514\u2500\u2500 eval.py           # Evaluation scripts\r\n\u2502\r\n\u251c\u2500\u2500 exports/               # Output directories\r\n\u2502   \u251c\u2500\u2500 checkpoints/      # Training checkpoints\r\n\u2502   \u251c\u2500\u2500 gguf/             # GGUF model exports\r\n\u2502   \u251c\u2500\u2500 onnx/             # ONNX model exports\r\n\u2502   \u2514\u2500\u2500 tokenizer/        # Saved tokenizer files\r\n\u2502\r\n\u251c\u2500\u2500 finetune/             # Fine-tuning scripts\r\n\u2502   \u251c\u2500\u2500 finetune.py      # Fine-tuning implementation\r\n\u2502   \u2514\u2500\u2500 __init__.py      # Package initialization\r\n\u2502\r\n\u251c\u2500\u2500 logs/                 # Training and evaluation logs\r\n\u2502\r\n\u251c\u2500\u2500 model/                # Model architecture\r\n\u2502   \u2514\u2500\u2500 gpt_model.py     # GPT model implementation\r\n\u2502\r\n\u251c\u2500\u2500 scripts/              # Enhanced processing scripts\r\n\u2502   \u251c\u2500\u2500 run_ingestion.py  # Document ingestion CLI\r\n\u2502   \u251c\u2500\u2500 enhanced_preprocess.py # Advanced preprocessing\r\n\u2502   \u251c\u2500\u2500 train_sentencepiece.py # SentencePiece training\r\n\u2502   \u2514\u2500\u2500 test_deduplication.py # Deduplication testing\r\n\u2502\r\n\u251c\u2500\u2500 tests/                # Comprehensive test suite\r\n\u2502   \u251c\u2500\u2500 test_ingestion.py # Document ingestion tests\r\n\u2502   \u251c\u2500\u2500 test_deduplication.py # Deduplication tests\r\n\u2502   \u251c\u2500\u2500 test_conversion_pipeline.py # GGUF conversion tests\r\n\u2502   \u251c\u2500\u2500 test_tokenizer_trainer.py # Tokenizer tests\r\n\u2502   \u2514\u2500\u2500 ... (many more test files)\r\n\u2502\r\n\u251c\u2500\u2500 tools/                # Utility scripts\r\n\u2502   \u251c\u2500\u2500 analyze_data.ps1  # PowerShell data analysis\r\n\u2502   \u251c\u2500\u2500 analyze_data.sh   # Bash data analysis\r\n\u2502   \u251c\u2500\u2500 download_hf_model.py # HuggingFace model downloader\r\n\u2502   \u251c\u2500\u2500 export_gguf.py    # Enhanced GGUF export utility\r\n\u2502   \u251c\u2500\u2500 conversion_pipeline.py # Automated GGUF conversion\r\n\u2502   \u2514\u2500\u2500 quantization_manager.py # Advanced quantization\r\n\u2502\r\n\u251c\u2500\u2500 training/             # Training pipeline\r\n\u2502   \u251c\u2500\u2500 dataset.py       # Dataset handling\r\n\u2502   \u251c\u2500\u2500 preprocess.py    # Data preprocessing\r\n\u2502   \u251c\u2500\u2500 quantization.py  # Model quantization\r\n\u2502   \u251c\u2500\u2500 train.py         # Main training script\r\n\u2502   \u251c\u2500\u2500 train_tokenizer.py # Enhanced tokenizer training\r\n\u2502   \u2514\u2500\u2500 utils.py         # Training utilities\r\n\u2502\r\n\u251c\u2500\u2500 .gitignore           # Git ignore rules\r\n\u251c\u2500\u2500 config.json          # Main configuration\r\n\u251c\u2500\u2500 config_cpu_small.json # Small CPU config\r\n\u251c\u2500\u2500 config_gpu.json      # GPU configuration\r\n\u251c\u2500\u2500 inference.py         # Inference script\r\n\u251c\u2500\u2500 quantize_model.py    # Model quantization\r\n\u251c\u2500\u2500 README.md           # This file\r\n\u251c\u2500\u2500 PIPELINE_UPDATES.md  # Recent updates summary\r\n\u251c\u2500\u2500 INSTALL_TESSERACT.md # OCR installation guide\r\n\u251c\u2500\u2500 requirements.txt    # Python dependencies\r\n\u251c\u2500\u2500 run.ps1            # Enhanced PowerShell runner\r\n\u2514\u2500\u2500 run.sh             # Enhanced Bash runner\r\n```\r\n\r\n## Quick Start <a name=\"quick-start\"></a>\r\n\r\n### 1. Prepare Your Data <a name=\"1-prepare-your-data\"></a>\r\n\r\n#### Enhanced Document Support\r\nPlace your documents in `data/raw/`. The system now supports:\r\n- **Text files** (.txt, .md)\r\n- **PDF files** (.pdf) - with automatic OCR for scanned documents\r\n- **Word documents** (.docx)\r\n- **Web content** (.html)\r\n- **E-books** (.epub)\r\n- **Markdown** (.md)\r\n\r\n#### Option 1: Download Sample Data\r\n```bash\r\n# Download sample corpus\r\npython data/download_data.py --corpus\r\n\r\n# Or download specific topics\r\npython data/download_data.py --topic literature --count 5\r\npython data/download_data.py --topic technology --count 3\r\n```\r\n\r\nAvailable topics: literature, science, technology, business, health, education\r\n\r\n#### Option 2: Use Your Own Data\r\nSimply place your documents in `data/raw/` - the enhanced ingestion pipeline will automatically:\r\n- Detect file formats\r\n- Extract text with appropriate methods\r\n- Handle OCR for scanned PDFs\r\n- Clean and normalize content\r\n\r\n### 2. Run the Pipeline <a name=\"2-run-the-pipeline\"></a>\r\n\r\n#### Linux/macOS:\r\n```bash\r\nchmod +x run.sh\r\n./run.sh\r\n```\r\n\r\n#### Windows:\r\n```batch\r\nrun.bat\r\n```\r\n\r\nOr using PowerShell:\r\n```powershell\r\n.\\run.ps1\r\n```\r\n\r\n### 3. Run Specific Stages <a name=\"3-run-specific-stages\"></a>\r\n\r\nThe enhanced pipeline includes new stages for better data processing:\r\n\r\n```bash\r\n# NEW: Enhanced document ingestion\r\n./run.sh ingest\r\n\r\n# NEW: Intelligent deduplication  \r\n./run.sh dedup\r\n\r\n# Traditional preprocessing (optional)\r\n./run.sh preprocess\r\n\r\n# Train tokenizer\r\n./run.sh tokenizer\r\n\r\n# Train model\r\n./run.sh train\r\n\r\n# Evaluate model\r\n./run.sh eval\r\n\r\n# Fine-tune existing model\r\n./run.sh finetune\r\n\r\n# Interactive text generation\r\n./run.sh inference\r\n\r\n# NEW: Convert to GGUF format\r\n./run.sh gguf\r\n\r\n# NEW: Run comprehensive tests\r\n./run.sh test\r\n```\r\n\r\n#### Windows PowerShell Examples:\r\n```powershell\r\n# Enhanced document processing\r\n.\\run.ps1 -Stage ingest\r\n\r\n# Run deduplication\r\n.\\run.ps1 -Stage dedup\r\n\r\n# Complete pipeline\r\n.\\run.ps1 -Stage all\r\n\r\n# Convert to GGUF\r\n.\\run.ps1 -Stage gguf\r\n```\r\n\r\n## Enhanced Pipeline Stages\r\n\r\n### \ud83d\udd04 **Document Ingestion** (`ingest`)\r\nAdvanced document processing with multi-format support:\r\n\r\n```bash\r\n# Process all supported formats with OCR\r\n./run.sh ingest\r\n\r\n# With custom options\r\npython scripts/run_ingestion.py \\\r\n  --input data/raw \\\r\n  --output data/cleaned \\\r\n  --ocr-lang eng fra deu \\\r\n  --max-size 50 \\\r\n  --recursive\r\n```\r\n\r\n**Features:**\r\n- **HTML Processing**: BeautifulSoup-based cleaning, removes scripts/styles\r\n- **EPUB Support**: Full e-book text extraction with metadata\r\n- **PDF with OCR**: Automatic fallback to Tesseract for scanned documents\r\n- **Markdown Processing**: Advanced parsing with table/code block support\r\n- **Progress Tracking**: Real-time processing statistics\r\n\r\n### \ud83d\udd0d **Intelligent Deduplication** (`dedup`)\r\nRemove exact and near-duplicate content to improve training quality:\r\n\r\n```bash\r\n# Run both hash and embedding deduplication\r\n./run.sh dedup\r\n\r\n# Custom similarity threshold\r\npython data/dedup.py \\\r\n  --input-dir data/cleaned \\\r\n  --output-dir data/deduped \\\r\n  --similarity-threshold 0.85\r\n```\r\n\r\n**Methods:**\r\n- **Hash-based**: Exact duplicate detection with text normalization\r\n- **Embedding-based**: Semantic similarity using sentence-transformers\r\n- **Quality Preservation**: Keeps highest quality version of duplicates\r\n- **Statistics**: Detailed reporting of removed content\r\n\r\n### \ud83d\udce6 **GGUF Conversion** (`gguf`)\r\nAutomated conversion to GGUF format for production deployment:\r\n\r\n```bash\r\n# Convert with multiple quantization levels\r\n./run.sh gguf\r\n\r\n# Custom quantization options\r\npython tools/conversion_pipeline.py \\\r\n  exports/checkpoints/best_model.pt \\\r\n  exports/gguf \\\r\n  --quantization f16 q8_0 q4_0 q4_1 \\\r\n  --tokenizer exports/tokenizer\r\n```\r\n\r\n**Features:**\r\n- **Multiple Quantization**: f16, q8_0, q4_0, q4_1, q5_0, q5_1\r\n- **Quality Validation**: Automatic validation and quality scoring\r\n- **Batch Processing**: Parallel conversion with error recovery\r\n- **Metadata Preservation**: Complete model metadata in GGUF format\r\n\r\n### \ud83e\uddea **Comprehensive Testing** (`test`)\r\nAutomated test suite for quality assurance:\r\n\r\n```bash\r\n# Run all tests\r\n./run.sh test\r\n\r\n# Run specific test categories\r\npython -m pytest tests/test_ingestion.py -v\r\npython -m pytest tests/test_deduplication.py -v\r\npython -m pytest tests/test_conversion_pipeline.py -v\r\n```\r\n\r\n## Fine-tuning <a name=\"fine-tuning\"></a>\r\n\r\nTo fine-tune the model on your own data:\r\n\r\n1. Place your training files in `data/finetune/`\r\n2. The system will automatically use the latest checkpoint\r\n3. Run the fine-tuning script:\r\n   ```bash\r\n   python finetune/finetune.py \\\r\n     --config config.json \\\r\n     --pretrained-model exports/checkpoints/latest.pt \\\r\n     --train-data data/finetune/ \\\r\n     --tokenizer-dir exports/tokenizer/\r\n   ```\r\n4. Fine-tuned models save to `exports/checkpoints/finetuned/`\r\n\r\n### Fine-tuning Configuration\r\n\r\nYou can customize fine-tuning by modifying these parameters:\r\n\r\n```yaml\r\nfinetune:\r\n  learning_rate: 0.0001    # Lower than training LR\r\n  batch_size: 4           # Adjust based on GPU memory\r\n  num_epochs: 3           # Number of fine-tuning epochs\r\n  warmup_steps: 100       # Learning rate warmup steps\r\n```\r\n\r\n### Monitoring Fine-tuning\r\n\r\nMonitor the fine-tuning process with:\r\n```bash\r\ntensorboard --logdir=exports/logs/finetune/\r\n```\r\n\r\n## Text Generation <a name=\"text-generation\"></a>\r\n\r\nRun interactive text generation:\r\n\r\n```bash\r\npython inference.py --interactive\r\n```\r\n\r\nOptions:\r\n- `--temperature`: Controls randomness (0.0-1.0)\r\n- `--top_k`: Limit to top-k predictions\r\n- `--top_p`: Nucleus sampling threshold\r\n\r\n## Configuration <a name=\"configuration\"></a>\r\n\r\nThis project includes multiple configuration files optimized for different hardware setups. Choose the one that best matches your environment:\r\n\r\n### Available Configurations\r\n\r\n1. **config.json** - Balanced configuration for standard CPUs\r\n   - Moderate model size\r\n   - Good balance between speed and quality\r\n   - Works well on most modern laptops/desktops\r\n\r\n2. **config_gpu.json** - Optimized for GPU training\r\n   - Larger model capacity\r\n   - Mixed precision training\r\n   - Gradient accumulation\r\n   - Best for NVIDIA GPUs with 8GB+ VRAM\r\n\r\n3. **config_cpu_small.json** - For very limited CPUs\r\n   - Minimal memory footprint\r\n   - Smaller model size\r\n   - Reduced sequence length\r\n   - Ideal for testing or low-resource environments\r\n\r\n### Configuration Options\r\n\r\n#### Model Architecture\r\n```yaml\r\nmodel:\r\n  vocab_size: 16000      # Vocabulary size\r\n  embedding_dim: 384     # Size of token embeddings\r\n  num_layers: 6          # Number of transformer layers\r\n  num_heads: 6           # Number of attention heads\r\n  hidden_dim: 1536       # Size of feedforward layers\r\n  max_seq_length: 256    # Maximum sequence length\r\n  dropout: 0.1           # Dropout rate\r\n  use_bias: true         # Use bias in linear layers\r\n  tie_weights: true      # Tie input/output embeddings\r\n```\r\n\r\n#### Training Settings\r\n```yaml\r\ntraining:\r\n  batch_size: 8          # Training batch size\r\n  learning_rate: 0.0002  # Learning rate\r\n  weight_decay: 0.01     # Weight decay for regularization\r\n  num_epochs: 10         # Number of training epochs\r\n  warmup_steps: 1000     # Warmup steps for learning rate\r\n  gradient_clip_norm: 1.0 # Gradient clipping\r\n  save_every: 1000       # Save checkpoint every N steps\r\n  eval_every: 500        # Evaluate every N steps\r\n  log_every: 10          # Log metrics every N steps\r\n  num_workers: 4         # Data loading workers\r\n  pin_memory: true       # Pin memory for faster transfer\r\n  prefetch_factor: 2      # Batches to prefetch\r\n  use_mixed_precision: false # Enable mixed precision\r\n```\r\n\r\n#### Device Configuration\r\n```yaml\r\ndevice:\r\n  use_cuda: false        # Use CUDA if available\r\n  cuda_device: 0         # CUDA device index\r\n  use_mps: false         # Use MPS on Apple Silicon\r\n  cpu_threads: 0         # Number of CPU threads (0 = all)\r\n  enable_mkldnn: true    # Enable MKLDNN acceleration\r\n  mixed_precision: false # Global mixed precision flag\r\n```\r\n\r\n### Choosing the Right Configuration\r\n\r\n1. **For GPU Training**: Use `config_gpu.json`\r\n   ```bash\r\n   python training/train.py --config config_gpu.json\r\n   ```\r\n\r\n2. **For Standard CPU Training**: Use `config.json`\r\n   ```bash\r\n   python training/train.py --config config.json\r\n   ```\r\n\r\n3. **For Low-End CPUs**: Use `config_cpu_small.json`\r\n   ```bash\r\n   python training/train.py --config config_cpu_small.json\r\n   ```\r\n\r\n### Custom Configuration\r\n\r\n1. Copy an existing config file:\r\n   ```bash\r\n   cp config.json my_config.json\r\n   ```\r\n\r\n2. Edit the parameters as needed\r\n3. Use your custom config:\r\n   ```bash\r\n   python training/train.py --config my_config.json\r\n   ```\r\n\r\n### Important Notes\r\n- Larger `batch_size` and `max_seq_length` require more memory\r\n- `num_workers` should be \u2264 number of CPU cores\r\n- Enable `mixed_precision` for GPUs with Tensor Cores (Volta, Turing, Ampere, etc.)\r\n- For small GPUs, reduce `batch_size` and enable `gradient_accumulation_steps`\r\n- For very small CPUs, reduce `num_layers`, `embedding_dim`, and `hidden_dim`\r\n\r\n## Debugging <a name=\"debugging\"></a>\r\n\r\nThe project includes several debugging scripts in the `debug_scripts/` directory to help diagnose issues:\r\n\r\n### Available Debug Scripts\r\n\r\n1. **debug_loader.py**\r\n   - Tests and profiles the data loading pipeline\r\n   - Helps identify bottlenecks in data loading\r\n   - Usage:\r\n     ```bash\r\n     python debug_scripts/debug_loader.py --config config.json\r\n     ```\r\n\r\n2. **debug_training.py**\r\n   - Runs a minimal training loop with extensive logging\r\n   - Verifies model can complete a forward/backward pass\r\n   - Usage:\r\n     ```bash\r\n     python debug_scripts/debug_training.py --config config.json --max-steps 10\r\n     ```\r\n\r\n3. **debug_timestamps.py**\r\n   - Profiles different components of the training loop\r\n   - Helps identify slow operations\r\n   - Usage:\r\n     ```bash\r\n     python debug_scripts/debug_timestamps.py --config config.json\r\n     ```\r\n\r\n### Debugging Tips\r\n\r\n1. **Reduced Test Case**\r\n   - Use a tiny dataset with `--max-samples 10`\r\n   - Set `num_workers=0` to simplify data loading\r\n   - Reduce `batch_size` and `max_seq_length`\r\n\r\n2. **Common Issues**\r\n   - **CUDA Out of Memory**: Reduce `batch_size` or model dimensions\r\n   - **Slow Training**: Check data loading with `debug_loader.py`\r\n   - **NaN/Inf Losses**: Try gradient clipping and lower learning rate\r\n\r\n3. **Verbose Logging**\r\n   ```bash\r\n   python training/train.py --config config.json --log-level DEBUG\r\n   ```\r\n\r\n4. **Memory Profiling**\r\n   ```bash\r\n   python -m memory_profiler training/train.py --config config.json\r\n   ```\r\n\r\n## Advanced Usage <a name=\"advanced-usage\"></a>\r\n\r\n### CPU Optimization <a name=\"cpu-optimization\"></a>\r\n\r\nOptimize for CPU training with:\r\n- Multi-threading\r\n- Memory efficiency\r\n- Gradient accumulation\r\n- MKLDNN acceleration\r\n\r\n### Data Processing <a name=\"data-processing\"></a>\r\n\r\nExample custom preprocessing:\r\n\r\n```python\r\nfrom training.preprocess import DataPreprocessor\r\n\r\nprocessor = DataPreprocessor(\r\n    min_length=100,       # Min text length\r\n    max_length=500000,    # Max text length\r\n    remove_urls=True,     # Clean URLs\r\n    remove_emails=True,   # Clean emails\r\n    normalize_whitespace=True\r\n)\r\n```\r\n\r\n### Training API <a name=\"training-api\"></a>\r\n\r\n```python\r\nfrom training.train import Trainer\r\n\r\n# Initialize trainer with JSON config\r\ntrainer = Trainer(config_path=\"config.json\")\r\n\r\n# Start training\r\ntrainer.train()\r\n\r\n# Example with custom settings\r\ncustom_trainer = Trainer(\r\n    config_path=\"config.json\",\r\n    train_data_dir=\"data/processed/train\",\r\n    val_data_dir=\"data/processed/val\",\r\n    output_dir=\"exports/models/custom_run\"\r\n)\r\ncustom_trainer.train()\r\n```\r\n\r\n**Configuration Options**:\r\n- `config_path`: Path to JSON config file (e.g., `config.json`)\r\n- `train_data_dir`: Directory containing training data (overrides config)\r\n- `val_data_dir`: Directory containing validation data (overrides config)\r\n- `output_dir`: Directory to save checkpoints and logs (overrides config)\r\n\r\n## Training Monitoring <a name=\"monitoring-training\"></a>\r\n\r\n### Logs\r\n- Console: Real-time progress\r\n- File: `logs/training.log`\r\n- Metrics: `logs/training_history.json`\r\n\r\n### Checkpoints\r\n- `checkpoint_epoch_N.pt`: Regular saves\r\n- `best_model.pt`: Best validation score\r\n- `latest.pt`: Most recent checkpoint\r\n\r\n## Performance Optimization <a name=\"performance-optimization\"></a>\r\n\r\n### CPU Training\r\n- Batch size: 8-32 (adjust for RAM)\r\n- Use all CPU cores\r\n- Enable gradient accumulation\r\n- Try mixed precision if available\r\n\r\n### Memory Management\r\n- Reduce `block_size` (128-256)\r\n- Decrease `batch_size`\r\n- Use smaller model dimensions\r\n\r\n### Speed Improvements\r\n- Increase `batch_size` (if RAM allows)\r\n- Use larger `block_size` for context\r\n- Multiple data files improve shuffling\r\n\r\n## Troubleshooting <a name=\"troubleshooting\"></a>\r\n\r\n### Common Issues\r\n\r\n1. **Out of Memory**\r\n   - Reduce `batch_size` in config.yaml\r\n   - Decrease `block_size` or model size\r\n   - Close other applications\r\n\r\n2. **No Training Data**\r\n   - Check `data/raw/` directory\r\n   - Supported formats: .txt, .pdf, .docx\r\n   - Verify file permissions\r\n\r\n3. **Slow Training**\r\n   - Optimize CPU thread count\r\n   - Reduce model size\r\n   - Monitor system resources\r\n\r\n4. **Import Errors**\r\n   ```bash\r\n   pip install -r requirements.txt\r\n   python --version  # Requires 3.8+\r\n   ```\r\n\r\nCheck `logs/` for detailed error messages.\r\n\r\n## Model Architecture <a name=\"model-architecture\"></a>\r\n\r\nGPT-style transformer with:\r\n- Multi-head self-attention\r\n- GELU activation\r\n- Pre-norm layer normalization\r\n- Learned positional embeddings\r\n- Weight-tied embeddings\r\n\r\n### Default Specs\r\n- Parameters: ~50M\r\n- Layers: 12\r\n- Heads: 12\r\n- Embedding: 768D\r\n- Context: 512 tokens\r\n- Vocabulary: 16K BPE\r\n\r\n## Recent Updates <a name=\"recent-updates\"></a>\r\n\r\n### \u2728 **Latest Features** (See [PIPELINE_UPDATES.md](PIPELINE_UPDATES.md))\r\n- **Enhanced Document Ingestion**: Multi-format support with OCR\r\n- **Intelligent Deduplication**: Hash + embedding-based duplicate removal\r\n- **Automated GGUF Conversion**: Production-ready model export\r\n- **Comprehensive Testing**: Full test suite with pytest\r\n- **Cross-platform Scripts**: Enhanced PowerShell and Bash runners\r\n\r\n### \ud83d\ude80 **Future Enhancements**\r\n- **Distributed Training**: Multi-GPU and multi-node support\r\n- **Web Interface**: Real-time monitoring dashboard\r\n- **More Architectures**: LLaMA, BERT, and custom models\r\n- **Cloud Integration**: AWS/GCP/Azure deployment\r\n- **Advanced Optimizations**: Dynamic quantization, pruning\r\n\r\n## Pre-trained Models <a name=\"pre-trained-models\"></a>\r\n\r\nDownload models from HuggingFace:\r\n\r\n```bash\r\npython tools/download_hf_model.py \\\r\n  --model Qwen/Qwen2.5-Coder-0.5B \\\r\n  --output-dir ./models/Qwen2.5-Coder-0.5B\r\n```\r\n\r\n## License <a name=\"license\"></a>\r\n\r\nMIT Licensed. See [LICENSE](LICENSE) for details.\r\n\r\n## Contributing <a name=\"contributing\"></a>\r\n\r\nContributions welcome! Please submit PRs or open issues.\r\n\r\n## Quick Reference\r\n\r\n### \ud83d\ude80 **One-Command Setup**\r\n```bash\r\n# Complete pipeline with enhanced features\r\n./run.sh all          # Linux/macOS\r\n.\\run.ps1 -Stage all  # Windows PowerShell\r\n```\r\n\r\n### \ud83d\udccb **Essential Commands**\r\n```bash\r\n# Enhanced document processing\r\n./run.sh ingest       # Process HTML, PDF, EPUB, etc.\r\n./run.sh dedup        # Remove duplicates intelligently\r\n./run.sh train        # Train your model\r\n./run.sh gguf         # Convert to GGUF format\r\n./run.sh test         # Run comprehensive tests\r\n```\r\n\r\n### \ud83d\udcda **Documentation**\r\n- **[USAGE.md](USAGE.md)** - Complete usage guide with examples\r\n- **[PIPELINE_UPDATES.md](PIPELINE_UPDATES.md)** - Recent feature updates\r\n- **[INSTALL_TESSERACT.md](INSTALL_TESSERACT.md)** - OCR setup guide\r\n- **[data/README_INGESTION.md](data/README_INGESTION.md)** - Document ingestion details\r\n\r\n### \ud83c\udd98 **Need Help?**\r\n1. Check the [Usage Guide](USAGE.md) for detailed examples\r\n2. Review logs in `logs/` directory\r\n3. Run tests: `./run.sh test`\r\n4. Open an issue on the repository\r\n\r\n---\r\n\r\n**Get started** by adding your documents to `data/raw/` and running:\r\n\r\n```bash\r\n./run.sh all  # Complete enhanced pipeline\r\n``` \r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Complete LLM Training and Deployment Pipeline with CLI",
    "version": "1.0.2",
    "project_urls": {
        "Changelog": "https://github.com/qubase/llmbuilder/blob/main/CHANGELOG.md",
        "Documentation": "https://llmbuilder.readthedocs.io",
        "Homepage": "https://github.com/qubase/llmbuilder",
        "Issues": "https://github.com/qubase/llmbuilder/issues",
        "Repository": "https://github.com/qubase/llmbuilder"
    },
    "split_keywords": [
        "llm",
        " training",
        " fine-tuning",
        " machine-learning",
        " ai",
        " transformers",
        " cli",
        " pytorch",
        " huggingface",
        " gpt",
        " llama",
        " deployment",
        " inference"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "30fb3295e09ce4f2387bc427ef358d2602dd8cb5df8abb189f1dff55d6995a0e",
                "md5": "e015a9caa7afaca5db6c0a6b01000296",
                "sha256": "67bc24c4078a39ce3adf3a10a8b5c6b44eb3c8a121f74bf2b88f8f1832579c93"
            },
            "downloads": -1,
            "filename": "llmbuilder-1.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e015a9caa7afaca5db6c0a6b01000296",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 341907,
            "upload_time": "2025-09-02T11:26:17",
            "upload_time_iso_8601": "2025-09-02T11:26:17.095539Z",
            "url": "https://files.pythonhosted.org/packages/30/fb/3295e09ce4f2387bc427ef358d2602dd8cb5df8abb189f1dff55d6995a0e/llmbuilder-1.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7ed200f4c327bb868ee2933836db6a2f87f2e8df5d7f495d767bdb8b267bd2fa",
                "md5": "4eec42930778f4a74dd40e93e7c5d631",
                "sha256": "27760a7644d575d1c50d12d7747b5c4ba4ce9c72a93c1eff9171320b920264d4"
            },
            "downloads": -1,
            "filename": "llmbuilder-1.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "4eec42930778f4a74dd40e93e7c5d631",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 300593,
            "upload_time": "2025-09-02T11:26:19",
            "upload_time_iso_8601": "2025-09-02T11:26:19.266823Z",
            "url": "https://files.pythonhosted.org/packages/7e/d2/00f4c327bb868ee2933836db6a2f87f2e8df5d7f495d767bdb8b267bd2fa/llmbuilder-1.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-02 11:26:19",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "qubase",
    "github_project": "llmbuilder",
    "github_not_found": true,
    "lcname": "llmbuilder"
}
        
Elapsed time: 1.13845s