# LLMBuilder
Created By Qubase
A comprehensive, production-ready implementation for training and fine-tuning Large Language Models from scratch. This project provides an advanced pipeline with enhanced document ingestion, intelligent deduplication, model training, automated GGUF conversion, and comprehensive testing - all optimized for both CPU and GPU environments.
## Table of Contents
- [Key Features](#key-features)
- [System Requirements](#system-requirements)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [1. Prepare Your Data](#1-prepare-your-data)
- [2. Run the Pipeline](#2-run-the-pipeline)
- [3. Run Specific Stages](#3-run-specific-stages)
- [Project Structure](#project-structure)
- [Fine-tuning](#fine-tuning)
- [Text Generation](#text-generation)
- [Configuration](#configuration)
- [Advanced Usage](#advanced-usage)
- [CPU Optimization](#cpu-optimization)
- [Data Processing](#data-processing)
- [Training API](#training-api)
- [Monitoring](#monitoring-training)
- [Performance Optimization](#performance-optimization)
- [Troubleshooting](#troubleshooting)
- [Model Architecture](#model-architecture)
- [Pre-trained Models](#pre-trained-models)
- [License](#license)
- [Contributing](#contributing)
## Key Features <a name="key-features"></a>
### ๐ **Enhanced Data Processing**
- **Multi-Format Document Ingestion**: HTML, EPUB, PDF (with OCR), Markdown, DOCX, TXT
- **Intelligent Deduplication**: Hash-based exact + embedding-based semantic duplicate removal
- **OCR Support**: Automatic fallback for scanned PDFs using Tesseract
- **Advanced Text Cleaning**: BeautifulSoup HTML processing, metadata extraction
### ๐ง **Advanced Training Pipeline**
- **End-to-End Workflow**: From raw documents to production-ready models
- **Multiple Tokenizer Options**: HuggingFace Tokenizers + SentencePiece CLI integration
- **CPU/GPU Optimization**: Efficient multi-threaded training with mixed precision
- **Modern GPT Architecture**: Transformer implementation with latest optimizations
### ๐ฆ **Production-Ready Export**
- **Automated GGUF Conversion**: Multiple quantization levels (f16, q8_0, q4_0)
- **Quality Validation**: Comprehensive model validation and quality scoring
- **Batch Processing**: Parallel conversion with error recovery
- **llama.cpp Compatibility**: Direct integration with inference engines
### ๐ง **Developer Experience**
- **Comprehensive Testing**: Automated test suite with pytest integration
- **Robust Error Handling**: Detailed logging and recovery mechanisms
- **Modular Architecture**: Clean, maintainable, extensible codebase
- **Cross-Platform**: Windows PowerShell + Linux/macOS Bash scripts
## System Requirements
### Minimum Requirements
- **Python**: 3.8 or higher
- **RAM**: 4GB minimum (8GB+ recommended for large datasets)
- **Storage**: 5GB+ free disk space
- **OS**: Windows 10+, Linux, or macOS
### Additional Dependencies
- **Tesseract OCR**: For PDF OCR processing (see [INSTALL_TESSERACT.md](INSTALL_TESSERACT.md))
- **Git**: For repository management
- **Optional**: CUDA-compatible GPU for accelerated training
## Installation <a name="installation"></a>
1. Clone the repository:
```bash
git clone <repository-url>
cd LLMBuilder
```
2. Create and activate virtual environment:
```bash
# Linux/macOS
python -m venv venv
source venv/bin/activate
# Windows
python -m venv venv
.\venv\Scripts\activate
```
3. Install dependencies:
```bash
pip install -r requirements.txt
```
4. Install Tesseract OCR (for PDF processing):
```bash
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# macOS
brew install tesseract
# Windows - see INSTALL_TESSERACT.md for detailed instructions
```
5. Verify installation:
```bash
python -c "import torch; print('PyTorch:', torch.__version__)"
tesseract --version
```
## System Preparation
### System Requirements Check
Before starting, ensure your system meets the requirements:
```bash
# Linux/macOS
free -h # Check available memory
df -h # Check disk space
nproc # Check CPU cores
# Windows
# Use Task Manager โ Performance โ Memory/Disk
# Check CPU cores in System Information
```
### Recommended Workflow
1. **Start with a small dataset** (100MB) to test the pipeline
2. **Monitor system resources** during initial runs
3. **Use checkpoints** - training progress is saved automatically
4. **Check logs** in `logs/training.log` for any issues
### **๐ Real-time Monitoring:**
```bash
# Linux/Mac: Monitor system resources
htop
# Windows: Use Task Manager or Resource Monitor
```
## Getting Started
For detailed instructions, see the [**๐ Complete Usage Guide (USAGE.md)**](USAGE.md) which includes:
- **Step-by-step walkthroughs** with example outputs
- **Advanced configuration options** for all components
- **Troubleshooting guide** with common solutions
- **Performance optimization** tips
- **Platform-specific commands** for Windows/Linux/macOS
- **Integration examples** with other tools
## Project Structure <a name="project-structure"></a>
```
LLMBuilder/
โโโ data/ # Data directories
โ โโโ raw/ # Raw input files (all formats)
โ โโโ cleaned/ # Processed text files
โ โโโ deduped/ # Deduplicated content
โ โโโ tokens/ # Tokenized datasets
โ โโโ finetune/ # Fine-tuning datasets
โ โโโ ingest.py # Enhanced document ingester
โ โโโ dedup.py # Deduplication system
โ โโโ download_data.py # Script to download datasets
โ โโโ SOURCES.md # Data sources documentation
โ โโโ README_INGESTION.md # Ingestion documentation
โ
โโโ debug_scripts/ # Debugging utilities
โ โโโ debug_loader.py # Data loading debugger
โ โโโ debug_training.py # Training process debugger
โ โโโ debug_timestamps.py # Timing analysis
โ
โโโ eval/ # Model evaluation
โ โโโ eval.py # Evaluation scripts
โ
โโโ exports/ # Output directories
โ โโโ checkpoints/ # Training checkpoints
โ โโโ gguf/ # GGUF model exports
โ โโโ onnx/ # ONNX model exports
โ โโโ tokenizer/ # Saved tokenizer files
โ
โโโ finetune/ # Fine-tuning scripts
โ โโโ finetune.py # Fine-tuning implementation
โ โโโ __init__.py # Package initialization
โ
โโโ logs/ # Training and evaluation logs
โ
โโโ model/ # Model architecture
โ โโโ gpt_model.py # GPT model implementation
โ
โโโ scripts/ # Enhanced processing scripts
โ โโโ run_ingestion.py # Document ingestion CLI
โ โโโ enhanced_preprocess.py # Advanced preprocessing
โ โโโ train_sentencepiece.py # SentencePiece training
โ โโโ test_deduplication.py # Deduplication testing
โ
โโโ tests/ # Comprehensive test suite
โ โโโ test_ingestion.py # Document ingestion tests
โ โโโ test_deduplication.py # Deduplication tests
โ โโโ test_conversion_pipeline.py # GGUF conversion tests
โ โโโ test_tokenizer_trainer.py # Tokenizer tests
โ โโโ ... (many more test files)
โ
โโโ tools/ # Utility scripts
โ โโโ analyze_data.ps1 # PowerShell data analysis
โ โโโ analyze_data.sh # Bash data analysis
โ โโโ download_hf_model.py # HuggingFace model downloader
โ โโโ export_gguf.py # Enhanced GGUF export utility
โ โโโ conversion_pipeline.py # Automated GGUF conversion
โ โโโ quantization_manager.py # Advanced quantization
โ
โโโ training/ # Training pipeline
โ โโโ dataset.py # Dataset handling
โ โโโ preprocess.py # Data preprocessing
โ โโโ quantization.py # Model quantization
โ โโโ train.py # Main training script
โ โโโ train_tokenizer.py # Enhanced tokenizer training
โ โโโ utils.py # Training utilities
โ
โโโ .gitignore # Git ignore rules
โโโ config.json # Main configuration
โโโ config_cpu_small.json # Small CPU config
โโโ config_gpu.json # GPU configuration
โโโ inference.py # Inference script
โโโ quantize_model.py # Model quantization
โโโ README.md # This file
โโโ PIPELINE_UPDATES.md # Recent updates summary
โโโ INSTALL_TESSERACT.md # OCR installation guide
โโโ requirements.txt # Python dependencies
โโโ run.ps1 # Enhanced PowerShell runner
โโโ run.sh # Enhanced Bash runner
```
## Quick Start <a name="quick-start"></a>
### 1. Prepare Your Data <a name="1-prepare-your-data"></a>
#### Enhanced Document Support
Place your documents in `data/raw/`. The system now supports:
- **Text files** (.txt, .md)
- **PDF files** (.pdf) - with automatic OCR for scanned documents
- **Word documents** (.docx)
- **Web content** (.html)
- **E-books** (.epub)
- **Markdown** (.md)
#### Option 1: Download Sample Data
```bash
# Download sample corpus
python data/download_data.py --corpus
# Or download specific topics
python data/download_data.py --topic literature --count 5
python data/download_data.py --topic technology --count 3
```
Available topics: literature, science, technology, business, health, education
#### Option 2: Use Your Own Data
Simply place your documents in `data/raw/` - the enhanced ingestion pipeline will automatically:
- Detect file formats
- Extract text with appropriate methods
- Handle OCR for scanned PDFs
- Clean and normalize content
### 2. Run the Pipeline <a name="2-run-the-pipeline"></a>
#### Linux/macOS:
```bash
chmod +x run.sh
./run.sh
```
#### Windows:
```batch
run.bat
```
Or using PowerShell:
```powershell
.\run.ps1
```
### 3. Run Specific Stages <a name="3-run-specific-stages"></a>
The enhanced pipeline includes new stages for better data processing:
```bash
# NEW: Enhanced document ingestion
./run.sh ingest
# NEW: Intelligent deduplication
./run.sh dedup
# Traditional preprocessing (optional)
./run.sh preprocess
# Train tokenizer
./run.sh tokenizer
# Train model
./run.sh train
# Evaluate model
./run.sh eval
# Fine-tune existing model
./run.sh finetune
# Interactive text generation
./run.sh inference
# NEW: Convert to GGUF format
./run.sh gguf
# NEW: Run comprehensive tests
./run.sh test
```
#### Windows PowerShell Examples:
```powershell
# Enhanced document processing
.\run.ps1 -Stage ingest
# Run deduplication
.\run.ps1 -Stage dedup
# Complete pipeline
.\run.ps1 -Stage all
# Convert to GGUF
.\run.ps1 -Stage gguf
```
## Enhanced Pipeline Stages
### ๐ **Document Ingestion** (`ingest`)
Advanced document processing with multi-format support:
```bash
# Process all supported formats with OCR
./run.sh ingest
# With custom options
python scripts/run_ingestion.py \
--input data/raw \
--output data/cleaned \
--ocr-lang eng fra deu \
--max-size 50 \
--recursive
```
**Features:**
- **HTML Processing**: BeautifulSoup-based cleaning, removes scripts/styles
- **EPUB Support**: Full e-book text extraction with metadata
- **PDF with OCR**: Automatic fallback to Tesseract for scanned documents
- **Markdown Processing**: Advanced parsing with table/code block support
- **Progress Tracking**: Real-time processing statistics
### ๐ **Intelligent Deduplication** (`dedup`)
Remove exact and near-duplicate content to improve training quality:
```bash
# Run both hash and embedding deduplication
./run.sh dedup
# Custom similarity threshold
python data/dedup.py \
--input-dir data/cleaned \
--output-dir data/deduped \
--similarity-threshold 0.85
```
**Methods:**
- **Hash-based**: Exact duplicate detection with text normalization
- **Embedding-based**: Semantic similarity using sentence-transformers
- **Quality Preservation**: Keeps highest quality version of duplicates
- **Statistics**: Detailed reporting of removed content
### ๐ฆ **GGUF Conversion** (`gguf`)
Automated conversion to GGUF format for production deployment:
```bash
# Convert with multiple quantization levels
./run.sh gguf
# Custom quantization options
python tools/conversion_pipeline.py \
exports/checkpoints/best_model.pt \
exports/gguf \
--quantization f16 q8_0 q4_0 q4_1 \
--tokenizer exports/tokenizer
```
**Features:**
- **Multiple Quantization**: f16, q8_0, q4_0, q4_1, q5_0, q5_1
- **Quality Validation**: Automatic validation and quality scoring
- **Batch Processing**: Parallel conversion with error recovery
- **Metadata Preservation**: Complete model metadata in GGUF format
### ๐งช **Comprehensive Testing** (`test`)
Automated test suite for quality assurance:
```bash
# Run all tests
./run.sh test
# Run specific test categories
python -m pytest tests/test_ingestion.py -v
python -m pytest tests/test_deduplication.py -v
python -m pytest tests/test_conversion_pipeline.py -v
```
## Fine-tuning <a name="fine-tuning"></a>
To fine-tune the model on your own data:
1. Place your training files in `data/finetune/`
2. The system will automatically use the latest checkpoint
3. Run the fine-tuning script:
```bash
python finetune/finetune.py \
--config config.json \
--pretrained-model exports/checkpoints/latest.pt \
--train-data data/finetune/ \
--tokenizer-dir exports/tokenizer/
```
4. Fine-tuned models save to `exports/checkpoints/finetuned/`
### Fine-tuning Configuration
You can customize fine-tuning by modifying these parameters:
```yaml
finetune:
learning_rate: 0.0001 # Lower than training LR
batch_size: 4 # Adjust based on GPU memory
num_epochs: 3 # Number of fine-tuning epochs
warmup_steps: 100 # Learning rate warmup steps
```
### Monitoring Fine-tuning
Monitor the fine-tuning process with:
```bash
tensorboard --logdir=exports/logs/finetune/
```
## Text Generation <a name="text-generation"></a>
Run interactive text generation:
```bash
python inference.py --interactive
```
Options:
- `--temperature`: Controls randomness (0.0-1.0)
- `--top_k`: Limit to top-k predictions
- `--top_p`: Nucleus sampling threshold
## Configuration <a name="configuration"></a>
This project includes multiple configuration files optimized for different hardware setups. Choose the one that best matches your environment:
### Available Configurations
1. **config.json** - Balanced configuration for standard CPUs
- Moderate model size
- Good balance between speed and quality
- Works well on most modern laptops/desktops
2. **config_gpu.json** - Optimized for GPU training
- Larger model capacity
- Mixed precision training
- Gradient accumulation
- Best for NVIDIA GPUs with 8GB+ VRAM
3. **config_cpu_small.json** - For very limited CPUs
- Minimal memory footprint
- Smaller model size
- Reduced sequence length
- Ideal for testing or low-resource environments
### Configuration Options
#### Model Architecture
```yaml
model:
vocab_size: 16000 # Vocabulary size
embedding_dim: 384 # Size of token embeddings
num_layers: 6 # Number of transformer layers
num_heads: 6 # Number of attention heads
hidden_dim: 1536 # Size of feedforward layers
max_seq_length: 256 # Maximum sequence length
dropout: 0.1 # Dropout rate
use_bias: true # Use bias in linear layers
tie_weights: true # Tie input/output embeddings
```
#### Training Settings
```yaml
training:
batch_size: 8 # Training batch size
learning_rate: 0.0002 # Learning rate
weight_decay: 0.01 # Weight decay for regularization
num_epochs: 10 # Number of training epochs
warmup_steps: 1000 # Warmup steps for learning rate
gradient_clip_norm: 1.0 # Gradient clipping
save_every: 1000 # Save checkpoint every N steps
eval_every: 500 # Evaluate every N steps
log_every: 10 # Log metrics every N steps
num_workers: 4 # Data loading workers
pin_memory: true # Pin memory for faster transfer
prefetch_factor: 2 # Batches to prefetch
use_mixed_precision: false # Enable mixed precision
```
#### Device Configuration
```yaml
device:
use_cuda: false # Use CUDA if available
cuda_device: 0 # CUDA device index
use_mps: false # Use MPS on Apple Silicon
cpu_threads: 0 # Number of CPU threads (0 = all)
enable_mkldnn: true # Enable MKLDNN acceleration
mixed_precision: false # Global mixed precision flag
```
### Choosing the Right Configuration
1. **For GPU Training**: Use `config_gpu.json`
```bash
python training/train.py --config config_gpu.json
```
2. **For Standard CPU Training**: Use `config.json`
```bash
python training/train.py --config config.json
```
3. **For Low-End CPUs**: Use `config_cpu_small.json`
```bash
python training/train.py --config config_cpu_small.json
```
### Custom Configuration
1. Copy an existing config file:
```bash
cp config.json my_config.json
```
2. Edit the parameters as needed
3. Use your custom config:
```bash
python training/train.py --config my_config.json
```
### Important Notes
- Larger `batch_size` and `max_seq_length` require more memory
- `num_workers` should be โค number of CPU cores
- Enable `mixed_precision` for GPUs with Tensor Cores (Volta, Turing, Ampere, etc.)
- For small GPUs, reduce `batch_size` and enable `gradient_accumulation_steps`
- For very small CPUs, reduce `num_layers`, `embedding_dim`, and `hidden_dim`
## Debugging <a name="debugging"></a>
The project includes several debugging scripts in the `debug_scripts/` directory to help diagnose issues:
### Available Debug Scripts
1. **debug_loader.py**
- Tests and profiles the data loading pipeline
- Helps identify bottlenecks in data loading
- Usage:
```bash
python debug_scripts/debug_loader.py --config config.json
```
2. **debug_training.py**
- Runs a minimal training loop with extensive logging
- Verifies model can complete a forward/backward pass
- Usage:
```bash
python debug_scripts/debug_training.py --config config.json --max-steps 10
```
3. **debug_timestamps.py**
- Profiles different components of the training loop
- Helps identify slow operations
- Usage:
```bash
python debug_scripts/debug_timestamps.py --config config.json
```
### Debugging Tips
1. **Reduced Test Case**
- Use a tiny dataset with `--max-samples 10`
- Set `num_workers=0` to simplify data loading
- Reduce `batch_size` and `max_seq_length`
2. **Common Issues**
- **CUDA Out of Memory**: Reduce `batch_size` or model dimensions
- **Slow Training**: Check data loading with `debug_loader.py`
- **NaN/Inf Losses**: Try gradient clipping and lower learning rate
3. **Verbose Logging**
```bash
python training/train.py --config config.json --log-level DEBUG
```
4. **Memory Profiling**
```bash
python -m memory_profiler training/train.py --config config.json
```
## Advanced Usage <a name="advanced-usage"></a>
### CPU Optimization <a name="cpu-optimization"></a>
Optimize for CPU training with:
- Multi-threading
- Memory efficiency
- Gradient accumulation
- MKLDNN acceleration
### Data Processing <a name="data-processing"></a>
Example custom preprocessing:
```python
from training.preprocess import DataPreprocessor
processor = DataPreprocessor(
min_length=100, # Min text length
max_length=500000, # Max text length
remove_urls=True, # Clean URLs
remove_emails=True, # Clean emails
normalize_whitespace=True
)
```
### Training API <a name="training-api"></a>
```python
from training.train import Trainer
# Initialize trainer with JSON config
trainer = Trainer(config_path="config.json")
# Start training
trainer.train()
# Example with custom settings
custom_trainer = Trainer(
config_path="config.json",
train_data_dir="data/processed/train",
val_data_dir="data/processed/val",
output_dir="exports/models/custom_run"
)
custom_trainer.train()
```
**Configuration Options**:
- `config_path`: Path to JSON config file (e.g., `config.json`)
- `train_data_dir`: Directory containing training data (overrides config)
- `val_data_dir`: Directory containing validation data (overrides config)
- `output_dir`: Directory to save checkpoints and logs (overrides config)
## Training Monitoring <a name="monitoring-training"></a>
### Logs
- Console: Real-time progress
- File: `logs/training.log`
- Metrics: `logs/training_history.json`
### Checkpoints
- `checkpoint_epoch_N.pt`: Regular saves
- `best_model.pt`: Best validation score
- `latest.pt`: Most recent checkpoint
## Performance Optimization <a name="performance-optimization"></a>
### CPU Training
- Batch size: 8-32 (adjust for RAM)
- Use all CPU cores
- Enable gradient accumulation
- Try mixed precision if available
### Memory Management
- Reduce `block_size` (128-256)
- Decrease `batch_size`
- Use smaller model dimensions
### Speed Improvements
- Increase `batch_size` (if RAM allows)
- Use larger `block_size` for context
- Multiple data files improve shuffling
## Troubleshooting <a name="troubleshooting"></a>
### Common Issues
1. **Out of Memory**
- Reduce `batch_size` in config.yaml
- Decrease `block_size` or model size
- Close other applications
2. **No Training Data**
- Check `data/raw/` directory
- Supported formats: .txt, .pdf, .docx
- Verify file permissions
3. **Slow Training**
- Optimize CPU thread count
- Reduce model size
- Monitor system resources
4. **Import Errors**
```bash
pip install -r requirements.txt
python --version # Requires 3.8+
```
Check `logs/` for detailed error messages.
## Model Architecture <a name="model-architecture"></a>
GPT-style transformer with:
- Multi-head self-attention
- GELU activation
- Pre-norm layer normalization
- Learned positional embeddings
- Weight-tied embeddings
### Default Specs
- Parameters: ~50M
- Layers: 12
- Heads: 12
- Embedding: 768D
- Context: 512 tokens
- Vocabulary: 16K BPE
## Recent Updates <a name="recent-updates"></a>
### โจ **Latest Features** (See [PIPELINE_UPDATES.md](PIPELINE_UPDATES.md))
- **Enhanced Document Ingestion**: Multi-format support with OCR
- **Intelligent Deduplication**: Hash + embedding-based duplicate removal
- **Automated GGUF Conversion**: Production-ready model export
- **Comprehensive Testing**: Full test suite with pytest
- **Cross-platform Scripts**: Enhanced PowerShell and Bash runners
### ๐ **Future Enhancements**
- **Distributed Training**: Multi-GPU and multi-node support
- **Web Interface**: Real-time monitoring dashboard
- **More Architectures**: LLaMA, BERT, and custom models
- **Cloud Integration**: AWS/GCP/Azure deployment
- **Advanced Optimizations**: Dynamic quantization, pruning
## Pre-trained Models <a name="pre-trained-models"></a>
Download models from HuggingFace:
```bash
python tools/download_hf_model.py \
--model Qwen/Qwen2.5-Coder-0.5B \
--output-dir ./models/Qwen2.5-Coder-0.5B
```
## License <a name="license"></a>
MIT Licensed. See [LICENSE](LICENSE) for details.
## Contributing <a name="contributing"></a>
Contributions welcome! Please submit PRs or open issues.
## Quick Reference
### ๐ **One-Command Setup**
```bash
# Complete pipeline with enhanced features
./run.sh all # Linux/macOS
.\run.ps1 -Stage all # Windows PowerShell
```
### ๐ **Essential Commands**
```bash
# Enhanced document processing
./run.sh ingest # Process HTML, PDF, EPUB, etc.
./run.sh dedup # Remove duplicates intelligently
./run.sh train # Train your model
./run.sh gguf # Convert to GGUF format
./run.sh test # Run comprehensive tests
```
### ๐ **Documentation**
- **[USAGE.md](USAGE.md)** - Complete usage guide with examples
- **[PIPELINE_UPDATES.md](PIPELINE_UPDATES.md)** - Recent feature updates
- **[INSTALL_TESSERACT.md](INSTALL_TESSERACT.md)** - OCR setup guide
- **[data/README_INGESTION.md](data/README_INGESTION.md)** - Document ingestion details
### ๐ **Need Help?**
1. Check the [Usage Guide](USAGE.md) for detailed examples
2. Review logs in `logs/` directory
3. Run tests: `./run.sh test`
4. Open an issue on the repository
---
**Get started** by adding your documents to `data/raw/` and running:
```bash
./run.sh all # Complete enhanced pipeline
```
Raw data
{
"_id": null,
"home_page": null,
"name": "llmbuilder",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Qubase Team <support@qubase.in>",
"keywords": "llm, training, fine-tuning, machine-learning, ai, transformers, cli, pytorch, huggingface, gpt, llama, deployment, inference",
"author": null,
"author_email": "Qubase <contact@qubase.in>",
"download_url": "https://files.pythonhosted.org/packages/7e/d2/00f4c327bb868ee2933836db6a2f87f2e8df5d7f495d767bdb8b267bd2fa/llmbuilder-1.0.2.tar.gz",
"platform": null,
"description": "# LLMBuilder\r\n\r\nCreated By Qubase\r\n\r\nA comprehensive, production-ready implementation for training and fine-tuning Large Language Models from scratch. This project provides an advanced pipeline with enhanced document ingestion, intelligent deduplication, model training, automated GGUF conversion, and comprehensive testing - all optimized for both CPU and GPU environments.\r\n\r\n## Table of Contents\r\n\r\n- [Key Features](#key-features)\r\n- [System Requirements](#system-requirements)\r\n- [Installation](#installation)\r\n- [Quick Start](#quick-start)\r\n - [1. Prepare Your Data](#1-prepare-your-data)\r\n - [2. Run the Pipeline](#2-run-the-pipeline)\r\n - [3. Run Specific Stages](#3-run-specific-stages)\r\n- [Project Structure](#project-structure)\r\n- [Fine-tuning](#fine-tuning)\r\n- [Text Generation](#text-generation)\r\n- [Configuration](#configuration)\r\n- [Advanced Usage](#advanced-usage)\r\n - [CPU Optimization](#cpu-optimization)\r\n - [Data Processing](#data-processing)\r\n - [Training API](#training-api)\r\n- [Monitoring](#monitoring-training)\r\n- [Performance Optimization](#performance-optimization)\r\n- [Troubleshooting](#troubleshooting)\r\n- [Model Architecture](#model-architecture)\r\n- [Pre-trained Models](#pre-trained-models)\r\n- [License](#license)\r\n- [Contributing](#contributing)\r\n\r\n## Key Features <a name=\"key-features\"></a>\r\n\r\n### \ud83d\ude80 **Enhanced Data Processing**\r\n- **Multi-Format Document Ingestion**: HTML, EPUB, PDF (with OCR), Markdown, DOCX, TXT\r\n- **Intelligent Deduplication**: Hash-based exact + embedding-based semantic duplicate removal\r\n- **OCR Support**: Automatic fallback for scanned PDFs using Tesseract\r\n- **Advanced Text Cleaning**: BeautifulSoup HTML processing, metadata extraction\r\n\r\n### \ud83e\udde0 **Advanced Training Pipeline**\r\n- **End-to-End Workflow**: From raw documents to production-ready models\r\n- **Multiple Tokenizer Options**: HuggingFace Tokenizers + SentencePiece CLI integration\r\n- **CPU/GPU Optimization**: Efficient multi-threaded training with mixed precision\r\n- **Modern GPT Architecture**: Transformer implementation with latest optimizations\r\n\r\n### \ud83d\udce6 **Production-Ready Export**\r\n- **Automated GGUF Conversion**: Multiple quantization levels (f16, q8_0, q4_0)\r\n- **Quality Validation**: Comprehensive model validation and quality scoring\r\n- **Batch Processing**: Parallel conversion with error recovery\r\n- **llama.cpp Compatibility**: Direct integration with inference engines\r\n\r\n### \ud83d\udd27 **Developer Experience**\r\n- **Comprehensive Testing**: Automated test suite with pytest integration\r\n- **Robust Error Handling**: Detailed logging and recovery mechanisms\r\n- **Modular Architecture**: Clean, maintainable, extensible codebase\r\n- **Cross-Platform**: Windows PowerShell + Linux/macOS Bash scripts\r\n\r\n## System Requirements\r\n\r\n### Minimum Requirements\r\n- **Python**: 3.8 or higher\r\n- **RAM**: 4GB minimum (8GB+ recommended for large datasets)\r\n- **Storage**: 5GB+ free disk space\r\n- **OS**: Windows 10+, Linux, or macOS\r\n\r\n### Additional Dependencies\r\n- **Tesseract OCR**: For PDF OCR processing (see [INSTALL_TESSERACT.md](INSTALL_TESSERACT.md))\r\n- **Git**: For repository management\r\n- **Optional**: CUDA-compatible GPU for accelerated training\r\n\r\n## Installation <a name=\"installation\"></a>\r\n\r\n1. Clone the repository:\r\n ```bash\r\n git clone <repository-url>\r\n cd LLMBuilder\r\n ```\r\n\r\n2. Create and activate virtual environment:\r\n ```bash\r\n # Linux/macOS\r\n python -m venv venv\r\n source venv/bin/activate\r\n \r\n # Windows\r\n python -m venv venv\r\n .\\venv\\Scripts\\activate\r\n ```\r\n\r\n3. Install dependencies:\r\n ```bash\r\n pip install -r requirements.txt\r\n ```\r\n\r\n4. Install Tesseract OCR (for PDF processing):\r\n ```bash\r\n # Ubuntu/Debian\r\n sudo apt-get install tesseract-ocr\r\n \r\n # macOS\r\n brew install tesseract\r\n \r\n # Windows - see INSTALL_TESSERACT.md for detailed instructions\r\n ```\r\n\r\n5. Verify installation:\r\n ```bash\r\n python -c \"import torch; print('PyTorch:', torch.__version__)\"\r\n tesseract --version\r\n ```\r\n\r\n## System Preparation\r\n\r\n### System Requirements Check\r\n\r\nBefore starting, ensure your system meets the requirements:\r\n\r\n```bash\r\n# Linux/macOS\r\nfree -h # Check available memory\r\ndf -h # Check disk space\r\nnproc # Check CPU cores\r\n\r\n# Windows\r\n# Use Task Manager \u2192 Performance \u2192 Memory/Disk\r\n# Check CPU cores in System Information\r\n```\r\n\r\n### Recommended Workflow\r\n\r\n1. **Start with a small dataset** (100MB) to test the pipeline\r\n2. **Monitor system resources** during initial runs\r\n3. **Use checkpoints** - training progress is saved automatically\r\n4. **Check logs** in `logs/training.log` for any issues\r\n\r\n### **\ud83d\udd0d Real-time Monitoring:**\r\n```bash\r\n# Linux/Mac: Monitor system resources\r\nhtop\r\n# Windows: Use Task Manager or Resource Monitor\r\n```\r\n\r\n## Getting Started\r\n\r\nFor detailed instructions, see the [**\ud83d\udcd6 Complete Usage Guide (USAGE.md)**](USAGE.md) which includes:\r\n- **Step-by-step walkthroughs** with example outputs\r\n- **Advanced configuration options** for all components\r\n- **Troubleshooting guide** with common solutions\r\n- **Performance optimization** tips\r\n- **Platform-specific commands** for Windows/Linux/macOS\r\n- **Integration examples** with other tools\r\n\r\n## Project Structure <a name=\"project-structure\"></a>\r\n\r\n```\r\nLLMBuilder/\r\n\u251c\u2500\u2500 data/ # Data directories\r\n\u2502 \u251c\u2500\u2500 raw/ # Raw input files (all formats)\r\n\u2502 \u251c\u2500\u2500 cleaned/ # Processed text files\r\n\u2502 \u251c\u2500\u2500 deduped/ # Deduplicated content\r\n\u2502 \u251c\u2500\u2500 tokens/ # Tokenized datasets\r\n\u2502 \u251c\u2500\u2500 finetune/ # Fine-tuning datasets\r\n\u2502 \u251c\u2500\u2500 ingest.py # Enhanced document ingester\r\n\u2502 \u251c\u2500\u2500 dedup.py # Deduplication system\r\n\u2502 \u251c\u2500\u2500 download_data.py # Script to download datasets\r\n\u2502 \u251c\u2500\u2500 SOURCES.md # Data sources documentation\r\n\u2502 \u2514\u2500\u2500 README_INGESTION.md # Ingestion documentation\r\n\u2502\r\n\u251c\u2500\u2500 debug_scripts/ # Debugging utilities\r\n\u2502 \u251c\u2500\u2500 debug_loader.py # Data loading debugger\r\n\u2502 \u251c\u2500\u2500 debug_training.py # Training process debugger\r\n\u2502 \u2514\u2500\u2500 debug_timestamps.py # Timing analysis\r\n\u2502\r\n\u251c\u2500\u2500 eval/ # Model evaluation\r\n\u2502 \u2514\u2500\u2500 eval.py # Evaluation scripts\r\n\u2502\r\n\u251c\u2500\u2500 exports/ # Output directories\r\n\u2502 \u251c\u2500\u2500 checkpoints/ # Training checkpoints\r\n\u2502 \u251c\u2500\u2500 gguf/ # GGUF model exports\r\n\u2502 \u251c\u2500\u2500 onnx/ # ONNX model exports\r\n\u2502 \u2514\u2500\u2500 tokenizer/ # Saved tokenizer files\r\n\u2502\r\n\u251c\u2500\u2500 finetune/ # Fine-tuning scripts\r\n\u2502 \u251c\u2500\u2500 finetune.py # Fine-tuning implementation\r\n\u2502 \u2514\u2500\u2500 __init__.py # Package initialization\r\n\u2502\r\n\u251c\u2500\u2500 logs/ # Training and evaluation logs\r\n\u2502\r\n\u251c\u2500\u2500 model/ # Model architecture\r\n\u2502 \u2514\u2500\u2500 gpt_model.py # GPT model implementation\r\n\u2502\r\n\u251c\u2500\u2500 scripts/ # Enhanced processing scripts\r\n\u2502 \u251c\u2500\u2500 run_ingestion.py # Document ingestion CLI\r\n\u2502 \u251c\u2500\u2500 enhanced_preprocess.py # Advanced preprocessing\r\n\u2502 \u251c\u2500\u2500 train_sentencepiece.py # SentencePiece training\r\n\u2502 \u2514\u2500\u2500 test_deduplication.py # Deduplication testing\r\n\u2502\r\n\u251c\u2500\u2500 tests/ # Comprehensive test suite\r\n\u2502 \u251c\u2500\u2500 test_ingestion.py # Document ingestion tests\r\n\u2502 \u251c\u2500\u2500 test_deduplication.py # Deduplication tests\r\n\u2502 \u251c\u2500\u2500 test_conversion_pipeline.py # GGUF conversion tests\r\n\u2502 \u251c\u2500\u2500 test_tokenizer_trainer.py # Tokenizer tests\r\n\u2502 \u2514\u2500\u2500 ... (many more test files)\r\n\u2502\r\n\u251c\u2500\u2500 tools/ # Utility scripts\r\n\u2502 \u251c\u2500\u2500 analyze_data.ps1 # PowerShell data analysis\r\n\u2502 \u251c\u2500\u2500 analyze_data.sh # Bash data analysis\r\n\u2502 \u251c\u2500\u2500 download_hf_model.py # HuggingFace model downloader\r\n\u2502 \u251c\u2500\u2500 export_gguf.py # Enhanced GGUF export utility\r\n\u2502 \u251c\u2500\u2500 conversion_pipeline.py # Automated GGUF conversion\r\n\u2502 \u2514\u2500\u2500 quantization_manager.py # Advanced quantization\r\n\u2502\r\n\u251c\u2500\u2500 training/ # Training pipeline\r\n\u2502 \u251c\u2500\u2500 dataset.py # Dataset handling\r\n\u2502 \u251c\u2500\u2500 preprocess.py # Data preprocessing\r\n\u2502 \u251c\u2500\u2500 quantization.py # Model quantization\r\n\u2502 \u251c\u2500\u2500 train.py # Main training script\r\n\u2502 \u251c\u2500\u2500 train_tokenizer.py # Enhanced tokenizer training\r\n\u2502 \u2514\u2500\u2500 utils.py # Training utilities\r\n\u2502\r\n\u251c\u2500\u2500 .gitignore # Git ignore rules\r\n\u251c\u2500\u2500 config.json # Main configuration\r\n\u251c\u2500\u2500 config_cpu_small.json # Small CPU config\r\n\u251c\u2500\u2500 config_gpu.json # GPU configuration\r\n\u251c\u2500\u2500 inference.py # Inference script\r\n\u251c\u2500\u2500 quantize_model.py # Model quantization\r\n\u251c\u2500\u2500 README.md # This file\r\n\u251c\u2500\u2500 PIPELINE_UPDATES.md # Recent updates summary\r\n\u251c\u2500\u2500 INSTALL_TESSERACT.md # OCR installation guide\r\n\u251c\u2500\u2500 requirements.txt # Python dependencies\r\n\u251c\u2500\u2500 run.ps1 # Enhanced PowerShell runner\r\n\u2514\u2500\u2500 run.sh # Enhanced Bash runner\r\n```\r\n\r\n## Quick Start <a name=\"quick-start\"></a>\r\n\r\n### 1. Prepare Your Data <a name=\"1-prepare-your-data\"></a>\r\n\r\n#### Enhanced Document Support\r\nPlace your documents in `data/raw/`. The system now supports:\r\n- **Text files** (.txt, .md)\r\n- **PDF files** (.pdf) - with automatic OCR for scanned documents\r\n- **Word documents** (.docx)\r\n- **Web content** (.html)\r\n- **E-books** (.epub)\r\n- **Markdown** (.md)\r\n\r\n#### Option 1: Download Sample Data\r\n```bash\r\n# Download sample corpus\r\npython data/download_data.py --corpus\r\n\r\n# Or download specific topics\r\npython data/download_data.py --topic literature --count 5\r\npython data/download_data.py --topic technology --count 3\r\n```\r\n\r\nAvailable topics: literature, science, technology, business, health, education\r\n\r\n#### Option 2: Use Your Own Data\r\nSimply place your documents in `data/raw/` - the enhanced ingestion pipeline will automatically:\r\n- Detect file formats\r\n- Extract text with appropriate methods\r\n- Handle OCR for scanned PDFs\r\n- Clean and normalize content\r\n\r\n### 2. Run the Pipeline <a name=\"2-run-the-pipeline\"></a>\r\n\r\n#### Linux/macOS:\r\n```bash\r\nchmod +x run.sh\r\n./run.sh\r\n```\r\n\r\n#### Windows:\r\n```batch\r\nrun.bat\r\n```\r\n\r\nOr using PowerShell:\r\n```powershell\r\n.\\run.ps1\r\n```\r\n\r\n### 3. Run Specific Stages <a name=\"3-run-specific-stages\"></a>\r\n\r\nThe enhanced pipeline includes new stages for better data processing:\r\n\r\n```bash\r\n# NEW: Enhanced document ingestion\r\n./run.sh ingest\r\n\r\n# NEW: Intelligent deduplication \r\n./run.sh dedup\r\n\r\n# Traditional preprocessing (optional)\r\n./run.sh preprocess\r\n\r\n# Train tokenizer\r\n./run.sh tokenizer\r\n\r\n# Train model\r\n./run.sh train\r\n\r\n# Evaluate model\r\n./run.sh eval\r\n\r\n# Fine-tune existing model\r\n./run.sh finetune\r\n\r\n# Interactive text generation\r\n./run.sh inference\r\n\r\n# NEW: Convert to GGUF format\r\n./run.sh gguf\r\n\r\n# NEW: Run comprehensive tests\r\n./run.sh test\r\n```\r\n\r\n#### Windows PowerShell Examples:\r\n```powershell\r\n# Enhanced document processing\r\n.\\run.ps1 -Stage ingest\r\n\r\n# Run deduplication\r\n.\\run.ps1 -Stage dedup\r\n\r\n# Complete pipeline\r\n.\\run.ps1 -Stage all\r\n\r\n# Convert to GGUF\r\n.\\run.ps1 -Stage gguf\r\n```\r\n\r\n## Enhanced Pipeline Stages\r\n\r\n### \ud83d\udd04 **Document Ingestion** (`ingest`)\r\nAdvanced document processing with multi-format support:\r\n\r\n```bash\r\n# Process all supported formats with OCR\r\n./run.sh ingest\r\n\r\n# With custom options\r\npython scripts/run_ingestion.py \\\r\n --input data/raw \\\r\n --output data/cleaned \\\r\n --ocr-lang eng fra deu \\\r\n --max-size 50 \\\r\n --recursive\r\n```\r\n\r\n**Features:**\r\n- **HTML Processing**: BeautifulSoup-based cleaning, removes scripts/styles\r\n- **EPUB Support**: Full e-book text extraction with metadata\r\n- **PDF with OCR**: Automatic fallback to Tesseract for scanned documents\r\n- **Markdown Processing**: Advanced parsing with table/code block support\r\n- **Progress Tracking**: Real-time processing statistics\r\n\r\n### \ud83d\udd0d **Intelligent Deduplication** (`dedup`)\r\nRemove exact and near-duplicate content to improve training quality:\r\n\r\n```bash\r\n# Run both hash and embedding deduplication\r\n./run.sh dedup\r\n\r\n# Custom similarity threshold\r\npython data/dedup.py \\\r\n --input-dir data/cleaned \\\r\n --output-dir data/deduped \\\r\n --similarity-threshold 0.85\r\n```\r\n\r\n**Methods:**\r\n- **Hash-based**: Exact duplicate detection with text normalization\r\n- **Embedding-based**: Semantic similarity using sentence-transformers\r\n- **Quality Preservation**: Keeps highest quality version of duplicates\r\n- **Statistics**: Detailed reporting of removed content\r\n\r\n### \ud83d\udce6 **GGUF Conversion** (`gguf`)\r\nAutomated conversion to GGUF format for production deployment:\r\n\r\n```bash\r\n# Convert with multiple quantization levels\r\n./run.sh gguf\r\n\r\n# Custom quantization options\r\npython tools/conversion_pipeline.py \\\r\n exports/checkpoints/best_model.pt \\\r\n exports/gguf \\\r\n --quantization f16 q8_0 q4_0 q4_1 \\\r\n --tokenizer exports/tokenizer\r\n```\r\n\r\n**Features:**\r\n- **Multiple Quantization**: f16, q8_0, q4_0, q4_1, q5_0, q5_1\r\n- **Quality Validation**: Automatic validation and quality scoring\r\n- **Batch Processing**: Parallel conversion with error recovery\r\n- **Metadata Preservation**: Complete model metadata in GGUF format\r\n\r\n### \ud83e\uddea **Comprehensive Testing** (`test`)\r\nAutomated test suite for quality assurance:\r\n\r\n```bash\r\n# Run all tests\r\n./run.sh test\r\n\r\n# Run specific test categories\r\npython -m pytest tests/test_ingestion.py -v\r\npython -m pytest tests/test_deduplication.py -v\r\npython -m pytest tests/test_conversion_pipeline.py -v\r\n```\r\n\r\n## Fine-tuning <a name=\"fine-tuning\"></a>\r\n\r\nTo fine-tune the model on your own data:\r\n\r\n1. Place your training files in `data/finetune/`\r\n2. The system will automatically use the latest checkpoint\r\n3. Run the fine-tuning script:\r\n ```bash\r\n python finetune/finetune.py \\\r\n --config config.json \\\r\n --pretrained-model exports/checkpoints/latest.pt \\\r\n --train-data data/finetune/ \\\r\n --tokenizer-dir exports/tokenizer/\r\n ```\r\n4. Fine-tuned models save to `exports/checkpoints/finetuned/`\r\n\r\n### Fine-tuning Configuration\r\n\r\nYou can customize fine-tuning by modifying these parameters:\r\n\r\n```yaml\r\nfinetune:\r\n learning_rate: 0.0001 # Lower than training LR\r\n batch_size: 4 # Adjust based on GPU memory\r\n num_epochs: 3 # Number of fine-tuning epochs\r\n warmup_steps: 100 # Learning rate warmup steps\r\n```\r\n\r\n### Monitoring Fine-tuning\r\n\r\nMonitor the fine-tuning process with:\r\n```bash\r\ntensorboard --logdir=exports/logs/finetune/\r\n```\r\n\r\n## Text Generation <a name=\"text-generation\"></a>\r\n\r\nRun interactive text generation:\r\n\r\n```bash\r\npython inference.py --interactive\r\n```\r\n\r\nOptions:\r\n- `--temperature`: Controls randomness (0.0-1.0)\r\n- `--top_k`: Limit to top-k predictions\r\n- `--top_p`: Nucleus sampling threshold\r\n\r\n## Configuration <a name=\"configuration\"></a>\r\n\r\nThis project includes multiple configuration files optimized for different hardware setups. Choose the one that best matches your environment:\r\n\r\n### Available Configurations\r\n\r\n1. **config.json** - Balanced configuration for standard CPUs\r\n - Moderate model size\r\n - Good balance between speed and quality\r\n - Works well on most modern laptops/desktops\r\n\r\n2. **config_gpu.json** - Optimized for GPU training\r\n - Larger model capacity\r\n - Mixed precision training\r\n - Gradient accumulation\r\n - Best for NVIDIA GPUs with 8GB+ VRAM\r\n\r\n3. **config_cpu_small.json** - For very limited CPUs\r\n - Minimal memory footprint\r\n - Smaller model size\r\n - Reduced sequence length\r\n - Ideal for testing or low-resource environments\r\n\r\n### Configuration Options\r\n\r\n#### Model Architecture\r\n```yaml\r\nmodel:\r\n vocab_size: 16000 # Vocabulary size\r\n embedding_dim: 384 # Size of token embeddings\r\n num_layers: 6 # Number of transformer layers\r\n num_heads: 6 # Number of attention heads\r\n hidden_dim: 1536 # Size of feedforward layers\r\n max_seq_length: 256 # Maximum sequence length\r\n dropout: 0.1 # Dropout rate\r\n use_bias: true # Use bias in linear layers\r\n tie_weights: true # Tie input/output embeddings\r\n```\r\n\r\n#### Training Settings\r\n```yaml\r\ntraining:\r\n batch_size: 8 # Training batch size\r\n learning_rate: 0.0002 # Learning rate\r\n weight_decay: 0.01 # Weight decay for regularization\r\n num_epochs: 10 # Number of training epochs\r\n warmup_steps: 1000 # Warmup steps for learning rate\r\n gradient_clip_norm: 1.0 # Gradient clipping\r\n save_every: 1000 # Save checkpoint every N steps\r\n eval_every: 500 # Evaluate every N steps\r\n log_every: 10 # Log metrics every N steps\r\n num_workers: 4 # Data loading workers\r\n pin_memory: true # Pin memory for faster transfer\r\n prefetch_factor: 2 # Batches to prefetch\r\n use_mixed_precision: false # Enable mixed precision\r\n```\r\n\r\n#### Device Configuration\r\n```yaml\r\ndevice:\r\n use_cuda: false # Use CUDA if available\r\n cuda_device: 0 # CUDA device index\r\n use_mps: false # Use MPS on Apple Silicon\r\n cpu_threads: 0 # Number of CPU threads (0 = all)\r\n enable_mkldnn: true # Enable MKLDNN acceleration\r\n mixed_precision: false # Global mixed precision flag\r\n```\r\n\r\n### Choosing the Right Configuration\r\n\r\n1. **For GPU Training**: Use `config_gpu.json`\r\n ```bash\r\n python training/train.py --config config_gpu.json\r\n ```\r\n\r\n2. **For Standard CPU Training**: Use `config.json`\r\n ```bash\r\n python training/train.py --config config.json\r\n ```\r\n\r\n3. **For Low-End CPUs**: Use `config_cpu_small.json`\r\n ```bash\r\n python training/train.py --config config_cpu_small.json\r\n ```\r\n\r\n### Custom Configuration\r\n\r\n1. Copy an existing config file:\r\n ```bash\r\n cp config.json my_config.json\r\n ```\r\n\r\n2. Edit the parameters as needed\r\n3. Use your custom config:\r\n ```bash\r\n python training/train.py --config my_config.json\r\n ```\r\n\r\n### Important Notes\r\n- Larger `batch_size` and `max_seq_length` require more memory\r\n- `num_workers` should be \u2264 number of CPU cores\r\n- Enable `mixed_precision` for GPUs with Tensor Cores (Volta, Turing, Ampere, etc.)\r\n- For small GPUs, reduce `batch_size` and enable `gradient_accumulation_steps`\r\n- For very small CPUs, reduce `num_layers`, `embedding_dim`, and `hidden_dim`\r\n\r\n## Debugging <a name=\"debugging\"></a>\r\n\r\nThe project includes several debugging scripts in the `debug_scripts/` directory to help diagnose issues:\r\n\r\n### Available Debug Scripts\r\n\r\n1. **debug_loader.py**\r\n - Tests and profiles the data loading pipeline\r\n - Helps identify bottlenecks in data loading\r\n - Usage:\r\n ```bash\r\n python debug_scripts/debug_loader.py --config config.json\r\n ```\r\n\r\n2. **debug_training.py**\r\n - Runs a minimal training loop with extensive logging\r\n - Verifies model can complete a forward/backward pass\r\n - Usage:\r\n ```bash\r\n python debug_scripts/debug_training.py --config config.json --max-steps 10\r\n ```\r\n\r\n3. **debug_timestamps.py**\r\n - Profiles different components of the training loop\r\n - Helps identify slow operations\r\n - Usage:\r\n ```bash\r\n python debug_scripts/debug_timestamps.py --config config.json\r\n ```\r\n\r\n### Debugging Tips\r\n\r\n1. **Reduced Test Case**\r\n - Use a tiny dataset with `--max-samples 10`\r\n - Set `num_workers=0` to simplify data loading\r\n - Reduce `batch_size` and `max_seq_length`\r\n\r\n2. **Common Issues**\r\n - **CUDA Out of Memory**: Reduce `batch_size` or model dimensions\r\n - **Slow Training**: Check data loading with `debug_loader.py`\r\n - **NaN/Inf Losses**: Try gradient clipping and lower learning rate\r\n\r\n3. **Verbose Logging**\r\n ```bash\r\n python training/train.py --config config.json --log-level DEBUG\r\n ```\r\n\r\n4. **Memory Profiling**\r\n ```bash\r\n python -m memory_profiler training/train.py --config config.json\r\n ```\r\n\r\n## Advanced Usage <a name=\"advanced-usage\"></a>\r\n\r\n### CPU Optimization <a name=\"cpu-optimization\"></a>\r\n\r\nOptimize for CPU training with:\r\n- Multi-threading\r\n- Memory efficiency\r\n- Gradient accumulation\r\n- MKLDNN acceleration\r\n\r\n### Data Processing <a name=\"data-processing\"></a>\r\n\r\nExample custom preprocessing:\r\n\r\n```python\r\nfrom training.preprocess import DataPreprocessor\r\n\r\nprocessor = DataPreprocessor(\r\n min_length=100, # Min text length\r\n max_length=500000, # Max text length\r\n remove_urls=True, # Clean URLs\r\n remove_emails=True, # Clean emails\r\n normalize_whitespace=True\r\n)\r\n```\r\n\r\n### Training API <a name=\"training-api\"></a>\r\n\r\n```python\r\nfrom training.train import Trainer\r\n\r\n# Initialize trainer with JSON config\r\ntrainer = Trainer(config_path=\"config.json\")\r\n\r\n# Start training\r\ntrainer.train()\r\n\r\n# Example with custom settings\r\ncustom_trainer = Trainer(\r\n config_path=\"config.json\",\r\n train_data_dir=\"data/processed/train\",\r\n val_data_dir=\"data/processed/val\",\r\n output_dir=\"exports/models/custom_run\"\r\n)\r\ncustom_trainer.train()\r\n```\r\n\r\n**Configuration Options**:\r\n- `config_path`: Path to JSON config file (e.g., `config.json`)\r\n- `train_data_dir`: Directory containing training data (overrides config)\r\n- `val_data_dir`: Directory containing validation data (overrides config)\r\n- `output_dir`: Directory to save checkpoints and logs (overrides config)\r\n\r\n## Training Monitoring <a name=\"monitoring-training\"></a>\r\n\r\n### Logs\r\n- Console: Real-time progress\r\n- File: `logs/training.log`\r\n- Metrics: `logs/training_history.json`\r\n\r\n### Checkpoints\r\n- `checkpoint_epoch_N.pt`: Regular saves\r\n- `best_model.pt`: Best validation score\r\n- `latest.pt`: Most recent checkpoint\r\n\r\n## Performance Optimization <a name=\"performance-optimization\"></a>\r\n\r\n### CPU Training\r\n- Batch size: 8-32 (adjust for RAM)\r\n- Use all CPU cores\r\n- Enable gradient accumulation\r\n- Try mixed precision if available\r\n\r\n### Memory Management\r\n- Reduce `block_size` (128-256)\r\n- Decrease `batch_size`\r\n- Use smaller model dimensions\r\n\r\n### Speed Improvements\r\n- Increase `batch_size` (if RAM allows)\r\n- Use larger `block_size` for context\r\n- Multiple data files improve shuffling\r\n\r\n## Troubleshooting <a name=\"troubleshooting\"></a>\r\n\r\n### Common Issues\r\n\r\n1. **Out of Memory**\r\n - Reduce `batch_size` in config.yaml\r\n - Decrease `block_size` or model size\r\n - Close other applications\r\n\r\n2. **No Training Data**\r\n - Check `data/raw/` directory\r\n - Supported formats: .txt, .pdf, .docx\r\n - Verify file permissions\r\n\r\n3. **Slow Training**\r\n - Optimize CPU thread count\r\n - Reduce model size\r\n - Monitor system resources\r\n\r\n4. **Import Errors**\r\n ```bash\r\n pip install -r requirements.txt\r\n python --version # Requires 3.8+\r\n ```\r\n\r\nCheck `logs/` for detailed error messages.\r\n\r\n## Model Architecture <a name=\"model-architecture\"></a>\r\n\r\nGPT-style transformer with:\r\n- Multi-head self-attention\r\n- GELU activation\r\n- Pre-norm layer normalization\r\n- Learned positional embeddings\r\n- Weight-tied embeddings\r\n\r\n### Default Specs\r\n- Parameters: ~50M\r\n- Layers: 12\r\n- Heads: 12\r\n- Embedding: 768D\r\n- Context: 512 tokens\r\n- Vocabulary: 16K BPE\r\n\r\n## Recent Updates <a name=\"recent-updates\"></a>\r\n\r\n### \u2728 **Latest Features** (See [PIPELINE_UPDATES.md](PIPELINE_UPDATES.md))\r\n- **Enhanced Document Ingestion**: Multi-format support with OCR\r\n- **Intelligent Deduplication**: Hash + embedding-based duplicate removal\r\n- **Automated GGUF Conversion**: Production-ready model export\r\n- **Comprehensive Testing**: Full test suite with pytest\r\n- **Cross-platform Scripts**: Enhanced PowerShell and Bash runners\r\n\r\n### \ud83d\ude80 **Future Enhancements**\r\n- **Distributed Training**: Multi-GPU and multi-node support\r\n- **Web Interface**: Real-time monitoring dashboard\r\n- **More Architectures**: LLaMA, BERT, and custom models\r\n- **Cloud Integration**: AWS/GCP/Azure deployment\r\n- **Advanced Optimizations**: Dynamic quantization, pruning\r\n\r\n## Pre-trained Models <a name=\"pre-trained-models\"></a>\r\n\r\nDownload models from HuggingFace:\r\n\r\n```bash\r\npython tools/download_hf_model.py \\\r\n --model Qwen/Qwen2.5-Coder-0.5B \\\r\n --output-dir ./models/Qwen2.5-Coder-0.5B\r\n```\r\n\r\n## License <a name=\"license\"></a>\r\n\r\nMIT Licensed. See [LICENSE](LICENSE) for details.\r\n\r\n## Contributing <a name=\"contributing\"></a>\r\n\r\nContributions welcome! Please submit PRs or open issues.\r\n\r\n## Quick Reference\r\n\r\n### \ud83d\ude80 **One-Command Setup**\r\n```bash\r\n# Complete pipeline with enhanced features\r\n./run.sh all # Linux/macOS\r\n.\\run.ps1 -Stage all # Windows PowerShell\r\n```\r\n\r\n### \ud83d\udccb **Essential Commands**\r\n```bash\r\n# Enhanced document processing\r\n./run.sh ingest # Process HTML, PDF, EPUB, etc.\r\n./run.sh dedup # Remove duplicates intelligently\r\n./run.sh train # Train your model\r\n./run.sh gguf # Convert to GGUF format\r\n./run.sh test # Run comprehensive tests\r\n```\r\n\r\n### \ud83d\udcda **Documentation**\r\n- **[USAGE.md](USAGE.md)** - Complete usage guide with examples\r\n- **[PIPELINE_UPDATES.md](PIPELINE_UPDATES.md)** - Recent feature updates\r\n- **[INSTALL_TESSERACT.md](INSTALL_TESSERACT.md)** - OCR setup guide\r\n- **[data/README_INGESTION.md](data/README_INGESTION.md)** - Document ingestion details\r\n\r\n### \ud83c\udd98 **Need Help?**\r\n1. Check the [Usage Guide](USAGE.md) for detailed examples\r\n2. Review logs in `logs/` directory\r\n3. Run tests: `./run.sh test`\r\n4. Open an issue on the repository\r\n\r\n---\r\n\r\n**Get started** by adding your documents to `data/raw/` and running:\r\n\r\n```bash\r\n./run.sh all # Complete enhanced pipeline\r\n``` \r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Complete LLM Training and Deployment Pipeline with CLI",
"version": "1.0.2",
"project_urls": {
"Changelog": "https://github.com/qubase/llmbuilder/blob/main/CHANGELOG.md",
"Documentation": "https://llmbuilder.readthedocs.io",
"Homepage": "https://github.com/qubase/llmbuilder",
"Issues": "https://github.com/qubase/llmbuilder/issues",
"Repository": "https://github.com/qubase/llmbuilder"
},
"split_keywords": [
"llm",
" training",
" fine-tuning",
" machine-learning",
" ai",
" transformers",
" cli",
" pytorch",
" huggingface",
" gpt",
" llama",
" deployment",
" inference"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "30fb3295e09ce4f2387bc427ef358d2602dd8cb5df8abb189f1dff55d6995a0e",
"md5": "e015a9caa7afaca5db6c0a6b01000296",
"sha256": "67bc24c4078a39ce3adf3a10a8b5c6b44eb3c8a121f74bf2b88f8f1832579c93"
},
"downloads": -1,
"filename": "llmbuilder-1.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e015a9caa7afaca5db6c0a6b01000296",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 341907,
"upload_time": "2025-09-02T11:26:17",
"upload_time_iso_8601": "2025-09-02T11:26:17.095539Z",
"url": "https://files.pythonhosted.org/packages/30/fb/3295e09ce4f2387bc427ef358d2602dd8cb5df8abb189f1dff55d6995a0e/llmbuilder-1.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "7ed200f4c327bb868ee2933836db6a2f87f2e8df5d7f495d767bdb8b267bd2fa",
"md5": "4eec42930778f4a74dd40e93e7c5d631",
"sha256": "27760a7644d575d1c50d12d7747b5c4ba4ce9c72a93c1eff9171320b920264d4"
},
"downloads": -1,
"filename": "llmbuilder-1.0.2.tar.gz",
"has_sig": false,
"md5_digest": "4eec42930778f4a74dd40e93e7c5d631",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 300593,
"upload_time": "2025-09-02T11:26:19",
"upload_time_iso_8601": "2025-09-02T11:26:19.266823Z",
"url": "https://files.pythonhosted.org/packages/7e/d2/00f4c327bb868ee2933836db6a2f87f2e8df5d7f495d767bdb8b267bd2fa/llmbuilder-1.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-02 11:26:19",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "qubase",
"github_project": "llmbuilder",
"github_not_found": true,
"lcname": "llmbuilder"
}