llmbuilder


Namellmbuilder JSON
Version 0.4.6 PyPI version JSON
download
home_pagehttps://github.com/Qubasehq/llmbuilder-package
SummaryA comprehensive toolkit for building, training, and deploying language models
upload_time2025-08-14 20:16:12
maintainerNone
docs_urlNone
authorQubโ–ณse
requires_python>=3.8
licenseApache-2.0
keywords machine learning deep learning natural language processing transformers language models training inference tokenization data processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ๐Ÿค– LLMBuilder

[![Documentation](https://img.shields.io/badge/docs-mkdocs-blue)](https://qubasehq.github.io/llmbuilder-package/)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A comprehensive toolkit for building, training, fine-tuning, and deploying GPT-style language models with advanced data processing capabilities and CPU-friendly defaults.

## About LLMBuilder Framework

**LLMBuilder** is a production-ready framework for training and fine-tuning Large Language Models (LLMs) โ€” not a model itself. Designed for developers, researchers, and AI engineers, LLMBuilder provides a full pipeline to go from raw text data to deployable, optimized LLMs, all running locally on CPUs or GPUs.

### Complete Framework Structure

The full LLMBuilder framework includes:

```
LLMBuilder/
โ”œโ”€โ”€ data/                   # Data directories
โ”‚   โ”œโ”€โ”€ raw/               # Raw input files (.txt, .pdf, .docx)
โ”‚   โ”œโ”€โ”€ cleaned/           # Processed text files
โ”‚   โ””โ”€โ”€ tokens/            # Tokenized datasets
โ”‚   โ”œโ”€โ”€ download_data.py   # Script to download datasets
โ”‚   โ””โ”€โ”€ SOURCES.md         # Data sources documentation
โ”‚
โ”œโ”€โ”€ debug_scripts/         # Debugging utilities
โ”‚   โ”œโ”€โ”€ debug_loader.py    # Data loading debugger
โ”‚   โ”œโ”€โ”€ debug_training.py  # Training process debugger
โ”‚   โ””โ”€โ”€ debug_timestamps.py # Timing analysis
โ”‚
โ”œโ”€โ”€ eval/                  # Model evaluation
โ”‚   โ””โ”€โ”€ eval.py           # Evaluation scripts
โ”‚
โ”œโ”€โ”€ exports/               # Output directories
โ”‚   โ”œโ”€โ”€ checkpoints/      # Training checkpoints
โ”‚   โ”œโ”€โ”€ gguf/             # GGUF model exports
โ”‚   โ”œโ”€โ”€ onnx/             # ONNX model exports
โ”‚   โ””โ”€โ”€ tokenizer/        # Saved tokenizer files
โ”‚
โ”œโ”€โ”€ finetune/             # Fine-tuning scripts
โ”‚   โ”œโ”€โ”€ finetune.py      # Fine-tuning implementation
โ”‚   โ””โ”€โ”€ __init__.py      # Package initialization
โ”‚
โ”œโ”€โ”€ logs/                 # Training and evaluation logs
โ”‚
โ”œโ”€โ”€ model/                # Model architecture
โ”‚   โ””โ”€โ”€ gpt_model.py     # GPT model implementation
โ”‚
โ”œโ”€โ”€ tools/                # Utility scripts
โ”‚   โ”œโ”€โ”€ analyze_data.ps1  # PowerShell data analysis
โ”‚   โ”œโ”€โ”€ analyze_data.sh   # Bash data analysis
โ”‚   โ”œโ”€โ”€ download_hf_model.py # HuggingFace model downloader
โ”‚   โ””โ”€โ”€ export_gguf.py    # GGUF export utility
โ”‚
โ”œโ”€โ”€ training/             # Training pipeline
โ”‚   โ”œโ”€โ”€ dataset.py       # Dataset handling
โ”‚   โ”œโ”€โ”€ preprocess.py    # Data preprocessing
โ”‚   โ”œโ”€โ”€ quantization.py  # Model quantization
โ”‚   โ”œโ”€โ”€ train.py         # Main training script
โ”‚   โ”œโ”€โ”€ train_tokenizer.py # Tokenizer training
โ”‚   โ””โ”€โ”€ utils.py         # Training utilities
โ”‚
โ”œโ”€โ”€ .gitignore           # Git ignore rules
โ”œโ”€โ”€ config.json          # Main configuration
โ”œโ”€โ”€ config_cpu_small.json # Small CPU config
โ”œโ”€โ”€ config_gpu.json      # GPU configuration
โ”œโ”€โ”€ inference.py         # Inference script
โ”œโ”€โ”€ quantize_model.py    # Model quantization
โ”œโ”€โ”€ README.md           # Documentation
โ”œโ”€โ”€ requirements.txt    # Python dependencies
โ”œโ”€โ”€ run.ps1            # PowerShell runner
โ””โ”€โ”€ run.sh             # Bash runner
```

๐Ÿ”— **Full Framework Repository**: [https://github.com/Qubasehq/llmbuilder](https://github.com/Qubasehq/llmbuilder)

> [!NOTE]
> **This is a separate framework** - The complete LLMBuilder framework shown above is **not related to this package**. It's a standalone, comprehensive framework available at the GitHub repository. This package (`llmbuilder_package`) provides the core modular toolkit, while the complete framework offers additional utilities, debugging tools, and production-ready scripts for comprehensive LLM development workflows.

## โœจ Key Features

### ๐Ÿ”„ Advanced Data Processing

- **Multi-Format Ingestion**: Process HTML, Markdown, EPUB, PDF, and text files with intelligent extraction
- **OCR Integration**: Automatic OCR fallback for scanned PDFs using Tesseract
- **Smart Deduplication**: Both exact and semantic duplicate detection with configurable similarity thresholds
- **Batch Processing**: Parallel processing with configurable worker threads and batch sizes

### ๐Ÿ”ค Flexible Tokenization

- **Multiple Algorithms**: BPE, SentencePiece, Unigram, and WordPiece tokenizers
- **Custom Training**: Train tokenizers on your specific datasets with advanced configuration options
- **Validation Tools**: Built-in tokenizer testing and benchmarking utilities

### โšก Model Conversion & Optimization

- **GGUF Pipeline**: Convert trained models to GGUF format for llama.cpp compatibility
- **Quantization Options**: Support for F32, F16, Q8_0, Q5_1, Q5_0, Q4_1, Q4_0 quantization levels
- **Batch Conversion**: Convert multiple models with different quantization levels simultaneously
- **Validation**: Automatic output validation and integrity checking

### โš™๏ธ Configuration Management

- **Template System**: Pre-configured templates for different use cases (CPU, GPU, inference, etc.)
- **Validation**: Comprehensive configuration validation with detailed error reporting
- **Override Support**: Easy configuration customization with dot-notation overrides
- **CLI Integration**: Full command-line configuration management tools

### ๐Ÿ–ฅ๏ธ Production-Ready CLI

- **Complete Interface**: Full command-line interface for all operations
- **Interactive Modes**: Guided setup and configuration for new users
- **Progress Tracking**: Real-time progress reporting with detailed logging
- **Batch Operations**: Support for processing multiple files and models

### ๐Ÿงช Quality Assurance

- **Extensive Testing**: 200+ unit and integration tests covering all functionality
- **Performance Tests**: Memory usage monitoring and performance benchmarking
- **CI/CD Pipeline**: Automated testing and validation on multiple platforms
- **Documentation**: Comprehensive documentation with examples and troubleshooting guides

## ๐Ÿš€ Quick Start

### Installation

```bash
pip install llmbuilder
```

### Optional Dependencies

LLMBuilder works out of the box, but you can install additional dependencies for advanced features:

```bash
# Complete installation with all features
pip install llmbuilder[all]

# Or install specific feature sets:
pip install llmbuilder[pdf]        # PDF processing with OCR
pip install llmbuilder[epub]       # EPUB document processing
pip install llmbuilder[html]       # Advanced HTML processing
pip install llmbuilder[semantic]   # Semantic deduplication
pip install llmbuilder[conversion] # GGUF model conversion

# Manual installation of optional dependencies:
pip install pymupdf pytesseract    # PDF processing with OCR
pip install ebooklib               # EPUB processing
pip install beautifulsoup4 lxml    # HTML processing
pip install sentence-transformers  # Semantic deduplication
```

### System Dependencies

**For PDF OCR Processing:**

```bash
# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
# macOS:
brew install tesseract

# Ubuntu/Debian:
sudo apt-get install tesseract-ocr tesseract-ocr-eng

# Verify installation:
tesseract --version
```

**For GGUF Model Conversion:**

```bash
# Option 1: Install llama.cpp (recommended)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Option 2: Python package (alternative)
pip install llama-cpp-python

# Verify installation:
llmbuilder convert gguf --help
```

### 5-Minute Complete Pipeline

```python
import llmbuilder as lb

# 1. Create configuration from template
from llmbuilder.config.manager import create_config_from_template
config = create_config_from_template("basic_config", {
    "model": {"vocab_size": 16000},
    "data": {
        "ingestion": {"enable_ocr": True, "supported_formats": ["pdf", "html", "txt"]},
        "deduplication": {"similarity_threshold": 0.85}
    }
})

# 2. Process multi-format documents with advanced features
from llmbuilder.data.ingest import IngestionPipeline
pipeline = IngestionPipeline(config.data.ingestion)
pipeline.process_directory("./raw_data", "./processed_data")
print(f"Processed {pipeline.get_stats()['files_processed']} files")

# 3. Advanced deduplication (exact + semantic)
from llmbuilder.data.dedup import DeduplicationPipeline
dedup = DeduplicationPipeline(config.data.deduplication)
stats = dedup.process_file("./processed_data/combined.txt", "./clean_data/deduplicated.txt")
print(f"Removed {stats['duplicates_removed']} duplicates")

# 4. Train custom tokenizer with validation
from llmbuilder.tokenizer import TokenizerTrainer
trainer = TokenizerTrainer(config.tokenizer_training)
results = trainer.train("./clean_data/deduplicated.txt", "./tokenizers")
print(f"Trained tokenizer with {results['vocab_size']} tokens")

# 5. Build and train model
model = lb.build_model(config.model)
from llmbuilder.data import TextDataset
dataset = TextDataset("./clean_data/deduplicated.txt", block_size=config.model.max_seq_length)
training_results = lb.train_model(model, dataset, config.training)
print(f"Training completed. Final loss: {training_results['final_loss']:.4f}")

# 6. Convert to multiple GGUF formats
from llmbuilder.tools.convert_to_gguf import GGUFConverter
converter = GGUFConverter()
quantization_levels = ["Q8_0", "Q4_0", "Q4_1"]
for quant in quantization_levels:
    result = converter.convert_model(
        "./checkpoints/model.pt",
        f"./exports/model_{quant.lower()}.gguf",
        quant
    )
    if result.success:
        print(f"โœ… {quant}: {result.file_size_bytes / (1024*1024):.1f} MB")

# 7. Generate text with different sampling strategies
text_creative = lb.generate_text(
    model_path="./checkpoints/model.pt",
    tokenizer_path="./tokenizers",
    prompt="The future of AI is",
    max_new_tokens=100,
    temperature=0.8,  # More creative
    top_p=0.9
)

text_focused = lb.generate_text(
    model_path="./checkpoints/model.pt",
    tokenizer_path="./tokenizers",
    prompt="The future of AI is",
    max_new_tokens=100,
    temperature=0.3,  # More focused
    top_k=40
)

print("Creative:", text_creative)
print("Focused:", text_focused)
```

## ๐Ÿ“š Documentation

**Complete documentation is available at: [https://qubasehq.github.io/llmbuilder-package/](https://qubasehq.github.io/llmbuilder-package/)**

The documentation includes:

- ๐Ÿ“– **Getting Started Guide** - From installation to your first model
- ๐ŸŽฏ **User Guides** - Comprehensive guides for all features
- ๐Ÿ–ฅ๏ธ **CLI Reference** - Complete command-line interface documentation
- ๐Ÿ **Python API** - Full API reference with examples
- ๐Ÿ“‹ **Examples** - Working code examples for common tasks
- โ“ **FAQ** - Answers to frequently asked questions

## CLI Quickstart

### Getting Started

```bash
# Show help and available commands
llmbuilder --help

# Interactive welcome guide for new users
llmbuilder welcome

# Show package information and credits
llmbuilder info
```

### Data Processing Pipeline

```bash
# Multi-format document ingestion with OCR
llmbuilder data load -i ./documents -o ./processed.txt --format all --clean --enable-ocr

# Advanced deduplication (exact + semantic)
llmbuilder data deduplicate -i ./processed.txt -o ./clean.txt --method both --threshold 0.85

# Train custom tokenizer with validation
llmbuilder data tokenizer -i ./clean.txt -o ./tokenizers --algorithm sentencepiece --vocab-size 16000
```

### Configuration Management

```bash
# List available configuration templates
llmbuilder config templates

# Create configuration from template with overrides
llmbuilder config from-template advanced_processing_config -o my_config.json \
  --override model.vocab_size=24000 \
  --override training.batch_size=32

# Validate configuration with detailed reporting
llmbuilder config validate my_config.json --detailed

# Show comprehensive configuration summary
llmbuilder config summary my_config.json
```

### Model Training & Operations

```bash
# Train model with configuration file
llmbuilder train model --config my_config.json --data ./clean.txt --tokenizer ./tokenizers --output ./checkpoints

# Interactive text generation setup
llmbuilder generate text --setup

# Generate text with custom parameters
llmbuilder generate text -m ./checkpoints/model.pt -t ./tokenizers -p "Hello world" --temperature 0.8 --max-tokens 100
```

### GGUF Model Conversion

```bash
# Convert single model with validation
llmbuilder convert gguf ./checkpoints/model.pt -o ./exports/model.gguf -q Q8_0 --validate

# Convert with all quantization levels
llmbuilder convert gguf ./checkpoints/model.pt -o ./exports/model.gguf --all-quantizations

# Batch convert multiple models
llmbuilder convert batch -i ./models -o ./exports -q Q8_0 Q4_0 Q4_1 --pattern "*.pt"
```

### Advanced Operations

```bash
# Process large datasets with custom settings
llmbuilder data load -i ./large_docs -o ./processed.txt --batch-size 200 --workers 8 --format pdf,html

# Semantic deduplication with GPU acceleration
llmbuilder data deduplicate -i ./dataset.txt -o ./clean.txt --method semantic --use-gpu --batch-size 2000

# Train tokenizer with advanced options
llmbuilder data tokenizer -i ./corpus.txt -o ./tokenizers \
  --algorithm sentencepiece \
  --vocab-size 32000 \
  --character-coverage 0.9998 \
  --special-tokens "<pad>,<unk>,<s>,</s>,<mask>"
```

## Python API Quickstart

```python
import llmbuilder as lb

# Load a preset config and build a model
cfg = lb.load_config(preset="cpu_small")
model = lb.build_model(cfg.model)

# Train (example; see examples/train_tiny.py for a runnable script)
from llmbuilder.data import TextDataset
dataset = TextDataset("./data/clean.txt", block_size=cfg.model.max_seq_length)
results = lb.train_model(model, dataset, cfg.training)

# Generate text
text = lb.generate_text(
    model_path="./checkpoints/model.pt",
    tokenizer_path="./tokenizers",
    prompt="Hello world",
    max_new_tokens=50,
)
print(text)
```

### Full Example Script: `docs/train_model.py`

```python
"""
Example: Train a small GPT model on cybersecurity text files using LLMBuilder.

Usage:
  python docs/train_model.py --data_dir ./Model_Test --output_dir ./Model_Test/output \
      --prompt "Cybersecurity is important because" --epochs 5

If --data_dir is omitted, it defaults to the directory containing this script.
If --output_dir is omitted, it defaults to <data_dir>/output.

This script uses small-friendly settings (block_size=64, batch_size=1) so it
works on tiny datasets. It trains, saves checkpoints, and performs a sample
text generation from the latest/best checkpoint.
"""
from __future__ import annotations
import argparse
from pathlib import Path
import llmbuilder


def main():
    parser = argparse.ArgumentParser(description="Train and generate with LLMBuilder on small text datasets.")
    parser.add_argument("--data_dir", type=str, default=None, help="Directory with .txt files (default: folder of this script)")
    parser.add_argument("--output_dir", type=str, default=None, help="Where to save outputs (default: <data_dir>/output)")
    parser.add_argument("--epochs", type=int, default=5, help="Number of training epochs")
    parser.add_argument("--batch_size", type=int, default=1, help="Training batch size (small data friendly)")
    parser.add_argument("--block_size", type=int, default=64, help="Context window size for training")
    parser.add_argument("--embed_dim", type=int, default=256, help="Model embedding dimension")
    parser.add_argument("--layers", type=int, default=4, help="Number of transformer layers")
    parser.add_argument("--heads", type=int, default=8, help="Number of attention heads")
    parser.add_argument("--lr", type=float, default=6e-4, help="Learning rate")
    parser.add_argument("--prompt", type=str, default="Cybersecurity is important because", help="Prompt for sample generation")
    parser.add_argument("--max_new_tokens", type=int, default=80, help="Tokens to generate")
    parser.add_argument("--temperature", type=float, default=0.8, help="Sampling temperature")
    parser.add_argument("--top_p", type=float, default=0.9, help="Nucleus sampling top_p")
    args = parser.parse_args()

    # Resolve paths
    if args.data_dir is None:
        data_dir = Path(__file__).parent
    else:
        data_dir = Path(args.data_dir)
    output_dir = Path(args.output_dir) if args.output_dir else (data_dir / "output")
    output_dir.mkdir(parents=True, exist_ok=True)

    print(f"Data directory: {data_dir}")
    print(f"Output directory: {output_dir}")

    # Configs mapped to llmbuilder expected keys
    config = {
        # tokenizer/dataset convenience
        "vocab_size": 8000,
        "block_size": int(args.block_size),
        # training config -> llmbuilder.config.TrainingConfig
        "training": {
            "batch_size": int(args.batch_size),
            "learning_rate": float(args.lr),
            "num_epochs": int(args.epochs),
            "max_grad_norm": 1.0,
            "save_every": 1,
            "log_every": 10,
        },
        # model config -> llmbuilder.config.ModelConfig
        "model": {
            "embedding_dim": int(args.embed_dim),
            "num_layers": int(args.layers),
            "num_heads": int(args.heads),
            "max_seq_length": int(args.block_size),
            "dropout": 0.1,
        },
    }

    print("Starting LLMBuilder training pipeline...")
    pipeline = llmbuilder.train(
        data_path=str(data_dir),
        output_dir=str(output_dir),
        config=config,
        clean=False,
    )

    # Generation
    best_ckpt = output_dir / "checkpoints" / "best_checkpoint.pt"
    latest_ckpt = output_dir / "checkpoints" / "latest_checkpoint.pt"
    model_ckpt = best_ckpt if best_ckpt.exists() else latest_ckpt
    tokenizer_dir = output_dir / "tokenizer"

    if model_ckpt.exists() and tokenizer_dir.exists():
        print("\nGenerating sample text with trained model...")
        text = llmbuilder.generate_text(
            model_path=str(model_ckpt),
            tokenizer_path=str(tokenizer_dir),
            prompt=args.prompt,
            max_new_tokens=int(args.max_new_tokens),
            temperature=float(args.temperature),
            top_p=float(args.top_p),
        )
        print("\nSample generation:\n" + text)
    else:
        print("\nSkipping generation because artifacts were not found.")


if __name__ == "__main__":
    main()
```

## More

- Examples: see the `examples/` folder
  - `examples/generate_text.py`
  - `examples/train_tiny.py`
- Migration from older scripts: see `MIGRATION.md`

## For Developers and Advanced Users

- Python API quickstart:

  ```python
  import llmbuilder as lb
  cfg = lb.load_config(preset="cpu_small")
  model = lb.build_model(cfg.model)
  from llmbuilder.data import TextDataset
  dataset = TextDataset("./data/clean.txt", block_size=cfg.model.max_seq_length)
  results = lb.train_model(model, dataset, cfg.training)
  text = lb.generate_text(
      model_path="./checkpoints/model.pt",
      tokenizer_path="./tokenizers",
      prompt="Hello",
      max_new_tokens=64,
      temperature=0.8,
      top_k=50,
      top_p=0.9,
  )
  print(text)
  ```

- Config presets and legacy keys:
  - Use `lb.load_config(preset="cpu_small")` or `path="config.yaml"`.
  - Legacy flat keys like `n_layer`, `n_head`, `n_embd` are accepted and mapped internally.
- Useful CLI flags:
  - Training: `--epochs`, `--batch-size`, `--lr`, `--eval-interval`, `--save-interval` (see `llmbuilder train model --help`).
  - Generation: `--max-new-tokens`, `--temperature`, `--top-k`, `--top-p`, `--device` (see `llmbuilder generate text --help`).
- Environment knobs:
  - Enable slow tests: `set RUN_SLOW=1`
  - Enable performance tests: `set RUN_PERF=1`
- Performance tips:
  - Prefer CPU wheels for broad compatibility; use smaller seq length and batch size.
  - Checkpoints are saved under `checkpoints/`; consider periodic eval to monitor perplexity.

## ๐Ÿ”ง Troubleshooting

### Installation Issues

**Missing Optional Dependencies**

```bash
# Check what's installed
python -c "import llmbuilder; print('โœ… LLMBuilder installed')"

# Install missing dependencies
pip install pymupdf pytesseract ebooklib beautifulsoup4 lxml sentence-transformers

# Verify specific features
python -c "import pytesseract; print('โœ… OCR available')"
python -c "import sentence_transformers; print('โœ… Semantic deduplication available')"
```

**System Dependencies**

```bash
# Tesseract OCR (for PDF processing)
# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
# macOS: brew install tesseract
# Ubuntu: sudo apt-get install tesseract-ocr tesseract-ocr-eng

# Verify Tesseract installation
tesseract --version
python -c "import pytesseract; pytesseract.get_tesseract_version()"
```

### Processing Issues

**PDF Processing Problems**

```bash
# Enable debug logging
export LLMBUILDER_LOG_LEVEL=DEBUG

# Common fixes:
# 1. Install language packs: sudo apt-get install tesseract-ocr-eng
# 2. Set Tesseract path: export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata
# 3. Lower OCR threshold: --ocr-threshold 0.3
```

**Memory Issues with Large Datasets**

```bash
# Use configuration to optimize memory usage
llmbuilder config from-template cpu_optimized_config -o memory_config.json \
  --override data.ingestion.batch_size=50 \
  --override data.deduplication.batch_size=500 \
  --override data.deduplication.use_gpu_for_embeddings=false

# Process in smaller chunks
llmbuilder data load -i large_dataset/ -o processed.txt --batch-size 25 --workers 2
```

**Semantic Deduplication Performance**

```bash
# GPU issues - disable GPU acceleration
llmbuilder data deduplicate -i dataset.txt -o clean.txt --method semantic --no-gpu

# Slow processing - increase batch size
llmbuilder data deduplicate -i dataset.txt -o clean.txt --method semantic --batch-size 2000

# Memory issues - reduce embedding cache
llmbuilder config from-template basic_config -o config.json \
  --override data.deduplication.embedding_cache_size=5000
```

### GGUF Conversion Issues

**Missing llama.cpp**

```bash
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Add to PATH or specify location
export PATH=$PATH:/path/to/llama.cpp

# Alternative: Use Python package
pip install llama-cpp-python

# Test conversion
llmbuilder convert gguf --help
```

**Conversion Failures**

```bash
# Check available conversion scripts
llmbuilder convert gguf model.pt -o test.gguf --verbose

# Try different quantization levels
llmbuilder convert gguf model.pt -o test.gguf -q F16  # Less compression
llmbuilder convert gguf model.pt -o test.gguf -q Q8_0 # Balanced

# Increase timeout for large models
llmbuilder config from-template basic_config -o config.json \
  --override gguf_conversion.conversion_timeout=7200
```

### Configuration Issues

**Validation Errors**

```bash
# Validate configuration with detailed output
llmbuilder config validate my_config.json --detailed

# Common fixes:
# 1. Vocab size mismatch - ensure model.vocab_size matches tokenizer_training.vocab_size
# 2. Sequence length issues - ensure data.max_length <= model.max_seq_length
# 3. Invalid quantization level - use: F32, F16, Q8_0, Q5_1, Q5_0, Q4_1, Q4_0
```

**Template Issues**

```bash
# List available templates
llmbuilder config templates

# Create from working template
llmbuilder config from-template basic_config -o working_config.json

# Validate before use
llmbuilder config validate working_config.json
```

### Performance Optimization

**Speed Up Processing**

```bash
# Use more workers for I/O bound tasks
llmbuilder data load -i docs/ -o processed.txt --workers 8

# Enable GPU for semantic operations
llmbuilder data deduplicate -i dataset.txt -o clean.txt --use-gpu --batch-size 2000

# Use faster HTML parser
llmbuilder config from-template basic_config -o config.json \
  --override data.ingestion.html_parser=lxml
```

**Reduce Memory Usage**

```bash
# Smaller batch sizes
llmbuilder data load -i docs/ -o processed.txt --batch-size 25

# Disable semantic deduplication for large datasets
llmbuilder data deduplicate -i dataset.txt -o clean.txt --method exact

# Use CPU-optimized configuration
llmbuilder config from-template cpu_optimized_config -o config.json
```

### Debug Mode

**Enable Verbose Logging**

```bash
# Set environment variable
export LLMBUILDER_LOG_LEVEL=DEBUG

# Or use CLI flag
llmbuilder data load -i docs/ -o processed.txt --verbose

# Check processing statistics
llmbuilder data load -i docs/ -o processed.txt --stats
```

### Getting Help

- ๐Ÿ“– **Documentation**: [https://qubasehq.github.io/llmbuilder-package/](https://qubasehq.github.io/llmbuilder-package/)
- ๐Ÿ› **Issues**: [GitHub Issues](https://github.com/Qubasehq/llmbuilder-package/issues)
- ๐Ÿ’ฌ **Discussions**: [GitHub Discussions](https://github.com/Qubasehq/llmbuilder-package/discussions)

## Testing (developers)

- Fast tests: `python -m pytest -q tests`
- Slow tests: `set RUN_SLOW=1 && python -m pytest -q tests`
- Performance tests: `set RUN_PERF=1 && python -m pytest -q tests\performance`

## License

Apache-2.0 (or project license).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Qubasehq/llmbuilder-package",
    "name": "llmbuilder",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Qubase Team <team@qubase.in>",
    "keywords": "machine learning, deep learning, natural language processing, transformers, language models, training, inference, tokenization, data processing",
    "author": "Qub\u25b3se",
    "author_email": "Qubase Team <team@qubase.in>",
    "download_url": "https://files.pythonhosted.org/packages/19/62/88fee15c132cedff5403a7a5b67e33f091b4966b401da97b195f39c13a7a/llmbuilder-0.4.6.tar.gz",
    "platform": null,
    "description": "# \ud83e\udd16 LLMBuilder\r\n\r\n[![Documentation](https://img.shields.io/badge/docs-mkdocs-blue)](https://qubasehq.github.io/llmbuilder-package/)\r\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\r\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\r\n\r\nA comprehensive toolkit for building, training, fine-tuning, and deploying GPT-style language models with advanced data processing capabilities and CPU-friendly defaults.\r\n\r\n## About LLMBuilder Framework\r\n\r\n**LLMBuilder** is a production-ready framework for training and fine-tuning Large Language Models (LLMs) \u2014 not a model itself. Designed for developers, researchers, and AI engineers, LLMBuilder provides a full pipeline to go from raw text data to deployable, optimized LLMs, all running locally on CPUs or GPUs.\r\n\r\n### Complete Framework Structure\r\n\r\nThe full LLMBuilder framework includes:\r\n\r\n```\r\nLLMBuilder/\r\n\u251c\u2500\u2500 data/                   # Data directories\r\n\u2502   \u251c\u2500\u2500 raw/               # Raw input files (.txt, .pdf, .docx)\r\n\u2502   \u251c\u2500\u2500 cleaned/           # Processed text files\r\n\u2502   \u2514\u2500\u2500 tokens/            # Tokenized datasets\r\n\u2502   \u251c\u2500\u2500 download_data.py   # Script to download datasets\r\n\u2502   \u2514\u2500\u2500 SOURCES.md         # Data sources documentation\r\n\u2502\r\n\u251c\u2500\u2500 debug_scripts/         # Debugging utilities\r\n\u2502   \u251c\u2500\u2500 debug_loader.py    # Data loading debugger\r\n\u2502   \u251c\u2500\u2500 debug_training.py  # Training process debugger\r\n\u2502   \u2514\u2500\u2500 debug_timestamps.py # Timing analysis\r\n\u2502\r\n\u251c\u2500\u2500 eval/                  # Model evaluation\r\n\u2502   \u2514\u2500\u2500 eval.py           # Evaluation scripts\r\n\u2502\r\n\u251c\u2500\u2500 exports/               # Output directories\r\n\u2502   \u251c\u2500\u2500 checkpoints/      # Training checkpoints\r\n\u2502   \u251c\u2500\u2500 gguf/             # GGUF model exports\r\n\u2502   \u251c\u2500\u2500 onnx/             # ONNX model exports\r\n\u2502   \u2514\u2500\u2500 tokenizer/        # Saved tokenizer files\r\n\u2502\r\n\u251c\u2500\u2500 finetune/             # Fine-tuning scripts\r\n\u2502   \u251c\u2500\u2500 finetune.py      # Fine-tuning implementation\r\n\u2502   \u2514\u2500\u2500 __init__.py      # Package initialization\r\n\u2502\r\n\u251c\u2500\u2500 logs/                 # Training and evaluation logs\r\n\u2502\r\n\u251c\u2500\u2500 model/                # Model architecture\r\n\u2502   \u2514\u2500\u2500 gpt_model.py     # GPT model implementation\r\n\u2502\r\n\u251c\u2500\u2500 tools/                # Utility scripts\r\n\u2502   \u251c\u2500\u2500 analyze_data.ps1  # PowerShell data analysis\r\n\u2502   \u251c\u2500\u2500 analyze_data.sh   # Bash data analysis\r\n\u2502   \u251c\u2500\u2500 download_hf_model.py # HuggingFace model downloader\r\n\u2502   \u2514\u2500\u2500 export_gguf.py    # GGUF export utility\r\n\u2502\r\n\u251c\u2500\u2500 training/             # Training pipeline\r\n\u2502   \u251c\u2500\u2500 dataset.py       # Dataset handling\r\n\u2502   \u251c\u2500\u2500 preprocess.py    # Data preprocessing\r\n\u2502   \u251c\u2500\u2500 quantization.py  # Model quantization\r\n\u2502   \u251c\u2500\u2500 train.py         # Main training script\r\n\u2502   \u251c\u2500\u2500 train_tokenizer.py # Tokenizer training\r\n\u2502   \u2514\u2500\u2500 utils.py         # Training utilities\r\n\u2502\r\n\u251c\u2500\u2500 .gitignore           # Git ignore rules\r\n\u251c\u2500\u2500 config.json          # Main configuration\r\n\u251c\u2500\u2500 config_cpu_small.json # Small CPU config\r\n\u251c\u2500\u2500 config_gpu.json      # GPU configuration\r\n\u251c\u2500\u2500 inference.py         # Inference script\r\n\u251c\u2500\u2500 quantize_model.py    # Model quantization\r\n\u251c\u2500\u2500 README.md           # Documentation\r\n\u251c\u2500\u2500 requirements.txt    # Python dependencies\r\n\u251c\u2500\u2500 run.ps1            # PowerShell runner\r\n\u2514\u2500\u2500 run.sh             # Bash runner\r\n```\r\n\r\n\ud83d\udd17 **Full Framework Repository**: [https://github.com/Qubasehq/llmbuilder](https://github.com/Qubasehq/llmbuilder)\r\n\r\n> [!NOTE]\r\n> **This is a separate framework** - The complete LLMBuilder framework shown above is **not related to this package**. It's a standalone, comprehensive framework available at the GitHub repository. This package (`llmbuilder_package`) provides the core modular toolkit, while the complete framework offers additional utilities, debugging tools, and production-ready scripts for comprehensive LLM development workflows.\r\n\r\n## \u2728 Key Features\r\n\r\n### \ud83d\udd04 Advanced Data Processing\r\n\r\n- **Multi-Format Ingestion**: Process HTML, Markdown, EPUB, PDF, and text files with intelligent extraction\r\n- **OCR Integration**: Automatic OCR fallback for scanned PDFs using Tesseract\r\n- **Smart Deduplication**: Both exact and semantic duplicate detection with configurable similarity thresholds\r\n- **Batch Processing**: Parallel processing with configurable worker threads and batch sizes\r\n\r\n### \ud83d\udd24 Flexible Tokenization\r\n\r\n- **Multiple Algorithms**: BPE, SentencePiece, Unigram, and WordPiece tokenizers\r\n- **Custom Training**: Train tokenizers on your specific datasets with advanced configuration options\r\n- **Validation Tools**: Built-in tokenizer testing and benchmarking utilities\r\n\r\n### \u26a1 Model Conversion & Optimization\r\n\r\n- **GGUF Pipeline**: Convert trained models to GGUF format for llama.cpp compatibility\r\n- **Quantization Options**: Support for F32, F16, Q8_0, Q5_1, Q5_0, Q4_1, Q4_0 quantization levels\r\n- **Batch Conversion**: Convert multiple models with different quantization levels simultaneously\r\n- **Validation**: Automatic output validation and integrity checking\r\n\r\n### \u2699\ufe0f Configuration Management\r\n\r\n- **Template System**: Pre-configured templates for different use cases (CPU, GPU, inference, etc.)\r\n- **Validation**: Comprehensive configuration validation with detailed error reporting\r\n- **Override Support**: Easy configuration customization with dot-notation overrides\r\n- **CLI Integration**: Full command-line configuration management tools\r\n\r\n### \ud83d\udda5\ufe0f Production-Ready CLI\r\n\r\n- **Complete Interface**: Full command-line interface for all operations\r\n- **Interactive Modes**: Guided setup and configuration for new users\r\n- **Progress Tracking**: Real-time progress reporting with detailed logging\r\n- **Batch Operations**: Support for processing multiple files and models\r\n\r\n### \ud83e\uddea Quality Assurance\r\n\r\n- **Extensive Testing**: 200+ unit and integration tests covering all functionality\r\n- **Performance Tests**: Memory usage monitoring and performance benchmarking\r\n- **CI/CD Pipeline**: Automated testing and validation on multiple platforms\r\n- **Documentation**: Comprehensive documentation with examples and troubleshooting guides\r\n\r\n## \ud83d\ude80 Quick Start\r\n\r\n### Installation\r\n\r\n```bash\r\npip install llmbuilder\r\n```\r\n\r\n### Optional Dependencies\r\n\r\nLLMBuilder works out of the box, but you can install additional dependencies for advanced features:\r\n\r\n```bash\r\n# Complete installation with all features\r\npip install llmbuilder[all]\r\n\r\n# Or install specific feature sets:\r\npip install llmbuilder[pdf]        # PDF processing with OCR\r\npip install llmbuilder[epub]       # EPUB document processing\r\npip install llmbuilder[html]       # Advanced HTML processing\r\npip install llmbuilder[semantic]   # Semantic deduplication\r\npip install llmbuilder[conversion] # GGUF model conversion\r\n\r\n# Manual installation of optional dependencies:\r\npip install pymupdf pytesseract    # PDF processing with OCR\r\npip install ebooklib               # EPUB processing\r\npip install beautifulsoup4 lxml    # HTML processing\r\npip install sentence-transformers  # Semantic deduplication\r\n```\r\n\r\n### System Dependencies\r\n\r\n**For PDF OCR Processing:**\r\n\r\n```bash\r\n# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki\r\n# macOS:\r\nbrew install tesseract\r\n\r\n# Ubuntu/Debian:\r\nsudo apt-get install tesseract-ocr tesseract-ocr-eng\r\n\r\n# Verify installation:\r\ntesseract --version\r\n```\r\n\r\n**For GGUF Model Conversion:**\r\n\r\n```bash\r\n# Option 1: Install llama.cpp (recommended)\r\ngit clone https://github.com/ggerganov/llama.cpp\r\ncd llama.cpp && make\r\n\r\n# Option 2: Python package (alternative)\r\npip install llama-cpp-python\r\n\r\n# Verify installation:\r\nllmbuilder convert gguf --help\r\n```\r\n\r\n### 5-Minute Complete Pipeline\r\n\r\n```python\r\nimport llmbuilder as lb\r\n\r\n# 1. Create configuration from template\r\nfrom llmbuilder.config.manager import create_config_from_template\r\nconfig = create_config_from_template(\"basic_config\", {\r\n    \"model\": {\"vocab_size\": 16000},\r\n    \"data\": {\r\n        \"ingestion\": {\"enable_ocr\": True, \"supported_formats\": [\"pdf\", \"html\", \"txt\"]},\r\n        \"deduplication\": {\"similarity_threshold\": 0.85}\r\n    }\r\n})\r\n\r\n# 2. Process multi-format documents with advanced features\r\nfrom llmbuilder.data.ingest import IngestionPipeline\r\npipeline = IngestionPipeline(config.data.ingestion)\r\npipeline.process_directory(\"./raw_data\", \"./processed_data\")\r\nprint(f\"Processed {pipeline.get_stats()['files_processed']} files\")\r\n\r\n# 3. Advanced deduplication (exact + semantic)\r\nfrom llmbuilder.data.dedup import DeduplicationPipeline\r\ndedup = DeduplicationPipeline(config.data.deduplication)\r\nstats = dedup.process_file(\"./processed_data/combined.txt\", \"./clean_data/deduplicated.txt\")\r\nprint(f\"Removed {stats['duplicates_removed']} duplicates\")\r\n\r\n# 4. Train custom tokenizer with validation\r\nfrom llmbuilder.tokenizer import TokenizerTrainer\r\ntrainer = TokenizerTrainer(config.tokenizer_training)\r\nresults = trainer.train(\"./clean_data/deduplicated.txt\", \"./tokenizers\")\r\nprint(f\"Trained tokenizer with {results['vocab_size']} tokens\")\r\n\r\n# 5. Build and train model\r\nmodel = lb.build_model(config.model)\r\nfrom llmbuilder.data import TextDataset\r\ndataset = TextDataset(\"./clean_data/deduplicated.txt\", block_size=config.model.max_seq_length)\r\ntraining_results = lb.train_model(model, dataset, config.training)\r\nprint(f\"Training completed. Final loss: {training_results['final_loss']:.4f}\")\r\n\r\n# 6. Convert to multiple GGUF formats\r\nfrom llmbuilder.tools.convert_to_gguf import GGUFConverter\r\nconverter = GGUFConverter()\r\nquantization_levels = [\"Q8_0\", \"Q4_0\", \"Q4_1\"]\r\nfor quant in quantization_levels:\r\n    result = converter.convert_model(\r\n        \"./checkpoints/model.pt\",\r\n        f\"./exports/model_{quant.lower()}.gguf\",\r\n        quant\r\n    )\r\n    if result.success:\r\n        print(f\"\u2705 {quant}: {result.file_size_bytes / (1024*1024):.1f} MB\")\r\n\r\n# 7. Generate text with different sampling strategies\r\ntext_creative = lb.generate_text(\r\n    model_path=\"./checkpoints/model.pt\",\r\n    tokenizer_path=\"./tokenizers\",\r\n    prompt=\"The future of AI is\",\r\n    max_new_tokens=100,\r\n    temperature=0.8,  # More creative\r\n    top_p=0.9\r\n)\r\n\r\ntext_focused = lb.generate_text(\r\n    model_path=\"./checkpoints/model.pt\",\r\n    tokenizer_path=\"./tokenizers\",\r\n    prompt=\"The future of AI is\",\r\n    max_new_tokens=100,\r\n    temperature=0.3,  # More focused\r\n    top_k=40\r\n)\r\n\r\nprint(\"Creative:\", text_creative)\r\nprint(\"Focused:\", text_focused)\r\n```\r\n\r\n## \ud83d\udcda Documentation\r\n\r\n**Complete documentation is available at: [https://qubasehq.github.io/llmbuilder-package/](https://qubasehq.github.io/llmbuilder-package/)**\r\n\r\nThe documentation includes:\r\n\r\n- \ud83d\udcd6 **Getting Started Guide** - From installation to your first model\r\n- \ud83c\udfaf **User Guides** - Comprehensive guides for all features\r\n- \ud83d\udda5\ufe0f **CLI Reference** - Complete command-line interface documentation\r\n- \ud83d\udc0d **Python API** - Full API reference with examples\r\n- \ud83d\udccb **Examples** - Working code examples for common tasks\r\n- \u2753 **FAQ** - Answers to frequently asked questions\r\n\r\n## CLI Quickstart\r\n\r\n### Getting Started\r\n\r\n```bash\r\n# Show help and available commands\r\nllmbuilder --help\r\n\r\n# Interactive welcome guide for new users\r\nllmbuilder welcome\r\n\r\n# Show package information and credits\r\nllmbuilder info\r\n```\r\n\r\n### Data Processing Pipeline\r\n\r\n```bash\r\n# Multi-format document ingestion with OCR\r\nllmbuilder data load -i ./documents -o ./processed.txt --format all --clean --enable-ocr\r\n\r\n# Advanced deduplication (exact + semantic)\r\nllmbuilder data deduplicate -i ./processed.txt -o ./clean.txt --method both --threshold 0.85\r\n\r\n# Train custom tokenizer with validation\r\nllmbuilder data tokenizer -i ./clean.txt -o ./tokenizers --algorithm sentencepiece --vocab-size 16000\r\n```\r\n\r\n### Configuration Management\r\n\r\n```bash\r\n# List available configuration templates\r\nllmbuilder config templates\r\n\r\n# Create configuration from template with overrides\r\nllmbuilder config from-template advanced_processing_config -o my_config.json \\\r\n  --override model.vocab_size=24000 \\\r\n  --override training.batch_size=32\r\n\r\n# Validate configuration with detailed reporting\r\nllmbuilder config validate my_config.json --detailed\r\n\r\n# Show comprehensive configuration summary\r\nllmbuilder config summary my_config.json\r\n```\r\n\r\n### Model Training & Operations\r\n\r\n```bash\r\n# Train model with configuration file\r\nllmbuilder train model --config my_config.json --data ./clean.txt --tokenizer ./tokenizers --output ./checkpoints\r\n\r\n# Interactive text generation setup\r\nllmbuilder generate text --setup\r\n\r\n# Generate text with custom parameters\r\nllmbuilder generate text -m ./checkpoints/model.pt -t ./tokenizers -p \"Hello world\" --temperature 0.8 --max-tokens 100\r\n```\r\n\r\n### GGUF Model Conversion\r\n\r\n```bash\r\n# Convert single model with validation\r\nllmbuilder convert gguf ./checkpoints/model.pt -o ./exports/model.gguf -q Q8_0 --validate\r\n\r\n# Convert with all quantization levels\r\nllmbuilder convert gguf ./checkpoints/model.pt -o ./exports/model.gguf --all-quantizations\r\n\r\n# Batch convert multiple models\r\nllmbuilder convert batch -i ./models -o ./exports -q Q8_0 Q4_0 Q4_1 --pattern \"*.pt\"\r\n```\r\n\r\n### Advanced Operations\r\n\r\n```bash\r\n# Process large datasets with custom settings\r\nllmbuilder data load -i ./large_docs -o ./processed.txt --batch-size 200 --workers 8 --format pdf,html\r\n\r\n# Semantic deduplication with GPU acceleration\r\nllmbuilder data deduplicate -i ./dataset.txt -o ./clean.txt --method semantic --use-gpu --batch-size 2000\r\n\r\n# Train tokenizer with advanced options\r\nllmbuilder data tokenizer -i ./corpus.txt -o ./tokenizers \\\r\n  --algorithm sentencepiece \\\r\n  --vocab-size 32000 \\\r\n  --character-coverage 0.9998 \\\r\n  --special-tokens \"<pad>,<unk>,<s>,</s>,<mask>\"\r\n```\r\n\r\n## Python API Quickstart\r\n\r\n```python\r\nimport llmbuilder as lb\r\n\r\n# Load a preset config and build a model\r\ncfg = lb.load_config(preset=\"cpu_small\")\r\nmodel = lb.build_model(cfg.model)\r\n\r\n# Train (example; see examples/train_tiny.py for a runnable script)\r\nfrom llmbuilder.data import TextDataset\r\ndataset = TextDataset(\"./data/clean.txt\", block_size=cfg.model.max_seq_length)\r\nresults = lb.train_model(model, dataset, cfg.training)\r\n\r\n# Generate text\r\ntext = lb.generate_text(\r\n    model_path=\"./checkpoints/model.pt\",\r\n    tokenizer_path=\"./tokenizers\",\r\n    prompt=\"Hello world\",\r\n    max_new_tokens=50,\r\n)\r\nprint(text)\r\n```\r\n\r\n### Full Example Script: `docs/train_model.py`\r\n\r\n```python\r\n\"\"\"\r\nExample: Train a small GPT model on cybersecurity text files using LLMBuilder.\r\n\r\nUsage:\r\n  python docs/train_model.py --data_dir ./Model_Test --output_dir ./Model_Test/output \\\r\n      --prompt \"Cybersecurity is important because\" --epochs 5\r\n\r\nIf --data_dir is omitted, it defaults to the directory containing this script.\r\nIf --output_dir is omitted, it defaults to <data_dir>/output.\r\n\r\nThis script uses small-friendly settings (block_size=64, batch_size=1) so it\r\nworks on tiny datasets. It trains, saves checkpoints, and performs a sample\r\ntext generation from the latest/best checkpoint.\r\n\"\"\"\r\nfrom __future__ import annotations\r\nimport argparse\r\nfrom pathlib import Path\r\nimport llmbuilder\r\n\r\n\r\ndef main():\r\n    parser = argparse.ArgumentParser(description=\"Train and generate with LLMBuilder on small text datasets.\")\r\n    parser.add_argument(\"--data_dir\", type=str, default=None, help=\"Directory with .txt files (default: folder of this script)\")\r\n    parser.add_argument(\"--output_dir\", type=str, default=None, help=\"Where to save outputs (default: <data_dir>/output)\")\r\n    parser.add_argument(\"--epochs\", type=int, default=5, help=\"Number of training epochs\")\r\n    parser.add_argument(\"--batch_size\", type=int, default=1, help=\"Training batch size (small data friendly)\")\r\n    parser.add_argument(\"--block_size\", type=int, default=64, help=\"Context window size for training\")\r\n    parser.add_argument(\"--embed_dim\", type=int, default=256, help=\"Model embedding dimension\")\r\n    parser.add_argument(\"--layers\", type=int, default=4, help=\"Number of transformer layers\")\r\n    parser.add_argument(\"--heads\", type=int, default=8, help=\"Number of attention heads\")\r\n    parser.add_argument(\"--lr\", type=float, default=6e-4, help=\"Learning rate\")\r\n    parser.add_argument(\"--prompt\", type=str, default=\"Cybersecurity is important because\", help=\"Prompt for sample generation\")\r\n    parser.add_argument(\"--max_new_tokens\", type=int, default=80, help=\"Tokens to generate\")\r\n    parser.add_argument(\"--temperature\", type=float, default=0.8, help=\"Sampling temperature\")\r\n    parser.add_argument(\"--top_p\", type=float, default=0.9, help=\"Nucleus sampling top_p\")\r\n    args = parser.parse_args()\r\n\r\n    # Resolve paths\r\n    if args.data_dir is None:\r\n        data_dir = Path(__file__).parent\r\n    else:\r\n        data_dir = Path(args.data_dir)\r\n    output_dir = Path(args.output_dir) if args.output_dir else (data_dir / \"output\")\r\n    output_dir.mkdir(parents=True, exist_ok=True)\r\n\r\n    print(f\"Data directory: {data_dir}\")\r\n    print(f\"Output directory: {output_dir}\")\r\n\r\n    # Configs mapped to llmbuilder expected keys\r\n    config = {\r\n        # tokenizer/dataset convenience\r\n        \"vocab_size\": 8000,\r\n        \"block_size\": int(args.block_size),\r\n        # training config -> llmbuilder.config.TrainingConfig\r\n        \"training\": {\r\n            \"batch_size\": int(args.batch_size),\r\n            \"learning_rate\": float(args.lr),\r\n            \"num_epochs\": int(args.epochs),\r\n            \"max_grad_norm\": 1.0,\r\n            \"save_every\": 1,\r\n            \"log_every\": 10,\r\n        },\r\n        # model config -> llmbuilder.config.ModelConfig\r\n        \"model\": {\r\n            \"embedding_dim\": int(args.embed_dim),\r\n            \"num_layers\": int(args.layers),\r\n            \"num_heads\": int(args.heads),\r\n            \"max_seq_length\": int(args.block_size),\r\n            \"dropout\": 0.1,\r\n        },\r\n    }\r\n\r\n    print(\"Starting LLMBuilder training pipeline...\")\r\n    pipeline = llmbuilder.train(\r\n        data_path=str(data_dir),\r\n        output_dir=str(output_dir),\r\n        config=config,\r\n        clean=False,\r\n    )\r\n\r\n    # Generation\r\n    best_ckpt = output_dir / \"checkpoints\" / \"best_checkpoint.pt\"\r\n    latest_ckpt = output_dir / \"checkpoints\" / \"latest_checkpoint.pt\"\r\n    model_ckpt = best_ckpt if best_ckpt.exists() else latest_ckpt\r\n    tokenizer_dir = output_dir / \"tokenizer\"\r\n\r\n    if model_ckpt.exists() and tokenizer_dir.exists():\r\n        print(\"\\nGenerating sample text with trained model...\")\r\n        text = llmbuilder.generate_text(\r\n            model_path=str(model_ckpt),\r\n            tokenizer_path=str(tokenizer_dir),\r\n            prompt=args.prompt,\r\n            max_new_tokens=int(args.max_new_tokens),\r\n            temperature=float(args.temperature),\r\n            top_p=float(args.top_p),\r\n        )\r\n        print(\"\\nSample generation:\\n\" + text)\r\n    else:\r\n        print(\"\\nSkipping generation because artifacts were not found.\")\r\n\r\n\r\nif __name__ == \"__main__\":\r\n    main()\r\n```\r\n\r\n## More\r\n\r\n- Examples: see the `examples/` folder\r\n  - `examples/generate_text.py`\r\n  - `examples/train_tiny.py`\r\n- Migration from older scripts: see `MIGRATION.md`\r\n\r\n## For Developers and Advanced Users\r\n\r\n- Python API quickstart:\r\n\r\n  ```python\r\n  import llmbuilder as lb\r\n  cfg = lb.load_config(preset=\"cpu_small\")\r\n  model = lb.build_model(cfg.model)\r\n  from llmbuilder.data import TextDataset\r\n  dataset = TextDataset(\"./data/clean.txt\", block_size=cfg.model.max_seq_length)\r\n  results = lb.train_model(model, dataset, cfg.training)\r\n  text = lb.generate_text(\r\n      model_path=\"./checkpoints/model.pt\",\r\n      tokenizer_path=\"./tokenizers\",\r\n      prompt=\"Hello\",\r\n      max_new_tokens=64,\r\n      temperature=0.8,\r\n      top_k=50,\r\n      top_p=0.9,\r\n  )\r\n  print(text)\r\n  ```\r\n\r\n- Config presets and legacy keys:\r\n  - Use `lb.load_config(preset=\"cpu_small\")` or `path=\"config.yaml\"`.\r\n  - Legacy flat keys like `n_layer`, `n_head`, `n_embd` are accepted and mapped internally.\r\n- Useful CLI flags:\r\n  - Training: `--epochs`, `--batch-size`, `--lr`, `--eval-interval`, `--save-interval` (see `llmbuilder train model --help`).\r\n  - Generation: `--max-new-tokens`, `--temperature`, `--top-k`, `--top-p`, `--device` (see `llmbuilder generate text --help`).\r\n- Environment knobs:\r\n  - Enable slow tests: `set RUN_SLOW=1`\r\n  - Enable performance tests: `set RUN_PERF=1`\r\n- Performance tips:\r\n  - Prefer CPU wheels for broad compatibility; use smaller seq length and batch size.\r\n  - Checkpoints are saved under `checkpoints/`; consider periodic eval to monitor perplexity.\r\n\r\n## \ud83d\udd27 Troubleshooting\r\n\r\n### Installation Issues\r\n\r\n**Missing Optional Dependencies**\r\n\r\n```bash\r\n# Check what's installed\r\npython -c \"import llmbuilder; print('\u2705 LLMBuilder installed')\"\r\n\r\n# Install missing dependencies\r\npip install pymupdf pytesseract ebooklib beautifulsoup4 lxml sentence-transformers\r\n\r\n# Verify specific features\r\npython -c \"import pytesseract; print('\u2705 OCR available')\"\r\npython -c \"import sentence_transformers; print('\u2705 Semantic deduplication available')\"\r\n```\r\n\r\n**System Dependencies**\r\n\r\n```bash\r\n# Tesseract OCR (for PDF processing)\r\n# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki\r\n# macOS: brew install tesseract\r\n# Ubuntu: sudo apt-get install tesseract-ocr tesseract-ocr-eng\r\n\r\n# Verify Tesseract installation\r\ntesseract --version\r\npython -c \"import pytesseract; pytesseract.get_tesseract_version()\"\r\n```\r\n\r\n### Processing Issues\r\n\r\n**PDF Processing Problems**\r\n\r\n```bash\r\n# Enable debug logging\r\nexport LLMBUILDER_LOG_LEVEL=DEBUG\r\n\r\n# Common fixes:\r\n# 1. Install language packs: sudo apt-get install tesseract-ocr-eng\r\n# 2. Set Tesseract path: export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata\r\n# 3. Lower OCR threshold: --ocr-threshold 0.3\r\n```\r\n\r\n**Memory Issues with Large Datasets**\r\n\r\n```bash\r\n# Use configuration to optimize memory usage\r\nllmbuilder config from-template cpu_optimized_config -o memory_config.json \\\r\n  --override data.ingestion.batch_size=50 \\\r\n  --override data.deduplication.batch_size=500 \\\r\n  --override data.deduplication.use_gpu_for_embeddings=false\r\n\r\n# Process in smaller chunks\r\nllmbuilder data load -i large_dataset/ -o processed.txt --batch-size 25 --workers 2\r\n```\r\n\r\n**Semantic Deduplication Performance**\r\n\r\n```bash\r\n# GPU issues - disable GPU acceleration\r\nllmbuilder data deduplicate -i dataset.txt -o clean.txt --method semantic --no-gpu\r\n\r\n# Slow processing - increase batch size\r\nllmbuilder data deduplicate -i dataset.txt -o clean.txt --method semantic --batch-size 2000\r\n\r\n# Memory issues - reduce embedding cache\r\nllmbuilder config from-template basic_config -o config.json \\\r\n  --override data.deduplication.embedding_cache_size=5000\r\n```\r\n\r\n### GGUF Conversion Issues\r\n\r\n**Missing llama.cpp**\r\n\r\n```bash\r\n# Install llama.cpp\r\ngit clone https://github.com/ggerganov/llama.cpp\r\ncd llama.cpp && make\r\n\r\n# Add to PATH or specify location\r\nexport PATH=$PATH:/path/to/llama.cpp\r\n\r\n# Alternative: Use Python package\r\npip install llama-cpp-python\r\n\r\n# Test conversion\r\nllmbuilder convert gguf --help\r\n```\r\n\r\n**Conversion Failures**\r\n\r\n```bash\r\n# Check available conversion scripts\r\nllmbuilder convert gguf model.pt -o test.gguf --verbose\r\n\r\n# Try different quantization levels\r\nllmbuilder convert gguf model.pt -o test.gguf -q F16  # Less compression\r\nllmbuilder convert gguf model.pt -o test.gguf -q Q8_0 # Balanced\r\n\r\n# Increase timeout for large models\r\nllmbuilder config from-template basic_config -o config.json \\\r\n  --override gguf_conversion.conversion_timeout=7200\r\n```\r\n\r\n### Configuration Issues\r\n\r\n**Validation Errors**\r\n\r\n```bash\r\n# Validate configuration with detailed output\r\nllmbuilder config validate my_config.json --detailed\r\n\r\n# Common fixes:\r\n# 1. Vocab size mismatch - ensure model.vocab_size matches tokenizer_training.vocab_size\r\n# 2. Sequence length issues - ensure data.max_length <= model.max_seq_length\r\n# 3. Invalid quantization level - use: F32, F16, Q8_0, Q5_1, Q5_0, Q4_1, Q4_0\r\n```\r\n\r\n**Template Issues**\r\n\r\n```bash\r\n# List available templates\r\nllmbuilder config templates\r\n\r\n# Create from working template\r\nllmbuilder config from-template basic_config -o working_config.json\r\n\r\n# Validate before use\r\nllmbuilder config validate working_config.json\r\n```\r\n\r\n### Performance Optimization\r\n\r\n**Speed Up Processing**\r\n\r\n```bash\r\n# Use more workers for I/O bound tasks\r\nllmbuilder data load -i docs/ -o processed.txt --workers 8\r\n\r\n# Enable GPU for semantic operations\r\nllmbuilder data deduplicate -i dataset.txt -o clean.txt --use-gpu --batch-size 2000\r\n\r\n# Use faster HTML parser\r\nllmbuilder config from-template basic_config -o config.json \\\r\n  --override data.ingestion.html_parser=lxml\r\n```\r\n\r\n**Reduce Memory Usage**\r\n\r\n```bash\r\n# Smaller batch sizes\r\nllmbuilder data load -i docs/ -o processed.txt --batch-size 25\r\n\r\n# Disable semantic deduplication for large datasets\r\nllmbuilder data deduplicate -i dataset.txt -o clean.txt --method exact\r\n\r\n# Use CPU-optimized configuration\r\nllmbuilder config from-template cpu_optimized_config -o config.json\r\n```\r\n\r\n### Debug Mode\r\n\r\n**Enable Verbose Logging**\r\n\r\n```bash\r\n# Set environment variable\r\nexport LLMBUILDER_LOG_LEVEL=DEBUG\r\n\r\n# Or use CLI flag\r\nllmbuilder data load -i docs/ -o processed.txt --verbose\r\n\r\n# Check processing statistics\r\nllmbuilder data load -i docs/ -o processed.txt --stats\r\n```\r\n\r\n### Getting Help\r\n\r\n- \ud83d\udcd6 **Documentation**: [https://qubasehq.github.io/llmbuilder-package/](https://qubasehq.github.io/llmbuilder-package/)\r\n- \ud83d\udc1b **Issues**: [GitHub Issues](https://github.com/Qubasehq/llmbuilder-package/issues)\r\n- \ud83d\udcac **Discussions**: [GitHub Discussions](https://github.com/Qubasehq/llmbuilder-package/discussions)\r\n\r\n## Testing (developers)\r\n\r\n- Fast tests: `python -m pytest -q tests`\r\n- Slow tests: `set RUN_SLOW=1 && python -m pytest -q tests`\r\n- Performance tests: `set RUN_PERF=1 && python -m pytest -q tests\\performance`\r\n\r\n## License\r\n\r\nApache-2.0 (or project license).\r\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "A comprehensive toolkit for building, training, and deploying language models",
    "version": "0.4.6",
    "project_urls": {
        "Changelog": "https://github.com/Qubasehq/llmbuilder-package/blob/main/CHANGELOG.md",
        "Documentation": "https://qubasehq.github.io/llmbuilder-package/",
        "Homepage": "https://github.com/Qubasehq/llmbuilder-package",
        "Issues": "https://github.com/Qubasehq/llmbuilder-package/issues",
        "Repository": "https://github.com/Qubasehq/llmbuilder-package.git"
    },
    "split_keywords": [
        "machine learning",
        " deep learning",
        " natural language processing",
        " transformers",
        " language models",
        " training",
        " inference",
        " tokenization",
        " data processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7584c63da9dce8d2e0c4b5c92553664b11f2ecf5aa3e6f20c4e75030dfac065f",
                "md5": "f8d49b485be71732ad9e839bfa1f9899",
                "sha256": "4d31cbda5c7bb873c055f362cd4a691e3e6221424f96ade455be94cdf599fc59"
            },
            "downloads": -1,
            "filename": "llmbuilder-0.4.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f8d49b485be71732ad9e839bfa1f9899",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 160855,
            "upload_time": "2025-08-14T20:16:10",
            "upload_time_iso_8601": "2025-08-14T20:16:10.176857Z",
            "url": "https://files.pythonhosted.org/packages/75/84/c63da9dce8d2e0c4b5c92553664b11f2ecf5aa3e6f20c4e75030dfac065f/llmbuilder-0.4.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "196288fee15c132cedff5403a7a5b67e33f091b4966b401da97b195f39c13a7a",
                "md5": "67559da99f151d21a3a3cce115c0acbd",
                "sha256": "3324c306e207db079f438cf8535ecda1da8882394a7f9b54c71f70ed7f729b32"
            },
            "downloads": -1,
            "filename": "llmbuilder-0.4.6.tar.gz",
            "has_sig": false,
            "md5_digest": "67559da99f151d21a3a3cce115c0acbd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 943423,
            "upload_time": "2025-08-14T20:16:12",
            "upload_time_iso_8601": "2025-08-14T20:16:12.129809Z",
            "url": "https://files.pythonhosted.org/packages/19/62/88fee15c132cedff5403a7a5b67e33f091b4966b401da97b195f39c13a7a/llmbuilder-0.4.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-14 20:16:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Qubasehq",
    "github_project": "llmbuilder-package",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "llmbuilder"
}
        
Elapsed time: 0.50226s