# ๐ค LLMBuilder
[](https://qubasehq.github.io/llmbuilder-package/)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
A comprehensive toolkit for building, training, fine-tuning, and deploying GPT-style language models with advanced data processing capabilities and CPU-friendly defaults.
## About LLMBuilder Framework
**LLMBuilder** is a production-ready framework for training and fine-tuning Large Language Models (LLMs) โ not a model itself. Designed for developers, researchers, and AI engineers, LLMBuilder provides a full pipeline to go from raw text data to deployable, optimized LLMs, all running locally on CPUs or GPUs.
### Complete Framework Structure
The full LLMBuilder framework includes:
```
LLMBuilder/
โโโ data/ # Data directories
โ โโโ raw/ # Raw input files (.txt, .pdf, .docx)
โ โโโ cleaned/ # Processed text files
โ โโโ tokens/ # Tokenized datasets
โ โโโ download_data.py # Script to download datasets
โ โโโ SOURCES.md # Data sources documentation
โ
โโโ debug_scripts/ # Debugging utilities
โ โโโ debug_loader.py # Data loading debugger
โ โโโ debug_training.py # Training process debugger
โ โโโ debug_timestamps.py # Timing analysis
โ
โโโ eval/ # Model evaluation
โ โโโ eval.py # Evaluation scripts
โ
โโโ exports/ # Output directories
โ โโโ checkpoints/ # Training checkpoints
โ โโโ gguf/ # GGUF model exports
โ โโโ onnx/ # ONNX model exports
โ โโโ tokenizer/ # Saved tokenizer files
โ
โโโ finetune/ # Fine-tuning scripts
โ โโโ finetune.py # Fine-tuning implementation
โ โโโ __init__.py # Package initialization
โ
โโโ logs/ # Training and evaluation logs
โ
โโโ model/ # Model architecture
โ โโโ gpt_model.py # GPT model implementation
โ
โโโ tools/ # Utility scripts
โ โโโ analyze_data.ps1 # PowerShell data analysis
โ โโโ analyze_data.sh # Bash data analysis
โ โโโ download_hf_model.py # HuggingFace model downloader
โ โโโ export_gguf.py # GGUF export utility
โ
โโโ training/ # Training pipeline
โ โโโ dataset.py # Dataset handling
โ โโโ preprocess.py # Data preprocessing
โ โโโ quantization.py # Model quantization
โ โโโ train.py # Main training script
โ โโโ train_tokenizer.py # Tokenizer training
โ โโโ utils.py # Training utilities
โ
โโโ .gitignore # Git ignore rules
โโโ config.json # Main configuration
โโโ config_cpu_small.json # Small CPU config
โโโ config_gpu.json # GPU configuration
โโโ inference.py # Inference script
โโโ quantize_model.py # Model quantization
โโโ README.md # Documentation
โโโ requirements.txt # Python dependencies
โโโ run.ps1 # PowerShell runner
โโโ run.sh # Bash runner
```
๐ **Full Framework Repository**: [https://github.com/Qubasehq/llmbuilder](https://github.com/Qubasehq/llmbuilder)
> [!NOTE]
> **This is a separate framework** - The complete LLMBuilder framework shown above is **not related to this package**. It's a standalone, comprehensive framework available at the GitHub repository. This package (`llmbuilder_package`) provides the core modular toolkit, while the complete framework offers additional utilities, debugging tools, and production-ready scripts for comprehensive LLM development workflows.
## โจ Key Features
### ๐ Advanced Data Processing
- **Multi-Format Ingestion**: Process HTML, Markdown, EPUB, PDF, and text files with intelligent extraction
- **OCR Integration**: Automatic OCR fallback for scanned PDFs using Tesseract
- **Smart Deduplication**: Both exact and semantic duplicate detection with configurable similarity thresholds
- **Batch Processing**: Parallel processing with configurable worker threads and batch sizes
### ๐ค Flexible Tokenization
- **Multiple Algorithms**: BPE, SentencePiece, Unigram, and WordPiece tokenizers
- **Custom Training**: Train tokenizers on your specific datasets with advanced configuration options
- **Validation Tools**: Built-in tokenizer testing and benchmarking utilities
### โก Model Conversion & Optimization
- **GGUF Pipeline**: Convert trained models to GGUF format for llama.cpp compatibility
- **Quantization Options**: Support for F32, F16, Q8_0, Q5_1, Q5_0, Q4_1, Q4_0 quantization levels
- **Batch Conversion**: Convert multiple models with different quantization levels simultaneously
- **Validation**: Automatic output validation and integrity checking
### โ๏ธ Configuration Management
- **Template System**: Pre-configured templates for different use cases (CPU, GPU, inference, etc.)
- **Validation**: Comprehensive configuration validation with detailed error reporting
- **Override Support**: Easy configuration customization with dot-notation overrides
- **CLI Integration**: Full command-line configuration management tools
### ๐ฅ๏ธ Production-Ready CLI
- **Complete Interface**: Full command-line interface for all operations
- **Interactive Modes**: Guided setup and configuration for new users
- **Progress Tracking**: Real-time progress reporting with detailed logging
- **Batch Operations**: Support for processing multiple files and models
### ๐งช Quality Assurance
- **Extensive Testing**: 200+ unit and integration tests covering all functionality
- **Performance Tests**: Memory usage monitoring and performance benchmarking
- **CI/CD Pipeline**: Automated testing and validation on multiple platforms
- **Documentation**: Comprehensive documentation with examples and troubleshooting guides
## ๐ Quick Start
### Installation
```bash
pip install llmbuilder
```
### Optional Dependencies
LLMBuilder works out of the box, but you can install additional dependencies for advanced features:
```bash
# Complete installation with all features
pip install llmbuilder[all]
# Or install specific feature sets:
pip install llmbuilder[pdf] # PDF processing with OCR
pip install llmbuilder[epub] # EPUB document processing
pip install llmbuilder[html] # Advanced HTML processing
pip install llmbuilder[semantic] # Semantic deduplication
pip install llmbuilder[conversion] # GGUF model conversion
# Manual installation of optional dependencies:
pip install pymupdf pytesseract # PDF processing with OCR
pip install ebooklib # EPUB processing
pip install beautifulsoup4 lxml # HTML processing
pip install sentence-transformers # Semantic deduplication
```
### System Dependencies
**For PDF OCR Processing:**
```bash
# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
# macOS:
brew install tesseract
# Ubuntu/Debian:
sudo apt-get install tesseract-ocr tesseract-ocr-eng
# Verify installation:
tesseract --version
```
**For GGUF Model Conversion:**
```bash
# Option 1: Install llama.cpp (recommended)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Option 2: Python package (alternative)
pip install llama-cpp-python
# Verify installation:
llmbuilder convert gguf --help
```
### 5-Minute Complete Pipeline
```python
import llmbuilder as lb
# 1. Create configuration from template
from llmbuilder.config.manager import create_config_from_template
config = create_config_from_template("basic_config", {
"model": {"vocab_size": 16000},
"data": {
"ingestion": {"enable_ocr": True, "supported_formats": ["pdf", "html", "txt"]},
"deduplication": {"similarity_threshold": 0.85}
}
})
# 2. Process multi-format documents with advanced features
from llmbuilder.data.ingest import IngestionPipeline
pipeline = IngestionPipeline(config.data.ingestion)
pipeline.process_directory("./raw_data", "./processed_data")
print(f"Processed {pipeline.get_stats()['files_processed']} files")
# 3. Advanced deduplication (exact + semantic)
from llmbuilder.data.dedup import DeduplicationPipeline
dedup = DeduplicationPipeline(config.data.deduplication)
stats = dedup.process_file("./processed_data/combined.txt", "./clean_data/deduplicated.txt")
print(f"Removed {stats['duplicates_removed']} duplicates")
# 4. Train custom tokenizer with validation
from llmbuilder.tokenizer import TokenizerTrainer
trainer = TokenizerTrainer(config.tokenizer_training)
results = trainer.train("./clean_data/deduplicated.txt", "./tokenizers")
print(f"Trained tokenizer with {results['vocab_size']} tokens")
# 5. Build and train model
model = lb.build_model(config.model)
from llmbuilder.data import TextDataset
dataset = TextDataset("./clean_data/deduplicated.txt", block_size=config.model.max_seq_length)
training_results = lb.train_model(model, dataset, config.training)
print(f"Training completed. Final loss: {training_results['final_loss']:.4f}")
# 6. Convert to multiple GGUF formats
from llmbuilder.tools.convert_to_gguf import GGUFConverter
converter = GGUFConverter()
quantization_levels = ["Q8_0", "Q4_0", "Q4_1"]
for quant in quantization_levels:
result = converter.convert_model(
"./checkpoints/model.pt",
f"./exports/model_{quant.lower()}.gguf",
quant
)
if result.success:
print(f"โ
{quant}: {result.file_size_bytes / (1024*1024):.1f} MB")
# 7. Generate text with different sampling strategies
text_creative = lb.generate_text(
model_path="./checkpoints/model.pt",
tokenizer_path="./tokenizers",
prompt="The future of AI is",
max_new_tokens=100,
temperature=0.8, # More creative
top_p=0.9
)
text_focused = lb.generate_text(
model_path="./checkpoints/model.pt",
tokenizer_path="./tokenizers",
prompt="The future of AI is",
max_new_tokens=100,
temperature=0.3, # More focused
top_k=40
)
print("Creative:", text_creative)
print("Focused:", text_focused)
```
## ๐ Documentation
**Complete documentation is available at: [https://qubasehq.github.io/llmbuilder-package/](https://qubasehq.github.io/llmbuilder-package/)**
The documentation includes:
- ๐ **Getting Started Guide** - From installation to your first model
- ๐ฏ **User Guides** - Comprehensive guides for all features
- ๐ฅ๏ธ **CLI Reference** - Complete command-line interface documentation
- ๐ **Python API** - Full API reference with examples
- ๐ **Examples** - Working code examples for common tasks
- โ **FAQ** - Answers to frequently asked questions
## CLI Quickstart
### Getting Started
```bash
# Show help and available commands
llmbuilder --help
# Interactive welcome guide for new users
llmbuilder welcome
# Show package information and credits
llmbuilder info
```
### Data Processing Pipeline
```bash
# Multi-format document ingestion with OCR
llmbuilder data load -i ./documents -o ./processed.txt --format all --clean --enable-ocr
# Advanced deduplication (exact + semantic)
llmbuilder data deduplicate -i ./processed.txt -o ./clean.txt --method both --threshold 0.85
# Train custom tokenizer with validation
llmbuilder data tokenizer -i ./clean.txt -o ./tokenizers --algorithm sentencepiece --vocab-size 16000
```
### Configuration Management
```bash
# List available configuration templates
llmbuilder config templates
# Create configuration from template with overrides
llmbuilder config from-template advanced_processing_config -o my_config.json \
--override model.vocab_size=24000 \
--override training.batch_size=32
# Validate configuration with detailed reporting
llmbuilder config validate my_config.json --detailed
# Show comprehensive configuration summary
llmbuilder config summary my_config.json
```
### Model Training & Operations
```bash
# Train model with configuration file
llmbuilder train model --config my_config.json --data ./clean.txt --tokenizer ./tokenizers --output ./checkpoints
# Interactive text generation setup
llmbuilder generate text --setup
# Generate text with custom parameters
llmbuilder generate text -m ./checkpoints/model.pt -t ./tokenizers -p "Hello world" --temperature 0.8 --max-tokens 100
```
### GGUF Model Conversion
```bash
# Convert single model with validation
llmbuilder convert gguf ./checkpoints/model.pt -o ./exports/model.gguf -q Q8_0 --validate
# Convert with all quantization levels
llmbuilder convert gguf ./checkpoints/model.pt -o ./exports/model.gguf --all-quantizations
# Batch convert multiple models
llmbuilder convert batch -i ./models -o ./exports -q Q8_0 Q4_0 Q4_1 --pattern "*.pt"
```
### Advanced Operations
```bash
# Process large datasets with custom settings
llmbuilder data load -i ./large_docs -o ./processed.txt --batch-size 200 --workers 8 --format pdf,html
# Semantic deduplication with GPU acceleration
llmbuilder data deduplicate -i ./dataset.txt -o ./clean.txt --method semantic --use-gpu --batch-size 2000
# Train tokenizer with advanced options
llmbuilder data tokenizer -i ./corpus.txt -o ./tokenizers \
--algorithm sentencepiece \
--vocab-size 32000 \
--character-coverage 0.9998 \
--special-tokens "<pad>,<unk>,<s>,</s>,<mask>"
```
## Python API Quickstart
```python
import llmbuilder as lb
# Load a preset config and build a model
cfg = lb.load_config(preset="cpu_small")
model = lb.build_model(cfg.model)
# Train (example; see examples/train_tiny.py for a runnable script)
from llmbuilder.data import TextDataset
dataset = TextDataset("./data/clean.txt", block_size=cfg.model.max_seq_length)
results = lb.train_model(model, dataset, cfg.training)
# Generate text
text = lb.generate_text(
model_path="./checkpoints/model.pt",
tokenizer_path="./tokenizers",
prompt="Hello world",
max_new_tokens=50,
)
print(text)
```
### Full Example Script: `docs/train_model.py`
```python
"""
Example: Train a small GPT model on cybersecurity text files using LLMBuilder.
Usage:
python docs/train_model.py --data_dir ./Model_Test --output_dir ./Model_Test/output \
--prompt "Cybersecurity is important because" --epochs 5
If --data_dir is omitted, it defaults to the directory containing this script.
If --output_dir is omitted, it defaults to <data_dir>/output.
This script uses small-friendly settings (block_size=64, batch_size=1) so it
works on tiny datasets. It trains, saves checkpoints, and performs a sample
text generation from the latest/best checkpoint.
"""
from __future__ import annotations
import argparse
from pathlib import Path
import llmbuilder
def main():
parser = argparse.ArgumentParser(description="Train and generate with LLMBuilder on small text datasets.")
parser.add_argument("--data_dir", type=str, default=None, help="Directory with .txt files (default: folder of this script)")
parser.add_argument("--output_dir", type=str, default=None, help="Where to save outputs (default: <data_dir>/output)")
parser.add_argument("--epochs", type=int, default=5, help="Number of training epochs")
parser.add_argument("--batch_size", type=int, default=1, help="Training batch size (small data friendly)")
parser.add_argument("--block_size", type=int, default=64, help="Context window size for training")
parser.add_argument("--embed_dim", type=int, default=256, help="Model embedding dimension")
parser.add_argument("--layers", type=int, default=4, help="Number of transformer layers")
parser.add_argument("--heads", type=int, default=8, help="Number of attention heads")
parser.add_argument("--lr", type=float, default=6e-4, help="Learning rate")
parser.add_argument("--prompt", type=str, default="Cybersecurity is important because", help="Prompt for sample generation")
parser.add_argument("--max_new_tokens", type=int, default=80, help="Tokens to generate")
parser.add_argument("--temperature", type=float, default=0.8, help="Sampling temperature")
parser.add_argument("--top_p", type=float, default=0.9, help="Nucleus sampling top_p")
args = parser.parse_args()
# Resolve paths
if args.data_dir is None:
data_dir = Path(__file__).parent
else:
data_dir = Path(args.data_dir)
output_dir = Path(args.output_dir) if args.output_dir else (data_dir / "output")
output_dir.mkdir(parents=True, exist_ok=True)
print(f"Data directory: {data_dir}")
print(f"Output directory: {output_dir}")
# Configs mapped to llmbuilder expected keys
config = {
# tokenizer/dataset convenience
"vocab_size": 8000,
"block_size": int(args.block_size),
# training config -> llmbuilder.config.TrainingConfig
"training": {
"batch_size": int(args.batch_size),
"learning_rate": float(args.lr),
"num_epochs": int(args.epochs),
"max_grad_norm": 1.0,
"save_every": 1,
"log_every": 10,
},
# model config -> llmbuilder.config.ModelConfig
"model": {
"embedding_dim": int(args.embed_dim),
"num_layers": int(args.layers),
"num_heads": int(args.heads),
"max_seq_length": int(args.block_size),
"dropout": 0.1,
},
}
print("Starting LLMBuilder training pipeline...")
pipeline = llmbuilder.train(
data_path=str(data_dir),
output_dir=str(output_dir),
config=config,
clean=False,
)
# Generation
best_ckpt = output_dir / "checkpoints" / "best_checkpoint.pt"
latest_ckpt = output_dir / "checkpoints" / "latest_checkpoint.pt"
model_ckpt = best_ckpt if best_ckpt.exists() else latest_ckpt
tokenizer_dir = output_dir / "tokenizer"
if model_ckpt.exists() and tokenizer_dir.exists():
print("\nGenerating sample text with trained model...")
text = llmbuilder.generate_text(
model_path=str(model_ckpt),
tokenizer_path=str(tokenizer_dir),
prompt=args.prompt,
max_new_tokens=int(args.max_new_tokens),
temperature=float(args.temperature),
top_p=float(args.top_p),
)
print("\nSample generation:\n" + text)
else:
print("\nSkipping generation because artifacts were not found.")
if __name__ == "__main__":
main()
```
## More
- Examples: see the `examples/` folder
- `examples/generate_text.py`
- `examples/train_tiny.py`
- Migration from older scripts: see `MIGRATION.md`
## For Developers and Advanced Users
- Python API quickstart:
```python
import llmbuilder as lb
cfg = lb.load_config(preset="cpu_small")
model = lb.build_model(cfg.model)
from llmbuilder.data import TextDataset
dataset = TextDataset("./data/clean.txt", block_size=cfg.model.max_seq_length)
results = lb.train_model(model, dataset, cfg.training)
text = lb.generate_text(
model_path="./checkpoints/model.pt",
tokenizer_path="./tokenizers",
prompt="Hello",
max_new_tokens=64,
temperature=0.8,
top_k=50,
top_p=0.9,
)
print(text)
```
- Config presets and legacy keys:
- Use `lb.load_config(preset="cpu_small")` or `path="config.yaml"`.
- Legacy flat keys like `n_layer`, `n_head`, `n_embd` are accepted and mapped internally.
- Useful CLI flags:
- Training: `--epochs`, `--batch-size`, `--lr`, `--eval-interval`, `--save-interval` (see `llmbuilder train model --help`).
- Generation: `--max-new-tokens`, `--temperature`, `--top-k`, `--top-p`, `--device` (see `llmbuilder generate text --help`).
- Environment knobs:
- Enable slow tests: `set RUN_SLOW=1`
- Enable performance tests: `set RUN_PERF=1`
- Performance tips:
- Prefer CPU wheels for broad compatibility; use smaller seq length and batch size.
- Checkpoints are saved under `checkpoints/`; consider periodic eval to monitor perplexity.
## ๐ง Troubleshooting
### Installation Issues
**Missing Optional Dependencies**
```bash
# Check what's installed
python -c "import llmbuilder; print('โ
LLMBuilder installed')"
# Install missing dependencies
pip install pymupdf pytesseract ebooklib beautifulsoup4 lxml sentence-transformers
# Verify specific features
python -c "import pytesseract; print('โ
OCR available')"
python -c "import sentence_transformers; print('โ
Semantic deduplication available')"
```
**System Dependencies**
```bash
# Tesseract OCR (for PDF processing)
# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
# macOS: brew install tesseract
# Ubuntu: sudo apt-get install tesseract-ocr tesseract-ocr-eng
# Verify Tesseract installation
tesseract --version
python -c "import pytesseract; pytesseract.get_tesseract_version()"
```
### Processing Issues
**PDF Processing Problems**
```bash
# Enable debug logging
export LLMBUILDER_LOG_LEVEL=DEBUG
# Common fixes:
# 1. Install language packs: sudo apt-get install tesseract-ocr-eng
# 2. Set Tesseract path: export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata
# 3. Lower OCR threshold: --ocr-threshold 0.3
```
**Memory Issues with Large Datasets**
```bash
# Use configuration to optimize memory usage
llmbuilder config from-template cpu_optimized_config -o memory_config.json \
--override data.ingestion.batch_size=50 \
--override data.deduplication.batch_size=500 \
--override data.deduplication.use_gpu_for_embeddings=false
# Process in smaller chunks
llmbuilder data load -i large_dataset/ -o processed.txt --batch-size 25 --workers 2
```
**Semantic Deduplication Performance**
```bash
# GPU issues - disable GPU acceleration
llmbuilder data deduplicate -i dataset.txt -o clean.txt --method semantic --no-gpu
# Slow processing - increase batch size
llmbuilder data deduplicate -i dataset.txt -o clean.txt --method semantic --batch-size 2000
# Memory issues - reduce embedding cache
llmbuilder config from-template basic_config -o config.json \
--override data.deduplication.embedding_cache_size=5000
```
### GGUF Conversion Issues
**Missing llama.cpp**
```bash
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Add to PATH or specify location
export PATH=$PATH:/path/to/llama.cpp
# Alternative: Use Python package
pip install llama-cpp-python
# Test conversion
llmbuilder convert gguf --help
```
**Conversion Failures**
```bash
# Check available conversion scripts
llmbuilder convert gguf model.pt -o test.gguf --verbose
# Try different quantization levels
llmbuilder convert gguf model.pt -o test.gguf -q F16 # Less compression
llmbuilder convert gguf model.pt -o test.gguf -q Q8_0 # Balanced
# Increase timeout for large models
llmbuilder config from-template basic_config -o config.json \
--override gguf_conversion.conversion_timeout=7200
```
### Configuration Issues
**Validation Errors**
```bash
# Validate configuration with detailed output
llmbuilder config validate my_config.json --detailed
# Common fixes:
# 1. Vocab size mismatch - ensure model.vocab_size matches tokenizer_training.vocab_size
# 2. Sequence length issues - ensure data.max_length <= model.max_seq_length
# 3. Invalid quantization level - use: F32, F16, Q8_0, Q5_1, Q5_0, Q4_1, Q4_0
```
**Template Issues**
```bash
# List available templates
llmbuilder config templates
# Create from working template
llmbuilder config from-template basic_config -o working_config.json
# Validate before use
llmbuilder config validate working_config.json
```
### Performance Optimization
**Speed Up Processing**
```bash
# Use more workers for I/O bound tasks
llmbuilder data load -i docs/ -o processed.txt --workers 8
# Enable GPU for semantic operations
llmbuilder data deduplicate -i dataset.txt -o clean.txt --use-gpu --batch-size 2000
# Use faster HTML parser
llmbuilder config from-template basic_config -o config.json \
--override data.ingestion.html_parser=lxml
```
**Reduce Memory Usage**
```bash
# Smaller batch sizes
llmbuilder data load -i docs/ -o processed.txt --batch-size 25
# Disable semantic deduplication for large datasets
llmbuilder data deduplicate -i dataset.txt -o clean.txt --method exact
# Use CPU-optimized configuration
llmbuilder config from-template cpu_optimized_config -o config.json
```
### Debug Mode
**Enable Verbose Logging**
```bash
# Set environment variable
export LLMBUILDER_LOG_LEVEL=DEBUG
# Or use CLI flag
llmbuilder data load -i docs/ -o processed.txt --verbose
# Check processing statistics
llmbuilder data load -i docs/ -o processed.txt --stats
```
### Getting Help
- ๐ **Documentation**: [https://qubasehq.github.io/llmbuilder-package/](https://qubasehq.github.io/llmbuilder-package/)
- ๐ **Issues**: [GitHub Issues](https://github.com/Qubasehq/llmbuilder-package/issues)
- ๐ฌ **Discussions**: [GitHub Discussions](https://github.com/Qubasehq/llmbuilder-package/discussions)
## Testing (developers)
- Fast tests: `python -m pytest -q tests`
- Slow tests: `set RUN_SLOW=1 && python -m pytest -q tests`
- Performance tests: `set RUN_PERF=1 && python -m pytest -q tests\performance`
## License
Apache-2.0 (or project license).
Raw data
{
"_id": null,
"home_page": "https://github.com/Qubasehq/llmbuilder-package",
"name": "llmbuilder",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Qubase Team <team@qubase.in>",
"keywords": "machine learning, deep learning, natural language processing, transformers, language models, training, inference, tokenization, data processing",
"author": "Qub\u25b3se",
"author_email": "Qubase Team <team@qubase.in>",
"download_url": "https://files.pythonhosted.org/packages/19/62/88fee15c132cedff5403a7a5b67e33f091b4966b401da97b195f39c13a7a/llmbuilder-0.4.6.tar.gz",
"platform": null,
"description": "# \ud83e\udd16 LLMBuilder\r\n\r\n[](https://qubasehq.github.io/llmbuilder-package/)\r\n[](https://www.python.org/downloads/)\r\n[](https://opensource.org/licenses/MIT)\r\n\r\nA comprehensive toolkit for building, training, fine-tuning, and deploying GPT-style language models with advanced data processing capabilities and CPU-friendly defaults.\r\n\r\n## About LLMBuilder Framework\r\n\r\n**LLMBuilder** is a production-ready framework for training and fine-tuning Large Language Models (LLMs) \u2014 not a model itself. Designed for developers, researchers, and AI engineers, LLMBuilder provides a full pipeline to go from raw text data to deployable, optimized LLMs, all running locally on CPUs or GPUs.\r\n\r\n### Complete Framework Structure\r\n\r\nThe full LLMBuilder framework includes:\r\n\r\n```\r\nLLMBuilder/\r\n\u251c\u2500\u2500 data/ # Data directories\r\n\u2502 \u251c\u2500\u2500 raw/ # Raw input files (.txt, .pdf, .docx)\r\n\u2502 \u251c\u2500\u2500 cleaned/ # Processed text files\r\n\u2502 \u2514\u2500\u2500 tokens/ # Tokenized datasets\r\n\u2502 \u251c\u2500\u2500 download_data.py # Script to download datasets\r\n\u2502 \u2514\u2500\u2500 SOURCES.md # Data sources documentation\r\n\u2502\r\n\u251c\u2500\u2500 debug_scripts/ # Debugging utilities\r\n\u2502 \u251c\u2500\u2500 debug_loader.py # Data loading debugger\r\n\u2502 \u251c\u2500\u2500 debug_training.py # Training process debugger\r\n\u2502 \u2514\u2500\u2500 debug_timestamps.py # Timing analysis\r\n\u2502\r\n\u251c\u2500\u2500 eval/ # Model evaluation\r\n\u2502 \u2514\u2500\u2500 eval.py # Evaluation scripts\r\n\u2502\r\n\u251c\u2500\u2500 exports/ # Output directories\r\n\u2502 \u251c\u2500\u2500 checkpoints/ # Training checkpoints\r\n\u2502 \u251c\u2500\u2500 gguf/ # GGUF model exports\r\n\u2502 \u251c\u2500\u2500 onnx/ # ONNX model exports\r\n\u2502 \u2514\u2500\u2500 tokenizer/ # Saved tokenizer files\r\n\u2502\r\n\u251c\u2500\u2500 finetune/ # Fine-tuning scripts\r\n\u2502 \u251c\u2500\u2500 finetune.py # Fine-tuning implementation\r\n\u2502 \u2514\u2500\u2500 __init__.py # Package initialization\r\n\u2502\r\n\u251c\u2500\u2500 logs/ # Training and evaluation logs\r\n\u2502\r\n\u251c\u2500\u2500 model/ # Model architecture\r\n\u2502 \u2514\u2500\u2500 gpt_model.py # GPT model implementation\r\n\u2502\r\n\u251c\u2500\u2500 tools/ # Utility scripts\r\n\u2502 \u251c\u2500\u2500 analyze_data.ps1 # PowerShell data analysis\r\n\u2502 \u251c\u2500\u2500 analyze_data.sh # Bash data analysis\r\n\u2502 \u251c\u2500\u2500 download_hf_model.py # HuggingFace model downloader\r\n\u2502 \u2514\u2500\u2500 export_gguf.py # GGUF export utility\r\n\u2502\r\n\u251c\u2500\u2500 training/ # Training pipeline\r\n\u2502 \u251c\u2500\u2500 dataset.py # Dataset handling\r\n\u2502 \u251c\u2500\u2500 preprocess.py # Data preprocessing\r\n\u2502 \u251c\u2500\u2500 quantization.py # Model quantization\r\n\u2502 \u251c\u2500\u2500 train.py # Main training script\r\n\u2502 \u251c\u2500\u2500 train_tokenizer.py # Tokenizer training\r\n\u2502 \u2514\u2500\u2500 utils.py # Training utilities\r\n\u2502\r\n\u251c\u2500\u2500 .gitignore # Git ignore rules\r\n\u251c\u2500\u2500 config.json # Main configuration\r\n\u251c\u2500\u2500 config_cpu_small.json # Small CPU config\r\n\u251c\u2500\u2500 config_gpu.json # GPU configuration\r\n\u251c\u2500\u2500 inference.py # Inference script\r\n\u251c\u2500\u2500 quantize_model.py # Model quantization\r\n\u251c\u2500\u2500 README.md # Documentation\r\n\u251c\u2500\u2500 requirements.txt # Python dependencies\r\n\u251c\u2500\u2500 run.ps1 # PowerShell runner\r\n\u2514\u2500\u2500 run.sh # Bash runner\r\n```\r\n\r\n\ud83d\udd17 **Full Framework Repository**: [https://github.com/Qubasehq/llmbuilder](https://github.com/Qubasehq/llmbuilder)\r\n\r\n> [!NOTE]\r\n> **This is a separate framework** - The complete LLMBuilder framework shown above is **not related to this package**. It's a standalone, comprehensive framework available at the GitHub repository. This package (`llmbuilder_package`) provides the core modular toolkit, while the complete framework offers additional utilities, debugging tools, and production-ready scripts for comprehensive LLM development workflows.\r\n\r\n## \u2728 Key Features\r\n\r\n### \ud83d\udd04 Advanced Data Processing\r\n\r\n- **Multi-Format Ingestion**: Process HTML, Markdown, EPUB, PDF, and text files with intelligent extraction\r\n- **OCR Integration**: Automatic OCR fallback for scanned PDFs using Tesseract\r\n- **Smart Deduplication**: Both exact and semantic duplicate detection with configurable similarity thresholds\r\n- **Batch Processing**: Parallel processing with configurable worker threads and batch sizes\r\n\r\n### \ud83d\udd24 Flexible Tokenization\r\n\r\n- **Multiple Algorithms**: BPE, SentencePiece, Unigram, and WordPiece tokenizers\r\n- **Custom Training**: Train tokenizers on your specific datasets with advanced configuration options\r\n- **Validation Tools**: Built-in tokenizer testing and benchmarking utilities\r\n\r\n### \u26a1 Model Conversion & Optimization\r\n\r\n- **GGUF Pipeline**: Convert trained models to GGUF format for llama.cpp compatibility\r\n- **Quantization Options**: Support for F32, F16, Q8_0, Q5_1, Q5_0, Q4_1, Q4_0 quantization levels\r\n- **Batch Conversion**: Convert multiple models with different quantization levels simultaneously\r\n- **Validation**: Automatic output validation and integrity checking\r\n\r\n### \u2699\ufe0f Configuration Management\r\n\r\n- **Template System**: Pre-configured templates for different use cases (CPU, GPU, inference, etc.)\r\n- **Validation**: Comprehensive configuration validation with detailed error reporting\r\n- **Override Support**: Easy configuration customization with dot-notation overrides\r\n- **CLI Integration**: Full command-line configuration management tools\r\n\r\n### \ud83d\udda5\ufe0f Production-Ready CLI\r\n\r\n- **Complete Interface**: Full command-line interface for all operations\r\n- **Interactive Modes**: Guided setup and configuration for new users\r\n- **Progress Tracking**: Real-time progress reporting with detailed logging\r\n- **Batch Operations**: Support for processing multiple files and models\r\n\r\n### \ud83e\uddea Quality Assurance\r\n\r\n- **Extensive Testing**: 200+ unit and integration tests covering all functionality\r\n- **Performance Tests**: Memory usage monitoring and performance benchmarking\r\n- **CI/CD Pipeline**: Automated testing and validation on multiple platforms\r\n- **Documentation**: Comprehensive documentation with examples and troubleshooting guides\r\n\r\n## \ud83d\ude80 Quick Start\r\n\r\n### Installation\r\n\r\n```bash\r\npip install llmbuilder\r\n```\r\n\r\n### Optional Dependencies\r\n\r\nLLMBuilder works out of the box, but you can install additional dependencies for advanced features:\r\n\r\n```bash\r\n# Complete installation with all features\r\npip install llmbuilder[all]\r\n\r\n# Or install specific feature sets:\r\npip install llmbuilder[pdf] # PDF processing with OCR\r\npip install llmbuilder[epub] # EPUB document processing\r\npip install llmbuilder[html] # Advanced HTML processing\r\npip install llmbuilder[semantic] # Semantic deduplication\r\npip install llmbuilder[conversion] # GGUF model conversion\r\n\r\n# Manual installation of optional dependencies:\r\npip install pymupdf pytesseract # PDF processing with OCR\r\npip install ebooklib # EPUB processing\r\npip install beautifulsoup4 lxml # HTML processing\r\npip install sentence-transformers # Semantic deduplication\r\n```\r\n\r\n### System Dependencies\r\n\r\n**For PDF OCR Processing:**\r\n\r\n```bash\r\n# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki\r\n# macOS:\r\nbrew install tesseract\r\n\r\n# Ubuntu/Debian:\r\nsudo apt-get install tesseract-ocr tesseract-ocr-eng\r\n\r\n# Verify installation:\r\ntesseract --version\r\n```\r\n\r\n**For GGUF Model Conversion:**\r\n\r\n```bash\r\n# Option 1: Install llama.cpp (recommended)\r\ngit clone https://github.com/ggerganov/llama.cpp\r\ncd llama.cpp && make\r\n\r\n# Option 2: Python package (alternative)\r\npip install llama-cpp-python\r\n\r\n# Verify installation:\r\nllmbuilder convert gguf --help\r\n```\r\n\r\n### 5-Minute Complete Pipeline\r\n\r\n```python\r\nimport llmbuilder as lb\r\n\r\n# 1. Create configuration from template\r\nfrom llmbuilder.config.manager import create_config_from_template\r\nconfig = create_config_from_template(\"basic_config\", {\r\n \"model\": {\"vocab_size\": 16000},\r\n \"data\": {\r\n \"ingestion\": {\"enable_ocr\": True, \"supported_formats\": [\"pdf\", \"html\", \"txt\"]},\r\n \"deduplication\": {\"similarity_threshold\": 0.85}\r\n }\r\n})\r\n\r\n# 2. Process multi-format documents with advanced features\r\nfrom llmbuilder.data.ingest import IngestionPipeline\r\npipeline = IngestionPipeline(config.data.ingestion)\r\npipeline.process_directory(\"./raw_data\", \"./processed_data\")\r\nprint(f\"Processed {pipeline.get_stats()['files_processed']} files\")\r\n\r\n# 3. Advanced deduplication (exact + semantic)\r\nfrom llmbuilder.data.dedup import DeduplicationPipeline\r\ndedup = DeduplicationPipeline(config.data.deduplication)\r\nstats = dedup.process_file(\"./processed_data/combined.txt\", \"./clean_data/deduplicated.txt\")\r\nprint(f\"Removed {stats['duplicates_removed']} duplicates\")\r\n\r\n# 4. Train custom tokenizer with validation\r\nfrom llmbuilder.tokenizer import TokenizerTrainer\r\ntrainer = TokenizerTrainer(config.tokenizer_training)\r\nresults = trainer.train(\"./clean_data/deduplicated.txt\", \"./tokenizers\")\r\nprint(f\"Trained tokenizer with {results['vocab_size']} tokens\")\r\n\r\n# 5. Build and train model\r\nmodel = lb.build_model(config.model)\r\nfrom llmbuilder.data import TextDataset\r\ndataset = TextDataset(\"./clean_data/deduplicated.txt\", block_size=config.model.max_seq_length)\r\ntraining_results = lb.train_model(model, dataset, config.training)\r\nprint(f\"Training completed. Final loss: {training_results['final_loss']:.4f}\")\r\n\r\n# 6. Convert to multiple GGUF formats\r\nfrom llmbuilder.tools.convert_to_gguf import GGUFConverter\r\nconverter = GGUFConverter()\r\nquantization_levels = [\"Q8_0\", \"Q4_0\", \"Q4_1\"]\r\nfor quant in quantization_levels:\r\n result = converter.convert_model(\r\n \"./checkpoints/model.pt\",\r\n f\"./exports/model_{quant.lower()}.gguf\",\r\n quant\r\n )\r\n if result.success:\r\n print(f\"\u2705 {quant}: {result.file_size_bytes / (1024*1024):.1f} MB\")\r\n\r\n# 7. Generate text with different sampling strategies\r\ntext_creative = lb.generate_text(\r\n model_path=\"./checkpoints/model.pt\",\r\n tokenizer_path=\"./tokenizers\",\r\n prompt=\"The future of AI is\",\r\n max_new_tokens=100,\r\n temperature=0.8, # More creative\r\n top_p=0.9\r\n)\r\n\r\ntext_focused = lb.generate_text(\r\n model_path=\"./checkpoints/model.pt\",\r\n tokenizer_path=\"./tokenizers\",\r\n prompt=\"The future of AI is\",\r\n max_new_tokens=100,\r\n temperature=0.3, # More focused\r\n top_k=40\r\n)\r\n\r\nprint(\"Creative:\", text_creative)\r\nprint(\"Focused:\", text_focused)\r\n```\r\n\r\n## \ud83d\udcda Documentation\r\n\r\n**Complete documentation is available at: [https://qubasehq.github.io/llmbuilder-package/](https://qubasehq.github.io/llmbuilder-package/)**\r\n\r\nThe documentation includes:\r\n\r\n- \ud83d\udcd6 **Getting Started Guide** - From installation to your first model\r\n- \ud83c\udfaf **User Guides** - Comprehensive guides for all features\r\n- \ud83d\udda5\ufe0f **CLI Reference** - Complete command-line interface documentation\r\n- \ud83d\udc0d **Python API** - Full API reference with examples\r\n- \ud83d\udccb **Examples** - Working code examples for common tasks\r\n- \u2753 **FAQ** - Answers to frequently asked questions\r\n\r\n## CLI Quickstart\r\n\r\n### Getting Started\r\n\r\n```bash\r\n# Show help and available commands\r\nllmbuilder --help\r\n\r\n# Interactive welcome guide for new users\r\nllmbuilder welcome\r\n\r\n# Show package information and credits\r\nllmbuilder info\r\n```\r\n\r\n### Data Processing Pipeline\r\n\r\n```bash\r\n# Multi-format document ingestion with OCR\r\nllmbuilder data load -i ./documents -o ./processed.txt --format all --clean --enable-ocr\r\n\r\n# Advanced deduplication (exact + semantic)\r\nllmbuilder data deduplicate -i ./processed.txt -o ./clean.txt --method both --threshold 0.85\r\n\r\n# Train custom tokenizer with validation\r\nllmbuilder data tokenizer -i ./clean.txt -o ./tokenizers --algorithm sentencepiece --vocab-size 16000\r\n```\r\n\r\n### Configuration Management\r\n\r\n```bash\r\n# List available configuration templates\r\nllmbuilder config templates\r\n\r\n# Create configuration from template with overrides\r\nllmbuilder config from-template advanced_processing_config -o my_config.json \\\r\n --override model.vocab_size=24000 \\\r\n --override training.batch_size=32\r\n\r\n# Validate configuration with detailed reporting\r\nllmbuilder config validate my_config.json --detailed\r\n\r\n# Show comprehensive configuration summary\r\nllmbuilder config summary my_config.json\r\n```\r\n\r\n### Model Training & Operations\r\n\r\n```bash\r\n# Train model with configuration file\r\nllmbuilder train model --config my_config.json --data ./clean.txt --tokenizer ./tokenizers --output ./checkpoints\r\n\r\n# Interactive text generation setup\r\nllmbuilder generate text --setup\r\n\r\n# Generate text with custom parameters\r\nllmbuilder generate text -m ./checkpoints/model.pt -t ./tokenizers -p \"Hello world\" --temperature 0.8 --max-tokens 100\r\n```\r\n\r\n### GGUF Model Conversion\r\n\r\n```bash\r\n# Convert single model with validation\r\nllmbuilder convert gguf ./checkpoints/model.pt -o ./exports/model.gguf -q Q8_0 --validate\r\n\r\n# Convert with all quantization levels\r\nllmbuilder convert gguf ./checkpoints/model.pt -o ./exports/model.gguf --all-quantizations\r\n\r\n# Batch convert multiple models\r\nllmbuilder convert batch -i ./models -o ./exports -q Q8_0 Q4_0 Q4_1 --pattern \"*.pt\"\r\n```\r\n\r\n### Advanced Operations\r\n\r\n```bash\r\n# Process large datasets with custom settings\r\nllmbuilder data load -i ./large_docs -o ./processed.txt --batch-size 200 --workers 8 --format pdf,html\r\n\r\n# Semantic deduplication with GPU acceleration\r\nllmbuilder data deduplicate -i ./dataset.txt -o ./clean.txt --method semantic --use-gpu --batch-size 2000\r\n\r\n# Train tokenizer with advanced options\r\nllmbuilder data tokenizer -i ./corpus.txt -o ./tokenizers \\\r\n --algorithm sentencepiece \\\r\n --vocab-size 32000 \\\r\n --character-coverage 0.9998 \\\r\n --special-tokens \"<pad>,<unk>,<s>,</s>,<mask>\"\r\n```\r\n\r\n## Python API Quickstart\r\n\r\n```python\r\nimport llmbuilder as lb\r\n\r\n# Load a preset config and build a model\r\ncfg = lb.load_config(preset=\"cpu_small\")\r\nmodel = lb.build_model(cfg.model)\r\n\r\n# Train (example; see examples/train_tiny.py for a runnable script)\r\nfrom llmbuilder.data import TextDataset\r\ndataset = TextDataset(\"./data/clean.txt\", block_size=cfg.model.max_seq_length)\r\nresults = lb.train_model(model, dataset, cfg.training)\r\n\r\n# Generate text\r\ntext = lb.generate_text(\r\n model_path=\"./checkpoints/model.pt\",\r\n tokenizer_path=\"./tokenizers\",\r\n prompt=\"Hello world\",\r\n max_new_tokens=50,\r\n)\r\nprint(text)\r\n```\r\n\r\n### Full Example Script: `docs/train_model.py`\r\n\r\n```python\r\n\"\"\"\r\nExample: Train a small GPT model on cybersecurity text files using LLMBuilder.\r\n\r\nUsage:\r\n python docs/train_model.py --data_dir ./Model_Test --output_dir ./Model_Test/output \\\r\n --prompt \"Cybersecurity is important because\" --epochs 5\r\n\r\nIf --data_dir is omitted, it defaults to the directory containing this script.\r\nIf --output_dir is omitted, it defaults to <data_dir>/output.\r\n\r\nThis script uses small-friendly settings (block_size=64, batch_size=1) so it\r\nworks on tiny datasets. It trains, saves checkpoints, and performs a sample\r\ntext generation from the latest/best checkpoint.\r\n\"\"\"\r\nfrom __future__ import annotations\r\nimport argparse\r\nfrom pathlib import Path\r\nimport llmbuilder\r\n\r\n\r\ndef main():\r\n parser = argparse.ArgumentParser(description=\"Train and generate with LLMBuilder on small text datasets.\")\r\n parser.add_argument(\"--data_dir\", type=str, default=None, help=\"Directory with .txt files (default: folder of this script)\")\r\n parser.add_argument(\"--output_dir\", type=str, default=None, help=\"Where to save outputs (default: <data_dir>/output)\")\r\n parser.add_argument(\"--epochs\", type=int, default=5, help=\"Number of training epochs\")\r\n parser.add_argument(\"--batch_size\", type=int, default=1, help=\"Training batch size (small data friendly)\")\r\n parser.add_argument(\"--block_size\", type=int, default=64, help=\"Context window size for training\")\r\n parser.add_argument(\"--embed_dim\", type=int, default=256, help=\"Model embedding dimension\")\r\n parser.add_argument(\"--layers\", type=int, default=4, help=\"Number of transformer layers\")\r\n parser.add_argument(\"--heads\", type=int, default=8, help=\"Number of attention heads\")\r\n parser.add_argument(\"--lr\", type=float, default=6e-4, help=\"Learning rate\")\r\n parser.add_argument(\"--prompt\", type=str, default=\"Cybersecurity is important because\", help=\"Prompt for sample generation\")\r\n parser.add_argument(\"--max_new_tokens\", type=int, default=80, help=\"Tokens to generate\")\r\n parser.add_argument(\"--temperature\", type=float, default=0.8, help=\"Sampling temperature\")\r\n parser.add_argument(\"--top_p\", type=float, default=0.9, help=\"Nucleus sampling top_p\")\r\n args = parser.parse_args()\r\n\r\n # Resolve paths\r\n if args.data_dir is None:\r\n data_dir = Path(__file__).parent\r\n else:\r\n data_dir = Path(args.data_dir)\r\n output_dir = Path(args.output_dir) if args.output_dir else (data_dir / \"output\")\r\n output_dir.mkdir(parents=True, exist_ok=True)\r\n\r\n print(f\"Data directory: {data_dir}\")\r\n print(f\"Output directory: {output_dir}\")\r\n\r\n # Configs mapped to llmbuilder expected keys\r\n config = {\r\n # tokenizer/dataset convenience\r\n \"vocab_size\": 8000,\r\n \"block_size\": int(args.block_size),\r\n # training config -> llmbuilder.config.TrainingConfig\r\n \"training\": {\r\n \"batch_size\": int(args.batch_size),\r\n \"learning_rate\": float(args.lr),\r\n \"num_epochs\": int(args.epochs),\r\n \"max_grad_norm\": 1.0,\r\n \"save_every\": 1,\r\n \"log_every\": 10,\r\n },\r\n # model config -> llmbuilder.config.ModelConfig\r\n \"model\": {\r\n \"embedding_dim\": int(args.embed_dim),\r\n \"num_layers\": int(args.layers),\r\n \"num_heads\": int(args.heads),\r\n \"max_seq_length\": int(args.block_size),\r\n \"dropout\": 0.1,\r\n },\r\n }\r\n\r\n print(\"Starting LLMBuilder training pipeline...\")\r\n pipeline = llmbuilder.train(\r\n data_path=str(data_dir),\r\n output_dir=str(output_dir),\r\n config=config,\r\n clean=False,\r\n )\r\n\r\n # Generation\r\n best_ckpt = output_dir / \"checkpoints\" / \"best_checkpoint.pt\"\r\n latest_ckpt = output_dir / \"checkpoints\" / \"latest_checkpoint.pt\"\r\n model_ckpt = best_ckpt if best_ckpt.exists() else latest_ckpt\r\n tokenizer_dir = output_dir / \"tokenizer\"\r\n\r\n if model_ckpt.exists() and tokenizer_dir.exists():\r\n print(\"\\nGenerating sample text with trained model...\")\r\n text = llmbuilder.generate_text(\r\n model_path=str(model_ckpt),\r\n tokenizer_path=str(tokenizer_dir),\r\n prompt=args.prompt,\r\n max_new_tokens=int(args.max_new_tokens),\r\n temperature=float(args.temperature),\r\n top_p=float(args.top_p),\r\n )\r\n print(\"\\nSample generation:\\n\" + text)\r\n else:\r\n print(\"\\nSkipping generation because artifacts were not found.\")\r\n\r\n\r\nif __name__ == \"__main__\":\r\n main()\r\n```\r\n\r\n## More\r\n\r\n- Examples: see the `examples/` folder\r\n - `examples/generate_text.py`\r\n - `examples/train_tiny.py`\r\n- Migration from older scripts: see `MIGRATION.md`\r\n\r\n## For Developers and Advanced Users\r\n\r\n- Python API quickstart:\r\n\r\n ```python\r\n import llmbuilder as lb\r\n cfg = lb.load_config(preset=\"cpu_small\")\r\n model = lb.build_model(cfg.model)\r\n from llmbuilder.data import TextDataset\r\n dataset = TextDataset(\"./data/clean.txt\", block_size=cfg.model.max_seq_length)\r\n results = lb.train_model(model, dataset, cfg.training)\r\n text = lb.generate_text(\r\n model_path=\"./checkpoints/model.pt\",\r\n tokenizer_path=\"./tokenizers\",\r\n prompt=\"Hello\",\r\n max_new_tokens=64,\r\n temperature=0.8,\r\n top_k=50,\r\n top_p=0.9,\r\n )\r\n print(text)\r\n ```\r\n\r\n- Config presets and legacy keys:\r\n - Use `lb.load_config(preset=\"cpu_small\")` or `path=\"config.yaml\"`.\r\n - Legacy flat keys like `n_layer`, `n_head`, `n_embd` are accepted and mapped internally.\r\n- Useful CLI flags:\r\n - Training: `--epochs`, `--batch-size`, `--lr`, `--eval-interval`, `--save-interval` (see `llmbuilder train model --help`).\r\n - Generation: `--max-new-tokens`, `--temperature`, `--top-k`, `--top-p`, `--device` (see `llmbuilder generate text --help`).\r\n- Environment knobs:\r\n - Enable slow tests: `set RUN_SLOW=1`\r\n - Enable performance tests: `set RUN_PERF=1`\r\n- Performance tips:\r\n - Prefer CPU wheels for broad compatibility; use smaller seq length and batch size.\r\n - Checkpoints are saved under `checkpoints/`; consider periodic eval to monitor perplexity.\r\n\r\n## \ud83d\udd27 Troubleshooting\r\n\r\n### Installation Issues\r\n\r\n**Missing Optional Dependencies**\r\n\r\n```bash\r\n# Check what's installed\r\npython -c \"import llmbuilder; print('\u2705 LLMBuilder installed')\"\r\n\r\n# Install missing dependencies\r\npip install pymupdf pytesseract ebooklib beautifulsoup4 lxml sentence-transformers\r\n\r\n# Verify specific features\r\npython -c \"import pytesseract; print('\u2705 OCR available')\"\r\npython -c \"import sentence_transformers; print('\u2705 Semantic deduplication available')\"\r\n```\r\n\r\n**System Dependencies**\r\n\r\n```bash\r\n# Tesseract OCR (for PDF processing)\r\n# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki\r\n# macOS: brew install tesseract\r\n# Ubuntu: sudo apt-get install tesseract-ocr tesseract-ocr-eng\r\n\r\n# Verify Tesseract installation\r\ntesseract --version\r\npython -c \"import pytesseract; pytesseract.get_tesseract_version()\"\r\n```\r\n\r\n### Processing Issues\r\n\r\n**PDF Processing Problems**\r\n\r\n```bash\r\n# Enable debug logging\r\nexport LLMBUILDER_LOG_LEVEL=DEBUG\r\n\r\n# Common fixes:\r\n# 1. Install language packs: sudo apt-get install tesseract-ocr-eng\r\n# 2. Set Tesseract path: export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata\r\n# 3. Lower OCR threshold: --ocr-threshold 0.3\r\n```\r\n\r\n**Memory Issues with Large Datasets**\r\n\r\n```bash\r\n# Use configuration to optimize memory usage\r\nllmbuilder config from-template cpu_optimized_config -o memory_config.json \\\r\n --override data.ingestion.batch_size=50 \\\r\n --override data.deduplication.batch_size=500 \\\r\n --override data.deduplication.use_gpu_for_embeddings=false\r\n\r\n# Process in smaller chunks\r\nllmbuilder data load -i large_dataset/ -o processed.txt --batch-size 25 --workers 2\r\n```\r\n\r\n**Semantic Deduplication Performance**\r\n\r\n```bash\r\n# GPU issues - disable GPU acceleration\r\nllmbuilder data deduplicate -i dataset.txt -o clean.txt --method semantic --no-gpu\r\n\r\n# Slow processing - increase batch size\r\nllmbuilder data deduplicate -i dataset.txt -o clean.txt --method semantic --batch-size 2000\r\n\r\n# Memory issues - reduce embedding cache\r\nllmbuilder config from-template basic_config -o config.json \\\r\n --override data.deduplication.embedding_cache_size=5000\r\n```\r\n\r\n### GGUF Conversion Issues\r\n\r\n**Missing llama.cpp**\r\n\r\n```bash\r\n# Install llama.cpp\r\ngit clone https://github.com/ggerganov/llama.cpp\r\ncd llama.cpp && make\r\n\r\n# Add to PATH or specify location\r\nexport PATH=$PATH:/path/to/llama.cpp\r\n\r\n# Alternative: Use Python package\r\npip install llama-cpp-python\r\n\r\n# Test conversion\r\nllmbuilder convert gguf --help\r\n```\r\n\r\n**Conversion Failures**\r\n\r\n```bash\r\n# Check available conversion scripts\r\nllmbuilder convert gguf model.pt -o test.gguf --verbose\r\n\r\n# Try different quantization levels\r\nllmbuilder convert gguf model.pt -o test.gguf -q F16 # Less compression\r\nllmbuilder convert gguf model.pt -o test.gguf -q Q8_0 # Balanced\r\n\r\n# Increase timeout for large models\r\nllmbuilder config from-template basic_config -o config.json \\\r\n --override gguf_conversion.conversion_timeout=7200\r\n```\r\n\r\n### Configuration Issues\r\n\r\n**Validation Errors**\r\n\r\n```bash\r\n# Validate configuration with detailed output\r\nllmbuilder config validate my_config.json --detailed\r\n\r\n# Common fixes:\r\n# 1. Vocab size mismatch - ensure model.vocab_size matches tokenizer_training.vocab_size\r\n# 2. Sequence length issues - ensure data.max_length <= model.max_seq_length\r\n# 3. Invalid quantization level - use: F32, F16, Q8_0, Q5_1, Q5_0, Q4_1, Q4_0\r\n```\r\n\r\n**Template Issues**\r\n\r\n```bash\r\n# List available templates\r\nllmbuilder config templates\r\n\r\n# Create from working template\r\nllmbuilder config from-template basic_config -o working_config.json\r\n\r\n# Validate before use\r\nllmbuilder config validate working_config.json\r\n```\r\n\r\n### Performance Optimization\r\n\r\n**Speed Up Processing**\r\n\r\n```bash\r\n# Use more workers for I/O bound tasks\r\nllmbuilder data load -i docs/ -o processed.txt --workers 8\r\n\r\n# Enable GPU for semantic operations\r\nllmbuilder data deduplicate -i dataset.txt -o clean.txt --use-gpu --batch-size 2000\r\n\r\n# Use faster HTML parser\r\nllmbuilder config from-template basic_config -o config.json \\\r\n --override data.ingestion.html_parser=lxml\r\n```\r\n\r\n**Reduce Memory Usage**\r\n\r\n```bash\r\n# Smaller batch sizes\r\nllmbuilder data load -i docs/ -o processed.txt --batch-size 25\r\n\r\n# Disable semantic deduplication for large datasets\r\nllmbuilder data deduplicate -i dataset.txt -o clean.txt --method exact\r\n\r\n# Use CPU-optimized configuration\r\nllmbuilder config from-template cpu_optimized_config -o config.json\r\n```\r\n\r\n### Debug Mode\r\n\r\n**Enable Verbose Logging**\r\n\r\n```bash\r\n# Set environment variable\r\nexport LLMBUILDER_LOG_LEVEL=DEBUG\r\n\r\n# Or use CLI flag\r\nllmbuilder data load -i docs/ -o processed.txt --verbose\r\n\r\n# Check processing statistics\r\nllmbuilder data load -i docs/ -o processed.txt --stats\r\n```\r\n\r\n### Getting Help\r\n\r\n- \ud83d\udcd6 **Documentation**: [https://qubasehq.github.io/llmbuilder-package/](https://qubasehq.github.io/llmbuilder-package/)\r\n- \ud83d\udc1b **Issues**: [GitHub Issues](https://github.com/Qubasehq/llmbuilder-package/issues)\r\n- \ud83d\udcac **Discussions**: [GitHub Discussions](https://github.com/Qubasehq/llmbuilder-package/discussions)\r\n\r\n## Testing (developers)\r\n\r\n- Fast tests: `python -m pytest -q tests`\r\n- Slow tests: `set RUN_SLOW=1 && python -m pytest -q tests`\r\n- Performance tests: `set RUN_PERF=1 && python -m pytest -q tests\\performance`\r\n\r\n## License\r\n\r\nApache-2.0 (or project license).\r\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "A comprehensive toolkit for building, training, and deploying language models",
"version": "0.4.6",
"project_urls": {
"Changelog": "https://github.com/Qubasehq/llmbuilder-package/blob/main/CHANGELOG.md",
"Documentation": "https://qubasehq.github.io/llmbuilder-package/",
"Homepage": "https://github.com/Qubasehq/llmbuilder-package",
"Issues": "https://github.com/Qubasehq/llmbuilder-package/issues",
"Repository": "https://github.com/Qubasehq/llmbuilder-package.git"
},
"split_keywords": [
"machine learning",
" deep learning",
" natural language processing",
" transformers",
" language models",
" training",
" inference",
" tokenization",
" data processing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "7584c63da9dce8d2e0c4b5c92553664b11f2ecf5aa3e6f20c4e75030dfac065f",
"md5": "f8d49b485be71732ad9e839bfa1f9899",
"sha256": "4d31cbda5c7bb873c055f362cd4a691e3e6221424f96ade455be94cdf599fc59"
},
"downloads": -1,
"filename": "llmbuilder-0.4.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f8d49b485be71732ad9e839bfa1f9899",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 160855,
"upload_time": "2025-08-14T20:16:10",
"upload_time_iso_8601": "2025-08-14T20:16:10.176857Z",
"url": "https://files.pythonhosted.org/packages/75/84/c63da9dce8d2e0c4b5c92553664b11f2ecf5aa3e6f20c4e75030dfac065f/llmbuilder-0.4.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "196288fee15c132cedff5403a7a5b67e33f091b4966b401da97b195f39c13a7a",
"md5": "67559da99f151d21a3a3cce115c0acbd",
"sha256": "3324c306e207db079f438cf8535ecda1da8882394a7f9b54c71f70ed7f729b32"
},
"downloads": -1,
"filename": "llmbuilder-0.4.6.tar.gz",
"has_sig": false,
"md5_digest": "67559da99f151d21a3a3cce115c0acbd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 943423,
"upload_time": "2025-08-14T20:16:12",
"upload_time_iso_8601": "2025-08-14T20:16:12.129809Z",
"url": "https://files.pythonhosted.org/packages/19/62/88fee15c132cedff5403a7a5b67e33f091b4966b401da97b195f39c13a7a/llmbuilder-0.4.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-14 20:16:12",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Qubasehq",
"github_project": "llmbuilder-package",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "llmbuilder"
}