Taking a Break from Shokken
Not every day is spent writing Kotlin and wrestling with subagents. Sometimes you need to explore other creative outlets. Today, I’m diving into something completely different: training LoRAs using Ostris’s AI toolkit for image and video generation models.
When I first opened this tool, the wall of configuration options was overwhelming. Every toggle, every number field seemed important, but I had no idea what any of them actually did. So I decided to document every single parameter - what it does, why it matters, and what values actually work in practice.
What We’re Actually Training
Before diving into configurations, let’s clarify what we’re doing. We’re not training entire base models - that would require data center-level resources. Instead, we’re training LoRAs (Low Rank Adaptations), which are essentially lightweight adapters that modify a base model’s behavior. Think of it as teaching an existing model a new concept without rewriting the entire thing.
The key insight: you’re training the transformer component, not the text encoder. This distinction matters for VRAM optimization and quality settings.
Nerdy Details
Job Configuration
GPU Selection: The toolkit only supports single-GPU training. No distributed training across multiple GPUs. If you have multiple GPUs, you can pause training and switch between them based on availability, but you’re always limited to one at a time.
Trigger Words: This is critical and often misunderstood. The trigger word is what activates your LoRA during inference. It must be unique - not something already in the base model’s vocabulary. If you use “person” as your trigger word and the base model already knows “person”, you’ll get conflicting signals during inference. The model won’t know if you want the base concept or your trained concept. Use something distinctive like “ohwx” or “sks” - nonsense tokens that definitely don’t exist in the base model.
Model Architecture and Quantization
Quantization Strategy:
- Text Encoder: Always use float8. Since we’re not training it, there’s no VRAM pressure here. Float8 gives you the highest quality without any downside.
- Transformer: This is where VRAM constraints bite. Float8 is ideal, but if you can’t fit it:
- 6-bit is the practical lower limit for quality
- 4-bit causes noticeable degradation
- 3-bit with ARA (Accuracy Recovery Adapters) is your last resort
The ARA adapter is fascinating - it’s a specialty LoRA trained by Ostris himself based on research showing that precision lost during 3-bit training can be partially recovered. It’s not perfect (3-bit + ARA will never match float8), but it makes otherwise impossible training feasible on consumer GPUs.
Target Type and Linear Rank
LoRA vs LoKr: Stick with LoRA for compatibility. LoKr (Low-rank Kronecker) is theoretically more efficient but has limited tooling support. ComfyUI recently added LoKr support, but why risk compatibility issues for marginal gains?
Linear Rank (The Complexity Dial):
- Higher rank = more parameters = more nuanced adaptation
- Also means more VRAM usage and higher overfitting risk
- Community wisdom:
- 32 for most character LoRAs
- 16 for Qwen Image specifically (it’s a 20B parameter model, doesn’t need as much adaptation complexity)
- Higher ranks risk capturing unwanted details (backgrounds, lighting, camera characteristics)
The overfitting problem is real: train with rank 128 on studio photos, and your LoRA might memorize the studio lighting, background, and lens distortion along with the person’s face. You want the person, not the photography setup.
Training Parameters Deep Dive
Batch Size and Gradient Accumulation: Unless you have an H100, batch size stays at 1. But we can simulate larger batch sizes through gradient accumulation. Setting gradient accumulation to 4 achieves similar quality to batch_size=4, just 4x slower. The optimizer sees multiple images’ worth of gradients before updating weights, improving training stability.
Learning Rate (The Speed Control): Default is 1e-4. Higher rates train faster but risk overshooting optimal weights. Think of it like this: with too high a learning rate, you might jump past the ideal weight values and oscillate around them without converging. The community generally agrees:
- 1e-4 is safe and reliable
- 2e-4 works for faster training if you monitor carefully
- 4e-4 and above is playing with fire
Weight Decay (Regularization Technique #1): Applies penalties to large weights to prevent overfitting. It dampens strong signals that might otherwise dominate the model. Default 1e-4 works for most cases - rarely needs adjustment unless you’re diagnosing specific overfitting issues.
Time Step Management (The Secret Sauce)
This is where things get sophisticated. Diffusion models work by removing noise over multiple steps:
- High noise period (early steps): Composition, overall structure
- Medium noise period (middle steps): Features, characteristics, details
- Low noise period (final steps): Textures, fine details
Time Step Type:
- Sigmoid: Bell curve distribution focusing on medium noise. Perfect for character LoRAs where you want facial features without affecting composition.
- Linear: Even distribution across all noise levels. Better for style LoRAs or special cases.
- Shift/Weighted: Undocumented, avoid these.
Time Step Bias: Further refines where training focuses:
- Balanced: No bias
- High noise: Shifts focus toward early steps (composition)
- Low noise: Shifts focus toward final steps (details)
For character LoRAs, sigmoid + balanced is the sweet spot. You want the model to learn faces during the medium noise period when features are being defined, not during composition or final texture refinement.
Memory Optimization Techniques
Low VRAM Mode: Not magical memory creation, but intelligent offloading. Moves components not currently needed out of VRAM. Essential for consumer GPUs.
Text Encoder Optimizations:
- Unload TE: Completely removes text encoder from VRAM, ignores captions, only uses trigger word
- Cache Text Embeddings: Pre-computes caption embeddings, caches them, frees VRAM while preserving caption information
The tradeoff: cached embeddings can’t be dynamic. But unless you’re doing fancy prompt augmentation during training, caching is pure win.
Cache Latents: Pre-converts all images to latent space representations. Trades disk/RAM for GPU compute. Useful for weak GPUs, but modern cards handle on-the-fly conversion fine.
Advanced Regularization Techniques
Exponential Moving Average (EMA): Maintains a moving average of weights in memory, dampening sudden changes and reducing overfitting. The catch: requires keeping two complete copies of the model in VRAM. If you have the memory, use it (0.9995 decay is standard). Most of us don’t have that luxury.
Differential Output Preservation: The nuclear option for overfitting prevention. During training, generates inference outputs and compares them against the base model, then applies dampening to overly strong signals. Computationally expensive and VRAM hungry. Theoretical benefit is high, practical use is limited by resource constraints.
Regularization Datasets: Feed the model examples of what you’re NOT training. Training a person? Include a dataset of other people to maintain the model’s ability to generate variety. Doubles your dataset preparation work and training time. Most hobbyists skip this.
Dataset Configuration
Resolution Buckets: The model creates buckets for different resolutions (512, 768, 1024, etc.). Images are automatically scaled to fit the nearest bucket. Higher resolutions = better quality but more VRAM. If you’re VRAM constrained, drop to 768 - the LoRA will still work at higher resolutions during inference, just with less captured detail.
Augmentation Options:
- X-flip: Horizontal mirroring. Dangerous for faces (asymmetry matters)
- Y-flip: Vertical flipping. Almost never useful unless training abstract concepts
- Generally avoid both unless you really understand your use case
Caption Dropout Rate: Randomly drops captions during training to prevent over-reliance on text conditioning. Only works with dynamic text embeddings (which you probably can’t use with low VRAM settings). Most users leave this at 0.
Sampling Configuration
Sample Every: Default 250 steps. The model pauses training to generate sample images, letting you monitor progress. Essential for catching overfitting early.
Sampler Settings:
- Always use flow_match (DDPM is outdated)
- Guidance scale depends on your base model (check documentation)
- 25 sampling steps is reasonable for monitoring
- Walk seed for deterministic comparisons between checkpoints
Practical VRAM Management Strategy
For a typical 24GB consumer GPU training Qwen Image character LoRA:
- Enable Low VRAM mode
- Use cached text embeddings (not unload TE)
- Set transformer quantization to float8 if possible, 6-bit if not
- Linear rank 16-32
- Batch size 1 with gradient accumulation 4
- Skip EMA and differential output preservation
- Train at 768 or 1024 resolution
Training Timeline and Checkpointing
Steps and Convergence:
- 2000 steps: Minimum for basic convergence
- 3000 steps: Conservative starting point
- 5000-10000 steps: Extended training for refinement
Checkpoint Strategy: Save every 250-500 steps, keep at least 4 checkpoints. Each checkpoint can be 200-500MB for rank 32, so plan storage accordingly. Set high step counts (10000) but monitor samples - you might find the 3500-step checkpoint better than the 7000-step one.
The Overfitting Spectrum
Overfitting isn’t binary - it’s a spectrum:
- Underfitted: Doesn’t capture the concept well
- Ideal: Captures concept while maintaining flexibility
- Slightly overfitted: Captures concept plus some unwanted consistencies
- Severely overfitted: Basically memorized training data
Multiple tools combat overfitting at different levels:
- Linear rank (structural complexity)
- Weight decay (signal dampening)
- EMA (temporal smoothing)
- Differential output preservation (comparative regularization)
- Regularization datasets (negative examples)
- Caption dropout (text conditioning variance)
Production Training Workflow
- Prepare dataset: 20-50 high-quality images for character LoRA
- Caption accurately: Describe what you want the model to learn
- Configure conservatively: Start with community defaults
- Monitor samples: Check every 250-500 steps
- Identify optimal checkpoint: Often not the final one
- Test in inference: Verify LoRA triggers correctly and maintains flexibility
Common Pitfalls and Solutions
Problem: LoRA doesn’t trigger Solution: Check trigger word uniqueness, ensure it’s in your inference prompt
Problem: LoRA affects entire image, not just subject Solution: Reduce linear rank, adjust time step to sigmoid, increase weight decay
Problem: Can’t fit training in VRAM Solution: Progressive degradation: cache embeddings → reduce rank → lower quantization → smaller resolution → 3-bit + ARA
Problem: Training takes forever Solution: Gradient accumulation might be too high, consider reducing for speed at slight quality cost
Performance Metrics and Expectations
On RTX 4090 training Qwen Image character LoRA:
- ~3-5 minutes per 100 steps at rank 32, 1024px
- ~2-3 hours for full 3000-step training
- 15-20GB VRAM usage with optimizations
- 5-10GB disk space for checkpoints
The Creator’s Achievement
Ostris built this as a solo project that millions now use. It’s a masterclass in practical tool development - not perfect, but incredibly functional. The tool abstracts tremendous complexity while exposing necessary controls. The balance between accessibility and power is nearly perfect.
The Bigger Picture
This detour into AI image generation isn’t really a detour. Understanding different development ecosystems, tools, and paradigms makes you a better developer. The precision required for training configurations mirrors the precision needed for app architecture. The VRAM constraints teach resource management. The overfitting problems parallel over-engineering in code.
Plus, sometimes you just need to do something fun. Not everything has to be about shipping the next feature or fixing the next bug. Creative exploration keeps the passion alive.
Next week, back to Shokken and whatever fresh hell gradle has prepared for us.