Blog

How to Train an Ideogram 4 LoRA

Mickael

mediapixel team

Published Jun 28, 2026Revised Jun 28, 2026

Training an Ideogram 4 LoRA locally is possible on a 12GB GPU, but the workflow needs to be realistic.

This article documents the low-VRAM route I use with AI Toolkit, the official Ideogram 4 FP8 model, structured JSON captions, and validation inside ComfyUI.

The goal is not to build a perfect universal training recipe. The goal is more practical: show a workflow that can actually run on hardware such as an RTX 3060 12GB, without pretending that every training setup needs a 24GB or 48GB GPU.

To keep the examples public and easy to understand, I will use two generic game-art concepts:

a sci-fi military buggy LoRA;
a green armored supersoldier LoRA;
a final validation prompt that combines both concepts in one scene.

The example concepts are original training examples for this tutorial. They are not official assets and are not affiliated with any existing game franchise.

Note: AI-generated images are not a replacement for 3D work, but they can be useful during pre-production. They can help place an asset inside a believable environment, explore design variations, test different visual styles, or create blueprint-like references before modeling. Used this way, AI becomes a fast concept and iteration tool for 3D asset creation.

Hardware and settings

The reference setup for this workflow is:

GPU: NVIDIA RTX 3060 12GB
Resolution: 768 px
Batch size: 1
Gradient accumulation: 1
LoRA rank / alpha: 16 / 16
Learning rate: 5e-5
Optimizer: adamw8bit
Training dtype: bf16
Save dtype: float16
Text encoder training: disabled
Sampling during training: disabled

On this setup, a validated Ideogram 4 FP8 training run reached roughly 11 to 12 seconds per step at 768 px.

That does not mean every dataset will behave exactly the same way. But it gives a useful baseline: with the right configuration, local Ideogram 4 LoRA training is not limited to very high-end GPUs.

Tools used

The main tools are:

AI Toolkit for LoRA training;
Ideogram 4 FP8 as the base model for training;
ComfyUI for inference and validation;
Comfy-Org Ideogram 4 repackaged files for the ComfyUI side;
Qwen3-VL 8B FP8 as the text encoder in the ComfyUI workflow;
PowerShell for launching and monitoring the training job.

Useful links:

AI Toolkit: https://github.com/ostris/ai-toolkit
Ideogram 4 FP8: https://huggingface.co/ideogram-ai/ideogram-4-fp8
ComfyUI Ideogram 4 files: https://huggingface.co/Comfy-Org/Ideogram-4
ComfyUI Qwen3-VL files: https://huggingface.co/Comfy-Org/Qwen3-VL

Before using the model, check the license and access conditions on the official Hugging Face page. At the time of writing, the Ideogram 4 FP8 repository requires accepting the model gate before downloading the files.

Download the tutorial files

All companion files for this tutorial are available in the GitHub assets repository:

Download the Ideogram 4 LoRA tutorial files

The folder includes:

ai-toolkit/config/mp_scifi_buggy_aitk_500steps.yaml
The main low-VRAM AI Toolkit config used for the buggy LoRA example.
ai-toolkit/config/mp_power_armor_rank16_aitk_800steps.yaml
A stronger rank 16 power armor LoRA config, useful when the armor is the main concept.
ai-toolkit/config/mp_power_armor_rank8_aitk_800steps.yaml
A lighter rank 8 power armor LoRA config, useful when stacking it with another LoRA.
scripts/run_mp_scifi_buggy_aitk_500steps.ps1
A PowerShell launcher that runs AI Toolkit and writes a timestamped log file.
workflows/ideogram4_lora_test_workflow.json
A ComfyUI workflow for testing the trained Ideogram 4 LoRA.

The example paths are Windows paths and assume this local structure:

C:\ai-toolkit
C:\ai-toolkit\models\ideogram-4-fp8
C:\ideogram4_lora_training\mp_scifi_buggy\images
C:\ideogram4_lora_training\mp_scifi_buggy\output_aitk

If your folders are different, update the paths in the YAML config and PowerShell script before running the training.

Dataset strategy

For this tutorial, I would not start with a huge mixed dataset.

A small focused LoRA is easier to debug than a large broad one. If a LoRA is supposed to learn a vehicle, the dataset should consistently show that vehicle idea. If a LoRA is supposed to learn an armored character, the dataset should consistently show that character idea.

For the public examples, I use two triggers:

mp_scifi_buggy
mp_power_armor

The first trigger is for a rugged sci-fi off-road buggy inspired by classic game vehicle design.
The second trigger is for a futuristic armored supersoldier.

A practical starting point for a small concept LoRA is:

6 to 12 images
768 px training resolution
rank 16 / alpha 16
batch size 1
500 to 1000 steps for a focused concept test

For a broader LoRA, you can increase the dataset size and steps. But for this tutorial, small focused concepts are easier to explain and evaluate.

Recommended dataset structure

A clean dataset should use matching image and caption files.

Example:

ideogram4_lora_training/
├─ mp_scifi_buggy/
│  ├─ images/
│  │  ├─ buggy_0001.png
│  │  ├─ buggy_0001.txt
│  │  ├─ buggy_0002.png
│  │  ├─ buggy_0002.txt
│  │  └─ ...
│  ├─ validation_prompts/
│  │  └─ test_scifi_buggy.json
│  └─ output_aitk/
├─ mp_power_armor/
│  ├─ images/
│  │  ├─ armor_0001.png
│  │  ├─ armor_0001.txt
│  │  ├─ armor_0002.png
│  │  ├─ armor_0002.txt
│  │  └─ ...
│  ├─ validation_prompts/
│  │  └─ test_power_armor.json
│  └─ output_aitk/
└─ shared_validation/
   └─ test_buggy_and_armor_scene.json

Each .png has a matching .txt file with the same base name.

The .txt file contains raw JSON. It should not contain Markdown fences.

Example caption: sci-fi buggy

In the .txt file, this should be stored as raw JSON

{
  "high_level_description": "A detailed studio render of a futuristic sci-fi military buggy. The vehicle is an olive-green armored off-road combat buggy with a low angular body, open side cockpit, heavy black roll cage, oversized rugged tires, reinforced suspension, and a rear-mounted rotating gun turret. The design features modular armor panels, black mechanical undercarriage parts, red suspension springs, front tow hooks, compact headlights, military-style decals, and a practical all-terrain silhouette, mp_scifi_buggy.",
  "style_description": {
    "aesthetics": "sci-fi military vehicle reference, mp_scifi_buggy",
    "lighting": "soft studio lighting with realistic reflections",
    "medium": "3D render",
    "art_style": "realistic hard-surface sci-fi vehicle render"
  },
  "compositional_deconstruction": {
    "background": "A neutral gray studio backdrop with soft lighting, a color calibration chart, and small material reference spheres in the upper-left corner, keeping the focus on the sci-fi buggy model.",
    "elements": [
      {
        "type": "obj",
        "desc": "The mp_scifi_buggy is shown in a front three-quarter view, emphasizing its low angular silhouette, wide wheelbase, and armored military design."
      },
      {
        "type": "obj",
        "desc": "The olive-green armored body has sloped front panels, modular side plating, black lower trim, and subtle military-style white decals on the side panels."
      },
      {
        "type": "obj",
        "desc": "Four oversized black off-road tires with deep tread support the vehicle, giving it a rugged all-terrain stance."
      },
      {
        "type": "obj",
        "desc": "The front section includes compact rectangular headlights, black tow hooks, exposed suspension parts, and red coil springs visible beneath the armored hood."
      },
      {
        "type": "obj",
        "desc": "The cockpit is partially open, with dark frame bars, visible seats, and a protective black roll cage integrated into the vehicle body."
      },
      {
        "type": "obj",
        "desc": "A rear-mounted rotating turret sits above the cockpit area, fitted with a compact rotary machine gun and an arcing ammunition belt feed."
      },
      {
        "type": "obj",
        "desc": "The render highlights realistic hard-surface modeling with beveled armor edges, panel seams, latch details, bolts, dark glass, rubber textures, and matte metal surfaces."
      }
    ]
  }
}

JSON captions for Ideogram 4

This guide assumes that you are already familiar with the JSON prompt structure used in my previous article:

JSON Prompts for Ideogram 4 in ComfyUI

Here, the same structure is used for LoRA captions: one global description, one style block, one background description, and one or more element descriptions.

Ideogram 4 responds well to structured JSON captions. For this workflow, I use the same general structure for training captions and validation prompts:

{
  "high_level_description": "...",
  "style_description": {
    "aesthetics": "...",
    "lighting": "...",
    "medium": "...",
    "art_style": "...",
    "color_palette": ["#...", "#...", "#..."]
  },
  "compositional_deconstruction": {
    "background": "...",
    "elements": [
      {
        "type": "obj",
        "desc": "..."
      }
    ]
  }
}

For training captions, the most important rule is consistency.

The trigger token should appear naturally in the caption. It can appear in high_level_description, in an element description, or both.

For example:

mp_scifi_buggy

should always refer to the same visual concept: a rugged sci-fi off-road buggy with a clear silhouette, large wheels, military body panels, visible suspension, and open-frame construction.

The caption should not simply say:

A cool sci-fi vehicle.

That is too vague.

It should describe what the model needs to learn:

A rugged sci-fi off-road buggy with oversized tires, open-frame roll cage, military body panels, elevated suspension, front bumper, and compact battlefield vehicle proportions, mp_scifi_buggy.

Installing AI Toolkit

Prerequisites

Before running the installation commands, make sure these tools are installed on Windows:

Windows Terminal or PowerShell
Used to run the commands.
Git for Windows
Required for git clone and git submodule update.
Python 3.10 or newer
Required for python -m venv venv and for running AI Toolkit.
During installation, enable Add python.exe to PATH.
NVIDIA GPU driver
Required for CUDA/PyTorch GPU acceleration.

Optional but useful if dependency installation fails:

Microsoft Visual Studio Build Tools
CMake

After installation, check that Git and Python are available:

git --version
python --version

If both commands return a version number, you can continue with the installation.

Example Windows installation:

cd C:\
git clone https://github.com/ostris/ai-toolkit.git ai-toolkit
cd C:\ai-toolkit
git submodule update --init --recursive
python -m venv venv
.\venv\Scripts\python.exe -m pip install --upgrade pip
.\venv\Scripts\pip.exe install -r requirements.txt

AI Toolkit evolves quickly. If the official README changes, follow the current official installation instructions first, then keep the low-VRAM principles from this article.

If python is not recognized, try:

py --version

If py works but python does not, Python is installed but not correctly added to the PATH.

Downloading Ideogram 4 FP8

Install the Hugging Face CLI:

cd C:\ai-toolkit
.\venv\Scripts\pip.exe install -U huggingface_hub

Then download the gated model after accepting the license conditions on Hugging Face:

.\venv\Scripts\huggingface-cli.exe download ideogram-ai/ideogram-4-fp8 --local-dir C:\ai-toolkit\models\ideogram-4-fp8

The expected local structure is:

C:\ai-toolkit\models\ideogram-4-fp8
C:\ai-toolkit\models\ideogram-4-fp8\text_encoder

In the AI Toolkit config, the model block will point to that folder:

model:
  name_or_path: "C:/ai-toolkit/models/ideogram-4-fp8"
  arch: "ideogram4"

Downloading the ComfyUI files

For inference and validation in ComfyUI, use the Comfy-Org repackaged files.

From:

https://huggingface.co/Comfy-Org/Ideogram-4

Place the files here:

ComfyUI\models\diffusion_models\ideogram4_fp8_scaled.safetensors
ComfyUI\models\diffusion_models\ideogram4_unconditional_fp8_scaled.safetensors
ComfyUI\models\text_encoders\qwen3vl_8b_fp8_scaled.safetensors
ComfyUI\models\vae\flux2-vae.safetensors

With Stability Matrix, the equivalent paths may look like this:

I:\StabilityMatrix\Models\DiffusionModels
I:\StabilityMatrix\Models\TextEncoders
I:\StabilityMatrix\Models\VAE

Low-VRAM AI Toolkit config

Here is the essential part of the AI Toolkit config used for the mp_scifi_buggy example.

The yaml config file must be created as :

C:\ai-toolkit\config\mp_scifi_buggy_aitk_500steps.yaml

You can duplicate it for another LoRA, by changing:

config.name
training_folder
trigger_word
dataset folder_path
optional metadata fields such as meta.dataset and meta.trigger_token

job: extension
config:
  name: "mp_scifi_buggy_aitk"
  process:
    - type: "sd_trainer"
      training_folder: "C:/ideogram4_lora_training/mp_scifi_buggy/output_aitk"
      device: cuda:0
      trigger_word: "mp_scifi_buggy"

      network:
        type: "lora"
        linear: 16
        linear_alpha: 16

      save:
        dtype: float16
        save_every: 250
        max_step_saves_to_keep: 10
        save_format: "safetensors"
        push_to_hub: false

      datasets:
        - folder_path: "C:/ideogram4_lora_training/mp_scifi_buggy/images"
          caption_ext: "txt"
          caption_dropout_rate: 0.05
          shuffle_tokens: false
          cache_latents_to_disk: true
          cache_text_embeddings: true
          resolution: [768]

      train:
        batch_size: 1
        gradient_accumulation_steps: 1
        steps: 500
        train_unet: true
        train_text_encoder: false
        gradient_checkpointing: true
        cache_text_embeddings: true
        disable_sampling: true
        noise_scheduler: "flowmatch"
        optimizer: "adamw8bit"
        lr: 0.0001
        dtype: bf16

      model:
        name_or_path: "C:/ai-toolkit/models/ideogram-4-fp8"
        arch: "ideogram4"
        quantize: true
        quantize_te: true
        low_vram: true
        layer_offloading: true
        layer_offloading_text_encoder_percent: 95
        layer_offloading_transformer_percent: 95
        model_kwargs:
          text_encoder_path: "C:/ai-toolkit/models/ideogram-4-fp8/text_encoder"
          max_text_length: 2048

The most important low-VRAM settings are:

batch_size: 1
train_text_encoder: false
gradient_checkpointing: true
cache_text_embeddings: true
cache_latents_to_disk: true
quantize: true
quantize_te: true
low_vram: true
layer_offloading: true
disable_sampling: true

These settings are useful because:

batch_size: 1 keeps VRAM usage predictable.
train_text_encoder: false avoids training the large Qwen text encoder.
gradient_checkpointing: true reduces memory usage during backpropagation.
cache_text_embeddings: true avoids recomputing text embeddings after the first cache pass.
cache_latents_to_disk: true avoids repeatedly encoding images.
quantize: true and quantize_te: true reduce memory pressure.
layer_offloading: true allows parts of the model to move between CPU and GPU.
disable_sampling: true prevents validation sampling from slowing down or blocking the training run.

For a focused LoRA with around 6 to 12 images, 500 steps is a reasonable first test value.

A useful way to think about training length is:

steps per image = total training steps / number of training images

For example:

500 steps / 13 images  = about 38 steps per image
1000 steps / 13 images = about 77 steps per image
1250 steps / 13 images = about 96 steps per image
1500 steps / 13 images = about 115 steps per image

For a small object-focused LoRA like mp_scifi_buggy, 500 steps is a good first checkpoint, while 1000 to 1500 steps can produce a much stronger result.

For a general style or broad concept LoRA with many images, you usually do not need the same steps-per-image ratio. A broad LoRA should stay more flexible, while a small object LoRA can be trained more aggressively.

As a rough guide:

General LoRA with many images      : 15-30 steps per image
Focused object or outfit LoRA      : 60-120 steps per image
Very specific character/object LoRA: 80-150 steps per image
Quick diagnostic run               : 50-250 total steps

Training too long can cause overfitting. Common signs are:

the LoRA becomes too rigid
composition variety decreases
the same angles or poses repeat too often
small details from the dataset appear where they should not
generated images become too close to the training references

To estimate training time, first measure your average seconds per step.

On my RTX 3060 12 GB setup, this AI Toolkit + Ideogram 4 FP8 configuration runs at roughly:

11-12 seconds per step at 768 px

Then use:

training time in minutes = steps * seconds_per_step / 60
training time in hours   = steps * seconds_per_step / 3600

At 11.5 seconds/step, the approximate times are:

250 steps  = about 48 minutes
500 steps  = about 1 h 36
1000 steps = about 3 h 12
1500 steps = about 4 h 48
3000 steps = about 9 h 35

In practice, add some margin for model loading, latent caching, text embedding caching, checkpoint saving, and restarts.

For this setup, training in blocks of 250 steps is comfortable: it is short enough to compare results often, but long enough to produce visible changes.

Other GPUs will not scale perfectly linearly, because this workflow also depends on CPU offloading, disk caching, PyTorch/CUDA versions, driver versions, and how much of the model stays in VRAM.

Still, for this exact low-VRAM Ideogram 4 FP8 setup, a practical rough estimate would be:

RTX 3060 12 GB       : 11-12 sec/step measured on my setup
RTX 4070 SUPER 12 GB : ~6-8 sec/step rough estimate
RTX 5070 12 GB       : ~4.5-6.5 sec/step rough estimate
RTX 5070 Ti 16 GB    : ~3.5-5.5 sec/step rough estimate

At those estimated speeds, training time would look approximately like this:

GPU                     250 steps      500 steps      1000 steps       1500 steps       3000 steps
RTX 3060 12 GB          ~48 min        ~1 h 36        ~3 h 12         ~4 h 48         ~9 h 35
RTX 4070 SUPER 12 GB    ~25-33 min     ~50-67 min     ~1 h 40-2 h 13  ~2 h 30-3 h 20 ~5 h-6 h 40
RTX 5070 12 GB          ~19-27 min     ~38-54 min     ~1 h 15-1 h 48  ~1 h 53-2 h 42 ~3 h 45-5 h 25
RTX 5070 Ti 16 GB       ~15-23 min     ~29-46 min     ~58 min-1 h 32  ~1 h 27-2 h 18 ~2 h 55-4 h 35

These are not official Ideogram 4 LoRA benchmarks. They are planning estimates.

For a real run, always measure your own average seconds per step after the first cache pass, then calculate:

training time in minutes = steps * seconds_per_step / 60
training time in hours   = steps * seconds_per_step / 3600

The first run may look slower because AI Toolkit has to cache latents and text embeddings. Once caching is done, the step time becomes more representative.

A 16 GB card, such as an RTX 5070 Ti, may also feel more comfortable than a 12 GB card because it can reduce some VRAM pressure. However, if the config still uses aggressive low-VRAM offloading, the speed gain may be smaller than the raw GPU specifications suggest.

PowerShell launcher

A launcher script makes training easier to repeat and log.

Example:

$ErrorActionPreference = "Stop"

$aiToolkitDir = "C:\ai-toolkit"
$configPath = "C:\ai-toolkit\config\mp_scifi_buggy_aitk_500steps.yaml"
$outputDir = "C:\ideogram4_lora_training\mp_scifi_buggy\output_aitk\mp_scifi_buggy_aitk"

New-Item -ItemType Directory -Force -Path $outputDir | Out-Null

$timestamp = Get-Date -Format "yyyyMMdd_HHmmss"
$logPath = Join-Path $outputDir "train_mp_scifi_buggy_${timestamp}.log"

Push-Location $aiToolkitDir
try {
    $previousErrorActionPreference = $ErrorActionPreference
    $ErrorActionPreference = "Continue"

    & .\venv\Scripts\python.exe .\run.py $configPath *> $logPath
    $exitCode = $LASTEXITCODE

    $ErrorActionPreference = $previousErrorActionPreference

    if ($exitCode -ne 0) {
        Write-Host "AI Toolkit exited with code $exitCode."
        Write-Host "Log: $logPath"
        exit $exitCode
    }

    Write-Host "Training completed successfully."
    Write-Host "Log: $logPath"
}
finally {
    Pop-Location
}

Run it from PowerShell:

cd C:\ideogram4_lora_training
.\run_mp_scifi_buggy_aitk_500steps.ps1

Monitoring the training

Follow the latest log:

Get-ChildItem C:\ideogram4_lora_training\mp_scifi_buggy\output_aitk\mp_scifi_buggy_aitk\*.log |
  Sort-Object LastWriteTime -Descending |
  Select-Object -First 1 |
  Get-Content -Wait -Tail 80

Monitor the GPU:

nvidia-smi -l 5

Healthy signs:

GPU usage is high during training steps;
loss changes but stays finite;
no NaN;
no CUDA out of memory;
.safetensors checkpoints appear at the configured intervals;
optimizer.pt appears in the output folder.

Problem signs:

loss becomes NaN;
training crashes during checkpoint saving;
CPU/CUDA mismatch errors appear;
no checkpoint is saved despite passing the save interval;
validation images are completely unrelated to the prompt, which often means an inference or model config issue rather than a dataset issue.

Resuming a training run

AI Toolkit can resume a run if the training metadata is still present.

Keep these unchanged in the yaml file:

same training_folder;
same config.name;
same base model;
same architecture;
same rank and alpha;
same optimizer family;
same output folder.

To extend a run, increase:

steps: 750

or:

steps: 1000

Then relaunch the same job.

Do not change rank or alpha when resuming a LoRA. If you need to change the network structure, start a new run.

Testing the LoRA in ComfyUI

The validation method should be simple and repeatable.

Use the same prompt and seed, then compare several LoRA strengths:

no LoRA;
LoRA at 0.7;

LoRA at 1.0;
LoRA at 1.5;
sometimes 1.7 or 2.0 for a very small specialized LoRA.

In ComfyUI:

load ideogram4_fp8_scaled.safetensors on the conditional model path used by the sampler;
load ideogram4_unconditional_fp8_scaled.safetensors if your workflow uses it;
load qwen3vl_8b_fp8_scaled.safetensors as the text encoder;
load flux2-vae.safetensors;
apply the LoRA on the model path used by the sampler.

The important LoRA parameter is usually:

strength_model

Since the text encoder was not trained, strength_clip can usually stay at:

In my tests, a properly trained LoRA should usually produce a visible and useful effect around 1.0.

Going above 1.0 is possible, but it does not always improve the image. Higher strengths can make the concept stronger, but they can also reduce visual quality, make the output more rigid, distort details, or push the result too close to the training references.

A practical range is:

0.7-1.0 : subtle or balanced effect
1.0     : good default value for a correctly trained LoRA
1.2-1.5 : stronger concept influence
1.7-2.0 : useful only for some small specialized LoRAs, but visually riskier

For a well-trained general LoRA, I usually expect 1.0 to be enough.

For a focused object LoRA, such as mp_scifi_buggy, values between 1.0 and 1.5 may help if the object is not strong enough. For a very small or very specific LoRA, 1.7 or even 2.0 can be worth testing, but I would not treat those values as the default.

When comparing strengths, do not only check whether the concept appears. Also check whether the image still looks good:

Are the shapes cleaner or more distorted?
Does the composition still follow the prompt?
Are details becoming too repetitive?
Does the image become too close to the training dataset?
Are faces, small parts, wheels, hands, or mechanical details degrading?

The best LoRA strength is not always the strongest one. It is the value where the learned concept appears clearly while the image still remains flexible, natural, and visually usable.

Validation prompt: sci-fi buggy

{
  "high_level_description": "A dynamic cinematic render of a rugged sci-fi off-road combat buggy racing across a dusty desert battlefield while armored soldiers fire a mounted machine gun, mp_scifi_buggy.",
  "style_description": {
    "aesthetics": "readable sci-fi military vehicle action scene, polished hard-surface rendering, production-friendly game asset design, tactical convoy energy",
    "lighting": "warm late-afternoon sunlight with dust-diffused highlights, clear readable vehicle forms, dramatic but controlled shadows",
    "medium": "digital vehicle rendering",
    "art_style": "AAA sci-fi vehicle action render with realistic hard-surface construction, oversized rugged tires, exposed suspension, protective roll cage, modular armor panels, open cockpit, rear-mounted machine gun turret, dust trails, and strong off-road motion",
    "color_palette": ["#556B3A", "#2F3437", "#8A7B5A", "#C9A15A", "#111827", "#D8C7A1"]
  },
  "compositional_deconstruction": {
    "background": "A wide rocky desert training ground with beige cliffs, dry scrub, drifting dust, distant industrial sci-fi outpost structures, and a pale sky. The environment supports the sense of speed while keeping the vehicle silhouette easy to read.",
    "elements": [
      {
        "type": "obj",
        "desc": "The mp_scifi_buggy dominates the foreground in a fast three-quarter side view, angled slightly toward the viewer. It has olive-green angular armor panels, a low armored body, oversized dusty off-road tires, exposed suspension, tubular front guard rails, an open side cockpit, black roll cage, and a practical combat utility silhouette."
      },
      {
        "type": "obj",
        "desc": "A helmeted driver sits low inside the open cockpit, gripping the steering controls while the buggy bounces over uneven terrain. The driver is partially framed by the roll cage and windshield canopy."
      },
      {
        "type": "obj",
        "desc": "A rear armored gunner stands braced behind the cockpit, operating a heavy mounted machine gun turret fixed to the rear deck. The gunner leans into the recoil while firing a bright muzzle flash toward the left side of the scene."
      },
      {
        "type": "obj",
        "desc": "A second seated soldier rides inside the side cockpit area, holding onto the roll cage and looking toward the direction of fire. The soldiers reinforce the vehicle scale without distracting from the mp_scifi_buggy."
      },
      {
        "type": "obj",
        "desc": "Dust bursts from the oversized tires, with small rocks, tire tracks, and motion haze behind the buggy to emphasize speed and aggressive off-road movement."
      }
    ]
  }
}

What to check:

Does the vehicle have a consistent buggy silhouette?
Are the large wheels present?
Is the roll cage visible?
Does the design remain a vehicle, not a tank or spaceship?
Is the body construction more stable with the LoRA than without it?

Without LORA (left), Lora with 250 steps and strenght 1.0 (right)

Lora with 500 steps (left) and 1000 steps that looks the best one (right)

Lora with 1250 steps (left) and 1500 steps (right)

Validation prompt: armored supersoldier

{
  "high_level_description": "A polished full-body studio image of a futuristic armored supersoldier, mp_power_armor.",
  "style_description": {
    "aesthetics": "heroic sci-fi armored character, polished game-art illustration, clean standalone figure",
    "lighting": "soft studio lighting with subtle reflections across armor surfaces",
    "medium": "digital illustration",
    "art_style": "AAA game character design with a strong armored silhouette, heavy plating, clean mechanical shapes, full helmet, reflective gold visor, reinforced shoulder armor, plated chest, armored gloves, and heavy boots",
    "color_palette": ["#4B5F36", "#1F2933", "#B08D38", "#0F172A", "#9CA3AF"]
  },
  "compositional_deconstruction": {
    "background": "A plain soft light gray studio background with subtle shading.",
    "elements": [
      {
        "type": "obj",
        "desc": "A single futuristic supersoldier in green powered armor standing centered in a neutral full-body pose, with a reflective gold visor, reinforced shoulder armor, plated chest armor, armored gloves, heavy boots, and military sci-fi design language, mp_power_armor."
      },
      {
        "type": "obj",
        "desc": "The figure fills most of the frame from helmet to boots, with a clean readable silhouette against the simple background."
      }
    ]
  }
}

One of the reference images / No Lora / 500 steps / 600 steps / 800 steps

Here, the 800 steps Lora seems satisfying. Note that the trigger keyword can be optional in the final prompt.

Rank choice: solo LoRA vs stackable LoRA

The LoRA rank controls how much capacity the LoRA has to learn and push a concept.

A higher rank is not always better. It depends on how you want to use the LoRA.

Rank 16: stronger solo or main-concept LoRA

Use rank 16 when the LoRA is meant to be the main subject or the main visual identity of the image.

This is useful for:

a specific vehicle
a detailed armor design
a complex character outfit
a style that should strongly dominate the image
a LoRA used mostly by itself

In your training yaml file :

network:
  type: "lora"
  linear: 16
  linear_alpha: 16

A rank 16 LoRA can learn more detail, but it can also push the model harder. When two rank 16 LoRAs are stacked, the image may become overconstrained or visually degraded.

Possible symptoms:

dirty textures
strange color contamination
watermark-like artifacts
over-sharpened details
reduced composition stability
both LoRAs fighting for the same visual space

Rank 8: lighter stackable LoRA

Use rank 8 when the LoRA is meant to support another LoRA rather than dominate the whole image.

This is useful for:

a secondary character
a reusable outfit
an accessory
a style helper
a concept that will often be stacked with another LoRA

In your training yaml file :

network:
  type: "lora"
  linear: 8
  linear_alpha: 8

A rank 8 LoRA may learn slightly less micro-detail than rank 16, but it often blends better with other LoRAs. For multi-concept scenes, this can produce a cleaner final image.

In practice:

Main object or character LoRA       -> rank 16
Secondary stackable concept LoRA    -> rank 8
General style helper LoRA           -> rank 8 or 16 depending on strength
Highly detailed solo concept LoRA   -> rank 16

Example: vehicle + armor

For a scene with a sci-fi buggy and a powered-armor soldier:

mp_scifi_buggy  : rank 16, used as the main vehicle concept
mp_power_armor  : rank 8, used as a stackable character concept

This can be cleaner than stacking:

mp_scifi_buggy rank 16 + mp_power_armor rank 16

because two strong LoRAs can over-bias the model at the same time, especially when they share similar visual language: military sci-fi, green metal, hard-surface armor, weapons, dust, and battlefield lighting.

Suggested test strengths:

buggy rank 16  @ 0.65-0.80
armor rank 8   @ 0.55-0.80

If both LoRAs are rank 16, start lower:

buggy rank 16  @ 0.45-0.65
armor rank 16  @ 0.35-0.60

The goal is not to make each LoRA as strong as possible.
The goal is to make the combined image clean, readable, and controllable.

Final validation: combining both LoRAs

For the final test, load both LoRAs and adapt their strengths. I made my Loras with rank 16, so

mp_scifi_buggy
mp_power_armor

Start with moderate strengths:

mp_scifi_buggy: (main lora) 0.7
mp_power_armor: 0.35

Final combined prompt:

{
    "high_level_description": "A cinematic sci-fi battlefield production still showing a rugged military buggy drifting across rocky terrain while an armored supersoldier drives through dust and debris.",
    "style_description": {
        "aesthetics": "cinematic sci-fi action scene, heroic military atmosphere, premium live-action game trailer look, dynamic motion, strong silhouette clarity",
        "lighting": "dramatic late-afternoon battlefield lighting with dusty air, warm highlights, soft atmospheric haze, and crisp reflections on armor and vehicle surfaces",
        "photo": "Shot like a high-budget sci-fi film still on a 35mm anamorphic lens, low tracking angle, sharp focus on the buggy and armored driver, subtle motion blur on dust and wheels, realistic lens contrast, slight cinematic grain, high dynamic range",
        "medium": "cinematic photograph",
        "color_palette": ["#4B5F36", "#2F3437", "#B08D38", "#111827", "#6B7280", "#D1D5DB"]
    },
    "compositional_deconstruction": {
        "background": "A wide rocky battlefield with uneven ground, scattered debris, light dust clouds, sparse alien vegetation, distant cliffs, and a dramatic cloudy sky. The scene is dusty and cinematic, with clear open space around the main vehicle silhouette.",
        "elements": [
            {
                "type": "obj",
                "bbox": [260, 90, 850, 900],
                "desc": "A rugged sci-fi military buggy dominates the lower middle of the frame in a dynamic three-quarter front view, angled from lower left toward upper right as it drifts over rough terrain. It has large off-road tires, visible suspension, angular military body panels, an open-frame roll cage, a strong front bumper, armored side panels, and a wide all-terrain combat silhouette. Dust kicks outward from the rear wheels, emphasizing speed and weight."
            },
            {
                "type": "obj",
                "bbox": [336, 378, 624, 583],
                "desc": "A futuristic armored supersoldier is seated inside the buggy in a combat driving pose, clearly visible above the open cockpit. The driver grips the controls with armored gloves and wears green powered armor with a reflective gold visor, reinforced shoulder plates, heavy chest armor, dark undersuit gaps, and a compact heroic silhouette. The helmet and upper torso are readable without covering the vehicle shape."
            },
            {
                "type": "obj",
                "bbox": [170, 547, 409, 828],
                "desc": "A rear-mounted heavy machine gun turret rises above the back deck of the buggy, attached to a sturdy rotating base behind the cockpit. It has a long barrel assembly, dark mechanical housing, protective shield-like side plates, visible mounting hardware, and a short ammunition belt feed. The turret reads as part of the vehicle, not as a separate floating weapon."
            },
            {
                "type": "obj",
                "bbox": [610, 40, 940, 980],
                "desc": "A sweeping dust trail and scattered battlefield debris curve behind the buggy, following the vehicle's diagonal motion across the rocky ground. Small stones, tire marks, and pale dust clouds add speed and cinematic energy while keeping the buggy, armored driver, and rear turret clearly readable."
            }
        ]
    }
}

Final thoughts

The biggest lesson from this workflow is simple: the dataset and captions matter as much as the training settings.

For Ideogram 4, structured JSON captions are especially useful because they let you describe the image in a way that matches how the model understands composition: global description, style, background, and elements.

On a 12GB GPU, the low-VRAM settings also matter. The workflow becomes practical when you keep the batch size small, avoid text encoder training, use FP8, enable offloading, cache what can be cached, and validate the LoRA later in ComfyUI instead of generating samples during the run.

Small specialized LoRAs are also easier to debug than large mixed datasets. A vehicle LoRA and a character LoRA can be tested separately, then combined in one final image to see whether the concepts remain stable together.

This is not the only way to train an Ideogram 4 LoRA, but it is a practical route that can run locally on modest hardware.