Concept · Rendering

Instanced Rendering

A FloraForge forest can hold thousands of trees, and every one of them is a detailed procedural mesh. Drawn naively — one command per tree — that forest would bury the CPU in overhead before the GPU broke a sweat. Instancing is the technique that collapses it all: describe the tree once, hand the GPU a list of where and how big, and let it stamp out the whole forest in a single command.

The draw-call problem

Every draw command an engine submits carries a fixed cost that has nothing to do with how much it draws. The CPU has to validate the call, record it into a command buffer, and the GPU's front end has to set up state before the first triangle moves. For one big mesh that overhead is noise. For ten thousand tiny meshes it is the workload: the CPU spends the frame dictating commands while the GPU finishes each little tree instantly and waits for the next order.

The waste is especially galling because the trees are nearly identical. Two oaks differ only in position, a turn around their trunk, and a bit of size — yet the naive approach re-submits the full mesh-drawing ceremony for each. What you want to send is the shape once and the differences many times.

One mesh, a list of differences

That is exactly what instanced rendering does. The mesh — vertices, normals, colours — is uploaded to GPU memory a single time, as a prototype. Alongside it goes an instance buffer: a tightly packed array with one small record per copy, holding only what makes that copy unique. One draw call then says "draw this mesh n times", and the GPU walks the record list itself, re-running the mesh through the vertex shader once per record.

Prototype mesh one vertex buffer, uploaded once thousands of vertices + Instance buffer VertexStepMode::Instance position (x, y, z) rot scale tint ( 41.2, 12.0, 18.9) 4.71 0.92 (102.7, 14.8, 63.4) 0.58 1.10 (150.1, 9.3, 201.8) 2.93 0.75 ( 88.0, 11.5, 144.6) 5.40 1.31 … 213 more records 48 bytes per copy one draw call 217 trees on screen stamped by the GPU, one per record the mesh crosses the bus once — only the 48-byte records are per-tree
The instancing bargain: a detailed prototype mesh is uploaded once, and each copy on screen costs only a small record in the instance buffer. One draw_indexed call replays the mesh once per record.

What an instance record holds

FloraForge's record is deliberately tiny. Here it is, verbatim — the struct that every tree, shrub, house and road sign in the world is positioned by:

src/renderer_wgpu/instancing.rs — one record per copy
#[repr(C)]
#[derive(Clone, Copy, Debug, Zeroable, Pod)]
pub struct InstanceData {
    pub position: [f32; 3],
    pub rotation_y: f32,
    pub scale: [f32; 3],
    pub tilt: f32,
    pub color: [f32; 4],
}

A position in the world, a rotation around the vertical axis, a scale, a lean angle, and an RGBA tint — 48 bytes per copy. The tilt field (which doubles as padding so the colour starts on a 16-byte boundary) leans dead snags a few degrees off vertical; living plants leave it at zero. #[repr(C)] plus the Pod derive guarantee the struct's bytes can be copied straight into a GPU buffer with no translation. The tint is how a shrub billboard gets its species' leaf colour — and how a dead plant gets its weathered grey-brown — while the scale is where the growth stage shows up: seedlings are stamped at 15% size, young plants at 50%, dead snags at 85% and shrinking, from meshes of the same species.

Two streams into one shader

The GPU is told about both buffers through vertex buffer layouts with different step modes: slot 0 (the mesh) advances per vertex, slot 1 (the records) advances per instance. From the shader's point of view they simply merge into one input struct — the @location numbers carry on from the mesh attributes into the instance attributes:

src/renderer_wgpu/shaders/instanced.wgsl — the merged vertex input
struct VertexInput {
    // Per-vertex (slot 0)
    @location(0) position: vec3<f32>,
    @location(1) normal: vec3<f32>,
    @location(2) vert_color: vec3<f32>,

    // Per-instance (slot 1)
    @location(3) inst_position: vec3<f32>,
    @location(4) inst_rotation_y: f32,
    @location(5) inst_scale: vec3<f32>,
    @location(6) inst_color: vec4<f32>,
};

When the GPU invokes the vertex shader for vertex 412 of copy 87, it has already paired the right mesh vertex with the right instance record. The shader's only job is to apply the record — scale, then spin around Y, then move into place:

src/renderer_wgpu/shaders/instanced.wgsl — placing one copy
let c = cos(input.inst_rotation_y);
let s = sin(input.inst_rotation_y);

// Scale, then rotate around Y, then translate
let scaled = input.position * input.inst_scale;
let rotated = vec3<f32>(
    scaled.x * c - scaled.z * s,
    scaled.y,
    scaled.x * s + scaled.z * c,
);
let world_pos = rotated + input.inst_position;

out.clip_position =
    frame.view_proj_no_translation * vec4<f32>(world_pos - frame.camera_position.xyz, 1.0);

That last line hides a subtlety worth noticing: the position is made camera-relative before projection. FloraForge's world is huge, and 32-bit floats lose precision far from the origin — subtracting the camera first keeps distant trees from jittering. The camera matrix itself arrives via the per-frame uniform described in Bind Groups.

One draw per species per chunk

FloraForge has eight plant species — oak, birch, spruce, willow, acacia, palm, cattail, shrub — and each gets its own prototype mesh, built by the engine's procedural plant generator rather than loaded from an art file (houses are the exception: a single GLB model). Instance records are grouped per species, per chunk, so when a chunk streams out, its vegetation's instance buffer is simply dropped with it. The render loop is then almost embarrassingly short:

src/renderer_wgpu/instanced_pass.rs — the whole forest, drawn
// Draw each species, batching by LOD level to minimise state changes
for (i, key) in self.species_names.iter().enumerate() {
    // …partition this species' visible chunks into near and far…

    // Draw near chunks with hi-res mesh
    if let Some(mesh) = self.models.get(key) {
        pass.set_vertex_buffer(0, mesh.vertex_buffer.slice(..));
        pass.set_index_buffer(mesh.index_buffer.slice(..), wgpu::IndexFormat::Uint32);
        for inst in &near {
            pass.set_vertex_buffer(1, inst.instance_buffer.slice(..));
            pass.draw_indexed(0..mesh.index_count, 0, 0..inst.instance_count);
        }
    }

    // …far chunks repeat the loop with a low-detail LOD mesh…
}

Each draw_indexed call's final argument is the instance range — "this mesh, instance_count times". A chunk with two hundred oaks costs one call. The loop also folds in two classic companions of instancing: chunks more than 512 metres away swap to a cheaper LOD mesh of the same species (dead plants get a third bucket — a bark-only snag mesh cheap enough to serve every distance), and chunks outside the camera's view are skipped entirely by frustum culling — the subject of the next page.

In the engine
Shrubs take the idea to its extreme. The detailed procedural shrub mesh has 3,500–5,300 vertices, but shrubs are small and numerous — so in the world they're drawn as a 12-vertex crossed-quad billboard whose foliage is painted procedurally in the fragment shader, with the species' leaf colour delivered through the instance record's color field. Same instancing pipeline, roughly 300–400× fewer vertices per copy.