Concept · Rendering
Compute Shaders
Most shaders draw. A compute shader doesn't — it's the GPU's general-purpose mode, where you hand the card a grid of identical little tasks and it chews through thousands of them at once. FloraForge uses one to manufacture its terrain: every hill you fly over was assembled, vertex by vertex, by the compute shader on this page.
A shader without a picture
The previous page described the render pipeline: vertex shader, rasterizer, fragment shader, a conveyor belt whose whole purpose is to end in pixels. A compute shader steps off that belt entirely. There are no vertices, no triangles, no rasterizer, and no framebuffer — just a function, a grid of indices to run it over, and memory to read and write. It is the GPU admitting what it has secretly been all along: a massively parallel computer that happens to be good at graphics.
Because there's no pipeline to define the shape of the work, you define it
yourself. A compute shader declares the size of its workgroup — a
small block of threads that execute together, such as a 16×16 tile — and the
engine then dispatches some number of those workgroups. Each thread
wakes up knowing exactly one thing: its coordinates in the overall grid,
delivered as the built-in global_invocation_id. From that single
number it works out which piece of the problem is its, does the work, writes
the result, and ends. Two attributes wire all of this up in WGSL:
@compute marks the entry point and
@workgroup_size(16, 16) sets the tile shape.
Why the GPU wins at this
A CPU core is a scholar: deep, clever, expensive, and there are perhaps a dozen of them. A GPU is a stadium of clerks — thousands of simple arithmetic units that all want to execute the same instruction on different data. Work that decomposes into many independent, identical tasks maps onto that hardware almost perfectly, and while one batch of threads waits on a memory read, the scheduler simply runs another batch, hiding the latency that would stall a CPU. The catch is the shape of the problem: threads can't coordinate freely or branch off on long individual adventures. Give the GPU ten thousand small identical jobs and it is unbeatable; give it one big tangled job and it is helpless.
Building a terrain mesh is the textbook good case. Every vertex of the ground is computed by the same recipe from its own little patch of input data, and no vertex needs to know what any other vertex decided.
FloraForge's terrain factory
The world is streamed in chunks — 256-metre squares, each sampled on a 129×129 grid (16,641 points). The CPU side, deep in the procedural generator, produces three flat arrays per chunk: a height, a moisture value, and a river wetness value for every grid point. Turning those raw numbers into a renderable mesh — world positions, surface normals, biome colours — is the compute shader's job, and it happens entirely on the GPU:
The shader's interface declares the whole arrangement. Compute shaders lean
on storage buffers — big, raw arrays the shader can index
freely, unlike the small fixed-size uniforms drawing shaders prefer. Three
are read-only inputs; the fourth, marked read_write, is the
vertex buffer being built:
struct ChunkParams {
origin_x: f32,
origin_z: f32,
cell_size: f32,
side: u32,
};
@group(0) @binding(0) var<uniform> params: ChunkParams;
@group(0) @binding(1) var<storage, read> heights: array<f32>;
@group(0) @binding(2) var<storage, read> moisture: array<f32>;
@group(0) @binding(3) var<storage, read_write> output: array<f32>;
@group(0) @binding(4) var<storage, read> river: array<f32>;
And here is the entry point itself — the function the GPU calls 20,736 times
per chunk. Each thread converts its grid coordinates into a world position,
estimates the surface normal from its four neighbours' heights, asks a helper
called biome_blend which two biomes this point sits between (so
grass can fade smoothly into desert, rock into snow), and writes its ten
floats into the output array:
@compute @workgroup_size(16, 16)
fn main(@builtin(global_invocation_id) id: vec3<u32>) {
let x = id.x;
let z = id.y;
let side = params.side;
if (x >= side || z >= side) {
return; // padding thread past the grid edge
}
let idx = z * side + x;
let h = heights[idx];
let m = moisture[idx];
let world_x = params.origin_x + f32(x) * params.cell_size;
let world_z = params.origin_z + f32(z) * params.cell_size;
// …read the 4 neighbouring heights (clamped at chunk edges)…
let normal = normalize(vec3<f32>(h_l - h_r, params.cell_size * 2.0, h_d - h_u));
let biome = biome_blend(h, m);
// Write 10 floats per vertex (position, normal, biome_data, river wetness)
let base = idx * 10u;
output[base + 0u] = world_x;
output[base + 1u] = h;
output[base + 2u] = world_z;
// …normal.xyz, biome.xyz, river[idx]…
}
Notice what's missing: no loop over the grid. The loop is the
dispatch. On the Rust side, the engine works out how many 16-thread tiles
cover the 129-sample side — div_ceil rounds 129 ÷ 16 up to 9 —
and issues a single command:
let mut pass = encoder.begin_compute_pass(&wgpu::ComputePassDescriptor {
label: Some("terrain-gen-pass"),
timestamp_writes: None,
});
pass.set_pipeline(&self.pipeline);
pass.set_bind_group(0, &bind_group, &[]);
let wg = (side as u32).div_ceil(16); // 129 → 9
pass.dispatch_workgroups(wg, wg, 1);
Born where it lives
The speed of 16,641 parallel vertex computations is only half the win. The
deeper one is location. The output buffer is created with
both STORAGE and VERTEX usage flags — the very
same allocation the compute shader writes is later bound, untouched, as the
vertex buffer the render pipeline draws from. The finished mesh never crosses
the bus to main memory and back; it is born in GPU memory, metres from where
it will be consumed, every frame, for as long as the chunk is loaded.
One more economy hides in plain sight: the shader writes only vertices, never triangles. Because every chunk has the identical 129×129 topology, the list of indices that stitches the grid into triangles is the same for all of them — so the engine builds one shared index buffer (98,304 indices) at startup, and every chunk in the world draws with it. Per chunk, the GPU stores only what's actually unique: the vertices.
src/renderer_wgpu/terrain_compute.rs and weighs in at about 230
lines.