Compute Shaders | FloraForge Concepts

A shader without a picture

The previous page described the render pipeline: vertex shader, rasterizer, fragment shader, a conveyor belt whose whole purpose is to end in pixels. A compute shader steps off that belt entirely. There are no vertices, no triangles, no rasterizer, and no framebuffer — just a function, a grid of indices to run it over, and memory to read and write. It is the GPU admitting what it has secretly been all along: a massively parallel computer that happens to be good at graphics.

Because there's no pipeline to define the shape of the work, you define it yourself. A compute shader declares the size of its workgroup — a small block of threads that execute together, such as a 16×16 tile — and the engine then dispatches some number of those workgroups. Each thread wakes up knowing exactly one thing: its coordinates in the overall grid, delivered as the built-in global_invocation_id. From that single number it works out which piece of the problem is its, does the work, writes the result, and ends. Two attributes wire all of this up in WGSL: @compute marks the entry point and @workgroup_size(16, 16) sets the tile shape.

Why the GPU wins at this

A CPU core is a scholar: deep, clever, expensive, and there are perhaps a dozen of them. A GPU is a stadium of clerks — thousands of simple arithmetic units that all want to execute the same instruction on different data. Work that decomposes into many independent, identical tasks maps onto that hardware almost perfectly, and while one batch of threads waits on a memory read, the scheduler simply runs another batch, hiding the latency that would stall a CPU. The catch is the shape of the problem: threads can't coordinate freely or branch off on long individual adventures. Give the GPU ten thousand small identical jobs and it is unbeatable; give it one big tangled job and it is helpless.

Building a terrain mesh is the textbook good case. Every vertex of the ground is computed by the same recipe from its own little patch of input data, and no vertex needs to know what any other vertex decided.

FloraForge's terrain factory

The world is streamed in chunks — 256-metre squares, each sampled on a 129×129 grid (16,641 points). The CPU side, deep in the procedural generator, produces three flat arrays per chunk: a height, a moisture value, and a river wetness value for every grid point. Turning those raw numbers into a renderable mesh — world positions, surface normals, biome colours — is the compute shader's job, and it happens entirely on the GPU:

One chunk, one dispatch. The 129×129 sample grid doesn't divide evenly into 16×16 tiles, so the engine rounds up to 9×9 workgroups and the shader's first act is a bounds check.

The shader's interface declares the whole arrangement. Compute shaders lean on storage buffers — big, raw arrays the shader can index freely, unlike the small fixed-size uniforms drawing shaders prefer. Three are read-only inputs; the fourth, marked read_write, is the vertex buffer being built:

src/renderer_wgpu/shaders/terrain_gen.wgsl — the shader's inputs and output

struct ChunkParams {
    origin_x: f32,
    origin_z: f32,
    cell_size: f32,
    side: u32,
};

@group(0) @binding(0) var<uniform> params: ChunkParams;
@group(0) @binding(1) var<storage, read> heights: array<f32>;
@group(0) @binding(2) var<storage, read> moisture: array<f32>;
@group(0) @binding(3) var<storage, read_write> output: array<f32>;
@group(0) @binding(4) var<storage, read> river: array<f32>;

And here is the entry point itself — the function the GPU calls 20,736 times per chunk. Each thread converts its grid coordinates into a world position, estimates the surface normal from its four neighbours' heights, asks a helper called biome_blend which two biomes this point sits between (so grass can fade smoothly into desert, rock into snow), and writes its ten floats into the output array:

src/renderer_wgpu/shaders/terrain_gen.wgsl — one thread, one vertex

@compute @workgroup_size(16, 16)
fn main(@builtin(global_invocation_id) id: vec3<u32>) {
    let x = id.x;
    let z = id.y;
    let side = params.side;

    if (x >= side || z >= side) {
        return;  // padding thread past the grid edge
    }

    let idx = z * side + x;
    let h = heights[idx];
    let m = moisture[idx];

    let world_x = params.origin_x + f32(x) * params.cell_size;
    let world_z = params.origin_z + f32(z) * params.cell_size;

    // …read the 4 neighbouring heights (clamped at chunk edges)…
    let normal = normalize(vec3<f32>(h_l - h_r, params.cell_size * 2.0, h_d - h_u));

    let biome = biome_blend(h, m);

    // Write 10 floats per vertex (position, normal, biome_data, river wetness)
    let base = idx * 10u;
    output[base + 0u] = world_x;
    output[base + 1u] = h;
    output[base + 2u] = world_z;
    // …normal.xyz, biome.xyz, river[idx]…
}

Notice what's missing: no loop over the grid. The loop is the dispatch. On the Rust side, the engine works out how many 16-thread tiles cover the 129-sample side — div_ceil rounds 129 ÷ 16 up to 9 — and issues a single command:

src/renderer_wgpu/terrain_compute.rs — the dispatch (trimmed)

let mut pass = encoder.begin_compute_pass(&wgpu::ComputePassDescriptor {
    label: Some("terrain-gen-pass"),
    timestamp_writes: None,
});
pass.set_pipeline(&self.pipeline);
pass.set_bind_group(0, &bind_group, &[]);
let wg = (side as u32).div_ceil(16);  // 129 → 9
pass.dispatch_workgroups(wg, wg, 1);

Born where it lives

The speed of 16,641 parallel vertex computations is only half the win. The deeper one is location. The output buffer is created with both STORAGE and VERTEX usage flags — the very same allocation the compute shader writes is later bound, untouched, as the vertex buffer the render pipeline draws from. The finished mesh never crosses the bus to main memory and back; it is born in GPU memory, metres from where it will be consumed, every frame, for as long as the chunk is loaded.

One more economy hides in plain sight: the shader writes only vertices, never triangles. Because every chunk has the identical 129×129 topology, the list of indices that stitches the grid into triangles is the same for all of them — so the engine builds one shared index buffer (98,304 indices) at startup, and every chunk in the world draws with it. Per chunk, the GPU stores only what's actually unique: the vertices.

In the engine

The compute pass is recorded into the same command encoder as the frame's render passes, so generating a newly streamed chunk and drawing the world are one submission — no synchronisation dance, no readback, no copy. WebGPU guarantees the compute writes finish before the render pass reads the buffer. The whole mechanism lives in src/renderer_wgpu/terrain_compute.rs and weighs in at about 230 lines.