The Swapchain | FloraForge Concepts

You never draw to the screen

Your monitor is not a passive sheet of pixels. It is a consumer running on its own relentless clock: 60, 120, maybe 144 times a second it scans out whatever image it has been given, reading it line by line from top to bottom. It does not pause for you, and it does not care whether you are finished.

If the engine drew straight into the image the display is reading, you would watch the work happen: the sky appearing first, then terrain stamped over it, then trees, all mid-scanout. So instead the engine renders into a back buffer — an off-screen image the display never sees — while the display reads a different, already-finished front buffer. When the new frame is done, the two are swapped. The display only ever sees complete pictures.

Why whole-image swaps matter: tearing

The swap itself has a failure mode. Suppose the display is halfway through scanning out a frame when the engine swaps in a new one. The top half of the screen now shows the old frame and the bottom half the new one, joined at a visible seam. If the camera was turning, the two halves don't line up — a horizontal rip across the picture called tearing.

The fix is to time the swap to the vertical blank — the short gap after the display finishes one scan and before it starts the next. Swap only in that gap and every scanout reads exactly one frame, never a mixture. That synchronisation is what "vsync" means. The price is that the engine may have to wait for the blank, which caps the frame rate at the display's refresh rate — a trade-off we'll come back to.

In a modern API the front and back buffers generalise to a small pool of two or three images, owned by the system and handed back and forth: the swapchain. At any instant each image is playing a role, and the roles rotate every frame:

One instant in a triple-buffered swapchain. Every frame the roles rotate one step: drawn → presented → on screen → recycled. The orange arrow is the only place the engine can be forced to wait — and that wait is information.

Acquire → draw → present

From the engine's side, every frame of the game loop runs the same three-step ritual against the swapchain. First it acquires the next free image — and this call can block, because an image the display is still reading cannot be handed out. Then it draws, recording every render pass (sky, terrain, plants, water, HUD — each one a pair of shaders) into a command buffer targeting that image. Finally it presents: the image is handed back, marked finished, and queued for the display. In FloraForge's renderer the whole cycle is visible in a dozen lines:

src/app/mod.rs — one frame against the swapchain (trimmed)

fn render(&mut self) -> Result<(), SurfaceError> {
    // Benchmark timing: the acquire call blocks while the GPU catches up, so
    // its duration is the per-frame GPU-bound wait; the span from here to
    // submit is the render-thread CPU cost.
    let t_acquire = Instant::now();
    let output = self.gpu.surface.get_current_texture()?;
    let gpu_wait_ms = t_acquire.elapsed().as_secs_f32() * 1000.0;

    // …encode every render pass into a command buffer…

    self.gpu.queue.submit(Some(encoder.finish()));
    output.present();
    Ok(())
}

Note what present() does not do: it does not wait for the GPU to finish drawing. The commands were merely submitted; the GPU may still be working on this frame — and the previous one — while the CPU starts the next lap of the loop. FloraForge configures the surface with desired_maximum_frame_latency: 2, allowing up to two frames to be in flight at once. That pipelining is where the throughput comes from, and the acquire call is the valve that keeps it from running away: when the GPU falls behind, acquire simply has no free image to return, and the CPU stalls until one comes back.

Present modes: vsync or raw throughput

How strictly the swapchain ties itself to the display's refresh is a configurable policy called the present mode, and WebGPU offers three flavours:

Fifo — classic vsync. Finished frames join a queue and the display takes one per refresh; if the queue is full, acquire blocks. No tearing ever, frame rate capped at the refresh rate. It is the only mode every platform guarantees.
Mailbox — triple-buffered, uncapped. The engine renders as fast as it can into a "mailbox" slot, and each refresh the display takes whatever the newest finished frame is, discarding stale ones. No tearing, no cap — at the cost of drawing frames nobody sees.
Immediate — no synchronisation at all. New frames replace the on-screen image the moment they finish, mid-scanout if need be. Maximum throughput, tearing allowed.

FloraForge plays in Fifo — smooth, tear-free, and frugal. But its FPS benchmark has the opposite need: measuring true render throughput. Under vsync a machine that could draw 400 fps reports a flat 60 or 120, hiding any regression. So at the start of a benchmark run the engine reconfigures the surface to the best uncapped mode available:

src/renderer_wgpu/gpu_context.rs — choosing the present modes

let present_mode = if capabilities.present_modes.contains(&wgpu::PresentMode::Fifo) {
    wgpu::PresentMode::Fifo
} else {
    capabilities.present_modes[0]
};

// Prefer an uncapped present mode for benchmarking. Immediate has no
// sync at all; Mailbox is the next best (triple-buffered, uncapped).
let no_vsync_present_mode = if capabilities.present_modes.contains(&wgpu::PresentMode::Immediate) {
    wgpu::PresentMode::Immediate
} else if capabilities.present_modes.contains(&wgpu::PresentMode::Mailbox) {
    wgpu::PresentMode::Mailbox
} else {
    present_mode
};

The acquire stall is a profiler

Look again at the timing in render(). When the GPU is the bottleneck, the CPU finishes its work early and then sits inside get_current_texture() waiting for the GPU to release an image — so the length of that stall closely tracks the GPU's per-frame cost. When the CPU is the bottleneck, the GPU is always done first and acquire returns instantly. One cheap CPU-side timer therefore answers the most important question in optimisation: which processor should you make faster?

FloraForge records both numbers every benchmark frame — gpu_wait_ms (the acquire stall) and cpu_ms (simulation plus command encoding) — and reports their ratio as gpu_bound_ratio. The trick matters because the obvious alternative, GPU timestamp queries, is unreliable on Apple Silicon: on a tile-based deferred GPU the pass-boundary timestamps bracket only the tiling phase, not the fragment work that dominates the frame. The acquire stall works on every backend.

In the engine

In normal play FloraForge runs Fifo (vsync); benchmark mode calls enable_no_vsync() to switch the surface to Immediate or Mailbox and measure uncapped throughput. The resulting report (benchmarks/latest.json) shows a gpu_bound_ratio of roughly 0.95–0.99 on Apple Silicon — the renderer spends almost the whole frame waiting on the GPU, which is why optimisation effort (like frustum culling) targets GPU work rather than CPU work.

The Swapchain & Presenting

You never draw to the screen

Why whole-image swaps matter: tearing

Acquire → draw → present

Present modes: vsync or raw throughput

The acquire stall is a profiler