Concept · Rendering
The Swapchain & Presenting
A surprising fact about real-time graphics: the engine never draws to the screen. Every frame is painted into an off-screen image, and only when it is completely finished does it get shown — all at once. The small rotation of images that makes this work is called the swapchain, and the moment of waiting for one turns out to be FloraForge's most useful profiler.
You never draw to the screen
Your monitor is not a passive sheet of pixels. It is a consumer running on its own relentless clock: 60, 120, maybe 144 times a second it scans out whatever image it has been given, reading it line by line from top to bottom. It does not pause for you, and it does not care whether you are finished.
If the engine drew straight into the image the display is reading, you would watch the work happen: the sky appearing first, then terrain stamped over it, then trees, all mid-scanout. So instead the engine renders into a back buffer — an off-screen image the display never sees — while the display reads a different, already-finished front buffer. When the new frame is done, the two are swapped. The display only ever sees complete pictures.
Why whole-image swaps matter: tearing
The swap itself has a failure mode. Suppose the display is halfway through scanning out a frame when the engine swaps in a new one. The top half of the screen now shows the old frame and the bottom half the new one, joined at a visible seam. If the camera was turning, the two halves don't line up — a horizontal rip across the picture called tearing.
The fix is to time the swap to the vertical blank — the short gap after the display finishes one scan and before it starts the next. Swap only in that gap and every scanout reads exactly one frame, never a mixture. That synchronisation is what "vsync" means. The price is that the engine may have to wait for the blank, which caps the frame rate at the display's refresh rate — a trade-off we'll come back to.
In a modern API the front and back buffers generalise to a small pool of two or three images, owned by the system and handed back and forth: the swapchain. At any instant each image is playing a role, and the roles rotate every frame:
Acquire → draw → present
From the engine's side, every frame of the game loop runs the same three-step ritual against the swapchain. First it acquires the next free image — and this call can block, because an image the display is still reading cannot be handed out. Then it draws, recording every render pass (sky, terrain, plants, water, HUD — each one a pair of shaders) into a command buffer targeting that image. Finally it presents: the image is handed back, marked finished, and queued for the display. In FloraForge's renderer the whole cycle is visible in a dozen lines:
fn render(&mut self) -> Result<(), SurfaceError> {
// Benchmark timing: the acquire call blocks while the GPU catches up, so
// its duration is the per-frame GPU-bound wait; the span from here to
// submit is the render-thread CPU cost.
let t_acquire = Instant::now();
let output = self.gpu.surface.get_current_texture()?;
let gpu_wait_ms = t_acquire.elapsed().as_secs_f32() * 1000.0;
// …encode every render pass into a command buffer…
self.gpu.queue.submit(Some(encoder.finish()));
output.present();
Ok(())
}
Note what present() does not do: it does not wait for the
GPU to finish drawing. The commands were merely submitted; the GPU may still be
working on this frame — and the previous one — while the CPU starts the next
lap of the loop. FloraForge configures the surface with
desired_maximum_frame_latency: 2, allowing up to two frames to be
in flight at once. That pipelining is where the throughput comes from, and the
acquire call is the valve that keeps it from running away: when the GPU falls
behind, acquire simply has no free image to return, and the CPU stalls until
one comes back.
Present modes: vsync or raw throughput
How strictly the swapchain ties itself to the display's refresh is a configurable policy called the present mode, and WebGPU offers three flavours:
- Fifo — classic vsync. Finished frames join a queue and the display takes one per refresh; if the queue is full, acquire blocks. No tearing ever, frame rate capped at the refresh rate. It is the only mode every platform guarantees.
- Mailbox — triple-buffered, uncapped. The engine renders as fast as it can into a "mailbox" slot, and each refresh the display takes whatever the newest finished frame is, discarding stale ones. No tearing, no cap — at the cost of drawing frames nobody sees.
- Immediate — no synchronisation at all. New frames replace the on-screen image the moment they finish, mid-scanout if need be. Maximum throughput, tearing allowed.
FloraForge plays in Fifo — smooth, tear-free, and frugal. But its FPS benchmark has the opposite need: measuring true render throughput. Under vsync a machine that could draw 400 fps reports a flat 60 or 120, hiding any regression. So at the start of a benchmark run the engine reconfigures the surface to the best uncapped mode available:
let present_mode = if capabilities.present_modes.contains(&wgpu::PresentMode::Fifo) {
wgpu::PresentMode::Fifo
} else {
capabilities.present_modes[0]
};
// Prefer an uncapped present mode for benchmarking. Immediate has no
// sync at all; Mailbox is the next best (triple-buffered, uncapped).
let no_vsync_present_mode = if capabilities.present_modes.contains(&wgpu::PresentMode::Immediate) {
wgpu::PresentMode::Immediate
} else if capabilities.present_modes.contains(&wgpu::PresentMode::Mailbox) {
wgpu::PresentMode::Mailbox
} else {
present_mode
};
The acquire stall is a profiler
Look again at the timing in render(). When the GPU is the
bottleneck, the CPU finishes its work early and then sits inside
get_current_texture() waiting for the GPU to release an image —
so the length of that stall closely tracks the GPU's per-frame cost. When the
CPU is the bottleneck, the GPU is always done first and acquire returns
instantly. One cheap CPU-side timer therefore answers the most important
question in optimisation: which processor should you make faster?
FloraForge records both numbers every benchmark frame — gpu_wait_ms
(the acquire stall) and cpu_ms (simulation plus command encoding)
— and reports their ratio as gpu_bound_ratio. The trick matters
because the obvious alternative, GPU timestamp queries, is unreliable on
Apple Silicon: on a tile-based deferred GPU the pass-boundary timestamps
bracket only the tiling phase, not the fragment work that dominates the frame.
The acquire stall works on every backend.
enable_no_vsync() to switch the surface to Immediate or
Mailbox and measure uncapped throughput. The resulting report
(benchmarks/latest.json) shows a gpu_bound_ratio of
roughly 0.95–0.99 on Apple Silicon — the renderer spends
almost the whole frame waiting on the GPU, which is why optimisation effort
(like frustum culling)
targets GPU work rather than CPU work.