WebGPU Shader Interception
for Browser LLMs

I intercepted a 3.8B parameter AI model running entirely in your browser, captured every GPU operation, decoded its architecture, and built a faster engine from scratch.

85 GPU shaders captured
342 dispatches per token
728 GPU buffers mapped
12,962 lines of WGSL
27-52 tok/s decode
How does an LLM run in a browser?

Large Language Models like Phi-3 generate text one word at a time. For each word, the model runs 342 GPU compute operations ("dispatches") across 32 neural network layers. Each layer does matrix multiplication, attention computation, and activation functions — all running on your GPU via WebGPU.

Normally, a framework called TVM manages these operations. TVM's runtime (written in WASM) decides which GPU shader to run, writes the parameters, submits the work, and reads the result — 342 times per word.

I intercepted TVM's GPU calls, captured every shader and buffer, decoded the architecture, and built my own dispatch loop that drives the GPU directly. Same shaders, same weights, same math — but without TVM's WASM overhead.

The 342 Dispatches

Every token the model generates requires exactly 342 GPU compute dispatches. Here's what each one does:

Matmul (int4 dequant)
Attention
Norm + Residual
Activation (RoPE, SiLU)
KV Cache
Sampling

85 Captured Shaders

I intercepted every createShaderModule call during model load, capturing 85 WGSL compute shaders totaling 12,962 lines. I also wrote 3 custom fused shaders.

Per-Layer Pattern (10 dispatches × 32 layers)

Custom Fused Shaders

728 GPU Buffers

Every buffer referenced in the 342 dispatches, classified by purpose:

270
Weight buffers
3.8 GB
71
Activation buffers
424 KB
37
KV Cache buffers
19 MB
350
Uniform buffers
7.7 KB

Key Findings

Submit batching (337x fewer submits) doesn't help on Apple Silicon

I accumulated all 342 command buffers and submitted them as a single GPU submit — a 337x reduction. The output was correct (coherent English, proper EOS detection). But it was 30% slower than TVM's 1:1 submit pattern.

Why? Chrome's GPU driver on Apple M2 already pipelines submits. While the CPU prepares dispatch N+1, the GPU executes dispatch N. Batching breaks this pipeline — the GPU sits idle during the entire CPU preparation phase.

TVM:     CPU: [setup1][setup2][setup3]...  ← overlaps with GPU
         GPU:    [work1][work2][work3]...

Batched: CPU: [setup1][setup2]...[setup342]  ← GPU idle
         GPU:                               [work1][work2]...[work342]
TVM already fuses elementwise operations

TVM's compiler isn't naive. It already fuses across elementwise boundaries:

fused_dequantize + NT_matmul     — int4 dequant + matmul in one shader
fuse_add_norm_decode             — residual add + RMSNorm in one shader
fused_split_silu_multiply        — gate split + SiLU + elementwise mul

It does NOT fuse across matmul boundaries. My fused shaders (FFN+SiLU, RMSNorm+Matmul) cross this boundary.

GPU f16 parallel reduction is non-deterministic

Running the exact same TVM code twice with the same prompt produces different tokens after position 7-156 (depending on prompt length). The f16 parallel reduction in matmul shaders accumulates partial sums in a tree pattern — different GPU scheduling = different rounding = different result.

This is a hardware-level non-determinism on Apple M2. It means any replay-based approach (including mine) can only match TVM's output up to this natural divergence point.

The memory bandwidth wall

At ~27-48 tok/s on M2 Pro (200 GB/s memory bandwidth), the model loads ~1.8GB of weights per token:

1.8 GB / 200 GB/s = 9ms per token = ~111 tok/s theoretical maximum

Current performance (27-48 tok/s) represents 25-43% of the theoretical bandwidth limit. The remaining gap is from compute (attention, matmul reduction) and framework overhead. My engine reduces framework overhead but can't change the bandwidth limit.

The attention uniform struct (56 bytes, 14 fields)

The most complex uniform in the model, decoded from the WGSL batch_decode_paged_kv_kernel shader:

struct PODArgs {
  B: i32,                            // offset 0:  batch size (=1)
  k_rope_pos_offset_elem_offset: i32,// offset 4:  =0
  length_info_elem_offset: i32,      // offset 8:  =0
  max_num_pages: i32,                // offset 12: =257
  nnz_pages: i32,                    // offset 16: CHANGES at page boundaries
  page_indptr_elem_offset: i32,      // offset 20: =0
  page_values_elem_offset: i32,      // offset 24: =0
  pages_elem_offset: i32,            // offset 28: =0
  q_rope_position_elem_offset: i32,  // offset 32: =0
  rope_scale: f32,                   // offset 36: =1.0
  rope_theta: f32,                   // offset 40: =10000.0
  rotary_mode: i32,                  // offset 44: =0
  sm_scale: f32,                     // offset 48: =1/sqrt(96)
  packGridDimX: u32                  // offset 52: =1
}

This struct was reverse-engineered from the captured WGSL source and verified against TVM's runtime writes. All 14 fields mapped correctly.

Try It Yourself

Chat with Phi-3-mini running on these 10 shaders — no TVM, no compiler, no server.

Open Zero-TVM Chat

Chrome or Edge only. ~2 GB model download on first load.