Large Language Models like Phi-3 generate text one word at a time. For each word, the model runs 342 GPU compute operations ("dispatches") across 32 neural network layers. Each layer does matrix multiplication, attention computation, and activation functions — all running on your GPU via WebGPU.
Normally, a framework called TVM manages these operations. TVM's runtime (written in WASM) decides which GPU shader to run, writes the parameters, submits the work, and reads the result — 342 times per word.
I intercepted TVM's GPU calls, captured every shader and buffer, decoded the architecture, and built my own dispatch loop that drives the GPU directly. Same shaders, same weights, same math — but without TVM's WASM overhead.