
PAW: Compile Once, Run Offline with a 23MB Adapter That Matches 32B Models
Published by AINave Editorial • Reviewed by Ramit
Program-as-Weights (PAW) is a new research system that treats large language models as one-time compilers rather than per-query solvers. A 4-billion-parameter compiler generates a 23-megabyte LoRA adapter from a natural-language task specification, which a frozen 600-million-parameter interpreter then runs offline with no further API access. On the FuzzyBench benchmark, PAW-compiled adapters for the 0.6B interpreter outperformed direct prompting of Qwen3-32B (73.78% vs 68.70% exact match) and exceeded full fine-tuning of the same base model by 15.4 percentage points. For AI builders, this reframes the cost and deployment model for repetitive, well-defined tasks.
What happened
Researchers from the University of Waterloo, Cornell University, and Harvard University published PAW on July 2, 2026. The system has two phases. In the compilation phase, a 4B compiler model trained on FuzzyBench (a 10-million-example dataset spanning 800+ categories) converts a natural-language task description into a 23MB LoRA adapter. In the execution phase, a frozen 600M-parameter interpreter (quantized Qwen3-0.6B, about 430MB in GGUF format) loads the adapter and processes every incoming call locally. The interpreter runs at 30 tokens per second on a MacBook M3 and uses roughly one-fiftieth the memory of a direct 32B model.
The paper reported that the 0.6B interpreter with PAW-compiled adapters achieved 73.78% exact match on FuzzyBench compared to 68.70% for direct prompting of Qwen3-32B. It also outperformed full fine-tuning of the same 0.6B base model by 15.4 percentage points and the strongest fixed LoRA baseline by 21.7 points. The researchers attribute the gain to the compiler-generated LoRA, not the model architecture alone.
PAW demonstrated five production use cases: event-driven log monitoring, intent-based site navigation, semantic search reranking, a tool-calling pipeline scoring 93% on a standard agentic evaluation, and a multilingual word-guessing game. A smaller GPT-2-based path also allows the system to run entirely client-side in a browser via WebAssembly.
Why AI builders should care
PAW reframes the traditional per-query inference cost model. Instead of sending each input to a large model at runtime, a developer uses the large model once at compile time to generate a function-specific adapter. Every subsequent invocation runs on a small, frozen local model. This eliminates per-token API costs for high-volume repetitive tasks, removes network latency, and enables deployment in air-gapped or edge environments.
Compiled artifacts are static files. Unlike a prompt sent to a hosted API where a model update can silently change behavior, a PAW artifact produces consistent outputs across time, software versions, and hardware configurations. This matters for production teams that need deterministic behavior.
However, PAW uses Alibaba's Qwen3-0.6B as its frozen interpreter backbone. Alibaba is subject to China's National Intelligence Law, which requires Chinese companies to cooperate with state intelligence requests. When running PAW locally, no data is transmitted to Alibaba's servers, but the legal obligation attaches to the company and its model development pipeline. Developers in government, defense, or highly regulated industries should evaluate whether a Qwen3-based interpreter is appropriate for their use case.
Practical implications
For production deployments, the PAW pipeline means a developer can compile a task once and then run it indefinitely on local hardware. The 23MB per-function adapter files are small enough to swap between functions without storing separate fine-tuned models. The interpreter model never changes, only the adapter.
PAW takes a structurally different approach from other cost-reduction techniques. Quantization reduces model precision, speculative decoding uses a smaller model to propose tokens, and mixture-of-experts activates only a fraction of parameters per token. PAW instead uses a larger model to pre-compile task-specific intelligence into a form a smaller model can consume, then removes the large model from the inference loop entirely.
The five illustrated use cases show the range: log monitoring (output triage), site navigation (custom classification), semantic search reranking (fuzzy search), tool-calling pipelines (agent preprocessing), and multilingual generation (creative tasks). These are all fuzzy functions: tasks that resist clean rule-based implementation but do not require multi-step reasoning on every call.
Caveats
Evidence for PAW's performance comes from a single research paper and its associated benchmark. FuzzyBench was designed and released by the same team that built PAW. Independent evaluation on third-party benchmarks and production workloads is necessary before the performance claims can be treated as externally validated.
PAW is designed for a specific category of tasks (fuzzy functions). It may not generalize to tasks requiring multi-step reasoning, open-ended generation, or broad world knowledge. The paper's results on FuzzyBench may not translate directly to real-world production workloads without further testing.
The use of Qwen3-0.6B as the interpreter backbone introduces data sovereignty considerations. While local inference means no data reaches Alibaba's servers, the company's legal obligations under Chinese law remain a factor for evaluating ongoing dependency on Alibaba's model releases. Developers with strict requirements may need to evaluate whether an equivalent Western-developed model backbone can be substituted.
Sources
- Compile Once, Run Offline: New AI Method Matches 32B Models With a 23MB File
- A New Framework Compiles AI Task Logic Into Lightweight Local Models. The Idea Challenges The Assumption That Stronger AI Must Always Mean Larger Runtime Models. | IBTimes
- How to Set Up and Run QwQ-32B Locally With Ollama | DataCamp
- How to Run AI Models Locally in 2026 (8 Tested Offline Tools)
- How to Run QwQ-32B Locally: A Step-by-Step Guide
- Running AI Offline: Complete Guide to Air-Gapped Local LLMs | InsiderLLM
- Run OpenHands LM 32B v0.1 Locally — Private LLM
- bmad-code-org/BMAD-METHOD: Breakthrough Method for Agile Ai...
- OneCompiler - Write, run and share code online | Free online compiler...
- Download Java Runtime Environment (32bit) 8 Update... - Filepuma.com
- Compile and Run a C or C++ code in CMD (Windows...) - YouTube
- Merge PDF files online. Free service to merge PDF
- SYNTX AI: 90+ AI Models in One Place | Telegram & Web






















