Section 1
Interrupt
Vol. VII, No. 3 — April 2026

The Cost of Scale:
DeepSeek's Architecture and Trade‑offs

Also inside: Benchmarking beyond the leaderboard, the ethics of open weights, and a test suite for tool‑calling accuracy.

Table of Contents

  1. The Cost of Scale
    DeepSeek's architecture and the trade‑offs of efficiency
  2. Benchmarking Beyond the Leaderboard
    Real‑world evaluation when benchmarks become games
  3. The Ethics of Open Weights
    What does "open source AI" actually mean?
  4. Tool Calling Accuracy
    A deterministic test suite for LLMs

The Cost of Scale

When DeepSeek‑V3 2‑Billion was released, the community noticed something unusual: its inference speed was 40% faster than comparable models of similar parameter count, while maintaining competitive accuracy on MMLU and HumanEval. The architecture paper revealed a series of deliberate trade‑offs—choices that privilege inference‑time efficiency over training‑time flexibility.

The core innovation is a modified attention mechanism that reduces memory bandwidth pressure. Instead of storing full attention matrices, the model uses a sliding‑window attention with a dynamic compression factor. This cuts memory usage by half, but requires a more complex training schedule that initially struggled with long‑context tasks.

Efficiency isn't just about flops—it's about the decisions you don't have to make.

DeepSeek's team opted for a mixture‑of‑experts (MoE) design with 16 experts, but only 2 active per token. This keeps the parameter count high (1.3T total) while keeping the active compute footprint manageable. The trade‑off: expert‑routing introduces a coordination overhead that can become a bottleneck on older hardware.

The following snippet illustrates the key attention modification—a simplified version of the windowed attention with compression:

import torch
import torch.nn as nn

class WindowedAttention(nn.Module):
    def __init__(self, dim, window_size, compression_ratio=0.5):
        super().__init__()
        self.dim = dim
        self.window_size = window_size
        self.compression_ratio = compression_ratio
        self.qkv = nn.Linear(dim, dim * 3)
        self.proj = nn.Linear(dim, dim)

    def forward(self, x):
        B, T, C = x.shape
        qkv = self.qkv(x).reshape(B, T, 3, C)
        q, k, v = qkv.unbind(2)
        # Apply compression to k, v
        k = nn.functional.avg_pool1d(k.transpose(1,2), kernel_size=int(1/self.compression_ratio)).transpose(1,2)
        v = nn.functional.avg_pool1d(v.transpose(1,2), kernel_size=int(1/self.compression_ratio)).transpose(1,2)
        # Compute windowed attention
        # ... (rest of implementation)
        return self.proj(v)

Training such a model required a custom distributed training framework that could handle the uneven load across experts. DeepSeek's engineers built a scheduler that dynamically reassigns experts to GPUs during training, reducing communication overhead by 30%.

Compute vs. Performance Trade‑off

Small Medium Large XL XXL Performance Model Size (parameters)
CODE IS NOT LAW.
BUT IT IS A TESTAMENT.

Benchmarking Beyond the Leaderboard

Maya Rodriguez • Former OpenAI researcher • 10 min read

Leaderboards dominate the public discourse about AI models: MMLU, HumanEval, GSM8K, HellaSwag. Each provides a single number that invites comparison—and inevitably, optimization.

But what happens when a benchmark becomes a target? The phenomenon of benchmark gaming is well‑known in machine learning: models overfit to the specific distribution of the test set, sometimes by memorizing answers, sometimes by exploiting statistical artifacts.

The recent “Chatbot Arena” results illustrate the divergence between benchmark scores and human preference. In a side‑by‑side comparison, humans consistently preferred models that scored lower on MMLU but demonstrated better reasoning transparency, fewer hedging phrases, and more coherent narrative flow.

If your benchmark can be gamed, it's not measuring what you think.

We need a new generation of evaluation that is inherently resistant to gaming. One approach is dynamic benchmark generation, where test questions are synthesized on‑the‑fly from a large space of possibilities, making memorization impossible.

Another is adversarial evaluation, where a second model attempts to find the weakest points of the system being evaluated, surfacing failures that static benchmarks would miss.

A third is real‑world task integration: evaluating models on actual user tasks (e.g., “write a cover letter for a software engineering job”) and measuring outcomes with human raters.

Each of these approaches adds complexity and cost, but they move us closer to measuring what we truly care about: how well a model performs in the wild.

Benchmark Scores vs. Human Preference

"The best way to predict the future is to invent it." — Alan Kay "Premature optimization is the root of all evil." — Donald Knuth "First solve the problem. Then write the code." — John Johnson "Any sufficiently advanced technology is indistinguishable from magic." — Arthur C. Clarke "The computer was born to solve problems that did not exist before." — Bill Gates

The Ethics of Open Weights

Jamal Washington • Open‑source advocate • 14 min read

When Llama 2 was released under a “community license” that prohibited certain uses, the term “open source” was stretched nearly to breaking point. Since then, the landscape has fragmented: we have “open‑weight” models, “open‑access” models, “open‑research” releases, and “open‑source” AI—each with different legal and practical implications.

The core tension is between openness as a development methodology and openness as a distribution license. Traditional open‑source software guarantees the right to study, modify, and redistribute the source code. With AI models, the “source code” is the architecture (often published) and the trained weights (sometimes published).

Open weights without open data is like publishing a recipe but keeping the ingredients secret.

But weights alone are insufficient. Without the training data, the training code, the hyperparameters, and the exact hardware environment, reproduction is impossible. This has led some researchers to argue that “open weights” is a marketing term, not a scientific one.

If we cannot audit the data that shaped the model, we cannot fully understand its biases, its limitations, or its potential for harm. Transparency must extend beyond the model to the pipeline that created it.

Licensing is another frontier. Many “open‑weight” models carry usage restrictions that violate the Open Source Definition (e.g., prohibiting commercial use, requiring attribution, forbidding certain industries). These are not open‑source licenses; they are proprietary licenses with a gratis‑weight provision.

The community needs a new taxonomy: perhaps “published weights”, “open‑research”, “source‑available”, and “truly open‑source AI” (where training data, code, and weights are all under an OSI‑approved license). Until then, we should be precise with our language: open weights are a step forward, but they are not open source.

Tool Calling Accuracy

As LLMs become agents, their ability to correctly invoke tools—Bash commands, file operations, MCP servers, Skills, and generation tasks—becomes critical. Yet most model evaluations measure only raw text generation, not the precise structured output required for tool calls.

We built a deterministic test suite that evaluates tool‑calling accuracy across five categories:

  • Bash: Executing shell commands with correct arguments and flags.
  • File operations: Reading, writing, editing files with exact path handling.
  • MCP: Using Model Context Protocol tools with proper inputs.
  • Skills: Loading and invoking Codex skills with correct parameters.
  • Generation: Producing structured outputs (JSON, XML, YAML) that match a schema.
Deterministic tests are the only way to know if a model can follow instructions.

The test suite consists of 50 prompts, each requiring a specific tool call. The model's response is parsed and compared against a ground‑truth structured call. Partial credit is given for semantically correct but syntactically imperfect calls.

Here is an example test case for the Bash category:

// Test case: find all .md files modified in the last 7 days
const test = {
    prompt: "Find all Markdown files modified in the last week.",
    expected: {
        tool: "bash",
        command: "find . -name '*.md' -mtime -7",
        explanation: "Uses find with -name and -mtime flags."
    },
    scoring: {
        exactMatch: 1.0,
        semanticMatch: 0.8,
        wrongFlags: 0.3,
        noCall: 0.0
    }
};

Early results show that even state‑of‑the‑art models struggle with consistency: they may get a complex call right once, then fail on a simpler variant. This suggests that tool‑calling is not yet a robust capability, but rather a fragile pattern‑matching skill.

Contributors

Alex Chen
Senior ML Engineer at Cerebras. Writes about scaling laws and hardware‑software co‑design.
Maya Rodriguez
Former OpenAI researcher. Focus on evaluation, safety, and adversarial robustness.
Jamal Washington
Open‑source advocate. Maintains the Llama.cpp bindings and the Open‑Weights Initiative.
Sofia Ivanova
Data visualization specialist. Her work appears in Nature, MIT Technology Review, and arXiv.