Interrupt — Vol. VII, No. 3

The Cost of Scale

Alex Chen • Senior ML Engineer at Cerebras • 12 min read

When DeepSeek‑V3 2‑Billion was released, the community noticed something unusual: its inference speed was 40% faster than comparable models of similar parameter count, while maintaining competitive accuracy on MMLU and HumanEval. The architecture paper revealed a series of deliberate trade‑offs—choices that privilege inference‑time efficiency over training‑time flexibility.

The core innovation is a modified attention mechanism that reduces memory bandwidth pressure. Instead of storing full attention matrices, the model uses a sliding‑window attention with a dynamic compression factor. This cuts memory usage by half, but requires a more complex training schedule that initially struggled with long‑context tasks.

Efficiency isn't just about flops—it's about the decisions you don't have to make.

DeepSeek's team opted for a mixture‑of‑experts (MoE) design with 16 experts, but only 2 active per token. This keeps the parameter count high (1.3T total) while keeping the active compute footprint manageable. The trade‑off: expert‑routing introduces a coordination overhead that can become a bottleneck on older hardware.

The following snippet illustrates the key attention modification—a simplified version of the windowed attention with compression:

import torch
import torch.nn as nn

class WindowedAttention(nn.Module):
    def __init__(self, dim, window_size, compression_ratio=0.5):
        super().__init__()
        self.dim = dim
        self.window_size = window_size
        self.compression_ratio = compression_ratio
        self.qkv = nn.Linear(dim, dim * 3)
        self.proj = nn.Linear(dim, dim)

    def forward(self, x):
        B, T, C = x.shape
        qkv = self.qkv(x).reshape(B, T, 3, C)
        q, k, v = qkv.unbind(2)
        # Apply compression to k, v
        k = nn.functional.avg_pool1d(k.transpose(1,2), kernel_size=int(1/self.compression_ratio)).transpose(1,2)
        v = nn.functional.avg_pool1d(v.transpose(1,2), kernel_size=int(1/self.compression_ratio)).transpose(1,2)
        # Compute windowed attention
        # ... (rest of implementation)
        return self.proj(v)

Training such a model required a custom distributed training framework that could handle the uneven load across experts. DeepSeek's engineers built a scheduler that dynamically reassigns experts to GPUs during training, reducing communication overhead by 30%.

Model Specs

Parameters: 1.3T total, 2B active
Training compute: 2.7×10²⁴ FLOPs
Context length: 128K tokens
Release date: March 2026
Architecture: Mixture‑of‑Experts, windowed attention
Inference speed: 40% faster than comparable dense models

Trade‑offs

+ Inference efficiency
+ Memory footprint
− Training complexity
− Hardware‑specific optimizations

Compute vs. Performance Trade‑off

Benchmarking Beyond the Leaderboard

Maya Rodriguez • Former OpenAI researcher • 10 min read

Leaderboards dominate the public discourse about AI models: MMLU, HumanEval, GSM8K, HellaSwag. Each provides a single number that invites comparison—and inevitably, optimization.

But what happens when a benchmark becomes a target? The phenomenon of benchmark gaming is well‑known in machine learning: models overfit to the specific distribution of the test set, sometimes by memorizing answers, sometimes by exploiting statistical artifacts.

The recent “Chatbot Arena” results illustrate the divergence between benchmark scores and human preference. In a side‑by‑side comparison, humans consistently preferred models that scored lower on MMLU but demonstrated better reasoning transparency, fewer hedging phrases, and more coherent narrative flow.

If your benchmark can be gamed, it's not measuring what you think.

We need a new generation of evaluation that is inherently resistant to gaming. One approach is dynamic benchmark generation, where test questions are synthesized on‑the‑fly from a large space of possibilities, making memorization impossible.

Another is adversarial evaluation, where a second model attempts to find the weakest points of the system being evaluated, surfacing failures that static benchmarks would miss.

A third is real‑world task integration: evaluating models on actual user tasks (e.g., “write a cover letter for a software engineering job”) and measuring outcomes with human raters.

Each of these approaches adds complexity and cost, but they move us closer to measuring what we truly care about: how well a model performs in the wild.

Benchmark Scores vs. Human Preference

The Ethics of Open Weights

Jamal Washington • Open‑source advocate • 14 min read

When Llama 2 was released under a “community license” that prohibited certain uses, the term “open source” was stretched nearly to breaking point. Since then, the landscape has fragmented: we have “open‑weight” models, “open‑access” models, “open‑research” releases, and “open‑source” AI—each with different legal and practical implications.

The core tension is between openness as a development methodology and openness as a distribution license. Traditional open‑source software guarantees the right to study, modify, and redistribute the source code. With AI models, the “source code” is the architecture (often published) and the trained weights (sometimes published).

Open weights without open data is like publishing a recipe but keeping the ingredients secret.

But weights alone are insufficient. Without the training data, the training code, the hyperparameters, and the exact hardware environment, reproduction is impossible. This has led some researchers to argue that “open weights” is a marketing term, not a scientific one.

If we cannot audit the data that shaped the model, we cannot fully understand its biases, its limitations, or its potential for harm. Transparency must extend beyond the model to the pipeline that created it.

Licensing is another frontier. Many “open‑weight” models carry usage restrictions that violate the Open Source Definition (e.g., prohibiting commercial use, requiring attribution, forbidding certain industries). These are not open‑source licenses; they are proprietary licenses with a gratis‑weight provision.

The community needs a new taxonomy: perhaps “published weights”, “open‑research”, “source‑available”, and “truly open‑source AI” (where training data, code, and weights are all under an OSI‑approved license). Until then, we should be precise with our language: open weights are a step forward, but they are not open source.

Tool Calling Accuracy

Sofia Ivanova • Data visualization specialist • 8 min read

As LLMs become agents, their ability to correctly invoke tools—Bash commands, file operations, MCP servers, Skills, and generation tasks—becomes critical. Yet most model evaluations measure only raw text generation, not the precise structured output required for tool calls.

We built a deterministic test suite that evaluates tool‑calling accuracy across five categories:

Bash: Executing shell commands with correct arguments and flags.
File operations: Reading, writing, editing files with exact path handling.
MCP: Using Model Context Protocol tools with proper inputs.
Skills: Loading and invoking Codex skills with correct parameters.
Generation: Producing structured outputs (JSON, XML, YAML) that match a schema.

Deterministic tests are the only way to know if a model can follow instructions.

The test suite consists of 50 prompts, each requiring a specific tool call. The model's response is parsed and compared against a ground‑truth structured call. Partial credit is given for semantically correct but syntactically imperfect calls.

Here is an example test case for the Bash category:

// Test case: find all .md files modified in the last 7 days
const test = {
    prompt: "Find all Markdown files modified in the last week.",
    expected: {
        tool: "bash",
        command: "find . -name '*.md' -mtime -7",
        explanation: "Uses find with -name and -mtime flags."
    },
    scoring: {
        exactMatch: 1.0,
        semanticMatch: 0.8,
        wrongFlags: 0.3,
        noCall: 0.0
    }
};

Early results show that even state‑of‑the‑art models struggle with consistency: they may get a complex call right once, then fail on a simpler variant. This suggests that tool‑calling is not yet a robust capability, but rather a fragile pattern‑matching skill.

Tool‑Calling Categories

Bash — shell commands
File ops — read, write, edit
MCP — Model Context Protocol
Skills — Codex skill invocation
Generation — structured output

Success Rates

Bash: 78%
File ops: 85%
MCP: 62%
Skills: 71%
Generation: 90%

Animated Terminal

$ find . -name '*.md' -mtime -7
./docs/api.md
./notes/meeting.md
$ █

The Cost of Scale:
DeepSeek's Architecture and Trade‑offs

Table of Contents