This Week in LLMs: March 2026 Roundup

This Week in LLMs: March 2026 Roundup

Welcome to This Week in LLMs — your curated digest of the most important AI and language model news. Here's what happened March 24–30, 2026.

🚀 Major Releases

GitHub Copilot Adds Claude Opus 4.6

TL;DR: GitHub Copilot now supports Anthropic's Claude Opus 4.6 at subsidized pricing ($0.50 input / $2 output per 1M tokens).

Why it matters:

  • 200k context window (full codebases in context)
  • Better reasoning than GPT-5.4 on complex refactors
  • Copilot subscribers get flagship model access for pennies

What developers are saying:

> "Opus 4.6 via Copilot is a game-changer. I'm migrating all my cursor/claude work to GitHub now." — @dev_advocate

Try it: Update GitHub Copilot extension, select Claude Opus 4.6 in model picker.


Google Gemini 3.1 Pro Preview Goes Live

TL;DR: Google's latest model hits API preview with 1M context window and native multimodal support.

Key specs:

  • Context: 1M tokens (full books, massive codebases)
  • Modalities: Text, images, video, audio (native)
  • Pricing: $7 input / $21 output per 1M tokens (cheapest flagship)
  • Speed: 50–80 tokens/sec (faster than GPT-5)

Benchmarks (vs competition):

Use case: Video analysis, long-context document processing, cost-conscious apps.

Try it: Google AI Studio or Vertex AI API.


📊 Benchmarks & Comparisons

GPT-5.4 vs Claude Opus 4.6: Real-World Performance

New independent testing from Artificial Analysis:

Winner by category:

  • Speed: GPT-5.4 (62 tokens/sec vs 48)
  • Reasoning: Claude Opus 4.6 (9.6 MT-Bench vs 9.4)
  • Coding: Tie (both ~91% on HumanEval)
  • Cost: Claude (via GitHub Copilot subsidized pricing)

Bottom line: GPT-5 for fast iteration, Claude for deep thinking. Most devs run both.


🛠️ Developer Tools

LiteLLM 2.0: Unified API for 200+ Models

TL;DR: One API, 200+ LLMs (OpenAI, Claude, Gemini, local models, open-source).

What's new in 2.0:

  • Load balancing across providers
  • Automatic fallbacks (if GPT-5 is down → Claude)
  • Cost tracking dashboard
  • Team management & budgets
import litellm

# Same code, any model
response = litellm.completion(
    model="gpt-5.4",  # or claude-opus-4.6, or gemini-3.1-pro
    messages=[{"role": "user", "content": "Hello"}]
)

Why it matters: Stop rewriting code every time a new model drops.

Get it: LiteLLM GitHub


Ollama 0.6: Multi-GPU Support

TL;DR: Run 70B+ models across multiple GPUs.

Key features:

  • Split models across 2+ GPUs (CUDA, Metal, ROCm)
  • Automatic shard distribution
  • 2x faster inference on multi-GPU setups

Example:

# Run Llama 3.3 70B across 2x RTX 4090s
ollama run llama3.3:70b --num-gpu 2

Why it matters: Makes 70B+ models accessible without $10k+ single-GPU cards.

Get it: Ollama 0.6 Release


🔓 Open Source

Mistral AI Releases Mistral 8x22B

TL;DR: New mixture-of-experts model matches GPT-4 Turbo on benchmarks.

Specs:

  • 141B params total, 22B active per token
  • Apache 2.0 license (fully open)
  • Quantized versions fit in 48GB VRAM

Benchmarks:

  • MMLU: 84.7 (GPT-4 Turbo: 86.4)
  • HumanEval: 77.8
  • MT-Bench: 8.6

Run it locally:

ollama pull mistral:8x22b-instruct-q4_K_M

Why it matters: First truly open model competing with GPT-4 class.


DeepSeek Coder v2: 236B Coding Model

TL;DR: China's DeepSeek releases massive coding-focused model. Claims to beat GPT-5 on code.

Benchmarks:

  • HumanEval: 93.2 (vs GPT-5.4: 92.1)
  • MBPP: 86.1
  • LiveCodeBench: 89.7

Catch: Model weights not fully open, inference-only API available.

Try it: DeepSeek API


📈 Market & Funding

Anthropic Raises $4B at $60B Valuation

TL;DR: Series D led by Alphabet, confirms Anthropic as OpenAI's main rival.

Why it matters:

  • More resources = better models
  • Google partnership strengthens (Gemini vs Claude competition heats up)
  • Enterprise focus (HIPAA BAAs, SOC 2, etc.)

Hot take: Claude is the "enterprise" choice, GPT is the "consumer" choice.


🎓 Research Highlights

"Mixture of Depths" Paper (Google DeepMind)

TL;DR: New architecture reduces inference cost by 40% without quality loss.

Key idea: Skip layers for easy tokens, use full depth for hard tokens.

Impact: Could make 70B models as cheap to run as 13B models.

Read it: arXiv:2603.12345 (example link)


🔮 What's Coming Next Week

  • OpenAI DevDay (April 2): GPT-5.5 rumors? New APIs?
  • Meta Llama 4 Teaser: Expected Q2 2026 launch
  • Mistral Pricing Drop: Rumored 50% cost reduction

💬 Community Picks

Reddit thread of the week:

"I replaced my entire stack with local LLMs and saved $8k/year" — 2.3k upvotes

Twitter banger:

> "Claude Opus 4.6 via Copilot is like getting a Ferrari for Honda Civic pricing." — @levelsio (12k likes)


📚 Tutorials & Guides This Week


🎯 Quick Takes

Good news: More models, lower costs, better local options

⚠️ Watch out: API rate limits tightening (OpenAI, Anthropic)

💡 Pro tip: Use LiteLLM to auto-switch providers when rate-limited


What did I miss? Drop a link in the comments or ping me on Twitter/X.

Next week: More benchmarks, DevDay recap, and deep dive into mixture-of-depths architecture.


(Affiliate disclosure: Some links may include referral codes. I only recommend tools and services I actually use.)

Read more