Seroter's Daily Reading — #774 (April 30, 2026) — Seroter's Daily Reading

Listen: https://blossom.nostr.xyz/3bdb2e1e91d0067d81fe283b25eea2b1b82f5c83612a7dff6bf261a11d599d44.mpga

Seroter's Daily Reading, episode 774. April 30, 2026.

Addy Osmani has a piece on what happens when AI agents stop being a single conversation and become something that runs for hours, days, or even weeks — Long-running Agents. The dominant mental model for the past couple of years has been the chat window: you type a goal, watch tokens stream by, and stop when the context window fills up or the agent declares victory prematurely. That got us pretty far, but it has a ceiling. The model forgets. It re-introduces a bug it fixed nine turns ago. It says "task complete" when it isn't. Long-running agents are what comes next, and Osmani maps out the three walls every implementation hits. Finite context — even a million-token window fills, and context rot kicks in well before the hard limit. No persistent state — each new session starts blank, like engineers showing up to a project with no knowledge of what the last shift accomplished. And no self-verification — models reliably grade their own work too generously. Asked "are you done?" they say yes more often than they should. The three major labs have converged on similar shapes of answers, but with very different surface area. Anthropic describes a brain, hands, and session split, where the session is an append-only event log that makes the agent recoverable even if a container crashes mid-run. Cursor ships a three-role system: planners that explore the codebase and emit tasks, workers that focus on execution and don't coordinate with each other, and judges that decide when an iteration is actually finished. Google, at Cloud Next, folded Vertex AI into the Gemini Enterprise Agent Platform and productized long-running agents with named SLAs, sessions with custom IDs that map to your CRM, and a Memory Bank for long-term curated memory. The pattern across all three is the same: state lives outside the agent, sessions are durable, generation is separated from evaluation. You can build a working version of this in an evening with a bash script and a JSON file. Most of what the big labs have productized is the work of making that pattern recoverable, secure, and observable at scale.

Hugging Face has a post on a problem that isn't getting nearly enough attention: evaluation costs are scaling non-linearly, and we're approaching a point where independent benchmarking becomes structurally impossible outside the major labs — AI evals are becoming the new compute bottleneck. The Holistic Agent Leaderboard spent about forty thousand dollars running twenty-one thousand seven hundred and thirty agent rollouts across nine models and nine benchmarks. A single GAIA run on a frontier model can cost nearly twenty-eight hundred dollars before caching. Exgentic ran a sweep across agent configurations and found a thirty-three times cost spread on identical tasks — the variance comes from scaffold choices, not the model itself. Here's the part that should worry the field: static benchmarks compress beautifully. You can reduce HELM by a hundred to two hundred times and preserve the ranking. Agent benchmarks don't. You get two to three and a half times compression at best before you lose fidelity. And training-in-the-loop benchmarks — things like PaperBench, where the agent has to replicate a published paper from scratch — resist compression almost completely, because the thing being evaluated is the trained model itself. The Well, a scientific ML benchmark, spends about nine hundred and sixty H100 hours evaluating a single new architecture. Evaluation compute now exceeds training compute in some corners of ML, reversing the old mental model entirely. HAL has actually paused new model evaluations to focus on reliability — the headline accuracy numbers still carry too much noise, and reducing that noise costs real money. If only frontier-lab budgets can produce statistically credible benchmark numbers, then evaluating AI systems becomes concentrated inside the same labs that build them. External validation becomes partial, and sometimes absent.

DX has a longitudinal study across more than four hundred engineering organizations over sixteen months — AI productivity gains: More modest than expected. The headline finding: as AI tool usage increased by an average of sixty-five percent, median PR throughput went up by just under eight percent. Most organizations land in the five to fifteen percent range. That's a real gain, but it's nowhere near the three times or ten times expectations executives are being held to. Why aren't the gains higher? The top reason probably won't surprise you: coding is not the primary bottleneck. Microsoft's Brian Houck published a study showing only about fourteen percent of developer time is spent coding. AI is only optimizing that slice, so the ceiling on output gains is structurally limited by that fact. The team also heard about review burden, technical debt, cognitive debt, and cultural friction inhibiting full adoption. On the flip side, for organizations at the high end of the range, the common thread seems to be an all-in culture with centralized rollout and championing of AI tools — a cultural shift toward infusing AI into the entire software development lifecycle, not just coding. One open question the DX team is still investigating: self-reported time savings from developers aren't showing up as proportional output gains. Where is that time going? A couple of hypotheses: only fourteen percent of time savings maps back to coding, so the gains are ratably distributed across the whole day, and some of the savings come with side effects — more oversight, more review — that absorb the reclaimed time. There's also a concept the author calls false velocity: organizations so focused on showing how fast and prolific their engineers are with AI that they're not asking whether roadmap velocity is actually increasing or whether the code being generated is sustainable.

Google Cloud announced fifty-plus fully managed MCP servers covering its entire service catalog — 50+ fully managed MCP servers now available for Google Cloud services. MCP — the Model Context Protocol — is becoming the standard that lets agents talk to external tools with structured, reliable connections rather than cobbled-together webhooks and hope. The servers span infrastructure, databases, analytics, and productivity tools. You can give an agent live access to GKE, Cloud Run, Spanner, AlloyDB, Cloud SQL, BigQuery, and Firestore. For operations and security, there's Cloud Logging, Monitoring, and Google Security Operations for automated threat investigation and response. And for productivity, there's the full Workspace suite — Gmail, Drive, Calendar, People API, and Chat — so an agent can summarize your inbox, draft a Doc, manage a calendar invite, or run a Chat workflow. The point is that agents don't just chat anymore. They take actions across your entire cloud estate with the same protocol your models already understand.

Simon Willison covers a move by the Zig programming language that stands out in the current moment — Zig Anti-AI. Zig has one of the most stringent anti-LLM policies of any major open source project: no LLMs for issues, no LLMs for pull requests, no LLMs for comments on the bug tracker. Not even for translation. The Zig Software Foundation's VP of Community, Loris Cro, wrote a piece called "Contributor Poker and Zig's AI Ban" that articulates the rationale as well as anything I've seen. In successful open source projects, you eventually reach a point where you're getting more PRs than you can process. Most projects respond by raising the bar — stop accepting imperfect PRs, maximize review ROI. Zig does the opposite. They try their hardest to help new contributors get their work in, even if those contributors need help along the way. And they do it not just because it's the right thing, but because it's the smart thing. Zig values contributors over their contributions. Each contributor represents an investment by the core team. The primary goal of reviewing and accepting a PR isn't to land new code — it's to grow new contributors who become trusted and prolific over time. LLM assistance breaks that completely. It doesn't matter if the LLM helps you submit a perfect PR to Zig. The time the Zig team spends reviewing your work does nothing to help them add new, confident, trustworthy contributors to the project. Willison's framing: if a PR was mostly written by an LLM, why should a project maintainer spend time reviewing and discussing that PR as opposed to firing up their own LLM to solve the same problem? The Bun JavaScript runtime, which was acquired by Anthropic in December and operates its own fork of Zig, recently achieved a four times performance improvement on compilation after adding parallel semantic analysis and multiple code generation units. They explicitly said they do not plan to upstream that work because Zig has a strict ban on LLM-authored contributions.

JetBrains has a practical guide to the Go web framework landscape — Popular Go Web Frameworks: A Practical Guide for Developers. The headline: there's no single dominant framework in Go, and that's intentional. Forty-six percent of Go developers use the language to build websites or web services, and about thirty-two percent stick with the standard library's net/http exclusively. Gin dominates the third-party space with forty-eight percent adoption. Echo has sixteen percent, Gorilla has seventeen despite no longer being actively maintained, and Fiber has eleven percent. The piece walks through each one — Gin for its community and familiarity, Echo for its structured batteries-included approach, Chi for staying close to the standard library with a lightweight router, Fiber for developers coming from Express.js who want extreme performance even at the cost of compatibility with the Go ecosystem. The comparison table is useful: net/http has no dependencies and full standardization across teams, but it requires more boilerplate for complex scenarios. Fiber, built on fasthttp rather than net/http, offers extremely high performance but locks you in more than the others and requires refactoring to move away from it. The conclusion from the JetBrains team is that none of the frameworks is superior to the standard library — they just offer different tradeoffs, and the choice comes down to what your team and project need.

Firebase announced that Firestore pipeline operations have graduated to general availability — Firestore levels up: Bringing the power of search and JOINs to NoSQL — and with it come some features that significantly close the gap between Firestore and relational databases. Full text search is now in preview, using Google search-style query syntax with special text indexes that tokenize fields and return results with a relevancy score you can sort by. Geospatial queries are also in preview, letting you sort results by distance to a coordinate. The more interesting addition is subqueries — essentially joins — where you can combine two arbitrary pipelines into one and aggregate data from subcollections directly into parent documents. The example they give is aggregating a restaurant's average rating and review count from its reviews subcollection into a single field on the restaurant document itself, in a single pipeline. They also added a data manipulation language so you can now pipe the output of a pipeline query directly into an update or delete operation without spinning up a Cloud Function. Backfilling a new field across an entire collection, conditionally updating documents, all inside Firestore. The Firebase blog's conclusion: with these changes they've achieved query feature parity with major relational databases like PostgreSQL. That's a significant statement from a product that has spent years being compared unfavorably to SQL databases on query power.

Google announced that Gemini can now generate files — You can now easily generate files in Gemini — PDFs, Microsoft Word documents, Excel spreadsheets, Google Docs, Sheets, Slides, and more — directly in the chat interface. You describe what you need, and Gemini creates the file and either lets you download it or export it directly to Drive. Supported formats include the Workspace file types, PDF, docx, xlsx, CSV, LaTeX, plain text, rich text format, and markdown. This is another step in the direction of Gemini as a workspace tool rather than just a conversational assistant.

InfoWorld reports that GitHub is moving Copilot to a usage-based billing model — GitHub shifts Copilot to usage-based billing, signaling a new cost model for enterprise AI tools — starting June first. All Copilot plans — Pro, Pro+, Business, and Enterprise — will transition from fixed subscription pricing to consumption-based charges measured through AI credits. You get a monthly allotment with the option to purchase additional usage. This is a significant shift from the per-seat model that has been the standard for enterprise AI developer tools. It mirrors what we're seeing across the infrastructure layer: consumption pricing is replacing subscription pricing as the dominant model, which changes the calculus for budget forecasting and makes AI tool spending more directly proportional to actual value delivered, for better or worse.

Google Cloud reported first quarter results exceeding twenty billion dollars in revenue, up sixty-three percent year over year — Google Cloud surpasses $20B, but says growth was capacity-constrained. The growth is being driven by strong demand for Gemini Enterprise and AI solutions. Products built on Google's genAI models grew nearly eight hundred percent year over year. Gemini Enterprise itself grew forty percent quarter over quarter, and AI token growth hit sixteen billion tokens per minute, up from ten billion in the fourth quarter. But here's the tension: Sundar Pichai told analysts on the earnings call that the company's cloud revenue would have been higher if they could meet the demand. They are compute constrained in the near term. Google's backlog doubled in the quarter to four hundred and sixty-two billion dollars, and they expect to work through about half of it over the next two years. The demand is massive and they can't satisfy it all yet.

Google Developers has a post on bringing Colossus — Speeding Up AI: Bringing Google Colossus to PyTorch via GCSFS and Rapid Bucket — the storage architecture behind YouTube and Google Search — directly to the PyTorch ecosystem via gcsfs and a new product called Rapid Bucket. The problem they're solving is keeping GPUs fed. As model sizes grow, data loading and checkpointing often become the primary bottlenecks in training. Standard REST-based storage access can't meet the throughput and latency requirements of modern distributed training. Rapid Bucket bypasses REST APIs and uses persistent gRPC bidirectional streams to connect directly to Colossus. The results: fifteen plus tebibytes per second aggregate throughput, under one millisecond latency for random reads, and over twenty million queries per second. In benchmarking, they observed a twenty-three percent performance gain in total training time compared with standard regional buckets, with reads improving by four point eight times and writes by two point eight times. The key detail for existing code: you don't change your code. You just change the bucket type to a Rapid Bucket, and gcsfs auto-detects the connection upgrade.

That's episode 774. Thirteen articles spanning long-running agent architecture, eval economics, the real productivity picture from four hundred organizations, MCP servers, Zig's anti-AL stance, Go framework choices, Firestore's query capabilities, Gemini file generation, Copilot's pricing shift, Google's capacity constraints, and Colossus storage performance. The through-line this week feels like infrastructure maturing underneath the agentic wave — the observability, persistence, memory, and cost structures that make production agents actually viable rather than just theoretically interesting.

Long-running Agents — Addy Osmani
AI evals are becoming the new compute bottleneck — Hugging Face / EvalEval Coalition
AI productivity gains: More modest than expected — DX
50+ fully managed MCP servers now available for Google Cloud services — Google Cloud
Zig Anti-AI — Simon Willison
Popular Go Web Frameworks: A Practical Guide for Developers — JetBrains
Firestore levels up: Bringing the power of search and JOINs to NoSQL — Firebase Blog
You can now easily generate files in Gemini — Google
GitHub shifts Copilot to usage-based billing, signaling a new cost model for enterprise AI tools — InfoWorld
Google Cloud surpasses $20B, but says growth was capacity-constrained — TechCrunch
Speeding Up AI: Bringing Google Colossus to PyTorch via GCSFS and Rapid Bucket — Google Developers