Seroter's Daily Reading — #778 (May 6, 2026) — Seroter's Daily Reading

Listen: https://blossom.nostr.xyz/74069152d1d06bbafc7a0697435eee22ee85aca8f0516246306c7b42dc0bfb04.mpga

Seroter's Daily Reading episode 778 for May 6th, 2026.

Today's list has a good spread: agent tooling, engineering leadership, and some sharp critiques of where the industry is heading. Let's get into it.

Starting with Addy Osmani's piece on Agent Skills, which is his attempt to make AI coding agents go through the same process a senior engineer would. The core problem he's addressing is that agents default to the shortest path to done. You ask for a feature, they write the feature. They don't ask about specs, they don't write tests first, they don't check whether the PR is reviewable. The senior parts of the job—the surfacing of assumptions, the discipline around scope, the evidence that the work is actually correct—those are invisible to the agent because the reward signal just says "task complete." His repo, which just crossed 27K stars, encodes twenty skills across six lifecycle phases: Define, Plan, Build, Verify, Review, Ship. It maps onto the SDLC that any healthy engineering org runs. What's clever is the anti-rationalization tables. Each skill includes pre-written rebuttals to the excuses an agent might use to skip the workflow—"This task is too simple to need a spec," "I'll write tests later," "Tests pass, ship it." He's basically planting counters to lies the agent hasn't told yet. That's good human engineering too. Teams that write down their own anti-rationalizations end up with fewer of them. The piece also maps the skills onto Google engineering practices—Hyrum's Law, the test pyramid, the ~100-line PR sizing with severity labels, Chesterton's Fence. None of that is new, but it's exactly the part agents skip by default. A frontier model has read "Hyrum's Law" in its training data but it doesn't apply it at 3am when it's designing your API. Skills are how you make sure it does.

Moving to a different skill entirely—Yue Zhao on How To Be Direct And Strategic without those two things being in tension. She gets a client question that illustrates the trap perfectly: someone flagged a process problem directly in a team meeting, gave all the details, and got shut down completely. The instinct from high-performers is often that if the idea is good it should speak for itself, or that managing how a message lands is manipulative. But managing the framing of a high-stakes conversation isn't deception—it's empathy. Before a difficult conversation, ask what the person already believes, what emotion it's likely to trigger, and what needs to be said first. The client went back to her manager, started by asking what her manager was hearing about the cross-team dynamic, introduced the feedback as an addition to what the manager was already seeing, and got agreement instead of pushback. Same truth, different conversation.

From Harvard Business Review, there's a piece on When an Executive Asks You an Unexpected Question that complements this. The point is that when an executive asks you something off the cuff, it's not just about having the factual answer ready. You need to understand where the question is coming from—which meeting they just left, what pressure they're under, what decision they're trying to make. Fast and factual isn't enough if you miss the real question underneath.

Google Cloud put out a five-part series on Five must-have guides to move agents into production with Gemini Enterprise Agent Platform. The first guide covers design patterns for long-running agents that maintain state for up to seven days, with checkpoint-and-resume mechanisms and delegated approval workflows. The second is about the agent governance stack—treating your agent fleet with the same rigor as your engineering org, which means Agent Identity with cryptographic badges, Agent Registry for centralized tool governance, Agent Gateway for natural language security policies, and behavioral anomaly detection. The third digs into multi-agent orchestration patterns with their Agent Development Kit, covering graph-based workflows, coordinator-specialist patterns, and secure sandboxed executors for running arbitrary code. The fourth is on the A2A and MCP interoperability standards—how agents from different teams and languages can discover and collaborate, with Agent Cards publishing capabilities and MCP as the universal tool bridge. The fifth covers pre-built atomic agent blueprints in their Agent Garden. If you're actually running agents in production rather than just demos, this series is worth the time.

Speaking of governance, there's a panel discussion from DX Annual on Designing the AI-native engineering organization featuring leadership from Microsoft, 1Password, and Atlassian. Some notable points. Tim from Microsoft described how the most effective teams are inverting the traditional time distribution—instead of 80% on operate and 10% on create, they're moving to spend most of their time on plan and validate, because AI has compressed create and operate. But he flagged that you shouldn't delegate validate to AI yet, and definitely not security. Nancy from 1Password said they've stopped writing full-length PRDs—teams build prototypes and put them in front of customers instead, which cut the back-and-forth significantly. But more makers writing PRs means more code reviews and reliability work downstream, so they've got a DevOps agent experiment running on real incident data to handle operate. At Atlassian, Taroon said the biggest shift isn't a reorg but how teams form—for zero-to-one work they've moved to squads of three to four people, because AI compressed the building part enough that the bottleneck is now alignment and decision-making. The panel agreed on not mandating AI usage—they track daily active use but focus on enabling it, and organic champions who show concrete wins do more than any top-down directive. On costs, everyone's treating token spend like cloud COGS now, which requires the same level of rigor. One quote from Nancy: negotiate forward-projected commitments with model providers just like you'd negotiate a cloud contract. And here's the thing that stood out—designers at Atlassian are submitting PRs, which is genuinely useful, but engineers are flagging quality issues in those contributions daily. The teams most comfortable accepting non-engineering contributions are the ones with robust test suites and deployment checks already in place. If your right-of-code processes are weak, AI-assisted contributions from anyone will cause problems. The maintainability of the code is suffering, and they've gone back to more standardized approaches and quality checks as a result.

Then Lars Faye comes in with "Agentic Coding is a Trap", and it's a useful counterweight to some of the enthusiasm in this space. His core argument is that the paradox of supervision is real—effectively using a coding agent requires the very coding skills that may atrophy from AI overuse. Anthropic's own research flagged this. LinkedIn's Director of Engineering overseeing 50 engineers has noticed the proliferation and requested his team not use them for tasks that require critical thinking or problem-solving. His quote: "To grow skills, people need to go through hardship. They need to develop the muscle to think through problems. How would someone question if AI is accurate if they don't have critical thinking?" Faye also flags the vendor lock-in angle—when Claude had an outage, posts surfaced showing teams at a standstill, their workflows already dependent on a single provider. Token costs are a moving target in a way employee costs aren't. And he makes a point about planning: some developers think better in code. Dax, the creator of OpenCode, said in an interview that when working on something new or challenging, typing out code is the process by which he figures out what they should even be doing—he has a tough time just writing a giant spec. The LLM fills ambiguity with assumptions, which leads to more review, more agent revisions, more tokens burned, more disconnection from what's being created. His own workflow is to use LLMs to help generate specs and plans while he facilitates the implementation, staying manually engaged for twenty to a hundred percent of the coding depending on the task. He never generates more than he can review in a sitting, and he never asks an LLM to implement something he couldn't do himself. His TL;DR: use them like the ship's computer, not Data.

AWS announced the general availability of The AWS MCP Server is now generally available, which gives AI agents authenticated access to all AWS services through a fixed set of tools. The call_aws tool executes any of fifteen thousand plus API operations using your existing IAM credentials. search_documentation and read_documentation retrieve current documentation at query time so the agent always works from up-to-date information. There's a run_script tool that lets the agent write Python that runs server-side in a sandboxed environment inheriting your IAM permissions but with no network access. The most significant addition is the transition from Agent SOPs to Skills—curated guidance for the tasks where agents most commonly make mistakes, contributed and maintained by AWS service teams, which keeps the tool list short and predictable and reduces hallucination. For enterprise customers, there's a clear separation between human and agent permissions, and CloudWatch metrics let you observe MCP calls separately from direct human calls. The demo shows Claude Code with Opus 4.6—which has a knowledge cutoff in May 2025—failing to know about S3 Vectors, and then succeeding once the AWS MCP Server is connected.

On code migration specifically, Google Cloud published the details of how they achieved a six to eight times speedup migrating production machine learning models from TensorFlow to JAX at YouTube scale in their post on Pioneering AI-assisted code migration: How Google achieved 6x faster migration from TensorFlow to JAX. The key is they didn't point a single agent at the codebase and say migrate this—that failed. They built a multi-agent system with three distinct roles. A Planner agent using deterministic compiler-based static analysis to map the entire dependency tree and sequence migration from leaf nodes upward—no AI judgment in sequencing, the dependency graph is the dependency graph. An Orchestrator agent that chunks work to fit context windows, injects domain-specific Playbooks, and handles failure recovery, with Playbooks ranging from general repository instructions to client-specific golden examples distilled from successful manual migrations. A Coder agent that keeps working until it produces compilable verifiable code, with done defined externally by build success and mathematical equivalence tests. The validation layer verifies correctness using algorithmic gradient ascent to find the maximum error between the original TensorFlow layer and the new JAX layer—mathematical verification, not probabilistic assessment. They also run a separate LLM Judge that scores migrated code against an architectural checklist.

Keith Petre noticed that Google's architecture validates something he's been arguing for—Deterministic Code in the Loop. The pattern: AI reasons, code decides. The key insight from his analysis is that the Orchestrator agent—the one making decisions about how to chunk work, which Playbook to inject, how to handle failures—is itself driven by a model. That's the highest-leverage decision in the entire system, and its governance is opaque. He calls it the control plane model problem. Every system that implements a Reasoning Plane has a model at its center making orchestration decisions that shape every downstream outcome. The enterprises that figure out how to make that governance explicit—deterministic, inspectable, auditable—are the ones that will get AI past pilot at scale. The unlock isn't making AI more autonomous. It's making the governance of AI orchestration deterministic all the way up.

InfoWorld published practical advice on Improving AI agents through better evaluations. A few concrete points. Encode your product's values in the eval—if you're building a coding assistant you care about tests, style, security, not bulldozing through a repo; if you're building a customer support agent you care about factuality, tone, escalation, policy compliance, resolution rate. Generic helpfulness graders won't capture any of that. Make regression a release gate instead of a release report—if a change drops a regression score, don't ship it. And write the eval before the prompt. You need to be able to articulate what good looks like before you start tweaking the system. The eval captures the end, the prompt is the means.

Google Cloud also shipped a significant update to their IAM capabilities specifically for agents in What's new in IAM: Security, governance, and runtime defense. Agent Identity is now a first-class principal type distinct from human identities or service accounts, built on the open SPIFFE standard—cryptographically protected and automatically provisioned. Agent Gateway routes all agent-to-agent and agent-to-tool traffic through a centralized policy enforcement point, with Identity-Aware Proxy and Context-Aware Access for agents integrated in. Principal Access Boundary for Agent Identity is in preview, setting hard limits on what resources an agent or group of agents can never access regardless of other permissions. And Model Armor provides runtime defense for prompt injection, tool poisoning, and sensitive data leakage across Agent Gateway, Agent Runtime, Google Cloud MCP servers, Langchain, and Firebase. The new Agent Security dashboard offers agentless discovery, vulnerability scanning, runtime threat detection, and graph-based risk discovery.

That's the list for episode 778. A theme that runs through several of these is the tension between pushing agents further into the loop and maintaining human judgment and accountability. The skills atrophy argument, the control plane model problem, the governance stack for enterprise agents—these all point to the same thing: as you give AI more agency, the governance problem grows faster than the capability gains. Worth sitting with that one.

Sources