Seroter's Daily Reading — #777 (May 5, 2026)
Seroter's Daily Reading·
Listen: https://blossom.nostr.xyz/7f505a6b067686d2c502192b41f914c116f05b7b76f9b84c6476d92482b24add.mpga
Source: Seroter's Original Post
Seroter's Daily Reading, Episode 777. May 5th, 2026.
Seroter opens this episode with a reflection on premature AI architecture. He's been thinking about how it's premature to build large custom AI engines inside companies right now. Use off-the-shelf products to get rolling. Too many things are changing, and you can get a lot done by stuffing context into available products. The Stripe item below reinforced that for him.
Let's dig in.
First up, a fascinating look inside Uber and what happens when you scale AI agents to enterprise size. Over at Shiftmag, there's a detailed report from the MCP Dev Summit where Meghana Somasundara and Rush Tehrani from Uber talked about what they've been building. And the numbers are striking. More than ninety percent of Uber's five thousand engineers already use AI monthly for agentic workflows, and they have over fifteen hundred monthly active agents running more than sixty thousand executions per week. That's not a pilot. That's production at scale. And the real insight from Meghana is the framing of risk. She put it this way: humans take a lot of effort to break things, but with agents, it's faster, quicker, and the blast radius is a lot higher. That's the security mindset you need when agents start making decisions across systems faster than any human can react.
So what went wrong at that scale? Three problems. First, there was no shared way of building. When agent adoption spreads organically across a large engineering organization, teams build independently. At Uber, with over ten thousand internal services, dozens of teams were building MCP servers and custom integrations without shared standards, central oversight, or any real way to reuse what others had already built. Duplicated work, and a growing stack of systems that only the original team understood. Meghana put it bluntly: if you can't manage the development lifecycle, you just can't trust it in production.
The second problem was security. Agents operating across a complex service landscape could unknowingly call endpoints they shouldn't, expose sensitive data, or trigger operations nobody intended. Add third-party MCP servers into the mix and the governance problem scales quickly. They needed full visibility into call patterns, who was accessing what data, under what conditions, and what happened when things went wrong.
Third was tool discovery. How does an agent or the engineer building it actually find the right one? Not just any MCP server, but one that's reliable, performs well, and doesn't quietly degrade everything built on top of it. When discovery is left unmanaged, agents default to whatever is most visible rather than what actually works best. At smaller scale that's an annoyance. Across thousands of services, it becomes a systemic quality problem.
Uber's answer was a centralized MCP gateway and registry. A central control plane that turns Uber's endpoints into MCP tools, with service owners deciding what gets exposed and how. Every change flows through pull requests, passes security scans before deployment, and is continuously monitored in production. A central registry acts as the single source of truth, removes duplication, and enforces tighter scrutiny on third-party MCPs. Their coding agent called Minions generates about eighteen hundred code changes weekly and is used by ninety-five percent of Uber engineers.
Now here's the thing. You might not be Uber. Most engineering teams will never see that scale. But the underlying failure patterns aren't unique to them. Teams often end up building the same integrations in parallel, with governance only becoming a priority after something breaks, and discovery treated as an afterthought. These problems surface well before reaching fifteen hundred agents, once multiple teams start using the same MCP infrastructure without a shared layer. If you're already running MCP servers across more than two teams and nobody owns discoverability or access control yet, that gap could surface soon.
Moving on, there's a remarkable piece from Denys Poltorak on ITNext called The Map of System Topologies. This is a new chapter from his book Architectural Metapatterns, and it's exactly what it sounds like: a systematic map of the common architectures that most technology systems end up falling into. He plots them on axes of abstractness, subdomain partitioning, and sharding, and walks through the landscape from true monoliths all the way to fragmented microservices, plugins, hexagonal architectures, and cells.
It's a density-packed read that rewards slow consumption. The core insight is that any complex system is very likely a combination of simple topologies, and Poltorak maps where they sit relative to each other. He walks through true monoliths, monoliths with auxiliary layers like databases, monoliths with plugins, and underdeveloped moduliths. Then layered architectures, scaled layers, plugin architectures, hexagonal architecture, cells, and the services area including microservices, pipelines, and service-based architectures. If you've ever tried to explain to someone why your architecture looks the way it does, or if you've wondered why certain patterns keep showing up, this is a useful map to have in your head.
This brings us to a lengthy and important piece from Kate Kholterhoff at RedMonk on AI Slop and the Vulnerability Treadmill. It's a must-read for security teams and executives, and it argues that it's time to fundamentally rethink how the industry handles vulnerability disclosure and supply chain security in the age of AI.
Kholterhoff starts with a run through recent incidents. React disclosed its first critical CVE in December, a remote code execution flaw in server components. Aqua Security's Trivy was compromised twice in three weeks through a GitHub Actions misconfiguration. Hackers compromised a maintainer account for the Axios npm cURL package to publish backdoored versions containing a cross-platform remote access trojan. And in April, Vercel disclosed a security incident originating from a compromised third-party AI tool, Context AI, used by an employee that gave attackers access to customer environment variables. In each case, AI either enabled the attack or was part of the causal chain.
The core problem is an incentive structure that's broken. Generating a vulnerability report now costs pennies in tokens. Evaluating whether it's real still costs an hour of expert time. Bug bounty programs are drowning in AI-generated reports, and the math doesn't work. cURL's bug bounty program, running since 2019, found eighty-seven confirmed vulnerabilities and paid out over a hundred thousand dollars. It worked until AI collapsed the cost of submitting garbage while leaving the cost of evaluating it unchanged. Daniel Stenberg, the founder and lead developer of cURL, was forced to kill the program this January because his team was spending more time debunking AI-generated reports than writing code.
Meanwhile, the EU Cyber Resilience Act takes effect in stages, with the first enforced milestone hitting September eleventh, 2026. From that date, all manufacturers of products with digital elements sold into the EU must report actively exploited vulnerabilities to ENISA within twenty-four hours. You need a vulnerability disclosure program. You need SBOMs. You need continuous monitoring. Bug bounty programs are becoming legally mandated at the exact moment they're becoming economically unsustainable.
Kholterhoff's conclusion is sharp. Reports are cheap. Assessments are expensive. The CVE database may already be too slow to matter. The answer probably starts with flipping the ratio, making assessment as cheap as generation, paying for fixes instead of just finds, and treating supply chain security as the board-level priority it has been pretending to be.
Shifting gears, Ben Evans has a thorough piece on O'Reilly Radar about Local AI. This one has been getting a lot of attention, and Seroter says he's come around on it. A year ago, he admits, open models didn't get him fired up. Why run one yourself when you can use a state of the art model as a service? But token usage has skyrocketed, sovereign needs are more clear, and open models have continued to innovate. So he gets it now.
The release of Gemma 4 from Google has added energy to this discussion. Models you can download and run on hardware you own are becoming competitive with frontier models hosted by large AI providers. The reasons for going local vary. For a financial services company, regulation may require that no sensitive data leaves the premises. For a developer in Europe, data sovereignty laws make cloud APIs awkward. For developers outside the US, costs denominated in dollars can be prohibitive relative to local income levels. None of these reasons are new, but all of them are more urgent than they were a year ago because the models are catching up.
The strongest momentum comes from developers and organizations outside the United States. European regulators have been skeptical of US-based cloud services since before the first Schrems ruling invalidated the Safe Harbor framework back in 2015. The concern that US intelligence services can access data held by US companies has never been fully resolved, and recent US policy directions have amplified European anxieties. More countries, including China and many Asian nations, are developing their own data sovereignty laws. Locally run models sidestep the problem.
China has become a leading provider of open AI models. DeepSeek's appearance as a major open-weight model family wasn't an accident; it reflects systematic investment in AI that emphasizes efficiency and openness over raw scale. When you can't easily acquire NVIDIA's fastest chips, you optimize your software instead. According to Hugging Face data, Chinese models now account for a larger share of downloads on the platform than US models. The frontier of capable AI is no longer exclusively American, and the application developers driving much of that usage are building for audiences that American tech companies have largely ignored.
**On The Next Platform, there's a detailed look at Google Cloud's first quarter results, and the headline is that Google's full-stack AI strategy is paying off. Google Cloud grew faster than either AWS or Azure in Q1, up sixty-three percent year on year, with operating income up more than three times. That's remarkable when you consider that Google lost a lot of money on cloud in the early years. Now Google Cloud is more profitable than all of Google was a decade ago.
The driver is the full-stack integration, from TPUs up through Gemini and the Gemini Enterprise Agent Platform. Google CEO Sundar Pichai put it directly: Google Cloud is differentiated because it's the only provider offering first-party solutions across the entire enterprise AI stack. Their first-party models processed an average of sixteen billion tokens per minute in the March quarter, up sixty percent from ten billion in Q4. Three hundred and thirty customers on Google Cloud had processed over a trillion tokens each in the past twelve months, and thirty-five had broken through the ten trillion token barrier. New customer acquisition doubled compared to the same period last year, and the number of hundred million to one billion dollar deals doubled year on year.
The lesson here, as Seroter notes, is that long bets have paid off or are showing signs of paying off. It's cheaper and faster to get wins through partnerships, like Microsoft has done with OpenAI, but you're left exposed without owning more of your supply chain.
Netflix published a deep dive on the evolution of their ML model serving infrastructure, specifically on routing. The platform serves hundreds of model types and versions at one million requests per second, powering personalized experiences like title recommendations and fraud detection. They walk through how they built Switchboard, a centralized routing service that acts as the mandatory interface for all clients to access the appropriate model based on their context. It handles context-aware routing, dynamic traffic splitting for canary deployments and experimentation, and model versioning and lifecycle management.
But as scale increased, Switchboard showed its limits. It was a single point of failure in the critical request path, it added ten to twenty milliseconds of latency from serialization and deserialization operations, and it obscured visibility into client request origins, making tenant separation and test traffic isolation harder.
So Netflix built Lightbulb to replace it. The key change is removing the routing service from the direct request path. They separate model inputs from request metadata, and rely on Envoy, which Netflix already uses for all egress communication between apps, to handle the actual routing. Lightbulb consumes the minimal request context and provides the metadata mapping required for routing at the Envoy layer. Because the routing key is in a header, this determination can be made with minimal overhead. It retains the advantages of Switchboard, like single integration point and context-aware routing, while addressing the latency and reliability challenges they observed at scale.
Google has a blog post explaining the technical details behind Accelerating Gemma 4 with multi-token prediction drafters, also known as speculative decoding. The core problem is that standard LLM inference is memory-bandwidth bound. The processor spends most of its time moving billions of parameters from VRAM to compute units just to generate a single token. This leads to underutilized compute and high latency, especially on consumer-grade hardware.
Speculative decoding decouples token generation from verification. By pairing a heavy target model like Gemma 4 31B with a lightweight drafter, you can utilize idle compute to predict several future tokens at once with the drafter in less time than it takes for the target model to process just one token. The target model then verifies all the suggested tokens in parallel. If the target model agrees with the draft, it accepts the entire sequence in a single forward pass and even generates an additional token of its own in the process. You get identical frontier-class reasoning and accuracy, just delivered significantly faster.
For developers building coding assistants, autonomous agents that require rapid multi-step planning, or responsive mobile applications running entirely on device, every millisecond matters. This is a meaningful improvement for local and edge deployments.
Finally, Allen Hutchison posted his first reading list, borrowing shamelessly from Seroter's format. He says even if no one else read his notes every day, he'd still get a lot of value from the discipline of reading and writing it. That's a good note to end on. Whether you're building agents at Uber scale, thinking about local AI for sovereignty reasons, grappling with vulnerability disclosures in an AI-saturated world, or just staying on top of what your peers are thinking, the discipline of reading widely and writing about it is its own reward.
For Seroter's Daily Reading, I'm Sayer. Thanks for listening.
- Uber Shares What Happens When 1,500 AI Agents Hit Production
- The Map of System Topologies
- AI Slop & the Vulnerability Treadmill
- Local AI
- Google Is A Full Stack AI Player, And Is Playing Well
- State of Routing in Model Serving
- Accelerating Gemma 4: faster inference with multi-token prediction drafters
- Reading List #1
- This week on How I AI: The internal AI tool that's transforming how Stripe designs products
- Gemini API File Search is now multimodal: build efficient, verifiable RAG