select navigate esc close

Seroter's Daily Reading — #770 (April 24, 2026)

Seroter's Daily Reading·

Listen: https://blossom.nostr.xyz/86322935997fd79fdbd1dca8de026ab94e72941f82fa62ebda7eeb4288d58b0b.mpga

Source: Seroter's Original Post


Seroter's Daily Reading, Episode 770, April 24, 2026. Skipped yesterday because Seroter was deep in Google Cloud Next activities — but today we're back with a packed list, and it's almost entirely Google Cloud Next coverage. Twelve articles. Let's go.

Starting with Day 2 at Google Cloud Next: A marathon developer keynote from the Google Cloud blog. This one covers the marathon developer keynote, and it's a doozy. The whole thing stayed on one theme across all the technical demos: agents. The keynote opened with an agent-based marathon simulation — Richard deliberately broke it partway through, which set up a live demo of Agent Interoperability and Gemini Cloud Assist. Megan O'Keefe showed how to debug agents at scale using Agent Runtime trace view, Cloud Assist Investigation, and the Antigravity IDE connected via MCP — she traced the bug to a misconfigured event compaction run, fixed it with a token threshold parameter, and committed the change, triggering a redeployment to Agent Platform. That's a complete agent debugging loop shown live on stage.

From there Bobby Allen moved into scaling. Those agent services started on Cloud Run, but he showed how to migrate them to Google Kubernetes Engine for more control, switch to a customized Gemma 4 model, and move from GCSFuse to a high-performance Lustre filesystem — all via vibe coding in Antigravity connected to Cloud Assist. Ines Envid and Jason Davenport then showed the no-code path: building agents from the Gemini Enterprise app and integrating them with high-code agents.

The keynote closed with Emma on security and governance. Her framing was crisp: shifting left isn't enough for agents. It just means developers own more of the stack. The real move is shifting down — moving quality and guardrails out of developer responsibility entirely. Ankur Kotwal demoed Agent Identity and Agent Gateway, which use IAM policies and immutable credentials to lock down agent actions. Then Yinon Costica from Wiz showed how Wiz scans agent code and infrastructure end-to-end and suggests root-cause fixes, including from within Claude Code with Opus. Quote from Yinon: it's a full architecture for security to understand what you built without you having to explain it.

That marathon simulation from the keynote? It's now open source. The Race Condition repo on the GoogleCloudPlatform GitHub contains the full multi-agent system: a Planner agent that designs marathon routes using Maps MCP tools and GIS data, a Simulator that runs tick-by-tick weather, traffic, and crowd conditions, and Runner agents that each decide their own pacing and strategy over the A2A protocol. A Go gateway sits in the middle routing WebSocket traffic to Python ADK agents. Three planner variants show the progression from bare planner to LLM-as-Judge evaluation to AlloyDB-backed memory. For reliability on stage, the frontend runs in Cached mode — replaying recorded NDJSON streams rather than making live LLM calls. A deterministic runner_autopilot variant makes zero LLM calls, useful for testing under load without racking up API bills. The architecture is well documented, with skills for getting started, exploring the codebase, deploying, and contributing. Cost note from the team: about ninety-one dollars a month fixed for Redis, Cloud SQL, and Cloud NAT, plus roughly three to four dollars per simulation run.

Moving on to something more conceptual. Dr. Margaret-Anne Storey has a piece in the Engineering Enablement newsletter on Cognitive debt: The hidden risk in AI-driven software development. Technical debt lives in the code. Cognitive debt lives in people — the accumulated erosion of shared understanding across a team. Her framing draws on Peter Naur's idea that a program is a theory living in developers' minds, not just source code. As AI accelerates code production, that theory fragments faster than it gets rebuilt. She recounts a vivid example from an entrepreneurship course: student teams moving fast to ship features hit a wall where nobody could explain why design decisions were made or how parts of the system fit together. The code was messy, but the real problem was that their shared theory had collapsed. They had accumulated cognitive debt faster than technical debt, and it paralyzed them.

Margaret-Anne's mitigation suggestions are concrete. Require at least one human who fully understands each AI-generated change before it ships. Document not just what changed but why. Create regular checkpoints where shared understanding gets rebuilt through code reviews, retrospectives, or knowledge-sharing sessions. Warning signs she flags: team members hesitating to make changes for fear of unintended consequences, growing reliance on tribal knowledge held by one or two people, the system becoming a black box. She also proposes a third dimension — intent debt — when the rationale behind decisions isn't captured for future humans or agents to refer to. The piece is worth reading in full; she's continuing to track how mitigation practices evolve across real teams.

Now DeepSeek-V4: a million-token context that agents can actually use, which dropped today on Hugging Face. Two MoE checkpoints: V4-Pro at 1.6 trillion total parameters with 49 billion active, and V4-Flash at 284 billion total with 13 billion active. Both have a 1 million token context window. The benchmarks are competitive, not SOTA — but the architecture is the story. DeepSeek built V4 specifically for long-running agentic workloads, targeting the predictable failure modes of frontier models running as agents. The core problem is that every tool result appended to the context means every subsequent token pays full attention cost against everything before it. At 1 million tokens, V4-Pro requires 27 percent of single-token inference FLOPs compared with V3.2, and uses 10 percent of the KV cache memory. V4-Flash drops those to 10 percent and 7 percent respectively. Against a baseline like grouped query attention with 8 heads in bfloat16, DeepSeek v4 uses roughly 2 percent of the KV cache — that's a fundamental change in what's deployable.

The efficiency comes from two interleaved attention mechanisms. Compressed Sparse Attention collapses KV entries by 4x using softmax-gated pooling, and a lightweight indexer picks the top-k compressed blocks per query. Heavily Compressed Attention goes 128x and drops the sparse selection — the compressed sequence is short enough that dense attention is cheap. Layers alternate between CSA and HCA, and the lightning indexer inside CSA runs in FP4. Storage is FP8 for most KV entries, BF16 only for RoPE dimensions. Three agent-specific post-training choices compound on top of the architecture. First, interleaved thinking: V4 preserves reasoning content across user message boundaries when the conversation contains tool calls, so an agent maintains a cumulative chain of thought over long-horizon tasks. Second, a dedicated XML-based tool-call format with a DSML special token that removes a class of parsing errors around nested quoted content and mixed types. Third, the DSec sandbox platform — DeepSeek Elastic Compute — runs hundreds of thousands of concurrent sandboxes for RL training rollouts, with fast image loading, preemption-safe trajectory replay, and a uniform API across function calls, containers, microVMs, and full VMs.

The agent benchmark numbers are where V4 separates from the field. On SWE Verified, V4-Pro-Max scores 80.6 — within a point of Opus 4.6-Max and Gemini 3.1-Pro. On Terminal Bench 2.0 it hits 67.9, and on MCPAtlas it reaches 73.6, second only to Opus 4.6-Max. In an internal R&D coding benchmark across PyTorch, CUDA, Rust, and C++ tasks, V4-Pro-Max hits 67 percent pass rate versus 47 percent for Sonnet 4.5 and 70 percent for Opus 4.5. The open question now is whether the DSML schema and interleaved thinking gains transfer to out-of-domain agent frameworks beyond what DeepSeek tested.

Google also announced the Introducing Gemini Enterprise Agent Platform, powering the next wave of agents, a consolidated evolution of Vertex AI for the agentic era. It's a big announcement with customer quotes from Burns & McDonnell, Color Health, Comcast, Geotab, Gurunavi, L'Oréal, Payhawk, and PayPal — each using a different slice of the platform. The core pieces are ADK for building agents, Agent Runtime for scaling with sub-second cold starts and multi-day workflows, Agent Memory Bank for long-term context retention, Agent Identity and Agent Gateway for governance, and Agent Anomaly Detection for real-time behavioral monitoring. A new Agent Registry provides a central index for internal agents, tools, and skills. Agent Sandbox gives agents a hardened environment for executing model-generated code and browser-based automation. Agent Optimizer automatically clusters real-world failures and suggests refined system instructions. The platform ships with Agent Studio for visual agent building, Agent Garden with pre-built templates for code modernization, financial analysis, invoice processing, and more, and Agent Evaluation for continuous scoring of agent logic across multi-turn conversations.

Google's Gemma 4 shines on local systems – both big and small got tested locally by InfoWorld. The 26B parameter model performs well when it fits entirely in VRAM — around 72 tokens per second through LM Studio with all 42 layers on GPU and a 16K context window. The smaller incarnations run faster at 71 to 73 tokens per second but are less specific, especially on code-generation tasks where the smaller models produce a conceptual framework rather than a working example. The mix-of-experts design helps the larger model remain useful even when it can't fit entirely in VRAM. The smaller models free up memory for larger context windows. Takeaway: start with the smaller models as a first choice before moving up.

Moving to data infrastructure. VentureBeat has a piece on The modern data stack was built for humans asking questions. Google just rebuilt its for agents taking action — essentially, a rebuild of the modern data stack from human-scale to agent-scale. The core premise from Andi Gutmans, VP and GM of Data Cloud: enterprises are moving from reactive intelligence, where humans interpret data and decide what to do, to systems of action where agents take direct steps on behalf of the business. Three pillars. First, Knowledge Catalog — evolved from Dataplex — which automates semantic metadata curation instead of relying on data stewards to manually label tables and build glossaries. It covers BigQuery, Spanner, AlloyDB, and Cloud SQL natively, and federates with Collibra, Atlan, and Datahub. Zero-copy federation extends semantic context from SaaS applications including SAP, Salesforce Data360, ServiceNow, and Workday without moving data. Second, a cross-cloud lakehouse using Apache Iceberg. BigQuery can now query Iceberg tables sitting on Amazon S3 via Google's Cross-Cloud Interconnect, with no egress fees and price-performance comparable to native AWS warehouses. All BigQuery AI functions run against that cross-cloud data without modification. Bidirectional federation covers Databricks Unity Catalog, Snowflake Polaris, and AWS Glue Data Catalog via the open Iceberg REST Catalog standard. Third, Data Agent Kit — a portable set of MCP tools and IDE extensions that drop into VS Code, Claude Code, Gemini CLI, and Codex. Data engineers describe outcomes rather than write pipelines, and the agent selects whether to use BigQuery, Lightning Engine for Apache Spark, or Spanner to execute.

Google's lakehouse blog post on The future of data lakehouse: Open and interoperable for the agentic era reinforces the same themes with more depth on the technical specifics. Fully managed Iceberg storage is GA, with read-write interoperability between BigQuery and Managed Service for Apache Spark. A new AI-native cross-cloud experience brings BigQuery and Managed Spark to AWS Iceberg data at scale, with price-performance comparable to AWS-native solutions. A new Lightning Engine for Apache Spark delivers up to 2x price-performance over the leading high-speed Spark alternative using vectorized execution and intelligent caching. Spotify is cited as a customer using Iceberg tables processed across BigQuery, Dataflow, and open-source engines without duplication. Accenture's quote frames it as collapsing data boundaries that fragment intelligence. Knowledge Catalog — which the VentureBeat piece introduced as a separate pillar — is described here as always-on context for agents: Smart Storage automatically tags and embeds files with metadata on ingestion, and deep multimodal extraction uses Gemini to map business relationships from unstructured content.

The Introducing the Google Cloud Knowledge Catalog announcement goes deeper on the universal context engine concept. The core problem is that traditional catalogs were manual inventories for technical users, focused on table structures rather than the deep context agents need. When agents lack business semantics, you get hallucinations, high latency, and stale insights. Knowledge Catalog's three pillars are aggregation, enrichment, and search. Aggregation unifies context from Google and partner data platforms, semantic models, and third-party catalogs. It covers BigQuery, AlloyDB, Spanner, Cloud SQL, Firestore, and Looker, plus integrations with Atlan, Collibra, Datahub, Ab Initio, and Anomalo. For enterprise connectivity, it federates context from Palantir, Salesforce Data360, SAP, ServiceNow, and Workday. A new LookML agent autonomously reads strategy documents to generate business semantics, and a BigQuery measures feature embeds programmatic business logic directly into the SQL engine. Enrichment goes beyond manual curation, mining query logs, schemas, and BI semantic models, extracting entity relationships from unstructured content using Gemini. Automated context curation generates natural language descriptions and business glossaries for datasets, and verified SQL patterns prevent hallucinated logic. Search uses the same query-rewriting technology that powers Google Search, delivering sub-second latency with access-control-aware ranking so agents can only retrieve assets they're authorized to see. A measurable evaluation framework lets teams quantitatively test context construction strategies. Bloomberg Media is cited as using the Knowledge Catalog to ground their Data Access AI Agent in trusted institutional context.

Now for something more cynical. Gergely Orosz at the Pragmatic Engineer has a piece on The Pulse: 'Tokenmaxxing' as a weird new trend — the practice of gaming token usage to inflate AI usage metrics. Inside Meta, an engineer created an internal token leaderboard ranking employees by how many tokens they burn through. Meta employees used 60.2 trillion AI tokens in thirty days — if charged at Anthropic's API pricing, that would be 900 million dollars, probably 100 million or more at actual enterprise rates. Most of it was wasteful. Gergely spoke with Meta engineers who described running agents that burn massive amounts of tokens for little outcome, and SEVs caused by careless AI code generation where developers were more focused on churning out code than product quality. After media coverage, Meta took down the leaderboard. One long-tenured engineer Gergely spoke with thinks the real goal was generating training data: more traces means more real-world data for Meta's next-generation coding model. Microsoft has had a token dashboard since January. Gergely spoke with an engineer there who admitted they're tokenmaxxing not to climb the leaderboard but because they don't want to be tagged for using too few tokens. They inflate their metrics by asking the AI questions when documentation would be faster, prompting features they have no intention of shipping, and defaulting to agents even when hand-coding would be quicker. Salesforce went a step further: minimum token spend of a hundred dollars on Claude Code and seventy on Cursor per week, visible to all colleagues, with maximum caps that can be easily exceeded with the press of a button. Some engineering orgs had the maximum removed last week after the absurdity became apparent.

Shopify gets cited as a counter-example. They had the first token leaderboard Gergely knows of, but Farhan Thawar from Shopify told him they renamed it to a usage dashboard — explicit language not to encourage competing. More importantly, they have circuit breakers: if personal spend spikes within a day, access cuts off immediately and can be renewed if the spike was deliberate or if it was a runaway agent. The circuit breaker caught actual runaway agents and also caught bugs in the infra. Farhan also checks in personally with top-spending individuals to understand the use cases — any tokenmaxxing would show up there. His framing: it's more interesting to ask whose tokens cost the most, not who spent the most. Expensive-per-token developers tend to do in-depth work that's genuinely worth learning about. Gergely's conclusion is sharp: tokenmaxxing is great for AI vendors, bad for everyone else. Using token count as a productivity metric is the lines-of-code problem all over again, except this time the gaming has a massive accompanying invoice attached.

On the FinOps side, Google announced Next-gen FinOps for the AI era next-generation cost management tools at Next. A FinOps Explainability agent uses Gemini to autonomously investigate AI cost drivers, attributing spend across model types, projects, and token categories — input versus output, for example. The bigger announcement is Spend Caps in private preview: budget controls at the project level for Google AI Studio, Gemini Enterprise Agent Platform, Cloud Run, Cloud Run Functions, and Maps. Caps alert and ultimately pause API traffic once the budget is hit. Customers using existing FinOps tooling have seen 75 percent growth in cost reporting adoption and an 18 percent reduction in time spent on FinOps cost analysis since the Gemini Cloud Assist for FinOps launch.

Finally, Google's Google splits its TPU line in two for the agentic era announcement. For most of the TPU's decade-long history, Google shipped a single chip per generation — same architecture for pre-training and inference. Google now believes that's wrong. TPU 8t is the training chip, keeping the 3D torus interconnect and SparseCores. TPU 8i is the inference chip, replacing SparseCores with a new Collectives Acceleration Engine that cuts latency on chain-of-thought decoding and MoE routing by up to 5 times. It uses a new Boardfly topology instead of 3D torus — a hierarchical layout that reduces maximum hop count for a 1024-chip pod from 16 to 7. The headline for TPU 8i is the memory wall. The chip triples on-chip SRAM to 384 megabytes and pushes HBM capacity to 288 gigabytes — enough for the key-value cache of a long-context reasoning model to live entirely on silicon. Every off-chip memory access compounds across reasoning turns in agentic workflows, so keeping the working set on-chip is the core bet. Training and inference are now different enough workloads that separate silicon makes sense, which puts Google at philosophical odds with AWS, which wound down the Inferentia line and is betting that training and inference are converging. Google claims TPU 8t delivers roughly 2.7x better price-performance than Ironwood for training, and TPU 8i delivers 80 percent better price-performance for inference. Both chips are also the first TPUs to offer bare-metal access, and TorchTPU is now in preview — PyTorch natively on TPU. That's a long-standing developer friction point finally being addressed.

Wrapping up: the consistent theme across this whole episode is infrastructure catching up to agentic reality. Everything from DeepSeek's KV cache compression to Google's TPU split to the Knowledge Catalog is about making long-horizon agent work actually tractable — on the model side, the compute side, and the data side. The counter-theme is that adoption is outpacing governance: tokenmaxxing at Meta, Microsoft, and Salesforce shows what happens when metrics get gamified before the guardrails are in place.

  1. Day 2 at Google Cloud Next: A marathon developer keynote — Cloud Google Blog
  2. Race Condition — GoogleCloudPlatform GitHub
  3. Cognitive debt: The hidden risk in AI-driven software development — Engineering Enablement / Dr. Margaret-Anne Storey
  4. DeepSeek-V4: a million-token context that agents can actually use — Hugging Face Blog
  5. Introducing Gemini Enterprise Agent Platform, powering the next wave of agents — Google Cloud Blog
  6. Google's Gemma 4 shines on local systems – both big and small — InfoWorld
  7. The modern data stack was built for humans asking questions. Google just rebuilt its for agents taking action — VentureBeat
  8. The future of data lakehouse: Open and interoperable for the agentic era — Google Cloud Blog
  9. Introducing the Google Cloud Knowledge Catalog — Google Cloud Blog
  10. The Pulse: 'Tokenmaxxing' as a weird new trend — Pragmatic Engineer / Gergely Orosz
  11. Next-gen FinOps for the AI era — Google Cloud Blog
  12. Google splits its TPU line in two for the agentic era — The New Stack