
We thoroughly enjoyed the charismatic Simon Eskildsen of Turbopuffer on today’s pod, and highly encourage listening in even if you’re not a database nerd:
As for today’s op-ed, it’s a quiet day. This comment from Aidan is currently living rent-free in my head:
Tyler Cowen has noted The high-return activity of raising others’ aspirations 8 years ago, and it is commonly the biggest regret we see in others and the biggest regret I have had in my last 3 years covering and tinkering in AI. The people just on the right side of insane have pushed LLMs to their limits and benefited, whereas the pragmatic people who judge and manage LLMs as they were at their time, mostly didn’t go anywhere.
So a fun question to ask an LLM, which we walk through in our upcoming Claude Cowork pod, is: how can I be more ambitious than what I’m currently doing?
AI News for 3/11/2026-3/12/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!
AI Twitter Recap
Agent Infrastructure, Harnesses, and the MCP Debate
- Harnesses are becoming the real product surface: Multiple posts converged on the view that model quality alone is no longer the bottleneck; the surrounding harness, tools, memory, and runtime matter more. @mattturck’s interview with Harrison Chase frames this explicitly around harnesses, sandboxes, filesystem access, skills, memory, and observability, while @hwchase17 emphasized that agent UI/UX is still hard and underbuilt. That same stack perspective shows up in LangChain JS’s new cross-framework
useStreamhook, Redis’s context-engineering lab, and Artificial Analysis’s Stirrup Slack integration, which adds Slack-native agents with documents, subagents, MCP, browser use, and code execution. - MCP is not dead; it’s being normalized into production plumbing: Despite a wave of “MCP is dead” jokes (example), the more technical takes cut the other way. @omarsar0 argued MCP’s issue is mostly a harness problem, not a protocol problem, and later noted that Anthropic’s new chart feature appears to be MCP-backed (tweet). Most concretely, @GergelyOrosz pointed to Uber using MCP internally as evidence that MCP is “the life blood” of agent-service integration inside larger companies. In practice, the market signal is clear: agent platforms now treat MCP as baseline interoperability rather than novelty.
Coding Agents, Evaluation, and Dev Workflow Shifts
- The coding-agent stack is maturing from demos into measurable systems: Cursor’s new CursorBench methodology is one of the stronger eval announcements in the set, combining offline benchmarks with online request-derived metrics to score models on both intelligence and efficiency; the team argues public coding benchmarks are increasingly saturated. OpenAI quickly highlighted that GPT-5.4 leads CursorBench on correctness with efficient token usage. Separately, Code Arena reported GPT-5.4-high in the top 6 for real-world web development tasks, while WeirdML results from @htihle showed strong but inconsistent performance and unusually long generated solutions. The common pattern: coding-model comparison is shifting toward multi-axis measurement—correctness, token efficiency, interaction behavior, and real task fit.
- Agent-assisted development is bifurcating into automation-heavy flows and “stay-in-the-loop” tooling: Several practitioners pushed back on the rush to fully autonomous coding. @ThePrimeagen argued that fast inline autocomplete still often outperforms agentic workflows in preserving understanding and reducing cognitive debt. In contrast, posts from @sydneyrunkle and @corbtt showed where agents excel today: reproducing bugs from screenshots, cross-tool organizational retrieval, and automating tedious coordination. OpenAI also shipped more operational features around this mode: Codex Automations are now GA with worktree vs. branch choice, model/reasoning controls, and reusable templates, plus UI customization in the app (themes update).
- Hermes Agent is emerging as a serious open agent platform: Nous’s Hermes Agent v0.2.0 shipped an unusually dense release for a two-week sprint, including full MCP client support, an ACP server for editors, provider expansion (including GLM, Kimi, MiniMax, OpenAI OAuth), filesystem checkpoints with rollback, git worktree isolation, local browser support, and subagent transparency as summarized by @witcheer. Follow-up updates added official Claude provider support and lighter installs. Community reaction suggests real adoption, including migration away from OpenClaw (example).
Multimodal Retrieval, Embeddings, and New Interaction Surfaces
- A big week for multimodal retrieval: Google’s Gemini Embedding 2 is its first natively multimodal embedding model, mapping text, images, audio, video, and PDFs into one vector space. Posts from Weaviate and @victorialslocum highlighted practical use cases like multimodal PDF RAG, flexible output dimensions via Matryoshka Representation Learning, and native support in retrieval pipelines. The strongest competitive response came from Mixedbread’s Wholembed v3, which claims SOTA retrieval across modalities and 100+ languages, with the team and outside observers stressing late-interaction / multi-vector design as the differentiator (@bclavie, @lateinteraction).
- The retrieval debate is crystallizing around single-vector vs. multi-vector: The most technically opinionated commentary came from @lateinteraction, arguing that new multimodal single-vector baselines like Gemini Embedding 2 were almost immediately outperformed by scaled ColBERT/ColPali-style approaches, and later stating it is “borderline irrational” to keep betting on single-vector embeddings (tweet). Even allowing for hype, the broader takeaway is important: retrieval teams are increasingly prioritizing interaction-rich indexing/scoring over one-vector simplicity, provided infra can make it practical at scale (TopK infrastructure note).
- Interfaces are getting richer, not just smarter: Anthropic’s Claude can now generate interactive charts and diagrams directly in chat, a notable product step toward generative UI rather than plain text outputs. This resonated with builders already assembling similar systems via MCP (@omarsar0). In parallel, Perplexity Computer rolled out to Pro users with 20+ models, skills, and connectors, and @alexalbert__ summarized the broader product trend as “Generative UI is here.”
Model Releases, Benchmarks, and Efficiency Trends
- NVIDIA’s Nemotron 3 Super stands out as the most technically discussed model release: The release was highlighted by @rasbt as an open-weight 120B model with strong throughput and benchmarks roughly in the Qwen3.5/GPT-OSS class. The architecture drew extra attention because of its LatentMoE design; @cwolferesearch provided a useful breakdown showing how routing in a lower-dimensional latent space reduces both all-to-all communication costs and expert weight loading costs, then reinvests those savings into more experts and more active experts per token. It is one of the clearer examples in the set of architecture changes aimed at better inference economics, not just benchmark chasing.
- Grok 4.20 Beta looks more like a cost/speed/behavior update than a frontier leap: Artificial Analysis’s evaluation puts Grok 4.20 (reasoning) at 48 on its Intelligence Index, below current top models but with a larger 2M context window, lower pricing ($2/$6 per 1M in/out tokens), strong speed, and the best measured score so far on its non-hallucination metric. Follow-up commentary from @scaling01 and Vals broadly reinforced that story: not frontier-topping, but cheaper, faster, and potentially more usable in some production settings.
- Efficiency and architecture remain central themes: The day also included FLUX.2 klein 9B-KV, which is reported as 2x–2.5x faster for image editing with no quality drop, and Reka Edge, a 7B VLM pitched around 98ms time-to-first-token and low-latency agentic/on-device use. On the research side, tweets surfaced work on looped transformers with gated memory banks, LM head gradient bottlenecks, reasoning probes for early CoT exit, and Flash-KMeans, all consistent with a field still searching for gains in training signal quality, inference efficiency, and adaptive compute.
Applied AI: Maps, Health, Video, and Forecasting
- Google Maps is being rebuilt around Gemini as an interaction layer, not just a map layer: Google’s Maps upgrade thread describes the biggest product update in over a decade, with two notable pieces: a conversational “Ask Maps” mode over Google’s places/community graph, and Immersive Navigation with richer 3D route guidance (details). The more interesting engineering implication came from observers like @dbreunig: the future UX may not “look like a map” at all, with LLMs acting as the primary interface to geospatial knowledge.
- Healthcare copilots are moving toward longitudinal personal context: Microsoft introduced Copilot Health, a US launch that can aggregate EHR records, wearables, personal history, and lab data into a dedicated health profile. The company stressed that user data is not used to train models, and that outputs are grounded in trusted health sources with citations. Separately, Glass Health added self-serve EHR integrations for athenaOne and eClinicalWorks, showing similar movement toward AI systems that are useful only once they are deeply connected to real clinical data systems.
- Video generation APIs are getting more product-ready: OpenAI’s Sora 2-powered Video API update added custom characters/objects, 16:9 and 9:16 export, 20-second clips, continuation, and batch jobs, which is a pragmatic set of features for campaign, storyboarding, and UGC workflows rather than pure research demos.
- Groundsource is one of the stronger “AI for public-good data” announcements: Google Research’s Groundsource uses Gemini to turn 5M+ public reports into a dataset of 2.6M+ flood events across 150+ countries, enabling urban flash-flood forecasting up to 24 hours in advance. The methodological point is broader than floods: using multimodal/LLM pipelines to synthesize structured, open benchmarks from noisy public corpora could become an important pattern for under-instrumented domains.
Top tweets (by engagement)
- Claude interactive charts and diagrams: Anthropic launched interactive charts/diagrams in chat, one of the clearest examples this week of LLMs moving toward richer frontends rather than better text alone.
- Google Maps + Gemini: Google’s major Maps upgrade was by far the biggest mainstream product launch in the set, with conversational place search and immersive navigation.
- CursorBench / coding evals: Cursor’s new eval methodology for coding agents drew outsized attention because it addresses a real gap: how to evaluate coding systems on both capability and efficiency using a mix of online and offline signals.
- Perplexity Computer rollout: Perplexity Computer for Pro users signals continued appetite for “computer-use” products with broader connectors/skills rather than one-model chat.
- OpenJarvis on-device personal AI: Stanford’s OpenJarvis launch stood out as a serious local-first framework for personal AI on device, with modular infra, local retrieval, MCP tools, and efficiency-aware evaluation.
AI Reddit Recap
/r/LocalLlama + /r/localLLM Recap
1. Qwen3.5 Model Performance and Benchmarks
-
Qwen3.5-9B is actually quite good for agentic coding (Activity: 428): The post discusses the performance of Qwen 3.5-9B on an Nvidia Geforce RTX 3060 with
12 GB VRAM, highlighting its effectiveness for agentic coding tasks. The user compared various models, including Qwen 2.5 Coder and Unsloth quantizations on Qwen 3 Coder, noting that 1-bit quantizations were fast but unreliable for tool calls, while 2-bit quantizations were slower and unstable. The Qwen3.5-9B model, however, performed well, maintaining functionality for over an hour without issues, suggesting it is well-suited for consumer hardware with limited VRAM. The user also mentions Unsloth-Qwen3 Coder 30B UD-TQ1_0 as effective for code completion. One commenter noted that Qwen3.5-9B performs comparably to larger models like GPT-120B, while another shared a negative experience where the model disrupted their build system, indicating variability in performance. A link to another model, OmniCoder-9B-GGUF, was also shared as a potential alternative.- The Qwen3.5-9B model is noted for its impressive performance, benchmarking around the level of GPT-3’s 120B model, which is surprising given its smaller size. This suggests that Qwen3.5-9B is highly efficient in terms of performance relative to its size, making it a strong contender in the field of agentic coding.
- Despite its strengths, Qwen3.5-9B has been reported to cause significant issues in some cases, such as completely disrupting a build system and deleting projects. This indicates that while the model can perform well, it may also have stability issues or bugs that need addressing, especially when used in complex environments like LM Studio and Claude Code on an RTX 4060.
- There is a debate around the utility of lower quantized models like Qwen3.5-9B, with some users arguing that these models can indeed perform useful tasks despite their smaller size. This challenges the notion that only larger models are capable of high-quality outputs, highlighting the potential of optimized smaller models in practical applications.
-
I spent 8+ hours benchmarking every MoE backend for Qwen3.5-397B NVFP4 on 4x RTX PRO 6000 (SM120). Here’s what I found. (Activity: 349): The post details an extensive benchmarking effort on the
nvidia/Qwen3.5-397B-A17B-NVFP4model using 4x RTX PRO 6000 GPUs, achieving a maximum of50.5 tok/sdespite claims of130+ tok/s. The bottleneck is attributed to NVIDIA’s CUTLASS kernels, which fail on SM120 hardware due to a bug in the initialization of TMA Warp Specialized grouped GEMM tactics, as documented in CUTLASS issue #3096. The author suggests that the issue is due to a Shared Memory (SMEM) overflow caused by datacenter assumptions, with SM120 having a strict limit of101 KiBcompared to datacenter variants. The post also highlights that Multi-Token Prediction (MTP) results in a-22%performance regression due to dequantization discrepancies. The author has submitted patches to FlashInfer and vLLM to address some of these issues, but the core problem remains unresolved by NVIDIA. A commenter suggests that the author’s bug report on GitHub is too verbose and lacks clear reproduction steps, which may be why NVIDIA hasn’t addressed it. Another commenter identifies the root cause of the SM120 issue as a physical SMEM overflow and suggests that smaller tile shapes could resolve the problem. A third commenter notes that running the setup on bare metal Linux could improve performance by~10%compared to WSL2.- lawdawgattorney provides a detailed analysis of a bug affecting the Qwen3.5-397B NVFP4 model on RTX PRO 6000 GPUs. The issue is traced to a Shared Memory (SMEM) overflow due to datacenter assumptions, where the SM120 dies have a strict 101 KiB limit compared to datacenter dies with ~227 KiB. The auto-computation formula for pipeline stages fails to account for
alignas(1024)padding, causing akErrorInternalcrash. A temporary fix involves hardcodingStageCount<2>, but this reduces performance to 4.8 tok/s due to memory latency issues. The proposed solution is to support smaller tile shapes to fit more pipeline stages within the SM120’s SMEM constraints. - JockY criticizes the original bug report for being overly verbose and lacking essential details such as reproduction instructions and error logs. The report’s excessive length and inclusion of irrelevant benchmarks make it difficult for developers to address the issue. JockY suggests a more concise approach, focusing on clear reproduction steps and error logs to facilitate debugging and resolution by developers.
- AndreVallestero suggests running the benchmarks on bare metal Linux instead of WSL2, noting a potential performance improvement of approximately 10%. This implies that the virtualization layer in WSL2 might introduce overhead that affects the performance of GPU-intensive tasks like those involved in running the Qwen3.5-397B NVFP4 model.
- lawdawgattorney provides a detailed analysis of a bug affecting the Qwen3.5-397B NVFP4 model on RTX PRO 6000 GPUs. The issue is traced to a Shared Memory (SMEM) overflow due to datacenter assumptions, where the SM120 dies have a strict 101 KiB limit compared to datacenter dies with ~227 KiB. The auto-computation formula for pipeline stages fails to account for
-
Qwen3.5-9B Quantization Comparison (Activity: 398): The post presents a detailed quantization comparison of the Qwen3.5-9B model using various GGUF quantization methods, focusing on Kullback-Leibler Divergence (KLD) and Perplexity (PPL) as key metrics. The analysis highlights that IQ4_XS from bartowski (4.93 GiB, KLD 0.0127) is optimal for VRAM-limited scenarios without dropping below Q4, while Q4_K_S from bartowski (5.18 GiB, KLD 0.0108) performs well across multiple domains. Notably, bartowski’s Q4_K_M outperforms unsloth’s Q4_K_M (0.0087 vs 0.0222 KLD), and lmstudio’s Q4_K_M scores worse (0.0353 KLD). The post also provides a token-level divergence visualization on HuggingFace, showing divergence across four domains. The evaluation was conducted using
llama.cppon a system with an i3-12100F CPU, 64GB RAM, and an RTX 3060 GPU. Commenters agree with the findings, noting that Bartowski’s quantizations are perceived as more stable compared to others, such as unsloth’s. There is appreciation for the detailed analysis, which aids in making informed decisions about quantization choices.- General_Arrival_9176 highlights a significant performance difference between Bartowski’s and Unsloth’s quantizations at the same level, with Bartowski’s Q4_K_M achieving a KL Divergence of
0.0087compared to Unsloth’s0.0222. This suggests a substantial impact of the quantization process and possibly the training methodology used by Unsloth, as well as the importance of choosing the right quantization approach to avoid downloading multiple versions unnecessarily. - Shingikai emphasizes the value of using KL Divergence (KLD) over Perplexity (PPL) for evaluating quantization performance. While PPL provides an average performance metric, KLD can reveal ‘catastrophic failure’ cases where a model remains fluent but makes incorrect decisions. The discussion also notes that Bartowski’s Q4_K_M outperforms Unsloth’s, suggesting that the choice of calibration data is more critical than the quantization engine itself at the 4-bit level.
- dark-light92 and General_Arrival_9176 both note the stability and performance of Bartowski’s quantizations over others like Unsloth’s. This indicates a trend towards more reliable quantization methods, with Bartowski’s approach being favored for its stability and lower KL Divergence, which is crucial for maintaining model accuracy and performance in practical applications.
- General_Arrival_9176 highlights a significant performance difference between Bartowski’s and Unsloth’s quantizations at the same level, with Bartowski’s Q4_K_M achieving a KL Divergence of
-
M5 Max just arrived - benchmarks incoming (Activity: 2679): The post discusses the arrival and benchmarking of the M5 Max 128GB 14” laptop, focusing on testing various AI models using the
mlx_lmtool. The author initially faced issues with BatchGenerator, leading to a delay as they switched to a fresh Python environment and usedstream_generatefor accurate results. The benchmarks cover models like Qwen3.5-122B-A10B-4bit and gpt-oss-120b-MXFP4-Q8, with detailed performance metrics such as tokens-per-second and peak memory usage. The results highlight the M5 Max’s capability to handle large models efficiently, with peak memory usage reaching up to92.605 GBfor some tests. Commenters are eager for the benchmark results, with some expressing interest in specific models like Qwen 3.5 27b MLX 4bit and 6bit. The discussion reflects anticipation and curiosity about the M5 Max’s performance with these AI models.- The benchmark results for the M5 Max 128GB 14” using
mlx_lm.generateshow significant performance across different models and configurations. For instance, the Qwen3.5-122B-A10B-4bit model achieves a prompt speed of1,239.7 t/sand a generation speed of60.6 t/sat a 16K context, with a peak memory usage of73.8 GB. In contrast, the Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-6bit model shows lower generation speeds, with14.9 t/sat a 32K context, but also lower peak memory usage at30.0 GB. - The Qwen3-Coder-Next-8bit model demonstrates impressive performance, particularly at higher contexts, achieving a prompt speed of
1,887.2 t/sand a generation speed of68.6 t/sat a 32K context, with peak memory usage reaching89.7 GB. This suggests that the model is optimized for handling larger contexts efficiently, although it requires substantial memory resources. - The gpt-oss-120b-MXFP4-Q8 model stands out with its high prompt and generation speeds, especially at a 16K context where it reaches
2,710.5 t/sfor prompts and76.0 t/sfor generation, with a relatively low peak memory usage of64.9 GB. This indicates a strong balance between speed and memory efficiency, making it a competitive option for high-performance tasks.
- The benchmark results for the M5 Max 128GB 14” using
-
Llama.cpp now with a true reasoning budget! (Activity: 444): Llama.cpp has introduced a true reasoning budget feature, enhancing the previous stub implementation. The new feature uses a sampler mechanism to count tokens during reasoning, terminating when the budget is reached. Initial tests on Qwen3 9B showed a performance drop from
94%to78%on HumanEval when enforcing a reasoning budget, but adding a--reasoning-budget-messageimproved scores to89%. This update allows experimentation with different models and settings, potentially improving reasoning efficiency. For more technical details, see the commit. Commenters noted potential confusion betweenthinking_budget_tokensin HTTP and--reasoning-budgetin CLI, suggesting dynamic logit_bias adjustments for better performance. Another suggestion was to enforce a minimum-thinking-budget by setting logit bias to negative infinity, potentially boosting scores. A user reported successful testing with Qwen3.5 35B, where a reasoning budget improved decision-making efficiency without overthinking.- coder543 highlights a potential confusion in the API design of Llama.cpp, where the HTTP field is named
thinking_budget_tokensbut the CLI argument is--reasoning-budget. This discrepancy could lead to errors if users mistakenly sendreasoning_budgetorreasoning_budget_tokensto the API. Additionally, coder543 suggests dynamically boosting the logit bias for the end-of-think token towards the end of the reasoning budget to help the model conclude more naturally, though they note this might reduce intelligence scores. - chris_0611 shares practical testing results with the qwen3.5 35B model in Q5, using a ‘car-wash test’ to evaluate reasoning budgets. With a reasoning budget of 0, the model fails to choose walking for a 100m distance. With an unlimited budget, it overthinks for 83 seconds but passes. A reasoning budget of 1000 tokens results in a successful test with only 18 seconds of thinking, demonstrating a balance between speed and accuracy.
- audioen proposes a method to gradually increase the likelihood of generating the
</think>token, suggesting a linear bias increase of 0.1% per token. This approach aims to naturally conclude the model’s reasoning process by the end of a set token limit, potentially improving efficiency without abrupt interruptions.
- coder543 highlights a potential confusion in the API design of Llama.cpp, where the HTTP field is named
-
llama.cpp on $500 MacBook Neo: Prompt: 7.8 t/s / Generation: 3.9 t/s on Qwen3.5 9B Q3_K_M (Activity: 636): A user successfully compiled
llama.cppon a MacBook Neo with an Apple A18 Pro chip, achieving7.8 tokens/secondfor prompt processing and3.9 tokens/secondfor generation using the Qwen3.5 9B Q3_K_M model. The setup utilized 8 GB of unified memory and a 5-core GPU with Metal support. The model was sourced from the Hugging Face repository and occupied4.4 GBon disk. The user provided a configuration for potentially faster performance, achieving5 tokens/secondfor the 9B model and10 tokens/secondfor a 4B model, linked in the post. Commenters noted that the performance is likely hindered by the 8GB RAM and potential disk swapping, suggesting that using a smaller model might improve speed. The configuration parameters-b,-ub,-ctk, and-ctvwere also highlighted as unusual.- coder543 suggests that the performance metrics indicate the system is likely swapping to disk or using compressed memory, which significantly impacts speed. They recommend testing with a smaller 4B model to observe potential non-linear speed improvements, implying that the current setup is not optimal for the 9B model.
- Technical-Earth-3254 points out that the limited 8GB RAM on the MacBook, especially when running a full operating system, is a major bottleneck affecting performance. This suggests that memory constraints are a critical factor in the observed throughput rates.
- thisguynextdoor inquires about the potential use of Apple’s Metal API for GPU acceleration, which could significantly enhance performance. They compare this to their own experience with a Gemma 3 27B model on an M1 Pro, achieving 15 t/s, indicating that hardware acceleration could be a key factor in improving throughput.
2. OmniCoder-9B and Agentic Coding
-
OmniCoder-9B | 9B coding agent fine-tuned on 425K agentic trajectories (Activity: 330): OmniCoder-9B is a 9-billion parameter coding agent developed by Tesslate, fine-tuned on the Qwen3.5-9B model using a hybrid architecture of Gated Delta Networks and standard attention. It was trained on
425,000+curated agentic coding trajectories, including data from models like Claude Opus 4.6 and GPT-5.4, focusing on real-world software engineering tasks and multi-step reasoning. The model features a262,144token context window, extensible to1M+, and demonstrates strong error recovery and reasoning capabilities, such as responding to LSP diagnostics and using minimal edit diffs. It is released under the Apache 2.0 license, with fully open weights. Commenters highlight the impressive capabilities of the Qwen3.5-9B model, suggesting it competes with much larger models despite its smaller size. There is enthusiasm for testing OmniCoder-9B, with some users noting its ability to handle complex tasks efficiently.- Uncle___Marty highlights the impressive performance of the Qwen 3.5 9B model, suggesting it rivals much larger models, such as those with 100B+ parameters, in terms of coding capabilities. The commenter emphasizes the potential of small, powerful models like Qwen 3.5 9B, advocating for their future in local applications due to their efficiency and capability despite their size.
- pilibitti shares a practical example of the model’s capabilities, noting that it successfully completed an agentic task requiring over 20 tool calls with a blank system prompt, a task that Qwen 3.5 9B failed to accomplish even with detailed prompts. This underscores the model’s efficiency and effectiveness in handling complex tasks with minimal guidance.
- PaceZealousideal6091 inquires about comparative benchmarks between OmniCoder-9B and Qwen 3.5 35B, expressing interest in understanding the performance differences. They also ask about the possibility of a larger version, such as a 35B model, indicating a demand for scaling up the model’s capabilities while maintaining its efficiency.
-
I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here’s what I use instead. (Activity: 2145): The post discusses a shift from using multiple typed function calls to a single
run(command="...")tool with Unix-style commands for AI agents, arguing that this approach aligns with the text-based nature of both Unix and LLMs. The author, a former backend lead at Manus, developed this approach while working on open-source projects like Pinix and agent-clip. The Unix philosophy of text streams and CLI commands is seen as a natural fit for LLMs, which are trained on vast amounts of CLI patterns. The post outlines a two-layer architecture to handle LLM constraints, using techniques like progressive--helpdiscovery, error messages for navigation, and consistent output formats to guide the agent’s use of CLI tools. The approach is contrasted with traditional function-calling methods, highlighting the efficiency and familiarity of CLI for LLMs. One commenter noted a similar experiment using Python code eval as the sole tool for LLMs, which reportedly worked well. Another comment humorously suggested the post might be a ploy to give LLM agents full terminal access, while a third highlighted the power of natural language to command-line tool translation.- spaceman_ mentions an experiment where an LLM was restricted to using only Python code evaluation as a tool, which reportedly performed well. This suggests that LLMs can effectively leverage programming languages as a tool for executing complex tasks, potentially simplifying the integration of LLMs in environments where Python is a primary language.
- raucousbasilisk highlights the potential of using JIT (Just-In-Time) compilation for natural language processing tasks, specifically converting natural language to command-line utilities like
sed,awk, andregex. This approach could harness the power of existing command-line tools for text processing, offering a flexible and efficient method for handling diverse data manipulation tasks. - johnbbab speculates that the ultimate agent framework might resemble a shell environment, implying that the simplicity and power of shell scripting could serve as a model for developing robust agent frameworks. This perspective underscores the potential for leveraging existing, well-understood technologies in new AI-driven applications.
3. Model Releases and New Benchmarks
-
Nemotron 3 Super Released (Activity: 755): NVIDIA has released the Nemotron 3 Super, a 120 billion parameter Mixture of Experts (MoE) model with 12 billion active parameters, designed for agentic reasoning. The model is fully open-source, including weights, datasets, and recipes, allowing developers to customize and deploy it on their infrastructure. It features a comprehensive data pipeline with 10 trillion curated tokens for pretraining and 40 million samples for post-training, supporting tasks like reasoning, coding, and multi-step agent tasks. The model is available on Hugging Face and supports Quantization Aware Training (QAT) with NVFP4 precision. Some users have created GGUFs for the model, requiring at least 64GB of memory, and suggest using a specific branch of llama.cpp for compatibility until an official update is available.
- Nemotron 3 Super is highlighted for its open-source nature, providing full access to weights, datasets, and recipes, allowing developers to customize and deploy the model on their infrastructure. The model’s data pipeline includes 10 trillion curated tokens for pretraining and 40 million samples for post-training, emphasizing reproducibility and agentic AI development. The RL tasks span 21 environments, generating 1.2 million rollouts, showcasing its dynamic capabilities beyond static text.
- The Nemotron 3 Super model is available in various formats, including BF16 and NVFP4, with Quantization Aware Training (QAT) options. The model’s GGUFs require at least 64GB of memory, and compatibility issues with
llama.cppare noted, suggesting the use of a specific branch from Unsloth for proper functionality. This highlights the model’s demanding hardware requirements and ongoing integration efforts. - Initial performance assessments of Nemotron 3 Super indicate it scores below lighter models like Qwen3.5 in the LM Arena Text benchmark, particularly when filtered for open-source and style-control off. This suggests that despite its comprehensive dataset and open-source advantages, its performance may not yet match expectations in certain benchmarks.
-
New benchmark just dropped. (Activity: 1359): The post humorously requests a complete
Three.jscode to create a scene with Michael Jackson, Pepe the Frog, Donald Trump, and Elon Musk performing the “Thriller” dance, emphasizing high-quality rendering and cinematic effects. The comments reference various AI models like Sonnet 4.6 and Gemini, noting their strengths in lighting and animation, respectively. The discussion highlights the playful nature of the request, with models like Deepseek 3.2 and Minimax & GLM mentioned in jest for their attempts at the task. The comments humorously debate the capabilities of different AI models, with Sonnet 4.6 praised for lighting and Gemini for choreography, while others like Deepseek 3.2 are acknowledged for effort despite not excelling.- Recoil42 highlights the performance of various models in a new benchmark, noting that Sonnet excelled in lightning and models, while Gemini was impressive in choreography. They also mention Deepseek 3.2’s consistent performance and humorously note that Minimax and GLM lost interest, and Qwen seemed to be attempting a different task altogether.
- cmdr-William-Riker questions the performance of OpenAI, expressing surprise at its perceived decline. They specifically inquire about the variant of Qwen 3.5 used in the benchmark, suggesting interest in understanding the specific configurations or versions that might have impacted the results.
-
Nvidia Will Spend $26 Billion to Build Open-Weight AI Models, Filings Show (Activity: 1146): Nvidia is set to invest
$26 billionover the next five years to develop open-weight AI models, as per recent financial filings. This initiative is designed to strengthen Nvidia’s position in the AI infrastructure market, competing with major entities like OpenAI and Anthropic. The investment will likely focus on leveraging Nvidia’s hardware capabilities, such as theH100clusters, to maintain CUDA as the default inference target, and optimize forNVFP4precision. For more details, see the original article. Commenters highlight Nvidia’s strategic move as a way to maintain dominance in the AI hardware market by commoditizing AI models, thus reinforcing the ‘Nvidia tax’. This approach is seen as a way to ensure CUDA remains the default inference target, leveraging their hardware’s capabilities.- Nvidia’s $26 billion investment in open-weight AI models is seen as a strategic move to maintain its dominance in the AI hardware market. By developing these models, Nvidia aims to ensure that CUDA remains the default inference target, effectively locking in their ecosystem and making it harder for competitors to justify designing alternative chips. This approach aligns with the strategy of ‘commoditizing your product’s complement,’ where Nvidia not only sells the hardware but also provides the software and models that run on it, thus reinforcing their market position.
- The investment in open-weight AI models is also a way for Nvidia to optimize the use of their own hardware, such as the H100 clusters. By utilizing these resources internally, Nvidia can potentially accelerate the development of advanced AI models, which in turn could drive further demand for their hardware. This move is seen as a way to justify the high costs associated with maintaining CUDA as the preferred platform for AI development, ensuring that Nvidia’s ecosystem remains the go-to choice for AI researchers and developers.
- Nvidia’s focus on optimizing for NVFP4 (Nvidia’s floating-point format) highlights their commitment to maximizing the performance of their hardware. By tailoring their AI models to leverage NVFP4, Nvidia can achieve higher efficiency and performance, which is crucial for large-scale AI training and inference tasks. This technical optimization not only enhances the capabilities of their models but also reinforces the value proposition of their hardware solutions, making them more attractive to customers who require high-performance AI solutions.
Less Technical AI Subreddit Recap
/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo
1. Claude Code and Anthropic Developments
-
Anthropic: Recursive Self Improvement Is Here. The Most Disruptive Company In The World. (Activity: 1803): Anthropic is accelerating AI development with its model, Claude, which now writes
70% to 90%of the code for future models, indicating a significant step towards recursive self-improvement. Evan Hubinger from Anthropic suggests that fully automated AI research could be imminent, potentially within a year. The release of Claude 3.7 Sonnet was delayed by 10 days due to safety concerns, highlighting the ongoing tension between rapid AI advancement and safety protocols. Dario Amodei warns of AI’s potential to displace half of entry-level white-collar jobs within five years, urging transparency about these impacts. Anthropic’s stance on AI deployment in military contexts and its critique of political influences on AI policy are also notable. Some commenters question the criticism of safety delays, arguing that with90%of code generated by AI, thorough testing is essential to ensure safety. Others reflect on the historical context of AI safety concerns, referencing Sutskever’s caution with GPT-2.- Substantial-Elk4531 raises a critical point about the safety concerns in AI development, questioning the delay in model releases due to these fears. They highlight the importance of testing AI-generated code, especially when a significant portion (up to 90%) is not human-written, to ensure safety and reliability. This reflects ongoing debates in the AI community about balancing innovation with ethical considerations.
- BiasHyperion784 discusses the timeline for Anthropic’s infrastructure upgrades, noting that by Q3 2027, new hardware (’rubin ultra’) will be deployed, enhancing compute capabilities. They emphasize that improvements in training time are crucial, as bespoke hardware will exponentially increase processing power, suggesting a significant leap in AI capabilities once these upgrades are complete.
- Pitiful-Impression70 expresses concern over the rapid advancement of AI self-improvement, where models like Claude are increasingly responsible for writing their own code. They argue that this shift from human authorship to AI-driven development poses challenges, as humans become reviewers rather than creators, complicating the oversight process due to the lack of intuitive understanding of AI-generated decisions.
-
Claude now creates interactive charts, diagrams and visualizations (Activity: 1174): Claude has introduced a new feature allowing users to create interactive charts, diagrams, and visualizations directly within conversations. These visuals are dynamically generated and can be modified through follow-up interactions, enhancing the conversational experience. This feature is currently in beta and available across all plans, including the free tier. More details can be found on their official blog. Commenters are enthusiastic about the potential of this feature for educational purposes and logic flow setups, suggesting it could replace other tools like ChatGPT for some users. There is curiosity about its integration with Claude’s coding capabilities and whether it can generate links to these visuals.
- Unlikely_Ad_8060 highlights a significant shift in problem-solving with Claude’s interactive visualizations. If these visuals are generated within the reasoning loop, it allows users to iteratively explore problems by adjusting parameters, testing assumptions, and exploring edge cases. This approach aligns more closely with how engineers and analysts think, moving beyond static outputs to a dynamic, exploratory process.
- trashcanhat raises a technical question about the integration of these interactive visualizations within Claude’s code environment. The potential for Claude to post links to these visualizations could enhance usability, suggesting a seamless integration between code execution and visual feedback.
- Much-Inevitable5083 provides a link to a resource on building interactive diagram tools with Claude, indicating that there are existing use cases and possibly tutorials or documentation available for users interested in leveraging this feature. This could be valuable for developers looking to implement or understand the technical capabilities of Claude’s visualization tools.
-
I delayed my product launch for months because I couldn’t afford demo videos. Spent a weekend with Claude Code and Remotion. Now my reels are getting thousands of views. (Activity: 1152): The post describes how the author used Remotion, a React-based video generation tool, and Claude Code to create demo videos for their product, bypassing the need for expensive motion designers. By leveraging
Claude Skillslikeremotion-transitionsandfrontend-design, the author was able to generate SVG-based visuals and animated UI sections quickly, reducing the time to create each reel from 3 hours to under an hour. This approach resulted in thousands of views and increased interest in the product, all with $0 in production costs aside from the Claude Code subscription. Comments highlight that while the generated videos are effective for product demos, they lack the polish of professional editors. Remotion is praised for its utility in creating explainer reels, though it may not suffice for more complex motion design or character animation.- Remotion is highlighted as a powerful tool for creating product demos and explainer reels, though it may not be suitable for more complex motion design or character animation. The combination of Remotion and Claude Code allows for efficient production of marketing videos without the need for extensive design skills or budgets, making it ideal for startups and small businesses.
- A key insight from the discussion is that the real bottleneck in launching products is often the assumption that high production quality requires large budgets. The use of tools like Remotion and Claude Code demonstrates that with minimal resources, founders can produce effective marketing assets. This approach allows for rapid iteration and deployment, as videos can be version controlled like code, eliminating the need for extensive back-and-forth with designers.
- The Remotion and Claude Code stack is praised for enabling videos to be treated as React components, allowing for version control and easy updates. This workflow is revolutionary as it allows for quick changes and re-rendering of marketing videos, similar to software development processes, which is a novel concept for many in the industry.
-
4 months of Claude Code and honestly the hardest part isn’t coding (Activity: 1448): The post discusses the challenges of using Claude Code to build a full iOS app, highlighting that while the AI can handle coding tasks effectively, design decisions and debugging with real users are more challenging. The app, consisting of
220k linesof code, faced issues like missing transactions when tested by external users, despite extensive internal testing. The author emphasizes the importance of security, using Plaid for bank connectivity and conducting a Snyk security audit to address vulnerabilities, ensuring that sensitive data is securely managed with Firestore rules and Cloud Functions. Commenters highlight the importance of security when handling sensitive information like bank details, suggesting that relying on AI for coding can lead to unexpected user behavior and security risks. There is also a noted difficulty in using Claude Code for CSS tasks, where the AI struggles with precise design requirements.- ReddLemon highlights a critical security concern when using AI-generated code for handling sensitive information like bank details. They caution that a ‘vibe coded product’ could lead to security breaches and potential legal issues, emphasizing the unpredictable nature of user interactions with software.
- CyberMage256 points out a specific issue with Claude Code’s handling of CSS, where the AI fails to accurately interpret and execute instructions for making buttons identical in height. This suggests limitations in Claude’s ability to understand and implement precise design requirements.
- KILLJEFFREY mentions a ‘220k monolith,’ likely referring to a large codebase or project size, which can be challenging to manage and maintain, especially when using AI-generated code. This highlights the complexity and potential difficulties in scaling AI-assisted development.
-
Two Claude Code features I slept on that completely changed how I use it: Stop Hooks + Memory files (Activity: 690): The Reddit post discusses two underutilized features of Claude Code: Stop Hooks and Memory Files. Stop Hooks allow users to automate follow-up actions after Claude completes a task, such as running a linter after code generation or auditing plans for edge cases. This feature streamlines workflows by reducing manual intervention. Memory Files address the issue of context loss in long or complex sessions by providing Claude with a persistent reference file at the start of each session, containing project structure, conventions, and decisions. This transforms Claude from a simple autocomplete tool into a more reliable collaborator. The post suggests these features are particularly beneficial for complex, multi-step tasks. Commenters highlight the utility of these features, with one suggesting the use of a
/btwcommand for side questions and another recommending a structured approach to memory files, using daily logs and a curated summary to prevent bloat. There’s also interest in chaining stop hooks for complex workflows, such as automating a sequence of tasks after editing a file.- The use of memory files in Claude Code is highlighted as a significant enhancement, with users like ‘asklee-klawde’ implementing a dual-file system: daily logs for raw notes and a curated MEMORY.md for long-term context. This approach prevents memory bloat while maintaining accessibility to important information, optimizing the AI’s ability to recall and apply project-specific patterns and decisions.
- Stop hooks are praised for their utility in automating repetitive tasks, such as auto-formatting and running sanity checks post-edit. Users are exploring more complex workflows by chaining stop hooks, like automating a sequence of actions after editing a file, which could include running tests and committing changes if successful. This reduces manual intervention and streamlines the development process.
- The concept of ‘Fork + Rewind’ is discussed as a method to manage multiple changes efficiently. This approach allows users to maintain the main context while handling unrelated changes separately, thus optimizing the use of context windows and improving workflow management. It is particularly useful for implementing multiple changes post-plan execution without cluttering the main context.
-
Anthropic just released free official courses on MCP, Claude Code, and their API (Anthropic Academy). (Activity: 296): Anthropic has launched the “Anthropic Academy,” offering free, comprehensive courses for developers working with Claude. The academy includes a
13-hourcourse on the Claude API,10 hourson the Model Context Protocol (MCP),3 hourson integrating Claude Code, and4 hourson agent skills. These courses aim to transition users from basic prompting to advanced AI integration, providing official completion certificates. The academy is accessible here, and a detailed course breakdown is available on Mindwired AI. Commenters highlight the value of the courses in moving from basic prompting to engineering, emphasizing the importance of MCP and agent skills for real-world AI integration. They suggest pairing course knowledge with platforms like Runable or Replit Agent for practical application, and recommend using infrastructure tools like Hasura and Kong for effective backend integration.- The release of structured courses like Claude 101 and Developer Deep-Dives on Anthropic Academy is significant for those transitioning from casual prompting to engineering. These courses emphasize the Model Context Protocol (MCP) and agent skills, indicating a shift towards integrating AI into real development workflows. This knowledge can be paired with execution platforms like Runable, Replit Agent, or V0 to build functional prototypes, enhancing the practical application of the course material.
- The advanced MCP content is valuable for designing tool layers that are maintainable over time. The courses encourage modeling tools as small, composable verbs, keeping state and policy external to the agent. A practical approach is to integrate a real backend, such as Postgres, as an MCP server during the lessons, incorporating logging, rate limits, and human approval flows to understand potential challenges early. This approach aligns well with infrastructure patterns using Hasura for typed GraphQL, Kong or Tyk for gateway policies, and DreamFactory for exposing legacy SQL as REST endpoints.
2. DeepSeek V4 and Related Speculations
... [Content truncated due to size limits]
