TTY-changelog #043

Agent cleanup debates, Anthropic open-sources Petri, Thinking Machines pushes real-time AI, TanStack worm spreads, local LLMs surge on Macs, Figure robots go autonomous.

May 16, 2026

👉 Article originally posted on TTY

Community Discussion

🧹 Cleaning up agentic junk – Lior Oren opened a week-long thread on what happens when a multi-repo team runs spec-to-merge agents with almost no human touch and the system starts collecting residue: dead code, junk specs, irrelevant quality metrics, agents.md waste, dead skills, and stale gotchas nobody curates once the original problem gets fixed. The boyscout rule applied to all agents helped a bit and made PRs dirtier. The sharper framing came later in the thread: “Cleaning itself works well. The when, the scope, and the how is where I’m trying to figure out a common pattern.“

Approaches covered the full range.

Lucas DiCiocci skips auto-merge entirely, freezes code when junk accumulates, and runs intentional cleanup before the merge-conflict hell sets in.
Pierre Chapuis follows a similar manual cadence with cleanup prompts triggered by hand, never cron’d.
Hugo Venturini hunts deadcode and antipatterns with a single sweep agent.
Anicet Nougaret asks the agent to produce dataflow diagrams of the messy parts, which catches 99% of the rot before it ships.
Nikolay Tchakarov’s team enforces quality checklists baked into the agent loop: PRs under 7 files, reuse before adding, dedup checks, with humans still approving as the accountability anchor.
Jocelyn Fournier doubled down on guardrails: mandatory tests with minimum mocks, sharp acceptance criteria in the RFC for the agent to follow, and reviews from other agents not polluted with the implementation context.
Amine Saboni argued for Domain Driven Design as the harness: when the domain model lives clearly in specs, documentation, and the code, agents have less semantic margin to drift, and a /simplify pass or a delegated reviewer does the rest.
Jérémie Bordier sketched the maximalist version, a “dream mode” autonomous maintainer that pulls back the global picture of the project, removes dead code, updates dependencies, and rewrites the .md files.
Youssef Tharwat offered the opposite bet with Noodlbox: rather than another agent, a local static analysis engine running under a file watcher produces a constantly fresh list of action items the coding agent consumes over MCP or CLI, scoped per PR or globally, with an in-flight extension to flag stale .md files whose linked code no longer exists or has semantically drifted.

The skeptics pushed back on the premise. Gawen Arab: “Isn’t junk the consequence of the high variance of LLMs? Having a high-variance LLM fix the junk of another high-variance LLM feels naively like a dead end.“ For now he keeps a human in the loop and skips the auto-loop entirely.

The hardest unsolved problem turned out not to be the code at all. The real junkyard, as Lior Oren pointed out later, is the agent memory and gotcha files: a session produces a note like “db migration files stale at x execution, wait more time than usual,” the issue gets fixed somewhere downstream, and nobody cleans the shared dev memories. Amine Saboni admitted he doesn’t commit those files, treating them as personal hygiene rather than shared standard, but at team scale that breaks: notes are pointless unless shared, and once shared they go stale unless someone owns them. Lior Oren eventually landed on a pattern worth trying: “Trigger cleaning agents with each PR but have them create a stacked PR. The main PR is not polluted, the scope of the cleaners is limited to the files touched in the main PR, and it saves the scheduling and cron infra.“ He has since wired it into GH CI as skills.

Audio

🗣️ Interaction models for real-time AI – Thinking Machines released a research preview of interaction models: a from-scratch architecture designed to continuously ingest audio, video, and text and to think, respond, and act in real time. The argument: today’s turn-based interfaces collapse the model’s perception into a single thread that freezes during generation, creating a bandwidth bottleneck for collaboration. The fix was a multi-stream, micro-turn design that let both model and user interject, see, and respond without waiting for the other to finish.

The multi-stream micro-turn loop replaced the wait-then-respond cycle, so the model kept perceiving and the user kept signaling while either side was mid-output.
The model was trained from scratch rather than retrofitted onto an autonomous-first base, on the bet that interactivity had to scale alongside intelligence rather than be added as a UX layer.
Reported state-of-the-art combined performance on intelligence and responsiveness, positioning collaborative real-time interaction as a primary capability rather than scaffolding.

Autonomous Agents

🧪 Petri auditing agent open-sourced – Anthropic open-sourced Petri (Parallel Exploration Tool for Risky Interactions), a framework that deployed an auditor agent to probe target AI systems through multi-turn conversations with simulated users and tools, then scored each transcript across multiple safety dimensions to flag situational awareness, whistleblowing, self-preservation, reward hacking, and similar concerning behaviors.

🧮 AI co-mathematician from Google – DeepMind introduced a stateful workbench where mathematicians collaborated with AI agents on ideation, literature search, computational exploration, and theorem proving. Scored 48% on FrontierMath Tier 4.

Community take w/ Enrico Piovano: “I recently tried LLMs on math problems from my applied math PhD. None could give correct proofs, even for smaller lemmas. Even powerful LLMs still struggle with less common problems and don’t generalize well in RL.”

🦘 LLMs can’t make abductive leaps – Position paper used general relativity as a case study and argued LLMs could execute deduction but lacked the abductive jump needed to formulate novel premises, proposing multimodal world models as the bridge.

Biotech, Health, and Chemistry

💊 Isomorphic skips the wet lab – Isomorphic Labs confirmed it would not be building any in-house wet lab or chemistry capability and would instead outsource all experimental work to contract research organizations (CROs), the third-party labs pharma companies hire to run experiments on commission. First clinical trials were targeted for end of 2026, with Demis Hassabis confirmed to stay as CEO through the push. The pipeline ran on Isomorphic’s drug design engine IsoDDE, which generated candidates that external CROs then validated experimentally.

Community take w/ Félix Raimundo: “No wet lab is insane to me. They clearly have too much cash. They will use CROs and pay their margins. I assume they think experiments are like compute and they view CROs as hyperscalers.“

🧬 Cellular Intelligence acquires STEM-PD – Cellular Intelligence acquired STEM-PD, an FDA-cleared Phase 2-ready allogeneic cell therapy for Parkinson’s, from Novo Nordisk, alongside a strategic investment from Novo Nordisk in the company.

Image, Video & 3D

🎨 Qwen Image 2.0 unifies generation and editing – Omni-capable image model that paired Qwen3-VL as condition encoder with a Multimodal Diffusion Transformer. Handled ultra-long text rendering up to 1K tokens for slides and posters, multilingual typography, and photorealism in one framework.

🌐 Image to world tool for Claude – A community tool turned any image into a 3D environment with individual meshes, physics, and ambient sound layers, exporting to Unreal or Blender. Built on WorldLabs and FAL under the hood.

Cyber

🚨 TanStack hit by massive supply chain attack – An attacker forked TanStack’s open-source React router project under a disguised name, opened a pull request that TanStack’s own automated build system ran, and used that single PR to steal a secret token from the build runner. With that token they published 84 malicious package versions through TanStack’s legitimate, cryptographically signed release pipeline, so the poisoned packages arrived with valid provenance. Every developer who installed one had their credentials stolen and silently used to republish more poisoned packages they maintained, turning the campaign into a self-spreading worm across npm and PyPI. Within hours it had reached companies like Mistral AI, OpenAI, UiPath, OpenSearch, and Guardrails AI.

🔓 Mythos breaks Apple M5 in days – Researchers at Calif.io, working with Anthropic’s Mythos Preview, built the first public macOS kernel memory corruption exploit on Apple M5 silicon in five days, against an MIE (Memory Integrity Enforcement) defense that Apple had spent five years and likely billions building into hardware. MIE was Apple’s ARM MTE-based memory safety system, designed to disrupt every public exploit chain against modern iOS including the recently leaked Coruna and Darksword kits.

Mythos Preview was Anthropic’s restricted-access security-focused model, released in April 2026 to a small group of Project Glasswing partners to find and patch vulnerabilities before equivalent capabilities became broadly available.
The macOS attack path was an accidental discovery starting April 25, with the team reporting it in person at Apple Park rather than through standard submission channels.

Language Models

🌳 PageIndex for vectorless RAG – Reasoning-based RAG system that built a hierarchical tree index from long documents and used LLMs to navigate via tree search rather than vector similarity, aimed at professional documents where similarity fell short of true relevance.

📐 Geometry of consolidation in RAG – NeurIPS paper proved any compression of embedded items into fewer representatives carried an irreducible identity-retrieval error tied to local effective dimension. RAG compression schemes therefore had a minimum wrongness no budget could escape.

MLOps

🎙️ The end of finetuning – Latent Space podcast with Jeremy Howard, who shifted from finetuning advocate to arguing finetuning was becoming obsolete as models gained context and retrieval capabilities. Remaining reasons to finetune: compliance, tone, cost, format, and latency.

📈 Local Mac models got 5x smarter – Chart compared smartest open-weight models runnable on a 128GB MacBook Pro on the Artificial Analysis Intelligence Index v4.0: from Llama 3 70B at score 10 in May 2024 to DeepSeek V4 Flash at 47 in May 2026, faster than Moore’s Law.

🍎 Local LLM picker for Apple Silicon – Tool that ranked which open models fit and ran well on specific Apple Silicon configurations, factoring unified memory, bandwidth, tokens per second, and break-even versus cloud API costs.

💻 Running local models on M4 24GB – Practical writeup of getting Qwen 3.5-9B running at around 40 tokens per second on a MacBook Pro M4 24GB with LM Studio, including thinking-mode prompts and tool use, plus failed attempts with Qwen 3.6 Q3 and Devstral Small 24B.

Programming

📊 Coding agents benchmark released – Artificial Analysis launched a composite index for coding agents averaging pass@1 across SWE-Bench-Pro-Hard, Terminal-Bench v2, and SWE-Atlas-QnA, while also reporting cost, token usage, and execution time across agent and model combinations.

💸 Claude Agent SDK pricing change – Starting June 15 2026, Agent SDK and claude -p usage will no longer count toward Claude plan limits and will require a separate monthly credit by tier: $20 for Pro, $100 for Max 5x, $200 for Max 20x.

Community take w/ Stan Girard: “Happy I switched to codex and got used to it two months ago. Really not happy with Claude in the last months.“

🤖 Chrome extension to vibe code websites – Stéphane Collot shipped a browser extension that let users describe UX improvements in plain English, with an AI agent inspecting the live page, writing JS plus CSS, and injecting it instantly. Features persisted across sessions via per-feature in-browser git repos.

Robotic, World AI

🚀 Former Qwen lead launches new lab – Lin Junyang, former technical leader of Alibaba’s Qwen, started a company valued around $2B targeting world models and embodied brains, with team members from ByteDance, Tencent, and overseas. Sequoia China and Gao Rong were among funds contacted.

🛏️ Figure robots tidy a bedroom – Two Helix-02 humanoids reset a bedroom in under two minutes, opening doors, hanging clothes, closing a book, taking out trash, and making a bed together. They ran a single learned vision-language-action policy with no shared planner. A separate livestream showed an 8-hour autonomous shift.

🚁 Aerial robot for vibration dampers – Lightweight aerial platform that landed on power transmission lines and replaced Stockbridge vibration dampers using motorized wheel arms, an impact wrench, and a damper-lifter stage, keeping human operators safely on the ground rather than working on live lines.

TTY Lunch

Each week, TTY Lunch brings together exceptional builders around the table. Today’s lineup included Clément Castellon, Gabriel Olympie (2501.ai), Jules Belveze, Julien Millet, Mathieu Kassovitz, Quentin Dubois (OSS Ventures), Stéphane Béreux (Jimini AI), and Yvann Barbot (TerraLab).

We used this almost-holiday in France 🇫🇷 to reinvent the world all at once, while trying to satisfy Mathieu’s insatiable need to understand. Topics in no particular order included evals, second-brain management, the impact and ethics of AI, the geopolitical energy crisis, data sovereignty, and even a bit of video games.

Contributors This Week

Lior Oren, Félix Raimundo (Tychobio), Jérémie Bordier (XHR), Youssef Tharwat (Noodlbox), Enrico Piovano (Goji), Stan Girard (Quivr), Amine Saboni (Pruna.ai), Pierre Chapuis (Finegrain), Gabriel Olympie (2501.ai), Julien Seveno-Piltant, Maziyar Panahi (OpenMed), Arnaud Thiercelin, Benjamin Trom, Hugo Venturini (SkipLabs), Ihab Bendidi (Recursion), Jeremie Kalfon (Pasteur), Raymond Rutjes, Stéphane Collot (Sequense), Anicet Nougaret (Ascii), Gawen Arab (Airbuds), Ivan Yamshchikov (Pleias), Jocelyn Fournier (Softizy), Karim Matrah (Contrast), Koutheir Cherni (Guepard), Lucas DiCiocci, Nikolay Tchakarov (Asteria), Quentin Dubois (OSS Ventures)

TTY Weekly

Discussion about this post

Ready for more?