怎麼讓 AI agent 照流程走:閘門只記帳,不攔人
流程裡那些閘門其實不在執行時擋住 AI agent,它要的是一張改不掉的收據。真正有牙齒的不是閘門,是記錄抹不掉、賴不掉。
Claude Code 多了個 dynamic workflows,我打開那段 JS 看了一下
Claude Code 5/28 釋出 dynamic workflows,跟 Opus 4.8 同一天上。比起「能開 1000 個 subagent」那個數字,更關鍵的是 orchestration 那段 JS 是 Claude 寫的、不是 Claude 在跑——這件事其實滿值得想一下的。
How Claude Code's Dynamic Workflows Run 1,000 Subagents
Claude Code's new dynamic workflows hand the orchestration plan over to a JavaScript script that Claude writes. The runtime executes it with up to 1,000 subagents — 16 concurrent — and Claude's context only sees the final cross-checked answer.
AI 連草莓有幾個 r 都數錯,是它笨嗎?
叫 AI 數 strawberry 有幾個 r,它曾經很有自信地答錯。新模型現在大多答對了,但它當初為什麼會錯——用一個積木的比喻聊聊,順便講為什麼那個原因到現在還沒真的消失。
Why Does AI Sound So Confident When It's Wrong?
AI's most dangerous trait isn't that it's wrong sometimes. It's that its tone when wrong is identical to its tone when right. Here's my plain-language take on why, including why it won't just say 'I don't know'.
How I Use ChatGPT, Claude, and Gemini Day to Day
Not a benchmark or a verdict on which AI is best — just the small habits I picked up from keeping ChatGPT, Claude, and Gemini all open: route by task, give context first, don't expect one perfect answer, and verify the confident-sounding stuff.
為什麼 AI 唬爛的時候,口氣跟講真話一模一樣?
AI 最會唬人的地方,不是它會錯,是它錯的時候那個口氣跟講對的時候完全一樣。用『它一直在猜下一個最順的字』這個角度,白話聊聊為什麼篤定不等於知道。
我每天開著三個 AI 聊天視窗,這陣子摸出來的幾個小習慣
沒什麼大道理,就是同時用 ChatGPT、Gemini、Claude 一陣子之後,自己順手摸出來的幾個小習慣。不同事丟不同家、先講清楚再問、別期待一次到位這類的。
Benchmark 飽和,其實是個驗證問題
GSM8k 99%、MMLU 90 出頭、HLE 在 2026 年中已進入 40 分檔。每出一份『更難的 benchmark』看起來都在解決問題,但結構性的事沒變:我們從來沒在驗證模型學會了什麼,只是在量它有沒有看過。
LLM Benchmark Saturation Is a Verification Problem
GSM8k at 99%, MMLU at the 88-94% noise band, HLE already in the mid-40s by mid-2026. Each round of harder benchmarks looks like progress, but the field never solved the underlying problem: we measure correlation with a test distribution and call it capability.