LLM Neuroanatomy III: Why RYS Works — The Language-Agnostic Middle
Probing a 27B model shows its middle layers organise by meaning, not by language or format — weak evidence against the strong Sapir-Whorf hypothesis, and the reason RYS works.
Probing a 27B model shows its middle layers organise by meaning, not by language or format — weak evidence against the strong Sapir-Whorf hypothesis, and the reason RYS works.
In mid-2024, the HuggingFace Open LLM Leaderboard was the Colosseum for Open-Weight AI. Thousands of models were battling it out, submitted by both well-funded labs with teams of PhDs and fine-tuni...
Introduction Small AI computers are usually sold with large dreams and shitty memory buses. I have a ridiculous server that pulls a few kilowatts, but I wanted a local Hermes Agent box that cou...
Introduction A while back I did some optimisation on my Hopper system for MiniMax M2.1, and this was followed by some deeper GH200 benchmarking, where I measured the machine as a memory-shuffling ...
Introduction This article is mostly for me, as a way to record the peculiarities of my server; but it might come in handy for the ~3 other people running a home Grace-Hopper server? In a previous ...
In Part 1, I described how duplicating a block of seven middle layers in Qwen2-72B — no weight changes, no training — produced the #1 model on the HuggingFace Open LLM Leaderboard. The method, whic...
Introduction May 2016, Munich. I had just joined NanoTemper Technologies as a Bioanalytics Scientist. If you aren’t familiar with NanoTemper, they build high-end biophysical instruments. At the ti...
Introduction So you’ve built a €9,000 Grace–Hopper “desktop” (see: my previous post involving 16-million-degree GPU temperatures). Running llama.cpp benchmarks is fine, but the real test of local ...
Introduction Running large language models locally has always been a game of compromise. You either spend \$10,000+ on consumer GPUs that can barely handle 70B parameter models, or you dream about...
No one knows how big AGI needs to be. The current consensus among the scaling-pilled crowd is “trillions of parameters and a nuclear power plant.” Maybe they’re right. But I spent years dissecting ...