Skip to content

Episode 21

It's the Bill

June 17, 2026 · 18:05

Jump to transcript ↓

The AI industry is discovering, with apparent surprise, that maybe you shouldn’t use more compute than you need. TechCrunch ran a piece this week on the cost-conscious turn — Brian Armstrong of Coinbase predicting eighty percent of workloads shift to ninety-nine percent cheaper models inside eighteen months, Harvey reporting a three-times inference cost reduction without quality loss, and the broader question of whether the scaling-first approach has finally hit a budget.

The Harvey quote in the article is precise about what’s changed. “The definition of quality is evolving from simply using the most powerful model for everything, to using the best model that gets the right answer most efficiently.” The panel reads this as a sentence written by someone who got an invoice. Not an evolution of quality — quality being redefined to fit the budget. Most of the savings, the article notes, come out of the pockets of the big labs as they head for IPO. The frontier labs say they’re fine with that; the volume shift is happening at the low end, which was barely profitable for them anyway. Legacy has heard that sentence before, in 1991, when IBM said it about Sun taking the low-end workstation market.

The panel’s argument lands somewhere close to four cycles of the same pattern. Mainframe MIPS optimization in the late eighties when IBM’s processor-second billing finally got somebody in accounting to pull the report. Sun E10K right-sizing when shops realized they’d bought hardware for workloads that ran fine on an Ultra 60. EC2 instance selection around 2012 when companies running m4-large for cron jobs got the bill. And now intelligent routing in 2026, where the discipline that should have been obvious from the start arrives because the invoice finally got walked into a meeting. The lesson takes ten to fifteen years to stick before someone invents a new abstraction layer and the cycle starts again. The reason it doesn’t stick is that the people who learned it last time have retired or been promoted out of the work.

Topics

  • The TechCrunch piece on the cost-conscious turn in AI deployment
  • Brian Armstrong’s eighty-percent / ninety-nine-percent prediction and Harvey’s three-times cost reduction
  • The bitter lesson and the “if you have unlimited compute” part nobody quotes — what Sutton actually wrote about versus how it’s been deployed as procurement justification
  • Four cycles of the same pattern: mainframe MIPS optimization (late eighties), Sun E10K right-sizing (client-server era), EC2 instance selection (2012), intelligent routing (2026)
  • “It’s the bill” — why cost discipline arrives when the invoice arrives, not when the strategy matures
  • What routing actually is operationally: a classifier, a model selector, an eval suite, an on-call rotation for when the cheap node lies about the answer
  • The predicted Q3 incident: cheap-model routing live in April, subtle quality degradation by late July, customer support tickets in August, engineering notices in September, two-week scramble back to Opus, October cloud bill spike, November board meeting, January AI-Operations role
  • Why “workflow knowledge of which queries go to which model” is a spreadsheet, not a moat — and why every prior cycle had the same spreadsheet under a different name
  • The IBM-Sun-Linux displacement pattern and what “ceding the low end” looks like in 1991, 2008, and 2026
  • Where the cost-optimization-as-a-category vendors come from — and why the same logos are pitching AI cost management now

Goat List Reasons referenced

  • #56You do not want, need, or desire in any way for goats to run at a higher clock speed. And they don’t.
  • #92Nobody will fire you for connecting diskless goats into a goat server when they think you should have purchased a massive mainframe goat to connect to a multitude of inexpensive dumb goats.
  • #13You don’t need to buy a goat 98 to fix all the bugs in your goat 95.

Source Article

Can tech companies learn to love cheaper models? — TechCrunch, June 9, 2026. Reporting on the shift toward cost-conscious model selection in production AI deployments, with Brian Armstrong of Coinbase forecasting that eighty percent of workloads will move to ninety-nine percent cheaper models within eighteen months, Harvey’s documented three-times inference cost reduction, and the broader industry conversation about whether scaling-first deployment has met its budget constraint. The piece includes the Harvey quote on quality being redefined as the best model that gets the right answer most efficiently, and the observation that much of the savings comes out of the pockets of the large frontier labs as they approach IPO.

Panel

  • The Legacy Sysadmin
  • The Burnt-Out SRE
  • The Startup Founder
  • The Goat Farmer’s Counsel

Transcript

Full episode transcript

HOST: Welcome back to Stake and Rope, from Goat Security. Today: the AI industry is discovering, with apparent surprise, that maybe you shouldn’t use more compute than you need. TechCrunch ran a piece this week on the cost-conscious turn — Brian Armstrong of Coinbase predicting eighty percent of workloads shift to ninety-nine percent cheaper models inside eighteen months, Harvey reporting a three-times inference cost reduction without quality loss, and the broader question of whether the scaling-first approach has finally hit a budget. With me today: the Legacy Sysadmin, who has watched this exact movie before, possibly on a VAX. The Burnt-Out SRE, who I assume knows exactly which line item on the cloud bill triggered this. The Founder, who I’m sure has a take on what this means for early-stage companies. And Goat Farmer’s here too.

HOST: Legacy, start us. What does this remind you of?

LEGACY SYSADMIN: [sighs] It doesn’t remind me of anything. I lived it. Three times, at least. The first one was mainframe MIPS optimization in the late eighties. IBM was billing by the processor-second, and somebody in accounting finally pulled the report. Suddenly every COBOL shop in the country was rewriting batch jobs to fit inside an off-peak window. The second was right-sizing in the client-server era — we’d bought Sun E10Ks for workloads that ran fine on an Ultra 60. The third was EC2 instance selection around 2012. Companies running m4-large for cron jobs. Same pattern. The bill arrives, the discipline arrives.

FOUNDER: Right, but I think the framing here is actually different. This isn’t just cost pressure. This is the market maturing. We’re moving from a phase of capability discovery into a phase of capability deployment, and that requires a totally different —

SRE: It’s the bill.

FOUNDER: I mean, sure, partially, but —

SRE: It’s the bill. Somebody got the invoice. Somebody printed it out. Somebody walked into a meeting with it.

LEGACY SYSADMIN: [chuckles] That’s how it always goes.

HOST: Founder, hold on. The article quotes Armstrong saying eighty percent of workloads move to ninety-nine percent cheaper models. You’re telling me that’s about market maturity and not about people getting a bill they didn’t expect?

FOUNDER: I’m saying both can be true. Look, the bitter lesson is still correct. Scaling is still the dominant strategy at the frontier. What’s happening now is that the value chain is bifurcating. You’ve got frontier compute for IQ-max workloads, and you’ve got commodity inference for everything else. This is actually huge for startups. Massive moat opportunity. If you’re a Series A company and you can route ninety percent of your calls to a small model and reserve Opus for the hard stuff, your unit economics are completely different from someone who locked in on GPT-5.5 a year ago. This is distribution unlocked. I’ve been telling my portfolio companies —

SRE: What’s your inference spend.

FOUNDER: What?

SRE: Your portfolio companies. What’s the monthly spend that triggered the conversation.

FOUNDER: I mean, it varies, but —

SRE: It’s somewhere between forty and a hundred and twenty thousand a month and somebody on the board asked about it.

FOUNDER: [chuckles] Okay, fair, but the point is —

LEGACY SYSADMIN: That’s the same number as the EMC SAN conversation in 2004. Adjusted for inflation.

HOST: Legacy, the article has this line about the bitter lesson — the idea that compute-intensive approaches win in the long run. The Founder is defending that. Is the bitter lesson wrong, or is this something else?

LEGACY SYSADMIN: The bitter lesson isn’t wrong. It’s just incomplete. Rich Sutton wrote that essay about research progress, not about quarterly P&L statements. The lesson is that if you have unlimited compute, throwing more at the problem beats clever engineering. The part nobody quotes is the “if you have unlimited compute” part. We’ve never had unlimited compute. We’ve had subsidized compute. And subsidies end.

GOAT FARMER:

Reason number 56. You do not want, need, or desire in any way for goats to run at a higher clock speed. And they don’t.

SRE: [exhales] That’s the part that gets me. The article frames this like it’s a discovery. “Initial tests suggest that, when the system is arranged right, cheaper models could sub in without any sacrifice in quality.” That’s the sentence. That’s the actual sentence in the article. Like we just figured out you should use the smallest tool that does the job.

FOUNDER: It is a discovery, though. In context. The default for two years was throw the biggest model at it. Now we’re learning how to route.

SRE: We didn’t learn anything. The bill came.

HOST: SRE, walk me through what this looks like operationally. Harvey says they cut inference costs three times without losing quality. What does that actually mean on the ground?

SRE: It means someone built a router. You have a request coming in, you classify it, you decide which model handles it. Most requests are simple — summarize this document, extract these fields, draft a paragraph. Those go to the cheap model. The hard ones go to Opus. The router is the work. Building it, maintaining it, monitoring when the cheap model degrades, handling the cases where the classification was wrong. That’s all engineering. That’s all on-call. The post-mortem is already written — the cheap model started failing on a class of queries in the last week of the quarter, nobody caught it because the dashboards were green, customer support tickets piled up, and somebody manually flipped the router to send everything to Opus while they debugged. Cost spike, board notices, repeat.

LEGACY SYSADMIN: That’s load balancing. We called it load balancing.

SRE: It’s load balancing with extra steps. The extra step is that the cheap node sometimes lies about the answer.

FOUNDER: Or — hear me out — the extra step is that you’ve got dynamic capability allocation, which is a much more sophisticated infrastructure than anything in the previous era. This is real software engineering happening on top of the model layer. The teams that figure this out first have a structural advantage.

HOST: Founder, you’ve used the word “moat” twice. Walk me through how routing requests to a cheaper model is a moat. Anthropic and OpenAI can build the same router. Fireworks already did. What’s the moat?

FOUNDER: [chuckles] Okay, the moat isn’t the router itself. The moat is the workflow knowledge. Knowing which of your specific customer queries can be served by a small model. That’s proprietary. That’s a data advantage. You can’t just —

SRE: That’s a spreadsheet.

FOUNDER: It’s not a spreadsheet, it’s a —

SRE: It’s a spreadsheet with which endpoints went to which model and what the satisfaction scores were. I’ve seen this spreadsheet. I maintained one in 2019 for routing to different regions of AWS.

LEGACY SYSADMIN: I maintained one in 1994 for routing to different Tandem nodes.

GOAT FARMER: Had that one.

HOST: I want to pull back for a minute. Legacy, you mentioned three previous cycles — mainframes, client-server, EC2. Walk me through one of them. The shape of how it played out.

LEGACY SYSADMIN: Take the EC2 cycle. 2008 through about 2014. The pitch was elastic compute, pay for what you use, scale on demand. Everyone took the pitch literally. They didn’t tune anything. They ran m4-large for cron jobs that fit on a t2-micro. The bills were small enough nobody cared. Then the bills got bigger. Then someone — I think it was around 2014, 2015 — started selling “cloud cost optimization” as a category. Spot instances. Reserved instances. Right-sizing tools. Whole companies built on the premise that you’d been overprovisioning for five years and now needed help fixing it. The total spend didn’t go down, by the way. The growth flattened. People made calmer decisions about which jobs went where. The lesson took about seven years to stick. And then it didn’t stick, because as soon as Kubernetes showed up, everyone overprovisioned again, and we ran the cycle a second time inside the same platform.

SRE: And the people who built the cost optimization tools are now building AI cost optimization tools. With the same logos. I’ve seen the pitch decks.

LEGACY SYSADMIN: Of course you have.

FOUNDER: I mean, that’s actually a good business. Picks and shovels. If you’re building inference cost management right now, you’re going to do extremely well over the next thirty-six months. I’ve got two portfolio companies in that exact space.

HOST: And neither of them are profitable yet, I assume.

FOUNDER: They’re growing. Growth is the metric.

SRE: Growth is the incident.

HOST: Let me ask the harder question. Founder, the article points out something specific — much of the savings comes out of the pockets of the big labs, just as they’re heading for IPO. If your portfolio companies route eighty percent of their traffic away from Opus and GPT-5.5, the labs lose revenue. How does that work for the people you’ve also pitched on backing OpenAI and Anthropic at their next round?

FOUNDER: [chuckles] That’s a great question. I think — look, the frontier labs are going to do fine. The IQ-max workloads are the high-value workloads. They’re capturing the top of the market, which is where the margin is. The volume shift to cheap models is happening at the low end, which was barely profitable for them anyway.

SRE: That’s not what the article says.

FOUNDER: It’s what I think the article should have said.

LEGACY SYSADMIN: That’s the same thing IBM said in 1991 about Sun taking the low-end workstation market. “It’s fine, we have the high-margin mainframe business.” Then Sun got better and ate up. Then Linux got better and ate Sun. The thing about ceding the low end is that the low end gets better.

GOAT FARMER:

Reason number 92. Nobody will fire you for connecting diskless goats into a goat server when they think you should have purchased a massive mainframe goat to connect to a multitude of inexpensive dumb goats.

HOST: SRE, you said earlier the post-mortem was already written. Walk me through it.

SRE: [exhales] Q3 incident. Cheap-model routing went live in April. Engineering celebrated. Cost dashboard turned green. Around late July, the cheap model started producing subtly wrong outputs on a class of legal queries — something about how it handled negation in long contracts. The router was classifying these as simple-query, sending them to GLM, and GLM was returning confident wrong answers. The eval suite didn’t catch it because the eval suite was written before the routing changes and didn’t cover the new failure mode. Customer support tickets started ramping in August. Engineering noticed in September because someone on the legal team forwarded a customer email to their VP. Two-week scramble. Everything routed back to Opus. October cloud bill is four times the previous month. Board meeting in November. Discussion of “AI cost discipline” leadership initiative for next year. Whole cycle restarts in January with new router, more sophisticated evals, and a head of AI Operations role.

FOUNDER: I mean, that’s a learning. That’s how you build a moat.

SRE: That’s how you build a department.

HOST: Let’s land the plane. Closing thoughts. Goat Farmer first.

GOAT FARMER:

Reason number 13. You don’t need to buy a goat 98 to fix all the bugs in your goat 95.

GOAT FARMER: Same goat. Costs less. Gives milk.

HOST: Founder.

FOUNDER: [chuckles] I actually think this is a really exciting moment for the industry. We’re maturing past the everything-must-be-frontier phase and into something more interesting — modular AI architectures, intelligent routing, real cost discipline. The startups that figure this out are going to be the ones building the durable companies. I’m going to write this up tonight. There’s a Substack in this for sure. Maybe an X thread first. Probably both. I think the headline is something like “The Routing Era” or “Cost-Aware AI.” I’ll workshop it. But this is the moment where the next generation of AI companies separates from the pack, and the people who figure out the unit economics now are the ones building the Stripes and Shopifys of this cycle.

HOST: SRE.

SRE: The thing the article doesn’t say — and it really doesn’t say it — is that none of this is about intelligence. None of it. The Harvey quote is “the definition of quality is evolving from simply using the most powerful model for everything, to using the best model that gets the right answer most efficiently.” That’s a sentence written by someone who got a bill. That’s not an evolution of quality. That’s quality being redefined to fit the budget. It’s going to work fine for most workloads. It’s going to fail in a specific class of cases that nobody catches until a customer notices. The post-mortem will recommend better evals and more sophisticated routing. The recommendation will be implemented partially. It will fail again in eighteen months. That’s the cycle. Nothing about this is new except the model names.

HOST: Legacy.

LEGACY SYSADMIN: [sighs] I’ve been doing this since the late seventies. I’ve watched compute get expensive, then cheap, then expensive again, four times. Every time the cheap phase ends, somebody discovers right-sizing. They publish a blog post about it. They give it a new name. In 1989 it was capacity planning. In 1998 it was application sizing. In 2007 it was server consolidation. In 2014 it was cloud cost optimization. In 2026 we’re calling it intelligent routing. The lesson takes about ten to fifteen years to stick before someone invents a new abstraction layer and we have to relearn it. The reason the lesson doesn’t stick is that the people who learned it last time have retired or been promoted out of the work. The people doing the work now have never seen a bill they couldn’t get reimbursed. They’ll see one this year. They’ll learn what we learned. And then in 2041, somebody on a podcast will explain to a younger generation that you don’t need to run every workload on a frontier model. And that person will think they discovered it.

HOST: Compute gets cheap, then expensive, then we rediscover restraint, then we forget. The bill always comes. We’ll see you next time.