ARR Is Not the Problem. The Institutional Vacuum Around It Is.

On Cluely, the AI revenue metric debate, and what economists call the "cop on the beat" question

May 03, 2026

Last month, Cluely co-founder Roy Lee admitted on X that the $7 million in annual recurring revenue he had given a TechCrunch reporter was, in his own words, “BS.” The actual figure was $5.2 million — a 35% gap. The confession lit up financial Twitter for a week, anchored a Bloomberg piece by Annie Bang asking whether ARR has become “the least-trusted metric of the AI era,” and prompted the usual round of think-pieces about founder ethics.

I was quoted in that Bloomberg piece, and the framing I gave Annie — that the startup world is “a bit more of a Wild West,” with no audit requirements and no cop on the beat — has been the part most readers shared. I want to use this post to say what I didn’t have room to say in 200 words of quoted speech: this is not a story about one founder, and it is not, in any deep sense, a story about ARR. It is a story about what happens when an ecosystem builds an investment thesis around a metric with no agreed-upon definition, no enforcement mechanism, and no countervailing institution incentivized to police it.

That’s a story economists and organizational scholars actually have tools for. And the policy implications are not the ones most commentators have been reaching for.

Three structural reasons ARR is decoupling from real revenue

The naive ARR calculation is one month of subscription revenue × 12. It works when three conditions hold: subscription pricing is the dominant model, customer retention is high enough that next month resembles this month, and contract structure is reasonably uniform across customers. SaaS in roughly 2010–2020 met all three. AI in 2024–2026 meets none of them.

First, AI customers experiment. Enterprise budgets right now have unusually large discretionary lines for “AI exploration” — every CIO has been told by their board to have an AI strategy. That money flows into trials. A trial signed in March counts as ARR in March. The customer’s actual decision — does this tool earn its seat at renewal? — happens in September. By then, ARR has already been booked, reported to investors, and used to justify a markup at the next round. Net revenue retention numbers, if they were available, would tell a different story; they generally aren’t, because most AI startups are too young to have meaningful 12-month cohorts yet.

Second, pricing has shifted. A growing share of AI revenue is usage-based — tokens consumed, calls made, seats actively engaged. Darren Yee at NYU made the point well in the Bloomberg piece: you cannot take one month of subscription and multiply by twelve when most of the bill is usage. The lumpiness is structural, not transient. Companies layer nominal subscriptions on top of usage-based billing and report the combined number as ARR, but the usage portion behaves nothing like a recurring annuity.

Third, front-loading. A 12-month prepaid contract signed today can be reported as $X of ARR on day one, even though the customer has 11 months left to decide whether to renew. The accounting is technically defensible. The economic substance — the real-world claim about revenue stability — is materially weaker than the number suggests.

Put these three together, and the same nominal ARR figure can describe radically different underlying businesses. That’s the ambiguity Roy Lee exploited — clumsily, with a 35% lie that was easy to falsify. The more durable problem is the founders who don’t lie at all, who pick the most flattering legitimate definition each time, and whose numbers nonetheless overstate true recurring economics by 20–40%.

Why VC due diligence doesn’t close the gap

The standard answer — and the one I gave in the Bloomberg piece — is that VC and acquirer due diligence is supposed to be the cop on the beat. In principle, that is right. In practice, the incentives don’t align as cleanly as the model assumes.

Will Gornall and Ilya Strebulaev’s Squaring Venture Capital Valuations with Reality (Journal of Financial Economics, 2020) showed that unicorn valuations are overstated by an average of about 48% once preferred share terms are properly priced. The mechanism is what matters: VCs and founders both benefit from the headline number, and the LPs who would in principle care are not at the diligence table. ARR has the same structure. A VC who marks her portfolio to ARR, raises her next fund partly on those marks, and competes for allocation in the next hot round has limited incentive to demand that founders disclose cohort-level retention. The founder doesn’t want to. The other VCs in the round don’t want to. The LP — the only party with skin in the game on the truth of the number — sees the marks and not the underlying.

This is a classic institutional-design problem. A metric is informative only if some actor in the system has both the ability and the incentive to verify it. In public markets, that role is played by auditors, the SEC, short sellers, and enforcement actions. As an independent director and Remuneration Committee chair on a Hong Kong–listed public company, I see what that machinery looks like up close — quarterly review cycles, named auditor liability, regulator inquiries that come on a predictable cadence. In private markets, the equivalent infrastructure has never been built, because for most of the venture industry’s history it didn’t need to be: funds were small, LPs were sophisticated, capital was patient. None of those conditions still hold.

This is partly an American problem

It is worth pausing to note that the convention I have been describing is largely an American one. I co-direct the Stanford Technology Ventures Program (STVP) for international entrepreneurship, and through STVP’s global programs we run field research and teaching across six continents. From that vantage point, the parochial nature of “ARR as universal yardstick” is hard to miss.

European venture markets, with more conservative LP bases and a stronger founder accounting culture, tend to push cohort-level disclosure into the diligence process earlier. Singapore family offices — which have grown into a meaningful share of the global LP pool over the past decade — increasingly include net retention reporting in fund-level terms. Chinese AI startups face the opposite pressure: their domestic disclosure regime is tightening through STAR Market and HKEX scrutiny even as Western VCs grow more permissive about ARR ambiguity. Israeli founders, who typically raise from US funds, end up triangulating between conventions, and Indian founders increasingly do the same.

None of these ecosystems has solved the problem. But the “Wild West” framing applies most squarely to American venture finance in 2026, and reform may well come from outside it. The work I have done with collaborators on how institutional environments and industrial policy shape entrepreneurial outcomes — particularly comparing the US and Chinese ecosystems — keeps returning to the same lesson: convention is local, capital is global, and when those two collide the convention usually moves first. If the largest non-US LPs continue to formalize cohort retention as a reporting term, US GPs will follow.

The case against the obvious fix

The obvious response is “audit them” — extend GAAP-style requirements down into seed and Series A. I don’t think that’s right, and I told Annie so for the piece. The cost of imposing audit machinery on a 12-person company is real. It would push out exactly the kind of high-variance experimentation that produces the small number of category-defining outcomes that matter. Work I did with Bill Miller estimating the economic impact of Stanford alumni–founded companies puts the annual revenue from that single university’s graduates on a scale comparable to the GDP of a top-ten global economy, and STVP’s global programs have reached over 200,000 students with that same entrepreneurial training across six continents. Most of the value comes from a thin tail. Choking off the experimentation at the base of the funnel to police a metric problem at the top is the wrong trade.

What would actually work is lighter and more institutional in character.

Cohort retention norms. The single highest-leverage move is for the largest LPs — public pension funds, sovereigns, university endowments — to begin asking, as a condition of allocation, that their GPs disclose net revenue retention by cohort for their portfolio companies. The mechanic is straightforward: take all customers who signed up in a given month, track what that same group is paying twelve months later, and report the ratio. Best-in-class SaaS lands at 120%+; healthy is 100–115%; below 90% means a leaky bucket regardless of what the headline ARR says. Public SaaS companies routinely disclose this on earnings calls because investors demand it. Private companies don’t, because their LPs have not yet demanded it of GPs and GPs have not yet demanded it of founders. The metric is well-defined, the data already exists in every Stripe and billing system, and it cuts directly through each of the three structural problems above. No regulation required. The change in equilibrium would happen in a quarter.

Acquirer playbook updates. The corp dev teams at the strategics doing AI acquisitions should standardize on a “true ARR” calculation that strips out trials, prorates front-loaded contracts, and discounts the usage portion. Several already do. Publishing the playbook would normalize it.

Disclosure-not-audit. Chris Sloan’s line in the Bloomberg piece — always err on the side of disclosing too much rather than too little — is the right ethical norm and is also, in expectation, the right strategic norm. Founders who disclose more get the benefit of the doubt the next time something looks off. Founders who disclose only the favorable number get re-priced harshly when the market turns, which it eventually will.

Why the ethics framing is necessary but not sufficient

Founder ethics matters, and it runs as a serious thread through STVP’s programming — from the Entrepreneurial Thought Leaders (ETL) speaker series, where founders regularly walk through the hard calls they got wrong, to the Xfund Ethics Fellows Program, the student-led cohort program built specifically around developing the personal principles entrepreneurs will lean on when the pressure to overstate is greatest. The Cluely confession will almost certainly show up as a teaching case in the next iteration of MS&E 272, the global entrepreneurship course I co-teach with Vimbayi Kajese. Students need to wrestle with these moments early, before they’re sitting in the chair Roy Lee was sitting in.

But individual ethics is the wrong layer at which to expect this problem to resolve at the system level. Even fully ethical founders pick the most flattering legitimate definition each time; the question is whether the institutions around them — VCs, LPs, acquirers, journalists, faculty — reward or penalize that picking. That is an institutional question, not a character question. We can teach character all day, and we should. We will not teach our way out of a measurement convention that every party with a seat at the table is incentivized to leave ambiguous.

The right concept is earnings quality

A sharper way to put all of this — credit to Ben Hallen, who pointed this out after the first version of this essay went up — is that what private markets are missing is a concept of earnings quality for ARR.

Earnings quality is a well-established idea in financial accounting. Two companies can report identical earnings under GAAP and have those earnings mean radically different things in terms of how durable they are, how much they reflect underlying economic activity versus accounting choices, and how confidently an investor should extrapolate from them. Public-market analysts spend a lot of time asking about earnings quality. They look at accruals, at deferred revenue, at one-time items, at the relationship between reported earnings and operating cash flow. The headline number is the start of the conversation, not the end.

ARR has no such concept attached to it. Two AI startups can report the same $5 million ARR and have wildly different ARR quality. One cohort signed annual contracts after a six-month sales cycle and will retain at 95% next year. Another cohort signed three-month trials in the last quarter, with 60% likely to churn at first renewal. Same nominal number, different earnings quality, different actual business.

What is striking, as Ben pointed out in the comment that prompted this section, is that quality of earnings analysis is already common practice in another part of the deal economy: when individuals buy small businesses. The standard small-business acquisition playbook involves a “quality of earnings” review — a financial professional digs into the underlying economics, separates durable revenue from one-time effects, and tests whether the seller’s reported numbers actually describe what the buyer is buying. The buyer pays a few thousand dollars for the analysis and treats it as table stakes. That a Main Street acquirer of a $2 million HVAC business gets a more rigorous earnings-quality review than a venture investor putting $20 million into a $5 million ARR AI startup tells you something specific about the institutional design of private markets at the higher end.

The most sophisticated venture investors and acquirers do, in practice, surface ARR quality in diligence — they ask for cohort retention data, they probe the contract structure, they discount usage-based revenue. The question is why this practice has not become standard, and why the headline ARR number continues to set the terms of debate. The answer, again, is institutional. The Main Street buyer of an HVAC company has every incentive to know what they are buying because the wrong answer ruins their year. The venture investor marking a portfolio to ARR has weaker incentives — the headline number gets them the markup, the markup gets them the next fund, and the truth of the underlying earnings quality only matters if and when the position is realized, often years later.

Cohort retention is the metric that surfaces ARR quality. So is the share of revenue that is usage-based versus subscription. So is the percentage of contracts that are prepaid annually versus monthly. None of these are exotic — they are routine in public-market disclosure for SaaS companies and they are routine in small-business acquisition diligence. They are missing from venture-stage practice almost entirely. The fix is not a new metric. It is the application of an old discipline to a new asset class.

This reframing also clarifies why the lighter-touch interventions I described above are likely to work. Cohort retention disclosure is exactly the kind of additional context that allows sophisticated investors to assess earnings quality without imposing audit overhead. It is the venture-stage equivalent of asking a public company to break out recurring versus one-time revenue. The information is cheap to produce, hard to game once standardized, and dramatically improves the signal-to-noise ratio of the headline number.

A larger point about metrics and ecosystems

Step back from ARR specifically. The deeper pattern is that entrepreneurial ecosystems develop measurement conventions during a period of relative stability, those conventions get embedded in deal terms, fund marks, press coverage, and recruiting pitches, and then the underlying business changes and the convention drifts from the thing it was meant to measure. The convention persists because too many actors are now invested in it.

This is not unique to ARR. It happened with daily active users in social media, gameable through engagement-loop design. It happened with gross merchandise value in e-commerce, gameable through subsidized transactions. It happened with monthly recurring revenue in early SaaS, gameable through one-time fees disguised as subscriptions. Each cycle, the ecosystem eventually develops a sharper metric — net revenue retention, contribution margin, organic DAU — usually after a public blowup forces the issue.

ARR is in the early innings of that correction. Cluely is the public blowup. The next 18 months will show whether the ecosystem develops the disclosure norms that would let ARR remain useful, or whether the metric becomes so degraded that sophisticated investors quietly stop using it and a new one takes its place.

Either outcome is fine. The one to avoid is the middle path — everyone keeps reporting ARR, everyone privately knows it’s unreliable, and the gap between the number and reality keeps widening until the next downturn forces the reckoning all at once.

Thanks to Annie Bang at Bloomberg for the original reporting and the conversation that prompted this longer treatment, and to Marina Temkin at TechCrunch for the original Cluely reporting that started the thread. The framing I lean on here owes a great deal to Will Gornall and Ilya Strebulaev’s work on private market valuations, which remains a solid academic anchor for thinking about this class of problem.

Chuck Eesley is a Professor of Management Science & Engineering at Stanford University and co-director (for international entrepreneurship) of the Stanford Technology Ventures Program (STVP) .

Chuck Eesley

Discussion about this post

Ready for more?