We Just Deleted 15% of Our Health Score. Here's Why Product Metrics Lie.

We shipped a five-dimension health score. A user spotted that one dimension was measuring our laziness, not theirs. So we deleted it.

Three weeks ago we shipped a feature. Last week, a user told us 15% of it was nonsense. Yesterday, we deleted that 15%. Here’s the lesson — and why I think most product metrics quietly do the same thing.

Three weeks is not a long life for a feature. Most of what we build at LocationCharter survives longer than that before it gets seriously challenged. This one didn’t. A user looked at it, paused for about ten seconds, and dismantled the math in a single sentence.

That sentence was right. We pulled out the offending dimension, redistributed its weight across the rest, dropped a database column, deleted an enum value, and shipped the change the same day. No fanfare, no rollback plan, no second-guessing. The feature is now objectively better, and we know exactly why.

This post is about what we built, what was wrong with it, and the pattern I keep seeing in product metrics across the industry — the one where a system quietly measures itself and presents the result as the customer’s reality.

What we built

A location health score. A single number between 0 and 100, with an A–F letter grade, shown as a circular widget on every location’s detail page and in the locations-list table.

The idea was simple. Customers managing Google Business Profile locations kept asking the same question in different forms: how am I doing? Not at the per-metric level — they could already see review counts, post frequency, photo categories, and so on. They wanted one number that summarized the lot. A starting point for a glance, a target for a quarter, a way to say “fix this one first” across a portfolio of locations.

So we built it with five dimensions:

DimensionWeight
Profile25%
Posts20%
Reviews25%
Photos15%
Freshness15%

Profile measured completeness — title, description, phone, website, primary category, hours, full address. Posts measured recency and variety. Reviews measured reply rate and average rating over the last 90 days. Photos measured category coverage and total count.

The fifth one — Freshness — was different. We’ll get to it.

The widget shipped. Customers liked it. A few asked us to expose it via the API so they could pull scores into their own reporting. Internally we considered it done.

The user comment that broke it

Then a user opened a chat with us. Verbatim, paraphrased only slightly:

“Stale data is already shown in the sync widget on the location card. It shouldn’t drag down the score the owner is trying to improve.”

Pause on that for a second. The “sync widget” he was referring to is a separate UI element on every location card — a small indicator that shows when our backend last synced data from the Google Business Profile API. It exists precisely so that customers know whether the numbers they’re looking at are fresh or stale. It’s a system-status surface. Honest, useful, unambiguous.

And we had also folded the same signal into the health score. Fifteen percent of the score was a dimension called Freshness, which checked whether metrics, posts, reviews, and media had been synced recently — within 3 days for metrics, within 24 hours for the others.

The user had spotted, in about ten seconds, that we were reporting the same operational signal in two places — once honestly, as a status, and once dishonestly, as a deduction from a score the customer was trying to improve.

We sat with that for less than a minute.

Why they were right

Here’s the thing about Freshness that took us those ten seconds to see and a follow-up cup of coffee to fully accept.

Freshness measured whether our backend cron had pulled the location’s data recently. Not whether the customer had done anything. Not whether their profile was complete, or their reviews were being replied to, or their posts were fresh. It measured our reliability and presented it inside a score they were trying to move.

A customer with a flawless Google Business Profile — every field complete, photos fresh, reviews replied to, posts going out three times a week — could see their health score drop on a Monday morning because our weekend sync pipeline had lagged on their account. There was nothing they could do about it. There was nothing they should be doing about it. The score was telling them they had a problem that was, in fact, our problem.

Freshness was measuring our laziness and presenting it as theirs.

That’s not just a weighting error. It’s a category error. A score has a contract with the user: the things in here are things you can move. Once you put something in the score that the user can’t move, the score isn’t a target anymore. It’s noise wearing the costume of a target.

What we changed

The fix wasn’t to rebalance the weights or to soften the Freshness scoring thresholds. It was to delete the dimension.

The concrete changes:

  • Four dimensions remain: Profile, Posts, Reviews, Photos. Each weighted at 25%, summing to 100%.
  • A Liquibase migration drops the freshness_score column from the location_health_score table.
  • The FRESHNESS enum value and the STALE_DATA issue type are removed from the codebase. Any UI that previously rendered “Freshness: 75%” no longer has a code path to do so.
  • Existing cached scores recompute lazily — no forced reset. Cached rows expire after the next hourly TTL or after the next location sync, whichever comes first.
  • Three new unit tests lock the rebalanced math in. One of them is a regression guard that asserts no issue ever gets emitted with dimension == "FRESHNESS" or issueType == "STALE_DATA". If anyone ever tries to add it back, the test breaks first.

Shipped Wednesday. No rollback. The sync widget keeps doing its honest job, and the health score is now exclusively about things the customer can change.

The bigger pattern: metrics that measure the system, not the user

The reason this post exists isn’t the freshness story. The freshness story is fifteen minutes of refactor. The reason this post exists is the pattern it taught me to look for, which I now see in product metrics across the industry.

The pattern is this: systems instrument what they can see, and what they can see is themselves. Customer reality is harder to measure than internal state, so you proxy it with the nearest available signal — and the nearest signal is almost always a system signal. Over time, the proxy stops being read as a proxy. The system’s measurement of itself starts being reported to the user as if it were the user’s situation.

Three examples I’ve been chewing on since the deletion:

Uptime SLAs that include planned maintenance. A provider reports 99.95% uptime. The customer experienced four hours of downtime last Tuesday — and got the email about the planned window two weeks in advance. From the customer’s perspective, the service was down. From the provider’s, it was “available” because the window was scheduled. The metric is measuring whether the provider’s calendar event fired, not whether the customer’s request could be served.

Email open rates triggered by tracking-pixel preloads. For years, “open rate” purported to measure whether a recipient had read an email. Then Apple Mail Privacy Protection started preloading every pixel automatically, and overnight every Apple Mail user “opened” every email. The metric had always been measuring whether the email client loaded an image — which used to correlate with reads, until it didn’t. Marketers spent years optimizing for a number that quietly stopped meaning what they thought it meant.

A fitness app’s “active minutes” that counts time the app was open. Easy to write that query. Harder to write the query that measures whether the user actually moved. So the easy query gets shipped, gets surfaced in the user’s weekly summary, gets compared to last week. The user is being told how engaged the app was with them, not how engaged they were with their exercise.

None of these are malicious. None of the people who built them set out to mislead. They just chose what they could measure over what they wished they could measure — and over time, the gap between the two stopped getting interrogated.

It’s drift, not deceit. But the customer doesn’t experience the difference.

The test we now run on every score

I’m not going to claim we’ll never make this mistake again. We probably will. But we have a question now that we didn’t have three weeks ago, and we run every score through it before launch:

Does this measure something the customer can change, or something we do?

If the answer is “something we do” — even partially — that signal does not belong in the score. It belongs on a system status surface, visible and separately addressable. Sync state, queue depth, pipeline lag, API quota: all real things, all worth showing. None of them should be folded into a number the customer is trying to improve.

The next score we ship will get this question before it ships, not after. The bar is low and the test is fast. There is no reason not to apply it.

The freshness dimension was 15% of a score that had been live for three weeks. Deleting it took an afternoon. The lesson took longer to articulate than the code took to remove — which, I suspect, is true of most lessons worth keeping.


At LocationCharter, we’re building an AI-powered command center for Google Business Profile management. AI agents queue real operations — review replies, posts, profile edits — and wait for your approval before anything goes live.

We’re two developers shipping fast, deleting fast, and writing about both. Start a free 14-day trial.

More from the blog

See all posts