Geometric mean vs Z-score
This week I was looped into a thread where we were moving the geometric mean score based action group selection to a z-score based normalized score based selection.
Felt, I don't have the strongest understanding on each so wrote this as a brain dump while trying to understand it and moving action group score calculation from geometric mean to z-score.
Some Background First
What Is a Geometric Mean?
A regular average (arithmetic mean) adds numbers up and divides by how many there are:
Regular average of 4 and 16: (4 + 16) / 2 = 10
A geometric mean multiplies numbers together and takes the nth root:
Geometric mean of 4 and 16: (4 x 16)^(1/2) = sqrt(64) = 8
With 3 numbers:
Geometric mean of 2, 8, and 32: (2 x 8 x 32)^(1/3) = (512)^(1/3) = 8
Why use it? When your numbers represent rates, probabilities, or things that compound, the geometric mean gives a more honest "middle ground." It's less influenced by outliers than a regular average.
For example, if one action has a score of 0.9 and another has 0.01:
Regular average: (0.9 + 0.01) / 2 = 0.455 <- looks okay-ish
Geometric mean: (0.9 x 0.01)^(1/2) = 0.095 <- reflects the weakness
The regular average hides the fact that one action is terrible. The geometric mean doesn't let a single strong score mask a weak one.
Another way to think about it: if your investment returns 100% one year (doubles) and loses 50% the next year (halves), the regular average says +25% per year. But you actually ended up right where you started. The geometric mean correctly says 0% -- it captures the compounding reality.
What Is a Z-Score?
A z-score answers one question: "How far from normal is this?"
Imagine a class of students takes a test. The average score is 70, and most students score within 10 points of that (standard deviation = 10).
Student scores 70 -> z-score = 0 (exactly average)
Student scores 80 -> z-score = +1 (one std dev above)
Student scores 90 -> z-score = +2 (two std devs above -- top ~2%)
Student scores 50 -> z-score = -2 (two std devs below -- bottom ~2%)
The formula:
z = (your score - the average) / standard deviation
The power of z-scores is that they let you compare across completely different scales. Say you want to compare a student who scored 85 on a math test vs 72 on an English test -- which performance was more impressive?
Math: average = 75, std dev = 10 -> z = (85 - 75) / 10 = +1.0
English: average = 65, std dev = 5 -> z = (72 - 65) / 5 = +1.4
The English score was actually more impressive relative to the class, even though 85 > 72. Z-scores strip away the different scales and tell you how exceptional each result really was.
This is exactly why z-scores help us with action groups -- groups of different sizes operate on different scales, and z-scores normalize them to a common one.
What Is a Beta Distribution?
When we don't know the true probability of something, we use a Beta distribution to represent our uncertainty.
Think of it like a restaurant review score:
Restaurant A: 1 review, 5.0 stars -> could be great or just lucky
Restaurant B: 500 reviews, 4.2 stars -> we're pretty confident it's around 4.2
Restaurant C: 50 reviews, 4.5 stars -> fairly confident, probably between 4.0-5.0
If you had to pick one restaurant, you might pick C over A -- even though A has a higher rating -- because you trust C's score more. The Beta distribution captures exactly this: what we think the probability is and how confident we are.
It's defined by two numbers, alpha and beta:
- alpha relates to the number of successes we've seen
- beta relates to the number of failures we've seen
- More data (higher alpha + beta) = tighter, more confident distribution
- The ratio alpha/(alpha+beta) gives the expected probability
Some concrete examples:
alpha=2, beta=2 -> "I think it's around 50%, but I'm not sure at all"
(could easily be anywhere from 20% to 80%)
alpha=10, beta=10 -> "I think it's around 50%, and I'm fairly confident"
(probably between 35% and 65%)
alpha=50, beta=50 -> "I think it's around 50%, and I'm very confident"
(almost certainly between 43% and 57%)
alpha=8, beta=2 -> "I think it's around 80%, fairly confident"
(probably between 55% and 95%)
Notice how higher numbers make the distribution narrower (more confident), and the ratio between alpha and beta shifts where the "peak" is.
In our system, each action has a probability and an aggregated signal (confidence). We convert these into alpha and beta:
alpha = 1 + (probability x aggregated_signal)
beta = 1 + ((1 - probability) x aggregated_signal)
For example, an action with probability=0.7 and aggregated_signal=100:
alpha = 1 + (0.7 x 100) = 71
beta = 1 + (0.3 x 100) = 31
-> High confidence that the true probability is around 70%
vs an action with probability=0.7 and aggregated_signal=5:
alpha = 1 + (0.7 x 5) = 4.5
beta = 1 + (0.3 x 5) = 2.5
-> Low confidence -- could easily be anywhere from 30% to 90%
We then "draw" a random sample from this distribution -- this is the beta draw. It's like asking: "Given what we know about this action, what's a plausible score for it right now?" The high-confidence action (alpha=71, beta=31) will draw values tightly clustered around 0.70. The low-confidence one (alpha=4.5, beta=2.5) will draw values all over the place -- sometimes 0.3, sometimes 0.9.
This randomness is intentional -- it lets the system explore. Actions we're uncertain about get a chance to prove themselves (or not), rather than always picking the same "safe" action.
The Old Way: Geometric Mean of Beta Draws
Previously, we scored an action group by taking the geometric mean of its beta draws.
Each action in a group gets a beta draw (a number between 0 and 1). The group's score is all the draws multiplied together, then root-averaged.
Group A (2 actions): score = (0.7 x 0.8)^(1/2) = 0.748
Group B (3 actions): score = (0.7 x 0.8 x 0.75)^(1/3) = 0.749
Looks fair, right? The problem is subtle and shows up over thousands of draws.
The Problem: More Actions = Lower Scores
Imagine you draw random numbers between 0 and 1. On average you get 0.5. But when you multiply numbers between 0 and 1, the result always gets smaller:
1 draw: average product ~ 0.50
2 draws: average product ~ 0.50 x 0.50 = 0.25
3 draws: average product ~ 0.50 x 0.50 x 0.50 = 0.125
The geometric mean (taking the nth root) tries to fix this, but it doesn't fully compensate. Here's a simulation to show why:
Suppose every action has the same quality (probability=0.5, signal=10).
Run 10,000 trials comparing a 2-action group vs a 3-action group.
With geometric mean:
2-action group wins: ~70% of the time <- unfair!
3-action group wins: ~30% of the time
Expected if fair:
Should be roughly 50/50
The root cause: variance shrinks as you add more draws. With fewer draws, you get wilder swings -- and wilder swings mean more chances to land a high score.
Think of it like flipping coins:
- Flip 2 coins: getting all heads (100%) happens 25% of the time
- Flip 10 coins: getting all heads happens 0.1% of the time
- More coins = results cluster closer to the true average = fewer "lucky streaks"
So groups with fewer actions have a built-in edge -- not because they're better, but because they're "rolling fewer dice" and have more room to get lucky.
A real example from production (Deezer at-risk): Groups with 2 actions were winning 70%+ of the time against groups with 3 actions, even when all actions had identical quality. The expected fair rate would be around 35% (since there are fewer 2-action groups).
A Candy Analogy
Imagine two kids picking candy from the same jar and rating tastiness (1-10 scale):
- Kid A picks 2 candies and averages their tastiness
- Kid B picks 3 candies and averages their tastiness
Round 1: Kid A gets [9, 7] -> average 8.0. Kid B gets [9, 7, 6] -> average 7.3. Kid A wins.
Round 2: Kid A gets [4, 8] -> average 6.0. Kid B gets [4, 8, 7] -> average 6.3. Kid B wins.
Round 3: Kid A gets [8, 9] -> average 8.5. Kid B gets [8, 9, 5] -> average 7.3. Kid A wins.
See the pattern? Kid B's 3rd candy keeps pulling them toward the jar's true average. Kid A's smaller sample gives more room for extremes. Over 1000 rounds, Kid A wins more often -- not from skill, but from math.
The New Way: Z-Score Normalization
Instead of comparing raw scores, we now ask: "How lucky did this group get, relative to what we'd expect?"
For each action, we know the Beta distribution it was drawn from (its alpha and beta). That means we can mathematically compute:
- The expected value of its log-score (using the digamma function)
- The variance of its log-score (using the polygamma function)
We combine these across all actions in a group to compute a z-score:
z = (what we observed - what we expected) / standard deviation
Same Candy Analogy, But Fair
Now instead of comparing raw tastiness, we ask each kid:
- "Given the candy jar you picked from, how much luckier than average were you?"
Kid A picked 2 candies and got an average tastiness of 8/10. Expected average from that jar: 6/10. Spread (std dev): +/-1.5.
Kid A's z-score: (8 - 6) / 1.5 = 1.33 -> "Pretty lucky!"
Kid B picked 3 candies and got an average tastiness of 7/10. Expected average from that jar: 6/10. Spread (std dev): +/-1.0.
Kid B's z-score: (7 - 6) / 1.0 = 1.00 -> "Somewhat lucky."
Kid A still wins here, but now it's because Kid A genuinely got luckier -- not because having fewer candies gives a mathematical advantage.
Notice the key difference: Kid B's smaller spread (+/-1.0 vs +/-1.5) means Kid B's deviation gets divided by a smaller number, boosting it proportionally. This exactly compensates for the fact that 3-candy averages are naturally less variable. Over many rounds, neither kid has a built-in edge.
A Worked Example With Real Numbers
Suppose we have two action groups competing:
Group X: 2 actions
- Action 1: probability=0.6, signal=10 -> alpha=7, beta=5 -> beta_draw=0.75
- Action 2: probability=0.5, signal=8 -> alpha=5, beta=5 -> beta_draw=0.60
Group Y: 3 actions
- Action 3: probability=0.6, signal=10 -> alpha=7, beta=5 -> beta_draw=0.65
- Action 4: probability=0.5, signal=8 -> alpha=5, beta=5 -> beta_draw=0.55
- Action 5: probability=0.7, signal=12 -> alpha=9.4, beta=4.6 -> beta_draw=0.70
Old way (geometric mean):
Group X: (0.75 x 0.60)^(1/2) = 0.671
Group Y: (0.65 x 0.55 x 0.70)^(1/3) = 0.630
Winner: Group X
New way (z-score):
Group X: observed log-mean = -0.441, expected log-mean = -0.593, std_dev = 0.198
z = (-0.441 - (-0.593)) / 0.198 = +0.77
Group Y: observed log-mean = -0.439, expected log-mean = -0.519, std_dev = 0.148
z = (-0.439 - (-0.519)) / 0.148 = +0.54
Winner: Group X (but the margin is different -- it properly accounts for group sizes)
Both methods pick Group X here, but the z-scores are on a level playing field. Group X isn't getting a free boost just for being smaller.
What Changes in the Score
| | Before (Geometric Mean) | After (Z-Score) | |---|---|---| | Score range | Always between 0 and 1 | Can be any number (typically -3 to +3) | | A score of 0 means | Impossible (one action scored 0) | Exactly average luck | | Positive score means | -- | Luckier than expected | | Negative score means | -- | Less lucky than expected | | Affected by group size? | Yes -- fewer actions = systematic advantage | No -- normalized for group size |
The Math (One Level Deeper)
For a group of n actions with beta draws x1, x2, ..., xn:
Before:
score = (x1 x x2 x ... x xn)^(1/n)
After:
log_geo_mean = mean(log(x1), log(x2), ..., log(xn))
expected = mean(psi(a1) - psi(a1+b1), ..., psi(an) - psi(an+bn))
std_dev = sqrt(sum(psi1(ai) - psi1(ai+bi)) / n^2)
z_score = (log_geo_mean - expected) / std_dev
Where:
psi(digamma) gives the expected value of log(X) for a Beta distributionpsi1(trigamma/polygamma) gives the variance of log(X)ai, biare the Beta distribution parameters for each action
Why do we work in log-space? Because the geometric mean is really just an arithmetic mean in log-space:
geometric_mean(x1, x2, ..., xn) = exp(mean(log(x1), log(x2), ..., log(xn)))
By working in log-space, we can use the nice additive properties of means and variances, which makes the z-score normalization straightforward.
The key insight: dividing by std_dev (which accounts for n) is what removes the size bias. Larger groups have smaller standard deviations, so their deviations from expected get amplified proportionally -- putting groups of all sizes on equal footing.