Scoring rules

The similarity-sensitive (nonlocal) scoring rule

The familiar logarithmic scoring rule is local: to score a report for an observation $x$, it only sees the probability assigned to $x$. That is fine when only exact identity matters, such as in many predictions over a categorical space, but it isn't the right property if you want to reward near-misses. In the spirit of continuing to develop the Leinster-Cobbold similarity-sensitive entropy, this page shows the nonlocal proper scoring rule derived from it. Just as SS-entropy with an identity kernel reduces to ordinary entropy, the SS-scoring rule with an identity kernel reduces to the logarithmic scoring rule. I'll begin with examples to motivate it.

Examples

Weather forecast

Suppose the outcome we want to predict is rain, cloudy, or sunny. If the consumer of the forecast treats confusing rain with cloudy as a smaller mistake than confusing either with sunny, we can encode that with a similarity kernel such as

$$K=\begin{pmatrix} 1&0.3&0\\ 0.3&1&0\\ 0&0&1 \end{pmatrix}.$$

Then probability assigned to cloudy should count as partial credit when rain occurs (and vice versa), while probability assigned to sunny should not. Ordinary log loss cannot express that distinction because it only cares about the probability assigned to the exact observed state.

For example, take the truth to be $p=(0.8,0.1,0.1)$ over rain, cloudy, and sunny, and compare $q_{\mathrm{near}}=(0.6,0.3,0.1)$ with $q_{\mathrm{far}}=(0.6,0.1,0.3)$. Ordinary cross-entropy gives them the same risk/reward because the reports assign the same probability to rain and only swap the equal-mass cloudy and sunny terms. But rain-or-cloudy has $90\%$ probability under the truth, so a similarity-sensitive cross-entropy should prefer the report that confuses between those two.

You may note that in practice, you can coarsen by merging rain and cloudy into a single "not sunny" category. This is true, but it throws away information that the forecaster might have, and leads to biasing the model in a way that is not transparent. We can do this more transparently (and equivalently) by setting the 0.3 similarity to 1. The next examples show that the problem of rewarding near-misses is not just a discrete issue, so coarsening by merging labels is never a complete answer.

Wrapped likelihood

Consider a simple misspecified model-fitting problem. Let the true distribution be uniform on $[0,1]$, and fit only the location $\mu$ of a fixed-variance Gaussian that is wrapped around the interval:

$$P=\mathrm{Unif}[0,1],\qquad q_\mu(x)=\sum_{m\in\mathbb Z}\phi_\sigma(x+m-\mu),\qquad x\in[0,1].$$

Here $\phi_\sigma$ is the centered Gaussian density with standard deviation $\sigma$. The sum turns it into a wrapped Gaussian density: it is the law of $\mu+\sigma Z \pmod 1$, written as a density on $[0,1]$.

Because $q_\mu$ is just a circular translation of the same density, ordinary cross-entropy cannot choose between them:

$$\operatorname{CE}(P,q_\mu)=-\int_0^1 \log q_\mu(x)\,dx$$

is the same for every $\mu$. The centered Gaussian and an off-center Gaussian would tie under MLE (in the theoretical limit).

Now imagine that we care about the geometry of the interval: values close to each other are more similar. A reasonable task kernel for this geometry is a Gaussian kernel that uses ordinary distance:

$$K_\ell(x,y)=\exp\!\left(-\frac{|x-y|^2}{2\ell^2}\right).$$

A loss that accounts for similarity should have a way to say that these two reports are different, but a loss that looks only at the density at each observed point cannot measure this difference.

An off-center candidate and centered candidate for a fixed-variance wrapped Gaussian model, each shown against a uniform truth with equal ordinary cross-entropy.
Wrapped model, distance kernel. Ordinary cross-entropy thinks all circular translations of the fixed-width density are equivalent. If you want to reward a model for near-misses, you need a nonlocal loss that considers the geometry of the interval.

Shuffled likelihoods

A difference begins to appear when we incorporate neighborhood information via $K$ directly. For a reported probability law $q$, the typicality of $x$ is

$$(Kq)(x)=\int K(x,y)\,q(dy).$$

The plot below compares normalized typicality for a Gaussian-shaped density and for a version of the same density cut into equal-width pieces and shuffled across the interval. Ordinary cross-entropy is the same for both, but the shuffled version has more uniform typicality because it spreads out the high-density pieces and the similarity kernel smooths over them. The plotted $Kq$ curves below use $\ell=0.04$ on the unit interval.

A contiguous Gaussian and a version cut into equal-width chunks and shuffled across the interval, with ordinary cross-entropy and normalized Kq typicality.
Piecewise shuffling. Cutting a density into equal-width pieces and permuting the pieces is irrelevant to ordinary cross-entropy under a uniform true distribution. It is however important for a local task kernel. Once high-density pieces are spread across the interval, the smoothed typicality $Kq$ is much closer to flat.

Finite atoms

Now, we approximate a continuous uniform distribution with a finite number of atoms. Ordinary log loss is annoyingly infinite for any finite-atom approximation, because the model has zero density between atoms. But the local typicality $(Kq)(x)$ is positive between atoms, so a similarity-aware loss should be able to compare these approximations rather than declare all of them infinitely bad.

Finite-atom approximations to a uniform distribution. Ordinary cross-entropy is infinite because the model has zero density between atoms, while normalized Kq remains finite.
Finite atoms. An empirical or particle approximation is singular with respect to a uniform true distribution, so ordinary log loss gives infinite risk. But the local typicality $(Kq)(x)$ is positive between atoms, and agrees with intuition that a higher density of atoms is a better approximation.

The Scoring Rule

Probability versus similarity

Let $\mathcal X$ be the outcome space, let $p$ be a probability law on $\mathcal X$, and let $K:\mathcal X\times\mathcal X\to(0,1]$ be a similarity kernel. Philosophically, we want to keep these conceptually separate: $p$ says what I believe will happen, while $K$ says which outcomes the task treats as similar.

For a law $p$, the typicality function again is

$$(Kp)(x)=\int K(x,y)\,p(dy).$$

Similarity-sensitive entropy is expected surprisal of that typicality:

$$H_K(p)=-\mathbb E_{X\sim p}\log (Kp)(X).$$

In finite notation, $(Kp)_i=\sum_j K_{ij}p_j$ and $H_K(p)=-\sum_i p_i\log (Kp)_i$. When $K$ is the identity kernel, this is Shannon entropy.

Log-typicality loss

The first scoring rule to try is the log-typicality loss, which is nonlocal and looks at the typicality of the observed state under the report:

$$S_{\mathrm{typ}}(q,x)=-\log (Kq)(x).$$

This is somewhat meaningful, but it is not enough. A report can make $x$ typical by putting mass on states near $x$, without paying attention to how much probability is assigned to $x$ by itself. In that sense, log-typicality blurs identity as well as distance.

The issue shows up even with two states. Let

$$K=\begin{pmatrix}1&1/2\\1/2&1\end{pmatrix},\qquad p=(0.8,0.2).$$

Truthful reporting gives $(Kp)=(0.9,0.6)$, so

$$\mathbb E_p[-\log(Kp)(X)] =-0.8\log 0.9-0.2\log 0.6\approx 0.186.$$

But the overly confident report that puts all mass on the first state $q=(1,0)$ gives $(Kq)=(1,0.5)$ and lower risk:

$$\mathbb E_p[-\log(Kq)(X)] =-0.8\log 1-0.2\log 0.5\approx 0.139.$$

The score prefers the report that denies the second state but keeps it typical through similarity to the first. So log-typicality is an improper scoring rule.

The $K$-corrected log-typicality rule

The correction is to keep the log-typicality term and add a reward (negative risk) for observations that represent a large share of the similarity mass under the report:

$$S_K(q,x)=-\log (Kq)(x) -\left[ \mathbb E_{Y\sim q}\frac{K(Y,x)}{(Kq)(Y)}-1 \right].$$

In the bracketed part, for each state $Y$ drawn from the reported distribution, the ratio compares the similarity between $Y$ and the observed point $x$ to the total typicality of $Y$ under the report. If $x$ accounts for a large share of many such neighborhoods, the reward is large. This creates the desired tradeoff between rewarding reports for putting mass on the observed point and rewarding them for putting mass near the observed point.

This reward cancels out under the report itself:

$$\mathbb E_{X\sim q}\left[ \mathbb E_{Y\sim q}\frac{K(Y,X)}{(Kq)(Y)} \right]=1.$$

So the extra term has mean zero when the data really come from $q$, while still rewarding observed points that are representative of the report as a whole, which is what log-typicality alone was missing.

In finite notation, it looks like

$$S_K(q,j)=-\log (Kq)_j -\left[ \sum_i q_i\frac{K_{ij}}{(Kq)_i}-1 \right].$$

If $K=I$, then $(Kq)_j=q_j$ and $\sum_i q_iK_{ij}/(Kq)_i=1$, so the expression collapses to $-\log q_j$. Log loss is the identity-kernel special case.

Where the rule comes from

On Kernel classes where this entropy is concave, the corrected score is the tangent loss rule generated by similarity-sensitive entropy:

$$H_K(q)=-\sum_i q_i\log (Kq)_i.$$

More generally, any concave entropy functional $H$ generates a proper score by taking the tangent plane at the reported distribution $q$:

$$S_H(q,j)=H(q)+\nabla H(q)\cdot(e_j-q).$$

Concavity gives $\mathbb E_{X\sim p}S_H(q,X)\ge H(p)$, with equality at $q=p$. That ensures "properness": if I believe $p$, I should report $p$.

Examples again

Using the proper score, we can check the examples above to see if this works. The entries below are risks under the true distribution used in each example.

Example Report Ordinary CE $K$-CE
Weather forecast Rain + sunny mass 0.759 0.605
Rain + cloudy mass 0.759 0.566
Wrapped Gaussian Off-center 2.767 3.731
Centered 2.767 3.500
Shuffled pieces Contiguous bump 1.692 3.959
Shuffled pieces 1.692 2.461
Finite atoms 11 atoms $\infty$ 2.474
31 atoms $\infty$ 2.345

Fitting models

Maximum likelihood minimizes ordinary cross-entropy:

$$\theta_{\mathrm{MLE}}=\arg\min_\theta \mathbb E_{X\sim P}[-\log q_\theta(X)].$$

Under misspecification this chooses the KL projection of $P$ onto the model class.

If the task kernel is known, the aligned estimator is instead

$$\theta_K=\arg\min_\theta \mathbb E_{X\sim P}S_K(q_\theta,X).$$

With a correctly specified model and infinite data, strictly proper scores target the same true law. Under misspecification, finite data, regularization, approximate inference, or model selection, the scoring rule changes the projection because it changes the shape of the tradeoff for being wrong.

Fitting a mixture model

Another important case where this matters is in fitting a Gaussian mixture to two small clusters of data. Ordinary MLE can get infinite reward placing a mixture component on one point and using the other component more broadly, requiring arbitrary minimum variance decisions to keep things finite, or priors, cross-validation, and other forms of regularization to select among fits. From the perspective presented above, the problem is that we use an implicit identity kernel that rewards precision far beyond what is appropriate for the task.

In the figure below, I optimize the component means, variances, and mixture weight under ordinary CE with a small variance minimum, and under proper $K$-CE with a Gaussian kernel using similarity scale $\ell=0.15$. The CE fit prefers that minimum variance components as a spike on that one observation; the $K$-CE optimum is interior and instead fits two natural groups.

A two-component Gaussian mixture example comparing an optimized ordinary cross-entropy fit with a variance floor against an optimized proper K cross-entropy fit.
Gaussian mixture singularity. The left panel shows the fully optimized fits. The right panel compares ordinary CE and proper $K$-CE along the corresponding one-dimensional second-component variance.

To show how favored this degenerate fit is by MLE, you can click the plot to add an observation and refit. The sample button draws the corresponding number of samples from a synthetic distribution with the selected number of components. The -infinite loss in ordinary CE leads to fits strongly favoring individual points; proper $K$-CE always keeps a finite score and encourages more balanced allocation. When it does optimize at a degenerate fit, the risk is still interpretable and finite.

Data
Interactive Gaussian mixture fitting comparison A one-dimensional plot comparing ordinary cross-entropy and proper K-cross-entropy Gaussian mixture fits.
Ready
Ordinary CE
-
Proper K-CE
-

The comparison below shows how much empirical risk is reduced by moving from the best $k-1$ component fit to the best $k$ component fit.

Information gain is risk reduction

Returning to experiment design. Suppose the future quantity we care about is $Y$. Before performing experiment $a$, the predictive law is $p_Y$. After observing data $Z$ from that experiment, the predictive law becomes $p_{Y\mid Z,a}$.

If future reports will be evaluated by the proper $K$-score and a calibrated model, then the value of action $a$ is

$$\operatorname{IG}_K(a;Y)=H_K(p_Y) -\mathbb E_{Z\mid a}H_K(p_{Y\mid Z,a}).$$

This is the expected reduction in future $K$-proper cross-entropy risk. Before seeing $Z$, the best report is $p_Y$ and the risk is $H_K(p_Y)$. After seeing $Z$, the best report is $p_{Y\mid Z,a}$, and the expected risk is the second term.

This is the exact same interpretation as ordinary predictive MI, except allowing for a flexible and transparent evaluation geometry.

Takeaway

Shannon entropy generates the logarithmic scoring rule, which gives ordinary cross-entropy, KL divergence, mutual information, and maximum likelihood. Similarity-sensitive entropy gives the parallel quantities: a similarity-sensitive scoring rule, $K$-cross-entropy, $K$-divergence, $K$-information gain, and corresponding model fitting.

With the identity kernel, the similarity-sensitive version becomes local and recovers the above Shannon/log-loss family. With an explicit similarity kernel designed for the task, we fit models and evaluate experiments to reduce the mistakes that matter for the task, rather than treating every imperfect prediction as equally wrong and infinitely precise ones as infinitely right.