Technical note

Kernalized coarse graining

A small algebraic restatement motivates similarity-sensitive entropy: keep a latent uniform seed $U \sim \mathrm{Unif}[0,1]$, and represent what you can (and can’t) distinguish using a kernel rather than only a pushed-forward variable.

Takeaway

Any probability distribution can be generated from a uniform seed $U\sim\mathrm{Unif}[0,1]$ and a map $X=\psi(U)$. If $\psi$ is invertible, then observing $X$ is equivalent to observing $U$ exactly. It is natural to conceive of $\psi$ as a relabeling of each event of $U$, but in contrast to Shannon entropy, differential entropy doesn't express this semantic. In many other cases, what we observe or model is often a "coarsening"—some information is lost and the exact value of $U$ can't be recovered. Using $Z = f(U)$ to denote this coarsening-measurement, $f$ encodes which distinctions you can make about $U$ when observing $Z$. If we encode that loss of distinguishability explicitly, we can motivate the entirety of the similarity-sensitive entropy framework, see Shannon and differential entropy as special cases and interpolate between them, while improving our notion of statistical information.

The key step is to replace the many-to-one map by a similarity kernel encoding how distinguishable points are. $K_f(u,u')=\mathbf 1\{f(u)=f(u')\}$. Its mass $\tau_f(u)=\int_0^1 K_f(u,u')\,du'$ is the size of the fiber containing $u$ (the preimage of $f(u)$), and Shannon entropy becomes an expected logarithm of typicality:

$$H(f(U))=\int_0^1 -\log \tau_f(u)\,du.$$

Similarity-sensitive entropy keeps the same template but allows graded kernels $K(x,x')\in[0,1]$, making distinguishability explicit and letting you interpolate between infinite and finite distinguishability.

For more: Similarity-Sensitive Entropy: Induced Kernels and Data-Processing Inequalities (arXiv:2601.03064).

In more detail

Randomness from a universal noise source

The starting point is the “uniform representation” idea: distributions can be generated from a single random seed. Let

$$U \sim \mathrm{Unif}[0,1], \qquad X = \psi(U).$$

In one dimension, $\psi$ is often the quantile map $F^{-1}$, so $X=F^{-1}(U)$. More generally, you can treat $\psi$ as a measurable map that pushes the Lebesgue measure forward to whatever law you want.

Any law can be read as uniform randomness plus a reorganizing structure (the map $\psi$).

Coarse observation are a partition map

Suppose we don’t observe $U$ precisely; we only observe which “bucket” it falls into. Formally, let $f:[0,1]\to\{1,\dots,m\}$ and define $Y := f(U)$. The fibers $A_j := f^{-1}(j)$ form a partition of $[0,1]$.

$$\mathbb P(Y=j)=\lambda(A_j)=:p_j,\qquad H(Y) = -\sum_{j=1}^m p_j \log p_j.$$
Fiber-size form of Shannon entropy

Because $U$ is uniform, the probability of “being indistinguishable from $u$ under $f$” is literally the Lebesgue measure of the fiber that contains $u$:

$$\mathbb P\!\left(f(U)=f(u)\right)=\lambda\!\left(f^{-1}(f(u))\right).$$

Plugging this into the definition yields the identity:

$$H(Y)=\int_0^1 -\log\Big(\lambda\big(f^{-1}(f(u))\big)\Big)\,du.$$

Read informally: Shannon entropy is the average surprisal of the size (mass) of the indistinguishability class containing the state.

Kernelized coarse graining

Define the partition (equivalence) kernel induced by $f$:

$$K_f(u,u') := \mathbf 1\{f(u)=f(u')\}.$$

Interpretation: $K_f(u,u')=1$ exactly when the observation $f$ cannot tell $u$ and $u'$ apart.

From a kernel, a natural notion of “typicality” is just the kernel mass around a point:

$$\tau_f(u) := \int_0^1 K_f(u,u')\,du'.$$

Since $K_f(u,u')$ is $1$ precisely on the fiber $f^{-1}(f(u))$, this reduces to $\tau_f(u)=\lambda(f^{-1}(f(u)))=p_{f(u)}$. Substitute into the fiber-size identity:

$$H(Y)=\int_0^1 -\log \tau_f(u)\,du.$$

We don’t push the measure to $\{1,\dots,m\}$; keep the base measure on $[0,1]$ and move the coarse map “inside” the logarithm as a kernel. Now, we have a kernelized form of Shannon entropy!

From partitions to SS-entropy

Partition kernels are $0$–$1$. Similarity-sensitive entropy generalizes by allowing a graded similarity kernel $K(x,x')\in[0,1]$ (with $K(x,x)=1$) on a state space with distribution $\mu$.

$$\tau(x):=\int K(x,x')\,d\mu(x'), \qquad H_K(\mu):=\int -\log \tau(x)\,d\mu(x).$$

In words, SS-entropy is expected surprisal of being typical under the similarity notion $K$. Partition kernels recover ordinary Shannon entropy for the corresponding coarse variable, while intermediate kernels interpolate between “everything is distinct” and “many things count as similar.”

A benefit of this formulation is that you can define similarity where the semantics live, then transport it through a change of variables by pulling the kernel back.

Example: a partition kernel on $\mathbb R$, pulled back to $[0,1]$

To match the latent-uniform story above, represent the Gaussian as $X=\psi(U)$ with $U\sim\mathrm{Unif}[0,1]$. Here $\psi$ is a quantile map: if $F$ is the CDF of $X$, then one choice is $\psi=F^{-1}$ (and $U=F(X)$).

Now “set the distinguishability scale” in $x$-space by binning. Fix a bin width $\Delta=0.4\sigma$ and let $g_\Delta:\mathbb R\to\mathbb Z$ record which interval of length $\Delta$ contains $x$ (anchored at the mean). Define the coarse variable $$Y:=g_\Delta(X)=g_\Delta(\psi(U)).$$

This induces a partition kernel on $\mathbb R$:

$$K_\Delta(x,x') := \mathbf 1\{g_\Delta(x)=g_\Delta(x')\}.$$

Typicality is the probability mass of your bin, $\tau_\Delta(x)=\mathbb P(g_\Delta(X')=g_\Delta(x))$, so SS-entropy reduces to an ordinary Shannon entropy: $$H_{K_\Delta}(\mu_X)=H(Y).$$

To express the same coarse-graining on the latent space, define $f_\Delta(u):=g_\Delta(\psi(u))$ and pull the kernel back: $$\widetilde K_\Delta(u,u') := \mathbf 1\{f_\Delta(u)=f_\Delta(u')\}=K_\Delta(\psi(u),\psi(u')).$$ Since $\psi$ pushes Lebesgue measure $\lambda$ forward to $\mu_X$, this is just a change of variables, so the entropy is unchanged: $$H_{K_\Delta}(\mu_X)=H_{\widetilde K_\Delta}(\lambda).$$

In $u$-space the bin boundaries become quantiles $u_j=\psi^{-1}(m+j\Delta)=F(m+j\Delta)$, which crowd near $0$ and $1$ because the Gaussian tails have low density.

Partition kernel bins on a Gaussian pdf and their pullback under the CDF to a uniform density.
A fixed-width partition in $x$ (top left) maps to a nonuniform partition in $u$ (top right); near the tails the mapped boundaries squeeze together. The bottom plot shows the resulting discrete distribution of $Y=g_\Delta(X)$. Let $\widetilde K_\Delta$ be the pulled-back partition kernel on $[0,1]$. Then: $$H_{K_\Delta}(\mu_X)=H_{\widetilde K_\Delta}(\lambda)=H(Y).$$ That is, the SS-entropy with either kernel equals the Shannon entropy of the coarse variable because all are representations of the same event, up to what the kernel considers indistinguishable.

Relationship to differential entropy

The “infinite distinguishability" limit corresponds to a kernel that only matches a point to itself:

$$K_{\mathrm{id}}(x,x') := \mathbf 1\{x=x'\}, \qquad \tau(x)=\int K_{\mathrm{id}}(x,x')\,d\mu(x')=\mu(\{x\}).$$

If $\mu$ is atomless (no point masses), then $\mu(\{x\})=0$ for every $x$, so $\tau(x)=0$ almost everywhere and $-\log \tau(x)=+\infty$. In SS-entropy terms: considering each event on a continuous space to be uniquely distinct leads to unbounded surprisal.

Differential entropy can be understood as what remains after you subtract a kernel-dependent distinguishability baseline from a finite-distinguishability family of kernels with a shrinking scale $\varepsilon$. For example, take a translation-invariant kernel $$K_\varepsilon(x,x') := \kappa\!\left(\frac{x-x'}{\varepsilon}\right), \qquad \kappa(t)=\kappa(-t), \qquad \kappa(0)=1.$$ For many “local” choices and smooth densities $f$, typicality behaves like $\tau_\varepsilon(x)\approx \varepsilon f(x)$ up to a kernel-dependent constant, so

$$H_{K_\varepsilon}(\mu)=\mathbb E[-\log \tau_\varepsilon(X)] \approx \log(1/\varepsilon) + \mathbb E[-\log f(X)] + \text{const}.$$

SS-entropy keeps the kernel (and therefore the baseline) explicit; differential entropy is the distribution part you get by renormalizing away that baseline as $\varepsilon\to 0$. By the pullback identity above, you can compute the same small-scale behavior either on $(\mathbb R,\mu)$ or on $([0,1],\lambda)$ with the pulled-back kernel.