Takeaway
Any probability distribution can be generated from a uniform seed $U\sim\mathrm{Unif}[0,1]$ and a map $X=\psi(U)$. If $\psi$ is invertible, then observing $X$ is equivalent to observing $U$ exactly. It is natural to conceive of $\psi$ as a relabeling of each event of $U$, but in contrast to Shannon entropy, differential entropy doesn't express this semantic. In many other cases, what we observe or model is often a "coarsening"—some information is lost and the exact value of $U$ can't be recovered. Using $Z = f(U)$ to denote this coarsening-measurement, $f$ encodes which distinctions you can make about $U$ when observing $Z$. If we encode that loss of distinguishability explicitly, we can motivate the entirety of the similarity-sensitive entropy framework, see Shannon and differential entropy as special cases and interpolate between them, while improving our notion of statistical information.
The key step is to replace the many-to-one map by a similarity kernel encoding how distinguishable points are. $K_f(u,u')=\mathbf 1\{f(u)=f(u')\}$. Its mass $\tau_f(u)=\int_0^1 K_f(u,u')\,du'$ is the size of the fiber containing $u$ (the preimage of $f(u)$), and Shannon entropy becomes an expected logarithm of typicality:
Similarity-sensitive entropy keeps the same template but allows graded kernels $K(x,x')\in[0,1]$, making distinguishability explicit and letting you interpolate between infinite and finite distinguishability.
For more: Similarity-Sensitive Entropy: Induced Kernels and Data-Processing Inequalities (arXiv:2601.03064).
In more detail
Randomness from a universal noise source
The starting point is the “uniform representation” idea: distributions can be generated from a single random seed. Let
In one dimension, $\psi$ is often the quantile map $F^{-1}$, so $X=F^{-1}(U)$. More generally, you can treat $\psi$ as a measurable map that pushes the Lebesgue measure forward to whatever law you want.
Any law can be read as uniform randomness plus a reorganizing structure (the map $\psi$).
Coarse observation are a partition map
Suppose we don’t observe $U$ precisely; we only observe which “bucket” it falls into. Formally, let $f:[0,1]\to\{1,\dots,m\}$ and define $Y := f(U)$. The fibers $A_j := f^{-1}(j)$ form a partition of $[0,1]$.
Fiber-size form of Shannon entropy
Because $U$ is uniform, the probability of “being indistinguishable from $u$ under $f$” is literally the Lebesgue measure of the fiber that contains $u$:
Plugging this into the definition yields the identity:
Read informally: Shannon entropy is the average surprisal of the size (mass) of the indistinguishability class containing the state.
Kernelized coarse graining
Define the partition (equivalence) kernel induced by $f$:
Interpretation: $K_f(u,u')=1$ exactly when the observation $f$ cannot tell $u$ and $u'$ apart.
From a kernel, a natural notion of “typicality” is just the kernel mass around a point:
Since $K_f(u,u')$ is $1$ precisely on the fiber $f^{-1}(f(u))$, this reduces to $\tau_f(u)=\lambda(f^{-1}(f(u)))=p_{f(u)}$. Substitute into the fiber-size identity:
We don’t push the measure to $\{1,\dots,m\}$; keep the base measure on $[0,1]$ and move the coarse map “inside” the logarithm as a kernel. Now, we have a kernelized form of Shannon entropy!
From partitions to SS-entropy
Partition kernels are $0$–$1$. Similarity-sensitive entropy generalizes by allowing a graded similarity kernel $K(x,x')\in[0,1]$ (with $K(x,x)=1$) on a state space with distribution $\mu$.
In words, SS-entropy is expected surprisal of being typical under the similarity notion $K$. Partition kernels recover ordinary Shannon entropy for the corresponding coarse variable, while intermediate kernels interpolate between “everything is distinct” and “many things count as similar.”
A benefit of this formulation is that you can define similarity where the semantics live, then transport it through a change of variables by pulling the kernel back.
Example: a partition kernel on $\mathbb R$, pulled back to $[0,1]$
To match the latent-uniform story above, represent the Gaussian as $X=\psi(U)$ with $U\sim\mathrm{Unif}[0,1]$. Here $\psi$ is a quantile map: if $F$ is the CDF of $X$, then one choice is $\psi=F^{-1}$ (and $U=F(X)$).
Now “set the distinguishability scale” in $x$-space by binning. Fix a bin width $\Delta=0.4\sigma$ and let $g_\Delta:\mathbb R\to\mathbb Z$ record which interval of length $\Delta$ contains $x$ (anchored at the mean). Define the coarse variable $$Y:=g_\Delta(X)=g_\Delta(\psi(U)).$$
This induces a partition kernel on $\mathbb R$:
Typicality is the probability mass of your bin, $\tau_\Delta(x)=\mathbb P(g_\Delta(X')=g_\Delta(x))$, so SS-entropy reduces to an ordinary Shannon entropy: $$H_{K_\Delta}(\mu_X)=H(Y).$$
To express the same coarse-graining on the latent space, define $f_\Delta(u):=g_\Delta(\psi(u))$ and pull the kernel back: $$\widetilde K_\Delta(u,u') := \mathbf 1\{f_\Delta(u)=f_\Delta(u')\}=K_\Delta(\psi(u),\psi(u')).$$ Since $\psi$ pushes Lebesgue measure $\lambda$ forward to $\mu_X$, this is just a change of variables, so the entropy is unchanged: $$H_{K_\Delta}(\mu_X)=H_{\widetilde K_\Delta}(\lambda).$$
In $u$-space the bin boundaries become quantiles $u_j=\psi^{-1}(m+j\Delta)=F(m+j\Delta)$, which crowd near $0$ and $1$ because the Gaussian tails have low density.
Relationship to differential entropy
The “infinite distinguishability" limit corresponds to a kernel that only matches a point to itself:
If $\mu$ is atomless (no point masses), then $\mu(\{x\})=0$ for every $x$, so $\tau(x)=0$ almost everywhere and $-\log \tau(x)=+\infty$. In SS-entropy terms: considering each event on a continuous space to be uniquely distinct leads to unbounded surprisal.
Differential entropy can be understood as what remains after you subtract a kernel-dependent distinguishability baseline from a finite-distinguishability family of kernels with a shrinking scale $\varepsilon$. For example, take a translation-invariant kernel $$K_\varepsilon(x,x') := \kappa\!\left(\frac{x-x'}{\varepsilon}\right), \qquad \kappa(t)=\kappa(-t), \qquad \kappa(0)=1.$$ For many “local” choices and smooth densities $f$, typicality behaves like $\tau_\varepsilon(x)\approx \varepsilon f(x)$ up to a kernel-dependent constant, so
SS-entropy keeps the kernel (and therefore the baseline) explicit; differential entropy is the distribution part you get by renormalizing away that baseline as $\varepsilon\to 0$. By the pullback identity above, you can compute the same small-scale behavior either on $(\mathbb R,\mu)$ or on $([0,1],\lambda)$ with the pulled-back kernel.