Histograms are probably one the most intuitive ways of performing density estimation in numerical data. In this writeup, I discuss some of their deeper meanings and properties, as I try to digest Wayne Oldford’s excellent lecture on the subject.
In particular, we look at how well one can expect to approximate an underlying density as one has access to more samples and utilizes narrower bins. We then discuss a form of histogram that is much less sensitive to the location of bin boundaries, called an Average Shifted Histogram [1]. This kind of histogram has deeper connections to kernel density estimation, and we will therefore touch briefly on that too.
I believe this kind of reflection to be a useful exercise to any practitioner without a background in statistics (e.g., me) who has found herself tweaking histogram parameters, wondering about their limitations, and wondering as to whether better methods existed.
Histograms
When data is one-dimensional and real-valued, histograms are typically constructed by partitioning the region where the observed data lies into a set of m≥1m≥1 bins. Each bin AkAk (1≤k≤m1≤k≤m) is then represented as an interval:
Ak=[tk,tk+1)Ak=[tk,tk+1)
where the bin width is given by:
h=tk+1−tkh=tk+1−tk
To simplify things, and because this is the most common case, we assume that all bins are of the same width. Other arrangements, however, are clearly possible (e.g., exponentially-sized bins).
Bin AkAk is also assigned a height, which is proportional to the number of points nknk that fall into it. In other words, if we have a set of observed data points D={x1,⋯xn}D={x1,⋯xn} ordered by value (i.e. xi≤xjxi≤xj iff i≤ji≤j), we define nknk as:
nk=|D∩Ak|nk=|D∩Ak|
and assign a height to AkAk which is proportional to nknk. In particular, we could assign a height to AkAk which is equal to nknk (as nknk is surely proportional to itself), in which case we would have a frequency histogram.
Let us now look at an example, based on the Old Faithful’s eruption dataset. The first 55 rows in the table are shown in Table 1.
data(geyser, package = 'MASS')
| waiting | duration |
|---|---|
| 80 | 4.016667 |
| 71 | 2.150000 |
| 57 | 4.000000 |
| 80 | 4.000000 |
| 75 | 4.000000 |
The dataset contains two columns: duration and waiting, which describe the duration of an eruption and the time ellapsed until that eruption, respectively.
We plot the points corresponding to the duration of each eruption in a line in Fig. 1. This gives us a rough idea of their distribution.
ggplot(geyser) +
geom_segment(aes(x = duration, xend = duration, y = -.1, yend = .1)) +
ylim(-1, 1) +
theme(
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
) +
xlab('duration (minutes)')
Figure 1: Duration of Old Faithful’s eruptions.
We can then use those to construct a frequency histogram, with 1010 equally-sized bins:
ggplot(geyser) +
geom_histogram(
aes(x = duration),
fill = 'salmon',
bins = 10,
col = 'black'
)

When used for density estimation, however, histograms are typically rescaled so that the sum of the areas of the bars equals to one. This is accomplished by dividing each nknk by the number of points nn, times the bin width hh, and assigning the resulting value to the bin’s height. We call this resulting value ˆpk^pk:
ˆpk=nknh^pk=nknh
ggplot(geyser) +
geom_histogram(
aes(x = duration, y = ..density..),
fill = 'salmon',
bins = 10,
col = 'black'
)
Figure 2: Durations of Old Faithful’s eruptions (density histogram).
Now let f(x)f(x) be the unknown density function we are trying to estimate. A histogram essentially defines an approximate density function based on the sample DD, which we name ˆf(x)^f(x) and which, for every x∈Akx∈Ak, states that the density of xx is ˆf(x)=ˆpk^f(x)=^pk. How far, then, is this approximation from the real density, f(x)f(x)?
Histogram Bias
To understand the quality of the approximation that the histogram provides, we look at its bias:
Bias[ˆf(x)]=E[ˆf(x)]−f(x)Bias[^f(x)]=E[^f(x)]−f(x)
Where the expectation on the right-hand side is taken with respect to all samples DD of size nn. To understand how to compute the expectation let us assume, without loss of generality, that xx belongs to some interval AkAk. For a given sample DD drawn from ff, let NkNk be the random variable which represents the number of points that ended up in bin AkAk. We have that:
E[ˆf(x)]=E[Nknh]=1nhE[Nk]E[^f(x)]=E[Nknh]=1nhE[Nk]
Suppose we knew that the probability that a single sample ends up in bin AkAk is pkpk (which is different from its estimator, ˆpk^pk). The probability that we get Nk=mNk=m samples out of nn (the size of DD) in AkAk is then just the probability that we get mm heads out of nn coin flips for a coin where the probability of getting heads is pkpk. This means that NkNk follows a binomial distribution; i.e., Nk∼Binomial(n,pk)Nk∼Binomial(n,pk). E[Nk]E[Nk] therefore is equal to npknpk, so that:
1nhE[Nk]=npknh=pkh1nhE[Nk]=npknh=pkh
Now note that pkpk is the probability that one sample falls in bin AkAk; i.e., that it falls between tktk and tk+1tk+1.
Figure 3: Histogram bins and the underlying density.
Fig. 3 tells us, however, that pkpk can be expressed in terms of ff as:
pk=P[tk≤X≤tk+1]=∫tk+1tkf(x)dx=F(tk+1)−F(tk)pk=P[tk≤X≤tk+1]=∫tk+1tkf(x)dx=F(tk+1)−F(tk)(1)
Where FF is the corresponding cumulative density function for ff. The final equation for the bias is, then:
Bias[ˆf(x)]=pkh−f(x)Bias[^f(x)]=pkh−f(x)(2)
What Eq. (2) tells us is that the bias for the histogram depends on hh but not on nn, which is reasonable. In particular, the bias will be small at xx as pkpk approaches hf(x)hf(x). Further, Eq. (1) tells us that the bias will be small for any xx as hh tends to zero, given that:
limh→0pkh=limh→0F(tk+1)−F(tk+1−h)h=f(tk+1)∗=f(x)limh→0pkh=limh→0F(tk+1)−F(tk+1−h)h=f(tk+1)∗=f(x)(3)
where the last equality results from the fact that xx is by construction contained in AkAk and, therefore, as hh approaches zero we get that tk=tk+1=xtk=tk+1=x.
Great! So we just have to make our hh very small and we will get a great histogram, right? Well, not so fast…
Histogram Variance
As is often the case, a lower bias does not come for free. Indeed, if we look at the variance for NkNk, we get:
Var[ˆf(x)]=Var[Nknh]=pk(1−pk)nh2Var[^f(x)]=Var[Nknh]=pk(1−pk)nh2(4)
which does not look like good news: to prevent variance from blowing up, it would appear that we have to increase nn faster than we decrease hh, and that the required increase in nn becomes more dramatic as hh approaches zero.
I say that it would appear to be that way because what happens to variance as we keep nn fixed and let hh go to zero is not immediately obvious: as we narrow down our bin width hh, pkpk, the probability that a point falls in our bin, also decreases towards zero. If pkpk manages to compensate the decrease in hh somehow, perhaps variance does not blow up after all.
We can solve this mistery by rewriting Eq. (4) and looking at its limit as we let h→0h→0:
limh→01nh(pkh−p2kh)limh→01nh(pkh−p2kh)
We know from Eq. (3) that limh→0pkh=f(x)limh→0pkh=f(x), which is a finite non-negative real number. Since pk<1pk<1, we also know that 0≤p2k≤pk0≤p2k≤pk, meaning that limh→0(pk/h−p2k/h)limh→0(pk/h−p2k/h) must converge to a non-negative real number as well (limits are linear). This leaves us with limh→01nhlimh→01nh, which, barring an equivalent increase in nn, will blow up to +∞+∞. As hh tends to zero, therefore, Var[ˆf(x)]Var[^f(x)] tends to infinity.
We are, therefore, stuck with this fact of life: when using histograms we can either choose to have wide but imprecise (biased) bins, or narrow but unreliable ones.
More on Bias
Another interesting point to consider is the relationship between bin width, bias and the shape of ff. Clearly, an ff which contains narrow peaks will be much harder to approximate with a wide rectangle than an ff which is completely flat (think uniform distribution). This intuition can be made formal if we look at the bias equation (Eq. (2)) again, but transform it slightly. Specifically, we will rewrite the term pk/hpk/h in another way.
Recall that Eq. (1) tells us that:
pk=∫tk+1tkf(x)dxpk=∫tk+1tkf(x)dx
The mean value theorem for definite integrals, then, tells us that if we were to draw a rectangle with base in Ak=[tk,tk+1)Ak=[tk,tk+1) and make it tall enough so that its area is equal to pkpk, then its height must be equal to f(ξ)f(ξ) for some ξ∈Akξ∈Ak.
The intuition for this is depicted in Fig. 4: if we draw a rectangle with base in Ak=[tk,tk+1)Ak=[tk,tk+1) and area akak, we have three options. Either the top side of the rectangle:
- is completely under f(x)f(x), in which case ak<pkak<pk;
- is completely above f(x)f(x), in which case ak>pkak>pk;
- is neither above nor below f(x)f(x), in which case we can pick tk≤ξ<tk+1tk≤ξ<tk+1, such that ak=pkak=pk.
Figure 4: Mean value theorem for definite integrals.
Recalling that tk+1−tk=htk+1−tk=h, we can therefore say that:
pk=∫tk+1tkf(x)dx=hf(ξ)pk=∫tk+1tkf(x)dx=hf(ξ)
for some tk≤ξ<tk+1tk≤ξ<tk+1. We can therefore rewrite the bias from Eq. (2) as:
Bias[ˆf(x)]=pkh−f(x)=f(ξ)−f(x)Bias[^f(x)]=pkh−f(x)=f(ξ)−f(x)
If we now make the rather mild assumption that ff is Lipschitz continuous on AkAk; i.e., that there exists some γk∈Rγk∈R such that, for every x,y∈Rx,y∈R we have:
|f(y)−f(x)|<γk|y−x||f(y)−f(x)|<γk|y−x|
We get that the absolute value for the bias is also bounded:
|Bias[ˆf(x)]|=|f(ξ)−f(x)|≤γk|ξ−x|≤γkh∣∣Bias[^f(x)]∣∣=|f(ξ)−f(x)|≤γk|ξ−x|≤γkh(5)
where the last inequality holds because both xx and ξξ are in AkAk and, therefore, cannot be farther apart by more than hh, the width of the bin AkAk.
One way of looking at what Eq. (5) is saying is that the closer the slope of ff over AkAk is to zero, the wider the rectangles we can use to approximate ff acceptably. Although this is only an upper bound, the result is intuitive: a flatter function is easier to represent with a rectangle than a function that varies steeply, up or down1.
Location Variance and Average Shifted Histograms
So far we have touched upon fundamental theoretical properties of histograms which we cannot do much about. We will now discuss something that wfe can actually improve: histogram variance due to bin alignment.
The main issue relates to the fact that we can generate differently shaped histograms by simply varying the location of the first bin or, equivalently by shifting the data itself2. The effect is shown in Fig. 5, where we plot three histograms in which we shift the bins, initially aligned at 00, by two different values. Some features of the dataset, notably the peak near 44, get more pronounced in the rightmost histogram.
breaks <- function(X, bins, offset) {
width <- diff(range(X)) / bins
offset + seq(min(X) - width, max(X), by = width)
}
par(mfrow = c(1, 3))
for (offset in c(0, 0.2, 0.34)) {
hist(
geyser$duration,
breaks = breaks(geyser$duration, 10, offset),
main = sprintf('11 bins, bin shift %f', offset),
right = FALSE,
ylim = c(0, 1),
xlab = 'duration',
freq = FALSE
)
}
Figure 5: Variation in shifted histograms.
This variability gets in the way of us understanding the actual nature of the underlying density. Clearly, we could mitigate the issue by using very narrow bins, but that would increase our data-related variance (Eq. (4)) thus producing a noisy and equally unreliablehistogram. What can, then, we do about it?
Histogram Overplots
One simple way of improving the situation is by plotting superimposed, transparent versions of our shifted histograms on top of each other. We call these shifted histogram overplots, or SHOs. The idea with SHOs is that we can amortize the effects of variations due to shifting by preemptively shifting them ourselves and “visually averaging” the effects of this shifting.
A SHO is specified by adding an extra parameter, mm, which controls the number of shifted copies we plot. For a given bin width hh, we plot mm histograms, shifting bin alignment for the jthjth histogram (0≤j<m0≤j<m) by j×hmj×hm. For simplicity we assume that, in the absence of shifting, the first bin of the histogram aligns with 00.
In R:
sho <- function(X, width, m, alpha, ...) {
for (shift in 0:(m - 1)) {
offset <- shift * (width / m)
min_k <- floor((min(X) - offset) / width)
max_k <- ceiling((max(X) - offset) / width)
hist(
X,
breaks = offset + width * seq(
from = min_k, to = max_k
),
freq = FALSE,
add = (shift != 0),
col = scales::alpha('gray', alpha = alpha),
lty = 'blank',
...
)
}
}
We then plot our data again and simulate what would happen had we gotten our data with a different shift each time: shifting the data changes the data/bin alignment just as shifting bin boundaries did, and so generates similar disturbances. The results are shown in Fig. 6.
par(mfrow = c(1, 3))
for (offset in c(0, 0.2, 0.34)) {
sho(
geyser$duration + offset,
width = 0.9,
m = 5,
alpha = 0.3,
main = sprintf('m = 5, shift %f', offset),
xlab = 'duration',
ylim = c(0, 0.7),
xlim = c(0, 7)
)
}
Figure 6: Shifted Histogram Overplots with h=0.9h=0.9, m=5m=5.
As we can see, although we still get some variability, the main features of the density remain largely unchanged. SHOs are therefore more resilient to changing data/bin alignment than plain histograms.
The issue with SHOs is that they are essentially a visual tool. While we could take the maximum of all shifted histograms and let this be our estimate for ff, this would fail to take into account that some regions of the SHO are less transparent (and thus, more likely to be part of the actual density) than others.
The final step in our histogram ladder, then, is a technique which takes these factors into account while constructing a numeric density estimate for ff.
Average Shifted Histograms
An Average Shifted Histogram [1] (ASH) constructs a density estimate by averaging together estimates produced by a set of shifted histograms just like the ones we had built for our SHOs. Formally, if we name the estimates produced by the mm shifted histograms at xx as f0(x),⋯,fm−1(x)f0(x),⋯,fm−1(x), the ASH estimate ^fash(x)ˆfash(x) for xx can be written as:
^fash(x)=1mm−1∑j=0^fj(x)ˆfash(x)=1mm−1∑j=0^fj(x)
Or, in R:
# Average Shifted Histogram
ash <- function(X, nbins, m) {
width <- diff(range(X)) / nbins
f_ash <- rep(0, length(X))
for (shift in 1:m) {
f_ash <- f_ash +
f_j(
X = X,
breaks = shift * (width / m) + seq(
from = min(X) - width, to = max(X) + 1, by = width
)
)
}
i <- order(X)
ash <- tibble(f = f_ash[i]/m, x = X[i])
class(ash) <- c('ash', class(ash))
ash
}
# Histogram Density Estimate
f_j <- function(X, breaks) {
n <- length(X)
f_dens <- rep(NA, n)
for (i in 1:(length(breaks) - 1)) {
low <- breaks[i]
high <- breaks[i + 1]
elements <- (X >= low) & (X < high)
f_dens[elements] <- sum(elements) / ((high - low) * n)
}
stopifnot(all(!is.na(f_dens)))
f_dens
}
# Plot ASH
plot.ash <- function(ash) {
ggplot(ash) +
geom_line(aes(x = x, y = f)) +
ylab('density')
}
Let’s see how the plot looks for our Old Faithful’s data:
plot(ash(geyser$duration, m = 5, nbins = 10)) +
xlab('duration') +
ggtitle('Duration of Old Faithful\'s Eruptions (ASH, m = 5, 10 bins)')

As we can see, the result is a much smoother function, with features that resemble the ones we could see in the SHOs of Fig. 6.
Connection to Kernel Methods
ASHes have some nice properties which they share with kernel methods. Scott [1] describes them as a middle ground between a kernel estimator in terms of statistical efficiency (which I presume to mean that their variance decreases faster with an equivalent number of samples than that of a histogram of equivalent bias) and yet are much cheaper to compute due to their relative simplicity.
While discussing matters of statistical efficiency is beyond the scope of this post (though I am certain to revisit it in later posts), it is nevertheless interesting to discuss the shape of the weight function and its relationship to the triangle kernel.
We start by dividing each of our intervals AkAk (our bins of length hh) into mm subparts, as depicted in Fig. 7. Each subpart is actually a smaller bin, which we refer to as BiBi. For ease of notation, we assume that xx is contained in small bin B0B0, and give negative indices to the small bins to the left of xx. Finally, let vivi be the number of points in our sample that fall in small bin BiBi (∑vi=∑ni=n∑vi=∑ni=n).
Figure 7: Shifted histograms and their partial estimates.
We know that the density estimate at xx is the average density coming from mm overlapping histograms. Looking at Fig. 7, we see that we can express those in terms of the vivi as follows:
^f0(x)=1nh(v0+⋯+vm−1)^f1(x)=1nh(v−1+v0+⋯vm−2)^f2(x)=1nh(v−2+v−1+v0+⋯vm−3)⋯^fi(x)=1nh(i∑j=1v−j+v0+m−i+1∑k=1vk)⋯ˆfm−1(x)=1nh(m−1∑j=1v−j+0)^f0(x)=1nh(v0+⋯+vm−1)^f1(x)=1nh(v−1+v0+⋯vm−2)^f2(x)=1nh(v−2+v−1+v0+⋯vm−3)⋯^fi(x)=1nh(i∑j=1v−j+v0+m−i+1∑k=1vk)⋯^fm−1(x)=1nh(m−1∑j=1v−j+0)
We have, therefore that:
1mm−1∑j=0^fj(x)=1m1nh[mv0+(m−1)(v−1+v1)+⋯1(vm−1+v−(m−1))]==1mm−1∑j=0(m−j)nh(vj+v−j)
We can see Eq. (6) as a weighted sum, where the underlying weight function assigns weights m−jnh to both vj and v−j. If we plot these weights as a function of vi, we can observe the shape of the weight function. We plot those in Fig. 8 for m∈{5,15,100}. As we increase m, the weight function approaches a continuous triangle and the behavior of ASHes approaches that of kernel density estimation with a triangular kernel.
Figure 8: The ASH weight function tends to the triangle kernel as m→+∞.
Summary
Histograms are an intuitive and widely used tool for density estimation. In this post, we have presented a slightly more formal treatment of histograms and their relationship with the underlying density, formalizing some intuitive notions, such as the fact that histograms are inherently biased estimators (it is difficult to represent a smooth function with rectangles) and that reducing this bias by adopting narrower bins entails a cost in terms of increased variance.
We have then discussed a different type of variance that arises in histograms as we change the alignment of bins and data. We have seen that Shifted Histogram Overplots can successfully mitigate this type of variance, and used them as a building block to explain Average Shifted Histograms. Finally, we have discussed the relationship between Average Shifted Histograms and the triangle kernel.
[1] D. Scott, Multivariate density estimation. Wiley, 1992.
It is actually a bit more complicated than that: the equation says that the slope of whatever remains outside of the rectangle after we matched the areas will bound the bias, and you could have a very tall function which still looks like a rectangle. The intuition that “flatter” functions will have smaller biases is still nevertheless correct.↩︎
While you could clearly just center the data, this is more of an issue when your data is already “cut” at a shifted location but you cannot know what the shift is, as in with an image, or with a time series. Further, the techniques developed here will pave the way for discussing kernel density estimation later on.↩︎