Normal (Gaussian) Distribution

The normal distribution, often referred to as the “bell curve” due to its shape, is arguably the most important continuous probability distribution in statistics and machine learning.

What is it and why is it useful?

In nature and data science, when you add up many independent random factors, their sum tends to look like a normal distribution, regardless of the original distribution of the factors. This phenomenon is called the Central Limit Theorem.

Because of this, the normal distribution is perfectly suited for modeling: * Measurement errors: The noise in sensor readings (like a GPS or a thermometer). * Biological traits: Human heights, blood pressure, or test scores in a large population. * Uncertainty in Bayesian Inference: It is heavily used as both a prior (what we believe before seeing data) and a likelihood (how noisy the data is).

Theory and Mathematical Definition

The normal distribution is completely defined by two parameters:

  • Mean (\(\mu\)): The expected value or center of the distribution. It dictates where the peak of the bell curve is located.

  • Standard Deviation (\(\sigma\)): The spread of the distribution. A larger standard deviation means the data is more spread out and uncertain.

The Probability Density Function (PDF) is given by the following equation:

\[f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left( -\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2} \,\right)\]

Visualizing the Distribution

Here is how a standard normal distribution (\(\mu=0\), \(\sigma=1\)) looks when sampled 10,000 times:

var d = Gaussian({mu: 0, sigma: 1});
var samples = repeat(10000, function() { return sample(d); });
viz.hist(samples, {title: "Standard Normal Distribution (mu=0, sigma=1)"});
Histogram of a standard normal distribution

(Note: You can easily generate this visualization on webppl.org using the ``viz.hist()`` function.)

Executable example: Basic properties and sampling

In WebPPL, you can define a normal distribution using the Gaussian object. Here we sample from it and calculate its probability density.

 1// Define a normal distribution with mean 175 and standard deviation 7
 2var d = Gaussian({mu: 175, sigma: 7});
 3
 4// Sample from the distribution
 5var s = sample(d);
 6display("A single sample: " + s);
 7
 8// Calculate the probability density of a specific value (PDF)
 9var p = Math.exp(d.score(175));
10display("Probability density at mean (175): " + p);
11
12// Generate multiple samples and calculate the empirical mean
13var samples = repeat(1000, function() { return sample(d); });
14var empiricalMean = listMean(samples);
15display("Empirical mean of 1000 samples: " + empiricalMean);

Output:

A single sample: 175.83471658080208
Probability density at mean (175): 0.056991754343061835
Empirical mean of 1000 samples: 175.01584749248636
undefined

Executable example: Bayesian Inference with Gaussian

In this example, we estimate the true length of an object based on several noisy measurements. We use the Gaussian distribution twice: first as a prior for the unknown length, and second as the likelihood to model the measurement noise.

 1// Observed data (e.g., measurements of an object's length)
 2var observedData = [10.2, 9.8, 10.1, 10.5, 9.9];
 3
 4var model = function() {
 5  // Prior: we believe the true length is around 10, but we are uncertain
 6  var trueLength = sample(Gaussian({mu: 10, sigma: 2}));
 7  
 8  // Likelihood: our measuring tool has a known noise (sigma = 0.5)
 9  var obsFn = function(d) {
10    observe(Gaussian({mu: trueLength, sigma: 0.5}), d);
11  };
12  map(obsFn, observedData);
13  
14  return trueLength;
15};
16
17// Run MCMC inference
18var posterior = Infer({method: 'MCMC', samples: 5000}, model);
19
20// Print the expected value based on the posterior distribution
21display("Expected true length given the noisy data: " + expectation(posterior));
22"";

Output:

Expected true length given the noisy data: 10.107397532282137