Probability Tutorial for Biology 231
Basic notation
Applying basic probability to Mendelian genetics
Conditional probability
Probability in statistical analysis
The binomial distribution
Bayes' theorem
The aim of this tutorial is to guide you through the basics of probability. An understanding of probability is the key to success in Mendelian and evolutionary genetics. Along the way, you will be challenged with eight problems to test your understanding of the concepts.
Basic Notation.
Applying Basic Probability to Mendelian Genetics.
. | ABDE | ABdE | AbDE | AbdE |
aBDE | AaBBDDEE | AaBBDdEE | AaBbDDEE | AaBbDdEE |
aBDe | AaBBDDEe | AaBBDdEe | AaBbDDEe | AaBbDdEe |
aBdE | AaBBDdEE | AaBBddEE | AaBbDdEE | AaBbddEE |
aBde | AaBBDdEe | AaBBddEe | AaBbDdEe | AaBbddEe |
abDE | AaBbDDEE | AaBbDdEE | AabbDDEE | AabbDdEE |
abDe | AaBbDDEe | AaBbDdEe | AabbDDEe | AabbDdEe |
abdE | AaBbDdEE | AaBbddEE | AabbDdEE | AabbddEE |
abde | AaBbDdEe | AaBbddEe | AabbDdEe | AabbddEe |
. | ABDE | ABdE | AbDE | AbdE |
aBDE | AaBBDDEE | AaBBDdEE | AaBbDDEE | AaBbDdEE |
aBDe | AaBBDDEe | AaBBDdEe | AaBbDDEe | AaBbDdEe |
aBdE | AaBBDdEE | AaBBddEE | AaBbDdEE | AaBbddEE |
aBde | AaBBDdEe | AaBBddEe | AaBbDdEe | AaBbddEe |
abDE | AaBbDDEE | AaBbDdEE | AabbDDEE | AabbDdEE |
abDe | AaBbDDEe | AaBbDdEe | AabbDDEe | AabbDdEe |
abdE | AaBbDdEE | AaBbddEE | AabbDdEE | AabbddEE |
abde | AaBbDdEe | AaBbddEe | AabbDdEe | AabbddEe |
TEST YOUR UNDERSTANDING.
Let's cross AaBBCcDdEEffGGHh × AaBbccDDEeFfGgHh. Again, we'll assume that the genes are independently assorting.
First, what is the chance that a particular offspring has the AaBbccDDEeFfGghh genotype? If you choose to set up a Punnett square, beware! You'll have 16 columns and 64 rows, for a grand total of 1024 boxes. Don't make any mistakes...
From the same cross... what is the probability that the offspring has the dominant phenotype for all eight genes, assuming that upper-case alleles are dominant to lower case alleles?
TEST YOUR UNDERSTANDING.
Let's apply this to a common Mendelian genetics problem. There is a gene in cats that affects development of the spine. Individuals with the MM genotype are phenotypically normal. Individuals with the Mm genotype are tailless (Manx) cats. The mm genotype is developmentally lethal, so zygotes with this genotype do not develop into kittens. If you cross two Manx cats, what fraction of the kittens are expected to be Manx?
Let's try a different problem. In fruit flies, brown eyes result from a homozygous recessive genotype (br/br). A pair of heterozygous parents produce a son with wild type eye color. He is mated with a brown-eyed female. What is the probability that their first offspring has brown eyes?
Probability in Statistical Analysis.
For many statistical tests, we are interested in the so-called p-value. This is the probability of obtaining a particular value of a test statistic (or greater) just by chance. In general, we are using the statistical test to contrast observed results (our data) to expected results (those predicted by the hypothesis being tested). [We usually must make certain assumptions about the data in order to use the p-value to reject or fail to reject the hypothesis.] If the difference between the observed and expected results is sufficiently great -- by convention, such that the p-value corresponding to the test statistic value is less than 0.05 -- we reject the hypothesis used to generated the expected results. If the p-value is greater than 0.05, we fail to reject the hypothesis.
How do we put this in terms of formal probability? Define A as "the observed results or any results less likely given the hypothesis" and B as "the hypothesis is correct." If all of the assumptions of the statistical test are valid, then the p-value = p(A|B): the probability of observing the results or any less likely results given that the hypothesis is correct.
Another way to define a p-value is as follows: it is the probability that, if we choose to reject the hypothesis, we are making a mistake! Obviously, we don't like to make mistakes. So we feel better about rejecting a hypothesis if our statistical test gives us a very low p-value.
The Binomial Distribution
A particularly broad class of repeated experiments falls into the category of Bernoulli Trials. By definition, Bernoulli trials have three characteristics:
If one knows in advance the probability of success (p), then one can predict the exact probability of k successes in N Bernoulli trials. This probability can be written formally as:
In terms of formal probability, the probability of k successes given N trials and given probability of success = p. [Note the awkward use of p for two different purposes in the equation.] This formula is the basis of the Binomial Distribution.
Perhaps a more proper way to think about the Binomial Distribution is to consider the distribution, itself. The Binomial Distribution describes the probabilities of all possible outcomes of N Bernoulli trials given probability of success = p. It should be evident that one could observe, in principle, any integer number of successes ranging from 0 to N.
To better understand the Binomial Distribution, it makes sense to break down the formula.
. | Trial 1 | Trial 2 | Trial 3 | Trial 4 |
k=0 | Fail | Fail | Fail | Fail |
k=1 | Success | Fail | Fail | Fail |
Fail | Success | Fail | Fail | |
Fail | Fail | Success | Fail | |
Fail | Fail | Fail | Success | |
k=2 | Success | Success | Fail | Fail |
Success | Fail | Success | Fail | |
Success | Fail | Fail | Success | |
Fail | Success | Success | Fail | |
Fail | Success | Fail | Success | |
Fail | Fail | Success | Success | |
k=3 | Fail | Success | Success | Success |
Success | Fail | Success | Success | |
Success | Success | Fail | Success | |
Success | Success | Success | Fail | |
k=4 | Success | Success | Success | Success |
k | N! | k! | N-k! | [N! ÷ (k! × (N-k)!)] |
0 | 1 × 2 × 3 × 4 = 24 | 1 (by definition) | 1 × 2 × 3 × 4 = 24 | 24 ÷ (1 × 24) = 1 |
1 | 1 × 2 × 3 × 4 = 24 | 1 | 1 × 2 × 3 = 6 | 24 ÷ (1 × 6) = 4 |
2 | 1 × 2 × 3 × 4 = 24 | 1 × 2 = 2 | 1 × 2 = 2 | 24 ÷ (2 × 2) = 6 |
3 | 1 × 2 × 3 × 4 = 24 | 1 × 2 × 3= 6 | 1 | 24 ÷ (6 × 1) = 4 |
4 | 1 × 2 × 3 × 4 = 24 | 1 × 2 × 3 × 4 = 24 | 1 (by definition) | 24 ÷ (24 × 1) = 1 |
k | pk | (1-p)(N-k) | pk × (1-p)(N-k) |
0 | (1/6)0 = 1.0000 | (5/6)4 = 0.4823 | 1.0000 × 0.4823 = 0.4823 |
1 | (1/6)1 = 0.1667 | (5/6)3 = 0.5787 | 0.1667 × 0.5787 = 0.0965 |
2 | (1/6)2 = 0.0278 | (5/6)3 = 0.6944 | 0.0278 × 0.6944 = 0.0193 |
3 | (1/6)3 = 0.0046 | (5/6)1 = 0.8333 | 0.0046 × 0.8333 = 0.0039 |
3 | (1/6)4 = 0.0008 | (5/6)0 = 1.0000 | 0.0008 × 1.000 = 0.0008 |
k | N! ÷ (k! × (N-k)!) | × | pk × (1-p)(N-k) | = | p(k|pN) |
0 | 1 | × | 0.4823 | = | 0.4823 |
1 | 4 | × | 0.0965 | = | 0.3858 |
2 | 6 | × | 0.0193 | = | 0.1157 |
3 | 4 | × | 0.0039 | = | 0.0154 |
4 | 1 | × | 0.0008 | = | 0.0008 |
Below are binomial distribution plots for 10 Bernoulli trials with three different probabilities of success.
As the number of trials is increased, the binomial distribution becomes smoother. In fact, the normal distribution can be derived mathematically from a binomial distribution with N = infinity and p = 0.5.
TEST YOUR UNDERSTANDING.
Do we really expect the expected results of a cross? Hmmm... In mice, individuals with either the BB or Bb genotype have black fur, while those with the bb genotype have brown fur. [We are ignoring other genes that can interact with this gene to produce other fur colors.] You cross true-breeding black and brown mice to produce heterozygotes, then cross these to produce an F2 generation with sixteen mouse pups. What is the exact probability that you will observe the expected result: twelve black mice and four brown mice?
Consider, then, the two closest outcomes: eleven black/five brown mice and thirteen black/three brown mice. How much more likely is the expected result than each of these alternative results?
Bayes' Theorem.
In traditional statistical analysis, we are estimating the probability of observed data given the hypothesis. Sometimes, however, we are interested in the inverse: the probability of a hypothesis given the observed data.
Consider the following scenario. A female human (Gladys) with an autosomal recessive phenotype has mated with a male human (Mickey) with the dominant phenotype. They have three offspring, all of whom show the dominant phenotype. What is the probability that Mickey was a heterozygote?
If we define A as the observed results (i.e., the data) and B as the hypothesis that Mickey is heterozygous and Gladys is homozygous recessive, we are interested in the value of p(B|A). As a conditional probability,
Consider, also, that
It should be obvious that
Therefore, rearranging the formula for p(A|B) and substituting p(BA) for p(AB), we get
If we substitute this into the first formula, we get
This equation represents Bayes' Theorem. It has three components:
At first glance, solving the Mickey/Gladys problem might seem straightforward. We want to calculate the posterior probability of Mickey being a heterozygote given the observation that three children have the dominant phenotype. However, it turns out that only one of the terms on the right side of the formula can actually be calculated with the information provided:
The other two terms, p(A) and p(B) can not be calculated with the information provided. We need one more piece of information: the prior probability that Mickey is heterozygous. That is, before we had any offspring data, what was the chance that Mickey was heterozygous? It depends on his parents. If they were both heterozygous, then there is a 2/3 chance that Mickey is heterozygous and a 1/3 chance that he is homozygous dominant. [Remember, we are conditioning these probabilities on the observation that Mickey has the dominant phenotype. Therefore, we only consider the outcomes of the cross that produce dominant offspring.] But if Mickey's parents had different genotypes, the chance that he is a heterozygote will change. So we need more information. Here it is: let's assume that we had prior information that led us to believe that both of Mickey's parents were heterozygous.
Now we can plug in a value for p(B), the prior probability that Mickey is heterozygous and Gladys is homozygous. We know that Gladys has the gg genotype. We also know that Mickey has the dominant phenotype, so his genotype must be either GG or Gg. If both of his parents were heterozygous, then there is a 2/3 change that Mickey is heterozygous. Therefore, we will assume that the prior probability of Mickey being heterozygous and Gladys being homozygous, p(B), is 2/3.
What about p(A)? This actually still has to be calculated. In terms of formal probability,
So, first, what is p(A|B)? We calculated this already! So, next, what is p(A|~B). That is, what is the probability of seeing three phenotypically dominant offspring if Mickey is not heterozygous? Since Mickey has the dominant phenotype, this means he must have the homozygous dominant genotype. Therefore, there is a 1.0 probability that the three offspring are phenotypically dominant. [GG × gg can only produce Gg offspring.] Therefore, using the formula above, p(A) = 1/8 × 2/3 + 1.0 × 1/3 = 1/12 + 4/12 = 5/12.
We are now ready to calculate the probability that Mickey is a heterozygote given the fact that he and Gladys have three phenotypically dominant offspring. From Bayes' Theorem, p(B|A) = [p(A|B) × p(B)] ÷ p(A) = 1/8 × 2/3 ÷ 5/12 = 1/8 × 2/3 × 12/5 = 0.2.
This is a very important point: if we had made different assumptions about the genotypes of Mickey's parents, we would have obtained a different answer.
This is another very important point: the posterior probability of a hypothesis is generally different than the prior probability of a hypothesis. This is because the posterior probability of a hypothesis is calculated after additional information (the data) has been provided.
Let's take the Mickey/Gladys problem one step farther. Given the data, what are the relative likelihoods of our two competing hypotheses: B, the hypothesis that Mickey is heterozygous and ~B, the probability that Mickey is homozygous(i.e., an individual with the dominant phenotype but not the Gg genotype)? We have already calculated the posterior probability that Mickey is heterozygous (assuming a prior probability of 2/3). We now must calculate the posterior probability that Mickey is homozygous (assuming the same prior probability). This can be written as
This should actually make sense. If we already calculated that the posterior probability of Mickey being a heterozygote is 0.2, then the posterior probability that he is not a heterozygote should be 1 - 0.2, or 0.8.
So, given the data, what are the relative likelihoods of the two competing hypotheses?
In other words, it is four times more likely that Mickey is a homozygote than it is that he is a heterozygote.
TEST YOUR UNDERSTANDING.
Consider a scenario where healthy individuals heterozygous for a recessive genetic disease represent 18% of the general population, while those with the disease represent 1% of the general population. A healthy male has undergone testing for the recessive allele and learns that he is heterozygous. His spouse is also healthy, but we do not know her genotype. They have a healthy child. What is the posterior probability that she is homozygous?
This next problem is pretty challenging. How many healthy children must they have before she can be more than 95% confident that she is homozygous? [Note: if they have even one child with the disease, the question is moot. We would know that she is heterozgous.]