This book explores important aspects of Markov and hidden Markov processes and the applications of these ideas to various problems in computational biology. The book starts from first principles, so that no previous knowledge of probability is necessary. However, the work is rigorous and mathematical, making it useful to engineers and mathematicians, even those not interested in biological applications. A range of exercises is provided, including drills to familiarize the reader with concepts and more advanced problems that require deep thinking about the theory. Biological applications are taken from post-genomic biology, especially genomics and proteomics.
The topics examined include standard material such as the Perron-Frobenius theorem, transient and recurrent states, hitting probabilities and hitting times, maximum likelihood estimation, the Viterbi algorithm, and the Baum-Welch algorithm. The book contains discussions of extremely useful topics not usually seen at the basic level, such as ergodicity of Markov processes, Markov Chain Monte Carlo (MCMC), information theory, and large deviation theory for both i.i.d and Markov processes. The book also presents state-of-the-art realization theory for hidden Markov models. Among biological applications, it offers an in-depth look at the BLAST (Basic Local Alignment Search Technique) algorithm, including a comprehensive explanation of the underlying theory. Other applications such as profile hidden Markov models are also explored.
PART 1. PRELIMINARIES 1
Chapter 1. Introduction to Probability and Random Variables 3
1.1 Introduction to Random Variables 3
1.1.1 Motivation 3
1.1.2 Definition of a Random Variable and Probability 4
1.1.3 Function of a Random Variable, Expected Value 8
1.1.4 Total Variation Distance 12
1.2 Multiple Random Variables 17
1.2.1 Joint and Marginal Distributions 17
1.2.2 Independence and Conditional Distributions 18
1.2.3 Bayes' Rule 27
1.2.4 MAP and Maximum Likelihood Estimates 29
1.3 Random Variables Assuming Infinitely Many Values 32
1.3.1 Some Preliminaries 32
1.3.2 Markov and Chebycheff Inequalities 35
1.3.3 Hoeffding's Inequality 38
1.3.4 Monte Carlo Simulation 41
1.3.5 Introduction to Cramér's Theorem 43
Chapter 2. Introduction to Information Theory 45
2.1 Convex and Concave Functions 45
2.2 Entropy 52
2.2.1 Definition of Entropy 52
2.2.2 Properties of the Entropy Function 53
2.2.3 Conditional Entropy 54
2.2.4 Uniqueness of the Entropy Function 58
2.3 Relative Entropy and the Kullback-Leibler Divergence 61
Chapter 3. Nonnegative Matrices 71
3.1 Canonical Form for Nonnegative Matrices 71
3.1.1 Basic Version of the Canonical Form 71
3.1.2 Irreducible Matrices 76
3.1.3 Final Version of Canonical Form 78
3.1.4 Irreducibility, Aperiodicity, and Primitivity 80
3.1.5 Canonical Form for Periodic Irreducible Matrices 86
3.2 Perron-Frobenius Theory 89
3.2.1 Perron-Frobenius Theorem for Primitive Matrices 90
3.2.2 Perron-Frobenius Theorem for Irreducible Matrices 95
PART 2. HIDDEN MARKOV PROCESSES 99
Chapter 4. Markov Processes 101
4.1 Basic Definitions 101
4.1.1 The Markov Property and the State Transition Matrix 101
4.1.2 Estimating the State Transition Matrix 107
4.2 Dynamics of Stationary Markov Chains 111
4.2.1 Recurrent and Transient States 111
4.2.2 Hitting Probabilities and Mean Hitting Times 114
4.3 Ergodicity of Markov Chains 122
Chapter 5. Introduction to Large Deviation Theory 129
5.1 Problem Formulation 129
5.2 Large Deviation Property for I.I.D. Samples: Sanov's Theorem 134
5.3 Large Deviation Property for Markov Chains 140
5.3.1 Stationary Distributions 141
5.3.2 Entropy and Relative Entropy Rates 143
5.3.3 The Rate Function for Doubleton Frequencies 148
5.3.4 The Rate Function for Singleton Frequencies 158
Chapter 6. Hidden Markov Processes: Basic Properties 164
6.1 Equivalence of Various Hidden Markov Models 164
6.1.1 Three Different-Looking Models 164
6.1.2 Equivalence between the Three Models 166
6.2 Computation of Likelihoods 169
6.2.1 Computation of Likelihoods of Output Sequences 170
6.2.2 The Viterbi Algorithm 172
6.2.3 The Baum-Welch Algorithm 174
Chapter 7. Hidden Markov Processes: The Complete Realization Problem 177
7.1 Finite Hankel Rank: A Universal Necessary Condition 178
7.2 Nonsuffciency of the Finite Hankel Rank Condition 180
7.3 An Abstract Necessary and Suffcient Condition 190
7.4 Existence of Regular Quasi-Realizations 195
7.5 Spectral Properties of Alpha-Mixing Processes 205
7.6 Ultra-Mixing Processes 207
7.7 A Sufficient Condition for the Existence of HMMs 211
PART 3. APPLICATIONS TO BIOLOGY 223
Chapter 8. Some Applications to Computational Biology 225
8.1 Some Basic Biology 226
8.1.1 The Genome 226
8.1.2 The Genetic Code 232
8.2 Optimal Gapped Sequence Alignment 235
8.2.1 Problem Formulation 236
8.2.2 Solution via Dynamic Programming 237
8.3 Gene Finding 240
8.3.1 Genes and the Gene-Finding Problem 240
8.3.2 The GLIMMER Family of Algorithms 243
8.3.3 The GENSCAN Algorithm 246
8.4 Protein Classification 247
8.4.1 Proteins and the Protein Classification Problem 247
8.4.2 Protein Classification Using Profile Hidden Markov Models 249
Chapter 9. BLAST Theory 255
9.1 BLAST Theory: Statements of Main Results 255
9.1.1 Problem Formulations 255
9.1.2 The Moment Generating Function 257
9.1.3 Statement of Main Results 259
9.1.4 Application of Main Results 263
9.2 BLAST Theory: Proofs of Main Results 264
Bibliography 273
Index 285