Probability
This document omits basic intuitive definitions, such as the comparison or arithmetic operations on random variables. Arithmetic operations on random variables are performed element-wise for each element of the sample set. Please see the Lean file for complete details.
0.1 Probability
The treatment of the probability follows mostly the treatment in
Deisenroth, M.P., Faisal, A.A. and Ong, C.S. (2021) Mathematics for machine learning.
Although this document states that the values are in \(\mathbb {R}\), the lean file is all defined in terms of rational numbers \(\mathbb {Q}\) for computability reasons.
0.1.1 Prelude
This section sets up some basic tools for working with probabilities. See Probability/Prelude.lean.
A \(p\in \mathbb {R}\) is a probability value when \(p \ge 0 \wedge p \le 1\).
The lean file also defines several basic properties for definition 0.1.1.1 which we do not summarize here.
A finite set of natural numbers is defined as \(\mathbb {N}_n := \left\{ 0, \dots , n-1 \right\} \).
A list \(\mathbb {L}_n \colon \mathbb {N}_n \to \tau \) of type \(\tau \) and length \(n\) is a function defined for \(\mathbb {N}_{n} := \left\{ 0, \dots , n-1 \right\} \) of type \(\tau \).
The following definition generalizes the inverse of a function to non-injective function which do not have an inverse.
The preimage \(f^{-1}\colon \mathcal{Y} \to \mathcal{X}\) of a function \(f\colon \mathcal{X} \to \mathcal{Y}\) is defined as
0.1.2 Defs
This section contains definitions of probability spaces, probability operators, and the expectation operator. See Probability/Defs.lean.
A finite probability distribution \(\Delta (n)\) for \(n\in \mathbb {N}\) is defined as
We call a probability distribution degenerate if \(p_i = 1\) for some \(i\); otherwise the probability distribution is supported.
The formal model we use to represent random events and processes is the following.
A complete probability space consists of:
A sample space \(\Omega := \mathbb {N}_n\), which is the set of all possible outcomes.
An implicit event space \(\mathcal{F} := 2^{\Omega }\), which we do not define explicitly in Lean.
A probability function \(\mathbb {P} \in \Delta _N\) that satisfies that
\(\mathbb {P}(\Omega ) = 1\)
\(\mathbb {P}(A_1 \cup A_2) = \mathbb {P}(A_1) + \mathbb {P}(A_2)\) for all \(A_1, A_2\in \mathcal{F}\) such that \(A_1 \cap A_2 = \emptyset \).
The notation \(2^{\Omega }\) in definition 0.1.2.2 represents the power set of \(\Omega \).
The set \(\mathcal{F}\) is often, but not always, the power set of \(\Omega \). One reason it may not be a power set is that it is necessary when \(\Omega \) is infinite; see Borel sets. Another reason to define \(\mathcal{F}\) that is not a power set is to denote the available information. We will see an example of this later. Unless specified otherwise, we assume that the event set for a finite sample space is its power set and the \(\sigma \)-algebra of \(\mathbb {R}\) is the Borel set.
The measure \(\mathbb {P}[\mathcal{S}]\) of the set \(\mathcal{S} \subseteq \mathbb {N}\) given a probability space is defined as
The definition requires that the set is decidable.
Suppose I want to model the behavior of coin flips. I have a coin that is equally likely to come up heads and tails. I flip the coin twice. One possible probability space would be
\(\Omega := \left\{ \mathrm{HH}, \mathrm{HT}, \mathrm{TH}, \mathrm{TT} \right\} \).
\(\mathcal{F} := 2^{\Omega } = \left\{ \emptyset , \Omega , \left\{ \mathrm{HH} \right\} , \left\{ \mathrm{HH}, \mathrm{HT} \right\} , \dots \right\} \).
\(P(\left\{ \mathrm{HH} \right\} ) := \frac{1}{4}, P(\left\{ \mathrm{HH}, \mathrm{HT} \right\} ) := \frac{1}{2}\). The probabilities of other sets are derived using the properties in definition 0.1.2.2.
For the sake of completeness, we include also the notion of measurability. However, since we are only concerned with complete probability spaces, all random variables we define will be measurable.
A function \(f\colon \mathcal{X} \to \mathcal{Y}\) with \(\sigma \)-algebras \(\mathcal{D} \subseteq 2^{\mathcal{X}}\) and \(\mathcal{E} \subseteq 2^{\mathcal{Y}}\) is measurable when the pre-image of each \(E\in \mathcal{E}\) is in \(\mathcal{D}\):
Here, the application to \(f^{-1}\) to a set \(E \subseteq \mathcal{Y}\) is defined as:
The measurability in the definition of random variables is required for us to reason about their probability distributions.
A random variable is a measurable function \(\tilde{x}\colon \mathbb {N}\to \mathcal{Y}\) where \(\mathcal{Y}\) is a measurable set. Note that the way we define random variables, they are independent of a particular probability space. The measurability requirement is vacuous in probability spaces.
When manipulating random variables we denote them with a tilde, such as \(\tilde{x}, \tilde{s}\). This corresponds to the common practice to use uppercase letters to denote random variables. We drop the tilde when treating the random variable as an ordinary function.
[TODO: We could get a lot of mileage from treating random variables as vectors. Expectation and probability are then linear operators on this vector space of random variables. This is only possible when using a complete vector space. Question: Do we need finiteness for this? That is, do random variables need to be defined for a finite \(\Omega \)?]
We define some ad-hoc operations on random variables. However, it would be better to treat them as a vector space. Then probability and expectation can be seen as linear operators on this vector space.
Other notable definition is that the equality with a scalar \(\tilde{x} = x_0\) is interpreted as a boolean random variable \(\tilde{b}: \mathbb {B} \to \left\{ 0,1 \right\} \) :
We use the operator (function) \(\mathbb {P}\) denote the probability of the set of sample space that corresponds to a given condition. We use a non-standard definition for convenience (and show that this definition equals to the standard one later).
For any complete probability space, \((\Omega , \mathcal{F}, P)\) and a random variable \(\tilde{b}\colon \Omega \to \mathbb {B}\), the operator \(\mathbb {P}\) is defined as
where the boolean is interpreted as \(0\) for false and \(1\) for true.
The definition can be seen as an inner product between the probabilities and the random variable.
The following theorem shows that our definition of a probability operator is equal to the standard definition.
For any probability space, \((\Omega , \mathcal{F}, P)\) and a random variable \(\tilde{b}\colon \Omega \to \mathbb {B}\):
Probability distributions are typically defined for random variables. The distribution of a random variable is characterized by the cumulative distribution function (CDF), defined as follows.
The cumulative distribution function \(F_{\tilde{x}}\colon \mathbb {R}\to [0,1]\) of a real-valued random variable \(\tilde{x}\colon \Omega \to \mathbb {R}\) is defined as
Some random variables also have a probability density function (PDF), and discrete random variables have a probability mass function (PMF). In this class, we will be primarily dealing with discrete random variables.
The probability mass function \(p_{\tilde{x}}\colon \mathcal{E} \to [0,1]\) for a discrete random variable \(\tilde{x}\colon \Omega \to \mathcal{E}\) with a finite \(\mathcal{E}\) is defined as
The joint probability of two random variables \(\tilde{x}\colon \Omega \to \mathcal{X}\) and \(\tilde{y} \colon \Omega \to \mathcal{Y}\) is defined as
One can also define a joint PMF for two discrete random variables \(p_{\tilde{x}, \tilde{y}}(x,y)\) for \(x\in \mathcal{X}\) and \(y\in \mathcal{Y}\) as
The conditional probability of \(\tilde{b}\colon \Omega \to \mathbb {B}\) on \(\tilde{c}\colon \Omega \to \mathbb {B}\) is defined as
Events \(A,B \in \mathcal{F}\) are independent when
Random variables \(\tilde{x}\colon \Omega \to \mathcal{X}\) and \(\tilde{y}\colon \Omega \to \mathcal{Y}\) are independent when
Discrete random variables \(\tilde{x}\) and \(\tilde{y}\) are independent if and only if
The expectation of a random variable is its mean. As with probabilities, we use a non-standard definition for the sake of computational convenience and then show that it is equivalent to the common definition of probabilities.
Suppose that \(\tilde{x}\colon \Omega \to \mathbb {R}\) is a random variable and that \(x(\Omega ) := \left\{ x(\omega ) \mid \omega \in \Omega \right\} \) is finite. Then
The following theorem shows that our definition of expected value is equal to the standard definition of expected values.
Suppose that \(\tilde{x}\colon \Omega \to \mathbb {R}\) is a random variable and that \(x(\Omega ) := \left\{ x(\omega ) \mid \omega \in \Omega \right\} \) is finite. Then
In general probability spaces, the expectation is defined using the Lebesgue integral:
which is equivalent to the form in the theorem above in finite probability spaces.
Some of the most important properties of expectations are as follows.
Assume \(\tilde{x},\tilde{y}\colon \Omega \to \mathbb {R}\) and \(c\in \mathbb {R}\). Then:
A random variable defined on a finite probability space \(P\) is a mapping \(\tilde{x}\colon \Omega \to \mathbb {R}\).
0.1.3 Induction
This section includes properties that aim to enable one to perform induction on probability spaces.
0.1.4 Basic
This section states the basic properties of probability spaces, and operators.
A finite probability measure \(p\colon \Omega \to \mathbb {R}_+\) on a finite set \(\Omega \) is any function that satisfies
For any discrete \(\tilde{x}\colon \Omega \to \mathbb {R}\) and \(g\colon \mathbb {R}\to \mathbb {R}\):
For the remainder of ??, we assume that \(P = (\Omega , p)\) is a finite probability space. All random variables are defined on the space \(P\) unless specified otherwise.
0.1.5 OLD CONTENT: TO BE MOVED
Here, we define probability and expectation operators.
A boolean set is \(\mathbb {B} = \left\{ \operatorname {false}, \operatorname {true}\right\} \).
The expectation of a random variable \(\tilde{x} \colon \Omega \to \mathbb {R}\) is
An indicator function \(\mathbb {I}\colon \mathbb {B} \to \left\{ 0, 1 \right\} \) is defined for \(b\in \mathbb {B}\) as
The conditional expectation of \(\tilde{x} \colon \Omega \to \mathbb {R}\) conditioned on \(\tilde{b} \colon \Omega \to \mathbb {B}\) is defined as
where we define that \(x / 0 = 0\) for each \(x\in \mathbb {R}\).
It is common to prohibit conditioning on a zero probability event both for expectation and probabilities. In this document, we follow the Lean convention, where the division by \(0\) is \(0\); see div_zero. However, even some basic probability and expectation results may require that we assume that the conditioned event does not have probability zero for it to hold.
The random conditional expectation of a random variable \(\tilde{x} \colon \Omega \to \mathbb {R}\) conditioned on \(\tilde{y} \colon \Omega \to \mathcal{Y}\) for a finite set \(\mathcal{Y}\) is the random variable \(\mathbb {E}\left[ \tilde{x} \mid \tilde{y} \right]\colon \Omega \to \mathbb {R}\) is defined as
The Lean file defines expectations more broadly for a data type \(\rho \) which is more general than just \(\mathbb {R}\). The main reason to generalize to both \(\mathbb {R}\) and \(\mathbb {R}_+\). However, in principle, the definitions could be used to reason with expectations that go beyond real numbers and may include other algebras, such as vectors or matrices.
Suppose that \(\tilde{b}, \tilde{c} \colon \Omega \to \mathbb {B}\). Then:
where the equality applies for all \(\omega \in \Omega \).
Suppose that \(\tilde{c} \colon \Omega \to \mathbb {B}\) such that \(\mathbb {P}\left[ \tilde{c} \right] = 0\). Then for any \(\tilde{x} \colon \Omega \to \mathbb {R}\):
Immediate from the definition and the fact that \(0 \cdot x = 0\) for \(x\in \mathbb {R}\).
Suppose that \(\tilde{c} \colon \Omega \to \mathbb {B}\) such that \(\mathbb {P}\left[ \tilde{c} \right] = 0\). Then for any \(\tilde{b} \colon \Omega \to \mathbb {R}\):
Immediate from theorem 0.1.5.9.
Suppose that \(\tilde{b}, \tilde{c} \colon \Omega \to \mathbb {B}\), then
The property holds immediately when \(\mathbb {P}\left[ \tilde{c} \right] = 0\). Assume that \(\mathbb {P}\left[ \tilde{c} \right] {\gt} 0\). Then:
Let \(\tilde{y}\colon \Omega \to \mathcal{Y}\) with a finite \(\mathcal{Y}\). Then
theorem 0.1.5.19 shows the equivalence of expectations for surely equal random variables.
Random variables \(\tilde{x}, \tilde{y} \colon \Omega \to \mathbb {R}\) satisfy that
From the distributive property of sums.
A random variable \(\tilde{x}\colon \Omega \to \mathbb {R}\) and \(c\in \mathbb {R}\) satisfies that
Suppose that \(\tilde{x} \colon \Omega \to \mathbb {R}\) and \(c\in \mathbb {R}\). Then
From ??.
Suppose that \(\tilde{x}, \tilde{y} \colon \Omega \to \mathbb {R}\) and \(\tilde{z}\colon \Omega \to \mathcal{V}\) are random variables and \(c\in \mathbb {R}\), such that \(\tilde{y}(\omega ) = c + \tilde{x}(\omega )\). Then
From theorem 0.1.5.16.
Suppose that \(\tilde{x}, \tilde{y} \colon \Omega \to \mathbb {R}\) satisfy that
Then
Suppose that \(\tilde{x}, \tilde{z} \colon \Omega \to \mathbb {R}\) satisfy that
Then
Immediately from the congruence of sums.
0.1.6 The Laws of The Unconscious Statisticians
Let \(\tilde{x} \colon \Omega \to \mathbb {R}\) be a random variable. Then:
Let \(\mathcal{X} := \tilde{x}(\Omega )\), which is a finite set. Then:
The following theorem generalizes the theorem above.
Let \(\tilde{x} \colon \Omega \to \mathbb {R}\) and \(\tilde{b} \colon \Omega \to \mathcal{Y}\) be random variables. Then:
Let \(\tilde{x} \colon \Omega \to \mathbb {R}\) and \(\tilde{y} \colon \Omega \to \mathcal{Y}\) be random variables with \(\mathcal{Y}\) finite. Then:
0.1.7 Total Expectation and Probability
Let \(\tilde{b} \colon \Omega \to \mathbb {B}\) and \(\tilde{y} \colon \Omega \to \mathcal{Y}\) be random variables with a finite set \(\mathcal{Y}\). Then:
Let \(\tilde{x} \colon \Omega \to \mathcal{X}\) and \(\tilde{y} \colon \Omega \to \mathcal{Y}\) be random variables with a finite set \(\mathcal{Y}\). Then:
Recall that we are allowing the division by 0 and assume that \(x / 0 = 0\).
Above, we used the fact that
which follows by analyzing two cases. First, when \(p(\omega ') = 0\), then the equality holds immediately. If \(p(\omega ') {\gt} 0\) then by lemma 0.1.5.12, \(\mathbb {P}\left[ \tilde{y} = \tilde{y}(\omega ')\right] {\gt} 0\), we get from ?? that
which completes the step.
The following proof is simpler but may require some more advanced properties.
0.2 Formal Decision Framework
0.2.1 Markov Decision Process
A Markov decision process \(M := (S, A, P, r)\) consists of:
a positive integer \(S \in \mathbb {N}_{{\gt}0}\) representing the number of states, with index set \(\operatorname {Fin}(S) = \{ 0,1, \dots , S-1\} \)
a positive integer \(A \in \mathbb {N}_{{\gt}0}\) representing the number of actions, with index set \(\operatorname {Fin}(A) = \{ 0,1, \dots , A-1\} \)
a transition function \(P \colon \operatorname {Fin}(S) \times \operatorname {Fin}(A) \to \Delta (\operatorname {Fin}(S))\), mapping each state–action pair to finite probability distribution over next states
a reward function \(r \colon \operatorname {Fin}(S) \times \operatorname {Fin}(A) \times \operatorname {Fin}(S) \to \mathbb {R}\), mapping each transition \((s,a,s')\) to a real number
0.2.2 Histories
We implicitly assume in the remainder of the section an MDP \(M = (S, A, P, r)\).
A history \(h\) in a set of histories \(\mathcal{H}\) is a sequence of states and actions defined for \(M\) recursively as
where \(s_0, s \in \operatorname {Fin}(S)\), \(a \in \operatorname {Fin}(A)\), and \(h'\in \mathcal{H}\).
The length \(|h|\in \mathbb {N}\) of a history \(h\in \mathcal{H}\) is defined as
The set \(\mathcal{H}_{\mathrm{NE}}\) of non-empty histories is
Following histories \(\mathcal{H}(h,t) \subseteq \mathcal{H}\) for \(h\in \mathcal{H}\) of length \(t\in \mathbb {N}\) are defined recursively as
The set of histories \(\mathcal{H}_{t}\) of length \(t \in \mathbb {N}\) is defined recursively as
For \(h \in \mathcal{H}\):
The theorem follows by induction on \(t\) from the definition.
We use \(\tilde{s}_k\colon \mathcal{H} \to \mathcal{S}\) to denote the 0-based \(k\)-th state of each history.
We use \(\tilde{a}_k\colon \mathcal{H} \to \mathcal{A}\) to denote the 0-based \(k\)-th action of each history.
The history-reward random variable \(\tilde{r}^{\mathrm{h}}\colon \mathcal{H} \to \mathbb {R}\) for \(h = \langle h', a, s \rangle \in \mathcal{H}\) for \(h'\in \mathcal{H}\), \(a\in \mathcal{A}\), and \(s\in \mathcal{S}\) is defined recursively as
The history-reward random variable \(\tilde{r}^{\mathrm{h}}_k \colon \mathcal{H} \to \mathbb {R}\) for \(h = \langle h', a, s \rangle \in \mathcal{H}\) for \(h'\in \mathcal{H}\), \(a\in \mathcal{A}\), and \(s\in \mathcal{S}\) is defined as the \(k\)-th reward (0-based) of a history.
The history-reward random variable \(\tilde{r}^{\mathrm{h}}_{\le k} \colon \mathcal{H} \to \mathbb {R}\) for \(h = \langle h', a, s \rangle \in \mathcal{H}\) for \(h'\in \mathcal{H}\), \(a\in \mathcal{A}\), and \(s\in \mathcal{S}\) is defined as the sum of all \(k\)-th or earlier rewards (0-based) of a history.
The history-reward random variable \(\tilde{r}^{\mathrm{h}}_{\ge k}\colon \mathcal{H} \to \mathbb {R}\) for \(h = \langle h', a, s \rangle \in \mathcal{H}\) for \(h'\in \mathcal{H}\), \(a\in \mathcal{A}\), and \(s\in \mathcal{S}\) is defined as the sum of \(k\)-th or later reward (0-based) of a history.
0.2.3 Policies
The set of decision rules \(\mathcal{D}\) is defined as \(\mathcal{D} := \mathcal{A}^{\mathcal{S}}. \) A single action \(a \in \mathcal{A}\) can also be interpreted as a decision rule \(\mathcal{d} := s \mapsto a\).
The set of history-dependent policies is \(\Pi _{\mathrm{HR}}:= \Delta (\mathcal{A})^{\mathcal{H}}. \)
The set of Markov deterministic policies \(\Pi _{\mathrm{MD}}\) is \(\Pi _{\mathrm{MD}}:= \mathcal{D}^{\mathbb {N}}. \) A Markov deterministic policy \(\pi \in \Pi _{\mathrm{MD}}\) can also be interpreted as \(\bar{\pi } \in \Pi _{\mathrm{HR}}\):
where \(\delta \) is the Dirac distribution, and \(s_{|h|}\) is the history’s last state.
The set of stationary deterministic policies \(\Pi _{\mathrm{SD}}\) is defined as \(\Pi _{\mathrm{SD}} := \mathcal{D}. \) A stationary policy \(\pi \in \Pi _{\mathrm{SD}}\) can be interpreted as \(\bar{\pi } \in \Pi _{\mathrm{HR}}\):
where \(\delta \) is the Dirac distribution and \(s_{|h|}\) is the history’s last state.
0.2.4 Distribution
The history probability distribution \(p^{\mathrm{h}}_T \colon \Pi _{\mathrm{HR}}\to \Delta (\mathcal{H}(h,t))\) and \(\pi \in \Pi _{\mathrm{HR}}\) is defined for each \(T \in \mathbb {N}\) and \(h\in \mathcal{H}(\hat{h},t)\) as
Moreover, the function \(p^{\mathrm{h}}\) maps policies to correct probability distribution.
TODO: This definition needs to be updated. A probability space \((\Omega _{h,t}, 2^{\Omega _{h,t} }, \hat{p}_{h,\pi })\) which is defined as
for each \(\left\langle h', a, s \right\rangle \in \Omega _{h,t}\). The random variables are defined as \(\tilde{s}_k(h') := s_k(h')\), \(\tilde{a}_k(h') := a_k(h')\), \(\forall h'\in \Omega _{h,t}\). We interpret the subscripts analogously on all operators, including other risk measures, and \(\mathbb {E}\), and \(\mathbb {P}\).
The history-dependent expectation is defined for each \(t\in \mathbb {N}\), \(\pi \in \Pi _{\mathrm{HR}}\), \(\hat{h}\in \mathcal{H}\) and a \(\tilde{x}\colon \mathcal{H} \to \mathbb {R}\) as
In the \(\mathbb {E}\) operator above, the random variable \(\tilde{x}\) lives in a probability space \((\Omega , p)\) where \(\Omega = \mathcal{H}(\hat{h}, t)\) and \(p(h) = p^{\mathrm{h}}(h, \pi ), \forall h\in \Omega \). Moreover, if \(\hat{h}\) is a state, then it is interpreted as a history with the single initial state.
The history-dependent expectation is defined for each \(t\in \mathbb {N}\), \(\pi \in \Pi _{\mathrm{HR}}\), \(\hat{h}\in \mathcal{H}\), \(\tilde{x}\colon \mathcal{H} \to \mathbb {R}\), \(\tilde{b}\colon \mathcal{H} \to \mathcal{\mathbb {B}}\) as
In the \(\mathbb {E}\) operator above, the random variables \(\tilde{x}\) and \(\tilde{b}\) live in a probability space \((\Omega , p)\) where \(\Omega = \mathcal{H}(\hat{h}, t)\) and \(p(h) = p^{\mathrm{h}}(h, \pi ), \forall h\in \Omega \). Moreover, if \(\hat{h}\) is a state, then it is interpreted as a history with the single initial state.
The history-dependent expectation is defined for each \(t\in \mathbb {N}\), \(\pi \in \Pi _{\mathrm{HR}}\), \(\hat{h}\in \mathcal{H}\), \(\tilde{x}\colon \mathcal{H} \to \mathbb {R}\), \(\tilde{y}\colon \mathcal{H} \to \mathcal{\mathcal{V}}\) as
In the \(\mathbb {E}\) operator above, the random variables \(\tilde{x}\) and \(\tilde{h}\) live in a probability space \((\Omega , p)\) where \(\Omega = \mathcal{H}(\hat{h}, t)\) and \(p(h) = p^{\mathrm{h}}(h, \pi ), \forall h\in \Omega \). Moreover, if \(\hat{h}\) is a state, then it is interpreted as a history with the single initial state.
0.2.5 Basic Properties
Assume \(\tilde{x} \colon \mathcal{H} \to \mathbb {R}\) and \(c \in \mathbb {R}\). Then \(\forall h\in \mathcal{H}, \pi \in \Pi _{\mathrm{HR}}, t \in \mathbb {N}\):
Directly from theorem 0.1.5.19.
Suppose that \(\tilde{x}, \tilde{y} \colon \mathcal{H} \to \mathbb {R}\). Then \(\forall h\in \mathcal{H}, \pi \in \Pi _{\mathrm{HR}}, t \in \mathbb {N}\):
From theorem 0.1.5.14.
Suppose that \(c\in \mathbb {R}\). Then \(\forall h\in \mathcal{H}, \pi \in \Pi _{\mathrm{HR}}, t \in \mathbb {N}\):
From theorem 0.1.5.15.
Suppose that \(\tilde{x}, \tilde{y} \colon \mathcal{H} \to \mathbb {R}\) satisfy that \(\tilde{x}(h) = \tilde{y}(h), \forall h \in \mathcal{H}\). Then \(\forall h\in \mathcal{H}, \pi \in \Pi _{\mathrm{HR}}, t \in \mathbb {N}\):
From theorem 0.1.5.16.
For each \(\hat{h}\in \mathcal{H}\), \(\pi \in \Pi _{\mathrm{HR}}\), and \(t\in \mathbb {N}\):
where \(\tilde{\operatorname {id}}(h)\) is the identity function, \(|\cdot |\) is the length of a history (0-based), \(\tilde{s}_k\colon \mathcal{H} \to \mathcal{S}\) and \(\tilde{a}_k\colon \mathcal{H} \to \mathcal{A}\) are the 0-based \(k\)-th state and action, respectively of each history.
Follows from theorem 0.2.5.1 and the equality of the reward function \(\tilde{r}^{\mathrm{h}}\) and the sum in the expectation.
For each \(h\in \mathcal{H}\), \(\pi \in \Pi _{\mathrm{HR}}\), and \(t\in \mathbb {N}\):
where \(k_0:=|h|\).
Follows from theorem 0.2.5.4.
For each \(\hat{h}\in \mathcal{H}\), \(\pi \in \Pi _{\mathrm{HR}}\), \(t\in \mathbb {N}\), \(h\in \mathcal{H}\):
where \(k_0:= |\hat{h}|\).
From theorem 0.1.5.15.
Assume \(h \in \mathcal{H}\) and \(f \colon \mathcal{H} \to \mathbb {R}\) such that \(s_0 := s_{|h|}(h)\)
Then
Directly from the definition of the expectation.
0.2.6 Total Expectation
For each \(h\in \mathcal{H}\), \(\pi \in \Pi _{\mathrm{HR}}\), \(t\in \mathbb {N}\), \(\tilde{x}\colon \mathcal{H} \to \mathbb {R}\) and \(\tilde{y}\colon \mathcal{H} \to \mathcal{V}\):
From theorem 0.1.7.2.
Suppose that the random variable \(\tilde{x}\colon \mathcal{H} \to \mathbb {R}\) satisfies for some \(k, t \in \mathbb {N}\), with \(k \le t\), that
where \(h_{\le k}\) is the prefix of \(h\) of length \(k\). Then for each \(h\in \mathcal{H}, \pi \in \Pi _{\mathrm{HR}}\):
0.2.7 Conditional Properties
For each \(\beta {\gt}0\), \(h\in \mathcal{H}\), \(\pi \in \Pi _{\mathrm{HR}}\), \(t\in \mathbb {N}\), \(\tilde{x}\colon \mathcal{H} \to \mathbb {R}\), \(s\in \mathcal{S}\), \(a\in \mathcal{A}\):
Let
where the inequality holds from the hypothesis. Also, let
Note that
which can be seen by algebraic manipulation from 12.
Using the notation above, we can show the result as
0.3 Dynamic Program: History-Dependent Finite Horizon
In this section, we derive dynamic programming equations for histories. We assume an MDP \(M = (\mathcal{S}, \mathcal{A}, p, r)\) throughout this section.
The main idea of the proof is to:
Derive (exponential-size) dynamic programming equations for the history-dependent value function of history-dependent policies
Define the value function
Define an optimal value function
Show that value functions decompose to equivalence classes
Show that the value function for the equivalence classes can be computed efficiently
0.3.1 Definitions
A finite horizon objective definition is given by \(O := (s_0, T)\) where \(s_0\in \mathcal{S}\) is the initial state and \(T\in \mathbb {N}\) is the horizon.
In the reminder of the section, we assume an objective \(O = (s_0, T)\).
The finite horizon objective function for and objective \(O\) is \(\pi \in \Pi _{\mathrm{HR}}\) is defined as
A policy \(\pi ^{\star }\in \Pi _{\mathrm{HR}}\) is return optimal for an objective \(O\) if
The set of history-dependent value functions \(\mathcal{U}\) is defined as
A history-dependent policy value function \(\hat{u}^{\pi }_t\colon \mathcal{H} \to \mathbb {R}\) for each \(h\in \mathcal{H}\), \(\pi \in \Pi _{\mathrm{HR}}\), and \(t\in \mathbb {N}\) is defined as
The optimal history-dependent value function \(\hat{u}_t^{\star }\colon \mathcal{H} \to \mathbb {R}\) is defined for a horizon \(t\in \mathbb {N}\) as
The following definition is another way of defining an optimal policy.
For each \(t\in \mathbb {N}\), a policy \(\pi ^{\star }\in \Pi _{\mathrm{HR}}\) is optimal if
A policy \(\pi ^{\star }\in \Pi _{\mathrm{HR}}\) optimal in definition 0.3.1.7 is also optimal in definition 0.3.1.7 for any initial state \(s_0\) and horizon \(T\).
0.3.2 History-dependent Dynamic Program
The following definitions of history-dependent value functions use a dynamic program formulation.
The history-dependent policy Bellman operator \(L_{\mathrm{h}}^{\pi } \colon \mathcal{U} \to \mathcal{U}\) is defined for each \(\pi \in \Pi _{\mathrm{HR}}\) as
where the value function \(\tilde{u}\) is interpreted as a random variable on defined on the sample space \(\Omega = \mathcal{H}\).
The history-dependent optimal Bellman operator \(L_{\mathrm{h}}^{\star }\colon \mathcal{U} \to \mathcal{U}\) is defined as
where the value function \(\tilde{u}\) is interpreted as a random variable on defined on the sample space \(\Omega = \mathcal{H}\).
The history-dependent DP value function \(u_t^{\pi } \in \mathcal{U}\) for a policy \(\pi \in \Pi _{\mathrm{HR}}\) and \(t\in \mathbb {N}\) is defined as
The history-dependent DP value function \(u_t^{\star }\in \mathcal{U}\) for \(t\in \mathbb {N}\) is defined as
Suppose that \(u^1, u^2 \in \mathcal{U}\) satisfy that \(u^1(h) \ge u^2(h), \forall h\in \mathcal{H}\). Then
From theorem 0.1.5.18.
The following theorem shows the history-dependent value function can be computed by the dynamic program. The following theorem is akin to ( .
For each \(\pi \in \Pi _{\mathrm{HR}}\) and \(t\in \mathbb {N}\):
By induction on \(t\). The base case for \(t=0\) follows from the definition. The inductive case for \(t + 1\) follows for each \(h\in \mathcal{H}\) when \(|h| = k_0\) as
The following theorem is akin to ( .
For each \(t\in \mathbb {N}\):
By induction on \(t\). The base case is immediate. The inductive case follows for \(t+1\) as follows. For each \(\pi \in \Pi _{\mathrm{HR}}\):
0.4 Expected Dynamic Program: Markov Policy
0.4.1 Optimality
We discuss results needed to prove the optimality of Markov policies.
The set of independent value functions is defined as \(\mathcal{V} := \mathbb {R}^{\mathcal{S}}\).
A Markov Bellman operator \(L^{\star }\colon \mathcal{V} \to \mathcal{V}\) is defined as
The optimal value function \(v^{\star }_t\in \mathcal{V}, t\in \mathbb {N}\) is defined as
Suppose that \(t\in \mathbb {N}\). Then:
By induction on \(t\). The base case follows immediately from the definition. The inductive step for \(t+1\) follows as:
The optimal finite-horizon policy \(\pi ^{\star }_t, t\in \mathbb {N}\) is defined as
where \(a_0\) is an arbitrary action.
Assume a horizon \(T\in \mathbb {N}\). Then:
Fix some \(T \in \mathbb {N}\). By induction on \(k\) from \(k = T\) to \(k=0\). The base case is immediate from the definition. We prove the inductive case for \(k-1\) from \(k\) as
Here, \(k := |h|\), \(a^{\star }:= \pi ^{\star }_T(k, s_0)\), and \(s_0 := s_{|h|}(h)\)
0.4.2 Evaluation
We discuss results pertinent to the evaluation of Markov policies.
Markov value functions depend on the length of the history.
The set of independent value functions is defined as \(\mathcal{V}_{\mathrm{M}} := \mathbb {R}^{\mathbb {N}\times \mathcal{S}}\).
A Markov policy Bellman operator \(L_k^{\pi }\colon \mathcal{V} \to \mathcal{V}\) for \(\pi \in \Pi \) is defined as
The Markov policy value function \(v^{\pi }_t\in \mathcal{V}_{\mathrm{M}}, t \in \mathbb {N}\) for \(\pi \in \Pi _{\mathrm{MD}}\) is defined as