State-Value and Action-Value FunctionΒΆ
State-Value Function : It is the expected return of being in state
, and following policy
thereafter.
\[v_{\pi} (s) = \mathbb{E}_{\pi} \left[ G_t | S_t = s \right]\]
\[\implies v_{\pi} (s) = \mathbb{E}_{\pi} \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s \right]\]
\[\implies v_{\pi} (s) = \mathbb{E}_{\pi} \left[ R_{t+1} + \gamma G_{t+1} | S_t = s, A_t \sim \pi (s) \right]\]
\[\therefore v_{\pi} (s) = \mathbb{E}_{\pi} \left[ R_{t+1} + \gamma v_{\pi} (S_{t+1}) | S_t = s, A_t \sim \pi (s) \right]\]
This is known as Bellman Expectation equation.
Its another form is,
\[\therefore v_{\pi} (s) = \sum_{a \in A(s)} \pi (a | s) \sum_{s' \in S, r \in R} p (s', r | s, a) (r + \gamma v_{\pi} (s'))\]
Action-Value Function : It is the expected return being in state
, having taken action
, and following policy
thereafter.
\[q_{\pi} (s) = \mathbb{E}_{\pi} \left[ G_t | S_t = s, A_t = a \right]\]
\[\implies q_{\pi} (s,a) = \mathbb{E}_{\pi} \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s, A_t = a \right]\]
\[\implies q_{\pi} (s,a) = \mathbb{E}_{\pi} \left[ R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a \right]\]
\[\therefore q_{\pi} (s,q) = \mathbb{E}_{\pi} \left[ R_{t+1} + \gamma v_{\pi} (S_{t+1}) | S_t = s, A_t \sim \pi (s) \right]\]
\[\begin{split}\text{or,} \, q_{\pi} (s,q) = \mathbb{E}_{\pi} \left[ R_{t+1} + \gamma \substack{argmax \\ a'' \in A} q_{\pi} (S_{t+1}, a'') | S_t = s, A_t \sim \pi (s) \right]\end{split}\]
This is known as Bellman Expectation equation.
Its another form is,
\[\therefore q_{\pi} (s, a) = \sum_{s' \in S, r \in R} p ( s', r | s, a )( r + \gamma \sum_{a' \in A(s')} \pi (a'|s') \, p_{\pi} (s', a') )\]