State-Value and Action-Value Function¶

State-Value Function $v_{\pi}(s)$ : It is the expected return of being in state $s$ , and following policy $\pi$ thereafter.

\[v_{\pi} (s) = \mathbb{E}_{\pi} \left[ G_t | S_t = s \right]\]

\[\implies v_{\pi} (s) = \mathbb{E}_{\pi} \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s \right]\]

\[\implies v_{\pi} (s) = \mathbb{E}_{\pi} \left[ R_{t+1} + \gamma G_{t+1} | S_t = s, A_t \sim \pi (s) \right]\]

\[\therefore v_{\pi} (s) = \mathbb{E}_{\pi} \left[ R_{t+1} + \gamma v_{\pi} (S_{t+1}) | S_t = s, A_t \sim \pi (s) \right]\]

This is known as Bellman Expectation equation.

Its another form is,

\[\therefore v_{\pi} (s) = \sum_{a \in A(s)} \pi (a | s) \sum_{s' \in S, r \in R} p (s', r | s, a) (r + \gamma v_{\pi} (s'))\]

Action-Value Function $q_{\pi}(s,a)$ : It is the expected return being in state $s$ , having taken action $a$ , and following policy $\pi$ thereafter.

\[q_{\pi} (s) = \mathbb{E}_{\pi} \left[ G_t | S_t = s, A_t = a \right]\]

\[\implies q_{\pi} (s,a) = \mathbb{E}_{\pi} \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s, A_t = a \right]\]

\[\implies q_{\pi} (s,a) = \mathbb{E}_{\pi} \left[ R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a \right]\]

\[\therefore q_{\pi} (s,q) = \mathbb{E}_{\pi} \left[ R_{t+1} + \gamma v_{\pi} (S_{t+1}) | S_t = s, A_t \sim \pi (s) \right]\]

\[\begin{split}\text{or,} \, q_{\pi} (s,q) = \mathbb{E}_{\pi} \left[ R_{t+1} + \gamma \substack{argmax \\ a'' \in A} q_{\pi} (S_{t+1}, a'') | S_t = s, A_t \sim \pi (s) \right]\end{split}\]

This is known as Bellman Expectation equation.

Its another form is,

\[\therefore q_{\pi} (s, a) = \sum_{s' \in S, r \in R} p ( s', r | s, a )( r + \gamma \sum_{a' \in A(s')} \pi (a'|s') \, p_{\pi} (s', a') )\]

Reinforcement Learning

State-Value and Action-Value Function¶