Action and Policy

A policy \pi is the agent’s internal strategy on picking actions. Moreover, an action is the agent’s degree of freedom to act for maximizing the reward.

Policy maps states to actions \pi : S \longrightarrow \A. Also, there are the following approaches for training this function to find the optimal policy.

  1. Policy-based: Directly train the policy.
    \[\begin{split}a = \pi ( s ) \, \text{or} \, \substack{argmax \\ a} \, \pi ( a | s )\end{split}\]
  2. Value-based: Train a value function, such that our policy is going to the state with the highest value.
    \[\textbf{State-Value Function:} V_{\pi} (s) = \mathbb{E}_{\pi} [ G_t | s], \]
    \[\textbf{Action-Value Function:} Q_{\pi} (s, a) = \mathbb{E}_{\pi} [ G_t | s, a]\]
  3. Model-based