Reward¶

A reward $R_t$ is a scalar feedback signal. Often, $R_t \in \mathbb{R}$ or $R_t \in \mathbb{Z}$ .
Indicates how well agent is doing at step $t$ .
The agent’s job is to maximize cumulative reward.

Reinforcement Learning is based on the reward hypothesis

Reward Hypothesis

All goals can be described by the maximisation of expected cumulative reward.

\[max \, \mathbb{E} \left[ \sum_{i = 0}^{\infty} R_{t+i+1} \right]\]

Our goal is to sequentially perform actions which maximize expected cumulative reward.

However,

Any action may have long-term consequences.
Reward may be delayed.
It may be better to sacrifice immediate reward to gain greater long-term reward.

Fig. 10 Short-term vs Long-term reward trade-off¶

Examples:

A financial investment (may take months to mature).
Refueling a helicopter (may prevent future crash).
Blocking opponent moves in chess (may improve chances of winning).

Note

A RL agent’s learning process is heavily linked with the reward distribution over time, however, there is no predefined way on how to design the best reward function.

Tip: Be careful what you wish for, for you might get it

The reward function learns towards a policy it was asked for, not what should have been asked for nor what was intended.

Discounted return¶

Discounted return ( $G_t$ ) is the cumulative reward defined as follows

\[G_t = \sum_{i = 0}^{\infty} \gamma^i \cdot R_{t+i+1}, \quad \text{where } \gamma \in \left[ 0, 1 \right]\]

$\gamma \longrightarrow Discount rate$ Discount rate
Larger $\gamma \implies$ Smaller discount. Agent cares more about long-term reward.
Smaller $\gamma \implies$ Larger discount. Agent cares more about short-term reward.

Task-dependent discounting¶

Episodic Task:

Tasks that have a terminal state.
Problem naturally breaks into episodes.
The return becomes a finite sum.

Continuing Task:

Tasks that have no terminal state but can go on infinitely until stopped.
Problem lacks a natural end.
The return should be discounted to prevent absurdly large numbers.

Two ways of calculating return¶

Batch Learning
\[G_t = \sum_{i = 0}^{\infty} \gamma^i \cdot R_{t+i+1}\]
Online Learning
\[G_{t} = R_{t+1} + G_{t+1}\]