List of Home Works and due dates

Author

Saúl Díaz Infante Velasco

Homework 001 due date: september 20, 2024-12:00:00

Exercise 1.1 Read [Sec 1.1, pp 1-2 1] and answer the following.

Explain why Reinforcement Learning differs for supervised and unsupervised learning.

Exercise 1.2 See the first Steve Brunton’s youtube video about Reinforcement Learning. Then accordingly to its presentation explain what is the meaning of the following expression:

\[ V_{\pi}(s) = \mathbb{E} \left( \sum_{t} \gamma ^ {t} r_t | s_0 = s \right). \]

Exercise 1.3 Form [see 1] obtain a time line pear year from 1950 to 2012.

[1]

R.S. Sutton, A.G. Barto, Reinforcement learning: An introduction, Second, MIT Press, Cambridge, MA, 2018.

Use the following format see https://kim.quarto.pub/milestones–bar-timelines/

Exercise 1.4 Consider the following consumption–saving problem with dynamics \[ x_{k+1} = (1+r)(x_k-a_k),\qquad k=0,1,\ldots,N-1, \] and utility function

\[ \beta^N(x_N)^{1-\gamma} + \sum_{k=0}^{N-1}\beta^k (a_k)^{1-\gamma}. \]

Show that the value functions of the DP algorithm take the form \[J_k(x)=A_k\beta^kx^{1-\gamma},\] where \(A_N=1\) and for \(k=N-1,\ldots,0\),

\[ A_k = [1 + ((1+r)\beta A_{k+1})^{1/\gamma} ]^\gamma. \] Show also that the optimal policies are \(h_k(x)=A_k^{-1/\gamma} x\). for \(k=N-1,\ldots,0\).

Exercise 1.5 Consider now the infinite–horizon version of the above consumption–saving problem.

Write down the associated Bellman equation.
Argue why a solution to the Bellman equation should be of the form \[v(x)=cx^{1-\gamma},\] where \(c\) is constant. Find the constant \(c\) and the stationary optimal policy ¹.

¹ Hint: Insert \(v(x)=cx^{1-\gamma}\) into the Bellman equation and solve the minimization problem.

Exercise 1.6 Let \(\{\xi_k\}\) be a sequence of iid random variables such that \(E[\xi]=0\) and \(E[\xi^2]=d\). Consider the dynamics \[ x_{k+1} = x_k + a_k + \xi_k, \qquad k= 0,1,2,\ldots, \] and the discounted cost \[ E \sum \beta^k(a_k^2+x_k^2). \]

Write down the associated Bellman equation.
Conjecture that the solution to the Bellman equation takes the form \(v(x)=ax^2+b\), where \(a\) and \(b\) are constant.
Determine the constants \(a\) and \(b\).
Conjecture that the solution to the Bellman equation takes the form \(v(x)=ax^2+b\), where \(a\) and \(b\) are constant. Determine the constants \(a\) and \(b\).

Bibliography

Homework 002 due date: October 25, 2024-12:00:00

Exercise 2.1 Make the docstrings regarding all python source code for each reported script.

Incremental Implementation
Optimistic Initial Values
Upper-Confidence-Bound Action Selection
Gradient Bandit method
Scripts for visualization

Exercise 2.2 Explain and illustrate the experiments about optimistic initial values. Include the plots similar to Figure 2.2 from [p. 29, 1].

Exercise 2.3 Explain and illustrate the experiments about optimistic initial values. Include the plots similar to Figure 2.3 from [p. 34, 1].

Exercise 2.4 Explain and describe the experiments about average performance of UCB. Include the plots similar to Figure 2.4 from [p. 36, 1].

Exercise 2.5 Explain and describe the experiments about average performance of the gradient bandit algorithm with and without a reward baseline. Include the plots similar to Figure 2.5 from [p. 38 1].

[1]

R.S. Sutton, A.G. Barto, Reinforcement learning: An introduction, Second, MIT Press, Cambridge, MA, 2018.

Exercise 2.6 Show that for two actions, the soft-max (Boltzmann or Gibbs) distribution is equivalent to the logistic, or sigmoid, function commonly used in statistics and artificial neural networks.

Bibliography

Homework 003 due date: November 08, 2024-12:00:00

Exercise 3.1 Suppose \(\gamma= 0.5\) and the following sequence of rewards is received \(R_1 = 1\),\(R_2 = 2\), \(R_3 = 6\), \(R_4 = 3\), and \(R_5 = 2\), with \(T = 5\). What are \(G_0 , G_1 , \cdots, G_5\)? ²

² Hint: Work backwards.

Exercise 3.2 Give a table analogous to that in [p.53, 1], but for \(p(s_0 , r |s, a)\). It should have columns for \(s\), \(a\), \(s_0\) , \(r\), and \(p(s_0 , r |s, a)\), and a row for every 4-tuple for which \(p(s_0 , r |s, a) > 0\).

Exercise 3.3 If the current state is \(S_t\) , and actions are selected according to a stochastic policy \(\pi\), then what is the expectation of \(R_{t+1}\) in terms of \(\pi\) and the four-argument function \(p\) in [Eq. (3.2), p.48, 1]?

[1]

R.S. Sutton, A.G. Barto, Reinforcement learning: An introduction, Second, MIT Press, Cambridge, MA, 2018.

Let the value function of a state \(s\) under a policy \(\pi\) denoted and defined by \[ \begin{aligned} v_{\pi}(s):= & \mathbb{E_{\pi}} \left[ G_t | S_t = s \right] \\ = & \mathbb{E_{\pi}} \left[ \sum_{k=0}^{\infty} R_{t+1+k} \Big | S_t = s \right], \quad \text{for all } s\in \mathcal{S}. \end{aligned} \] Similarly we denote and define the value of taking action \(a\) in state \(s\) under a policy \(\pi\) by \[ \begin{aligned} q_{\pi}(s,a):= & \mathbb{E_{\pi}} \left[ G_t | S_t = s, A_t = a \right] \\ = & \mathbb{E_{\pi}} \left[ \sum_{k=0}^{\infty} R_{t+1+k} \Big | S_t = s, A_t = a \right], \quad \text{for all } (s,a) \in \mathcal{S}\times \mathcal{A}(s). \end{aligned} \]

Exercise 3.4 Give an equation for \(v_{\pi}\) in terms of \(q_{\pi}\) and \(\pi\).

Exercise 3.5

Give an equation for \(q_{\pi}\) in terms of \(v_{\pi}\) and the four-argument \(p(\cdot,\cdot|\cdot,\cdot)\)
Give the Bellman equation for \(q_{\pi}\) for the recycling robot.

Exercise 3.6 According to the following script:

4x4 Grid-world exmple with terminal states.

gridword_4x4_exm.py

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


# take a step, get next state and reward
def env_step(
        state: tuple,
        act: tuple
        ) -> tuple:
    assert state not in [(0, 0), (3, 3)], "Terminal states should be selected"
    next_row = state[0] + act[0]
    next_col = state[1] + act[1]
    
    if 0 <= next_row < 4 and 0 <= next_col < 4:
        return -1., (next_row, next_col)
    else:
        return -1., state


# Iterative policy evaluation
def policy_evaluation(
        grid_world: np.ndarray,
        actions: list,
        probs: list,
        gamma: float,
        delta: float,
        terminal_states: list
        ) -> tuple:
    # The author uses two-list method in this example
    Q = np.zeros(shape=(grid_world.shape[0], grid_world.shape[1], len(actions)))
    
    # Loop over all of the states
    for row in range(4):
        for col in range(4):
            if (row, col) in terminal_states: continue
            for i, act in enumerate(actions):
                reward, next_state = env_step((row, col), act)
                next_value = grid_world[next_state]
                Q[row, col, i] = probs[i] * (reward + gamma * next_value)
    
    # calculate the abs difference between two grids and return the max diff value
    new_grid_world = Q.sum(axis=-1)
    max_diff = np.abs(new_grid_world - grid_world).max()
    return new_grid_world, max(max_diff, delta)


# Get greedy policy from current values
def greedy_policy(
        grid_world: np.ndarray,
        actions: list,
        probs: list,
        gamma: float,
        terminal_states: list
        ) -> list:
    # The author uses two-list method in this example
    Q = np.zeros(shape=(grid_world.shape[0], grid_world.shape[1], len(actions)))
    act_list = np.array([0x2190, 0x2193, 0x2192, 0x2191])
    greedy_acts = []
    
    # Loop over all of the states
    for row in range(4):
        acts_row = []
        for col in range(4):
            # Skip terminal states
            if (row, col) in terminal_states:
                acts_row.append('')
                continue
            
            for i, act in enumerate(actions):
                reward, next_state = env_step((row, col), act)
                next_value = grid_world[next_state]
                Q[row, col, i] = probs[i] * (reward + gamma * next_value)
            
            # include all poosible actions
            greedy_sels = \
            np.where(np.abs(Q[row, col] - Q[row, col].max()) < 0.0001)[0]
            acts = "".join([chr(act_list[i]) for i in greedy_sels])
            acts_row.append(acts)
        greedy_acts.append(acts_row)
    
    return greedy_acts


# plot values of the gridworld
def plot_heatmap(
        data: np.ndarray,
        annots: list,
        axes: np.ndarray,
        curr_row: int
        ) -> None:
    assert len(data.shape) == 2, 'Input must be a 2-D array'
    ax_val = axes[curr_row, 0]  # value axis
    ax_act = axes[curr_row, 1]  # action axis
    
    title = f'k={curr_row}' if curr_row != -1 else 'k=$\infty$'
    
    # Plot values
    sns.heatmap(
        data,
        ax=ax_val,
        cbar=False,
        fmt='.1f',
        annot=data,
        cmap='Blues',
        linewidths=0.5
        )
    ax_val.set_yticks([])
    ax_val.set_xticks([])
    ax_val.set_title(title, fontweight='bold')
    
    # Plot actions
    sns.heatmap(
        data,
        ax=ax_act,
        cbar=False,
        annot=annots,
        fmt='',
        cmap='Blues',
        linewidths=0.5
        )
    ax_act.set_yticks([])
    ax_act.set_xticks([])
    ax_act.set_title(title, fontweight='bold')


if __name__ == '__main__':
    # Initialize the grid world
    actions = [(0, -1), (1, 0), (0, 1), (-1, 0)]
    grid_world = np.zeros(shape=(4, 4))
    gamma = 1.0  # Normally gamma should be between (0, 1)
    theta = 0.001
    
    # random policy with eqaul probability on each action
    probs = [1. / len(actions)] * len(actions)
    terminal_states = [(0, 0), (3, 3)]
    
    # store values in the record for plotting
    fig, axes = plt.subplots(6, 2, figsize=(5, 16), dpi=150)
    curr_row = 0
    # Loop until delta is smaller than theta
    for i in range(1000):
        
        # plot value and policy in given timestep
        if i in [0, 1, 2, 3, 10]:
            greedy_acts = greedy_policy(
                grid_world,
                actions,
                probs,
                gamma,
                terminal_states
                )
            plot_heatmap(grid_world, greedy_acts, axes, curr_row)
            curr_row += 1
        
        delta = 0  # reset delta
        grid_world, delta = policy_evaluation(
                grid_world,
                actions,
                probs,
                gamma,
                delta,
                terminal_states
                )
        
        if delta < theta:
            break
    
    # get greedy policy when converged
    greedy_acts = greedy_policy(
        grid_world,
        actions,
        probs,
        gamma,
        terminal_states
        )
    
    plot_heatmap(grid_world, greedy_acts, axes, -1)
    plt.tight_layout()
    # plt.show()
    plt.savefig('example_4_1.png')

Document this numerical experiment
Display the regarding figure and explain each iteration accordingly to the policy evaluation algorithm.

Bibliography

[1]

R.S. Sutton, A.G. Barto, Reinforcement learning: An introduction, Second, MIT Press, Cambridge, MA, 2018.