List of Home Works and due dates
Homework 001 due date: september 20, 2024-12:00:00
Exercise 1.1 Read [Sec 1.1, pp 1-2 1] and answer the following.
Explain why Reinforcement Learning differs for supervised and unsupervised learning.
Exercise 1.2 See the first Steve Brunton’s youtube video about Reinforcement Learning. Then accordingly to its presentation explain what is the meaning of the following expression:
\[ V_{\pi}(s) = \mathbb{E} \left( \sum_{t} \gamma ^ {t} r_t | s_0 = s \right). \]
Exercise 1.3 Form [see 1] obtain a time line pear year from 1950 to 2012.
Use the following format see https://kim.quarto.pub/milestones–bar-timelines/
Exercise 1.4 Consider the following consumption–saving problem with dynamics \[ x_{k+1} = (1+r)(x_k-a_k),\qquad k=0,1,\ldots,N-1, \] and utility function
\[ \beta^N(x_N)^{1-\gamma} + \sum_{k=0}^{N-1}\beta^k (a_k)^{1-\gamma}. \]
Show that the value functions of the DP algorithm take the form \[J_k(x)=A_k\beta^kx^{1-\gamma},\] where \(A_N=1\) and for \(k=N-1,\ldots,0\),
\[ A_k = [1 + ((1+r)\beta A_{k+1})^{1/\gamma} ]^\gamma. \] Show also that the optimal policies are \(h_k(x)=A_k^{-1/\gamma} x\). for \(k=N-1,\ldots,0\).
Exercise 1.5 Consider now the infinite–horizon version of the above consumption–saving problem.
Write down the associated Bellman equation.
Argue why a solution to the Bellman equation should be of the form \[v(x)=cx^{1-\gamma},\] where \(c\) is constant. Find the constant \(c\) and the stationary optimal policy 1.
1 Hint: Insert \(v(x)=cx^{1-\gamma}\) into the Bellman equation and solve the minimization problem.
Exercise 1.6 Let \(\{\xi_k\}\) be a sequence of iid random variables such that \(E[\xi]=0\) and \(E[\xi^2]=d\). Consider the dynamics \[ x_{k+1} = x_k + a_k + \xi_k, \qquad k= 0,1,2,\ldots, \] and the discounted cost \[ E \sum \beta^k(a_k^2+x_k^2). \]
Write down the associated Bellman equation.
Conjecture that the solution to the Bellman equation takes the form \(v(x)=ax^2+b\), where \(a\) and \(b\) are constant.
Determine the constants \(a\) and \(b\).
Conjecture that the solution to the Bellman equation takes the form \(v(x)=ax^2+b\), where \(a\) and \(b\) are constant. Determine the constants \(a\) and \(b\).
Bibliography
Homework 002 due date: October 25, 2024-12:00:00
Exercise 2.1 Make the docstrings regarding all python source code for each reported script.
- Incremental Implementation
- Optimistic Initial Values
- Upper-Confidence-Bound Action Selection
- Gradient Bandit method
- Scripts for visualization
Exercise 2.2 Explain and illustrate the experiments about optimistic initial values. Include the plots similar to Figure 2.2 from [p. 29, 1].
Exercise 2.3 Explain and illustrate the experiments about optimistic initial values. Include the plots similar to Figure 2.3 from [p. 34, 1].
Exercise 2.4 Explain and describe the experiments about average performance of UCB. Include the plots similar to Figure 2.4 from [p. 36, 1].
Exercise 2.5 Explain and describe the experiments about average performance of the gradient bandit algorithm with and without a reward baseline. Include the plots similar to Figure 2.5 from [p. 38 1].
Exercise 2.6 Show that for two actions, the soft-max (Boltzmann or Gibbs) distribution is equivalent to the logistic, or sigmoid, function commonly used in statistics and artificial neural networks.
Bibliography
Homework 003 due date: November 08, 2024-12:00:00
Exercise 3.1 Suppose \(\gamma= 0.5\) and the following sequence of rewards is received \(R_1 = 1\),\(R_2 = 2\), \(R_3 = 6\), \(R_4 = 3\), and \(R_5 = 2\), with \(T = 5\). What are \(G_0 , G_1 , \cdots, G_5\)? 2
2 Hint: Work backwards.
Exercise 3.2 Give a table analogous to that in [p.53, 1], but for \(p(s_0 , r |s, a)\). It should have columns for \(s\), \(a\), \(s_0\) , \(r\), and \(p(s_0 , r |s, a)\), and a row for every 4-tuple for which \(p(s_0 , r |s, a) > 0\).
Exercise 3.3 If the current state is \(S_t\) , and actions are selected according to a stochastic policy \(\pi\), then what is the expectation of \(R_{t+1}\) in terms of \(\pi\) and the four-argument function \(p\) in [Eq. (3.2), p.48, 1]?
Let the value function of a state \(s\) under a policy \(\pi\) denoted and defined by \[ \begin{aligned} v_{\pi}(s):= & \mathbb{E_{\pi}} \left[ G_t | S_t = s \right] \\ = & \mathbb{E_{\pi}} \left[ \sum_{k=0}^{\infty} R_{t+1+k} \Big | S_t = s \right], \quad \text{for all } s\in \mathcal{S}. \end{aligned} \] Similarly we denote and define the value of taking action \(a\) in state \(s\) under a policy \(\pi\) by \[ \begin{aligned} q_{\pi}(s,a):= & \mathbb{E_{\pi}} \left[ G_t | S_t = s, A_t = a \right] \\ = & \mathbb{E_{\pi}} \left[ \sum_{k=0}^{\infty} R_{t+1+k} \Big | S_t = s, A_t = a \right], \quad \text{for all } (s,a) \in \mathcal{S}\times \mathcal{A}(s). \end{aligned} \]
Exercise 3.4 Give an equation for \(v_{\pi}\) in terms of \(q_{\pi}\) and \(\pi\).
Exercise 3.5
- Give an equation for \(q_{\pi}\) in terms of \(v_{\pi}\) and the four-argument \(p(\cdot,\cdot|\cdot,\cdot)\)
- Give the Bellman equation for \(q_{\pi}\) for the recycling robot.
Exercise 3.6 According to the following script:
- Document this numerical experiment
- Display the regarding figure and explain each iteration accordingly to the policy evaluation algorithm.