当前位置: 代码网 > it编程>软件设计>算法 > David Silver Lecture 3: planning by dynamic programming

David Silver Lecture 3: planning by dynamic programming

2024年08月06日 算法 我要评论
定义:核心思想:将复杂问题拆解成简单子问题。

1 introduction

1.1 定义

定义:

核心思想:
将复杂问题拆解成简单子问题

1.2 requirements for dynamic programming

  • optimal substructure
    • principle of optimality applies
    • optimal solution can be decomposed into subproblems
  • overlapping subproblems
    • subproblem会调用很多次
    • solution需要存储起来和进行复用
  • mdp 满足以下性质
    • bellman 方程提供递归的分解
    • value 函数存储和复用solutions

1.3 planning by dynamic programming

几种假设:
1)dp 假设对mdp可观测结果
2)for prediction
在这里插入图片描述
3)for control
在这里插入图片描述

1.4 动态规划的其他应用

  • scheduling algorithm
  • string algorithms (e.g. sequence alignment)
  • graph algorithms (e.g. shortest path algorithms)
  • graphical models (e.g. viterbi algorithm)
  • bioinformatics (e.g. lattice models)

2 policy evaluation

在这里插入图片描述
动态问题的结构可以拆成下面三个部分:
state transition s n = f ( s n − 1 , a n ) value function j ∗ ( s ) = min ⁡ a ∈ a { l ( s , a ) + γ ∑ s ′ ∈ s p ( s ′ ∣ s , a ) j ∗ ( s ′ ) } policy π ( s ) = arg ⁡ min ⁡ a { l ( s , a ) + γ ∑ s ′ ∈ s p ( s ′ ∣ s , a ) j ∗ ( s ′ ) } \begin{aligned} \text{state transition} & \quad s_{n} = f(s_{n-1}, a_{n}) \\ \text{value function} & \quad j^*(s) = \min \limits_{a\in a} \{l(s,a)+\gamma \sum_{s' \in s} p(s' | s, a) j^*(s')\} \\ \text{policy} & \quad \pi(s)=\arg\min \limits_{a} \{l(s,a)+\gamma \sum_{s' \in s} p(s' | s, a) j^*(s')\} \end{aligned} state transitionvalue functionpolicysn=f(sn1,an)j(s)=aamin{l(s,a)+γssp(ss,a)j(s)}π(s)=argamin{l(s,a)+γssp(ss,a)j(s)}
代码的结构大概是这样的

```matlab
for state = 1:num_state
	for action = 1:nun_action
		q = instaneous_cost(state, action);
		next_state = transition(state, action);
		q = q + j(next_state);
	end
end

用iterative policy evaluation 表示:
在这里插入图片描述
在这里插入图片描述

% initialize the grid world and parameters
grid_size = 4;
num_actions = 4;
discount_factor = 1.0;
theta = 1e-4;

% initialize state-value function
v = zeros(grid_size);

% define the reward function
reward = -1;

% define the transition probabilities for the uniform random policy
policy = ones(grid_size, grid_size, num_actions) / num_actions;

% iterative policy evaluation
while true
    delta = 0;
    
    % loop over all states
    for i = 1:grid_size
        for j = 1:grid_size
            
            % skip the start and terminal states
            if (i == 1 && j == 1) || (i == grid_size && j == grid_size)
                continue;
            end
            
            % store the old value
            old_value = v(i, j);
            
            % calculate the new value by averaging over actions
            new_value = 0;
            for action = 1:num_actions
                [next_i, next_j] = apply_action(i, j, action);
                reward_next = (next_i == grid_size && next_j == grid_size) * (1 - discount_factor);
                new_value = new_value + policy(i, j, action) * (reward + reward_next + discount_factor * v(next_i, next_j));
            end
            
            % update the value function
            v(i, j) = new_value;
            delta = max(delta, abs(old_value - new_value));
        end
    end
    
    % check for convergence
    if delta < theta
        break;
    end
end

% apply the given action to the current state (i, j)
function [next_i, next_j] = apply_action(i, j, action)
    grid_size = 4;
    
    % actions: 1 = up, 2 = down, 3 = left, 4 = right
    if action == 1
        next_i = max(i - 1, 1);
        next_j = j;
    elseif action == 2
        next_i = min(i + 1, grid_size);
        next_j = j;
    elseif action == 3
        next_i = i;
        next_j = max(j - 1, 1);
    else
        next_i = i;
        next_j = min(j + 1, grid_size);
    end
end

3 policy iteration

3.1 how to improve policy

每一步都找出optimal的value function, 作为state的value function
在这里插入图片描述
在这里插入图片描述

import numpy as np

def gridworld_policy_iteration():
    n_states = 16
    n_actions = 4
    terminal_state = 15
    rewards = -1 * np.ones((n_states, n_actions))
    rewards[terminal_state, :] = 0

    transition_matrix = np.zeros((n_states, n_actions, n_states))

    for state in range(n_states):
        for action in range(n_actions):
            if state == terminal_state:
                transition_matrix[state, action, state] = 1
                continue

            next_state = state

            if action == 0:  # up
                next_state = max(state - 4, 0)
            elif action == 1:  # down
                next_state = min(state + 4, n_states - 1)
            elif action == 2:  # left
                next_state = state - 1
                if state % 4 == 0:
                    next_state = state
            elif action == 3:  # right
                next_state = state + 1
                if (state + 1) % 4 == 0:
                    next_state = state

            transition_matrix[state, action, next_state] = 1

    gamma = 0.99
    max_iter = 1000
    theta = 1e-10

    policy = np.ones((n_states, n_actions)) / n_actions

    for _ in range(max_iter):
        policy_stable = true

        # policy evaluation
        v = np.zeros(n_states)
        while true:
            delta = 0
            for state in range(n_states):
                v = v[state]
                v[state] = np.sum(policy[state, :] * (rewards[state, :] + gamma * transition_matrix[state, :, :] @ v))
                delta = max(delta, abs(v - v[state]))
            if delta < theta:
                break

        # policy improvement
        for state in range(n_states):
            old_action = np.argmax(policy[state, :])
            action_returns = np.zeros(n_actions)
            for action in range(n_actions):
                action_returns[action] = rewards[state, action] + gamma * np.dot(transition_matrix[state, action, :], v)
            best_action = np.argmax(action_returns)
            policy[state, :] = 0
            policy[state, best_action] = 1

            if old_action != best_action:
                policy_stable = false

        if policy_stable:
            break

    optimal_policy = policy
    state_values = v

    return optimal_policy, state_values

optimal_policy, state_values = gridworld_policy_iteration()
print("optimal policy:")
print(optimal_policy)
print("state values:")
print(state_values)

代码的关键在于

# 找到最大reward对应的action,对其policy为1,其他为0
            for action in range(n_actions):
                action_returns[action] = rewards[state, action] + gamma * np.dot(transition_matrix[state, action, :], v)
            best_action = np.argmax(action_returns)
            policy[state, :] = 0
            policy[state, best_action] = 1

在这里插入图片描述

> consider

4 value iteration

4.1 value iteration in mdps

4.1.1 principle of optimality

任何的优化策略可以划分成两个组成部分,
1.第一步采用最优动作 a ∗ a_{*} a
2.对successor state采用optimal policy
在这里插入图片描述

4.1.2 deterministic value iteration

  • 子问题的solution v ∗ ( s ′ ) v_{*}(s') v(s)
  • 问题的solution v ∗ ( s ) v_{*}(s) v(s)可以通过往前走一步得到
    在这里插入图片描述
  • 直觉理解,start with final rewards and work backwards
  • still works with loopy, stochastic mdps
    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    值迭代的思想非常简单,代码看起来更美观一点
import numpy as np

# gridworld environment
rows = 4
cols = 4
terminal_states = [(0, 0), (rows-1, cols-1)]
actions = [(0, 1), (1, 0), (0, -1), (-1, 0)]

def is_valid_state(state):
    r, c = state
    return 0 <= r < rows and 0 <= c < cols and state not in terminal_states

def next_state(state, action):
    r, c = np.array(state) + np.array(action)
    if is_valid_state((r, c)):
        return r, c
    return state

# value iteration
def value_iteration(gamma=1, theta=1e-6):
    v = np.zeros((rows, cols))
    while true:
        delta = 0
        for r in range(rows):
            for c in range(cols):
                state = (r, c)
                if state in terminal_states:
                    continue

                v = v[state]
                max_value = float('-inf')
                for a in actions:
                    next_s = next_state(state, a)
                    value = -1 + gamma * v[next_s]
                    max_value = max(max_value, value)
                v[state] = max_value
                delta = max(delta, abs(v - v[state]))

        if delta < theta:
            break
    return v

# find optimal policy
def find_optimal_policy(v, gamma=1):
    policy = np.zeros((rows, cols, len(actions)))
    for r in range(rows):
        for c in range(cols):
            state = (r, c)
            if state in terminal_states:
                continue

            q_values = np.zeros(len(actions))
            for i, a in enumerate(actions):
                next_s = next_state(state, a)
                q_values[i] = -1 + gamma * v[next_s]
            optimal_action = np.argmax(q_values)
            policy[state][optimal_action] = 1
    return policy

# find the shortest path using the optimal policy
def find_shortest_path(policy):
    state = (0, 0)
    path = [state]
    while state != (rows-1, cols-1):
        action_idx = np.argmax(policy[state])
        state = next_state(state, actions[action_idx])
        path.append(state)
    return path

v = value_iteration()
policy = find_optimal_policy(v)
path = find_shortest_path(policy)

print("shortest path:", path)

5 extensions to dp

5.1 asynchronous dynamic programming

在这里插入图片描述
在这里插入图片描述

5.1.1 in place dynamic programming

  • synchronuous value iteration
    在这里插入图片描述
  • in place value iteration
    在这里插入图片描述

5.1.2 prioritised sweeping

  • use magnitude of bellman error to guide state selection, e.g.
    在这里插入图片描述

5.1.3 real time dp

在这里插入图片描述

在这里插入图片描述

6 contraction mapping

类似于李雅普诺夫稳定性的定义

6.1技术问题

在这里插入图片描述

6.2 value function space

在这里插入图片描述
在这里插入图片描述

6.3 bellman expectation backup is a contraction

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

总结

  • policy evaluation
    v ( s ) = ∑ a π ( a ∣ s ) ( r ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) v ( s ′ ) ) v(s) = \sum_{a} \pi(a|s) \biggl( r(s, a) + \gamma \sum_{s'} p(s'|s, a) v(s') \biggr) v(s)=aπ(as)(r(s,a)+γsp(ss,a)v(s))
    在这里插入图片描述

  • policy improvement
    在这里插入图片描述
    找到stable的policy之后停止

  • value iteration
    v k + 1 ( s ) = max ⁡ a ( r ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) ) v k + 1 ( s ) = max ⁡ a r ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) \begin{aligned} v_{k+1}(s) &= \max_{a} \biggl( r(s, a) + \gamma \sum_{s'} p(s'|s, a) v_k(s') \biggr) \\ v_{k+1}(s) &= \max_a r(s, a) + \gamma \sum_{s'} p(s'|s, a) v_k(s') \end{aligned} vk+1(s)vk+1(s)=amax(r(s,a)+γsp(ss,a)vk(s))=amaxr(s,a)+γsp(ss,a)vk(s)

  • intuition
    将问题拆分成子问题,在使用greedy 策略;

  • 李雅普诺夫稳定
    线性收敛,有稳定点,系统最终会收敛到稳定点。

(0)

相关文章:

版权声明:本文内容由互联网用户贡献,该文观点仅代表作者本人。本站仅提供信息存储服务,不拥有所有权,不承担相关法律责任。 如发现本站有涉嫌抄袭侵权/违法违规的内容, 请发送邮件至 2386932994@qq.com 举报,一经查实将立刻删除。

发表评论

验证码:
Copyright © 2017-2025  代码网 保留所有权利. 粤ICP备2024248653号
站长QQ:2386932994 | 联系邮箱:2386932994@qq.com