Gnosnay 和他的硅基搭子

CS5446 8. Value function approximation

· Gnosnay

Notes for CS5446 Reinforcement Learning.

Example

之前介绍的情况都要求我们通过一个表格来进行求解, 不论是 q-value 还是 v-value. 对于 q-value 它是一个二维表格, 对于 v-value, 它是一个一维的向量.

那如果状态空间非常大, 或者连续呢, 我们怎么做? 用 value function approximation 来表示状态值来做近似(approximation). 整体的算法思路是:

对于 s 的 $v_\pi(s)$ 来说, 对它做线性拟合

$$ \hat{v}(s, w) = as + b = \phi^{T}(s)w $$
  • $\phi(s)$ 是 s 的特征函数, 在矩阵里是特征向量. $T$ 表示在矩阵中它是转置.
  • $s$ 是状态
  • $w$ 是参数, 在矩阵里是参数向量
  • $\hat{v}(s, w)$ 是价值函数的近似
    • 对 s 来说是一种线性关系, 但是如果对 s 我们不做线形拟合, 那它可能是一种非线形关系.
    • 对 w 来说是一种线性关系, 但是它永远对 w 成线性关系. 非线形的部分被 $\phi(s)$ 所吸收.

Value function approximation

正式地定义 value function approximation: 我们希望得到 $v_{\pi}(s)$ 的近似, 即 $\hat{v}(s, w)$ . 即我们希望找到一个 $w$ 使得 $\hat{v}(s, w)$ 尽可能接近 $v_{\pi}(s)$ . 那么我们可以定义如下的目标函数:

$$ J(w) = \mathbb{E}[(v_{\pi}(S) - \hat{v}(S, w))^2] $$

我们的优化目标是是,希望找到一个 $w$ 使得 $J(w)$ 尽可能小. 因此我们可以用梯度下降的方式求解:

$$ \begin{aligned} & \nabla_w J(w) = \nabla \mathbb{E}[(v_{\pi}(S) - \hat{v}(S, w))^2] \\ & = \mathbb{E}[\nabla (v_{\pi}(S) - \hat{v}(S, w))^2] \\ & = 2\mathbb{E}[(v_{\pi}(S) - \hat{v}(S, w)) (- \nabla \hat{v}(S, w))] \\ & = -2\mathbb{E}[(v_{\pi}(S) - \hat{v}(S, w)) \nabla \hat{v}(S, w)] \end{aligned} $$

对此我们可以用迭代的方法进行求解:

$$ \begin{aligned} & w_{t+1} = w_t - \alpha_t \nabla_w J(w_t) \\ & = w_t \red{+} \alpha_t [(v_{\pi}(s_t) - \hat{v}(s_t, w_t)) \nabla \hat{v}(s_t, w_t)] \end{aligned} $$

注意上面梯度下降的式子是负的

但是这里有一个未知的值: $v_{\pi}(s_t)$ . 因此我们可以用一个估计值来代替 $v_{\pi}(s_t)$ . 我们可以通过 MC learning 或者 TD learning 的方法来估计它, 比如使用 TD learning 的方式:

$$ \begin{aligned} & w_{t+1} = w_t + \alpha_t [(v_{\pi}(s_t) - \hat{v}(s_t, w_t)) \nabla \hat{v}(s_t, w_t)] \\ & = w_t + \alpha_t [r_t + \gamma \hat{v}(s_{t+1}, w_t)- \hat{v}(s_t, w_t)] \nabla \hat{v}(s_t, w_t) \end{aligned} $$

这样我们就可以用迭代法来求了:

  • 我们对 $(s_t, r_t, s_{t+1})$ 进行采样
  • 然后不断迭代求解 $w$ 使得 $\hat{v}(s, w)$ 最终接近 $v_{\pi}(s)$

这个算法最终解决了如何估计 state value 的问题.

$\hat{v}(s, w)$ 的选取

上面的式子中, 我们发现我们仍然需要知道 $\hat{v}(s, w)$ 的值. 因此选取拟合的方式很重要. 我们除了使用线形的方式进行拟合, 还可以通过神经网络的方式进行拟合. 神经网络本质上就是对非线形关系进行拟合.

Example: Linear Approximation

以这个例子为例, 由于这个例子的 model 我们都知道, 因此我们可以直接线形求解:

import numpy as np

H, W = 5, 5
gamma = 0.9

ACTIONS = {
    0: (-1, 0),  # up
    1: (1, 0),  # down
    2: (0, -1),  # left
    3: (0, 1),  # right
    4: (0, 0),  # stay
}
pi = 1.0 / len(ACTIONS)  # 0.2

# 1-based coordinates
forbidden = {(2, 2), (2, 3), (3, 3), (4, 2), (4, 4), (5, 2)}
target = (4, 3)

r_forbidden = -1.0
r_boundary = -1.0
r_target = 1.0


def idx(r, c):  # r,c are 1-based
    return (r - 1) * W + (c - 1)


def step(r, c, a):
    # direction row, column
    dr, dc = ACTIONS[a]
    # next row, column
    nr, nc = r + dr, c + dc

    # boundary -> stay (and get boundary reward)
    if nr < 1 or nr > H or nc < 1 or nc > W:
        return (r, c), r_boundary

    ns = (nr, nc)

    # reward by landing cell
    if ns == target:
        return ns, r_target
    if ns in forbidden:
        return ns, r_forbidden

    return ns, 0.0  # normal cell reward


S = H * W
# state transition probability matrix
P_pi = np.zeros((S, S), dtype=float)
# state reward vector
r_pi = np.zeros(S, dtype=float)

for r in range(1, H + 1):
    for c in range(1, W + 1):
        s = idx(r, c)
        for a in ACTIONS:
            (nr, nc), reward = step(r, c, a)
            sp = idx(nr, nc)
            P_pi[s, sp] += pi
            r_pi[s] += pi * reward

I = np.eye(S)
A = I - gamma * P_pi
# Solve (I - gamma P) v = r
v = np.linalg.solve(A, r_pi)
# print as 5x5
V = v.reshape(H, W)
np.set_printoptions(precision=3, suppress=True)
print("\n=== V (5x5) ===")
print(V)

注意这里的计算都是矩阵计算

np.linalg.solve(A, b) 是 NumPy 里用来解线性方程组的函数:它直接求出满足 $Ax = b$ 的 $x$ 的值. 因此原先 $v = r_{\pi} + \gamma P_{\pi}v$ 被移项成了: $(I - \gamma P_{\pi})v = r_{\pi}$

我们只需要构造出来 $A$ 和 $r_{\pi}$ 就可以求解出 $v$ 的值.

最终我们得到:

=== V (5x5) ===
[[-3.846 -3.814 -3.644 -3.121 -3.235]
 [-3.794 -3.847 -3.8   -3.106 -2.924]
 [-3.572 -3.896 -3.382 -3.194 -2.944]
 [-3.901 -3.616 -3.403 -2.898 -3.239]
 [-4.459 -4.161 -3.386 -3.363 -3.453]]

使用一个三维图像进行可视化:

有了这个答案我们可以看看如果我们用 linear approximation 来求解会怎么样.

import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D  # noqa: F401

# =========================
# 1) Gridworld definition (keep yours)
# =========================
H, W = 5, 5
gamma = 0.9

ACTIONS = {
    0: (-1, 0),  # up
    1: (1, 0),  # down
    2: (0, -1),  # left
    3: (0, 1),  # right
    4: (0, 0),  # stay
}
A = len(ACTIONS)

r_forbidden = -1.0
r_boundary = -1.0
r_target = 1.0

FORBIDDEN = {(2, 2), (2, 3), (3, 3), (4, 2), (4, 4), (5, 2)}
TARGET = (4, 3)


def step(state_rc, action):
    r, c = state_rc
    dr, dc = ACTIONS[action]
    nr, nc = r + dr, c + dc

    if nr < 1 or nr > H or nc < 1 or nc > W:
        return (r, c), r_boundary

    ns = (nr, nc)
    if ns == TARGET:
        return ns, r_target
    if ns in FORBIDDEN:
        return ns, r_forbidden
    return ns, 0.0


# =========================================
# 2) Polynomial features phi(s) up to degree K
# =========================================
def make_poly_phi(degree: int):
    """
    Build polynomial features in (rn, cn) up to total degree=degree.
    Example:
      degree=1 -> [1, rn, cn]
      degree=2 -> [1, rn, cn, rn^2, rn*cn, cn^2]
      degree=3 -> add rn^3, rn^2*cn, rn*cn^2, cn^3
    """
    assert degree >= 0

    def phi(state_rc):
        r, c = state_rc
        rn = (r - 1) / (H - 1)  # [0,1]
        cn = (c - 1) / (W - 1)  # [0,1]

        feats = [1.0]
        # total degree d: rn^i * cn^j with i+j=d
        for d in range(1, degree + 1):
            for i in range(d, -1, -1):
                j = d - i
                feats.append((rn**i) * (cn**j))
        return np.asarray(feats, dtype=np.float64)

    # feature dimension
    dim = 1 + sum((d + 1) for d in range(1, degree + 1))  # 1 + (2+3+...+degree+1)
    return phi, dim


def v_hat(state_rc, w, phi_fn):
    return float(np.dot(w, phi_fn(state_rc)))


# =========================================
# 3) TD(0) trainer
# =========================================
rng = np.random.default_rng(0)

num_episodes = 500
steps_per_episode = 500
alpha0 = 0.1


def alpha(t):
    return alpha0 / np.sqrt(1.0 + t / 2000.0)


def train_td0(phi_fn, d, seed=0):
    rng_local = np.random.default_rng(seed)
    w = np.zeros(d, dtype=np.float64)

    td_err_log = []
    t_global = 0

    for ep in range(num_episodes):
        s = (int(rng_local.integers(1, H + 1)), int(rng_local.integers(1, W + 1)))

        for _ in range(steps_per_episode):
            a = int(rng_local.integers(0, A))
            s_next, r = step(s, a)

            v_s = v_hat(s, w, phi_fn)
            v_sn = v_hat(s_next, w, phi_fn)
            delta = r + gamma * v_sn - v_s

            w += alpha(t_global) * delta * phi_fn(s)

            if t_global % 1000 == 0:
                td_err_log.append(abs(delta))

            s = s_next
            t_global += 1

    return w, np.asarray(td_err_log, dtype=np.float64)


def compute_V(w, phi_fn):
    V = np.zeros((H, W), dtype=np.float64)
    for r in range(1, H + 1):
        for c in range(1, W + 1):
            V[r - 1, c - 1] = v_hat((r, c), w, phi_fn)
    return V


# =========================================
# 4) Train for degree 1/2/3, compute V
# =========================================
degrees = [1, 2, 3]
results = {}

for deg in degrees:
    phi_fn, d = make_poly_phi(deg)
    w, td_err = train_td0(phi_fn, d, seed=0)
    V = compute_V(w, phi_fn)
    results[deg] = {"phi": phi_fn, "d": d, "w": w, "V": V, "td_err": td_err}

    print(f"\n=== degree={deg} ===")
    print("d =", d)
    print("w =", w)
    print("Estimated V =\n", np.round(V, 3))


# =========================================
# 5) 3D visualization with gradient color + colorbar
# =========================================
rows = np.arange(1, H + 1)
cols = np.arange(1, W + 1)
C, R = np.meshgrid(cols, rows)  # X=col, Y=row

# optional: use same color scale across degrees for easy comparison
all_vals = np.concatenate([results[d]["V"].ravel() for d in degrees])
vmin, vmax = float(all_vals.min()), float(all_vals.max())

for deg in degrees:
    Z = results[deg]["V"]

    fig = plt.figure(figsize=(9, 6))
    ax = fig.add_subplot(111, projection="3d")

    surf = ax.plot_surface(
        C,
        R,
        Z,
        rstride=1,
        cstride=1,
        linewidth=0,
        antialiased=True,
        cmap="viridis",  # 渐变色
        vmin=vmin,
        vmax=vmax,  # 统一色阶(想各自自适应就删掉这行)
    )

    # colorbar(显示渐变对应的数值)
    fig.colorbar(surf, ax=ax, shrink=0.7, pad=0.1, label="v_hat")

    ax.set_xlabel("col")
    ax.set_ylabel("row")
    ax.set_zlabel("v_hat(row,col)")
    ax.set_title(f"TD(0) + Poly Features (degree={deg})")
    ax.set_box_aspect((1.0, 1.0, 0.5))
    ax.invert_yaxis()
    ax.view_init(elev=30, azim=225)

    plt.savefig(f"value_surface_deg{deg}.png", dpi=300, bbox_inches="tight")
    plt.show()

我们可以看到以下结果:

我们可以看到, 随着 degree 的增加, 拟合的效果在变好.

一句感叹… AI 好强. 这种POC的代码都不用手写了.