Skip to content

Commit fc3ef73

Browse files
committed
Updates dl/reinforcement/reinforcement.md
Auto commit by GitBook Editor
1 parent de30dc9 commit fc3ef73

File tree

3 files changed

+34
-3
lines changed

3 files changed

+34
-3
lines changed

assets/reinforcementlearing3.png

123 KB
Loading

assets/reinforcementlearning2.png

61.2 KB
Loading

dl/reinforcement/reinforcement.md

+34-3
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,42 @@
11
除了agent和环境之外,强化学习的要素还包括**策略\(Policy\)****奖励\(reward signal\)****值函数\(value function\)****环境模型\(model\)**,下面对这几种要素进行说明:
22

33
1. **策略\(Policy\)**
4-
,策略就是一个从当环境状态到行为的映射;
5-
2. **奖励\(reward signal\)**
4+
,策略就是一个从当环境状态到行为的映射;$$$$
5+
6+
$$a_t = \mu_\theta(s_t)$$ 确定型
7+
8+
$$a_t \sim \pi_\theta(\cdot|s_t)$$ 随机型
9+
10+
随机型又分为 **categorical policies 和diagonal Gaussian policies**
11+
12+
**categorical policies 通常用在离散动作空间的场景**
13+
14+
采样阶段,随机生成每一个动作的概率。
15+
16+
Log-Likelihood:计算每一个动作的概率,$$log\pi_\theta(a|s) = log[P_\theta(s)]_a$$
17+
18+
**diagonal Gaussian policies 通常用在连续动作空间的场景**
19+
20+
采样阶段,生成随机动作的概率 $$a = \mu_\theta(s) +\delta_\theta(s)\odot z$$ $$z\sim N(0,I)$$
21+
22+
Log-Likelihood: $$log\pi_\theta(a|s) = -\frac{1}{2}( \sum_{i=1}^{k}(\frac{(a_i-\mu_i)^2)}{\delta_i^2}))+klog2\pi)$$
23+
24+
25+
26+
下一步的表示:
27+
28+
$$s_{t+1} = f(s_t,a_t)$$
29+
30+
$$s_{t+1} \sim P(\odot|s_t, a_t)$$
31+
32+
1. **奖励\(reward signal\)**
633
,奖励是agent执行一次行为获得的反馈,强化学习系统的目标是最大化累积的奖励,在不同状态下执行同一个行为可能会得到不同的奖励;
7-
3. **值函数\(value function\)**
34+
2. **值函数\(value function\)**
835
,一种状态的value为从当前状态出发到停机状态所获得的累积的奖励;
36+
3. ![](/assets/reinforcementlearing3.png)
937
4. **环境模型\(model\)**
1038
,agent能够根据环境模型预测环境的行为,采用环境模型的强化学习方法称为基于模型\(model-based\)的方法,不采用环境模型的强化学习方法称为model-free方法。
39+
5. ![](/assets/reinforcementlearning2.png)
1140

1241
![](/assets/reinforcementlearning1.png)
1342

@@ -23,3 +52,5 @@
2352

2453
2554

55+
[^1]: Enter footnote here.
56+

0 commit comments

Comments
 (0)