共计 11460 个字符,预计需要花费 29 分钟才能阅读完成。
深度确定性策略梯度 (Deep Deterministic Policy Gradient, DDPG) 是受 Deep Q-Network 启发的无模型、非策略深度强化算法,是基于应用策略梯度的 Actor-Critic,本文将应用 pytorch 对其进行残缺的实现和解说
DDPG 的要害组成部分是
- Replay Buffer
- Actor-Critic neural network
- Exploration Noise
- Target network
- Soft Target Updates for Target Network
上面咱们一个一个来逐渐实现:
Replay Buffer
DDPG 应用 Replay Buffer 存储通过摸索环境采样的过程和处分(Sₜ,aₜ,Rₜ,Sₜ+₁)。Replay Buffer 在帮忙代理减速学习以及 DDPG 的稳定性方面起着至关重要的作用:
- 最小化样本之间的相关性:将过来的教训存储在 Replay Buffer 中,从而容许代理从各种教训中学习。
- 启用离线策略学习:容许代理从重播缓冲区采样转换,而不是从以后策略采样转换。
- 高效采样:将过来的教训存储在缓冲区中,容许代理屡次从不同的教训中学习。
classReplay_buffer(): | |
''' | |
Code based on: | |
https://github.com/openai/baselines/blob/master/baselines/deepq/replay_buffer.py | |
Expects tuples of (state, next_state, action, reward, done) | |
'''def__init__(self, max_size=capacity):"""Create Replay buffer. | |
Parameters | |
---------- | |
size: int | |
Max number of transitions to store in the buffer. When the buffer | |
overflows the old memories are dropped. | |
""" | |
self.storage= [] | |
self.max_size=max_size | |
self.ptr=0 | |
defpush(self, data): | |
iflen(self.storage) ==self.max_size: | |
self.storage[int(self.ptr)] =data | |
self.ptr= (self.ptr+1) %self.max_size | |
else: | |
self.storage.append(data) | |
defsample(self, batch_size): | |
"""Sample a batch of experiences. | |
Parameters | |
---------- | |
batch_size: int | |
How many transitions to sample. | |
Returns | |
------- | |
state: np.array | |
batch of state or observations | |
action: np.array | |
batch of actions executed given a state | |
reward: np.array | |
rewards received as results of executing action | |
next_state: np.array | |
next state next state or observations seen after executing action | |
done: np.array | |
done[i] = 1 if executing ation[i] resulted in | |
the end of an episode and 0 otherwise. | |
""" | |
ind=np.random.randint(0, len(self.storage), size=batch_size) | |
state, next_state, action, reward, done= [], [], [], [], [] | |
foriinind: | |
st, n_st, act, rew, dn=self.storage[i] | |
state.append(np.array(st, copy=False)) | |
next_state.append(np.array(n_st, copy=False)) | |
action.append(np.array(act, copy=False)) | |
reward.append(np.array(rew, copy=False)) | |
done.append(np.array(dn, copy=False)) | |
returnnp.array(state), np.array(next_state), np.array(action), np.array(reward).reshape(-1, 1), np.array(done).reshape(-1, 1) |
Actor-Critic Neural Network
这是 Actor-Critic 强化学习算法的 PyTorch 实现。该代码定义了两个神经网络模型,一个 Actor 和一个 Critic。
Actor 模型的输出:环境状态;Actor 模型的输入:具备间断值的动作。
Critic 模型的输出:环境状态和动作;Critic 模型的输入:Q 值,即以后状态 - 动作对的预期总处分。
classActor(nn.Module): | |
""" | |
The Actor model takes in a state observation as input and | |
outputs an action, which is a continuous value. | |
It consists of four fully connected linear layers with ReLU activation functions and | |
a final output layer selects one single optimized action for the state | |
""" | |
def__init__(self, n_states, action_dim, hidden1): | |
super(Actor, self).__init__() | |
self.net=nn.Sequential(nn.Linear(n_states, hidden1), | |
nn.ReLU(), | |
nn.Linear(hidden1, hidden1), | |
nn.ReLU(), | |
nn.Linear(hidden1, hidden1), | |
nn.ReLU(), | |
nn.Linear(hidden1, 1) | |
) | |
defforward(self, state): | |
returnself.net(state) | |
classCritic(nn.Module): | |
""" | |
The Critic model takes in both a state observation and an action as input and | |
outputs a Q-value, which estimates the expected total reward for the current state-action pair. | |
It consists of four linear layers with ReLU activation functions, | |
State and action inputs are concatenated before being fed into the first linear layer. | |
The output layer has a single output, representing the Q-value | |
""" | |
def__init__(self, n_states, action_dim, hidden2): | |
super(Critic, self).__init__() | |
self.net=nn.Sequential(nn.Linear(n_states+action_dim, hidden2), | |
nn.ReLU(), | |
nn.Linear(hidden2, hidden2), | |
nn.ReLU(), | |
nn.Linear(hidden2, hidden2), | |
nn.ReLU(), | |
nn.Linear(hidden2, action_dim) | |
) | |
defforward(self, state, action): | |
returnself.net(torch.cat((state, action), 1)) |
Exploration Noise
向 Actor 抉择的动作增加噪声是 DDPG 中用来激励摸索和改良学习过程的一种技术。
能够应用高斯噪声或 Ornstein-Uhlenbeck 噪声。高斯噪声简略且易于实现,Ornstein-Uhlenbeck 噪声会生成工夫相干的噪声,能够帮忙代理更无效地摸索动作空间。然而与高斯噪声办法相比,Ornstein-Uhlenbeck 噪声稳定更平滑且随机性更低。
importnumpyasnp | |
importrandom | |
importcopy | |
classOU_Noise(object): | |
"""Ornstein-Uhlenbeck process. | |
code from : | |
https://math.stackexchange.com/questions/1287634/implementing-ornstein-uhlenbeck-in-matlab | |
The OU_Noise class has four attributes | |
size: the size of the noise vector to be generated | |
mu: the mean of the noise, set to 0 by default | |
theta: the rate of mean reversion, controlling how quickly the noise returns to the mean | |
sigma: the volatility of the noise, controlling the magnitude of fluctuations | |
""" | |
def__init__(self, size, seed, mu=0., theta=0.15, sigma=0.2): | |
self.mu=mu*np.ones(size) | |
self.theta=theta | |
self.sigma=sigma | |
self.seed=random.seed(seed) | |
self.reset() | |
defreset(self): | |
"""Reset the internal state (= noise) to mean (mu).""" | |
self.state=copy.copy(self.mu) | |
defsample(self): | |
"""Update internal state and return it as a noise sample. | |
This method uses the current state of the noise and generates the next sample | |
""" | |
dx=self.theta* (self.mu-self.state) +self.sigma*np.array([np.random.normal() for_inrange(len(self.state))]) | |
self.state+=dx | |
returnself.state |
要在 DDPG 中应用高斯噪声,能够间接将高斯噪声增加到代理的动作抉择过程中。
DDPG
DDPG (Deep Deterministic Policy Gradient)采纳两组 Actor-Critic 神经网络进行函数迫近。在 DDPG 中,指标网络是 Actor-Critic,它指标网络具备与 Actor-Critic 网络雷同的构造和参数化。
在训练期时,代理应用其 Actor-Critic 网络与环境交互,并将教训元组(Sₜ、Aₜ、Rₜ、Sₜ+₁)存储在 Replay Buffer 中。而后代理从 Replay Buffer 中采样并应用数据更新 Actor-Critic 网络。DDPG 算法不是通过间接从 Actor-Critic 网络复制来更新指标网络权重,而是通过称为软指标更新的过程迟缓更新指标网络权重。
软指标的更新是从 Actor-Critic 网络传输到指标网络的称为指标更新率 (τ) 的权重的一小部分。
软指标的更新公式如下:
通过应用软指标技术,能够大大提高学习的稳定性。
#Set Hyperparameters | |
# Hyperparameters adapted for performance from | |
capacity=1000000 | |
batch_size=64 | |
update_iteration=200 | |
tau=0.001# tau for soft updating | |
gamma=0.99# discount factor | |
directory='./' | |
hidden1=20# hidden layer for actor | |
hidden2=64.#hiiden laye for critic | |
classDDPG(object): | |
def__init__(self, state_dim, action_dim): | |
""" | |
Initializes the DDPG agent. | |
Takes three arguments: | |
state_dim which is the dimensionality of the state space, | |
action_dim which is the dimensionality of the action space, and | |
max_action which is the maximum value an action can take. | |
Creates a replay buffer, an actor-critic networks and their corresponding target networks. | |
It also initializes the optimizer for both actor and critic networks alog with | |
counters to track the number of training iterations. | |
""" | |
self.replay_buffer=Replay_buffer() | |
self.actor=Actor(state_dim, action_dim, hidden1).to(device) | |
self.actor_target=Actor(state_dim, action_dim, hidden1).to(device) | |
self.actor_target.load_state_dict(self.actor.state_dict()) | |
self.actor_optimizer=optim.Adam(self.actor.parameters(), lr=3e-3) | |
self.critic=Critic(state_dim, action_dim, hidden2).to(device) | |
self.critic_target=Critic(state_dim, action_dim, hidden2).to(device) | |
self.critic_target.load_state_dict(self.critic.state_dict()) | |
self.critic_optimizer=optim.Adam(self.critic.parameters(), lr=2e-2) | |
# learning rate | |
self.num_critic_update_iteration=0 | |
self.num_actor_update_iteration=0 | |
self.num_training=0 | |
defselect_action(self, state): | |
""" | |
takes the current state as input and returns an action to take in that state. | |
It uses the actor network to map the state to an action. | |
""" | |
state=torch.FloatTensor(state.reshape(1, -1)).to(device) | |
returnself.actor(state).cpu().data.numpy().flatten() | |
defupdate(self): | |
""" | |
updates the actor and critic networks using a batch of samples from the replay buffer. | |
For each sample in the batch, it computes the target Q value using the target critic network and the target actor network. | |
It then computes the current Q value | |
using the critic network and the action taken by the actor network. | |
It computes the critic loss as the mean squared error between the target Q value and the current Q value, and | |
updates the critic network using gradient descent. | |
It then computes the actor loss as the negative mean Q value using the critic network and the actor network, and | |
updates the actor network using gradient ascent. | |
Finally, it updates the target networks using | |
soft updates, where a small fraction of the actor and critic network weights are transferred to their target counterparts. | |
This process is repeated for a fixed number of iterations. | |
""" | |
foritinrange(update_iteration): | |
# For each Sample in replay buffer batch | |
state, next_state, action, reward, done=self.replay_buffer.sample(batch_size) | |
state=torch.FloatTensor(state).to(device) | |
action=torch.FloatTensor(action).to(device) | |
next_state=torch.FloatTensor(next_state).to(device) | |
done=torch.FloatTensor(1-done).to(device) | |
reward=torch.FloatTensor(reward).to(device) | |
# Compute the target Q value | |
target_Q=self.critic_target(next_state, self.actor_target(next_state)) | |
target_Q=reward+ (done*gamma*target_Q).detach() | |
# Get current Q estimate | |
current_Q=self.critic(state, action) | |
# Compute critic loss | |
critic_loss=F.mse_loss(current_Q, target_Q) | |
# Optimize the critic | |
self.critic_optimizer.zero_grad() | |
critic_loss.backward() | |
self.critic_optimizer.step() | |
# Compute actor loss as the negative mean Q value using the critic network and the actor network | |
actor_loss=-self.critic(state, self.actor(state)).mean() | |
# Optimize the actor | |
self.actor_optimizer.zero_grad() | |
actor_loss.backward() | |
self.actor_optimizer.step() | |
""" | |
Update the frozen target models using | |
soft updates, where | |
tau,a small fraction of the actor and critic network weights are transferred to their target counterparts. | |
""" | |
forparam, target_paraminzip(self.critic.parameters(), self.critic_target.parameters()): | |
target_param.data.copy_(tau*param.data+ (1-tau) *target_param.data) | |
forparam, target_paraminzip(self.actor.parameters(), self.actor_target.parameters()): | |
target_param.data.copy_(tau*param.data+ (1-tau) *target_param.data) | |
self.num_actor_update_iteration+=1 | |
self.num_critic_update_iteration+=1 | |
defsave(self): | |
"""Saves the state dictionaries of the actor and critic networks to files""" | |
torch.save(self.actor.state_dict(), directory+'actor.pth') | |
torch.save(self.critic.state_dict(), directory+'critic.pth') | |
defload(self): | |
"""Loads the state dictionaries of the actor and critic networks to files""" | |
self.actor.load_state_dict(torch.load(directory+'actor.pth')) | |
self.critic.load_state_dict(torch.load(directory+'critic.pth')) |
训练 DDPG
这里咱们应用 OpenAI Gym 的“MountainCarContinuous-v0”来训练咱们的 DDPG RL 模型,这里的环境提供间断的口头和察看空间,指标是尽快让小车达到山顶。
上面定义算法的各种参数,例如最大训练次数、摸索噪声和记录距离等等。应用固定的随机种子能够使得过程可能回溯。
importgym | |
# create the environment | |
env_name='MountainCarContinuous-v0' | |
env=gym.make(env_name) | |
device='cuda'iftorch.cuda.is_available() else'cpu' | |
# Define different parameters for training the agent | |
max_episode=100 | |
max_time_steps=5000 | |
ep_r=0 | |
total_step=0 | |
score_hist=[] | |
# for rensering the environmnet | |
render=True | |
render_interval=10 | |
# for reproducibility | |
env.seed(0) | |
torch.manual_seed(0) | |
np.random.seed(0) | |
#Environment action ans states | |
state_dim=env.observation_space.shape[0] | |
action_dim=env.action_space.shape[0] | |
max_action=float(env.action_space.high[0]) | |
min_Val=torch.tensor(1e-7).float().to(device) | |
# Exploration Noise | |
exploration_noise=0.1 | |
exploration_noise=0.1*max_action |
创立 DDPG 代理类的实例,以训练代理达到指定的次数。在每轮完结时调用代理的 update()办法来更新参数,并且在每十轮之后应用 save()办法将代理的参数保留到一个文件中。
# Create a DDPG instance | |
agent=DDPG(state_dim, action_dim) | |
# Train the agent for max_episodes | |
foriinrange(max_episode): | |
total_reward=0 | |
step=0 | |
state=env.reset() | |
for tinrange(max_time_steps): | |
action=agent.select_action(state) | |
# Add Gaussian noise to actions for exploration | |
action= (action+np.random.normal(0, 1, size=action_dim)).clip(-max_action, max_action) | |
#action += ou_noise.sample() | |
next_state, reward, done, info=env.step(action) | |
total_reward+=reward | |
ifrenderandi>=render_interval : env.render() | |
agent.replay_buffer.push((state, next_state, action, reward, np.float(done))) | |
state=next_state | |
ifdone: | |
break | |
step+=1 | |
score_hist.append(total_reward) | |
total_step+=step+1 | |
print("Episode: \t{} Total Reward: \t{:0.2f}".format(i, total_reward)) | |
agent.update() | |
ifi%10==0: | |
agent.save() | |
env.close() |
测试 DDPG
test_iteration=100 | |
foriinrange(test_iteration): | |
state=env.reset() | |
fortincount(): | |
action=agent.select_action(state) | |
next_state, reward, done, info=env.step(np.float32(action)) | |
ep_r+=reward | |
print(reward) | |
env.render() | |
ifdone: | |
print("reward{}".format(reward)) | |
print("Episode \t{}, the episode reward is \t{:0.2f}".format(i, ep_r)) | |
ep_r=0 | |
env.render() | |
break | |
state=next_state |
咱们应用上面的参数让模型收敛:
- 从规范正态分布中采样噪声,而不是随机采样。
- 将 polyak 常数 (tau) 从 0.99 更改为 0.001
- 批改 Critic 网络的暗藏层大小为[64,64]。在 Critic 网络的第二层之后删除了 ReLU 激活。改成(Linear, ReLU, Linear, Linear)。
- 最大缓冲区大小更改为 1000000
- 将 batch_size 的大小从 128 更改为 64
训练了 75 轮之后的成果如下:
总结
DDPG 算法是一种受 deep Q-Network (DQN)算法启发的无模型 off-policy Actor-Critic 算法。它联合了策略梯度办法和 Q -learning 的长处来学习间断动作空间的确定性策略。
与 DQN 相似,它应用重播缓冲区存储过来的教训和指标网络,用于训练网络,从而进步了训练过程的稳定性。
DDPG 算法须要认真的超参数调优以获得最佳性能。超参数包含学习率、批大小、指标网络更新速率和探测噪声参数。超参数的渺小变动会对算法的性能产生重大影响。
下面的参数来自:
https://avoid.overfit.cn/post/9951ac196ec84629968ce7168215e461
作者:Renu Khandelwal