深度强化学习(DRL 4) - DQN的实战(DQN, Double DQN, Dueling DQN)

╰半橙微兮° 2024-02-23 07:54 194阅读 0赞

目录

    • 一、环境
    • 二、DQN
    • 三、Double DQN
    • 四、Dueling DQN(D3QN)
    • 五、小结

全部代码:

https://github.com/ColinFred/Reinforce\_Learning\_Pytorch/tree/main/RL/DQN

一、环境

查看可用的环境

  1. from gym import envs
  2. print(envs.registry.all())
  3. ValuesView(├──CartPole: [ v0, v1 ]
  4. ├──MountainCar: [ v0 ]
  5. ├──MountainCarContinuous: [ v0 ]
  6. ├──Pendulum: [ v1 ]
  7. ├──Acrobot: [ v1 ]
  8. ├──LunarLander: [ v2 ]
  9. ├──LunarLanderContinuous: [ v2 ]
  10. ├──BipedalWalker: [ v3 ]
  11. ├──BipedalWalkerHardcore: [ v3 ]
  12. ├──CarRacing: [ v1 ]
  13. ├──Blackjack: [ v1 ]
  14. ├──FrozenLake: [ v1 ]
  15. ├──FrozenLake8x8: [ v1 ]
  16. ├──CliffWalking: [ v0 ]
  17. ├──Taxi: [ v3 ]
  18. ├──Reacher: [ v2 ]
  19. ├──Pusher: [ v2 ]
  20. ├──Thrower: [ v2 ]
  21. ├──Striker: [ v2 ]
  22. ├──InvertedPendulum: [ v2 ]

依旧使用CartPole-v1的环境,改写reward的值

  1. # print(env.action_space) # number of action
  2. # print(env.observation_space) # number of state
  3. # print(env.observation_space.high)
  4. # print(env.observation_space.low)
  5. NUM_ACTIONS = env.action_space.n
  6. NUM_STATES = env.observation_space.shape[0]
  7. ENV_A_SHAPE = 0 if isinstance(env.action_space.sample(), int) else env.action_space.sample.shape
  8. RL = DQN(n_action=NUM_ACTIONS, n_state=NUM_STATES, learning_rate=0.01) # choose algorithm
  9. total_steps = 0
  10. for episode in range(1000):
  11. state, info = env.reset(return_info=True)
  12. ep_r = 0
  13. while True:
  14. env.render() # update env
  15. action = RL.choose_action(state) # choose action
  16. state_, reward, done, info = env.step(action) # take action and get next state and reward
  17. x, x_dot, theta, theta_dot = state_ # change given reward
  18. r1 = (env.x_threshold - abs(x)) / env.x_threshold - 0.8
  19. r2 = (env.theta_threshold_radians - abs(theta)) / env.theta_threshold_radians - 0.5
  20. reward = r1 + r2 # consider both locations and radians
  21. RL.store_transition(state, action, reward, state_) # store transition
  22. RL.learn() # learn
  23. ep_r += reward
  24. if total_steps % C == 0: # every C steps update target_network
  25. RL.update_target_network()
  26. if done:
  27. print('episode: ', episode,
  28. 'ep_r: ', round(ep_r, 2))
  29. break
  30. state = state_
  31. total_steps += 1
  32. time.sleep(0.05)

二、DQN

代码:

1,经验回放

先进行若干次游戏,将游戏数据存储到memory中。从memory中随机选取训练数据batch_memory用于批量训练。

  1. Transition = namedtuple('Transition',
  2. ('state', 'action', 'next_state', 'reward')) # Define a transition tuple
  3. class ReplayMemory(object): # Define a replay memory
  4. def __init__(self, capacity):
  5. self.capacity = capacity
  6. self.memory = []
  7. self.position = 0
  8. def Push(self, *args):
  9. if len(self.memory) < self.capacity:
  10. self.memory.append(None)
  11. self.memory[self.position] = Transition(*args)
  12. self.position = (self.position + 1) % self.capacity
  13. def Sample(self, batch_size):
  14. return sample(self.memory, batch_size)
  15. def __len__(self):
  16. return len(self.memory)

2,DNN网络

输入的维度是状态的树木,输出的是Q(s,a)的在状态s的各个Q值。对初始权重进行处理,使其服从均值0,方差0.1的正态分布。

  1. class DNN(nn.Module): # Double DQN
  2. def __init__(self, n_state, n_action): # Define the layers of the fully-connected hidden network
  3. super(DNN, self).__init__()
  4. self.input_layer = nn.Linear(n_state, 64)
  5. self.input_layer.weight.data.normal_(0, 0.1)
  6. self.middle_layer = nn.Linear(64, 32)
  7. self.middle_layer.weight.data.normal_(0, 0.1)
  8. self.middle_layer_2 = nn.Linear(32, 32)
  9. self.middle_layer_2.weight.data.normal_(0, 0.1)
  10. self.adv_layer = nn.Linear(32, n_action)
  11. self.adv_layer.weight.data.normal_(0, 0.1)
  12. def forward(self, state):
  13. x = F.relu(self.input_layer(state))
  14. x = F.relu(self.middle_layer(x))
  15. x = F.relu(self.middle_layer_2(x))
  16. out = self.adv_layer(x)
  17. return out

DQN,难点在于理解 learn()函数,要求对pytorch的各个函数较熟悉。

  1. class DQN:
  2. def __init__(self, n_action, n_state, learning_rate):
  3. self.n_action = n_action
  4. self.n_state = n_state
  5. self.memory = ReplayMemory(capacity=100)
  6. self.memory_counter = 0
  7. self.model_policy = DNN(self.n_state, self.n_action)
  8. self.model_target = DNN(self.n_state, self.n_action)
  9. self.model_target.load_state_dict(self.model_policy.state_dict())
  10. self.model_target.eval()
  11. self.optimizer = optim.Adam(self.model_policy.parameters(), lr=learning_rate)
  12. def store_transition(self, s, a, r, s_):
  13. state = torch.FloatTensor([s])
  14. action = torch.LongTensor([a])
  15. reward = torch.FloatTensor([r])
  16. next_state = torch.FloatTensor([s_])
  17. self.memory.Push(state, action, next_state, reward)
  18. def choose_action(self, state):
  19. state = torch.FloatTensor(state)
  20. if np.random.randn() <= EPISILO: # greedy policy
  21. with torch.no_grad():
  22. q_value = self.model_policy(state)
  23. action = q_value.max(0)[1].view(1, 1).item()
  24. else: # random policy
  25. action = torch.tensor([randrange(self.n_action)], dtype=torch.long).item()
  26. return action
  27. def learn(self):
  28. if len(self.memory) < BATCH_SIZE:
  29. return
  30. transitions = self.memory.Sample(BATCH_SIZE)
  31. batch = Transition(*zip(*transitions))
  32. state_batch = torch.cat(batch.state)
  33. action_batch = torch.cat(batch.action).unsqueeze(1)
  34. reward_batch = torch.cat(batch.reward)
  35. next_state_batch = torch.cat(batch.next_state)
  36. state_action_values = self.model_policy(state_batch).gather(1, action_batch)
  37. next_action_batch = torch.unsqueeze(self.model_target(next_state_batch).max(1)[1], 1)
  38. next_state_values = self.model_target(next_state_batch).gather(1, next_action_batch)
  39. expected_state_action_values = (next_state_values * GAMMA) + reward_batch.unsqueeze(1)
  40. loss = F.smooth_l1_loss(state_action_values, expected_state_action_values)
  41. self.optimizer.zero_grad()
  42. loss.backward()
  43. for param in self.model_policy.parameters():
  44. param.grad.data.clamp_(-1, 1)
  45. self.optimizer.step()
  46. def update_target_network(self):
  47. self.model_target.load_state_dict(self.model_policy.state_dict())

三、Double DQN

DDQN通过解耦目标Q值动作的选择和目标Q值的计算这两步,来达到消除过度估计的问题。在DDQN这里,不再是直接在目标Q网络里面找各个动作中最大Q值,而是先在当前Q网络中先找出最大Q值对应的动作。然后利用这个选择出来的动作在目标Q网络里面去计算目标Q值。在代码的体现上就是将104行的

  1. next_action_batch = torch.unsqueeze(self.model_target(next_state_batch).max(1)[1], 1)

改为

  1. next_action_batch = torch.unsqueeze(self.model_policy(next_state_batch).max(1)[1], 1)

非常简单的改动

四、Dueling DQN(D3QN)

基本思路就是Q(s,a)的值既和state有关,又和action有关。但是两种”有关”的程度不一样,或者说影响力不一样。对于Q(s,a) 我们希望它能反应出两个方面的差异。

1,对于当前状态s,能够很好的区分不同action的影响
2,对于不同状态s,能够很好的区分不同state的影响

将Q值分成V和A,关系为:Q = V + A

V评价当前状态,是一个值。A评价动作,与Q同维度。

也就是将DNN网络改为

  1. class DNN(nn.Module): # D3QN
  2. def __init__(self, n_state, n_action): # Define the layers of the fully-connected hidden network
  3. super(DNN, self).__init__()
  4. self.input_layer = nn.Linear(n_state, 64)
  5. self.input_layer.weight.data.normal_(0, 0.1)
  6. self.middle_layer = nn.Linear(64, 32)
  7. self.middle_layer.weight.data.normal_(0, 0.1)
  8. self.middle_layer_2 = nn.Linear(32, 32)
  9. self.middle_layer_2.weight.data.normal_(0, 0.1)
  10. self.adv_layer = nn.Linear(32, n_action)
  11. self.adv_layer.weight.data.normal_(0, 0.1)
  12. self.value_layer = nn.Linear(32, 1)
  13. self.value_layer.weight.data.normal_(0, 0.1)
  14. def forward(self, state): # Define the neural network forward function
  15. x = F.relu(self.input_layer(state))
  16. x = F.relu(self.middle_layer(x))
  17. x = F.relu(self.middle_layer_2(x))
  18. value = self.value_layer(x)
  19. adv = self.adv_layer(x)
  20. out = value + adv - adv.mean()
  21. return out

Q = V + A

一般算法中都要减去一个A的均值,是为了对A做去中心化处理。

Q = V + A - A.mean()

如果不这样做,假设Q值为[11,12,13,14,15],那么V和A会有无数种结果,比如V=10,A=[1,2,3,4,5],或者V=9,A=[2,3,4,5,6]等等。

减去均值后,就有了唯一的V和A:V=13, A=[-2,-1,0,1,2]

五、小结

调节超参数

  1. MEMORY_CAPACITY = 1000
  2. C = 50
  3. BATCH_SIZE = 32
  4. LR = 0.01
  5. GAMMA = 0.90
  6. EPISILO = 0.9
  7. TEST_EPISODE = 30

分别运行使用DQN、Double DQN、Dueling DQN ,可以直观的显示它们的性能。

发表评论

表情:
评论列表 (有 0 条评论,194人围观)

还没有评论,来说两句吧...

相关阅读

    相关 强化学习六、DQN

    到目前为止已经介绍了强化学习的基本方法:基于动态规划的方法、基于蒙特卡罗的方法、基于时间差分的方法。这些方法都有一个基本的前提条件:状态空间和动作空间是离散的,而且都不能太大。