In this project, our goal is to let agent protect villagers and kill zombies. We do this by using deep reinforcement learning, which is a combination of deep learning and reinforcement learning. By randomly choosing actions at the beginning, agent will learn from the algorithm and start to choose the best action or move it should do to accomplish our goal.
Following is the whole process diagram of our algorithm:
To be specific, we use DQN(Deep Q-Network) as our main algorithm. The key point of this algorithm is to fit the original q-value function by using a neural network, which is:
Q(s,a,w)~q(s,a)
where Q is the neural network, w is the weight. We actually design two ways of representing the relationship between state,action and q-values, which will make easier for us to calculate. They are: 1.Q(state)=actions,q-values 2.Q(state,action)=q-value The first one is the actions and corresponding q-values under given state, and the second one is the q-value under given state and action.
In the original Q-Network algorithm, we use Bellman equation to iteratively updating the q-function:
qnew(s,a)= Es’[reward+γ*max q(s’,a’) St=s, At=a]
But apparently, if we already know what q(s,a) is, there is no need for us to use neural network to fit it. Plus, as a traditional machine learning algorithm, the neural network need a loss function to update the weight. Then how do we accomplish these two missions? We use following:
Loss = (reward+γ*max Q(s’,a’,w)-Q(s,a,w))2
Also, if we update w every frame, the value would become very unstable. Because while Q is having a new w, Q’ is also having a new value, which the model would not be able to aim for a correct target. So we update the w every 10000 frame.
Our input state would be a 5x5 map from every frame, including the position of the agent, zombie and villager. And some direction information of the agent. Different position of either agent, villager and zombie would be a unique position state. The total position states in 5x5 map would be 25x24x23 = 13800. While the agent have degree as direction information, the actual state would be hard to calculate. And that’s why we choose neural network to process. To have more details, we actually take 4 frames as a complete state. Aka, a complete state would be a 4x5x5 array. Because the difference between every frame is too small, especially the position information, any of the unit on the map may be in the same brick for 3~5 frames.
In the traditional machine learning process, we shuffle the data to solve the problem of correlation. Aka, if you are doing a classification problem, it’s better to train some Class A and some Class B, instead of training the algorithm with all Class A and then all Class B. Then we face another problem when we processing the data in DQN: There is no way for us to get future state(data) before we start training the neural network. How do we do that?
We use a experience/memory array to record all the state. We use this array to store the past state, and randomly take out some samples and feed them to the neural network to do the training.
The real memory array would be in the form of:
[frame, action, reward, is_terminal]
We will record every frame, its corresponding action, reward and if this frame is a terminal frame.
Every time we need to take out the data, we randomly choose one frame. The chosen frame we call it “key frame”. The key frame will form a state (frame k-3 to frame k) and new_state(k-2 to k+1). The action will just be the action from the key frame k.
So every piece of data took out from the memory array would be in the form of:
[state ,action, reward, new_state, is_terminal]
We feed the neural network with a batch of above kind of data.
We also consider the storage. Our memory array have a limitation. If it’s full, the new state will replace the old ones. The reason why we doing this is the training could take a long time, and the storage of the RAM could be full. We don’t want the training stops because there is no memory on the hardware.
We set 6 different situations for rewarding in total. If villager alive in every frame, we give reward 0.02. If villager died, the agent will have reward -50. If zombie died, it scores 40. If the zombie is attacked by the agent, everytime it scores 10. If agent is attacked by zombie, it scores -5. Finally if our agent died, it scores -40.
We only have 5 action, which are move forward, move backward, turn left, turn right and doing nothing. We let the agent keep attacking so that it only need to consider about which way to go. The algorithm we use to choose action is the Greedy Epsilon Policy. We would set the value of epsilon to be 0.05. Then we will randomly generate a number between 0-1, if the number is greater and equal than 0.05, our agent will pick the most valuable action from the q-value list. Otherwise, agent will randomly pick an action to execute.
By setting up our initial map with a 5 x 5 (or 25 blocks) size arena, we are able to train the agent’s ability to observe its surrounding, make an action according to the constraints/score, and learn the reward/consequence of attacking a certain entity. In order to ensure accurate observation, it is necessary to include coordinates that locates each entity (Agent, Villager, Zombie) and differentiates their standing positions (25 blocks). The accuracy is the number of “correct” actions that generates positive reward each mission over the total number of observations. Since our project mainly focus on the reward policy. The performance of this project is measured by how accurately our agent observes its surrounding environment. After we finish training our agent for certain number of rounds, we are able to see that our agent can identify and attack the zombie.
There are some examples regarding our accuracy measurement.
In the picture above, the agent performed an action we don’t want. We count this as a false positive because the agent is able to learn that attacking villager will result in a negative reward. There are also cases when villagers and zombies stayed very close. By which I mean agent may accidentally attack the villager while his first intention was the zombie.
Table. 5 x 5 [one zombie/one villager] map