A.I. Learns to Drive From Scratch in Trackmania - YouTube

Channel: unknown

[0]
Each of these cars is controlled by an Artificial聽 Intelligence (AI) in the racing game Trackmania.聽聽
[5]
This AI is not very intelligent yet.. But that's聽 normal : it has just started to learn. In fact,聽聽
[12]
I want to use a method called Reinforcement聽 Learning to make this AI learn by itself聽聽
[17]
how to drive as fast as possible. I also聽 want it to become intelligent enough to聽聽
[21]
master various combinations of turns without聽 ever falling off the road. And to ensure this,聽聽
[27]
the AI will have to pass a final challenge :聽 to complete this giant track. But first of all,聽聽
[34]
how is a simple computer program supposed to learn聽 things ? It's not the first time I'm experimenting聽聽
[42]
with AI in Trackmania. And to achieve this,聽 i'm using a method called Machine Learning.聽聽
[49]
First, I'm running a program that controls the car聽 in-game to make it turn and accelerate. So the AI聽聽
[56]
can choose between 6 different actions. But how聽 can it decide which action to take ? The AI needs聽聽
[62]
to get information about the game. It receives聽 that in the form of numbers called inputs.聽聽
[68]
Some inputs describe the state of the car,聽 such as its current speed and acceleration.聽聽
[74]
Others indicate how the car is positioned聽 on the road section it's currently crossing.聽聽
[80]
And the last inputs indicate what's further ahead.聽 This is now what the AI sees when playing. But how聽聽
[87]
can it interpret that ? It needs to sort of use聽 this data in an intelligent way. To link inputs聽聽
[94]
to the desired action, the AI is going to use a聽 neural network, which basically acts like a brain.聽聽
[101]
Now, all that remains is to parameterize the聽 neural network so that it results in fast聽聽
[106]
driving. And that's where Machine Learning comes聽 into play. As I said earlier, the objective here聽聽
[112]
is that the AI learns to drive by itself. So it聽 will have to experiment with different strategies,聽聽
[119]
through trial and error, to progressively select聽 the neural network that leads to the best driving.聽聽
[125]
One way to do this would be聽 to use a genetic algorithm.聽聽
[129]
I've already tried that in Trackmania and it works聽 fairly well. Basically, the idea is to start with聽聽
[135]
a population of several AIs, each with its own聽 neural network. All AIs compete on the same map,聽聽
[142]
and the best ones are selected and reassembled聽 through a process similar to Natural Selection.聽聽
[148]
This can be repeated for many generations,聽 to get a better and better neural network.聽聽
[154]
One problem with this method is that you only聽 compare the different AIs based on their end聽聽
[159]
result. To make an AI progress, it might be聽 better to give it feedback on what it did聽聽
[165]
well or not so well during the race. So it's time聽 to try something else : Reinforcement Learning.聽聽
[172]
And this goes with a crucial聽 idea : the concept of reward.
[179]
This time, the AI has only one goal in聽 mind : to get as many rewards as possible.聽聽
[185]
The idea of reinforcement learning is to learn聽 to pick the action that brings the most reward,聽聽
[190]
in any situation. In fact, this is quite like a聽 pet being trained, which will interpret pleasure聽聽
[196]
or food intake as a positive reinforcement. But in聽 Trackmania, there is no food. So how can we define聽聽
[202]
rewards ? the AI can take 10 actions per second.聽 Each action will be associated with a reward equal聽聽
[210]
to the distance traveled up to the next action. So聽 the faster the AI goes, the more rewards it gets.聽聽
[217]
If the AI ever tries to go the wrong聽 way, it will receive a punishment,聽聽
[221]
which is actually just a negative reward. And if聽 the AI falls off the road, it will be directly聽聽
[227]
punished by a zero reward, but also indirectly by聽 stopping the race. Which means no more rewards.聽聽
[235]
Now, it's time to start training. To learn聽 which inputs and actions lead to which reward,聽聽
[240]
the AI must first gather information about聽 the game. This is the exploration phase. the聽聽
[246]
AI simply takes random actions and doesn't use its聽 neural network for the moment. The runs are driven聽聽
[253]
one by one. And after a thousand of them, here聽 is what the AI has explored of the map so far.聽聽
[260]
Each line corresponds to one race trajectory. the聽 AI has already collected plenty of data about the聽聽
[267]
rewards it can expect to get for various sets of聽 inputs and actions. Now, it's time to use this聽聽
[273]
data to train its neural network. This is the role聽 of the reinforcement learning algorithm. There聽聽
[279]
are many different variants of this method and聽 here I chose to use one called Deep Q Learning.
[287]
Basically, for a given set of inputs, the role聽 of the neural network is to predict the expected聽聽
[292]
reward for each possible action. But which reward聽 are we talking about ? is it an immediate one ? In聽聽
[299]
Trackmania, although some actions may result聽 in an immediate positive reward, they may have聽聽
[305]
negative consequences in the long run. Sometimes,聽 it may be useful to sacrifice short-term incomes,聽聽
[312]
for example by slowing down when approaching a聽 turn, in order to gain more long-term reward.聽聽
[318]
the AI therefore needs to consider the long-term聽 consequences of each action. To achieve this,聽聽
[324]
the AI tries to imagine the cumulative reward聽 that it's most likely to obtain in the future.聽聽
[331]
Although the long term is important, an action聽 still has more impact in the short term. Thus,聽聽
[336]
the events in the immediate future are weighted聽 more. So each time the AI gets inputs, its neural聽聽
[342]
network tries to predict the expected cumulative聽 reward for each possible action. And the AI just聽聽
[348]
selects the one with the highest value. Let's聽 resume training where we left off. In parallel to聽聽
[355]
driving, the AI is continuously trying to improve聽 its neural network with the data it collects.聽聽
[361]
But by only doing random exploration, the AI聽 ends up not having much new to learn. Instead聽聽
[367]
of just exploring, it's time for the AI to also聽 start exploiting the knowledge it has acquired,聽聽
[373]
meaning using its neural network instead of聽 just acting randomly. the AI is still a bit聽聽
[379]
immature though, to only rely on its neural聽 network. If it does too much exploitation, it聽聽
[384]
will just experience the same things over and over聽 again, which will not teach it much. For now, I'm聽聽
[391]
setting the proportion of exploration at 90%, and聽 I'll decrease it progressively during training.
[426]
After more than 20 000 attempts on this map, here聽 is the best run the AI has done so far. The AI聽聽
[432]
drives quite carefully, and it's not too bad for聽 a start ! It has definitely learned something.聽聽
[440]
Going further into the map, it聽 seems a bit more complicated,聽聽
[443]
and the AI ends up falling.聽 Time to get back to training !
[450]
At this point, you might think聽 that the AI hasn't learned much,聽聽
[454]
after training on the same map for聽 so many hours. But I think it's quite聽聽
[458]
normal. Reinforcement learning is known to聽 require a large number of iterations to work.聽聽
[464]
The time displayed here is in-game time.聽 Fortunately, training is faster in practice,聽聽
[470]
since I can increase the game speed using a tool聽 called TMInterface. This project would probably聽聽
[476]
not have been possible without this tool,聽 so a big thanks to Donadigo, its developer.
[483]
The AI has made some nice progress. The聽 driving style it learned in the first turns聽聽
[488]
seems to apply well to the following聽 ones, which shows a good capacity聽聽
[492]
of generalization. The AI has now reached a 5%聽 exploration, which I will not decrease further.
[515]
It seems that the AI is stuck and can no longer聽 progress. Here is its current personal best.聽聽
[524]
In the first part of the map, the聽 AI shows very little hesitation.聽聽
[528]
This first portion has a lot of turns and聽 short straights. But then the AI arrives聽聽
[533]
in a new section with mainly long straight聽 lines. Its driving becomes a little sketchy.聽聽
[541]
At one point, it even stops, as if it's afraid to聽 continue. After a long minute, it finally decides聽聽
[550]
to continue, and dies. The AI seems to have聽 difficulty adapting to this new type of road.聽聽
[557]
Or maybe it just needs more time. To be sure,聽 I decided to push the training a little longer.
[574]
After 10 000 more attempts, the AI hasn't made聽 much progress. It still has a lot of trouble with聽聽
[580]
long straight lines. There may be several reasons聽 for this, but I think the main one is overfitting,聽聽
[586]
which is common in machine learning. In the聽 exploration phase, the AI practiced the same first聽聽
[591]
few turns over and over again. Its neural network聽 became a specialist of this kind of trajectories,聽聽
[598]
learning them almost by heart, as if nothing else聽 existed. But when the AI faces a new situation,聽聽
[604]
the driving style it learned in the past is no聽 longer appropriate : it needs to adapt. In a way,聽聽
[610]
adapting means questioning everything it聽 has learned in the past. If the AI tries to聽聽
[615]
drastically change its strategy to adapt to this聽 new roads, it risks to break everything that was聽聽
[621]
working for the first few turns. When there聽 is overfitting, there is no generalization.聽聽
[627]
So what's the solution ? Maybe the AI could drive聽 each run on a different map, to constantly learn聽聽
[634]
new things. But at this point, I really don't want聽 to spend hours building dozens of different maps.聽聽
[640]
So, I'm gonna do things differently. I'm going聽 to restart training from the beginning. But now,聽聽
[647]
each time the AI will start a new run,聽 it will spawn at a random location on聽聽
[652]
the map, with a random speed and a random聽 orientation. This should limit overfitting,聽聽
[657]
since the AI will be forced to consider many聽 different situations from the beginning.聽聽
[678]
This time, the AI is learning way faster. However,聽 perhaps the AI managed to cover long distances聽聽
[685]
just because it spawned in easy sections of the聽 map. The real challenge is still to complete聽聽
[690]
the track from start to finish. From now on, I聽 will regularly test the AI outside of training,聽聽
[697]
on a normal race. Outside of training,聽 I remove any exploration to optimize聽聽
[702]
the AI's performance. I also increase the聽 action frequency from 10 to 30 per second.
[720]
The AI is able to drive in all sections聽 of the map, so there is clearly less聽聽
[724]
overfitting this time ! Now, the AI only聽 has to combine everything in one run.
[735]
In this attempt, the AI manages to surpass its聽 previous record, going further than ever. But聽聽
[741]
it fails within 500 meters of the finish. It聽 has never been so close to finish this map.聽聽
[748]
And finally, a few attempts later, and after聽 53 hours of training, AI gets this run.
[772]
The AI was able to complete 230 turns聽 without ever falling. Sounds good, but聽聽
[779]
is the AI fast ? Now, it's聽 my turn to drive, to compare.
[788]
After a few attempts, I made a聽 run of 4 minutes and 44 seconds.聽聽
[792]
Without using the brake of course, for a fair聽 comparison. So yeah, the AI is not very fast. But聽聽
[800]
training is not over ! Now, the AI has one聽 goal : to finish this map as fast as possible.
[823]
6 minutes and 28 seconds. After this run, I聽 continued training, and the AI kept getting聽聽
[830]
slightly faster on average, more consistent too,聽 but it never managed to beat its personal best.聽聽
[837]
With this version of its neural network, the聽 AI drives quite aggressively, and takes most聽聽
[841]
turns very sharply. It's quite surprising to see聽 it survived the whole race with such a driving聽聽
[847]
style. But it's the best the AI has found. Perhaps聽 there is still a way to improve the AI's record聽聽
[854]
one last time, still with the same neural network.聽 If I randomly force some actions of the AI at the聽聽
[861]
beginning, here, the AI will have to adapt to聽 this small perturbation. And this is the start聽聽
[867]
of a completely different run. Now, I can repeat聽 this a few hundred times to see what happens.
[898]
And Here is the final improvement of聽 AI's record. Not a big improvement,聽聽
[903]
but it was visually worth it ! There is still聽 a big gap with human performance, but I'm still聽聽
[909]
very happy with the result. Trackmania is a game聽 that requires a lot of practice, even for humans,聽聽
[915]
and from my experience I'm pretty sure this聽 AI could beat a good amount of beginners.聽聽
[921]
If there's anything AI is doing well, it's聽 generalization. It can adapt to any new map with a聽聽
[927]
similar road structure. I even tried to change the聽 road surface to see if it could drive on grass,聽聽
[934]
and AI is doing quite well ! Same thing on dirt,聽 even though the AI has never experienced these聽聽
[941]
surfaces during training. But can it still聽 survive on a new map, with a mix of road聽聽
[947]
dirt and grass surfaces, and聽 a few slopes and obstacles ?
[971]
So yeah of course there is room to improve聽 this AI. But with reinforcement learning,聽聽
[976]
it seems that the main limitation is always the聽 same : training time. Even with a tool to increase聽聽
[983]
game speed. That's why I never venture into聽 more complex maps, and that's why I try to聽聽
[989]
limit any complexity in general : few inputs,聽 no breaks, not too many actions per second,聽聽
[995]
and so on. Anyway for now, the AI has deserved聽 to rest after those long hours of training.聽聽
[1002]
And maybe it will be back聽 one day, with new surprises !