馃攳
A.I. Learns to Drive From Scratch in Trackmania - YouTube
Channel: unknown
[0]
Each of these cars is controlled by an Artificial聽
Intelligence (AI) in the racing game Trackmania.聽聽
[5]
This AI is not very intelligent yet.. But that's聽
normal : it has just started to learn. In fact,聽聽
[12]
I want to use a method called Reinforcement聽
Learning to make this AI learn by itself聽聽
[17]
how to drive as fast as possible. I also聽
want it to become intelligent enough to聽聽
[21]
master various combinations of turns without聽
ever falling off the road. And to ensure this,聽聽
[27]
the AI will have to pass a final challenge :聽
to complete this giant track. But first of all,聽聽
[34]
how is a simple computer program supposed to learn聽
things ? It's not the first time I'm experimenting聽聽
[42]
with AI in Trackmania. And to achieve this,聽
i'm using a method called Machine Learning.聽聽
[49]
First, I'm running a program that controls the car聽
in-game to make it turn and accelerate. So the AI聽聽
[56]
can choose between 6 different actions. But how聽
can it decide which action to take ? The AI needs聽聽
[62]
to get information about the game. It receives聽
that in the form of numbers called inputs.聽聽
[68]
Some inputs describe the state of the car,聽
such as its current speed and acceleration.聽聽
[74]
Others indicate how the car is positioned聽
on the road section it's currently crossing.聽聽
[80]
And the last inputs indicate what's further ahead.聽
This is now what the AI sees when playing. But how聽聽
[87]
can it interpret that ? It needs to sort of use聽
this data in an intelligent way. To link inputs聽聽
[94]
to the desired action, the AI is going to use a聽
neural network, which basically acts like a brain.聽聽
[101]
Now, all that remains is to parameterize the聽
neural network so that it results in fast聽聽
[106]
driving. And that's where Machine Learning comes聽
into play. As I said earlier, the objective here聽聽
[112]
is that the AI learns to drive by itself. So it聽
will have to experiment with different strategies,聽聽
[119]
through trial and error, to progressively select聽
the neural network that leads to the best driving.聽聽
[125]
One way to do this would be聽
to use a genetic algorithm.聽聽
[129]
I've already tried that in Trackmania and it works聽
fairly well. Basically, the idea is to start with聽聽
[135]
a population of several AIs, each with its own聽
neural network. All AIs compete on the same map,聽聽
[142]
and the best ones are selected and reassembled聽
through a process similar to Natural Selection.聽聽
[148]
This can be repeated for many generations,聽
to get a better and better neural network.聽聽
[154]
One problem with this method is that you only聽
compare the different AIs based on their end聽聽
[159]
result. To make an AI progress, it might be聽
better to give it feedback on what it did聽聽
[165]
well or not so well during the race. So it's time聽
to try something else : Reinforcement Learning.聽聽
[172]
And this goes with a crucial聽
idea : the concept of reward.
[179]
This time, the AI has only one goal in聽
mind : to get as many rewards as possible.聽聽
[185]
The idea of reinforcement learning is to learn聽
to pick the action that brings the most reward,聽聽
[190]
in any situation. In fact, this is quite like a聽
pet being trained, which will interpret pleasure聽聽
[196]
or food intake as a positive reinforcement. But in聽
Trackmania, there is no food. So how can we define聽聽
[202]
rewards ? the AI can take 10 actions per second.聽
Each action will be associated with a reward equal聽聽
[210]
to the distance traveled up to the next action. So聽
the faster the AI goes, the more rewards it gets.聽聽
[217]
If the AI ever tries to go the wrong聽
way, it will receive a punishment,聽聽
[221]
which is actually just a negative reward. And if聽
the AI falls off the road, it will be directly聽聽
[227]
punished by a zero reward, but also indirectly by聽
stopping the race. Which means no more rewards.聽聽
[235]
Now, it's time to start training. To learn聽
which inputs and actions lead to which reward,聽聽
[240]
the AI must first gather information about聽
the game. This is the exploration phase. the聽聽
[246]
AI simply takes random actions and doesn't use its聽
neural network for the moment. The runs are driven聽聽
[253]
one by one. And after a thousand of them, here聽
is what the AI has explored of the map so far.聽聽
[260]
Each line corresponds to one race trajectory. the聽
AI has already collected plenty of data about the聽聽
[267]
rewards it can expect to get for various sets of聽
inputs and actions. Now, it's time to use this聽聽
[273]
data to train its neural network. This is the role聽
of the reinforcement learning algorithm. There聽聽
[279]
are many different variants of this method and聽
here I chose to use one called Deep Q Learning.
[287]
Basically, for a given set of inputs, the role聽
of the neural network is to predict the expected聽聽
[292]
reward for each possible action. But which reward聽
are we talking about ? is it an immediate one ? In聽聽
[299]
Trackmania, although some actions may result聽
in an immediate positive reward, they may have聽聽
[305]
negative consequences in the long run. Sometimes,聽
it may be useful to sacrifice short-term incomes,聽聽
[312]
for example by slowing down when approaching a聽
turn, in order to gain more long-term reward.聽聽
[318]
the AI therefore needs to consider the long-term聽
consequences of each action. To achieve this,聽聽
[324]
the AI tries to imagine the cumulative reward聽
that it's most likely to obtain in the future.聽聽
[331]
Although the long term is important, an action聽
still has more impact in the short term. Thus,聽聽
[336]
the events in the immediate future are weighted聽
more. So each time the AI gets inputs, its neural聽聽
[342]
network tries to predict the expected cumulative聽
reward for each possible action. And the AI just聽聽
[348]
selects the one with the highest value. Let's聽
resume training where we left off. In parallel to聽聽
[355]
driving, the AI is continuously trying to improve聽
its neural network with the data it collects.聽聽
[361]
But by only doing random exploration, the AI聽
ends up not having much new to learn. Instead聽聽
[367]
of just exploring, it's time for the AI to also聽
start exploiting the knowledge it has acquired,聽聽
[373]
meaning using its neural network instead of聽
just acting randomly. the AI is still a bit聽聽
[379]
immature though, to only rely on its neural聽
network. If it does too much exploitation, it聽聽
[384]
will just experience the same things over and over聽
again, which will not teach it much. For now, I'm聽聽
[391]
setting the proportion of exploration at 90%, and聽
I'll decrease it progressively during training.
[426]
After more than 20 000 attempts on this map, here聽
is the best run the AI has done so far. The AI聽聽
[432]
drives quite carefully, and it's not too bad for聽
a start ! It has definitely learned something.聽聽
[440]
Going further into the map, it聽
seems a bit more complicated,聽聽
[443]
and the AI ends up falling.聽
Time to get back to training !
[450]
At this point, you might think聽
that the AI hasn't learned much,聽聽
[454]
after training on the same map for聽
so many hours. But I think it's quite聽聽
[458]
normal. Reinforcement learning is known to聽
require a large number of iterations to work.聽聽
[464]
The time displayed here is in-game time.聽
Fortunately, training is faster in practice,聽聽
[470]
since I can increase the game speed using a tool聽
called TMInterface. This project would probably聽聽
[476]
not have been possible without this tool,聽
so a big thanks to Donadigo, its developer.
[483]
The AI has made some nice progress. The聽
driving style it learned in the first turns聽聽
[488]
seems to apply well to the following聽
ones, which shows a good capacity聽聽
[492]
of generalization. The AI has now reached a 5%聽
exploration, which I will not decrease further.
[515]
It seems that the AI is stuck and can no longer聽
progress. Here is its current personal best.聽聽
[524]
In the first part of the map, the聽
AI shows very little hesitation.聽聽
[528]
This first portion has a lot of turns and聽
short straights. But then the AI arrives聽聽
[533]
in a new section with mainly long straight聽
lines. Its driving becomes a little sketchy.聽聽
[541]
At one point, it even stops, as if it's afraid to聽
continue. After a long minute, it finally decides聽聽
[550]
to continue, and dies. The AI seems to have聽
difficulty adapting to this new type of road.聽聽
[557]
Or maybe it just needs more time. To be sure,聽
I decided to push the training a little longer.
[574]
After 10 000 more attempts, the AI hasn't made聽
much progress. It still has a lot of trouble with聽聽
[580]
long straight lines. There may be several reasons聽
for this, but I think the main one is overfitting,聽聽
[586]
which is common in machine learning. In the聽
exploration phase, the AI practiced the same first聽聽
[591]
few turns over and over again. Its neural network聽
became a specialist of this kind of trajectories,聽聽
[598]
learning them almost by heart, as if nothing else聽
existed. But when the AI faces a new situation,聽聽
[604]
the driving style it learned in the past is no聽
longer appropriate : it needs to adapt. In a way,聽聽
[610]
adapting means questioning everything it聽
has learned in the past. If the AI tries to聽聽
[615]
drastically change its strategy to adapt to this聽
new roads, it risks to break everything that was聽聽
[621]
working for the first few turns. When there聽
is overfitting, there is no generalization.聽聽
[627]
So what's the solution ? Maybe the AI could drive聽
each run on a different map, to constantly learn聽聽
[634]
new things. But at this point, I really don't want聽
to spend hours building dozens of different maps.聽聽
[640]
So, I'm gonna do things differently. I'm going聽
to restart training from the beginning. But now,聽聽
[647]
each time the AI will start a new run,聽
it will spawn at a random location on聽聽
[652]
the map, with a random speed and a random聽
orientation. This should limit overfitting,聽聽
[657]
since the AI will be forced to consider many聽
different situations from the beginning.聽聽
[678]
This time, the AI is learning way faster. However,聽
perhaps the AI managed to cover long distances聽聽
[685]
just because it spawned in easy sections of the聽
map. The real challenge is still to complete聽聽
[690]
the track from start to finish. From now on, I聽
will regularly test the AI outside of training,聽聽
[697]
on a normal race. Outside of training,聽
I remove any exploration to optimize聽聽
[702]
the AI's performance. I also increase the聽
action frequency from 10 to 30 per second.
[720]
The AI is able to drive in all sections聽
of the map, so there is clearly less聽聽
[724]
overfitting this time ! Now, the AI only聽
has to combine everything in one run.
[735]
In this attempt, the AI manages to surpass its聽
previous record, going further than ever. But聽聽
[741]
it fails within 500 meters of the finish. It聽
has never been so close to finish this map.聽聽
[748]
And finally, a few attempts later, and after聽
53 hours of training, AI gets this run.
[772]
The AI was able to complete 230 turns聽
without ever falling. Sounds good, but聽聽
[779]
is the AI fast ? Now, it's聽
my turn to drive, to compare.
[788]
After a few attempts, I made a聽
run of 4 minutes and 44 seconds.聽聽
[792]
Without using the brake of course, for a fair聽
comparison. So yeah, the AI is not very fast. But聽聽
[800]
training is not over ! Now, the AI has one聽
goal : to finish this map as fast as possible.
[823]
6 minutes and 28 seconds. After this run, I聽
continued training, and the AI kept getting聽聽
[830]
slightly faster on average, more consistent too,聽
but it never managed to beat its personal best.聽聽
[837]
With this version of its neural network, the聽
AI drives quite aggressively, and takes most聽聽
[841]
turns very sharply. It's quite surprising to see聽
it survived the whole race with such a driving聽聽
[847]
style. But it's the best the AI has found. Perhaps聽
there is still a way to improve the AI's record聽聽
[854]
one last time, still with the same neural network.聽
If I randomly force some actions of the AI at the聽聽
[861]
beginning, here, the AI will have to adapt to聽
this small perturbation. And this is the start聽聽
[867]
of a completely different run. Now, I can repeat聽
this a few hundred times to see what happens.
[898]
And Here is the final improvement of聽
AI's record. Not a big improvement,聽聽
[903]
but it was visually worth it ! There is still聽
a big gap with human performance, but I'm still聽聽
[909]
very happy with the result. Trackmania is a game聽
that requires a lot of practice, even for humans,聽聽
[915]
and from my experience I'm pretty sure this聽
AI could beat a good amount of beginners.聽聽
[921]
If there's anything AI is doing well, it's聽
generalization. It can adapt to any new map with a聽聽
[927]
similar road structure. I even tried to change the聽
road surface to see if it could drive on grass,聽聽
[934]
and AI is doing quite well ! Same thing on dirt,聽
even though the AI has never experienced these聽聽
[941]
surfaces during training. But can it still聽
survive on a new map, with a mix of road聽聽
[947]
dirt and grass surfaces, and聽
a few slopes and obstacles ?
[971]
So yeah of course there is room to improve聽
this AI. But with reinforcement learning,聽聽
[976]
it seems that the main limitation is always the聽
same : training time. Even with a tool to increase聽聽
[983]
game speed. That's why I never venture into聽
more complex maps, and that's why I try to聽聽
[989]
limit any complexity in general : few inputs,聽
no breaks, not too many actions per second,聽聽
[995]
and so on. Anyway for now, the AI has deserved聽
to rest after those long hours of training.聽聽
[1002]
And maybe it will be back聽
one day, with new surprises !
Most Recent Videos:
You can go back to the homepage right here: Homepage





