What is YOLO algorithm? | Deep Learning Tutorial 31 (Tensorflow, Keras & Python) - YouTube

Channel: codebasics

[0]
YOLO is state of the art object
[2]
detection algorithm
[4]
and it is so fast that it has become
[6]
almost a standard way
[8]
of detecting objects in the field of
[10]
computer vision.
[12]
Previously people were using sliding
[14]
window object detection
[16]
then more faster versions were invented
[20]
such as
[21]
RCNN, fast RCNN and faster
[24]
RCNN but in 2015
[28]
YOLO was invented which
[31]
outperformed all the previous object
[34]
detection algorithms
[35]
and that's what we are going to discuss
[37]
today. We will go over the theory
[39]
on how exactly YOLO works and in the
[42]
future video we will also do coding so
[45]
this video is just
[46]
about the theory behind how YOLO works
[48]
and we'll
[49]
try to see why it is faster. Full form of
[52]
YOLO is
[53]
You Only Look Oce. Let's say you're
[55]
working on
[57]
an image classification problem where
[59]
you want to decide if the
[60]
image is of a dog or a person. In this
[63]
case the
[64]
output of neural network is pretty
[66]
simple. You will say dog is equal to one
[68]
person is equal to zero but when you
[70]
talk about object localization
[73]
you're not only telling which class this
[75]
is you're also telling the bounding box
[78]
or the position of an object within the
[81]
image.
[82]
So here in addition to dog is equal to 1
[85]
and person is equal to 0
[86]
you are also telling about the bounding
[89]
box. Now how exactly you do that?
[90]
So in in terms of neural network output
[93]
you can have a vector like this
[95]
where pc is the probability of a class.
[99]
So here if there is a dog or a person
[102]
then this number will be one. If there is
[105]
no dog or no person
[106]
this number will be zero then the
[109]
bounding box
[110]
so bx bi by is the coordinate
[114]
of the center which is indicated in
[116]
yellow
[117]
circle here and 1670 is the
[120]
width and height of this red box.
[124]
C1 is class one that is for dog.
[127]
So here it will be one c2 is for
[130]
person and it will be zero if you have a
[133]
different image like this.
[134]
There is a person here. This is my
[136]
picture in my
[137]
high school. The pc probability of any
[141]
class is 1
[142]
because there is some object and these
[145]
are like bounding box
[146]
coordinates and c1 is 0 because it's
[149]
not a dog and
[150]
c2 is 1 because it's a person and when
[153]
you have
[154]
no object in the image the pc will be
[157]
zero
[158]
and rest of the values don't matter. So
[160]
now you can train
[161]
a neural network to
[164]
classify the object as well as the
[168]
bounding box. So you can have I am just
[171]
showing
[172]
three images here but you can have less
[173]
than ten thousand such images
[176]
and for each of these images since it's
[178]
a supervised learning problem
[180]
you need to give the bounding boxes. And
[183]
the way you give bounding boxes to
[185]
neural network..
[186]
understand neural network only
[187]
understands numbers so you
[189]
have to convert this into
[192]
this kind of vectors. So you will have a
[195]
vector of size
[196]
7 for each corresponding image so that
[199]
will so image is x strain
[200]
and y train will be a vector of size 7.
[204]
you can have 10 000 such images you can
[207]
train a neural network in a way
[210]
that
[211]
if you input a new image now it will
[214]
tell you
[215]
that particular vector and now this
[218]
vector is telling you
[219]
that this is a dog because c1 is set to
[222]
1 and it is also telling you the
[224]
bounding box so basically it's
[226]
essentially giving you the
[229]
answer for your object detection or
[231]
object localization rather.
[234]
This only works for a single object if
[237]
you have a multiple objects what do you
[239]
do?
[239]
Here there is person and a dog in the
[241]
same image.
[243]
One might say that okay
[246]
you know in my image there could be
[249]
n number of object there could be two
[251]
dogs three people there could be five
[254]
dogs one person you don't know
[257]
how many objects are there in the
[258]
picture. So it's
[260]
hard to determine the dimension of your
[263]
neural network
[264]
output if you have one one object.
[268]
It's pretty fixed right but if you
[271]
have
[272]
n number of objects and you don't know
[274]
then
[275]
determining the size of the output of
[278]
neural network is hard.
[279]
You can say upper max is 10 let's say
[282]
there will be only 10 objects
[284]
and you can have 10 into 7 which is like
[286]
a
[287]
70 size vector but what if there are 11
[290]
objects?
[291]
See so that doesn't work so you have to
[293]
do something
[294]
else. All right! So let's say you have
[297]
this image
[298]
and there are two bounding boxes that
[301]
this image has.
[303]
What yolo algorithm will do is it will
[306]
divide this
[307]
image into this kind of grid cells. So
[310]
I'm using
[311]
four by four grid here. It could be three
[314]
by three it could be 19 by 19.
[316]
There's no fixed rule that it has to be
[318]
four by four.
[320]
And for each of the grid cells for
[321]
example this grid cell
[323]
you can encode or you can come up with
[326]
that vector that we saw previously
[328]
which is pc bounding box c1 and c2
[332]
there are no objects here so probability
[334]
of class will be zero
[335]
and then rest of the values don't matter.
[338]
But for
[338]
this particular grid cell,
[342]
So I have highlighted here the dog is
[345]
there in the picture
[346]
see when dog is expanding to multiple
[348]
grid cell you try to
[350]
find the central place of that dog and
[354]
the dog belongs to that particular grid
[356]
cell.
[357]
So I'm in this particular cell here and
[360]
when I
[361]
look at the coordinates you can think
[363]
about this per point as a
[364]
zero and this point has one
[367]
coordinate.
[368]
And now you can create this vector where
[370]
p c is one which means you have
[372]
some object then c one and c two. C One
[376]
is for dog so it is one
[377]
c two is per person it is 0. There is
[380]
person's head here but the person's
[382]
center is here so this person object
[385]
belongs to this cell
[388]
and then 0.05. Like this particular
[391]
distance is 0.05.
[393]
This is 0.3 because see this whole thing
[396]
is 1
[397]
and then your bounding rectangle can go
[400]
out of your grid cell. It is fine
[402]
that's why these values are more than
[403]
one. So 1.3 and 1.
[406]
oh sorry 2 and 1.3 so that is the width
[409]
so 2 is this width
[411]
and 1.3 is height. So it is this height
[416]
and now talking about this particular
[418]
grid cell.
[419]
So there is a person center here so we
[422]
can say person
[423]
is in this grid and therefore c2 class
[426]
value
[427]
1 is 1 c1 is 0 because there is no dog
[431]
and these are like bounding boxes so
[434]
0.32 is see 0.32 is this much.
[438]
0.02 is this this particular height and
[440]
it is 3 because
[442]
the rectangle with this yellow line is
[444]
equal to almost 3.
[446]
The size of c the width of this grid
[448]
cell and if you compare this
[450]
this is three times this. That's why i
[452]
have three here
[454]
and now you can have
[457]
for remaining all the cells
[461]
the vector will be this. So pc will be
[464]
zero remaining will be
[466]
don't care so now you have four by four
[470]
by
[470]
seven volume. Why?
[474]
Because you have four by four total grid
[477]
cells. 16 cells
[479]
each cell is a vector of size seven.
[482]
That's why I'm saying four by four by
[484]
seven so if you're talking about this
[486]
top left cell and if you expand it in a
[490]
z
[491]
direction that will be
[494]
this vector of size 7. So I hope you're
[497]
getting an idea. If you don't please
[498]
pause the video and just think about
[500]
what I just said.
[504]
So now you have the image and then the
[506]
bounding rectangles.
[508]
Now you can form your training data set.
[510]
So your training data cell will have
[513]
so many such images. Let's say I am
[515]
showing only three
[516]
four example, but you will have 10 000
[518]
such images.
[520]
Each image will have bounding rectangle
[522]
and based on that rectangle
[524]
you will try to derive. You will first
[528]
form this kind of grid 4x4 grid or 3x3
[531]
or 19x19. It varies.
[533]
It doesn't have to be four by four and
[536]
you will come up with the
[538]
y or a target vector which will be
[541]
for each cell there will be one vector
[544]
so there will be 16
[545]
such vector per training sample
[548]
or per training image. Using this
[552]
now you can train your neural network
[555]
and
[556]
after you have trained it it can do
[558]
prediction.
[559]
So when you now give this type of image
[563]
it can produce 16 such vectors and
[567]
y 16 because this is like 4 by 4 grid
[571]
which will basically tell you the
[572]
bounding rectangle for each of these
[575]
objects.
[576]
So this is the YOLO algorithm. It is
[579]
called You Only Look Once
[580]
because we are not repeating it. See we
[583]
are not doing something like okay
[584]
we have 16 cells. So it's not like we are
[587]
inputing it 16 times and doing 60
[589]
iteration
[591]
in one forward pass. You can make all
[594]
your prediction
[595]
that is why it is called You Only Look
[598]
Once. Now this is a basic algorithm
[602]
we need some tweaks because there could
[604]
be few issues with this approach.
[607]
First issue is the algorithm might
[610]
detect
[611]
multiple bounding rectangles for a given
[614]
object.
[615]
It is possible so how do you tackle that.
[618]
So let's think about this-
[620]
Let's say for a person it detected all
[624]
these two yellow and this one white
[626]
rectangle and we know by visual
[630]
observation that this white one
[632]
is the most accurate one and the
[635]
algorithm will also throw out the
[636]
probability.
[637]
It will say this is point nine percent
[639]
you know the pc. The pc
[641]
class. It will say this is point nine
[643]
percent matching with
[644]
person and
[648]
the other rectangles have less
[649]
probability.
[651]
So maybe we can look at all the
[653]
probabilities for a person class and
[654]
take the max right?
[656]
Well we cannot do this okay?
[660]
If you just take a max and if there is
[662]
another person
[664]
what happens to that you don't know
[667]
where that person is right. So as a
[670]
neural network, as a computer you don't
[672]
know
[673]
so you can't take a max you have to use
[675]
different approach.
[677]
So we use this concept of IOU. so IOU is
[681]
basically
[681]
intersection over union- which is you
[684]
take this
[685]
rectangle which is 0.9 this is that
[687]
white rectangle
[689]
and then for that same class which is
[691]
person
[693]
you will take all other rectangles and
[695]
try to find
[696]
overlapping area and to find
[699]
overlapping area you use IOU. So here in
[702]
this case
[703]
see this is that yellow box okay?
[706]
So this is that yellow box here and this
[709]
is the white box
[710]
and the area indicated in this orange
[713]
color
[714]
is intersection area. Area indicated in
[717]
purple colors is union area. So you find
[721]
division of these two and if the objects
[725]
are overlapping this value will be more.
[727]
So let's say if the value is more
[729]
than 0.6 or 0.7
[731]
we can say these rectangles are
[732]
overlapping, if they are completely
[734]
overlapping the value will be 1.
[736]
If they are not overlapping at all value
[738]
will be 0.
[740]
So now we find that
[743]
these two yellow boxes are overlapping
[745]
because their
[746]
IOU is let's say greater than 0.65
[750]
and then you discard those rectangles.
[755]
So I discarded all the rectangles which
[758]
had IOU greater than 0.65
[761]
and kept the rectangle which has class
[762]
probability as max.
[766]
Okay. So
[770]
this so I do this for a personal object
[773]
then I do the same thing for a dog
[775]
object. So for dog
[777]
I find that okay point 81 this is the
[779]
max probability.
[780]
I find all other rectangles in this
[783]
image
[783]
again there could be two more dogs here
[786]
and there will be rectangles for those
[788]
also.
[789]
So you will try to find overlap?
[792]
Okay so let's see if there is a dog here
[795]
you will not find overlap so you will
[796]
not discard that
[798]
particular rectangle but this rectangle
[801]
you find it to be overlapping and since
[803]
point 81 is max point seven
[805]
is less you discard this and you get
[809]
final bounding boxes. This technique is
[812]
also called
[813]
nomex operation. So after
[816]
neural network has detected all the
[818]
objects you apply no max suppression and
[820]
you get
[821]
these unique bounding boxes there could
[824]
be another issue is
[826]
what if a single cell contains the
[828]
center of two objects?
[830]
In this case the dog and the person both
[833]
are in the middle's middle
[836]
grid cell. Now we use this vector to
[840]
represent the grid cell but
[842]
see this vector can represent only one
[843]
class.
[845]
So how do you represent two class? Well I
[848]
have
[849]
this value for dog. I have this value for
[851]
person
[852]
so instead of having a seven dimension
[854]
vector
[855]
how about we have a vector of size 14
[859]
where you're just concatenating these
[862]
two
[862]
vectors. Okay so this is said to have
[866]
a basically it has two anchor boxes so
[870]
this is one anchor box this is second
[872]
anchor box.
[873]
So here you have two anchor boxes and
[877]
you can actually have more than two
[880]
anchor boxes. Let's say if there are
[881]
three objects
[882]
which has the same center, Then you can
[885]
have
[886]
three anchor boxes, you can have five
[888]
anchor boxes but
[890]
if your grid sales are small enough then
[892]
in real life
[893]
it's hard to have. You know many objects
[897]
belonging to one grid cell
[901]
so now cnn with two anchor boxes will
[904]
look something like this so instead of
[906]
a vector of size the only change is now
[908]
you have a vector of size
[910]
14. If you want to have three
[913]
anchor boxes you'll have a vector of
[915]
size 21- 7 into three
[917]
okay? And that will give you your final
[921]
output.
[921]
So that was all about you only look once
[924]
or YOLO algorithm
[926]
It's a very very fast algorithm even on
[929]
a video clip
[930]
which is let's say at 40 frame per
[932]
second it can detect objects
[934]
really fast and it is the most modern
[937]
way of detecting objects so if you are
[939]
in computer vision fields
[941]
if you want to do object detection you
[943]
have to use
[944]
YOLO because it is very fast and
[947]
accurate
[948]
in the next video we will be looking at
[951]
some code we will do a real object
[954]
detection
[955]
in image and in video using YOLO
[958]
framework.
[959]
I hope you're liking this series so far.
[961]
If you do give it a thumbs up and share
[962]
it with your friends. Thanks.