🔍

What is YOLO algorithm? | Deep Learning Tutorial 31 (Tensorflow, Keras & Python) - YouTube

Channel: codebasics

[0]

YOLO is state of the art object

[2]

detection algorithm

[4]

and it is so fast that it has become

[6]

almost a standard way

[8]

of detecting objects in the field of

[10]

computer vision.

[12]

Previously people were using sliding

[14]

window object detection

[16]

then more faster versions were invented

[20]

such as

[21]

RCNN, fast RCNN and faster

[24]

RCNN but in 2015

[28]

YOLO was invented which

[31]

outperformed all the previous object

[34]

detection algorithms

[35]

and that's what we are going to discuss

[37]

today. We will go over the theory

[39]

on how exactly YOLO works and in the

[42]

future video we will also do coding so

[45]

this video is just

[46]

about the theory behind how YOLO works

[48]

and we'll

[49]

try to see why it is faster. Full form of

[52]

YOLO is

[53]

You Only Look Oce. Let's say you're

[55]

working on

[57]

an image classification problem where

[59]

you want to decide if the

[60]

image is of a dog or a person. In this

[63]

case the

[64]

output of neural network is pretty

[66]

simple. You will say dog is equal to one

[68]

person is equal to zero but when you

[70]

talk about object localization

[73]

you're not only telling which class this

[75]

is you're also telling the bounding box

[78]

or the position of an object within the

[81]

image.

[82]

So here in addition to dog is equal to 1

[85]

and person is equal to 0

[86]

you are also telling about the bounding

[89]

box. Now how exactly you do that?

[90]

So in in terms of neural network output

[93]

you can have a vector like this

[95]

where pc is the probability of a class.

[99]

So here if there is a dog or a person

[102]

then this number will be one. If there is

[105]

no dog or no person

[106]

this number will be zero then the

[109]

bounding box

[110]

so bx bi by is the coordinate

[114]

of the center which is indicated in

[116]

yellow

[117]

circle here and 1670 is the

[120]

width and height of this red box.

[124]

C1 is class one that is for dog.

[127]

So here it will be one c2 is for

[130]

person and it will be zero if you have a

[133]

different image like this.

[134]

There is a person here. This is my

[136]

picture in my

[137]

high school. The pc probability of any

[141]

class is 1

[142]

because there is some object and these

[145]

are like bounding box

[146]

coordinates and c1 is 0 because it's

[149]

not a dog and

[150]

c2 is 1 because it's a person and when

[153]

you have

[154]

no object in the image the pc will be

[157]

zero

[158]

and rest of the values don't matter. So

[160]

now you can train

[161]

a neural network to

[164]

classify the object as well as the

[168]

bounding box. So you can have I am just

[171]

showing

[172]

three images here but you can have less

[173]

than ten thousand such images

[176]

and for each of these images since it's

[178]

a supervised learning problem

[180]

you need to give the bounding boxes. And

[183]

the way you give bounding boxes to

[185]

neural network..

[186]

understand neural network only

[187]

understands numbers so you

[189]

have to convert this into

[192]

this kind of vectors. So you will have a

[195]

vector of size

[196]

7 for each corresponding image so that

[199]

will so image is x strain

[200]

and y train will be a vector of size 7.

[204]

you can have 10 000 such images you can

[207]

train a neural network in a way

[210]

that

[211]

if you input a new image now it will

[214]

tell you

[215]

that particular vector and now this

[218]

vector is telling you

[219]

that this is a dog because c1 is set to

[222]

1 and it is also telling you the

[224]

bounding box so basically it's

[226]

essentially giving you the

[229]

answer for your object detection or

[231]

object localization rather.

[234]

This only works for a single object if

[237]

you have a multiple objects what do you

[239]

do?

[239]

Here there is person and a dog in the

[241]

same image.

[243]

One might say that okay

[246]

you know in my image there could be

[249]

n number of object there could be two

[251]

dogs three people there could be five

[254]

dogs one person you don't know

[257]

how many objects are there in the

[258]

picture. So it's

[260]

hard to determine the dimension of your

[263]

neural network

[264]

output if you have one one object.

[268]

It's pretty fixed right but if you

[271]

have

[272]

n number of objects and you don't know

[274]

then

[275]

determining the size of the output of

[278]

neural network is hard.

[279]

You can say upper max is 10 let's say

[282]

there will be only 10 objects

[284]

and you can have 10 into 7 which is like

[286]

[287]

70 size vector but what if there are 11

[290]

objects?

[291]

See so that doesn't work so you have to

[293]

do something

[294]

else. All right! So let's say you have

[297]

this image

[298]

and there are two bounding boxes that

[301]

this image has.

[303]

What yolo algorithm will do is it will

[306]

divide this

[307]

image into this kind of grid cells. So

[310]

I'm using

[311]

four by four grid here. It could be three

[314]

by three it could be 19 by 19.

[316]

There's no fixed rule that it has to be

[318]

four by four.

[320]

And for each of the grid cells for

[321]

example this grid cell

[323]

you can encode or you can come up with

[326]

that vector that we saw previously

[328]

which is pc bounding box c1 and c2

[332]

there are no objects here so probability

[334]

of class will be zero

[335]

and then rest of the values don't matter.

[338]

But for

[338]

this particular grid cell,

[342]

So I have highlighted here the dog is

[345]

there in the picture

[346]

see when dog is expanding to multiple

[348]

grid cell you try to

[350]

find the central place of that dog and

[354]

the dog belongs to that particular grid

[356]

cell.

[357]

So I'm in this particular cell here and

[360]

when I

[361]

look at the coordinates you can think

[363]

about this per point as a

[364]

zero and this point has one

[367]

coordinate.

[368]

And now you can create this vector where

[370]

p c is one which means you have

[372]

some object then c one and c two. C One

[376]

is for dog so it is one

[377]

c two is per person it is 0. There is

[380]

person's head here but the person's

[382]

center is here so this person object

[385]

belongs to this cell

[388]

and then 0.05. Like this particular

[391]

distance is 0.05.

[393]

This is 0.3 because see this whole thing

[396]

is 1

[397]

and then your bounding rectangle can go

[400]

out of your grid cell. It is fine

[402]

that's why these values are more than

[403]

one. So 1.3 and 1.

[406]

oh sorry 2 and 1.3 so that is the width

[409]

so 2 is this width

[411]

and 1.3 is height. So it is this height

[416]

and now talking about this particular

[418]

grid cell.

[419]

So there is a person center here so we

[422]

can say person

[423]

is in this grid and therefore c2 class

[426]

value

[427]

1 is 1 c1 is 0 because there is no dog

[431]

and these are like bounding boxes so

[434]

0.32 is see 0.32 is this much.

[438]

0.02 is this this particular height and

[440]

it is 3 because

[442]

the rectangle with this yellow line is

[444]

equal to almost 3.

[446]

The size of c the width of this grid

[448]

cell and if you compare this

[450]

this is three times this. That's why i

[452]

have three here

[454]

and now you can have

[457]

for remaining all the cells

[461]

the vector will be this. So pc will be

[464]

zero remaining will be

[466]

don't care so now you have four by four

[470]

seven volume. Why?

[474]

Because you have four by four total grid

[477]

cells. 16 cells

[479]

each cell is a vector of size seven.

[482]

That's why I'm saying four by four by

[484]

seven so if you're talking about this

[486]

top left cell and if you expand it in a

[490]

[491]

direction that will be

[494]

this vector of size 7. So I hope you're

[497]

getting an idea. If you don't please

[498]

pause the video and just think about

[500]

what I just said.

[504]

So now you have the image and then the

[506]

bounding rectangles.

[508]

Now you can form your training data set.

[510]

So your training data cell will have

[513]

so many such images. Let's say I am

[515]

showing only three

[516]

four example, but you will have 10 000

[518]

such images.

[520]

Each image will have bounding rectangle

[522]

and based on that rectangle

[524]

you will try to derive. You will first

[528]

form this kind of grid 4x4 grid or 3x3

[531]

or 19x19. It varies.

[533]

It doesn't have to be four by four and

[536]

you will come up with the

[538]

y or a target vector which will be

[541]

for each cell there will be one vector

[544]

so there will be 16

[545]

such vector per training sample

[548]

or per training image. Using this

[552]

now you can train your neural network

[555]

and

[556]

after you have trained it it can do

[558]

prediction.

[559]

So when you now give this type of image

[563]

it can produce 16 such vectors and

[567]

y 16 because this is like 4 by 4 grid

[571]

which will basically tell you the

[572]

bounding rectangle for each of these

[575]

objects.

[576]

So this is the YOLO algorithm. It is

[579]

called You Only Look Once

[580]

because we are not repeating it. See we

[583]

are not doing something like okay

[584]

we have 16 cells. So it's not like we are

[587]

inputing it 16 times and doing 60

[589]

iteration

[591]

in one forward pass. You can make all

[594]

your prediction

[595]

that is why it is called You Only Look

[598]

Once. Now this is a basic algorithm

[602]

we need some tweaks because there could

[604]

be few issues with this approach.

[607]

First issue is the algorithm might

[610]

detect

[611]

multiple bounding rectangles for a given

[614]

object.

[615]

It is possible so how do you tackle that.

[618]

So let's think about this-

[620]

Let's say for a person it detected all

[624]

these two yellow and this one white

[626]

rectangle and we know by visual

[630]

observation that this white one

[632]

is the most accurate one and the

[635]

algorithm will also throw out the

[636]

probability.

[637]

It will say this is point nine percent

[639]

you know the pc. The pc

[641]

class. It will say this is point nine

[643]

percent matching with

[644]

person and

[648]

the other rectangles have less

[649]

probability.

[651]

So maybe we can look at all the

[653]

probabilities for a person class and

[654]

take the max right?

[656]

Well we cannot do this okay?

[660]

If you just take a max and if there is

[662]

another person

[664]

what happens to that you don't know

[667]

where that person is right. So as a

[670]

neural network, as a computer you don't

[672]

know

[673]

so you can't take a max you have to use

[675]

different approach.

[677]

So we use this concept of IOU. so IOU is

[681]

basically

[681]

intersection over union- which is you

[684]

take this

[685]

rectangle which is 0.9 this is that

[687]

white rectangle

[689]

and then for that same class which is

[691]

person

[693]

you will take all other rectangles and

[695]

try to find

[696]

overlapping area and to find

[699]

overlapping area you use IOU. So here in

[702]

this case

[703]

see this is that yellow box okay?

[706]

So this is that yellow box here and this

[709]

is the white box

[710]

and the area indicated in this orange

[713]

color

[714]

is intersection area. Area indicated in

[717]

purple colors is union area. So you find

[721]

division of these two and if the objects

[725]

are overlapping this value will be more.

[727]

So let's say if the value is more

[729]

than 0.6 or 0.7

[731]

we can say these rectangles are

[732]

overlapping, if they are completely

[734]

overlapping the value will be 1.

[736]

If they are not overlapping at all value

[738]

will be 0.

[740]

So now we find that

[743]

these two yellow boxes are overlapping

[745]

because their

[746]

IOU is let's say greater than 0.65

[750]

and then you discard those rectangles.

[755]

So I discarded all the rectangles which

[758]

had IOU greater than 0.65

[761]

and kept the rectangle which has class

[762]

probability as max.

[766]

Okay. So

[770]

this so I do this for a personal object

[773]

then I do the same thing for a dog

[775]

object. So for dog

[777]

I find that okay point 81 this is the

[779]

max probability.

[780]

I find all other rectangles in this

[783]

image

[783]

again there could be two more dogs here

[786]

and there will be rectangles for those

[788]

also.

[789]

So you will try to find overlap?

[792]

Okay so let's see if there is a dog here

[795]

you will not find overlap so you will

[796]

not discard that

[798]

particular rectangle but this rectangle

[801]

you find it to be overlapping and since

[803]

point 81 is max point seven

[805]

is less you discard this and you get

[809]

final bounding boxes. This technique is

[812]

also called

[813]

nomex operation. So after

[816]

neural network has detected all the

[818]

objects you apply no max suppression and

[820]

you get

[821]

these unique bounding boxes there could

[824]

be another issue is

[826]

what if a single cell contains the

[828]

center of two objects?

[830]

In this case the dog and the person both

[833]

are in the middle's middle

[836]

grid cell. Now we use this vector to

[840]

represent the grid cell but

[842]

see this vector can represent only one

[843]

class.

[845]

So how do you represent two class? Well I

[848]

have

[849]

this value for dog. I have this value for

[851]

person

[852]

so instead of having a seven dimension

[854]

vector

[855]

how about we have a vector of size 14

[859]

where you're just concatenating these

[862]

two

[862]

vectors. Okay so this is said to have

[866]

a basically it has two anchor boxes so

[870]

this is one anchor box this is second

[872]

anchor box.

[873]

So here you have two anchor boxes and

[877]

you can actually have more than two

[880]

anchor boxes. Let's say if there are

[881]

three objects

[882]

which has the same center, Then you can

[885]

have

[886]

three anchor boxes, you can have five

[888]

anchor boxes but

[890]

if your grid sales are small enough then

[892]

in real life

[893]

it's hard to have. You know many objects

[897]

belonging to one grid cell

[901]

so now cnn with two anchor boxes will

[904]

look something like this so instead of

[906]

a vector of size the only change is now

[908]

you have a vector of size

[910]

14. If you want to have three

[913]

anchor boxes you'll have a vector of

[915]

size 21- 7 into three

[917]

okay? And that will give you your final

[921]

output.

[921]

So that was all about you only look once

[924]

or YOLO algorithm

[926]

It's a very very fast algorithm even on

[929]

a video clip

[930]

which is let's say at 40 frame per

[932]

second it can detect objects

[934]

really fast and it is the most modern

[937]

way of detecting objects so if you are

[939]

in computer vision fields

[941]

if you want to do object detection you

[943]

have to use

[944]

YOLO because it is very fast and

[947]

accurate

[948]

in the next video we will be looking at

[951]

some code we will do a real object

[954]

detection

[955]

in image and in video using YOLO

[958]

framework.

[959]

I hope you're liking this series so far.

[961]

If you do give it a thumbs up and share

[962]

it with your friends. Thanks.

Most Recent Videos:

WE KILLED 6 HEROIC BOSSES! - YouTube

¿Quién inventó el dinero? - YouTube

Cuándo se inventó el dinero y cómo el dólar se convirtió en la principal moneda del mundo - YouTube

This Citizenship Program is Failing - YouTube

Candida Treatment Protocol w/ Dr. DiNezza - YouTube

$500M investor reacts to Real Estate Tik Toks 2 - YouTube

You can go back to the homepage right here: Homepage

What is YOLO algorithm? | Deep Learning Tutorial 31 (Tensorflow, Keras &amp; Python) - YouTube

What is YOLO algorithm? | Deep Learning Tutorial 31 (Tensorflow, Keras & Python) - YouTube