NVIDIA’s AI Puts Video Calls On Steroids! 💪 - YouTube

Channel: Two Minute Papers

[0]
Dear Fellow Scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér.
[3]
This paper is really something else.
[7]
Scientists at NVIDIA just came up with an absolutely insane idea for video conferencing.
[14]
Their idea is not to do what everyone else is doing, which is, transmitting our video
[19]
to the person on the other end.
[21]
No, of course not, that would be too easy!
[25]
What they do in this work, is take only the first image from the video, and they throw
[31]
away the entire video afterwards!
[34]
But before that, it stores a tiny bit of information from it, which is, how our head is moving
[41]
over time, and how our expressions change.
[45]
That is an absolutely outrageous idea… and of course, we like those around here, so,
[51]
does this work?
[52]
Well, let’s have a look.
[54]
This is the input video, note that this is not transmitted, only the first image and
[60]
some additional information, and the rest of this video is discarded.
[65]
And hold on to your papers, because this is the output of the algorithm compared to the
[70]
input video.
[72]
No, this is not some kind of misunderstanding, nobody has copy-pasted the results there.
[79]
This is a near-perfect reconstruction of the input, except that the amount of information
[84]
we need to transmit through the network is significantly less than with previous compression
[89]
techniques.
[91]
How much less?
[92]
Well, you know what’s coming, so let’s try it out!
[96]
Here is the output of the new technique, and here is the comparison against H.264, a powerful
[102]
and commonly used video compression standard.
[105]
Well, to our disappointment, the two seem close, the new technique appears better, especially
[112]
around the glasses, but the rest is similar.
[116]
And if you have been holding on to your papers so far, now, squeeze that paper, because this
[122]
is not a reasonable comparison.
[125]
And that is because the previous method was allowed to transmit 6 to 12 times more information.
[131]
Look, as we further decrease the data allowance of the previous method, it still can transmit
[137]
more than twice as much information, and at this point, there is no contest.
[143]
This bitrate would be unusable for any kind of videoconferencing, while the new method
[148]
uses less than half as much information, and still transmits a sharp and perfectly fine
[155]
video.
[156]
Overall, the authors report that their new method is ten times more efficient.
[162]
That is unreal.
[163]
This is an excellent video reconstruction technique, that much is clear.
[168]
And if it only did that, it would be a great paper.
[172]
But this is not a great paper, this is an absolutely amazing paper, so it does even
[179]
more.
[180]
Much, much more!
[181]
For instance, it can also rotate our head and make a frontal video, can also fix potential
[187]
framing issues by translating our head, and transferring all of our gestures to a new
[194]
model.
[195]
And, it is also evaluated well, so all of these new features are tested in isolation.
[202]
Look at these two previous methods trying to frontalize the input video.
[207]
One would think that it’s not even possible to perform properly given how much these techniques
[212]
are struggling with the task…until we look at the new method.
[218]
My goodness.
[219]
There is some jumpiness in the neck movement in the output video here, and some warping
[226]
issues here, but otherwise, very impressive results.
[230]
Now if you have been holding on to your papers so far, now, squeeze that paper, because these
[235]
previous methods are not some ancient papers that were published a long time ago.
[241]
Not at all!
[242]
Both of them were published within the same year as the new paper.
[247]
How amazing is that.
[249]
Wow.
[250]
I really liked this page from the paper, which showcases both the images and the mathematical
[256]
measurements against previous methods side by side.
[260]
There are many ways to measure how close two videos are to each other, the up and down
[265]
arrows tell us whether the given quality metric is subject to minimization or maximization,
[272]
for instance, pixelwise errors are typically minimized, so lesser is better, but we are
[278]
to maximize the the peak signal to noise ratio.
[281]
And the cool thing is that none of this matters too much as soon as we insert the new technique,
[287]
which really outpaces all of these.
[290]
And we are still not done yet!
[293]
So we said that the technique takes the first image, reads the evolution of expressions
[298]
and the head pose from the input video, and then, it discards the entirety of the video,
[304]
save for the first image.
[307]
The cool thing about this was that we could pretend to rotate the head pose information,
[312]
and the result is that the head appears rotated in the output image.
[317]
That was great.
[319]
But what if we take the source image from someone, and take this data, the driving keypoint
[325]
sequence from someone else?
[328]
Well, what we get is, motion transfer.
[332]
Look!
[333]
We only need one image of the target person, and we can transfer all of our gestures to
[339]
them, in a way that is significantly better than most previous methods.
[347]
Now, of course, not even this technique is perfect, it still struggles a great deal in
[355]
the presence of occluder objects, but still, just the fact that this is possible feels
[361]
like something straight out of a science fiction movie.
[365]
What
[375]
a time
[398]
to be alive!
[421]
Thanks for watching and for your generous support,
[429]
and I'll see you next time!