馃攳
CS224W: Machine Learning with Graphs | 2021 | Lecture 15.3 - Scaling Up & Evaluating Graph Gen - YouTube
Channel: unknown
[5]
What I want to discuss
next is how do we
[7]
make the RNN tractable?
[10]
And how do we evaluate it?
[11]
So we are first going to
discuss about tractability.
[15]
How do we make our
model more scalable?
[17]
And the issue is that
in a graph in principle
[20]
any node can connect
to any prior node.
[23]
So this means there
might be a lot of steps
[26]
for edge generation, right?
[29]
When node 1,000 joins, we
may need to generate now
[33]
1,000 possible edges, right?
[35]
Does it link to node 999, and
then 998, all the way down
[41]
to node let's say
number 1, right?
[44]
So this would mean
that in principle we
[45]
need to generate a
full adjacency matrix.
[48]
And this is complex,
and it's like--
[52]
it leads to very kind of
long-range edge dependencies.
[55]
Because you have to memorize--
[57]
the model needs
to memorize a lot
[59]
in terms of what
previous nodes were added
[62]
and which previous nodes does
a new node want to connect to,
[65]
right?
[66]
So if you have a node
ordering like I show you here,
[69]
then for example,
the way you generate
[71]
this graph would be add node
1, add node 2, add node 3,
[75]
and only now start
creating edges, right?
[78]
Then this doesn't
feel like natural, it
[80]
seems that like this
when I added node 3,
[82]
I needed now to remember that
1 and 2 were already added
[85]
and all that, right?
[88]
And the point is that a
new node needs to link to--
[91]
can in principle link to
any previous node, right?
[93]
Even node 5 could
link to node number 1?
[96]
And the question is, how would
we limit this complexity?
[99]
And how would we limit these
long range dependencies, right?
[102]
Because this means that
the hidden state of node 5
[106]
somehow has to remember that
node 1 was added at all the--
[110]
at the beginning and
that 5 should link to 1.
[113]
And that's a lot to kind
of ask our model to do.
[118]
So if we can help it in any
way that would definitely help.
[122]
So the way the insight
is that we can take a--
[128]
because the insight
is that we can come up
[130]
with an node ordering that makes
our model much more tractable.
[134]
So the insight is that,
rather than having
[137]
a random node ordering and
having the model to worry
[141]
about long-range
dependencies, we
[143]
are going to use a node
ordering that helps
[145]
the model learn better.
[147]
And the node ordering
we propose is
[149]
called the breadth-first
search node ordering.
[151]
Basically, we are going to
start at some random node
[154]
in the graph.
[155]
We are going to label
it as node number 1.
[157]
We are going to label its
nodes as, let's say 2 and 3.
[160]
Then their neighbors
are 4 and 5, right?
[164]
And this leads now to a
much more natural recipe
[168]
to generate the graph, right?
[169]
We are saying add node 1,
add node 2, connect 2 to 1,
[174]
add 3 connect it with 1, add
4 connect it with 2 and 3.
[177]
It's kind of much more nicely
interwoven, these things are.
[181]
And if you say, how would
you draw this graph?
[183]
You'd probably draw it this
way, not some other way
[186]
that you first put all the nodes
down and then connect them,
[189]
right?
[190]
So the BFS ordering,
what does it buy us?
[193]
It buys us the following,
because node 4 does not
[197]
connect to node 1, we
know that no other node
[201]
is going to connect to
1 from now on, right,
[204]
because it's a breadth-first
search ordering.
[208]
So if for example, node 5
were to connect to node 1,
[211]
then its ID wouldn't be 5, it
would be less than 5, right?
[217]
So we know that all
of node 1's neighbors
[220]
have already been traversed
when a given node does not
[224]
link to it.
[225]
Therefore, node 5, and
all the following nodes,
[228]
right, as I said, will
never connect to node 1.
[231]
And why is this important?
[232]
Because this means
that when node 5 comes
[234]
I don't need to worry about
node 1 anymore, right?
[238]
Node 1 I can kind of forget, I
need to have much less memory,
[242]
right?
[243]
I don't need memory, I only
need memory of two steps
[246]
rather than memory of
remembering what I did
[249]
all the way at the beginning.
[251]
So the BFS ordering,
the key insight
[254]
is that node 5 will
never connect to node 1.
[258]
And this means that we only
need memory of two steps,
[260]
rather than the memory
of n minus 1 steps where
[263]
n is the number of
nodes in the network.
[266]
And this also means
that it reduces
[269]
the number of possible
orderings, right,
[271]
rather than considering all
the possible orderings which
[274]
is n factorial of them.
[276]
We only have to kind of consider
the number of distinct BFS
[283]
orderings.
[284]
And it also reduces
the number of steps
[286]
for edge generation
because of this insight
[289]
that we know that
5 won't link to 1.
[292]
And this is important.
[293]
Because so far, I
explained to you,
[296]
I said, the edge level
RNN generates the column
[300]
of the adjacency matrix.
[302]
But if you take this
BFS based ordering,
[305]
then the edge level
RNN does not really
[309]
need to generate
the entire column.
[314]
It can-- it only
needs to generate
[316]
a small part of the
column because we
[318]
know that the rest is 0, right?
[321]
So, rather than
generating connectivity
[323]
to all previous
nodes, and having
[325]
to remember all this with
the proper node ordering,
[328]
we are guaranteed
that all we need to do
[330]
is generate just a small band
of this adjacency matrix.
[335]
And again, this
doesn't prevent us
[337]
from generating any
kind-- like graphs,
[341]
this is still fully
general, it is just
[343]
exploiting the ability that
we can re-number the nodes.
[348]
We can order the nodes in
whatever order we want.
[352]
And because real
graphs are sparse,
[356]
a favorable ordering
gives us a lot of benefit
[359]
because it's much
easier to learn
[361]
to generate just this
blue part, than to learn
[364]
to generate this
entire upper triangle
[367]
of the adjacency matrix.
[370]
So this was the
discussion of how
[372]
do we scale up our model and
its insight, that we can come up
[376]
or we can decide on the ordering
and if you are smart about it,
[379]
it can really help us.
[381]
It could help the model learn.
[383]
The second thing I
want to talk about
[384]
is, how do we evaluate
graph generation, right?
[388]
And there are two ways
how you can do it.
[391]
One is that we
visually look at them
[392]
and see whether
they are similar.
[394]
And that's good to
get some intuition,
[396]
but we also want to define a
statistical notion of graph
[399]
similarity.
[400]
And of course, you
could try to say,
[401]
I'll take two graphs and I'll
somehow align them one on top
[404]
of the other, but that is very
expensive and for big graphs
[408]
you cannot do that.
[409]
So we have to define some kind
of statistical measure of graph
[414]
similarity between two graphs.
[416]
So first, let me show
you some visual examples
[418]
of what GraphRNN is able to do.
[421]
So what I'm showing here is
three input training graphs.
[426]
These are the output
graphs from GraphRNN,
[429]
and here are some output from
three traditional generating
[434]
models.
[434]
This is the Kronecker
graphs generating model,
[436]
this is the mixed membership
stochastic block model,
[439]
and this is the
preferential attachment
[441]
to the Barabasi-Albert model.
[444]
And what you notice
is that GraphRNN,
[448]
if you give it grids it's
going to generate you grids.
[451]
You see little mistakes because
the model is stochastic,
[454]
so it may make some little
mistakes, that's OK.
[458]
But you see that other
models completely fail,
[461]
they are not able to
generate the grid.
[462]
And this is not surprising
because none of these models
[466]
was developed to generate grids.
[469]
They were developed
to generate networks
[473]
for different types
of properties.
[475]
So that's OK, right?
[477]
GraphRNN can generate the
grid the others cannot.
[480]
What is interesting though,
is, even if you for example
[482]
give GraphRNN examples
of graphs with two
[486]
clusters with this kind
of community structure,
[489]
GraphRNN is able to learn
that these graphs have
[491]
that structure and is
able to generate you
[493]
graphs with such a structure.
[495]
Why?
[496]
For example, Kronecker graphs or
Barabasi-Albert, they cannot--
[499]
they were not done to
generate graphs with community
[502]
structures.
[503]
So they cannot do that, but
mixed membership's stochastic
[505]
block model was developed
for community structure.
[508]
So it does a good job, right?
[511]
And what I want to say
here is the following,
[514]
right, is that GraphRNN
is super general, right?
[517]
It's basically able to
take these input graphs
[520]
and learn about the
structure and then
[524]
generate new graphs with
similar structure, right?
[526]
You don't have to tell it,
Hey, I have communities.
[529]
Hey, this is a grid.
[530]
You just give it the graph, and
it will figure it out by itself
[533]
that the given graph
is a grid and it
[535]
needs to generate a grid, or
that it's a community structure
[539]
graph or anything else.
[540]
So it's quite remarkable that
such a diversity of graphs,
[545]
the same model can
cover and you don't even
[549]
have to tell it what
the input graphs are.
[552]
So now, how about doing
more rigorous comparison
[556]
about statistical
similarity, right?
[559]
How do we do that, right?
[560]
we-- as I said we cannot do
direct comparison between two
[563]
graphs, trying to
do graph alignment,
[565]
because graph isomorphism
testing as we have seen is
[569]
an NP, is a hard
problem, let's say.
[572]
So the solution is to
compare graph statistics,
[575]
and the typical graph statistics
we have already discussed
[579]
would be like degree
distribution, clustering
[581]
coefficient, or orbit count
statistics from the graphlets
[587]
that I discussed, I think
in lecture number two.
[590]
So the point would
be that given a graph
[591]
we are going to describe it
with a set of statistics,
[594]
and then we are going to
say that the two graphs are
[597]
more similar if their
corresponding statistics are
[601]
more similar.
[602]
And in our case, each
of these statistics
[605]
we are going to think of it
as a probability distribution.
[608]
And I'm going to explain
why is this important.
[611]
OK.
[612]
So first, is that,
given two statistics,
[616]
maybe two graph
properties, maybe degree--
[619]
two degrees sequences,
two degree distributions,
[622]
two orbit count distributions.
[624]
We want to compare this
on sets of training graphs
[628]
as well as on the
syntactic graphs.
[630]
You want to see how similar
is a set of training
[633]
graphs to the set of
synthetically generated graphs.
[635]
And we'd like to measure
the level of similarity
[639]
between the two.
[639]
And we are going to
do a 2-step approach.
[642]
In the first step, we are
going to do the following.
[644]
We are going to take
each of these graphs
[646]
and we are going to describe
it with a set of statistics.
[650]
We'll say here is the
degree distribution,
[651]
here is the clustering
coefficient distribution.
[653]
And now, we'll take
and we are going
[655]
to this for all the input
graphs, training graphs,
[659]
as well as, for all
the generated graphs.
[662]
And now, we are going to take,
let's say, degree distribution
[665]
of a synthetic graph and degree
distribution of a real graph,
[668]
and we want to see how much
of these distribution differ.
[671]
And to measure--
to quantify that we
[674]
are going to use something
that we call the earth mover
[677]
distance.
[678]
And now, that we have compared
the statistic individual
[682]
statistics.
[683]
Now, we need to aggregate
and measure how--
[687]
once we have a measure of
degree distribution similarity,
[690]
we have a measure of clustering
coefficient similarities.
[692]
Now, we need to take
these similarities
[694]
and further aggregate them to
get the overall similarity.
[701]
And for this second
level aggregation,
[704]
we are going to
use what is called
[706]
the maximum mean discrepancy
that will be based on the earth
[710]
mover distance.
[711]
So let me tell--
[712]
let me first define the
earth mover distance and then
[714]
the MMD.
[717]
So the earth mover
distance, kind of
[720]
tries to measure similarity
between two distributions,
[722]
or similarity between
two histograms.
[724]
And the way you
can-- the intuition
[726]
is that, what is the
minimum amount of effort,
[729]
minimum amount of Earth, minimum
amount of probability mass
[734]
to move from one
pile to the other,
[736]
so that one pile gets
transformed to the other,
[738]
right?
[739]
So I say, I have a distribution,
I have two distributions,
[742]
how much more?
[743]
How much mass?
[744]
How much of this yellow--
[746]
dirt, yellow earth,
do I have to move
[749]
between these
different pillars, so
[751]
that I will make
it and transform it
[752]
into this type of distribution?
[754]
So if I have distributions
that are very different,
[757]
then the earth mover
distance would be high.
[759]
And if they are kind of similar,
the earth mover distance
[762]
will be low.
[763]
And earth mover distance can
be solved as an optimal flow,
[766]
and is found by using a linear
program optimization problem.
[772]
We're basically saying, the
amount of work I have to do
[775]
is, how do I take
F and transform it.
[782]
How much earth do
I have to move,
[785]
so that it minimizes
the overall cost
[789]
didj between the probability
distributions x and y.
[793]
So that's the intuition behind
the earth mover distance
[799]
metric.
[800]
And then, the
second part will be
[803]
that, we are going to use
the maximum mean discrepancy,
[806]
or MMD.
[807]
And the idea of
maximum discrepancy
[810]
is to represent distances
between distributions,
[813]
as distances between mean
embeddings of their features,
[816]
right?
[817]
And here I give you
the formula for it.
[821]
But basically, the MMD between
two distributions p and q,
[826]
you can think of it--
is-- that this is the--
[830]
if I write it in terms
of some kernel k,
[833]
it's kind of the expectation
over these elements x and y
[842]
that are drawn from
distribution p and q,
[846]
and taking the expectation over
the distribution of p and q.
[850]
And of course, we need
to have this kernel k.
[852]
In our case, the kernel
will be the L2 distance.
[858]
So now, let me just summarize.
[862]
How do we put all this together?
[863]
We are given two sets
of graphs and we want
[865]
to see how similar they are.
[867]
The way we are going to do
this is, for every graph
[870]
we are compute--
[871]
we are going to
compute its statistics.
[874]
We are then going to
use earth mover distance
[877]
to measure to the discrepancy
between the two statistics,
[881]
between the two distributions.
[883]
And then, we are going
to apply the mean--
[887]
the maximum mean
discrepancy to measure
[891]
the similarity between
these sets of statistics.
[894]
Where the similarity
between sets elements--
[897]
which means the
individual distributions
[900]
in your statistics is computed
with the earth mover distance.
[907]
And this means, for
example, that, this way
[910]
we can rigorously evaluate
the correspondence
[916]
between the
particular statistic--
[918]
set of statistics on
the training graph,
[920]
as well as, on the
testing graphs.
Most Recent Videos:
You can go back to the homepage right here: Homepage