CS224W: Machine Learning with Graphs | 2021 | Lecture 15.3 - Scaling Up & Evaluating Graph Gen - YouTube

Channel: unknown

[5]
What I want to discuss next is how do we
[7]
make the RNN tractable?
[10]
And how do we evaluate it?
[11]
So we are first going to discuss about tractability.
[15]
How do we make our model more scalable?
[17]
And the issue is that in a graph in principle
[20]
any node can connect to any prior node.
[23]
So this means there might be a lot of steps
[26]
for edge generation, right?
[29]
When node 1,000 joins, we may need to generate now
[33]
1,000 possible edges, right?
[35]
Does it link to node 999, and then 998, all the way down
[41]
to node let's say number 1, right?
[44]
So this would mean that in principle we
[45]
need to generate a full adjacency matrix.
[48]
And this is complex, and it's like--
[52]
it leads to very kind of long-range edge dependencies.
[55]
Because you have to memorize--
[57]
the model needs to memorize a lot
[59]
in terms of what previous nodes were added
[62]
and which previous nodes does a new node want to connect to,
[65]
right?
[66]
So if you have a node ordering like I show you here,
[69]
then for example, the way you generate
[71]
this graph would be add node 1, add node 2, add node 3,
[75]
and only now start creating edges, right?
[78]
Then this doesn't feel like natural, it
[80]
seems that like this when I added node 3,
[82]
I needed now to remember that 1 and 2 were already added
[85]
and all that, right?
[88]
And the point is that a new node needs to link to--
[91]
can in principle link to any previous node, right?
[93]
Even node 5 could link to node number 1?
[96]
And the question is, how would we limit this complexity?
[99]
And how would we limit these long range dependencies, right?
[102]
Because this means that the hidden state of node 5
[106]
somehow has to remember that node 1 was added at all the--
[110]
at the beginning and that 5 should link to 1.
[113]
And that's a lot to kind of ask our model to do.
[118]
So if we can help it in any way that would definitely help.
[122]
So the way the insight is that we can take a--
[128]
because the insight is that we can come up
[130]
with an node ordering that makes our model much more tractable.
[134]
So the insight is that, rather than having
[137]
a random node ordering and having the model to worry
[141]
about long-range dependencies, we
[143]
are going to use a node ordering that helps
[145]
the model learn better.
[147]
And the node ordering we propose is
[149]
called the breadth-first search node ordering.
[151]
Basically, we are going to start at some random node
[154]
in the graph.
[155]
We are going to label it as node number 1.
[157]
We are going to label its nodes as, let's say 2 and 3.
[160]
Then their neighbors are 4 and 5, right?
[164]
And this leads now to a much more natural recipe
[168]
to generate the graph, right?
[169]
We are saying add node 1, add node 2, connect 2 to 1,
[174]
add 3 connect it with 1, add 4 connect it with 2 and 3.
[177]
It's kind of much more nicely interwoven, these things are.
[181]
And if you say, how would you draw this graph?
[183]
You'd probably draw it this way, not some other way
[186]
that you first put all the nodes down and then connect them,
[189]
right?
[190]
So the BFS ordering, what does it buy us?
[193]
It buys us the following, because node 4 does not
[197]
connect to node 1, we know that no other node
[201]
is going to connect to 1 from now on, right,
[204]
because it's a breadth-first search ordering.
[208]
So if for example, node 5 were to connect to node 1,
[211]
then its ID wouldn't be 5, it would be less than 5, right?
[217]
So we know that all of node 1's neighbors
[220]
have already been traversed when a given node does not
[224]
link to it.
[225]
Therefore, node 5, and all the following nodes,
[228]
right, as I said, will never connect to node 1.
[231]
And why is this important?
[232]
Because this means that when node 5 comes
[234]
I don't need to worry about node 1 anymore, right?
[238]
Node 1 I can kind of forget, I need to have much less memory,
[242]
right?
[243]
I don't need memory, I only need memory of two steps
[246]
rather than memory of remembering what I did
[249]
all the way at the beginning.
[251]
So the BFS ordering, the key insight
[254]
is that node 5 will never connect to node 1.
[258]
And this means that we only need memory of two steps,
[260]
rather than the memory of n minus 1 steps where
[263]
n is the number of nodes in the network.
[266]
And this also means that it reduces
[269]
the number of possible orderings, right,
[271]
rather than considering all the possible orderings which
[274]
is n factorial of them.
[276]
We only have to kind of consider the number of distinct BFS
[283]
orderings.
[284]
And it also reduces the number of steps
[286]
for edge generation because of this insight
[289]
that we know that 5 won't link to 1.
[292]
And this is important.
[293]
Because so far, I explained to you,
[296]
I said, the edge level RNN generates the column
[300]
of the adjacency matrix.
[302]
But if you take this BFS based ordering,
[305]
then the edge level RNN does not really
[309]
need to generate the entire column.
[314]
It can-- it only needs to generate
[316]
a small part of the column because we
[318]
know that the rest is 0, right?
[321]
So, rather than generating connectivity
[323]
to all previous nodes, and having
[325]
to remember all this with the proper node ordering,
[328]
we are guaranteed that all we need to do
[330]
is generate just a small band of this adjacency matrix.
[335]
And again, this doesn't prevent us
[337]
from generating any kind-- like graphs,
[341]
this is still fully general, it is just
[343]
exploiting the ability that we can re-number the nodes.
[348]
We can order the nodes in whatever order we want.
[352]
And because real graphs are sparse,
[356]
a favorable ordering gives us a lot of benefit
[359]
because it's much easier to learn
[361]
to generate just this blue part, than to learn
[364]
to generate this entire upper triangle
[367]
of the adjacency matrix.
[370]
So this was the discussion of how
[372]
do we scale up our model and its insight, that we can come up
[376]
or we can decide on the ordering and if you are smart about it,
[379]
it can really help us.
[381]
It could help the model learn.
[383]
The second thing I want to talk about
[384]
is, how do we evaluate graph generation, right?
[388]
And there are two ways how you can do it.
[391]
One is that we visually look at them
[392]
and see whether they are similar.
[394]
And that's good to get some intuition,
[396]
but we also want to define a statistical notion of graph
[399]
similarity.
[400]
And of course, you could try to say,
[401]
I'll take two graphs and I'll somehow align them one on top
[404]
of the other, but that is very expensive and for big graphs
[408]
you cannot do that.
[409]
So we have to define some kind of statistical measure of graph
[414]
similarity between two graphs.
[416]
So first, let me show you some visual examples
[418]
of what GraphRNN is able to do.
[421]
So what I'm showing here is three input training graphs.
[426]
These are the output graphs from GraphRNN,
[429]
and here are some output from three traditional generating
[434]
models.
[434]
This is the Kronecker graphs generating model,
[436]
this is the mixed membership stochastic block model,
[439]
and this is the preferential attachment
[441]
to the Barabasi-Albert model.
[444]
And what you notice is that GraphRNN,
[448]
if you give it grids it's going to generate you grids.
[451]
You see little mistakes because the model is stochastic,
[454]
so it may make some little mistakes, that's OK.
[458]
But you see that other models completely fail,
[461]
they are not able to generate the grid.
[462]
And this is not surprising because none of these models
[466]
was developed to generate grids.
[469]
They were developed to generate networks
[473]
for different types of properties.
[475]
So that's OK, right?
[477]
GraphRNN can generate the grid the others cannot.
[480]
What is interesting though, is, even if you for example
[482]
give GraphRNN examples of graphs with two
[486]
clusters with this kind of community structure,
[489]
GraphRNN is able to learn that these graphs have
[491]
that structure and is able to generate you
[493]
graphs with such a structure.
[495]
Why?
[496]
For example, Kronecker graphs or Barabasi-Albert, they cannot--
[499]
they were not done to generate graphs with community
[502]
structures.
[503]
So they cannot do that, but mixed membership's stochastic
[505]
block model was developed for community structure.
[508]
So it does a good job, right?
[511]
And what I want to say here is the following,
[514]
right, is that GraphRNN is super general, right?
[517]
It's basically able to take these input graphs
[520]
and learn about the structure and then
[524]
generate new graphs with similar structure, right?
[526]
You don't have to tell it, Hey, I have communities.
[529]
Hey, this is a grid.
[530]
You just give it the graph, and it will figure it out by itself
[533]
that the given graph is a grid and it
[535]
needs to generate a grid, or that it's a community structure
[539]
graph or anything else.
[540]
So it's quite remarkable that such a diversity of graphs,
[545]
the same model can cover and you don't even
[549]
have to tell it what the input graphs are.
[552]
So now, how about doing more rigorous comparison
[556]
about statistical similarity, right?
[559]
How do we do that, right?
[560]
we-- as I said we cannot do direct comparison between two
[563]
graphs, trying to do graph alignment,
[565]
because graph isomorphism testing as we have seen is
[569]
an NP, is a hard problem, let's say.
[572]
So the solution is to compare graph statistics,
[575]
and the typical graph statistics we have already discussed
[579]
would be like degree distribution, clustering
[581]
coefficient, or orbit count statistics from the graphlets
[587]
that I discussed, I think in lecture number two.
[590]
So the point would be that given a graph
[591]
we are going to describe it with a set of statistics,
[594]
and then we are going to say that the two graphs are
[597]
more similar if their corresponding statistics are
[601]
more similar.
[602]
And in our case, each of these statistics
[605]
we are going to think of it as a probability distribution.
[608]
And I'm going to explain why is this important.
[611]
OK.
[612]
So first, is that, given two statistics,
[616]
maybe two graph properties, maybe degree--
[619]
two degrees sequences, two degree distributions,
[622]
two orbit count distributions.
[624]
We want to compare this on sets of training graphs
[628]
as well as on the syntactic graphs.
[630]
You want to see how similar is a set of training
[633]
graphs to the set of synthetically generated graphs.
[635]
And we'd like to measure the level of similarity
[639]
between the two.
[639]
And we are going to do a 2-step approach.
[642]
In the first step, we are going to do the following.
[644]
We are going to take each of these graphs
[646]
and we are going to describe it with a set of statistics.
[650]
We'll say here is the degree distribution,
[651]
here is the clustering coefficient distribution.
[653]
And now, we'll take and we are going
[655]
to this for all the input graphs, training graphs,
[659]
as well as, for all the generated graphs.
[662]
And now, we are going to take, let's say, degree distribution
[665]
of a synthetic graph and degree distribution of a real graph,
[668]
and we want to see how much of these distribution differ.
[671]
And to measure-- to quantify that we
[674]
are going to use something that we call the earth mover
[677]
distance.
[678]
And now, that we have compared the statistic individual
[682]
statistics.
[683]
Now, we need to aggregate and measure how--
[687]
once we have a measure of degree distribution similarity,
[690]
we have a measure of clustering coefficient similarities.
[692]
Now, we need to take these similarities
[694]
and further aggregate them to get the overall similarity.
[701]
And for this second level aggregation,
[704]
we are going to use what is called
[706]
the maximum mean discrepancy that will be based on the earth
[710]
mover distance.
[711]
So let me tell--
[712]
let me first define the earth mover distance and then
[714]
the MMD.
[717]
So the earth mover distance, kind of
[720]
tries to measure similarity between two distributions,
[722]
or similarity between two histograms.
[724]
And the way you can-- the intuition
[726]
is that, what is the minimum amount of effort,
[729]
minimum amount of Earth, minimum amount of probability mass
[734]
to move from one pile to the other,
[736]
so that one pile gets transformed to the other,
[738]
right?
[739]
So I say, I have a distribution, I have two distributions,
[742]
how much more?
[743]
How much mass?
[744]
How much of this yellow--
[746]
dirt, yellow earth, do I have to move
[749]
between these different pillars, so
[751]
that I will make it and transform it
[752]
into this type of distribution?
[754]
So if I have distributions that are very different,
[757]
then the earth mover distance would be high.
[759]
And if they are kind of similar, the earth mover distance
[762]
will be low.
[763]
And earth mover distance can be solved as an optimal flow,
[766]
and is found by using a linear program optimization problem.
[772]
We're basically saying, the amount of work I have to do
[775]
is, how do I take F and transform it.
[782]
How much earth do I have to move,
[785]
so that it minimizes the overall cost
[789]
didj between the probability distributions x and y.
[793]
So that's the intuition behind the earth mover distance
[799]
metric.
[800]
And then, the second part will be
[803]
that, we are going to use the maximum mean discrepancy,
[806]
or MMD.
[807]
And the idea of maximum discrepancy
[810]
is to represent distances between distributions,
[813]
as distances between mean embeddings of their features,
[816]
right?
[817]
And here I give you the formula for it.
[821]
But basically, the MMD between two distributions p and q,
[826]
you can think of it-- is-- that this is the--
[830]
if I write it in terms of some kernel k,
[833]
it's kind of the expectation over these elements x and y
[842]
that are drawn from distribution p and q,
[846]
and taking the expectation over the distribution of p and q.
[850]
And of course, we need to have this kernel k.
[852]
In our case, the kernel will be the L2 distance.
[858]
So now, let me just summarize.
[862]
How do we put all this together?
[863]
We are given two sets of graphs and we want
[865]
to see how similar they are.
[867]
The way we are going to do this is, for every graph
[870]
we are compute--
[871]
we are going to compute its statistics.
[874]
We are then going to use earth mover distance
[877]
to measure to the discrepancy between the two statistics,
[881]
between the two distributions.
[883]
And then, we are going to apply the mean--
[887]
the maximum mean discrepancy to measure
[891]
the similarity between these sets of statistics.
[894]
Where the similarity between sets elements--
[897]
which means the individual distributions
[900]
in your statistics is computed with the earth mover distance.
[907]
And this means, for example, that, this way
[910]
we can rigorously evaluate the correspondence
[916]
between the particular statistic--
[918]
set of statistics on the training graph,
[920]
as well as, on the testing graphs.