🔍

CS224W: Machine Learning with Graphs | 2021 | Lecture 15.3 - Scaling Up & Evaluating Graph Gen - YouTube

Channel: unknown

[5]

What I want to discuss next is how do we

[7]

make the RNN tractable?

[10]

And how do we evaluate it?

[11]

So we are first going to discuss about tractability.

[15]

How do we make our model more scalable?

[17]

And the issue is that in a graph in principle

[20]

any node can connect to any prior node.

[23]

So this means there might be a lot of steps

[26]

for edge generation, right?

[29]

When node 1,000 joins, we may need to generate now

[33]

1,000 possible edges, right?

[35]

Does it link to node 999, and then 998, all the way down

[41]

to node let's say number 1, right?

[44]

So this would mean that in principle we

[45]

need to generate a full adjacency matrix.

[48]

And this is complex, and it's like--

[52]

it leads to very kind of long-range edge dependencies.

[55]

Because you have to memorize--

[57]

the model needs to memorize a lot

[59]

in terms of what previous nodes were added

[62]

and which previous nodes does a new node want to connect to,

[65]

right?

[66]

So if you have a node ordering like I show you here,

[69]

then for example, the way you generate

[71]

this graph would be add node 1, add node 2, add node 3,

[75]

and only now start creating edges, right?

[78]

Then this doesn't feel like natural, it

[80]

seems that like this when I added node 3,

[82]

I needed now to remember that 1 and 2 were already added

[85]

and all that, right?

[88]

And the point is that a new node needs to link to--

[91]

can in principle link to any previous node, right?

[93]

Even node 5 could link to node number 1?

[96]

And the question is, how would we limit this complexity?

[99]

And how would we limit these long range dependencies, right?

[102]

Because this means that the hidden state of node 5

[106]

somehow has to remember that node 1 was added at all the--

[110]

at the beginning and that 5 should link to 1.

[113]

And that's a lot to kind of ask our model to do.

[118]

So if we can help it in any way that would definitely help.

[122]

So the way the insight is that we can take a--

[128]

because the insight is that we can come up

[130]

with an node ordering that makes our model much more tractable.

[134]

So the insight is that, rather than having

[137]

a random node ordering and having the model to worry

[141]

about long-range dependencies, we

[143]

are going to use a node ordering that helps

[145]

the model learn better.

[147]

And the node ordering we propose is

[149]

called the breadth-first search node ordering.

[151]

Basically, we are going to start at some random node

[154]

in the graph.

[155]

We are going to label it as node number 1.

[157]

We are going to label its nodes as, let's say 2 and 3.

[160]

Then their neighbors are 4 and 5, right?

[164]

And this leads now to a much more natural recipe

[168]

to generate the graph, right?

[169]

We are saying add node 1, add node 2, connect 2 to 1,

[174]

add 3 connect it with 1, add 4 connect it with 2 and 3.

[177]

It's kind of much more nicely interwoven, these things are.

[181]

And if you say, how would you draw this graph?

[183]

You'd probably draw it this way, not some other way

[186]

that you first put all the nodes down and then connect them,

[189]

right?

[190]

So the BFS ordering, what does it buy us?

[193]

It buys us the following, because node 4 does not

[197]

connect to node 1, we know that no other node

[201]

is going to connect to 1 from now on, right,

[204]

because it's a breadth-first search ordering.

[208]

So if for example, node 5 were to connect to node 1,

[211]

then its ID wouldn't be 5, it would be less than 5, right?

[217]

So we know that all of node 1's neighbors

[220]

have already been traversed when a given node does not

[224]

link to it.

[225]

Therefore, node 5, and all the following nodes,

[228]

right, as I said, will never connect to node 1.

[231]

And why is this important?

[232]

Because this means that when node 5 comes

[234]

I don't need to worry about node 1 anymore, right?

[238]

Node 1 I can kind of forget, I need to have much less memory,

[242]

right?

[243]

I don't need memory, I only need memory of two steps

[246]

rather than memory of remembering what I did

[249]

all the way at the beginning.

[251]

So the BFS ordering, the key insight

[254]

is that node 5 will never connect to node 1.

[258]

And this means that we only need memory of two steps,

[260]

rather than the memory of n minus 1 steps where

[263]

n is the number of nodes in the network.

[266]

And this also means that it reduces

[269]

the number of possible orderings, right,

[271]

rather than considering all the possible orderings which

[274]

is n factorial of them.

[276]

We only have to kind of consider the number of distinct BFS

[283]

orderings.

[284]

And it also reduces the number of steps

[286]

for edge generation because of this insight

[289]

that we know that 5 won't link to 1.

[292]

And this is important.

[293]

Because so far, I explained to you,

[296]

I said, the edge level RNN generates the column

[300]

of the adjacency matrix.

[302]

But if you take this BFS based ordering,

[305]

then the edge level RNN does not really

[309]

need to generate the entire column.

[314]

It can-- it only needs to generate

[316]

a small part of the column because we

[318]

know that the rest is 0, right?

[321]

So, rather than generating connectivity

[323]

to all previous nodes, and having

[325]

to remember all this with the proper node ordering,

[328]

we are guaranteed that all we need to do

[330]

is generate just a small band of this adjacency matrix.

[335]

And again, this doesn't prevent us

[337]

from generating any kind-- like graphs,

[341]

this is still fully general, it is just

[343]

exploiting the ability that we can re-number the nodes.

[348]

We can order the nodes in whatever order we want.

[352]

And because real graphs are sparse,

[356]

a favorable ordering gives us a lot of benefit

[359]

because it's much easier to learn

[361]

to generate just this blue part, than to learn

[364]

to generate this entire upper triangle

[367]

of the adjacency matrix.

[370]

So this was the discussion of how

[372]

do we scale up our model and its insight, that we can come up

[376]

or we can decide on the ordering and if you are smart about it,

[379]

it can really help us.

[381]

It could help the model learn.

[383]

The second thing I want to talk about

[384]

is, how do we evaluate graph generation, right?

[388]

And there are two ways how you can do it.

[391]

One is that we visually look at them

[392]

and see whether they are similar.

[394]

And that's good to get some intuition,

[396]

but we also want to define a statistical notion of graph

[399]

similarity.

[400]

And of course, you could try to say,

[401]

I'll take two graphs and I'll somehow align them one on top

[404]

of the other, but that is very expensive and for big graphs

[408]

you cannot do that.

[409]

So we have to define some kind of statistical measure of graph

[414]

similarity between two graphs.

[416]

So first, let me show you some visual examples

[418]

of what GraphRNN is able to do.

[421]

So what I'm showing here is three input training graphs.

[426]

These are the output graphs from GraphRNN,

[429]

and here are some output from three traditional generating

[434]

models.

[434]

This is the Kronecker graphs generating model,

[436]

this is the mixed membership stochastic block model,

[439]

and this is the preferential attachment

[441]

to the Barabasi-Albert model.

[444]

And what you notice is that GraphRNN,

[448]

if you give it grids it's going to generate you grids.

[451]

You see little mistakes because the model is stochastic,

[454]

so it may make some little mistakes, that's OK.

[458]

But you see that other models completely fail,

[461]

they are not able to generate the grid.

[462]

And this is not surprising because none of these models

[466]

was developed to generate grids.

[469]

They were developed to generate networks

[473]

for different types of properties.

[475]

So that's OK, right?

[477]

GraphRNN can generate the grid the others cannot.

[480]

What is interesting though, is, even if you for example

[482]

give GraphRNN examples of graphs with two

[486]

clusters with this kind of community structure,

[489]

GraphRNN is able to learn that these graphs have

[491]

that structure and is able to generate you

[493]

graphs with such a structure.

[495]

Why?

[496]

For example, Kronecker graphs or Barabasi-Albert, they cannot--

[499]

they were not done to generate graphs with community

[502]

structures.

[503]

So they cannot do that, but mixed membership's stochastic

[505]

block model was developed for community structure.

[508]

So it does a good job, right?

[511]

And what I want to say here is the following,

[514]

right, is that GraphRNN is super general, right?

[517]

It's basically able to take these input graphs

[520]

and learn about the structure and then

[524]

generate new graphs with similar structure, right?

[526]

You don't have to tell it, Hey, I have communities.

[529]

Hey, this is a grid.

[530]

You just give it the graph, and it will figure it out by itself

[533]

that the given graph is a grid and it

[535]

needs to generate a grid, or that it's a community structure

[539]

graph or anything else.

[540]

So it's quite remarkable that such a diversity of graphs,

[545]

the same model can cover and you don't even

[549]

have to tell it what the input graphs are.

[552]

So now, how about doing more rigorous comparison

[556]

about statistical similarity, right?

[559]

How do we do that, right?

[560]

we-- as I said we cannot do direct comparison between two

[563]

graphs, trying to do graph alignment,

[565]

because graph isomorphism testing as we have seen is

[569]

an NP, is a hard problem, let's say.

[572]

So the solution is to compare graph statistics,

[575]

and the typical graph statistics we have already discussed

[579]

would be like degree distribution, clustering

[581]

coefficient, or orbit count statistics from the graphlets

[587]

that I discussed, I think in lecture number two.

[590]

So the point would be that given a graph

[591]

we are going to describe it with a set of statistics,

[594]

and then we are going to say that the two graphs are

[597]

more similar if their corresponding statistics are

[601]

more similar.

[602]

And in our case, each of these statistics

[605]

we are going to think of it as a probability distribution.

[608]

And I'm going to explain why is this important.

[611]

OK.

[612]

So first, is that, given two statistics,

[616]

maybe two graph properties, maybe degree--

[619]

two degrees sequences, two degree distributions,

[622]

two orbit count distributions.

[624]

We want to compare this on sets of training graphs

[628]

as well as on the syntactic graphs.

[630]

You want to see how similar is a set of training

[633]

graphs to the set of synthetically generated graphs.

[635]

And we'd like to measure the level of similarity

[639]

between the two.

[639]

And we are going to do a 2-step approach.

[642]

In the first step, we are going to do the following.

[644]

We are going to take each of these graphs

[646]

and we are going to describe it with a set of statistics.

[650]

We'll say here is the degree distribution,

[651]

here is the clustering coefficient distribution.

[653]

And now, we'll take and we are going

[655]

to this for all the input graphs, training graphs,

[659]

as well as, for all the generated graphs.

[662]

And now, we are going to take, let's say, degree distribution

[665]

of a synthetic graph and degree distribution of a real graph,

[668]

and we want to see how much of these distribution differ.

[671]

And to measure-- to quantify that we

[674]

are going to use something that we call the earth mover

[677]

distance.

[678]

And now, that we have compared the statistic individual

[682]

statistics.

[683]

Now, we need to aggregate and measure how--

[687]

once we have a measure of degree distribution similarity,

[690]

we have a measure of clustering coefficient similarities.

[692]

Now, we need to take these similarities

[694]

and further aggregate them to get the overall similarity.

[701]

And for this second level aggregation,

[704]

we are going to use what is called

[706]

the maximum mean discrepancy that will be based on the earth

[710]

mover distance.

[711]

So let me tell--

[712]

let me first define the earth mover distance and then

[714]

the MMD.

[717]

So the earth mover distance, kind of

[720]

tries to measure similarity between two distributions,

[722]

or similarity between two histograms.

[724]

And the way you can-- the intuition

[726]

is that, what is the minimum amount of effort,

[729]

minimum amount of Earth, minimum amount of probability mass

[734]

to move from one pile to the other,

[736]

so that one pile gets transformed to the other,

[738]

right?

[739]

So I say, I have a distribution, I have two distributions,

[742]

how much more?

[743]

How much mass?

[744]

How much of this yellow--

[746]

dirt, yellow earth, do I have to move

[749]

between these different pillars, so

[751]

that I will make it and transform it

[752]

into this type of distribution?

[754]

So if I have distributions that are very different,

[757]

then the earth mover distance would be high.

[759]

And if they are kind of similar, the earth mover distance

[762]

will be low.

[763]

And earth mover distance can be solved as an optimal flow,

[766]

and is found by using a linear program optimization problem.

[772]

We're basically saying, the amount of work I have to do

[775]

is, how do I take F and transform it.

[782]

How much earth do I have to move,

[785]

so that it minimizes the overall cost

[789]

didj between the probability distributions x and y.

[793]

So that's the intuition behind the earth mover distance

[799]

metric.

[800]

And then, the second part will be

[803]

that, we are going to use the maximum mean discrepancy,

[806]

or MMD.

[807]

And the idea of maximum discrepancy

[810]

is to represent distances between distributions,

[813]

as distances between mean embeddings of their features,

[816]

right?

[817]

And here I give you the formula for it.

[821]

But basically, the MMD between two distributions p and q,

[826]

you can think of it-- is-- that this is the--

[830]

if I write it in terms of some kernel k,

[833]

it's kind of the expectation over these elements x and y

[842]

that are drawn from distribution p and q,

[846]

and taking the expectation over the distribution of p and q.

[850]

And of course, we need to have this kernel k.

[852]

In our case, the kernel will be the L2 distance.

[858]

So now, let me just summarize.

[862]

How do we put all this together?

[863]

We are given two sets of graphs and we want

[865]

to see how similar they are.

[867]

The way we are going to do this is, for every graph

[870]

we are compute--

[871]

we are going to compute its statistics.

[874]

We are then going to use earth mover distance

[877]

to measure to the discrepancy between the two statistics,

[881]

between the two distributions.

[883]

And then, we are going to apply the mean--

[887]

the maximum mean discrepancy to measure

[891]

the similarity between these sets of statistics.

[894]

Where the similarity between sets elements--

[897]

which means the individual distributions

[900]

in your statistics is computed with the earth mover distance.

[907]

And this means, for example, that, this way

[910]

we can rigorously evaluate the correspondence

[916]

between the particular statistic--

[918]

set of statistics on the training graph,

[920]

as well as, on the testing graphs.

Most Recent Videos:

WE KILLED 6 HEROIC BOSSES! - YouTube

¿Quién inventó el dinero? - YouTube

Cuándo se inventó el dinero y cómo el dólar se convirtió en la principal moneda del mundo - YouTube

This Citizenship Program is Failing - YouTube

Candida Treatment Protocol w/ Dr. DiNezza - YouTube

$500M investor reacts to Real Estate Tik Toks 2 - YouTube

You can go back to the homepage right here: Homepage

CS224W: Machine Learning with Graphs | 2021 | Lecture 15.3 - Scaling Up &amp; Evaluating Graph Gen - YouTube

CS224W: Machine Learning with Graphs | 2021 | Lecture 15.3 - Scaling Up & Evaluating Graph Gen - YouTube