Levels of variation and intraclass correlation - YouTube

Channel: unknown

[0]
One of the first things that we聽 analyzed when we start working聽聽
[3]
with multiple data is on which level聽 each individual observation varies.
[8]
We also calculate intraclass聽 correlation to quantify these variances.
[12]
Let's take a look at an example to understand聽聽
[16]
what levels of variation on an聽 individual variable level means.
[19]
We have here our profitability data or a company.聽 So we have five observations for a single company聽聽
[25]
and average profitability for this company聽 is about fourteen percent and the individual聽聽
[30]
observations vary randomly around the average聽 profitability because companies sometimes have聽聽
[36]
good years sometimes they have bad years so the聽 performance is not always the same. So there's聽聽
[40]
always some year to year variation. But that聽 doesn't really fully explain why a large data聽聽
[49]
set of profitability figures would vary because聽 there can be also other levels of variation.
[54]
For example there can be company level variation.聽 These red mounds each present one company and聽聽
[61]
they are all working within an industry and聽 this blue area here represents the variation聽聽
[68]
of the performance of all companies within that聽 industry. So we can see that different companies聽聽
[74]
vary. Their performance vary within company but聽 there are also variations between companies. So聽聽
[79]
that this company here is consistently聽 less profitable than this company here.
[84]
So we have two levels. We have聽 the within company level and we聽聽
[89]
have the between company level which聽 is also the within industry level.
[92]
We can also add more levels. There is no limit on聽聽
[94]
how many levels we can do but聽 let's go for an industry level.
[98]
So we have these blue five different industries聽 and the industries are different in their聽聽
[106]
profitability. Some industries are highly聽 profitable - others are not so and we can see that聽聽
[112]
the individual variation of the data here is a聽 function of these three sources of variation - the聽聽
[119]
between industry level the between company level聽 and the year-to-year variation within companies.
[125]
To understand our data and understand the聽 phenomenon that the data represent we typically聽聽
[132]
need to decomposite variance to understood somehow聽 come up with percentages or some other statistics聽聽
[138]
that quantify how much of the variation is here聽 and how much of the variation is here in our data.
[144]
If our data set is small we typically start with聽 a graphical analysis. So we can just upload the聽聽
[151]
data. This is 25 observations. 5 observation for聽 each company for 5 companies within one industry.
[159]
We can see that there's some patterns for聽 example this company - there is not much聽聽
[164]
variation in performance. This company is聽 less profitable than that company and so on.
[169]
This kind of analysis works well when you have聽 a small set of observations. If we have a large聽聽
[176]
number of observations but still a fairly聽 manageable number of clusters let's say up聽聽
[183]
to 30 companies or 30 industries or whatever聽 is our level to unit - we can use box plots.
[189]
Box plots are graphical presentations聽 of individual variables and we can do聽聽
[195]
box plots by groups. The idea of a box plot聽 is that we first calculate for a variable聽聽
[202]
regulate the median and the median gets this聽 thick line here. That marks the median. Then聽聽
[209]
we calculate the first quartile and the聽 third quartile of the data. So quartile聽聽
[215]
means that below this line lies 25 percent聽 of our observations and above the line 75聽聽
[221]
percent. Median is half-and-half聽 and third quartile is 75% and 25%.
[228]
We draw a box between the first quartile and聽 the third quartile and half of our data is聽聽
[235]
within this box. Then we have these whiskers聽 that indicate the minimum and maximum and聽聽
[241]
sometimes we also have outliers that the聽 box plot algorithm identifies as circles.
[247]
So why is this box plot presentation useful聽 and how can we analyze the box plots?
[252]
We can first of all start to understand the聽 between and within variance by looking at the聽聽
[259]
box plots. We can compare these medians or we聽 can do box plot with means and we can check how聽聽
[267]
much variation there is between these means聽 or medians and that is our between variation.
[273]
We can also take a look at how high the聽 boxes are and that quantifies the within聽聽
[280]
variance and comparing these two dimensions聽 tell us if the variation in this variable聽聽
[289]
is more due to the differences between firms聽 or is it just random variation or some other聽聽
[295]
variation within firms. So is it a within firm聽 or between variation that explains the data.
[300]
We can quantify the level of variation聽 between two levels also numerically by聽聽
[307]
calculating the within variance and between聽 variance. This is our data and we start by聽聽
[313]
calculating group means. So we take each聽 of these companies and we calculate a mean聽聽
[319]
of this. So these are the group means or聽 cluster means for these five firms. And we聽聽
[327]
check how much these means vary. The variation聽 is quantified here with this statistic and then聽聽
[335]
we calculate how much these individual聽 observations vary from the group mean.
[341]
In practice we do group mean centering. So we聽 take each of these observations. We subtract聽聽
[347]
the group mean and that gives us the group mean聽 standard values. Then we calculate how much the聽聽
[354]
group mean standard data varies and this is our聽 between variation. This is our within variation聽聽
[361]
and this is our total variation which is the sum聽 of the between variation and the within variation.
[368]
So the variation of variance is聽 a statistic that depends on the聽聽
[373]
scale. It would be useful to have a聽 scale-free way to explain on which聽聽
[379]
level the data varies and this is where聽 the intraclass correlation comes to play.
[385]
So intraclass correlation is simply calculated聽 as variance between groups divided by the total聽聽
[392]
variance. So it answers the question how much聽 of the variation in the data is attributed the聽聽
[397]
groups and how much is attributed to聽 the variation within the groups. This聽聽
[403]
is called ICC one for reason that there are聽 many other kinds of intraclass correlations.
[409]
So intraclass correlation generally refers聽 to correlation between observations and聽聽
[415]
because there are many this is called the聽 ICC one. There are like a few others but聽聽
[420]
this is the most important one that you need to聽 understand when you work with multi-level data.
[424]
Other inter cross correlations are聽 mostly about reliability of multiple聽聽
[430]
raters but as you see one is - this simple聽 equation that simply quantifies variation.
[435]
When ICC 1 is 0 then that indicates聽 that there is no variance between聽聽
[444]
groups. So the box plots are all on the聽 same level here. There are no difference聽聽
[451]
between means and this case medians聽 are close as well and all variation聽聽
[456]
is simply because there is variation聽 between these within these groups.
[459]
Then when intraclass correlation is 1 then聽 that means there is no variance within聽聽
[466]
clusters at all. All observations equal the聽 level to means. So this firm's profitability聽聽
[473]
is always here. This firm's always here and聽 so on. So there's no within unit variation.
[479]
Why do people calculate intraclass correlation聽 and how it's typically reported? The role of聽聽
[488]
intraclass correlation - the first role is to make聽 a decision whether something needs to be done for聽聽
[496]
the clustering. If all the observations within聽 cluster have the same then you can just pick聽聽
[501]
one observation for each cluster and use those聽 in regression analysis and it doesn't really聽聽
[506]
matter that you have the remaining observations聽 because they don't provide you any more data.
[510]
If ICC 1 is 0 then there is no clustering in聽 that variable and if your all ICC ones are very聽聽
[518]
low then there is no meaningful clustering in聽 your data and it's possibly safe to go without聽聽
[525]
a multi-level modeling. There exceptions to聽 that rule but generally when ICC 1 is close to聽聽
[532]
0 or when ICC 1 is close to 1 then a multi-level聽 modeling may not be needed but if it's somewhere聽聽
[539]
between like it's 50% then you typically need to聽 take levels into account in your analysis somehow.
[545]
Let's take a look at the example of how ICC聽 1 has been reported in published research. So聽聽
[551]
this comes from Hausknecht's paper and this is聽 a good example because they first explain what聽聽
[560]
the statistic is. So quite often people just聽 report a statistic or report a number without聽聽
[564]
explaining what ICC one is. And this study聽 provides concise description. ICC 1 values聽聽
[573]
can be interpreted as the total amount of聽 variance in the dependent variable that聽聽
[577]
is attributable to between unit rather聽 than within uni differences over time.
[581]
So that explains what the statistics聽 interpretation is and also if their聽聽
[586]
values are are high then regression聽 analysis could be inappropriate and聽聽
[593]
then you would have to do something else聽 or for example use cluster over standard聽聽
[598]
errors or multi-level modeling. And then聽 they go on and they explain what is the聽聽
[603]
actual statistic. Abstention point 76 and聽 then they explain what the statistic means.
[609]
So giving this short introduction to your聽 statistics is very useful for your readers聽聽
[616]
because your readers may not be experts in using聽 multi-level data so make it easier for them.