Home » Posts tagged 'data analysis'

# Tag Archives: data analysis

## Levels of Data in Research

I admit it. Part of my business’ marketing strategy is having at least some menial social media presence. That is what all the marketing advice columns say to do and who am I to buck this trend? I have focused my efforts on LinkedIn because the Groups function provides me with an opportunity to answer, hopefully intelligently, research-related questions posed by other investigators, and a recent question I answered got me thinking.

A LinkedIn user who listed her job title as a research associate asked about how to analyze data that contained different levels. I answered her question to the best of my abilities, but with so little information to work on (I still don’t know what “levels” meant to her), I’m confident I only partially answered her inquiry.

*But the idea of levels in data is interesting.*

There are many ways to interpret data levels. The most traditional interpretation likely occurs when we discuss the response options to a single question. If a question asks about your age, each possible response is a different level. Strongly disagree is a different level than strongly agree for Likert scale questions. Perhaps the mysterious researcher was asking how to test for differences between different levels of an intervention. Another possible interpretation is in how data is collected. Most researchers conceptualize a hierarchy of research designs, with randomized controlled trials at the top and a myriad of observational designs closer to the bottom. I don’t think this is what the researcher was referring to because I can’t think of many circumstances where you would have the capability of even testing, for example, the findings of an RCT versus a cohort study.

Instead, I think our researcher was talking about how the things that produce data, typically humans but not necessarily so, tend to cluster into groups, and this clustering creates a hierarchy within the data that should be accounted for. Each *level* of the hierarchy is a different *level* of data available to the researcher.

The simple answer is that data that contains some type of hierarchical structure should be evaluated using hierarchical linear modeling/multi-level modeling (HLM), structural equation modeling (SEM), or generalized linear mixed models (GLMM). But providing such a simple answer doesn’t provide any information about why you should use such complex statistical methods.

(To provide a reference point, it is relatively easy to hand calculate a t-test or a chi-square. Odds ratios are a breeze and even ANOVA’s aren’t beyond our reach. Those only take a few minutes. Because of the iterative processes that are used, it would probably take years to solve an HLM, SEM, and GLMM model by hand.)

*Data Likes to Cluster*

ANOVA’s, regressions, t-tests, and chi-square make the same large assumption: all observations are independent and uncorrelated, at least the errors are uncorrelated. But in the research world, we very often experience correlated observations. The simplest example of correlated observations is when a study incorporates a longitudinal design. Since the same people (or other units of analysis) are being measured multiple times, we would expect that different measurements from the same person will be correlated. In fact, they should be correlated because *the same *person is answering *the same *question. Even when responses change over time, we would not expect such differences to occur at random (e.g. age). If these within-subjects measurements are not correlated, we should question how the data was collected, labeled, and cleaned.

But there are other situations where data can cluster even if we don’t expect it to. When we try to determine the difference of a between-subjects effect, we assume that the participants don’t know each other, but that isn’t necessarily true. When I was conducting research on tobacco control policies, convenience samples were routinely recruited. Several study subjects knew of each other; some were friends and completed the study as a group; and in one scenario, a subject was actually a participant in another subject’s research study!

Connections like these are relatively random (a participant in another subject’s study? Really?), and our basic statistical tools are typically robust enough to withstand such correlations. However, there are numerous other situations that require relationships between participants to be considered when determine statistical effects because these correlations can have dramatic effects on our results. Twin studies should take into account shared genomes and family environments. School-based programs must consider how students are clustered into classrooms or even schools. Clinical trials need to consider how patients may cluster within hospitals. Evaluation studies may need to assess how program participants cluster within neighborhoods.

*Why do we need to account for clustering?*

When performing a statistical test, we are trying to see if the distribution of scores in group A is different than the distribution of scores in Group B. This distribution of scores is called the variance. When study participants are clustered or related, there is a greater likelihood that these individuals will provide similar responses to the questions being asked or measurements being taken. The greater sameness between members of the same group reduces the variance of group and creates a statistical bias towards better being able to find a between-subject difference. In effect, you increase the probability that you’ll find a significant difference when one doesn’t truly exist, known as Type I error.

HLM, SEM, and GLMM can account for this bias and make statistical adjustments to ensure that this sameness among the participants does not influence the final conclusions of the study. Yes, for many reasons detecting significant effects becomes more difficult, but when significant effects do occur, there is greater confidence that any differences truly exist.

*What tests should I use?*

Before you try to use HLM, SEM, or GLMM, you might need to take a class or 2, read a couple of books (stay away from journal articles unless you really like statistical theory), and/or watch a whole bunch of YouTube videos.

With that note of caution out of the way, here are my recommendations. If you are working with within-subjects comparisons, HLM and SEM perform best. If your time component is structured (e.g. all measurements were taken exactly 12 months apart), HLM and SEM work equally well. If you time component is unstructured (e.g. some measurements were taken at 6 months while others at 9 months), HLM performs better. If your data contains multiple measurements for each participant without respect to time (e.g. 3 cholesterol tests run on the same blood sample), GLMM is appropriate. If you are concerned about between-subjects clusters, HLM and SEM both perform well.

* The take home message:* We often work with data that is structured in levels or hierarchies, and measurements within such levels are often correlated. When such hierarchies are non-random and measurement correlation is expected to be high, sophisticated statistical models are required to account and adjust for the clustering effects. If no adjustments are made, the analysis is prone to finding significant differences that don’t really exist.

## Understanding Factor Analysis (wait, what are latent variables again?)

In social science, collecting data is an interesting process. Whether we observe or ask questions, it takes time, thought, and precious energy to select the right process or questions to answer our research questions. Even if we do select the perfect question, we can still never exactly measure the true nature of a phenomenon. This inability to measure the real world is known as *measurement error*, specifically *random measurement error*. (It’s more insidious cousin is *systematic measurement error*, which occurs when we, the researcher, make the wrong decisions and introduce bias into a study.) Because of this error, I am highly jealous of the “hard” sciences (e.g. biology, chemistry, physics). Yes, not every reaction works as predicted and sensors need to be calibrated correctly, but their research doesn’t need to deal with people!

And let’s be realistic. People are not good research subjects. They forget things. They give different answers to the same question, and they give the same answer to different questions. We get around this inherent difficulty of working with people, or at least try to, through a pretty simple mechanism: we ask multiple questions about the same topic.

Let’s use depression as an example. Depression is a multi-faceted disease. Each person can have a unique manifestation of depression, and each person can recover from depression in a unique way. Depression will resolve spontaneously in some but require lifelong treatment in others. Even the diagnosis of depression is rather complex.

How do we accurately determine if someone is suffering from depression? We can use a series of reliable and validated questions, such as the Beck’s Depression Inventory (BDI). The BDI consists of 21 multiple choice questions that can be answered by interview or self-report, and each response option to each question is coded with a number. In its simplest implementation, all someone needs to do to reasonably diagnose someone with depression is add up all the numbers and see what category the person falls in.

This aggregating of responses across questions creates an *index variable* because within the single number, say a BDI of 14, multiple facets of depression are represented. At the risk of repeating myself, depression is a complex disease, and we ask multiple questions about depression because we don’t want to miss any aspect of the disease that may be important to research or treatment. While this single number is useful, it is not necessary informative because we can’t fully understand how depression is being externalized in any given individual. Instead of an index variable, which contains information on multiple facets, it is often more fruitful to work with *scale variables*, which are created by aggregating the responses of multiple questions that measure the same thing.

*Latent Variables*

This leads us into the idea of *latent variables*. Latent variables are a little strange. They exist. We give them names. They are real, but we can never directly measure them. In actuality, latent variables are THE thing we want to measure in the real world but can’t because of measurement error (and people. It’s always people too). Because we can’t measure these real-world things, which we know really exist, directly, we use multiple questions and then combine these questions into a scale variable. Essentially, a scale variable is a numerical representation of the real-world thing that exists but we can’t directly measure, and each scale variable represents one latent variable.

Now back to depression. A total BDI score is not a scale variable because depression has multiple facets and can’t be represented by a single value. Each facet of depression is a separate latent variable that makes of the disease we see as depression. It turns out depression, according to the BDI, consists of 2 facets, or latent variables: an affective facet, which is the psychological side of the disease, and a somatic facet, which is the physical side of the disease. Affective and Somatic are 2 latent variables within depression. We can’t directly measure them, but we can construct scale variables that come pretty close.

*Factor Analysis*

Alright, if you bought into the idea of facets of disease and latent variables so far, a logical question to ask is: how do we know what questions to combine to create these scale variables?

This is where factor analysis comes in and a bit of methodology which isn’t necessarily the most scientific. In its simplest form, *factor analysis* is the act of identifying and confirming what questions measure different parts of the same underlying latent variable. Ideally, we would know how to combine the questions as the questions were being written. Unfortunately, this is often impossible because we can’t predict how the questions will perform in the real world. A layperson’s interpretation of a question may be remarkably different than what the researcher intended. This information is still useful but in a slightly different way than envisioned.

Instead of assuming how questions should be combined to form scale variables to represent latent variables, we conduct an *exploratory factor analysis*, which is just how it sounds. We explore the data. We let the data tell us how to combine the questions. We, for lack of a better term, go on a small fishing expedition. In an exploratory factor analysis, we look for sets of questions whose responses are highly correlated with each other. (Thankfully, some very sophisticated algorithms exist to do this for us so we aren’t staring at correlation tables for hours on end).

Suppose we run an exploratory factor analysis on a 10 item questionnaire. The results of the analysis show that there are likely 3 latent variables being measured by this questionnaire. Questions 1, 2, and 7 are highly correlated (let’s call it Physical Health). Questions 3, 4, 6, and 10 are highly correlated (Mental Health), and questions 5, 8, and 9 are highly correlated (Spiritual Health). So it appears that our 10 item questionnaire measures 3 different facets, or latent variables, of health: physical, mental, and spiritual.

*How do we know an exploratory factor analysis is correct?*

Replicate. Replicate. Replicate.

After we run an exploratory factor analysis, we need to confirm our findings. The best method to do that is to recruit new samples of people from the same population as the original study and recruit samples of people from different populations compared to the original study. Once these new samples are recruited and the questions have been answered, we can test whether questions 1, 2 and 7, questions 3, 4, 6 and 10, and questions 5, 8 and 9 remain highly correlated. When we attempt to confirm the findings of an exploratory factor analysis, the procedure is called a *confirmatory factor analysis* because we want to confirm the findings (get it?).

If the findings of a confirmatory factor analysis replicate those of an exploratory factor analysis, you have just discovered a method to reliably measure a real, but unmeasurable, latent variable. If your findings differ between samples of the same population, perhaps the questionnaire has a more complicated structure than originally thought. If your findings differ between samples of different populations, then you need to explore why the findings differ between populations, a research path that can be very intriguing.

* The take home message:* We often use multiple questions to measure some real-world construct because it is impossible to do so with a single question. These unmeasurable constructs are called latent variables. We identify how to combine these questions, and create latent variables, using exploratory factor analysis and confirm the findings using exploratory factor analysis.