I admit it. Part of my business’ marketing strategy is having at least some menial social media presence. That is what all the marketing advice columns say to do and who am I to buck this trend? I have focused my efforts on LinkedIn because the Groups function provides me with an opportunity to answer, hopefully intelligently, research-related questions posed by other investigators, and a recent question I answered got me thinking.
A LinkedIn user who listed her job title as a research associate asked about how to analyze data that contained different levels. I answered her question to the best of my abilities, but with so little information to work on (I still don’t know what “levels” meant to her), I’m confident I only partially answered her inquiry.
But the idea of levels in data is interesting.
There are many ways to interpret data levels. The most traditional interpretation likely occurs when we discuss the response options to a single question. If a question asks about your age, each possible response is a different level. Strongly disagree is a different level than strongly agree for Likert scale questions. Perhaps the mysterious researcher was asking how to test for differences between different levels of an intervention. Another possible interpretation is in how data is collected. Most researchers conceptualize a hierarchy of research designs, with randomized controlled trials at the top and a myriad of observational designs closer to the bottom. I don’t think this is what the researcher was referring to because I can’t think of many circumstances where you would have the capability of even testing, for example, the findings of an RCT versus a cohort study.
Instead, I think our researcher was talking about how the things that produce data, typically humans but not necessarily so, tend to cluster into groups, and this clustering creates a hierarchy within the data that should be accounted for. Each level of the hierarchy is a different level of data available to the researcher.
The simple answer is that data that contains some type of hierarchical structure should be evaluated using hierarchical linear modeling/multi-level modeling (HLM), structural equation modeling (SEM), or generalized linear mixed models (GLMM). But providing such a simple answer doesn’t provide any information about why you should use such complex statistical methods.
(To provide a reference point, it is relatively easy to hand calculate a t-test or a chi-square. Odds ratios are a breeze and even ANOVA’s aren’t beyond our reach. Those only take a few minutes. Because of the iterative processes that are used, it would probably take years to solve an HLM, SEM, and GLMM model by hand.)
Data Likes to Cluster
ANOVA’s, regressions, t-tests, and chi-square make the same large assumption: all observations are independent and uncorrelated, at least the errors are uncorrelated. But in the research world, we very often experience correlated observations. The simplest example of correlated observations is when a study incorporates a longitudinal design. Since the same people (or other units of analysis) are being measured multiple times, we would expect that different measurements from the same person will be correlated. In fact, they should be correlated because the same person is answering the same question. Even when responses change over time, we would not expect such differences to occur at random (e.g. age). If these within-subjects measurements are not correlated, we should question how the data was collected, labeled, and cleaned.
But there are other situations where data can cluster even if we don’t expect it to. When we try to determine the difference of a between-subjects effect, we assume that the participants don’t know each other, but that isn’t necessarily true. When I was conducting research on tobacco control policies, convenience samples were routinely recruited. Several study subjects knew of each other; some were friends and completed the study as a group; and in one scenario, a subject was actually a participant in another subject’s research study!
Connections like these are relatively random (a participant in another subject’s study? Really?), and our basic statistical tools are typically robust enough to withstand such correlations. However, there are numerous other situations that require relationships between participants to be considered when determine statistical effects because these correlations can have dramatic effects on our results. Twin studies should take into account shared genomes and family environments. School-based programs must consider how students are clustered into classrooms or even schools. Clinical trials need to consider how patients may cluster within hospitals. Evaluation studies may need to assess how program participants cluster within neighborhoods.
Why do we need to account for clustering?
When performing a statistical test, we are trying to see if the distribution of scores in group A is different than the distribution of scores in Group B. This distribution of scores is called the variance. When study participants are clustered or related, there is a greater likelihood that these individuals will provide similar responses to the questions being asked or measurements being taken. The greater sameness between members of the same group reduces the variance of group and creates a statistical bias towards better being able to find a between-subject difference. In effect, you increase the probability that you’ll find a significant difference when one doesn’t truly exist, known as Type I error.
HLM, SEM, and GLMM can account for this bias and make statistical adjustments to ensure that this sameness among the participants does not influence the final conclusions of the study. Yes, for many reasons detecting significant effects becomes more difficult, but when significant effects do occur, there is greater confidence that any differences truly exist.
What tests should I use?
Before you try to use HLM, SEM, or GLMM, you might need to take a class or 2, read a couple of books (stay away from journal articles unless you really like statistical theory), and/or watch a whole bunch of YouTube videos.
With that note of caution out of the way, here are my recommendations. If you are working with within-subjects comparisons, HLM and SEM perform best. If your time component is structured (e.g. all measurements were taken exactly 12 months apart), HLM and SEM work equally well. If you time component is unstructured (e.g. some measurements were taken at 6 months while others at 9 months), HLM performs better. If your data contains multiple measurements for each participant without respect to time (e.g. 3 cholesterol tests run on the same blood sample), GLMM is appropriate. If you are concerned about between-subjects clusters, HLM and SEM both perform well.
The take home message: We often work with data that is structured in levels or hierarchies, and measurements within such levels are often correlated. When such hierarchies are non-random and measurement correlation is expected to be high, sophisticated statistical models are required to account and adjust for the clustering effects. If no adjustments are made, the analysis is prone to finding significant differences that don’t really exist.
In social science, collecting data is an interesting process. Whether we observe or ask questions, it takes time, thought, and precious energy to select the right process or questions to answer our research questions. Even if we do select the perfect question, we can still never exactly measure the true nature of a phenomenon. This inability to measure the real world is known as measurement error, specifically random measurement error. (It’s more insidious cousin is systematic measurement error, which occurs when we, the researcher, make the wrong decisions and introduce bias into a study.) Because of this error, I am highly jealous of the “hard” sciences (e.g. biology, chemistry, physics). Yes, not every reaction works as predicted and sensors need to be calibrated correctly, but their research doesn’t need to deal with people!
And let’s be realistic. People are not good research subjects. They forget things. They give different answers to the same question, and they give the same answer to different questions. We get around this inherent difficulty of working with people, or at least try to, through a pretty simple mechanism: we ask multiple questions about the same topic.
Let’s use depression as an example. Depression is a multi-faceted disease. Each person can have a unique manifestation of depression, and each person can recover from depression in a unique way. Depression will resolve spontaneously in some but require lifelong treatment in others. Even the diagnosis of depression is rather complex.
How do we accurately determine if someone is suffering from depression? We can use a series of reliable and validated questions, such as the Beck’s Depression Inventory (BDI). The BDI consists of 21 multiple choice questions that can be answered by interview or self-report, and each response option to each question is coded with a number. In its simplest implementation, all someone needs to do to reasonably diagnose someone with depression is add up all the numbers and see what category the person falls in.
This aggregating of responses across questions creates an index variable because within the single number, say a BDI of 14, multiple facets of depression are represented. At the risk of repeating myself, depression is a complex disease, and we ask multiple questions about depression because we don’t want to miss any aspect of the disease that may be important to research or treatment. While this single number is useful, it is not necessary informative because we can’t fully understand how depression is being externalized in any given individual. Instead of an index variable, which contains information on multiple facets, it is often more fruitful to work with scale variables, which are created by aggregating the responses of multiple questions that measure the same thing.
This leads us into the idea of latent variables. Latent variables are a little strange. They exist. We give them names. They are real, but we can never directly measure them. In actuality, latent variables are THE thing we want to measure in the real world but can’t because of measurement error (and people. It’s always people too). Because we can’t measure these real-world things, which we know really exist, directly, we use multiple questions and then combine these questions into a scale variable. Essentially, a scale variable is a numerical representation of the real-world thing that exists but we can’t directly measure, and each scale variable represents one latent variable.
Now back to depression. A total BDI score is not a scale variable because depression has multiple facets and can’t be represented by a single value. Each facet of depression is a separate latent variable that makes of the disease we see as depression. It turns out depression, according to the BDI, consists of 2 facets, or latent variables: an affective facet, which is the psychological side of the disease, and a somatic facet, which is the physical side of the disease. Affective and Somatic are 2 latent variables within depression. We can’t directly measure them, but we can construct scale variables that come pretty close.
Alright, if you bought into the idea of facets of disease and latent variables so far, a logical question to ask is: how do we know what questions to combine to create these scale variables?
This is where factor analysis comes in and a bit of methodology which isn’t necessarily the most scientific. In its simplest form, factor analysis is the act of identifying and confirming what questions measure different parts of the same underlying latent variable. Ideally, we would know how to combine the questions as the questions were being written. Unfortunately, this is often impossible because we can’t predict how the questions will perform in the real world. A layperson’s interpretation of a question may be remarkably different than what the researcher intended. This information is still useful but in a slightly different way than envisioned.
Instead of assuming how questions should be combined to form scale variables to represent latent variables, we conduct an exploratory factor analysis, which is just how it sounds. We explore the data. We let the data tell us how to combine the questions. We, for lack of a better term, go on a small fishing expedition. In an exploratory factor analysis, we look for sets of questions whose responses are highly correlated with each other. (Thankfully, some very sophisticated algorithms exist to do this for us so we aren’t staring at correlation tables for hours on end).
Suppose we run an exploratory factor analysis on a 10 item questionnaire. The results of the analysis show that there are likely 3 latent variables being measured by this questionnaire. Questions 1, 2, and 7 are highly correlated (let’s call it Physical Health). Questions 3, 4, 6, and 10 are highly correlated (Mental Health), and questions 5, 8, and 9 are highly correlated (Spiritual Health). So it appears that our 10 item questionnaire measures 3 different facets, or latent variables, of health: physical, mental, and spiritual.
How do we know an exploratory factor analysis is correct?
Replicate. Replicate. Replicate.
After we run an exploratory factor analysis, we need to confirm our findings. The best method to do that is to recruit new samples of people from the same population as the original study and recruit samples of people from different populations compared to the original study. Once these new samples are recruited and the questions have been answered, we can test whether questions 1, 2 and 7, questions 3, 4, 6 and 10, and questions 5, 8 and 9 remain highly correlated. When we attempt to confirm the findings of an exploratory factor analysis, the procedure is called a confirmatory factor analysis because we want to confirm the findings (get it?).
If the findings of a confirmatory factor analysis replicate those of an exploratory factor analysis, you have just discovered a method to reliably measure a real, but unmeasurable, latent variable. If your findings differ between samples of the same population, perhaps the questionnaire has a more complicated structure than originally thought. If your findings differ between samples of different populations, then you need to explore why the findings differ between populations, a research path that can be very intriguing.
The take home message: We often use multiple questions to measure some real-world construct because it is impossible to do so with a single question. These unmeasurable constructs are called latent variables. We identify how to combine these questions, and create latent variables, using exploratory factor analysis and confirm the findings using exploratory factor analysis.
In the last 2 years, two studies have thrown a large bucket of ice water on the notion that a drink, whether beer, wine, or spirits, a day will really help you live longer. The problem these researchers confronted was the sick abstainer bias. Essentially, there are many reasons for a person to not drink. Voluntarily abstaining from alcohol is only one of those reasons. Others include having a medical condition that makes alcohol consumption unsafe or being a former alcoholic. These non-voluntary reasons to abstain from alcohol are also significant risk factors for early death, but in most studies, the unhealthy non-drinkers are collected into the same group as the healthy non-drinkers, which potentially introduces bias into the study (It is usually unwise to have sick participants in control groups). By reviewing a large body of scientific literature and accounting for this sick abstainer bias, the benefits of moderate drinking (aka 1 drink a day) disappear. Sadly, a drink a day won’t help you (It probably won’t harm you though).
At this point, it is important to note that these studies were funded by the National Institutes of Health (NIH) and, specifically, by the National Institute for Alcohol Abuse and Alcoholism (NIAAA). Both are US government entities with the executive brand, and NIH/NIAAA funding is largely seen as unbiased, by which I mean NIH/NIAAA does not expect any specific outcome of the research. Instead, they want to know if your hypothesis is true or false because proving a hypothesis false can be as important as proving a hypothesis true (and a necessary possibility in experimental research).
It’s also important to note that the alcohol industry was not supportive of the conclusion that moderate drinking isn’t healthy for you, unsurprisingly. The International Scientific Forum on Alcohol Research (ISFAR), which consists of approximately 50 researchers who are financially supported or sympathetic to the alcohol industry, issued a scathing critique of the study within days of publication (A little off track but how do 50 researchers read and reach consensus on a study critique within days? It takes me weeks to get a single researcher to review a paper.). The President of the Distilled Spirits Council of the US, an alcohol industry trade association, called the paper an “attack.”
So independent researchers concluded that alcohol use is probably not healthy for you (a pretty logical conclusion), and the alcohol industry didn’t like the findings (an expected response).
What happens now?
The alcohol industry throws a bunch of money at the problem.
But not just any money. Money that in essence gets laundered so it looks clean on the other side.
Anheuser-Busch InBev, Heineken, Diageo, Pernod Ricard and Carlsberg, the largest alcohol producers in the world, have pledged nearly $68 million (so far) to the NIH Foundation in support of a study to determine the health consequences of 1 drink of alcohol per day. The entire study is expected to cost $100 million.
If you haven’t heard of the NIH Foundation, you are not alone. I didn’t know it existed until learning about this controversy. It exists as a 501(c)3 non-profit organization as a way to raise private funds to support NIH research. Its donors include several pharmaceutical companies, the Gates Foundation, the National Football League (which is also the subject of controversy), and now the alcohol industry.
By donating this large sum of money to the NIH Foundation, the alcohol industry is intending to build a wall between itself and the research outcomes. If the study produces positive results, the industry needs the ability to say the study was done independent of industry influence. The problem is that by providing the money to fund the study, the alcohol industry is at least indirectly influencing the results. As Dr. Thomas Babor, from the University of Connecticut School of Medicine, said in an article on Wine Spector “there is the potential for people to subtly or not-so-subtly change their findings or interpretations based on the expectation of the funder.” In sum, the alcohol industry may not be directing the research, but there are ways to influence the process.
Funding the study through the NIH Foundation is even more insidious than at first glance because the researcher’s do not need to disclose that the alcohol industry funded the project when the time comes to publish the findings. Instead, they only need to disclose that the funding the provided by the NIH Foundation, which on paper looks like a pretty benign funding source.
This has been done before.
For some, gambling is an addiction, and heavy gamblers risk serious negative social and health consequences due to their addiction. In a not so deceptive effort to influence the direction of gambling research, the gaming industry has been funding gambling research through the National Center for Responsible Gaming (NCRG). The NCRG was started by a gaming company, and the NCRG remains fully funded by the gaming industry. This firewall allows researchers who accept such money to truthfully state they were not directly funded by industry dollars, and allows gambling industry members to fund researchers who will most likely support their positions.
Frankly, the NIH Foundation is being used by the alcohol industry as the NCRG is used by the gambling industry.
What can be done? What is the purpose of discussing this?
First, research needs to be fully independent with no expectations of results placed on the investigators. I support government funded research for this very specific reasons. Once investigators expect a certain result before a study has even begun, they will make decisions, small and large, to ensure that such a result is achieved. These decisions can be as large as what criteria to use to include or exclude potential participations or as small as whose data to include or exclude in the final analysis. Maybe the intervention group gets a little more attention than the control group or maybe the results are downplayed or even withheld from the public if they are unfavorable to the funder. Moreover, these decisions may be made consciously or unconsciously, and no one is immune to this influence. I cannot honestly say I would be unaffected by a funders intentions, and I feel like I have a pretty good grasp of the problem.
Second, follow the money when it comes to research. Just like political donations, research “funded” by foundations and other non-governmental groups may actually be funded by for-profit industries that stand to benefit from favorable results or suffer from unfavorable results. The investigators who will publish the NIAAA-alcohol use study will claim the study is funded by the NIH Foundation, which is technically correct, but the study actually has the finger prints of numerous transnational alcohol producers.
The take home message: One drink a day may not be healthy after all, and the NIH/NIAAA is accepting a large amount of money from the alcohol industry to study this exact problem. Beware the final results of this project. It will likely be influenced by the alcohol industry itself. For a more critical analysis of the study methods, please read: http://tobaccoanalysis.blogspot.com/2017/07/niaaa-prostitutes-its-scientific.html.
As a researcher, data is important. Data is life. Data is everything. I need to use the best methods available to collect data, and the best statistical tests to analyze it. But there is a big problem I often face, and I bet many other researchers have the same issue. Data is expensive. Data can be very expensive and out of reach for many investigators, particularly junior investigators who don’t have access to alternative funding steams. So what’s left? What can a researcher do if there are no resources to collect data?
Use someone else’s.
This idea, that another researcher’s dataset can be used for novel purposes, is the entire premise of secondary data analysis. This isn’t a novel approach, and pointing out the pros and cons of secondary data analysis at this point would simply seem duplicative. The key to secondary data is finding it, which is the purpose of this post.
But before I delve into finding data, I want to make a distinction between secondary data and “Big Data.” Big data has been a trendy research area for several years but even I get confused about what is and is not big not big data. A really large study database is not big data. Surveillance studies that include hundreds of thousands of people is not big data. Decades long longitudinal studies are not big data. Instead, big data, with some notable exceptions, are generated by the things we do in everyday life. Big data is combining information on the type of posts you Like on Facebook with your purchase history. Big data is combining medical records information with information from grocery store receipts. Big data is using credit card transactions at gas stations to determine the popularity of tourist attractions. These datasets are massive, encompassing millions of people and potentially billions of data points. The sheer size of the these datasets requires research to essentially program their own apps in order to effectively analyze them (Something which I am unable to do but am jealous of the people who can). SAS or SPSS simply can’t handle the workload.
But back to secondary data. We all know what it is but where do we get some?
(Note: This is an anti-conflict of interest statement. I am not affiliated with ICPSR in any way. I just like the system that has been created.)
I feel like I’ve given away the punchline before even telling the joke, but ICPSR is the clearinghouse for data. This database of databases has been maintained for over 50 years and includes data on almost every conceivable topic. For instance, if I am interested in alcohol use, ICPSR has information on 1,325 studies that contain questions on alcohol use. There are 517 studies that contain information on pets; 129 studies on aspirin; 142 studies that have information on media literacy in urban schools; 1,735 studies on sexuality; and 2,263 studies on policy. Remember, those are studies. Each study can contain one or more variables on your topic of interest. Even a handful of studies on your topic may have hundreds of relevant variables. (For example, there are 57,468 variables pertaining to alcohol use).
An additional benefit of ICPSR is that it contains important information on all the large national surveillance studies that are currently being conducted in the U.S. (e.g. BRFSS, YRBS, NHANES, etc.). There will always be some database that isn’t within ICPSR’s search parameters, but there are no better systems to access the amount of data available (If you know of a better source, let me know.).
A potential problem
There’s one problem though, a problem shared with all research on human subjects. In order to conduct a secondary analysis of individual-level human subjects data, approval is needed by an Institutional Review Board (IRB). For a professor at nearly any university, this is not a big hurdle. There are always some administrative inefficiencies but at least you have access to an IRB. For graduate students, post-docs, research assistants/associates, or researchers who aren’t affiliated with an institution that has an IRB, IRB approval is a significant roadblock. For grad students, post-docs, and research assistants/associates, I know the obvious answer is to have a supervisor sign off the application, but there are ethical implications to consider. The supervisor may know nothing about the project or may not be interested in the project. Therefore, is it morally right to reward a supervisor for doing literally nothing? And is it right for a supervisor to sign off on a protocol that they have no knowledge of? These are the questions I dealt with while a doctoral student, and questions I still haven’t fully answered for myself (Essentially, I’m not sure the cons outweigh the pros, particularly because future career opportunities in academia are almost universally reliant on publication history.). It’s possible to send an IRB application to an unaffiliated, for-profit IRB, like WIRB, but if there are no resources to collect the data, I doubt there will be resources available to pay the required fees.
So for those of us who want to answer a research question but don’t have the resources to collect our own data and aren’t in a position to get IRB approval, there is one last type of data that can be used: ecological data. Ecological data is data that has been summarized across a population. For instance, the cancer rate per state, the prevalence of obesity by country, or average income by city. This type of data does not need IRB approval because it is not considered human subjects data since no single individual can be identified in the dataset. I have been fortunate enough to be able to publish ecological data using countries and villages as my unit of measurement, although researchers must be careful not to over-interpret the findings from ecological studies.
Where can you find ecological secondary data?
The answer is a little more complicated because there is no single website housing all of this data. If you are in the U.S. and are interested in a purely demographic/geography analysis, then the US. Census Bureau is the right place to look. You’ll likely have to create a dataset by hand, but the Bureau has already created thousands of tables that will get you started. If you have a health-related research question, the CDC and other NIH institutes are likely the best source. State-level surveillance data will be available from nearly every surveillance survey that is conducted, and although the sample size is relatively small (n=~50 depending if DC and territories are included), the number is large enough to perform multivariable regression, among other techniques. Often, you may have to combine health-related data from a source like the CDC with demographic information available from the Census Bureau. If you research question has a larger geographic scope, then search through the World Health Organization’s Global Health Observatory or the World Bank’s Global Health Indicators. Each source provides summary statistics at the country-level, and datasets can be linked by country name.
The take home message: Any research question can be answered, even if funding isn’t available to collect new data. If data is too expensive to generate, consider performing a secondary analysis using publicly available datasets. If you don’t have access to an IRB, consider performing an ecological analysis using country-, state-, country-, or city-level data.