As a researcher, data is important. Data is life. Data is everything. I need to use the best methods available to collect data, and the best statistical tests to analyze it. But there is a big problem I often face, and I bet many other researchers have the same issue. Data is expensive. Data can be very expensive and out of reach for many investigators, particularly junior investigators who don’t have access to alternative funding steams. So what’s left? What can a researcher do if there are no resources to collect data?
Use someone else’s.
This idea, that another researcher’s dataset can be used for novel purposes, is the entire premise of secondary data analysis. This isn’t a novel approach, and pointing out the pros and cons of secondary data analysis at this point would simply seem duplicative. The key to secondary data is finding it, which is the purpose of this post.
But before I delve into finding data, I want to make a distinction between secondary data and “Big Data.” Big data has been a trendy research area for several years but even I get confused about what is and is not big not big data. A really large study database is not big data. Surveillance studies that include hundreds of thousands of people is not big data. Decades long longitudinal studies are not big data. Instead, big data, with some notable exceptions, are generated by the things we do in everyday life. Big data is combining information on the type of posts you Like on Facebook with your purchase history. Big data is combining medical records information with information from grocery store receipts. Big data is using credit card transactions at gas stations to determine the popularity of tourist attractions. These datasets are massive, encompassing millions of people and potentially billions of data points. The sheer size of the these datasets requires research to essentially program their own apps in order to effectively analyze them (Something which I am unable to do but am jealous of the people who can). SAS or SPSS simply can’t handle the workload.
But back to secondary data. We all know what it is but where do we get some?
(Note: This is an anti-conflict of interest statement. I am not affiliated with ICPSR in any way. I just like the system that has been created.)
I feel like I’ve given away the punchline before even telling the joke, but ICPSR is the clearinghouse for data. This database of databases has been maintained for over 50 years and includes data on almost every conceivable topic. For instance, if I am interested in alcohol use, ICPSR has information on 1,325 studies that contain questions on alcohol use. There are 517 studies that contain information on pets; 129 studies on aspirin; 142 studies that have information on media literacy in urban schools; 1,735 studies on sexuality; and 2,263 studies on policy. Remember, those are studies. Each study can contain one or more variables on your topic of interest. Even a handful of studies on your topic may have hundreds of relevant variables. (For example, there are 57,468 variables pertaining to alcohol use).
An additional benefit of ICPSR is that it contains important information on all the large national surveillance studies that are currently being conducted in the U.S. (e.g. BRFSS, YRBS, NHANES, etc.). There will always be some database that isn’t within ICPSR’s search parameters, but there are no better systems to access the amount of data available (If you know of a better source, let me know.).
A potential problem
There’s one problem though, a problem shared with all research on human subjects. In order to conduct a secondary analysis of individual-level human subjects data, approval is needed by an Institutional Review Board (IRB). For a professor at nearly any university, this is not a big hurdle. There are always some administrative inefficiencies but at least you have access to an IRB. For graduate students, post-docs, research assistants/associates, or researchers who aren’t affiliated with an institution that has an IRB, IRB approval is a significant roadblock. For grad students, post-docs, and research assistants/associates, I know the obvious answer is to have a supervisor sign off the application, but there are ethical implications to consider. The supervisor may know nothing about the project or may not be interested in the project. Therefore, is it morally right to reward a supervisor for doing literally nothing? And is it right for a supervisor to sign off on a protocol that they have no knowledge of? These are the questions I dealt with while a doctoral student, and questions I still haven’t fully answered for myself (Essentially, I’m not sure the cons outweigh the pros, particularly because future career opportunities in academia are almost universally reliant on publication history.). It’s possible to send an IRB application to an unaffiliated, for-profit IRB, like WIRB, but if there are no resources to collect the data, I doubt there will be resources available to pay the required fees.
So for those of us who want to answer a research question but don’t have the resources to collect our own data and aren’t in a position to get IRB approval, there is one last type of data that can be used: ecological data. Ecological data is data that has been summarized across a population. For instance, the cancer rate per state, the prevalence of obesity by country, or average income by city. This type of data does not need IRB approval because it is not considered human subjects data since no single individual can be identified in the dataset. I have been fortunate enough to be able to publish ecological data using countries and villages as my unit of measurement, although researchers must be careful not to over-interpret the findings from ecological studies.
Where can you find ecological secondary data?
The answer is a little more complicated because there is no single website housing all of this data. If you are in the U.S. and are interested in a purely demographic/geography analysis, then the US. Census Bureau is the right place to look. You’ll likely have to create a dataset by hand, but the Bureau has already created thousands of tables that will get you started. If you have a health-related research question, the CDC and other NIH institutes are likely the best source. State-level surveillance data will be available from nearly every surveillance survey that is conducted, and although the sample size is relatively small (n=~50 depending if DC and territories are included), the number is large enough to perform multivariable regression, among other techniques. Often, you may have to combine health-related data from a source like the CDC with demographic information available from the Census Bureau. If you research question has a larger geographic scope, then search through the World Health Organization’s Global Health Observatory or the World Bank’s Global Health Indicators. Each source provides summary statistics at the country-level, and datasets can be linked by country name.
The take home message: Any research question can be answered, even if funding isn’t available to collect new data. If data is too expensive to generate, consider performing a secondary analysis using publicly available datasets. If you don’t have access to an IRB, consider performing an ecological analysis using country-, state-, country-, or city-level data.