5 Effective Strategies for Accessing and Utilizing Public Datasets in Independent Research
Introduction
The Open Knowledge Foundation defines public data or open data as “open data and content can be freely used, modified, and shared by anyone for any purpose.”[1] Described in another instance as “the democratization of information,”[2] the advent of public-use datasets has sparked debate over whether these data present more of benefit or a risk to society. Some of the benefits cited include open data’s role in facilitating innovation, the sheer volume of open data available, the free and high-quality nature of much of these datasets, and the vast variety of topics found among open data sources.[3]
One common concern in regard to open data is possible risks implicit in having one’s data publicized (even if it is made anonymous). Indeed, even well-meaning researchers may inadvertently endanger confidential participant information by making seemingly-innocuous identifiers such as zip code available to the public.[4] Also relevant to maintaining data integrity is ensuring that said data is accurate, up-to-date, and high-quality before using a dataset to conduct research. Failing to do so may have severe consequences, particularly if the resulting research is used by policy makers.[5]
5 Effective Strategies for Public Dataset Usage
Finding the right dataset for your project, knowing how to utilize these data, and doing so in an intentional and ethical way are all important aspects of conducting research. Read on for 5 ways to effectively and ethically access and utilize open data in your independent research.
- Identify high-quality sources of open data. Some credible open data sources of note include (but are not limited to) Google Dataset Search, Kaggle, GitHub, FiveThirtyEight, data.world, and government databases such as data.gov, usa.gov, Federal Reserve Data, and U.S. Bureau of Labor Statistics.[6]
- Choose a data source to use. Identifying the best-suited data source for your needs depends on what kind of research you plan on doing. Some considerations include:
- Do you need longitudinal data? If you are planning to conduct longitudinal research, you might consider diving deeper into public use longitudinal data sets such as the Centers for Disease Control and Prevention (CDC)’s Behavioral Risk Factor Surveillance System or data from the U.S. Census[7].
- Would you benefit from using a data repository? A data repository is a database of various datasets. Some examples of data repositories include: Databrary.org (behavioral science), National Institute of Child Health and Human Development (NICHD) Datasets (genomics, brain development, etc.), Stanford Education Data Archive (education research), and OpenNeuro (neuroimaging).[8] Of course, once you choose a data repository to use, you will then need to identify a specific dataset(s) before you begin your data analysis.
- What is your population of interest? Regardless of the target population of your planned research, there is likely a data source that fits the bill. For example, if you are wanting to focus on adolescents, ongoing adolescent study Add Health launched in 1994.[9]
- What is your topic of interest? Open data sources on an array of topics are available for a variety of research needs – this includes both data repositories and individual datasets. If you are a researcher who intends to visualize socioeconomic, education, and housing status, you will want to check out the University of Wisconsin’s Neighborhood Atlas resource.[10]
- Choose which data analysis process(es) to use. Once you’ve chosen your data set(s), you will need to choose which data analysis process or processes to use with your data. Considerations include:
- Which data analysis software do you have access to/will you use? Common choices include SPSS, R, SAS, SQL, and Stata. R is a favored choice for data analysis, particularly for its enhanced data visualization capabilities and accessibility.
- Which language will you use? Some researchers use specific programming languages for conducting data analysis; R is not only a favored program to use but also a favored programming language for this purpose. On the other hand, if you are partial to SPSS, the question of programming language is irrelevant (as you probably know).
- Which data analysis approach will you use? Do you need to conduct a statistical analysis? If so, which kind and how is it performed? Is it informed by your research questions and/or hypotheses? If you are intending to create a data visualization you will ask yourself different questions such as whether you have the data you need to investigate your chosen research questions and hypotheses, and to create said data visualization.
- Even sources generally deemed as reliable may vary in quality from one dataset to another.[11] When evaluating a source for potential use, consider whether the data are reliable:
- Is the data source trustworthy? While sources sponsored by the government are generally going to be credible, verifying source trustworthiness is particularly imperative when considering using data from sources like GitHub and data. world.[12] If you have doubts about the credibility of a specific dataset, compare the data to other, similar datasets to get a feel for whether it can be trusted.[13]
- Is the data accurate? Most datasets need to be reviewed and cleaned before they can be analyzed. For example, sometimes data include typos and other data has items that need to be recoded in a way that makes sense.[14]
- Is the data complete? Whether the dataset you are using is missing data can be determined by doing an initial cursory review of the data to check for null (missing) values.[15] Should you discover that some of your data is missing, best practices recommend[16] running missingness tests (e.g. Little’s MCAR test) to determine the pattern of missing data[17] [18] and whether measures need to be taken to address missingness.[19]
- Is the data skewed? Data skewness can impact the quality of the dataset you are using.[20] To determine whether your chosen data has a skew to it, run a histogram to assess for this possibility;[21] this is most effective for numeric data.[22] If any of your data are non-numeric (e.g., gender), you can run a frequency to get a bird’s eye view of your data and test for possible skewness.[23] Once you determine what the big picture of your data looks like, you can make more informed decisions regarding what analyses are best to use. For example, if you find that your data are left skewed or right skewed, it is more advisable to use a median rather than an average in your analysis because a median is not impacted by skewness.[24]
- Is the data recent? Depending on the nature of your topic and research, what is considered “recent” data varies. According to one medical journal, data may be most relevant when results are published within 3 years of data collection.[25] A nursing journal editorial, however, suggests that this 3-year benchmark may result in missing out on valuable insights from data that were collected outside of this narrow parameter.[26] Indeed, sometimes one must simply use the most recent data available regardless of any prescriptive criteria[27]. “Old” data may also be acceptable to use when one is studying a past event or when the data were taken with painstaking exactness.[28]
- Ensure that IRB approval is not required for data usage and analysis. Data that a) involve human participants and b) are not deidentified generally must obtain approval from an institutional review board (IRB) before analysis can commence. Criteria for exemptions to this rule include data that are publicly available for use and data that no longer have identifiers attached to participants.[29] Some entities have a page dedicated to pre-approved, publicly available, and de-identified data sources.[30] When in doubt whether you need IRB approval, contact an IRB to verify.
Conclusion
The use of public data sets/open data offers significant potential for innovation and the advancement of public policy, provided it is approached with both ethical rigor and a clear understanding of its associated risks and benefits. As such, researchers must prioritize selecting high-quality, credible sources of open data—such as government databases or established data repositories—and apply appropriate analytical methods tailored to their specific research questions.
Key steps for ethical and effective use of open data include identifying the most relevant datasets, ensuring data accuracy and completeness, and adhering to data privacy standards. Researchers should conduct thorough evaluations to confirm data reliability, address any issues of missing or skewed data, and ensure the data’s recency aligns with the research objectives. Such diligence is essential to maintain the trustworthiness and relevance of research outcomes. Moreover, understanding the ethical implications, such as obtaining necessary IRB approvals for data involving human subjects, is crucial to maintain the integrity of the research process. Researchers must also be vigilant in ensuring that the data they use does not inadvertently compromise participant confidentiality.
By following these best practices, researchers can leverage open data to drive meaningful insights while upholding the highest standards of data integrity and participant privacy. This careful balance will enhance the credibility and impact of research, ensuring that open data continues to be a valuable resource for societal advancement. Embracing these practices will not only foster innovation but also ensure that the democratization of data benefits all segments of society, reinforcing the role of open data in shaping informed policies.
Take Away
As Data Scientist & Machine Learning Engineer Shattesh Mani describes, open data enables the “democratization of information,” allowing free use, modification, and sharing. This concept is key to its impact on research and public policy. Researchers must prioritize high-quality data sources and rigorous analytical methods to maintain integrity and privacy. Upholding these standards enhances research credibility and ensures data-driven advancements benefit society.
[1] Open Knowledge. Open Definition. https://opendefinition.org/
[2] Mani, S. 7 Awesome Ways Publicly Available Datasets Can Help Your Business Flourish Unparalleled. Medium.
https://medium.com/swlh/12-awesome-ways-publicly-available-datasets-can-help-your-business-flourish-unparalleled-cd2fb9f206d3
[3] Mani, S. 7 Awesome Ways Publicly Available Datasets Can Help Your Business Flourish Unparalleled. Medium.
https://medium.com/swlh/12-awesome-ways-publicly-available-datasets-can-help-your-business-flourish-unparalleled-cd2fb9f206d3
[4] David, M. Where to find free datasets & how to know if they’re good quality. https://www.atlassian.com/data/business-intelligence/free-datasets
[5] David, M. Where to find free datasets & how to know if they’re good quality. https://www.atlassian.com/data/business-intelligence/free-datasets
[6] David, M. Where to find free datasets & how to know if they’re good quality. https://www.atlassian.com/data/business-intelligence/free-datasets
[7] Weir, K. Finding treasure in public data. American Psychological Association. https://www.apa.org/monitor/2019/05/finding-treasure
[8] Harrison, E. Missing data. The Comprehensive R Archive Network. https://cran.r-project.org/web/packages/finalfit/vignettes/missing.html
[9] Harrison, E. Missing data. The Comprehensive R Archive Network. https://cran.r-project.org/web/packages/finalfit/vignettes/missing.html
[10] Harrison, E. Missing data. The Comprehensive R Archive Network. https://cran.r-project.org/web/packages/finalfit/vignettes/missing.html
[11] Weir, K. Finding treasure in public data. American Psychological Association. https://www.apa.org/monitor/2019/05/finding-treasure
[12] Weir, K. Finding treasure in public data. American Psychological Association. https://www.apa.org/monitor/2019/05/finding-treasure
[13] Weir, K. Finding treasure in public data. American Psychological Association. https://www.apa.org/monitor/2019/05/finding-treasure
[14] Weir, K. Finding treasure in public data. American Psychological Association. https://www.apa.org/monitor/2019/05/finding-treasure
[15] Weir, K. Finding treasure in public data. American Psychological Association. https://www.apa.org/monitor/2019/05/finding-treasure
[16] Harrison, E. Missing data. The Comprehensive R Archive Network. https://cran.r-project.org/web/packages/finalfit/vignettes/missing.html
[17] Howell, D. C. The Treatment of Missing Data. University of Vermont. https://www.uvm.edu/~statdhtx/StatPages/Missing_Data/Missing_data2x.pdf
[18] Mack, C., Su, Z. & Westreich, D. Types of Missing Data. National Library of Medicine. https://www.ncbi.nlm.nih.gov/books/NBK493614/
[19] Mack, C., Su, Z. & Westreich, D. Types of Missing Data. National Library of Medicine. https://www.ncbi.nlm.nih.gov/books/NBK493614/
[20] Weir, K. Finding treasure in public data. American Psychological Association. https://www.apa.org/monitor/2019/05/finding-treasure
[21] Weir, K. Finding treasure in public data. American Psychological Association. https://www.apa.org/monitor/2019/05/finding-treasure
[22] Weir, K. Finding treasure in public data. American Psychological Association. https://www.apa.org/monitor/2019/05/finding-treasure
[23] Weir, K. Finding treasure in public data. American Psychological Association. https://www.apa.org/monitor/2019/05/finding-treasure
[24] David, M. Statistic vs. Distribution. The Data School. https://dataschool.com/misrepresenting-data/statistic-vs-distribution/
[25] Welsh, J. Lu, Y. & Sanket, S. D. Age of Data at the Time of Publication of Contemporary Clinical Trials. https://doi.org/10.1001/jamanetworkopen.2018.1065
[26] Dale, C.M. & Logsdon, M.C. When is data too old to inform nursing science and practice?. Journal of Advanced Nursing. https://onlinelibrary.wiley.com/doi/10.1111/jan.15411
[27] Ketchen, D. J., Roccapriore, A. Y. & Connelly, B. L. Using Old Data: When Is It Appropriate? Journal of Management. https://doi.org/10.1177/01492063231177785
[28] Office of Research and Creative Achievement. Use of Publicly Available OR Identifiable Private Sources of Information. University of Michigan. https://research.umbc.edu/use-of-pre-existing-data/
[29] Office of Research and Creative Achievement. Use of Publicly Available OR Identifiable Private Sources of Information. University of Michigan. https://research.umbc.edu/use-of-pre-existing-data/
[30] Office of Research and Creative Achievement. IRB Pre-Approved Publicly Available, De-Identified Data Sources. University of Michigan. https://research.umbc.edu/irb-pre-approved-publicly-available-de-identified-data-sources/
Articles and White Papers About Data Sources
The Problem with Relying Solely on Dashboards
Articles and White Papers About Monitoring & Evaluation Case Study: Apprenticeship Program Evaluation Conducting a statistically representative comprehensive program evaluation which includes conducting a comprehensive evaluation for two workforce development programs and work with each of the vendors and the County to use interim findings to improve program design and...
Read MoreWhat Types of Data Should You Track?
Articles and White Papers About Data Governance Planning What Types of Data Should You Track? Read More 5 Strategies for Ensuring Ethical Data Handling in Nonprofit Quantitative Research Introduction With the rapid advancement of technology, ethically engaging with data is more imperative than ever, particularly in the realm of quantitative...
Read More5 Data Mining Techniques for Nonprofit Organizations: Extracting Insights from Quantitative Sources
Introduction While data mining is often associated with the corporate world, the benefits of this tool extend into the nonprofit realm.[1] Data mining is “a process of analyzing information that you’re collecting, including information like demographics, likenesses of site visitors or any other helpful information.”[2] According to some data science...
Read MoreEthical Considerations in Utilizing Quantitative Design Data Sources in Research
Introduction From 1932 to 1972 the U.S. Public Health Service conducted a now infamous study called the Tuskegee experiment.[1] This experiment was designed to observe how untreated syphilis progressed in Black men.[2] During the recruitment process researchers did not receive informed consent from the participants; recruiters capitalized on local jargon,...
Read More