Optimizing Database Queries for Efficient Data Retrieval in Independent Research
Introduction
Some research has described open data as “having the potential to transform science and fast-track the development of new knowledge”;[1] other researchers have observed that this potential is limited if data-seekers are unable to find the data that they seek[2]. We present these guidelines on optimizing database queries for efficient data retrieval not only for this reason but also because a different approach is typically required for searching for open data versus search for literature.[3] Indeed, while many researchers may be experienced seekers of empirical research for a literature, we anticipate that fewer researchers are practiced at locating and retrieving datasets to analyze.[4] If this describes you, keep reading for our tips and best practices on optimizing data retrieval for your independent research.
How to Prepare for Data Retrieval
To prepare for data retrieval, identify which database or data repository most relevant for your research topic. If you need assistance choosing the database best suited to your research, you can utilize credible online resources such as FAIRsharing or re3data.org, both of which offer lists of certified data repositories; the National Institutes of Health (NIH) also provides a list of supported scientific repositories. If you wish for more personalized assistance, seeking out the assistance of a funding coordinator or librarian, or consulting with a colleague is also a great way to capitalize on the resources surrounding you and offers the benefit of an interactive experience in your quest for database optimization.[5]
Databases/data repositories focus on a myriad of subject areas. For example, if your topic is health related, you may find the NIH (National Institutes of Health) data repository to be a valuable source of data for your research[6]. If you find that no subject-specific repository seems to exist for your research topic, consider investigating data repositories of a generalist nature, such as Harvard Dataverse, Mendeley Data, Open Science Framework, Science Data Bank or Code Ocean1. Once you have chosen a database or data repository, use the search function to locate datasets most relevant to your topic of interest.
Best Practices in Data Retrieval
The process of efficient data retrieval begins before you even open a data base or data repository. In order to most effectively optimize your database queries, consider the following recommendations for searching databases:
- Form and refine your research question. You will know that your research question is effectively refined when you can a) easily identify which analytical procedure(s) will help you investigate your research question(s) and b) easily break down your question into keywords that will likely yield the data that you are looking for. If your research question is: “Does lifestyle or genetics more strongly predict whether one is diagnosed with diabetes?”, you would search using words like “lifestyle”, “genetics” and “diabetes”[7] to find datasets relevant to your research.
- Identify the most relevant databases/data repositories.[8] Continuing with the above example, if you were seeking data related to diabetes, you would likely find health research databases such as NIH. Should you need additional resources, check out a generalist data repository such as Open Science Framework or Science Data Bank.
- Ensure you have appropriate filters turned on. Make sure your results are datasets rather than research articles. Some databases contain not only datasets but other resources such as empirical literature. To make sure your search only produces datasets for analysis, check the relevant filters. For example, if you are using the Open Science Foundation, one of the filter options is “Resource Type” which can be adjusted to filter out everything except datasets.
- Review keywords/tags on data files that appear relevant. The Diabetes Study of Northern California, a dataset found through re3data.org, is tagged with keywords such as “diabetes-related complication”, “social disparities” and “ethnicity”; these keywords suggest that this dataset is likely relevant to a diabetes-related research topic.
- Utilize multiple databases.[9] [10] As with a literature search, we recommend using multiple databases/data repositories to cross-reference your resources. This way you will likely have more than one dataset to choose from or find datasets that may be merged or appended together.
- Just keep swimming.[11] If your first database search fails to yield the results you are hoping for, don’t give up! Take a few deep breaths, brainstorm additional key words, try a different database, and just keep swimming until you find a dataset you can use.
Below are some guidelines to keep in mind as you are choosing a dataset (or datasets) for your independent research.
- According to the the BioMedical Informatics Coordinating Committee (BMIC) enumerates the characteristics of a credible data repository include (but are not limited to): persistent unique identifiers, long-term sustainability, maximally open access, security, and free access to the data.[12]
- If you seek data that involves human participants, you must also take into account aspects such as whether participants granted consent for their data to be collected/used, participant privacy, clear use guidance, plans for potential breaches and use violations, and other related requirements for using human data.[13]
Conclusion
In an era where data is increasingly recognized as a pivotal asset for scientific advancement, mastering the art of data retrieval is essential. Effective data retrieval involves more than just accessing any available data: it requires targeted strategies to find the most relevant and high-quality datasets. The guidelines and best practices provided in this article serve as a comprehensive roadmap for researchers, particularly those less experienced with the nuances of data search as compared to literature review.
Moreover, understanding the characteristics of credible data repositories—such as long-term sustainability, open access, and robust security measures—can significantly impact the reliability and effectiveness of data used in research. These elements ensure that the data not only serves the immediate needs of the researcher but also adheres to broader ethical and operational standards.
As researchers embark on their journey of data retrieval, they should remain diligent, flexible, and persistent. The path to finding the right data is often iterative, requiring adjustments and sometimes, starting anew. By embracing the comprehensive approach detailed in this article, researchers can enhance their capability to harness the full potential of open data, thereby contributing to the acceleration of scientific discovery and the expansion of human knowledge. In summary, while the challenges of data retrieval are non-trivial, the strategic insights provided here equip researchers with the necessary tools to overcome these hurdles and make impactful scientific contributions.
Take Away
Harnessing the power of open data can accelerate research and expand scientific knowledge. Open data holds the potential to “transform science and fast-track the development of new knowledge.” Mastering the effective navigation and utilization of diverse data repositories can significantly boost the impact of scientific endeavors.
[1] Gray, J. Jim gray on eScience: A transformed scientific method In Hey T., Tansley S., & Tolle K. (Eds.), The fourth paradigm: Data‐intensive scientific discovery (pp. xvii–xxxi).
[2] Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A. & Wyatt, S. Searching data: A review of observational data retrieval practices in selected disciplines. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6853156/#asi24165-bib-0046
[3] Kern, D. & Mathiak, B. Are there any differences in data set retrieval compared to well‐known literature retrieval? In Kapidakis S., Mazurek C., & Werla M. (Eds.), Research and advanced technology for digital libraries. Lecture notes in computer science (Vol. 9316).
[4] Kern, D. & Mathiak, B. Are there any differences in data set retrieval compared to well‐known literature retrieval? In Kapidakis S., Mazurek C., & Werla M. (Eds.), Research and advanced technology for digital libraries. Lecture notes in computer science (Vol. 9316).
[5] Kern, D. & Mathiak, B. Are there any differences in data set retrieval compared to well‐known literature retrieval? In Kapidakis S., Mazurek C., & Werla M. (Eds.), Research and advanced technology for digital libraries. Lecture notes in computer science (Vol. 9316).
[6] National Institutes of Health. Selecting a Data Repository. https://sharing.nih.gov/data-management-and-sharing-policy/sharing-scientific-data/selecting-a-data-repository
[7] Zickel, E. Basic Guidelines for Research in Academic Databases. Middle Tennessee State University. https://mtsu.pressbooks.pub/1020mtsu/chapter/basic-guidelines-for-academic-research-database-searches/
[8] Walden University Library. Database Search Skills: Introduction. Walden University. https://academicguides.waldenu.edu/library/databasesearchskills
[9] University of Dayton Libraries. Using Databases for Your Research. University of Dayton. https://libguides.udayton.edu/searching-in-databases/general-search-tips-EBSCO
[10] Columbia University Libraries. Database Searching Guide: Best Practices for Database Searching. Columbia University. https://guides.library.columbia.edu/c.php?g=518800&p=3593167
[11] Zickel, E. Basic Guidelines for Research in Academic Databases. Middle Tennessee State University. https://mtsu.pressbooks.pub/1020mtsu/chapter/basic-guidelines-for-academic-research-database-searches/
[12] Hofstra University Library. SOM NIH Data Management and Sharing Policy: Choosing a Repository. Hofstra University. https://libguides.hofstra.edu/c.php?g=1275561&p=9363881
[13] Hofstra University Library. SOM NIH Data Management and Sharing Policy: Choosing a Repository. Hofstra University. https://libguides.hofstra.edu/c.php?g=1275561&p=9363881
Articles and White Papers About Database Management
What Are Some Data Collection Challenges and How Do You Overcome Them? (Part 2 of 3)
Articles and White Papers About Considerations How do You Develop an Evaluation Plan? Read More How Do You Get Started With Your Program Evaluation? Read More What Do You Need to Consider About Program Evaluation? Read More How Does Your Organization Build Its Credibility? Read More Load More
Read MoreWhat Are Some Data Collection Challenges and How Do You Overcome Them? (Part 1 of 3)
Articles and White Papers About Considerations How do You Develop an Evaluation Plan? Read More How Do You Get Started With Your Program Evaluation? Read More What Do You Need to Consider About Program Evaluation? Read More How Does Your Organization Build Its Credibility? Read More Load More
Read More5 Advanced Data Mining Techniques for Insights in Medical Health Databases
Introduction Data mining refers to “the use of machine learning and statistical analysis to uncover patterns and other valuable information from large datasets”.[1] In other words, it is primarily used for either describing the dataset in question or predicting results by utilizing machine learning algorithms. Data mining is often performed...
Read MoreOptimizing Database Queries for Efficient Data Retrieval in Independent Research
Introduction Some research has described open data as “having the potential to transform science and fast-track the development of new knowledge”;[1] other researchers have observed that this potential is limited if data-seekers are unable to find the data that they seek[2]. We present these guidelines on optimizing database queries for...
Read More