5 Advanced Data Mining Techniques for Insights in Medical Health Databases
Introduction
Data mining refers to “the use of machine learning and statistical analysis to uncover patterns and other valuable information from large datasets”.[1] In other words, it is primarily used for either describing the dataset in question or predicting results by utilizing machine learning algorithms. Data mining is often performed through the use of programming languages such as R, Python, or SQL, and may be used with both unstructured and structured datasets.
Like any modern innovation, data mining offers both benefits and risks. Benefits of data mining include the ability to detect previously hidden patterns and insights and increased efficiency.[2] Data mining also introduces certain liabilities into the data analysis process such as the fact that incorrectly mined data can yield results that are dangerous or misleading, data mining can be costly, and the uncertainty that is “baked into” the process of data mining. Data mining can also be used in a variety of different fields; this article will focus on data mining using medical health data. More specifically, we offer not only an overview of the principles of using data mining in this type of database, but also some advanced techniques for your data mining endeavors.
How to Use Basic Techniques
On a basic level, data mining typically includes the following five main steps: the identification of objectives, choosing data to mine, preparing the chosen data, building the data model, and finally detecting and
evaluating patterns1.
Objectives Identification
Identifying objectives ahead of time is perhaps one of the most important steps of the process.[3] Before diving into the data, those performing the data mining should define the problem they are trying to solve. By doing so, data questions and parameters are much more likely to have a clear purpose and direction.
Choosing Data
Once a team has identified their objectives for their data mining, they will then choose the data they are going to use for the process.[4] Now that the problem and objectives have been identified, data scientists will have a more defined direction when choosing which dataset to utilize. In the case of medical health data, possible sources of relevant datasets include the National Health Institute, Healthdata.gov, the University of Illinois Chicago and the CDC, just to name a few. After the team has chosen the data they will use, they can work with their organization’s IT representative to decide how to most safely store the data.
Preparing Data
Another name for “preparing data” is “cleaning” it or removing any “noise” from the dataset including missing data, duplicates, and outliers.[5] Data scientists may also find that they need to minimize the number of variables they utilize for data mining; this may help simplify the process. Data scientists should take care to keep the most crucial predictors in order to maximize accuracy within the data model. As stated by IBM, “Responsible data science means thinking about the model beyond the code and performance.”
Building the Data Model and Mining Patterns
Data mining techniques can be categorized into three modeling groups: descriptive modeling (e.g., classification analysis), predictive modeling, and prescriptive modeling.[6] Contingent on the kind of analysis that they employ in their model(s), data analysts may assess trends or data relationships of interest. For example, they might delve into detected correlations, sequential patterns, or association rules. While patterns that are high-frequency in nature may be more generalizable, it is noteworthy that data deviations can garner more interest among data scientists.
Perhaps the most well-known recent use of data modeling is that used to predict the trajectory of COVID-19 infections.[7] [8] In July 2020, the Government Accountability Office (GAO) published a series of data models designed to help predict the spread of COVID-19 and expected deaths resulting from the virus.[9] The GAO explained that forecasting models are useful for predicting health trends such as rates of infection and/or mortality.[10]
Result Evaluation and Knowledge Implementation
Once all the data are collected and aggregated, analysts can then interpret and evaluate the results.[11] These results can then be converted into a data visualization and presented in meetings and conferences. Should the findings be relevant, credible, engaging, and novel they may be applied in real world settings such. In the case of COVID-19 models, these results were used to influence COVID-related policies which evolved along with the changing nature of coronavirus.[12] However, the GAO cautions that data models may rely on data that has been collected and reported in different ways, which may make it difficult to compare data—even if said data is similar at a basic level (e.g., multiple COVID-19 datasets collected by different organizations).[13]
Common Data Mining Techniques
Some of the most common data mining techniques consist of classification, association rule learning, affinity grouping, clustering analysis, and anomaly/outlier detection.[14] We will briefly discuss each below, within the context of COVID.
Classification Analysis
Classification analysis refers to the data mining technique wherein data scientists assign data points to various classes or groups.[15] In other words, data are categorized and grouped together according to shared characteristics.[16] A classification analysis for COVID-19 data might involve classifying COVID-19 symptoms as “mild” or “severe”.
Association Rule Learning
Data analysts use this technique to identify any potential relationships between points of data.[17] Specifically, an association rule is a data mining technique that uses an “if/then” method to detect potential relationships (associations) between various data variables1. Once analysts identify variable relationships, they then measure such associations by confidence and support.[18] In relation to COVID data, data scientists might identify a relationship between employment as an essential worker and developing COVID.
Affinity Grouping
Similar to association rule learning is affinity grouping, which is when data scientists analyze data for significant relationships[19]. Affinity grouping in the context of COVID might look like analysts looking for associations between COVID symptoms and comorbidities in people experiencing COVID.
Clustering Analysis and Anomaly/Outlier Detection
Clustering analysis refers to when a data scientist groups similar records into the same place in order to more easily identify potential outliers or anomalies[20] Conversely, data scientists who are seeking out unusual data may implement the anomaly or outlier detection technique.[21] As its name implies, this method refers to identifying data that does not conform to an established pattern.[22] One way in which these methods are related to COVID is that the development of a new COVID strain might be considered an “anomaly” or an “outlier” detected among the typical COVID strains; the identification of a new COVID variant (anomaly) would likely be facilitated by a cluster analysis.
5 Advanced Techniques
Advanced data mining techniques are predictive models[23] that can enable data scientists to assess the nature of potential future trends; the most advanced models can make rapid in-vivo predictions[24] (e.g., the aforementioned CDC predictions of the COVID-19 trajectory). Specifically, we will discuss the following advanced data mining techniques: regression, neural networks, natural language processing, dimensionality reduction and decision trees.
Regression
Data analysts use regression analysis in their data mining work to understand which of a dataset’s variables are the most important, which variables they can ignore, and how these variables interact with one another.[25] That is, regression helps analysts determine which data matters most and which of it matters least.[26] Data scientists using regression to analyze COVID data might investigate whether getting a COVID vaccine predicts whether a person gets a COVID booster shot.
Neural Networks and Natural Language Processing
Neutral Networks is the term for intelligent computer programming that learns and makes predictions by means of detecting patterns.[27] In fact, neural networks have the name that they do because they work similarly to the neurons found in the human brain;[28] some examples of neural networks include Convolutional neural networks (CNNs), Recurrent neural networks (RNNs), Feedforward neural networks, and Autoencoder neural networks.[29]
Perhaps the most well-known neural network is ChatGPT, a transformer neural network developed by OpenAI.[30] ChatGPT is also informed by the natural language processing (NLP) abilities. ChatGPT’s NLP abilities allow it to understand context (e.g., carry a coherent conversation with users), understand a variety of languages and language formats, manage different text types, adjust tone, and learn from interactions. According to one data scientist and consultant: “In the future, we can use [neural networks] to give doctors a second opinion – for example, if something is cancer, or what some unknown problem is”.[31] Neural networks and NLP are thus quite relevant for a variety of health topics.
Dimensionality Reduction
Dimensionality reduction refers to the data mining method employed when data analysts wish to convert data from a space that is high-dimensional to one that is low-dimensional.[32] In other words, this technique is used to make large amounts of data more manageable so that an analyst can more effectively conduct data analysis; this might consist of removing irrelevant variables that are not necessary for the analyses the data scientist wants to conduct. Two methods of dimensionality reduction are Principal Component Analysis (PCA) and t-Stochastic Neighbor Embedding (t-SNE): PCA is a procedure based in mathematics that reduces dimension while preserving as much variability as possible and t-SNE is a statistical method used to create data visualizations of large datasets.
Decision Trees
As their name suggests, decision trees are an advanced data mining technique involving diagrams that are a) tree-shaped and b) contain “branches” that each hold a probable outcome.[33] Like the aforementioned neural networks and regression techniques, the decision tree method is a predictive data mining model.[34] When using a decision tree, also known as a “tree induction model”, data scientists can use it to categorize data points according to their attributes[35], thus diagramming the potential outcomes of a given decision1. Decision trees are relevant to COVID-19 data in that analysts may use them to predict COVID-19 case severity among people of different demographics, medical histories, etc. That is, this data mining method has the ability to help healthcare professionals effectively identify people who are high-risk for severe COVID symptoms and allocate resources accordingly.
Conclusion
Advanced data mining techniques have proven essential in navigating health-related challenges, such as those posed by the COVID-19 pandemic, by significantly enhancing the management and analysis of medical health data. Techniques such as regression analysis, neural networks, natural language processing, dimensionality reduction, and decision trees have extracted vital insights from extensive datasets, improving predictions for COVID-19 case trajectories and mortality rates.
Regression analysis has identified critical variables affecting the spread and severity of COVID-19, aiding in the design of targeted public health interventions. Neural networks, with their rapid data processing capabilities, have offered predictive insights that have been crucial for timely and effective public health responses. Natural language processing has played a key role in analyzing unstructured data, such as clinical notes and social media, providing a richer epidemiological understanding and aiding in real-time surveillance and public sentiment analysis. Dimensionality reduction techniques have made complex COVID-19 data more accessible and interpretable, aiding health analysts in focusing on the most relevant information for decision-making. Decision trees have helped visualize the progression and impact of the virus across different demographics, enabling tailored patient care and resource allocation.
These tools are also likely to influence future public health policies by providing a deeper understanding of disease patterns and treatment outcomes. By continuing to advance our data mining capabilities, healthcare systems can become more proactive rather than reactive, optimizing health outcomes on a global scale and building more resilient public health infrastructures. As we navigate the evolving landscape of healthcare challenges, the strategic implementation of advanced data mining will likely play a critical role in shaping effective and efficient healthcare interventions and strategies.
Take Away
The transformative potential of data mining in healthcare is evident in the claim that “we can use [neural networks] to give doctors a second opinion – for example, if something is cancer, or what some unknown problem is.” This underscores how neural networks revolutionize diagnostics, as seen during the COVID-19 pandemic, where they enhanced decision-making and potentially saved lives through more accurate, timely insights.
[1] Holdsworth, J. What is data mining? IBM. https://www.ibm.com/topics/data-mining
[2] Holdsworth, J. What is data mining? IBM. https://www.ibm.com/topics/data-mining
[3] Holdsworth, J. What is data mining? IBM. https://www.ibm.com/topics/data-mining
[4] Holdsworth, J. What is data mining? IBM. https://www.ibm.com/topics/data-mining
[5] Holdsworth, J. What is data mining? IBM. https://www.ibm.com/topics/data-mining
[6] SAS. Data Mining. https://www.sas.com/en_us/insights/analytics/data-mining.html
[7] Centers for Disease Control and Prevention. CDC COVID-19 Cases and Deaths Ensemble Forecast Archive. https://data.cdc.gov/Models/CDC-COVID-19-Cases-and-Deaths-Ensemble-Forecast-Ar/ci7c-73kg/about_data
[8] U.S. Government Accountability Office. COVID-19: Data Quality and Considerations for Modeling and Analysis. https://www.gao.gov/products/gao-20-635sp
[9] U.S. Government Accountability Office. COVID-19: Data Quality and Considerations for Modeling and Analysis. https://www.gao.gov/products/gao-20-635sp
[10] U.S. Government Accountability Office. COVID-19: Data Quality and Considerations for Modeling and Analysis. https://www.gao.gov/products/gao-20-635sp
[11] Holdsworth, J. What is data mining? IBM. https://www.ibm.com/topics/data-mining
[12] Berman Institute of Bioethics. COVID-19 Modeling. Johns Hopkins University. https://bioethics.jhu.edu/research-and-outreach/covid-19-bioethics-expert-insights/resources-for-addressing-key-ethical-areas/grappling-with-the-ethics-of-social-distancing/covid-19-modeling/
[13] Centers for Disease Control and Prevention. CDC COVID-19 Cases and Deaths Ensemble Forecast Archive. https://data.cdc.gov/Models/CDC-COVID-19-Cases-and-Deaths-Ensemble-Forecast-Ar/ci7c-73kg/about_data
[14] Rutgers University Bootcamps. What is Data Mining? A Beginner’s Guide (2022). Rutgers University. https://bootcamp.rutgers.edu/blog/what-is-data-mining/
[15] Rutgers University Bootcamps. What is Data Mining? A Beginner’s Guide (2022). Rutgers University. https://bootcamp.rutgers.edu/blog/what-is-data-mining/
[16] Holdsworth, J. What is data mining? IBM. https://www.ibm.com/topics/data-mining
[17] Rutgers University Bootcamps. What is Data Mining? A Beginner’s Guide (2022). Rutgers University. https://bootcamp.rutgers.edu/blog/what-is-data-mining/
[18] Holdsworth, J. What is data mining? IBM. https://www.ibm.com/topics/data-mining
[19] Transforming Data with Intelligence. How Machine-Learning Techniques Use Methods (Part 2 in a Series). https://tdwi.org/articles/2016/03/16/machine-learning-techniques-methods.aspx
[20] DX Adobe. Data Mining. Adobe Experience Cloud Blog. https://business.adobe.com/blog/basics/data-mining
[21] Rutgers University Bootcamps. What is Data Mining? A Beginner’s Guide (2022). Rutgers University. https://bootcamp.rutgers.edu/blog/what-is-data-mining/
[22] Rutgers University Bootcamps. What is Data Mining? A Beginner’s Guide (2022). Rutgers University. https://bootcamp.rutgers.edu/blog/what-is-data-mining/
[23] SAS. Data Mining. https://www.sas.com/en_us/insights/analytics/data-mining.html
[24] Holdsworth, J. What is data mining? IBM. https://www.ibm.com/topics/data-mining
[25] Rutgers University Bootcamps. What is Data Mining? A Beginner’s Guide (2022). Rutgers University. https://bootcamp.rutgers.edu/blog/what-is-data-mining/
[26] Holdsworth, J. What is data mining? IBM. https://www.ibm.com/topics/data-mining
[27] SAS. Data Mining. https://www.sas.com/en_us/insights/analytics/data-mining.html
[28] SAS. Artificial Neural Networks. https://www.sas.com/en_us/insights/analytics/neural-networks.html
[29] SAS. Artificial Neural Networks. https://www.sas.com/en_us/insights/analytics/neural-networks.html
[30] ChatGPT4. https://chatgpt.com/c/99d2134c-471c-4a90-af07-b96182d0c09a
[31] SAS. Artificial Neural Networks. https://www.sas.com/en_us/insights/analytics/neural-networks.html
[32] Berga, M. & Ochman, A. Advanced Analytics and the Top 6 Data Mining Techniques. Imaginary Cloud. https://www.imaginarycloud.com/blog/data-mining-techniques/#advanced
[33] SAS. Data Mining. https://www.sas.com/en_us/insights/analytics/data-mining.html
[34] SAS. Data Mining. https://www.sas.com/en_us/insights/analytics/data-mining.html
[35] Olsen, D. L. & Delen, D. Advanced Data Mining Techniques.
Articles and White Papers About Database Management
What Are Some Data Collection Challenges and How Do You Overcome Them? (Part 2 of 3)
Articles and White Papers About Considerations How do You Develop an Evaluation Plan? Read More How Do You Get Started With Your Program Evaluation? Read More What Do You Need to Consider About Program Evaluation? Read More How Does Your Organization Build Its Credibility? Read More Load More
Read MoreWhat Are Some Data Collection Challenges and How Do You Overcome Them? (Part 1 of 3)
Articles and White Papers About Considerations How do You Develop an Evaluation Plan? Read More How Do You Get Started With Your Program Evaluation? Read More What Do You Need to Consider About Program Evaluation? Read More How Does Your Organization Build Its Credibility? Read More Load More
Read More5 Advanced Data Mining Techniques for Insights in Medical Health Databases
Introduction Data mining refers to “the use of machine learning and statistical analysis to uncover patterns and other valuable information from large datasets”.[1] In other words, it is primarily used for either describing the dataset in question or predicting results by utilizing machine learning algorithms. Data mining is often performed...
Read MoreOptimizing Database Queries for Efficient Data Retrieval in Independent Research
Introduction Some research has described open data as “having the potential to transform science and fast-track the development of new knowledge”;[1] other researchers have observed that this potential is limited if data-seekers are unable to find the data that they seek[2]. We present these guidelines on optimizing database queries for...
Read More