5 Best Practices for Data Cleaning and Preprocessing in Independent Research Projects
Introduction
Achieving data integrity is, in part, a result of engaging in data cleaning and data preprocessing before ever running those first statistical analyses to test research questions and hypotheses. While data cleaning and data preprocessing are technically separate terms, data cleaning is nested within the various steps involved in preprocessing data. Within data cleaning are techniques such as data removal and data imputation; we will dive a little deeper into these approaches later on.
A critical part of this process involves addressing what is often referred to as “dirty data.” The three most common types of dirty data include incomplete data (missing or null values), inconsistent data (conflicting formats or representations), and inaccurate data (incorrect or outdated information). Recognizing these categories helps researchers design appropriate data cleaning strategies to enhance the reliability of their analyses.
More broadly, in this article, we will focus on data cleaning/preprocessing best practices and practical advice for those engaging in independent research projects, whether it is an independent scholar conducting the research or a university faculty member.
Why Clean Data Matters
Given the prolific opportunity for data analysis in the modern age, ensuring that anyone interacting with data is using the highest quality version is imperative. This is especially true for researchers, whether tenured university professor, an independent researcher, or something in between: researchers not only work with data but also embark on analytical investigations with the goal of disseminating findings intended to shape the course of future research, policy and/or applied practice in the professional world. As such, scholars carry an unparalleled burden to prioritize the usage of only the highest quality data.
Beyond the scholarly world, data cleaning and preprocessing also matters because it minimizes the risk of data errors that may frustrate both employees and clients and provides an opportunity to map out the purpose and functions of the data.[1] It is also a means of anticipating, assessing, and accounting for any errors that may arise.[2] In short, taking the time to ensure clean data generally makes everyone’s lives easier in the long run: researchers, clients, and employees alike.
5 Best Practices for Data Cleaning and Preprocessing
- Understand the Characteristics of the Data
Prior to commencing with any data cleaning or data preprocessing, researchers are most likely to maximize data quality by first thoroughly reviewing the dataset(s) in question to ensure comprehension of the data’s nature.[3] For example, researchers wishing to conduct text analysis should take into account the nature of the text they will be cleaning and analyzing: text sourced from social media posts will likely contain contemporary and informal characteristics such as slang, emojis, and abbreviations (e.g., “fr” for
“for real”).[4]
Having a preliminary grasp on the data is likely to enhance researcher knowledge of what steps need to be taken to increase data integrity. Without this initial review, effectively cleaning and preprocessing the data may prove difficult for the researcher(s) pursuing such a task.
- Set Clear Objectives and Responsibilities
As such, once the researcher(s) who will be involved in the data analysis are well-acquainted with all of the data that will be utilized, a data cleaning plan must be formed in order to increase efficiency and analysis quality. Indeed, before any further best practices can be observed, scholars and their research will benefit from first defining clear objectives and responsibilities.
Objective setting should include identifying what parts of the dataset(s) need to be cleaned and/or preprocessed;[5] this may include addressing issues such as inconsistencies within the data, “noise” (i.e., irrelevant or unnecessary data such as that collected from unqualified participants), missing data, etc.[6] Furthermore, this part of the data cleaning process is an ideal time to identify which tools would be most helpful and reliable in achieving the highest data quality possible.
After the objectives for data cleaning and preprocessing are established, research teams will then benefit from clearly identifying which team member is responsible for which objective(s);[7] having a set plan will support equitable divisions of responsibility, timeliness of task completion, and accountability between research team members. Furthermore, dividing up responsibilities can help ensure that each researcher is responsible for the task for which they are best equipped.[8] Of course, if the analysis is performed by a researcher engaging in a truly independent (i.e., solo) project, there will likely be no doubt regarding who is responsible for each aspect of data cleaning and preprocessing that is needed.
- Validate and Correct Data (as needed)
A significant part of data cleaning and preprocessing consists of ensuring that the data in question is as valid and accurate as possible; this includes removing any observations that are irrelevant or duplicate observations, addressing structural errors to the data, handling outliers, and any other process that a particular dataset may require3. One way in which a researcher might validate and correct data, as necessary, is through utilizing validation checks.[9]
Prior to beginning a validation check, we recommend that researchers verify that the data they wish to use are well-suited for validation. Examples of data that are most relevant to validation checks are numeric data, textual data, dates and times, categorical data, Boolean (dichotomous) data, unique identifier data such as IDs or order numbers, and geospatial data like addresses or coordinates.[10] Once it has been determined that validation checks are relevant to a particular set of data, scholars may choose from a variety
of techniques.
Common approaches to data validation checks include the following: data type validation, range validation, list validation, and pattern matching validation.[11] While each technique varies from one another, each of them uses some kind of reference against which they compare (i.e., validate) the data that is being evaluated.[12] Once invalid cases are identified by these validation checks, researchers can then go in and make necessary adjustments to achieve data validity. For a real-world example of how data validation might look, the Teacher Incentive Allotment provides a review of how they validate the data used to assess statewide student performance across Texas.[13] More generalized examples of data validation can be found on the BBC website in their “Bitesize” units.[14]
- Address Any Missing Data (as necessary)
Like validation, addressing missing data within a given dataset is also imperative. Missing data, once identified, may be handled through imputing or another appropriate method for addressing missingness.[15] Without taking the time to check for and handle missing data, researchers increase the risk of conducting an incomplete and inaccurate analysis.[16] Missing data can also result in biased results.[17] While most (if not all) research is inherently biased due to the fact that non-objective human beings are conducting it.[18] [19] [20]
As observed by the Center for Naval Analyses: “Some data bias always will occur when operating in real time and in the real world, rather than under laboratory conditions….it is human nature to empathize with the planners and participants of an experiment, exercise, or wargame, leading data collectors to leave out points they feel should not “count,” so that the event succeeds.
The bias problem only grows when there is no right answer. ‘How many ships does the Navy have?’ may be quantitative, but it is also squishy. Is a submarine a ship? How about the USS Constitution? Include the National Defense Reserve Fleet? The definition of “ship” will bias the data one way or the other, producing a smaller or larger number depending on the choices the data analyst makes—consciously or unconsciously.”[21]
After all, according to a data scientist at University of California Berkeley, data is “always cooked”[22] since humans bake their biases directly into the data lifecycle;[23] opportunities for human bias to permeate “objective” research include study conceptualization, data collection, and data interpretation12. The role of researchers, then, is to account for and mitigate bias where possible rather than to make claims of conducting “bias-free” research.[24]
- Track Changes
The final best practice of data cleaning and preprocessing that we will discuss in this article is the practice of tracking changes[25] [26]; this allows researchers to retrace the steps taken to clean the data.
Version control, or the process of tracking and managing changes to coding, also enhances producibility as every modification is stored and accessible.[27] This transparency allows others to track changes, collaborate, and revert to previous versions easily. This can enhance efficiency and accuracy. For example, scholars drafting the methods section of an article they intend to submit for publication will likely have an easier time doing so if they have kept careful track of the data cleaning and preprocessing procedures they have undertaken over the course of the project.
While particularly important for collaborative research projects, even researchers pursuing solo projects benefit from tracking the changes they make. Whether working alone or with colleagues, maximizing replicability and efficiency—and by extension quality—is likely going to be in one’s best interest.
The Ultimate Goal: High-Quality Data
These 5 best practices discussed in this article are, of course, a means rather than an end: each of these practices is meant to help researchers achieve the highest quality data that they can before disseminating their findings out into the world. According to one source, the components of high-quality data consist of validity, completeness, consistency, uniformity, and accuracy.[28]
While these criteria may seem to be redundant, each component holds an invaluable role in maximizing data integrity. The first measure of high-quality data is validity, or the measurement of how well the data in question meets established standards. Next, completeness is the component which refers to whether the dataset has missing data and, if so, whether it has been handled. Consistency is the characteristic of high-quality data that assesses whether data is reliably represented across the entire dataset(s) that researchers plan to analyze. Distinct from consistency is the principle of uniformity or the degree of measurement standardization across the data. For example, if one datapoint utilizes the metric system then all datapoints should use the same system (i.e., instead of some measurements being in meters and some measurements being in feet). Perhaps the crowning characteristic of high-quality data is accuracy; it might be said that accuracy is achieved when all of the other qualities associated with data integrity are well-accounted for. That is, accurate data are valid, complete, consistent, and uniform. Together, all of these characteristics contribute to maximizing data quality.
Conclusion
Data cleaning and preprocessing clearly play an important role in producing meaningful research results; they are best viewed not as obstacles but as opportunities to better understand datasets and refine analytical approaches. We conclude this article with some final tips and insights for those embarking on an independent research project (or two!).
First, we recommend that independent researchers leverage open-access resources available to them (if they haven’t already). These tools may include libraries, tutorials and open-access data and research articles; taking advantage of such resources may streamline not just data cleaning but also the entire research process. In the same vein, maximizing efficiency through the use of relevant programming languages, such as Python or R, can help automate repetitive tasks and reduce human error. Even with the most thorough methods, however, achieving perfect data integrity may not be feasible due to human error, evolving data sources, and contextual ambiguities.
We also encourage researchers to practice caution by making changes only to a copy of the data, preserving the original raw file for reference or rollback if needed. This preventive step can help minimize the impact of mistakes introduced during the cleaning process. Researchers further could benefit from adopting a mindset of continuous improvement, expecting to revisit and refine their datasets as new challenges emerge.
Finally, consider how your data preparation efforts align with the broader goals of your research. Data cleaning and preprocessing are essential steps for ensuring high-quality data in research. By following best practices such as understanding the data, setting clear objectives, validating and correcting data, addressing missing data, and tracking changes, researchers can significantly enhance the integrity and reliability of their datasets. These practices help researchers achieve data that is valid, complete, consistent, uniform, and accurate, leading to meaningful and trustworthy research outcomes. Embracing these steps not only improves the quality of data but also contributes to the overall success and credibility of the research process.
Take Away
Data cleaning and preprocessing are crucial for ensuring high-quality research data. By understanding the data, setting clear objectives, validating and correcting errors, addressing missing data, and tracking changes, researchers enhance dataset integrity and reliability. These practices ensure data is valid, complete, consistent, and accurate, leading to meaningful, trustworthy research outcomes and strengthening overall research credibility.
[1] Tableau. Guide to Data Cleaning: Definition, Benefits, Components, And How To Clean Your Data
[2] Imarticus. Data Cleaning and Preprocessing: Ensuring Quality. https://imarticus.org/blog/data-cleaning-and-preprocessing-ensuring-data-quality/
[3] Imarticus. Data Cleaning and Preprocessing: Ensuring Quality. https://imarticus.org/blog/data-cleaning-and-preprocessing-ensuring-data-quality/
[4] Chatterjee, S. 5 Best Practices for Data Cleaning and Preprocessing a Data Analyst Beginner Should Know. Emeritus. https://emeritus.org/blog/data-science-and-analytics-data-analyst-beginner/
[5] Imarticus. Data Cleaning and Preprocessing: Ensuring Quality. https://imarticus.org/blog/data-cleaning-and-preprocessing-ensuring-data-quality/
[6] Chatterjee, S. 5 Best Practices for Data Cleaning and Preprocessing a Data Analyst Beginner Should Know. Emeritus. https://emeritus.org/blog/data-science-and-analytics-data-analyst-beginner/
[7] Chatterjee, S. 5 Best Practices for Data Cleaning and Preprocessing a Data Analyst Beginner Should Know. Emeritus. https://emeritus.org/blog/data-science-and-analytics-data-analyst-beginner/
[8] Chatterjee, S. 5 Best Practices for Data Cleaning and Preprocessing a Data Analyst Beginner Should Know. Emeritus. https://emeritus.org/blog/data-science-and-analytics-data-analyst-beginner/
[9] Chatterjee, S. 5 Best Practices for Data Cleaning and Preprocessing a Data Analyst Beginner Should Know. Emeritus. https://emeritus.org/blog/data-science-and-analytics-data-analyst-beginner/
[10] Data.org. How to improve data quality through validation and quality checks. https://data.org/guides/how-to-improve-data-quality-through-validation-and-quality-checks/
[11] BBC. Data validation and verification. BBC Bitesize. https://www.bbc.co.uk/bitesize/guides/zd9cy9q/revision/1
[12] BBC. Data validation and verification. BBC Bitesize. https://www.bbc.co.uk/bitesize/guides/zd9cy9q/revision/1
[13] Teacher Incentive Allotment. Data Validation & System Approval. https://tiatexas.org/for-districts/data-submission/data-validation-system-approval/
[14] BBC. Data validation and verification. BBC Bitesize. https://www.bbc.co.uk/bitesize/guides/zd9cy9q/revision/1
[15] Van Smeden, M., Penning de Vries, B.B.L., Nab, L. & Groenwold, R. H. H. Approaches to addressing missing values, measurement error, and confounding in epidemiologic studies. Journal of Clinical Epidemiology. https://doi.org/10.1016/j.jclinepi.2020.11.006
[16] BBC. Data validation and verification. BBC Bitesize. https://www.bbc.co.uk/bitesize/guides/zd9cy9q/revision/1
[17] BBC. Data validation and verification. BBC Bitesize. https://www.bbc.co.uk/bitesize/guides/zd9cy9q/revision/1
[18] Chollet, E. Seeing Through the Fog of Data Bias. Center for Naval Analyses. https://www.cna.org/our-media/indepth/2021/10/data-bias
[19] Arias, C. A. Foundation of a Successful Data Project: Identify and Mitigating Bias. Georgetown University Beeck Center. https://beeckcenter.georgetown.edu/foundation-of-a-successful-data-project-identify-and-mitigating-bias/
[20] National Institute of Standards and Technology. There’s More to AI Bias Than Biased Data, NIST Report Highlights. U.S. Department of Commerce. https://www.nist.gov/news-events/news/2022/03/theres-more-ai-bias-biased-data-nist-report-highlights
[21] National Institute of Standards and Technology. There’s More to AI Bias Than Biased Data, NIST Report Highlights. U.S. Department of Commerce. https://www.nist.gov/news-events/news/2022/03/theres-more-ai-bias-biased-data-nist-report-highlights
[22] Arnold, C. How Biased Data and Algorithms Can Harm Health. Johns Hopkins Bloomberg School of Public Health. https://magazine.publichealth.jhu.edu/2022/how-biased-data-and-algorithms-can-harm-health
[23] Arnold, C. How Biased Data and Algorithms Can Harm Health. Johns Hopkins Bloomberg School of Public Health. https://magazine.publichealth.jhu.edu/2022/how-biased-data-and-algorithms-can-harm-health
[24] Arnold, C. How Biased Data and Algorithms Can Harm Health. Johns Hopkins Bloomberg School of Public Health. https://magazine.publichealth.jhu.edu/2022/how-biased-data-and-algorithms-can-harm-health
[25] Stony Brook Libraries. DATA CLEANING AND WRANGLING GUIDE. Stony Brook University. https://guides.library.stonybrook.edu/c.php?g=1417828&p=10508533
[26] Xiong, A. & Vaz, R. How to Clean Your Data: Best Practices for Data Hygiene. Grantbook. https://www.grantbook.org/blog/how-to-clean-your-data-hygiene-best-practices
[27] BBC. Data validation and verification. BBC Bitesize. https://www.bbc.co.uk/bitesize/guides/zd9cy9q/revision/1
[28] Chatterjee, S. 5 Best Practices for Data Cleaning and Preprocessing a Data Analyst Beginner Should Know. Emeritus. https://emeritus.org/blog/data-science-and-analytics-data-analyst-beginner/
Articles and White Papers About Data Processes
Why You Should Go Beyond Microsoft Excel (Part 1)
Articles and White Papers About Software The Problem with Relying Solely on Dashboards Read More Why You Should Go Beyond Microsoft Excel (Part 2) Read More Why You Should Go Beyond Microsoft Excel (Part 1) Articles and White Papers About Monitoring & Evaluation To RCT or Not? Randomized Control Trials...
Read MoreWhich Six Assumptions Of Multiple Regression Should You Always Test?
Articles and White Papers About Quantitative Statistical Analysis Why You Should Go Beyond Microsoft Excel (Part 2) Read More Which Six Assumptions Of Multiple Regression Should You Always Test? Read More Addressing Challenges in Chemical Engineering: Using Six Sigma to Reduce Defects Articles and White Papers About Quantitative Research Design...
Read MoreHow You Can Use Statistics to Help Predict Trends in the Market
Articles and White Papers About Quantitative Statistical Analysis Addressing Challenges in Chemical Engineering: Using Six Sigma to Reduce Defects Read More How You Can Use Statistics to Help Predict Trends in the Market Read More
Read More5 Best Practices for Data Cleaning and Preprocessing in Independent Research Projects
Introduction Achieving data integrity is, in part, a result of engaging in data cleaning and data preprocessing before ever running those first statistical analyses to test research questions and hypotheses. While data cleaning and data preprocessing are technically separate terms, data cleaning is nested within the various steps involved in...
Read More