LibGuides: SOM Data Management: Data Cleaning

Introduction

What is Data Cleaning?
General Process

Unless you’ve collected your data yourself, and even if you have, chances are that the data you want to analyze will need to be cleaned up. Data cleaning refers to the process of preparing data for analysis, and often includes steps like normalizing values, handling blank values (null), re-organizing data, and otherwise refining data into exactly what you need. Surveys of data scientists show that data cleaning can take anywhere from 15% (Kaggle 2018) to 80% (Crowdflower 2016) of their time working with data.

How Do I Know if my Data Needs Cleaning?

Ask yourself, very generally, is the data correctly formatted and does it provide what I need? More specifically:

Did you collect the data yourself or is it from somewhere else? If you’re re-using data, it’s likely that it’s not already formatted in the best way for your research and the tools you want to use.
Do you know what all the columns or variables are?
What kinds of data you should include your analysis, and how they are useful?
Do you know if there are any missing values or possible errors?
Have you looked for outliers? If outliers are present, you will need to decide how to handle them.

Adapted from University of Illinois.

Make a copy of your data.
Choose a documentation method.
Determine your data type.
Determine what you need to do.
Choose a tool to use.
Clean!

Adapted from University of Illinois.

Documenting Changes to the Dataset

Before cleaning the data, make sure to save a back up copy of the original data set. That way, if your cleaning or analysis accidentally deletes data, you can always start with a fresh copy.

You can also use a system like GitHub that includes version control. Version control documents who made changes to your data, what they were and when so that you can detail this process later. You can also roll back to older versions in case of errors. To learn more about Git/Github, check out this Carpentries tutorial. Some other platforms, such as Box, also incorporate a version control.

If you do not want to use an automated system, it is still important to document the changes you make to your dataset. Best practices for preserving and disseminating data include creation of a README file, to give important information about the dataset. Being able to discuss how you cleaned it and why increases the validity of your research, and helps you, should you need to repeat a previous step further down the line.

OpenRefine

OpenRefine
With a simple interface, OpenRefine is a powerful but user-friendly program for exploring and cleaning messy data. With its ability to incorporate textual cleaning techniques (such as clustering and faceting), OpenRefine provides an advanced alternative to Excel without needing to understand computer programming. It works with spreadsheet formats like CSV or TSV, as well as JSON, XML, Excel, and others.

Tutorials

Data Carpentries: Data Cleaning with OpenRefine
A step-by-step lesson (with activities included) that will teach you to use OpenRefine to effectively clean and format data and automatically track any changes that you make.
Programming Historian: Cleaning Data with OpenRefine
A peer-reviewed tutorial on the principles and practice of data cleaning. This lesson has been translated into Spanish and French.
Programming Historian: Fetching and Parsing Data from the Web with OpenRefine
Building on essential data wrangling skills as described in "Cleaning Data with OpenRefine" (another Programming Historian lesson), this lesson focuses on OpenRefine’s ability to fetch URLs and parse web content.

eBooks & Articles

Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities
Open Access Article
In clinical epidemiological research, errors occur in spite of careful study design, conduct, and implementation of error-prevention strategies. Data cleaning intends to identify and correct these errors or at least to minimize their impact on study results. Little guidance is currently available in the peer-reviewed literature on how to set up and carry out cleaning efforts in an efficient and ethical way. With the growing importance of Good Clinical Practice guidelines and regulations, data cleaning and other aspects of data handling will emerge from being mainly gray-literature subjects to being the focus of comparative methodological studies and process evaluations. We present a brief summary of the scattered information, integrated into a conceptual framework aimed at assisting investigators with planning and implementation. We recommend that scientific reports describe data-cleaning methods, error types and rates, error deletion and correction rates, and differences in outcome with and without remaining outliers.
see note...close notes
Van den Broeck J, Argeseanu Cunningham S, Eeckels R, Herbst K (2005) Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities. PLoS Med 2(10): e267. https://doi.org/10.1371/journal.pmed.0020267

Data Cleaning by Ihab F. Ilyas; Xu Chu
ISBN: 9781450371520
Publication Date: 2019-06-30
ZSOM/Hofstra users only.
This is an overview of the end-to-end data cleaning process. Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and incorrect business decisions. Poor data across businesses and the U.S. government are reported to cost trillions of dollars a year. Multiple surveys show that dirty data is the most common barrier faced by data scientists. Not surprisingly, developing effective and efficient data cleaning solutions is challenging and is rife with deep theoretical and engineering problems. This book is about data cleaning, which is used to refer to all kinds of tasks and activities to detect and repair errors in the data. Rather than focus on a particular data cleaning task, this book describes various error detection and repair methods, and attempts to anchor these proposals with multiple taxonomies and views. Specifically, it covers four of the most common and important data cleaning tasks, namely, outlier detection, data transformation, error repair (including imputing missing values), and data deduplication. Furthermore, due to the increasing popularity and applicability of machine learning techniques, it includes a chapter that specifically explores how machine learning techniques are used for data cleaning, and how data cleaning is used to improve machine learning models. This book is intended to serve as a useful reference for researchers and practitioners who are interested in the area of data quality and data cleaning. It can also be used as a textbook for a graduate course. Although we aim at covering state-of-the-art algorithms and techniques, we recognize that data cleaning is still an active field of research and therefore provide future directions of research whenever appropriate.
Clean Data by Megan Squire
ISBN: 9781785284014
Publication Date: 2015-05-29
ZSOM/Hofstra users only.
If you are a data scientist of any level, beginners included, and interested in cleaning up your data, this is the book for you! Experience with Python or PHP is assumed, but no previous knowledge of data cleaning is needed.

This site is compliant with the W3C-WAI Web Content Accessibility Guidelines
HOFSTRA UNIVERSITY Hempstead, NY 11549-1000 (516) 463-6600 © 2000-2009 Hofstra University

SOM Data Management: Data Cleaning