Skip to Main Content

SOM Data Management: Data Cleaning

Introduction

Unless you’ve collected your data yourself, and even if you have, chances are that the data you want to analyze will need to be cleaned up. Data cleaning refers to the process of preparing data for analysis, and often includes steps like normalizing values, handling blank values (null), re-organizing data, and otherwise refining data into exactly what you need. Surveys of data scientists show that data cleaning can take anywhere from 15% (Kaggle 2018) to 80% (Crowdflower 2016) of their time working with data.

How Do I Know if my Data Needs Cleaning?

Ask yourself, very generally, is the data correctly formatted and does it provide what I need? More specifically:

  • Did you collect the data yourself or is it from somewhere else? If you’re re-using data, it’s likely that it’s not already formatted in the best way for your research and the tools you want to use.  
  • Do you know what all the columns or variables are?
  • What kinds of data you should include your analysis, and how they are useful?
  • Do you know if there are any missing values or possible errors?
  • Have you looked for outliers? If outliers are present, you will need to decide how to handle them.

Adapted from University of Illinois.

  1. Make a copy of your data.
  2. Choose a documentation method. 
  3. Determine your data type.
  4. Determine what you need to do.
  5. Choose a tool to use. 
  6. Clean!

Adapted from University of Illinois.

Documenting Changes to the Dataset

Before cleaning the data, make sure to save a back up copy of the original data set. That way, if your cleaning or analysis accidentally deletes data, you can always start with a fresh copy.

You can also use a system like GitHub that includes version control. Version control documents who made changes to your data, what they were and when so that you can detail this process later. You can also roll back to older versions in case of errors. To learn more about Git/Github, check out this Carpentries tutorial. Some other platforms, such as Box, also incorporate a version control.

If you do not want to use an automated system, it is still important to document the changes you make to your dataset. Best practices for preserving and disseminating data include creation of a README file, to give important information about the dataset. Being able to discuss how you cleaned it and why increases the validity of your research, and helps you, should you need to repeat a previous step further down the line. 

OpenRefine

Tutorials

Other Tools and Tutorials

eBooks & Articles

Hofstra University

This site is compliant with the W3C-WAI Web Content Accessibility Guidelines
HOFSTRA UNIVERSITY Hempstead, NY 11549-1000 (516) 463-6600 © 2000-2024 Hofstra University