Unless you’ve collected your data yourself, and even if you have, chances are that the data you want to analyze will need to be cleaned up. Data cleaning refers to the process of preparing data for analysis, and often includes steps like normalizing values, handling blank values (null), re-organizing data, and otherwise refining data into exactly what you need. Surveys of data scientists show that data cleaning can take anywhere from 15% (Kaggle 2018) to 80% (Crowdflower 2016) of their time working with data.
Ask yourself, very generally, is the data correctly formatted and does it provide what I need? More specifically:
Adapted from University of Illinois.
Adapted from University of Illinois.
Before cleaning the data, make sure to save a back up copy of the original data set. That way, if your cleaning or analysis accidentally deletes data, you can always start with a fresh copy.
You can also use a system like GitHub that includes version control. Version control documents who made changes to your data, what they were and when so that you can detail this process later. You can also roll back to older versions in case of errors. To learn more about Git/Github, check out this Carpentries tutorial. Some other platforms, such as Box, also incorporate a version control.
If you do not want to use an automated system, it is still important to document the changes you make to your dataset. Best practices for preserving and disseminating data include creation of a README file, to give important information about the dataset. Being able to discuss how you cleaned it and why increases the validity of your research, and helps you, should you need to repeat a previous step further down the line.
This site is compliant with the W3C-WAI Web Content Accessibility Guidelines
HOFSTRA UNIVERSITY Hempstead, NY 11549-1000 (516) 463-6600 © 2000-2024 Hofstra University