Skip to main content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Clean and Prepare Your Data

What should I do while I’m working with my files?

Here are some good habits to help avoid common pitfalls that we recommend as best practices.

Step 1: Handle multiple versions

Best practices for handling multiple versions of your data files:

  • Keep a copy of the 'master' data
  • Never edit files in the master copy, always make a new copy of the file and then edit that.
  • Append version information to your file name
    • The date that the file was created or modified
    • An incremental numbering system to distinguish between versions (e.g. v1, v2, v2.1)
  • Track
    • The physical location of the file
    • Access permissions for the file
    • Changes made to the file
  • Consider using version control software to manage your files
    • Version control software will automatically track your files and keep detailed notes as you work with them. 

Step 2: Record all modifications applied to your data

Make a high level record of all techniques which you have applied to your data. For each, document:

  • What technique you applied
  • The motivation for applying the technique

Step 3: Track your changes

As you work on the high level techniques described above, keep a record of every change you make in your dataset.

  • Note that some purpose-built data wrangling tools like OpenRefine will automatically keep a journal of every change you make.

Step 4: Back up your files regularly

Follow the plan you created for backing up and securing your data. Back up your files whenever you finish a task so you’ll always be able to roll back to a usable copy. 

Suggest an edit to this guide

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.