Skip to Main Content

Clean and Prepare Your Data

What should I do while I’m working with my files?

Step 1: Handle multiple versions

Best practices for handling multiple versions of your data files:

  • Keep a copy of the 'master' data
  • Never edit files in the master copy, always make a new copy of the file and then edit that.
  • Append version information to your file name
    • The date that the file was created or modified
    • An incremental numbering system to distinguish between versions (e.g. v1, v2, v2.1)
  • Track
    • The physical location of the file
    • Access permissions for the file
    • Changes made to the file
  • Consider using version control software to manage your files
    • Version control software will automatically track your files and keep detailed notes as you work with them. 

Step 2: Record all modifications applied to your data

Make a high level record of all techniques which you have applied to your data. For each, document:

  • What technique you applied
  • The motivation for applying the technique

Step 3: Track your changes

As you work on the high level techniques described above, keep a record of every change you make in your dataset.

  • Note that some purpose-built data wrangling tools like OpenRefine will automatically keep a journal of every change you make.

Step 4: Back up your files regularly

Follow the plan you created for backing up and securing your data. Back up your files whenever you finish a task so you’ll always be able to roll back to a usable copy. 

Suggest an edit to this guide

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.