Skip to main content

Clean and Prepare Your Data

Why should I create documentation for my data?

Adopting and documenting some best practices can go a long way toward setting you up for success and avoiding frustrations resulting from lost data, forgotten variable definitions, or missed steps in data analysis. Documenting this information can also help when collaborating and sharing your data with others. 

  • Create a directory in the top level of your project folder called Documentation.
  • Inside of that directory create one document for each of the steps outlined below:

Step 1: Gather information about your data

  • This is typically called Metadata, which is descriptive information about your dataset.
  • Create a document that records all of the information that describes all aspects of your data.
  • The metadata that you record will vary by the project that you're working on, but some important metadata to record might be:
    • Who collected your data?
    • How was it collected?
    • Who funded the research?
    • When was it collected?
    • Who can use the data?
    • What kind of data do you have?
    • How is the data formatted?

Step 2: Develop a filing naming convention

Using a file naming convention will help you keep your files organized. Adopt a consistent, meaningful naming convention for your files by incorporating meaningful information in the file names:

  • Project name, code or acronym
  • Creator initials or surname
  • File version
  • Process used to generate the file
  • Date of creation or modification
  • Geographical location
  • Description of content

When naming your files,

  • Do not use unusual characters (!-#@$%^*?) or spaces
  • Separate words by using underscores or capitalize the first letter of each word
  • Document your file naming convention in your plan
    • For example, a file created using a file naming convention might be called “LLMM_TAS1_AGG_20170207.txt”.
    • Using our file naming convention we are able to understand a lot about the file from the file name alone: 
      • The file belongs to the Legacy of Lucy Maud Montgomery (LLMM) project
      • The file holds the results of text analysis (TA) script 1 (S1)
      • The text analysis was run on the text of the book Anne of Green Gables (AGG)
      • The analysis was performed on February 7, 2017
      • For more in depth information, refer to the Research Data Management section of the Library website.

Step 3: Decide how you will organize your files

Just like a consistent file naming convention, document how you'll organize your files:

  • Organize files into a series of hierarchical folders
  • Folders should be based on important elements of  your project
    • Experiment or trial
    • Parameter assessed
    • Type of data
    • Date or year
  • Use concise but descriptive folder names
  • Follow the same naming conventions used for naming your files

For more in depth information, refer to the Research Data Management section of the Library website.

Step 4: Create a backup plan

You never know when something will go wrong. It's important to create and stick to a plan for backing up your data.

  • Store a copy of your master files in a separate location from your working files
  • Set up a regular backup schedule
    • What files will be backed up?
    • How frequently will the files be backed up?
  • Synchronize files regularly
    • Across multiple devices
    • Across team members
  • Store sensitive data securely
    • Password protect your computer and your files
    • Always keep your antivirus installed and up to date
    • Impose access restrictions so others only have access to the data that they need
    • Encrypt sensitive data 

For more information about backup and security visit the Information Security website.

Suggest an edit to this guide

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.