Skip to main content

Clean and Prepare Your Data

How do I choose a tool?

There is a wide variety of tools to use for cleaning and preparing data. Each tool will come with its own advantages and disadvantages. In general tools typically fall into one of two categories: 

  • General Purpose tools which are not designed specifically for Data Wrangling yet still provide Data Wrangling capabilities.
  • Purpose-built tools which are designed specifically for Data Wrangling or a specific task within Data Wrangling.

Often times a data wrangling project is best accomplished through a combination of general purpose tools and specialized tools.

What are general purpose tools?

General purpose tools are less focused than other tools, but offer functionality to support a wide range of data wrangling tasks. Since they are designed to fill many roles, general purpose tools may lack capabilities offered by other more specialized tools. 

General Tools: Spreadsheets

Spreadsheet software is an excellent option when dealing with tabular data. In particular, spreadsheet software is useful for performing bulk operations on columns of data.

Common tasks might include: 

  • Sorting data by columns
  • Converting units in a column
  • Formatting values in a column
  • Joining values
  • Simple value extraction

When Not to Use Spreadsheets:

  • Tasks that must be repeated across a large numbers of files.
  • As a solution for managing data long term

Examples of Spreadsheet Software:

General Tools: Databases

Database software provides a robust platform for managing data over the lifecycle of a project in addition to powerful capabilities for querying, reporting, and filtering data.

Database software can be complex so a time investment may be required to get the full value out of a database product.

Use a database  

  • Still collecting data or need to manage an evolving dataset over time
  • Very large volume of data
  • Multiple data files that relate to each other in some way
  • Need to ask questions or run reports of the data
  • Need to join or subset data into working sets

Do not use a database

  • One time operations
  • Simple transformations that can be handled by other easier to use tools

Examples of Database Software

What are purpose-built tools?

  • Purpose-built tools are created to solve a particular problem or class of problems.
  • They usually provide a deeper level of functionality that would be available in general purpose tools though they may lack in other areas. Because of this, purpose-built tools are often used in conjunction with more general tools.

Purpose-built Tools: Data Wrangling

OpenRefine:

  • Formerly Google Refine, a powerful tool for working with messy data:
    • cleaning it 
    • transforming it from one format into another 
    • and extending it with web services and external data.‚Äč

Trifacta Wrangler:

  • A commercial tool designed to provide a wide variety of Data Wrangling capabilities.
  • A free version of Trifacta Wrangler is available for individual use.

Purpose-built Tools: Format Conversion

CSVKit:

  • A suite of command line tools designed for working with CSV files.
    • CSVKit support converting to and from CSV, as well as functions for filtering, searching, and manipulating CSV files.

Mr. Data Converter:

  • A Web based tool for converting CSV and Excel data into a variety of other formats. 

Tabula:

  • Tabula is a tool for extracting data from tables locked inside PDF files.

Purpose-built Tools: Anonymization

Cornell Anonymization Toolkit:

  • The Cornell Anonymization Toolkit is designed for interactively anonymizing published dataset to limit identification disclosure of records under various attacker models.

ARX Data Anonymization Tool:

  • ARX is a comprehensive open source software for anonymizing sensitive personal data.

Suggest an edit to this guide

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.