Clean data for clear insights: part 1

May 23, 2024

Reda Belmejdoub

Facebook

Further Learning:

Data cleansing (or cleaning), is used to refer to the process of detecting and correcting inaccurate, corrupt or unusable data. It is an essential step before any data analysis project, since every step after it assumes the data is “clean” or, in other words, trustworthy and accurate.

In this 2-part series, we'll be going over the steps required for understanding what makes data "dirty", and how to clean it. Get your rubber gloves ready!

When do I need to clean my data?

There’s a high chance you’ll always have to do some cleaning. When working with datasets, especially large ones, you will almost inevitably find errors among the records- either because of human error or other issues during data collection and storage. The data can also be messy or unstructured in its raw form — which is often the case with data extracted from websites via web scraping, for example.

It is thus good practice to always take the time to familiarize yourself with the dataset you’re working with, examine it for potential errors and corrupt records, and deal with them before actually trying to make sense of the data.

What tools do we need to clean data?

Most small to medium sized datasets can be properly analyzed and cleansed using Google Sheets or Excel. In this article, we will work with a small a dataset in Google Sheets- think a few hundreds or thousands of records at most.

The dataset we will be using as an example (as well as the final, "cleansed", version) can be found here.

Get to know your Dataset

Before we do any cleaning, let's take the time to familiarize ourselves with the dataset. In this case, we will use a sample of records from the OpenFoodFacts database, containing information about a variety of food products and their nutritional values, brand, etc.

The first step to cleaning your data is being able to verbally describe your dataset, or in other words, speak to its characteristics.

To describe your data, you will need to know:

The size of the dataset

Knowing how many records you’re working with will not only give you an idea of how much data you have available, but it will also let you know how much data was expunged after you are done cleaning.

Specifically, we need to know how many columns (features, or properties) and rows (records) are in a dataset.

You can easily get the # of columns and rows by using the COUNTA formula, which returns the number of cells with values in a given range.

For rows, it should look something like:
=counta(A:A)

You should see a value of 2000. What's happening here is we're asking COUNTA to give us the amount of populated cells on column A.

Now, let's try the same for getting the number of columns:
=counta(A1:EY1)

Protip: You can avoid having to scroll to the right to get the last column by just clicking on row 1. That will select the entire first row for you, with all the columns it spans across.

You should get 155. That means the sample we’re using contains 155 columns and 2000 rows.

Great! Now we know how big our dataset is.

The Features’ Data Types

Now that we've taken a general glance at the dataset, we need to start looking at the types of data contained in each column. Most data can be categorized into 4 basic types:

Numerical Data: Numbers! It can also be called quantitative data. Measurements like height, weight, distances, or, in our dataset, sugar content, all fall within this category. Numerical data can further be split into continuous or discrete data. Continuous data assumes any value within a range (sugar for 100g can be any value between 0 and 100). Discrete data only takes distinct values (i.e a number of people will always be an integer).
Categorical Data represents a property of an object that allows grouping it in a category. Gender, nationality, hometown are all categories one could group people into. It is also possible to represent categorical data with numbers, such as 1 for male and 2 for female, which hold no mathematical significance, and thus still represent categorical, not numerical, data.
There is also what we call ordinal data, which is a mix of numerical and categorical data of sorts. In ordinal data, the data still falls into categories, but those categories are ordered or ranked in some particular way. An example would be age groups, such as baby boomers, Gen X, Millenials and Gen Z. These would represent groups of people that can be ranked chronologically.
In our dataset, the “countries” and “brands” columns are both examples of categorical data.
Time Series: Sequences of numbers representing time-based values that allow the data to be situated chronologically. These can dates or timestamps that allow you to visualize trends and evolutions in time for your data. It is important to know what type of representation is being used for time series data in your dataset. In the example we’re using, we have both UNIX epoch timestamps and ISO format timestamps for the record creation time. In cases where the format isn’t supported by Google Sheets like this one, it will be necessary to convert them to another, more manageable format.
Text Data: Words or sentences that don’t represent categories, such as item descriptions, names, addresses etc…

Now that you're ready

Once you understand the contents of your data, the data types you will be dealing with, and the general shape of your dataset, you are ready to tackle the real cleaning. It's now time to head on over to Part 2 of this article and find out how you can polish raw data into a jewel of insights.

‍

EBook

Augmented AI:

Blending human empathy and AI for exceptional customer support

This ebook provides an essential guide to understanding Augmented AI and how it revolutionizes customer service by combining the efficiency of AI with the empathy and creativity of human support.

Get your copy

All the CX goodness—straight to your inbox

Get industry news, AI guidance, and early access to tools and resources. Don’t worry, we don’t spam the people we love.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Customer Experience

Reimagining CX metrics in an AI-driven world: practical strategies for CX leaders

May 7, 2025

Learn how AI is reshaping CX metrics with practical tips to balance speed and empathy, build trust, ensure data security, and improve customer satisfaction.

Read article

Customer Experience

Chatbot best practices: a comprehensive guide

May 5, 2025

From setup to optimization, these chatbot best practices will help you streamline CX, enhance efficiency, and improve customer satisfaction.

Read article

Customer Experience

Customer service outsourcing: when, why & how to do it right

Apr 30, 2025

Struggling to keep up with support? Customer service outsourcing might be the smart, scalable solution you need—here’s how to do it right.

‍

Read article

Customer Experience

Reimagining CX metrics in an AI-driven world: practical strategies for CX leaders

May 7, 2025

Learn how AI is reshaping CX metrics with practical tips to balance speed and empathy, build trust, ensure data security, and improve customer satisfaction.

Read article

Customer Experience

Chatbot best practices: a comprehensive guide

May 5, 2025

From setup to optimization, these chatbot best practices will help you streamline CX, enhance efficiency, and improve customer satisfaction.

Read article

Customer Experience

Customer service outsourcing: when, why & how to do it right

Apr 30, 2025

Struggling to keep up with support? Customer service outsourcing might be the smart, scalable solution you need—here’s how to do it right.

‍

Read article

Customer Experience

ASAT meaning: what agent satisfaction score is and how to measure it

Apr 30, 2025

The purpose of measuring ASAT is to proactively identify risks such as burnout or lack of engagement, and ensure the team is happy doing the work they’ve been hired to do.

‍

Read article

Blog

Clean data for clear insights: part 1

When do I need to clean my data?

What tools do we need to clean data?

The dataset we will be using as an example (as well as the final, "cleansed", version) can be found here.

Get to know your Dataset

The first step to cleaning your data is being able to verbally describe your dataset, or in other words, speak to its characteristics.

To describe your data, you will need to know:

The size of the dataset

Protip: You can avoid having to scroll to the right to get the last column by just clicking on row 1. That will select the entire first row for you, with all the columns it spans across.

You should get 155. That means the sample we’re using contains 155 columns and 2000 rows.

Great! Now we know how big our dataset is.

The Features’ Data Types

Now that we've taken a general glance at the dataset, we need to start looking at the types of data contained in each column. Most data can be categorized into 4 basic types:

Numerical Data: Numbers! It can also be called quantitative data. Measurements like height, weight, distances, or, in our dataset, sugar content, all fall within this category. Numerical data can further be split into continuous or discrete data. Continuous data assumes any value within a range (sugar for 100g can be any value between 0 and 100). Discrete data only takes distinct values (i.e a number of people will always be an integer).
Categorical Data represents a property of an object that allows grouping it in a category. Gender, nationality, hometown are all categories one could group people into. It is also possible to represent categorical data with numbers, such as 1 for male and 2 for female, which hold no mathematical significance, and thus still represent categorical, not numerical, data.
There is also what we call ordinal data, which is a mix of numerical and categorical data of sorts. In ordinal data, the data still falls into categories, but those categories are ordered or ranked in some particular way. An example would be age groups, such as baby boomers, Gen X, Millenials and Gen Z. These would represent groups of people that can be ranked chronologically.
In our dataset, the “countries” and “brands” columns are both examples of categorical data.
Time Series: Sequences of numbers representing time-based values that allow the data to be situated chronologically. These can dates or timestamps that allow you to visualize trends and evolutions in time for your data. It is important to know what type of representation is being used for time series data in your dataset. In the example we’re using, we have both UNIX epoch timestamps and ISO format timestamps for the record creation time. In cases where the format isn’t supported by Google Sheets like this one, it will be necessary to convert them to another, more manageable format.
Text Data: Words or sentences that don’t represent categories, such as item descriptions, names, addresses etc…

Now that you're ready

‍

Reda Belmejdoub

Clean data for clear insights: part 1

When do I need to clean my data?

What tools do we need to clean data?

Get to know your Dataset

The size of the dataset

The Features’ Data Types

Now that you're ready

All the CX goodness—straight to your inbox

You might also like

Reimagining CX metrics in an AI-driven world: practical strategies for CX leaders

Chatbot best practices: a comprehensive guide

Customer service outsourcing: when, why & how to do it right

Reimagining CX metrics in an AI-driven world: practical strategies for CX leaders

Chatbot best practices: a comprehensive guide

Customer service outsourcing: when, why & how to do it right

ASAT meaning: what agent satisfaction score is and how to measure it

Clean data for clear insights: part 1

When do I need to clean my data?

What tools do we need to clean data?

Get to know your Dataset

The size of the dataset

The Features’ Data Types

Now that you're ready

You might also like:

🌞 What is follow-the-sun and why should you care? 🌞

You’re already becoming less successful

Your guide to the best customer service software solutions