How to clean data in Python?

Park Daniel
Oct 1, 2020
2 min read

In this blog post, I am going to specifically talk about the steps you should take to clean data through Python. I am going to introduce numerous examples of functions that are useful in helping us reach that goal.

Check to understand size

#checking amount of rows in given dataset to understand the size we are working with
airbnb = pd.read_csv(r"C:\Users\16094\Documents\iSTEM\AB_NYC_2019.csv")
len(airbnb)

The code above is an easy way to get an understanding of how much data you are working with. It is important to assign the data so that goes in the function len(). In the example above, I selected Airbnb because I was able to use pandas library and 'read_csv' function to read the csv file formatted from kaggle

Check type column

#checking type of every column in the dataset
airbnb.dtypes

The dtypes function will provide us with the type of every column. You might ask what are types of a function? In computer science and programming, a data type or simply type is an attribute of data which tells the compiler how the programmer intends to use the data

id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group                object
neighbourhood                      object
latitude                          float64
longitude                         float64

The header for each column can be seen as above through the dtypes function. Some headers such as id, latitude, and longitude is a number (Integer or a float).

They are both numerical data, but integer is number without decimal point, while float does have the precision to have decimal point.

Which columns have null values

airbnb.isnull().sum()

This code tells the programmer which columns have null values. Using 'sum' function will show us how many nulls are found it each column in dataset. In our case, missing data that is observed does not need much special treatment. Looking into the nature of our dataset we can state further things: columns "name" and "host_name" are irrelevant and insignificant to our data analysis, columns "last_review" and "review_per_month" need very simple handling. To elaborate, "last_review" is date; if there were no reviews for the listing - date simply will not exist. In our case, this column is irrelevant and insignificant therefore appending those values is not needed. For "review_per_month" column we can simply append it with 0.0 for missing values; we can see that in "number_of_review" that column will have a 0, therefore following this logic with 0 total reviews there will be 0.0 rate of reviews per month. Therefore, let's proceed with removing columns that are not important and handling of missing data.

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0

Categorial unique values

airbnb.neighbourhood_group.unique()
airbnb.room_type.unique()
len(airbnb.neighbourhood.unique())

Understanding unique values and categorical data that we have in our dataset was the last step we had to do. It looks like for those columns' values we will be doing some mapping to prepare the dataset for predictive analysis.