How to clean data in Python?
- Park Daniel
- Oct 1, 2020
- 2 min read
In this blog post, I am going to specifically talk about the steps you should take to clean data through Python. I am going to introduce numerous examples of functions that are useful in helping us reach that goal.

Check to understand size
#checking amount of rows in given dataset to understand the size we are working with
airbnb = pd.read_csv(r"C:\Users\16094\Documents\iSTEM\AB_NYC_2019.csv")
len(airbnb)
The code above is an easy way to get an understanding of how much data you are working with. It is important to assign the data so that goes in the function len(). In the example above, I selected Airbnb because I was able to use pandas library and 'read_csv' function to read the csv file formatted from kaggle
Check type column
The dtypes function will provide us with the type of every column. You might ask what are types of a function? In computer science and programming, a data type or simply type is an attribute of data which tells the compiler how the programmer intends to use the data
id int64
name object
host_id int64
host_name object
neighbourhood_group object
neighbourhood object
latitude float64
longitude float64
The header for each column can be seen as above through the dtypes function. Some headers such as id, latitude, and longitude is a number (Integer or a float).
They are both numerical data, but integer is number without decimal point, while float does have the precision to have decimal point.
Which columns have null values
This code tells the programmer which columns have null values. Using 'sum' function will show us how many nulls are found it each column in dataset. In our case, missing data that is observed does not need much special treatment. Looking into the nature of our dataset we can state further things: columns "name" and "host_name" are irrelevant and insignificant to our data analysis, columns "last_review" and "review_per_month" need very simple handling. To elaborate, "last_review" is date; if there were no reviews for the listing - date simply will not exist. In our case, this column is irrelevant and insignificant therefore appending those values is not needed. For "review_per_month" column we can simply append it with 0.0 for missing values; we can see that in "number_of_review" that column will have a 0, therefore following this logic with 0 total reviews there will be 0.0 rate of reviews per month. Therefore, let's proceed with removing columns that are not important and handling of missing data.
id 0
name 16
host_id 0
host_name 21
neighbourhood_group 0
neighbourhood 0
latitude 0
minimum_nights 0
number_of_reviews 0
last_review 10052
reviews_per_month 10052
calculated_host_listings_count 0
availability_365 0
Categorial unique values
Understanding unique values and categorical data that we have in our dataset was the last step we had to do. It looks like for those columns' values we will be doing some mapping to prepare the dataset for predictive analysis.
Comentarios