Data Visualization is Fun!

In last post, I shared some codes how to clean data in Melbourne Housing Dataset. This part we will continue doing data exploration on Melbourne Housing dataset, especially within data visualization tools. Python and Tableau are both used.

First, let’s import packages, load dataset and complete data cleaning as we did last time.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as ny
import seaborn as sns

#load dataset
housing = pd.read_csv('housing.csv')
#filter rows whose buildingarea is zero
housing = housing[housing['BuildingArea']!=0]
#drop three columns
housing = housing.drop(['Address','SellerG','Bedroom2'],axis = 1) 
#get age of property
housing['Age'] = 2018-housing['YearBuilt']
#filter one 420-year-old property 
housing = housing[housing['Age']<500]
#extract Year, Month, Day from Date. First, you have to change string into datetime object
housing['Date']=pd.to_datetime(housing['Date'],infer_datetime_format=True)  ## transform Date column to datetime object
housing['Year']=pd.DatetimeIndex(housing['Date']).year
housing['Month']=pd.DatetimeIndex(housing['Date']).month
housing['Weekday']=pd.DatetimeIndex(housing['Date']).weekday
#finally, drop all NAN
housing=housing.dropna()
#Take a look!
housing.head()

Suburb Rooms Type Price Method Date Distance Postcode Bathroom Car ... YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount Age Year Month Weekday
2 Abbotsford 2 h 1035000.0 S 2016-04-02 2.5 3067.0 1.0 0.0 ... 1900.0 Yarra City Council -37.8079 144.9934 Northern Metropolitan 4019.0 118.0 2016 4 5
4 Abbotsford 3 h 1465000.0 SP 2017-04-03 2.5 3067.0 2.0 0.0 ... 1900.0 Yarra City Council -37.8093 144.9944 Northern Metropolitan 4019.0 118.0 2017 4 0
6 Abbotsford 4 h 1600000.0 VB 2016-04-06 2.5 3067.0 1.0 2.0 ... 2014.0 Yarra City Council -37.8072 144.9941 Northern Metropolitan 4019.0 4.0 2016 4 2
11 Abbotsford 3 h 1876000.0 S 2016-07-05 2.5 3067.0 2.0 0.0 ... 1910.0 Yarra City Council -37.8024 144.9993 Northern Metropolitan 4019.0 108.0 2016 7 1
14 Abbotsford 2 h 1636000.0 S 2016-08-10 2.5 3067.0 1.0 2.0 ... 1890.0 Yarra City Council -37.8060 144.9954 Northern Metropolitan 4019.0 128.0 2016 8 2

5 rows × 22 columns

housing.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8841 entries, 2 to 34856
Data columns (total 22 columns):
Suburb           8841 non-null object
Rooms            8841 non-null int64
Type             8841 non-null object
Price            8841 non-null float64
Method           8841 non-null object
Date             8841 non-null datetime64[ns]
Distance         8841 non-null float64
Postcode         8841 non-null float64
Bathroom         8841 non-null float64
Car              8841 non-null float64
Landsize         8841 non-null float64
BuildingArea     8841 non-null float64
YearBuilt        8841 non-null float64
CouncilArea      8841 non-null object
Lattitude        8841 non-null float64
Longtitude       8841 non-null float64
Regionname       8841 non-null object
Propertycount    8841 non-null float64
Age              8841 non-null float64
Year             8841 non-null int64
Month            8841 non-null int64
Weekday          8841 non-null int64
dtypes: datetime64[ns](1), float64(12), int64(4), object(5)
memory usage: 1.6+ MB

Let us rescale price column in to Price(k)

housing['Price(K)'] = housing['Price']/1000
housing.drop('Price',axis = 1,inplace=True) 

** Dataset is ready for visualize. **

Show, don’t tell!

Viz 1

This plot will show the distribution of Price(K)

plt.figure(figsize=(10,6))
ax1=housing['Price(K)'].hist(bins=31)
ax1.set_xlabel('Price(K)')
ax1.set_xlim([0,8000])
(0, 8000)

png

It seems like price is right-skew distributed. Most of properties has price ranging from 500-1500(k) AUD.

Viz 2

Price(K) VS Age of Property

plt.figure(figsize=(10,6))
ax2=housing.groupby('Age')['Price(K)'].mean().plot()
ax2.set_ylabel('Price(K)')

png

It is kind of time series data. The price shivers within age. Also, the price shivers greater as age increases. It might indicate for properties of older age, other features have greater impact on its price. However, we can still infer that the older the property is, the more expensive is if the age is lower than 125. The property older thatn 130 seems to be as expensive as the ones 50 years old.

Viz 3

Price(K) VS Year, Month, and Weekday

plt.figure(figsize=(10,6))
g1=sns.factorplot(y='Price(K)',x='Year',kind='bar',size=4, aspect =3,data=housing)
g1.ax.set_title('Price(K) VS Year')

png

Price from 2016-2018 did not have big changes. However, it should be noticed that scale is in 1000.

plt.figure(figsize=(10,6))
g2=sns.factorplot(y='Price(K)',x='Month',kind='bar',size=4, aspect =3,data=housing)
g2.ax.set_title('Price(K) VS Month')

png

Properties sold on January and July had relatively lower price.

plt.figure(figsize=(10,6))
g3=sns.factorplot(y='Price(K)',x='Weekday',kind='bar',size=4, aspect =3,data=housing)
g3.ax.set_title('Price(K) VS Week of Day')

png

Properties sold on Wednesday had the lowest price. Should we buy a house on Wednesday then?

plt.figure(figsize=(16,6))
g4=sns.factorplot(y='Price(K)',x='Month',kind='bar',hue='Weekday',size=5, aspect =3,data=housing)
g4.ax.set_title('Price(K) VS Month')

png

When combining weekday with month, we can have a closer look at price difference in differnet days. It seems that Wednesday in November and Thursday in March would be a better choice. This plot can also tell us during first two months, properties were sold only on Friday. Why?? Properties were only sold on Wednesday in March and November.

Viz 4

Price(K) VS Distance

plt.figure(figsize=(12,6))
sns.regplot(x='Distance',y='Price(K)',data=housing,scatter_kws={'alpha':0.1},fit_reg=False)

png

From this plot, we can two main insights:

  • Most of properties have distance from 2ish to 15ish unit(miles/kms)
  • The further the property is from the CBD, the less value the property has

This justifies the comman sense of the expensive properties in Melbourne CBD. Actually this rule applies to any big city!

Viz 5

Price(K) VS Number of Rooms

g5=sns.factorplot(y='Price(K)',x='Rooms',kind='bar',size=5, aspect =3,data=housing)
g5.ax.set_title('Price(K) VS Number of Rooms')

png

Actually, we can guess that the more room one property has, the bigger the property is. The bigger the property is, the more value it has. The guess is justified with those of 6 rooms. After it exceed 6 rooms, the rule is not applicable. SO…..Let’s Viz6

Viz6

Price(K) VS Buildingarea

plt.figure(figsize=(12,6))
ax3=sns.regplot(x='BuildingArea',y='Price(K)',data=housing,scatter_kws={'alpha':0.1},fit_reg=False)
ax3.set_title('Price(k) VS BuildingArea')

png

Let’s take a closer look at x:(0,1000) part. In other words, ignore extreme values (They could be outliers).

plt.figure(figsize=(12,6))
ax4=sns.regplot(x='BuildingArea',y='Price(K)',data=housing,scatter_kws={'alpha':0.3},fit_reg=False)
ax4.set_xlim((0,1100))
ax4.set_title('Price(K) VS BuildingArea(0-1100)')

png

For buildingarea ranging from 0 and 400, it seems to be a positive linear trend. It is not clear due to too many overlapping points.

Viz 7

Price VS Region Name

plt.figure(figsize=(12,6))
housing.groupby('Regionname')['Price(K)'].mean().plot(kind='bar')

png

Southern Metropolitan and Eastern Metropolitan are two regions which had the highest average housing price. You must be very curious about what these two regions. Let’s take a look at Google Map!

png

Tableau

Tableau is a strong tool specifically for data visualization. With student account, we can get one-year free trial. How generous! These two regions are closer to CBD Melbourne. Also, it is clear that more properties locate around Philip Bay.

Now, you must be wondering how price differ within different locations. Python can deal with geographical information, but Tableau is a simpler way to do. All you need to do is load dataset, drag features and choose plot tyle.

** Magic 1 ** How properties locate with different regions? png

How properties’ price change within different locations? png

The black points represent those properties with highest price. Most of them are in Southern Metropolitan and Eastern Metropolitan.

Then just choose these luxury properties. We can find: png

Yep!

Conclusion

  • Visualization helps you understand data better
  • various plots should be applied properly
  • Outliers can be detected in the plot (extreme values are not necessarily be outliers)
  • Missing values can also be visualized to find some pattern!

In the future

  • Handling Missing Values is important