Data Preprocessing

Introduction

In the real world, data is not clean. Several techniques are used as follows:

  • Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.
  • Data integration: using multiple databases, data cubes, or files.
  • Data transformation: normalization and aggregation.
  • Data reduction: reducing the volume but producing the same or similar analytical results.
  • Data discretization: part of data reduction, replacing numerical attributes with nominal ones.

Case study

The dataset I chose is from Kaggle. It is called Melbourne Housing Market. The following link is: https://www.kaggle.com/anthonypino/melbourne-housing-market

1. Import useful packages, read datasets and get some general information

import pandas as pd
import matplotlib.pyplot as plt
import numpy as ny
import seaborn as sns
housing = pd.read_csv('housing.csv')
housing.head()
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode ... Bathroom Car Landsize BuildingArea YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount
0 Abbotsford 68 Studley St 2 h NaN SS Jellis 3/09/2016 2.5 3067.0 ... 1.0 1.0 126.0 NaN NaN Yarra City Council -37.8014 144.9958 Northern Metropolitan 4019.0
1 Abbotsford 85 Turner St 2 h 1480000.0 S Biggin 3/12/2016 2.5 3067.0 ... 1.0 1.0 202.0 NaN NaN Yarra City Council -37.7996 144.9984 Northern Metropolitan 4019.0
2 Abbotsford 25 Bloomburg St 2 h 1035000.0 S Biggin 4/02/2016 2.5 3067.0 ... 1.0 0.0 156.0 79.0 1900.0 Yarra City Council -37.8079 144.9934 Northern Metropolitan 4019.0
3 Abbotsford 18/659 Victoria St 3 u NaN VB Rounds 4/02/2016 2.5 3067.0 ... 2.0 1.0 0.0 NaN NaN Yarra City Council -37.8114 145.0116 Northern Metropolitan 4019.0
4 Abbotsford 5 Charles St 3 h 1465000.0 SP Biggin 4/03/2017 2.5 3067.0 ... 2.0 0.0 134.0 150.0 1900.0 Yarra City Council -37.8093 144.9944 Northern Metropolitan 4019.0

5 rows × 21 columns

The dataset had 21 attributes and 34857 observations in all.

housing.isnull().sum()
Suburb               0
Address              0
Rooms                0
Type                 0
Price             7610
Method               0
SellerG              0
Date                 0
Distance             1
Postcode             1
Bedroom2          8217
Bathroom          8226
Car               8728
Landsize         11810
BuildingArea     21115
YearBuilt        19306
CouncilArea          3
Lattitude         7976
Longtitude        7976
Regionname           3
Propertycount        3
dtype: int64
housing.isnull().sum()/len(housing)*100
Suburb            0.000000
Address           0.000000
Rooms             0.000000
Type              0.000000
Price            21.832057
Method            0.000000
SellerG           0.000000
Date              0.000000
Distance          0.002869
Postcode          0.002869
Bedroom2         23.573457
Bathroom         23.599277
Car              25.039447
Landsize         33.881286
BuildingArea     60.576068
YearBuilt        55.386293
CouncilArea       0.008607
Lattitude        22.882061
Longtitude       22.882061
Regionname        0.008607
Propertycount     0.008607
dtype: float64

It shows missing value percentage for each column. We can also get the percentage of all the missing values.

missing = housing.isnull().sum().sum()
all = housing.isnull().count().sum()
missing/all*100
13.794455441757275

There are a lot of missing values in the dataset, especially in Landsize, BuildingArea, and YearBuilt column. 13% of the data are missing. That is not bad.

You can also visualize missing values

plt.figure(figsize=(10,3))
sns.set(font_scale=1.2)
sns.heatmap(housing.isnull(),yticklabels = False, cbar = False, cmap = 'plasma')
<matplotlib.axes._subplots.AxesSubplot at 0x170625eb978>

png

2. Obtain descriptive statistics of some attributes.

Firstly, we need to understand the data type. Typically, they are:

  1. Categorical data
    • Nominal: variables are variables that have two or more categories, but which do not have an intrinsic order. Examples: Gender, weather, room type
    • Ordinal: variables are variables that have two or more categories just like nominal variables only the categories can also be ordered or ranked. Examples: Class, Ranking
  2. Numeric data
    • Interval
    • Ratio

Measures of the central tendency: Mean, Median, Mode

  • Mean: Average; Susceptible to outliers.
  • Median: Better in skewed data
  • Mode: the most frequent score. Best for nominal data
  • Importance: right-skewed data: median > mean. left-skewed data: median < mean

The more skewed the distribution, the greater the difference between the median and mean, and the greater emphasis should be placed on using the median as opposed to the mean.

Measures of the Dispersion: Variance/Standard Deviation, Range, interquantile range

  • Variance/Standatd Deviation
  • Range: Max-min,
  • Interquartile: Q3 - Q1. Interquartile range is where the middle 50% data locates.
housing.describe()
#'mean of housing price',housing['Price'].mean()
Rooms Price Distance Postcode Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt Lattitude Longtitude Propertycount
count 34857.000000 2.724700e+04 34856.000000 34856.000000 26640.000000 26631.000000 26129.000000 23047.000000 13742.00000 15551.000000 26881.000000 26881.000000 34854.000000
mean 3.031012 1.050173e+06 11.184929 3116.062859 3.084647 1.624798 1.728845 593.598993 160.25640 1965.289885 -37.810634 145.001851 7572.888306
std 0.969933 6.414671e+05 6.788892 109.023903 0.980690 0.724212 1.010771 3398.841946 401.26706 37.328178 0.090279 0.120169 4428.090313
min 1.000000 8.500000e+04 0.000000 3000.000000 0.000000 0.000000 0.000000 0.000000 0.00000 1196.000000 -38.190430 144.423790 83.000000
25% 2.000000 6.350000e+05 6.400000 3051.000000 2.000000 1.000000 1.000000 224.000000 102.00000 1940.000000 -37.862950 144.933500 4385.000000
50% 3.000000 8.700000e+05 10.300000 3103.000000 3.000000 2.000000 2.000000 521.000000 136.00000 1970.000000 -37.807600 145.007800 6763.000000
75% 4.000000 1.295000e+06 14.000000 3156.000000 4.000000 2.000000 2.000000 670.000000 188.00000 2000.000000 -37.754100 145.071900 10412.000000
max 16.000000 1.120000e+07 48.100000 3978.000000 30.000000 12.000000 26.000000 433014.000000 44515.00000 2106.000000 -37.390200 145.526350 21650.000000

Here, we choose price and number of rooms as two attributes we are interested in.

Also, we may need median, mode, range, quartiles for these two attributes.

'median of housing price',housing['Price'].median()
median of housing price 870000.0





('quartiles for housing price', 870000.0)
'mode of housing price',housing['Price'].mode()[0],housing['Price'].mode()[1]
('mode of housing price', 600000.0, 1100000.0)
'mode of housing price',[x for x in housing['Price'].mode()]
('mode of housing price', [600000.0, 1100000.0])
'range of housing price',housing['Price'].max()-housing['Price'].min()
('range of housing price', 11115000.0)
'quartiles for housing price',housing['Price'].quantile([0.25,0.75])
('quartiles for housing price', 0.25     635000.0
 0.75    1295000.0
 Name: Price, dtype: float64)
'standard deviation for housing price',housing['Price'].std()
('standard deviation for housing price', 641467.1301046001)

Visualize the distributions of numeric value

Histogram, Boxplot

plt.figure(figsize=[15,6])
housing['Price'].hist(bins=30)
<matplotlib.axes._subplots.AxesSubplot at 0x26f1caabfd0>

png

plt.figure(figsize=[10,8])
sns.distplot(housing['Price'].dropna(),bins=30)
<matplotlib.axes._subplots.AxesSubplot at 0x17062335128>

png

sns.boxplot(y='Price',data=housing)
<matplotlib.axes._subplots.AxesSubplot at 0x26f1ca8fe80>

png

Then apply the same techniques to Rooms column.

'median of room number',housing['Rooms'].median()
('median of room number', 3.0)
'mode of room number',housing['Rooms'].mode()[0]
('mode of room number', 3)
'range of room number',housing['Rooms'].max()-housing['Rooms'].min()
('range of room number', 15)
'quartiles for room number',housing['Rooms'].quantile([0.25,0.75])
('quartiles for room number', 0.25    2.0
 0.75    4.0
 Name: Rooms, dtype: float64)
plt.figure(figsize=[15,6])
housing['Rooms'].hist(bins=30)
<matplotlib.axes._subplots.AxesSubplot at 0x26f1d1fc550>

png

sns.countplot(housing['Rooms'])
<matplotlib.axes._subplots.AxesSubplot at 0x26f1d308c88>

png

sns.boxplot(y=housing['Rooms'])
<matplotlib.axes._subplots.AxesSubplot at 0x26f1d316e48>

png

3. Potential problem of data quality

1. Missing data

There are several methods to handle missing values. In summary, they are:

  • Ignore the tuple: usually done when class label is missing.
  • Use the attribute mean (or majority nominal value) to fill in the missing value.
  • Use the attribute mean (or majority nominal value) for all samples belonging to the same class.
  • Predict the missing value by using a learning algorithm: consider the attribute with the missing value as a dependent (class) variable and run a learning algorithm (usually Bayes or decision tree) to predict the missing value.

Reference: http://www.cs.ccsu.edu/~markov/ccsu_courses/datamining-3.html

To simply, we just remove all the observations which have missing values. Advanced methods will be studied in the future.

housing = housing.dropna()
housing.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8887 entries, 2 to 34856
Data columns (total 21 columns):
Suburb           8887 non-null object
Address          8887 non-null object
Rooms            8887 non-null int64
Type             8887 non-null object
Price            8887 non-null float64
Method           8887 non-null object
SellerG          8887 non-null object
Date             8887 non-null object
Distance         8887 non-null float64
Postcode         8887 non-null float64
Bedroom2         8887 non-null float64
Bathroom         8887 non-null float64
Car              8887 non-null float64
Landsize         8887 non-null float64
BuildingArea     8887 non-null float64
YearBuilt        8887 non-null float64
CouncilArea      8887 non-null object
Lattitude        8887 non-null float64
Longtitude       8887 non-null float64
Regionname       8887 non-null object
Propertycount    8887 non-null float64
dtypes: float64(12), int64(1), object(8)
memory usage: 1.5+ MB

Now, we drop all the missing values. Dataset is clean.

2. Outlier data

There are several methods to handle missing values. In summary, they are:

  • Binning
    • Sort the attribute values and partition them into bins (see “Unsupervised discretization” below);
    • Then smooth by bin means, bin median, or bin boundaries.
  • Clustering: group values in clusters and then detect and remove outliers (automatic or manual)
  • Regression: smooth by fitting the data into regression functions.
  • Univariate method: Boxplot
housing.describe()
Rooms Price Distance Postcode Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt Lattitude Longtitude Propertycount
count 8887.000000 8.887000e+03 8887.000000 8887.000000 8887.000000 8887.000000 8887.000000 8887.000000 8887.000000 8887.000000 8887.000000 8887.000000 8887.000000
mean 3.098909 1.092902e+06 11.199887 3111.662653 3.078204 1.646450 1.692247 523.480365 149.309477 1965.753348 -37.804501 144.991393 7475.940137
std 0.963786 6.793819e+05 6.813402 112.614268 0.966269 0.721611 0.975464 1061.324228 87.925580 37.040876 0.090549 0.118919 4375.024364
min 1.000000 1.310000e+05 0.000000 3000.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1196.000000 -38.174360 144.423790 249.000000
25% 2.000000 6.410000e+05 6.400000 3044.000000 2.000000 1.000000 1.000000 212.000000 100.000000 1945.000000 -37.858560 144.920000 4382.500000
50% 3.000000 9.000000e+05 10.200000 3084.000000 3.000000 2.000000 2.000000 478.000000 132.000000 1970.000000 -37.798700 144.998500 6567.000000
75% 4.000000 1.345000e+06 13.900000 3150.000000 4.000000 2.000000 2.000000 652.000000 180.000000 2000.000000 -37.748945 145.064560 10331.000000
max 12.000000 9.000000e+06 47.400000 3977.000000 12.000000 9.000000 10.000000 42800.000000 3112.000000 2019.000000 -37.407200 145.526350 21650.000000
housing[housing['BuildingArea']==0]
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode ... Bathroom Car Landsize BuildingArea YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount
7211 North Melbourne 19 Shands La 2 t 841000.0 S Jellis 4/03/2017 2.3 3051.0 ... 1.0 1.0 215.0 0.0 2000.0 Melbourne City Council -37.79530 144.94370 Northern Metropolitan 6821.0
19775 Balwyn North 14 Wanbrow Av 5 h 1950000.0 S RT 3/09/2017 9.7 3104.0 ... 3.0 2.0 743.0 0.0 1949.0 Boroondara City Council -37.80235 145.09311 Southern Metropolitan 7809.0
19840 Bundoora 22 Moreton Cr 3 h 814000.0 S Barry 3/09/2017 12.1 3083.0 ... 1.0 2.0 542.0 0.0 1970.0 Banyule City Council -37.70861 145.05691 Northern Metropolitan 10175.0
20223 Roxburgh Park 16 Sandover Dr 4 h 570000.0 S Raine 3/09/2017 20.6 3064.0 ... 2.0 2.0 504.0 0.0 2000.0 Hume City Council -37.61419 144.93448 Northern Metropolitan 5833.0
20262 Thornbury 19/337 Station St 3 t 900000.0 VB Jellis 3/09/2017 7.0 3071.0 ... 2.0 2.0 120.0 0.0 2000.0 Darebin City Council -37.76343 145.02096 Northern Metropolitan 8870.0
22040 Prahran 6 Aberdeen Rd 3 h 1390000.0 S Marshall 19/08/2017 4.6 3181.0 ... 2.0 1.0 125.0 0.0 2002.0 Stonnington City Council -37.85257 145.00296 Southern Metropolitan 7717.0
22507 Huntingdale 33 Beauford St 3 h 1205000.0 SA FN 23/09/2017 12.3 3166.0 ... 2.0 4.0 622.0 0.0 1960.0 Monash City Council -37.90823 145.10851 Southern Metropolitan 768.0
22931 Balwyn North 1 Hosken St 5 h 2800000.0 S Marshall 26/08/2017 9.7 3104.0 ... 2.0 2.0 1173.0 0.0 1960.0 Boroondara City Council -37.80385 145.09094 Southern Metropolitan 7809.0
22994 Brighton East 60 Cummins Rd 3 h 1650000.0 SP Buxton 26/08/2017 10.3 3187.0 ... 2.0 2.0 623.0 0.0 1920.0 Bayside City Council -37.92698 145.02673 Southern Metropolitan 6938.0
23022 Bundoora 37 Greenwood Dr 4 h 815000.0 S Ray 26/08/2017 12.1 3083.0 ... 2.0 1.0 525.0 0.0 1965.0 Banyule City Council -37.70765 145.05556 Northern Metropolitan 10175.0
23085 Craigieburn 28 Powell St 3 h 412500.0 S RE 26/08/2017 20.6 3064.0 ... 2.0 1.0 197.0 0.0 2012.0 Hume City Council -37.57687 144.91100 Northern Metropolitan 15510.0
23115 Epping 26 Lowalde Dr 3 h 595000.0 S hockingstuart 26/08/2017 19.6 3076.0 ... 1.0 2.0 536.0 0.0 1980.0 Whittlesea City Council -37.64972 145.04086 Northern Metropolitan 10926.0
23159 Glen Iris 6 Viva St 4 h 2690000.0 PI Marshall 26/08/2017 7.3 3146.0 ... 2.0 2.0 647.0 0.0 1910.0 Boroondara City Council -37.86133 145.04167 Southern Metropolitan 10412.0
23242 Kew 16 Hodgson St 5 h 3450000.0 PI Kay 26/08/2017 5.4 3101.0 ... 2.0 2.0 668.0 0.0 2006.0 Boroondara City Council -37.80795 145.01474 Southern Metropolitan 10331.0
23250 Kilsyth 17 Birkenhead Dr 3 h 803000.0 S Max 26/08/2017 26.0 3137.0 ... 1.0 2.0 862.0 0.0 1970.0 Maroondah City Council -37.79902 145.32092 Eastern Metropolitan 4654.0
23321 Moorabbin 7 Walsh Av 3 h 1290000.0 S Ray 26/08/2017 14.3 3189.0 ... 1.0 1.0 580.0 0.0 1970.0 Kingston City Council -37.94492 145.04938 Southern Metropolitan 2555.0
23378 Port Melbourne 44 Garton St 4 t 2455000.0 SP Marshall 26/08/2017 3.5 3207.0 ... 3.0 2.0 123.0 0.0 2010.0 Melbourne City Council -37.83349 144.94840 Southern Metropolitan 8648.0
23654 Cheltenham 5 Hannah St 3 h 975000.0 S O'Brien 7/10/2017 17.9 3192.0 ... 2.0 2.0 651.0 0.0 1970.0 Bayside City Council -37.95683 145.07184 Southern Metropolitan 9758.0
23690 Craigieburn 18 Pymble Gdns 4 h 590000.0 S LJ 7/10/2017 20.6 3064.0 ... 2.0 2.0 448.0 0.0 2009.0 Hume City Council -37.60902 144.91279 Northern Metropolitan 15510.0
24116 Werribee 21 Sinns Av 3 h 550000.0 S Ray 7/10/2017 14.7 3030.0 ... 2.0 2.0 580.0 0.0 1980.0 Wyndham City Council -37.90136 144.66925 Western Metropolitan 16166.0
24196 Balwyn North 5 Highview Rd 4 h 1500000.0 VB Fletchers 14/10/2017 9.7 3104.0 ... 2.0 1.0 620.0 0.0 1965.0 Boroondara City Council -37.78396 145.07942 Southern Metropolitan 7809.0
24205 Bentleigh 1 Donaldson St 3 h 1730000.0 S Jellis 14/10/2017 11.4 3204.0 ... 1.0 2.0 569.0 0.0 1940.0 Glen Eira City Council -37.91456 145.04109 Southern Metropolitan 6795.0
24344 Donvale 29 Martha St 2 h 1070000.0 S hockingstuart 14/10/2017 16.1 3111.0 ... 2.0 2.0 758.0 0.0 1980.0 Manningham City Council -37.79883 145.17337 Eastern Metropolitan 4790.0
24602 Northcote 155 Clarke St 3 h 2750000.0 VB Woodards 14/10/2017 5.3 3070.0 ... 2.0 4.0 634.0 0.0 1886.0 Darebin City Council -37.77625 144.99572 Northern Metropolitan 11364.0
25086 Glen Iris 60 Hortense St 4 h 2237500.0 S Marshall 21/10/2017 7.3 3146.0 ... 2.0 2.0 650.0 0.0 1985.0 Boroondara City Council -37.85776 145.07998 Southern Metropolitan 10412.0
25320 Reservoir 12 Kelverne St 3 h 650000.0 SP Barry 21/10/2017 12.0 3073.0 ... 1.0 1.0 491.0 0.0 1950.0 Darebin City Council -37.71445 144.98225 Northern Metropolitan 21650.0
25352 Roxburgh Park 15 Donvale Av 3 h 470000.0 S Raine 21/10/2017 20.6 3064.0 ... 2.0 2.0 328.0 0.0 2004.0 Hume City Council -37.61388 144.92270 Northern Metropolitan 5833.0
25376 Spotswood 104 Hudsons Rd 2 h 1225000.0 SP Greg 21/10/2017 6.2 3015.0 ... 1.0 1.0 361.0 0.0 1910.0 Hobsons Bay City Council -37.82940 144.88410 Western Metropolitan 1223.0
25412 Tarneit 10 Discovery Dr 4 h 585000.0 S hockingstuart 21/10/2017 18.4 3029.0 ... 2.0 2.0 448.0 0.0 2010.0 Wyndham City Council -37.84743 144.71243 Western Metropolitan 10160.0
25708 Bundoora 8 Oxford Dr 3 h 770000.0 S Ray 28/10/2017 12.1 3083.0 ... 1.0 1.0 551.0 0.0 1970.0 Banyule City Council -37.69939 145.06567 Northern Metropolitan 10175.0
26343 Preston 148 Albert St 3 h 833000.0 S Stockdale 28/10/2017 8.4 3072.0 ... 1.0 1.0 501.0 0.0 1960.0 Darebin City Council -37.73674 145.02418 Northern Metropolitan 14577.0
26633 Craigieburn 31 Yarcombe Cr 4 h 540000.0 S Barry 4/11/2017 20.6 3064.0 ... 2.0 1.0 541.0 0.0 1995.0 Hume City Council -37.60800 144.92530 Northern Metropolitan 15510.0
27441 Yallambie 21 Lowan Av 5 h 990000.0 S Buckingham 11/11/2017 12.7 3085.0 ... 3.0 2.0 510.0 0.0 1985.0 Banyule City Council -37.72040 145.10880 Northern Metropolitan 1369.0
27564 Brighton East 15 Bayview Rd 3 t 950000.0 VB Buxton 18/11/2017 10.3 3187.0 ... 2.0 2.0 217.0 0.0 2000.0 Bayside City Council -37.90670 145.02470 Southern Metropolitan 6938.0
27587 Brunswick West 1/1 Duggan St 2 u 420500.0 SP Pagan 18/11/2017 5.2 3055.0 ... 1.0 1.0 5497.0 0.0 2011.0 Moreland City Council -37.75820 144.94000 Northern Metropolitan 7082.0
27629 Carrum 18 Church Rd 4 h 980000.0 S hockingstuart 18/11/2017 31.2 3197.0 ... 1.0 3.0 987.0 0.0 1960.0 Kingston City Council -38.07920 145.12760 South-Eastern Metropolitan 1989.0
27922 Lalor 2 Orchid Ct 5 h 591000.0 S HAR 18/11/2017 16.3 3075.0 ... 1.0 2.0 636.0 0.0 1980.0 Whittlesea City Council -37.67010 145.00500 Northern Metropolitan 8279.0
30914 Wollert 21 Dalwood Wy 4 h 609000.0 S HAR 9/12/2017 25.5 3750.0 ... 2.0 2.0 350.0 0.0 2015.0 Whittlesea City Council -37.61031 145.04010 Northern Metropolitan 2940.0
31464 Balwyn North 4 Beverley Ct 4 t 1190000.0 VB Bekdon 3/03/2018 9.7 3104.0 ... 2.0 2.0 260.0 0.0 2017.0 Boroondara City Council -37.79589 145.09988 Southern Metropolitan 7809.0
31509 Blackburn South 5 Abercromby Rd 4 h 1400000.0 VB Jellis 3/03/2018 13.4 3130.0 ... 2.0 1.0 560.0 0.0 1960.0 Whitehorse City Council -37.83533 145.14455 Eastern Metropolitan 4387.0
31541 Brunswick 3 Austral Av 4 h 1500000.0 VB Nelson 3/03/2018 5.2 3056.0 ... 2.0 1.0 373.0 0.0 1940.0 Moreland City Council -37.76289 144.95552 Northern Metropolitan 11918.0
31717 Epping 28 Bail St 3 h 600000.0 S hockingstuart 3/03/2018 19.6 3076.0 ... 2.0 2.0 461.0 0.0 2010.0 Whittlesea City Council -37.62844 145.00884 Northern Metropolitan 10926.0
32403 Roxburgh Park 23 Wrigley Cr 4 h 622000.0 S Raine 10/03/2018 20.6 3064.0 ... 2.0 1.0 530.0 0.0 1998.0 Hume City Council -37.62352 144.93133 Northern Metropolitan 5833.0
33397 Greenvale 26 Perugia Av 4 h 677000.0 S Ray 17/03/2018 20.4 3059.0 ... 2.0 2.0 312.0 0.0 2013.0 Hume City Council -37.62439 144.88629 Northern Metropolitan 4864.0
33899 Wollert 40 Whitebark St 4 h 615000.0 S HAR 17/03/2018 25.5 3750.0 ... 2.0 2.0 392.0 0.0 2015.0 Whittlesea City Council -37.61252 145.04288 Northern Metropolitan 2940.0

45 rows × 21 columns

housing[housing['Rooms']==0]
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode ... Bathroom Car Landsize BuildingArea YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount

0 rows × 21 columns

BuildingArea is suspicious because of zero values. How could they be zero since they all have rooms? Some mistakes might be taken in the data entry. We choose to remove these observations.

housing = housing[housing['BuildingArea']!=0]
housing.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8842 entries, 2 to 34856
Data columns (total 21 columns):
Suburb           8842 non-null object
Address          8842 non-null object
Rooms            8842 non-null int64
Type             8842 non-null object
Price            8842 non-null float64
Method           8842 non-null object
SellerG          8842 non-null object
Date             8842 non-null object
Distance         8842 non-null float64
Postcode         8842 non-null float64
Bedroom2         8842 non-null float64
Bathroom         8842 non-null float64
Car              8842 non-null float64
Landsize         8842 non-null float64
BuildingArea     8842 non-null float64
YearBuilt        8842 non-null float64
CouncilArea      8842 non-null object
Lattitude        8842 non-null float64
Longtitude       8842 non-null float64
Regionname       8842 non-null object
Propertycount    8842 non-null float64
dtypes: float64(12), int64(1), object(8)
memory usage: 1.5+ MB

3. Data Reduction

Data reduction is necessary when there are too much redundant information in the dataset. In summary, common methods are:

  • Reducing the number of attributes
    • Data cube aggregation: applying roll-up, slice or dice operations.
    • Removing irrelevant attributes: attribute selection (filtering and wrapper methods), searching the attribute space (see Lecture 5: Attribute-oriented analysis).
    • Principle component analysis (numeric attributes only): searching for a lower dimensional space that can best represent the data
  • Reducing the number of attribute values
    • Binning (histograms): reducing the number of attributes by grouping them into intervals (bins).
    • Clustering: grouping values in clusters.
    • Aggregation or generalization
  • Reducing the number of tuples
    • Sampling

Reference: http://www.cs.ccsu.edu/~markov/ccsu_courses/datamining-3.html

In this case, Address and sellerG (real estate agent) seems to be unnecessary currently. In the future, natural language processing might help include in the model. Also, bedroom2 is scraped from different sources. We chose to ignore this column.

housing = housing.drop(['Address','SellerG'],axis = 1)    #axis=1 indicate the column name

4. Feature Engineering

Feature engineering is important to dig more inforamtion from row data. We just translate some attributes into new ones that make more sense in the reality. For example, in the case, we would translate YearBuilt into Age of property. Date sold can also be split into three attributes: Year, Month and Weekday.

Let’s do it!

housing['Age'] = 2018-housing['YearBuilt']
housing[housing['Age']<0]
Suburb Rooms Type Price Method Date Distance Postcode Bedroom2 Bathroom ... YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount Age Year Month Weekday
33033 Bentleigh 3 h 1100000.0 VB 2018-03-17 11.4 3204.0 3.0 1.0 ... 2019.0 Glen Eira City Council -37.92963 145.03666 Southern Metropolitan 6795.0 -1.0 2018 3 5

1 rows × 23 columns

** Notice Here: There is one property to be built in 2019 **

We are using groupby function to see the average price VS age of property

plt.figure(figsize=(10,6))
housing.groupby('Age')['Price'].mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x17065f3c128>

png

Here, we found a age of 800. It could be true, but it seems an outlier. We choose to take it out.

housing[housing['Age']>400]
Suburb Rooms Type Price Method Date Distance Postcode Bedroom2 Bathroom ... YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount Age Year Month Weekday
16424 Mount Waverley 3 h 1200000.0 VB 2017-06-24 14.2 3149.0 3.0 1.0 ... 1196.0 Monash City Council -37.86788 145.12121 Eastern Metropolitan 13366.0 822.0 2017 6 5

1 rows × 23 columns

housing = housing[housing['Age']<500]

Then we visualize again. It should be good now.

plt.figure(figsize=(10,6))
ax=housing.groupby('Age')['Price'].mean().plot()
ax.set_ylabel('Price')
Text(0,0.5,'Price')

png

Interesting Finding!

Then we transform Date column!

housing['Date']=pd.to_datetime(housing['Date'],infer_datetime_format=True)  ## transform Date column to datetime object
housing['Year']=pd.DatetimeIndex(housing['Date']).year
housing['Month']=pd.DatetimeIndex(housing['Date']).month
housing['Weekday']=pd.DatetimeIndex(housing['Date']).weekday
housing.head()
Suburb Rooms Type Price Method Date Distance Postcode Bedroom2 Bathroom ... YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount Age Year Month Weekday
2 Abbotsford 2 h 1035000.0 S 2016-04-02 2.5 3067.0 2.0 1.0 ... 1900.0 Yarra City Council -37.8079 144.9934 Northern Metropolitan 4019.0 118.0 2016 4 5
4 Abbotsford 3 h 1465000.0 SP 2017-04-03 2.5 3067.0 3.0 2.0 ... 1900.0 Yarra City Council -37.8093 144.9944 Northern Metropolitan 4019.0 118.0 2017 4 0
6 Abbotsford 4 h 1600000.0 VB 2016-04-06 2.5 3067.0 3.0 1.0 ... 2014.0 Yarra City Council -37.8072 144.9941 Northern Metropolitan 4019.0 4.0 2016 4 2
11 Abbotsford 3 h 1876000.0 S 2016-07-05 2.5 3067.0 4.0 2.0 ... 1910.0 Yarra City Council -37.8024 144.9993 Northern Metropolitan 4019.0 108.0 2016 7 1
14 Abbotsford 2 h 1636000.0 S 2016-08-10 2.5 3067.0 2.0 1.0 ... 1890.0 Yarra City Council -37.8060 144.9954 Northern Metropolitan 4019.0 128.0 2016 8 2

5 rows × 23 columns

sns.factorplot(y='Price',x='Month',kind='bar',data=housing)
<seaborn.axisgrid.FacetGrid at 0x170620c41d0>

png

sns.factorplot(y='Price',x='Weekday',kind='bar',data=housing) 

<seaborn.axisgrid.FacetGrid at 0x17066119d30>

png

sns.factorplot(y='Price',x='Year',kind='bar',data=housing)
<seaborn.axisgrid.FacetGrid at 0x170627f4358>

png

Conclusion

  1. Data preprocessing is very important.
  2. More advanced method in outliers detection and missing values handling is to be studied in the future
    • I am really not sure what outliers should be taken into account. Price has many extreme values from boxplot. Should we take all of them off?
    • Missing values had better be taken care of carefully. Here, I just drop them.
  3. Visualization will be introduced next time.

Thank you!