Data Preprocessing
Introduction
In the real world, data is not clean. Several techniques are used as follows:
- Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.
- Data integration: using multiple databases, data cubes, or files.
- Data transformation: normalization and aggregation.
- Data reduction: reducing the volume but producing the same or similar analytical results.
- Data discretization: part of data reduction, replacing numerical attributes with nominal ones.
Case study
The dataset I chose is from Kaggle. It is called Melbourne Housing Market. The following link is: https://www.kaggle.com/anthonypino/melbourne-housing-market
1. Import useful packages, read datasets and get some general information
import pandas as pd
import matplotlib.pyplot as plt
import numpy as ny
import seaborn as sns
housing = pd.read_csv('housing.csv')
housing.head()
Suburb | Address | Rooms | Type | Price | Method | SellerG | Date | Distance | Postcode | ... | Bathroom | Car | Landsize | BuildingArea | YearBuilt | CouncilArea | Lattitude | Longtitude | Regionname | Propertycount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Abbotsford | 68 Studley St | 2 | h | NaN | SS | Jellis | 3/09/2016 | 2.5 | 3067.0 | ... | 1.0 | 1.0 | 126.0 | NaN | NaN | Yarra City Council | -37.8014 | 144.9958 | Northern Metropolitan | 4019.0 |
1 | Abbotsford | 85 Turner St | 2 | h | 1480000.0 | S | Biggin | 3/12/2016 | 2.5 | 3067.0 | ... | 1.0 | 1.0 | 202.0 | NaN | NaN | Yarra City Council | -37.7996 | 144.9984 | Northern Metropolitan | 4019.0 |
2 | Abbotsford | 25 Bloomburg St | 2 | h | 1035000.0 | S | Biggin | 4/02/2016 | 2.5 | 3067.0 | ... | 1.0 | 0.0 | 156.0 | 79.0 | 1900.0 | Yarra City Council | -37.8079 | 144.9934 | Northern Metropolitan | 4019.0 |
3 | Abbotsford | 18/659 Victoria St | 3 | u | NaN | VB | Rounds | 4/02/2016 | 2.5 | 3067.0 | ... | 2.0 | 1.0 | 0.0 | NaN | NaN | Yarra City Council | -37.8114 | 145.0116 | Northern Metropolitan | 4019.0 |
4 | Abbotsford | 5 Charles St | 3 | h | 1465000.0 | SP | Biggin | 4/03/2017 | 2.5 | 3067.0 | ... | 2.0 | 0.0 | 134.0 | 150.0 | 1900.0 | Yarra City Council | -37.8093 | 144.9944 | Northern Metropolitan | 4019.0 |
5 rows × 21 columns
The dataset had 21 attributes and 34857 observations in all.
housing.isnull().sum()
Suburb 0
Address 0
Rooms 0
Type 0
Price 7610
Method 0
SellerG 0
Date 0
Distance 1
Postcode 1
Bedroom2 8217
Bathroom 8226
Car 8728
Landsize 11810
BuildingArea 21115
YearBuilt 19306
CouncilArea 3
Lattitude 7976
Longtitude 7976
Regionname 3
Propertycount 3
dtype: int64
housing.isnull().sum()/len(housing)*100
Suburb 0.000000
Address 0.000000
Rooms 0.000000
Type 0.000000
Price 21.832057
Method 0.000000
SellerG 0.000000
Date 0.000000
Distance 0.002869
Postcode 0.002869
Bedroom2 23.573457
Bathroom 23.599277
Car 25.039447
Landsize 33.881286
BuildingArea 60.576068
YearBuilt 55.386293
CouncilArea 0.008607
Lattitude 22.882061
Longtitude 22.882061
Regionname 0.008607
Propertycount 0.008607
dtype: float64
It shows missing value percentage for each column. We can also get the percentage of all the missing values.
missing = housing.isnull().sum().sum()
all = housing.isnull().count().sum()
missing/all*100
13.794455441757275
There are a lot of missing values in the dataset, especially in Landsize, BuildingArea, and YearBuilt column. 13% of the data are missing. That is not bad.
You can also visualize missing values
plt.figure(figsize=(10,3))
sns.set(font_scale=1.2)
sns.heatmap(housing.isnull(),yticklabels = False, cbar = False, cmap = 'plasma')
<matplotlib.axes._subplots.AxesSubplot at 0x170625eb978>
2. Obtain descriptive statistics of some attributes.
Firstly, we need to understand the data type. Typically, they are:
- Categorical data
- Nominal: variables are variables that have two or more categories, but which do not have an intrinsic order. Examples: Gender, weather, room type
- Ordinal: variables are variables that have two or more categories just like nominal variables only the categories can also be ordered or ranked. Examples: Class, Ranking
- Numeric data
- Interval
- Ratio
Measures of the central tendency: Mean, Median, Mode
- Mean: Average; Susceptible to outliers.
- Median: Better in skewed data
- Mode: the most frequent score. Best for nominal data
- Importance: right-skewed data: median > mean. left-skewed data: median < mean
The more skewed the distribution, the greater the difference between the median and mean, and the greater emphasis should be placed on using the median as opposed to the mean.
Measures of the Dispersion: Variance/Standard Deviation, Range, interquantile range
- Variance/Standatd Deviation
- Range: Max-min,
- Interquartile: Q3 - Q1. Interquartile range is where the middle 50% data locates.
housing.describe()
#'mean of housing price',housing['Price'].mean()
Rooms | Price | Distance | Postcode | Bedroom2 | Bathroom | Car | Landsize | BuildingArea | YearBuilt | Lattitude | Longtitude | Propertycount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 34857.000000 | 2.724700e+04 | 34856.000000 | 34856.000000 | 26640.000000 | 26631.000000 | 26129.000000 | 23047.000000 | 13742.00000 | 15551.000000 | 26881.000000 | 26881.000000 | 34854.000000 |
mean | 3.031012 | 1.050173e+06 | 11.184929 | 3116.062859 | 3.084647 | 1.624798 | 1.728845 | 593.598993 | 160.25640 | 1965.289885 | -37.810634 | 145.001851 | 7572.888306 |
std | 0.969933 | 6.414671e+05 | 6.788892 | 109.023903 | 0.980690 | 0.724212 | 1.010771 | 3398.841946 | 401.26706 | 37.328178 | 0.090279 | 0.120169 | 4428.090313 |
min | 1.000000 | 8.500000e+04 | 0.000000 | 3000.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 1196.000000 | -38.190430 | 144.423790 | 83.000000 |
25% | 2.000000 | 6.350000e+05 | 6.400000 | 3051.000000 | 2.000000 | 1.000000 | 1.000000 | 224.000000 | 102.00000 | 1940.000000 | -37.862950 | 144.933500 | 4385.000000 |
50% | 3.000000 | 8.700000e+05 | 10.300000 | 3103.000000 | 3.000000 | 2.000000 | 2.000000 | 521.000000 | 136.00000 | 1970.000000 | -37.807600 | 145.007800 | 6763.000000 |
75% | 4.000000 | 1.295000e+06 | 14.000000 | 3156.000000 | 4.000000 | 2.000000 | 2.000000 | 670.000000 | 188.00000 | 2000.000000 | -37.754100 | 145.071900 | 10412.000000 |
max | 16.000000 | 1.120000e+07 | 48.100000 | 3978.000000 | 30.000000 | 12.000000 | 26.000000 | 433014.000000 | 44515.00000 | 2106.000000 | -37.390200 | 145.526350 | 21650.000000 |
Here, we choose price and number of rooms as two attributes we are interested in.
Also, we may need median, mode, range, quartiles for these two attributes.
'median of housing price',housing['Price'].median()
median of housing price 870000.0
('quartiles for housing price', 870000.0)
'mode of housing price',housing['Price'].mode()[0],housing['Price'].mode()[1]
('mode of housing price', 600000.0, 1100000.0)
'mode of housing price',[x for x in housing['Price'].mode()]
('mode of housing price', [600000.0, 1100000.0])
'range of housing price',housing['Price'].max()-housing['Price'].min()
('range of housing price', 11115000.0)
'quartiles for housing price',housing['Price'].quantile([0.25,0.75])
('quartiles for housing price', 0.25 635000.0
0.75 1295000.0
Name: Price, dtype: float64)
'standard deviation for housing price',housing['Price'].std()
('standard deviation for housing price', 641467.1301046001)
Visualize the distributions of numeric value
Histogram, Boxplot
plt.figure(figsize=[15,6])
housing['Price'].hist(bins=30)
<matplotlib.axes._subplots.AxesSubplot at 0x26f1caabfd0>
plt.figure(figsize=[10,8])
sns.distplot(housing['Price'].dropna(),bins=30)
<matplotlib.axes._subplots.AxesSubplot at 0x17062335128>
sns.boxplot(y='Price',data=housing)
<matplotlib.axes._subplots.AxesSubplot at 0x26f1ca8fe80>
Then apply the same techniques to Rooms column.
'median of room number',housing['Rooms'].median()
('median of room number', 3.0)
'mode of room number',housing['Rooms'].mode()[0]
('mode of room number', 3)
'range of room number',housing['Rooms'].max()-housing['Rooms'].min()
('range of room number', 15)
'quartiles for room number',housing['Rooms'].quantile([0.25,0.75])
('quartiles for room number', 0.25 2.0
0.75 4.0
Name: Rooms, dtype: float64)
plt.figure(figsize=[15,6])
housing['Rooms'].hist(bins=30)
<matplotlib.axes._subplots.AxesSubplot at 0x26f1d1fc550>
sns.countplot(housing['Rooms'])
<matplotlib.axes._subplots.AxesSubplot at 0x26f1d308c88>
sns.boxplot(y=housing['Rooms'])
<matplotlib.axes._subplots.AxesSubplot at 0x26f1d316e48>
3. Potential problem of data quality
1. Missing data
There are several methods to handle missing values. In summary, they are:
- Ignore the tuple: usually done when class label is missing.
- Use the attribute mean (or majority nominal value) to fill in the missing value.
- Use the attribute mean (or majority nominal value) for all samples belonging to the same class.
- Predict the missing value by using a learning algorithm: consider the attribute with the missing value as a dependent (class) variable and run a learning algorithm (usually Bayes or decision tree) to predict the missing value.
Reference: http://www.cs.ccsu.edu/~markov/ccsu_courses/datamining-3.html
To simply, we just remove all the observations which have missing values. Advanced methods will be studied in the future.
housing = housing.dropna()
housing.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8887 entries, 2 to 34856
Data columns (total 21 columns):
Suburb 8887 non-null object
Address 8887 non-null object
Rooms 8887 non-null int64
Type 8887 non-null object
Price 8887 non-null float64
Method 8887 non-null object
SellerG 8887 non-null object
Date 8887 non-null object
Distance 8887 non-null float64
Postcode 8887 non-null float64
Bedroom2 8887 non-null float64
Bathroom 8887 non-null float64
Car 8887 non-null float64
Landsize 8887 non-null float64
BuildingArea 8887 non-null float64
YearBuilt 8887 non-null float64
CouncilArea 8887 non-null object
Lattitude 8887 non-null float64
Longtitude 8887 non-null float64
Regionname 8887 non-null object
Propertycount 8887 non-null float64
dtypes: float64(12), int64(1), object(8)
memory usage: 1.5+ MB
Now, we drop all the missing values. Dataset is clean.
2. Outlier data
There are several methods to handle missing values. In summary, they are:
- Binning
- Sort the attribute values and partition them into bins (see “Unsupervised discretization” below);
- Then smooth by bin means, bin median, or bin boundaries.
- Clustering: group values in clusters and then detect and remove outliers (automatic or manual)
- Regression: smooth by fitting the data into regression functions.
- Univariate method: Boxplot
housing.describe()
Rooms | Price | Distance | Postcode | Bedroom2 | Bathroom | Car | Landsize | BuildingArea | YearBuilt | Lattitude | Longtitude | Propertycount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 8887.000000 | 8.887000e+03 | 8887.000000 | 8887.000000 | 8887.000000 | 8887.000000 | 8887.000000 | 8887.000000 | 8887.000000 | 8887.000000 | 8887.000000 | 8887.000000 | 8887.000000 |
mean | 3.098909 | 1.092902e+06 | 11.199887 | 3111.662653 | 3.078204 | 1.646450 | 1.692247 | 523.480365 | 149.309477 | 1965.753348 | -37.804501 | 144.991393 | 7475.940137 |
std | 0.963786 | 6.793819e+05 | 6.813402 | 112.614268 | 0.966269 | 0.721611 | 0.975464 | 1061.324228 | 87.925580 | 37.040876 | 0.090549 | 0.118919 | 4375.024364 |
min | 1.000000 | 1.310000e+05 | 0.000000 | 3000.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1196.000000 | -38.174360 | 144.423790 | 249.000000 |
25% | 2.000000 | 6.410000e+05 | 6.400000 | 3044.000000 | 2.000000 | 1.000000 | 1.000000 | 212.000000 | 100.000000 | 1945.000000 | -37.858560 | 144.920000 | 4382.500000 |
50% | 3.000000 | 9.000000e+05 | 10.200000 | 3084.000000 | 3.000000 | 2.000000 | 2.000000 | 478.000000 | 132.000000 | 1970.000000 | -37.798700 | 144.998500 | 6567.000000 |
75% | 4.000000 | 1.345000e+06 | 13.900000 | 3150.000000 | 4.000000 | 2.000000 | 2.000000 | 652.000000 | 180.000000 | 2000.000000 | -37.748945 | 145.064560 | 10331.000000 |
max | 12.000000 | 9.000000e+06 | 47.400000 | 3977.000000 | 12.000000 | 9.000000 | 10.000000 | 42800.000000 | 3112.000000 | 2019.000000 | -37.407200 | 145.526350 | 21650.000000 |
housing[housing['BuildingArea']==0]
Suburb | Address | Rooms | Type | Price | Method | SellerG | Date | Distance | Postcode | ... | Bathroom | Car | Landsize | BuildingArea | YearBuilt | CouncilArea | Lattitude | Longtitude | Regionname | Propertycount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7211 | North Melbourne | 19 Shands La | 2 | t | 841000.0 | S | Jellis | 4/03/2017 | 2.3 | 3051.0 | ... | 1.0 | 1.0 | 215.0 | 0.0 | 2000.0 | Melbourne City Council | -37.79530 | 144.94370 | Northern Metropolitan | 6821.0 |
19775 | Balwyn North | 14 Wanbrow Av | 5 | h | 1950000.0 | S | RT | 3/09/2017 | 9.7 | 3104.0 | ... | 3.0 | 2.0 | 743.0 | 0.0 | 1949.0 | Boroondara City Council | -37.80235 | 145.09311 | Southern Metropolitan | 7809.0 |
19840 | Bundoora | 22 Moreton Cr | 3 | h | 814000.0 | S | Barry | 3/09/2017 | 12.1 | 3083.0 | ... | 1.0 | 2.0 | 542.0 | 0.0 | 1970.0 | Banyule City Council | -37.70861 | 145.05691 | Northern Metropolitan | 10175.0 |
20223 | Roxburgh Park | 16 Sandover Dr | 4 | h | 570000.0 | S | Raine | 3/09/2017 | 20.6 | 3064.0 | ... | 2.0 | 2.0 | 504.0 | 0.0 | 2000.0 | Hume City Council | -37.61419 | 144.93448 | Northern Metropolitan | 5833.0 |
20262 | Thornbury | 19/337 Station St | 3 | t | 900000.0 | VB | Jellis | 3/09/2017 | 7.0 | 3071.0 | ... | 2.0 | 2.0 | 120.0 | 0.0 | 2000.0 | Darebin City Council | -37.76343 | 145.02096 | Northern Metropolitan | 8870.0 |
22040 | Prahran | 6 Aberdeen Rd | 3 | h | 1390000.0 | S | Marshall | 19/08/2017 | 4.6 | 3181.0 | ... | 2.0 | 1.0 | 125.0 | 0.0 | 2002.0 | Stonnington City Council | -37.85257 | 145.00296 | Southern Metropolitan | 7717.0 |
22507 | Huntingdale | 33 Beauford St | 3 | h | 1205000.0 | SA | FN | 23/09/2017 | 12.3 | 3166.0 | ... | 2.0 | 4.0 | 622.0 | 0.0 | 1960.0 | Monash City Council | -37.90823 | 145.10851 | Southern Metropolitan | 768.0 |
22931 | Balwyn North | 1 Hosken St | 5 | h | 2800000.0 | S | Marshall | 26/08/2017 | 9.7 | 3104.0 | ... | 2.0 | 2.0 | 1173.0 | 0.0 | 1960.0 | Boroondara City Council | -37.80385 | 145.09094 | Southern Metropolitan | 7809.0 |
22994 | Brighton East | 60 Cummins Rd | 3 | h | 1650000.0 | SP | Buxton | 26/08/2017 | 10.3 | 3187.0 | ... | 2.0 | 2.0 | 623.0 | 0.0 | 1920.0 | Bayside City Council | -37.92698 | 145.02673 | Southern Metropolitan | 6938.0 |
23022 | Bundoora | 37 Greenwood Dr | 4 | h | 815000.0 | S | Ray | 26/08/2017 | 12.1 | 3083.0 | ... | 2.0 | 1.0 | 525.0 | 0.0 | 1965.0 | Banyule City Council | -37.70765 | 145.05556 | Northern Metropolitan | 10175.0 |
23085 | Craigieburn | 28 Powell St | 3 | h | 412500.0 | S | RE | 26/08/2017 | 20.6 | 3064.0 | ... | 2.0 | 1.0 | 197.0 | 0.0 | 2012.0 | Hume City Council | -37.57687 | 144.91100 | Northern Metropolitan | 15510.0 |
23115 | Epping | 26 Lowalde Dr | 3 | h | 595000.0 | S | hockingstuart | 26/08/2017 | 19.6 | 3076.0 | ... | 1.0 | 2.0 | 536.0 | 0.0 | 1980.0 | Whittlesea City Council | -37.64972 | 145.04086 | Northern Metropolitan | 10926.0 |
23159 | Glen Iris | 6 Viva St | 4 | h | 2690000.0 | PI | Marshall | 26/08/2017 | 7.3 | 3146.0 | ... | 2.0 | 2.0 | 647.0 | 0.0 | 1910.0 | Boroondara City Council | -37.86133 | 145.04167 | Southern Metropolitan | 10412.0 |
23242 | Kew | 16 Hodgson St | 5 | h | 3450000.0 | PI | Kay | 26/08/2017 | 5.4 | 3101.0 | ... | 2.0 | 2.0 | 668.0 | 0.0 | 2006.0 | Boroondara City Council | -37.80795 | 145.01474 | Southern Metropolitan | 10331.0 |
23250 | Kilsyth | 17 Birkenhead Dr | 3 | h | 803000.0 | S | Max | 26/08/2017 | 26.0 | 3137.0 | ... | 1.0 | 2.0 | 862.0 | 0.0 | 1970.0 | Maroondah City Council | -37.79902 | 145.32092 | Eastern Metropolitan | 4654.0 |
23321 | Moorabbin | 7 Walsh Av | 3 | h | 1290000.0 | S | Ray | 26/08/2017 | 14.3 | 3189.0 | ... | 1.0 | 1.0 | 580.0 | 0.0 | 1970.0 | Kingston City Council | -37.94492 | 145.04938 | Southern Metropolitan | 2555.0 |
23378 | Port Melbourne | 44 Garton St | 4 | t | 2455000.0 | SP | Marshall | 26/08/2017 | 3.5 | 3207.0 | ... | 3.0 | 2.0 | 123.0 | 0.0 | 2010.0 | Melbourne City Council | -37.83349 | 144.94840 | Southern Metropolitan | 8648.0 |
23654 | Cheltenham | 5 Hannah St | 3 | h | 975000.0 | S | O'Brien | 7/10/2017 | 17.9 | 3192.0 | ... | 2.0 | 2.0 | 651.0 | 0.0 | 1970.0 | Bayside City Council | -37.95683 | 145.07184 | Southern Metropolitan | 9758.0 |
23690 | Craigieburn | 18 Pymble Gdns | 4 | h | 590000.0 | S | LJ | 7/10/2017 | 20.6 | 3064.0 | ... | 2.0 | 2.0 | 448.0 | 0.0 | 2009.0 | Hume City Council | -37.60902 | 144.91279 | Northern Metropolitan | 15510.0 |
24116 | Werribee | 21 Sinns Av | 3 | h | 550000.0 | S | Ray | 7/10/2017 | 14.7 | 3030.0 | ... | 2.0 | 2.0 | 580.0 | 0.0 | 1980.0 | Wyndham City Council | -37.90136 | 144.66925 | Western Metropolitan | 16166.0 |
24196 | Balwyn North | 5 Highview Rd | 4 | h | 1500000.0 | VB | Fletchers | 14/10/2017 | 9.7 | 3104.0 | ... | 2.0 | 1.0 | 620.0 | 0.0 | 1965.0 | Boroondara City Council | -37.78396 | 145.07942 | Southern Metropolitan | 7809.0 |
24205 | Bentleigh | 1 Donaldson St | 3 | h | 1730000.0 | S | Jellis | 14/10/2017 | 11.4 | 3204.0 | ... | 1.0 | 2.0 | 569.0 | 0.0 | 1940.0 | Glen Eira City Council | -37.91456 | 145.04109 | Southern Metropolitan | 6795.0 |
24344 | Donvale | 29 Martha St | 2 | h | 1070000.0 | S | hockingstuart | 14/10/2017 | 16.1 | 3111.0 | ... | 2.0 | 2.0 | 758.0 | 0.0 | 1980.0 | Manningham City Council | -37.79883 | 145.17337 | Eastern Metropolitan | 4790.0 |
24602 | Northcote | 155 Clarke St | 3 | h | 2750000.0 | VB | Woodards | 14/10/2017 | 5.3 | 3070.0 | ... | 2.0 | 4.0 | 634.0 | 0.0 | 1886.0 | Darebin City Council | -37.77625 | 144.99572 | Northern Metropolitan | 11364.0 |
25086 | Glen Iris | 60 Hortense St | 4 | h | 2237500.0 | S | Marshall | 21/10/2017 | 7.3 | 3146.0 | ... | 2.0 | 2.0 | 650.0 | 0.0 | 1985.0 | Boroondara City Council | -37.85776 | 145.07998 | Southern Metropolitan | 10412.0 |
25320 | Reservoir | 12 Kelverne St | 3 | h | 650000.0 | SP | Barry | 21/10/2017 | 12.0 | 3073.0 | ... | 1.0 | 1.0 | 491.0 | 0.0 | 1950.0 | Darebin City Council | -37.71445 | 144.98225 | Northern Metropolitan | 21650.0 |
25352 | Roxburgh Park | 15 Donvale Av | 3 | h | 470000.0 | S | Raine | 21/10/2017 | 20.6 | 3064.0 | ... | 2.0 | 2.0 | 328.0 | 0.0 | 2004.0 | Hume City Council | -37.61388 | 144.92270 | Northern Metropolitan | 5833.0 |
25376 | Spotswood | 104 Hudsons Rd | 2 | h | 1225000.0 | SP | Greg | 21/10/2017 | 6.2 | 3015.0 | ... | 1.0 | 1.0 | 361.0 | 0.0 | 1910.0 | Hobsons Bay City Council | -37.82940 | 144.88410 | Western Metropolitan | 1223.0 |
25412 | Tarneit | 10 Discovery Dr | 4 | h | 585000.0 | S | hockingstuart | 21/10/2017 | 18.4 | 3029.0 | ... | 2.0 | 2.0 | 448.0 | 0.0 | 2010.0 | Wyndham City Council | -37.84743 | 144.71243 | Western Metropolitan | 10160.0 |
25708 | Bundoora | 8 Oxford Dr | 3 | h | 770000.0 | S | Ray | 28/10/2017 | 12.1 | 3083.0 | ... | 1.0 | 1.0 | 551.0 | 0.0 | 1970.0 | Banyule City Council | -37.69939 | 145.06567 | Northern Metropolitan | 10175.0 |
26343 | Preston | 148 Albert St | 3 | h | 833000.0 | S | Stockdale | 28/10/2017 | 8.4 | 3072.0 | ... | 1.0 | 1.0 | 501.0 | 0.0 | 1960.0 | Darebin City Council | -37.73674 | 145.02418 | Northern Metropolitan | 14577.0 |
26633 | Craigieburn | 31 Yarcombe Cr | 4 | h | 540000.0 | S | Barry | 4/11/2017 | 20.6 | 3064.0 | ... | 2.0 | 1.0 | 541.0 | 0.0 | 1995.0 | Hume City Council | -37.60800 | 144.92530 | Northern Metropolitan | 15510.0 |
27441 | Yallambie | 21 Lowan Av | 5 | h | 990000.0 | S | Buckingham | 11/11/2017 | 12.7 | 3085.0 | ... | 3.0 | 2.0 | 510.0 | 0.0 | 1985.0 | Banyule City Council | -37.72040 | 145.10880 | Northern Metropolitan | 1369.0 |
27564 | Brighton East | 15 Bayview Rd | 3 | t | 950000.0 | VB | Buxton | 18/11/2017 | 10.3 | 3187.0 | ... | 2.0 | 2.0 | 217.0 | 0.0 | 2000.0 | Bayside City Council | -37.90670 | 145.02470 | Southern Metropolitan | 6938.0 |
27587 | Brunswick West | 1/1 Duggan St | 2 | u | 420500.0 | SP | Pagan | 18/11/2017 | 5.2 | 3055.0 | ... | 1.0 | 1.0 | 5497.0 | 0.0 | 2011.0 | Moreland City Council | -37.75820 | 144.94000 | Northern Metropolitan | 7082.0 |
27629 | Carrum | 18 Church Rd | 4 | h | 980000.0 | S | hockingstuart | 18/11/2017 | 31.2 | 3197.0 | ... | 1.0 | 3.0 | 987.0 | 0.0 | 1960.0 | Kingston City Council | -38.07920 | 145.12760 | South-Eastern Metropolitan | 1989.0 |
27922 | Lalor | 2 Orchid Ct | 5 | h | 591000.0 | S | HAR | 18/11/2017 | 16.3 | 3075.0 | ... | 1.0 | 2.0 | 636.0 | 0.0 | 1980.0 | Whittlesea City Council | -37.67010 | 145.00500 | Northern Metropolitan | 8279.0 |
30914 | Wollert | 21 Dalwood Wy | 4 | h | 609000.0 | S | HAR | 9/12/2017 | 25.5 | 3750.0 | ... | 2.0 | 2.0 | 350.0 | 0.0 | 2015.0 | Whittlesea City Council | -37.61031 | 145.04010 | Northern Metropolitan | 2940.0 |
31464 | Balwyn North | 4 Beverley Ct | 4 | t | 1190000.0 | VB | Bekdon | 3/03/2018 | 9.7 | 3104.0 | ... | 2.0 | 2.0 | 260.0 | 0.0 | 2017.0 | Boroondara City Council | -37.79589 | 145.09988 | Southern Metropolitan | 7809.0 |
31509 | Blackburn South | 5 Abercromby Rd | 4 | h | 1400000.0 | VB | Jellis | 3/03/2018 | 13.4 | 3130.0 | ... | 2.0 | 1.0 | 560.0 | 0.0 | 1960.0 | Whitehorse City Council | -37.83533 | 145.14455 | Eastern Metropolitan | 4387.0 |
31541 | Brunswick | 3 Austral Av | 4 | h | 1500000.0 | VB | Nelson | 3/03/2018 | 5.2 | 3056.0 | ... | 2.0 | 1.0 | 373.0 | 0.0 | 1940.0 | Moreland City Council | -37.76289 | 144.95552 | Northern Metropolitan | 11918.0 |
31717 | Epping | 28 Bail St | 3 | h | 600000.0 | S | hockingstuart | 3/03/2018 | 19.6 | 3076.0 | ... | 2.0 | 2.0 | 461.0 | 0.0 | 2010.0 | Whittlesea City Council | -37.62844 | 145.00884 | Northern Metropolitan | 10926.0 |
32403 | Roxburgh Park | 23 Wrigley Cr | 4 | h | 622000.0 | S | Raine | 10/03/2018 | 20.6 | 3064.0 | ... | 2.0 | 1.0 | 530.0 | 0.0 | 1998.0 | Hume City Council | -37.62352 | 144.93133 | Northern Metropolitan | 5833.0 |
33397 | Greenvale | 26 Perugia Av | 4 | h | 677000.0 | S | Ray | 17/03/2018 | 20.4 | 3059.0 | ... | 2.0 | 2.0 | 312.0 | 0.0 | 2013.0 | Hume City Council | -37.62439 | 144.88629 | Northern Metropolitan | 4864.0 |
33899 | Wollert | 40 Whitebark St | 4 | h | 615000.0 | S | HAR | 17/03/2018 | 25.5 | 3750.0 | ... | 2.0 | 2.0 | 392.0 | 0.0 | 2015.0 | Whittlesea City Council | -37.61252 | 145.04288 | Northern Metropolitan | 2940.0 |
45 rows × 21 columns
housing[housing['Rooms']==0]
Suburb | Address | Rooms | Type | Price | Method | SellerG | Date | Distance | Postcode | ... | Bathroom | Car | Landsize | BuildingArea | YearBuilt | CouncilArea | Lattitude | Longtitude | Regionname | Propertycount |
---|
0 rows × 21 columns
BuildingArea is suspicious because of zero values. How could they be zero since they all have rooms? Some mistakes might be taken in the data entry. We choose to remove these observations.
housing = housing[housing['BuildingArea']!=0]
housing.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8842 entries, 2 to 34856
Data columns (total 21 columns):
Suburb 8842 non-null object
Address 8842 non-null object
Rooms 8842 non-null int64
Type 8842 non-null object
Price 8842 non-null float64
Method 8842 non-null object
SellerG 8842 non-null object
Date 8842 non-null object
Distance 8842 non-null float64
Postcode 8842 non-null float64
Bedroom2 8842 non-null float64
Bathroom 8842 non-null float64
Car 8842 non-null float64
Landsize 8842 non-null float64
BuildingArea 8842 non-null float64
YearBuilt 8842 non-null float64
CouncilArea 8842 non-null object
Lattitude 8842 non-null float64
Longtitude 8842 non-null float64
Regionname 8842 non-null object
Propertycount 8842 non-null float64
dtypes: float64(12), int64(1), object(8)
memory usage: 1.5+ MB
3. Data Reduction
Data reduction is necessary when there are too much redundant information in the dataset. In summary, common methods are:
- Reducing the number of attributes
- Data cube aggregation: applying roll-up, slice or dice operations.
- Removing irrelevant attributes: attribute selection (filtering and wrapper methods), searching the attribute space (see Lecture 5: Attribute-oriented analysis).
- Principle component analysis (numeric attributes only): searching for a lower dimensional space that can best represent the data
- Reducing the number of attribute values
- Binning (histograms): reducing the number of attributes by grouping them into intervals (bins).
- Clustering: grouping values in clusters.
- Aggregation or generalization
- Reducing the number of tuples
- Sampling
Reference: http://www.cs.ccsu.edu/~markov/ccsu_courses/datamining-3.html
In this case, Address and sellerG (real estate agent) seems to be unnecessary currently. In the future, natural language processing might help include in the model. Also, bedroom2 is scraped from different sources. We chose to ignore this column.
housing = housing.drop(['Address','SellerG'],axis = 1) #axis=1 indicate the column name
4. Feature Engineering
Feature engineering is important to dig more inforamtion from row data. We just translate some attributes into new ones that make more sense in the reality. For example, in the case, we would translate YearBuilt into Age of property. Date sold can also be split into three attributes: Year, Month and Weekday.
Let’s do it!
housing['Age'] = 2018-housing['YearBuilt']
housing[housing['Age']<0]
Suburb | Rooms | Type | Price | Method | Date | Distance | Postcode | Bedroom2 | Bathroom | ... | YearBuilt | CouncilArea | Lattitude | Longtitude | Regionname | Propertycount | Age | Year | Month | Weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
33033 | Bentleigh | 3 | h | 1100000.0 | VB | 2018-03-17 | 11.4 | 3204.0 | 3.0 | 1.0 | ... | 2019.0 | Glen Eira City Council | -37.92963 | 145.03666 | Southern Metropolitan | 6795.0 | -1.0 | 2018 | 3 | 5 |
1 rows × 23 columns
** Notice Here: There is one property to be built in 2019 **
We are using groupby function to see the average price VS age of property
plt.figure(figsize=(10,6))
housing.groupby('Age')['Price'].mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x17065f3c128>
Here, we found a age of 800. It could be true, but it seems an outlier. We choose to take it out.
housing[housing['Age']>400]
Suburb | Rooms | Type | Price | Method | Date | Distance | Postcode | Bedroom2 | Bathroom | ... | YearBuilt | CouncilArea | Lattitude | Longtitude | Regionname | Propertycount | Age | Year | Month | Weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16424 | Mount Waverley | 3 | h | 1200000.0 | VB | 2017-06-24 | 14.2 | 3149.0 | 3.0 | 1.0 | ... | 1196.0 | Monash City Council | -37.86788 | 145.12121 | Eastern Metropolitan | 13366.0 | 822.0 | 2017 | 6 | 5 |
1 rows × 23 columns
housing = housing[housing['Age']<500]
Then we visualize again. It should be good now.
plt.figure(figsize=(10,6))
ax=housing.groupby('Age')['Price'].mean().plot()
ax.set_ylabel('Price')
Text(0,0.5,'Price')
Interesting Finding!
Then we transform Date column!
housing['Date']=pd.to_datetime(housing['Date'],infer_datetime_format=True) ## transform Date column to datetime object
housing['Year']=pd.DatetimeIndex(housing['Date']).year
housing['Month']=pd.DatetimeIndex(housing['Date']).month
housing['Weekday']=pd.DatetimeIndex(housing['Date']).weekday
housing.head()
Suburb | Rooms | Type | Price | Method | Date | Distance | Postcode | Bedroom2 | Bathroom | ... | YearBuilt | CouncilArea | Lattitude | Longtitude | Regionname | Propertycount | Age | Year | Month | Weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | Abbotsford | 2 | h | 1035000.0 | S | 2016-04-02 | 2.5 | 3067.0 | 2.0 | 1.0 | ... | 1900.0 | Yarra City Council | -37.8079 | 144.9934 | Northern Metropolitan | 4019.0 | 118.0 | 2016 | 4 | 5 |
4 | Abbotsford | 3 | h | 1465000.0 | SP | 2017-04-03 | 2.5 | 3067.0 | 3.0 | 2.0 | ... | 1900.0 | Yarra City Council | -37.8093 | 144.9944 | Northern Metropolitan | 4019.0 | 118.0 | 2017 | 4 | 0 |
6 | Abbotsford | 4 | h | 1600000.0 | VB | 2016-04-06 | 2.5 | 3067.0 | 3.0 | 1.0 | ... | 2014.0 | Yarra City Council | -37.8072 | 144.9941 | Northern Metropolitan | 4019.0 | 4.0 | 2016 | 4 | 2 |
11 | Abbotsford | 3 | h | 1876000.0 | S | 2016-07-05 | 2.5 | 3067.0 | 4.0 | 2.0 | ... | 1910.0 | Yarra City Council | -37.8024 | 144.9993 | Northern Metropolitan | 4019.0 | 108.0 | 2016 | 7 | 1 |
14 | Abbotsford | 2 | h | 1636000.0 | S | 2016-08-10 | 2.5 | 3067.0 | 2.0 | 1.0 | ... | 1890.0 | Yarra City Council | -37.8060 | 144.9954 | Northern Metropolitan | 4019.0 | 128.0 | 2016 | 8 | 2 |
5 rows × 23 columns
sns.factorplot(y='Price',x='Month',kind='bar',data=housing)
<seaborn.axisgrid.FacetGrid at 0x170620c41d0>
sns.factorplot(y='Price',x='Weekday',kind='bar',data=housing)
<seaborn.axisgrid.FacetGrid at 0x17066119d30>
sns.factorplot(y='Price',x='Year',kind='bar',data=housing)
<seaborn.axisgrid.FacetGrid at 0x170627f4358>
Conclusion
- Data preprocessing is very important.
- More advanced method in outliers detection and missing values handling is to be studied in the future
- I am really not sure what outliers should be taken into account. Price has many extreme values from boxplot. Should we take all of them off?
- Missing values had better be taken care of carefully. Here, I just drop them.
- Visualization will be introduced next time.
Thank you!