Data Preprocessing

Introduction

In the real world, data is not clean. Several techniques are used as follows:

Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.
Data integration: using multiple databases, data cubes, or files.
Data transformation: normalization and aggregation.
Data reduction: reducing the volume but producing the same or similar analytical results.
Data discretization: part of data reduction, replacing numerical attributes with nominal ones.

Case study

The dataset I chose is from Kaggle. It is called Melbourne Housing Market. The following link is: https://www.kaggle.com/anthonypino/melbourne-housing-market

1. Import useful packages, read datasets and get some general information

import pandas as pd
import matplotlib.pyplot as plt
import numpy as ny
import seaborn as sns

housing = pd.read_csv('housing.csv')

housing.head()

	Suburb	Address	Rooms	Type	Price	Method	SellerG	Date	Distance	Postcode	...	Bathroom	Car	Landsize	BuildingArea	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount
0	Abbotsford	68 Studley St	2	h	NaN	SS	Jellis	3/09/2016	2.5	3067.0	...	1.0	1.0	126.0	NaN	NaN	Yarra City Council	-37.8014	144.9958	Northern Metropolitan	4019.0
1	Abbotsford	85 Turner St	2	h	1480000.0	S	Biggin	3/12/2016	2.5	3067.0	...	1.0	1.0	202.0	NaN	NaN	Yarra City Council	-37.7996	144.9984	Northern Metropolitan	4019.0
2	Abbotsford	25 Bloomburg St	2	h	1035000.0	S	Biggin	4/02/2016	2.5	3067.0	...	1.0	0.0	156.0	79.0	1900.0	Yarra City Council	-37.8079	144.9934	Northern Metropolitan	4019.0
3	Abbotsford	18/659 Victoria St	3	u	NaN	VB	Rounds	4/02/2016	2.5	3067.0	...	2.0	1.0	0.0	NaN	NaN	Yarra City Council	-37.8114	145.0116	Northern Metropolitan	4019.0
4	Abbotsford	5 Charles St	3	h	1465000.0	SP	Biggin	4/03/2017	2.5	3067.0	...	2.0	0.0	134.0	150.0	1900.0	Yarra City Council	-37.8093	144.9944	Northern Metropolitan	4019.0

5 rows × 21 columns

The dataset had 21 attributes and 34857 observations in all.

housing.isnull().sum()

Suburb               0
Address              0
Rooms                0
Type                 0
Price             7610
Method               0
SellerG              0
Date                 0
Distance             1
Postcode             1
Bedroom2          8217
Bathroom          8226
Car               8728
Landsize         11810
BuildingArea     21115
YearBuilt        19306
CouncilArea          3
Lattitude         7976
Longtitude        7976
Regionname           3
Propertycount        3
dtype: int64

housing.isnull().sum()/len(housing)*100

Suburb            0.000000
Address           0.000000
Rooms             0.000000
Type              0.000000
Price            21.832057
Method            0.000000
SellerG           0.000000
Date              0.000000
Distance          0.002869
Postcode          0.002869
Bedroom2         23.573457
Bathroom         23.599277
Car              25.039447
Landsize         33.881286
BuildingArea     60.576068
YearBuilt        55.386293
CouncilArea       0.008607
Lattitude        22.882061
Longtitude       22.882061
Regionname        0.008607
Propertycount     0.008607
dtype: float64

It shows missing value percentage for each column. We can also get the percentage of all the missing values.

missing = housing.isnull().sum().sum()
all = housing.isnull().count().sum()
missing/all*100

13.794455441757275

There are a lot of missing values in the dataset, especially in Landsize, BuildingArea, and YearBuilt column. 13% of the data are missing. That is not bad.

You can also visualize missing values

plt.figure(figsize=(10,3))
sns.set(font_scale=1.2)
sns.heatmap(housing.isnull(),yticklabels = False, cbar = False, cmap = 'plasma')

<matplotlib.axes._subplots.AxesSubplot at 0x170625eb978>

png

2. Obtain descriptive statistics of some attributes.

Firstly, we need to understand the data type. Typically, they are:

Categorical data
- Nominal: variables are variables that have two or more categories, but which do not have an intrinsic order. Examples: Gender, weather, room type
- Ordinal: variables are variables that have two or more categories just like nominal variables only the categories can also be ordered or ranked. Examples: Class, Ranking
Numeric data
- Interval
- Ratio

Measures of the central tendency: Mean, Median, Mode

Mean: Average; Susceptible to outliers.
Median: Better in skewed data
Mode: the most frequent score. Best for nominal data
Importance: right-skewed data: median > mean. left-skewed data: median < mean

The more skewed the distribution, the greater the difference between the median and mean, and the greater emphasis should be placed on using the median as opposed to the mean.

Measures of the Dispersion: Variance/Standard Deviation, Range, interquantile range

Variance/Standatd Deviation
Range: Max-min,
Interquartile: Q3 - Q1. Interquartile range is where the middle 50% data locates.

housing.describe()
#'mean of housing price',housing['Price'].mean()

	Rooms	Price	Distance	Postcode	Bedroom2	Bathroom	Car	Landsize	BuildingArea	YearBuilt	Lattitude	Longtitude	Propertycount
count	34857.000000	2.724700e+04	34856.000000	34856.000000	26640.000000	26631.000000	26129.000000	23047.000000	13742.00000	15551.000000	26881.000000	26881.000000	34854.000000
mean	3.031012	1.050173e+06	11.184929	3116.062859	3.084647	1.624798	1.728845	593.598993	160.25640	1965.289885	-37.810634	145.001851	7572.888306
std	0.969933	6.414671e+05	6.788892	109.023903	0.980690	0.724212	1.010771	3398.841946	401.26706	37.328178	0.090279	0.120169	4428.090313
min	1.000000	8.500000e+04	0.000000	3000.000000	0.000000	0.000000	0.000000	0.000000	0.00000	1196.000000	-38.190430	144.423790	83.000000
25%	2.000000	6.350000e+05	6.400000	3051.000000	2.000000	1.000000	1.000000	224.000000	102.00000	1940.000000	-37.862950	144.933500	4385.000000
50%	3.000000	8.700000e+05	10.300000	3103.000000	3.000000	2.000000	2.000000	521.000000	136.00000	1970.000000	-37.807600	145.007800	6763.000000
75%	4.000000	1.295000e+06	14.000000	3156.000000	4.000000	2.000000	2.000000	670.000000	188.00000	2000.000000	-37.754100	145.071900	10412.000000
max	16.000000	1.120000e+07	48.100000	3978.000000	30.000000	12.000000	26.000000	433014.000000	44515.00000	2106.000000	-37.390200	145.526350	21650.000000

Here, we choose price and number of rooms as two attributes we are interested in.

Also, we may need median, mode, range, quartiles for these two attributes.

'median of housing price',housing['Price'].median()

median of housing price 870000.0

('quartiles for housing price', 870000.0)

'mode of housing price',housing['Price'].mode()[0],housing['Price'].mode()[1]

('mode of housing price', 600000.0, 1100000.0)

'mode of housing price',[x for x in housing['Price'].mode()]

('mode of housing price', [600000.0, 1100000.0])

'range of housing price',housing['Price'].max()-housing['Price'].min()

('range of housing price', 11115000.0)

'quartiles for housing price',housing['Price'].quantile([0.25,0.75])

('quartiles for housing price', 0.25     635000.0
 0.75    1295000.0
 Name: Price, dtype: float64)

'standard deviation for housing price',housing['Price'].std()

('standard deviation for housing price', 641467.1301046001)

Visualize the distributions of numeric value

Histogram, Boxplot

plt.figure(figsize=[15,6])
housing['Price'].hist(bins=30)

<matplotlib.axes._subplots.AxesSubplot at 0x26f1caabfd0>

png

plt.figure(figsize=[10,8])
sns.distplot(housing['Price'].dropna(),bins=30)

<matplotlib.axes._subplots.AxesSubplot at 0x17062335128>

png

sns.boxplot(y='Price',data=housing)

<matplotlib.axes._subplots.AxesSubplot at 0x26f1ca8fe80>

png

Then apply the same techniques to Rooms column.

'median of room number',housing['Rooms'].median()

('median of room number', 3.0)

'mode of room number',housing['Rooms'].mode()[0]

('mode of room number', 3)

'range of room number',housing['Rooms'].max()-housing['Rooms'].min()

('range of room number', 15)

'quartiles for room number',housing['Rooms'].quantile([0.25,0.75])

('quartiles for room number', 0.25    2.0
 0.75    4.0
 Name: Rooms, dtype: float64)

plt.figure(figsize=[15,6])
housing['Rooms'].hist(bins=30)

<matplotlib.axes._subplots.AxesSubplot at 0x26f1d1fc550>

png

sns.countplot(housing['Rooms'])

<matplotlib.axes._subplots.AxesSubplot at 0x26f1d308c88>

png

sns.boxplot(y=housing['Rooms'])

<matplotlib.axes._subplots.AxesSubplot at 0x26f1d316e48>

png

3. Potential problem of data quality

1. Missing data

There are several methods to handle missing values. In summary, they are:

Ignore the tuple: usually done when class label is missing.
Use the attribute mean (or majority nominal value) to fill in the missing value.
Use the attribute mean (or majority nominal value) for all samples belonging to the same class.
Predict the missing value by using a learning algorithm: consider the attribute with the missing value as a dependent (class) variable and run a learning algorithm (usually Bayes or decision tree) to predict the missing value.

Reference: http://www.cs.ccsu.edu/~markov/ccsu_courses/datamining-3.html

To simply, we just remove all the observations which have missing values. Advanced methods will be studied in the future.

housing = housing.dropna()

housing.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8887 entries, 2 to 34856
Data columns (total 21 columns):
Suburb           8887 non-null object
Address          8887 non-null object
Rooms            8887 non-null int64
Type             8887 non-null object
Price            8887 non-null float64
Method           8887 non-null object
SellerG          8887 non-null object
Date             8887 non-null object
Distance         8887 non-null float64
Postcode         8887 non-null float64
Bedroom2         8887 non-null float64
Bathroom         8887 non-null float64
Car              8887 non-null float64
Landsize         8887 non-null float64
BuildingArea     8887 non-null float64
YearBuilt        8887 non-null float64
CouncilArea      8887 non-null object
Lattitude        8887 non-null float64
Longtitude       8887 non-null float64
Regionname       8887 non-null object
Propertycount    8887 non-null float64
dtypes: float64(12), int64(1), object(8)
memory usage: 1.5+ MB

Now, we drop all the missing values. Dataset is clean.

2. Outlier data

There are several methods to handle missing values. In summary, they are:

Binning
- Sort the attribute values and partition them into bins (see “Unsupervised discretization” below);
- Then smooth by bin means, bin median, or bin boundaries.
Clustering: group values in clusters and then detect and remove outliers (automatic or manual)
Regression: smooth by fitting the data into regression functions.
Univariate method: Boxplot

housing.describe()

	Rooms	Price	Distance	Postcode	Bedroom2	Bathroom	Car	Landsize	BuildingArea	YearBuilt	Lattitude	Longtitude	Propertycount
count	8887.000000	8.887000e+03	8887.000000	8887.000000	8887.000000	8887.000000	8887.000000	8887.000000	8887.000000	8887.000000	8887.000000	8887.000000	8887.000000
mean	3.098909	1.092902e+06	11.199887	3111.662653	3.078204	1.646450	1.692247	523.480365	149.309477	1965.753348	-37.804501	144.991393	7475.940137
std	0.963786	6.793819e+05	6.813402	112.614268	0.966269	0.721611	0.975464	1061.324228	87.925580	37.040876	0.090549	0.118919	4375.024364
min	1.000000	1.310000e+05	0.000000	3000.000000	0.000000	1.000000	0.000000	0.000000	0.000000	1196.000000	-38.174360	144.423790	249.000000
25%	2.000000	6.410000e+05	6.400000	3044.000000	2.000000	1.000000	1.000000	212.000000	100.000000	1945.000000	-37.858560	144.920000	4382.500000
50%	3.000000	9.000000e+05	10.200000	3084.000000	3.000000	2.000000	2.000000	478.000000	132.000000	1970.000000	-37.798700	144.998500	6567.000000
75%	4.000000	1.345000e+06	13.900000	3150.000000	4.000000	2.000000	2.000000	652.000000	180.000000	2000.000000	-37.748945	145.064560	10331.000000
max	12.000000	9.000000e+06	47.400000	3977.000000	12.000000	9.000000	10.000000	42800.000000	3112.000000	2019.000000	-37.407200	145.526350	21650.000000

housing[housing['BuildingArea']==0]

	Suburb	Address	Rooms	Type	Price	Method	SellerG	Date	Distance	Postcode	...	Bathroom	Car	Landsize	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount
7211	North Melbourne	19 Shands La	2	t	841000.0	S	Jellis	4/03/2017	2.3	3051.0	...	1.0	1.0	215.0	2000.0	Melbourne City Council	-37.79530	144.94370	Northern Metropolitan	6821.0
19775	Balwyn North	14 Wanbrow Av	5	h	1950000.0	S	RT	3/09/2017	9.7	3104.0	...	3.0	2.0	743.0	1949.0	Boroondara City Council	-37.80235	145.09311	Southern Metropolitan	7809.0
19840	Bundoora	22 Moreton Cr	3	h	814000.0	S	Barry	3/09/2017	12.1	3083.0	...	1.0	2.0	542.0	1970.0	Banyule City Council	-37.70861	145.05691	Northern Metropolitan	10175.0
20223	Roxburgh Park	16 Sandover Dr	4	h	570000.0	S	Raine	3/09/2017	20.6	3064.0	...	2.0	2.0	504.0	2000.0	Hume City Council	-37.61419	144.93448	Northern Metropolitan	5833.0
20262	Thornbury	19/337 Station St	3	t	900000.0	VB	Jellis	3/09/2017	7.0	3071.0	...	2.0	2.0	120.0	2000.0	Darebin City Council	-37.76343	145.02096	Northern Metropolitan	8870.0
22040	Prahran	6 Aberdeen Rd	3	h	1390000.0	S	Marshall	19/08/2017	4.6	3181.0	...	2.0	1.0	125.0	2002.0	Stonnington City Council	-37.85257	145.00296	Southern Metropolitan	7717.0
22507	Huntingdale	33 Beauford St	3	h	1205000.0	SA	FN	23/09/2017	12.3	3166.0	...	2.0	4.0	622.0	1960.0	Monash City Council	-37.90823	145.10851	Southern Metropolitan	768.0
22931	Balwyn North	1 Hosken St	5	h	2800000.0	S	Marshall	26/08/2017	9.7	3104.0	...	2.0	2.0	1173.0	1960.0	Boroondara City Council	-37.80385	145.09094	Southern Metropolitan	7809.0
22994	Brighton East	60 Cummins Rd	3	h	1650000.0	SP	Buxton	26/08/2017	10.3	3187.0	...	2.0	2.0	623.0	1920.0	Bayside City Council	-37.92698	145.02673	Southern Metropolitan	6938.0
23022	Bundoora	37 Greenwood Dr	4	h	815000.0	S	Ray	26/08/2017	12.1	3083.0	...	2.0	1.0	525.0	1965.0	Banyule City Council	-37.70765	145.05556	Northern Metropolitan	10175.0
23085	Craigieburn	28 Powell St	3	h	412500.0	S	RE	26/08/2017	20.6	3064.0	...	2.0	1.0	197.0	2012.0	Hume City Council	-37.57687	144.91100	Northern Metropolitan	15510.0
23115	Epping	26 Lowalde Dr	3	h	595000.0	S	hockingstuart	26/08/2017	19.6	3076.0	...	1.0	2.0	536.0	1980.0	Whittlesea City Council	-37.64972	145.04086	Northern Metropolitan	10926.0
23159	Glen Iris	6 Viva St	4	h	2690000.0	PI	Marshall	26/08/2017	7.3	3146.0	...	2.0	2.0	647.0	1910.0	Boroondara City Council	-37.86133	145.04167	Southern Metropolitan	10412.0
23242	Kew	16 Hodgson St	5	h	3450000.0	PI	Kay	26/08/2017	5.4	3101.0	...	2.0	2.0	668.0	2006.0	Boroondara City Council	-37.80795	145.01474	Southern Metropolitan	10331.0
23250	Kilsyth	17 Birkenhead Dr	3	h	803000.0	S	Max	26/08/2017	26.0	3137.0	...	1.0	2.0	862.0	1970.0	Maroondah City Council	-37.79902	145.32092	Eastern Metropolitan	4654.0
23321	Moorabbin	7 Walsh Av	3	h	1290000.0	S	Ray	26/08/2017	14.3	3189.0	...	1.0	1.0	580.0	1970.0	Kingston City Council	-37.94492	145.04938	Southern Metropolitan	2555.0
23378	Port Melbourne	44 Garton St	4	t	2455000.0	SP	Marshall	26/08/2017	3.5	3207.0	...	3.0	2.0	123.0	2010.0	Melbourne City Council	-37.83349	144.94840	Southern Metropolitan	8648.0
23654	Cheltenham	5 Hannah St	3	h	975000.0	S	O'Brien	7/10/2017	17.9	3192.0	...	2.0	2.0	651.0	1970.0	Bayside City Council	-37.95683	145.07184	Southern Metropolitan	9758.0
23690	Craigieburn	18 Pymble Gdns	4	h	590000.0	S	LJ	7/10/2017	20.6	3064.0	...	2.0	2.0	448.0	2009.0	Hume City Council	-37.60902	144.91279	Northern Metropolitan	15510.0
24116	Werribee	21 Sinns Av	3	h	550000.0	S	Ray	7/10/2017	14.7	3030.0	...	2.0	2.0	580.0	1980.0	Wyndham City Council	-37.90136	144.66925	Western Metropolitan	16166.0
24196	Balwyn North	5 Highview Rd	4	h	1500000.0	VB	Fletchers	14/10/2017	9.7	3104.0	...	2.0	1.0	620.0	1965.0	Boroondara City Council	-37.78396	145.07942	Southern Metropolitan	7809.0
24205	Bentleigh	1 Donaldson St	3	h	1730000.0	S	Jellis	14/10/2017	11.4	3204.0	...	1.0	2.0	569.0	1940.0	Glen Eira City Council	-37.91456	145.04109	Southern Metropolitan	6795.0
24344	Donvale	29 Martha St	2	h	1070000.0	S	hockingstuart	14/10/2017	16.1	3111.0	...	2.0	2.0	758.0	1980.0	Manningham City Council	-37.79883	145.17337	Eastern Metropolitan	4790.0
24602	Northcote	155 Clarke St	3	h	2750000.0	VB	Woodards	14/10/2017	5.3	3070.0	...	2.0	4.0	634.0	1886.0	Darebin City Council	-37.77625	144.99572	Northern Metropolitan	11364.0
25086	Glen Iris	60 Hortense St	4	h	2237500.0	S	Marshall	21/10/2017	7.3	3146.0	...	2.0	2.0	650.0	1985.0	Boroondara City Council	-37.85776	145.07998	Southern Metropolitan	10412.0
25320	Reservoir	12 Kelverne St	3	h	650000.0	SP	Barry	21/10/2017	12.0	3073.0	...	1.0	1.0	491.0	1950.0	Darebin City Council	-37.71445	144.98225	Northern Metropolitan	21650.0
25352	Roxburgh Park	15 Donvale Av	3	h	470000.0	S	Raine	21/10/2017	20.6	3064.0	...	2.0	2.0	328.0	2004.0	Hume City Council	-37.61388	144.92270	Northern Metropolitan	5833.0
25376	Spotswood	104 Hudsons Rd	2	h	1225000.0	SP	Greg	21/10/2017	6.2	3015.0	...	1.0	1.0	361.0	1910.0	Hobsons Bay City Council	-37.82940	144.88410	Western Metropolitan	1223.0
25412	Tarneit	10 Discovery Dr	4	h	585000.0	S	hockingstuart	21/10/2017	18.4	3029.0	...	2.0	2.0	448.0	2010.0	Wyndham City Council	-37.84743	144.71243	Western Metropolitan	10160.0
25708	Bundoora	8 Oxford Dr	3	h	770000.0	S	Ray	28/10/2017	12.1	3083.0	...	1.0	1.0	551.0	1970.0	Banyule City Council	-37.69939	145.06567	Northern Metropolitan	10175.0
26343	Preston	148 Albert St	3	h	833000.0	S	Stockdale	28/10/2017	8.4	3072.0	...	1.0	1.0	501.0	1960.0	Darebin City Council	-37.73674	145.02418	Northern Metropolitan	14577.0
26633	Craigieburn	31 Yarcombe Cr	4	h	540000.0	S	Barry	4/11/2017	20.6	3064.0	...	2.0	1.0	541.0	1995.0	Hume City Council	-37.60800	144.92530	Northern Metropolitan	15510.0
27441	Yallambie	21 Lowan Av	5	h	990000.0	S	Buckingham	11/11/2017	12.7	3085.0	...	3.0	2.0	510.0	1985.0	Banyule City Council	-37.72040	145.10880	Northern Metropolitan	1369.0
27564	Brighton East	15 Bayview Rd	3	t	950000.0	VB	Buxton	18/11/2017	10.3	3187.0	...	2.0	2.0	217.0	2000.0	Bayside City Council	-37.90670	145.02470	Southern Metropolitan	6938.0
27587	Brunswick West	1/1 Duggan St	2	u	420500.0	SP	Pagan	18/11/2017	5.2	3055.0	...	1.0	1.0	5497.0	2011.0	Moreland City Council	-37.75820	144.94000	Northern Metropolitan	7082.0
27629	Carrum	18 Church Rd	4	h	980000.0	S	hockingstuart	18/11/2017	31.2	3197.0	...	1.0	3.0	987.0	1960.0	Kingston City Council	-38.07920	145.12760	South-Eastern Metropolitan	1989.0
27922	Lalor	2 Orchid Ct	5	h	591000.0	S	HAR	18/11/2017	16.3	3075.0	...	1.0	2.0	636.0	1980.0	Whittlesea City Council	-37.67010	145.00500	Northern Metropolitan	8279.0
30914	Wollert	21 Dalwood Wy	4	h	609000.0	S	HAR	9/12/2017	25.5	3750.0	...	2.0	2.0	350.0	2015.0	Whittlesea City Council	-37.61031	145.04010	Northern Metropolitan	2940.0
31464	Balwyn North	4 Beverley Ct	4	t	1190000.0	VB	Bekdon	3/03/2018	9.7	3104.0	...	2.0	2.0	260.0	2017.0	Boroondara City Council	-37.79589	145.09988	Southern Metropolitan	7809.0
31509	Blackburn South	5 Abercromby Rd	4	h	1400000.0	VB	Jellis	3/03/2018	13.4	3130.0	...	2.0	1.0	560.0	1960.0	Whitehorse City Council	-37.83533	145.14455	Eastern Metropolitan	4387.0
31541	Brunswick	3 Austral Av	4	h	1500000.0	VB	Nelson	3/03/2018	5.2	3056.0	...	2.0	1.0	373.0	1940.0	Moreland City Council	-37.76289	144.95552	Northern Metropolitan	11918.0
31717	Epping	28 Bail St	3	h	600000.0	S	hockingstuart	3/03/2018	19.6	3076.0	...	2.0	2.0	461.0	2010.0	Whittlesea City Council	-37.62844	145.00884	Northern Metropolitan	10926.0
32403	Roxburgh Park	23 Wrigley Cr	4	h	622000.0	S	Raine	10/03/2018	20.6	3064.0	...	2.0	1.0	530.0	1998.0	Hume City Council	-37.62352	144.93133	Northern Metropolitan	5833.0
33397	Greenvale	26 Perugia Av	4	h	677000.0	S	Ray	17/03/2018	20.4	3059.0	...	2.0	2.0	312.0	2013.0	Hume City Council	-37.62439	144.88629	Northern Metropolitan	4864.0
33899	Wollert	40 Whitebark St	4	h	615000.0	S	HAR	17/03/2018	25.5	3750.0	...	2.0	2.0	392.0	2015.0	Whittlesea City Council	-37.61252	145.04288	Northern Metropolitan	2940.0

45 rows × 21 columns

housing[housing['Rooms']==0]

	Suburb	Address	Rooms	Type	Price	Method	SellerG	Date	Distance	Postcode	...	Bathroom	Car	Landsize	BuildingArea	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount

0 rows × 21 columns

BuildingArea is suspicious because of zero values. How could they be zero since they all have rooms? Some mistakes might be taken in the data entry. We choose to remove these observations.

housing = housing[housing['BuildingArea']!=0]

housing.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8842 entries, 2 to 34856
Data columns (total 21 columns):
Suburb           8842 non-null object
Address          8842 non-null object
Rooms            8842 non-null int64
Type             8842 non-null object
Price            8842 non-null float64
Method           8842 non-null object
SellerG          8842 non-null object
Date             8842 non-null object
Distance         8842 non-null float64
Postcode         8842 non-null float64
Bedroom2         8842 non-null float64
Bathroom         8842 non-null float64
Car              8842 non-null float64
Landsize         8842 non-null float64
BuildingArea     8842 non-null float64
YearBuilt        8842 non-null float64
CouncilArea      8842 non-null object
Lattitude        8842 non-null float64
Longtitude       8842 non-null float64
Regionname       8842 non-null object
Propertycount    8842 non-null float64
dtypes: float64(12), int64(1), object(8)
memory usage: 1.5+ MB

3. Data Reduction

Data reduction is necessary when there are too much redundant information in the dataset. In summary, common methods are:

Reducing the number of attributes
- Data cube aggregation: applying roll-up, slice or dice operations.
- Removing irrelevant attributes: attribute selection (filtering and wrapper methods), searching the attribute space (see Lecture 5: Attribute-oriented analysis).
- Principle component analysis (numeric attributes only): searching for a lower dimensional space that can best represent the data
Reducing the number of attribute values
- Binning (histograms): reducing the number of attributes by grouping them into intervals (bins).
- Clustering: grouping values in clusters.
- Aggregation or generalization
Reducing the number of tuples
- Sampling

Reference: http://www.cs.ccsu.edu/~markov/ccsu_courses/datamining-3.html

In this case, Address and sellerG (real estate agent) seems to be unnecessary currently. In the future, natural language processing might help include in the model. Also, bedroom2 is scraped from different sources. We chose to ignore this column.

housing = housing.drop(['Address','SellerG'],axis = 1)    #axis=1 indicate the column name

4. Feature Engineering

Feature engineering is important to dig more inforamtion from row data. We just translate some attributes into new ones that make more sense in the reality. For example, in the case, we would translate YearBuilt into Age of property. Date sold can also be split into three attributes: Year, Month and Weekday.

Let’s do it!

housing['Age'] = 2018-housing['YearBuilt']

housing[housing['Age']<0]

	Suburb	Rooms	Type	Price	Method	Date	Distance	Postcode	Bedroom2	Bathroom	...	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount	Age	Year	Month	Weekday
33033	Bentleigh	3	h	1100000.0	VB	2018-03-17	11.4	3204.0	3.0	1.0	...	2019.0	Glen Eira City Council	-37.92963	145.03666	Southern Metropolitan	6795.0	-1.0	2018	3	5

1 rows × 23 columns

** Notice Here: There is one property to be built in 2019 **

We are using groupby function to see the average price VS age of property

plt.figure(figsize=(10,6))
housing.groupby('Age')['Price'].mean().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x17065f3c128>

png

Here, we found a age of 800. It could be true, but it seems an outlier. We choose to take it out.

housing[housing['Age']>400]

	Suburb	Rooms	Type	Price	Method	Date	Distance	Postcode	Bedroom2	Bathroom	...	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount	Age	Year	Month	Weekday
16424	Mount Waverley	3	h	1200000.0	VB	2017-06-24	14.2	3149.0	3.0	1.0	...	1196.0	Monash City Council	-37.86788	145.12121	Eastern Metropolitan	13366.0	822.0	2017	6	5

1 rows × 23 columns

housing = housing[housing['Age']<500]

Then we visualize again. It should be good now.

plt.figure(figsize=(10,6))
ax=housing.groupby('Age')['Price'].mean().plot()
ax.set_ylabel('Price')

Text(0,0.5,'Price')

png

Interesting Finding!

Then we transform Date column!

housing['Date']=pd.to_datetime(housing['Date'],infer_datetime_format=True)  ## transform Date column to datetime object

housing['Year']=pd.DatetimeIndex(housing['Date']).year
housing['Month']=pd.DatetimeIndex(housing['Date']).month
housing['Weekday']=pd.DatetimeIndex(housing['Date']).weekday

housing.head()

	Suburb	Rooms	Type	Price	Method	Date	Distance	Postcode	Bedroom2	Bathroom	...	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount	Age	Year	Month	Weekday
2	Abbotsford	2	h	1035000.0	S	2016-04-02	2.5	3067.0	2.0	1.0	...	1900.0	Yarra City Council	-37.8079	144.9934	Northern Metropolitan	4019.0	118.0	2016	4	5
4	Abbotsford	3	h	1465000.0	SP	2017-04-03	2.5	3067.0	3.0	2.0	...	1900.0	Yarra City Council	-37.8093	144.9944	Northern Metropolitan	4019.0	118.0	2017	4	0
6	Abbotsford	4	h	1600000.0	VB	2016-04-06	2.5	3067.0	3.0	1.0	...	2014.0	Yarra City Council	-37.8072	144.9941	Northern Metropolitan	4019.0	4.0	2016	4	2
11	Abbotsford	3	h	1876000.0	S	2016-07-05	2.5	3067.0	4.0	2.0	...	1910.0	Yarra City Council	-37.8024	144.9993	Northern Metropolitan	4019.0	108.0	2016	7	1
14	Abbotsford	2	h	1636000.0	S	2016-08-10	2.5	3067.0	2.0	1.0	...	1890.0	Yarra City Council	-37.8060	144.9954	Northern Metropolitan	4019.0	128.0	2016	8	2

5 rows × 23 columns

sns.factorplot(y='Price',x='Month',kind='bar',data=housing)

<seaborn.axisgrid.FacetGrid at 0x170620c41d0>

png

sns.factorplot(y='Price',x='Weekday',kind='bar',data=housing) 

<seaborn.axisgrid.FacetGrid at 0x17066119d30>

png

sns.factorplot(y='Price',x='Year',kind='bar',data=housing)

<seaborn.axisgrid.FacetGrid at 0x170627f4358>

png

Conclusion

Data preprocessing is very important.
More advanced method in outliers detection and missing values handling is to be studied in the future
- I am really not sure what outliers should be taken into account. Price has many extreme values from boxplot. Should we take all of them off?
- Missing values had better be taken care of carefully. Here, I just drop them.
Visualization will be introduced next time.

Thank you!

Data Preprocessing with Melbourne Housing Data

Preprcessing is the key to the success