How to predict HDB resale prices using 3 different Machine learning models (Linear Regression, Neural Networks, K Nearest Neighbor)

In this tutorial, we aim to find the best model for predicting HDB resale prices using three different machine learning models. For the impatient person, K nearest neighbours achieved the best result. To try out the prediction accuracy for yourself, you may go to hdbpricer.com

This is part 1 of a 4-part tutorial
1. Building a good prediction model
2. Hosting the model prediction as an API endpoint on Flask
3. Building a simple VueJS frontend for a users to price their HDBs
4. Deploying the entire full stack application to the internet

The Git repository for this notebook can be found here
My hdbpricer colab notebook
Colab is Google’s hosted notebook service that allows you to use their GPU / TPUs for free, try it out!

Background

Well, at some point in time of my life I would want to buy a house (HDB resale flat to be exact). Therefore, I thought if I could understand prices well based on data, it would benefit me to a large extent. If you’re looking for a great example of a good pricing model for real estate in Singapore, I would recommend SRX.

Some inspiration (and code) for the first two models were taken from this US Housing data prediction example 

This is the Dataset for HDB Resale prices 

Importing packages

%matplotlib inline
import matplotlib.pyplot as plt

import math
import tensorflow as tf
from collections import defaultdict
import numpy as np
from numpy import unique
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.callbacks import ModelCheckpoint, Callback, TensorBoard

Reading data

Here, I hosted my own dataset (no modifications) instead of using data.gov.sg’s API. Then you use pandas to read the file into a dataframe. As you can see, I prefer a separate copy so I do not need to reload everytime I alter the dataframe

file_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vQ8OfO82KXoRmO0E6c58MdwsOSc8ns5Geme87SiaiqTUrS_hI8u8mYE5KIOfQe4m2m3GGf9En22xuXx/pub?gid=382289391&single=true&output=csv"
data = pd.read_csv(file_url)

dataframe = data.copy()

Use dataframe.head() to take a sneak peak of the dataframe.

monthtownflat_typeblockstreet_namestorey_rangefloor_area_sqmflat_modellease_commence_dateremaining_leaseresale_price
02017-01ANG MO KIO2 ROOM406ANG MO KIO AVE 1010 TO 1244.0Improved197961 years 04 months232000.0
12017-01ANG MO KIO3 ROOM108ANG MO KIO AVE 401 TO 0367.0New Generation197860 years 07 months250000.0
22017-01ANG MO KIO3 ROOM602ANG MO KIO AVE 501 TO 0367.0New Generation198062 years 05 months262000.0
32017-01ANG MO KIO3 ROOM465ANG MO KIO AVE 1004 TO 0668.0New Generation198062 years 01 month265000.0
42017-01ANG MO KIO3 ROOM601ANG MO KIO AVE 501 TO 0367.0New Generation198062 years 05 months265000.0

I would then try to understand the data better by plotting some graphs – statistics always helps!

#visualizing house prices
fig = plt.figure(figsize=(10,7))
fig.add_subplot(2,1,1)
sns.distplot(dataframe['resale_price'])
plt.tight_layout()

#visualizing square metres
fig = plt.figure(figsize=(20,12))
fig.add_subplot(2,2,1)
sns.scatterplot(dataframe['floor_area_sqm'], dataframe['resale_price'])


#visualizing town, flat type, flat model, storey range
fig = plt.figure(figsize=(15,7))
fig.add_subplot(2,2,1)
sns.countplot(dataframe['town'])
plt.xticks(rotation=90)
fig.add_subplot(2,2,2)
sns.countplot(dataframe['flat_type'])
plt.xticks(rotation=90)
fig.add_subplot(2,2,3)
sns.countplot(dataframe['flat_model'])
plt.xticks(rotation=90)
fig.add_subplot(2,2,4)
sns.countplot(dataframe['storey_range'])
plt.xticks(rotation=90)

plt.tight_layout()

Preprocess data

After getting a rough understanding of the data we are working with, we preprocess the data help us in determining the input to our model.


#let's break date to years, months
dataframe['date'] = pd.to_datetime(dataframe['month']) dataframe['month'] = dataframe['date'].apply(lambda date:date.month) dataframe['year'] = dataframe['date'].apply(lambda date:date.year)

#Get number of years left on lease as a continuous number (ignoring months) dataframe['remaining_lease'] = dataframe['remaining_lease'].apply(lambda remaining_lease:remaining_lease[:2]) #Get storey range as a continuous number dataframe['storey_range'] = dataframe['storey_range'].apply(lambda storey_range:storey_range[:2])

#data visualization house price vs months and years fig = plt.figure(figsize=(16,5)) fig.add_subplot(1,2,1) dataframe.groupby('month').mean()['resale_price'].plot() fig.add_subplot(1,2,2) dataframe.groupby('year').mean()['resale_price'].plot()

Geocoding

I would also like to geocode the addresses. However since geocoders usually have a limit to the amount of addresses you can geocode for free, I decided to geocode the towns.

I used Geopy and Geopanda as shown here https://towardsdatascience.com/geocode-with-python-161ec1e62b89

#Concat address
dataframe['address'] = dataframe['block'].map(str) + ', ' + dataframe['street_name'].map(str) + ', Singapore' 
import geopandas as gpd
import geopy
from geopy.geocoders import Nominatim
import folium
from folium.plugins import FastMarkerCluster
from geopy.extra.rate_limiter import RateLimiter
#Geocode by town (Singapore is so small that geocoding by addresses might not make much difference compared to geocoding to town)

town = [x for x in dataframe['town'].unique().tolist() 
            if type(x) == str]
latitude = []
longitude =  []
for i in range(0, len(town)):
    # remove things that does not seem usefull here
    try:
        geolocator = Nominatim(user_agent="ny_explorer")
        loc = geolocator.geocode(town[i])
        latitude.append(loc.latitude)
        longitude.append(loc.longitude)
        print('The geographical coordinate of location are {}, {}.'.format(loc.latitude, loc.longitude))
    except:
        # in the case the geolocator does not work, then add nan element to list
        # to keep the right size
        latitude.append(np.nan)
        longitude.append(np.nan)
# create a dataframe with the locatio, latitude and longitude
df_ = pd.DataFrame({'town':town, 
                    'latitude': latitude,
                    'longitude':longitude})
# merge on Restaurant_Location with rest_df to get the column 
dataframe = dataframe.merge(df_, on='town', how='left')
monthtownflat_typeblockstreet_namestorey_rangefloor_area_sqmflat_modellease_commence_dateremaining_leaseresale_pricedateyearaddresslatitudelongitude
01ANG MO KIO2 ROOM406ANG MO KIO AVE 101044.0Improved197961232000.02017-01-012017406, ANG MO KIO AVE 10, Singapore1.370080103.849523
11ANG MO KIO3 ROOM108ANG MO KIO AVE 40167.0New Generation197860250000.02017-01-012017108, ANG MO KIO AVE 4, Singapore1.370080103.849523
21ANG MO KIO3 ROOM602ANG MO KIO AVE 50167.0New Generation198062262000.02017-01-012017602, ANG MO KIO AVE 5, Singapore1.370080103.849523
31ANG MO KIO3 ROOM465ANG MO KIO AVE 100468.0New Generation198062265000.02017-01-012017465, ANG MO KIO AVE 10, Singapore1.370080103.849523
41ANG MO KIO3 ROOM601ANG MO KIO AVE 50167.0New Generation198062265000.02017-01-012017601, ANG MO KIO AVE 5, Singapore1.370080103.849523

If you would like to use the lat/lng visually, you can opt for folium.

folium_map = folium.Map(location=[1.3521,103.8198],
                        zoom_start=12,
                        tiles='CartoDB dark_matter')


FastMarkerCluster(data=list(zip(dataframe['latitude'].values, dataframe['longitude'].values))).add_to(folium_map)
folium.LayerControl().add_to(folium_map)
folium_map
# check if there are any Null values
dataframe.isnull().sum()

Encoding

I usually use labelencoding or one-hot encoding, but this time I chose a bit of a more bare metal approach that I need for my webapp. This is not recommended if you have data that is unforeseen.


townDict = {'ANG MO KIO': 1,'BEDOK': 2,'BISHAN': 3,'BUKIT BATOK': 4,'BUKIT MERAH': 5,'BUKIT PANJANG': 6,'BUKIT TIMAH': 7,'CENTRAL AREA': 8,'CHOA CHU KANG': 9,'CLEMENTI': 10,'GEYLANG': 11,'HOUGANG': 12,'JURONG EAST': 13,'JURONG WEST': 14,'KALLANG/WHAMPOA': 15,'MARINE PARADE': 16,'PASIR RIS': 17,'PUNGGOL': 18,'QUEENSTOWN': 19,'SEMBAWANG': 20,'SENGKANG': 21,'SERANGOON': 22,'TAMPINES': 23,'TOA PAYOH': 24,'WOODLANDS': 25,'YISHUN': 26,}
flat_typeDict = {'1 ROOM': 1,'2 ROOM': 2,'3 ROOM': 3,'4 ROOM': 4,'5 ROOM': 5,'EXECUTIVE': 6,'MULTI-GENERATION': 7,}


dataframe['town'] = dataframe['town'].replace(townDict, regex=True)
dataframe['flat_type'] = dataframe['flat_type'].replace(flat_typeDict, regex=True)

Understanding correlations

I did not want to use all the columns of the tabular data, and instead wanted to find out what affected the prices most.

Side note: Do not remove too many columns as this might cause bias in your predictions.

#correlation matrix
corrmat = dataframe.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);

To get more insight on what correlates most to resale prices, we do the following

#resale_price correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'resale_price')['resale_price'].index
cm = np.corrcoef(dataframe[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

Drop columns

Drop columns with poor correlations. As mentioned previously, we will try not to drop too many columns as this might induce bias

# drop some unnecessary columns
dataframe = dataframe.drop('date',axis=1)
dataframe = dataframe.drop('block',axis=1)
dataframe = dataframe.drop('month',axis=1)
dataframe = dataframe.drop('street_name',axis=1)
dataframe = dataframe.drop('address',axis=1)
dataframe = dataframe.drop('flat_model',axis=1)
dataframe = dataframe.drop('year',axis=1)
dataframe = dataframe.drop('remaining_lease',axis=1)

This is what we will be working with for all 3 models.

townflat_typestorey_rangefloor_area_sqmlease_commence_dateresale_pricelatitudelongitude
0121044.01979232000.01.37008103.849523
1130167.01978250000.01.37008103.849523
2130167.01980262000.01.37008103.849523
3130468.01980265000.01.37008103.849523
4130167.01980265000.01.37008103.849523

Train – test split

Splitting the data to training and test. Also, the y values in the case will be the resale prices.

X = dataframe.drop('resale_price',axis =1).values
y = dataframe['resale_price'].values

#splitting Train and Test 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

Scaling the data

#standardization scaler - fit&transform on train, fit only on test

s_scaler = StandardScaler()
X_train = s_scaler.fit_transform(X_train.astype(np.float))
X_test = s_scaler.transform(X_test.astype(np.float))

After all the data preparation, we will now move into the 3 models we are working with.

Model 1: Linear Regression

# Multiple Liner Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()  
regressor.fit(X_train, y_train)
#evaluate the model (intercept and slope)
print(regressor.intercept_)
print(regressor.coef_)
#predicting the test set result
y_pred = regressor.predict(X_test)
#put results as a DataFrame
coeff_df = pd.DataFrame(regressor.coef_, dataframe.drop('resale_price',axis =1).columns, columns=['Coefficient']) 
#compare actual output values with predicted values
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
ActualPredicted
0318000.0438507.992301
1770000.0520882.740728
2851800.0538331.202484
3475000.0512533.657286
4320000.0343942.388290
5305000.0360899.266588
6460000.0516238.557580
7470000.0413940.963681
8468000.0662196.628018
9340000.0517207.052462

We will then evaluate the algorithm

# evaluate the performance of the algorithm (MAE - MSE - RMSE)
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))  
print('MSE:', metrics.mean_squared_error(y_test, y_pred))  
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('VarScore:',metrics.explained_variance_score(y_test,y_pred))

MAE: 72189.73698345898
MSE: 9413246271.327522
RMSE: 97021.88552758352
VarScore: 0.5966611367993274

As you can see, the RMSE is really high – I wouldn’t want my predictions to be inaccurate by almost a $100,000 when my prices average at about $500,000

Model 2: Keras neural network

# Creating a Neural Network Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout
from tensorflow.keras.optimizers import Adam
# having 9 neurons is based on the number of available features
model = Sequential()
model.add(Dense(9,activation='relu'))
model.add(Dense(18,activation='relu'))
model.add(Dense(18,activation='relu'))
model.add(Dense(9,activation='relu'))
model.add(Dense(1))
model.compile(optimizer='Adam',loss='mse')
model.fit(x=X_train,y=y_train,
          validation_data=(X_test,y_test),
          batch_size=128,epochs=400)

model.summary()

Time to predict and evaluate

y_pred = model.predict(X_test)
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))  
print('MSE:', metrics.mean_squared_error(y_test, y_pred))  
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('VarScore:',metrics.explained_variance_score(y_test,y_pred))
# Visualizing Our predictions
fig = plt.figure(figsize=(10,5))
plt.scatter(y_test,y_pred)
# Perfect predictions
plt.plot(y_test,y_test,'r')

MAE: 54379.245910929014
MSE: 5658879298.306035
RMSE: 75225.52291812956
VarScore: 0.7578333837833753

Well an RMSE of 75,000 is definitely better than the simple linear regression. However, I am looking to be as accurate as possible when it comes to predictions, and the neural network model does not seem to fit the bill. Let’s also take a look at the predictions below.

y_pred_2 = []
for pred in y_pred:
  y_pred_2.append(pred[0])

df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_2})
df1 = df.head(20)
df1
ActualPredicted
0318000.0390469.937500
1770000.0527135.250000
2851800.0531485.875000
3475000.0489226.625000
4320000.0271884.000000
5305000.0349331.218750
6460000.0436004.437500
7470000.0419963.031250
8468000.0662334.125000
9340000.0428492.031250
10355000.0394001.593750
11308000.0294318.593750
12250000.0275638.250000
13442000.0473855.781250
14228000.0251287.421875
15315000.0353884.187500
16363000.0342053.406250
17395000.0413878.531250
18398000.0386796.562500
19325000.0283567.468750

Not the best.. Therefore looking up online, it seems that I should try K nearest neighbours as it seems to be a better fit for tabular data with inputs such as housing data.

Model 3: K nearest neighbours

We would like to try K nearest neighbours to try to improve our pricing accuracy. We use scikit learn, with reference to this example https://www.dataquest.io/blog/machine-learning-tutorial/

from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(algorithm='brute')
knn.fit(X_train,y_train)
predictions = knn.predict(X_test)
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, predictions)
rmse = mse ** (1/2)
print(rmse)

Here, we have an RMSE of 46496, which is far better than the previous models.

# Visualizing Our predictions
fig = plt.figure(figsize=(10,5))
plt.scatter(y_test,predictions)
# Perfect predictions
plt.plot(y_test,y_test,'r')
df = pd.DataFrame({'Actual': y_test, 'Predicted': predictions})
df1 = df.head(20)
df1

Actual
Predicted
0318000.0362600.0
1770000.0641600.0
2851800.0734600.0
3475000.0452577.6
4320000.0305600.0
5305000.0303800.0
6460000.0408300.0
7470000.0436000.0
8468000.0476000.0
9340000.0333600.0
10355000.0374977.6
11308000.0311600.0
12250000.0256600.0
13442000.0452600.0
14228000.0224177.6
15315000.0382000.0
16363000.0325600.0
17395000.0420200.0
18398000.0364777.6
19325000.0312000.0

Conclusion

K nearest neighbours seems to have the best prediction with a relatively low RMSE. Neural networks come in second while Linear regression comes in last.

We can definitely improve the models, perhaps doing more with geolocation, getting more data, perhaps renovation level of this houses etc. Once the input data gets better, I believe the K nearest neighbours model will yield a better RMSE.

Stay tuned for my next blogpost – to see how this model went from a jupyter notebook into www.hdbpricer.com

5 thoughts on “How to predict HDB resale prices using 3 different Machine learning models (Linear Regression, Neural Networks, K Nearest Neighbor)

  1. Youre so cool! I dont suppose Ive learn anything like this before. So nice to seek out anyone with some original thoughts on this subject. realy thank you for starting this up. this website is one thing that’s wanted on the net, somebody with a bit of originality. useful job for bringing one thing new to the internet!

    Like

  2. I’d like to thank you for the efforts you’ve put in writing this website. I’m hoping to check out the same high-grade blog posts by you later on as well. In truth, your creative writing abilities has encouraged me to get my own, personal website now 😉

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s