Home Prices in King County, WA


Project description

King County is located in Washington and contains the Seattle metropolitan area. This project will explore the use of several regression methods in order to find the most accurate prediction of home prices in the area. Data downloaded from Kaggle.

Some limited information is available on the county website, otherwise best judgement was used in interpreting the meaning of each feature in the dataset.

Home Price and Home Grade by Zipcode

* Maps generated with Tableau

Contents

  1. Exploratory data analysis
  2. Simple model
  3. Model including location
  4. Time-series model
  5. Combining estimators
  6. Conclusion

Next steps

  • Gradient boosting trees perform well here however it might be worthwhile to compare XGBoost performance against the scikit-learn gradient-boosting implementation.
In [120]:
# import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_color_codes()
sns.set(font_scale=1.25)

import joblib
import os, sys
utilpath = os.path.abspath(os.path.join('..', '..', 'mypyutil'))
sys.path.append(utilpath)
# from InspectPandas import inspectdata

import statsmodels.formula.api as smf
import statsmodels.api as sm

from sklearn.linear_model import RidgeCV, LassoCV, LinearRegression
from sklearn.model_selection import train_test_split, TimeSeriesSplit, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.utils import shuffle
from sklearn.neighbors import KNeighborsRegressor

%matplotlib inline

Return to top


Exploratory data analysis

In [16]:
# load and inspect data

# df = inspectdata('kc_house_data.csv.zip')
df = pd.read_csv('kc_house_data.csv.zip', sep=',', compression='zip')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
id               21613 non-null int64
date             21613 non-null object
price            21613 non-null float64
bedrooms         21613 non-null int64
bathrooms        21613 non-null float64
sqft_living      21613 non-null int64
sqft_lot         21613 non-null int64
floors           21613 non-null float64
waterfront       21613 non-null int64
view             21613 non-null int64
condition        21613 non-null int64
grade            21613 non-null int64
sqft_above       21613 non-null int64
sqft_basement    21613 non-null int64
yr_built         21613 non-null int64
yr_renovated     21613 non-null int64
zipcode          21613 non-null int64
lat              21613 non-null float64
long             21613 non-null float64
sqft_living15    21613 non-null int64
sqft_lot15       21613 non-null int64
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB

Note:

Looks like the data provider cleaned up the dataset a bit so there aren't any missing values.

In [172]:
# distribution of prices

plt.figure(figsize=(10,6))
plt.xlabel('Price')
plt.ylabel('Occurences')
_ = df['price'].hist(bins=100)
In [173]:
# what are the categorical variables?

catvars = ['waterfront', 'view', 'condition', 'zipcode']

for var in catvars:
    print('')
    print(var)
    print(df[var].unique())
waterfront
[0 1]

view
[0 3 4 2 1]

condition
[3 5 4 1 2]

zipcode
[98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 98007 98115
 98107 98126 98019 98103 98002 98133 98040 98092 98030 98119 98112 98052
 98027 98117 98058 98001 98056 98166 98023 98070 98148 98105 98042 98008
 98059 98122 98144 98004 98005 98034 98075 98116 98010 98118 98199 98032
 98045 98102 98077 98108 98168 98177 98065 98029 98006 98109 98022 98033
 98155 98024 98011 98031 98106 98072 98188 98014 98055 98039]

Note:

Based on the possible values, condition and view are likely ordinal variables. But without more information it is difficult to know with certainty.

Waterfront is likely a binary variable indicating if the property is on the waterfront or not, thus does not require encoding or additional processing.

Suggest checking correlation to see if variables are likely imporant.

In [174]:
# summary statistics for each of the variables

df.describe()
Out[174]:
id price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
count 2.161300e+04 2.161300e+04 21613.000000 21613.000000 21613.000000 2.161300e+04 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000
mean 4.580302e+09 5.400881e+05 3.370842 2.114757 2079.899736 1.510697e+04 1.494309 0.007542 0.234303 3.409430 7.656873 1788.390691 291.509045 1971.005136 84.402258 98077.939805 47.560053 -122.213896 1986.552492 12768.455652
std 2.876566e+09 3.671272e+05 0.930062 0.770163 918.440897 4.142051e+04 0.539989 0.086517 0.766318 0.650743 1.175459 828.090978 442.575043 29.373411 401.679240 53.505026 0.138564 0.140828 685.391304 27304.179631
min 1.000102e+06 7.500000e+04 0.000000 0.000000 290.000000 5.200000e+02 1.000000 0.000000 0.000000 1.000000 1.000000 290.000000 0.000000 1900.000000 0.000000 98001.000000 47.155900 -122.519000 399.000000 651.000000
25% 2.123049e+09 3.219500e+05 3.000000 1.750000 1427.000000 5.040000e+03 1.000000 0.000000 0.000000 3.000000 7.000000 1190.000000 0.000000 1951.000000 0.000000 98033.000000 47.471000 -122.328000 1490.000000 5100.000000
50% 3.904930e+09 4.500000e+05 3.000000 2.250000 1910.000000 7.618000e+03 1.500000 0.000000 0.000000 3.000000 7.000000 1560.000000 0.000000 1975.000000 0.000000 98065.000000 47.571800 -122.230000 1840.000000 7620.000000
75% 7.308900e+09 6.450000e+05 4.000000 2.500000 2550.000000 1.068800e+04 2.000000 0.000000 0.000000 4.000000 8.000000 2210.000000 560.000000 1997.000000 0.000000 98118.000000 47.678000 -122.125000 2360.000000 10083.000000
max 9.900000e+09 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.500000 1.000000 4.000000 5.000000 13.000000 9410.000000 4820.000000 2015.000000 2015.000000 98199.000000 47.777600 -121.315000 6210.000000 871200.000000
In [6]:
# check for extreme outliers or nonsensical values

f,ax = plt.subplots(4, 1, figsize=(10,16))

df[['bedrooms', 'bathrooms', 'floors']].plot(kind='box', vert=False, sym='*', ax=ax[0])

sqfts = [col for col in df.columns if 'sqft' in col and 'lot' not in col]
df[sqfts].plot(kind='box', vert=False, sym='*', ax=ax[1])

df[[col for col in df.columns if 'sqft' in col and 'lot' in col]].plot(kind='box', vert=False, sym='*', ax=ax[2])

df[[col for col in df.columns if 'yr' in col]].plot(kind='box', vert=False, sym='*', ax=ax[3])

ax[0].set_xlabel('Count')
ax[1].set_xlabel('Square Feet')
ax[2].set_xlabel('Square Feet')
ax[3].set_xlabel('Year')
Out[6]:
<matplotlib.text.Text at 0x10e048710>

Note:

Many values for yr_renovated are zero; the most reasonable interpretation is that these houses were never renovated. Perhaps then, the features with 15 indicate the difference between the dimensions of the house in 2015 and as it was originally constructed?

Also 33 bedrooms seems excessive. Perhaps its a typo?

In [3]:
# exploring the 33 bedrooms house

df[df['bedrooms']==33]
Out[3]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
15870 2402100895 20140625T000000 640000.0 33 1.75 1620 6000 1.0 0 0 ... 7 1040 580 1947 0 98103 47.6878 -122.331 1330 4700

1 rows × 21 columns

Note:

It would seem to be a typo, as if the number of bedrooms should be 3. While this would make sense intuitively, it is possible to use a quick regression model based on some obvious features (price, bathrooms, sqft_living) to see if a value of 3 is supported.

In [177]:
lr = LinearRegression()
ss = StandardScaler()
Xstd = ss.fit_transform(df[['price', 'bathrooms', 'sqft_living']])
dfXstd = pd.DataFrame(Xstd, columns=['price', 'bathrooms', 'sqft_living'])
Xstd = dfXstd.drop(15870, axis=0)
y = df['bedrooms'].drop(15870, axis=0)

lr.fit(Xstd, y)
print('Predicted bathrooms for example with 33: {0:.1f}'.format(lr.predict(dfXstd.iloc[15870].reshape(1,-1))[0]))
Predicted bathrooms for example with 33: 3.0

Note:

So a simple linear regression model suggests the number of bedrooms for this house should be 3.

In [17]:
# set bedrooms to 3 for mistaken value

df.set_value(15870, 'bedrooms', 3)
df[df.id == 2402100895]
Out[17]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
15870 2402100895 20140625T000000 640000.0 3 1.75 1620 6000 1.0 0 0 ... 7 1040 580 1947 0 98103 47.6878 -122.331 1330 4700

1 rows × 21 columns

In [18]:
# examining the other variables...
# what are the changes over time for living space and lot size for non-renovated houses?

df['livingdiff'] = df['sqft_living15'] - df['sqft_living']
df['lotdiff'] = df['sqft_lot15'] - df['sqft_lot']
df[['livingdiff', 'lotdiff']][df['yr_renovated']==0].describe()
Out[18]:
livingdiff lotdiff
count 20699.000000 2.069900e+04
mean -82.317745 -2.316542e+03
std 585.238849 2.912757e+04
min -8690.000000 -1.225778e+06
25% -290.000000 -6.700000e+02
50% 0.000000 0.000000e+00
75% 220.000000 4.150000e+02
max 2310.000000 3.268790e+05
In [19]:
# differences for renovated houses?

df[['livingdiff', 'lotdiff']][df['yr_renovated']!=0].describe()
Out[19]:
livingdiff lotdiff
count 914.000000 914.000000
mean -343.128009 -2836.050328
std 845.506489 23496.634059
min -8110.000000 -280213.000000
25% -780.000000 -1099.000000
50% -215.000000 0.000000
75% 123.000000 632.000000
max 1960.000000 208216.000000
In [20]:
# do all values with non-zero renovation year, have renovation after year built?

all(df[df['yr_renovated']!=0].index == df[df['yr_renovated']>df['yr_built']].index)
Out[20]:
True
In [21]:
# how does the renovation year influence price?

df[['yr_renovated','price']][df['yr_renovated']!=0].corr()
Out[21]:
yr_renovated price
yr_renovated 1.000000 0.127894
price 0.127894 1.000000
In [22]:
# what about if a renovation is performed?

df['is_renovated'] = df['yr_renovated'].apply(lambda x: int(x>0))
df[['is_renovated', 'price']].corr()
Out[22]:
is_renovated price
is_renovated 1.000000 0.126092
price 0.126092 1.000000
In [184]:
# is the relationship statistically significant?

dfrenov = df[['yr_renovated', 'price']][df['yr_renovated']!=0]
dfrenov.plot(x='yr_renovated', y='price', kind='scatter')

smf.ols('price ~ yr_renovated', data=dfrenov).fit().summary()
Out[184]:
OLS Regression Results
Dep. Variable: price R-squared: 0.016
Model: OLS Adj. R-squared: 0.015
Method: Least Squares F-statistic: 15.17
Date: Thu, 06 Oct 2016 Prob (F-statistic): 0.000106
Time: 15:08:48 Log-Likelihood: -13461.
No. Observations: 914 AIC: 2.693e+04
Df Residuals: 912 BIC: 2.694e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -9.237e+06 2.57e+06 -3.598 0.000 -1.43e+07 -4.2e+06
yr_renovated 5009.0285 1286.248 3.894 0.000 2484.679 7533.378
Omnibus: 880.533 Durbin-Watson: 2.122
Prob(Omnibus): 0.000 Jarque-Bera (JB): 50308.616
Skew: 4.306 Prob(JB): 0.00
Kurtosis: 38.311 Cond. No. 2.57e+05
In [23]:
print('Number of non-renovated houses:')
print(len(df['yr_renovated'][df['yr_renovated']==0]))
df['yr_renovated'].head()
Number of non-renovated houses:
20699
Out[23]:
0       0
1    1991
2       0
3       0
4       0
Name: yr_renovated, dtype: int64
In [24]:
# create new yr_renovated feature where zeros are replaced with the year house was built

def yrreplace(x):
    if x==0:
        return np.nan
    else:
        return x

df['yr_renovated'] = df['yr_renovated'].apply(yrreplace)
df['yr_renovated'].fillna(df['yr_built'], inplace=True)

df['yr_renovated'].head()
Out[24]:
0    1955.0
1    1991.0
2    1933.0
3    1965.0
4    1987.0
Name: yr_renovated, dtype: float64
In [27]:
df[['yr_built', 'yr_renovated', 'is_renovated', 'price']].corr()[['price']]
Out[27]:
price
yr_built 0.054012
yr_renovated 0.105755
is_renovated 0.126092
price 1.000000

Note:

  1. So the dataset is consistent in that non-zero yr_renovated values have a renovation date after the build date.
  2. The living and lot size can increase or decrease with our without a renovation.
  3. After replacing zeros in yr_renovated with yr_built, the correlation with price is actually slightly reduced. Additionally, there is a correlation of slightly higher magnitude between whether or renovation was performed or not.
  4. There seems to be little correlation between the year the house was built and price.
In [28]:
# save progress
df.to_csv('modified01.csv.gz', sep=',', index=False, compression='gzip')

Return to top


Model with little data processing

Does not include time or location features.

In [124]:
# define a few useful functions

def splitbootstrap(df, yfeature, testsize=0.3, bootstrapsize=0.8):
    '''
    split dataframe in training and bootstrapped test data
    procedure:
    1. split dataset into train, test
    2. fit and scale train data (including binary variables)
    4. bootstrap and scale test
    '''

    # split dataframe into two dataframes
    xytrain, xytest = train_test_split(df, test_size=testsize, random_state=1)
    
    # set scale and transform training data
    ss = StandardScaler()
    Xtrain = ss.fit_transform(xytrain.drop(yfeature, axis=1, inplace=False))
    ytrain = xytrain[yfeature]
    
    # bootstrap test data and scale
    xytestboot = xytest.sample(axis=0, frac=bootstrapsize, replace=False)
    Xtest = ss.transform(xytestboot.drop([yfeature], axis=1, inplace=False))
    ytest = xytestboot[yfeature]
    
    return (Xtrain, Xtest, ytrain, ytest)

def fitpredict(est, xytuple):
    '''
    given an estimator and tuple (Xtrain, Xtest, ytrain, ytest) returns the fit estimator with prediction on test set.
    '''
    # fit estimator
    est.fit(xytuple[0], xytuple[2])
    # predict estimator
    try: 
        print('best alpha:', est.alpha_)
    except AttributeError:
        pass
    try:
        print(est.best_estimator_)
        griddf = pd.DataFrame(est.cv_results_)
        print()
        print('training score: {0:.4f}'.format(griddf['mean_test_score'].max()))
    except AttributeError:
        pass
    
    print()
    print('test score: {0:.4f}'.format(r2_score(xytuple[3], est.predict(xytuple[1]))))
    print()

    return est
In [9]:
# load saved data

df = pd.read_csv('modified01.csv.gz', sep=',', compression='gzip')
In [10]:
# check for correlation with price for appropriate variables
# drop unexamined variables and any that we can't interpret

exclvars = ['id', 'zipcode', 'lat', 'long', 'date', 'livingdiff', 'lotdiff', 'yr_renovated']

dfcont = df.drop(exclvars, axis=1)

dfcont.corr()[['price']].sort_values(by='price', ascending=False)
Out[10]:
price
price 1.000000
sqft_living 0.702035
grade 0.667434
sqft_above 0.605567
sqft_living15 0.585379
bathrooms 0.525138
view 0.397293
sqft_basement 0.323816
bedrooms 0.315438
waterfront 0.266369
floors 0.256794
is_renovated 0.126092
sqft_lot 0.089661
sqft_lot15 0.082447
yr_built 0.054012
condition 0.036362
In [30]:
# overall correlation

f, ax = plt.subplots(figsize=(8,8))
_ = sns.heatmap(dfcont.corr(), vmax=0.9, square=True)

Note

Regularization will be used to address potential collinearity.

In [69]:
# split dataset into train, test
# fit and scale train data (except binary variables: waterfront and is_renovated)
# fit model to train data
# bootstrap test set and evaluate

def linearmodelcv(df, estobj):
    '''
    fit linear model of given estimator.
    procedure:
    1. split dataset into train, test
    2. fit and scale train data (including binary variables: waterfront and is_renovated)
    3. fit model to train data
    4. bootstrap test set and evaluate
    '''
    # instantiate estimator
    est = estobj
    # split dataframe into two dataframes
    xytrain, xytest = train_test_split(df, test_size=0.3, random_state=1)
    # set scale and transform training data
    ss = StandardScaler()
#     xscale = ss.fit_transform(xytrain.drop(['price', 'is_renovated', 'waterfront'], axis=1, inplace=False))
#     Xtrain = np.hstack((ss.fit_transform(xytrain.drop(['price', 'is_renovated', 'waterfront'], axis=1, inplace=False)), 
#                         xytrain[['is_renovated', 'waterfront']].values))
    Xtrain = ss.fit_transform(xytrain.drop(['price'], axis=1, inplace=False))
    ytrain = xytrain['price']
    # bootstrap test data and scale
    xytestboot = xytest.sample(axis=0, frac=0.8, replace=False)
#     Xtest = np.hstack((ss.transform(xytestboot.drop(['price', 'is_renovated', 'waterfront'], axis=1, inplace=False)), 
#                        xytestboot[['is_renovated', 'waterfront']].values))
    Xtest = ss.transform(xytestboot.drop(['price'], axis=1, inplace=False))
    ytest = xytestboot['price']
    # fit estimator
    est.fit(Xtrain, ytrain)
    # predict estimator
    print('best alpha:', est.alpha_)
    print()
    print('r^2 score: {0:.3f}'.format(r2_score(ytest, est.predict(Xtest))))
    print()
    print('coefficients:')
    print(pd.Series(est.coef_, index=xytest.columns[1:]).sort_values(ascending=False))
    return est
In [70]:
# with scaling binary features

rcv = linearmodelcv(dfcont, RidgeCV(alphas=np.logspace(-1,8,10), cv=3))
best alpha: 10.0

r^2 score: 0.655

coefficients:
grade            137918.087742
sqft_living       73410.411898
sqft_above        61948.305558
waterfront        48955.864191
sqft_basement     35596.527797
bathrooms         33923.962961
view              31459.742966
sqft_living15     16261.293363
floors            15950.778050
condition         12078.714883
is_renovated       5182.603157
sqft_lot           -921.393602
sqft_lot15       -13879.603068
bedrooms         -35928.913187
yr_built        -103301.234648
dtype: float64
In [64]:
# without scaling binary features

rcv = linearmodelcv(dfcont, RidgeCV(alphas=np.logspace(-1,8,10), cv=3))
best alpha: 0.1

r^2 score: 0.648

coefficients:
waterfront       560985.732319
grade            138126.776568
sqft_living       73436.120825
sqft_above        61977.136651
sqft_basement     35595.748747
bathrooms         33959.290739
view              31453.063820
is_renovated      25617.506324
sqft_living15     16152.411318
floors            15942.987107
condition         12055.722750
sqft_lot           -921.532118
sqft_lot15       -13891.669246
bedrooms         -36002.789327
yr_built        -103466.730930
dtype: float64
In [71]:
# with scaling binary features

lcv = linearmodelcv(dfcont, LassoCV(n_alphas=100, cv=3))
best alpha: 247.310661131

r^2 score: 0.657

coefficients:
sqft_living      141771.852208
grade            138117.041959
waterfront        48841.596995
bathrooms         33582.572178
view              31466.669153
sqft_living15     15871.978876
floors            15531.485556
condition         11815.423837
is_renovated       5036.471655
sqft_basement      1982.419634
sqft_above            0.000000
sqft_lot           -766.185212
sqft_lot15       -13697.570920
bedrooms         -35293.658801
yr_built        -102957.186604
dtype: float64
In [65]:
# without scaling binary features

lcv = linearmodelcv(dfcont, LassoCV(n_alphas=100, cv=3))
best alpha: 247.310661131

r^2 score: 0.656

coefficients:
waterfront       524828.704233
sqft_living      142139.586348
grade            137998.392392
bathrooms         33821.924074
view              32787.559220
is_renovated      20335.589055
sqft_living15     15670.076951
floors            15614.243338
condition         11692.241173
sqft_basement      1816.129750
sqft_above            0.000000
sqft_lot           -822.978961
sqft_lot15       -13659.185773
bedrooms         -35539.296847
yr_built        -103376.380754
dtype: float64
In [99]:
# run random forest regressor using fitpredict and splitbootstrap functions

params = {'n_estimators': [500,1000], 'max_depth': [16,32]}

gridrf = fitpredict(GridSearchCV(RandomForestRegressor(n_jobs=-1), params, scoring='r2', n_jobs=1), 
                    splitbootstrap(dfcont, 'price'))
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=32,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=1000, n_jobs=-1, oob_score=False,
           random_state=None, verbose=0, warm_start=False)

training score: 0.733

test score: 0.761

In [107]:
# support vector machine regressor

params = {'C': np.logspace(0,7,8)}

gridsvm = fitpredict(GridSearchCV(SVR(kernel='rbf'), params, scoring='r2', n_jobs=-1), 
                     splitbootstrap(dfcont, 'price'))
SVR(C=1000000.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
  gamma='auto', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001,
  verbose=False)

training score: 0.6888

test score: 0.6181

Result

The coefficient of determination ($r^2$) values reported are on a scale of 0 to 1, the highest score was achieved with random forest and that value was $r^2 = 0.761$ which suggests some room for improvement.

Return to top


Model including sale location

Previously random forest regression proved the best so we will continue with that regressor and see if we can improve the model performance.

In [3]:
# load saved data

df = pd.read_csv('modified01.csv.gz', sep=',', compression='gzip')
In [101]:
# keep lat/long for location since it is more granular than zip code

exclvars = ['id', 'zipcode', 'date', 'livingdiff', 'lotdiff']

dfloc = df.drop(exclvars, axis=1)
dfloc.columns
Out[101]:
Index(['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long',
       'sqft_living15', 'sqft_lot15', 'is_renovated'],
      dtype='object')
In [111]:
# fit random forest regressor

params = {'n_estimators': [1000,2000], 'max_depth': [8,16,32], 'max_features':['sqrt', 0.5]}
 
gridrfloc = fitpredict(GridSearchCV(RandomForestRegressor(n_jobs=-1), params, scoring='r2', n_jobs=1), 
                       splitbootstrap(dfloc, 'price'))
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=32,
           max_features=0.5, max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

training score: 0.8694

test score: 0.8622

In [112]:
# random forest cv results

griddf = pd.DataFrame(gridrfloc.cv_results_)
for row in griddf[['params','mean_test_score']].values:
    print(row[0], '  ', '{0:.4f}'.format(row[1]))
{'max_depth': 8, 'n_estimators': 1000, 'max_features': 'sqrt'}    0.8190
{'max_depth': 8, 'n_estimators': 2000, 'max_features': 'sqrt'}    0.8192
{'max_depth': 8, 'n_estimators': 1000, 'max_features': 0.5}    0.8401
{'max_depth': 8, 'n_estimators': 2000, 'max_features': 0.5}    0.8401
{'max_depth': 16, 'n_estimators': 1000, 'max_features': 'sqrt'}    0.8553
{'max_depth': 16, 'n_estimators': 2000, 'max_features': 'sqrt'}    0.8559
{'max_depth': 16, 'n_estimators': 1000, 'max_features': 0.5}    0.8679
{'max_depth': 16, 'n_estimators': 2000, 'max_features': 0.5}    0.8681
{'max_depth': 32, 'n_estimators': 1000, 'max_features': 'sqrt'}    0.8562
{'max_depth': 32, 'n_estimators': 2000, 'max_features': 'sqrt'}    0.8571
{'max_depth': 32, 'n_estimators': 1000, 'max_features': 0.5}    0.8694
{'max_depth': 32, 'n_estimators': 2000, 'max_features': 0.5}    0.8686
In [113]:
# random forest feature importance

featimp = pd.Series(gridrfloc.best_estimator_.feature_importances_, index=dfloc.drop('price', axis=1).columns)
plt.figure(figsize=(12,8))
plt.title('Feature Importance')
_ = featimp.sort_values(ascending=True).plot(kind='barh')
In [114]:
# fit gradient boosting trees

params = {'n_estimators': [1000,2000], 'max_depth': [2,4,8]}
 
gridgbrloc = fitpredict(GridSearchCV(GradientBoostingRegressor(), params, scoring='r2', n_jobs=-1), 
                       splitbootstrap(dfloc, 'price'))
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=4, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=1000,
             presort='auto', random_state=None, subsample=1.0, verbose=0,
             warm_start=False)

training score: 0.8864

test score: 0.8940

In [115]:
# gradient boosting trees feature importance

featimp = pd.Series(gridgbrloc.best_estimator_.feature_importances_, index=dfloc.drop('price', axis=1).columns)
plt.figure(figsize=(12,8))
plt.title('Feature Importance')
_ = featimp.sort_values(ascending=True).plot(kind='barh')

Result

Interestingly, the location is by far the most important feature so its inclusion is critical. The Gradient-Boosted model performs better than the Random Forest algorithm previously used.

Return to top


Modeled as a time series

Determine range of time spanned by data and appropriate granularity (day, week, month, year).

In [57]:
# extract year and month from date

df['saleyear'] = pd.to_datetime(df['date']).dt.year
df['salemonth'] = pd.to_datetime(df['date']).dt.month

df.head()
Out[57]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... zipcode lat long sqft_living15 sqft_lot15 livingdiff lotdiff yr_renovated_new saleyear salemonth
0 7129300520 20141013T000000 221900.0 3 1.00 1180 5650 1.0 0 0 ... 98178 47.5112 -122.257 1340 5650 160 0 1995.827133 2014 10
1 6414100192 20141209T000000 538000.0 3 2.25 2570 7242 2.0 0 0 ... 98125 47.7210 -122.319 1690 7639 -880 397 1991.000000 2014 12
2 5631500400 20150225T000000 180000.0 2 1.00 770 10000 1.0 0 0 ... 98028 47.7379 -122.233 2720 8062 1950 -1938 1995.827133 2015 2
3 2487200875 20141209T000000 604000.0 4 3.00 1960 5000 1.0 0 0 ... 98136 47.5208 -122.393 1360 5000 -600 0 1995.827133 2014 12
4 1954400510 20150218T000000 510000.0 3 2.00 1680 8080 1.0 0 0 ... 98074 47.6168 -122.045 1800 7503 120 -577 1995.827133 2015 2

5 rows × 26 columns

In [58]:
# correlation between saleyear, salemonth and price?

df[['saleyear', 'salemonth', 'price']].corr().price
Out[58]:
saleyear     0.003576
salemonth   -0.010081
price        1.000000
Name: price, dtype: float64
In [59]:
# number of 2014 and 2015 house sales

print('2014 records:', len(df[df['saleyear']==2014]))
print('2015 records:', len(df[df['saleyear']==2015]))
2014 records: 14633
2015 records: 6980
In [60]:
# are all months present for each year?

print(pd.to_datetime(df['date']).min())
print(pd.to_datetime(df['date']).max())
2014-05-02 00:00:00
2015-05-27 00:00:00

Note:

Data only spans a single year. So yearly time groups don't make sense; test day, week, month.

In [151]:
# load data for model construction

df = pd.read_csv('modified01.csv.gz', sep=',', compression='gzip')
df.head()
Out[151]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... yr_renovated zipcode lat long sqft_living15 sqft_lot15 livingdiff lotdiff is_renovated yr_renovated_new
0 7129300520 20141013T000000 221900.0 3 1.00 1180 5650 1.0 0 0 ... 0 98178 47.5112 -122.257 1340 5650 160 0 0 1995.827133
1 6414100192 20141209T000000 538000.0 3 2.25 2570 7242 2.0 0 0 ... 1991 98125 47.7210 -122.319 1690 7639 -880 397 1 1991.000000
2 5631500400 20150225T000000 180000.0 2 1.00 770 10000 1.0 0 0 ... 0 98028 47.7379 -122.233 2720 8062 1950 -1938 0 1995.827133
3 2487200875 20141209T000000 604000.0 4 3.00 1960 5000 1.0 0 0 ... 0 98136 47.5208 -122.393 1360 5000 -600 0 0 1995.827133
4 1954400510 20150218T000000 510000.0 3 2.00 1680 8080 1.0 0 0 ... 0 98074 47.6168 -122.045 1800 7503 120 -577 0 1995.827133

5 rows × 25 columns

In [9]:
# first convert date to proper date

df['date'] = pd.to_datetime(df['date'])
df.head()
Out[9]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15 livingdiff lotdiff is_renovated
0 7129300520 2014-10-13 221900.0 3 1.00 1180 5650 1.0 0 0 ... 1955 1955.0 98178 47.5112 -122.257 1340 5650 160 0 0
1 6414100192 2014-12-09 538000.0 3 2.25 2570 7242 2.0 0 0 ... 1951 1991.0 98125 47.7210 -122.319 1690 7639 -880 397 1
2 5631500400 2015-02-25 180000.0 2 1.00 770 10000 1.0 0 0 ... 1933 1933.0 98028 47.7379 -122.233 2720 8062 1950 -1938 0
3 2487200875 2014-12-09 604000.0 4 3.00 1960 5000 1.0 0 0 ... 1965 1965.0 98136 47.5208 -122.393 1360 5000 -600 0 0
4 1954400510 2015-02-18 510000.0 3 2.00 1680 8080 1.0 0 0 ... 1987 1987.0 98074 47.6168 -122.045 1800 7503 120 -577 0

5 rows × 24 columns

In [10]:
# function to generate proper time stamps based on desired granularity


# find earliest time points and save for proper time window
df2014 = df[df['date'].dt.year==2014]
firstmonth = df2014['date'].dt.month.min()
firstweek = df2014['date'][df2014['date']<pd.to_datetime('2014-12-29')].dt.weekofyear.min()
firstday = df2014['date'].dt.dayofyear.min()

print(firstmonth, firstweek, firstday)

# all of these quantities (day, week, month) are 1-indexed
def ordinalconv(row, timeframe):
    if timeframe == 'month':
        first = firstmonth
        last = 12
        currow = row.month
    elif timeframe == 'week':
        first = firstweek
        last = 52
        currow = row.weekofyear
    elif timeframe == 'day':
        first = firstday
        last = 365
        currow = row.dayofyear
    else:
        return np.nan
    return currow + (row.year - 2014)*last - first
5 18 122
In [12]:
# compute the performance for each time delta and return best

# first sort values to preserve date order
df.sort_values(by='date', ascending=True, inplace=True)

# add appropriate time stamp
for timedelta in ['month', 'week', 'day']:
    df['timedelta'] = df['date'].apply(ordinalconv, args=(timedelta,))
    
    # compute model
    trainmask = df['date'].dt.year==2014
    testmask = df['date'].dt.year==2015
        
    exclvars = ['id', 'zipcode', 'date', 'livingdiff', 'lotdiff', 'yr_renovated']
    dftime = df.drop(exclvars, axis=1, inplace=False)
    
    y = dftime['price']
    Xraw = dftime.drop(['price'], axis=1)

    ss = StandardScaler()
    X = pd.DataFrame(ss.fit_transform(Xraw), index=Xraw.index)
    
    Xtrain = X[trainmask]
    Xtest = X[testmask]
    ytrain = y[trainmask]
    ytest = y[testmask]
    
    print('')
    print(timedelta)
    
    gbr = GradientBoostingRegressor(n_estimators=1000, max_depth=4, random_state=42)
#     params = {'learning_rate': [1,0.1,0.01]}

#     gridgbrtime = GridSearchCV(gbr, params, scoring='r2', cv=TimeSeriesSplit(n_splits=3), n_jobs=-1)
#     gridgbrtime.fit(Xtrain, ytrain)
    gbr.fit(Xtrain,ytrain)

    # display results

#     print(gridgbrtime.best_params_)
#     gridgbrtimedf = pd.DataFrame(gridgbrtime.cv_results_)
#     print('training set r^2 score = {0:.4f}'.format(gridgbrtimedf['mean_test_score'].max()))
    print('test set r^2 score = {0:.4f}'.format(r2_score(ytest, gbr.predict(Xtest))))
month
test set r^2 score = 0.8788

week
test set r^2 score = 0.8746

day
test set r^2 score = 0.8781
In [14]:
# feature importance

featimp = pd.Series(gbr.feature_importances_, index=Xraw.columns)
plt.figure(figsize=(12,8))
plt.title('Feature Importance')
_ = featimp.sort_values(ascending=True).plot(kind='barh')

Result

There is actually a slight decrease in model performance when considering the time evolution of this data. This could be a result of slight overfitting.

Return to top


Combination of predictions

Although the gradient boosting model seems to generalize well (test error is close in magnitude to training error) it may be possible to improve performance by combining the output of several models.

In [5]:
# load dataset

df = pd.read_csv('modified01.csv.gz', sep=',', compression='gzip')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 25 columns):
id                  21613 non-null int64
date                21613 non-null object
price               21613 non-null float64
bedrooms            21613 non-null int64
bathrooms           21613 non-null float64
sqft_living         21613 non-null int64
sqft_lot            21613 non-null int64
floors              21613 non-null float64
waterfront          21613 non-null int64
view                21613 non-null int64
condition           21613 non-null int64
grade               21613 non-null int64
sqft_above          21613 non-null int64
sqft_basement       21613 non-null int64
yr_built            21613 non-null int64
yr_renovated        21613 non-null int64
zipcode             21613 non-null int64
lat                 21613 non-null float64
long                21613 non-null float64
sqft_living15       21613 non-null int64
sqft_lot15          21613 non-null int64
livingdiff          21613 non-null int64
lotdiff             21613 non-null int64
is_renovated        21613 non-null int64
yr_renovated_new    21613 non-null float64
dtypes: float64(6), int64(18), object(1)
memory usage: 4.1+ MB
In [116]:
# support vector machine regressor
params = {'C': np.logspace(0,7,8)}
gridsvmloc = fitpredict(GridSearchCV(SVR(kernel='rbf'), params, scoring='r2', n_jobs=-1), 
                     splitbootstrap(dfloc, 'price'))
SVR(C=1000000.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
  gamma='auto', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001,
  verbose=False)

training score: 0.7873

test score: 0.7493

In [121]:
# k nearest neighbors regressor
params = {'n_neighbors':np.arange(2,52,5)}
gridknr = fitpredict(GridSearchCV(KNeighborsRegressor(weights='distance', n_jobs=-1), params, scoring='r2', n_jobs=1), 
                     splitbootstrap(dfloc[['lat','long','price']], 'price'))
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=-1, n_neighbors=17, p=2,
          weights='distance')

training score: 0.5883

test score: 0.5478

In [133]:
# combine outputs and feed into linear regression

Xtrain, Xtest, ytrain, ytest = splitbootstrap(dfloc, 'price')
lr = LinearRegression(n_jobs=-1)
lr.fit(np.hstack((gridrfloc.predict(Xtrain).reshape(-1,1), gridgbrloc.predict(Xtrain).reshape(-1,1), 
                  gridsvmloc.predict(Xtrain).reshape(-1,1), gridknr.predict(Xtrain[:,13:15]).reshape(-1,1))),ytrain)
Xtestlr = np.hstack((gridrfloc.predict(Xtest).reshape(-1,1), gridgbrloc.predict(Xtest).reshape(-1,1), 
                  gridsvmloc.predict(Xtest).reshape(-1,1), gridknr.predict(Xtest[:,13:15]).reshape(-1,1)))
print()
print('linear combination test score = {0:.4f}'.format(r2_score(ytest, lr.predict(Xtestlr))))
linear combination test score = 0.6409
In [134]:
# coefficients
print(lr.coef_, lr.intercept_)
[ 0.14843331  0.02525447 -0.03291231  0.86231419] -2021.31174303
In [21]:
# factor in time dependence

# first sort values to preserve date order
df.sort_values(by='date', ascending=True, inplace=True)

trainmask = df['date'].dt.year==2014
testmask = df['date'].dt.year==2015

X = pd.DataFrame(ss.fit_transform(Xraw), index=Xraw.index)

Xtrain = X.loc[trainmask]
Xtest = X.loc[testmask]
ytrain = y.loc[trainmask]
ytest = y.loc[testmask]

df['timedelta'] = df['date'].apply(ordinalconv, args=('month',))


for name,est in zip(['Random Forest', 'Gradient Boosting', 'SVM', 'Linear Combination'],[rf,gbr,svm,lr]):
    print(name)
    if est==lr:
        est.fit(np.hstack((rf.predict(Xtrain).reshape(-1,1), 
                           gbr.predict(Xtrain).reshape(-1,1), 
                           svm.predict(Xtrain).reshape(-1,1))), ytrain)
        Xtest3 = np.hstack((rf.predict(Xtest).reshape(-1,1), 
                           gbr.predict(Xtest).reshape(-1,1), 
                           svm.predict(Xtest).reshape(-1,1)))
    else:
        est.fit(Xtrain, ytrain)
        Xtest3 = Xtest
    ypred = est.predict(Xtest3)
    print('test set r^2 score = {0:.4f}'.format(r2_score(ytest, ypred)))
    print()
Random Forest
test set r^2 score = 0.8615

Gradient Boosting
test set r^2 score = 0.8738

SVM
test set r^2 score = 0.7755

Linear Combination
test set r^2 score = 0.8774

Result

The combination of random forest, gradient boosting trees, support vector machine and k nearest neighbors models did not significantly improve test set performance. A similar linear combination did not significantly improve the time-dependent model.

Return to top


Conclusion

The best performing model (Gradient boosting trees with $R^2=0.89$) includes a variety of physical descriptors of the house, lot, and local neighborhood along with the actual geographic location (latitude/longitude). Due to the narrow time range, time evolution due to demand or price inflation doesn't seem to have much influence on the predictability of the model.

In [136]:
# best performing model

output = 'price'

print('Best peforming model:')
print(gridgbrloc.best_estimator_)
print()
print('Features used:', dfloc.drop(output, axis=1).columns.tolist())
print()
print('Predicted feature:', output)
Best peforming model:
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=4, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=1000,
             presort='auto', random_state=None, subsample=1.0, verbose=0,
             warm_start=False)

Features used: ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long', 'sqft_living15', 'sqft_lot15', 'is_renovated']

Predicted feature: price
In [137]:
# pickle and save best performing estimator

joblib.dump(gridgbrloc.best_estimator_, 'gbr.pkl')
Out[137]:
['gbr.pkl']