Expedia Hotel Recommendations

https://www.kaggle.com/c/expedia-hotel-recommendations

Objective:

Generate personalized booking recommendations. For example, is the user booking a dream vacation, weekend getaway, business trip or other travel type? In the absence of sufficient uesr data, use historical customer data to identify the type of destination the customer is seeking and present them with the appropriate recommendations.

In this project, the information to return to the user is the hotel cluster, a classification system developed by Expedia that takes into account historical price, customer star ratings, geographical locations relative to city center, and other factors.

Contents:

  1. Exploratory analysis
  2. Machine learning approach using random forest
  3. Refined random forest model
  4. Make recommendations directly, based on several search parameters

Exploratory Analysis


Import useful libraries

In [1]:
import pandas as pd
import numpy as np
from scipy import sparse
import random
import re
import multiprocessing
from collections import defaultdict, OrderedDict

from sklearn import preprocessing
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import log_loss, average_precision_score
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

import ml_metrics
import joblib

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Load random subset of data

Due to the large dataset and hardware limitations, a random subset (0.5%) of data is loaded to identify trends among features.

In [8]:
# Read large file in chunks
# Generate random row numbers for 1/200 of each chunk
# create dataframe then append additional

reader = pd.read_csv(dataurl+'train.csv.gz', sep=',', compression='gzip', chunksize=100000)
first = True
for chunk in reader:
    randsamp = random.sample(range(len(chunk)),len(chunk)//200)
    if first:
        df = chunk.iloc[randsamp]
        first = False
    else:
        df = df.append(chunk.iloc[randsamp], ignore_index=True)
In [10]:
# save data to csv for reloading

df.to_csv(dataurl+'train200th.csv', sep=',', encoding='utf-8', index=False)

Relationships among features

Look for correlations among input variables (not including date columns).

In [213]:
cols = list(df.columns)
print(len(cols))
print(cols)
24
['date_time', 'site_name', 'posa_continent', 'user_location_country', 'user_location_region', 'user_location_city', 'orig_destination_distance', 'user_id', 'is_mobile', 'is_package', 'channel', 'srch_ci', 'srch_co', 'srch_adults_cnt', 'srch_children_cnt', 'srch_rm_cnt', 'srch_destination_id', 'srch_destination_type_id', 'is_booking', 'cnt', 'hotel_continent', 'hotel_country', 'hotel_market', 'hotel_cluster']
In [214]:
def dropcol(subcols, drops):
    for drop in drops:
        if drop in subcols:
            subcols.remove(drop)
    return subcols

subcols = dropcol(cols, ['date_time', 'srch_ci', 'srch_co'])
print(len(subcols))
print(subcols)
21
['site_name', 'posa_continent', 'user_location_country', 'user_location_region', 'user_location_city', 'orig_destination_distance', 'user_id', 'is_mobile', 'is_package', 'channel', 'srch_adults_cnt', 'srch_children_cnt', 'srch_rm_cnt', 'srch_destination_id', 'srch_destination_type_id', 'is_booking', 'cnt', 'hotel_continent', 'hotel_country', 'hotel_market', 'hotel_cluster']
In [177]:
corr = df[subcols].corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

sns.set(font_scale=1.25)
f,ax = plt.subplots(1,1,figsize=(12,12))
_ = sns.heatmap(corr, mask=mask, vmax=0.5, square=True, ax=ax)

A strong correlation is apparent between a few variables suggesting that one member of the pair can be excluded from the model.

In [116]:
# Pairs correlated stronger than +/- 0.4

pairs = defaultdict(list)
for col in corr.columns:
    curr = list(corr[col][(abs(corr[col])>=0.4) & (corr[col]!=1)].index)
    if len(curr)>0 and not any([col in v for v in pairs.values()]):
        pairs[col]=curr
pairs
Out[116]:
defaultdict(list,
            {'orig_destination_distance': ['hotel_continent'],
             'site_name': ['posa_continent'],
             'srch_adults_cnt': ['srch_rm_cnt'],
             'srch_destination_id': ['srch_destination_type_id']})

Feature selection

To select which variable to exclude from the correlated pair, consider two quantities:

  1. Correlation with output
  2. Features variance.

Greater variance and stronger correlation with output will improve discriminating power.

However, for the correlation between orig_destination_distance and hotel_continent, the distance metric will be kept since the correlation most likely shows that customers from certain continents book hotels on other continents and discarding the distance feature will lose a lot of information for customers that do not travel to foreign countries or continents.

In [143]:
# Check correlation of each feature with output

sercorrout = corr['hotel_cluster'].abs()
sercorrout.drop('hotel_cluster', inplace=True)
sercorrout.sort_values(inplace=True)
sercorrout
Out[143]:
user_location_region         0.002227
channel                      0.002301
user_id                      0.002303
cnt                          0.003040
user_location_city           0.003254
srch_rm_cnt                  0.005417
orig_destination_distance    0.009926
srch_adults_cnt              0.010323
srch_destination_id          0.010529
user_location_country        0.011087
is_mobile                    0.012176
hotel_continent              0.012815
posa_continent               0.013590
srch_children_cnt            0.017408
is_booking                   0.019307
site_name                    0.022608
hotel_country                0.023831
srch_destination_type_id     0.028516
hotel_market                 0.034823
is_package                   0.036482
Name: hotel_cluster, dtype: float64
In [163]:
# Min-max standardize values and compare standard deviation

cols = ['srch_adults_cnt','srch_rm_cnt']

def featureselect(cols):
    mmscaled = preprocessing.minmax_scale(df[cols])
    dfmm = pd.DataFrame(mmscaled, columns=cols)
    dvars = dfmm.var().to_dict()
    dcorrs = dict(zip(cols,[sercorrout.loc[k] for k in cols]))
    # compare relative differences
    dscores = {}
    for col in cols:
        val = dvars[col]/sum(dvars.values()) + dcorrs[col]/sum(dcorrs.values())
        dscores[val] = col
    print('Variance:', dvars)
    print('Correlation with output:', dcorrs)
    print('Score:', dscores)
    print('')
    print('Feature to keep from correlated pair:', dscores[max(dscores)])
    print('Feature to drop:', dscores[min(dscores)])
    return

featureselect(cols)
Variance: {'srch_rm_cnt': 0.0033440957707676832, 'srch_adults_cnt': 0.010348964884441899}
Correlation with output: {'srch_rm_cnt': 0.0054174647407086093, 'srch_adults_cnt': 0.010323018936554905}
Score: {0.58839225259332428: 'srch_rm_cnt', 1.4116077474066757: 'srch_adults_cnt'}

Feature to keep from correlated pair: srch_adults_cnt
Feature to drop: srch_rm_cnt
In [164]:
p1 = 'srch_destination_id'

featureselect([p1, pairs[p1][0]])
Variance: {'srch_destination_type_id': 0.072431445265535896, 'srch_destination_id': 0.028589649539407042}
Correlation with output: {'srch_destination_type_id': 0.028516253917846818, 'srch_destination_id': 0.010529017100625929}
Score: {0.55266850371112275: 'srch_destination_id', 1.4473314962888773: 'srch_destination_type_id'}

Feature to keep from correlated pair: srch_destination_type_id
Feature to drop: srch_destination_id
In [192]:
p1 = 'site_name'

featureselect([p1, pairs[p1][0]])
Variance: {'posa_continent': 0.035079832030670187, 'site_name': 0.055031869189630621}
Correlation with output: {'posa_continent': 0.013589939138183623, 'site_name': 0.022607651686052908}
Score: {0.76473042004943226: 'posa_continent', 1.2352695799505677: 'site_name'}

Feature to keep from correlated pair: site_name
Feature to drop: posa_continent
In [215]:
# Remove columns

subcols = dropcol(subcols, ['srch_rm_cnt','srch_destination_id','posa_continent'])
print(len(subcols))
print(subcols)
18
['site_name', 'user_location_country', 'user_location_region', 'user_location_city', 'orig_destination_distance', 'user_id', 'is_mobile', 'is_package', 'channel', 'srch_adults_cnt', 'srch_children_cnt', 'srch_destination_type_id', 'is_booking', 'cnt', 'hotel_continent', 'hotel_country', 'hotel_market', 'hotel_cluster']

Using this strategy for feature selection returns the above list of features to retain.


Additional feature selection

Does binary data surpass the variance threshold?

Binary column names begin with is.

In [182]:
p = re.compile('^(is)')
bicols = [c for c in df.columns if p.search(c)]
bicols
Out[182]:
['is_mobile', 'is_package', 'is_booking']
In [196]:
# Probability of occurence threshold
p = 0.9

sel = VarianceThreshold(p*(1-p))
sel.fit_transform(df[bicols])
bipass = [bicols[i] for i in range(len(bicols)) if sel.get_support()[i]==True]
print('Features that pass threshold:', bipass)
bidrop = [c for c in bicols if c not in bipass]
print('Features to drop:', bidrop)
Features that pass threshold: ['is_mobile', 'is_package']
Features to drop: ['is_booking']
In [216]:
# Remove columns

subcols = dropcol(subcols, bidrop)
print(len(subcols))
print(subcols)
17
['site_name', 'user_location_country', 'user_location_region', 'user_location_city', 'orig_destination_distance', 'user_id', 'is_mobile', 'is_package', 'channel', 'srch_adults_cnt', 'srch_children_cnt', 'srch_destination_type_id', 'cnt', 'hotel_continent', 'hotel_country', 'hotel_market', 'hotel_cluster']

Remove and set aside output variable for model construction.

In [217]:
subcols = dropcol(subcols, ['hotel_cluster'])
print(len(subcols))
print(subcols)
16
['site_name', 'user_location_country', 'user_location_region', 'user_location_city', 'orig_destination_distance', 'user_id', 'is_mobile', 'is_package', 'channel', 'srch_adults_cnt', 'srch_children_cnt', 'srch_destination_type_id', 'cnt', 'hotel_continent', 'hotel_country', 'hotel_market']

Finally, is user_id important?

In [9]:
len(dfXy['user_id'].unique())/len(dfXy)
Out[9]:
0.8144952774341522

user_id is mostly unique values so unlikely to have a significant effect on outcome.

Explanatory analysis

Test random forest model


First identify and remove missing values

In [207]:
dnas = {}
for col in subcols:
    nullcount = df[col].isnull().sum()
    if nullcount > 0:
        dnas[col] = nullcount
print(dnas)
{'orig_destination_distance': 67584}
In [219]:
# Drop this column from the training data
# Since it is highly correlated with hotel continent, little information will be lost

subcols = dropcol(subcols, dnas.keys())
print(len(subcols))
print(subcols)
15
['site_name', 'user_location_country', 'user_location_region', 'user_location_city', 'user_id', 'is_mobile', 'is_package', 'channel', 'srch_adults_cnt', 'srch_children_cnt', 'srch_destination_type_id', 'cnt', 'hotel_continent', 'hotel_country', 'hotel_market']

Create save point

In [11]:
outcols = ['site_name', 'user_location_country', 'user_location_region', 'user_location_city', 
           'is_mobile', 'is_package', 'channel', 'srch_adults_cnt', 'srch_children_cnt', 
           'srch_destination_type_id', 'hotel_continent', 'hotel_country', 'hotel_market', 'hotel_cluster']
dfXy = dfXy[outcols].copy()

dfXy.to_csv(dataurl+'train200thdrop.csv', sep=',', encoding='utf-8', index=False)

Reload data if necessary

In [137]:
dataurl = '/Users/dbricare/Documents/Python/datasets/expedia/'
dfXy = pd.read_csv(dataurl+'train200th.csv', sep=',', encoding='utf-8')

# drop 'cnt' because it is not present in test dataset
# dfXy.drop('cnt', axis=1, inplace=True)
# dfXy.to_csv(dataurl+'train200thdrop.csv', sep=',', encoding='utf-8', index=False)
print(dfXy.shape)
print(list(dfXy.columns))
(188351, 24)
['date_time', 'site_name', 'posa_continent', 'user_location_country', 'user_location_region', 'user_location_city', 'orig_destination_distance', 'user_id', 'is_mobile', 'is_package', 'channel', 'srch_ci', 'srch_co', 'srch_adults_cnt', 'srch_children_cnt', 'srch_rm_cnt', 'srch_destination_id', 'srch_destination_type_id', 'is_booking', 'cnt', 'hotel_continent', 'hotel_country', 'hotel_market', 'hotel_cluster']

Parse date time

In [151]:
def parsedatetime(dfXy):
    # create new feature for length of stay
    for col in ['date_time', 'srch_ci', 'srch_co']:
        dfXy[col] = pd.to_datetime(dfXy[col], errors='coerce')
    # convert booking datetime to month
    dfXy['bookdoy'] = dfXy['date_time'].apply(lambda x: x.dayofyear)
    # some checkout values missing, fill with datetime
    dfXy['srch_co'].fillna(dfXy['date_time'], inplace=True)
    dfXy['srch_ci'].fillna(dfXy['date_time'], inplace=True)
    # convert stay to length and dayofyear
    dfXy['stay'] = dfXy['srch_co']-dfXy['srch_ci']
    dfXy['stay'] = dfXy['stay'].apply(lambda x: x.days)
    dfXy['staydoy'] = dfXy['srch_ci'].apply(lambda x: x.dayofyear)
    return dfXy

# drop 225 null checkouts
dfXy.dropna(axis=0, subset=['srch_ci'], inplace=True)

dfXy = parsedatetime(dfXy)

# serdtbook = pd.to_datetime(X['date_time']).apply(lambda x: x.month)
dfXy.head()
Out[151]:
date_time site_name posa_continent user_location_country user_location_region user_location_city orig_destination_distance user_id is_mobile is_package ... srch_destination_type_id is_booking cnt hotel_continent hotel_country hotel_market hotel_cluster bookdoy stay staydoy
0 2014-02-10 21:49:43 18 2 231 68 42296 NaN 150736 0 1 ... 6 0 1 6 105 29 35 41 2 89
1 2014-10-21 17:17:41 13 1 46 171 56407 5763.2976 72327 0 0 ... 1 0 1 5 203 253 58 294 7 339
2 2013-08-31 16:52:22 2 3 66 220 14656 188.2171 193997 1 0 ... 1 0 1 2 50 682 79 243 3 266
3 2014-11-09 12:14:12 34 3 205 155 14703 60.4896 163092 0 0 ... 1 0 1 2 198 401 95 313 1 326
4 2014-06-18 09:22:40 2 3 66 435 40631 4362.4117 120073 0 0 ... 1 0 1 6 204 27 72 169 3 192

5 rows × 27 columns

Split data into train and test sets, create output dummy variables and drop unwanted columns

In [120]:
# dropcols = ['hotel_cluster', 'user_location_country', 'user_location_city', 'user_location_region', 
#             'hotel_country', 'hotel_market']
dropcols = ['hotel_cluster', 'cnt', 'user_id', 'orig_destination_distance', 'srch_rm_cnt', 'srch_destination_id', 
           'posa_continent', 'date_time', 'srch_ci', 'srch_co', 'is_mobile', 'user_location_city', 'channel', 
            'is_booking']
y = dfXy['hotel_cluster'].astype(int)
if len(dropcols)>0:
    X = dfXy.drop(dropcols, axis=1)
else:
    X = dfXy.drop('hotel_cluster', axis=1)
print(X.shape)
print(list(X.columns))

sss = StratifiedShuffleSplit(y, test_size=0.3, random_state=42, n_iter=1)
for train_index, test_index in sss:
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

X.head()
(188126, 13)
['site_name', 'user_location_country', 'user_location_region', 'is_package', 'srch_adults_cnt', 'srch_children_cnt', 'srch_destination_type_id', 'hotel_continent', 'hotel_country', 'hotel_market', 'bookdoy', 'stay', 'staydoy']
Out[120]:
site_name user_location_country user_location_region is_package srch_adults_cnt srch_children_cnt srch_destination_type_id hotel_continent hotel_country hotel_market bookdoy stay staydoy
0 18 231 68 1 4 0 6 6 105 29 41 2 89
1 13 46 171 0 2 0 1 5 203 253 294 7 339
2 2 66 220 0 3 0 1 2 50 682 243 3 266
3 34 205 155 0 2 0 1 2 198 401 313 1 326
4 2 66 435 0 2 0 1 6 204 27 169 3 192
In [121]:
rfparams = {'n_jobs': 3}
gridparams = {'n_estimators': [25,50,75], 'max_depth' : [8,12,16]}

clf = RandomForestClassifier(random_state=42, class_weight='balanced', **rfparams)
# clf = GradientBoostingClassifier(random_state=42, **params)
# clf.fit(X_train, y_train)

grid = GridSearchCV(clf, param_grid=gridparams, cv=3, scoring='log_loss')
grid.fit(X_train, y_train)

print(grid.best_params_)
grid.grid_scores_
{'n_estimators': 75, 'max_depth': 16}
Out[121]:
[mean: -3.89073, std: 0.01437, params: {'n_estimators': 25, 'max_depth': 8},
 mean: -3.88040, std: 0.00680, params: {'n_estimators': 50, 'max_depth': 8},
 mean: -3.87464, std: 0.00304, params: {'n_estimators': 75, 'max_depth': 8},
 mean: -3.76308, std: 0.00178, params: {'n_estimators': 25, 'max_depth': 12},
 mean: -3.73708, std: 0.00570, params: {'n_estimators': 50, 'max_depth': 12},
 mean: -3.72322, std: 0.00779, params: {'n_estimators': 75, 'max_depth': 12},
 mean: -3.91921, std: 0.00832, params: {'n_estimators': 25, 'max_depth': 16},
 mean: -3.76277, std: 0.01243, params: {'n_estimators': 50, 'max_depth': 16},
 mean: -3.71235, std: 0.01038, params: {'n_estimators': 75, 'max_depth': 16}]

Save trained model to disk

In [92]:
joblib.dump(clf, 'gbc200th.pkl', compress=True)
Out[92]:
['gbc200th.pkl']

Relative feature importance

In [60]:
sns.set(font_scale=1.5)
serfeatures = pd.Series(clf.feature_importances_, index=X.columns).sort_values(ascending=True)
_ = serfeatures.plot(kind='barh')
_ = plt.title('Relative Feature Importance')
In [16]:
log_loss(y_test, clf.predict_proba(X_test))
Out[16]:
4.3140226413733886

After looking at this data, it appears the user city location, channel and if the booking was done on mobile don't have strong influence over the result. Suggest dropping them to simplify the model.

In [122]:
clf.set_params(**grid.best_params_)
clf.fit(X_train, y_train)

sns.set(font_scale=1.5)
serfeatures = pd.Series(clf.feature_importances_, index=X.columns).sort_values(ascending=True)
_ = serfeatures.plot(kind='barh')
_ = plt.title('Relative Feature Importance')
In [123]:
log_loss(y_test, clf.predict_proba(X_test))
Out[123]:
3.6633599347203667
In [326]:
# Convert log probabilities to five recommendations

def logtorec(x):
    ypred = list(zip(x, list(range(100))))
    ypred5 = ' '.join([str(t[1]) for t in sorted(ypred, reverse=True)][:5])
    return ypred5
In [327]:
dfpred = pd.DataFrame(pred)
serpred = dfpred.apply(logtorec, axis=1)
serpred.name = 'hotel_cluster'
serpred.to_csv('testoutput.csv', sep=',', index_label='id', header=True)

serpred.head()
Out[327]:
0      91 68 9 2 92
1     82 30 5 64 25
2    91 48 41 55 95
3     30 82 5 62 25
4    91 48 41 18 70
Name: hotel_cluster, dtype: object


The above dataset uses integer values to represent categories, however the random forest model interprets this data as ordinal which is not correct.

Repeat random forest


Convert integer-based category features to bernoulli features


Using a one-hot encoding will create a series of binary-valued features for each category. To minimize memory usage, store the dataset as a sparse matrix.

Also reduce the number of features to simplify the model further.

In [72]:
# function definitions

def formatdata(chunk):
    '''Format data of each startified chunk.'''
    
    # fill or drop NAs for origin-destination distance
    chunk['orig_destination_distance'].fillna(0.0, inplace=True)
    chunk['orig_destination_distance'].round(4)
    # chunk.dropna(axis=0, subset=['orig_destination_distance'], inplace=True)
      
    return chunk

def stratshufspl(chunk, fraction, ylabel):
    '''Startified shuffle split of chunks.'''
    sss = StratifiedShuffleSplit(chunk[ylabel], test_size=fraction, 
                                 random_state=42, n_iter=1)
    for _, idx in sss:
        train = chunk.iloc[idx].copy()
    return train


def fractionate(trainiter, fraction, ylabel):
    '''Utilizes only one core.'''
    print('')
    print('loading data...')
    
    # create empty list and add formatted data chunks
    chunks = list()
    for chunk in trainiter:
        # chunk = chunk[chunk['is_booking']==1]   # train only on booking events
        # if using whole dataset skip this step
        if fraction < 1.0:
            chunk = stratshufspl(chunk, fraction, ylabel)
        curr = formatdata(chunk)
        chunks.append(curr)
        
    # concatenate chunks
    train = pd.concat(chunks, axis=0)
    
    # split concatenated set into X and y for ml model fitting
    # X = train.drop([ylabel, 'is_booking'], axis=1, inplace=False)
    # y = train[ylabel]
    return train
In [77]:
dataurl = '/Users/dbricare/Documents/Python/datasets/expedia/'

rawcols = ['user_location_city', 'orig_destination_distance', 'is_booking', 'hotel_cluster']

ylabel = rawcols[-1]

# csviter = pd.read_csv(dataurl+'train.csv.gz', sep=',', compression='gzip', chunksize=2000000, usecols=rawcols)
csviter = pd.read_csv(dataurl+'train.csv.gz', sep=',', chunksize=1000000, usecols=rawcols)

X, y = fractionate(csviter, 0.2, rawcols[-1])

print(X.shape)
print(X.columns)
X.head()
loading data...
(396954, 2)
Index(['user_location_city', 'orig_destination_distance'], dtype='object')
Out[77]:
user_location_city orig_destination_distance
669455 25315 3939.6881
255860 5070 414.0835
117285 10553 4094.8274
982084 2428 924.9407
435781 10800 6808.6816
In [72]:
notcats = ['orig_destination_distance']
cats = list(X.columns)
for cat in notcats:
    cats.remove(cat)
    
print('expected number of columns in sparse matrix:', X[cats].max().sum()+len(X.columns))
expected number of columns in sparse matrix: 56509

Check that test data has the correct values after one-hot encoding

In [44]:
# check test data

testiter = pd.read_csv(dataurl+'test.csv.gz', sep=',', compression='gzip', 
                       chunksize=100000, usecols=rawcols[:-1])
first = True
for chunk in testiter:
    chunk = formatdata(chunk)
    currmin = chunk[cats].min()
    currmax = chunk[cats].max()
    if first:
        minvals = currmin
        maxvals = currmax
        first = False
    else:
        if any([both[0]!=both[1] for both in zip(minvals,currmin)]):
            minvals = [min(both[0],both[1]) for both in zip(minvals,currmin)]
        if any([both[0]!=both[1] for both in zip(maxvals,currmax)]):
            maxvals = [max(both[0],both[1]) for both in zip(maxvals,currmax)]
print(minvals,maxvals)
[0] [56507]
In [73]:
catdict = OrderedDict()
for col in X.columns:
    if col in notcats:
        val = False
    else:
        val = True
    catdict.update({col:val})

mask = np.array(list(catdict.values()))
maxfeatures = [56507+1]

print(mask)
print(maxfeatures)
[ True False]
[56508]
In [79]:
# actually create the encoder and check that the number of features matches the expected number

enc = OneHotEncoder(n_values=maxfeatures, categorical_features=mask, dtype=int, sparse=True)

Xsparse = enc.fit_transform(X.values)
print('all features:',sum(maxfeatures)+len(notcats))
print('sparse matrix shape:', Xsparse.shape)
print('total encoding values:', sum(enc.n_values)) 
all features: 56509
sparse matrix shape: (396954, 56509)
total encoding values: 56508

Train model

In [81]:
est = RandomForestClassifier(random_state=42)
est.set_params(n_jobs=2, n_estimators=50, max_depth=8)

scores = cross_val_score(est, Xsparse, y, scoring='log_loss', cv=5)
print(scores)

# gridprms = {'max_features': ['sqrt', 0.1, 0.2]}

# grid =GridSearchCV(est, param_grid=gridprms, cv=3, scoring='log_loss')
# grid.fit(Xsparse, y)

# print(grid.best_params_)
# grid.grid_scores_
[-4.33611025 -4.34351671 -4.344757   -4.34226746 -4.33936352]

After using grid search to identify the best hyperparameters for the random forest model, the resulting score was disappointing. Alternative approaches should be explored.

Non-Machine Learning Approach

The strategy here is to see if selecting certain features and simpler predictions can improve the accuracy of the model.


Load data

In [73]:
dataurl = '/Users/dbricare/Documents/Python/datasets/expedia/'

rawcols = ['srch_destination_id', 'user_location_country', 
           'user_location_region', 'user_location_city', 
           'hotel_market', 'orig_destination_distance', 
           'is_booking','hotel_cluster']

ylabel = rawcols[-1]

csviter = pd.read_csv(dataurl+'train.csv.gz', sep=',', compression='gzip', chunksize=2000000, usecols=rawcols)

Xy = fractionate(csviter, 1, rawcols[-1])

print(Xy.shape)
print(Xy.columns)
Xy.head()
loading data...
(37670293, 8)
Index(['user_location_country', 'user_location_region', 'user_location_city',
       'orig_destination_distance', 'srch_destination_id', 'is_booking',
       'hotel_market', 'hotel_cluster'],
      dtype='object')
Out[73]:
user_location_country user_location_region user_location_city orig_destination_distance srch_destination_id is_booking hotel_market hotel_cluster
0 66 348 48862 2234.2641 8250 0 628 1
1 66 348 48862 2234.2641 8250 1 628 1
2 66 348 48862 2234.2641 8250 0 628 1
3 66 442 35390 913.1932 14984 0 1457 80
4 66 442 35390 913.6259 14984 0 1457 21

What does the distribution of hotel clusters look like?

In [69]:
# Distribution of hotel cluster occurrences

f = Xy['hotel_cluster'].value_counts(sort=False).plot(figsize=(14,6))
mostcommon = Xy['hotel_cluster'].value_counts(sort=True).iloc[:5].index.tolist()
print('Most common hotel clusters:', mostcommon)
_ = f.set_ylabel('Number of Examples')
_ = f.set_xlabel('Hotel Cluster')
Most common hotel clusters: [91, 48, 42, 59, 28]

Divide data into train/test sets to test these models

In [4]:
sss = StratifiedShuffleSplit(Xy[Xy.columns[-1]], n_iter=1, test_size=0.2)
for trainidx, testidx in sss:
    # Xtrain, ytrain = Xy[Xy.columns[:-1]].iloc[trainidx], Xy[Xy.columns[-1]].iloc[trainidx]
    Xytrain = Xy.iloc[trainidx]
    Xtest, ytest = Xy[Xy.columns[:-1]].iloc[testidx], Xy[Xy.columns[-1]].iloc[testidx]

Strategy 1: Predict 5 top occurring hotel clusters based on srch_destination_id

From examining the data, it can be seen that certain hotel clusters appear with only certain search destination IDs in high frequency relative to others.

Thus by grouping the two features and creating a dictionary to look up the values, it is possible to make a strong prediction for the appropriate hotel_cluster.

In [101]:
# create dictionary of destination IDs and counter including number of occurrences
# for each hotel cluster

match_cols = ["srch_destination_id"]
cluster_cols = match_cols + ['hotel_cluster']

# group on destination ID then hotel cluster
groups = Xy.groupby(cluster_cols)

top_clusters = {}
for idx, data in groups:
    clicks = len(data.is_booking[data.is_booking == False])
    bookings = len(data.is_booking[data.is_booking == True])
    score = bookings + .15 * clicks
    cluster_name = str(idx[0])  # idx is (destination ID, hotel cluster)
    if cluster_name not in top_clusters:
        top_clusters[cluster_name] = {}
    top_clusters[cluster_name][idx[-1]] = score
In [74]:
srchidcols = [rawcols[0],rawcols[-2],rawcols[-1]]

Xgroup = Xy[srchidcols].groupby([srchidcols[0],srchidcols[-1]]).count().sort_values('is_booking', ascending=False)
Xgroup.columns = ['count']
Xgroup.head(10)
Out[74]:
count
srch_destination_id hotel_cluster
8250 1 309484
8791 65 177405
8250 45 177232
8267 56 167721
8250 79 167188
8267 70 125801
11439 65 124353
8250 24 122226
12206 1 121597
8250 54 108178
In [76]:
srchidcols = [rawcols[0],rawcols[-2],rawcols[-1]]
Xbookclick = Xy.copy()
Xbookclick['is_booking'] = Xbookclick['is_booking']+0.2

Xgroup = Xbookclick[srchidcols].groupby([srchidcols[0],srchidcols[-1]]).sum().sort_values('is_booking', 
                                                                                          ascending=False)
Xgroup.columns = ['sum']
Xgroup.head(10)
Out[76]:
count
srch_destination_id hotel_cluster
8250 1 88415.799999
45 49206.400000
79 46233.600000
8267 56 44988.200000
8791 65 41152.000000
12206 1 34038.400000
8250 24 33847.200000
8267 70 32021.200000
8250 54 29584.600000
11439 65 29015.600000
In [77]:
Xgroup.columns = ['sum']
Xgroup.round(decimals=1)
Xgroup.head()
Out[77]:
sum
srch_destination_id hotel_cluster
8250 1 88415.799999
45 49206.400000
79 46233.600000
8267 56 44988.200000
8791 65 41152.000000
In [103]:
# extract top 5 hotel clusters for each destination ID

cluster_dict = {}
for n in top_clusters:
    val = top_clusters[n]
    top = [l[0] for l in sorted(val.items(), key=operator.itemgetter(1), reverse=True)[:5]]
    cluster_dict[n] = top
In [116]:
# peek at test data to make sure the srch_destination_id is included

Xtest.head()
Out[116]:
user_location_city orig_destination_distance srch_destination_id is_booking
315100 39058 569.8766 28622 1
788962 11984 3.9448 14797 1
483909 4868 2713.0000 22930 1
743928 2428 4482.8951 21790 1
283337 33958 319.2291 66 1

Make prediction

This is only an example of how this mechanism work as the final model will be combined with another predictor.

For each test example, srch_destination_id is looked up in the dictionary and the appropriate top 5 hotel_cluster values are returned.

In [145]:
# example calculation
# create list of top 5 predictions for each row in test set and save to csv

nowpt = datetime.datetime.now()
nowstr = nowpt.strftime('%Y%m%d_%H%M')

preds = Xtest['srch_destination_id'].apply(lambda x: ' '.join([str(i) for i in cluster_dict.get(str(x), [])]))
preds.reset_index(drop=True, inplace=True, name='hotel_cluster')
preds.head()
Out[145]:
0    17 51 72 42 43
1    33 32 91 48 13
2    36 62 82 46 29
3     59 21 82 9 99
4    77 48 28 91 47
Name: hotel_cluster, dtype: object
In [125]:
ml_metrics.mapk([[l] for l in ytest], preds, k=5)
Out[125]:
0.28583392131169572

From this small test sample, the mean average precision for $k=5$ is much improved over the previous random forest models.

Strategy 2: Find exact matches for several features

Again, there are other features that appear with high regularity for certain hotel clusters. By grouping in a pandas dataframe and looking up the corresponding index, the appropriate hotel cluster can be found.

In [ ]:
match_cols = ['user_location_country', 'user_location_region', 'user_location_city', 
              'hotel_market', 'orig_destination_distance']

groups = Xy.groupby(match_cols)
    
def generate_exact_matches(row, match_cols):
    index = tuple([row[t] for t in match_cols])   # grab value in cell
    try:
        group = groups.get_group(index)
    except Exception:
        return []
    clusters = list(set(group.hotel_cluster))   # unique hotel clusters
    print(group)
    return clusters

exact_matches = []
for i in range(2):
    exact_matches.append(generate_exact_matches(Xtest.iloc[i], match_cols))
In [78]:
# group by desired features

matchcols = ['user_location_country', 'user_location_region', 'user_location_city', 
              'hotel_market', 'orig_destination_distance']

groups = Xy[matchcols+['hotel_cluster']].groupby(matchcols)
# groups.count().head()
In [79]:
# generate dictionary with features as keys and list of clusters as values

exacts = groups.aplpy(lambda x: x['hotel_cluster'].unique().tolist())
print(exacts.shape)
dictexacts = exacts.to_dict()
print(list(dictexacts.items())[:2])
(10622932,)
[((66, 174, 16292, 1630, 113.0115), [93]), ((66, 226, 29254, 548, 388.3634), [40])]

Combine two prediction methods

Since certain combinations of the features in strategy 2 are not present, fill in the remaining 5 predictions for hotel_cluster using the methodology of strategy 1.

In [45]:
# load and format test data for submission

rawcols = ['srch_destination_id', 'user_location_country', 
           'user_location_region', 'user_location_city', 
           'hotel_market', 'orig_destination_distance', 
           'is_booking','hotel_cluster']

testiter = pd.read_csv(dataurl+'test.csv.gz', sep=',', compression='gzip', usecols=rawcols[:-2], 
                       chunksize=1000000)
chunks = list()
for chunk in testiter:
    chunk = formatdata(chunk)
    chunks.append(chunk)

dftest = pd.concat(chunks, axis=0)

print(dftest.shape)
dftest.head()
(2528243, 6)
Out[45]:
user_location_country user_location_region user_location_city orig_destination_distance srch_destination_id hotel_market
0 66 174 37449 5539.0567 12243 27
1 66 174 37449 5873.2923 14474 1540
2 66 142 17440 3975.9776 11353 699
3 66 258 34156 1508.5975 8250 628
4 66 467 36345 66.7913 11812 538
In [82]:
# look up test row based on matchcols
# fill in missing values with prediction from srch_destination_id
# trim list to 5 entries
# needs two dictionaries, srch_destination_id needs grouped data

def srchlist(x):
    if x not in Xgroup.index:
        return ''
    else:
        toplist = Xgroup.loc[x].index.tolist()[:5]
        # joinlist = ' '.join([str(i) for i in toplist])
        return toplist
    
dictid = {}
for srchid in dftest['srch_destination_id'].unique():
    dictid[srchid] = srchlist(srchid)
In [85]:
def makepred(x):
    output = dictexacts.get(tuple(x[match_cols]), [])
    if len(output)<5:
        output.extend(dictid[x['srch_destination_id']])
    if len(output)<5:
        output.extend(mostcommon)
    return ' '.join([str(i) for i in output[:5]])

serpred = dftest.apply(makepred, axis=1)

print(serpred.shape)
serpred.head()
(2528243,)
Out[85]:
0     5 37 55 11 8
1    5 91 48 42 59
2    91 0 31 77 91
3     1 1 45 79 24
4    50 51 91 2 42
dtype: object
In [62]:
# how many missing values after strategy #2 required filling with strategy #1

len(serpred)-len(serpred[serpred.str.match('\d+')])
Out[62]:
20027

Save data for submission to kaggle

In [86]:
nowpt = datetime.datetime.now()
nowstr = nowpt.strftime('%Y%m%d_%H%M')
serpred.reset_index(drop=True, inplace=True, name='hotel_cluster')
serpred.to_csv('mloutput'+nowstr+'.csv', sep=',', index_label='id', header=True)