<  return home

Housing Price Prediction with Regression & Ensembling

This project came from the ongoing Kaggle competition House Prices: Advanced Regression Techniques.

In this project, we'll explore historical housing data in Ames, Iowa, with the end goal of developing the best predictive model on final sale price. We'll take a systematic approach to do so, which includes:

  1. Initial Exploratory Data Analysis
    • Variable Statistics
    • Correlation of Numerical Variables
  2. Data Preprocessing
    • Removing Unnecessary Columns
    • Outliers
    • Null Values
  3. Additional Preprocessing
    • Numerical to Categorical Variables
    • Categorical to Numerical Variables (Label Encoding)
    • Engineering New Features
  4. Adjusting Skewed Variables
    • Target Variable (SalePrice)
    • Independent Variables
  5. Dummy Variables
  6. Overfitted Variables and Other Outliers
  7. Baseline Model Performance
    • KNN Regression
    • SGD Regression
    • Random Forest Regression
    • Linear Regression
    • Ridge Regression
    • Support Vector Regression
    • Lasso Regression
    • Elastic Net Regression
    • Kernel Ridge Regression
    • Gradient Boosting Regression
    • LightGBM Regression
    • XGBoost Regression
  8. Ensemble Models: Simple Average
  9. Stacking Models: Meta-Model
  10. Stacking Models II: Meta-Model From Scratch
  11. Final Predictions
In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import missingno as msno

import warnings
warnings.filterwarnings('ignore')
In [4]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

Initial EDA

In [5]:
train_df
Out[5]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1455 1456 60 RL 62.0 7917 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 8 2007 WD Normal 175000
1456 1457 20 RL 85.0 13175 Pave NaN Reg Lvl AllPub ... 0 NaN MnPrv NaN 0 2 2010 WD Normal 210000
1457 1458 70 RL 66.0 9042 Pave NaN Reg Lvl AllPub ... 0 NaN GdPrv Shed 2500 5 2010 WD Normal 266500
1458 1459 20 RL 68.0 9717 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 4 2010 WD Normal 142125
1459 1460 20 RL 75.0 9937 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 6 2008 WD Normal 147500

1460 rows × 81 columns

In [6]:
test_df
Out[6]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN NaN NaN 0 1 2010 WD Normal
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1454 2915 160 RM 21.0 1936 Pave NaN Reg Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2006 WD Normal
1455 2916 160 RM 21.0 1894 Pave NaN Reg Lvl AllPub ... 0 0 NaN NaN NaN 0 4 2006 WD Abnorml
1456 2917 20 RL 160.0 20000 Pave NaN Reg Lvl AllPub ... 0 0 NaN NaN NaN 0 9 2006 WD Abnorml
1457 2918 85 RL 62.0 10441 Pave NaN Reg Lvl AllPub ... 0 0 NaN MnPrv Shed 700 7 2006 WD Normal
1458 2919 60 RL 74.0 9627 Pave NaN Reg Lvl AllPub ... 0 0 NaN NaN NaN 0 11 2006 WD Normal

1459 rows × 80 columns

In [7]:
train_df.describe()
Out[7]:
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
count 1460.000000 1460.000000 1201.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1452.000000 1460.000000 ... 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000
mean 730.500000 56.897260 70.049958 10516.828082 6.099315 5.575342 1971.267808 1984.865753 103.685262 443.639726 ... 94.244521 46.660274 21.954110 3.409589 15.060959 2.758904 43.489041 6.321918 2007.815753 180921.195890
std 421.610009 42.300571 24.284752 9981.264932 1.382997 1.112799 30.202904 20.645407 181.066207 456.098091 ... 125.338794 66.256028 61.119149 29.317331 55.757415 40.177307 496.123024 2.703626 1.328095 79442.502883
min 1.000000 20.000000 21.000000 1300.000000 1.000000 1.000000 1872.000000 1950.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 2006.000000 34900.000000
25% 365.750000 20.000000 59.000000 7553.500000 5.000000 5.000000 1954.000000 1967.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 2007.000000 129975.000000
50% 730.500000 50.000000 69.000000 9478.500000 6.000000 5.000000 1973.000000 1994.000000 0.000000 383.500000 ... 0.000000 25.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 2008.000000 163000.000000
75% 1095.250000 70.000000 80.000000 11601.500000 7.000000 6.000000 2000.000000 2004.000000 166.000000 712.250000 ... 168.000000 68.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000 2009.000000 214000.000000
max 1460.000000 190.000000 313.000000 215245.000000 10.000000 9.000000 2010.000000 2010.000000 1600.000000 5644.000000 ... 857.000000 547.000000 552.000000 508.000000 480.000000 738.000000 15500.000000 12.000000 2010.000000 755000.000000

8 rows × 38 columns

Correlation of Numerical Variables

In [8]:
plt.figure(figsize=(18,18))
sns.heatmap(train_df.corr(),annot=True,cmap="Blues",fmt='.1f',square=True)
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9cfcef0550>

Interesting... We can see that OverallQual and many of the area/sqft-related variables are highly correlated with our SalePrice target variable. Furthermore, notice that many independent variables are correlated with each other... it's important to keep in mind that linear regression models (like the ones we'll be using for predictive purposes later on in this notebook) require independent variables to have little to no collinearity. We'll keep these variables for now, however, as we can account for collinearity through regularization (i.e. Lasso, Ridge) later on.

Data Preprocessing

Removing Unnecessary Columns

First, let's drop the ID column. There may be others, but for now ID is an obvious choice.

In [9]:
train_df.drop(['Id'],axis=1,inplace=True)
test_df.drop(['Id'],axis=1,inplace=True)

Outliers

Now let's look for potential outliers and address them.

In [10]:
# View features that are highly correlated with SalePrice
corrs = train_df.corr()[['SalePrice']]
corrs = corrs[corrs['SalePrice']>0.5]
corrs = corrs.sort_values(by='SalePrice',ascending=False)

high_corr_feats = corrs.index[1:]

fig, axes = plt.subplots(5,2,figsize=(13,16))

for i, ax in enumerate(axes.flatten()):
    feat = high_corr_feats[i]
    sns.scatterplot(x=train_df[feat], y=train_df['SalePrice'], ax=ax)
    plt.xlabel(feat)
    plt.ylabel('Sale Price')
plt.tight_layout()

On GrLivArea, it looks like those two points on the bottom right are outliers, given they have such high GrLivArea and low SalePrice. Same for the points on the bottom right of TotalBsmtSF and 1stFlrSF. Let's drop these for now.

In [11]:
train_df.shape
Out[11]:
(1460, 80)
In [12]:
# Drop GrLivArea outliers
train_df.drop(train_df[(train_df['SalePrice'] < 300000) &
                       (train_df['GrLivArea'] > 4000)].index,
                       inplace=True)

# Drop TotalBsmtSF and 1stFlrSF outliers
train_df.drop(train_df[(train_df['TotalBsmtSF'] > 6000) |
                       (train_df['1stFlrSF'] > 4000)].index,
                       inplace=True)
In [13]:
train_df.shape
Out[13]:
(1458, 80)

Great! Looks like these outliers boiled down to just two points. Let's visualize the graphs again to ensure all outliers were removed.

In [14]:
fig, axes = plt.subplots(1,3,figsize=(14,4))
feats = ['GrLivArea', 'TotalBsmtSF', '1stFlrSF']

for i, ax in enumerate(axes.flatten()):
    feat = feats[i]
    sns.scatterplot(x=train_df[feat], y=train_df['SalePrice'], ax=ax)
    plt.xlabel(feat)
    plt.ylabel('Sale Price')

plt.tight_layout()

Success! There are likely other outliers, but we will address these later on in our analysis in a more automated way using outlier_test() from statsmodels.api.

Null Values

Now, let's get an idea of the null values in our data, and let's figure out how best to replace them. First, we'll concatenate the train and test data into one df.

In [15]:
df = pd.concat([train_df.drop(['SalePrice'],axis=1),
                test_df]).reset_index(drop=True)
df.shape
Out[15]:
(2917, 79)

Awesome, now let's visualize our null values in a few different ways: msno matrices, a bargraph showing feature null-value percentages, and a table showing null-value totals & percentages.

In [16]:
msno.matrix(train_df)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9cfdfedc10>
In [17]:
msno.matrix(test_df)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d0030e050>
In [18]:
df_na = 100 * df.isnull().sum() / len(df)
df_na = pd.DataFrame(df_na,columns=['%NA'])
df_na = df_na.sort_values('%NA', ascending=False)
df_na = df_na[df_na['%NA']>0]

plt.figure(figsize=(14,6))
sns.barplot(x=df_na.index,y=df_na['%NA'],)
plt.xticks(rotation = '90')
plt.title('Feature Missing Value Percentage',fontsize=20,fontweight='bold')
Out[18]:
Text(0.5, 1.0, 'Feature Missing Value Percentage')
In [19]:
def missing_percentage(df):
    total = df.isnull().sum().sort_values(ascending = False)[df.isnull().sum().sort_values(ascending = False) != 0]
    percent = round(df.isnull().sum().sort_values(ascending = False)/len(df)*100,2)[round(df.isnull().sum().sort_values(ascending = False)/len(df)*100,2) != 0]
    return pd.concat([total, percent], axis=1, keys=['Total Nulls','Percent Null'])

missing_percentage(df)
Out[19]:
Total Nulls Percent Null
PoolQC 2908 99.69
MiscFeature 2812 96.40
Alley 2719 93.21
Fence 2346 80.43
FireplaceQu 1420 48.68
LotFrontage 486 16.66
GarageCond 159 5.45
GarageQual 159 5.45
GarageYrBlt 159 5.45
GarageFinish 159 5.45
GarageType 157 5.38
BsmtCond 82 2.81
BsmtExposure 82 2.81
BsmtQual 81 2.78
BsmtFinType2 80 2.74
BsmtFinType1 79 2.71
MasVnrType 24 0.82
MasVnrArea 23 0.79
MSZoning 4 0.14
BsmtHalfBath 2 0.07
Utilities 2 0.07
Functional 2 0.07
BsmtFullBath 2 0.07
BsmtFinSF2 1 0.03
BsmtFinSF1 1 0.03
Exterior2nd 1 0.03
BsmtUnfSF 1 0.03
TotalBsmtSF 1 0.03
Exterior1st 1 0.03
SaleType 1 0.03
Electrical 1 0.03
KitchenQual 1 0.03
GarageArea 1 0.03
GarageCars 1 0.03

Great! Now let's fill our null values. We'll take a specific approach for each variable, depending on the context:

In [20]:
# 'None' if NA
for i in ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
         'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'GarageType', 'GarageFinish',
         'GarageQual', 'GarageCond', 'MasVnrType', 'FireplaceQu', 'MSSubClass']:
    df[i] = df[i].fillna('None')


# 0 if NA
for i in ['GarageYrBlt', 'GarageArea', 'GarageCars', 'BsmtFinSF1', 'BsmtFinSF2',
          'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath', 'MasVnrArea']:
    df[i] = df[i].fillna(0)


# Exterior1st, Exterior2nd - mode if NA
for i in ['Exterior1st', 'Exterior2nd', 'KitchenQual', 'Electrical', 'MSZoning',
         'SaleType', 'Functional']:
    df[i] = df[i].fillna(df[i].mode()[0])


# LotFrontage - Take median of neighborhood
df['LotFrontage'] = df.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.median()))


# Utilities - Drop, as all are 'AllPub', except one 'NoSeWa in training data.
df.drop(['Utilities'],inplace=True,axis=1)
In [21]:
missing_percentage(df)
Out[21]:
Total Nulls Percent Null

Awesome, we've addressed all our null values.

Additional Preprocessing

Numerical --> Categorical Variables

Next, let's change datatype on a few numerical variables that would be better represented categorically.

In [22]:
df['MSSubClass'] = df['MSSubClass'].astype(str)
df['OverallCond'] = df['OverallCond'].astype(str)
df['YrSold'] = df['YrSold'].astype(str)
df['MoSold'] = df['MoSold'].astype(str)

Categorical --> Numerical Variables (Label Encoding)

Now, let's go the other way -- let's change datatype on a few categorical variables that would be better represented numerically. Here, we use Label Encoding. Interestingly, Label Encoding outperformed One Hot Encoding on the final test submissions, which is surprising... usually we would expect the opposite to be true.

In [23]:
from sklearn.preprocessing import LabelEncoder

var = ['FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond',
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual',
        'BsmtFinType1', 'BsmtFinType2', 'Functional', 'Fence',
        'BsmtExposure', 'GarageFinish', 'LandSlope', 'LotShape',
        'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass',
        'OverallCond', 'YrSold', 'MoSold']

for i in var:
    mdl = LabelEncoder().fit(list(df[i].values))
    df[i] = mdl.transform(list(df[i].values))

df[var].head()
Out[23]:
FireplaceQu BsmtQual BsmtCond GarageQual GarageCond ExterQual ExterCond HeatingQC PoolQC KitchenQual ... LandSlope LotShape PavedDrive Street Alley CentralAir MSSubClass OverallCond YrSold MoSold
0 3 2 4 5 5 2 4 0 3 2 ... 0 3 2 1 1 1 10 4 2 4
1 5 2 4 5 5 3 4 0 3 3 ... 0 3 2 1 1 1 5 7 1 7
2 5 2 4 5 5 2 4 0 3 2 ... 0 0 2 1 1 1 10 4 2 11
3 2 4 1 5 5 3 4 2 3 2 ... 0 0 2 1 1 1 11 4 0 4
4 5 2 4 5 5 2 4 0 3 2 ... 0 0 2 1 1 1 10 4 2 3

5 rows × 26 columns

Engineering New Features

Below are a variety of different features introduced to try to improve prediction accuracy in our final models. Interestingly, only 'Total_SF_Main' improved our final test score (which is why the others are commented out).

In [24]:
df['Total_SF_Main'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']
#df['Total_Porch_SF'] = df['WoodDeckSF'] + df['OpenPorchSF'] + df['EnclosedPorch'] + df['3SsnPorch'] + df['ScreenPorch']
#df['Total_Bathrooms'] = df['BsmtFullBath'] + df['FullBath'] + 0.5*(df['HalfBath'] + df['BsmtHalfBath'])
#df['YrBltRemod'] = df['YearBuilt'] + df['YearRemodAdd']
#df['Total_sqr_footage'] = df['BsmtFinSF1'] + df['BsmtFinSF2'] + df['1stFlrSF'] + df['2ndFlrSF']
#df['haspool'] = df['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
#df['has2ndfloor'] = df['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)
#df['hasgarage'] = df['GarageArea'].apply(lambda x: 1 if x > 0 else 0)
#df['hasbsmt'] = df['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
#df['hasfireplace'] = df['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)

Adjusting Skewed Variables

Alright, now let's address skew in our variables. The more skewed our numeric variables (especially our target variable), the worse our linear regression models will perform. Let's see if we can identify these highly skewed variables and attempt to normalize them through log & boxcox transformations. Let's start with our target variable, SalePrice.

Target Variable

In [25]:
sns.distplot(train_df['SalePrice'])
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9cfdf49990>

Looks like SalePrice is positively skewed. Let's quantify this further...

In [26]:
mu = train_df['SalePrice'].mean()
med = train_df['SalePrice'].median()
std = train_df['SalePrice'].std()
skew = train_df['SalePrice'].skew()
kurt = train_df['SalePrice'].kurt()

print('SalePrice \n mean = {:.2f} \n median = {:.2f} \n standard deviation = {:.2f} \n skew = {:.2f} \n kurtosis = {:.2f}'.format(mu, med, std, skew, kurt))
SalePrice
 mean = 180932.92
 median = 163000.00
 standard deviation = 79495.06
 skew = 1.88
 kurtosis = 6.52
In [27]:
stats.probplot(train_df['SalePrice'], plot=plt)
Out[27]:
((array([-3.3047554 , -3.04752042, -2.90446807, ...,  2.90446807,
          3.04752042,  3.3047554 ]),
  array([ 34900,  35311,  37900, ..., 625000, 745000, 755000])),
 (74213.25959976624, 180932.91906721535, 0.9320154492892367))
In [28]:
sns.residplot('GrLivArea', 'SalePrice', data=train_df)
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9ce4917410>

SalePrice has a positive skew of 1.88 and a positive kurtosis of 6.52 (meaning it's vulnerable to outliers). Further evidence of skew can be seen in the probability plot above. Finally, we see a heteroscedastic relationship between certain independent variables (i.e. GrLivArea) and our target variable. Let's see if we can normalize SalePrice a bit.

In [29]:
train_df['SalePrice'] = np.log1p(train_df['SalePrice'])

mu = train_df['SalePrice'].mean()
med = train_df['SalePrice'].median()
std = train_df['SalePrice'].std()
skew = train_df['SalePrice'].skew()
kurt = train_df['SalePrice'].kurt()

print('SalePrice \n mean = {:.2f} \n median = {:.2f} \n standard deviation = {:.2f} \n skew = {:.2f} \n kurtosis = {:.2f}'.format(mu, med, std, skew, kurt))

sns.distplot(train_df['SalePrice'])
plt.figure()
stats.probplot(train_df['SalePrice'], plot=plt)
SalePrice
 mean = 12.02
 median = 12.00
 standard deviation = 0.40
 skew = 0.12
 kurtosis = 0.80
Out[29]:
((array([-3.3047554 , -3.04752042, -2.90446807, ...,  2.90446807,
          3.04752042,  3.3047554 ]),
  array([10.46027076, 10.47197813, 10.54273278, ..., 13.34550853,
         13.52114084, 13.53447435])),
 (0.3985294832980731, 12.024015155682548, 0.9953918721417083))
In [30]:
sns.residplot('GrLivArea', 'SalePrice', data=train_df)
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9ce57e7dd0>

Great! SalePrice is now much less skewed, more homoscedastic, and more normally distributed. Let's adjust our other highly skewed variables as well, but in a more automated way.

Independent Variables

In [31]:
numeric_var_skews = pd.DataFrame(df.dtypes[df.dtypes != 'object'].index,columns=['Numeric_Variables'])
numeric_var_skews['Skew'] = numeric_var_skews['Numeric_Variables'].apply(lambda x: df[x].skew())
numeric_var_skews.sort_values('Skew',ascending=False,inplace=True)
numeric_var_skews.reset_index(inplace=True,drop=True)
display(numeric_var_skews)
Numeric_Variables Skew
0 MiscVal 21.950962
1 PoolArea 17.697766
2 LotArea 13.116240
3 LowQualFinSF 12.090757
4 3SsnPorch 11.377932
5 LandSlope 4.975813
6 KitchenAbvGr 4.302763
7 BsmtFinSF2 4.146636
8 EnclosedPorch 4.004404
9 ScreenPorch 3.947131
10 BsmtHalfBath 3.932018
11 MasVnrArea 2.623068
12 OpenPorchSF 2.530660
13 WoodDeckSF 1.845741
14 1stFlrSF 1.257933
15 LotFrontage 1.103606
16 GrLivArea 1.069300
17 Total_SF_Main 1.009676
18 BsmtFinSF1 0.981149
19 BsmtUnfSF 0.920161
20 2ndFlrSF 0.861999
21 TotRmsAbvGrd 0.749618
22 Fireplaces 0.725651
23 HalfBath 0.697024
24 TotalBsmtSF 0.672097
25 BsmtFullBath 0.622735
26 OverallCond 0.569607
27 HeatingQC 0.485784
28 FireplaceQu 0.332782
29 BedroomAbvGr 0.326736
30 GarageArea 0.216968
31 OverallQual 0.189688
32 FullBath 0.165599
33 MSSubClass 0.139781
34 YrSold 0.132064
35 BsmtFinType1 0.083684
36 GarageCars -0.219410
37 YearRemodAdd -0.450365
38 BsmtQual -0.488614
39 YearBuilt -0.599503
40 GarageFinish -0.610267
41 LotShape -0.618882
42 MoSold -0.646506
43 Alley -0.652041
44 BsmtExposure -1.117896
45 KitchenQual -1.450560
46 ExterQual -1.800989
47 Fence -1.993675
48 ExterCond -2.497774
49 BsmtCond -2.862744
50 PavedDrive -2.979273
51 BsmtFinType2 -3.044545
52 GarageQual -3.074369
53 CentralAir -3.459334
54 GarageCond -3.596139
55 GarageYrBlt -3.906642
56 Functional -4.056212
57 Street -15.502729
58 PoolQC -21.228518
In [32]:
high_skew = numeric_var_skews[abs(numeric_var_skews['Skew']) > 0.75]

from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax

high_skew_vars = high_skew['Numeric_Variables']
for var in high_skew_vars:
    df[var] = boxcox1p(df[var], 0.15, #boxcox_normmax(df[var] + 1)
                      )

Dummy Variables

Great! Now that we've tackled skewness, we're ready to create dummy variables.

In [33]:
# Interestingly, not removing the first dummy variable actually improved
# the final test score, thus we keep drop_first=False. Normally, one 
# would want to remove one of the dummy variables to avoid collinearity
# in situations where the dummies represent all possible scenarios.
df_dummy = pd.get_dummies(df, #drop_first = True
                         )
df_dummy.shape
Out[33]:
(2917, 220)

Overfitted Variables & Other Outliers

In general, it's a good idea to consider removing variables where the vast majority of values are the same, as this can cause overfitting.

In [34]:
def overfit_reducer(df):
    """
    This function takes in a dataframe and returns a list of features that are overfitted.
    """
    overfit = []
    for i in df.columns:
        counts = df[i].value_counts()
        zeros = counts.iloc[0]
        if zeros / len(df) * 100 > 99.94:
            overfit.append(i)
    overfit = list(overfit)
    return overfit

overfitted_features = overfit_reducer(df_dummy[:train_df.shape[0]])

df_dummy = df_dummy.drop(overfitted_features, axis=1)

Let's also remove any additional outliers we may have missed.

In [35]:
# Remove additional outliers
train = df_dummy[:train_df.shape[0]]
Y_train = train_df['SalePrice'].values

import statsmodels.api as sm
ols = sm.OLS(endog = Y_train,
             exog = train)
fit = ols.fit()
test2 = fit.outlier_test()['bonf(p)']

outliers = list(test2[test2<1e-2].index)

print('There were {:.0f} outliers at indices:'.format(len(outliers)))
print(outliers)

train_df = train_df.drop(train_df.index[outliers])
df_dummy = df_dummy.drop(df_dummy.index[outliers])
df_dummy.shape
There were 9 outliers at indices:
[30, 88, 462, 587, 631, 967, 969, 1322, 1451]
Out[35]:
(2908, 220)

Baseline Model Performance

Awesome! We're finally done cleaning up our data, and we're ready to start making predictions! First we'll define a cross-validation strategy, and then we'll proceed with testing a variety of different base models to see which perform best.

In [36]:
# Helpful imports
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler
from sklearn import metrics
from sklearn.linear_model import Ridge, Lasso, ElasticNet, BayesianRidge, LassoLarsIC, LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
import xgboost as xgb
import lightgbm as lgb
from sklearn.svm import SVR

# Designate preprocessed train and test data
train = df_dummy[:train_df.shape[0]]
test = df_dummy[train_df.shape[0]:]
Y_train = train_df['SalePrice'].values

# Cross validation strategy
def rmsle_cv(model):
    kf = KFold(5, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse = np.sqrt(-cross_val_score(model, train.values, Y_train,
            scoring='neg_mean_squared_error', cv=kf))
    return(rmse)
In [37]:
models = pd.DataFrame([],columns=['model_name','model_object','score_mean','score_std'])

KNN Regression

In [38]:
knr = KNeighborsRegressor(9, weights='distance')
score = rmsle_cv(knr)
print('KNN Regression score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['knr',knr,score.mean(),score.std()]
KNN Regression score = 0.2253  (std = 0.0048)

SGD Regression

In [39]:
from sklearn.linear_model import SGDRegressor
sgd = make_pipeline(RobustScaler(), SGDRegressor(alpha=1000000000000000,l1_ratio=1))
score = rmsle_cv(sgd)
print('SGD score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['sgd',sgd,score.mean(),score.std()]
SGD score = 0.4089  (std = 0.0371)

Random Forest Regression

In [63]:
rfr = RandomForestRegressor()
score = rmsle_cv(rfr)
print('Random Forest score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['rfr',rfr,score.mean(),score.std()]
Random Forest score = 0.1300  (std = 0.0046)

Linear Regression

In [41]:
lnr = LinearRegression()
score = rmsle_cv(lnr)
print('Linear Regression score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['lnr',lnr,score.mean(),score.std()]
Linear Regression score = 0.1064  (std = 0.0083)

Ridge

In [42]:
ridg = make_pipeline(RobustScaler(), Ridge(alpha = .17,normalize=True, random_state=4))
score = rmsle_cv(ridg)
print('Ridge score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['ridg',ridg,score.mean(),score.std()]
Ridge score = 0.1034  (std = 0.0067)

Support Vector Regressor

In [43]:
svr = make_pipeline(RobustScaler(), SVR(C= 20, epsilon= 0.02, gamma=0.00046))
score = rmsle_cv(svr)
print('SVR score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['svr',svr,score.mean(),score.std()]
SVR score = 0.0991  (std = 0.0079)

Lasso Regression

In [44]:
lasso = make_pipeline(RobustScaler(), Lasso(alpha = 0.00042, max_iter=100000, random_state=1))
score = rmsle_cv(lasso)
print('Lasso Score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['lasso',lasso,score.mean(),score.std()]
Lasso Score = 0.0991  (std = 0.0061)

Elastic Net Regression

In [45]:
e_net = make_pipeline(RobustScaler(), ElasticNet(alpha = 0.00045, l1_ratio=0.9, random_state=1))
score = rmsle_cv(e_net)
print('Elastic Net score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['e_net',e_net,score.mean(),score.std()]
Elastic Net score = 0.0990  (std = 0.0061)

Kernel Ridge Regression

In [46]:
kr = make_pipeline(RobustScaler(), KernelRidge(alpha=0.04, kernel='polynomial', degree=1, coef0=2.5))
score = rmsle_cv(kr)
print('Kernel Ridge score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['kr',kr,score.mean(),score.std()]
Kernel Ridge score = 0.1000  (std = 0.0066)

Decision Tree Regression

In [47]:
dtr = make_pipeline(RobustScaler(), DecisionTreeRegressor(random_state=0, max_depth=20))
score = rmsle_cv(dtr)
print('Decision Tree score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['dtr',dtr,score.mean(),score.std()]
Decision Tree score = 0.1894  (std = 0.0115)

Gradient Boosting Regression

In [61]:
gbr = GradientBoostingRegressor(n_estimators=3000,
            learning_rate=0.05, max_depth=4, max_features='sqrt',
            min_samples_leaf=1, min_samples_split=2, loss='huber',
            random_state=5,)
score = rmsle_cv(gbr)
print('Gradient Boosting score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['gbr',gbr,score.mean(),score.std()]
Gradient Boosting score = 0.1030  (std = 0.0089)

LightGBM Regression

In [49]:
lgbr = lgb.LGBMRegressor(objective='regression',num_leaves=5,
        learning_rate=0.05, n_estimators=720, max_bin = 55,
        bagging_fraction = 0.8, bagging_freq = 5,
        feature_fraction = 0.2319, feature_fraction_seed=9, bagging_seed=9,
        min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)
score = rmsle_cv(lgbr)
print('LightGBM score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['lgbr',lgbr,score.mean(),score.std()]
LightGBM score = 0.1028  (std = 0.0063)

XGBoost Regression

In [50]:
xgbr = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468,
        learning_rate=0.05, max_depth=3, min_child_weight=1.7817,
        n_estimators=2200, reg_alpha=0.4640, reg_lambda=0.8571,
        subsample=0.5213, silent=True, random_state =7, nthread = -1)
score = rmsle_cv(xgbr)
print('XGBoost score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['xgbr',xgbr,score.mean(),score.std()]
[12:16:15] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:16:23] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:16:31] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:16:38] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:16:47] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


XGBoost score = 0.1045  (std = 0.0065)

All Models Ranked:

In [51]:
models.sort_values(by='score_mean',inplace=True)
models.reset_index(inplace=True,drop=True)
models
Out[51]:
model_name model_object score_mean score_std
0 e_net (RobustScaler(copy=True, quantile_range=(25.0,... 0.099040 0.006107
1 lasso (RobustScaler(copy=True, quantile_range=(25.0,... 0.099052 0.006089
2 svr (RobustScaler(copy=True, quantile_range=(25.0,... 0.099071 0.007941
3 kr (RobustScaler(copy=True, quantile_range=(25.0,... 0.099997 0.006571
4 lgbr LGBMRegressor(bagging_fraction=0.8, bagging_fr... 0.102830 0.006320
5 gbr GradientBoostingRegressor(alpha=0.9, ccp_alpha... 0.102971 0.008909
6 ridg (RobustScaler(copy=True, quantile_range=(25.0,... 0.103362 0.006744
7 xgbr XGBRegressor(base_score=None, booster=None, co... 0.104543 0.006468
8 lnr LinearRegression(copy_X=True, fit_intercept=Tr... 0.106361 0.008303
9 rfr RandomForestRegressor(bootstrap=True, ccp_alph... 0.129406 0.004995
10 dtr (RobustScaler(copy=True, quantile_range=(25.0,... 0.189363 0.011477
11 knr KNeighborsRegressor(algorithm='auto', leaf_siz... 0.225321 0.004786
12 sgd (RobustScaler(copy=True, quantile_range=(25.0,... 0.408903 0.037110

Awesome! We have some pretty strong predictive models so far. Let's see if we can improve our predictions through ensembling.

Ensemble Models: Simple Average

Our goal here is to identify which combinations of models give the best overall cross validation score when taking a simple average of their predictions.

First we'll create the class "AveragingModels" that calculates the simple average prediction of a basket of models.

In [52]:
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models

    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        for model in self.models_:
            model.fit(X, y)
        return self

    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)

Next, we'll create a list of every combination of the models with score_mean < 0.11.

In [53]:
from itertools import combinations

def subset(lst, count):
    return list(set(combinations(lst, count)))

model_list = list(models[models['score_mean']<0.11]['model_name'])
combo = list()

for i in range(1,len(model_list)):
    combo = combo + subset(model_list, i)

print('There are {:.0f} combinations. First 20 include:'.format(len(combo)))
combo[:20]
There are 510 combinations. First 20 include:
Out[53]:
[('svr',),
 ('ridg',),
 ('lnr',),
 ('lgbr',),
 ('gbr',),
 ('kr',),
 ('e_net',),
 ('lasso',),
 ('xgbr',),
 ('e_net', 'ridg'),
 ('kr', 'ridg'),
 ('lgbr', 'ridg'),
 ('svr', 'lnr'),
 ('lasso', 'ridg'),
 ('gbr', 'xgbr'),
 ('lgbr', 'gbr'),
 ('svr', 'xgbr'),
 ('lgbr', 'lnr'),
 ('e_net', 'lgbr'),
 ('svr', 'lgbr')]

And finally, we'll apply AveragingModels to every combination. Note, this may take a while.

In [222]:
model_scores = pd.DataFrame([],columns=['models_averaged','score_mean','score_std'])

for i in range(len(combo)):
    mods = list()
    for j in range(len(combo[i])):
        mods = mods + list(models[models['model_name']==list(combo[i])[j]]['model_object'])
    avg = AveragingModels(models = mods)
    score = rmsle_cv(avg)
    model_scores.loc[len(model_scores)] = [combo[i],score.mean(),score.std()]

model_scores = model_scores.sort_values(by='score_mean')
In [55]:
model_scores.to_csv('simple_average_scores.csv')
In [57]:
model_scores.head(25)
Out[57]:
models_averaged score_mean score_std
248 (e_net, svr, gbr) 0.094975 0.007544
59 (lasso, svr, gbr) 0.094979 0.007540
293 (e_net, lasso, svr, lgbr, gbr) 0.095026 0.006907
346 (e_net, lasso, svr, gbr, xgbr) 0.095083 0.006951
218 (e_net, lasso, svr, gbr) 0.095164 0.007205
1 (e_net, svr, lgbr, gbr) 0.095167 0.007098
84 (lasso, svr, lgbr, gbr) 0.095171 0.007096
426 (e_net, svr, gbr, xgbr) 0.095178 0.007165
418 (lasso, svr, gbr, xgbr) 0.095186 0.007163
90 (e_net, svr, gbr, ridg, xgbr) 0.095273 0.007138
64 (lasso, svr, gbr, ridg, xgbr) 0.095274 0.007136
169 (e_net, lasso, svr, lgbr, gbr, ridg) 0.095280 0.006921
45 (e_net, lasso, svr, lgbr, gbr, ridg, xgbr) 0.095285 0.006813
91 (e_net, lasso, svr, lgbr, gbr, xgbr) 0.095287 0.006783
184 (lasso, svr, lgbr, gbr, ridg) 0.095292 0.007082
264 (e_net, svr, lgbr, gbr, ridg) 0.095294 0.007084
13 (e_net, lasso, svr, gbr, ridg, xgbr) 0.095300 0.006958
149 (lasso, svr, gbr, xgbr, lnr) 0.095318 0.007384
132 (e_net, svr, gbr, xgbr, lnr) 0.095319 0.007388
185 (e_net, lasso, svr, lgbr, gbr, xgbr, lnr) 0.095334 0.006981
94 (lasso, svr, lgbr, gbr, lnr) 0.095340 0.007314
51 (e_net, svr, lgbr, gbr, lnr) 0.095344 0.007317
296 (e_net, lasso, svr, lgbr, gbr, lnr) 0.095384 0.007140
323 (e_net, lasso, svr, gbr, xgbr, lnr) 0.095401 0.007188
209 (e_net, svr, lgbr, gbr, xgbr, lnr) 0.095409 0.007118

Awesome! Above are the top 25 model combinations by cross validation score.

Note: After testing many of the top combinations above on the final Kaggle test data, we saw the best performance overall from (lasso, gbr, lgbr, kr).

In [54]:
simple_avg_final = AveragingModels(models = (lasso, gbr, lgbr, kr))
score = rmsle_cv(simple_avg_final)
print('Simple Average score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
Simple Average score = 0.0959  (std = 0.0067)

Stacking Models: Meta-Model

Let's see if we can improve our predictions even further through applying a meta-model atop our base model predictions. Keeping consistent with our cross-validation strategy, we'll use StackingCVRegressor to train our meta-model (as opposed to StackingRegressor, which does not train the meta-model using the out-of-fold cross-validation predictions from the base models).

In [55]:
from mlxtend.regressor import StackingCVRegressor

stacked = StackingCVRegressor(regressors=(lasso, gbr, lgbr, kr),
                                meta_regressor=lasso,
                                use_features_in_secondary=True)

score = rmsle_cv(stacked)
print('Stacked score = {:.8f}  (std = {:.4f})'.format(score.mean(), score.std()))
Stacked score = 0.09567798  (std = 0.0065)

Great! We were able to improve our score using a stacked model approach. In particular, defining our base models to be the same set of models for which we received the best simple-average test results above (lasso, gbr, lgbr, kr), we were able to marginally improve our cross-validation score by applying the lasso meta-model.

Final Predictions

Yahoo! We made it! For our final prediction, we'll create an ensemble model that is

  • 50% a simple average of lasso, kr, gbr, lgbr.
  • 50% a stacked meta model with base svr, ridg, and xgbr and meta regressor e_net.

Interesting sidenote: While incorporating the stacked meta-model approach into our final prediction ensemble did improve predictive power in a variety of cases, my strongest result overall when submitting to Kaggle (.11997) came from a simple average of Lasso, Gradient Boost, LightGMB, and Kernel Ridge.

In [58]:
stacked_final = StackingCVRegressor(regressors=(svr, ridg, xgbr),
                                meta_regressor=e_net,
                                use_features_in_secondary=True)

score = rmsle_cv(stacked_final)
print('stacked_final score = {:.8f}  (std = {:.4f})'.format(score.mean(), score.std()))
[12:28:57] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:29:03] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:29:09] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:29:16] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:29:22] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:29:29] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:29:38] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:29:45] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:29:51] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:29:57] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:30:03] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:30:10] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:30:19] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:30:25] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:30:31] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:30:37] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:30:44] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:30:50] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:31:00] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:31:06] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:31:13] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:31:19] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:31:25] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:31:32] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:31:42] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:31:49] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:31:55] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:32:01] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:32:07] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:32:14] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


stacked_final score = 0.09646295  (std = 0.0066)
In [59]:
model_1 = simple_avg_final
model_2 = stacked_final
mod_1_share = .5
mod_2_share = .5

model_1.fit(train.values, Y_train)
model_1_test_predictions = np.expm1(model_1.predict(test.values))

model_2.fit(train.values, Y_train)
model_2_test_predictions = np.expm1(model_2.predict(test.values))

test_predictions = mod_1_share * model_1_test_predictions + mod_2_share * model_2_test_predictions
[12:32:33] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:32:41] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:32:49] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:32:57] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:33:05] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:33:13] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


In [133]:
test_id = pd.read_csv('test.csv')[['Id']]
test_id['SalePrice'] = np.round(test_predictions,2)
test_id.to_csv('predictions_simple(lasso,gbr,lgbr,kr)_meta(e_net,svr,ridg,xgbr).csv',index=False)

Thank you so much for going on this journey with me! I hope you found this notebook helpful. Please let me know if you have any questions or if you have suggestion for improving upon my approach - having a conversation is the best way to improve. 😊

In [ ]: