Housing Price Prediction with Regression & Ensembling¶

This project came from the ongoing Kaggle competition House Prices: Advanced Regression Techniques.

In this project, we'll explore historical housing data in Ames, Iowa, with the end goal of developing the best predictive model on final sale price. We'll take a systematic approach to do so, which includes:

Initial Exploratory Data Analysis
- Variable Statistics
- Correlation of Numerical Variables
Data Preprocessing
- Removing Unnecessary Columns
- Outliers
- Null Values
Additional Preprocessing
- Numerical to Categorical Variables
- Categorical to Numerical Variables (Label Encoding)
- Engineering New Features
Adjusting Skewed Variables
- Target Variable (SalePrice)
- Independent Variables
Dummy Variables
Overfitted Variables and Other Outliers
Baseline Model Performance
- KNN Regression
- SGD Regression
- Random Forest Regression
- Linear Regression
- Ridge Regression
- Support Vector Regression
- Lasso Regression
- Elastic Net Regression
- Kernel Ridge Regression
- Gradient Boosting Regression
- LightGBM Regression
- XGBoost Regression
Ensemble Models: Simple Average
Stacking Models: Meta-Model
Stacking Models II: Meta-Model From Scratch
Final Predictions

In [3]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import missingno as msno

import warnings
warnings.filterwarnings('ignore')

In [4]:

train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

Initial EDA¶

In [5]:

train_df

Out[5]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	12	2008	WD	Normal	250000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1455	1456	60	RL	62.0	7917	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	8	2007	WD	Normal	175000
1456	1457	20	RL	85.0	13175	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	MnPrv	NaN	0	2	2010	WD	Normal	210000
1457	1458	70	RL	66.0	9042	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	GdPrv	Shed	2500	5	2010	WD	Normal	266500
1458	1459	20	RL	68.0	9717	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	4	2010	WD	Normal	142125
1459	1460	20	RL	75.0	9937	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	6	2008	WD	Normal	147500

1460 rows × 81 columns

In [6]:

test_df

Out[6]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
0	1461	20	RH	80.0	11622	Pave	NaN	Reg	Lvl	AllPub	...	120	0	NaN	MnPrv	NaN	0	6	2010	WD	Normal
1	1462	20	RL	81.0	14267	Pave	NaN	IR1	Lvl	AllPub	...	0	0	NaN	NaN	Gar2	12500	6	2010	WD	Normal
2	1463	60	RL	74.0	13830	Pave	NaN	IR1	Lvl	AllPub	...	0	0	NaN	MnPrv	NaN	0	3	2010	WD	Normal
3	1464	60	RL	78.0	9978	Pave	NaN	IR1	Lvl	AllPub	...	0	0	NaN	NaN	NaN	0	6	2010	WD	Normal
4	1465	120	RL	43.0	5005	Pave	NaN	IR1	HLS	AllPub	...	144	0	NaN	NaN	NaN	0	1	2010	WD	Normal
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1454	2915	160	RM	21.0	1936	Pave	NaN	Reg	Lvl	AllPub	...	0	0	NaN	NaN	NaN	0	6	2006	WD	Normal
1455	2916	160	RM	21.0	1894	Pave	NaN	Reg	Lvl	AllPub	...	0	0	NaN	NaN	NaN	0	4	2006	WD	Abnorml
1456	2917	20	RL	160.0	20000	Pave	NaN	Reg	Lvl	AllPub	...	0	0	NaN	NaN	NaN	0	9	2006	WD	Abnorml
1457	2918	85	RL	62.0	10441	Pave	NaN	Reg	Lvl	AllPub	...	0	0	NaN	MnPrv	Shed	700	7	2006	WD	Normal
1458	2919	60	RL	74.0	9627	Pave	NaN	Reg	Lvl	AllPub	...	0	0	NaN	NaN	NaN	0	11	2006	WD	Normal

1459 rows × 80 columns

In [7]:

train_df.describe()

Out[7]:

	Id	MSSubClass	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	...	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold	SalePrice
count	1460.000000	1460.000000	1201.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1452.000000	1460.000000	...	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000
mean	730.500000	56.897260	70.049958	10516.828082	6.099315	5.575342	1971.267808	1984.865753	103.685262	443.639726	...	94.244521	46.660274	21.954110	3.409589	15.060959	2.758904	43.489041	6.321918	2007.815753	180921.195890
std	421.610009	42.300571	24.284752	9981.264932	1.382997	1.112799	30.202904	20.645407	181.066207	456.098091	...	125.338794	66.256028	61.119149	29.317331	55.757415	40.177307	496.123024	2.703626	1.328095	79442.502883
min	1.000000	20.000000	21.000000	1300.000000	1.000000	1.000000	1872.000000	1950.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	2006.000000	34900.000000
25%	365.750000	20.000000	59.000000	7553.500000	5.000000	5.000000	1954.000000	1967.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.000000	2007.000000	129975.000000
50%	730.500000	50.000000	69.000000	9478.500000	6.000000	5.000000	1973.000000	1994.000000	0.000000	383.500000	...	0.000000	25.000000	0.000000	0.000000	0.000000	0.000000	0.000000	6.000000	2008.000000	163000.000000
75%	1095.250000	70.000000	80.000000	11601.500000	7.000000	6.000000	2000.000000	2004.000000	166.000000	712.250000	...	168.000000	68.000000	0.000000	0.000000	0.000000	0.000000	0.000000	8.000000	2009.000000	214000.000000
max	1460.000000	190.000000	313.000000	215245.000000	10.000000	9.000000	2010.000000	2010.000000	1600.000000	5644.000000	...	857.000000	547.000000	552.000000	508.000000	480.000000	738.000000	15500.000000	12.000000	2010.000000	755000.000000

8 rows × 38 columns

Correlation of Numerical Variables¶

In [8]:

plt.figure(figsize=(18,18))
sns.heatmap(train_df.corr(),annot=True,cmap="Blues",fmt='.1f',square=True)

Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9cfcef0550>

Interesting... We can see that OverallQual and many of the area/sqft-related variables are highly correlated with our SalePrice target variable. Furthermore, notice that many independent variables are correlated with each other... it's important to keep in mind that linear regression models (like the ones we'll be using for predictive purposes later on in this notebook) require independent variables to have little to no collinearity. We'll keep these variables for now, however, as we can account for collinearity through regularization (i.e. Lasso, Ridge) later on.

Data Preprocessing¶

Removing Unnecessary Columns¶

First, let's drop the ID column. There may be others, but for now ID is an obvious choice.

In [9]:

train_df.drop(['Id'],axis=1,inplace=True)
test_df.drop(['Id'],axis=1,inplace=True)

Outliers¶

Now let's look for potential outliers and address them.

In [10]:

# View features that are highly correlated with SalePrice
corrs = train_df.corr()[['SalePrice']]
corrs = corrs[corrs['SalePrice']>0.5]
corrs = corrs.sort_values(by='SalePrice',ascending=False)

high_corr_feats = corrs.index[1:]

fig, axes = plt.subplots(5,2,figsize=(13,16))

for i, ax in enumerate(axes.flatten()):
    feat = high_corr_feats[i]
    sns.scatterplot(x=train_df[feat], y=train_df['SalePrice'], ax=ax)
    plt.xlabel(feat)
    plt.ylabel('Sale Price')
plt.tight_layout()

On GrLivArea, it looks like those two points on the bottom right are outliers, given they have such high GrLivArea and low SalePrice. Same for the points on the bottom right of TotalBsmtSF and 1stFlrSF. Let's drop these for now.

In [11]:

train_df.shape

Out[11]:

(1460, 80)

In [12]:

# Drop GrLivArea outliers
train_df.drop(train_df[(train_df['SalePrice'] < 300000) &
                       (train_df['GrLivArea'] > 4000)].index,
                       inplace=True)

# Drop TotalBsmtSF and 1stFlrSF outliers
train_df.drop(train_df[(train_df['TotalBsmtSF'] > 6000) |
                       (train_df['1stFlrSF'] > 4000)].index,
                       inplace=True)

In [13]:

train_df.shape

Out[13]:

(1458, 80)

Great! Looks like these outliers boiled down to just two points. Let's visualize the graphs again to ensure all outliers were removed.

In [14]:

fig, axes = plt.subplots(1,3,figsize=(14,4))
feats = ['GrLivArea', 'TotalBsmtSF', '1stFlrSF']

for i, ax in enumerate(axes.flatten()):
    feat = feats[i]
    sns.scatterplot(x=train_df[feat], y=train_df['SalePrice'], ax=ax)
    plt.xlabel(feat)
    plt.ylabel('Sale Price')

plt.tight_layout()

Success! There are likely other outliers, but we will address these later on in our analysis in a more automated way using outlier_test() from statsmodels.api.

Null Values¶

Now, let's get an idea of the null values in our data, and let's figure out how best to replace them. First, we'll concatenate the train and test data into one df.

In [15]:

df = pd.concat([train_df.drop(['SalePrice'],axis=1),
                test_df]).reset_index(drop=True)
df.shape

Out[15]:

(2917, 79)

Awesome, now let's visualize our null values in a few different ways: msno matrices, a bargraph showing feature null-value percentages, and a table showing null-value totals & percentages.

In [16]:

msno.matrix(train_df)

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9cfdfedc10>

In [17]:

msno.matrix(test_df)

Out[17]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9d0030e050>

In [18]:

df_na = 100 * df.isnull().sum() / len(df)
df_na = pd.DataFrame(df_na,columns=['%NA'])
df_na = df_na.sort_values('%NA', ascending=False)
df_na = df_na[df_na['%NA']>0]

plt.figure(figsize=(14,6))
sns.barplot(x=df_na.index,y=df_na['%NA'],)
plt.xticks(rotation = '90')
plt.title('Feature Missing Value Percentage',fontsize=20,fontweight='bold')

Out[18]:

Text(0.5, 1.0, 'Feature Missing Value Percentage')

In [19]:

def missing_percentage(df):
    total = df.isnull().sum().sort_values(ascending = False)[df.isnull().sum().sort_values(ascending = False) != 0]
    percent = round(df.isnull().sum().sort_values(ascending = False)/len(df)*100,2)[round(df.isnull().sum().sort_values(ascending = False)/len(df)*100,2) != 0]
    return pd.concat([total, percent], axis=1, keys=['Total Nulls','Percent Null'])

missing_percentage(df)

Out[19]:

	Total Nulls	Percent Null
PoolQC	2908	99.69
MiscFeature	2812	96.40
Alley	2719	93.21
Fence	2346	80.43
FireplaceQu	1420	48.68
LotFrontage	486	16.66
GarageCond	159	5.45
GarageQual	159	5.45
GarageYrBlt	159	5.45
GarageFinish	159	5.45
GarageType	157	5.38
BsmtCond	82	2.81
BsmtExposure	82	2.81
BsmtQual	81	2.78
BsmtFinType2	80	2.74
BsmtFinType1	79	2.71
MasVnrType	24	0.82
MasVnrArea	23	0.79
MSZoning	4	0.14
BsmtHalfBath	2	0.07
Utilities	2	0.07
Functional	2	0.07
BsmtFullBath	2	0.07
BsmtFinSF2	1	0.03
BsmtFinSF1	1	0.03
Exterior2nd	1	0.03
BsmtUnfSF	1	0.03
TotalBsmtSF	1	0.03
Exterior1st	1	0.03
SaleType	1	0.03
Electrical	1	0.03
KitchenQual	1	0.03
GarageArea	1	0.03
GarageCars	1	0.03

Great! Now let's fill our null values. We'll take a specific approach for each variable, depending on the context:

In [20]:

# 'None' if NA
for i in ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
         'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'GarageType', 'GarageFinish',
         'GarageQual', 'GarageCond', 'MasVnrType', 'FireplaceQu', 'MSSubClass']:
    df[i] = df[i].fillna('None')


# 0 if NA
for i in ['GarageYrBlt', 'GarageArea', 'GarageCars', 'BsmtFinSF1', 'BsmtFinSF2',
          'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath', 'MasVnrArea']:
    df[i] = df[i].fillna(0)


# Exterior1st, Exterior2nd - mode if NA
for i in ['Exterior1st', 'Exterior2nd', 'KitchenQual', 'Electrical', 'MSZoning',
         'SaleType', 'Functional']:
    df[i] = df[i].fillna(df[i].mode()[0])


# LotFrontage - Take median of neighborhood
df['LotFrontage'] = df.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.median()))


# Utilities - Drop, as all are 'AllPub', except one 'NoSeWa in training data.
df.drop(['Utilities'],inplace=True,axis=1)

In [21]:

missing_percentage(df)

Out[21]:

	Total Nulls	Percent Null

Awesome, we've addressed all our null values.

Additional Preprocessing¶

Numerical --> Categorical Variables¶

Next, let's change datatype on a few numerical variables that would be better represented categorically.

In [22]:

df['MSSubClass'] = df['MSSubClass'].astype(str)
df['OverallCond'] = df['OverallCond'].astype(str)
df['YrSold'] = df['YrSold'].astype(str)
df['MoSold'] = df['MoSold'].astype(str)

Categorical --> Numerical Variables (Label Encoding)¶

Now, let's go the other way -- let's change datatype on a few categorical variables that would be better represented numerically. Here, we use Label Encoding. Interestingly, Label Encoding outperformed One Hot Encoding on the final test submissions, which is surprising... usually we would expect the opposite to be true.

In [23]:

from sklearn.preprocessing import LabelEncoder

var = ['FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond',
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual',
        'BsmtFinType1', 'BsmtFinType2', 'Functional', 'Fence',
        'BsmtExposure', 'GarageFinish', 'LandSlope', 'LotShape',
        'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass',
        'OverallCond', 'YrSold', 'MoSold']

for i in var:
    mdl = LabelEncoder().fit(list(df[i].values))
    df[i] = mdl.transform(list(df[i].values))

df[var].head()

Out[23]:

	FireplaceQu	BsmtQual	BsmtCond	GarageQual	GarageCond	ExterQual	ExterCond	HeatingQC	PoolQC	KitchenQual	...	LotShape	PavedDrive	Street	Alley	CentralAir	MSSubClass	OverallCond	YrSold	MoSold
0	3	2	4	5	5	2	4	0	3	2	...	3	2	1	1	1	10	4	2	4
1	5	2	4	5	5	3	4	0	3	3	...	3	2	1	1	1	5	7	1	7
2	5	2	4	5	5	2	4	0	3	2	...	0	2	1	1	1	10	4	2	11
3	2	4	1	5	5	3	4	2	3	2	...	0	2	1	1	1	11	4	0	4
4	5	2	4	5	5	2	4	0	3	2	...	0	2	1	1	1	10	4	2	3

5 rows × 26 columns

Engineering New Features¶

Below are a variety of different features introduced to try to improve prediction accuracy in our final models. Interestingly, only 'Total_SF_Main' improved our final test score (which is why the others are commented out).

In [24]:

df['Total_SF_Main'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']
#df['Total_Porch_SF'] = df['WoodDeckSF'] + df['OpenPorchSF'] + df['EnclosedPorch'] + df['3SsnPorch'] + df['ScreenPorch']
#df['Total_Bathrooms'] = df['BsmtFullBath'] + df['FullBath'] + 0.5*(df['HalfBath'] + df['BsmtHalfBath'])
#df['YrBltRemod'] = df['YearBuilt'] + df['YearRemodAdd']
#df['Total_sqr_footage'] = df['BsmtFinSF1'] + df['BsmtFinSF2'] + df['1stFlrSF'] + df['2ndFlrSF']
#df['haspool'] = df['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
#df['has2ndfloor'] = df['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)
#df['hasgarage'] = df['GarageArea'].apply(lambda x: 1 if x > 0 else 0)
#df['hasbsmt'] = df['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
#df['hasfireplace'] = df['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)

Adjusting Skewed Variables¶

Alright, now let's address skew in our variables. The more skewed our numeric variables (especially our target variable), the worse our linear regression models will perform. Let's see if we can identify these highly skewed variables and attempt to normalize them through log & boxcox transformations. Let's start with our target variable, SalePrice.

Target Variable¶

In [25]:

sns.distplot(train_df['SalePrice'])

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9cfdf49990>

Looks like SalePrice is positively skewed. Let's quantify this further...

In [26]:

mu = train_df['SalePrice'].mean()
med = train_df['SalePrice'].median()
std = train_df['SalePrice'].std()
skew = train_df['SalePrice'].skew()
kurt = train_df['SalePrice'].kurt()

print('SalePrice \n mean = {:.2f} \n median = {:.2f} \n standard deviation = {:.2f} \n skew = {:.2f} \n kurtosis = {:.2f}'.format(mu, med, std, skew, kurt))

SalePrice
 mean = 180932.92
 median = 163000.00
 standard deviation = 79495.06
 skew = 1.88
 kurtosis = 6.52

In [27]:

stats.probplot(train_df['SalePrice'], plot=plt)

Out[27]:

((array([-3.3047554 , -3.04752042, -2.90446807, ...,  2.90446807,
          3.04752042,  3.3047554 ]),
  array([ 34900,  35311,  37900, ..., 625000, 745000, 755000])),
 (74213.25959976624, 180932.91906721535, 0.9320154492892367))

In [28]:

sns.residplot('GrLivArea', 'SalePrice', data=train_df)

Out[28]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9ce4917410>

SalePrice has a positive skew of 1.88 and a positive kurtosis of 6.52 (meaning it's vulnerable to outliers). Further evidence of skew can be seen in the probability plot above. Finally, we see a heteroscedastic relationship between certain independent variables (i.e. GrLivArea) and our target variable. Let's see if we can normalize SalePrice a bit.

In [29]:

train_df['SalePrice'] = np.log1p(train_df['SalePrice'])

mu = train_df['SalePrice'].mean()
med = train_df['SalePrice'].median()
std = train_df['SalePrice'].std()
skew = train_df['SalePrice'].skew()
kurt = train_df['SalePrice'].kurt()

print('SalePrice \n mean = {:.2f} \n median = {:.2f} \n standard deviation = {:.2f} \n skew = {:.2f} \n kurtosis = {:.2f}'.format(mu, med, std, skew, kurt))

sns.distplot(train_df['SalePrice'])
plt.figure()
stats.probplot(train_df['SalePrice'], plot=plt)

SalePrice
 mean = 12.02
 median = 12.00
 standard deviation = 0.40
 skew = 0.12
 kurtosis = 0.80

Out[29]:

((array([-3.3047554 , -3.04752042, -2.90446807, ...,  2.90446807,
          3.04752042,  3.3047554 ]),
  array([10.46027076, 10.47197813, 10.54273278, ..., 13.34550853,
         13.52114084, 13.53447435])),
 (0.3985294832980731, 12.024015155682548, 0.9953918721417083))

In [30]:

sns.residplot('GrLivArea', 'SalePrice', data=train_df)

Out[30]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9ce57e7dd0>

Great! SalePrice is now much less skewed, more homoscedastic, and more normally distributed. Let's adjust our other highly skewed variables as well, but in a more automated way.

Independent Variables¶

In [31]:

numeric_var_skews = pd.DataFrame(df.dtypes[df.dtypes != 'object'].index,columns=['Numeric_Variables'])
numeric_var_skews['Skew'] = numeric_var_skews['Numeric_Variables'].apply(lambda x: df[x].skew())
numeric_var_skews.sort_values('Skew',ascending=False,inplace=True)
numeric_var_skews.reset_index(inplace=True,drop=True)
display(numeric_var_skews)

	Numeric_Variables	Skew
0	MiscVal	21.950962
1	PoolArea	17.697766
2	LotArea	13.116240
3	LowQualFinSF	12.090757
4	3SsnPorch	11.377932
5	LandSlope	4.975813
6	KitchenAbvGr	4.302763
7	BsmtFinSF2	4.146636
8	EnclosedPorch	4.004404
9	ScreenPorch	3.947131
10	BsmtHalfBath	3.932018
11	MasVnrArea	2.623068
12	OpenPorchSF	2.530660
13	WoodDeckSF	1.845741
14	1stFlrSF	1.257933
15	LotFrontage	1.103606
16	GrLivArea	1.069300
17	Total_SF_Main	1.009676
18	BsmtFinSF1	0.981149
19	BsmtUnfSF	0.920161
20	2ndFlrSF	0.861999
21	TotRmsAbvGrd	0.749618
22	Fireplaces	0.725651
23	HalfBath	0.697024
24	TotalBsmtSF	0.672097
25	BsmtFullBath	0.622735
26	OverallCond	0.569607
27	HeatingQC	0.485784
28	FireplaceQu	0.332782
29	BedroomAbvGr	0.326736
30	GarageArea	0.216968
31	OverallQual	0.189688
32	FullBath	0.165599
33	MSSubClass	0.139781
34	YrSold	0.132064
35	BsmtFinType1	0.083684
36	GarageCars	-0.219410
37	YearRemodAdd	-0.450365
38	BsmtQual	-0.488614
39	YearBuilt	-0.599503
40	GarageFinish	-0.610267
41	LotShape	-0.618882
42	MoSold	-0.646506
43	Alley	-0.652041
44	BsmtExposure	-1.117896
45	KitchenQual	-1.450560
46	ExterQual	-1.800989
47	Fence	-1.993675
48	ExterCond	-2.497774
49	BsmtCond	-2.862744
50	PavedDrive	-2.979273
51	BsmtFinType2	-3.044545
52	GarageQual	-3.074369
53	CentralAir	-3.459334
54	GarageCond	-3.596139
55	GarageYrBlt	-3.906642
56	Functional	-4.056212
57	Street	-15.502729
58	PoolQC	-21.228518

In [32]:

high_skew = numeric_var_skews[abs(numeric_var_skews['Skew']) > 0.75]

from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax

high_skew_vars = high_skew['Numeric_Variables']
for var in high_skew_vars:
    df[var] = boxcox1p(df[var], 0.15, #boxcox_normmax(df[var] + 1)
                      )

Dummy Variables¶

Great! Now that we've tackled skewness, we're ready to create dummy variables.

In [33]:

# Interestingly, not removing the first dummy variable actually improved
# the final test score, thus we keep drop_first=False. Normally, one 
# would want to remove one of the dummy variables to avoid collinearity
# in situations where the dummies represent all possible scenarios.
df_dummy = pd.get_dummies(df, #drop_first = True
                         )
df_dummy.shape

Out[33]:

(2917, 220)

Overfitted Variables & Other Outliers¶

In general, it's a good idea to consider removing variables where the vast majority of values are the same, as this can cause overfitting.

In [34]:

def overfit_reducer(df):
    """
    This function takes in a dataframe and returns a list of features that are overfitted.
    """
    overfit = []
    for i in df.columns:
        counts = df[i].value_counts()
        zeros = counts.iloc[0]
        if zeros / len(df) * 100 > 99.94:
            overfit.append(i)
    overfit = list(overfit)
    return overfit

overfitted_features = overfit_reducer(df_dummy[:train_df.shape[0]])

df_dummy = df_dummy.drop(overfitted_features, axis=1)

Let's also remove any additional outliers we may have missed.

In [35]:

# Remove additional outliers
train = df_dummy[:train_df.shape[0]]
Y_train = train_df['SalePrice'].values

import statsmodels.api as sm
ols = sm.OLS(endog = Y_train,
             exog = train)
fit = ols.fit()
test2 = fit.outlier_test()['bonf(p)']

outliers = list(test2[test2<1e-2].index)

print('There were {:.0f} outliers at indices:'.format(len(outliers)))
print(outliers)

train_df = train_df.drop(train_df.index[outliers])
df_dummy = df_dummy.drop(df_dummy.index[outliers])
df_dummy.shape

There were 9 outliers at indices:
[30, 88, 462, 587, 631, 967, 969, 1322, 1451]

Out[35]:

(2908, 220)

Baseline Model Performance¶

Awesome! We're finally done cleaning up our data, and we're ready to start making predictions! First we'll define a cross-validation strategy, and then we'll proceed with testing a variety of different base models to see which perform best.

In [36]:

# Helpful imports
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler
from sklearn import metrics
from sklearn.linear_model import Ridge, Lasso, ElasticNet, BayesianRidge, LassoLarsIC, LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
import xgboost as xgb
import lightgbm as lgb
from sklearn.svm import SVR

# Designate preprocessed train and test data
train = df_dummy[:train_df.shape[0]]
test = df_dummy[train_df.shape[0]:]
Y_train = train_df['SalePrice'].values

# Cross validation strategy
def rmsle_cv(model):
    kf = KFold(5, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse = np.sqrt(-cross_val_score(model, train.values, Y_train,
            scoring='neg_mean_squared_error', cv=kf))
    return(rmse)

In [37]:

models = pd.DataFrame([],columns=['model_name','model_object','score_mean','score_std'])

KNN Regression¶

In [38]:

knr = KNeighborsRegressor(9, weights='distance')
score = rmsle_cv(knr)
print('KNN Regression score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['knr',knr,score.mean(),score.std()]

KNN Regression score = 0.2253  (std = 0.0048)

SGD Regression¶

In [39]:

from sklearn.linear_model import SGDRegressor
sgd = make_pipeline(RobustScaler(), SGDRegressor(alpha=1000000000000000,l1_ratio=1))
score = rmsle_cv(sgd)
print('SGD score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['sgd',sgd,score.mean(),score.std()]

SGD score = 0.4089  (std = 0.0371)

Random Forest Regression¶

In [63]:

rfr = RandomForestRegressor()
score = rmsle_cv(rfr)
print('Random Forest score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['rfr',rfr,score.mean(),score.std()]

Random Forest score = 0.1300  (std = 0.0046)

Linear Regression¶

In [41]:

lnr = LinearRegression()
score = rmsle_cv(lnr)
print('Linear Regression score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['lnr',lnr,score.mean(),score.std()]

Linear Regression score = 0.1064  (std = 0.0083)

Ridge¶

In [42]:

ridg = make_pipeline(RobustScaler(), Ridge(alpha = .17,normalize=True, random_state=4))
score = rmsle_cv(ridg)
print('Ridge score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['ridg',ridg,score.mean(),score.std()]

Ridge score = 0.1034  (std = 0.0067)

Support Vector Regressor¶

In [43]:

svr = make_pipeline(RobustScaler(), SVR(C= 20, epsilon= 0.02, gamma=0.00046))
score = rmsle_cv(svr)
print('SVR score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['svr',svr,score.mean(),score.std()]

SVR score = 0.0991  (std = 0.0079)

Lasso Regression¶

In [44]:

lasso = make_pipeline(RobustScaler(), Lasso(alpha = 0.00042, max_iter=100000, random_state=1))
score = rmsle_cv(lasso)
print('Lasso Score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['lasso',lasso,score.mean(),score.std()]

Lasso Score = 0.0991  (std = 0.0061)

Elastic Net Regression¶

In [45]:

e_net = make_pipeline(RobustScaler(), ElasticNet(alpha = 0.00045, l1_ratio=0.9, random_state=1))
score = rmsle_cv(e_net)
print('Elastic Net score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['e_net',e_net,score.mean(),score.std()]

Elastic Net score = 0.0990  (std = 0.0061)

Kernel Ridge Regression¶

In [46]:

kr = make_pipeline(RobustScaler(), KernelRidge(alpha=0.04, kernel='polynomial', degree=1, coef0=2.5))
score = rmsle_cv(kr)
print('Kernel Ridge score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['kr',kr,score.mean(),score.std()]

Kernel Ridge score = 0.1000  (std = 0.0066)

Decision Tree Regression¶

In [47]:

dtr = make_pipeline(RobustScaler(), DecisionTreeRegressor(random_state=0, max_depth=20))
score = rmsle_cv(dtr)
print('Decision Tree score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['dtr',dtr,score.mean(),score.std()]

Decision Tree score = 0.1894  (std = 0.0115)

Gradient Boosting Regression¶

In [61]:

gbr = GradientBoostingRegressor(n_estimators=3000,
            learning_rate=0.05, max_depth=4, max_features='sqrt',
            min_samples_leaf=1, min_samples_split=2, loss='huber',
            random_state=5,)
score = rmsle_cv(gbr)
print('Gradient Boosting score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['gbr',gbr,score.mean(),score.std()]

Gradient Boosting score = 0.1030  (std = 0.0089)

LightGBM Regression¶

In [49]:

lgbr = lgb.LGBMRegressor(objective='regression',num_leaves=5,
        learning_rate=0.05, n_estimators=720, max_bin = 55,
        bagging_fraction = 0.8, bagging_freq = 5,
        feature_fraction = 0.2319, feature_fraction_seed=9, bagging_seed=9,
        min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)
score = rmsle_cv(lgbr)
print('LightGBM score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['lgbr',lgbr,score.mean(),score.std()]

LightGBM score = 0.1028  (std = 0.0063)

XGBoost Regression¶

In [50]:

xgbr = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468,
        learning_rate=0.05, max_depth=3, min_child_weight=1.7817,
        n_estimators=2200, reg_alpha=0.4640, reg_lambda=0.8571,
        subsample=0.5213, silent=True, random_state =7, nthread = -1)
score = rmsle_cv(xgbr)
print('XGBoost score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))
models.loc[len(models)] = ['xgbr',xgbr,score.mean(),score.std()]

[12:16:15] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:16:23] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:16:31] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:16:38] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:16:47] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


XGBoost score = 0.1045  (std = 0.0065)

All Models Ranked:¶

In [51]:

models.sort_values(by='score_mean',inplace=True)
models.reset_index(inplace=True,drop=True)
models

Out[51]:

	model_name	model_object	score_mean	score_std
0	e_net	(RobustScaler(copy=True, quantile_range=(25.0,...	0.099040	0.006107
1	lasso	(RobustScaler(copy=True, quantile_range=(25.0,...	0.099052	0.006089
2	svr	(RobustScaler(copy=True, quantile_range=(25.0,...	0.099071	0.007941
3	kr	(RobustScaler(copy=True, quantile_range=(25.0,...	0.099997	0.006571
4	lgbr	LGBMRegressor(bagging_fraction=0.8, bagging_fr...	0.102830	0.006320
5	gbr	GradientBoostingRegressor(alpha=0.9, ccp_alpha...	0.102971	0.008909
6	ridg	(RobustScaler(copy=True, quantile_range=(25.0,...	0.103362	0.006744
7	xgbr	XGBRegressor(base_score=None, booster=None, co...	0.104543	0.006468
8	lnr	LinearRegression(copy_X=True, fit_intercept=Tr...	0.106361	0.008303
9	rfr	RandomForestRegressor(bootstrap=True, ccp_alph...	0.129406	0.004995
10	dtr	(RobustScaler(copy=True, quantile_range=(25.0,...	0.189363	0.011477
11	knr	KNeighborsRegressor(algorithm='auto', leaf_siz...	0.225321	0.004786
12	sgd	(RobustScaler(copy=True, quantile_range=(25.0,...	0.408903	0.037110

Awesome! We have some pretty strong predictive models so far. Let's see if we can improve our predictions through ensembling.

Ensemble Models: Simple Average¶

Our goal here is to identify which combinations of models give the best overall cross validation score when taking a simple average of their predictions.

First we'll create the class "AveragingModels" that calculates the simple average prediction of a basket of models.

In [52]:

from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models

    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        for model in self.models_:
            model.fit(X, y)
        return self

    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)

Next, we'll create a list of every combination of the models with score_mean < 0.11.

In [53]:

from itertools import combinations

def subset(lst, count):
    return list(set(combinations(lst, count)))

model_list = list(models[models['score_mean']<0.11]['model_name'])
combo = list()

for i in range(1,len(model_list)):
    combo = combo + subset(model_list, i)

print('There are {:.0f} combinations. First 20 include:'.format(len(combo)))
combo[:20]

There are 510 combinations. First 20 include:

Out[53]:

[('svr',),
 ('ridg',),
 ('lnr',),
 ('lgbr',),
 ('gbr',),
 ('kr',),
 ('e_net',),
 ('lasso',),
 ('xgbr',),
 ('e_net', 'ridg'),
 ('kr', 'ridg'),
 ('lgbr', 'ridg'),
 ('svr', 'lnr'),
 ('lasso', 'ridg'),
 ('gbr', 'xgbr'),
 ('lgbr', 'gbr'),
 ('svr', 'xgbr'),
 ('lgbr', 'lnr'),
 ('e_net', 'lgbr'),
 ('svr', 'lgbr')]

And finally, we'll apply AveragingModels to every combination. Note, this may take a while.

In [222]:

model_scores = pd.DataFrame([],columns=['models_averaged','score_mean','score_std'])

for i in range(len(combo)):
    mods = list()
    for j in range(len(combo[i])):
        mods = mods + list(models[models['model_name']==list(combo[i])[j]]['model_object'])
    avg = AveragingModels(models = mods)
    score = rmsle_cv(avg)
    model_scores.loc[len(model_scores)] = [combo[i],score.mean(),score.std()]

model_scores = model_scores.sort_values(by='score_mean')

In [55]:

model_scores.to_csv('simple_average_scores.csv')

In [57]:

model_scores.head(25)

Out[57]:

	models_averaged	score_mean	score_std
248	(e_net, svr, gbr)	0.094975	0.007544
59	(lasso, svr, gbr)	0.094979	0.007540
293	(e_net, lasso, svr, lgbr, gbr)	0.095026	0.006907
346	(e_net, lasso, svr, gbr, xgbr)	0.095083	0.006951
218	(e_net, lasso, svr, gbr)	0.095164	0.007205
1	(e_net, svr, lgbr, gbr)	0.095167	0.007098
84	(lasso, svr, lgbr, gbr)	0.095171	0.007096
426	(e_net, svr, gbr, xgbr)	0.095178	0.007165
418	(lasso, svr, gbr, xgbr)	0.095186	0.007163
90	(e_net, svr, gbr, ridg, xgbr)	0.095273	0.007138
64	(lasso, svr, gbr, ridg, xgbr)	0.095274	0.007136
169	(e_net, lasso, svr, lgbr, gbr, ridg)	0.095280	0.006921
45	(e_net, lasso, svr, lgbr, gbr, ridg, xgbr)	0.095285	0.006813
91	(e_net, lasso, svr, lgbr, gbr, xgbr)	0.095287	0.006783
184	(lasso, svr, lgbr, gbr, ridg)	0.095292	0.007082
264	(e_net, svr, lgbr, gbr, ridg)	0.095294	0.007084
13	(e_net, lasso, svr, gbr, ridg, xgbr)	0.095300	0.006958
149	(lasso, svr, gbr, xgbr, lnr)	0.095318	0.007384
132	(e_net, svr, gbr, xgbr, lnr)	0.095319	0.007388
185	(e_net, lasso, svr, lgbr, gbr, xgbr, lnr)	0.095334	0.006981
94	(lasso, svr, lgbr, gbr, lnr)	0.095340	0.007314
51	(e_net, svr, lgbr, gbr, lnr)	0.095344	0.007317
296	(e_net, lasso, svr, lgbr, gbr, lnr)	0.095384	0.007140
323	(e_net, lasso, svr, gbr, xgbr, lnr)	0.095401	0.007188
209	(e_net, svr, lgbr, gbr, xgbr, lnr)	0.095409	0.007118

Awesome! Above are the top 25 model combinations by cross validation score.

Note: After testing many of the top combinations above on the final Kaggle test data, we saw the best performance overall from (lasso, gbr, lgbr, kr).

In [54]:

simple_avg_final = AveragingModels(models = (lasso, gbr, lgbr, kr))
score = rmsle_cv(simple_avg_final)
print('Simple Average score = {:.4f}  (std = {:.4f})'.format(score.mean(), score.std()))

Simple Average score = 0.0959  (std = 0.0067)

Stacking Models: Meta-Model¶

Let's see if we can improve our predictions even further through applying a meta-model atop our base model predictions. Keeping consistent with our cross-validation strategy, we'll use StackingCVRegressor to train our meta-model (as opposed to StackingRegressor, which does not train the meta-model using the out-of-fold cross-validation predictions from the base models).

In [55]:

from mlxtend.regressor import StackingCVRegressor

stacked = StackingCVRegressor(regressors=(lasso, gbr, lgbr, kr),
                                meta_regressor=lasso,
                                use_features_in_secondary=True)

score = rmsle_cv(stacked)
print('Stacked score = {:.8f}  (std = {:.4f})'.format(score.mean(), score.std()))

Stacked score = 0.09567798  (std = 0.0065)

Great! We were able to improve our score using a stacked model approach. In particular, defining our base models to be the same set of models for which we received the best simple-average test results above (lasso, gbr, lgbr, kr), we were able to marginally improve our cross-validation score by applying the lasso meta-model.

Final Predictions¶

Yahoo! We made it! For our final prediction, we'll create an ensemble model that is

50% a simple average of lasso, kr, gbr, lgbr.
50% a stacked meta model with base svr, ridg, and xgbr and meta regressor e_net.

Interesting sidenote: While incorporating the stacked meta-model approach into our final prediction ensemble did improve predictive power in a variety of cases, my strongest result overall when submitting to Kaggle (.11997) came from a simple average of Lasso, Gradient Boost, LightGMB, and Kernel Ridge.

In [58]:

stacked_final = StackingCVRegressor(regressors=(svr, ridg, xgbr),
                                meta_regressor=e_net,
                                use_features_in_secondary=True)

score = rmsle_cv(stacked_final)
print('stacked_final score = {:.8f}  (std = {:.4f})'.format(score.mean(), score.std()))

[12:28:57] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:29:03] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:29:09] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:29:16] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:29:22] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:29:29] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:29:38] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:29:45] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:29:51] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:29:57] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:30:03] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:30:10] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:30:19] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:30:25] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:30:31] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:30:37] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:30:44] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:30:50] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:31:00] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:31:06] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:31:13] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:31:19] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:31:25] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:31:32] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:31:42] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:31:49] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:31:55] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:32:01] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:32:07] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:32:14] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


stacked_final score = 0.09646295  (std = 0.0066)

In [59]:

model_1 = simple_avg_final
model_2 = stacked_final
mod_1_share = .5
mod_2_share = .5

model_1.fit(train.values, Y_train)
model_1_test_predictions = np.expm1(model_1.predict(test.values))

model_2.fit(train.values, Y_train)
model_2_test_predictions = np.expm1(model_2.predict(test.values))

test_predictions = mod_1_share * model_1_test_predictions + mod_2_share * model_2_test_predictions

[12:32:33] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:32:41] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:32:49] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:32:57] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:33:05] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[12:33:13] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480:
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.

In [133]:

test_id = pd.read_csv('test.csv')[['Id']]
test_id['SalePrice'] = np.round(test_predictions,2)
test_id.to_csv('predictions_simple(lasso,gbr,lgbr,kr)_meta(e_net,svr,ridg,xgbr).csv',index=False)

Thank you so much for going on this journey with me! I hope you found this notebook helpful. Please let me know if you have any questions or if you have suggestion for improving upon my approach - having a conversation is the best way to improve. 😊

In [ ]:

	FireplaceQu	BsmtQual	BsmtCond	GarageQual	GarageCond	ExterQual	ExterCond	HeatingQC	PoolQC	KitchenQual	...	LotShape	PavedDrive	Street	Alley	CentralAir	MSSubClass	OverallCond	YrSold	MoSold
0	3	2	4	5	5	2	4	0	3	2	...	3	2	1	1	1	10	4	2	4
1	5	2	4	5	5	3	4	0	3	3	...	3	2	1	1	1	5	7	1	7
2	5	2	4	5	5	2	4	0	3	2	...	0	2	1	1	1	10	4	2	11
3	2	4	1	5	5	3	4	2	3	2	...	0	2	1	1	1	11	4	0	4
4	5	2	4	5	5	2	4	0	3	2	...	0	2	1	1	1	10	4	2	3

	FireplaceQu	BsmtQual	BsmtCond	GarageQual	GarageCond	ExterQual	ExterCond	HeatingQC	PoolQC	KitchenQual	...	LotShape	PavedDrive	Street	Alley	CentralAir	MSSubClass	OverallCond	YrSold	MoSold
0	3	2	4	5	5	2	4	0	3	2	...	3	2	1	1	1	10	4	2	4
1	5	2	4	5	5	3	4	0	3	3	...	3	2	1	1	1	5	7	1	7
2	5	2	4	5	5	2	4	0	3	2	...	0	2	1	1	1	10	4	2	11
3	2	4	1	5	5	3	4	2	3	2	...	0	2	1	1	1	11	4	0	4
4	5	2	4	5	5	2	4	0	3	2	...	0	2	1	1	1	10	4	2	3

	FireplaceQu	BsmtQual	BsmtCond	GarageQual	GarageCond	ExterQual	ExterCond	HeatingQC	PoolQC	KitchenQual	...	LotShape	PavedDrive	Street	Alley	CentralAir	MSSubClass	OverallCond	YrSold	MoSold
0	3	2	4	5	5	2	4	0	3	2	...	3	2	1	1	1	10	4	2	4
1	5	2	4	5	5	3	4	0	3	3	...	3	2	1	1	1	5	7	1	7
2	5	2	4	5	5	2	4	0	3	2	...	0	2	1	1	1	10	4	2	11
3	2	4	1	5	5	3	4	2	3	2	...	0	2	1	1	1	11	4	0	4
4	5	2	4	5	5	2	4	0	3	2	...	0	2	1	1	1	10	4	2	3