California Housing Prices

Median house prices for California districts derived from the 1990 census.

Description

This is the dataset used in the second chapter of Aurélien Géron's recent book 'Hands-On Machine learning with Scikit-Learn and TensorFlow'. It serves as an excellent introduction to implementing machine learning algorithms because it requires rudimentary data cleaning, has an easily understandable list of variables and sits at an optimal size between being to toyish and too cumbersome.

The data contains information from the 1990 California census. So although it may not help you with predicting current housing prices like the Zillow Zestimate dataset, it does provide an accessible introductory dataset for teaching people about the basics of machine learning.

Content

The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. Be warned the data aren't cleaned so there are some preprocessing steps required! The columns are as follows, their names are pretty self explanitory:

longitude

latitude

housing_median_age

total_rooms

total_bedrooms

population

households

median_income

median_house_value

ocean_proximity

Acknowledgements

This data was initially featured in the following paper: Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.

and I encountered it in 'Hands-On Machine learning with Scikit-Learn and TensorFlow' by Aurélien Géron. Aurélien Géron wrote: This dataset is a modified version of the California Housing dataset available from: Luís Torgo's page (University of Porto)

Prepare data

In [179]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Load data

In [180]:
data = pd.read_csv("housing.csv")

Expolore data

In [181]:
data.head()
Out[181]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
In [4]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
In [5]:
data.describe()
Out[5]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000
In [6]:
data.hist(bins=50, figsize=(20,15))
Out[6]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10ccba588>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x129348ac8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x129374dd8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1293a0128>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10d953438>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10d953470>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10d9a4a20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10d9ced30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10da02080>]],
      dtype=object)
  • The attributes have different scales. It is recommended to rescale all the attributes.
  • The medium house value has a sudden peak around 500000, which is very different from others. It is recommended to remove these data in training the model.
  • The medium income is centered around 3, where the unit is unknown. Probably, 3 means $30,000.

add a new categorical feature in order to split the dataset properly

This new feature will be deleted afterwards.

In [182]:
data['income_cat'] = np.ceil(data['median_income']/1.5)
data['income_cat'].where(data['income_cat']<5, 5, inplace=True)
In [183]:
plt.hist(data.income_cat)
Out[183]:
(array([ 822.,    0., 6581.,    0.,    0., 7236.,    0., 3639.,    0.,
        2362.]),
 array([1. , 1.4, 1.8, 2.2, 2.6, 3. , 3.4, 3.8, 4.2, 4.6, 5. ]),
 <a list of 10 Patch objects>)
In [184]:
data.income_cat.value_counts()/len(data.income_cat)
Out[184]:
3.0    0.350581
2.0    0.318847
4.0    0.176308
5.0    0.114438
1.0    0.039826
Name: income_cat, dtype: float64

**split the data into train and test sets based on the new feature

Method 1: using the package train_test_split without the new feature
Method 2: using the package StratifiedShuffleSplit with the new feature

In [185]:
from sklearn.model_selection import train_test_split
In [186]:
data4train,data4test = train_test_split(data, test_size=0.2, random_state=42)
In [187]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state = 42)
In [188]:
for train_index, test_index in split.split(data, data['income_cat']):
    train_set = data.loc[train_index]
    test_set  = data.loc[test_index] 

compare the two split methods

In [189]:
def table4income_cat(dataset,df,label):
    df[label]=pd.Series(dataset['income_cat'].value_counts()/len(dataset['income_cat']))
    return df
In [190]:
df = pd.DataFrame()
df = table4income_cat(train_set,df,'All_set')
df = table4income_cat(train_set,df,'train_set_Shuff')
df = table4income_cat(test_set,df,'test_set_Shuff')
df = table4income_cat(data4train,df,'train_set_split')
df = table4income_cat(data4test,df,'test_set_split')
df
Out[190]:
All_set train_set_Shuff test_set_Shuff train_set_split test_set_split
3.0 0.350594 0.350594 0.350533 0.348595 0.358527
2.0 0.318859 0.318859 0.318798 0.317466 0.324370
4.0 0.176296 0.176296 0.176357 0.178537 0.167393
5.0 0.114402 0.114402 0.114583 0.115673 0.109496
1.0 0.039850 0.039850 0.039729 0.039729 0.040213

It is shown above that the StratifiedShuffleSplit method works a little bit better as the income category propotions in train data and test data are closer to that in the all dataset.

After split the data into train and test, we delete the newly added feature.

In [191]:
# delete the new feature
for set_ in (data, data4train, data4test, train_set, test_set):
    set_.drop("income_cat",axis=1,inplace=True)
/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py:3694: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [17]:
data4train.columns
Out[17]:
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')
In [193]:
housing4train = train_set.copy()
In [194]:
housing4train.head()
Out[194]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
17606 -121.89 37.29 38.0 1568.0 351.0 710.0 339.0 2.7042 286600.0 <1H OCEAN
18632 -121.93 37.05 14.0 679.0 108.0 306.0 113.0 6.4214 340600.0 <1H OCEAN
14650 -117.20 32.77 31.0 1952.0 471.0 936.0 462.0 2.8621 196900.0 NEAR OCEAN
3230 -119.61 36.31 25.0 1847.0 371.0 1460.0 353.0 1.8839 46300.0 INLAND
3555 -118.59 34.23 17.0 6592.0 1525.0 4459.0 1463.0 3.0347 254500.0 <1H OCEAN
In [196]:
housing4train.plot(kind='scatter', x='longitude', y='latitude', alpha=0.3,
         s=housing4train['population']/100, label='population',   # set symbol size on population
         c=housing4train['median_house_value'],                  #  set symbol color on house value    
         cmap=plt.get_cmap('jet'),      
         colorbar=True,
         figsize=(10,7))
plt.legend()
Out[196]:
<matplotlib.legend.Legend at 0x1396f09e8>
  • The house value depends much on the population and location.
In [19]:
import folium

from geopy.geocoders import Nominatim 
In [20]:
def get_latlon(address):
    geolocator = Nominatim()
    location   = geolocator.geocode(address)
    latitude   = location.latitude
    longitude  = location.longitude
    return latitude, longitude
In [21]:
# latitude, longitude = get_latlon("1070 E Arques Ave, Sunnyvale, CA") 

# California_map = folium.Map(location=[latitude, longitude], zoom_start=12) 
# for lat, lng, house_value, population in zip(data['latitude'], data['longitude'], data['median_house_value'], 
#                                          data['population']): 
#     labels = 'House_Value:{}, Population: {}'.format(house_value,population)
#     label = folium.Popup(labels,parse_html=True)
#     folium.CircleMarker(
#         [lat, lng],
#         radius=5, 
#         color='blue',
#         popup=label,
#         fill=True,
#         #fill_color='#3186cc',
#         fill_opacity=0.7,
#         parse_html=False).add_to(California_map)  
    
# California_map

Check the correlations between pairs of features

analysis correlations

In [197]:
sns.heatmap(housing4train.corr(),annot=True)
Out[197]:
<matplotlib.axes._subplots.AxesSubplot at 0x13a5fc748>
In [198]:
corr_matrix = housing4train.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)
Out[198]:
median_house_value    1.000000
median_income         0.687160
total_rooms           0.135097
housing_median_age    0.114110
households            0.064506
total_bedrooms        0.047689
population           -0.026920
longitude            -0.047432
latitude             -0.142724
Name: median_house_value, dtype: float64
In [199]:
corr_matrix
Out[199]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
longitude 1.000000 -0.924478 -0.105848 0.048871 0.076598 0.108030 0.063070 -0.019583 -0.047432
latitude -0.924478 1.000000 0.005766 -0.039184 -0.072419 -0.115222 -0.077647 -0.075205 -0.142724
housing_median_age -0.105848 0.005766 1.000000 -0.364509 -0.325047 -0.298710 -0.306428 -0.111360 0.114110
total_rooms 0.048871 -0.039184 -0.364509 1.000000 0.929379 0.855109 0.918392 0.200087 0.135097
total_bedrooms 0.076598 -0.072419 -0.325047 0.929379 1.000000 0.876320 0.980170 -0.009740 0.047689
population 0.108030 -0.115222 -0.298710 0.855109 0.876320 1.000000 0.904637 0.002380 -0.026920
households 0.063070 -0.077647 -0.306428 0.918392 0.980170 0.904637 1.000000 0.010781 0.064506
median_income -0.019583 -0.075205 -0.111360 0.200087 -0.009740 0.002380 0.010781 1.000000 0.687160
median_house_value -0.047432 -0.142724 0.114110 0.135097 0.047689 -0.026920 0.064506 0.687160 1.000000
In [200]:
df = corr_matrix.tail(1).T
df
Out[200]:
median_house_value
longitude -0.047432
latitude -0.142724
housing_median_age 0.114110
total_rooms 0.135097
total_bedrooms 0.047689
population -0.026920
households 0.064506
median_income 0.687160
median_house_value 1.000000
In [201]:
df.sort_values(by='median_house_value',inplace=True) 
df
Out[201]:
median_house_value
latitude -0.142724
longitude -0.047432
population -0.026920
total_bedrooms 0.047689
households 0.064506
housing_median_age 0.114110
total_rooms 0.135097
median_income 0.687160
median_house_value 1.000000
In [202]:
features = list(df[abs(df['median_house_value'])>0.1].T.columns)
features
Out[202]:
['latitude',
 'housing_median_age',
 'total_rooms',
 'median_income',
 'median_house_value']

Large correlations are found between the median house value and income, total_rooms, and house age

In [28]:
from pandas.plotting import scatter_matrix
In [178]:
#scatter_matrix(data[features],figsize=(12,8))

Iteractive plot

In [30]:
from plotly.graph_objs import Scatter,Layout
import plotly
import plotly.offline as py
import plotly.graph_objs as go
In [31]:
plotly.offline.init_notebook_mode(connected=True)
In [32]:
trace0 = go.Scatter(x=data['median_income'], y=data['median_house_value'], mode = 'markers', name = 'value vs income')
In [177]:
#py.iplot([trace0])

Attribute combination

The total rooms in a particular district is not much related to the house price if the population is not specified. The more relevant variable for the house price is the number of people in a house. Therefore, one may introduce a new attribute population_per_household.

In [203]:
housing4train['population_per_household'] = housing4train['population']/housing4train['households']

In addition, the number of rooms per household and bedrooms per rooms are introduced.

In [204]:
housing4train['bedrooms_per_room'] = housing4train['total_bedrooms']/housing4train['total_rooms']
housing4train['rooms_per_household'] = housing4train['total_rooms']/housing4train['households']
In [205]:
corr_matrix = housing4train.corr() 
In [206]:
df = pd.DataFrame(corr_matrix['median_house_value'])
df.sort_values(by='median_house_value',inplace=True)
In [207]:
df.drop(index='median_house_value',inplace=True)
df.plot(kind='bar')
Out[207]:
<matplotlib.axes._subplots.AxesSubplot at 0x13aed5a90>
In [208]:
features = list(df[abs(df['median_house_value'])>0.15].T.columns)
features
Out[208]:
['bedrooms_per_room', 'median_income']

Prepare Data for Machine Learning Algorithm

In [211]:
X_train = data4train.drop("median_house_value",axis=1)
y_train = data4train["median_house_value"].copy()
In [214]:
X_train.shape,y_train.shape
Out[214]:
((16512, 9), (16512,))
In [217]:
X_test = data4test.drop("median_house_value",axis=1)
y_test = data4test["median_house_value"].copy()

missing data?

In [215]:
#missing data
def report_missing_data(dataset):
    total = dataset.isnull().sum().sort_values(ascending=False)
    percent = dataset.isnull().sum()/total 
        
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    missing_data.plot(kind='bar',y='Total',figsize=(10,6),fontsize=20)
    print(missing_data)
In [218]:
report_missing_data(X_test)
                    Total  Percent
households              0      NaN
housing_median_age      0      NaN
latitude                0      NaN
longitude               0      NaN
median_income           0      NaN
ocean_proximity         0      NaN
population              0      NaN
total_bedrooms        207      1.0
total_rooms             0      NaN
/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:6: FutureWarning:

Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False


In [216]:
report_missing_data(X_train)
                    Total  Percent
households              0      NaN
housing_median_age      0      NaN
latitude                0      NaN
longitude               0      NaN
median_income           0      NaN
ocean_proximity         0      NaN
population              0      NaN
total_bedrooms          0      NaN
total_rooms             0      NaN
/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:6: FutureWarning:

Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False


In [221]:
X_train = train_set.drop("median_house_value",axis=1)
y_train = train_set["median_house_value"].copy()

X_test = test_set.drop("median_house_value",axis=1)
y_test = test_set["median_house_value"].copy()
In [222]:
report_missing_data(X_train)
                    Total  Percent
households              0      NaN
housing_median_age      0      NaN
latitude                0      NaN
longitude               0      NaN
median_income           0      NaN
ocean_proximity         0      NaN
population              0      NaN
total_bedrooms        158      1.0
total_rooms             0      NaN
/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:6: FutureWarning:

Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False


Clean the data

Three ways to take care of the missing value/"NaN":

  • data.dropna(subset=["total_bedrooms"])
  • data.drop("total_bedrooms",axis=1)
  • data["total_bedrooms"].fillna(median,inplace=True), where median = data["total_bedrooms"].median()
In [223]:
X_train['total_bedrooms'].isnull().sum()
Out[223]:
158
In [224]:
missing_feature = pd.DataFrame(X_train.isnull().sum().sort_values(ascending=False)).index[0]
missing_feature
Out[224]:
'total_bedrooms'

Only the total bedrooms contains missing values.

In [225]:
median=X_train[missing_feature].median()
X_train[missing_feature]=X_train[missing_feature].replace(np.nan,median)
X_train[missing_feature].isnull().sum()
Out[225]:
0
In [320]:
median=X_test[missing_feature].median()
X_test[missing_feature]=X_test[missing_feature].replace(np.nan,median)
X_test[missing_feature].isnull().sum()
Out[320]:
0

Categorical data with one-hot encoding

In [226]:
X_train['ocean_proximity'].unique()
Out[226]:
array(['<1H OCEAN', 'NEAR OCEAN', 'INLAND', 'NEAR BAY', 'ISLAND'],
      dtype=object)
In [227]:
housing_num = X_train.drop("ocean_proximity",axis=1)
num_attribs = list(housing_num)
In [228]:
num_attribs
Out[228]:
['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income']
In [229]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,RobustScaler
from sklearn.preprocessing import Imputer   
from sklearn.pipeline import FeatureUnion
#CategoricalEncoder(encoding='onehot-dense')


from sklearn.base import BaseEstimator,TransformerMixin
#select columns and transit to array

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array
from sklearn.preprocessing import LabelEncoder
from scipy import sparse

class CategoricalEncoder(BaseEstimator, TransformerMixin):
    

    def __init__(self, encoding='onehot', categories='auto', dtype=np.float64,
                 handle_unknown='error'):
        self.encoding = encoding
        self.categories = categories
        self.dtype = dtype
        self.handle_unknown = handle_unknown

    def fit(self, X, y=None):

        if self.encoding not in ['onehot', 'onehot-dense', 'ordinal']:
            template = ("encoding should be either 'onehot', 'onehot-dense' "
                        "or 'ordinal', got %s")
            raise ValueError(template % self.handle_unknown)

        if self.handle_unknown not in ['error', 'ignore']:
            template = ("handle_unknown should be either 'error' or "
                        "'ignore', got %s")
            raise ValueError(template % self.handle_unknown)

        if self.encoding == 'ordinal' and self.handle_unknown == 'ignore':
            raise ValueError("handle_unknown='ignore' is not supported for"
                             " encoding='ordinal'")

        X = check_array(X, dtype=np.object, accept_sparse='csc', copy=True)
        n_samples, n_features = X.shape

        self._label_encoders_ = [LabelEncoder() for _ in range(n_features)]

        for i in range(n_features):
            le = self._label_encoders_[i]
            Xi = X[:, i]
            if self.categories == 'auto':
                le.fit(Xi)
            else:
                valid_mask = np.in1d(Xi, self.categories[i])
                if not np.all(valid_mask):
                    if self.handle_unknown == 'error':
                        diff = np.unique(Xi[~valid_mask])
                        msg = ("Found unknown categories {0} in column {1}"
                               " during fit".format(diff, i))
                        raise ValueError(msg)
                le.classes_ = np.array(np.sort(self.categories[i]))

        self.categories_ = [le.classes_ for le in self._label_encoders_]

        return self

    def transform(self, X):
        X = check_array(X, accept_sparse='csc', dtype=np.object, copy=True)
        n_samples, n_features = X.shape
        X_int = np.zeros_like(X, dtype=np.int)
        X_mask = np.ones_like(X, dtype=np.bool)

        for i in range(n_features):
            valid_mask = np.in1d(X[:, i], self.categories_[i])

            if not np.all(valid_mask):
                if self.handle_unknown == 'error':
                    diff = np.unique(X[~valid_mask, i])
                    msg = ("Found unknown categories {0} in column {1}"
                           " during transform".format(diff, i))
                    raise ValueError(msg)
                else:
                    X_mask[:, i] = valid_mask
                    X[:, i][~valid_mask] = self.categories_[i][0]
            X_int[:, i] = self._label_encoders_[i].transform(X[:, i])

        if self.encoding == 'ordinal':
            return X_int.astype(self.dtype, copy=False)

        mask = X_mask.ravel()
        n_values = [cats.shape[0] for cats in self.categories_]
        n_values = np.array([0] + n_values)
        indices = np.cumsum(n_values)

        column_indices = (X_int + indices[:-1]).ravel()[mask]
        row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),
                                n_features)[mask]
        data = np.ones(n_samples * n_features)[mask]

        out = sparse.csc_matrix((data, (row_indices, column_indices)),
                                shape=(n_samples, indices[-1]),
                                dtype=self.dtype).tocsr()
        if self.encoding == 'onehot-dense':
            return out.toarray()
        else:
            return out
        
class DataFrameSelector(BaseEstimator,TransformerMixin):
    def __init__(self,feature_names):
        self.feature_names = feature_names
    def fit(self,X,y=None):
        return self
    def transform(self,X):
        return X[self.feature_names].values
    

# build pipelines
cat_attribs = ['ocean_proximity']
num_attribs = list(housing_num)

num_pipeline = Pipeline([
               ('selector',DataFrameSelector(num_attribs)),      
               ('std_scaler',StandardScaler()), 
                ]) 

# build categorical pipeline
cat_pipeline = Pipeline([
                  ('selector',DataFrameSelector(cat_attribs)),
                  ('cat_encoder',CategoricalEncoder(encoding='onehot-dense')),
              ])


# concatenate all the transforms using "FeatureUnion"
pipelines = FeatureUnion(transformer_list=
                             [ 
                              ('num_pipeline',num_pipeline),
                              ('cat_pipeline',cat_pipeline),
                             ])
In [230]:
X_train_prepared = pipelines.fit_transform(X_train)
In [231]:
X_train_prepared.shape
Out[231]:
(16512, 13)
In [232]:
type(X_train_prepared)
Out[232]:
numpy.ndarray

Select and train a model

In [311]:
from sklearn.metrics import mean_absolute_error,mean_squared_error

def model_performace(model,X,y):
    model.fit(X,y)
    pred = model.predict(X)
    scores = np.sqrt(mean_squared_error(pred,y))
    
    print("scores:",scores)
    print("Mean:",scores.mean())
    print("Standard Deviation:",scores.std())
    return model,pred
In [290]:
def plot_pred_true(ypred,ytrue):
    df = pd.DataFrame([ypred,ytrue]).T
    df.columns=['pred','true']
    plt.scatter(df['pred'],df['true'])

Linear regression (prone to under fit the data)

In [291]:
from sklearn.linear_model import LinearRegression
In [313]:
lr = LinearRegression()
lr_model,ypred = model_performace(lr,X_train_prepared, y_train) 
scores: 69050.98178244587
Mean: 69050.98178244587
Standard Deviation: 0.0
In [293]:
plot_pred_true(ypred,y_train)

Support Vector Machine

In [294]:
from sklearn import svm
In [314]:
clf  = svm.SVR(kernel='poly',degree=3)
svr_model,ypred = model_performace(clf,X_train_prepared,y_train)
/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:196: FutureWarning:

The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.

scores: 118640.88533698737
Mean: 118640.88533698737
Standard Deviation: 0.0
In [298]:
plot_pred_true(ypred,y_train)
In [285]:
# svm_rmse = 1
# for C in range(2000,100000,20000):
#     for gamma in np.logspace(-8, -6, 3):
#         clf = svm.SVR(kernel='rbf', degree=3, C=C, gamma=gamma).fit(X_train_prepared,y_train)
#         #y_test_pred = clf.predict(X_test)
#         #rmse = np.sqrt(mean_squared_error(y_test_pred,y_test))
#         y_pred = clf.predict(X_train_prepared)
#         rmse = np.sqrt(mean_squared_error(y_pred,y_train))
#         print("C:{}, gamma:{} rmse:{}".format(C,gamma, rmse))

#         if(svm_rmse > rmse):
#             svm_best = clf
#             svm_rmse = rmse

# print("The best score with rmse={}".format(svm_rmse))
C:2000, gamma:1e-08 rmse:118920.42068845521
C:2000, gamma:1e-07 rmse:118918.23139190683
C:2000, gamma:1e-06 rmse:118896.2100453728
C:22000, gamma:1e-08 rmse:118917.98801324304
C:22000, gamma:1e-07 rmse:118893.77790163875
C:22000, gamma:1e-06 rmse:118584.33740604384
C:42000, gamma:1e-08 rmse:118915.54295616361
C:42000, gamma:1e-07 rmse:118869.34364625966
C:42000, gamma:1e-06 rmse:118280.53161633229
C:62000, gamma:1e-08 rmse:118913.09809162146
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-285-a4fc5562af5a> in <module>()
      5         #y_test_pred = clf.predict(X_test)
      6         #rmse = np.sqrt(mean_squared_error(y_test_pred,y_test))
----> 7         y_pred = clf.predict(X_train_prepared)
      8         rmse = np.sqrt(mean_squared_error(y_pred,y_train))
      9         print("C:{}, gamma:{} rmse:{}".format(C,gamma, rmse))

/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py in predict(self, X)
    325         X = self._validate_for_predict(X)
    326         predict = self._sparse_predict if self._sparse else self._dense_predict
--> 327         return predict(X)
    328 
    329     def _dense_predict(self, X):

/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py in _dense_predict(self, X)
    348             self.probA_, self.probB_, svm_type=svm_type, kernel=kernel,
    349             degree=self.degree, coef0=self.coef0, gamma=self._gamma,
--> 350             cache_size=self.cache_size)
    351 
    352     def _sparse_predict(self, X):

KeyboardInterrupt: 

Decision Tree (prone to overfit the data)

In [245]:
from sklearn.tree import DecisionTreeRegressor
In [315]:
tree = DecisionTreeRegressor() 
tree_model,ypred= model_performace(tree,X_train_prepared, y_train) 
scores: 0.0
Mean: 0.0
Standard Deviation: 0.0
In [300]:
plot_pred_true(ypred,y_train)

Cross-Validation

In [301]:
from sklearn.model_selection import cross_val_score
In [302]:
scores = cross_val_score(tree,X_train_prepared,y_train, scoring="neg_mean_squared_error",cv=10)
In [303]:
tree_rmse_score=np.sqrt(-scores)
In [304]:
def display_scores(scores):
    print("scores:",scores)
    print("Mean:",scores.mean())
    print("Standard Deviation:",scores.std())
In [256]:
display_scores(tree_rmse_score)
scores: [66855.08817709 65281.38390775 72588.16228709 69803.9966613
 65897.83253408 76413.5136803  67201.27627995 70327.10820383
 68426.17405539 67685.95748111]
Mean: 69048.04932678945
Standard Deviation: 3218.1239792468496

Random Forest

In [305]:
from sklearn.ensemble import RandomForestRegressor
In [316]:
forest = RandomForestRegressor()
forest_model,y_pred = model_performace(forest,X_train_prepared,y_train)
/anaconda3/lib/python3.6/site-packages/sklearn/ensemble/forest.py:246: FutureWarning:

The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.

scores: 22100.794740299967
Mean: 22100.794740299967
Standard Deviation: 0.0
In [307]:
plot_pred_true(y_pred,y_train)

Test dataset

Transform the test data in the same way

In [321]:
X_test_prepared = pipelines.transform(X_test)
In [324]:
models = [lr_model,svr_model,tree_model,forest_model]

for model in models:
    test_pred = model.predict(X_test_prepared) 
    plot_pred_true(test_pred,y_test)
    plt.show()

Save model

In [276]:
from sklearn.externals import joblib
In [ ]:
# joblib.dump(model, "my_model.pkl")

# my_model_load = joblib.load("my_model.pkl")