In [1]:
from IPython.display import HTML

Introduction to Machine Learning

Machine learning versus statistics

  • "Statistics draws population inferences from a sample, and machine learning finds generalizable predictive patterns."

  • "Statistical methods have a long-standing focus on inference, which is achieved through the creation and fitting of a project-specific probability model. The model allows us to compute a quantitative measure of confidence that a discovered relationship describes a 'true' effect that is unlikely to result from noise. Furthermore, if enough data are available, we can explicitly verify assumptions (e.g., equal variance) and refine the specified model, if needed."

  • "By contrast, ML concentrates on prediction by using general-purpose learning algorithms to find patterns in often rich and unwieldy data. ML methods are particularly helpful when one is dealing with 'wide data', where the number of input variables exceeds the number of subjects, in contrast to 'long data', where the number of subjects is greater than that of input variables. ML makes minimal assumptions about the data-generating systems; they can be effective even when the data are gathered without a carefully controlled experimental design and in the presence of complicated nonlinear interactions."

Danilo Bzdok, Naomi Altman & Martin Krzywinski, Nature Methods volume 15, pages 233–234 (2018).

Supervised Machine Learning

  • classification (predicting classes)
  • regression (predicting values)

Classification

Binary classification

  • targert y: two classes (e.g., 0 or 1)
In [2]:
from sklearn.metrics import confusion_matrix, precision_score, accuracy_score, f1_score

model performance measurements: confusion matrix

  • Accuray

$$ accuracy = \dfrac{\rm TP + TN}{\rm Total (TP+TN+FP+FN)} $$

  • precision

$$ precision = \dfrac{\rm TP}{\rm Predicted~Positive (TP+FP)} $$

  • recall/sensitivity/True Positive Rate(TPR) $$ recall = \dfrac{\rm TP}{\rm Real~Positive (TP+FN)} $$

  • F1 score

$$ F_1 = \dfrac{2}{\dfrac{1}{precision} + \dfrac{1}{recall}} = 2\times \dfrac{precision*recall}{precision + recall} $$

where

  • TP: true positive = (predict positive| real positive)
  • TN: true negative = (predict negative| real negative)
  • FP: false positive = (predict positive| real negative)
  • FN: false negative = (predict negative| real positive)

Note: For unbalanced dataset, the recall is more important than the precision and accuracy.

  • Receiver operating characteristic (ROC) curve

True Positive Rate (recall/sensitivity) v.s. False Positive Rate (1 - Specificity)

where

  • False Positive Rate (FPR)

$$ FPR = \dfrac{\rm FP}{\rm Real~Negative} = 1 - Specificity $$

  • Specificity/True Negative Rate (TNR) $$ Specificity = \dfrac{\rm TN}{\rm Real~Negative} $$

The better algorithm gives a larger AUC (area under the curve) score.

In [3]:
import numpy as np
from IPython.display import Image
import matplotlib.pyplot as plt

Image("images/ROC_curves.jpg")
Out[3]:

Multiclass classification

  • One versus the rest: still binary classification
  • one versus one (pair)

Regression

Linear regression

  • The linear regression computes the sum of weighted input features (plus a bias) and output this result.

  • The prediction $\hat y^{(m)}$ for the $m$-th sample is parameterized as a linear function of the feature values $x^{(m)}_i$ (N features) of the $m$-th sample :

$$ \widehat y^{(m)} = h_\theta(\mathbf{X^{(m)}}) = \sum^N_{i=1} \theta_i x^{(m)}_i + \theta_0 = \mathbf{\theta}^T\cdot \mathbf{X}^{(m)} $$

where

  1. $h_\theta(\mathbf{X})$ is a hypothesis (linear) function.
  2. $\theta_i$ and $\theta_0$ are model parameters, $\theta_0$ is the bias term. Both are universal for all the samples.
  3. vectorization of the relation: $$ \mathbf{\widehat y} =\left[\widehat y^{(1)}, \widehat y^{(2)}, \cdots, \widehat y^{(M)}\right] =\left[\theta_1, \theta_2, \cdots, \theta_N \right]\cdot \begin{bmatrix} x^{(1)}_1, & x^{(2)}_1, & \cdots, & x^{(M)}_1 \\ x^{(1)}_2, & x^{(2)}_2, & \cdots, & x^{(M)}_2 \\ \cdots, & \cdots, & \cdots, & \cdots \\ x^{(1)}_N, & x^{(2)}_N, & \cdots, &x^{(M)}_N \\ \end{bmatrix} =\mathbf{\theta}^T\cdot \mathbf{X} $$
  • Mean-Square-Error (MSE): $$ MSE (\mathbf{\theta}) = \dfrac{1}{M} \sum^M_{m=1} (y^{(m)} - \widehat y^{(m)})^2. $$

  • The model paramter $\mathbf{\theta}$ is determined by minization of the MSE and given by the solution of the following Normal Equation $$ \mathbf{\theta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbf{y} ={\cal X}^+ \mathbf{y}, $$ where

  • ${\cal X}^+ \equiv (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$ is the so-called pseudo-inverse of $\mathbf{X}$, which can be calculated by

$$ {\cal X}^+ = np.linlg.pinv(X) $$

Note: using SVD (pseudo-inverse) method can avoid the singularity problem.

In [4]:
m=100
X = 6*np.random.rand(m,1)
y = 3*X + 4 + np.random.rand(m,1) # plus noise
In [5]:
X_plus_b = np.c_[np.ones((100,1)), X]  # add x0=1 to each instance
In [6]:
theta_inv = np.linalg.inv(X_plus_b.T.dot(X_plus_b)).dot(X_plus_b.T).dot(y)
theta_inv
Out[6]:
array([[4.50285414],
       [3.001809  ]])
In [7]:
theta_svd = np.linalg.pinv(X_plus_b).dot(y)
In [8]:
theta_svd
Out[8]:
array([[4.50285414],
       [3.001809  ]])
  • Regularization (Ridge/L2):

    1. Adding a term onto the cost function $J(\mathbf{\theta})$: $$ J(\mathbf{\theta}) = MSE (\mathbf{\theta}) + \dfrac{\alpha}{2} \mathbf{\theta}^T\mathbf{\theta}. $$

    2. The model paramter $\mathbf{\theta}$ is given by $$ \mathbf{\theta} = (\mathbf{X}^T\mathbf{X}+\alpha \mathbf{I})^{-1}\mathbf{X}^T \mathbf{y}, $$ where $\mathbf{I}$ is the Identity matrix, $\alpha$ contronls the size of the regularization term.

L2 Ridge regression is sensitive to the scale of features.

  • Regularization (Lasso/L1):

$$ J(\mathbf{\theta}) = MSE (\mathbf{\theta}) + \alpha \sum^N_{i=0} \vert\theta_i\vert^2 $$

Features of L1 Lasso regression:

1. tends to elimilate the least important features.     
2. automatically performs feature selections and outputs a sparese model.     
3. Slope changes abruptly.  



  • Elastic Net (between L1 and L2):

$$ J(\mathbf{\theta}) = MSE (\mathbf{\theta}) + r \alpha\sum^N_{i=0} \vert\theta_i\vert^2 + \dfrac{1-r}{2}\alpha \mathbf{\theta}^T\mathbf{\theta} $$

Note: First try **Ridge**. If only a few features are important, then try **Elastic Net**.

In [9]:
 np.random.randn
Out[9]:
<function RandomState.randn>
In [10]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
In [11]:
m=20
X = 6*np.random.rand(m,1)
y = 3*X + 4 + np.random.rand(m,1) # plus noise
In [12]:
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X,y)
y_pred = elastic_net.predict(X)
In [13]:
plt.scatter(X,y.squeeze())
plt.plot(X,y_pred,'r')
Out[13]:
[<matplotlib.lines.Line2D at 0x12246c2b0>]
In [14]:
from scipy import stats
In [15]:
slope, intercept, r_value, p_value, std_err = stats.linregress(X.reshape(1,-1), y.reshape(1,-1))
plt.plot(X, y, 'o', label='original data')
plt.plot(X, intercept + slope*X, 'r', label='fitted line')
/anaconda3/lib/python3.6/site-packages/scipy/stats/_stats_mstats_common.py:125: RuntimeWarning: invalid value encountered in sqrt
  t = r * np.sqrt(df / ((1.0 - r + TINY)*(1.0 + r + TINY)))
/anaconda3/lib/python3.6/site-packages/scipy/stats/_stats_mstats_common.py:127: RuntimeWarning: invalid value encountered in sqrt
  sterrest = np.sqrt((1 - r**2) * ssym / ssxm / df)
Out[15]:
[<matplotlib.lines.Line2D at 0x1225309e8>]
In [16]:
print("The r-value is {}".format(r_value))
The r-value is 0.9983031755104109

The statistic model also provides the quatitative measure of the confidence using the p-value, which is not the case for ML models.

Note: Pay attention to the shapes of the $X, y$:

  • sklearn: dimension at least to be 2D
  • scipy: dimension should be 1D

Polynomial regression

The polynomial function of $X$ can be rewritten as a linear function of a new basis funcction $\phi(X)$:

$$ \begin{align} \widehat y &= \theta_0 + \theta_1 X + \theta_2 X^2 + \theta_3 X^3 + \cdots \nonumber\\ &= \theta_0 + \theta_1 \phi_1(X) + \theta_2 \phi_2(X) + \theta_3 \phi_3(X) + \cdots \nonumber\\ &= \mathbf{\theta}^T \cdot\mathbf{\phi(X)}, \end{align} $$ where $\phi_i(X)=1, X, X^2, X^3, \ldots$ for $i=1,2,3,\cdots$.

The basis function plays the role of scaling the features:

$$ X \to \phi(X) $$

In [17]:
from sklearn.preprocessing import PolynomialFeatures
In [18]:
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
In [19]:
elastic_net =ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X_poly, y)
elastic_net.intercept_, elastic_net.coef_
Out[19]:
(array([5.0389554]), array([2.35301464, 0.11935787]))

The model estimate: $$ \widehat y = 0.57 x^2 + 0.25 x + 1.70 $$

In [20]:
plt.scatter(X,elastic_net.predict(X_poly),c='r',label='prediction')
plt.scatter(X,y.squeeze(),c='b',label='data')
plt.legend()
Out[20]:
<matplotlib.legend.Legend at 0x1225dc3c8>

Note: Let's consider the case with $n$ features ($X_1, X_2, \cdots, X_n$) and $degree=d$. Then the number of new features

$$ {\rm No. of~} X^a_1X^b_2 \cdots X^c_n {\rm~terms}\sim \dfrac{(n+d)!}{d!n!} $$

increase factorially, which makes the computation heavy.

In [21]:
style = " <style> div.highlight { background-color: #E0FFFF; border-color: #dFb5b4; border-left: 5px solid #dfb5b4; padding: 0.5em;}</style>"
HTML(style)
Out[21]:

Logistic regression

  • to predict the probability $\widehat p$ that an instance belongs to a particular class.

    1. belongs to if $\widehat p>0.5$
    2. does not belong to if $\widehat p<0.5$
  • A logistic regression computes the sum of the weighted input features (plus a bias), and output the logistic of this result.

  • Cost function for a single training instance/sample

$$ C(\theta) = \begin{cases} -\log(\widehat p), {\rm if}~ y=1 ({\rm positive ~instance}),\\ -\log(1- \widehat p), {\rm if}~ y=0 ({\rm negative~ instance}),\\ \end{cases} $$ which can be rewritten as one expression $$ C^{(m)}(\theta) = -y^{(m)}\log(\widehat p^{(m)}) -(1-y^{(m)})\log(1- \widehat p^{(m)}). $$ Therefore, the cost function of the whole training set is

$$ J(\theta) = -\dfrac{1}{M} \sum^M_m \left[ y^{(m)}\log \widehat p^{(m)} +(1- y^{(m)})\log (1 - \widehat p^{(m)} )\right], $$

where

  1. $y^{(m)}$ is the target value of the $m$-th instance/sample,
  2. while $\widehat p^{(m)}$ is the predicted probability that the instance belonging to the class, given by $$ \widehat p^{(m)} = \sigma(\mathbf{\theta}^T \cdot \mathbf{X}^{(m)}) = \dfrac{1}{1+ e^{-\mathbf{\theta}^T \cdot \mathbf{X}^{(m)}}} =\dfrac{e^{\mathbf{\theta}^T \cdot \mathbf{X}^{(m)}}}{ e^{\mathbf{\theta}^T \cdot \mathbf{X}^{(m)}}+ e^0} $$

Note: the goal of training is to find the $\theta$ so that the model estimates high probability for positive instances and low probability for negative instances.

In [22]:
plt.subplot(1,2,1)
x1=np.linspace(0.0001,0.9999)
plt.plot(x1,-np.log(x1),c='r')
plt.text(0.25,8,r'$y^{(m)}=1$')
plt.arrow(0.25,2,0.2,-0.5, head_length = 0.17, head_width = 0.05, length_includes_head = True)
plt.xlabel(r'$\widehat p^{(m)}$',fontsize=20)
plt.ylabel(r'$C^{(m)}(\theta)$',fontsize=20)

plt.subplot(1,2,2)
x0=np.linspace(0.0001,0.9999)
plt.plot(x0,-np.log(1-x0),c='r')
plt.text(0.25,8,r'$y^{(m)}=0$')
plt.arrow(0.75,2,-0.2,-0.5, head_length = 0.17, head_width = 0.05, length_includes_head = True)
plt.xlabel(r'$\widehat p^{(m)}$',fontsize=20) 
Out[22]:
Text(0.5,0,'$\\widehat p^{(m)}$')

Note: The cost function is convex ! The model parameter $\theta$ is determined by minimizing the cost function $C^{(m)}(\theta)$, which drives the $\widehat p^{(m)}$ towards the value of $y^{(m)}$.

In [23]:
def sigmoid(t):
    return 1/(1+np.exp(-t))
In [24]:
x = np.linspace(-50,50,200) 
x = np.array(sorted(x))
In [25]:
plt.plot(x,sigmoid(x),c='r')
plt.xlim(-15,15)
plt.axhline(0.5,linestyle=':',c='k')
plt.axvline(0,linestyle=':',c='k')
plt.ylabel('$\hat p^{(m)}$', fontsize=20)
plt.xlabel(r'$\theta^T\cdot \mathbf{X}^{(m)}$', fontsize=20)
Out[25]:
Text(0.5,0,'$\\theta^T\\cdot \\mathbf{X}^{(m)}$')
In [26]:
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
In [27]:
iris = datasets.load_iris()
list(iris)
Out[27]:
['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename']
In [28]:
X= iris["data"][:,3:] # pental width
y=(iris["target"]==2).astype(np.int)
In [29]:
log_reg = LogisticRegression()
log_reg.fit(X,y)
/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
Out[29]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)
In [30]:
X_new = np.linspace(0,3,1000).reshape(-1,1)
y_proba = log_reg.predict_proba(X_new)
plt.plot(X_new,y_proba[:,1],label='Iris-Virginica')
plt.plot(X_new,y_proba[:,0],label='Not Iris-Virginica')
plt.legend(loc=1,fontsize=16)
plt.ylabel('Probability', fontsize=16)
plt.xlabel('Petal width (cm)', fontsize=16)
Out[30]:
Text(0.5,0,'Petal width (cm)')

Softmax regression

  • also called Multinomial Logistic Regression
  • The softmax regression computes a score $s_k(\mathbf{X})$ for each class $k$, given an instance $\mathbf{X}$,

$$ s_k(\mathbf{X}) = \mathbf{\theta_{(k)}}^T\cdot \mathbf{X} $$

and then estimates the probability $\widehat p_k$ of each class by applying the softmax function to the scores

$$ \widehat p_k =\sigma(s(\mathbf{X}))_k = \dfrac{ e^{s_k(\mathbf{X}) }}{\sum^K_{k=1} e^{s_k(\theta)}}. $$

The denominator is normalization factor, $$ \sum^K_k\widehat p_k(\mathbf{X})=1. $$

  1. $K$: number of the classkes
  2. $s(\mathbf{X})$: a vector containing the scores of each class for the instance $\mathbf{X}$.
  3. $\widehat p_k(\mathbf{X})$ is the predicted probability of the instance $\mathbf{X}$ belonging to the class $k$.
  • The softmax regression classifier prediction is the the value of $k$, which maximizes the function $\sigma(s(\mathbf{X}))$ or equivalently the score $s_k(\mathbf{X})$:

$$ y = \underset{k}{argmax}~\sigma(s(\mathbf{X}))_k = \underset{k}{argmax} \left(\theta_{(k)}^T\cdot \mathbf{X}\right) $$

  • Cost function is the cross entropy (for the case with $M$ instances/samples and $K$ classes):

$$ J(\Theta)= -\dfrac{1}{M} \sum^M_{m=1} \sum^K_{k=1}\left[ y^{(m)}_k\log \widehat p^{(m)}_k\right], $$

where $y^{(m)}_k$ is either 1 or 0, depending on whethe the $m$-th instance belongs to the class $k$ or not.

  • Gradient of the cost function:

$$ \nabla_{\theta_{(k)}} J(\Theta) = \dfrac{1}{M} \sum^M_{m=1} \left(\widehat p^{(m)}_k - y^{(m)}_k\right) \mathbf{X}^{(m)}. $$

Note: In the deep-learing course by A. Ng, the symbol $a^{(m)}$ is introduced to replace $\widehat p^{(m)}$.

In [31]:
style1 = " <style> div.warn { background-color: #fcf2f2; border-color: #dFb5b4; border-left: 5px solid #dfb5b4; padding: 0.5em;}</style>"
HTML(style1)
# #DCDCDC
Out[31]:

Binary classification is a special case of softmax regression with $K=2$: $$y^{(m)}_{k=2}=1-y^{(m)}_{k=1}, \quad \widehat p^{(m)}_2 = 1 - \widehat p^{(m)}_1,$$

and the cost function is simplified as (set $y^{(m)}_{k=1}=y^{(m)}, \widehat p^{(m)}_1=\widehat p^{(m)}$)

$$ \begin{align} J(\Theta) &= -\dfrac{1}{M} \sum^M_m \sum^2_{k=1}\left[ y^{(m)}_k\log \widehat p^{(m)}_k\right] \nonumber\\ &= -\dfrac{1}{M} \sum^M_m \left[ y^{(m)}_1\log \widehat p^{(m)}_1 + y^{(m)}_2\log \widehat p^{(m)}_2\right]\nonumber\\ &= -\dfrac{1}{M} \sum^M_m \left[ y^{(m)} \log \widehat p^{(m)} + (1- y^{(m)})\log (1- \widehat p^{(m)})\right]. \end{align} $$

In [32]:
X = iris["data"][:,(2,3)] # petal length and width
y = iris["target"]
y
Out[32]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
In [33]:
soft_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs",C=10)
soft_reg.fit(X,y)
Out[33]:
LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)
In [34]:
X_test = [5,2]
soft_reg.predict([X_test]), soft_reg.predict_proba([X_test])
Out[34]:
(array([2]), array([[6.38014896e-07, 5.74929995e-02, 9.42506362e-01]]))
In [35]:
classes = ['class 0', 'class 1', 'class 2'] 
In [36]:
prob = np.array(soft_reg.predict_proba([X_test])).squeeze() 
In [37]:
plt.bar(classes,prob)
plt.ylabel("Probability")
Out[37]:
Text(0,0.5,'Probability')

Support Vector Machines

  • The instances located on the edge of the street are called support vectors.

Training object/cost function

$$ J(\Theta) = \dfrac{1}{2} \theta^T\theta $$ with the linear constraint $$ t^{(m)} (\theta^T \mathbf{X}^{(m)}+b) \ge 1 $$

where $t^{(m)}$ is 1 for the positive instance (corresponding to $y^{(m)}=1$) and -1 for the negative instance (corresponding to $y^{(m)}=0$).

One can use the following cost function to take care of the constraint term

$$ J(\Theta, b) = \dfrac{1}{2} \theta^T\theta + C\sum^M_{m=1} max\left(0, 1- t^{(m)} (\theta^T \mathbf{X}^{(m)}+b)\right). $$

  1. The first term is pushing the model to have a smallest value of $\theta$, namely, the largest margin/street.
  2. The second term (called Hinge Loss) is minimizing the marginal violations (proportional to the distance to the correct side of the street) as small/few as possible.

Decision function

  • Linear SVM classification

$$ \widehat y = \begin{cases} 0, & \quad \theta^T \mathbf{X}+b<0\\ 1, & \quad \theta^T \mathbf{X}+b >0 \end{cases} $$

SVM Classification

Soft marginal classification

  • hard margin classification: require all the instances being off the street. This classification is sensitive to the outliers.

  • soft margin classification: find a good balance between keeping the street as large as possible and limiting the margin violation (numbers of instances in the middle of the street).

Nonlinear SVM classifications

  • add polynomial features by using polynomial kernel.

For the degree of 2 polynomial mapping: $$ \phi(\mathbf{X}) =\phi\begin{pmatrix} x_1 \\ x_2 \end{pmatrix} =\begin{pmatrix} x^2_1 \\ \sqrt{2}x_1x_2\\ x^2_2 \end{pmatrix} $$

and the dot product

$$ \phi(\mathbf{X})^T\phi(\mathbf{Y}) =\begin{pmatrix} x^2_1 \\ \sqrt{2}x_1x_2\\ x^2_2 \end{pmatrix}^T \begin{pmatrix} y^2_1 \\ \sqrt{2}y_1y_2\\ y^2_2 \end{pmatrix} =(\mathbf{X}^T\mathbf{Y})^2 \equiv K(\mathbf{X},\mathbf{Y}), $$ where $K(\mathbf{X},\mathbf{Y})$ called a $2$nd-degree polynomial kernel.

  • add Similarity features by using Gaussian RBF kernel

For a given landmark $\mathbf{\ell}$, one can map the feature $\mathbf{X}$ to a new feature using a Gaussian function which measures the similarity between the feature $\mathbf{X}$ and the landmark:

$$ K(\mathbf{X}, \mathbf{\ell}) \equiv \phi_\gamma(\mathbf{X}, \mathbf{\ell}) = \exp(-\gamma \vert\vert \mathbf{X} - \mathbf{\ell}\vert\vert^2) $$

  • The landmark $\mathbf{\ell}$ is usually taken at the location of each and every instance in the dataset.

Therefore, if there are $m$ instance with $n$ features in the dataset, these $n$ features introduce $n^2$ Gaussian functions, thus $n^2$ new features. The time complexcity is increased

$$ O(m\times n) \to O(m\times n^2) $$

In [38]:
X = iris["data"][:, (2,3)]
y = (iris["target"]==2).astype(np.float64)
In [39]:
def plot_decision_boundary(model, X, y):
    
    # Set min and max values and give it some padding
    x_min, x_max = X[:,0].min() - 1, X[:,0].max() + 1   # x1
    y_min, y_max = X[:,1].min() - 1, X[:,1].max() + 1   # x2
    h = 0.01  
    # Generate a grid of points with distance h between them
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    # Predict the function value for the whole grid
    Z = model(np.c_[xx.ravel(), yy.ravel()])  # numpy.ravel() == numpy.reshape(-1); np.c_: Stack 1-D arrays as columns into a 2-D array
    Z = Z.reshape(xx.shape)

    # Plot the contour and training examples
    plt.contourf(xx, yy, Z, cmap='binary')
    plt.ylabel('x2')
    plt.xlabel('x1')
    plt.scatter(X[:,0], X[:,1], c=y.ravel(), cmap=plt.cm.Spectral)
    
In [40]:
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC,SVC
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

polynomial_svm_clf = Pipeline(
[("poly_features", PolynomialFeatures(degree=3)),
("scaler",StandardScaler()),
("svm_clf",LinearSVC(C=10,loss="hinge"))
])

polynomial_svm_clf_poly3 = Pipeline(
[("poly_features", PolynomialFeatures(degree=3)),
("scaler",StandardScaler()),
("svm_clf",SVC(kernel="poly", degree=3,coef0=1,C=5))
])

polynomial_svm_clf_rbf = Pipeline(
[("poly_features", PolynomialFeatures(degree=3)),
("scaler",StandardScaler()),
("svm_clf",SVC(kernel="rbf", degree=3,coef0=1,C=5))
])

polynomial_svm_clf.fit(X,y)
Out[40]:
Pipeline(memory=None,
     steps=[('poly_features', PolynomialFeatures(degree=3, include_bias=True, interaction_only=False)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', LinearSVC(C=10, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0))])
In [41]:
X_test = np.array([5,2]).reshape(1,-1)
polynomial_svm_clf.predict(X_test)
plot_decision_boundary(lambda x: polynomial_svm_clf.predict(x), X, y)
In [42]:
polynomial_svm_clf_poly3.fit(X,y)
plot_decision_boundary(lambda x: polynomial_svm_clf_poly3.predict(x), X, y)
/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
In [43]:
polynomial_svm_clf_rbf.fit(X,y)
plot_decision_boundary(lambda x: polynomial_svm_clf_rbf.predict(x), X, y)
/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

SVM Regression

  • for both linear and non-linear regressions

  • the street width is controled by the parameter $\epsilon$

In [44]:
X = np.random.rand(100,1)*5
y = X**3 + 10*X**2 + X + 4
In [45]:
X_plus_b = np.c_[np.ones((100,1)), X]  # add x0=1 to each instance
In [46]:
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
In [47]:
svm_poly_reg = SVR(kernel="poly", degree=3, C=100, epsilon=0.1)
svm_poly_reg.fit(X_plus_b,y)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Out[47]:
SVR(C=100, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
  gamma='auto_deprecated', kernel='poly', max_iter=-1, shrinking=True,
  tol=0.001, verbose=False)
In [48]:
y_pred = svm_poly_reg.predict(X_plus_b)
plt.plot(y,y_pred)
Out[48]:
[<matplotlib.lines.Line2D at 0x1227b80b8>]
In [49]:
mean_squared_error(y,y_pred)
Out[49]:
0.004644341964169357

Decision Trees

  • Classification
  • Regression
In [50]:
X=iris["data"][:,2:]
y=iris["target"]
In [51]:
from sklearn.tree import DecisionTreeClassifier
In [52]:
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X,y)
Out[52]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
In [53]:
from sklearn.tree import export_graphviz
export_graphviz(tree_clf,
               out_file=("iris_tree.dot"),
               feature_names=iris.feature_names[2:],
               class_names=iris.target_names,
               rounded=True,
               filled=True)
In [54]:
! dot -Tpng iris_tree.dot -o iris_tree.png
In [55]:
Image("iris_tree.png")
Out[55]:
  • Gini Impurity

$$ G_{i} = 1 - \sum^K_{k=1} p^2_{i,k} $$ where $p_{i,k}$ is the ratio of the class $k$ instances among the training instances in the $i$-th node:

For example: for the right-child node in the second layer (depth-1), samples = 100, value =[0,50,50]. The Gini score is given by

$$ G_i = 1 - (0/100)^2 - (50/100)^2 - (50/100)^2 = 0.5 $$

  • Training algorithm and cost function for classification:

The Classification and Regression Tree (CART) algorithm trains Decision Tree by splitting the trainsing set in two subsets using a single feature $k$ and a threshold $t_k$. The paramters $k$ and $t_k$ are chosen in such a way that it preoduces the purest subsets, weighted by their size, $$ J(k, t_k) = \dfrac{M_{\rm left}}{M} G_{\rm left} + \dfrac{M_{\rm right}}{M} G_{\rm right} $$ where $M_{\rm left/right}$ is the number of instances in the left/right node/subset.

Alternative, one can also use the entropy impurity measure:

$$ H_i = -\sum^K_{k=1, p_{i,k}\neq 0} p_{i, k} \log_2(p_{i,k}) $$

Decision Tree versus Linear models:

  1. Decision Tree does not assume that the data is linear.
  2. belongs to non-parameteric model.
  3. most likely overfit the data
  4. use max_depth, max_leaf_nodes and mini_samples_leaf to regularize the model.
  5. Simple to understand and interpret.
  6. Sensitive to a small variations in the training data.

    Linear models belong to parametric model. The model parameters are predetermined, which reduces the risk of overfitting to some extend.

  • Cost function for regression:

$$ J(k, t_k) = \dfrac{M_{\rm left}}{M} {\rm MSE}_{\rm left} + \dfrac{M_{\rm right}}{M} {\rm MSE}_{\rm right} $$ where

$$ {\rm MSE}_{\rm node} = \sum_{m\in node} (\widehat y_{node} - y^{(m)})^2, \quad \widehat y_{node}=\dfrac{1}{M_{node}} \sum_{m \in node} y^{(m)} $$

Ensenmble and Random Forest

Voting

  • Train a few classfiers (based on different algorithms) trainand aggregate the predictions of each classifier and predict the class that gets the most votes.

  • The ensemble can do much better than each classifer, provided that the classifiers are independent.

In [56]:
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
In [57]:
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
estimators=[
    ('lr',log_clf),
    ('rf',rnd_clf),
    ('svc',svm_clf)],
voting = 'hard')

voting_clf.fit(X,y)
/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:460: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Out[57]:
VotingClassifier(estimators=[('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)), ('rf', RandomFo...f', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))],
         flatten_transform=None, n_jobs=None, voting='hard', weights=None)
In [58]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X,y)
    y_pred = clf.predict(X)
    print(clf.__class__.__name__,accuracy_score(y,y_pred))
LogisticRegression 0.8733333333333333
RandomForestClassifier 0.9933333333333333
SVC 0.9666666666666667
VotingClassifier 0.9733333333333334
/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:460: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:460: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
In [59]:
plot_decision_boundary(lambda x:  clf.predict(x), X, y)

Bagging and Pasting

  • Use the same training algorithm for every predictor, but to train on different random subsets of the training set.

  • Bagging: sampling is performed with replacement (bootstrap=True)

  • Pasting: sampling is performed without replacement

  • Aggregate the predictions of all predictors, and take either the most frequent prediction for classification or average for regresion.

  • Aggregation reduces both bias and variance. Generally, the ensemble has a similar bias, but a lower variance than a sinple predictor.

  • Bagging ends up with a little bit higher bias than Pasting.

In [60]:
Image("images/bagging.png", width=800)
Out[60]:
In [61]:
X=iris["data"][:,2:]
y=iris["target"]
In [62]:
X.shape,y.shape
Out[62]:
((150, 2), (150,))
In [63]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier()
tree_clf.fit(X,y)

bag_clf = BaggingClassifier(
        DecisionTreeClassifier(), 
        n_estimators=500, 
        max_samples=100, 
        bootstrap=True, 
        n_jobs=-1,
        oob_score=True)

bag_clf.fit(X,y)
y_pred = bag_clf.predict(X)
In [64]:
plot_decision_boundary(lambda x:  bag_clf.predict(x), X, y)
bag_clf.oob_score_
Out[64]:
0.9666666666666667
In [65]:
plot_decision_boundary(lambda x:  tree_clf.predict(x), X, y)

Random Forest

  • An ensemble of Decision Trees
  • Trained via the Bagging method

  • provide a measure on feature importance

In [66]:
from sklearn.ensemble import RandomForestClassifier
In [67]:
rnd_clf = RandomForestClassifier(n_estimators=500,max_leaf_nodes=16,n_jobs=-1)
rnd_clf.fit(X,y)
Out[67]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=16,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
In [68]:
for name,score in zip(iris['feature_names'],rnd_clf.feature_importances_):
    print(name,score)
sepal length (cm) 0.48803363807696976
sepal width (cm) 0.5119663619230304
In [69]:
plot_decision_boundary(lambda x:  rnd_clf.predict(x), X, y)

Comparison with Bagging based on DecisionTree

In [70]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(
            splitter='random',
            max_leaf_nodes=16),
    n_estimators=500,
    max_samples=1.0,
    bootstrap=True,
    n_jobs=-1)
In [71]:
bag_clf.fit(X,y)
Out[71]:
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=16,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='random'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=500, n_jobs=-1, oob_score=False,
         random_state=None, verbose=0, warm_start=False)
In [72]:
plot_decision_boundary(lambda x:  bag_clf.predict(x), X, y)

Boosting

  • Train predictors sequentially, each trying to correct
  • Ensemble method that can comnine several weaker learners into a strong learner

Training Algorithms: Gradient descent

Unsupervised Machine Learning

Reference

  • Hands on Machine Learning with Scikit-Learn and Tensorflow, by Aurélien Géron