from IPython.display import HTML

Introduction to Machine Learning¶

Machine learning versus statistics ¶

"Statistics draws population inferences from a sample, and machine learning finds generalizable predictive patterns."
"Statistical methods have a long-standing focus on inference, which is achieved through the creation and fitting of a project-specific probability model. The model allows us to compute a quantitative measure of confidence that a discovered relationship describes a 'true' effect that is unlikely to result from noise. Furthermore, if enough data are available, we can explicitly verify assumptions (e.g., equal variance) and refine the specified model, if needed."
"By contrast, ML concentrates on prediction by using general-purpose learning algorithms to find patterns in often rich and unwieldy data. ML methods are particularly helpful when one is dealing with 'wide data', where the number of input variables exceeds the number of subjects, in contrast to 'long data', where the number of subjects is greater than that of input variables. ML makes minimal assumptions about the data-generating systems; they can be effective even when the data are gathered without a carefully controlled experimental design and in the presence of complicated nonlinear interactions."

Danilo Bzdok, Naomi Altman & Martin Krzywinski, Nature Methods volume 15, pages 233–234 (2018).

Supervised Machine Learning¶

classification (predicting classes)
regression (predicting values)

Classification¶

Binary classification¶

targert y: two classes (e.g., 0 or 1)

from sklearn.metrics import confusion_matrix, precision_score, accuracy_score, f1_score

model performance measurements: confusion matrix

Accuray

$$ accuracy = \dfrac{\rm TP + TN}{\rm Total (TP+TN+FP+FN)} $$

precision

$$ precision = \dfrac{\rm TP}{\rm Predicted~Positive (TP+FP)} $$

recall/sensitivity/True Positive Rate(TPR) $$ recall = \dfrac{\rm TP}{\rm Real~Positive (TP+FN)} $$
F1 score

$$ F_1 = \dfrac{2}{\dfrac{1}{precision} + \dfrac{1}{recall}} = 2\times \dfrac{precision*recall}{precision + recall} $$

where

TP: true positive = (predict positive| real positive)
TN: true negative = (predict negative| real negative)
FP: false positive = (predict positive| real negative)
FN: false negative = (predict negative| real positive)

Note: For unbalanced dataset, the recall is more important than the precision and accuracy.

Receiver operating characteristic (ROC) curve

True Positive Rate (recall/sensitivity) v.s. False Positive Rate (1 - Specificity)

where

False Positive Rate (FPR)

$$ FPR = \dfrac{\rm FP}{\rm Real~Negative} = 1 - Specificity $$

Specificity/True Negative Rate (TNR) $$ Specificity = \dfrac{\rm TN}{\rm Real~Negative} $$

The better algorithm gives a larger AUC (area under the curve) score.

import numpy as np
from IPython.display import Image
import matplotlib.pyplot as plt

Image("images/ROC_curves.jpg")

Multiclass classification¶

One versus the rest: still binary classification
one versus one (pair)

Regression¶

Linear regression¶

The linear regression computes the sum of weighted input features (plus a bias) and output this result.
The prediction $\hat y^{(m)}$ for the $m$-th sample is parameterized as a linear function of the feature values $x^{(m)}_i$ (N features) of the $m$-th sample :

$$ \widehat y^{(m)} = h_\theta(\mathbf{X^{(m)}}) = \sum^N_{i=1} \theta_i x^{(m)}_i + \theta_0 = \mathbf{\theta}^T\cdot \mathbf{X}^{(m)} $$

where

$h_\theta(\mathbf{X})$ is a hypothesis (linear) function.
$\theta_i$ and $\theta_0$ are model parameters, $\theta_0$ is the bias term. Both are universal for all the samples.
vectorization of the relation: $$ \mathbf{\widehat y} =\left[\widehat y^{(1)}, \widehat y^{(2)}, \cdots, \widehat y^{(M)}\right] =\left[\theta_1, \theta_2, \cdots, \theta_N \right]\cdot \begin{bmatrix} x^{(1)}_1, & x^{(2)}_1, & \cdots, & x^{(M)}_1 \\ x^{(1)}_2, & x^{(2)}_2, & \cdots, & x^{(M)}_2 \\ \cdots, & \cdots, & \cdots, & \cdots \\ x^{(1)}_N, & x^{(2)}_N, & \cdots, &x^{(M)}_N \\ \end{bmatrix} =\mathbf{\theta}^T\cdot \mathbf{X} $$

Mean-Square-Error (MSE): $$ MSE (\mathbf{\theta}) = \dfrac{1}{M} \sum^M_{m=1} (y^{(m)} - \widehat y^{(m)})^2. $$
The model paramter $\mathbf{\theta}$ is determined by minization of the MSE and given by the solution of the following Normal Equation $$ \mathbf{\theta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbf{y} ={\cal X}^+ \mathbf{y}, $$ where
${\cal X}^+ \equiv (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$ is the so-called pseudo-inverse of $\mathbf{X}$, which can be calculated by

$$ {\cal X}^+ = np.linlg.pinv(X) $$

Note: using SVD (pseudo-inverse) method can avoid the singularity problem.

m=100
X = 6*np.random.rand(m,1)
y = 3*X + 4 + np.random.rand(m,1) # plus noise

X_plus_b = np.c_[np.ones((100,1)), X]  # add x0=1 to each instance

theta_inv = np.linalg.inv(X_plus_b.T.dot(X_plus_b)).dot(X_plus_b.T).dot(y)
theta_inv

array([[4.50285414],
       [3.001809  ]])

theta_svd = np.linalg.pinv(X_plus_b).dot(y)

theta_svd

array([[4.50285414],
       [3.001809  ]])

Regularization (Ridge/L2):
1. Adding a term onto the cost function $J(\mathbf{\theta})$: $$ J(\mathbf{\theta}) = MSE (\mathbf{\theta}) + \dfrac{\alpha}{2} \mathbf{\theta}^T\mathbf{\theta}. $$
2. The model paramter $\mathbf{\theta}$ is given by $$ \mathbf{\theta} = (\mathbf{X}^T\mathbf{X}+\alpha \mathbf{I})^{-1}\mathbf{X}^T \mathbf{y}, $$ where $\mathbf{I}$ is the Identity matrix, $\alpha$ contronls the size of the regularization term.

L2 Ridge regression is sensitive to the scale of features.

Regularization (Lasso/L1):

$$ J(\mathbf{\theta}) = MSE (\mathbf{\theta}) + \alpha \sum^N_{i=0} \vert\theta_i\vert^2 $$

Features of L1 Lasso regression:

1. tends to elimilate the least important features.     
2. automatically performs feature selections and outputs a sparese model.     
3. Slope changes abruptly.

Elastic Net (between L1 and L2):

$$ J(\mathbf{\theta}) = MSE (\mathbf{\theta}) + r \alpha\sum^N_{i=0} \vert\theta_i\vert^2 + \dfrac{1-r}{2}\alpha \mathbf{\theta}^T\mathbf{\theta} $$

Note: First try **Ridge**. If only a few features are important, then try **Elastic Net**.

 np.random.randn

<function RandomState.randn>

from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet

m=20
X = 6*np.random.rand(m,1)
y = 3*X + 4 + np.random.rand(m,1) # plus noise

elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X,y)
y_pred = elastic_net.predict(X)

plt.scatter(X,y.squeeze())
plt.plot(X,y_pred,'r')

[<matplotlib.lines.Line2D at 0x12246c2b0>]

from scipy import stats

slope, intercept, r_value, p_value, std_err = stats.linregress(X.reshape(1,-1), y.reshape(1,-1))
plt.plot(X, y, 'o', label='original data')
plt.plot(X, intercept + slope*X, 'r', label='fitted line')

/anaconda3/lib/python3.6/site-packages/scipy/stats/_stats_mstats_common.py:125: RuntimeWarning: invalid value encountered in sqrt
  t = r * np.sqrt(df / ((1.0 - r + TINY)*(1.0 + r + TINY)))
/anaconda3/lib/python3.6/site-packages/scipy/stats/_stats_mstats_common.py:127: RuntimeWarning: invalid value encountered in sqrt
  sterrest = np.sqrt((1 - r**2) * ssym / ssxm / df)

[<matplotlib.lines.Line2D at 0x1225309e8>]

print("The r-value is {}".format(r_value))

The r-value is 0.9983031755104109

The statistic model also provides the quatitative measure of the confidence using the p-value, which is not the case for ML models.

Note: Pay attention to the shapes of the $X, y$:

sklearn: dimension at least to be 2D
scipy: dimension should be 1D

Polynomial regression¶

The polynomial function of $X$ can be rewritten as a linear function of a new basis funcction $\phi(X)$:

$$ \begin{align} \widehat y &= \theta_0 + \theta_1 X + \theta_2 X^2 + \theta_3 X^3 + \cdots \nonumber\\ &= \theta_0 + \theta_1 \phi_1(X) + \theta_2 \phi_2(X) + \theta_3 \phi_3(X) + \cdots \nonumber\\ &= \mathbf{\theta}^T \cdot\mathbf{\phi(X)}, \end{align} $$ where $\phi_i(X)=1, X, X^2, X^3, \ldots$ for $i=1,2,3,\cdots$.

The basis function plays the role of scaling the features:

$$ X \to \phi(X) $$

from sklearn.preprocessing import PolynomialFeatures

poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)

elastic_net =ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X_poly, y)
elastic_net.intercept_, elastic_net.coef_

(array([5.0389554]), array([2.35301464, 0.11935787]))

The model estimate: $$ \widehat y = 0.57 x^2 + 0.25 x + 1.70 $$

plt.scatter(X,elastic_net.predict(X_poly),c='r',label='prediction')
plt.scatter(X,y.squeeze(),c='b',label='data')
plt.legend()

<matplotlib.legend.Legend at 0x1225dc3c8>

Note: Let's consider the case with $n$ features ($X_1, X_2, \cdots, X_n$) and $degree=d$. Then the number of new features

$$ {\rm No. of~} X^a_1X^b_2 \cdots X^c_n {\rm~terms}\sim \dfrac{(n+d)!}{d!n!} $$

increase factorially, which makes the computation heavy.

style = " <style> div.highlight { background-color: #E0FFFF; border-color: #dFb5b4; border-left: 5px solid #dfb5b4; padding: 0.5em;}</style>"
HTML(style)

plt.subplot(1,2,1)
x1=np.linspace(0.0001,0.9999)
plt.plot(x1,-np.log(x1),c='r')
plt.text(0.25,8,r'$y^{(m)}=1$')
plt.arrow(0.25,2,0.2,-0.5, head_length = 0.17, head_width = 0.05, length_includes_head = True)
plt.xlabel(r'$\widehat p^{(m)}$',fontsize=20)
plt.ylabel(r'$C^{(m)}(\theta)$',fontsize=20)

plt.subplot(1,2,2)
x0=np.linspace(0.0001,0.9999)
plt.plot(x0,-np.log(1-x0),c='r')
plt.text(0.25,8,r'$y^{(m)}=0$')
plt.arrow(0.75,2,-0.2,-0.5, head_length = 0.17, head_width = 0.05, length_includes_head = True)
plt.xlabel(r'$\widehat p^{(m)}$',fontsize=20)

Text(0.5,0,'$\\widehat p^{(m)}$')

Text(0.5,0,'$\\theta^T\\cdot \\mathbf{X}^{(m)}$')

['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename']

/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

Text(0.5,0,'Petal width (cm)')

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

(array([2]), array([[6.38014896e-07, 5.74929995e-02, 9.42506362e-01]]))

Text(0,0.5,'Probability')

plt.subplot(1,2,1)
x1=np.linspace(0.0001,0.9999)
plt.plot(x1,-np.log(x1),c='r')
plt.text(0.25,8,r'$y^{(m)}=1$')
plt.arrow(0.25,2,0.2,-0.5, head_length = 0.17, head_width = 0.05, length_includes_head = True)
plt.xlabel(r'$\widehat p^{(m)}$',fontsize=20)
plt.ylabel(r'$C^{(m)}(\theta)$',fontsize=20)

plt.subplot(1,2,2)
x0=np.linspace(0.0001,0.9999)
plt.plot(x0,-np.log(1-x0),c='r')
plt.text(0.25,8,r'$y^{(m)}=0$')
plt.arrow(0.75,2,-0.2,-0.5, head_length = 0.17, head_width = 0.05, length_includes_head = True)
plt.xlabel(r'$\widehat p^{(m)}$',fontsize=20)

Text(0.5,0,'$\\widehat p^{(m)}$')

Note: The cost function is convex ! The model parameter $\theta$ is determined by minimizing the cost function $C^{(m)}(\theta)$, which drives the $\widehat p^{(m)}$ towards the value of $y^{(m)}$.

def sigmoid(t):
    return 1/(1+np.exp(-t))

x = np.linspace(-50,50,200) 
x = np.array(sorted(x))

plt.plot(x,sigmoid(x),c='r')
plt.xlim(-15,15)
plt.axhline(0.5,linestyle=':',c='k')
plt.axvline(0,linestyle=':',c='k')
plt.ylabel('$\hat p^{(m)}$', fontsize=20)
plt.xlabel(r'$\theta^T\cdot \mathbf{X}^{(m)}$', fontsize=20)

Text(0.5,0,'$\\theta^T\\cdot \\mathbf{X}^{(m)}$')

from sklearn.linear_model import LogisticRegression
from sklearn import datasets

iris = datasets.load_iris()
list(iris)

['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename']

X= iris["data"][:,3:] # pental width
y=(iris["target"]==2).astype(np.int)

log_reg = LogisticRegression()
log_reg.fit(X,y)

/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

X_new = np.linspace(0,3,1000).reshape(-1,1)
y_proba = log_reg.predict_proba(X_new)
plt.plot(X_new,y_proba[:,1],label='Iris-Virginica')
plt.plot(X_new,y_proba[:,0],label='Not Iris-Virginica')
plt.legend(loc=1,fontsize=16)
plt.ylabel('Probability', fontsize=16)
plt.xlabel('Petal width (cm)', fontsize=16)

Text(0.5,0,'Petal width (cm)')

Softmax regression¶

also called Multinomial Logistic Regression
The softmax regression computes a score $s_k(\mathbf{X})$ for each class $k$, given an instance $\mathbf{X}$,

$$ s_k(\mathbf{X}) = \mathbf{\theta_{(k)}}^T\cdot \mathbf{X} $$

and then estimates the probability $\widehat p_k$ of each class by applying the softmax function to the scores

$$ \widehat p_k =\sigma(s(\mathbf{X}))_k = \dfrac{ e^{s_k(\mathbf{X}) }}{\sum^K_{k=1} e^{s_k(\theta)}}. $$

The denominator is normalization factor, $$ \sum^K_k\widehat p_k(\mathbf{X})=1. $$

$K$: number of the classkes
$s(\mathbf{X})$: a vector containing the scores of each class for the instance $\mathbf{X}$.
$\widehat p_k(\mathbf{X})$ is the predicted probability of the instance $\mathbf{X}$ belonging to the class $k$.

The softmax regression classifier prediction is the the value of $k$, which maximizes the function $\sigma(s(\mathbf{X}))$ or equivalently the score $s_k(\mathbf{X})$:

$$ y = \underset{k}{argmax}~\sigma(s(\mathbf{X}))_k = \underset{k}{argmax} \left(\theta_{(k)}^T\cdot \mathbf{X}\right) $$

Cost function is the cross entropy (for the case with $M$ instances/samples and $K$ classes):

$$ J(\Theta)= -\dfrac{1}{M} \sum^M_{m=1} \sum^K_{k=1}\left[ y^{(m)}_k\log \widehat p^{(m)}_k\right], $$

where $y^{(m)}_k$ is either 1 or 0, depending on whethe the $m$-th instance belongs to the class $k$ or not.

Gradient of the cost function:

$$ \nabla_{\theta_{(k)}} J(\Theta) = \dfrac{1}{M} \sum^M_{m=1} \left(\widehat p^{(m)}_k - y^{(m)}_k\right) \mathbf{X}^{(m)}. $$

Note: In the deep-learing course by A. Ng, the symbol $a^{(m)}$ is introduced to replace $\widehat p^{(m)}$.

style1 = " <style> div.warn { background-color: #fcf2f2; border-color: #dFb5b4; border-left: 5px solid #dfb5b4; padding: 0.5em;}</style>"
HTML(style1)
# #DCDCDC

X = iris["data"][:,(2,3)] # petal length and width
y = iris["target"]
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

(array([2]), array([[6.38014896e-07, 5.74929995e-02, 9.42506362e-01]]))

Text(0,0.5,'Probability')

Pipeline(memory=None,
     steps=[('poly_features', PolynomialFeatures(degree=3, include_bias=True, interaction_only=False)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', LinearSVC(C=10, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0))])

/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

SVR(C=100, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
  gamma='auto_deprecated', kernel='poly', max_iter=-1, shrinking=True,
  tol=0.001, verbose=False)

[<matplotlib.lines.Line2D at 0x1227b80b8>]

X = iris["data"][:,(2,3)] # petal length and width
y = iris["target"]
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

soft_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs",C=10)
soft_reg.fit(X,y)

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

X_test = [5,2]
soft_reg.predict([X_test]), soft_reg.predict_proba([X_test])

(array([2]), array([[6.38014896e-07, 5.74929995e-02, 9.42506362e-01]]))

classes = ['class 0', 'class 1', 'class 2']

prob = np.array(soft_reg.predict_proba([X_test])).squeeze()

plt.bar(classes,prob)
plt.ylabel("Probability")

Text(0,0.5,'Probability')

Support Vector Machines¶

The instances located on the edge of the street are called support vectors.

Training object/cost function¶

$$ J(\Theta) = \dfrac{1}{2} \theta^T\theta $$ with the linear constraint $$ t^{(m)} (\theta^T \mathbf{X}^{(m)}+b) \ge 1 $$

where $t^{(m)}$ is 1 for the positive instance (corresponding to $y^{(m)}=1$) and -1 for the negative instance (corresponding to $y^{(m)}=0$).

One can use the following cost function to take care of the constraint term

$$ J(\Theta, b) = \dfrac{1}{2} \theta^T\theta + C\sum^M_{m=1} max\left(0, 1- t^{(m)} (\theta^T \mathbf{X}^{(m)}+b)\right). $$

The first term is pushing the model to have a smallest value of $\theta$, namely, the largest margin/street.
The second term (called Hinge Loss) is minimizing the marginal violations (proportional to the distance to the correct side of the street) as small/few as possible.

Decision function¶

Linear SVM classification

$$ \widehat y = \begin{cases} 0, & \quad \theta^T \mathbf{X}+b<0\\ 1, & \quad \theta^T \mathbf{X}+b >0 \end{cases} $$

SVM Classification¶

Soft marginal classification¶

hard margin classification: require all the instances being off the street. This classification is sensitive to the outliers.
soft margin classification: find a good balance between keeping the street as large as possible and limiting the margin violation (numbers of instances in the middle of the street).

Nonlinear SVM classifications¶

add polynomial features by using polynomial kernel.

For the degree of 2 polynomial mapping: $$ \phi(\mathbf{X}) =\phi\begin{pmatrix} x_1 \\ x_2 \end{pmatrix} =\begin{pmatrix} x^2_1 \\ \sqrt{2}x_1x_2\\ x^2_2 \end{pmatrix} $$

and the dot product

$$ \phi(\mathbf{X})^T\phi(\mathbf{Y}) =\begin{pmatrix} x^2_1 \\ \sqrt{2}x_1x_2\\ x^2_2 \end{pmatrix}^T \begin{pmatrix} y^2_1 \\ \sqrt{2}y_1y_2\\ y^2_2 \end{pmatrix} =(\mathbf{X}^T\mathbf{Y})^2 \equiv K(\mathbf{X},\mathbf{Y}), $$ where $K(\mathbf{X},\mathbf{Y})$ called a $2$nd-degree polynomial kernel.

add Similarity features by using Gaussian RBF kernel

For a given landmark $\mathbf{\ell}$, one can map the feature $\mathbf{X}$ to a new feature using a Gaussian function which measures the similarity between the feature $\mathbf{X}$ and the landmark:

$$ K(\mathbf{X}, \mathbf{\ell}) \equiv \phi_\gamma(\mathbf{X}, \mathbf{\ell}) = \exp(-\gamma \vert\vert \mathbf{X} - \mathbf{\ell}\vert\vert^2) $$

The landmark $\mathbf{\ell}$ is usually taken at the location of each and every instance in the dataset.

Therefore, if there are $m$ instance with $n$ features in the dataset, these $n$ features introduce $n^2$ Gaussian functions, thus $n^2$ new features. The time complexcity is increased

$$ O(m\times n) \to O(m\times n^2) $$

X = iris["data"][:, (2,3)]
y = (iris["target"]==2).astype(np.float64)

def plot_decision_boundary(model, X, y):
    
    # Set min and max values and give it some padding
    x_min, x_max = X[:,0].min() - 1, X[:,0].max() + 1   # x1
    y_min, y_max = X[:,1].min() - 1, X[:,1].max() + 1   # x2
    h = 0.01  
    # Generate a grid of points with distance h between them
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    # Predict the function value for the whole grid
    Z = model(np.c_[xx.ravel(), yy.ravel()])  # numpy.ravel() == numpy.reshape(-1); np.c_: Stack 1-D arrays as columns into a 2-D array
    Z = Z.reshape(xx.shape)

    # Plot the contour and training examples
    plt.contourf(xx, yy, Z, cmap='binary')
    plt.ylabel('x2')
    plt.xlabel('x1')
    plt.scatter(X[:,0], X[:,1], c=y.ravel(), cmap=plt.cm.Spectral)

from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC,SVC
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

polynomial_svm_clf = Pipeline(
[("poly_features", PolynomialFeatures(degree=3)),
("scaler",StandardScaler()),
("svm_clf",LinearSVC(C=10,loss="hinge"))
])

polynomial_svm_clf_poly3 = Pipeline(
[("poly_features", PolynomialFeatures(degree=3)),
("scaler",StandardScaler()),
("svm_clf",SVC(kernel="poly", degree=3,coef0=1,C=5))
])

polynomial_svm_clf_rbf = Pipeline(
[("poly_features", PolynomialFeatures(degree=3)),
("scaler",StandardScaler()),
("svm_clf",SVC(kernel="rbf", degree=3,coef0=1,C=5))
])

polynomial_svm_clf.fit(X,y)

Pipeline(memory=None,
     steps=[('poly_features', PolynomialFeatures(degree=3, include_bias=True, interaction_only=False)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', LinearSVC(C=10, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0))])

X_test = np.array([5,2]).reshape(1,-1)
polynomial_svm_clf.predict(X_test)
plot_decision_boundary(lambda x: polynomial_svm_clf.predict(x), X, y)

polynomial_svm_clf_poly3.fit(X,y)
plot_decision_boundary(lambda x: polynomial_svm_clf_poly3.predict(x), X, y)

/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

polynomial_svm_clf_rbf.fit(X,y)
plot_decision_boundary(lambda x: polynomial_svm_clf_rbf.predict(x), X, y)

/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

SVM Regression¶

for both linear and non-linear regressions
the street width is controled by the parameter $\epsilon$

X = np.random.rand(100,1)*5
y = X**3 + 10*X**2 + X + 4

X_plus_b = np.c_[np.ones((100,1)), X]  # add x0=1 to each instance

from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error

svm_poly_reg = SVR(kernel="poly", degree=3, C=100, epsilon=0.1)
svm_poly_reg.fit(X_plus_b,y)

/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

SVR(C=100, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
  gamma='auto_deprecated', kernel='poly', max_iter=-1, shrinking=True,
  tol=0.001, verbose=False)

y_pred = svm_poly_reg.predict(X_plus_b)
plt.plot(y,y_pred)

[<matplotlib.lines.Line2D at 0x1227b80b8>]

mean_squared_error(y,y_pred)

0.004644341964169357

Decision Trees¶

Classification
Regression

X=iris["data"][:,2:]
y=iris["target"]

from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier()
tree_clf.fit(X,y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

from sklearn.tree import export_graphviz
export_graphviz(tree_clf,
               out_file=("iris_tree.dot"),
               feature_names=iris.feature_names[2:],
               class_names=iris.target_names,
               rounded=True,
               filled=True)

! dot -Tpng iris_tree.dot -o iris_tree.png

Image("iris_tree.png")

Gini Impurity

$$ G_{i} = 1 - \sum^K_{k=1} p^2_{i,k} $$ where $p_{i,k}$ is the ratio of the class $k$ instances among the training instances in the $i$-th node:

For example: for the right-child node in the second layer (depth-1), samples = 100, value =[0,50,50]. The Gini score is given by

$$ G_i = 1 - (0/100)^2 - (50/100)^2 - (50/100)^2 = 0.5 $$

Training algorithm and cost function for classification:

The Classification and Regression Tree (CART) algorithm trains Decision Tree by splitting the trainsing set in two subsets using a single feature $k$ and a threshold $t_k$. The paramters $k$ and $t_k$ are chosen in such a way that it preoduces the purest subsets, weighted by their size, $$ J(k, t_k) = \dfrac{M_{\rm left}}{M} G_{\rm left} + \dfrac{M_{\rm right}}{M} G_{\rm right} $$ where $M_{\rm left/right}$ is the number of instances in the left/right node/subset.

Alternative, one can also use the entropy impurity measure:

$$ H_i = -\sum^K_{k=1, p_{i,k}\neq 0} p_{i, k} \log_2(p_{i,k}) $$

Decision Tree versus Linear models:

Decision Tree does not assume that the data is linear.
belongs to non-parameteric model.
most likely overfit the data
use max_depth, max_leaf_nodes and mini_samples_leaf to regularize the model.
Simple to understand and interpret.
Sensitive to a small variations in the training data.

Linear models belong to parametric model. The model parameters are predetermined, which reduces the risk of overfitting to some extend.

Cost function for regression:

$$ J(k, t_k) = \dfrac{M_{\rm left}}{M} {\rm MSE}_{\rm left} + \dfrac{M_{\rm right}}{M} {\rm MSE}_{\rm right} $$ where

$$ {\rm MSE}_{\rm node} = \sum_{m\in node} (\widehat y_{node} - y^{(m)})^2, \quad \widehat y_{node}=\dfrac{1}{M_{node}} \sum_{m \in node} y^{(m)} $$

Ensenmble and Random Forest¶

Voting¶

Train a few classfiers (based on different algorithms) trainand aggregate the predictions of each classifier and predict the class that gets the most votes.
The ensemble can do much better than each classifer, provided that the classifiers are independent.

from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
estimators=[
    ('lr',log_clf),
    ('rf',rnd_clf),
    ('svc',svm_clf)],
voting = 'hard')

voting_clf.fit(X,y)

/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:460: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

VotingClassifier(estimators=[('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)), ('rf', RandomFo...f', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))],
         flatten_transform=None, n_jobs=None, voting='hard', weights=None)

from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X,y)
    y_pred = clf.predict(X)
    print(clf.__class__.__name__,accuracy_score(y,y_pred))

LogisticRegression 0.8733333333333333
RandomForestClassifier 0.9933333333333333
SVC 0.9666666666666667
VotingClassifier 0.9733333333333334

/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:460: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:460: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

plot_decision_boundary(lambda x:  clf.predict(x), X, y)

Bagging and Pasting¶

Use the same training algorithm for every predictor, but to train on different random subsets of the training set.
Bagging: sampling is performed with replacement (bootstrap=True)
Pasting: sampling is performed without replacement
Aggregate the predictions of all predictors, and take either the most frequent prediction for classification or average for regresion.

Aggregation reduces both bias and variance. Generally, the ensemble has a similar bias, but a lower variance than a sinple predictor.
Bagging ends up with a little bit higher bias than Pasting.

Image("images/bagging.png", width=800)

X=iris["data"][:,2:]
y=iris["target"]

X.shape,y.shape

((150, 2), (150,))

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier()
tree_clf.fit(X,y)

bag_clf = BaggingClassifier(
        DecisionTreeClassifier(), 
        n_estimators=500, 
        max_samples=100, 
        bootstrap=True, 
        n_jobs=-1,
        oob_score=True)

bag_clf.fit(X,y)
y_pred = bag_clf.predict(X)

plot_decision_boundary(lambda x:  bag_clf.predict(x), X, y)
bag_clf.oob_score_

0.9666666666666667

plot_decision_boundary(lambda x:  tree_clf.predict(x), X, y)

Random Forest¶

An ensemble of Decision Trees
Trained via the Bagging method
provide a measure on feature importance

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500,max_leaf_nodes=16,n_jobs=-1)
rnd_clf.fit(X,y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=16,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

for name,score in zip(iris['feature_names'],rnd_clf.feature_importances_):
    print(name,score)

sepal length (cm) 0.48803363807696976
sepal width (cm) 0.5119663619230304

plot_decision_boundary(lambda x:  rnd_clf.predict(x), X, y)

Comparison with Bagging based on DecisionTree

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(
            splitter='random',
            max_leaf_nodes=16),
    n_estimators=500,
    max_samples=1.0,
    bootstrap=True,
    n_jobs=-1)

bag_clf.fit(X,y)

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=16,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='random'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=500, n_jobs=-1, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

plot_decision_boundary(lambda x:  bag_clf.predict(x), X, y)

Introduction to Machine Learning¶

Machine learning versus statistics ¶

Supervised Machine Learning¶

Classification¶

Binary classification¶

Multiclass classification¶

Regression¶

Linear regression¶

Polynomial regression¶

Logistic regression¶

Softmax regression¶

Support Vector Machines¶

Training object/cost function¶

Decision function¶

SVM Classification¶

Soft marginal classification¶

Nonlinear SVM classifications¶

SVM Regression¶

Decision Trees¶

Ensenmble and Random Forest¶

Voting¶

Bagging and Pasting¶

Random Forest¶

Boosting¶

Training Algorithms: Gradient descent¶

Unsupervised Machine Learning¶

Reference¶

Introduction to Machine Learning¶

Machine learning versus statistics¶

Supervised Machine Learning¶

Classification¶

Binary classification¶

Multiclass classification¶

Regression¶

Linear regression¶

Polynomial regression¶

Logistic regression¶

Softmax regression¶

Support Vector Machines¶

Training object/cost function¶

Decision function¶

SVM Classification¶

Soft marginal classification¶

Nonlinear SVM classifications¶

SVM Regression¶

Decision Trees¶

Ensenmble and Random Forest¶

Voting¶

Bagging and Pasting¶

Random Forest¶

Boosting¶

Training Algorithms: Gradient descent¶

Unsupervised Machine Learning¶

Reference¶

Machine learning versus statistics ¶