from IPython.display import HTML
"Statistics draws population inferences from a sample, and machine learning finds generalizable predictive patterns."
"Statistical methods have a long-standing focus on inference, which is achieved through the creation and fitting of a project-specific probability model. The model allows us to compute a quantitative measure of confidence that a discovered relationship describes a 'true' effect that is unlikely to result from noise. Furthermore, if enough data are available, we can explicitly verify assumptions (e.g., equal variance) and refine the specified model, if needed."
"By contrast, ML concentrates on prediction by using general-purpose learning algorithms to find patterns in often rich and unwieldy data. ML methods are particularly helpful when one is dealing with 'wide data', where the number of input variables exceeds the number of subjects, in contrast to 'long data', where the number of subjects is greater than that of input variables. ML makes minimal assumptions about the data-generating systems; they can be effective even when the data are gathered without a carefully controlled experimental design and in the presence of complicated nonlinear interactions."
Danilo Bzdok, Naomi Altman & Martin Krzywinski, Nature Methods volume 15, pages 233–234 (2018).
from sklearn.metrics import confusion_matrix, precision_score, accuracy_score, f1_score
model performance measurements: confusion matrix
$$ accuracy = \dfrac{\rm TP + TN}{\rm Total (TP+TN+FP+FN)} $$
$$ precision = \dfrac{\rm TP}{\rm Predicted~Positive (TP+FP)} $$
recall/sensitivity/True Positive Rate(TPR) $$ recall = \dfrac{\rm TP}{\rm Real~Positive (TP+FN)} $$
F1 score
$$ F_1 = \dfrac{2}{\dfrac{1}{precision} + \dfrac{1}{recall}} = 2\times \dfrac{precision*recall}{precision + recall} $$
where
Note: For unbalanced dataset, the recall is more important than the precision and accuracy.
True Positive Rate (recall/sensitivity) v.s. False Positive Rate (1 - Specificity)
where
$$ FPR = \dfrac{\rm FP}{\rm Real~Negative} = 1 - Specificity $$
The better algorithm gives a larger AUC (area under the curve) score.
import numpy as np
from IPython.display import Image
import matplotlib.pyplot as plt
Image("images/ROC_curves.jpg")
The linear regression computes the sum of weighted input features (plus a bias) and output this result.
The prediction $\hat y^{(m)}$ for the $m$-th sample is parameterized as a linear function of the feature values $x^{(m)}_i$ (N features) of the $m$-th sample :
$$ \widehat y^{(m)} = h_\theta(\mathbf{X^{(m)}}) = \sum^N_{i=1} \theta_i x^{(m)}_i + \theta_0 = \mathbf{\theta}^T\cdot \mathbf{X}^{(m)} $$
where
Mean-Square-Error (MSE): $$ MSE (\mathbf{\theta}) = \dfrac{1}{M} \sum^M_{m=1} (y^{(m)} - \widehat y^{(m)})^2. $$
The model paramter $\mathbf{\theta}$ is determined by minization of the MSE and given by the solution of the following Normal Equation $$ \mathbf{\theta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbf{y} ={\cal X}^+ \mathbf{y}, $$ where
$$ {\cal X}^+ = np.linlg.pinv(X) $$
Note: using SVD (pseudo-inverse) method can avoid the singularity problem.
m=100
X = 6*np.random.rand(m,1)
y = 3*X + 4 + np.random.rand(m,1) # plus noise
X_plus_b = np.c_[np.ones((100,1)), X] # add x0=1 to each instance
theta_inv = np.linalg.inv(X_plus_b.T.dot(X_plus_b)).dot(X_plus_b.T).dot(y)
theta_inv
theta_svd = np.linalg.pinv(X_plus_b).dot(y)
theta_svd
Regularization (Ridge/L2):
Adding a term onto the cost function $J(\mathbf{\theta})$: $$ J(\mathbf{\theta}) = MSE (\mathbf{\theta}) + \dfrac{\alpha}{2} \mathbf{\theta}^T\mathbf{\theta}. $$
The model paramter $\mathbf{\theta}$ is given by $$ \mathbf{\theta} = (\mathbf{X}^T\mathbf{X}+\alpha \mathbf{I})^{-1}\mathbf{X}^T \mathbf{y}, $$ where $\mathbf{I}$ is the Identity matrix, $\alpha$ contronls the size of the regularization term.
L2 Ridge regression is sensitive to the scale of features.
$$ J(\mathbf{\theta}) = MSE (\mathbf{\theta}) + \alpha \sum^N_{i=0} \vert\theta_i\vert^2 $$
Features of L1 Lasso regression:
1. tends to elimilate the least important features.
2. automatically performs feature selections and outputs a sparese model.
3. Slope changes abruptly.
$$ J(\mathbf{\theta}) = MSE (\mathbf{\theta}) + r \alpha\sum^N_{i=0} \vert\theta_i\vert^2 + \dfrac{1-r}{2}\alpha \mathbf{\theta}^T\mathbf{\theta} $$
Note: First try **Ridge**. If only a few features are important, then try **Elastic Net**.
np.random.randn
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
m=20
X = 6*np.random.rand(m,1)
y = 3*X + 4 + np.random.rand(m,1) # plus noise
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X,y)
y_pred = elastic_net.predict(X)
plt.scatter(X,y.squeeze())
plt.plot(X,y_pred,'r')
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(X.reshape(1,-1), y.reshape(1,-1))
plt.plot(X, y, 'o', label='original data')
plt.plot(X, intercept + slope*X, 'r', label='fitted line')
print("The r-value is {}".format(r_value))
The statistic model also provides the quatitative measure of the confidence using the p-value, which is not the case for ML models.
Note: Pay attention to the shapes of the $X, y$:
The polynomial function of $X$ can be rewritten as a linear function of a new basis funcction $\phi(X)$:
$$ \begin{align} \widehat y &= \theta_0 + \theta_1 X + \theta_2 X^2 + \theta_3 X^3 + \cdots \nonumber\\ &= \theta_0 + \theta_1 \phi_1(X) + \theta_2 \phi_2(X) + \theta_3 \phi_3(X) + \cdots \nonumber\\ &= \mathbf{\theta}^T \cdot\mathbf{\phi(X)}, \end{align} $$ where $\phi_i(X)=1, X, X^2, X^3, \ldots$ for $i=1,2,3,\cdots$.
The basis function plays the role of scaling the features:
$$ X \to \phi(X) $$
from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
elastic_net =ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X_poly, y)
elastic_net.intercept_, elastic_net.coef_
The model estimate: $$ \widehat y = 0.57 x^2 + 0.25 x + 1.70 $$
plt.scatter(X,elastic_net.predict(X_poly),c='r',label='prediction')
plt.scatter(X,y.squeeze(),c='b',label='data')
plt.legend()
Note: Let's consider the case with $n$ features ($X_1, X_2, \cdots, X_n$) and $degree=d$. Then the number of new features
$$ {\rm No. of~} X^a_1X^b_2 \cdots X^c_n {\rm~terms}\sim \dfrac{(n+d)!}{d!n!} $$
increase factorially, which makes the computation heavy.
style = " <style> div.highlight { background-color: #E0FFFF; border-color: #dFb5b4; border-left: 5px solid #dfb5b4; padding: 0.5em;}</style>"
HTML(style)
to predict the probability $\widehat p$ that an instance belongs to a particular class.
A logistic regression computes the sum of the weighted input features (plus a bias), and output the logistic of this result.
Cost function for a single training instance/sample
$$ C(\theta) = \begin{cases} -\log(\widehat p), {\rm if}~ y=1 ({\rm positive ~instance}),\\ -\log(1- \widehat p), {\rm if}~ y=0 ({\rm negative~ instance}),\\ \end{cases} $$ which can be rewritten as one expression $$ C^{(m)}(\theta) = -y^{(m)}\log(\widehat p^{(m)}) -(1-y^{(m)})\log(1- \widehat p^{(m)}). $$ Therefore, the cost function of the whole training set is
Note: the goal of training is to find the $\theta$ so that the model estimates high probability for positive instances and low probability for negative instances.
plt.subplot(1,2,1)
x1=np.linspace(0.0001,0.9999)
plt.plot(x1,-np.log(x1),c='r')
plt.text(0.25,8,r'$y^{(m)}=1$')
plt.arrow(0.25,2,0.2,-0.5, head_length = 0.17, head_width = 0.05, length_includes_head = True)
plt.xlabel(r'$\widehat p^{(m)}$',fontsize=20)
plt.ylabel(r'$C^{(m)}(\theta)$',fontsize=20)
plt.subplot(1,2,2)
x0=np.linspace(0.0001,0.9999)
plt.plot(x0,-np.log(1-x0),c='r')
plt.text(0.25,8,r'$y^{(m)}=0$')
plt.arrow(0.75,2,-0.2,-0.5, head_length = 0.17, head_width = 0.05, length_includes_head = True)
plt.xlabel(r'$\widehat p^{(m)}$',fontsize=20)
Note: The cost function is convex ! The model parameter $\theta$ is determined by minimizing the cost function $C^{(m)}(\theta)$, which drives the $\widehat p^{(m)}$ towards the value of $y^{(m)}$.
def sigmoid(t):
return 1/(1+np.exp(-t))
x = np.linspace(-50,50,200)
x = np.array(sorted(x))
plt.plot(x,sigmoid(x),c='r')
plt.xlim(-15,15)
plt.axhline(0.5,linestyle=':',c='k')
plt.axvline(0,linestyle=':',c='k')
plt.ylabel('$\hat p^{(m)}$', fontsize=20)
plt.xlabel(r'$\theta^T\cdot \mathbf{X}^{(m)}$', fontsize=20)
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
iris = datasets.load_iris()
list(iris)
X= iris["data"][:,3:] # pental width
y=(iris["target"]==2).astype(np.int)
log_reg = LogisticRegression()
log_reg.fit(X,y)
X_new = np.linspace(0,3,1000).reshape(-1,1)
y_proba = log_reg.predict_proba(X_new)
plt.plot(X_new,y_proba[:,1],label='Iris-Virginica')
plt.plot(X_new,y_proba[:,0],label='Not Iris-Virginica')
plt.legend(loc=1,fontsize=16)
plt.ylabel('Probability', fontsize=16)
plt.xlabel('Petal width (cm)', fontsize=16)
$$ s_k(\mathbf{X}) = \mathbf{\theta_{(k)}}^T\cdot \mathbf{X} $$
and then estimates the probability $\widehat p_k$ of each class by applying the softmax function to the scores
$$ \widehat p_k =\sigma(s(\mathbf{X}))_k = \dfrac{ e^{s_k(\mathbf{X}) }}{\sum^K_{k=1} e^{s_k(\theta)}}. $$
The denominator is normalization factor, $$ \sum^K_k\widehat p_k(\mathbf{X})=1. $$
$$ y = \underset{k}{argmax}~\sigma(s(\mathbf{X}))_k = \underset{k}{argmax} \left(\theta_{(k)}^T\cdot \mathbf{X}\right) $$
$$ J(\Theta)= -\dfrac{1}{M} \sum^M_{m=1} \sum^K_{k=1}\left[ y^{(m)}_k\log \widehat p^{(m)}_k\right], $$
where $y^{(m)}_k$ is either 1 or 0, depending on whethe the $m$-th instance belongs to the class $k$ or not.
$$ \nabla_{\theta_{(k)}} J(\Theta) = \dfrac{1}{M} \sum^M_{m=1} \left(\widehat p^{(m)}_k - y^{(m)}_k\right) \mathbf{X}^{(m)}. $$
Note: In the deep-learing course by A. Ng, the symbol $a^{(m)}$ is introduced to replace $\widehat p^{(m)}$.
style1 = " <style> div.warn { background-color: #fcf2f2; border-color: #dFb5b4; border-left: 5px solid #dfb5b4; padding: 0.5em;}</style>"
HTML(style1)
# #DCDCDC
and the cost function is simplified as (set $y^{(m)}_{k=1}=y^{(m)}, \widehat p^{(m)}_1=\widehat p^{(m)}$)
$$ \begin{align} J(\Theta) &= -\dfrac{1}{M} \sum^M_m \sum^2_{k=1}\left[ y^{(m)}_k\log \widehat p^{(m)}_k\right] \nonumber\\ &= -\dfrac{1}{M} \sum^M_m \left[ y^{(m)}_1\log \widehat p^{(m)}_1 + y^{(m)}_2\log \widehat p^{(m)}_2\right]\nonumber\\ &= -\dfrac{1}{M} \sum^M_m \left[ y^{(m)} \log \widehat p^{(m)} + (1- y^{(m)})\log (1- \widehat p^{(m)})\right]. \end{align} $$
X = iris["data"][:,(2,3)] # petal length and width
y = iris["target"]
y
soft_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs",C=10)
soft_reg.fit(X,y)
X_test = [5,2]
soft_reg.predict([X_test]), soft_reg.predict_proba([X_test])
classes = ['class 0', 'class 1', 'class 2']
prob = np.array(soft_reg.predict_proba([X_test])).squeeze()
plt.bar(classes,prob)
plt.ylabel("Probability")
$$ J(\Theta) = \dfrac{1}{2} \theta^T\theta $$ with the linear constraint $$ t^{(m)} (\theta^T \mathbf{X}^{(m)}+b) \ge 1 $$
where $t^{(m)}$ is 1 for the positive instance (corresponding to $y^{(m)}=1$) and -1 for the negative instance (corresponding to $y^{(m)}=0$).
One can use the following cost function to take care of the constraint term
$$ J(\Theta, b) = \dfrac{1}{2} \theta^T\theta + C\sum^M_{m=1} max\left(0, 1- t^{(m)} (\theta^T \mathbf{X}^{(m)}+b)\right). $$
$$ \widehat y = \begin{cases} 0, & \quad \theta^T \mathbf{X}+b<0\\ 1, & \quad \theta^T \mathbf{X}+b >0 \end{cases} $$
hard margin classification: require all the instances being off the street. This classification is sensitive to the outliers.
soft margin classification: find a good balance between keeping the street as large as possible and limiting the margin violation (numbers of instances in the middle of the street).
For the degree of 2 polynomial mapping: $$ \phi(\mathbf{X}) =\phi\begin{pmatrix} x_1 \\ x_2 \end{pmatrix} =\begin{pmatrix} x^2_1 \\ \sqrt{2}x_1x_2\\ x^2_2 \end{pmatrix} $$
and the dot product
$$ \phi(\mathbf{X})^T\phi(\mathbf{Y}) =\begin{pmatrix} x^2_1 \\ \sqrt{2}x_1x_2\\ x^2_2 \end{pmatrix}^T \begin{pmatrix} y^2_1 \\ \sqrt{2}y_1y_2\\ y^2_2 \end{pmatrix} =(\mathbf{X}^T\mathbf{Y})^2 \equiv K(\mathbf{X},\mathbf{Y}), $$ where $K(\mathbf{X},\mathbf{Y})$ called a $2$nd-degree polynomial kernel.
For a given landmark $\mathbf{\ell}$, one can map the feature $\mathbf{X}$ to a new feature using a Gaussian function which measures the similarity between the feature $\mathbf{X}$ and the landmark:
$$ K(\mathbf{X}, \mathbf{\ell}) \equiv \phi_\gamma(\mathbf{X}, \mathbf{\ell}) = \exp(-\gamma \vert\vert \mathbf{X} - \mathbf{\ell}\vert\vert^2) $$
Therefore, if there are $m$ instance with $n$ features in the dataset, these $n$ features introduce $n^2$ Gaussian functions, thus $n^2$ new features. The time complexcity is increased
$$ O(m\times n) \to O(m\times n^2) $$
X = iris["data"][:, (2,3)]
y = (iris["target"]==2).astype(np.float64)
def plot_decision_boundary(model, X, y):
# Set min and max values and give it some padding
x_min, x_max = X[:,0].min() - 1, X[:,0].max() + 1 # x1
y_min, y_max = X[:,1].min() - 1, X[:,1].max() + 1 # x2
h = 0.01
# Generate a grid of points with distance h between them
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict the function value for the whole grid
Z = model(np.c_[xx.ravel(), yy.ravel()]) # numpy.ravel() == numpy.reshape(-1); np.c_: Stack 1-D arrays as columns into a 2-D array
Z = Z.reshape(xx.shape)
# Plot the contour and training examples
plt.contourf(xx, yy, Z, cmap='binary')
plt.ylabel('x2')
plt.xlabel('x1')
plt.scatter(X[:,0], X[:,1], c=y.ravel(), cmap=plt.cm.Spectral)
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC,SVC
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
polynomial_svm_clf = Pipeline(
[("poly_features", PolynomialFeatures(degree=3)),
("scaler",StandardScaler()),
("svm_clf",LinearSVC(C=10,loss="hinge"))
])
polynomial_svm_clf_poly3 = Pipeline(
[("poly_features", PolynomialFeatures(degree=3)),
("scaler",StandardScaler()),
("svm_clf",SVC(kernel="poly", degree=3,coef0=1,C=5))
])
polynomial_svm_clf_rbf = Pipeline(
[("poly_features", PolynomialFeatures(degree=3)),
("scaler",StandardScaler()),
("svm_clf",SVC(kernel="rbf", degree=3,coef0=1,C=5))
])
polynomial_svm_clf.fit(X,y)
X_test = np.array([5,2]).reshape(1,-1)
polynomial_svm_clf.predict(X_test)
plot_decision_boundary(lambda x: polynomial_svm_clf.predict(x), X, y)
polynomial_svm_clf_poly3.fit(X,y)
plot_decision_boundary(lambda x: polynomial_svm_clf_poly3.predict(x), X, y)
polynomial_svm_clf_rbf.fit(X,y)
plot_decision_boundary(lambda x: polynomial_svm_clf_rbf.predict(x), X, y)
for both linear and non-linear regressions
the street width is controled by the parameter $\epsilon$
X = np.random.rand(100,1)*5
y = X**3 + 10*X**2 + X + 4
X_plus_b = np.c_[np.ones((100,1)), X] # add x0=1 to each instance
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
svm_poly_reg = SVR(kernel="poly", degree=3, C=100, epsilon=0.1)
svm_poly_reg.fit(X_plus_b,y)
y_pred = svm_poly_reg.predict(X_plus_b)
plt.plot(y,y_pred)
mean_squared_error(y,y_pred)
X=iris["data"][:,2:]
y=iris["target"]
from sklearn.tree import DecisionTreeClassifier
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X,y)
from sklearn.tree import export_graphviz
export_graphviz(tree_clf,
out_file=("iris_tree.dot"),
feature_names=iris.feature_names[2:],
class_names=iris.target_names,
rounded=True,
filled=True)
! dot -Tpng iris_tree.dot -o iris_tree.png
Image("iris_tree.png")
$$ G_{i} = 1 - \sum^K_{k=1} p^2_{i,k} $$ where $p_{i,k}$ is the ratio of the class $k$ instances among the training instances in the $i$-th node:
For example: for the right-child node in the second layer (depth-1), samples = 100, value =[0,50,50]. The Gini score is given by
$$ G_i = 1 - (0/100)^2 - (50/100)^2 - (50/100)^2 = 0.5 $$
The Classification and Regression Tree (CART) algorithm trains Decision Tree by splitting the trainsing set in two subsets using a single feature $k$ and a threshold $t_k$. The paramters $k$ and $t_k$ are chosen in such a way that it preoduces the purest subsets, weighted by their size, $$ J(k, t_k) = \dfrac{M_{\rm left}}{M} G_{\rm left} + \dfrac{M_{\rm right}}{M} G_{\rm right} $$ where $M_{\rm left/right}$ is the number of instances in the left/right node/subset.
Alternative, one can also use the entropy impurity measure:
$$ H_i = -\sum^K_{k=1, p_{i,k}\neq 0} p_{i, k} \log_2(p_{i,k}) $$
Decision Tree versus Linear models:
Sensitive to a small variations in the training data.
Linear models belong to parametric model. The model parameters are predetermined, which reduces the risk of overfitting to some extend.
$$ J(k, t_k) = \dfrac{M_{\rm left}}{M} {\rm MSE}_{\rm left} + \dfrac{M_{\rm right}}{M} {\rm MSE}_{\rm right} $$ where
$$ {\rm MSE}_{\rm node} = \sum_{m\in node} (\widehat y_{node} - y^{(m)})^2, \quad \widehat y_{node}=\dfrac{1}{M_{node}} \sum_{m \in node} y^{(m)} $$
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()
voting_clf = VotingClassifier(
estimators=[
('lr',log_clf),
('rf',rnd_clf),
('svc',svm_clf)],
voting = 'hard')
voting_clf.fit(X,y)
from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
clf.fit(X,y)
y_pred = clf.predict(X)
print(clf.__class__.__name__,accuracy_score(y,y_pred))
plot_decision_boundary(lambda x: clf.predict(x), X, y)
Use the same training algorithm for every predictor, but to train on different random subsets of the training set.
Bagging: sampling is performed with replacement (bootstrap=True)
Pasting: sampling is performed without replacement
Aggregate the predictions of all predictors, and take either the most frequent prediction for classification or average for regresion.
Aggregation reduces both bias and variance. Generally, the ensemble has a similar bias, but a lower variance than a sinple predictor.
Bagging ends up with a little bit higher bias than Pasting.
Image("images/bagging.png", width=800)
X=iris["data"][:,2:]
y=iris["target"]
X.shape,y.shape
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X,y)
bag_clf = BaggingClassifier(
DecisionTreeClassifier(),
n_estimators=500,
max_samples=100,
bootstrap=True,
n_jobs=-1,
oob_score=True)
bag_clf.fit(X,y)
y_pred = bag_clf.predict(X)
plot_decision_boundary(lambda x: bag_clf.predict(x), X, y)
bag_clf.oob_score_
plot_decision_boundary(lambda x: tree_clf.predict(x), X, y)
Trained via the Bagging method
provide a measure on feature importance
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=500,max_leaf_nodes=16,n_jobs=-1)
rnd_clf.fit(X,y)
for name,score in zip(iris['feature_names'],rnd_clf.feature_importances_):
print(name,score)
plot_decision_boundary(lambda x: rnd_clf.predict(x), X, y)
Comparison with Bagging based on DecisionTree
bag_clf = BaggingClassifier(
DecisionTreeClassifier(
splitter='random',
max_leaf_nodes=16),
n_estimators=500,
max_samples=1.0,
bootstrap=True,
n_jobs=-1)
bag_clf.fit(X,y)
plot_decision_boundary(lambda x: bag_clf.predict(x), X, y)