 - 9 mins

# Intro

I though I would give some introduction on basic machine learning techniques and include some basic implementation with the explanations. I started with the topic below. Hope you will enjoy reading.

## Logistic regression

The main goal is to maximize the joint probability of the output label given the input data - in layman’s terms - make the predictions correct most of the time.

$p(y | x) = \hat{y}^{y}(1-\hat{y})^{(1-y)}$

Since this equation gives:

$p(y=1 | x) = \hat{y}, \quad p(y=0 | x) = (1- \hat{y})$

and we want the predictions to represent the probability of a binary classification problem. In logistic regression we apply a sigmoid function to the regression term in order to achieve [0, 1] range and make it a normalized probability:

$\hat{y} = sigmoid(a) = sigmoid(w^{T}x + b)$

Where x is a data point and w, b are the parameters of our regression model. In order to maximize the joint probability we take its logarithmic mapping since the logarithm is a continuous and monotonic function therefore for a data and label pair (x, y) our loss function is:

$- L(y, \hat{y}) = y\ln\hat{y} + (1 - y)\ln(1 - \hat{y}) = \ln p(y|x)$

We want to optimize this by stochastic gradient descent so we will need the gradients. For a dataset of

$\{\vec{x}^{(i)}, y^{(i)}\}_{i = 1}^{N}$

the computations are the following:

The i-th activation is:

$a^{(i)} = \vec{w}^{T}\vec{x}^{(i)} + b$

And the i-th prediction:

$\hat{y}^{(i)} = sigmoid(a^{(i)}) = \frac{1}{1 + e^{-a^{(i)}}}$

The i-th loss therefore:

$L(y^{(i)}, \hat{y}^{(i)}) = - \big(y\ln\hat{y}^{(i)} + (1 - y^{(i)})\ln(1 - \hat{y}^{(i)})\big)$

We still have to define the cost function which is the sum of the losses. If we do stochastic gradient descent we sample a batch of the dataset randomly and calculate the loss for

$1 \leq m < N$

samples.

$J(\vec{w}, b) = \frac{1}{m}\sum_{j = 1}^{m}L(y^{(i)}, \hat{y}^{(i)})$

The derivatives are the following:

$\frac{\partial L}{\partial \hat{y}^{(i)}} = \frac{1 - y^{(i)}}{1- \hat{y}^{(i)}} - \frac{y^{(i)}}{\hat{y}^{(i)}} \equiv d\hat{y}^{(i)}$

Since we know the chain rule we know that in order to get

$\frac{\partial L}{\partial a^{(i)}}$

we need:

$\frac{\partial \hat{y}^{(j)}}{\partial a^{(i)}} = sigmoid(-a^{(j)})sigmoid(a^{(j)})\delta_{ji} = (1 - \hat{y}^{(j)})\hat{y}^{(j)}\delta_{ij}$

Therefore:

$\frac{\partial L}{\partial a^{(i)}} = \sum_{j}\frac{\partial L}{\partial \hat{y}^{(j)}}\frac{\partial \hat{y}^{(j)}}{\partial a^{(i)}} = \hat{y}^{(i)} - y^{(i)} \equiv da^{(i)}$

Moving on the actual learnable parameters

$\frac{\partial L}{\partial w_{p}} = \sum_{q}\frac{\partial L}{\partial a^{(q)}}\frac{\partial a^{(q)}}{\partial w_{p}} = \sum_{q} ( \hat{y}^{(q)} - y^{(q)} ) x^{(q)}_{p} \equiv dw_{p}$ $\frac{\partial L}{\partial b} = \sum_{q}\frac{\partial L}{\partial a^{(q)}}\frac{\partial a^{(q)}}{\partial b} = \sum_{q} ( \hat{y}^{(q)} - y^{(q)} ) \cdot 1 \equiv db$

## Logistic regression in code with L2 weight regularization

Now the same in code:

import numpy as np
from sklearn.metrics import accuracy_score


Using sklearn a default binary classification dataset can be loaded and a metric can be used to test accuracy.

X, y = load_breast_cancer(return_X_y=True)

y.reshape(y.shape, 1)

print('Data shape : ', X.shape, '\t labels shape : ', y.shape)

Data shape :  (569, 30) 	 labels shape :  (569,)

n_samples = X.shape

test_split = np.random.choice(n_samples, size=n_samples // 10, replace=False)
assert np.unique(test_split).size == n_samples // 10, 'Test split size constraint failed'

train_split = list(set(range(0, n_samples)) - set(test_split))
assert np.unique(train_split).size == n_samples - n_samples // 10, 'Train split size constraint failed'

X_train, y_train, X_test, y_test = X[train_split], y[train_split], X[test_split], y[test_split]
y_train = y_train.reshape(n_samples - n_samples // 10, 1)
y_test = y_test.reshape(n_samples // 10, 1)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((513, 30), (513, 1), (56, 30), (56, 1))


Here I implemented some custom splitting technique with numpy only but the default train_test_split method can be used from sklearn as well. Normalizing the dataset is also important to ensure numerical stability.

train_std, train_mean = np.std(X_train, axis=0), np.mean(X_train, axis=0)
test_std, test_mean = np.std(X_test, axis=0), np.mean(X_test, axis=0)

X_train -= train_mean
X_test -= test_mean
X_train /= train_std
X_test /= test_std

X_train.shape, X_test.shape

((513, 30), (56, 30))

def sigmoid(x):
return 1. / (1. + np.exp(-x))

sigmoid = np.vectorize(sigmoid)

learning_rate = 5e-4
m = 750
max_iter = 5_000

# regularization
alpha = 10e-3

w = np.random.randn(1, X_train.shape)
b = np.random.randn()

for _iter in range(max_iter):
batch_indices = np.random.choice(X_train.shape, size = m, replace = True)
batch = X_train[batch_indices].T
_y = y_train[batch_indices].reshape(1, m)
assert batch.shape == (X_train.shape, m), 'Batch shape constraint failed'
activation = np.dot(w, batch) + b
assert activation.shape == (1, m), 'Activation shape constraint failed'
y_hat = sigmoid(activation)
assert y_hat.shape == activation.shape, 'Prediction shape constraint failed'

if _iter % 500 == 0:
# Calculate current accuracy
y_pred_test = np.dot(w, X_test.T) + b
y_pred_test[y_pred_test >= 0.5] = 1
y_pred_test[y_pred_test < 0.5] = 0
print('Iter : %d | Current accuracy on test set %.2f%%: ' % (_iter,
100 * accuracy_score(y_test, y_pred_test.T)))

da = y_hat - _y
assert da.shape == activation.shape, 'Activation derivative shape constraint failed'
dw = np.matmul(batch, da.T).T
assert dw.shape == w.shape, 'Weights derivative shape constraint failed'
db = np.sum(da)
# Update parameters
w -= learning_rate * ( dw + 2. * alpha * np.linalg.norm(dw) )
b -= learning_rate * ( db + 2. * alpha * np.abs(db) )

y_pred_test = np.dot(w, X_test.T) + b
y_pred_test[y_pred_test >= 0.5] = 1
y_pred_test[y_pred_test < 0.5] = 0
print('Final accuracy : %.2f%%' % (100 * accuracy_score(y_test, y_pred_test.T)))

Iter : 0 | Current accuracy on test set 16.07%:
Iter : 500 | Current accuracy on test set 98.21%:
Iter : 1000 | Current accuracy on test set 96.43%:
Iter : 1500 | Current accuracy on test set 96.43%:
Iter : 2000 | Current accuracy on test set 96.43%:
Iter : 2500 | Current accuracy on test set 98.21%:
Iter : 3000 | Current accuracy on test set 98.21%:
Iter : 3500 | Current accuracy on test set 98.21%:
Iter : 4000 | Current accuracy on test set 98.21%:
Iter : 4500 | Current accuracy on test set 98.21%:
Final accuracy : 98.21%


Here the final accuracy is 98.21%. Which seems to be pretty amazing for such a simple algorithm.