# Logistic regression

- 9 mins# Intro

I though I would give some introduction on basic machine learning techniques and include some basic implementation with the explanations. I started with the topic below. Hope you will enjoy reading.

## Logistic regression

The main goal is to maximize the joint probability of the output label given the input data - in layman’s terms - make the predictions correct most of the time.

\[p(y | x) = \hat{y}^{y}(1-\hat{y})^{(1-y)}\]Since this equation gives:

\[p(y=1 | x) = \hat{y}, \quad p(y=0 | x) = (1- \hat{y})\]and we want the predictions to represent the probability of a binary classification problem. In logistic regression we apply a sigmoid function to the regression term in order to achieve [0, 1] range and make it a normalized probability:

\[\hat{y} = sigmoid(a) = sigmoid(w^{T}x + b)\]Where ** x** is a data point and

**,**

*w***are the parameters of our regression model. In order to maximize the joint probability we take its logarithmic mapping since the logarithm is a continuous and monotonic function therefore for a data and label pair (x, y) our loss function is:**

*b*We want to optimize this by stochastic gradient descent so we will need the gradients. For a dataset of

\[\{\vec{x}^{(i)}, y^{(i)}\}_{i = 1}^{N}\]the computations are the following:

The ** i**-th activation is:

And the ** i**-th prediction:

The ** i**-th loss therefore:

We still have to define the cost function which is the sum of the losses. If we do stochastic gradient descent we sample a batch of the dataset randomly and calculate the loss for

\[1 \leq m < N\]samples.

\[J(\vec{w}, b) = \frac{1}{m}\sum_{j = 1}^{m}L(y^{(i)}, \hat{y}^{(i)})\]The derivatives are the following:

\[\frac{\partial L}{\partial \hat{y}^{(i)}} = \frac{1 - y^{(i)}}{1- \hat{y}^{(i)}} - \frac{y^{(i)}}{\hat{y}^{(i)}} \equiv d\hat{y}^{(i)}\]Since we know the chain rule we know that in order to get

\[\frac{\partial L}{\partial a^{(i)}}\]we need:

\[\frac{\partial \hat{y}^{(j)}}{\partial a^{(i)}} = sigmoid(-a^{(j)})sigmoid(a^{(j)})\delta_{ji} = (1 - \hat{y}^{(j)})\hat{y}^{(j)}\delta_{ij}\]Therefore:

\[\frac{\partial L}{\partial a^{(i)}} = \sum_{j}\frac{\partial L}{\partial \hat{y}^{(j)}}\frac{\partial \hat{y}^{(j)}}{\partial a^{(i)}} = \hat{y}^{(i)} - y^{(i)} \equiv da^{(i)}\]Moving on the actual learnable parameters

\[\frac{\partial L}{\partial w_{p}} = \sum_{q}\frac{\partial L}{\partial a^{(q)}}\frac{\partial a^{(q)}}{\partial w_{p}} = \sum_{q} ( \hat{y}^{(q)} - y^{(q)} ) x^{(q)}_{p} \equiv dw_{p}\] \[\frac{\partial L}{\partial b} = \sum_{q}\frac{\partial L}{\partial a^{(q)}}\frac{\partial a^{(q)}}{\partial b} = \sum_{q} ( \hat{y}^{(q)} - y^{(q)} ) \cdot 1 \equiv db\]## Logistic regression in code with L2 weight regularization

Now the same in code:

```
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
```

Using `sklearn`

a default binary classification dataset can be loaded and a metric can be used to test accuracy.

```
X, y = load_breast_cancer(return_X_y=True)
y.reshape(y.shape[0], 1)
print('Data shape : ', X.shape, '\t labels shape : ', y.shape)
```

```
Data shape : (569, 30) labels shape : (569,)
```

```
n_samples = X.shape[0]
test_split = np.random.choice(n_samples, size=n_samples // 10, replace=False)
assert np.unique(test_split).size == n_samples // 10, 'Test split size constraint failed'
train_split = list(set(range(0, n_samples)) - set(test_split))
assert np.unique(train_split).size == n_samples - n_samples // 10, 'Train split size constraint failed'
X_train, y_train, X_test, y_test = X[train_split], y[train_split], X[test_split], y[test_split]
y_train = y_train.reshape(n_samples - n_samples // 10, 1)
y_test = y_test.reshape(n_samples // 10, 1)
X_train.shape, y_train.shape, X_test.shape, y_test.shape
```

```
((513, 30), (513, 1), (56, 30), (56, 1))
```

Here I implemented some custom splitting technique with `numpy`

only but the default `train_test_split`

method can be used from `sklearn`

as well. Normalizing the dataset is also important to ensure numerical stability.

```
train_std, train_mean = np.std(X_train, axis=0), np.mean(X_train, axis=0)
test_std, test_mean = np.std(X_test, axis=0), np.mean(X_test, axis=0)
X_train -= train_mean
X_test -= test_mean
X_train /= train_std
X_test /= test_std
X_train.shape, X_test.shape
```

```
((513, 30), (56, 30))
```

```
def sigmoid(x):
return 1. / (1. + np.exp(-x))
sigmoid = np.vectorize(sigmoid)
learning_rate = 5e-4
m = 750
max_iter = 5_000
# regularization
alpha = 10e-3
w = np.random.randn(1, X_train.shape[1])
b = np.random.randn()
for _iter in range(max_iter):
batch_indices = np.random.choice(X_train.shape[0], size = m, replace = True)
batch = X_train[batch_indices].T
_y = y_train[batch_indices].reshape(1, m)
assert batch.shape == (X_train.shape[1], m), 'Batch shape constraint failed'
activation = np.dot(w, batch) + b
assert activation.shape == (1, m), 'Activation shape constraint failed'
y_hat = sigmoid(activation)
assert y_hat.shape == activation.shape, 'Prediction shape constraint failed'
if _iter % 500 == 0:
# Calculate current accuracy
y_pred_test = np.dot(w, X_test.T) + b
y_pred_test[y_pred_test >= 0.5] = 1
y_pred_test[y_pred_test < 0.5] = 0
print('Iter : %d | Current accuracy on test set %.2f%%: ' % (_iter,
100 * accuracy_score(y_test, y_pred_test.T)))
da = y_hat - _y
assert da.shape == activation.shape, 'Activation derivative shape constraint failed'
dw = np.matmul(batch, da.T).T
assert dw.shape == w.shape, 'Weights derivative shape constraint failed'
db = np.sum(da)
# Update parameters
w -= learning_rate * ( dw + 2. * alpha * np.linalg.norm(dw) )
b -= learning_rate * ( db + 2. * alpha * np.abs(db) )
y_pred_test = np.dot(w, X_test.T) + b
y_pred_test[y_pred_test >= 0.5] = 1
y_pred_test[y_pred_test < 0.5] = 0
print('Final accuracy : %.2f%%' % (100 * accuracy_score(y_test, y_pred_test.T)))
```

```
Iter : 0 | Current accuracy on test set 16.07%:
Iter : 500 | Current accuracy on test set 98.21%:
Iter : 1000 | Current accuracy on test set 96.43%:
Iter : 1500 | Current accuracy on test set 96.43%:
Iter : 2000 | Current accuracy on test set 96.43%:
Iter : 2500 | Current accuracy on test set 98.21%:
Iter : 3000 | Current accuracy on test set 98.21%:
Iter : 3500 | Current accuracy on test set 98.21%:
Iter : 4000 | Current accuracy on test set 98.21%:
Iter : 4500 | Current accuracy on test set 98.21%:
Final accuracy : 98.21%
```

Here the final accuracy is 98.21%. Which seems to be pretty amazing for such a simple algorithm.