DeepLearning Basics

1 Gradient Descent

What is Gradient Descent?

Gradient Descent Process

Every direction goes in the direction of the decline, and finally goes to the place with the least loss.

2 Linear model

2.1 Linear regression

  • Learn an optimization algorithm - gradient descent method to optimize this model.

  • Linear regression is a very simple model in supervised learning, and gradient descent is also the most widely used optimization algorithm in deep learning.

import torch
import numpy as np
# IMPORT Variable
from torch.autograd import Variable 
# Sets the seed for generating random numbers. 
torch.manual_seed(2017)
<torch._C.Generator at 0x112959630>

manual_seed(seed) → Generator

Sets the seed for generating random numbers. Returns a torch.Generator object. It is recommended to set a large seed, i.e. a number that has a good balance of 0 and 1 bits. Avoid having many 0 bits in the seed.

Linear Regression - Gradient Descent

2.1.1 Input Data & Preprocessing

# put in data x & y
x_train = np.array([[3.3], [4.4], [5.5], [6.71], [6.93], [4.168],
                    [9.779], [6.182], [7.59], [2.167], [7.042],
                    [10.791], [5.313], [7.997], [3.1]], dtype=np.float32)

y_train = np.array([[1.7], [2.76], [2.09], [3.19], [1.694], [1.573],
                    [3.366], [2.596], [2.53], [1.221], [2.827],
                    [3.465], [1.65], [2.904], [1.3]], dtype=np.float32)
# draw the graph
import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(x_train, y_train, 'bo')
[<matplotlib.lines.Line2D at 0x11565e588>]

output

# CONVERT to Tensor
x_train = torch.from_numpy(x_train) # torch.from_numpy()
y_train = torch.from_numpy(y_train)

2.1.2 Define Params & Model

# Define parameters w & b
w = Variable(torch.randn(1), requires_grad=True) # Random initialization
b = Variable(torch.zeros(1), requires_grad=True) # Initialize with 0
x_train = Variable(x_train)
y_train = Variable(y_train)
# Define Model
def linear_model(x):
    return x * w + b # return prediction
y_ = linear_model(x_train) # y_ --> Prediction

After the above steps, we have defined the model. Before we update the parameters, we can first look at what the output of the model looks like.

plt.plot(x_train.data.numpy(), y_train.data.numpy(), 'bo', label='real')
plt.plot(x_train.data.numpy(), y_.data.numpy(), 'ro', label='estimated')
plt.legend()
<matplotlib.legend.Legend at 0x11577da58>

output

2.1.3 Define Loss Function

At this time we need to calculate our loss function, that is
$$
\frac{1}{n} \sum_{i=1}^n(\hat{y}_i - y_i)^2
$$

# Calculation loss
# NOTES: 'y_' means prediction, 'y' means real value
def get_loss(y_, y): 
    return torch.mean((y_ - y_train) ** 2)
        # (prediction - realValue)^ 2
loss = get_loss(y_, y_train) # input the realValue
# Print to see the size of loss
print(loss)
Variable containing:
 153.3520
[torch.FloatTensor of size 1]

2.1.4 BackPropagation & Get Gradient

The error function is defined. Next we need to calculate the gradients of $\mathrm { W }$ and $b$. At this time, thanks to the automatic derivation of PyTorch, we do not need to manually calculate the gradient. The gradients of $\mathrm { W }$ and $b$ are respectively:

$$ \frac{\partial}{\partial w} = \frac{2}{n} \sum_{i=1}^n x_i(w x_i + b - y_i) \\ \frac{\partial}{\partial b} = \frac{2}{n} \sum_{i=1}^n (w x_i + b - y_i) $$
# Automatic derivation
loss.backward()
# View the gradients of w and b
print(w.grad)
print(b.grad)
Variable containing:
 161.0043
[torch.FloatTensor of size 1]

Variable containing:
 22.8730
[torch.FloatTensor of size 1]

2.1.5 Update Model Parameters

# Update parameters once
w.data = w.data - 1e-2 * w.grad.data # ATTENTION!
b.data = b.data - 1e-2 * b.grad.data # 1e-2 Learning Rate

After updating the completion parameters, let’s take a look at the results of the model output again.

y_ = linear_model(x_train)
plt.plot(x_train.data.numpy(), y_train.data.numpy(), 'bo', label='real')
plt.plot(x_train.data.numpy(), y_.data.numpy(), 'ro', label='estimated')
plt.legend()
<matplotlib.legend.Legend at 0x11588b358>

output

As you can see from the above example, after the update, the red line runs below the blue line. There is no particularly good fit to the true value of blue, so we need to update it several times.

2.1.6 ReDo UPDATE PARAMs

for e in range(10): # updates 10 times 进行 10 次更新
    y_ = linear_model(x_train) # prediction
    loss = get_loss(y_, y_train)

    w.grad.zero_() # Remember to return to zero gradient 记得归零梯度
    b.grad.zero_() # Remember to return to zero gradient 记得归零梯度
    loss.backward()
    # UPDATE
    w.data = w.data - 1e-2 * w.grad.data # update w
    b.data = b.data - 1e-2 * b.grad.data # update b 
    # PRINT output
    print('epoch: {}, loss: {}'.format(e, loss.data[0]))
epoch: 0, loss: 3.1357719898223877
epoch: 1, loss: 0.3550889194011688
epoch: 2, loss: 0.30295443534851074
epoch: 3, loss: 0.30131956934928894
epoch: 4, loss: 0.3006229102611542
epoch: 5, loss: 0.29994693398475647
epoch: 6, loss: 0.299274742603302
epoch: 7, loss: 0.2986060082912445
epoch: 8, loss: 0.2979407012462616
epoch: 9, loss: 0.29727882146835327
y_ = linear_model(x_train)
plt.plot(x_train.data.numpy(), y_train.data.numpy(), 'bo', label='real')
plt.plot(x_train.data.numpy(), y_.data.numpy(), 'ro', label='estimated')
plt.legend()
<matplotlib.legend.Legend at 0x11598e0f0>

output

After 10 updates, we found that the red prediction results have been better fitted to the true value of blue.

2.2 Polynomial Linear Regression

First we can first define an objective function that needs to be fitted. This function is a cubic polynomial.

# Define a multivariate function

w_target = np.array([0.5, 3, 2.4]) # define parameter
b_target = np.array([0.9]) # define parameter

f_des = 'y = {:.2f} + {:.2f} * x + {:.2f} * x^2 + {:.2f} * x^3'.format(
    b_target[0], w_target[0], w_target[1], w_target[2]) # Print out the expression of the function

print(f_des)
y = 0.90 + 0.50 * x + 3.00 * x^2 + 2.40 * x^3

We can draw the image of this polynomial first.

# Draw the curve of this function
x_sample = np.arange(-3, 3.1, 0.1)
y_sample = b_target[0] + w_target[0] * x_sample + w_target[1] * x_sample ** 2 + w_target[2] * x_sample ** 3

plt.plot(x_sample, y_sample, label='real curve')
plt.legend()
<matplotlib.legend.Legend at 0x1158f5e48>

output

Then we can build the dataset, we need x and y, and it is a cubic polynomial, so we took it: $x,\ x^2, x^3$

# build data x & y
# x is a Matrix [x, x^2, x^3]
# y is the result of the function [y]

x_train = np.stack([x_sample ** i for i in range(1, 4)], axis=1)
x_train = torch.from_numpy(x_train).float() # CONVERT to float tensor

y_train = torch.from_numpy(y_sample).float().unsqueeze(1) # CONVERT to float tensor 

Then we can define the parameters that need to be optimized, which is the function in the previous function $w_i$

# define parameter & model
w = Variable(torch.randn(3, 1), requires_grad=True)
b = Variable(torch.zeros(1), requires_grad=True)

# Convert x and y to Variable
x_train = Variable(x_train)
y_train = Variable(y_train)

def multi_linear(x):
    return torch.mm(x, w) + b

We can draw a comparison between the model before the update and the real model.

# Draw the model before the update
y_pred = multi_linear(x_train)

plt.plot(x_train.data.numpy()[:, 0], y_pred.data.numpy(), label='fitting curve', color='r')
plt.plot(x_train.data.numpy()[:, 0], y_sample, label='real curve', color='b')
plt.legend()
<matplotlib.legend.Legend at 0x115b8cc50>

output

It can be found that there is a difference between the two curves, we calculate the error between them

# Calculating the error, the error here is the same as the error of the unary linear model. The get_loss has been defined earlier.
loss = get_loss(y_pred, y_train)
print(loss)
Variable containing:
 413.9843
[torch.FloatTensor of size 1]
# Automatic derivation
loss.backward()
# Take a look at the gradients of w and b
print(w.grad)
print(b.grad)
Variable containing:
 -34.1391
-146.6133
-215.9148
[torch.FloatTensor of size 3x1]

Variable containing:
-27.0838
[torch.FloatTensor of size 1]
# update the parameter
w.data = w.data - 0.001 * w.grad.data
b.data = b.data - 0.001 * b.grad.data
# Draw the model after updating once
y_pred = multi_linear(x_train)

plt.plot(x_train.data.numpy()[:, 0], y_pred.data.numpy(), label='fitting curve', color='r')
plt.plot(x_train.data.numpy()[:, 0], y_sample, label='real curve', color='b')
plt.legend()
<matplotlib.legend.Legend at 0x1164c6d30>

output

Since it has only been updated once, the difference between the two curves still exists, we do 100 iterations

# 进行 100 次参数更新 100 parameter updates
for e in range(100):
    y_pred = multi_linear(x_train)
    loss = get_loss(y_pred, y_train)

    w.grad.data.zero_()
    b.grad.data.zero_()
    loss.backward()

    # update the parameters
    w.data = w.data - 0.001 * w.grad.data
    b.data = b.data - 0.001 * b.grad.data
    if (e + 1) % 20 == 0:
        print('epoch {}, Loss: {:.5f}'.format(e+1, loss.data[0]))
epoch 20, Loss: 73.67840
epoch 40, Loss: 17.97097
epoch 60, Loss: 4.94101
epoch 80, Loss: 1.87171
epoch 100, Loss: 1.12812

You can see that the loss is very small after the update is complete. We draw the curve comparison after the update.

# Draw the results after the update
y_pred = multi_linear(x_train)

plt.plot(x_train.data.numpy()[:, 0], y_pred.data.numpy(), label='fitting curve', color='r')
plt.plot(x_train.data.numpy()[:, 0], y_sample, label='real curve', color='b')
plt.legend()
<matplotlib.legend.Legend at 0x1164e8278>

output

It can be seen that after 100 updates, you can see that the fitted line and the real line are completely coincident.

2.3 Logistic Regression

2.3.1 Sigmoid Function

Sigmoid Func

2.2.2 Logistic Regression

Logistic Regression

2.2.3 Pytorch built

import torch
from torch.autograd import Variable
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# set Random seeds 设定随机种子 
torch.manual_seed(2017)
<torch._C.Generator at 0x108f3c5f0>
1.Input Data & Preprocessing

After reading in the data points, we divide the data points into red and blue according to different labels, and the drawing is displayed.

读入数据点之后我们根据不同的 label 将数据点分为了红色和蓝色,并且画图展示出来了

# READ FROM data.txt 
with open('./data.txt', 'r') as f:
    data_list = [i.split('\n')[0].split(',') for i in f.readlines()]
    data = [(float(i[0]), float(i[1]), float(i[2])) for i in data_list]

# 标准化 standardize
x0_max = max([i[0] for i in data])
x1_max = max([i[1] for i in data])
data = [(i[0]/x0_max, i[1]/x1_max, i[2]) for i in data]

x0 = list(filter(lambda x: x[-1] == 0.0, data)) # chose the 1st class point 选择第一类的点
x1 = list(filter(lambda x: x[-1] == 1.0, data)) # chose the 2nd class point 选择第二类的点

plot_x0 = [i[0] for i in x0]
plot_y0 = [i[1] for i in x0]
plot_x1 = [i[0] for i in x1]
plot_y1 = [i[1] for i in x1]

plt.plot(plot_x0, plot_y0, 'ro', label='x_0')
plt.plot(plot_x1, plot_y1, 'bo', label='x_1')
plt.legend(loc='best')
<matplotlib.legend.Legend at 0x108137c50>

output

Next we convert the data into a NumPy type, then switch to Tensor to prepare for the training that follows.

接下来我们将数据转换成 NumPy 的类型,接着转换到 Tensor 为之后的训练做准备

np_data = np.array(data, dtype='float32') # CONVERT to numpy array
x_data = torch.from_numpy(np_data[:, 0:2]) # CONVERT TO Tensor, SIZE [100, 2]
y_data = torch.from_numpy(np_data[:, -1]).unsqueeze(1) # CONVERT to Tensor,SIZE [100, 1]
2.Define Params & Model

Pytorch-sigmoid

Let’s implement the following Sigmoid function. The formula for the Sigmoid function is
$$
f(x) = \frac{1}{1 + e^{-x}}
$$

# DEFINE sigmoid FUNC
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

Draw the Sigmoid function, you can see that the larger the value, the closer to the Sigmoid function, the smaller the value, the closer to 0.

画出 Sigmoid 函数,可以看到值越大,经过 Sigmoid 函数之后越靠近 1,值越小,越靠近 0

# DRAW sigmoid graph

plot_x = np.arange(-10, 10.01, 0.01)
plot_y = sigmoid(plot_x)

plt.plot(plot_x, plot_y, 'r')
[<matplotlib.lines.Line2D at 0x10be61908>]

output

x_data = Variable(x_data)
y_data = Variable(y_data)

In PyTorch, we don’t need to write Sigmoid functions ourselves. PyTorch has written some commonly used functions for us in the underlying C++ language, which is not only convenient for us to use, but also faster and more stable than our own.

在 PyTorch 当中,不需要我们自己写 Sigmoid 的函数,PyTorch 已经用底层的 C++ 语言为我们写好了一些常用的函数,不仅方便我们使用,同时速度上比我们自己实现的更快,稳定性更好

By using torch.nn.functional to use, the following is the method of use

通过导入 torch.nn.functional 来使用,下面就是使用方法

import torch.nn.functional as F
# DEFINE logistic regression Model
w = Variable(torch.randn(2, 1), requires_grad=True) 
b = Variable(torch.zeros(1), requires_grad=True)

def logistic_regression(x):
    return F.sigmoid(torch.mm(x, w) + b)

Before the update, we can draw the effect of the classification

在更新之前,我们可以画出分类的效果

# Draw the results before the parameter update 画出参数更新之前的结果
w0 = w[0].data[0]
w1 = w[1].data[0]
b0 = b.data[0]

plot_x = np.arange(0.2, 1, 0.01)
plot_y = (-w0 * plot_x - b0) / w1

plt.plot(plot_x, plot_y, 'g', label='cutting line')
plt.plot(plot_x0, plot_y0, 'ro', label='x_0')
plt.plot(plot_x1, plot_y1, 'bo', label='x_1')
plt.legend(loc='best')
<matplotlib.legend.Legend at 0x10bf66c18>

output

3.Define Loss Function

Can see that the classification effect is basically confusing. 可以看到分类效果基本是混乱的

Pytorch-sigmoid - 2

let’s calculate the loss, the formula is as follows. 我们来计算一下 loss,公式如下
$$
loss = -(y * log(\hat{y}) + (1 - y) * log(1 - \hat{y}))
$$

# calculate the loss
def binary_loss(y_pred, y):
    logits = (y * y_pred.clamp(1e-12).log() + (1 - y) * (1 - y_pred).clamp(1e-12).log()).mean()
    return -logits

Clamp all elements in input into the range [min, max] and return a resulting Tensor.

>>> a = torch.randn(4)
>>> a

 1.3869
 0.3912
-0.8634
-0.5468
[torch.FloatTensor of size 4]

>>> torch.clamp(a, min=-0.5, max=0.5)

 0.5000
 0.3912
-0.5000
-0.5000
[torch.FloatTensor of size 4]

Tip: View an image of a log function

y_pred = logistic_regression(x_data)
loss = binary_loss(y_pred, y_data)
print(loss)
Variable containing:
 0.6412
[torch.FloatTensor of size 1]

After getting the loss, we still use the gradient descent method to update the parameters. Here we can use the automatic derivation to get the derivative of the parameter directly. Interested students can manually derive the formula of the derivative.

4.BackPropagation & Get Gradient
# automatic derivation
loss.backward()
w.data = w.data - 0.1 * w.grad.data
b.data = b.data - 0.1 * b.grad.data

# Calculate the loss after an update
y_pred = logistic_regression(x_data)
loss = binary_loss(y_pred, y_data)
print(loss)
Variable containing:
 0.6407
[torch.FloatTensor of size 1]

The above parameter update method is actually a tedious repetitive operation. If we have a lot of parameters, such as 100, then we need to write 100 rows to update the parameters. For convenience, we can write a function to update. In fact, PyTorch has been packaged for us. A function to do this, this is the optimizer torch.optim in PyTorch.

Using torch.optim requires another data type, nn.Parameter, which is essentially the same as Variable, except that nn.Parameter defaults to gradients, and Variable defaults to no gradients.

上面的参数更新方式其实是繁琐的重复操作,如果我们的参数很多,比如有 100 个,那么我们需要写 100 行来更新参数,为了方便,我们可以写成一个函数来更新,其实 PyTorch 已经为我们封装了一个函数来做这件事,这就是 PyTorch 中的优化器 torch.optim

使用 torch.optim 需要另外一个数据类型,就是 nn.Parameter,这个本质上和 Variable 是一样的,只不过 nn.Parameter 默认是要求梯度的,而 Variable 默认是不求梯度的

5.Update Model Parameters

Use torch.optim.SGD to update the parameters using the gradient descent method. The optimizer in PyTorch has more optimization algorithms, which I will cover in more detail later in this chapter.

After putting the parameters w and b into torch.optim.SGD, explain the learning rate, you can use optimizer.step() to update the parameters. For example, let’s pass the parameters to the optimizer and learn. Rate is set to 1.0

使用 torch.optim.SGD 可以使用梯度下降法来更新参数,PyTorch 中的优化器有更多的优化算法,在本章后面的部分我会更加详细的介绍

将参数 w 和 b 放到 torch.optim.SGD 中之后,说明一下学习率的大小,就可以使用 optimizer.step() 来更新参数了,比如下面我们将参数传入优化器,学习率设置为 1.0

# Update parameters with torch.optim
from torch import nn
w = nn.Parameter(torch.randn(2, 1))
b = nn.Parameter(torch.zeros(1))

def logistic_regression(x):
    return F.sigmoid(torch.mm(x, w) + b)

optimizer = torch.optim.SGD([w, b], lr=1.)
# update 1000 times
import time

start = time.time()
for e in range(1000):
    # forward propagation
    y_pred = logistic_regression(x_data)
    loss = binary_loss(y_pred, y_data) # get loss
    # backpropagation
    optimizer.zero_grad() # Use the optimizer to return the gradient to 0 使用优化器将梯度归 0
    loss.backward()
    optimizer.step() # Use the optimizer to update parameters 使用优化器来更新参数
    # Calculation accuracy
    mask = y_pred.ge(0.5).float()
    acc = (mask == y_data).sum().data[0] / y_data.shape[0]
    if (e + 1) % 200 == 0:
        print('epoch: {}, Loss: {:.5f}, Acc: {:.5f}'.format(e+1, loss.data[0], acc))
during = time.time() - start
print()
print('During Time: {:.3f} s'.format(during))
epoch: 200, Loss: 0.39730, Acc: 0.92000
epoch: 400, Loss: 0.32458, Acc: 0.92000
epoch: 600, Loss: 0.29065, Acc: 0.91000
epoch: 800, Loss: 0.27077, Acc: 0.91000
epoch: 1000, Loss: 0.25765, Acc: 0.90000

During Time: 0.595 s

You can see that updating the parameters after using the optimizer is very simple. Just use optimizer.zero_grad() to return to the gradient before auto-derivation, then update it with optimizer.step() The parameters are fine, very easy

At the same time, after 1000 updates, the loss has also dropped lower.

Below we draw the results after the update

可以看到使用优化器之后更新参数非常简单,只需要在自动求导之前使用optimizer.zero_grad() 来归 0 梯度,然后使用 optimizer.step()来更新参数就可以了,非常简便

同时经过了 1000 次更新,loss 也降得比较低了

下面我们画出更新之后的结果

# Draw the results after the update 画出更新之后的结果
w0 = w[0].data[0]
w1 = w[1].data[0]
b0 = b.data[0]

plot_x = np.arange(0.2, 1, 0.01)
plot_y = (-w0 * plot_x - b0) / w1

plt.plot(plot_x, plot_y, 'g', label='cutting line')
plt.plot(plot_x0, plot_y0, 'ro', label='x_0')
plt.plot(plot_x1, plot_y1, 'bo', label='x_1')
plt.legend(loc='best')
<matplotlib.legend.Legend at 0x10c08ec50>

output

6.built-in LOSS

You can see that the model has been able to basically separate the two types of points after the update.

Earlier we used the loss we wrote. In fact, PyTorch has written some common loss for us. For example, the loss in linear regression is nn.MSE(), and the second classification loss of logistic regression is nn in PyTorch. .BCEWithLogitsLoss(), for more loss, see documentation

PyTorch has two advantages for us to implement the loss function. The first is that we can use it, we don’t need to recreate the wheel. The second is that its implementation is in the underlying C++ language, so we have better speed and stability than ourselves. Realized better

In addition, PyTorch combines the model’s Sigmoid operation and the final loss with nn.BCEWithLogitsLoss() for stability reasons, so we don’t need to add Sigmoid operations with the PyTorch’s own loss.

可以看到更新之后模型已经能够基本将这两类点分开了

前面我们使用了自己写的 loss,其实 PyTorch 已经为我们写好了一些常见的 loss,比如线性回归里面的 loss 是 nn.MSE(),而 Logistic 回归的二分类 loss 在 PyTorch 中是 nn.BCEWithLogitsLoss(),关于更多的 loss,可以查看文档

PyTorch 为我们实现的 loss 函数有两个好处,第一是方便我们使用,不需要重复造轮子,第二就是其实现是在底层 C++ 语言上的,所以速度上和稳定性上都要比我们自己实现的要好

另外,PyTorch 出于稳定性考虑,将模型的 Sigmoid 操作和最后的 loss 都合在了 nn.BCEWithLogitsLoss(),所以我们使用 PyTorch 自带的 loss 就不需要再加上 Sigmoid 操作了

# use built-in loss
criterion = nn.BCEWithLogitsLoss() # Write sigmoid and loss on one layer for faster speed and better stability 将 sigmoid 和 loss 写在一层,有更快的速度、更好的稳定性

w = nn.Parameter(torch.randn(2, 1))
b = nn.Parameter(torch.zeros(1))

def logistic_reg(x):
    return torch.mm(x, w) + b

optimizer = torch.optim.SGD([w, b], 1.)
y_pred = logistic_reg(x_data)
loss = criterion(y_pred, y_data)
print(loss.data)
 0.6363
[torch.FloatTensor of size 1]
# updates 1000 times

start = time.time()
for e in range(1000):
    # forwardPropagation
    y_pred = logistic_reg(x_data)
    loss = criterion(y_pred, y_data)
    # backwardPropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    # calculate the Accuracy
    mask = y_pred.ge(0.5).float()
    acc = (mask == y_data).sum().data[0] / y_data.shape[0]
    if (e + 1) % 200 == 0:
        print('epoch: {}, Loss: {:.5f}, Acc: {:.5f}'.format(e+1, loss.data[0], acc))

during = time.time() - start
print()
print('During Time: {:.3f} s'.format(during))
epoch: 200, Loss: 0.39538, Acc: 0.88000
epoch: 400, Loss: 0.32407, Acc: 0.87000
epoch: 600, Loss: 0.29039, Acc: 0.87000
epoch: 800, Loss: 0.27061, Acc: 0.87000
epoch: 1000, Loss: 0.25753, Acc: 0.88000

During Time: 0.527 s

It can be seen that after using PyTorch’s own loss, the speed has increased, although it seems that the speed is not much improved, but this is just a small network. For large networks, use the built-in loss regardless of stability. Or in terms of speed, there is a qualitative leap, and it also avoids the trouble of repetitive wheel making.

2.2.4 More Loss

More Loss Function

3 Optimizer

3.1 Overview

Optimizer


   Reprint policy


《DeepLearning Basics》 by David Qiao is licensed under a Creative Commons Attribution 4.0 International License
  TOC