线性回归

主要内容包括：

线性回归的基本要素
线性回归模型从零开始的实现
线性回归模型使用pytorch的简洁实现

线性回归的基本要素

模型

为了简单起见，这里我们假设价格只取决于房屋状况的两个因素，即面积（平方米）和房龄（年）。接下来我们希望探索价格与这两个因素的具体关系。线性回归假设输出与各个输入之间是线性关系:

p r i c e = w a r e a ⋅ a r e a + w a g e ⋅ a g e + b \mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b price=warea⋅area+wage⋅age+b

数据集

我们通常收集一系列的真实数据，例如多栋房屋的真实售出价格和它们对应的面积和房龄。我们希望在这个数据上面寻找模型参数来使模型的预测价格与真实价格的误差最小。在机器学习术语里，该数据集被称为训练数据集（training data set）或训练集（training set），一栋房屋被称为一个样本（sample），其真实售出价格叫作标签（label），用来预测标签的两个因素叫作特征（feature）。特征用来表征样本的特点。

损失函数

在模型训练中，我们需要衡量价格预测值与真实值之间的误差。通常我们会选取一个非负数作为误差，且数值越小表示误差越小。一个常用的选择是平方函数。它在评估索引为 i i i 的样本误差的表达式为

l ( i ) ( w , b ) = 1 2 ( y ^ ( i ) − y ( i ) ) 2 , l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2, l(i)(w,b)=21(y^(i)−y(i))2,

L ( w , b ) = 1 n ∑ i = 1 n l ( i ) ( w , b ) = 1 n ∑ i = 1 n 1 2 ( w ⊤ x ( i ) + b − y ( i ) ) 2 . L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2. L(w,b)=n1i=1∑nl(i)(w,b)=n1i=1∑n21(w⊤x(i)+b−y(i))2.

优化函数 - 随机梯度下降

当模型和损失函数形式较为简单时，上面的误差最小化问题的解可以直接用公式表达出来。这类解叫作解析解（analytical solution）。本节使用的线性回归和平方误差刚好属于这个范畴。然而，大多数深度学习模型并没有解析解，只能通过优化算法有限次迭代模型参数来尽可能降低损失函数的值。这类解叫作数值解（numerical solution）。

在求数值解的优化算法中，小批量随机梯度下降（mini-batch stochastic gradient descent）在深度学习中被广泛使用。它的算法很简单：先选取一组模型参数的初始值，如随机选取；接下来对参数进行多次迭代，使每次迭代都可能降低损失函数的值。在每次迭代中，先随机均匀采样一个由固定数目训练数据样本所组成的小批量（mini-batch） B \mathcal{B} B，然后求小批量中数据样本的平均损失有关模型参数的导数（梯度），最后用此结果与预先设定的一个正数的乘积作为模型参数在本次迭代的减小量。

( w , b ) ← ( w , b ) − η ∣ B ∣ ∑ i ∈ B ∂ ( w , b ) l ( i ) ( w , b ) (\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b) (w,b)←(w,b)−∣B∣ηi∈B∑∂(w,b)l(i)(w,b)

学习率: η \eta η代表在每次优化中，能够学习的步长的大小
批量大小: B \mathcal{B} B是小批量计算中的批量大小batch size

总结一下，优化函数的有以下两个步骤：

(i)初始化模型参数，一般来说使用随机初始化；
(ii)我们在数据上迭代多次，通过在负梯度方向移动参数来更新每个参数。

矢量计算

在模型训练或预测时，我们常常会同时处理多个数据样本并用到矢量计算。在介绍线性回归的矢量计算表达式之前，让我们先考虑对两个向量相加的两种方法。

向量相加的一种方法是，将这两个向量按元素逐一做标量加法。

向量相加的另一种方法是，将这两个向量直接做矢量加法。

import torch
import time

init variable a, b as 1000 dimension vector

n = 1000
a = torch.ones(n)
b = torch.ones(n)#设置一维度的张量,也就是可以理解为向量

define a timer class to record time

class Timer(object):

"""Record multiple running times."""
def __init__(self):
    self.times = []
    self.start()
def start(self):
    # start the timer
    self.start_time = time.time()
def stop(self):
    # stop the timer and record time into a list
    self.times.append(time.time() - self.start_time)
    return self.times[-1]
def avg(self):
    # calculate the average and return
    return sum(self.times)/len(self.times)
def sum(self):
    # return the sum of recorded time
    return sum(self.times)

现在我们可以来测试了。首先将两个向量使用for循环按元素逐一做标量加法。

timer = Timer()
c = torch.zeros(n)#1*n的张量
for i in range(n):
    c[i] = a[i] + b[i]
'%.5f sec' % timer.stop()
'0.01136 sec'

另外是使用torch来将两个向量直接做矢量加法：

timer.start()
d = a + b
d
'%.5f sec' % timer.stop()
'0.00031 sec'

结果很明显,后者比前者运算速度更快。因此，我们应该尽可能采用矢量计算，以提升计算效率。

线性回归模型从零开始的实现

# import packages and modules
%matplotlib inline
import torch
from IPython import display
from matplotlib import pyplot as plt
import numpy as np
import random
# 导入包
print(torch.__version__)
1.3.0

生成数据集

使用线性模型来生成数据集，生成一个1000个样本的数据集，下面是用来生成数据的线性关系：

p r i c e = w a r e a ⋅ a r e a + w a g e ⋅ a g e + b \mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b price=warea⋅area+wage⋅age+b

# set input feature number 
num_inputs = 2
# set example number
num_examples = 1000
# set true weight and bias in order to generate corresponded label
true_w = [2, -3.4]
true_b = 4.2
# torch.randn(*sizes, out=None) → Tensor ,这里运行结果为1000行两列的参数
features = torch.randn(num_examples, num_inputs,
                      dtype=torch.float32)
labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()),
                       dtype=torch.float32)
# np.random.normal正态分布函数的参数为loc:float 分布中心,scale:float概率分标准差,size:输出维度
# print(labels.size())1000

使用图像来展示生成的数据

plt.scatter(features[:, 1].numpy(), labels.numpy(), 1)
# numpy只是为了可读取
# print(features[:, 1].numpy())
<matplotlib.collections.PathCollection at 0x7f749bc92e48>

读取数据集

def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    random.shuffle(indices)  # random read 10 samples # 打乱顺序
    for i in range(0, num_examples, batch_size):
        j = torch.LongTensor(indices[i: min(i + batch_size, num_examples)]) # the last time may be not enough for a whole batch
        # print(indices[i: min(i + batch_size, num_examples)])
        yield  features.index_select(0, j), labels.index_select(0, j)
        # torch.index_select(input, dim, index, out=None) → Tensor
# input (Tensor) – 输入张量
# dim (int) – 索引的轴
# index (LongTensor) – 包含索引下标的一维张量
# out (Tensor, optional) – 目标张量
batch_size = 10
for X, y in data_iter(batch_size, features, labels):
    print(X, '\n', y)
    break
tensor([[ 0.5027,  0.1907],
        [ 0.1581,  2.0837],
        [ 0.0971, -0.7907],
        [ 1.0616,  1.2568],
        [-0.6036, -0.8940],
        [-0.6406,  0.0366],
        [ 1.3812, -1.4876],
        [ 1.2584,  0.6306],
        [ 1.1232, -0.5262],
        [-1.2096, -1.6601]]) 
 tensor([ 4.5702, -2.5926,  7.0957,  2.0412,  6.0298,  2.7910, 12.0099,  4.5726,
         8.2394,  7.4195])

初始化模型参数

w = torch.tensor(np.random.normal(0, 0.01, (num_inputs, 1)), dtype=torch.float32)
# num_inputs = 2
# print(w)
b = torch.zeros(1, dtype=torch.float32)
w.requires_grad_(requires_grad=True)# 叶子节点,允许求导
b.requires_grad_(requires_grad=True)
tensor([0.], requires_grad=True)

定义模型

定义用来训练参数的训练模型：

p r i c e = w a r e a ⋅ a r e a + w a g e ⋅ a g e + b \mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b price=warea⋅area+wage⋅age+b

def linreg(X, w, b):
    return torch.mm(X, w) + b
    # mm矩阵相乘

定义损失函数

我们使用的是均方误差损失函数：
l ( i ) ( w , b ) = 1 2 ( y ^ ( i ) − y ( i ) ) 2 , l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2, l(i)(w,b)=21(y^(i)−y(i))2,

def squared_loss(y_hat, y): 
    return (y_hat - y.view(y_hat.size())) ** 2 / 2
    # view相当于numpy中的resize

定义优化函数

在这里优化函数使用的是小批量随机梯度下降：

def sgd(params, lr, batch_size): 
    for param in params:
        param.data -= lr * param.grad / batch_size # ues .data to operate param without gradient track

训练

当数据集、模型、损失函数和优化函数定义完了之后就可来准备进行模型的训练了。

# super parameters init
lr = 0.03
num_epochs = 5
net = linreg# 张量相乘
loss = squared_loss
# training
for epoch in range(num_epochs):  # training repeats num_epochs times
    # in each epoch, all the samples in dataset will be used once
    # X is the feature and y is the label of a batch sample
    for X, y in data_iter(batch_size, features, labels):
        #print(X)
        l = loss(net(X, w, b), y).sum() 
        # calculate the gradient of batch sample loss 
        l.backward()  
# 创建一个Tensor时，使用requires_grad参数指定是否记录对其的操作，以便之后利用backward()方法进行梯度求解。
# 一个Tensor的requires_grad成员保存该Tensor是否记录操作用于计算梯度。
# 可利用requires_grad_()方法修改Tensor的requires_grad属性（in place）。
# 通过运算创建的Tensor，会自动被赋值grad_fn属性。该属性表示梯度函数。
# 最后得到的Tensor执行自身的backward()函数，此时之前参与运算并生成当前Tensor的叶子（leaf）Tensor将会保存其梯度在叶子Tensor的grad属性中。backward()函数接受参数，表示在特定位置求梯度值，该参数应和调用backward()函数的Tensor的维度相同，或者是可broadcast的维度。默认为torch.tensor(1)，也就是在当前梯度为标量1的位置求叶子Tensor的梯度。
# 默认同一个运算得到的Tensor仅能进行一次backward()。再次运算得到的Tesnor，可以再次进行backward()。
# 当多个Tensor从相同的源Tensor运算得到，这些运算得到的Tensor的backwards()方法将向源Tensor的grad属性中进行数值累加。
# arameters
        sgd([w, b], lr, batch_size)  
        # reset parameter gradient
        w.grad.data.zero_()
        # print(w)
        b.grad.data.zero_()
        # pytorch实现多项线性回归中，在grad更新时，每一次运算后都需要将上一次的梯度记录清空
    train_l = loss(net(features, w, b), labels)
    print('epoch %d, loss %f' % (epoch + 1, train_l.mean().item()))
epoch 1, loss 0.000049
epoch 2, loss 0.000049
epoch 3, loss 0.000049
epoch 4, loss 0.000049
epoch 5, loss 0.000049
w, true_w, b, true_b
(tensor([[ 2.0000],
         [-3.4001]], requires_grad=True),
 [2, -3.4],
 tensor([4.1995], requires_grad=True),
 4.2)

线性回归模型使用pytorch的简洁实现

import torch
from torch import nn
import numpy as np
torch.manual_seed(1)# 为了可以复现给torch所有的函数设置了一个确定的随机种子
print(torch.__version__)
torch.set_default_tensor_type('torch.FloatTensor')# 改变默认类型
1.3.0

生成数据集

在这里生成数据集跟从零开始的实现中是完全一样的。

num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
features = torch.tensor(np.random.normal(0, 1, (num_examples, num_inputs)), dtype=torch.float)
labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()), dtype=torch.float)

读取数据集

import torch.utils.data as Data
batch_size = 10
# combine featues and labels of dataset
dataset = Data.TensorDataset(features, labels)
# 包装变量和数据
# put dataset into DataLoader
data_iter = Data.DataLoader(
    dataset=dataset,            # torch TensorDataset format
    batch_size=batch_size,      # mini batch size
    shuffle=True,               # whether shuffle the data or not
    num_workers=2,              # read data in multithreading
)
# 导入数据,数据加载器
for X, y in data_iter:
    print(X, '\n', y)
    break
tensor([[ 0.6484,  1.2504],
        [ 1.1520, -0.7892],
        [-0.2521,  0.1240],
        [-0.6498,  0.0227],
        [ 0.6085, -0.7356],
        [-1.2659,  1.4959],
        [ 1.2577,  0.4130],
        [-0.4149,  0.6235],
        [ 0.3223,  0.3474],
        [-0.7551,  0.4452]]) 
 tensor([ 1.2478,  9.1838,  3.2643,  2.8207,  7.9096, -3.4437,  5.3291,  1.2519,
         3.6509,  1.1668])

定义模型

class LinearNet(nn.Module):
    def __init__(self, n_feature):
        super(LinearNet, self).__init__()      # call father function to init 
        self.linear = nn.Linear(n_feature, 1)  # function prototype: `torch.nn.Linear(in_features, out_features, bias=True)`
    def forward(self, x):
        y = self.linear(x)
        return y
net = LinearNet(num_inputs)
# num_inputs = 2
print(net)
LinearNet(
  (linear): Linear(in_features=2, out_features=1, bias=True)
)
# ways to init a multilayer network,为了铺设多层感知机而使用的
# method one
net = nn.Sequential(
    nn.Linear(num_inputs, 1)
    # other layers can be added here
    )
# method two
net = nn.Sequential()
net.add_module('linear', nn.Linear(num_inputs, 1))
# net.add_module ......
# method three
from collections import OrderedDict
net = nn.Sequential(OrderedDict([
          ('linear', nn.Linear(num_inputs, 1))
          # 有序字典
        ]))
print(net)
print(net[0])
Sequential(
  (linear): Linear(in_features=2, out_features=1, bias=True)
)
Linear(in_features=2, out_features=1, bias=True)

初始化模型参数

from torch.nn import init
init.normal_(net[0].weight, mean=0.0, std=0.01)
#从给定均值和标准差的正态分布N(mean, std)中生成值，填充输入的张量或变量
# tensor – n维的torch.Tensor
# mean – 正态分布的均值
# std – 正态分布的标准差
init.constant_(net[0].bias, val=0.0)  
# or you can use `net[0].bias.data.fill_(0)` to modify it directly
# 用val的值填充输入的张量或变量
for param in net.parameters():
    print(param)
Parameter containing:
tensor([[0.5347, 0.7057]], requires_grad=True)
Parameter containing:
tensor([0.6873], requires_grad=True)

定义损失函数

loss = nn.MSELoss()    # nn built-in squared loss function
                       # function prototype: `torch.nn.MSELoss(size_average=None, reduce=None, reduction='mean')`
print(net.parameters())
<generator object Module.parameters at 0x7f76de70c9a8>

定义优化函数

import torch.optim as optim
optimizer = optim.SGD(net.parameters(), lr=0.03)   # built-in random gradient descent function
print(optimizer)  # function prototype: `torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)`
SGD (
Parameter Group 0
    dampening: 0
    lr: 0.03
    momentum: 0
    nesterov: False
    weight_decay: 0
)

训练

num_epochs = 3
for epoch in range(1, num_epochs + 1):
    for X, y in data_iter:
        output = net(X)
        l = loss(output, y.view(-1, 1))
        optimizer.zero_grad() # reset gradient, equal to net.zero_grad()#清除前面痕迹
        l.backward()
        optimizer.step()# 梯度更新
    print('epoch %d, loss: %f' % (epoch, l.item()))
epoch 1, loss: 0.000283
epoch 2, loss: 0.000035
epoch 3, loss: 0.000079
# result comparision
dense = net[0]
print(true_w, dense.weight.data)
print(true_b, dense.bias.data)
[2, -3.4] tensor([[ 2.0014, -3.4004]])
4.2 tensor([4.1997])

两种实现方式的比较

从零开始的实现（推荐用来学习）

能够更好的理解模型和神经网络底层的原理
使用pytorch的简洁实现

能够更加快速地完成模型的设计与实现

总结

torch.ones()/torch.zeros()，与MATLAB的ones/zeros很接近。初始化生成一个单位张量或者0张量
均匀分布torch.rand(*sizes, out=None) → Tensor返回一个张量，包含了从区间[0, 1)的均匀分布中抽取的一组随机数。张量的形状由参数sizes定义。
标准正态分布torch.randn(*sizes, out=None) → Tensor返回一个张量，包含了从标准正态分布（均值为0，方差为1，即高斯白噪声）中抽取的一组随机数。张量的形状由参数sizes定义。
torch.mul(a, b)是矩阵a和b对应位相乘，a和b的维度必须相等，比如a的维度是(1, 2)，b的维度是(1, 2)，返回的仍是(1, 2)的矩阵
torch.mm(a, b)是矩阵a和b矩阵相乘，比如a的维度是(1, 2)，b的维度是(2, 3)，返回的就是(1, 3)的矩阵
torch.Tensor是一种包含单一数据类型元素的多维矩阵，定义了7种CPU tensor和8种GPU tensor类型。
random.shuffle(a)：用于将一个列表中的元素打乱。shuffle() 是不能直接访问的，需要导入 random 模块，然后通过 random 静态对象调用该方法。
backward()是pytorch中提供的函数，配套有require_grad：

1.所有的tensor都有.requires_grad属性,可以设置这个属性.x = tensor.ones(2,4,requires_grad=True)

2.如果想改变这个属性，就调用tensor.requires_grad_()方法 x.requires_grad_(False)
pytorch.detach()返回一个新的Variable，从当前计算图中分离下来的，但是仍指向原变量的存放位置,不同之处只是requires_grad为false，得到的这个Variable永远不需要计算其梯度，不具有grad。即使之后重新将它的requires_grad置为true,它也不会具有梯度grad,比较安全
.data在于.data的修改不会被autograd追踪，这样当进行backward()时它不会报错，回得到一个错误的backward值