TensorFlow学习笔记-蒲公英云

1 环境问题集
1.1 RuntimeError: Error copying tensor to device

RuntimeError: Error copying tensor to device: /job:localhost/replica:0/task:0/device:GPU:0. 
/job:localhost/replica:0/task:0/device:GPU:0 unknown device.

通过检测程序,发现缺少cuDNN,这个是用来gpu加速的。

>>> import tensorflow as tf
>>> tf.test.is_gpu_available()
2019-11-20 10:50:16.311650: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-20 10:50:16.327690: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-11-20 10:50:16.390645: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-20 10:50:16.391072: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3ea93a0 executing computations on platform CUDA. Devices:
2019-11-20 10:50:16.391087: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2019-11-20 10:50:16.412600: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3192000000 Hz
2019-11-20 10:50:16.413479: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x42911b0 executing computations on platform Host. Devices:
2019-11-20 10:50:16.413505: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-11-20 10:50:16.413605: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-20 10:50:16.413929: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-11-20 10:50:16.414056: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2019-11-20 10:50:16.414088: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
2019-11-20 10:50:16.414116: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2019-11-20 10:50:16.414144: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2019-11-20 10:50:16.414172: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2019-11-20 10:50:16.414200: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
2019-11-20 10:50:16.414229: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2019-11-20 10:50:16.414235: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2019-11-20 10:50:16.414244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-20 10:50:16.414269: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-11-20 10:50:16.414273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
False

执行命令dpkg -i libcudnn7_7.6.5.32-1+cuda10.1_amd64.deb，但是问题依旧，于是检查

root@test-To-be-filled-by-O-E-M:/home/test/download# dpkg -l | grep cudnn
ii  libcudnn7                                  7.6.5.32-1+cuda10.1                          amd64        cuDNN runtime libraries

【解决】新版cudnn Deb安装版本找不到cudnn.h,这篇文章指出应该下载的是cuDNN Library for Linux版本，而不应该用deb版本。根据cuda我再执行sh cuda_10.1.243_418.87.00_linux.run。
运行cudnn样例，提示

./mnistCUDNN: error while loading shared libraries: libcudart.so.10.1: cannot open shared object file: No such file or directory

解决方案

cd /etc/ld.so.conf.d
echo "/usr/local/cuda/lib64" >> cuda-10-1.conf
#  配置生效
sudo ldconfig

接着执行下面的命令，创建软连接并设置环境变量

# 如果是root用户下面安装，而需要在其他用户也要运行的话，也需要在当前用户的环境变量中增加下面的配置
echo "export PATH=/usr/local/cuda/bin:$PATH" >> ~/.bashrc
echo "export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH" >> ~/.bashrc
source ~/.bashrc
# 这个需要查一下本地的对应的版本
sudo ln -s /usr/local/cuda/lib64/libcudart.so.10.1.243 /usr/local/cuda/lib64/libcudart.so.10.0
sudo ln -s /usr/local/cuda/lib64/libcufft.so.10.1.1.243 /usr/local/cuda/lib64/libcufft.so.10.0
sudo ln -s /usr/local/cuda/lib64/libcurand.so.10.1.1.243 /usr/local/cuda/lib64/libcurand.so.10.0
sudo ln -s /usr/local/cuda/lib64/libcusolver.so.10.2.0.243 /usr/local/cuda/lib64/libcusolver.so.10.0
sudo ln -s /usr/local/cuda/lib64/libcusparse.so.10.3.0.243 /usr/local/cuda/lib64/libcusparse.so.10.0
sudo ln -s /usr/lib/x86_64-linux-gnu/libcublas.so.10.2.1.243 /usr/local/cuda-10.1/lib64/libcublas.so.10.0

1.2 ‘1type’ as a synonym of type is deprecated

>>> import tensorflow as tf
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])

tensorflow + numpy compatibility?,在test用户下面numpy版本过高，改成1.16.4即可。
1.3 Font family [‘STKaiti’] not found
ubuntu字体缺失，参考ubuntu字体安装

mkdir -p /usr/share/fonts/zh_CN
sudo chmod 777 *
sudo mkfontscale
sudo mkfontdir
sudo fc-cache
reboot
/home/test/.local/lib/python3.6/site-packages/ipykernel_launcher.py:68: UserWarning: Attempted to set non-positive bottom ylim on a log-scaled axis.
Invalid limit will be ignored.
findfont: Font family ['STKaiti'] not found. Falling back to DejaVu Sans.

执行fc-list :lang=zh,发现已经安装了这个字体

那么为什么还提示这样的警告呢？按照【Matplotlib】Matplotlib之更换字体常见错误及修正方式（change font）这里还需要删除一下缓存rm -rf ~/.cache/matplotlib ，但是新的问题来了，界面中文乱码.

cache目录下并没有重新生成matplotlib，
误删的话，可以通过进行恢复，但是通过winscp删除的就没有办法了。只能重装matplotlib就可以了。

sudo apt-get install extundelete
root@test-To-be-filled-by-O-E-M:~# date -d "2019-11-21 11:15:00" +%s
1574306100
sudo extundelete /dev/sda8 --after 1574306100--restore-all

这里需要注意的是安装matplotlib会自动升级numpy，
在这里插入图片描述
如果还有其他警告，则添加下面的语句即可。

import warnings
warnings.filterwarnings('ignore')

1.4 wrapper (from tensorflow.python.ops.array_ops) is deprecated
这个警告可以忽略掉

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1205: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

1.5 This is probably because cuDNN failed to initialize

UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D]

执行测试程序,提示cudnn失败

root@test-To-be-filled-by-O-E-M:/home/test/cudnn_samples_v7/mnistCUDNN# ./mnistCUDNN
cudnnGetVersion() : 7605 , CUDNN_VERSION from cudnn.h : 7605 (7.6.5)
Host compiler version : GCC 7.4.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 36  Capabilities 7.5, SmClock 1710.0 Mhz, MemSize (Mb) 7982, MemClock 7001.0 Mhz, Ecc=0, boardGroupID=0
Using device 0
Testing single precision
CUDNN failure
Error: CUDNN_STATUS_INTERNAL_ERROR
mnistCUDNN.cpp:394
Aborting...

通过下面的截图发现，原来是显存溢出了。

1. 6 tensorboard
tensorboard 依赖numpy，numpy会自动升级，从会产生章节1.2中的问题

# 安装tensorboard 
pip3 install tensorboard --user -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
# 这个目录随你取
mkdir -p /home/test/workspace/netlog
# --host=0.0.0.0 就可以远程访问
tensorboard --logdir=/home/test/workspace/netlog --host=0.0.0.0

通过summary对Rsenet残差网络笔记进行可视化,可视化借鉴了tensorflow 2.0 随机梯度下降之 tensorboard可视化,做了一些调整

def plot_to_image(figure):
    """Converts the matplotlib plot specified by 'figure' to a PNG image and
    returns it. The supplied figure is closed and inaccessible after this call."""
    # Save the plot to a PNG in memory.
    buf = io.BytesIO()
    plt.savefig(buf, format='png')
    # Closing the figure prevents it from being displayed directly inside
    # the notebook.
    plt.close(figure)
    buf.seek(0)
    # Convert PNG buffer to TF image
    image = tf.image.decode_png(buf.getvalue(), channels=4)
    # Add the batch dimension
    image = tf.expand_dims(image, 0)
    return image
def image_grid(images):
    """Return a 5x5 grid of the MNIST images as a matplotlib figure."""
    # Create a figure to contain the plot.
    figure = plt.figure(figsize=(10, 10))
    for i in range(25):
        # Start next subplot.
        plt.subplot(5, 5, i + 1, title='name')
        plt.xticks([])
        plt.yticks([])
        plt.grid(False)
        plt.imshow(images[i], cmap=plt.cm.binary)
    return figure
# 网络模型
model=resnet18()
model.build(input_shape=(None,32,32,3))
model.summary()
optimizer=optimizers.Adam(lr=1e-4)
#
current_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
model_name = 'resnet18_{}'.format(current_time)
log_dir = 'logs/' + model_name
summary_writer = tf.summary.create_file_writer(log_dir)
#
sample=next(iter(train_data))
print('sample.shape is ',sample[0].shape)
sample_img = sample[0]
sample_img = tf.reshape(sample_img, [-1,32, 32, 1])
with summary_writer.as_default():
    tf.summary.image("Training sample:", sample_img, step=0)
def main():
    for epoch in range(100):
        for step,(x,y) in enumerate(train_data):
            with tf.GradientTape() as tape:
                logits=model(x)
                y_onehot=tf.one_hot(y,depth=10)
                loss=tf.losses.categorical_crossentropy(y_onehot,logits,from_logits=True)
                loss=tf.reduce_mean(loss)
            grads=tape.gradient(loss,model.trainable_variables)
            optimizer.apply_gradients(zip(grads,model.trainable_variables))
            if step%50==0:
                print(epoch,step,'loss',float(loss))
                with summary_writer.as_default():
                    tf.summary.scalar('train-loss', float(loss), step=epoch)
       # 正确率
        total_num=0
        total_correct=0
        for x,y in test_data:
            logits=model(x)
            prob=tf.nn.softmax(logits,axis=1)
            pred=tf.argmax(prob,axis=1)
            pred=tf.cast(pred,dtype=tf.int32)
            correct=tf.cast(tf.equal(pred,y),dtype=tf.int32)
            correct=tf.reduce_sum(correct)
            total_num+=x.shape[0]
            total_correct+=int(correct)
        acc=total_correct/total_num
        print(epoch,'acc:',acc)
        #
        val_images = x[:25]
        val_images = tf.reshape(val_images, [-1, 32, 32, 1])
        with summary_writer.as_default():
            tf.summary.scalar(
                'test-acc',
                acc,
                step=epoch)
            tf.summary.image(
                "val-onebyone-images:",
                val_images,
                max_outputs=25,
                step=epoch)
            val_images = tf.reshape(val_images, [-1, 32, 32])
            figure = image_grid(val_images)
            tf.summary.image('val-images:', plot_to_image(figure), step=epoch)
    print('训练结束')

从下图可以看到，你可以选择多次训练结果进行对比。例如我把batch_size由512调整为64，其他参数不变，观察变化：

从上图看到resnet18跑cifar10训练集，迭代100次达到71.49%，于是pytorch之ResNet18（对cifar10数据进行分类准确度达到94%），调整学习率、批次以及迭代次数，运行两个不同参数进行对比。超参就那么几个，不过他的这篇文章没有把最优的参数给出来。
学习率为0.1，正确率稳定在0.1，而且损失函数抖动太厉害，没有意义，应该提前终止掉

2 tf基础
2.1 one-hot 编码
机器学习：数据预处理之独热编码（One-Hot）,因为在做回归、分类、聚类的时候计算特征相似度或距离是在欧氏空间中计算，使用one-hot编码，可以将离散特征的取值扩展到了欧式空间。
one-hot编码
2.2 批次和迭代
我们把对训练集中的一个 Batch 运算更新一次叫做一个 Step，对训练集的所有样本循环迭代一次叫做一个 Epoch

# 随机打散，预处理，批量化
train_data = train_data.shuffle(1000).map(preprocess).batch(512)
# 那么下面就是出现两个批次的数据训练
for step,(x,y) in enumerate(train_data):

卷积神经网络训练三个概念（epoch，迭代次数，batchsize）
3 网络问题集
3.1 incompatible with the layer
这个是我在搭建Resnet+FPN的时候，抛出的异常

self.latlayer = Sequential([
            layers.Conv2D(filters=256, kernel_size=1, strides=1, padding='valid'),
            layers.BatchNormalization(),
            layers.Activation('relu')
        ])
File "D:\Python36\lib\site-packages\tensorflow_core\python\keras\engine\input_spec.py", line 213, in assert_input_compatibility
    ' but received input with shape ' + str(shape))
ValueError: Input 0 of layer sequential_10 is incompatible with the layer: expected axis -1 of input shape to have value 256 but received input with shape [None, 28, 28, 128]

我考虑是不是应为tf不需要知道输入维度，tensorflow 2.0 基础操作之 Broadcasting机制，但是pytorch需要明确输入的维度，问题就在Sequential中，按照pytorch搭建FPN架构

self.latlayer1 = nn.Conv2d(1024, 256, kernel_size=1, stride=1, padding=0)
        self.latlayer1_bn = nn.BatchNorm2d(256)
        self.latlayer1_relu = nn.ReLU(inplace=True)
        self.latlayer2 = nn.Conv2d(512,  256, kernel_size=1, stride=1, padding=0)
        self.latlayer2_bn = nn.BatchNorm2d(256)
        self.latlayer2_relu = nn.ReLU(inplace=True)
        self.latlayer3 = nn.Conv2d(256,  256, kernel_size=1, stride=1, padding=0)
        self.latlayer3_bn = nn.BatchNorm2d(256)
        self.latlayer3_relu = nn.ReLU(inplace=True)

这里需要调整一下,问题就解决了。

self.latlayer4 = Sequential([
            layers.Conv2D(filters=256, kernel_size=1, strides=1, padding='valid'),
            layers.BatchNormalization(),
            layers.Activation('relu')
        ])
        self.latlayer3 = Sequential([
            layers.Conv2D(filters=256, kernel_size=1, strides=1, padding='valid'),
            layers.BatchNormalization(),
            layers.Activation('relu')
        ])
        self.latlayer2 = Sequential([
            layers.Conv2D(filters=256, kernel_size=1, strides=1, padding='valid'),
            layers.BatchNormalization(),
            layers.Activation('relu')
        ])

3.2 WARNINGEntity <bound method BasicBlock
这个原因是gast版本过高,执行命令pip3 install -U gast==0.2.2 --user修正即可

WARNING:tensorflow:Entity <bound method BasicBlock.call of <__main__.BasicBlock object at 0x7f61cc54ad30>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method BasicBlock.call of <__main__.BasicBlock object at 0x7f61cc54ad30>>: AssertionError: Bad argument number for Name: 3, expecting 4

3.3 显存使用溢出
我参考Tensorflow2.0使用Resnet18进行数据训练，以及FPN进行训练测试，启动不久就提示

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     65     else:
     66       message = e.message
---> 67     six.raise_from(core._status_to_exception(e.code, message), None)
     68   except TypeError as e:
     69     if any(ops._is_keras_symbolic_tensor(x) for x in inputs):
~/.local/lib/python3.6/site-packages/six.py in raise_from(value, from_value)
UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D]

默认情况下，TensorFlow 将使用几乎所有可用的显存，以避免内存碎片化所带来的性能损失,所以内存溢出，就不奇怪了。

设置固定显存调用

gpus = tf.config.experimental.list_physical_devices(device_type=’GPU’)

tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=3072)]
)
按需分配
下面看到按需分配的显存也蛮大的。

for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)

3.4 logits and labels must be broadcastabl
针对3.3中的问题，我增加脚本os.environ["CUDA_VISIBLE_DEVICES"] = "-1",在cpu上运行，结果提示：

InvalidArgumentError: logits and labels must be broadcastable: logits_size=[65536,7] labels_size=[64,10] [Op:SoftmaxCrossEntropyWithLogits] name: softmax_cross_entropy_with_logits/