TensorFlow学习笔记

缺乏、安全感 2023-06-14 14:57 141阅读 0赞

1 环境问题集
1.1 RuntimeError: Error copying tensor to device

  1. RuntimeError: Error copying tensor to device: /job:localhost/replica:0/task:0/device:GPU:0.
  2. /job:localhost/replica:0/task:0/device:GPU:0 unknown device.

通过检测程序,发现缺少cuDNN,这个是用来gpu加速的。

  1. >>> import tensorflow as tf
  2. >>> tf.test.is_gpu_available()
  3. 2019-11-20 10:50:16.311650: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
  4. 2019-11-20 10:50:16.327690: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
  5. 2019-11-20 10:50:16.390645: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
  6. 2019-11-20 10:50:16.391072: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3ea93a0 executing computations on platform CUDA. Devices:
  7. 2019-11-20 10:50:16.391087: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
  8. 2019-11-20 10:50:16.412600: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3192000000 Hz
  9. 2019-11-20 10:50:16.413479: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x42911b0 executing computations on platform Host. Devices:
  10. 2019-11-20 10:50:16.413505: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
  11. 2019-11-20 10:50:16.413605: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
  12. 2019-11-20 10:50:16.413929: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
  13. name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.71
  14. pciBusID: 0000:01:00.0
  15. 2019-11-20 10:50:16.414056: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
  16. 2019-11-20 10:50:16.414088: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
  17. 2019-11-20 10:50:16.414116: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
  18. 2019-11-20 10:50:16.414144: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
  19. 2019-11-20 10:50:16.414172: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
  20. 2019-11-20 10:50:16.414200: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
  21. 2019-11-20 10:50:16.414229: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
  22. 2019-11-20 10:50:16.414235: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
  23. 2019-11-20 10:50:16.414244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
  24. 2019-11-20 10:50:16.414269: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
  25. 2019-11-20 10:50:16.414273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
  26. False

执行命令dpkg -i libcudnn7_7.6.5.32-1+cuda10.1_amd64.deb,但是问题依旧,于是检查

  1. root@test-To-be-filled-by-O-E-M:/home/test/download# dpkg -l | grep cudnn
  2. ii libcudnn7 7.6.5.32-1+cuda10.1 amd64 cuDNN runtime libraries

【解决】新版cudnn Deb安装版本 找不到cudnn.h,这篇文章指出应该下载的是cuDNN Library for Linux版本,而不应该用deb版本。根据cuda我再执行sh cuda_10.1.243_418.87.00_linux.run
运行cudnn样例,提示

  1. ./mnistCUDNN: error while loading shared libraries: libcudart.so.10.1: cannot open shared object file: No such file or directory

解决方案

  1. cd /etc/ld.so.conf.d
  2. echo "/usr/local/cuda/lib64" >> cuda-10-1.conf
  3. # 配置生效
  4. sudo ldconfig

接着执行下面的命令,创建软连接并设置环境变量

  1. # 如果是root用户下面安装,而需要在其他用户也要运行的话,也需要在当前用户的环境变量中增加下面的配置
  2. echo "export PATH=/usr/local/cuda/bin:$PATH" >> ~/.bashrc
  3. echo "export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH" >> ~/.bashrc
  4. source ~/.bashrc
  5. # 这个需要查一下本地的对应的版本
  6. sudo ln -s /usr/local/cuda/lib64/libcudart.so.10.1.243 /usr/local/cuda/lib64/libcudart.so.10.0
  7. sudo ln -s /usr/local/cuda/lib64/libcufft.so.10.1.1.243 /usr/local/cuda/lib64/libcufft.so.10.0
  8. sudo ln -s /usr/local/cuda/lib64/libcurand.so.10.1.1.243 /usr/local/cuda/lib64/libcurand.so.10.0
  9. sudo ln -s /usr/local/cuda/lib64/libcusolver.so.10.2.0.243 /usr/local/cuda/lib64/libcusolver.so.10.0
  10. sudo ln -s /usr/local/cuda/lib64/libcusparse.so.10.3.0.243 /usr/local/cuda/lib64/libcusparse.so.10.0
  11. sudo ln -s /usr/lib/x86_64-linux-gnu/libcublas.so.10.2.1.243 /usr/local/cuda-10.1/lib64/libcublas.so.10.0

1.2 ‘1type’ as a synonym of type is deprecated

  1. >>> import tensorflow as tf
  2. /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  3. _np_qint8 = np.dtype([("qint8", np.int8, 1)])

tensorflow + numpy compatibility?,在test用户下面numpy版本过高,改成1.16.4即可。
1.3 Font family [‘STKaiti’] not found
ubuntu字体缺失,参考ubuntu字体安装

  1. mkdir -p /usr/share/fonts/zh_CN
  2. sudo chmod 777 *
  3. sudo mkfontscale
  4. sudo mkfontdir
  5. sudo fc-cache
  6. reboot
  7. /home/test/.local/lib/python3.6/site-packages/ipykernel_launcher.py:68: UserWarning: Attempted to set non-positive bottom ylim on a log-scaled axis.
  8. Invalid limit will be ignored.
  9. findfont: Font family ['STKaiti'] not found. Falling back to DejaVu Sans.

执行fc-list :lang=zh,发现已经安装了这个字体
1
那么为什么还提示这样的警告呢?按照【Matplotlib】Matplotlib之更换字体常见错误及修正方式(change font)这里还需要删除一下缓存rm -rf ~/.cache/matplotlib ,但是新的问题来了,界面中文乱码.
3
cache目录下并没有重新生成matplotlib,
误删的话,可以通过进行恢复,但是通过winscp删除的就没有办法了。只能重装matplotlib就可以了。

  1. sudo apt-get install extundelete
  2. root@test-To-be-filled-by-O-E-M:~# date -d "2019-11-21 11:15:00" +%s
  3. 1574306100
  4. sudo extundelete /dev/sda8 --after 1574306100--restore-all

这里需要注意的是安装matplotlib会自动升级numpy,
在这里插入图片描述
如果还有其他警告,则添加下面的语句即可。

  1. import warnings
  2. warnings.filterwarnings('ignore')

1.4 wrapper (from tensorflow.python.ops.array_ops) is deprecated
这个警告可以忽略掉

  1. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1205: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
  2. Instructions for updating:
  3. Use tf.where in 2.0, which has the same broadcast rule as np.where

1.5 This is probably because cuDNN failed to initialize

  1. UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D]

执行测试程序,提示cudnn失败

  1. root@test-To-be-filled-by-O-E-M:/home/test/cudnn_samples_v7/mnistCUDNN# ./mnistCUDNN
  2. cudnnGetVersion() : 7605 , CUDNN_VERSION from cudnn.h : 7605 (7.6.5)
  3. Host compiler version : GCC 7.4.0
  4. There are 1 CUDA capable devices on your machine :
  5. device 0 : sms 36 Capabilities 7.5, SmClock 1710.0 Mhz, MemSize (Mb) 7982, MemClock 7001.0 Mhz, Ecc=0, boardGroupID=0
  6. Using device 0
  7. Testing single precision
  8. CUDNN failure
  9. Error: CUDNN_STATUS_INTERNAL_ERROR
  10. mnistCUDNN.cpp:394
  11. Aborting...

通过下面的截图发现,原来是显存溢出了。
1
2
1. 6 tensorboard
tensorboard 依赖numpy,numpy会自动升级,从会产生章节1.2中的问题

  1. # 安装tensorboard
  2. pip3 install tensorboard --user -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
  3. # 这个目录随你取
  4. mkdir -p /home/test/workspace/netlog
  5. # --host=0.0.0.0 就可以远程访问
  6. tensorboard --logdir=/home/test/workspace/netlog --host=0.0.0.0

通过summary对Rsenet残差网络笔记进行可视化,可视化借鉴了tensorflow 2.0 随机梯度下降 之 tensorboard可视化,做了一些调整

  1. def plot_to_image(figure):
  2. """Converts the matplotlib plot specified by 'figure' to a PNG image and
  3. returns it. The supplied figure is closed and inaccessible after this call."""
  4. # Save the plot to a PNG in memory.
  5. buf = io.BytesIO()
  6. plt.savefig(buf, format='png')
  7. # Closing the figure prevents it from being displayed directly inside
  8. # the notebook.
  9. plt.close(figure)
  10. buf.seek(0)
  11. # Convert PNG buffer to TF image
  12. image = tf.image.decode_png(buf.getvalue(), channels=4)
  13. # Add the batch dimension
  14. image = tf.expand_dims(image, 0)
  15. return image
  16. def image_grid(images):
  17. """Return a 5x5 grid of the MNIST images as a matplotlib figure."""
  18. # Create a figure to contain the plot.
  19. figure = plt.figure(figsize=(10, 10))
  20. for i in range(25):
  21. # Start next subplot.
  22. plt.subplot(5, 5, i + 1, title='name')
  23. plt.xticks([])
  24. plt.yticks([])
  25. plt.grid(False)
  26. plt.imshow(images[i], cmap=plt.cm.binary)
  27. return figure
  28. # 网络模型
  29. model=resnet18()
  30. model.build(input_shape=(None,32,32,3))
  31. model.summary()
  32. optimizer=optimizers.Adam(lr=1e-4)
  33. #
  34. current_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
  35. model_name = 'resnet18_{}'.format(current_time)
  36. log_dir = 'logs/' + model_name
  37. summary_writer = tf.summary.create_file_writer(log_dir)
  38. #
  39. sample=next(iter(train_data))
  40. print('sample.shape is ',sample[0].shape)
  41. sample_img = sample[0]
  42. sample_img = tf.reshape(sample_img, [-1,32, 32, 1])
  43. with summary_writer.as_default():
  44. tf.summary.image("Training sample:", sample_img, step=0)
  45. def main():
  46. for epoch in range(100):
  47. for step,(x,y) in enumerate(train_data):
  48. with tf.GradientTape() as tape:
  49. logits=model(x)
  50. y_onehot=tf.one_hot(y,depth=10)
  51. loss=tf.losses.categorical_crossentropy(y_onehot,logits,from_logits=True)
  52. loss=tf.reduce_mean(loss)
  53. grads=tape.gradient(loss,model.trainable_variables)
  54. optimizer.apply_gradients(zip(grads,model.trainable_variables))
  55. if step%50==0:
  56. print(epoch,step,'loss',float(loss))
  57. with summary_writer.as_default():
  58. tf.summary.scalar('train-loss', float(loss), step=epoch)
  59. # 正确率
  60. total_num=0
  61. total_correct=0
  62. for x,y in test_data:
  63. logits=model(x)
  64. prob=tf.nn.softmax(logits,axis=1)
  65. pred=tf.argmax(prob,axis=1)
  66. pred=tf.cast(pred,dtype=tf.int32)
  67. correct=tf.cast(tf.equal(pred,y),dtype=tf.int32)
  68. correct=tf.reduce_sum(correct)
  69. total_num+=x.shape[0]
  70. total_correct+=int(correct)
  71. acc=total_correct/total_num
  72. print(epoch,'acc:',acc)
  73. #
  74. val_images = x[:25]
  75. val_images = tf.reshape(val_images, [-1, 32, 32, 1])
  76. with summary_writer.as_default():
  77. tf.summary.scalar(
  78. 'test-acc',
  79. acc,
  80. step=epoch)
  81. tf.summary.image(
  82. "val-onebyone-images:",
  83. val_images,
  84. max_outputs=25,
  85. step=epoch)
  86. val_images = tf.reshape(val_images, [-1, 32, 32])
  87. figure = image_grid(val_images)
  88. tf.summary.image('val-images:', plot_to_image(figure), step=epoch)
  89. print('训练结束')

1
从下图可以看到,你可以选择多次训练结果进行对比。例如我把batch_size由512调整为64,其他参数不变,观察变化:
2
从上图看到resnet18跑cifar10训练集,迭代100次达到71.49%,于是pytorch之ResNet18(对cifar10数据进行分类准确度达到94%),调整学习率、批次以及迭代次数,运行两个不同参数进行对比。超参就那么几个,不过他的这篇文章没有把最优的参数给出来。
学习率为0.1,正确率稳定在0.1,而且损失函数抖动太厉害,没有意义,应该提前终止掉
5
2 tf基础
2.1 one-hot 编码
机器学习:数据预处理之独热编码(One-Hot),因为在做回归、分类、聚类的时候计算特征相似度或距离是在欧氏空间中计算,使用one-hot编码,可以将离散特征的取值扩展到了欧式空间。
one-hot编码
2.2 批次和迭代
我们把对训练集中的一个 Batch 运算更新一次叫做一个 Step,对训练集的所有样本循环迭代一次叫做一个 Epoch

  1. # 随机打散,预处理,批量化
  2. train_data = train_data.shuffle(1000).map(preprocess).batch(512)
  3. # 那么下面就是出现两个批次的数据训练
  4. for step,(x,y) in enumerate(train_data):

卷积神经网络训练三个概念(epoch,迭代次数,batchsize)
3 网络问题集
3.1 incompatible with the layer
这个是我在搭建Resnet+FPN的时候,抛出的异常

  1. self.latlayer = Sequential([
  2. layers.Conv2D(filters=256, kernel_size=1, strides=1, padding='valid'),
  3. layers.BatchNormalization(),
  4. layers.Activation('relu')
  5. ])
  6. File "D:\Python36\lib\site-packages\tensorflow_core\python\keras\engine\input_spec.py", line 213, in assert_input_compatibility
  7. ' but received input with shape ' + str(shape))
  8. ValueError: Input 0 of layer sequential_10 is incompatible with the layer: expected axis -1 of input shape to have value 256 but received input with shape [None, 28, 28, 128]

我考虑是不是应为tf不需要知道输入维度,tensorflow 2.0 基础操作 之 Broadcasting机制,但是pytorch需要明确输入的维度,问题就在Sequential中,按照pytorch搭建FPN架构

  1. self.latlayer1 = nn.Conv2d(1024, 256, kernel_size=1, stride=1, padding=0)
  2. self.latlayer1_bn = nn.BatchNorm2d(256)
  3. self.latlayer1_relu = nn.ReLU(inplace=True)
  4. self.latlayer2 = nn.Conv2d(512, 256, kernel_size=1, stride=1, padding=0)
  5. self.latlayer2_bn = nn.BatchNorm2d(256)
  6. self.latlayer2_relu = nn.ReLU(inplace=True)
  7. self.latlayer3 = nn.Conv2d(256, 256, kernel_size=1, stride=1, padding=0)
  8. self.latlayer3_bn = nn.BatchNorm2d(256)
  9. self.latlayer3_relu = nn.ReLU(inplace=True)

这里需要调整一下,问题就解决了。

  1. self.latlayer4 = Sequential([
  2. layers.Conv2D(filters=256, kernel_size=1, strides=1, padding='valid'),
  3. layers.BatchNormalization(),
  4. layers.Activation('relu')
  5. ])
  6. self.latlayer3 = Sequential([
  7. layers.Conv2D(filters=256, kernel_size=1, strides=1, padding='valid'),
  8. layers.BatchNormalization(),
  9. layers.Activation('relu')
  10. ])
  11. self.latlayer2 = Sequential([
  12. layers.Conv2D(filters=256, kernel_size=1, strides=1, padding='valid'),
  13. layers.BatchNormalization(),
  14. layers.Activation('relu')
  15. ])

3.2 WARNING:tensorflow:Entity <bound method BasicBlock
这个原因是gast版本过高,执行命令pip3 install -U gast==0.2.2 --user修正即可

  1. WARNING:tensorflow:Entity <bound method BasicBlock.call of <__main__.BasicBlock object at 0x7f61cc54ad30>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method BasicBlock.call of <__main__.BasicBlock object at 0x7f61cc54ad30>>: AssertionError: Bad argument number for Name: 3, expecting 4

3.3 显存使用溢出
我参考Tensorflow2.0使用Resnet18进行数据训练,以及FPN进行训练测试,启动不久就提示

  1. /usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
  2. 65 else:
  3. 66 message = e.message
  4. ---> 67 six.raise_from(core._status_to_exception(e.code, message), None)
  5. 68 except TypeError as e:
  6. 69 if any(ops._is_keras_symbolic_tensor(x) for x in inputs):
  7. ~/.local/lib/python3.6/site-packages/six.py in raise_from(value, from_value)
  8. UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D]

默认情况下,TensorFlow 将使用几乎所有可用的显存,以避免内存碎片化所带来的性能损失,所以内存溢出,就不奇怪了。

  • 设置固定显存调用

    gpus = tf.config.experimental.list_physical_devices(device_type=’GPU’)

    tf.config.experimental.set_virtual_device_configuration(
    gpus[0],
    [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=3072)]
    )

  • 按需分配
    下面看到按需分配的显存也蛮大的。

    for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

2
3.4 logits and labels must be broadcastabl
针对3.3中的问题,我增加脚本os.environ["CUDA_VISIBLE_DEVICES"] = "-1",在cpu上运行,结果提示:

  1. InvalidArgumentError: logits and labels must be broadcastable: logits_size=[65536,7] labels_size=[64,10] [Op:SoftmaxCrossEntropyWithLogits] name: softmax_cross_entropy_with_logits/

发表评论

表情:
评论列表 (有 0 条评论,141人围观)

还没有评论,来说两句吧...

相关阅读