Hi, I’ve been having problems with running tensorflow on gpu for about a week now, so here I am, looking for ideas on how to fix this.I’m running tensorflow 2.8.0 with cuda-11.6 on a Ubuntu 20.04 virtual machine using passthrough for the 2 GPUs I have.
Now when I run this object detection project: https://github.com/nicknochnack/TFODCourse everything works but the training part doesn’t seem to use any of the GPU power, looking with nvidia-smi:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Quadro K2200 On | 00000000:07:00.0 Off | N/A | | 42% 26C P8 1W / 39W | 3543MiB / 4096MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Quadro K2200 On | 00000000:08:00.0 Off | N/A | | 42% 21C P8 1W / 39W | 3543MiB / 4096MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1313 G /usr/lib/xorg/Xorg 2MiB | | 0 N/A N/A 2149 C python 3535MiB | | 1 N/A N/A 1313 G /usr/lib/xorg/Xorg 2MiB | | 1 N/A N/A 2149 C python 3535MiB | +-----------------------------------------------------------------------------+
Also, when I run tf.test.is_gpu_available(), I get True.
Finally, here’s the result from device_lib.list_local_devices():
2022-05-01 07:45:35.910286: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-05-01 07:45:35.910656: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /device:GPU:0 with 124 MB memory: -> device: 0, name: Quadro K2200, pci bus id: 0000:07:00.0, compute capability: 5.0 2022-05-01 07:45:35.910778: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-05-01 07:45:35.911123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /device:GPU:1 with 124 MB memory: -> device: 1, name: Quadro K2200, pci bus id: 0000:08:00.0, compute capability: 5.0 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 11170602163960232220 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 130678784 locality { bus_id: 1 links { } } incarnation: 3083526702988944120 physical_device_desc: "device: 0, name: Quadro K2200, pci bus id: 0000:07:00.0, compute capability: 5.0" xla_global_id: 416903419 , name: "/device:GPU:1" device_type: "GPU" memory_limit: 130678784 locality { bus_id: 1 links { } } incarnation: 4460464810151506589 physical_device_desc: "device: 1, name: Quadro K2200, pci bus id: 0000:08:00.0, compute capability: 5.0" xla_global_id: 2144165316 ]
It seems like tensorflow can see my GPUs, when I was missing libraries I couldn’t get to this point, but now I’m stuck.
Oh, and here’s what I get when trying to train:
2022-04-30 21:53:01.961464: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-04-30 21:53:01.962055: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-04-30 21:53:02.071860: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-04-30 21:53:02.072626: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-04-30 21:53:02.073121: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-04-30 21:53:02.073627: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-04-30 21:53:02.407004: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-04-30 21:53:02.407485: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-04-30 21:53:02.407840: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-04-30 21:53:02.408204: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-04-30 21:53:02.408536: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-04-30 21:53:02.408865: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-04-30 21:53:04.298247: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-04-30 21:53:04.298711: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-04-30 21:53:04.299062: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-04-30 21:53:04.299421: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-04-30 21:53:04.299801: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-04-30 21:53:04.300924: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3385 MB memory: -> device: 0, name: Quadro K2200, pci bus id: 0000:07:00.0, compute capability: 5.0 2022-04-30 21:53:04.303496: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-04-30 21:53:04.303793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 3385 MB memory: -> device: 1, name: Quadro K2200, pci bus id: 0000:08:00.0, compute capability: 5.0 INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1') I0430 21:53:04.485582 140329512097600 mirrored_strategy.py:374] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1') INFO:tensorflow:Maybe overwriting train_steps: 2000 I0430 21:53:04.494590 140329512097600 config_util.py:552] Maybe overwriting train_steps: 2000 INFO:tensorflow:Maybe overwriting use_bfloat16: False I0430 21:53:04.494734 140329512097600 config_util.py:552] Maybe overwriting use_bfloat16: False WARNING:tensorflow:From /home/oscar/Tensorflow/tfod/lib/python3.8/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version. Instructions for updating: rename to distribute_datasets_from_function W0430 21:53:04.534977 140329512097600 deprecation.py:337] From /home/oscar/Tensorflow/tfod/lib/python3.8/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version. Instructions for updating: rename to distribute_datasets_from_function INFO:tensorflow:Reading unweighted datasets: ['Tensorflow/workspace/annotations/train.record'] I0430 21:53:04.547105 140329512097600 dataset_builder.py:162] Reading unweighted datasets: ['Tensorflow/workspace/annotations/train.record'] INFO:tensorflow:Reading record datasets for input file: ['Tensorflow/workspace/annotations/train.record'] I0430 21:53:04.547340 140329512097600 dataset_builder.py:79] Reading record datasets for input file: ['Tensorflow/workspace/annotations/train.record'] INFO:tensorflow:Number of filenames to read: 1 I0430 21:53:04.547466 140329512097600 dataset_builder.py:80] Number of filenames to read: 1 WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards. W0430 21:53:04.547594 140329512097600 dataset_builder.py:86] num_readers has been reduced to 1 to match input file shards. WARNING:tensorflow:From /home/oscar/Tensorflow/tfod/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.deterministic`. W0430 21:53:04.553132 140329512097600 deprecation.py:337] From /home/oscar/Tensorflow/tfod/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.deterministic`. WARNING:tensorflow:From /home/oscar/Tensorflow/tfod/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py:235: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.data.Dataset.map() W0430 21:53:04.594555 140329512097600 deprecation.py:337] From /home/oscar/Tensorflow/tfod/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py:235: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.data.Dataset.map() WARNING:tensorflow:From /home/oscar/Tensorflow/tfod/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead. W0430 21:53:11.867036 140329512097600 deprecation.py:337] From /home/oscar/Tensorflow/tfod/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead. WARNING:tensorflow:From /home/oscar/Tensorflow/tfod/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version. Instructions for updating: `seed2` arg is deprecated.Use sample_distorted_bounding_box_v2 instead. W0430 21:53:15.011741 140329512097600 deprecation.py:337] From /home/oscar/Tensorflow/tfod/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version. Instructions for updating: `seed2` arg is deprecated.Use sample_distorted_bounding_box_v2 instead. WARNING:tensorflow:From /home/oscar/Tensorflow/tfod/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.cast` instead. W0430 21:53:16.702110 140329512097600 deprecation.py:337] From /home/oscar/Tensorflow/tfod/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.cast` instead.