PBS allows tensorflow job to run on one node and terminates job when it is running on another node


#1

Hello Contributors,

I submitted a tensorflow job on my HPC server and i observed this:
When the job is running on compute node 1, it gets terminated. But when it runs on any other node the job get executed successfully. Is the problem due to the scheduler or an issue with the compute node itself?
Below is the error message.

WARNING:tensorflow:From inception_reimplement_v2.py:236: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See @{tf.nn.softmax_cross_entropy_with_logits_v2}.

2018-12-10 11:27:39.691795: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-12-10 11:27:41.422854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:02:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-12-10 11:27:41.731735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 1 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:82:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-12-10 11:27:41.731816: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1
2018-12-10 11:27:42.881204: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-10 11:27:42.881258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 1
2018-12-10 11:27:42.881269: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N N
2018-12-10 11:27:42.881273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1: N N
2018-12-10 11:27:42.883945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15127 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:02:00.0, compute capability: 6.0)
2018-12-10 11:27:43.023229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 15127 MB memory) -> physical GPU (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0, compute capability: 6.0)
2018-12-10 11:27:45.151367: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-12-10 11:27:45.194114: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-12-10 11:27:45.222603: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-12-10 11:27:45.250662: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-12-10 11:27:45.279407: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-12-10 11:27:45.307738: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-12-10 11:27:45.882243: E tensorflow/stream_executor/cuda/cuda_dnn.cc:455] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2018-12-10 11:27:45.882846: W ./tensorflow/stream_executor/stream.h:2023] attempting to perform DNN operation using StreamExecutor without DNN support
2018-12-10 11:27:46.007526: E tensorflow/stream_executor/cuda/cuda_dnn.cc:455] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2018-12-10 11:27:46.007594: F tensorflow/core/kernels/conv_ops.cc:713] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms)
/var/spool/pbs/mom_priv/jobs/80112.master1.local.SC: line 14: 57413 Aborted python3.5 inception_reimplement_v2.py


#2

Could you please describe your setup ? and also
Please share the output of the below command

  1. tracejob
  2. qstat -fx

<job_id>.<server_name>.SC – > Job script if a job script was provided at qsub

It seems the job script executing on the node has aborted due to issues caused at line 14: