Is there any way to avoid executing jobs on an unstable node?
We have recently experienced a physical trouble with a GPU that causes a segmentation fault in jobs executed on that node.
After the job fails with exit_status=139, PBS starts executing another job on that node, and it fails, and so on.
As a result, a single unstable node behaves like a blackhole, making the whole complex useless.
I guess this kind of situation occurs frequently in a large system and people have established a best practice against such a trouble.
Could someone suggest a smart way to handle this situation with PBS Pro?
It would be my pleasure to learn the wisdom of predecessors