How to clean up orphaned MPI process

Hi there,

I have a user application which sometimes get deadlock and don’t respond to sigterm. Only sigkill (kill -9) can stop it from running. However, after “qdel” PBS simply thinks the job has quit and assign new jobs onto the node, effectively oversubscribed the node.

Is there a reliable way to kill the job? Or do I need to attach an execjob_end hook? I see there is pbs.pid attribute but don’t know what it contains.

Regards,
Chen

Please add the below line to the $PBS_HOME/mom_priv/config and restart the pbs_mom services.
$restrict_user True

This would remove all the orphaned MPI processes.

1 Like

The mom now actively remove user’s process if none of his job was assigned the the node.

Seems promising. Maybe it will solve most part of the problem.

1 Like