Well the state of jobs/nodes is already what you store using pbsnodes, qmgr, qstat so db is not going to have anything different.
One example of redundancy will be if there is a product bug that makes a binary dump core. Now consider a test suite with 10 test cases. On failure of each test case we will store PBS_HOME and at the end of the test suite we will end up with tarballs consisting a total of 55 core files.
About the way to check for failure, I’d assume that it is the test case that fails. So when you know a test case has failed you store PBS_HOME from all the machines it was running on. If it is hard to identify then just collect logs from all the moms when the test suite finishes.