PP-302: Implement save of PBS data for post-run analysis


#1

Hi @developers,

There is new design document is available for review:
https://pbspro.atlassian.net/wiki/display/PD/PP-302%3A+Implement+save+of+PBS+data+for+post-run+analysis

Please have look and provide your comments/suggestion to community.

Thanks,
Hiren


#2

Hiren,

Having the core file for post analysis may also help.

Thanks,
Prakash

PS: Please add a link on the design page to the community discussion.


#3

@prakashcv13 AFAIK core file will be created in PBS_HOME/*_priv directory and PBS_HOME is already in post analysis data.
I think I had already add link to design page in this discussion see first post from me.


#4

The request is for a link from the design page to this page and not vice versa :slight_smile: .
And you are right about the PBS_HOME having the core file.

Thanks,
Prakash


#5

@prakashcv13 Done. I have added link for this discussion in design doc.


#6

Looks good. This data will be useful.


#7

@hirenvadalia If PTL does not reset the log files present in PBS_HOME for each test case failure then wouldn’t it be sufficient to just have one copy of “PBS_<hostname>.tar.gz” for each test suite?


#8

@arungrover Yes, PTL does not reset logs file and initially I had same thought of coping PBS_HOME tarball at test suite level but then I realize that almost every test case make changes to PBS configuration (ex. one test case test enables RR in sched config where another test case disables RR in sched config). So, in this case how we will know what was change/content in sched config from post analysis data if data is stored at test suite level. Also in test suite may contain single node as well as multi-node test cases so in this case single node test case will have only one tarball while multi-node test case will have multiple tarball based on number of nodes. So, after considering above cases I choose coping PBS_HOME tarball at test case level instead of test suite level.


#9

Well change is content of files like sched_config can be easily judged by looking at the failing test case and checking what modification it made to the config file.
Regarding taking backup for multi-node test cases… why don’t we consider the worst case scenario and store one backup from all the machines where these test cases might have run. If we do this, we can still related the failing test case with the logs.

What do you think?


#10

@arungrover Well we can judge by looking at failing test cases but not always and not for all files, and specifically for database directory inside PBS_HOME.

Why we want to store data from machine where failing test case has not run?

I would prefer to test case level data saving instead test suite level data saving.


#11

Well since database and other files that are there, which ideally do not change from case to case in a suite, I feel it will have a lot of redundant data.

Now if we can design something which can smartly collect PBS_HOME from only machines where a test case has failed then that would be great. It will be ironical if we agree to make a copy of home for every failed test case but disagree on taking a copy of home from every machine if a test suite encounters any failure.


#12

@arungrover Yes database will be definitely different from case to case as every test case has different config like jobs, resvs and nodes etc. so considering database directory I don’t see any redundant data.

AFAIK currently PTL does not have any interface which can tell exactly on which machine case is failed (I’m not sure but may be its possible to implement this but that will be like totally refactoring PTL as far as I have knowledge about PTL). If you have any idea on how to implement without refactoring PTL please do suggest I will love to implement that.


#13

Well the state of jobs/nodes is already what you store using pbsnodes, qmgr, qstat so db is not going to have anything different.
One example of redundancy will be if there is a product bug that makes a binary dump core. Now consider a test suite with 10 test cases. On failure of each test case we will store PBS_HOME and at the end of the test suite we will end up with tarballs consisting a total of 55 core files.

About the way to check for failure, I’d assume that it is the test case that fails. So when you know a test case has failed you store PBS_HOME from all the machines it was running on. If it is hard to identify then just collect logs from all the moms when the test suite finishes.


#14

while writing my previous comment I realized that we don’t collect anything for reservations. I think “pbs_rstat -f” should be sufficient to collect all reservation specific information.


#15

I agree with Arun here on taking datastore backup for every test case may not be worth it as you already have configurations, nodes, jobs and reservation details taken using commands. Datastore will have same data. Also when there are logs and configurations readily available I would prefer to refer them before getting into DB and start running queries.


#16

@arungrover good catch I will update design document for “pbs_rstat -f” as well as for " qmgr -c ‘p h’ and qmgr -c ‘p pbshook’ " (Thanks to @ashwathraop for catching this)


#17

I have updated design document as per my last comment.


#18

@hirenvadalia : Overall all design looks good.

  • PBS logs are the one which eats up disk space, so I would prefer to save logs only once per test suite instead of saving at test case level.

  • For cpuset systems we need to collect cpuset info (cpuset -s / -r) in case of failure.

  • In some cases we need job’s error file which exist in user’s home dir.

  • Instead of running list of pbs commands , can we make use of pbs_diag command?

  • If the test case fails at setup/teardown level do you save the post analysis data?


#19

@arungrover : We can make out configuration changes by logs and test script file but I feel it is an overhead. If we save it will make it easy to debug failure.


#20

Well I think people who will be working on these issues reported by PTL can easily look at test case. But, still if you think it makes it easy then just copy config files in each test case and then have one single tar of PBS_HOME per testsuite