PP-758: Add pbs_snapshot tool to capture state & logs from PBS


#61

Looks great (version v.39)…

Regarding one of the latest changes (adding a default for the --map= option): I suggest putting the default map file at the same level as the snapshot directory (not inside the snapshot directory). I feel including the map file inside the snapshot will lead to the unintended inclusion of unobfuscated data in otherwise cleansed snapshots, as the reasonable behavior of simply making a tar file of the whole snapshot directory will include the unobfuscated map file, rendering the obfuscation moot.


#62

Thanks for pointing that out, I’ve made the change (40+ revisions now, wohoo :)) Please let me know if anything else seems off.


#63

To support automated testing, you still need an option to save the current day and previous day only. This is for tests run in “batch mode” that fail after crossing a day boundary.


#64

Not sure I understand completely, but couldn’t you accomplish that using --service-logs=2 ? That would capture logs of the present day + the previous day.


#65

Just to put another nail in the sudo coffin, not all systems have sudo installed or configured. In fact, many customers use only su and other tools such as dzdo or some such.


#66

Ravi, that seems to do it.


#67

Alright, thanks for all the valuable inputs guys. Please provide your sign-off if the doc looks good now. Thanks!


#68

Will it be possible to reconfigure a system based on the contents of an existing snapshot? Seems like that might be a useful feature for testing and reproducing errors.


#69

Hey Mike,

That would be a pretty useful facility. I think pbs_config already has the ability to ingest a pbs_diag output to recreate the environment, so once pbs_snapshot replaces pbs_diag, I think we’ll modify pbs_config as well to support recreating environments using snapshots. I’ve filed a ticket for it here: https://pbspro.atlassian.net/browse/PP-787

Thanks!
Ravi


#70

I wish I would have commented on this earlier, but I have a slight organizational change to propose. It seems to me that the contents of the top level queue/ and resource/ directories are directly applicable to server/, and so their contents could be moved into that directory and clean up the top level a bit.

I have the same thought about moving reservation/ to scheduler/, but one could also argue that it belongs in server/ which means it is probably correct for it to be separate.

What do you think?


#71

Thanks Scott. I like the idea of moving queue/ and resource/ to server/ to tidy up the top level.

I personally think that reservations should be kept separate, it wouldn’t be obvious to me to look for them under a scheduler/ directory.

I wanted to clarify something about --obfuscate flag, right now ACLs are listed under the items to remove, not obfuscate. I was thinking maybe we should just obfuscate them, not sure why we wanna delete them off. What do you think?


#72

From the troubleshooting aspect of using this tool I don’t think it is a huge deal either way. ACL configurations are self contained, and if the problem at hand is related to ACLs (or manager/operator privileges) it is usually pretty obvious and can be discussed directly. Put another way, ACL configuration is unlikely to be the problematic needle if the pbs_snapshot haystack.

On the other hand, since we are already obfuscating euser and egroup and putting them in the map (which we absolutely need to do), is it really that much more effort to go through the ACLs as well and obfuscate them under the same mapping? Same with group_list, managers, and operators?


#73

Can we grab “ps -leaf” alongside “ps -aux”?


#74

Hey Scott and others,

I’ve changed the layout of the snapshot as follows:

  • moved the queue/ and resource/ information inside server/ itself.
  • Moved the _*priv/ and *_logs/ directories to the top level to be consistent with PBS_HOME layout.
  • Moved ‘hook/’ to mom_priv/hooks/ to again be consistent with PBS_HOME
  • Added ps -leaf output.

I sort of feel like we should have a top level directory called ‘cmd_output’ or something and put all command related outputs there. Right now it looks kind of messy to me. But do let me know.

Thanks,
Ravi


#75

Great, thanks.[quote=“agrawalravi90, post:74, topic:520”]

  • Moved ‘hook/’ to mom_priv/hooks/ to again be consistent with PBS_HOME
    [/quote]

I don’t think this is correct. The output in the commands listed contains all hooks, server and mom, I think either the top level hooks/ directory should remain with qmgr_ph_default.out and qmgr_lpbshook.out in it, or those two files could be moved to server/ (because mom hooks are configured via the server). I prefer the top level hooks/ directory.

As for the actual hooks directories in server_priv and mom_priv, for mom_priv why not just follow how server_priv is currently defined in the EDD, since as it is written already hooks/ will be copied in?[quote=“agrawalravi90, post:74, topic:520”]

  • Added ps -leaf output.
    [/quote]

Great, thanks.

I think it is fine as it is. Adding another layer just means more typing to get at the same information without adding to the overall organization. Others may of course disagree, and if so please speak up.

One more thought is that currently we have qmgr_lsched.out at the top level rather than in a scheduler/ directory. That makes some amount of sense because currently that’s the only relevant command output. Once PP-748 goes in, will we need more, like “list fs_group”, “list policy”, “list time_window” (that may be a question for @jon and @arungrover , and probably added to the WIP PP-748 EDD as far as what gets listed or not listed when doing “list sched”)? I would suggest creating a scheduler/ directory now and putting qmgr_lsched.out in it and we can add to it later as needed.


#76

Ok, hook is back on top level and I’ve added back a scheduler directory, if there’s a chance that there’ll be more information collected out of the scheduler in the future then it’s better to have it.

Please let me know if things look fine now.

Thanks for all the help!


#77

Thanks Ravi, inching towards the finish line here! What about this:

As the EDD currently stands there will be no hooks/ directory containing the actual contents of the directory from mom_priv/, and while that SHOULD not be needed since in theory that all matches what is in qmgr_ph_default.out, I think it is worth collecting for troubleshooting purposes.


#78

Oh, sorry about that, in the code I’ve implemented it as such, I just updated the design doc, please let me know if things look fine now. Thanks!


#79

Actually i just made one more change, renamed “service logs” to “daemon logs” since that seemed to be more popular.

Please let me know if any other changes are needed.

Thanks!


#80

I am happy! I’d be interested to hear from @sgombosi or @brianl (or anyone really) to see if they have any lingering concerns from the support perspective.