Capture "sub" snapshots for pbs_snapshot --additional-hosts


#1

Hi,

I’m proposing a design change for pbs_snapshot’s “–additional-hosts” option. Right now, when given a list of hosts, pbs_snapshot captures things like comm_logs, mom_logs, mom_priv and some system commands from each of those hosts (whatever can be captured). This has 2 disadvantages:

  1. It incurs a lot of data being copied over the network from various hosts to the machine where pbs_snapshot was run.
  2. copying protected data (e.g - mom_priv) from remote hosts is very tricky if running pbs_snapshot as a non-root user using the new “–with-sudo” option.

So, I propose the following:

  • When capturing data from remote hosts, pbs_snapshot will run pbs_snapshot directly on the remote hosts.
  • Once the “sub” snapshots are captured completely, the main pbs_snapshot program will copy over the compressed snapshot tarballs as _snapshot.tgz and keep them in the first level of the main snapshot that’s being captured.

Please let me know what you guys think. @scc, requesting you specifically to provide some feedback.

Thanks!


#2

Gentle reminder, @scc could you please review this proposal?

@bhroam and @arungrover, requesting you guys to also review this since we came up with this together. Thanks!


#3

Sounds goo to me @agrawalravi90


#4

@agrawalravi90, I like this idea, but have a few questions:

  1. The current pbs_snapshot EDD does not define what the snapshot will contain if pbs_snapshot is run directly on an execution host, and in fact snapshot tries to collect all of the same information as though it had been run on the server itself. This is fine for when it is invoked normally, but would lead to lots of redundant server queries (pbsnodes, qstat, etc.) that are already being collected by the top level snapshot invocation. Did you all discuss introducing a new mode of operation that would skip these queries and only collect the relevant local information (logs and all v1 and v2 configuration files, essentially)?

  2. What will the snapshots be named? I think the “sub” snapshots should contain the host name from which they cam in the .tgz filename.


#5

Thanks for your feedback Scott. We didn’t discuss adding a separate mode for capturing data from non-PBS server nodes, I was just thinking that we’ll call it with -H <hostname> set to the mom host, so it will issue qstat etc., but won’t get anything as the host is a mom/comm. But we should probably add a new mode to pbs_snapshot instead as that’ll be more elegant and reliable.

About what the “sub” snapshots will be named, yes, we were thinking the same thing, to name them by their host’s hostname.


#6

I created a design document for this: https://pbspro.atlassian.net/wiki/spaces/PD/pages/718635011/Enhance+pbs+snapshot+to+capture+remote+host+data+as+sub+snapshots

Please review it and provide feedback. Thanks!


#7

Thanks @agrawalravi90, looks good!


#8

@agrawalravi90 I think we should introduce a new argument to snapshot for capturing only mom related data. There could be a use case where one would want to capture snapshot from mom node only (along with server queries). If you change ‘-H’ to match hostname with server name to issue additional server queries then one has to run the snapshot from server only to get server related data.

How about when additional hosts is issued we run snapshot on these hosts with this new argument which will capture information only from that host?


#9

I think I didn’t understand you completely. If -H argument is a valid PBS server, then all of the pbs commands will be executed, if it is not a valid pbs server, then we won’t. So, even with this change, somebody can run pbs_snapshot remotely, they won’t have to run it from the server node, right?


#10

ah… got it, didn’t realize it takes server name as an argument. I guess what you have will work.
Sorry about the confusion.


#11

Design doc looks good to me. Thanks for coming up with that.


#12

Thanks guys, seems like there’s enough consensus, so I’ll go ahead and implement the code.


#13

Hey guys, while implementing the code for this, I realized that I had to make another, hopefully smallish, interface change. I’ve updated the EDD to reflect it, but here’s the text for quick reference:

Primary host captured will now be local host by default:

  • Earlier, pbs_snapshot would actually parse the pbs.conf on the local host, find the pbspro server host and capture that. Now, pbs_snapshot will capture the local host by default. The -H option should be used to point to the remote pbs server if pbs_snapshot is invoked from a client host. This is needed to prevent the child pbs_snapshot invocations from capturing data from the main pbs server host.

Please let me know if you guys have any concerns regarding this change. Thanks and sorry that I didn’t think about this before.


#14

btw, here’s the PR: https://github.com/PBSPro/pbspro/pull/825