Alternate security model to CLUSTERADDR2

Hi All,

This proposal is to eliminate the usage of IS_CLUSTERADDR due to its salability issues and to use an alternate scheme.

You can find my initial solution in the page:

https://pbspro.atlassian.net/wiki/spaces/PD/pages/1320878087/Alternate+security+model+to+CLUSTERADDR2

Let me know your comments.

Could you please describe any plans for backward compatibility and how a site would migrate from the old mechanism? Will they need to drain the cluster, or can this be done while jobs are running?

The packets have a fixed content for a given IP address, making spoofing easier. Just wait for a node to go down (e.g., for provisioning) and grab its IP address.

I think it would be better to generate a packet containing several bytes of random data, followed by a timestamp, then the IP address. Take this and encrypt it with the shared key and send that.

The receiver decrypts it using the shared key, ignores the random data, and validates the timestamp and IP address.

Replays are caught by the timestamp. And every packet is unique, making partial-known-plaintext attacks harder.

Hey @mkaro,

This is only the proposal for replacing CLUSTERADDR functionality. At the time to implementation we can circle back as to what is required. Technically, it is possible to support both the protocols, the old and new (since we will version this as well, as part of the new PING protocol).

Hi @dtalcott,

Yes you are right about the possibility of IP spoofing. However, IP spoofing from outside a subnet is a bit difficult - you can send a message to target, but the reply will usually make way back to somebody with the original IP (or to nobody) since the routers will do their jobs (unless you have access to the routers themselves). Nevertheless, it is possible.

Our view was that this is only to replace the CLUSTERADDR2 message. The overall IM exchange sequence is protected by other authentication mechanisms, like munge (in future TLS etc). This is, therefore, not a replacement of the authentication, but just a replacement of CLUSTERADDR2 message.

That said, we can introduce a timestamp; our feel was that this would need the clocks to be pretty synchronized - is that a thing that can be enforced? Of course, i think munge mandates that anyway…

If mandating that clocks be synchronized is not issue, then we can definitely use timestamp.

Of course we can store the last packets timestamp and check that the incoming timestamp is > one stored and that does not need clocks to be synced.

I’ve thought about this more and have decided the original proposal is adequate. The only purpose to the CLUSTERADDR2 message is to add an IP address to a MOM’s list of in-the-same-cluster hosts. It does not guarantee the (current) holder of that IP address is not a rogue of some kind.

Note this new method does not protect a Mom from a Mom that was removed from the server via qmgr. The old Mom still has the secret, so it can send valid CLUSTERADDR2 msgs. Under the old method, the server would push out a new list with that old Mom removed.

In a similar vein, is this a risk in a cloud environment where the same image might be used for multiple instances? Thus, same secret. So a Mom from one instance could add itself to the Moms in another instance if it knew their IP addresses?

Yes, you are right.

Shared key should be removed from the mom to prevent falling in the wrong hands once the node is removed from the cluster.
In a cloud environment, adding/removal of the key should be done on the fly and key being part of the cloud image is insecure.

Exchange with Ben Matthews (NCAR):

Sorry about sitting on this for a few days – it’s been a busy week. Without thinking about it a ton, I do have a few thoughts:

  • Wouldn’t it be easier to just standardize on munge and only share one key around?

[samg] This isn’t a replacement for job authentication, it’s just for the moms to recognize eachother.

Sure, but the MoMs need munge (or pbs_ifl or whatever it is, but, for the sake of argument, munge because I understand it better) to authenticate the job anyway. You already have a pre-shared key and the assumptions about consistent user ids from munge. My understanding is that Munge is vulnerable to replay attacks for a configurable short period of time without some sort of message cache (my understanding is that Slurm does this, as far as I know, PBS does not) but even used naively it is significantly better than nothing. Most of the work is already done – why re-invent the wheel? Maybe just sign everything with munge as the PBS user or root.

I’m not too familiar with what the MoMs might tell each other. I’m assuming that this is security sensitive since the MoM might run arbitrary things as anyone, but that doesn’t necessarily have to be the case.

  • I don’t think I agree that IP spoofing is difficult on a lot of modern clusters. Why wouldn’t you mitigate replay attacks by including a sequence number and/or time stamp and probably a past message cache. Mellanox has a nice interface for RDMAing raw ethernet frames to/from their NICs which I’m pretty sure works as a regular user on many HPC setups.

[samg] Our view was that this is only to replace the CLUSTERADDR2 message. The overall IM exchange sequence is protected by other authentication mechanisms, like munge (in future TLS, etc). This is, therefore, not a replacement of the authentication, but just a replacement of CLUSTERADDR2 message.

That said, we can introduce a timestamp; our feel was that this would need the clocks to be pretty synchronized - is that a thing that can be enforced? Of course, I think munge mandates that anyway…

munge does, most parallel filesystems do, etc. I don’t think this is unreasonable, at least in HPC. We once had a super fun multi-day outage that ended up being GPFS (silently!) refusing to mount due to clock drift. If nothing else, you need reasonable clocks to make any sense out of any node-local logs.

Make the tolerance configurable for situations where you have bad system clocks and are ok with the reduced security?

  • It seems like this scheme wouldn’t work across NAT or though any kind of middle box that messes with the IP headers. I don’t think we directly care right now, but in a universe where we had a second PBS cluster, we would very likely want to be able to exchange work between them, maybe even be able to run both on one of the servers while the other is down.

  • What if each host has a bunch of IPs?

  • What happens if IPs change? SGI’s management software would like to change IPs when hardware gets replaced. We don’t allow it because GPFS doesn’t tolerate it, but not everyone runs GPFS.

[samg] Shared key should be removed from the mom to prevent falling in the wrong hands once the node is removed from the cluster. How this would happen is open for discussion…

The node IP can change without it ever being removed from the cluster (certainly without PBS getting told about it). This used to (13.x, 14.x) be quite problematic for PBS, but last time it happened things only broke a little (things get stuck in “various” state) and there were extra vnodes to delete.

Here’s another fun security thought exercise: You have a traditional stateless cluster – pxeboot (dhcp, tftp something suggested by the tftp response, potentially wget a fatter image and chroot into it). What prevents an end-user from querying the management server and obtaining a copy of the image (including your munge or pbs key)? HP’s management system actually has some mitigation for this and our (custom) viz cluster has pretty good mitigation (firewall by originating user), but a lot of clusters don’t. What happens when two (or N) instances of MoM have the same key and IP?

Of course, munge can be vulnerable to this attack too and given how it’s used, is an easier path to root.

-Ben