We recently had a very weird issue considering HA config sync and failover. The failover was not working when we set the slave’s priority higher to the master’s and we found configuration sync issues. This was all done on FortiOS 5.2, so I can’t really say anything about any later versions, but I think this still applies.
By the way, all the commands here are in my little cheat sheet aswell. You can find that on a GitHub repo I have set up for cheatsheets like this.
We could see that the config was not the same on both cluster members. You can see this with a simple command:
diag sys ha cluster-csum. This shows the configuration checksums. This showed the following (taken from the Fortinet page about config sync issues):
================== FG100D3G13xxxxxx ================= is_manage_master()=0, is_root_master()=0 debugzone global: 89 f2 f0 0b e8 eb 0d ee f8 55 8b 47 27 7a 27 1e root: cf 85 55 fe a7 e5 7c 6f a6 88 e5 a9 ea 26 e6 92 all: f4 62 b2 ce 81 9a c9 04 8f 67 07 ec a7 44 60 1f checksum global: 89 f2 f0 0b e8 eb 0d ee f8 55 8b 47 27 7a 27 1e root: cf 85 55 fe a7 e5 7c 6f a6 88 e5 a9 ea 26 e6 92 all: f4 62 b2 ce 81 9a c9 04 8f 67 07 ec a7 44 60 1f ================== FG100D3G12xxxxxx ================== is_manage_master()=1, is_root_master()=1 debugzone global: 89 f2 f0 0b e8 eb 0d ee f8 55 8b 47 27 7a 27 1e root: d8 f5 57 46 f0 b8 45 1e 00 be 45 92 a2 07 14 90 all: a7 8d cc c7 32 b5 81 a2 55 49 52 21 57 f9 3c 3b checksum global: 89 f2 f0 0b e8 eb 0d ee f8 55 8b 47 27 7a 27 1e root: d8 f5 57 46 f0 b8 45 1e 00 be 45 92 a2 07 14 90 all: a7 8d cc c7 32 b5 81 a2 55 49 52 21 57 f9 3c 3b
As you can see, the cluster members have a difference in the
root VDOM and therefore also in the
all view. As you can see on that page, you can drill down the various levels with
diagnose system ha showsum LEVEL. However, as far as I remember, there were no differences. A colleague then
diff-ed the whole config and we found no differences – except in the
config system ha part. As you know, Fortinet does not sync everything since there obviously are parts that don’t have the same settings on both cluster members. In that part, this is also the case since values like the priority should obviously be different.
However, it seems that while that whole part is not synced, some settings still go into that checksum calculation!
We then changed another parameter,
set override enable on the slave (this was already set on the master). This basically changes the Master election algorithm to favor Priority before Uptime. So with this setting, a higher priority wins, without that, a higher uptime wins. In easy terms, it allows overriding the Master election via the priority.
After setting that configuration, we set the priority of the slave higher than the master and we had an immediate failover (like we wanted). We then checked the checksums again, and everything was fine! Setting the slave’s priority back to lower than the master’s triggered another wanted failover.
What we found out: some config parts are not synced, but they are still part of the configuration checksum. Keep that in mind when having similar issues.
UPDATE: we checked it in the lab and apparently, override is not part of the checksum calculation. So we can’t really pin it down to what actually fixed it, but it’s now ok.