Survey answered -- I would be more than happy to talk with you about how we have this deployed/
Here are my thoughts on the whole failover engine using NeverFail: this is a sledgehammer to crack a grape.
It's really good for file-based replication, e.g. voicemail systems where each message is stored in the filesystem. it's not so good for application failover.
So, taking a deep breath, here are my thoughts that go deeper into product architecture messing around trying to get neverfail to be the solution:
Where I am using the failover engine to provide failover for an additional poller this could be provided by using an passive polling engine. each node could then have a primary and secondary polling engine for a node. if the primary polling engine doesn't poll a node in a certain time the secondary poller does it automatically. You might even think about providing N+1 redundancy schemes for those people with a single site that only need to cope with a single server failure.
Eliminate the difference between the application server and the additional poller so all polling engines regardless of where they are running are equivalent.
so, a legacy application server = 1 web server + 1 polling engine.
If an install needs more web users or redundancy of the web UI then allow N additional web servers.
Provide poller upgrade packages that can be deployed from a central location so we can upgrade the whole infrastructure from one point rather than like now where I have to do at two installs per package (admittedly I am now so fast at this that I can complete reinstall NPM app server and four additional pollers on all primary and secondary servers before I get off hold for techsupport.)
Load all of the configuration files (ob-except the database connection information) into the database so there are no files that need to be replicated anywhere -- when a node makes contact with the database on initialization it exports the config blobs to the components can startup.
Provide a Kiwi-syslog like node that centralizes all of the logfiles from all of the nodes in the orion cluster into one place instead of on every poller.
This would give me an install that is significantly easier to manage, and the complexity of the install would scale at less than O(2n)
/RjL