Both the server and the library are native code, so a global try/catch around the library call is not really an option. So this calls for a dispatcher/worker architecture; the service receives requests and routes them to worker processes, one request at a time. If a worker crashes, the service will know and act accordingly.
One of the projects where I had to deal with this was on Linux; that's a story for another day. The other one was on Windows, and that's what I would like to discuss.
So, dispatcher/worker communication in Windows. It all hinges on the choice of an interprocess communication mechanism. I'd like an IPC that:
- Reliably detects server crashes
- Is message-based as opposed to stream-based
- Has a built-in datatype marshaling logic
Component Object Model (COM) comes to mind. The worker program would be the COM server with a single object, the dispatcher would instantiate the server object and call its methods. Each request translates into one or more COM method calls. Server crash detection - check. Built-in marshaling - check. But there's a wrinkle. A COM out-of-process server is not supposed to run multiple instances. Here's how COM usually works:
- A server executable is listed in the registry under the CLSID
- A client calls CoCreateInstance() with that CLSID
- The run-time starts the server executable
- The server executable calls CoRegisterClassObject() for the CLSID
- The run-time calls the object factory
If a subsequent call for the same CLSID comes, the run-time would reuse the same server process rather than starting a new one.
Also, the loosely coupled nature of COM is a bit of an overkill for my scenario. I never meant to expose my worker program to clients other than my dispatcher. The whole COM machinery for making servers exposed and user friendly to third party clients is irrelevant to my case.
So, how can we have a COM client creating multiple, identical COM objects running in different processes? Running Object Table (ROT) to the rescue. COM servers can publish their objects in a global repository, identified by arbitrary monikers. So the idea is:
- The dispatcher starts multiple worker processes
- Each gets a unique integer parameter (a cookie) via the command line
- The worker registers an object in the ROT, identified by the cookie
- The dispatcher retrieves that object
This is different from the regular object creation protocol. The worker program has no object factory (since it's only running exactly one object). There's no need to register the server in the registry. The object needs no CLSID. As for the interface, in my case, I'd use raw IDispatch, so there's no need for any marshaling code, either. My dispatcher/worker exchange protocol can be perfectly served by passing an array of VARIANTs both ways.
The only addition to that protocol is that the dispatcher needs to know once the worker's COM object is available in the ROT. I did that with a named event object, where the name contains the cookie. Once the worker starts up and registers its object, it would set the event. Maybe a short sleep on the dispatcher side would accomplish the same, but this is both faster and safer.
In my case, the worker can only process one request at a time, so the worker doesn't need to be multithreaded. So the CoInitialize() call in the worker would specify COINIT_APARTMENTTHREADED, and then the worker's WinMain() would have to run a message loop.
Now, dealing with the worker process crashes. There can be three kinds:
- During process startup
- During the request
- Between the requests
The first one is rather easy. I mentioned that the dispatcher starts the worker and waits for the "I'm ready" event. Make that a wait for two objects, the event handle and the worker process handle, and see if the process terminates before the event is set. If it does, that means a startup crash.
If the process crashes during the request, COM will report an error. The question is, which one? After some extensive testing with numerous crashes, I've identified the following HRESULT values:
Either of those means the server process terminated during the COM call, one way or another.
If the process terminates between requests, the next COM call would return HRESULT_FROM_WIN32(RPC_S_SERVER_UNAVAILABLE).
What are the quit conditions for the worker? One can implement a "please quit" method, and tell all workers to quit during dispatcher shutdown. A worker might quit when the client disconnects (e. g. the main object is released to zero). It may quit after a timeout of inactivity. In my implementation, I'd pass the PID of the dispatcher to the workers, and made them quit if the dispatcher terminates. The message loop becomes a MsgWaitForMultipleObjects() loop, with the objects being the parent process handle.
Rather than post the code here, I've published it as a Gist. The gist contains a sample worker, a sample dispatcher that creates several threads and calls the worker on each of them. The worker crashes at random with an access violation, but the dispatcher is handling that gracefully.
There's no ATL dependency in the project. There's a Native COM reference in the dispatcher, but that can be easily avoided, if dependencies are a problem. I've compiled it with Visual Studio 2017. Dispatcher is meant to be a console project (it prints some lines), while Worker is a Windows GUI one (it has a message loop).
To summarize, this is a fun little way of making COM dance. No registration, no typelibs, no proxy-stub machinery, no ref counting. Just the bits that we like - reliable interprocess communication, cross-process error handling, friendly passing of simple-typed parameters as VARIANTs.