Rather Technical: I want to break free

Sometimes, one has release tasks in Azure DevOps (formerly known as TFS) that don't complete synchronously.

In my case, it was Active Directory driven TLS certificate issuance. Active Directory has an interface where certificate authorities (CAs) can advertise themselves, and clients in want of certs can send signing requests to those. In our software shop, we have this kind of setup, but the way the CA is organized, our cert requests are not served immediately, they have to sit there until a human approves them. This has to do with a policy I don't have the political clout to amend.

But back to scripted, AzDevOps driven releases. Setting up TLS as a part of the Web based system initial release process is a natural thing, but what to do with the fact that the certificate, once requested, may not become available for up to several hours? When the CA administrators approve it, the cert gets pushed to the requesting machine via a global policy update, but that never happens immediately.

Enter the idea of background polling. We fire a cert request and have a process run in the background, checking if the cert has dropped every 10 minutes or so, and once it does, the script follows up - binds it to the IIS site.

Now, here is the snag: if you run a process from an AzDevOps release, the process quits when the release job quits. That was a big surprise to me at the time, but I've isolated the behavior to the bare minimum - it's clearly there. The nature of the spawning task and/or the process being run doesn't matter - custom or built-in, PowerShell, Node, native, managed - they all meet the same fate. The depth of the process tree doesn't matter either, subprocesses of subprocesses are forced to quit also.

How does AzDevOps do that, was my next question. Once we know how, we can hopefully subvert it. I was not even sure whether this is AzDevOps proper, or Windows has some logic in place to kill the whole process tree when the parent process quits. As of AzDevOps agent 2.210.1 (which comes with AzDevOps 2022), each job (formerly known as phase) is executed by a dedicated process, Agent.Worker.exe. So the processes that are spawned by the job all have a common parent process, the job worker process.

Windows kind of has the logic for terminating subprocess trees - in the form of kernel job objects (distinct from AzDevOps jobs). Process Explorer to the rescue - no kernel jobs here. Other than those, I could not find a built in provision to terminate a process tree. Console processes that share the same console window probably can't outlive the console, but I didn't quite go there.

Is there a way to escape the default subprocess tree somehow? In other words, can one start a process with the parent that is distinct from its creator? Turns out that we can. The idea is:

Create/initialize a PROC_THREAD_ATTRIBUTE_LIST object with 1 attribute
Put a PROC_THREAD_ATTRIBUTE_PARENT_PROCESS there, specifying a process handle to a fake parent
Put said object into lpAttributeList of a STARTUPINFOEX structure
Create a child process with dwCreationFlags EXTENDED_STARTUPINFO_PRESENT, passing said structure

Needless to say, this deep magic is not particularly exposed to shells on top of CreateProcess(). So I've slapped together an executable in C++ that does that, tried under AzDevOps - no, the process that is spawned that way still quits when the AzDevOps job quits.

At that point I thought, enough with the blackbox approach, let's take a look at the cleanup logic of the job worker - break open the ILSpy. The executable is right there, in the bin subfolder under the AzDevOps agent's root folder. Surprise - ILSpy says Agent.Worker.exe contains no managed code. That was unexpected. Let's connect to it with the Visual Studio debugger, decided I next. And there managed code was - according to the "Attach to process" window in Visual Studio, Agent.Worker.exe did too contain managed code, specifically the .NET Core 3 kind. The module list immediately revealed that all the managed code was in Agent.Worker.dll. I sicced ILSpy on that DLL, and there all the logic was, in all its decompiled glory.

The primary logic of job execution is in the method RunAsync() in class JobRunner in namespace Microsoft.VisualStudio.Services.Agent.Worker. For the cleanup after a job, it calls into FinalizeJob() in class JobExtension. And there you have it - halfway through the method, there is a trace message: "Start cleaning up orphan processes".

The logic of orphan cleanup is:

Go over the whole list of processes (as obtained via Process.GetProcesses())
Retrieve the environment variables of each found process
If the value of the variable VSTS_PROCESS_LOOKUP_ID has the value "vsts_{GUID}", kill it

As as aside, all this talk of killing orphans is why the IT industry has a bit of a reputation.

The GUID in the third step is a class static, initialized in place and never reset - effectively, a per-run unique identifier.

The design becomes clear enough. The worker sets up this environment variable, giving it a well known name and a value that is unique to the act of job execution. All spawned processes inherit the environment block (unless explicitly told otherwise). There is no process tree traversal to speak of - it's environment inheritance that marks the job spawned processes. And with that knowledge, the way to escape the sandbox becomes clear - make sure the environment block of the background process doesn't contain the environment variable VSTS_PROCESS_LOOKUP_ID.

One other caveat has to do with the working folder of the job. By default, the current folder of all tasks of the job is the scratch folder under _work that AzDevOps creates for the release to operate in. AzDevOps deletes and recreates that folder on every run. If a background process has that scratch folder as its current, Windows places a lock on it (as if it was an open file), so it can't be deleted, and subsequent runs of the same release pipeline would fail.

Armed with this knowledge, a real true background process can be started with the following command line:

cd /d c:\
set VSTS_PROCESS_LOOKUP_ID=
start powershell.exe c:\Path\MyBackgroundScript.ps1

The "start" command is there to make sure the batch execution logic doesn't wait for the PowerShell process to complete. The "cd" command is there to make sure the release work folder is not locked.

The PowerShell 7 version of the Start-Process cmdlet has a switch to explicitly set the environment of the new process from an arbitrary dictionary, but the PowerShell 5 version does not; it only has the -UseNewEnvironment switch, which merely resets the spawned process' environment to the machine default, which is somewhat deficient.

I mentioned that he worker checks the environment of the discovered processes; the way it does so (on Windows) is quite fascinating. There is no straightforward way to do that. What they do is:

Use NtQueryInformationProcess() to retrieve PROCESS_BASIC_INFORMATION
From its PebBaseAddress, get to the Process Environment Block (PEB)
From its ProcessParameters, get to the RTL_USER_PROCESS_PARAMETERS structure
Get the pointer at offset 128 of that struct (for 64-bit processes)
Get the memory block that it points at
Parse as if it was the environment block (a doubly null terminated set of null terminated Unicode strings)

Where do we start with what is wrong with this. NtQueryInformationProcess() is documented, but it has a scary disclaimer:

NtQueryInformationProcess may be altered or unavailable in future versions of Windows. Applications should use the alternate functions listed in this topic

PEB and RTL_USER_PROCESS_PARAMETERS have similar disclaimers. And then we get to the part where they pull the pointer from the RTL_USER_PROCESS_PARAMETERS at offset 128. According to the docs, as of April 2024, there is nothing of interest in that structure past the CommandLine field. There is evidence that once the definition of the structure was documented differently and there was indeed the environment block pointer after the command line field, but it's no longer official. And if you check the contents of that pointer in a sample process, it doesn't quite point at the environment. It points at something that looks like a null terminated Unicode string that goes, in my case, "=::=::\", and then the environment.

So parsing the contents of this memory kind of works, but I wonder how fragile that is.

Rather Technical

Saturday, April 13, 2024

I want to break free

No comments:

Post a Comment