Injecting from Ring-0: Kernel APC DLL Injection

This post covers how Peregrine’s kernel driver injects a monitoring DLL into target processes using kernel-mode Asynchronous Procedure Calls. The injection is timed to the moment kernel32.dll loads in the target, before any application code executes, and the entire pipeline runs autonomously from ring 0 without any userland orchestrator.

Why Kernel-Mode Injection

The standard userland injection pattern (OpenProcess, VirtualAllocEx, WriteProcessMemory, CreateRemoteThread) is visible to everything inside the target process [1]. Any inline hook on LdrInitializeThunk, any thread-creation callback, any scan of the thread list will catch it. It also requires a userland process to orchestrate the injection, which is one more component that can be tampered with.

Injecting from the kernel driver eliminates the orchestrator entirely. The driver decides which processes to inject, watches for the precise moment the process is ready, allocates and writes the payload directly, and queues the APC. No handles leak into the userland handle table. No remote thread appears in the thread list. The target process’s own loader thread wakes up and calls LdrLoadDll as if it decided to do so itself.

Three Notify Routines at Driver Load

Peregrine registers three notify routines at driver load via registerNotifyRoutine. From NotifyRoutine.c:

// From: PeregrineKernelComponent/NotifyRoutine.c
NTSTATUS registerNotifyRoutine() {
    NTSTATUS status = PsSetCreateProcessNotifyRoutineEx(
        CreateProcessNotifyRoutineEx, FALSE);
    // ...
    status = PsSetCreateThreadNotifyRoutine(CreateThreadNotifyRoutine);
    // ...
    status = PsSetLoadImageNotifyRoutine(LoadImageNotifyRoutine);
    // ...
    return status;
}

PsSetCreateProcessNotifyRoutineEx fires when any process is created or exits [2]. PsSetCreateThreadNotifyRoutine fires on thread creation and destruction [3]. PsSetLoadImageNotifyRoutine fires whenever a PE image is mapped into any process [4]. The image-load routine is the critical one for injection timing.

Matching Target Processes on Creation

When the process-creation callback fires, InjOnProcessCreate checks whether the new process matches any target name in the injection target list. If it does, the process ID goes into a pending array. From ApcInjection.c:

// From: PeregrineKernelComponent/ApcInjection.c
VOID InjOnProcessCreate(HANDLE ProcessId, PPS_CREATE_NOTIFY_INFO CreateInfo)
{
    if (!CreateInfo || !CreateInfo->ImageFileName) return;

    KIRQL irql;
    KeAcquireSpinLock(&g_Inj.Lock, &irql);
    BOOLEAN go = g_Inj.Enabled && g_Inj.TargetCount > 0;
    KeReleaseSpinLock(&g_Inj.Lock, irql);
    if (!go) return;

    // ... convert ImageFileName to ANSI, extract base name ...

    if (MatchesAnyTarget(buf)) {
        PendingAdd(ProcessId);
    }
}

Target matching uses case-insensitive string comparison under a spinlock. MatchesAnyTarget strips the path down to the base filename, then compares against each entry in the Targets array:

// From: PeregrineKernelComponent/ApcInjection.c
static BOOLEAN MatchesAnyTarget(_In_ const CHAR* ImageName)
{
    if (!ImageName || !*ImageName) return FALSE;
    const CHAR* fn = ImageName;
    for (const CHAR* p = ImageName; *p; p++)
        if (*p == '\\' || *p == '/') fn = p + 1;

    KIRQL irql;
    KeAcquireSpinLock(&g_Inj.Lock, &irql);
    BOOLEAN hit = FALSE;
    for (ULONG i = 0; i < g_Inj.TargetCount; i++) {
        if (_stricmp(fn, g_Inj.Targets[i]) == 0) { hit = TRUE; break; }
    }
    KeReleaseSpinLock(&g_Inj.Lock, irql);
    return hit;
}

At this point the process exists, but the NT loader has not initialized. LdrLoadDll is not callable yet. The driver waits for kernel32.dll to appear.

Triggering Injection When kernel32.dll Loads

The image-load notify routine sees every DLL mapped into every process. Two things happen in InjOnImageLoad. First, it captures ntdll’s base address the first time it appears (both native and WoW64 variants), because the driver needs it to resolve LdrLoadDll from ntdll’s export table:

// From: PeregrineKernelComponent/ApcInjection.c
if (UStrEndsWith(FullImageName, L"ntdll.dll")) {
    if (UStrContainsI(FullImageName, L"syswow64")) {
        if (g_NtdllBaseX86 == 0)
            InterlockedExchange(&g_NtdllBaseX86,
                (LONG)(ULONG)(ULONG_PTR)ImageInfo->ImageBase);
    } else {
        if (g_NtdllBaseX64 == 0)
            InterlockedExchangePointer(
                (PVOID*)&g_NtdllBaseX64, ImageInfo->ImageBase);
    }
    return;
}

Second, when kernel32.dll loads in a pending process, the PID is pulled from the pending list and injection happens immediately:

// From: PeregrineKernelComponent/ApcInjection.c
if (!UStrEndsWith(FullImageName, L"kernel32.dll")) return;

BOOLEAN isPending = PendingRemove(ProcessId);
if (!isPending) return;

BOOLEAN isWow64 = UStrContainsI(FullImageName, L"syswow64");
NTSTATUS st = DoInjectInContext(isWow64);

The timing is deliberate. By the time kernel32.dll is mapped, ntdll is already present and the NT loader is initialized enough to handle a LdrLoadDll call [5]. But application code has not started executing yet; the system is still inside the loader thread’s image-mapping sequence. This is the earliest safe moment to inject a DLL.

Resolving LdrLoadDll by Walking the PE Export Table

GetProcAddress is not available from kernel mode. Instead, ParseExports walks the PE export table of ntdll manually while running in the target process context. From ApcInjection.c:

// From: PeregrineKernelComponent/ApcInjection.c
static ULONG_PTR ParseExports(_In_ PVOID ImageBase, _In_ BOOLEAN Is32)
{
    ULONG_PTR result = 0;
    __try {
        PUCHAR base = (PUCHAR)ImageBase;
        if (*(PUSHORT)base != 0x5A4D) return 0;            /* MZ */

        LONG lfanew = *(PLONG)(base + 0x3C);
        if (*(PULONG)(base + lfanew) != 0x00004550) return 0; /* PE\0\0 */

        PUCHAR opt = base + lfanew + 24;
        ULONG  expRva;
        if (Is32)
            expRva = *(PULONG)(opt + 96);                   /* DataDir[0].VA */
        else
            expRva = *(PULONG)(opt + 112);

        if (expRva == 0) return 0;

        PUCHAR ed  = base + expRva;
        ULONG  nNames    = *(PULONG)(ed + 24);
        ULONG  fnRva     = *(PULONG)(ed + 28);
        ULONG  nameRva   = *(PULONG)(ed + 32);
        ULONG  ordRva    = *(PULONG)(ed + 36);

        PULONG  names    = (PULONG)(base + nameRva);
        PUSHORT ords     = (PUSHORT)(base + ordRva);
        PULONG  funcs    = (PULONG)(base + fnRva);

        for (ULONG i = 0; i < nNames; i++) {
            if (strcmp((const CHAR*)(base + names[i]), "LdrLoadDll") == 0) {
                result = (ULONG_PTR)base + funcs[ords[i]];
                break;
            }
        }
    } __except (EXCEPTION_EXECUTE_HANDLER) {
        result = 0;
    }
    return result;
}

This is standard PE export directory traversal [6]: MZ check, PE signature check, walk to the optional header, read the export directory RVA, iterate the name table until LdrLoadDll is found, then resolve its address via the ordinal table and function table. The __try/__except block is necessary because this code reads userland memory from kernel context and pages could theoretically be paged out, though in practice ntdll is never paged.

The resolved address is cached in a global (g_LdrLoadDllX64 / g_LdrLoadDllX86) using interlocked operations, so the driver only parses exports once per boot.

The x64 and x86 Shellcode Stubs

The entire injection payload is a small stub that calls LdrLoadDll(NULL, NULL, &UnicodeString, &hMod). The x64 version is 37 bytes. From ApcInjection.c:

// From: PeregrineKernelComponent/ApcInjection.c
static const UCHAR g_ShellcodeX64[] = {
    0x48, 0xB8,                                     /* mov rax, imm64       */
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* <LdrLoadDll addr>   */
    0x48, 0x83, 0xEC, 0x28,                         /* sub rsp, 0x28        */
    0x49, 0x89, 0xC8,                               /* mov r8,  rcx         */
    0x4C, 0x8D, 0x4C, 0x24, 0x20,                   /* lea r9,  [rsp+0x20]  */
    0x48, 0x31, 0xC9,                               /* xor rcx, rcx         */
    0x48, 0x31, 0xD2,                               /* xor rdx, rdx         */
    0xFF, 0xD0,                                     /* call rax             */
    0x48, 0x83, 0xC4, 0x28,                         /* add rsp, 0x28        */
    0xC3                                            /* ret                  */
};

When the APC fires, the NormalRoutine is the shellcode and NormalContext is the pointer to the UNICODE_STRING. The APC delivery mechanism passes NormalContext in rcx (first argument per the x64 calling convention) [7]. The shellcode then:

  1. Loads the absolute address of LdrLoadDll into rax (patched at injection time at offset 2).
  2. Allocates shadow space (sub rsp, 0x28, which is 0x20 for shadow plus 0x8 for alignment).
  3. Moves rcx (the UNICODE_STRING pointer) into r8 (third argument to LdrLoadDll).
  4. Points r9 at stack scratch space for the output HMODULE pointer (fourth argument).
  5. Zeros rcx and rdx (first and second arguments: PathToFile and Flags, both NULL/0).
  6. Calls rax.
  7. Cleans up the stack and returns.

The x86 variant is 23 bytes and uses cdecl-style stack pushing:

// From: PeregrineKernelComponent/ApcInjection.c
static const UCHAR g_ShellcodeX86[] = {
    0xB8,                                           /* mov eax, imm32       */
    0x00, 0x00, 0x00, 0x00,                         /* <LdrLoadDll addr>   */
    0x8B, 0x4C, 0x24, 0x04,                         /* mov ecx, [esp+4]    */
    0x6A, 0x00,                                     /* push 0   (hMod out) */
    0x54,                                           /* push esp (&hMod)    */
    0x51,                                           /* push ecx (UStr*)    */
    0x6A, 0x00,                                     /* push 0   (Chars)    */
    0x6A, 0x00,                                     /* push 0   (Path)     */
    0xFF, 0xD0,                                     /* call eax            */
    0x83, 0xC4, 0x04,                               /* add esp, 4          */
    0xC3                                            /* ret                 */
};

Same logic, different calling convention. The UNICODE_STRING pointer comes from [esp+4] (the APC context pushed by the dispatcher), and all four arguments to LdrLoadDll are pushed right-to-left onto the stack.

Allocating and Writing the Payload in Target Memory

DoInjectInContext runs inside the target process context because the image-load callback executes in the context of the process loading the image [4]. This means ZwAllocateVirtualMemory with ZwCurrentProcess() allocates directly in the target’s address space, with no cross-process attachment needed [8]:

// From: PeregrineKernelComponent/ApcInjection.c
status = ZwAllocateVirtualMemory(ZwCurrentProcess(), &alloc, 0,
    &allocSz, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);

The allocation holds three things laid out contiguously: the shellcode, a UNICODE_STRING struct, and the DLL path string. The shellcode is copied first and the LdrLoadDll address is patched in:

// From: PeregrineKernelComponent/ApcInjection.c
RtlCopyMemory(dst, sc, scSz);
if (IsWow64)
    *(PULONG)(dst + patch)     = (ULONG)ldrAddr;
else
    *(PULONG_PTR)(dst + patch) = ldrAddr;

Then the UNICODE_STRING and path are written immediately after:

// From: PeregrineKernelComponent/ApcInjection.c
PUCHAR ustrA = dst + scSz;
PUCHAR pathA = ustrA + ustrSz;

RtlCopyMemory(pathA, dllPath, dllBytes);
*(PWCHAR)(pathA + dllBytes) = L'\0';

// x64 UNICODE_STRING: Length(2) + MaxLength(2) + pad(4) + Buffer(8) = 16 bytes
*(PUSHORT)  (ustrA + 0) = dllBytes;
*(PUSHORT)  (ustrA + 2) = dllBytes + (USHORT)sizeof(WCHAR);
*(PULONG)   (ustrA + 4) = 0;
*(PULONG_PTR)(ustrA + 8) = (ULONG_PTR)pathA;

The entire payload, shellcode, string descriptor, and path, lives in a single contiguous RWX allocation. One allocation, one free. No loose pointers.

Queueing the Kernel APC

With the payload written, the driver allocates a KAPC structure from nonpaged pool and initializes it:

// From: PeregrineKernelComponent/ApcInjection.c
PRKAPC execApc = (PRKAPC)ExAllocatePool2(
    POOL_FLAG_NON_PAGED, sizeof(KAPC), INJ_POOL_TAG);

KeInitializeApc(execApc, curThread, OriginalApcEnvironment,
    (PKKERNEL_ROUTINE)KernelRoutineCb, (PKRUNDOWN_ROUTINE)RundownCb,
    (PKNORMAL_ROUTINE)alloc, UserMode, ustrA);

KeInsertQueueApc(execApc, NULL, NULL, 0);

The NormalRoutine is the shellcode (alloc). The NormalContext is the UNICODE_STRING pointer (ustrA). When the APC fires in user mode, the kernel transitions to the normal routine with the context in rcx, exactly what the shellcode expects [7].

The KernelRoutine callback runs first, in kernel mode, right before user-mode APC delivery. This is where WoW64 adjustment happens:

// From: PeregrineKernelComponent/ApcInjection.c
static VOID NTAPI KernelRoutineCb(
    PRKAPC Apc, PKNORMAL_ROUTINE* NormalRoutine,
    PVOID* NormalContext, PVOID* Sys1, PVOID* Sys2)
{
    if (PsGetProcessWow64Process(PsGetCurrentProcess()) != NULL)
        PsWrapApcWow64Thread(NormalContext, (PVOID*)NormalRoutine);
    ExFreePoolWithTag(Apc, INJ_POOL_TAG);
}

PsWrapApcWow64Thread adjusts the routine and context pointers so the APC fires correctly in a 32-bit process running under WoW64 [9]. Without this call, the APC dispatcher would try to execute x86 shellcode in a 64-bit context, causing an immediate crash.

Forcing APC Delivery with KeTestAlertThread

An APC only executes when the thread enters an alertable wait or returns to user mode [7]. Since the code is inside the image-load callback (the thread is still in kernel mode processing the load), a nudge is needed:

// From: PeregrineKernelComponent/ApcInjection.c
KeTestAlertThread(UserMode);

KeTestAlertThread sets the thread’s user-mode alert flag, which forces the APC queue to drain on the next kernel-to-user transition. Since the image-load callback returns to user mode almost immediately (the loader continues loading DLLs), the APC fires promptly. The thread calls the shellcode, which calls LdrLoadDll, which loads Peregrine’s monitoring DLL, and the process continues executing with the hooks already installed.

The INJ_STATE Structure

The INJ_STATE structure tracks everything under a spinlock. From ApcInjection.c:

// From: PeregrineKernelComponent/ApcInjection.c
typedef struct _INJ_STATE {
    KSPIN_LOCK Lock;
    BOOLEAN    Enabled;

    WCHAR  DllPathX64[INJ_MAX_PATH];
    USHORT DllPathX64Bytes;
    WCHAR  DllPathX86[INJ_MAX_PATH];
    USHORT DllPathX86Bytes;

    CHAR   Targets[INJ_MAX_TARGETS][INJ_MAX_NAME];
    ULONG  TargetCount;

    HANDLE Pending[INJ_MAX_PENDING];
    ULONG  PendingCount;
} INJ_STATE;

Targets holds up to 16 process names to inject into (INJ_MAX_TARGETS). Pending holds up to 32 PIDs (INJ_MAX_PENDING) that have been created and match a target but have not yet loaded kernel32.dll. The userland service configures DllPathX64/DllPathX86 and the target list via IOCTLs, then sets Enabled to TRUE. From that point, injection is fully autonomous.

Process exit events clean up the pending list (InjOnProcessExit calls PendingRemove) so the driver does not attempt to inject into processes that terminated before kernel32.dll loaded:

// From: PeregrineKernelComponent/ApcInjection.c
VOID InjOnProcessExit(HANDLE ProcessId) { PendingRemove(ProcessId); }

Injection Result Reporting

After DoInjectInContext returns, the result is reported over the kernel communications channel as a JSON message:

// From: PeregrineKernelComponent/ApcInjection.c
if (NT_SUCCESS(st)) {
    RtlStringCchPrintfA(json, ARRAYSIZE(json),
        "{ \"event\": \"apc_inject\", \"pid\": %lu, \"tid\": %lu, "
        "\"status\": \"success\" }",
        (ULONG)(ULONG_PTR)ProcessId, tid);
} else {
    RtlStringCchPrintfA(json, ARRAYSIZE(json),
        "{ \"event\": \"apc_inject\", \"pid\": %lu, \"tid\": %lu, "
        "\"status\": \"failed\", \"error\": \"0x%08X\" }",
        (ULONG)(ULONG_PTR)ProcessId, tid, st);
}
ComsSendToUser(json, (ULONG)strlen(json));

This lets the Tauri GUI confirm whether injection succeeded or failed for each target process.

Limitations and Fragility

The PE export table parsing to resolve LdrLoadDll relies on hardcoded offsets into the PE optional header (offset 96 for 32-bit, 112 for 64-bit) to locate the export directory RVA. These offsets are defined by the PE/COFF specification [6] and have not changed since the format was introduced, so this is stable in practice.

The shellcode itself is position-dependent in one sense: the LdrLoadDll address is patched in as an absolute value. If ntdll were ever relocated after the address was cached (which does not happen in normal operation), the shellcode would jump to an invalid address. The interlocked caching ensures the value is set once per boot and never updated.

The PAGE_EXECUTE_READWRITE allocation is a detectable artifact. Any scanner that enumerates RWX regions in the target process could flag the shellcode page. A more hardened implementation would use separate allocations or change page protections after writing.

This post was generated by an LLM based on code from Peregrine Anti-Cheat. All code snippets are from the actual repository. Claims about Windows internals are sourced from Microsoft documentation.

References

[1] Microsoft, “CreateRemoteThread function”, learn.microsoft.com

[2] Microsoft, “PsSetCreateProcessNotifyRoutineEx function”, learn.microsoft.com

[3] Microsoft, “PsSetCreateThreadNotifyRoutine function”, learn.microsoft.com

[4] Microsoft, “PsSetLoadImageNotifyRoutine function”, learn.microsoft.com

[5] Microsoft, “DLL Search Order”, learn.microsoft.com

[6] Microsoft, “PE Format: Export Directory Table”, learn.microsoft.com

[7] Microsoft, “Asynchronous Procedure Calls”, learn.microsoft.com

[8] Microsoft, “ZwAllocateVirtualMemory function”, learn.microsoft.com

[9] Microsoft, “WoW64 Implementation Details”, learn.microsoft.com