Stack Spoofing - Researching new approaches

Been a some time since the last post, if you missed me i love you and hope that what im about to write will help you somehow (:

Recentely i have been studying and researching more about C2 implant development, specifically on old and new evasion techniques. EDR and defense tradecraft Improve so attackers need to do the same. In this post i want to talk about some known ideas and disclose new ones for call stack spoofing on Windows, with and without CET

Previus researches

The GOAT in this topic from what i have been able to research is KlezVirus, he was the one behind the well known SilentMoonwalk and extensive researches on CET compliant spoofing that we will be looking in a few minutes

Another goated resource is DarkVortex, the creator of Brute Ratel C2 here show how to abuse callbacks and tail-calls to hide the original caller from the stack trace

The last one I feel like mentioning is LoudSunRun, he has done an amazing work simplifying the quite complex SilentMoonwalk implementation

Basics

All functions needs to allocate some memory on the stack in order to properly operate (except leaf functions), each information stored in the allocated memory can be accessed using a static offset from RSP or if the function UNWIND_INFO foresees using RBP.

Whats start the usage of the stack in the function is the prologue, each function has one

Common prologue in Windows x64


    ; maybe push some regs (eg: push rbp, r15, r14...)
    sub rsp, N
    ... ; function body

What ends the usage of the stack for a function is the epilogue:

Common epilogue in Windows x64


    ; maybe pop some regs (eg: pop rbp, r15, r14...)
    add rsp, N
    ...
    ret

So each function allocate the needed space onto the stack, this is what we call the frame size of a function.

Every function at least allocate 8 bytes, this is the ret address, what will be executed after the ret. The return address will always live above RSP/RBP during the execution of the function, when that function execute ret, the instruction will pop the current RSP value inside RIP and jmp to that

Returning from a function


0x1 call func_x
        |
        |-> save at RSP+x ret addr 0x6
            func_x body execute
            ...
            ret (under the hood move rsp(0x6) into rip)
             |
             V
0x6    [func_x will ret here]

Each ret address resides into the call stack until its popped out the stack and executed, this is the call stack, its just a sequence of "living" ret addresses

This is a normal looking call stack:

Each thread in the process has its own call stack, and they all start with the same 2 frames: BaseThreadInitThunk (BTIT) and RtlUserThreadStart (RUTS). Every entry in the stack has its own allocated frames from a backed and legitimate dll

What makes a call stack look bad and suspicius?

As you can see we have unbacked shellcode frames, those are bad because if our shellcode needs to call some sensitive API (eg: LoadLibrary, VirtualAlloc...) an EDR could unwind the callstack and directly see those frames coming from shellcode memory region. At the same time the BTIT and RUTS frames are disappeard because the unwinder doesnt know how to unwind the shellcode frames and completely mess the stack appearence (more on that later)

Ideally a proper spoofing would try to conceal those frames while being able to execute normally without crashing or losing anything

Syntethic Frames Spoofing

I have started by looking more into non-compliant CET techniques, specifically SilentMoonwalk and LoudSunRun.

As a brief introduction, we can place fake frames on the call stack before tail-jumping to the sensitive API. In this way we are actually creating a 100% artificial call stack. The fake call stack start creating when we do a push 0, this cause the call stack to be "cutted" and all of the previus frames (loader/shellcode) will be gone.

What fake frames do we need to place?

RUTS
BTIT
Thread Start Address
Desync JOP gadget

As we discussed early the RUTS and BTIT frames are always presents, a call stack without them wold be anomalous. The thread start address is something not a lot of people considers... If you take a look at all of your call stack and compare the 3rd entry to the Thread Start Address of the thread hosting that call stack you will notice its always they both share the same address

We will search the desync gadget inside external DLLs such as kernel32, ntdll or by manually loading one we know provides the right gadget. The purpose of the desync gadget is to return the execution flow to the main program, the desync gadget most of the time will have this layout: JMP [NON_VOL] and we place the address of a fixup routine inside NON_VOL. We call it desync gadget because the JOP gadget is capable of desyncing the execution of the CPU from the unwinding of the unwinder.

And this is how the call stack appears:

This technique is reliable and fully functional but very well known. Some EDRs are writing rules specifically to detect this type of spoofing

As an example this Elastic detection rules look if we are loading an internet module via Synthetic Stack Spoofing by specifically searching for well known kernel32 functions wich has JOP gadget in it

this blog really go in-depth about synthetic stack spoof detections and evasion tradecraft.

Doing a brief list of strong IoCs of this technique:

CALL prefixed desync gadget
frame actually containing a JOP gadget with a non volatile register

All IoCs could be used as a detection mechanism, but we as attacker are on the right side, because at the end of the day all the detections relies on monitoring the call stack we are spoofing

How can we address those IoCs? The first one is quite simple, we just search a CALL prefixed JOP gadget ad use that as desync frame, in our code we will just go +0x5 (size of CALL instruction) to start the frame from the actual JOP instruction, this way an EDR looking at that frame will see a CALL instruction before the JOP gadget marking that frame as a valid return address area. Use this tool to search DLLs of a directory for specific JOP gadgets

The second one is more subtle and where i had most of the truble addressing, its simple to write a rule that detects a JOP gadget using a non volatile register so we must change this architecture somehow.

After some researching i was able to find an actual gadget that could help address the second IoC. The thing is that an EDR would inspect the call stack at the time a suspicius API is called, not after it returns. So the idea is to somehow hide the desync gadget during that time and execute it after the API ret.

A push non_vol; ret could work, non_vol will hold the address of the desync gadget and the ret will execute it. This successfully hide the Jop gadget by replacing it with a Rop one (less scrutinized gadget and host dll). This architecture can still be improved by not using a non volatile register, However, doing so requires building a chain that allows the pointer to the fixup routine to be brought into the volatile register (hard to do especially with ROP + call stack spoofing or a full JOP chain)

Those are examples of good candidates for what we want to do:

Unwinding internals

Lets talk about how the stack and unwinding works on Windows x64.

When we talk about unwinding we refer to the mechanism the OS uses to handle exceptions, when an exception is raised the OS need to see where it happened, and for that it use the call stack, walking it backward until finding the faulty function

Example of call stack


    main()
    └─ func_1()
        └─ func_2()
                └─ VirtualAlloc()   
                        goes into kbase, ntdll...

Each frame here is storing a return address at a fixed offset from the current RSP. This offset is what we call frame size. The frame size is how much the prologue of the function frame has allocated, knowing the frame size of each frame we can unwind the call stack cleanly

On windows there are 2 critical APIs that handle the unwiding, RtlLookupFunctionEntry and RtlVirtualUnwind and RtlLookupFunctionEntry , you can read an open source implementation inside ReactOS source code

Every PE file has a .pdata section, the compiler puts an Exception directory inside it and fill it with RUNTIME_FUNCTION structures, one for each non-leaf function in the PE

RUNTIME_FUNCTION


    // Source - https://stackoverflow.com/a/55896238
    // Posted by Lewis Kelsey, modified by community. See post 'Timeline' for change history

    typedef struct _RUNTIME_FUNCTION {
        ULONG BeginAddress;
        ULONG EndAddress;
        ULONG UnwindData;
    } RUNTIME_FUNCTION, *PRUNTIME_FUNCTION;

The BeginAddress points at the start of the function, the EndAddress points at the end of it, UnwindData points to an _UNWIND_INFO structure

UNWIND_INFO


// Source - https://stackoverflow.com/a/55896238
// Posted by Lewis Kelsey, modified by community. See post 'Timeline' for change history

typedef struct _UNWIND_INFO {
    UBYTE Version         : 3;
    UBYTE Flags           : 5;
    UBYTE SizeOfProlog;
    UBYTE CountOfCodes;  //so the beginning of ExceptionData is known as they're both FAMs
    UBYTE FrameRegister  : 4;
    UBYTE FrameOffset    : 4;
    UNWIND_CODE UnwindCode[1];
    union {
        //
        // If (Flags & UNW_FLAG_EHANDLER)
        //
        OPTIONAL ULONG ExceptionHandler;
        //
        // Else if (Flags & UNW_FLAG_CHAININFO)
        //
        OPTIONAL ULONG FunctionEntry;
    };
    //
    // If (Flags & UNW_FLAG_EHANDLER)
    //
    OPTIONAL ULONG ExceptionData[]; 
} UNWIND_INFO, *PUNWIND_INFO;

UnwindCode rapresent the action that the prologue of the corresponding function does, some examples:

UWOP_SET_FPREG - Frame pointer (RBP) is used
UWOP_ALLOC_SMALL / UWOP_ALLOC_LARGE - Stack frame size
UWOP_PUSH_NONVOL - Register was pushed

The unwinding at a high level works as this:

1. RtlLookupFunctionEntry(ctx.Rip, &ImageBase, NULL) - Scan the modules .pdata and find the corresponding RUNTIME_FUNCTION covering ctx.Rip
2. RtlVirtualUnwind - reads the UNWIND_INFO and interprets the unwind codes
3. Applies the unwind codes to compute callers RSP and so on

CET enforcement

CET is a mitigation mostly against Return Oriented Programming and other control flow hijack attacks (COP, JOP). CET enforcement works in 2 ways:

Shadow Stack
Indirect Branch Tracking (IBT)

Shadow Stack

All the stacks inside a process have an alter ego corresponding shadow stack, this shadow stack is read-only and protected by the kernel. Whenever the CPU execute a CALL or RET to push/pop a value onto the stack the shadow stack create a corresponding entry inside it self, When a return address mismatch occurs, the system triggers a crash in order to prevent the exploit or malware from working

Photo from KlezVirus post

As you maybe have alredy understanded this is a problem for the desync gadget implementation because by the time the program ret into that synthetically placed gadget the shadow stack will throw an error because it doesnt match with what it contains

You can read more about CET here and here

Indirect Branch Tracking (IBT)

This is less concerning what we are talking about but worth giving it a look. IBT operates via a processor state machine that monitors indirect CALL and JMP instructions. Upon encountering such an instruction, the state machine enters a "WAIT_FOR_ENDBRANCH" mode, requiring the subsequent instruction to be an ENDBRANCH (either ENDBR32 or ENDBR64), which marks valid targets

CET Compliant spoofing

With CET out of the way we can talk about how we can make stack spoofing compliant to it. The main idea is that stack spoofing dont aim to hijack the control flow of the program but just to make a plausible call stack to evade EDR inspection after an API is called

An implant will usually have some modules extension, they function like external small programs that amplify the basics of the implant, we call those BoF (based on cobalt strike) and they perform stuff like lsass dumping, privesc, loading other payloads...

Those modules are often PIC shellcodes retrieved and loaded by the main beacon, or they could be normal COFF objects (.o / .obj). When using COFF objects the beacon needs to handle the in-memory loading of it, this mean working with the sections of the COFF object and relocations, exactly as linked would do.

COFF loading and COFF structure is outside of this discussion, at the end you will find many good resources

I have talked about this because the idea is that the beacon will have a COFF loader ( this one slightly modified ), this COFF loader will retrieve and load the extension by using module stomping on a sacrificial DLL, after that we will fix the call stack of the extension by registering it to a RUNTIME_FUNCTION table inside the .pdata of the hollowed DLL . This will theoretically be able to spoof the extension call stack.

What im trying to accomplish here is very similar to what DreamWalkers try to do, go check it out.

We are doing module stomping in order to have frames originating from a legitimate DLL. Module stomping by itself will add some IoCs, so to make it FUD you should do some research on how to evade those and apply them to this

Why do we need to register a RUNTIME_FUNCTION to the stomped extension?

This is the call stack with a simple module stomping, as you can see its quite broken, we have 2 random frames as the bottom and the RUTS / BTIT frames are missing, why is this happening?

The hollowed DLL (windows.storage.dll) doesnt have the unwind information for the current stomped memory region (the .text section of the dll)... Actually, i have said something wrong, it does have unwind information for that region but they were for the previus functions that lived there. This mismatch cause the unwinder to struggle unwinding those frame and this is the result

So in order to register new unwind information corresponding to the loaded extension we can use the RtlAddFunctionTable api

By registering the unwind information we get this:

Now the unwinder can correctly unwind the stack but we still have the loader frames inside the stack (in a real scenario those frames come from a beacon shellcode), we dont want them. The Syntethic approach eliminates those by cutting the stack with the push 0 instruction and performing the tail-jmp to the api

The idea for removing those frames is using proxy functions (callbacks), a proxy function is a function that take as parameter another function (the callback) and execute it in a dedicated thread.

For the proxy function i opted for a simple CreateThreadpoolWork giving as callback a tail-jmp trampoline that will jmp inside the stomped DLL

A tail-jmp trampoline wont push a ret address into the callstack and its CET compliant

This is what we get out from it:

This is looking great, by using a dedicated thread with the callback and trampoline now the call stack containing the target API doesnt have the loader frames

But, this implementation has a well known IoC, the Tpp* entries. An EDR walking this stack could see those entry and flag the whole execution chain as malicius or scan the originating thread and find the loader frames anymore

Like the module stomping IoCs, there are ways to evade even this, however i will leave this as an exercise for the reader (:

Further ideas

An amazing work on novel ways to do call stack spoofing in the CET era was done by KlezVirus here. The main idea here is to play with unwind informations in order to conceal specific frames from the unwinder

Something that i would look into more are undocumented windows APIs such as RtlCreateUserStack

Conclusion

Both PoC are available in this repo, thank you for coming this long

Useful links:

https://klezvirus.github.io/
https://offsec.almond.consulting/evading-elastic-callstack-signatures.html
https://www.elastic.co/security-labs/doubling-down-etw-callstacks
https://www.elastic.co/security-labs/call-stacks-no-more-free-passes-for-malware
https://dtsec.us/2023-09-15-StackSpoofin/
https://www.elastic.co/security-labs/finding-truth-in-the-shadows
https://www.unknowncheats.me/forum/index.php
https://0xdarkvortex.dev/hiding-in-plainsight/