Отключение выравнивания функций MSVC

coree · 15.08.2024

Доброго дня. Столкнулся с проблемой, msvc компилирует код, используя выравнивание с помощью байтов 0xCC. Попробовал компилить с различными оптимизациями размера кода, выключенным cfg - не помогло, стабильно вижу послелог из 0xCC. Подскажите, что такой поведение вызывает? Кстати, при компиляции с clang-cl таких приколов нету.

DildoFagins · 15.08.2024

coree сказал(а):

Подскажите, что такой поведение вызывает?

Так а какое выравнивание? На 16 байт? Если да, то насколько я помню, это связано с процессом чтения инструкций процессором на x86_64, типа считывает инструкции блоками по 16 байт, и если начало функции выровнено на 16 байт, то код будет исполняться быстрее. Поправьте меня, если я, как порядочная ллмка, сгалюцинировал сейчас.

coree · 15.08.2024

DildoFagins сказал(а):

код будет исполняться быстрее.

А clang-cl с какими же оптимизациями решает что это не надо? Или это мсвц несмотря на флаг принудительного уменьшения кода хочет ускорить его выполнение? А так да, код если выровнен на 16 байт исполняется быстрее.

shrekushka · 21.08.2024

This is standard MSVC debug build behavior, you can prevent it with /RTCu. Code Generation - Basic Runtime Checks.

Now regarding alignment itself, this is two fold: alignment of individual data types and stack frames. The address of a value must be evenly divisible by the sizeof(type). So the CPU can fetch in 1 bus cycle (memory cycle) instead of slow back in the days. And this was not the only reason for our grandfathers, but even for simplicity in implementation. If you failed here, the CPU would throw a bus error (misalignment fault). Now AMD64 supports unaligned accesses anyway and this is not slow anymore. SPARC, itanium still require strict alignment. Now even for SIMD instructions, SSE operate on 16B xmm registers, AVX 32B avx registers. Of course, there are unaligned vector loads like MOVUPS as well. Now the performance penalty, pain in the ass is a another topic we are going off track.
For stack frames, other than these reasons, there is the System V AMD64 ABI, thank god, the stack pointer should be aligned to 16B before a call (regardless of whether ISA needs it). this is standard for most UNIX-like systems, even windows x86-64 follows this 16B frames, windows x86 has 4B boundaries. For all the grandfathers, linux on x86 originally used 4B alignments, who remember this was gcc who changed to 16B) gcc uses this assumption of 16B to use SSE aligned instructions.
look

C:

float *aligned = (float *)malloc(16 * sizeof(float));
float *unaligned = (float *)((char *)aligned + 4);    // 4 bytes offset
__m128 trial1 = _mm_load_ps(aligned);    // +
__m128 trial2 = _mm_load_ps(unaligned);    // SIGSEGV

Код:

(gdb) run
Starting program: /home/:t/temp/3
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00005555555551b4 in main ()
(gdb) backtrace
#0  0x00005555555551b4 in main ()
(gdb) info registers
rax            0x5555555592a4      93824992252580
rbx            0x7fffffffe0f8      140737488347384
rcx            0x5555555592a0      93824992252576
rdx            0x0                 0
rsi            0x40                64
rdi            0x5555555592a0      93824992252576
rbp            0x7fffffffdfe0      0x7fffffffdfe0
rsp            0x7fffffffdfb0      0x7fffffffdfb0
r8             0x7ffff7fa5c78      140737353768056
r9             0x21001             135169
r10            0x7ffff7de5db0      140737351933360
r11            0x50                80
r12            0x0                 0
r13            0x7fffffffe108      140737488347400
r14            0x555555557dd8      93824992247256
r15            0x7ffff7ffd020      140737354125344
rip            0x5555555551b4      0x5555555551b4 <main+75>
eflags         0x10202             [ IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0

Код:

(gdb) set disassembly-flavor intel
(gdb) disassemble
Dump of assembler code for function main:
   0x0000555555555169 <+0>:     push   rbp
   0x000055555555516a <+1>:     mov    rbp,rsp
   0x000055555555516d <+4>:     sub    rsp,0x30
   0x0000555555555171 <+8>:     mov    edi,0x40
   0x0000555555555176 <+13>:    call   0x555555555050 <malloc@plt>
   0x000055555555517b <+18>:    mov    QWORD PTR [rbp-0x8],rax
   0x000055555555517f <+22>:    cmp    QWORD PTR [rbp-0x8],0x0
   0x0000555555555184 <+27>:    jne    0x55555555519c <main+51>
   0x0000555555555186 <+29>:    lea    rax,[rip+0xe77]        # 0x555555556004
   0x000055555555518d <+36>:    mov    rdi,rax
   0x0000555555555190 <+39>:    call   0x555555555060 <perror@plt>
   0x0000555555555195 <+44>:    mov    eax,0x1
   0x000055555555519a <+49>:    jmp    0x5555555551f7 <main+142>
   0x000055555555519c <+51>:    mov    rax,QWORD PTR [rbp-0x8]
   0x00005555555551a0 <+55>:    add    rax,0x4
   0x00005555555551a4 <+59>:    mov    QWORD PTR [rbp-0x10],rax
   0x00005555555551a8 <+63>:    mov    rax,QWORD PTR [rbp-0x10]
   0x00005555555551ac <+67>:    mov    QWORD PTR [rbp-0x18],rax
   0x00005555555551b0 <+71>:    mov    rax,QWORD PTR [rbp-0x18]
=> 0x00005555555551b4 <+75>:    movaps xmm0,XMMWORD PTR [rax]
   0x00005555555551b7 <+78>:    movaps XMMWORD PTR [rbp-0x30],xmm0
   0x00005555555551bb <+82>:    movss  xmm0,DWORD PTR [rbp-0x30]
   0x00005555555551c0 <+87>:    pxor   xmm1,xmm1
   0x00005555555551c4 <+91>:    cvtss2sd xmm1,xmm0
   0x00005555555551c8 <+95>:    movq   rax,xmm1
   0x00005555555551cd <+100>:   movq   xmm0,rax
   0x00005555555551d2 <+105>:   lea    rax,[rip+0xe39]        # 0x555555556012
   0x00005555555551d9 <+112>:   mov    rdi,rax
   0x00005555555551dc <+115>:   mov    eax,0x1
   0x00005555555551e1 <+120>:   call   0x555555555040 <printf@plt>
   0x00005555555551e6 <+125>:   mov    rax,QWORD PTR [rbp-0x8]
   0x00005555555551ea <+129>:   mov    rdi,rax
   0x00005555555551ed <+132>:   call   0x555555555030 <free@plt>
   0x00005555555551f2 <+137>:   mov    eax,0x0
   0x00005555555551f7 <+142>:   leave
   0x00005555555551f8 <+143>:   ret
End of assembler dump.

crash because movaps at 0x5555555551b4 attempts to load data from a misaligned memory address (0x5555555592a4), because hex(0x5555555592a4 % 16) == '0x04'
now try float *unaligned = (float *)((char *)aligned + 16); this will work)

forgive me for not taking screenshots of disassembly i use i3, there is no way to take a screenshot of the terminal without installing scrot/flameshot and moreover it went up in terminal history scrolling buffer and binary is deleted and adding a scrot -s to i3 config keyboard bindings, i am lazy right now), if someone knows something other more convenient than scrot/flameshot, please let me know)

There is no problem if you write a custom compiler yourself that has !=16B algined stack frame setup if it's purely in that language, even you can save some memory from fragmentation, but if you throw calls out to cross language, runtime libs, there is all sorts of obvious pain.
now there are other reasons like cache line splits but this is not the first reason for alignment at all)

shrekushka · 31.08.2024

native binaries crash in top-level exception handler · Issue #5700 · ocaml/ocaml

Original bug ID: 5700 Reporter: @avsm Status: closed (set by @xavierleroy on 2015-12-11T18:25:33Z) Resolution: fixed Priority: high Severity: crash OS: MacOS X OS Version: 10.8 Version: 4.00.0+beta...

github.com

There was a misalignment of stack pointer when OCaml programs were compiled to native binaries (ocamlopt) and executed with OCAMLRUNPARAM=b (print bt). OSX 10.8 C compiler from then was strict 16B boundaries. For those not familiar with OCaml, caml_stash_backtrace func allows the runtime capture backtrace. This specific trigger has the lazy init of frame table (which used for backtraces) coinciding with an exception being raised. The misaligned stack pointer caused problems when caml_stash_backtrace called into libc funcs that relied on 16B stack alignment.

I read two patches:
1. caml_raise_exception was modified to ensure stack is aligned before calling the caml_stash_backtrace
2. The second patch, I didn't understand the details but they added some extra alignment subq $8, %rsp before the call in some specific location. But the reason is again same, it needs it to offset in any way to 16B boundaries if it wants to sit comfortably with native runtime.

Let's thank System V ABI:

https://wiki.osdev.org/System_V_ABI

https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf

Отключение выравнивания функций MSVC

coree

(L2) cache

DildoFagins

TPU unit

coree

(L2) cache

shrekushka

(L1) cache

shrekushka

(L1) cache

native binaries crash in top-level exception handler · Issue #5700 · ocaml/ocaml