I'm aligning branch targets with NOPs, and sometimes the CPU executes these NOPs, up to 15 NOPs. How many 1-byte NOPs can Skylake execute in one cycle? What about other Intel-compatible processors, like AMD? I'm interested not only in Skylake but in other microarchitecrutes as well. How many cycles may it take to execute a sequence of 15 NOPs? I want to know whether extra code size and extra execution time of adding these NOPs worth its price. This is not me who adding these NOPs but an assembler automatically whenever I write an
Update: I have managed it to automatically insert multibyte
See also Cody's answer for lots of good stuff I'm leaving out because he covered it already.
Never use multiple 1-byte NOPs. All assemblers have ways to get long NOPs; see below.
15 NOPs take 3.75c to issue at the usual 4 per clock, but might not slow down your code at all if it was bottlenecked on a long dependency chain at that point. They do take up space in the ROB all the way until retirement. The only thing they don't do is use an execution port. The point is, CPU performance isn't additive. You can't just say "this takes 5 cycles and this takes 3, so together they will take 8". The point of out-of-order execution is to overlap with surrounding code.
The worse effect of many 1 byte short-NOPs on SnB-family is that they tend to overflow the uop-cache limit of 3 lines per aligned 32B chunk of x86 code. This would mean that the whole 32B block always has to run from the decoders, not the uop cache or loop buffer. (The loop buffer only works for loops that have all their uops in the uop cache).
You should only ever have at most 2 NOPs in a row that actually execute, and then only if you need to pad by more than 10B or 15B or something. (Some CPUs do very badly when decoding instructions with very many prefixes, so for NOPs that actually execute it's probably best not to repeat prefixes out to 15B (the max x86 instruction length).
YASM defaults to making long NOPs. For NASM, use the
smartalign standard macro package, which isn't enabled by default. It forces you to pick a NOP strategy.
%use smartalign ALIGNMODE p6, 32 ; p6 NOP strategy, and jump over the NOPs only if they're 32B or larger.
IDK if 32 is optimal. Also, beware that the longest NOPs might use a lot of prefixes and decode slowly on Silvermont, or on AMD. Check the NASM manual for other modes.
The GNU assembler's
.p2align directive gives you some conditional behaviour:
.p2align 4,,10 will align to 16 (1<<4), but only if that skips 10 bytes or fewer. (The empty 2nd arg means the filler is NOPs, and the power-of-2 align name is because plain
.align is power-of-2 on some platforms but byte-count on others). gcc often emits this before the top of loops:
.p2align 4,,10 .p2align 3 .L7:
So you always get 8-byte alignment (unconditional
.p2align 3), but maybe also 16 unless that would waste more than 10B. Putting the larger alignment first is important to avoid getting e.g. a 1-byte NOP and then an 8-byte NOP instead of a single 9-byte NOP.
It's probably possible to implement this functionality with a NASM macro.
Missing features no assembler has (AFAIK):
It's a good thing alignment for decode bottlenecks isn't usually very important anymore, because tweaking it usually involves manual assemble/disassemble/edit cycles, and has to get looked at again if the preceding code changes.
Especially if you have the luxury of tuning for a limited set of CPUs, test and don't pad if you don't find a perf benefit. In a lot of cases, especially for CPUs with a uop cache and/or loop buffer, it's ok not to align branch targets within functions, even loops.
Some of the performance-variation due to varying alignment is that it makes different branches alias each other in the branch-prediction caches. This secondary subtle effect is still present even when the uop cache works perfectly and there are no front-end bottlenecks from fetching mostly-empty lines from the uop cache.