[J-core] Fun little story about gcc code generation for opterons.

Wed Sep 7 02:12:33 EDT 2016

On 9/6/2016 1:21 PM, Cedric BAIL wrote:
> On Sun, Sep 4, 2016 at 1:23 PM, Rob Landley <rob at landley.net> wrote:
>> Fun story Izabera posted to the #musl IRC channel:
>>
>>    http://repzret.org/p/repzret/
>>
>> tl;dr: for K8 AMD convinced gcc to stop producing single byte return
>> instructions because they didn't have enough branch predictor selectors,
>> and then for K10 they wanted them to go to 3 byte return instructions.
>> (Then the requirement went away again, but gcc's still doing it.)
> Very interesting read.

meanwhile, I am still using a K10 (a Phenom II 955), but mostly due to 
"it works" and "don't have money for a new PC". I realized that, sadly, 
despite being a 7 year old chip (which I got used), it would still take 
a bit of money to get something solidly better.

meanwhile, entry-level PC's now are coming out with CPUs barely much 
faster than what they were coming out with in 2007 (my parents got a 
newer PC fairly recently, and looking at the HW stats, was not impressed).

>> It would be nice to work into some future talk how pipelines getting
>> deeper sometimes requires the machine code to get larger in order to
>> optimize for it.
>>
>> So far, we're keeping both the assembly code and the transistor count as
>> small as possible in j-core, so we haven't got deep pipelines and are
>> thus are avoiding the need for stuff like branch prediction. The
>> question is how far we can go there, and if there's a persistent sweet
>> spot we can camp.
>>
>> Branch prediction (and speculative execution) has a fundamental problem
>> that it sometimes does unnecessary work which has to be discarded, which
>> reduces power efficiency. In order to maintain multiple contexts you do
>> register renaming, which means you have multiple copies of circuitry
>> drawing power in parallel. Great when running from wall current with a
>> CPU fan, not so great on battery power with a passive heat sink.
> I have never though about this in this term, but indeed, doing
> unecessary work is obviously going to consume more energy that not
> doing it.

though, not exactly the same, matters of branching come up in 
trace-based interpreters and JITs.
you don't want to branch, because this means ending the trace, which 
means overhead (particularly if running in a trampoline loop, *1).

likewise, in the case of an unconditional branch, it is possible to tack 
the logic for the following trace(s) onto the end of the current trace. 
though my SH2 interpreter doesn't currently do this (this is possible, 
along with potentially inferring "MOV.L @(PC+disp), Rn" to mean "load a 
constant" or similar, but would need a good "the source address 
represents a constant value" heuristic).

*1: for a JIT there is the possible option of inlining some of the 
trampoline logic onto the end of each trace, turning the common case 
into traces effectively tail-calling to the next trace without passing 
control back to the trampoline (can't really be safely/portably done in 
C, as tail-call optimization is a nonstandard extension).

the problem with a trampoline loop is that it executes calls which can't 
really be predicted (since each iteration through the loop, it calls to 
a different place). this is admittedly one of the reasons I stopped 
using while/switch interpreters, as even in the best case performance 
basically "eats it" at the "switch()", and unrolled loops of function 
pointers can go significantly faster (though, still not as fast as 
direct call instructions or JIT compilation).

but, a trampoline is cheaper, as it is only seen "once every so often", 
rather than hit once for every instruction.
but, a lot of this stuff does come at a big cost: memory; and, by 
extension, how effectively the interpreted code and VM state can fit 
into cache.

I have observed that the relative costs of a lot of this are a lot lower 
on ARM hardware, where on ARM the relative performance cost of branches 
is a lot lower than on x86 systems (well, at least on ARM11, not really 
tested for this on newer hardware).

but, generally, the performance cost of the trampoline is offset by the 
ability to use green-threads, and allows for much lower scheduler 
latency, but at the cost of using a non-standard ABI.

I am left to wonder some about the viability of (for embedded) making 
use of a C compiler which produces entirely trampolined code, and thus 
allowing both threading and low-latency scheduling without the usual 
limitations of an interrupt-driven preemptive scheduler (probably 
holding the CPU for too long or spinning in an infinite loop would be an 
ABI fault).

for example, one could compile code in such a way that they could run 
the scheduler at around 250 kHz or 1 MHz or so (assuming a reasonably 
fast CPU), and handle events without having to resort to doing 
everything in an explicitly event-driven style, and simplify enforcing 
CPU usage quotas and reliably servicing events in a timely manner.

but, granted, some may object to the implied performance cost of making 
global use of trampolines exist as a part of the C ABI.

note that this would be separate from the use of a VM, as the trampoline 
could instead exist as an extension of the OS's scheduler.

or such...