[J-core] Fun little story about gcc code generation for opterons.
cr88192 at gmail.com
Wed Sep 7 02:12:33 EDT 2016
On 9/6/2016 1:21 PM, Cedric BAIL wrote:
> On Sun, Sep 4, 2016 at 1:23 PM, Rob Landley <rob at landley.net> wrote:
>> Fun story Izabera posted to the #musl IRC channel:
>> tl;dr: for K8 AMD convinced gcc to stop producing single byte return
>> instructions because they didn't have enough branch predictor selectors,
>> and then for K10 they wanted them to go to 3 byte return instructions.
>> (Then the requirement went away again, but gcc's still doing it.)
> Very interesting read.
meanwhile, I am still using a K10 (a Phenom II 955), but mostly due to
"it works" and "don't have money for a new PC". I realized that, sadly,
despite being a 7 year old chip (which I got used), it would still take
a bit of money to get something solidly better.
meanwhile, entry-level PC's now are coming out with CPUs barely much
faster than what they were coming out with in 2007 (my parents got a
newer PC fairly recently, and looking at the HW stats, was not impressed).
>> It would be nice to work into some future talk how pipelines getting
>> deeper sometimes requires the machine code to get larger in order to
>> optimize for it.
>> So far, we're keeping both the assembly code and the transistor count as
>> small as possible in j-core, so we haven't got deep pipelines and are
>> thus are avoiding the need for stuff like branch prediction. The
>> question is how far we can go there, and if there's a persistent sweet
>> spot we can camp.
>> Branch prediction (and speculative execution) has a fundamental problem
>> that it sometimes does unnecessary work which has to be discarded, which
>> reduces power efficiency. In order to maintain multiple contexts you do
>> register renaming, which means you have multiple copies of circuitry
>> drawing power in parallel. Great when running from wall current with a
>> CPU fan, not so great on battery power with a passive heat sink.
> I have never though about this in this term, but indeed, doing
> unecessary work is obviously going to consume more energy that not
> doing it.
though, not exactly the same, matters of branching come up in
trace-based interpreters and JITs.
you don't want to branch, because this means ending the trace, which
means overhead (particularly if running in a trampoline loop, *1).
likewise, in the case of an unconditional branch, it is possible to tack
the logic for the following trace(s) onto the end of the current trace.
though my SH2 interpreter doesn't currently do this (this is possible,
along with potentially inferring "MOV.L @(PC+disp), Rn" to mean "load a
constant" or similar, but would need a good "the source address
represents a constant value" heuristic).
*1: for a JIT there is the possible option of inlining some of the
trampoline logic onto the end of each trace, turning the common case
into traces effectively tail-calling to the next trace without passing
control back to the trampoline (can't really be safely/portably done in
C, as tail-call optimization is a nonstandard extension).
the problem with a trampoline loop is that it executes calls which can't
really be predicted (since each iteration through the loop, it calls to
a different place). this is admittedly one of the reasons I stopped
using while/switch interpreters, as even in the best case performance
basically "eats it" at the "switch()", and unrolled loops of function
pointers can go significantly faster (though, still not as fast as
direct call instructions or JIT compilation).
but, a trampoline is cheaper, as it is only seen "once every so often",
rather than hit once for every instruction.
but, a lot of this stuff does come at a big cost: memory; and, by
extension, how effectively the interpreted code and VM state can fit
I have observed that the relative costs of a lot of this are a lot lower
on ARM hardware, where on ARM the relative performance cost of branches
is a lot lower than on x86 systems (well, at least on ARM11, not really
tested for this on newer hardware).
but, generally, the performance cost of the trampoline is offset by the
ability to use green-threads, and allows for much lower scheduler
latency, but at the cost of using a non-standard ABI.
I am left to wonder some about the viability of (for embedded) making
use of a C compiler which produces entirely trampolined code, and thus
allowing both threading and low-latency scheduling without the usual
limitations of an interrupt-driven preemptive scheduler (probably
holding the CPU for too long or spinning in an infinite loop would be an
for example, one could compile code in such a way that they could run
the scheduler at around 250 kHz or 1 MHz or so (assuming a reasonably
fast CPU), and handle events without having to resort to doing
everything in an explicitly event-driven style, and simplify enforcing
CPU usage quotas and reliably servicing events in a timely manner.
but, granted, some may object to the implied performance cost of making
global use of trampolines exist as a part of the C ABI.
note that this would be separate from the use of a VM, as the trampoline
could instead exist as an extension of the OS's scheduler.
More information about the J-core