[J-core] Fun little story about gcc code generation for opterons.

Sun Sep 4 16:23:35 EDT 2016

Fun story Izabera posted to the #musl IRC channel:

  http://repzret.org/p/repzret/

tl;dr: for K8 AMD convinced gcc to stop producing single byte return
instructions because they didn't have enough branch predictor selectors,
and then for K10 they wanted them to go to 3 byte return instructions.
(Then the requirement went away again, but gcc's still doing it.)

It would be nice to work into some future talk how pipelines getting
deeper sometimes requires the machine code to get larger in order to
optimize for it.

So far, we're keeping both the assembly code and the transistor count as
small as possible in j-core, so we haven't got deep pipelines and are
thus are avoiding the need for stuff like branch prediction. The
question is how far we can go there, and if there's a persistent sweet
spot we can camp.

Branch prediction (and speculative execution) has a fundamental problem
that it sometimes does unnecessary work which has to be discarded, which
reduces power efficiency. In order to maintain multiple contexts you do
register renaming, which means you have multiple copies of circuitry
drawing power in parallel. Great when running from wall current with a
CPU fan, not so great on battery power with a passive heat sink.

The ~j4 plan adds multi-issue, but only 2-way. We'll probably bring the
memory prefetcher back, and might add simple rules like "when execution
advances into the last few bytes of a cache line, request the next one".
But these are compile time options,and the big question is how far to go
down that path: _where_ is the sweet spot? (At what point does the extra
performance beat out losses from things like transistor leakage current
where going more slowly charges you overhead costs longer so the
race-to-quiescence and then power down further wins? And what's the
latency of recovering from powering down further...)

One problem Intel and AMD had is that if they DON'T increase their
transistor counts, they get so tiny on modern fabrication processes that
A) they haven't got the surface area to connect up enough I/O pins to
them, B) unless they pad their active circuitry with less-active
circuitry, they can't cool the hotspots in their chips.

Of course the REAL problem Intel and AMD have had for about the past
decade is if they don't do something shiny nobody needs to buy the NEW
chips, they can happily keep using the old ones...

Die size shrinks haven't translated to CPU mhz increases at the original
Moore's Law rate in 16 years. (Intel's 1.13 mhz Pentium III recall was
August 2000, it's been _ten_ moore's law doubling periods since then and
modern processors are clocked maybe three times that rate at the high
end.)  The switch to 64 bits, and then to SMP, gave customers several
years of Reasons To Buy New Chips. And the fact that Itanic and Pentium
4 was such a DISASTER that ripping them out and backing up to a cleaned
up Pentium III with a lot more L2 cache (I.E. the Pentium M, which the
core and core II descend from) was a huge performance boost by removing
the "being stupid" penalty.

What I'm wondering is, how much of the current pipeline tricks are
"going down the P4 rathole again"?

I don't have as much visibility on the arm side of things, but the
current chips are NOT as power efficient as the older ones. They have
way bigger batteries (with exciting new chemistries which explode
spectacularly if pierced), but how many phone apps and games "eat your
battery"? I play a game on my phone that can run through the whole
battery from 100% charge to powering itself off in half an hour. (And
the phone gets HOT in my hand when this happens.)

Translation: these chips are only power efficient when essentially idle,
if you _use_ the performance they can provide, it's power expensive. How
much of the power efficiency is software and marketing, not really the
processor?

Rob