[J-core] Fun little story about gcc code generation for opterons.

Tue Sep 6 14:21:46 EDT 2016

On Sun, Sep 4, 2016 at 1:23 PM, Rob Landley <rob at landley.net> wrote:
> Fun story Izabera posted to the #musl IRC channel:
>
>   http://repzret.org/p/repzret/
>
> tl;dr: for K8 AMD convinced gcc to stop producing single byte return
> instructions because they didn't have enough branch predictor selectors,
> and then for K10 they wanted them to go to 3 byte return instructions.
> (Then the requirement went away again, but gcc's still doing it.)

Very interesting read.

> It would be nice to work into some future talk how pipelines getting
> deeper sometimes requires the machine code to get larger in order to
> optimize for it.
>
> So far, we're keeping both the assembly code and the transistor count as
> small as possible in j-core, so we haven't got deep pipelines and are
> thus are avoiding the need for stuff like branch prediction. The
> question is how far we can go there, and if there's a persistent sweet
> spot we can camp.
>
> Branch prediction (and speculative execution) has a fundamental problem
> that it sometimes does unnecessary work which has to be discarded, which
> reduces power efficiency. In order to maintain multiple contexts you do
> register renaming, which means you have multiple copies of circuitry
> drawing power in parallel. Great when running from wall current with a
> CPU fan, not so great on battery power with a passive heat sink.

I have never though about this in this term, but indeed, doing
unecessary work is obviously going to consume more energy that not
doing it.

> The ~j4 plan adds multi-issue, but only 2-way. We'll probably bring the
> memory prefetcher back, and might add simple rules like "when execution
> advances into the last few bytes of a cache line, request the next one".
> But these are compile time options,and the big question is how far to go
> down that path: _where_ is the sweet spot? (At what point does the extra
> performance beat out losses from things like transistor leakage current
> where going more slowly charges you overhead costs longer so the
> race-to-quiescence and then power down further wins? And what's the
> latency of recovering from powering down further...)

Quick question here, do you think the j4 will fit in a not too
expensive FPGA board ? If yes, this open the oportunity to benchmark
different solution in real life scenario.

> One problem Intel and AMD had is that if they DON'T increase their
> transistor counts, they get so tiny on modern fabrication processes that
> A) they haven't got the surface area to connect up enough I/O pins to
> them, B) unless they pad their active circuitry with less-active
> circuitry, they can't cool the hotspots in their chips.
>
> Of course the REAL problem Intel and AMD have had for about the past
> decade is if they don't do something shiny nobody needs to buy the NEW
> chips, they can happily keep using the old ones...
>
> Die size shrinks haven't translated to CPU mhz increases at the original
> Moore's Law rate in 16 years. (Intel's 1.13 mhz Pentium III recall was
> August 2000, it's been _ten_ moore's law doubling periods since then and
> modern processors are clocked maybe three times that rate at the high
> end.)  The switch to 64 bits, and then to SMP, gave customers several
> years of Reasons To Buy New Chips. And the fact that Itanic and Pentium
> 4 was such a DISASTER that ripping them out and backing up to a cleaned
> up Pentium III with a lot more L2 cache (I.E. the Pentium M, which the
> core and core II descend from) was a huge performance boost by removing
> the "being stupid" penalty.
>
> What I'm wondering is, how much of the current pipeline tricks are
> "going down the P4 rathole again"?
>
> I don't have as much visibility on the arm side of things, but the
> current chips are NOT as power efficient as the older ones. They have
> way bigger batteries (with exciting new chemistries which explode
> spectacularly if pierced), but how many phone apps and games "eat your
> battery"? I play a game on my phone that can run through the whole
> battery from 100% charge to powering itself off in half an hour. (And
> the phone gets HOT in my hand when this happens.)

I think that ARM is in a more tight spot than Intel and AMD on this
subject. Mostly because ARM need to consume less energy in its market
and still deliver good peak performance. They have pushed interesting
hardware solution like Big.Little, but the software still has to catch
up. And all the way up in the stack.

First Linux kernel scheduler is really not helping for managing
frequency, idle and scheduling when you want to be energy efficient.
Work is being done to address that, but it is still not there, and
won't likely be there for another year or two. But that just the first
change. The kernel only knows what a task has done in the past and
base its decision on that. Most application are mixing all their
operation into one thread. It is pretty visible on interactive UX
task, which spend some times IO bound, followed by CPU bound and
finally memory bound operation. The kernel is basically never going to
set things right for that kind of process. Each load need to be moved
into its own thread even if they are activated sequentially as this is
the only way for the kernel to learn this pattern. It is particularly
visible in Big.Little configuration, but any dynamic frequency change
and cpu idle change will also show this problem. Android tried to work
around the issue, by having some custom scheduler, but they don't
really help. So hardware manufacturer rely for the moment on user
space daemon watching the process that are running and trying to tune
the system for best performance (At the price of a bad
energy/performance ratio).

Another reason why your phone can get hot and see its battery drained,
is OpenGL. OpenGL is almost impossible to efficiently use on a multi
threaded system. You mostly end up with one core running at 100%...
which means the worst case scenario for this system. With Vulkan it
should be possible to have both better threading usage, so less
complex games could actually consume less energy by spreading the load
on multiple core more efficiently, and better caching of result which
should lead to also an improved energy usage (This is still theorical
for me as I haven't played with this yet).

Oh and something people sometimes also forget about is wireless data
are really expensive. The more you are trying to send, the more you
will need to retry in case of packet loss. This apply on the entire
stack. Your 4G layer will try to do some retransmission, like your
wifi layer, but your TCP connection will also do too. It is a real
problem on Android especially as application stay in the background
and do a lot of network traffic. Optimization there is also quite
tricky. There is a sweet spot between using human readable protocol
with compression and using binary protocol. Require experiment to
actually find the best solution for each use case.

Of course the problem is that migrating the entire stack to have a
more energy efficient architecture and software stack take a lot of
time and so money. Most society involved in building this device don't
have an insentive into improving things. If you sell a game for a 1$
or even nothing, you clearly have no spare money to invest into a more
energy efficient solution. You rely heavily on the SDK and the work of
others for that purpose. The general knowledge on also optimizing for
battery usage is not very common. People may optimize for performance,
sometimes for memory (if they hit a limit somewhere) and even more
rarely for battery. The tool for mesuring battery usage is also not
very precise, so it require a lot of test and error before you figure
out the right configuration. Caching computation result are a good
example of this problem. If you cache data, you are likely going to be
faster the next time you require the same computation, but now you use
more memory. On most system, it won't be a real problem, but actually
accessing the main memory more often does consume more energy than if
you fit into the CPU cache. It become a tradeof between CPU, memory
and energy usage.

Also most big name haven't invested much into energy efficiency. Tizen
is usually using 30% less battery than Android on the same hardware
for the base application. Microsoft has teased Google on the energy
efficiency of Edge and it seems like they are starting to invest
seriously into it (Well, web technology have a really long road to be
energy efficient anyway). I have today one hour more than what Dell is
advertising on the XPS 13, just by a proper software setup of my
laptop.

In general software is still catching up with hardware. As another
example of hardware providing feature that software has yet to
leverage on, a lot of ARM SoC have some hardware capable of decoding
jpeg. Still most software do not use them, even if they are more
energy efficient than doing it in software. It is my opinion, but
hardware have advanced too fast for software and there is a lot of
work just to leverage what we have today in an efficient way.

> Translation: these chips are only power efficient when essentially idle,
> if you _use_ the performance they can provide, it's power expensive. How
> much of the power efficiency is software and marketing, not really the
> processor?

How much is just marketing today ? :-)
-- 
Cedric BAIL