[J-core] Hi - more info on Super-H advantages

Thu Jan 11 15:17:09 EST 2018

On 01/11/2018 05:06 AM, Dru Nelson wrote:
> Given your access to the original designers (or not), what are the other
> clever design choices that this CPU pioneered?

Jeff's the guy you want to talk to, but I can provide a little industry
context. (Probably the biggest single insight experience gives you is
that the industry advances when patents EXPIRE, not when they're awarded. :)

A few years ago I wrote up a very basic history of the last few decades
of processor design:

  http://landley.net/aboriginal/architectures.html#when

And before that I wrote a slightly more detailed one which is mirrored at:

  http://landley.net/writing/mirror/fool/todo/rulemaker000222.htm
  http://landley.net/writing/mirror/fool/todo/rulemaker000223.htm
  http://landley.net/writing/mirror/fool/todo/rulemaker000224.htm
  http://landley.net/writing/mirror/fool/todo/rulemaker000225.htm

(It's a bit intel-centric because I was writing it in context of a stock
market investment column tracking intel, but it gets the general idea
across. It was written in 2000 which is right at the tail end of sh2/3/4
coming out and the hand off to Renesas that stopped interesting new
development on the line.)

But to understand any of that you need to understand die size shrinks
first (which is what drove moore's law through its whole ~50 year
s-curve from about 1960 to 2010), which I wrote about in my old "what
Intel does for a living" series at:

  http://landley.net/writing/mirror/fool/CashKingPort980421.html
  http://landley.net/writing/mirror/fool/CashKingPort980421.html
  http://landley.net/writing/mirror/fool/CashKingPort980421.html

Here's a quick attempt to summarize:

Back in the 1970's CISC designs had variable length instructions that
took a variable number of clock cycles to execute, which acted as a form
of data compression, making your instruction set small and expressive so
you could fit more of it in cache and send less data across the memory
bus. The downside is it was complicated, full of special cases such as
instructions crossing cache line boundaries or combinations of prefix
and opcode the designers never anticipated (ala the foof bug).

The x86 architecture wound up king of the hill and is still with us
today, and has variants of backwards compatibility back to the 8080
processor in Space Invaders coin-operated arcade machines.

In the 80's the first generation of RISC happened, with fixed length
instructions that each ran in one clock cycle. (The processor was
generally breaking the CISC instructions down into "microcode" anyway,
which was an array of simple instructions that told the processor what
to do each clock cycle of the CISC instruction. RISC sort of used the
microcode as the instruction set, each one fixed length and taking one
clock cycle.)

The initial purpose of RISC was simplifying the processor (because that
makes it smaller, and signals take less time to travel down shorter
wires, so you can clock it faster). But pretty soon they went "we know
where the next instruction starts, why not have basically a second
processor look over the shoulder of the first and execute the next
instruction in the same clock cycle when it doesn't interfere with the
first one by using any of the same registers or memory locations? And a
NOP when it would, of course."

Alas, doing that reintroduced buckets of complexity: your compiler had
to reorder the instructions so they didn't interfere (otherwise the
second execution core ran a NOP lots of the time"), and there was
nonsense like "branch delay slots" where the instruction after a "jump"
got executed because hey, same clock cycle... But in theory your
processor could go up to twice as fast!

Of course this made the compiler do a lot more work, and when your
compiler didn't get it right (which nobody did the first decade) your
second executino unit was idle a lot, AND you'd lost the data
compression advantage of CISC so you were thrashing your cache and
saturating your memory bus so your RISC chips weren't really faster than
CISC. But people were sure this would get fixed any day now and RISC was
the future and a dozen different architectures sprung up (mips, sparc,
alpha, powerpc) determined to displace the PC and own the future!
(Positive gold rush circa 1990 or so).

Except as die sizes shrunk allowing clock speeds to go faster (signal
takes less time to travel down shorter wires), processors wound up
"clock doubled" and then "clock triped", I.E. the processor ran at a
multiple of the motherboard speed. This was fine as long as you were
running code out of CPU cache, but when you were waiting for the cache
to fill? A processor rapidly twiddling its thumbs is not making
progress. I remember a couple processors going at 20x the system bus
speed and spending LOTS of their time waiting for memory. I.E. memory
bus bandwidth and how much code you can fit in cache became a BIG DEAL,
which made CISC's variable length instruction data compression strategy
(originally done because memory was tiny and REALLY EXPENSIVE back in
the 1970's) shine again, and first generation RISC's verbosity hurt badly.

Then the x86/CISC guys went "hey let's pipeline the microcode", I.E.
have a translation front-end converting the CISC instructions into RISC
instructions, but have it to lots of prefetch and lookahead so you get
an array of instructions. Then do the reordering (to avoid conflicts) in
hardware instead of the compiler, and feed them into multiple execution
units in parallel just like RISC does when the compiler gets it right.

The result was BUCKETS of complexity but it got some of the advantages
of RISC and some of the advantages of CISC mand it was backwards
compatible with the installed base of PC software, and die size shrinks
were increasing the transistor budget anyway, and it cut off the air
supply to the first generation RISC gold rush.

So that was the Pentium, which spent the first half of the 1990's
getting a lot of stuff wrong and then wiped the floor with everybody
else starting with the Pentium II.

And THAT's about when SuperH happened. It was a new chip design that
learned from the mistakes of the CISC designs _and_ from the first
generation of RISC designs, and the clever things the CISC people did to
counter it, and Hitachi went "let's make a microcoded RISC processor".
(Um... what?) No really, let's do a RISC instruction set where MOST of
the instructions take 1 clock cycle (in which case the microcode
translation is 1:1), but some instructions expand to lots of microcode.
That way we get some of the data compression advantages of CISC; why
waste bytes describing those extra clock cycles when we don't need to?
And our "fixed length" can be just 16 bits, we'll pack the heck out of
our instruction coding, as tight as we can, but still make it fixed length!

And then rather than try to modify compilers to work with their
processor design like the first generation RISC guys did, the Hitachi
guys the Hitachi guys instead studied what compilers were outputting and
designed their instruction set to as compactly as possible represent
that (optimizing for the cache and memory bus bottlenecks).

Later stuff happened later, of course. AMD scavenged the Dec Alpha
design team from the corpse of DEC after Compaq was done with it, and
those guys did the Athlon, an x86 design with THREE execution cores. And
then Intel copied it but couldn't keep them busy so it did
hyper-threading to keep the execution cores busy by having them cherry
pick the next instructions from _two_ register profiles simultaneously.
And years later the Hexagon guys did something clever I blogged about at
the time, http://landley.net/notes-2012.html#24-02-2012 which has both
upsides and downsides but which is totally hamstrung by qualcomm's
laywers for another decade anyway.) Oh, and along the way Transmeta
moved the CISC translation layer from hardware to software and everybody
else went "you know, bytecode engines are cool, you can fit a lot of
bytecode in cache" and Java tried to pretend it had invented it but
pascal p-code was from like the 70's and everything from python to
javascript had a bytecode engine under the covers anyway and...

Ahem. Stuff happened. It was fun.

The superh design was a historical sweet spot, where enough history had
passed that we finally had a simple and powerful design, but before die
shrinks had advanced to the point of pad limiting and thermal throttling
where people were making enormous unrolled pipelines and such
desperately trying to soak up the transistor budget and turn it into
performance, and spread the power consumption out so the heat sinks
didn't melt. We're cycling back around to where we've learned the
lessons of the Itanic and Pentium 4 and are returning to OLDER designs
(like the Pentium M did circa 2005) where we didn't know where the local
peak was until we'd overshot it...

Rob

P.S. Ask me about VLIW if that's not already covered in the stuff I
linked to, that's probably worth understanding.

Ah, you have a PS:

> BTW, I also perused the programmers guide, and one thing struck me.
> The assembly code for the SH looks a bit like 68K assembly. They even hint
> at this with mentions of "Other CPU" in the beginning of the document.
> The fact that it is 16 registers and uses a 16 bit instruction format
> furthers that notion.
> Yet, I couldn't find a mention of that anywhere.

I vaguely recall there was some sort of politics or procurement thing
where Hitachi engineers were going to make a product using a 68k at one
point, and then didn't because of lawyers or royalties or somebody
insulted somebody's dog or whatever reason, and they did their own
processor instead? (What've I heard thirdhand because engineers said
things to other people over beer and what I can officially source are
two different things.)

Keep in mind Motorola's offensive lawyers are the reason IBM didn't go
with Motorola for the PC either. I might or might not have covered that
in these two computer history blog entries years ago:

  http://landley.net/notes-2010.html#17-07-2010
  http://landley.net/notes-2010.html#19-07-2010

If not, there's probably a mention of it somewhere under:

  http://landley.net/history/mirror/

(I _know_ it's covered in "triumph of the nerds", which was on youtube
last I checked.)

It reminds me of how SEI made a go of Leon Sparc first and then decided
_not_ to tape out an ASIC of it (because of the code density issue) and
instead did a new design from scratch (in this case also compatible with
an existing instruction set, just a different one). In both cases,
"there's already something that works, but we can do a better job"...

Rob