[J-core] [RFC] SIMD extention for J-Core

Fri Oct 27 23:16:30 EDT 2017

On 10/27/2017 7:48 PM, Cedric BAIL wrote:
> On Fri, Oct 27, 2017 at 3:37 PM, Rob Landley <rob at landley.net> wrote:
>> On 10/26/2017 08:00 PM, Cedric BAIL wrote:
>>>      > SIMD Configuration Instructions:
>>>      >
>>>      > SIMD.IMODE - This configures the Integer Math mode of the Integer SIMD
>>>      > Operations. The accepted modes should include the following:
>>>      > * Integer Carry Mode - See ADDC and SUBC for example.
>>>      > * Value UnderFlow and OverFlow Mode - See ADDV and SUBV  for examples.
>>>      > * Integer Type: Signed and Unsigned values with sizes of 8-bit,
>>>      > 16-bit, 32-bit, and 64-bits.
>>>
>>>      We're really short on instruction space. We fit j64 in, but had to
>>>      repurpose several existing instructions in 64 bit mode to do it.
>>>
>>> For the needed addition, as they will not be used very often, having a
>>> prefix that enable a new instruction set temporary would limit the
>>> problem. Even then, the instruction space is quite limited and adding an
>>> instruction require careful thinking and testing different scenario.
>> Currently there are no prefixes. All the instructions are self-contained
>> and fixed length. That's one of the main advantages of the instruction
>> set. The way we do mode shifts is with control register bits.
> Well, prefix can be of fixed length too. Like one 16bits instruction
> that turn the cpu in a specific instruction space (Either for just one
> instruction or multiple). I guess it would work the same if there was
> a control register to go into that specific instruction space. It just
> is a matter of how costly it is to go in an out and if you can
> pipeline it nicely.

this is closer to how I originally imagined my variable length 
instructions to work:
the 8Exx would be a prefix word modifying the behavior of the following 
instruction word.

in the emulator, they ended up being decoded as a whole separate 
instruction space.
in the partial CPU core I was working on, it was sort of in-between.

handling it purely as a prefix could potentially allow going back to a 
slightly shorter/simpler pipeline; though this would come at a cost of 
reduced throughput (since now a clock cycle would be spent handling the 
prefix word).

did earlier post an image showing a small piece of the current ISA 
in-action (this was showing the Q_memcpy function, which is basically a 
loop copying memory via integers):
https://twitter.com/cr88192/status/924060656646672384

in this version, it generally moves 4 integers at a time and then loops.
nevermind a few cases where the compiler is doing something stupid here.

>> In theory we could have a transient control register mask that gets
>> xored with the persistent control register but all its bits are zeroed
>> after some event (one instruction, next jump, etc). So you could have
>> something prefix-like, although it would probably make multi-issue more
>> complicated. But that's something that would have to be designed and
>> justified and isn't part of the current stuff so there's backwards
>> compatibility issues...
>>
>> I.E. that's j64 territory. Right now j3 and j4 are _mostly_ specced as
>> following the path laid down by Hitachi 20 years ago. Two of the 3 "new"
>> j-core instructions are backported from j3. We regression test against
>> old sh2 binaries. We care that this is part of a _family_. (The "born
>> out of wedlock" jokes are left as an exercise for the reader.)
> It seems by reading you, that j3 and j4 are more specced than I had
> imagine. My assumption so far was j3 would be j2 + MMU. j4 being kind
> of question mark. Superscalar ? FPU ?

as I understand it, J4 is meant to be a mostly canonical SH4.

information on the SH3 is sparse, but from what I can gather it is a 
more intermediate stage:
     either essentially SH2+MMU, or SH4-FPU.

some details are unclear, for example SH2 and SH4 use apparently 
different ISR mechanisms:
     SH2 (or was this a J2 tweak?) was using VBR essentially as a array 
of ISR entry-points.
     SH4 uses several computed entry points relative to VBR.
         so, interrupt happens and it branches to VBR+0x100 or similar.

some of these cases leave a bit of ambiguity.

>> After sh4 Hitachi handed off the technology to a new company but kept
>> the design engineers, and the new team did a brand new sh5 instruction
>> set that nobody was interested in, and that's the point where we need to
>> forge our own path. So j64 introduces new design elements, although or
>> model is "what x86-64 did to x86, we're trying to do to shmobile".
> So my understanding here is that you do not want to take the road of a
> family with instructions being optional. Instead you have clearly
> incremental defined step. j64 would be the time when you introduce
> SIMD. So j64 would be j2 + MMU + FPU + SIMD + 64bits. Not something
> where you had a j-core with configurable option and you would have the
> possibility to do j2 + SIMD or j2 + 64bits for example. It does sound
> like it will reduce the complexity of the source code and is better
> for long term maintenance.

sounds about like what I have been hearing.

FWIW: I also decided to blow off SH5, as their solution was basically to 
glue on a completely different ISA.
like, if I wanted something like this, could probably just go implement 
the SPARC ISA or something...
and/or try to make a case for jumping ship over to RISC-V; ...

>> So we _were_ discussing SIMD in context of j64. we're not sure whether
>> we need to move it sooner, or just bear down and get j64 out ASAP. (We'd
>> love to do the latter, we're working out how to clear engineering time
>> for it...)
> I am playing with SIMD ideas in the j2 case. It takes time (on my
> spare time) to build an assembler and after than an emulator that
> allow rapid prototyping of it. Still the reason why I am playing in
> that land and think there is potential for it, is that I think it
> would be able to drive eink devices for a large variety of usage. This
> would replace the usually terrible chip that come with them this days.

I also did an emulator and assembler; later did a C compiler.
of these, the C compiler is probably one of the harder parts.

trying to implement a core is also hard, can't clearly say which is 
harder though between them.

> Now, going with a j64 for that scenario, would open a lot more use
> case. Instead of just driving the screen, you could start looking at
> webpage rendering (which is sadly the format of a lot of ebook). It
> would clearly be an overkill for simple usecase, like electronic tag,
> but the question is how much of a waste is it ? How much bigger would
> it be compared to a j2 ?

?...

>> j64 has a control register bit to switch between 32 bit and 64 bit
>> modes. 32 bit mode is scrupulously compatible with j4. 64 bit mode is
>> designed around the idea you're probably going to drop back to 32 bit
>> mode to run 32 bit code, which needs to preserve the high bits of the
>> registers unchanged and so on...
>>
>> We're going to try to put together a better j64 proposal in November,
>> with actual details. (I haven't been back to tokyo since last November,
>> but my next flight there leaves tuesday morning. There's a backlog of
>> sitting down with people and writing up documentation...)
> That is so way more early than I had expected. Pretty big news that
> you are casually dropping here :-) Looking forward to it.

I would also like to know the specifics of the J64 ISA design.
if I like the design, might also try to do a prototype of it as well.

from what I can gather, the direction it is going is a little different 
from mine, but by how much exactly, I don't know...