[J-core] [RFC] SIMD extention for J-Core

Sat Oct 28 23:38:41 EDT 2017

On 10/27/2017 10:16 PM, BGB wrote:
>>>> For the needed addition, as they will not be used very often, having a
>>>> prefix that enable a new instruction set temporary would limit the
>>>> problem. Even then, the instruction space is quite limited and
>>>> adding an
>>>> instruction require careful thinking and testing different scenario.
>>> Currently there are no prefixes. All the instructions are self-contained
>>> and fixed length. That's one of the main advantages of the instruction
>>> set. The way we do mode shifts is with control register bits.
>> Well, prefix can be of fixed length too. Like one 16bits instruction
>> that turn the cpu in a specific instruction space (Either for just one
>> instruction or multiple). I guess it would work the same if there was
>> a control register to go into that specific instruction space. It just
>> is a matter of how costly it is to go in an out and if you can
>> pipeline it nicely.
> 
> this is closer to how I originally imagined my variable length
> instructions to work:
> the 8Exx would be a prefix word modifying the behavior of the following
> instruction word.
> 
> in the emulator, they ended up being decoded as a whole separate
> instruction space.
> in the partial CPU core I was working on, it was sort of in-between.

Good luck with that, we'll be over here.

> handling it purely as a prefix could potentially allow going back to a
> slightly shorter/simpler pipeline; though this would come at a cost of
> reduced throughput (since now a clock cycle would be spent handling the
> prefix word).

Not interested, thanks.

> did earlier post an image showing a small piece of the current ISA
> in-action (this was showing the Q_memcpy function, which is basically a
> loop copying memory via integers):
> https://twitter.com/cr88192/status/924060656646672384

Our current problem copying memory is cache aliasing. Page aligned
transfers hit a pathological case where the write evicts the cache line
used by the read, and vice versa. Rich proposed a method of reading 8
input words into 8 registers and writing them out again, but it doesn't
help for loops and I don't think he ever put it in musl (or the kernel).

The proper fix is to adjust the cache logic to be 2-way indexed.

We're also proposing reviving the prefetch unit, or something similar.
We replaced the prefetch unit with l1 cache, but now we're not doing any
prefetch at all. We should have 2-way aliased data cache and instruction
cache that does prefetches when it starts executing the second half of a
cache line whose successor is not in the l1 yet. All this has been on
the internal roadmap since last november, we just got to the point where
our stuff was good enough to implement a real product and all our
engineers got dragged into product-land. We're trying to get back out to
advance the processor roadmap again (my trip to japan next week is
related to that), and also hit an upcoming ASIC shuttle for the existing
stuff...

>>> I.E. that's j64 territory. Right now j3 and j4 are _mostly_ specced as
>>> following the path laid down by Hitachi 20 years ago. Two of the 3 "new"
>>> j-core instructions are backported from j3. We regression test against
>>> old sh2 binaries. We care that this is part of a _family_. (The "born
>>> out of wedlock" jokes are left as an exercise for the reader.)
>> It seems by reading you, that j3 and j4 are more specced than I had
>> imagine. My assumption so far was j3 would be j2 + MMU. j4 being kind
>> of question mark. Superscalar ? FPU ?
> 
> as I understand it, J4 is meant to be a mostly canonical SH4.

More or less. We can deviate, but have to justify the deviations. Each
one comes with stuff like toolchain and kernel work. (Although we've
already got to do that for j3 since we rejected their mmu design.)

> information on the SH3 is sparse, but from what I can gather it is a
> more intermediate stage:
>     either essentially SH2+MMU, or SH4-FPU.

We were thinking of skipping over j3 and going straight to j4, but if we
have to implement a new mmu design (with correspondind kernel and qemu
work, and maybe binutils) having a release with just that seems warranted.

> some details are unclear, for example SH2 and SH4 use apparently
> different ISR mechanisms:

Yeah, that's a Rich question. He may have posted about it here last
year. Basically he adjusted the kernel to handle both kinds, and then we
adjusted j2 to do the new kind I think? (QEMU still hasn't got a turtle
board emulation so it didn't matter either way.)

The old one might have interefered with smp or something, I'd have to
dig through back email to find it. (And when I asked a retired hitachi
devs I think they said microsoft's windows CE was to blame? It was 2
years ago, I don't remember the details and no longer have their contact
info. The j3 team was largely different people than j2, but still inside
hitachi. Some day I should buy them all beer and do a proper history
writeup...)

>     SH2 (or was this a J2 tweak?) was using VBR essentially as a array
> of ISR entry-points.
>     SH4 uses several computed entry points relative to VBR.
>         so, interrupt happens and it branches to VBR+0x100 or similar.
> 
> some of these cases leave a bit of ambiguity.

At least three people in SEI know this stuff in excruciating detail. I
am not one of them.

>>> After sh4 Hitachi handed off the technology to a new company but kept
>>> the design engineers, and the new team did a brand new sh5 instruction
>>> set that nobody was interested in, and that's the point where we need to
>>> forge our own path. So j64 introduces new design elements, although or
>>> model is "what x86-64 did to x86, we're trying to do to shmobile".
>> So my understanding here is that you do not want to take the road of a
>> family with instructions being optional.

We'd like something you can actually program, yes.

>> Instead you have clearly
>> incremental defined step. j64 would be the time when you introduce
>> SIMD. So j64 would be j2 + MMU + FPU + SIMD + 64bits. Not something
>> where you had a j-core with configurable option and you would have the
>> possibility to do j2 + SIMD or j2 + 64bits for example. It does sound
>> like it will reduce the complexity of the source code and is better
>> for long term maintenance.
> 
> sounds about like what I have been hearing.

SIMD is a bit like FPU, there were historical 486sx/dx versions and I
could se with and without SIMD as build options.

Whether or not Jeff thinks that's worth doing is another matter.

> FWIW: I also decided to blow off SH5, as their solution was basically to
> glue on a completely different ISA.

Yeah, that was the Itanium approach, not the x86-64 approach.

> like, if I wanted something like this, could probably just go implement
> the SPARC ISA or something...
> and/or try to make a case for jumping ship over to RISC-V; ...

SEI tried Leon Sparc before deciding to implement its own processor.
Performance was badly constrained by memory bus bandwidth and cache
size, the CPU spent all its time waiting for instructions.

This is not a problem with j-core, there's memory bus bandwidth left
over to do plenty of other DMA stuff. Turtle's even displaying an HDMI
bitmap out of it. (Not at 1080p, but there's patents there. There's a
lower resolution you can do and avoid patents, again back email or Jeff
question...)

>> Now, going with a j64 for that scenario, would open a lot more use
>> case. Instead of just driving the screen, you could start looking at
>> webpage rendering (which is sadly the format of a lot of ebook). It
>> would clearly be an overkill for simple usecase, like electronic tag,
>> but the question is how much of a waste is it ? How much bigger would
>> it be compared to a j2 ?
> 
> ?...

We're trying to work that out. what is and isn't optional will be
affected by that question.

>>> We're going to try to put together a better j64 proposal in November,
>>> with actual details. (I haven't been back to tokyo since last November,
>>> but my next flight there leaves tuesday morning. There's a backlog of
>>> sitting down with people and writing up documentation...)
>> That is so way more early than I had expected. Pretty big news that
>> you are casually dropping here :-) Looking forward to it.
> 
> I would also like to know the specifics of the J64 ISA design.

I'm scheduled to fly back December 5, so we need to get this done and
posted by then.

> if I like the design, might also try to do a prototype of it as well.
> 
> from what I can gather, the direction it is going is a little different
> from mine, but by how much exactly, I don't know...

I continue not to care about your fork, sorry.

Rob