[J-core] [RFC] SIMD extention for J-Core

Sun Aug 27 01:48:59 EDT 2017

Hello,

By this email I would like to introduce why I am personnally
interested in this project and what I am working on my spare time
related to it. I am just a software engineer working on 2D user
interface and this has been my field of interest for a long time now.
More than a decade ago, I participated in an interesting project named
F-CPU, which was I think way to early and way to big in scope, but it
was a very interesting learning experience in both hardware, software
and community.

Today, my main interest in the J-Core is due to the instruction
density that should permit to lower the memory bandwidth needed for a
task, which is today the main limit on 2D user interface rendering.
Also the core being patent free and small, reusing existing compiler
and kernel make it a very interesting pragmatic platform to develop
and test on. Clearly the missing bit for this will be having a git
repository and the turtle board to allow external contribution ;-)

So with that in mind, my current interest is in proposing a
SIMD/Vector instructions set as this is maybe in my skill and should
already be useful in itself. I have been researching on all the SIMD,
vector computer and GPU design that where published in the 90's. The
nice part of this is that we now have 20 years of feedback on what
work and compiler have improved a lot to take advantage of this
instructions, especially over the last decade.

By this email, I want to share the direction I am currently heading
to, to make sure that I am not missing anything obvious before
starting to write too much code around it. It is a long way from being
a formal proposal, this is really closer to a please share your
opinion on the matter.

What I can summarize here is that the main difference between SIMD,
vector computer and GPU instructions set seems to be mostly on the
load/store unit and the different memory access mode. By focusing just
on SIMD, I am removing any change on the memory sub-system which does
reduce the complexity of this discussion. The second set of feedback
on SIMD instructions set is what has been a killer is the their usual
lack of being complete (instead of an assemblage of
specialized/dedicated one, a la Intel), lacking horizontal operation
is to put in that same bucket. Finally another set of mistake that
mostly impacted SIMD instructions on RISC design, was the huge cost of
moving data between the vector register and the scalar/memory system
(Some where quite bad like the VIS which resulted in 40% of the
instruction in any kernel using it to be dedicated to moving data
around).

On the side of things that did work and especially help compiler,
there seems to be a need for predicate (which seems to be something
that CPU architects dislike, but I don't really understand why and it
seems it should be fine to me as the instruction set already read the
destination register in many case). Altivec instruction set,
especially the permutation instructions seems to have attracted quite
some praise also. And the side idea of providing an #include
<altivec.h> in C with function that do match one to one with their
instructions set. Well, except the idea of naming there data type
vector which collide with C++. Nothing is perfect ! Still I would like
to defer the discussion about the permutation unit and instructions to
another later thread.

My goal is also to have a solution that potentially scale from the
lower end of the J-Core family to potentially a GPU using J-Core
(Implying superscalar and SMT design) and doesn't require too much
additional hardware to already give benefit. Ideally the focus is on
having as close as possible of a instruction ratio being divided by
the size of the item in a vector.

That lead me to the proposal of adding just a few instructions to
switch the CPU between a normal scalar mode and a vector mode that
work by dividing the vector in 32bits step of operation. There would
still be 16 registers with the lower part of it being the scalar
register when the cpu is in scalar mode. Most instruction of the
scalar mode will be available while in vector mode, sometime slightly
adapted to be useful in that mode. A few instructions related to
shuffling vector around will be added to deal with the lack of a wide
range of address mode in vector mode.

The main instruction I am thinking of will be :

{p}vector{32,64,128,256}.{b,w,l} #Imm3[, Rn]

The p-redicate flag define if Rn will be used as a source for the
predicate (Maybe it could actually use T for that purpose and avoid
specifying a register here). The size flag (32, 64, 128 and 256)
define the size of the vector register to be used for the next #Imm3
vector instructions. The b,w and l flags define the word size of the
SIMD operation to be used during computation for the next #Imm3 vector
instructions.

An alternate solution would be to have a stop instruction instead of
using a immediate as above. My guess is that it is highly unlikely
that the kernel that use vector instruction would last for so long
that it would be a win against using one instruction. That is
something could require further experiment.

Some mode are slightly silly, like vector32.l. Maybe the entire 32bits
mode is not really necessary. I need to write more kernel using this
and see how a compiler can use it.

There will be a need to have two more instructions to allow
interruption during the vector mode. I am expecting that we don't want
interruption to have to deal with the complexity of not knowing in
which mode they are. So they should always start in scalar mode. Now
they will need a way to save and restore the vector mode, something
along that line :

move V, Rn // Would save the state (predicate on ? vector size ? word
size ? Instructions left ?)
move P, Rn // Would save the predicate that is currently in use
move Rn, V
move Rn, P
vector.restore P, V // After this instruction if V does indicate that
there is still instruction left in vector mode, the cpu will switch
for that amount of instruction in vector mode.

I am guessing that the delay slot of an interruption could be used to
call vector.restore. Also if there is no need for the interruption
code to use vector or schedule another task, it won't need to save the
V and P vector. I don't think that any of V, P and vector.restore
should be requiring any priviledge to be used. Also as P, and
predicate in general, need to fit in one 32bits register, this is why
I settle with 256bits as the max size for a vector.

This is all that needs to be added to the SH instructions set. The
rest of the new instruction neede will be just available in vector
mode, logically reusing instruction slot that don't make sense in that
mode (To avoid future collision with for example FPU instruction).

I have not been able to find any prior art on such an idea nor any
system that would implement this kind of concept. It does worry me a
bit, as I feel like I am missing a point if I am inventing something
new here. Would feel better if it wasn't all new. The closer
instruction I could find that look like my proposal is the IT (IfThen)
instruction of Thumb2 which is very close to pvector32.l actually.
That might be another reason to just not bother with the 32bits mode.

In the vector mode, there is a need for reviewing each instruction and
see if they do work. For example mac with direct memory access isn't
that useful as it is a common use case to have scather/gather pattern
or reverse order. It would be better to just do mac on the register
directly.

There would be a need for more load/store operation, but SIMD CPU
instructions set have lived without it for a long time (GPU and vector
computer do win from it). It would be way above my capability to touch
this at this point and there is enough work on the existing mode to
make things work that I feel like this could be kept for the future.
If one day you want to do a GPU based on the J-Core, you would likely
have to improve the memory subsystem and that would be part of it. So
I don't want to bring that into the discussion yet.

They are instruction that added, could be useful. Like complex
multiplication, which are basically a multiply and substract
operation. This is useful for every image and video codec pretty much.
Pretty much anything that does radio signal processing, but again
would like to push that to another thread.

Then a few instructions to shuffle bits around would be also quite
useful, if not necessary actually. As we are working with block of 32
bits here, something to rotate a vector by one block, pop and push a
32bits scalar into a vector seems to be useful and make sense. This
wouldn't necessarily require a new execution unit as each column of
32bits in a vector register could actually be in a separated register
bank. So could be implemented as just a move from one bank to the
other. They would look like something like this :

rotblock{l,r} #Imm, Rn
popblock Rn, Rm
pushblock Rn, Rm

The shift/rotation unit can also be reused for rotation inside a block
and reduce the need to switch to the vector.l mode to do shift/rotate
(This seems to look useful for crypto related operation, but need some
experimenting to see if it is really that useful). This could be
looking like :

rot{l,r}row{1,2,3} Rn
rot{l,r}row @Rn, Rm // Get the rotation for each block from memory,
most significant byte define the most significant block move.

On the hardware, I have a distant view as I am just starting to learn
VHDL, but my guess is that there will be need to modify the decoder to
lookup in the vector mode micro code or the normal one depending on
some counter that decrease during vector mode, modify the execution
unit to handle all the smaller operation and add a few register bank.
Overall, I hope this won't increase too much the size of the J-Core
(Is it possible to have an idea of how much that would require at this
point of the design idea ?).

For the software side there is quite a lot of work there. Checking if
assembler kernel do work out (Ideally you would want a reduction of
the number of instruction to be a factor of the length of the vector,
8bits code from scalar to register should be reducing the code size by
as close as 32 times as possible when using 256bits vector mode for
example). I am going to play with matrix multiplication, blending RGBA
scanline, blurring scanline, color conversion, various turbojpeg
optimization, maybe some crypto primitive (AES, ChaCha, ...). For this
I will need to go with at least an assembler, that is where the
previous JSON work I have done should become handy as it should enable
autogeneration of this instruction and allow to adjust things as I go
with little work on the assembler.

If that does looks good, there is a lot of work to be done after that.
Adding support to the linux kernel, the J-Core and also to the
compiler. Maybe even up to the auto vectorization loop. Quite a
massive amount of work in itself and I bet it is going to take a long
time before it reach the point where it is usable, but well, I find it
fun. I will try to keep everything accessible in a github repository
and ask from time to time for feedback there. I will likely focus
first on the software side as there is a lot to do there and I will
progress faster as it is a domain I know best, still I would like to
start playing VHDL later on.

Ah, and I have played with just matrix multiplication at the moment,
and a 4x4 8bits matrix multiplication take 23 instructions instead of
361 (Using 128bits register, some permutation instructions and no
change in the ABI) which I think is encouraging as it is very close to
the ideal target number and could likely be even lower with a change
of ABI (By passing matrix directly in register). Anyway, what do you
think ?

Best
-- 
Cedric BAIL