[J-core] [RFC] SIMD extention for J-Core

Sun Aug 27 04:58:16 EDT 2017

Here are the thoughts of a novice.

>Today, my main interest in the J-Core is due to the instruction
>density that should permit to lower the memory bandwidth needed for a
>task, which is today the main limit on 2D user interface rendering.

That high code density is possible through having few instructions.

>So with that in mind, my current interest is in proposing a
>SIMD/Vector instructions set as this is maybe in my skill and should
>already be useful in itself. I have been researching on all the SIMD,
>vector computer and GPU design that where published in the 90's. The
>nice part of this is that we now have 20 years of feedback on what
>work and compiler have improved a lot to take advantage of this
>instructions, especially over the last decade.

Do you argue that an instruction set designed today is clearly supperior to one from the past?

>By this email, I want to share the direction I am currently heading
>to, to make sure that I am not missing anything obvious before
>starting to write too much code around it. It is a long way from being
>a formal proposal, this is really closer to a please share your
>opinion on the matter.
>
>What I can summarize here is that the main difference between SIMD,
>vector computer and GPU instructions set seems to be mostly on the
>load/store unit and the different memory access mode. By focusing just
>on SIMD, I am removing any change on the memory sub-system which does
>reduce the complexity of this discussion. 
Do you have some third-party source to back this up?

>The second set of feedback on SIMD instructions set is what has been a killer is the their usual
>lack of being complete (instead of an assemblage of
>specialized/dedicated one, a la Intel), lacking horizontal operation
>is to put in that same bucket. Finally another set of mistake that
>mostly impacted SIMD instructions on RISC design, was the huge cost of
>moving data between the vector register and the scalar/memory system
>(Some where quite bad like the VIS which resulted in 40% of the
>instruction in any kernel using it to be dedicated to moving data
>around).
>
>On the side of things that did work and especially help compiler,
>there seems to be a need for predicate (which seems to be something
>that CPU architects dislike, but I don't really understand why and it
>seems it should be fine to me as the instruction set already read the
>destination register in many case).
I do not understand that paragraph

>Altivec instruction set, especially the permutation instructions seems to have attracted quite
>some praise also. 
Citation please?
> instructions <altivec.h> in C with function that do match one to one with their
> set. Well, except the idea of naming there data type
>vector which collide with C++. Nothing is perfect ! Still I would like
>to defer the discussion about the permutation unit and instructions to
>another later thread.
Pleasr try to make your more clear then.
>
>My goal is also to have a solution that potentially scale from the
>lower end of the J-Core family to potentially a GPU using J-Core
>(Implying superscalar and SMT design) and doesn't require too much
>additional hardware to already give benefit. Ideally the focus is on
>having as close as possible of a instruction ratio being divided by
>the size of the item in a vector.
>
>That lead me to the proposal of adding just a few instructions to
>switch the CPU between a normal scalar mode and a vector mode that
>work by dividing the vector in 32bits step of operation. There would
>still be 16 registers with the lower part of it being the scalar
>register when the cpu is in scalar mode. Most instruction of the
>scalar mode will be available while in vector mode, sometime slightly
>adapted to be useful in that mode. A few instructions related to
>shuffling vector around will be added to deal with the lack of a wide
>range of address mode in vector mode.
>
>The main instruction I am thinking of will be :
>
>{p}vector{32,64,128,256}.{b,w,l} #Imm3[, Rn]
>
>The p-redicate flag define if Rn will be used as a source for the
>predicate (Maybe it could actually use T for that purpose and avoid
>specifying a register here). The size flag (32, 64, 128 and 256)
>define the size of the vector register to be used for the next #Imm3
>vector instructions. The b,w and l flags define the word size of the
>SIMD operation to be used during computation for the next #Imm3 vector
>instructions.
>
>An alternate solution would be to have a stop instruction instead of
>using a immediate as above. My guess is that it is highly unlikely
>that the kernel that use vector instruction would last for so long
>that it would be a win against using one instruction. That is
>something could require further experiment.
>
How would you want to perform those.

>Some mode are slightly silly, like vector32.l. Maybe the entire 32bits
>mode is not really necessary. I need to write more kernel using this
>and see how a compiler can use it.

That approach seems wasteful before having it in a proposal you should be sure your ideas work.

>There will be a need to have two more instructions to allow
>interruption during the vector mode. I am expecting that we don't want
>interruption to have to deal with the complexity of not knowing in
>which mode they are. So they should always start in scalar mode. Now
>they will need a way to save and restore the vector mode, something
>along that line :
>
>move V, Rn // Would save the state (predicate on ? vector size ? word
>size ? Instructions left ?)
>move P, Rn // Would save the predicate that is currently in use
>move Rn, V
>move Rn, P
>vector.restore P, V // After this instruction if V does indicate that
>there is still instruction left in vector mode, the cpu will switch
>for that amount of instruction in vector mode.

Kepp in mind that more registers are  expensive.

>
>I am guessing that the delay slot of an interruption could be used to
>call vector.restore. Also if there is no need for the interruption
>code to use vector or schedule another task, it won't need to save the
>V and P vector. I don't think that any of V, P and vector.restore
>should be requiring any priviledge to be used. Also as P, and
>predicate in general, need to fit in one 32bits register, this is why
>I settle with 256bits as the max size for a vector.
>
>This is all that needs to be added to the SH instructions set. The
>rest of the new instruction neede will be just available in vector
>mode, logically reusing instruction slot that don't make sense in that
>mode (To avoid future collision with for example FPU instruction).
>
>I have not been able to find any prior art on such an idea nor any
>system that would implement this kind of concept. It does worry me a
>bit, as I feel like I am missing a point if I am inventing something
>new here. Would feel better if it wasn't all new. The closer
>instruction I could find that look like my proposal is the IT (IfThen)
>instruction of Thumb2 which is very close to pvector32.l actually.
>That might be another reason to just not bother with the 32bits mode.

I am not expert enough to see problems with your proposal but i share the same worries.

>In the vector mode, there is a need for reviewing each instruction and
>see if they do work. For example mac with direct memory access isn't
>that useful as it is a common use case to have scather/gather pattern
>or reverse order. It would be better to just do mac on the register
>directly.
>
>There would be a need for more load/store operation, but SIMD CPU
>instructions set have lived without it for a long time (GPU and vector
>computer do win from it). It would be way above my capability to touch
>this at this point and there is enough work on the existing mode to
>make things work that I feel like this could be kept for the future.
>If one day you want to do a GPU based on the J-Core, you would likely
>have to improve the memory subsystem and that would be part of it. So
>I don't want to bring that into the discussion yet.

I do not see what sense it would make to build a GPU on a CPU instruction set. 
If the GPU were a superset of thr CPU instructions the designed GPU would be surely inefficient. You would basicly have an accelorator card with 512 or more CPUs on it. I do not see much use in it.
It GPU instruction set is not a super set what motivation is there to make it close to a CPU instruction set?
Can you please argue why a j-core based gpu would be usefull?

>They are instruction that added, could be useful. Like complex
>multiplication, which are basically a multiply and substract
>operation. This is useful for every image and video codec pretty much.
>Pretty much anything that does radio signal processing, but again
>would like to push that to another thread.
>
>Then a few instructions to shuffle bits around would be also quite
>useful, if not necessary actually. As we are working with block of 32
>bits here, something to rotate a vector by one block, pop and push a
>32bits scalar into a vector seems to be useful and make sense. This
>wouldn't necessarily require a new execution unit as each column of
>32bits in a vector register could actually be in a separated register
>bank. So could be implemented as just a move from one bank to the
>other. They would look like something like this :
>
>rotblock{l,r} #Imm, Rn
>popblock Rn, Rm
>pushblock Rn, Rm
>
>The shift/rotation unit can also be reused for rotation inside a block
>and reduce the need to switch to the vector.l mode to do shift/rotate
>(This seems to look useful for crypto related operation, but need some
>experimenting to see if it is really that useful). This could be
>looking like :
>
>rot{l,r}row{1,2,3} Rn
>rot{l,r}row @Rn, Rm // Get the rotation for each block from memory,
>most significant byte define the most significant block move.
>
>On the hardware, I have a distant view as I am just starting to learn
>VHDL, but my guess is that there will be need to modify the decoder to
>lookup in the vector mode micro code or the normal one depending on
>some counter that decrease during vector mode, modify the execution
>unit to handle all the smaller operation and add a few register bank.
>Overall, I hope this won't increase too much the size of the J-Core

*hope"

>(Is it possible to have an idea of how much that would require at this
>point of the design idea ?).
>
>For the software side there is quite a lot of work there. Checking if
>assembler kernel do work out (Ideally you would want a reduction of
>the number of instruction to be a factor of the length of the vector,
>8bits code from scalar to register should be reducing the code size by
>as close as 32 times as possible when using 256bits vector mode for
>example). I am going to play with matrix multiplication, blending RGBA
>scanline, blurring scanline, color conversion, various turbojpeg
>optimization, maybe some crypto primitive (AES, ChaCha, ...). 

Just so not come up with so many instructions that it won't fit in to J-Core

>For this I will need to go with at least an assembler, that is where the
>previous JSON work I have done should become handy as it should enable
>autogeneration of this instruction and allow to adjust things as I go
>with little work on the assembler.
>
>If that does looks good, there is a lot of work to be done after that.
>Adding support to the linux kernel, the J-Core and also to the
>compiler. Maybe even up to the auto vectorization loop. Quite a
>massive amount of work in itself and I bet it is going to take a long
>time before it reach the point where it is usable, but well, I find it
>fun. I will try to keep everything accessible in a github repository
>and ask from time to time for feedback there. 

Good

>I will likely focus first on the software side as there is a lot to do there and I will
>progress faster as it is a domain I know best, still I would like to
>start playing VHDL later on.

That approach might be dangerous as your use cases may not be a good representation of all the use cases. But try it out.

>Ah, and I have played with just matrix multiplication at the moment,
>and a 4x4 8bits matrix multiplication take 23 instructions instead of
>361 (Using 128bits register, some permutation instructions and no
>change in the ABI) which I think is encouraging as it is very close to
>the ideal target number and could likely be even lower with a change
>of ABI (By passing matrix directly in register). Anyway, what do you
>think ?

It would be also interesting to not only compare instruction count but also memory bandwidth used.

>
>Best
>-- 
>Cedric BAIL
>_______________________________________________
>J-core mailing list
>J-core at lists.j-core.org
>http://lists.j-core.org/mailman/listinfo/j-core