[J-core] [RFC] SIMD extention for J-Core

Sun Aug 27 19:30:14 EDT 2017

On Sun, Aug 27, 2017 at 1:58 AM, Joh-Tob Schäg <johtobsch at gmail.com> wrote:
> Here are the thoughts of a novice.
>
>>Today, my main interest in the J-Core is due to the instruction
>>density that should permit to lower the memory bandwidth needed for a
>>task, which is today the main limit on 2D user interface rendering.
>
> That high code density is possible through having few instructions.

Exactly the goal here, indeed.

>>So with that in mind, my current interest is in proposing a
>>SIMD/Vector instructions set as this is maybe in my skill and should
>>already be useful in itself. I have been researching on all the SIMD,
>>vector computer and GPU design that where published in the 90's. The
>>nice part of this is that we now have 20 years of feedback on what
>>work and compiler have improved a lot to take advantage of this
>>instructions, especially over the last decade.
>
> Do you argue that an instruction set designed today is clearly supperior to one from the past?

Not necessarily, but if you look at what decision worked best in the
past, you might avoid those pitfall. Arguably, you might fall in new
one :-) Ideally, I would have preferred to not come with new idea.

>>By this email, I want to share the direction I am currently heading
>>to, to make sure that I am not missing anything obvious before
>>starting to write too much code around it. It is a long way from being
>>a formal proposal, this is really closer to a please share your
>>opinion on the matter.
>>
>>What I can summarize here is that the main difference between SIMD,
>>vector computer and GPU instructions set seems to be mostly on the
>>load/store unit and the different memory access mode. By focusing just
>>on SIMD, I am removing any change on the memory sub-system which does
>>reduce the complexity of this discussion.

> Do you have some third-party source to back this up?

Sure, I would recommend to read the last version of Computer
Architecture, A quantitative approach. They have a good chapter on
GPU.

http://www.tandon-books.com/Computer%20Science/CS6143%20-%20Computer%20Architecture%20II/(CS6143)%20Computer%20Architecture%20-%20A%20Quantitative%20Approach%205e.pdf

Also a bit old, but covering nicely the design evolution of the GPU up to 2011 :

http://cours.do.am/ParadigmeAgent/final.pdf

This is obviously not all, but I feel like it is a good starting point.

>>The second set of feedback on SIMD instructions set is what has been a killer is the their usual
>>lack of being complete (instead of an assemblage of
>>specialized/dedicated one, a la Intel), lacking horizontal operation
>>is to put in that same bucket. Finally another set of mistake that
>>mostly impacted SIMD instructions on RISC design, was the huge cost of
>>moving data between the vector register and the scalar/memory system
>>(Some where quite bad like the VIS which resulted in 40% of the
>>instruction in any kernel using it to be dedicated to moving data
>>around).
>>
>>On the side of things that did work and especially help compiler,
>>there seems to be a need for predicate (which seems to be something
>>that CPU architects dislike, but I don't really understand why and it
>>seems it should be fine to me as the instruction set already read the
>>destination register in many case).
> I do not understand that paragraph
>
>>Altivec instruction set, especially the permutation instructions seems to have attracted quite
>>some praise also.

> Citation please?

Not to sure of what would be of interest for you here, but I would
really like to avoid having the discussion about permutation
instruction in this thread.

>> instructions <altivec.h> in C with function that do match one to one with their
>> set. Well, except the idea of naming there data type
>>vector which collide with C++. Nothing is perfect ! Still I would like
>>to defer the discussion about the permutation unit and instructions to
>>another later thread.

> Pleasr try to make your more clear then.

I guess, that my statement should have been, permutation are mandatory
for higher performance SIMD instructions set, but as this require
additionnal hardware unit, I would like to defer this discussion to
another email thread.

>>My goal is also to have a solution that potentially scale from the
>>lower end of the J-Core family to potentially a GPU using J-Core
>>(Implying superscalar and SMT design) and doesn't require too much
>>additional hardware to already give benefit. Ideally the focus is on
>>having as close as possible of a instruction ratio being divided by
>>the size of the item in a vector.
>>
>>That lead me to the proposal of adding just a few instructions to
>>switch the CPU between a normal scalar mode and a vector mode that
>>work by dividing the vector in 32bits step of operation. There would
>>still be 16 registers with the lower part of it being the scalar
>>register when the cpu is in scalar mode. Most instruction of the
>>scalar mode will be available while in vector mode, sometime slightly
>>adapted to be useful in that mode. A few instructions related to
>>shuffling vector around will be added to deal with the lack of a wide
>>range of address mode in vector mode.
>>
>>The main instruction I am thinking of will be :
>>
>>{p}vector{32,64,128,256}.{b,w,l} #Imm3[, Rn]
>>
>>The p-redicate flag define if Rn will be used as a source for the
>>predicate (Maybe it could actually use T for that purpose and avoid
>>specifying a register here). The size flag (32, 64, 128 and 256)
>>define the size of the vector register to be used for the next #Imm3
>>vector instructions. The b,w and l flags define the word size of the
>>SIMD operation to be used during computation for the next #Imm3 vector
>>instructions.
>>
>>An alternate solution would be to have a stop instruction instead of
>>using a immediate as above. My guess is that it is highly unlikely
>>that the kernel that use vector instruction would last for so long
>>that it would be a win against using one instruction. That is
>>something could require further experiment.
>>
> How would you want to perform those.

You would have a vector.end instruction, that would switch the CPU
back to a scalar mode.

>>Some mode are slightly silly, like vector32.l. Maybe the entire 32bits
>>mode is not really necessary. I need to write more kernel using this
>>and see how a compiler can use it.
>
> That approach seems wasteful before having it in a proposal you should be sure your ideas work.

There is use case for the 32bits mode, the question is how useful are
they. For example when manipulating pixels, you may have a warm up
phase before you are able to process 256bits at a time, being able to
process just one pixels might be more efficient than using a predicate
to disable the rest of the vector. There would be possibility to have
a lot more logic around predicate to not execute anything when a
32bits word is completely disable, but this might increase the
complexity of the hardware while it feels not so problematic in
software. But as we both say, this require further experiment.

>>There will be a need to have two more instructions to allow
>>interruption during the vector mode. I am expecting that we don't want
>>interruption to have to deal with the complexity of not knowing in
>>which mode they are. So they should always start in scalar mode. Now
>>they will need a way to save and restore the vector mode, something
>>along that line :
>>
>>move V, Rn // Would save the state (predicate on ? vector size ? word
>>size ? Instructions left ?)
>>move P, Rn // Would save the predicate that is currently in use
>>move Rn, V
>>move Rn, P
>>vector.restore P, V // After this instruction if V does indicate that
>>there is still instruction left in vector mode, the cpu will switch
>>for that amount of instruction in vector mode.
>
> Kepp in mind that more registers are  expensive.

Well, V and P register are small, 32bits register, that are part of
what is needed internally to run in a vector compute mode. Arguably,
they are really small compared to the added 8 registers bank that a
256 bits vector computer does imply which where the cost goes here, I
think.

>>I am guessing that the delay slot of an interruption could be used to
>>call vector.restore. Also if there is no need for the interruption
>>code to use vector or schedule another task, it won't need to save the
>>V and P vector. I don't think that any of V, P and vector.restore
>>should be requiring any priviledge to be used. Also as P, and
>>predicate in general, need to fit in one 32bits register, this is why
>>I settle with 256bits as the max size for a vector.
>>
>>This is all that needs to be added to the SH instructions set. The
>>rest of the new instruction neede will be just available in vector
>>mode, logically reusing instruction slot that don't make sense in that
>>mode (To avoid future collision with for example FPU instruction).
>>
>>I have not been able to find any prior art on such an idea nor any
>>system that would implement this kind of concept. It does worry me a
>>bit, as I feel like I am missing a point if I am inventing something
>>new here. Would feel better if it wasn't all new. The closer
>>instruction I could find that look like my proposal is the IT (IfThen)
>>instruction of Thumb2 which is very close to pvector32.l actually.
>>That might be another reason to just not bother with the 32bits mode.
>
> I am not expert enough to see problems with your proposal but i share the same worries.
>
>>In the vector mode, there is a need for reviewing each instruction and
>>see if they do work. For example mac with direct memory access isn't
>>that useful as it is a common use case to have scather/gather pattern
>>or reverse order. It would be better to just do mac on the register
>>directly.
>>
>>There would be a need for more load/store operation, but SIMD CPU
>>instructions set have lived without it for a long time (GPU and vector
>>computer do win from it). It would be way above my capability to touch
>>this at this point and there is enough work on the existing mode to
>>make things work that I feel like this could be kept for the future.
>>If one day you want to do a GPU based on the J-Core, you would likely
>>have to improve the memory subsystem and that would be part of it. So
>>I don't want to bring that into the discussion yet.
>
> I do not see what sense it would make to build a GPU on a CPU instruction set.
> If the GPU were a superset of thr CPU instructions the designed GPU would be surely inefficient. You would basicly have an accelorator card with 512 or more CPUs on it. I do not see much use in it.
> It GPU instruction set is not a super set what motivation is there to make it close to a CPU instruction set?

That is maybe where we will disagree. GPU instructions set are from my
point of view a superset of a CPU one indeed. They usually carry some
scater/gather memory access instruction, a few more not to precise FPU
related instruction (like cos, sin) and some times dedicated 2D
operation like blending. The last generation is starting to integrate
also operation more dedicated to faster matrix multiplication.

> Can you please argue why a j-core based gpu would be usefull?

I do feel like J-Core is a good starting point as the instruction set
is dense, leading to low memory bandwidth usage for code. It is also
fully open source, pattent free and a modern CISC design which should
make it possible to slowly go into that direction. Ofcourse, for a 3D
GPU, you will need a FPU, but for just 2D purpose, you don't. Some GPU
do have special instructions for 2D purpose, like blending or color
conversion for example, but this is not something that would be
mandatory or necessary at this stage. Something to see if it is worth
it when you start running real work load (Arguably it would be nice to
have a way to extract statistic of a program at run time, again sounds
like a discussion for another email thread).

Finally the argument while I also feel like it is interesting to
leverage J-Core for this kind of use case, is after a discussion with
Rob, we realized that instead of having proprietary blob to upload
into unknow chip that drive ethernet, sound, graphic, ... We could
actually likely leverage J-Core for that purpose. This means that the
driver and the code running on the other side of the pipe can actually
be compiled by the same compiler if not, after some linker trick, part
of the same kernel. This would open a lot of potential on the
software.

>>They are instruction that added, could be useful. Like complex
>>multiplication, which are basically a multiply and substract
>>operation. This is useful for every image and video codec pretty much.
>>Pretty much anything that does radio signal processing, but again
>>would like to push that to another thread.
>>
>>Then a few instructions to shuffle bits around would be also quite
>>useful, if not necessary actually. As we are working with block of 32
>>bits here, something to rotate a vector by one block, pop and push a
>>32bits scalar into a vector seems to be useful and make sense. This
>>wouldn't necessarily require a new execution unit as each column of
>>32bits in a vector register could actually be in a separated register
>>bank. So could be implemented as just a move from one bank to the
>>other. They would look like something like this :
>>
>>rotblock{l,r} #Imm, Rn
>>popblock Rn, Rm
>>pushblock Rn, Rm
>>
>>The shift/rotation unit can also be reused for rotation inside a block
>>and reduce the need to switch to the vector.l mode to do shift/rotate
>>(This seems to look useful for crypto related operation, but need some
>>experimenting to see if it is really that useful). This could be
>>looking like :
>>
>>rot{l,r}row{1,2,3} Rn
>>rot{l,r}row @Rn, Rm // Get the rotation for each block from memory,
>>most significant byte define the most significant block move.
>>
>>On the hardware, I have a distant view as I am just starting to learn
>>VHDL, but my guess is that there will be need to modify the decoder to
>>lookup in the vector mode micro code or the normal one depending on
>>some counter that decrease during vector mode, modify the execution
>>unit to handle all the smaller operation and add a few register bank.
>>Overall, I hope this won't increase too much the size of the J-Core
>
> *hope"

At this stage, I don't have much idea of how to mesure that. Until we
start adding this into VHDL, I am not to sure of how we can get a
valuable metric. If you have idea there, would be interesting.

>>(Is it possible to have an idea of how much that would require at this
>>point of the design idea ?).
>>
>>For the software side there is quite a lot of work there. Checking if
>>assembler kernel do work out (Ideally you would want a reduction of
>>the number of instruction to be a factor of the length of the vector,
>>8bits code from scalar to register should be reducing the code size by
>>as close as 32 times as possible when using 256bits vector mode for
>>example). I am going to play with matrix multiplication, blending RGBA
>>scanline, blurring scanline, color conversion, various turbojpeg
>>optimization, maybe some crypto primitive (AES, ChaCha, ...).
>
> Just so not come up with so many instructions that it won't fit in to J-Core

Hum, maybe there is a misunderstanding here. So far, I have proposed
only 6 new instructions in the J-Core scalar instructions set and an
addition of 5 in the vector instructions set. I do not plan and think
that adding a matrix multiplication unit would pay off as any other
new unit, but what I am talking about is to try to write the assembly
code that correspond to the above task using the added instructions
described here. I hope this clarify it.

>>For this I will need to go with at least an assembler, that is where the
>>previous JSON work I have done should become handy as it should enable
>>autogeneration of this instruction and allow to adjust things as I go
>>with little work on the assembler.
>>
>>If that does looks good, there is a lot of work to be done after that.
>>Adding support to the linux kernel, the J-Core and also to the
>>compiler. Maybe even up to the auto vectorization loop. Quite a
>>massive amount of work in itself and I bet it is going to take a long
>>time before it reach the point where it is usable, but well, I find it
>>fun. I will try to keep everything accessible in a github repository
>>and ask from time to time for feedback there.
>
> Good
>
>>I will likely focus first on the software side as there is a lot to do there and I will
>>progress faster as it is a domain I know best, still I would like to
>>start playing VHDL later on.
>
> That approach might be dangerous as your use cases may not be a good representation of all the use cases. But try it out.

What would be your approach ? Why do you think it is dangerous ? Or
what are the danger you are seeing ? Please share, that's the
interesting bit !

>>Ah, and I have played with just matrix multiplication at the moment,
>>and a 4x4 8bits matrix multiplication take 23 instructions instead of
>>361 (Using 128bits register, some permutation instructions and no
>>change in the ABI) which I think is encouraging as it is very close to
>>the ideal target number and could likely be even lower with a change
>>of ABI (By passing matrix directly in register). Anyway, what do you
>>think ?
>
> It would be also interesting to not only compare instruction count but also memory bandwidth used.

I am guessing you would like to know how often you need to spill out
to memory due to register pressure ? In the example above, there is no
unecessary memory access, the input matrices are accessed once and the
output matrix is write back in one instruction with no step in between
that require any memory access. If that is what we are talking about,
we do agree that we need to keep the amount of unecessary memory
access to the lowest as this is the main concern.
-- 
Cedric BAIL