[J-core] [RFC] SIMD extention for J-Core

Fri Oct 27 14:02:52 EDT 2017

On 10/26/2017 8:58 PM, Ken Phillis Jr wrote:
> On Thu, Oct 26, 2017 at 6:14 PM, Rob Landley <rob at landley.net> wrote:
>> On 10/25/2017 06:59 PM, Ken Phillis Jr wrote:
>>> I figure I will include my Idea for an easy to use SIMD Extension for
>>> the J-Core. In general this extension will be used for Three Types of
>>> number formats...
>> One of the blue sky J64 proposals Jeff mentioned (let me see if I can
>> remember the details) was 2 control bits per register letting you
>> specify that register contains 8/16/32/64 bit SIMD, meaning a single 32
>> bit control register could control SIMD state for 16 general purpose
>> registers. Then you use the normal operations to deal with
>> signed/unsigned, multiply, and so on.
>>
>> The next question was how to deal with source/target size mismatches: in
>> theory a 64 bit source iterating over 8 bit targets could apply a single
>> 64 bit source value to all 8 targets, a 32 bit source could go
>> 1-2-3-4-1-2-3-4, etc. In practice what that makes the ALU look like is a
>> question that needs some code prototyping I think...
>>
> I forgot that the SIMD.IMODE, and SIMD.FMODE Instructions is to avoid
> added complexity of managing this information on a per-register basis.
> In general I think a single Instruction for converting between
> same-size types is more than enough....
>
> SIMD.TypeConv - This instruction is a bidirectional conversion two and
> from data types of the same size.  A few example conversions are as
> follows:
> * Signed Integer and Unsigned Integer
> * Integers and Floats
> * Fixed Point value(s) and floating point.

partially agree.

I suspect a modal design may be preferable to per-register type-tagging 
or similar, as the latter seems likely to work out more complicated to 
deal with.

while modal isn't really "ideal" per-se, if there isn't too much rapid 
switching between sub-modes, its added cost isn't too drastic.

using FPSCR as an example:
     usually a function is much more dominated by one mode or another 
(ex: Float or Double);
     the exceptions tend to be clustered (say, a small island of Double 
ops in an otherwise Float function).

in a compiler, it is possible to count, say, how many variable accesses 
are to float, and how many are to double, and set this as the "default" 
mode within a given function (and/or, more cleverly, to determine this 
per-label); then when a branch occurs, this mode is set (if, say, at the 
branch things are in Double mode for a Float function or label, then it 
can set the state back to Float).

granted, there are cases where a branch may be followed by immediately 
switching to a different mode, but this doesn't really seem to be a big 
issue.

if SIMD is added mostly via control flags, it seems like a similar sort 
of issue.

the advantage is that modes may allow reusing more of the 16-bit coding 
space.

the disadvantage is that, with modal ops one may have instructions whose 
full nature isn't really known until afterwards (though, for sanity, one 
can require that this be consistent within a static control-flow graph; 
no two branches should be allowed to be able to reach a given label 
while having an inconsistent operating mode nor a mode inconsistent with 
the label at its point of declaration).

there may be other trade-offs though with how easy one wants to make it 
on assemblers/disassemblers/..., vs the goal of generating compact and 
efficient code.

admittedly, I was initially a bit displeased with most of the FPU ops 
having different operations depending on bits in FPSCR, but after 
messing with it, the basic idea works well enough.

slightly more drastic though is going modal with larger parts of the 
ISA, which is more how my design attempt had handled the matter of doing 
a 64-bit ISA (I briefly considered dropping this modal aspect in favor 
of "just use 32-bit I-forms for all 64-bit integer ops", but backtracked 
here; and ended up with many of the 32-bit I-forms being modal as well, 
following the same basic rules as their 16-bit siblings).

granted, there are pros/cons to use of escaped longer (variable-length) 
I-forms:
     pro, have a lot more encoding space available;
     pro, can improve code-density due to helping some messy/costly and 
semi-common edge cases.
         most have to do with memory addressing, branch-distances, and 
immediate values.
     con, larger ISA means more complex ISA.
         need to try to minimize adding too much complexity here.
     con, a HW impl requires a longer pipeline (so, branches are more 
expensive, ...)
         though, solutions and alternatives exist, with other tradeoffs.
         working on the Verilog version more solidly established "no 
I-forms beyond 32-bits" for now though.

likewise:
     the compiler still needs to typically prioritize use of the 16-bit 
I-forms for best code-density.
     the ops can't get too fancy (need to keep complexity down and 
maintain throughput, ...)
     all of this is eating up a lot of time/effort.
         sadly, haven't even gotten to testing my fancier/crazier ideas yet.
     it is taking a lot just to try to get everything to a "sane baseline".
         trying some to get plausible prototypes of both the 32-bit and 
64-bit ISA variants.
         also want to get "plausible" prototype CPU cores (in Verilog), 
but this is its own set of issues.
             need to also design many internal mechanisms, beyond just 
the main ISA.
         ...

or such...