[J-core] [RFC] SIMD extention for J-Core
cr88192 at gmail.com
Fri Oct 27 14:02:52 EDT 2017
On 10/26/2017 8:58 PM, Ken Phillis Jr wrote:
> On Thu, Oct 26, 2017 at 6:14 PM, Rob Landley <rob at landley.net> wrote:
>> On 10/25/2017 06:59 PM, Ken Phillis Jr wrote:
>>> I figure I will include my Idea for an easy to use SIMD Extension for
>>> the J-Core. In general this extension will be used for Three Types of
>>> number formats...
>> One of the blue sky J64 proposals Jeff mentioned (let me see if I can
>> remember the details) was 2 control bits per register letting you
>> specify that register contains 8/16/32/64 bit SIMD, meaning a single 32
>> bit control register could control SIMD state for 16 general purpose
>> registers. Then you use the normal operations to deal with
>> signed/unsigned, multiply, and so on.
>> The next question was how to deal with source/target size mismatches: in
>> theory a 64 bit source iterating over 8 bit targets could apply a single
>> 64 bit source value to all 8 targets, a 32 bit source could go
>> 1-2-3-4-1-2-3-4, etc. In practice what that makes the ALU look like is a
>> question that needs some code prototyping I think...
> I forgot that the SIMD.IMODE, and SIMD.FMODE Instructions is to avoid
> added complexity of managing this information on a per-register basis.
> In general I think a single Instruction for converting between
> same-size types is more than enough....
> SIMD.TypeConv - This instruction is a bidirectional conversion two and
> from data types of the same size. A few example conversions are as
> * Signed Integer and Unsigned Integer
> * Integers and Floats
> * Fixed Point value(s) and floating point.
I suspect a modal design may be preferable to per-register type-tagging
or similar, as the latter seems likely to work out more complicated to
while modal isn't really "ideal" per-se, if there isn't too much rapid
switching between sub-modes, its added cost isn't too drastic.
using FPSCR as an example:
usually a function is much more dominated by one mode or another
(ex: Float or Double);
the exceptions tend to be clustered (say, a small island of Double
ops in an otherwise Float function).
in a compiler, it is possible to count, say, how many variable accesses
are to float, and how many are to double, and set this as the "default"
mode within a given function (and/or, more cleverly, to determine this
per-label); then when a branch occurs, this mode is set (if, say, at the
branch things are in Double mode for a Float function or label, then it
can set the state back to Float).
granted, there are cases where a branch may be followed by immediately
switching to a different mode, but this doesn't really seem to be a big
if SIMD is added mostly via control flags, it seems like a similar sort
the advantage is that modes may allow reusing more of the 16-bit coding
the disadvantage is that, with modal ops one may have instructions whose
full nature isn't really known until afterwards (though, for sanity, one
can require that this be consistent within a static control-flow graph;
no two branches should be allowed to be able to reach a given label
while having an inconsistent operating mode nor a mode inconsistent with
the label at its point of declaration).
there may be other trade-offs though with how easy one wants to make it
on assemblers/disassemblers/..., vs the goal of generating compact and
admittedly, I was initially a bit displeased with most of the FPU ops
having different operations depending on bits in FPSCR, but after
messing with it, the basic idea works well enough.
slightly more drastic though is going modal with larger parts of the
ISA, which is more how my design attempt had handled the matter of doing
a 64-bit ISA (I briefly considered dropping this modal aspect in favor
of "just use 32-bit I-forms for all 64-bit integer ops", but backtracked
here; and ended up with many of the 32-bit I-forms being modal as well,
following the same basic rules as their 16-bit siblings).
granted, there are pros/cons to use of escaped longer (variable-length)
pro, have a lot more encoding space available;
pro, can improve code-density due to helping some messy/costly and
semi-common edge cases.
most have to do with memory addressing, branch-distances, and
con, larger ISA means more complex ISA.
need to try to minimize adding too much complexity here.
con, a HW impl requires a longer pipeline (so, branches are more
though, solutions and alternatives exist, with other tradeoffs.
working on the Verilog version more solidly established "no
I-forms beyond 32-bits" for now though.
the compiler still needs to typically prioritize use of the 16-bit
I-forms for best code-density.
the ops can't get too fancy (need to keep complexity down and
maintain throughput, ...)
all of this is eating up a lot of time/effort.
sadly, haven't even gotten to testing my fancier/crazier ideas yet.
it is taking a lot just to try to get everything to a "sane baseline".
trying some to get plausible prototypes of both the 32-bit and
64-bit ISA variants.
also want to get "plausible" prototype CPU cores (in Verilog),
but this is its own set of issues.
need to also design many internal mechanisms, beyond just
the main ISA.
More information about the J-core