[J-core] [RFC] SIMD extention for J-Core

Thu Oct 26 00:08:30 EDT 2017

On 10/25/2017 6:59 PM, Ken Phillis Jr wrote:
> I figure I will include my Idea for an easy to use SIMD Extension for
> the J-Core. In general this extension will be used for Three Types of
> number formats...
>
> Format 1: Signed, and Unsigned Integers
> Format 2: Floating Point Values
> Format 3: Fixed Point Values - This is low priority, and most likely
> will be dropped unless sample applications requiring this
> functionality can be found. Also, in many cases, this can be simulated
> using existing floating point and integer math routines.

errm, usually the fixed-point *is* the packed integer operations.

there are edge-cases, for example a fixed-point multiply is usually done 
as (X*Y>>SH); sometimes the shift is omitted or delayed depending on how 
it is used; or will produce a wider result and only keep the high order 
bits (typically if working with unit-scale values).

possible could be an op that does, say:
     c[i]=clamp((a[i]*b[i])>>shift, min, max);

which could more easily allow some things which can't readily be done 
with packed multiply ops which keep only the low or high bits.

such an operation would probably require a 32-bit instruction-format 
though to represent effectively.

> New CPUID Flags:
> SIMD_INTEGER8
> SIMD_INTEGER16
> SIMD_INTEGER32
> SIMD_INTEGER64
> SIMD_HALF_PRECISION_FLOAT
> SIMD_SINGLE_PRECISION_FLOAT
> SIMD_DOUBLE_PRECISION_FLOAT
>
>
> New Registers: simd0 to simd15
> These Registers are 128-bits in size, and are used to perform a bulk
> of the SIMD Math.

well, you can maybe get one or two of those.

my proposal is that SIMD would use the same register space as the FPU.
it makes a lot more sense (considering the design of the SH FPU) than it 
did for x87/MMX (where basically the two subsystems were at odds with 
each other and effectively couldn't be used simultaneously).

granted, the actual register ordering gets a little convoluted if one 
considers vectors:
* my designs had defined XR0=(FR1,FR0,XF1,XF0); XR2=(FR3,FR2,XF3,XF2), ...
* I had considered a full 16x SIMD registers, but am now leaning towards 8.
** size of the register space can itself be a problem.
** the memory blocks needed for registers are generally more scarce than 
those for normal memory;
** ...

> SIMD Configuration Instructions:
>
> SIMD.IMODE - This configures the Integer Math mode of the Integer SIMD
> Operations. The accepted modes should include the following:
> * Integer Carry Mode - See ADDC and SUBC for example.
> * Value UnderFlow and OverFlow Mode - See ADDV and SUBV  for examples.
> * Integer Type: Signed and Unsigned values with sizes of 8-bit,
> 16-bit, 32-bit, and 64-bits.
>
> SIMD.FMODE - This configures the Floating Point Mode. This can change
> the Operation Data size, and In particular the SIMD FPU should support
> Half-Precision ( 16-bit), Single Precision (32-bit), and Double
> Precision (64-bit) Floating Point numbers. This also has configuration
> settings for what happens when Floating point errors occur.

packed double is probably seriously overkill at this stage, and would be 
rather expensive (unless emulated using scalar operations).

assuming this sticks with the same basic register space as the SH4 FPU, 
then presumably, if SIMD and FPU coexist on the same core, float and 
double could be done with scalar ops (and/or if packed float/double ops 
exist they may internally "decay" into several scalar ops executed in 
series).

> Data Manipulation Instructions:
> SIMD.MOV - The MOV Instruction, but for the SIMD Registers. There is
> no need for a SIMD.FMOV Instruction since there is only one set of
> registers.
> SIMD.SHUFFLE - This instruction allows for Free form byte reordering.
> A close example of an existing instruction is the SWAP instruction.
> SIMD.ROTL - Left Binary Shift
> SIMD.ROTR - Right Binary Shift
>
> Bitwise Operations:
> SIMD.AND - SIMD Variation of the AND instruction.
> SIMD.NOT - SIMD Variation of the NOT instruction.
> SIMD.OR  - SIMD Variation of the OR instruction.
> SIMD.XOR - SIMD Variation of the XOR instruction

yes, probably.

> Integer Arithmetic:
> SIMD.ADD - See: ADD Instruction.
> SIMD.SUB - See: SUB Instruction
> SIMD.MUL - Integer Multiplication.
> SIMD.DIV - Integer Division.
>
> Integer Composite Instructions:
> SIMD.ABS - Absolute Value.
> SIMD.MIN - Minimum Value
> SIMD.MAX - Maximum Value
>
> Integer Comparison:
> SIMD.CMP/EQ - Data Parallel version of CMP/EQ
> SIMD.CMP/GT - Data Parallel version of CMP/GT
>
> Floating Point Arithmetic:
> SIMD.FADD - See: FADD Instruction.
> SIMD.FSUB - See: FSUB Instruction.
> SIMD.FMUL - See: FMUL Instruction.
> SIMD.FDIV - See: FDIV Instruction.
>
> Floating Point Composite Instructions:
> SIMD.FABS - Floating Point Absolute Value.
> SIMD.FMIN - Floating Point Minimum Value.
> SIMD.FMAX - Floating Point Maximum Value.
> SIMD.FSQRT - Floating Point Square Root.
>
> Floating Point Comparisons:
> SIMD.FCMP/EQ - Data Parallel version of FCMP/EQ
> SIMD.FCMP/GT - Data Parallel version of FCMP/GT

plausible enough.

> Data Loading/Conversion Instructions:
> Bulk Conversion From Integers to Floats, and Floats to integers is a
> must. That said, I'm not exactly sure how many Instructions are needed
> for this, but It would be reasonable to say that four to seven
> instructions may be required.

issue isn't so much number of instructions, but how they are encoded;
not all bit patterns are created equal, and so:
* different sets of operations can coexist if they have mutually 
exclusive operating modes;
* operations with fewer/smaller operands are cheaper;
* ...

in terms of space for 2-register 16-bit operations:
* there is pretty much no space is left for these;
* there is a little more space for ops with single (4-bit) operands.
* ...

if done simply with mode bits, and overlaps the FPU, you basically have 
the Fxxx space (or about a 12 bit space), which needs to encode all 
possible SIMD ops (and their operands).

otherwise, 16-bit land is basically already pretty full.

( decided to omit some stuff )

in my case, I am using some parts of the 16-bit space for other things:
     82xx, reserved for now (probably for more opcode space);
     83xx, "MOV.L @(SP, disp)" (stretches locals up to 31, sometimes useful)
         there are a lot of functions with a lot of locals basically 
"register thrashing".
         if at least the 32-most common locals have 16-bit forms, this 
saves a bit of space vs just 16.
         displaces 2A's "JSR/N @(TBR,disp)", but TBR doesn't exist in my 
ISA.
     86xx, "FMOV.S @(SP, disp)" (disp=0..15)
     87xx, "FMOV.S @(SP, disp)" (disp=16..31)
         these also save some space.
         these displace some SH2A bit-twiddly operations (which seemed 
overly niche).
         these could be dropped if something better comes along.
     8Axx, "MOV #imm24, R0"
         actually pretty useful for things like relative addressing and 
calls/branches.
         is big enough to handle pretty much all intra-binary 
displacements in most cases.
     8Cxx
         used for some extended opcode blocks.
     8Exx
         used for the vast majority of my extended ISA ops.
         this space is pretty big vs the 16-bit space, but sadly still 
not exactly infinite.

some blocks only exist in certain operating modes:
     CCxx (ops which only exist in 64-bit mode, and uROM mode)
     CDxx (more 64b and uROM ops)
     CExx (64-bit mode, possible 64-bit instruction forms, TBD, 1)
     CFxx (64-bit mode, possible 48-bit instruction forms, TBD, 1)

in the normal 32-bit mode, these ranges encode some "@(GBR, R0)" 
instructions which are otherwise basically unused (and in some of my 
stuff GBR is re-purposed to serve a similar role to FS or GS in x86-64).

1: I had previously decided that no larger I-forms will exist in 32-bit 
mode, but may consider them for 64-bit mode (mostly for large immediate 
values). this is mostly due to matters of how the I-cache works 
(limiting opcodes to 32 bits allows for a somewhat cheaper I-cache; but, 
also hinders efficient handling of large constants).

some other scattered ops were used for other instructions, a few of 
which have turned out pretty useful.
     eg: PC-relative load/store ops to reduce the cost of accessing globals.
         mostly function as combiner ops to form a larger "pseudo 
instruction".