[J-core] [RFC] SIMD extention for J-Core

Thu Oct 26 17:12:02 EDT 2017

On  Wed, 25 Oct 2017 23:08:30 -0500,   BGB  wrote:
> On 10/25/2017 6:59 PM, Ken Phillis Jr wrote:
>> I figure I will include my Idea for an easy to use SIMD Extension for
>> the J-Core. In general this extension will be used for Three Types of
>> number formats...
>>
>> Format 1: Signed, and Unsigned Integers
>> Format 2: Floating Point Values
>> Format 3: Fixed Point Values - This is low priority, and most likely
>> will be dropped unless sample applications requiring this
>> functionality can be found. Also, in many cases, this can be simulated
>> using existing floating point and integer math routines.
>
> errm, usually the fixed-point *is* the packed integer operations.
>
> there are edge-cases, for example a fixed-point multiply is usually done
> as (X*Y>>SH); sometimes the shift is omitted or delayed depending on how
> it is used; or will produce a wider result and only keep the high order
> bits (typically if working with unit-scale values).
>
> possible could be an op that does, say:
>      c[i]=clamp((a[i]*b[i])>>shift, min, max);
>
> which could more easily allow some things which can't readily be done
> with packed multiply ops which keep only the low or high bits.
>
> such an operation would probably require a 32-bit instruction-format
> though to represent effectively.
>
>

I know this, and the SIMD Instructions probably should be expanded to
allow for Fixed point math.

>> New CPUID Flags:
>> SIMD_INTEGER8
>> SIMD_INTEGER16
>> SIMD_INTEGER32
>> SIMD_INTEGER64
>> SIMD_HALF_PRECISION_FLOAT
>> SIMD_SINGLE_PRECISION_FLOAT
>> SIMD_DOUBLE_PRECISION_FLOAT
>>
>>
>> New Registers: simd0 to simd15
>> These Registers are 128-bits in size, and are used to perform a bulk
>> of the SIMD Math.
>
> well, you can maybe get one or two of those.
>
> my proposal is that SIMD would use the same register space as the FPU.
> it makes a lot more sense (considering the design of the SH FPU) than it
> did for x87/MMX (where basically the two subsystems were at odds with
> each other and effectively couldn't be used simultaneously).
>
> granted, the actual register ordering gets a little convoluted if one
> considers vectors:
> * my designs had defined XR0=(FR1,FR0,XF1,XF0); XR2=(FR3,FR2,XF3,XF2), ...
> * I had considered a full 16x SIMD registers, but am now leaning towards 8.
> ** size of the register space can itself be a problem.
> ** the memory blocks needed for registers are generally more scarce than
> those for normal memory;
> ** ...
>

I did not know about the Register space constraints. To safe in both
Register Space, and Instruction Space, the SIMD Instructions I
recommended can actually completely replace the old Floating Point
unit. This means that the FPR0_bank0 to FPR15_bank0, and FPR0_bank1 to
FPR15_bank1 would change into just the following registers...
SIMD0_BANK0 to SIMD15_bank0, and SIMD0_bank1 to simd15_Bank1 and all
be 128-bits in size. Also, to save space, a bank of the simd registers
can be removed.

>
>> SIMD Configuration Instructions:
>>
>> SIMD.IMODE - This configures the Integer Math mode of the Integer SIMD
>> Operations. The accepted modes should include the following:
>> * Integer Carry Mode - See ADDC and SUBC for example.
>> * Value UnderFlow and OverFlow Mode - See ADDV and SUBV  for examples.
>> * Integer Type: Signed and Unsigned values with sizes of 8-bit,
>> 16-bit, 32-bit, and 64-bits.
>>
>> SIMD.FMODE - This configures the Floating Point Mode. This can change
>> the Operation Data size, and In particular the SIMD FPU should support
>> Half-Precision ( 16-bit), Single Precision (32-bit), and Double
>> Precision (64-bit) Floating Point numbers. This also has configuration
>> settings for what happens when Floating point errors occur.
>
> packed double is probably seriously overkill at this stage, and would be
> rather expensive (unless emulated using scalar operations).
>
> assuming this sticks with the same basic register space as the SH4 FPU,
> then presumably, if SIMD and FPU coexist on the same core, float and
> double could be done with scalar ops (and/or if packed float/double ops
> exist they may internally "decay" into several scalar ops executed in
> series).
>
>

The SIMD.FMODE is meant to be a replacement for the FPSCR Instruction.
As far as Double Precision floating Point is concerned, A viable
solution would be to reduce performance on these values by only
Instructions in this mode only have a single True 64-bit Floating
Point unit available at this time.

Now as far as Implementation is Concerned, I know of a few existing
Open Source Floating Point cores exist, and they are already licensed
under an ISC Styled license...
32-bit Floats: https://opencores.org/project,fpu
32-bit Floats: https://opencores.org/project,fpu100
64-bit Floats: https://opencores.org/project,fpu_double - This Core
may be of great use, and I wonder how much logic space is used by
this.

>
>> Data Manipulation Instructions:
>> SIMD.MOV - The MOV Instruction, but for the SIMD Registers. There is
>> no need for a SIMD.FMOV Instruction since there is only one set of
>> registers.
>> SIMD.SHUFFLE - This instruction allows for Free form byte reordering.
>> A close example of an existing instruction is the SWAP instruction.
>> SIMD.ROTL - Left Binary Shift
>> SIMD.ROTR - Right Binary Shift
>>
>> Bitwise Operations:
>> SIMD.AND - SIMD Variation of the AND instruction.
>> SIMD.NOT - SIMD Variation of the NOT instruction.
>> SIMD.OR  - SIMD Variation of the OR instruction.
>> SIMD.XOR - SIMD Variation of the XOR instruction
>
> yes, probably.
>
>
>> Integer Arithmetic:
>> SIMD.ADD - See: ADD Instruction.
>> SIMD.SUB - See: SUB Instruction
>> SIMD.MUL - Integer Multiplication.
>> SIMD.DIV - Integer Division.
>>
>> Integer Composite Instructions:
>> SIMD.ABS - Absolute Value.
>> SIMD.MIN - Minimum Value
>> SIMD.MAX - Maximum Value
>>
>> Integer Comparison:
>> SIMD.CMP/EQ - Data Parallel version of CMP/EQ
>> SIMD.CMP/GT - Data Parallel version of CMP/GT
>>
>> Floating Point Arithmetic:
>> SIMD.FADD - See: FADD Instruction.
>> SIMD.FSUB - See: FSUB Instruction.
>> SIMD.FMUL - See: FMUL Instruction.
>> SIMD.FDIV - See: FDIV Instruction.
>>
>> Floating Point Composite Instructions:
>> SIMD.FABS - Floating Point Absolute Value.
>> SIMD.FMIN - Floating Point Minimum Value.
>> SIMD.FMAX - Floating Point Maximum Value.
>> SIMD.FSQRT - Floating Point Square Root.
>>
>> Floating Point Comparisons:
>> SIMD.FCMP/EQ - Data Parallel version of FCMP/EQ
>> SIMD.FCMP/GT - Data Parallel version of FCMP/GT
>
> plausible enough.
>
>
>> Data Loading/Conversion Instructions:
>> Bulk Conversion From Integers to Floats, and Floats to integers is a
>> must. That said, I'm not exactly sure how many Instructions are needed
>> for this, but It would be reasonable to say that four to seven
>> instructions may be required.
>
> issue isn't so much number of instructions, but how they are encoded;
> not all bit patterns are created equal, and so:
> * different sets of operations can coexist if they have mutually
> exclusive operating modes;
> * operations with fewer/smaller operands are cheaper;
> * ...
>
> in terms of space for 2-register 16-bit operations:
> * there is pretty much no space is left for these;
> * there is a little more space for ops with single (4-bit) operands.
> * ...
>
> if done simply with mode bits, and overlaps the FPU, you basically have
> the Fxxx space (or about a 12 bit space), which needs to encode all
> possible SIMD ops (and their operands).
>
> otherwise, 16-bit land is basically already pretty full.
>
>
> ( decided to omit some stuff )
>
> in my case, I am using some parts of the 16-bit space for other things:
>      82xx, reserved for now (probably for more opcode space);
>      83xx, "MOV.L @(SP, disp)" (stretches locals up to 31, sometimes useful)
>          there are a lot of functions with a lot of locals basically
> "register thrashing".
>          if at least the 32-most common locals have 16-bit forms, this
> saves a bit of space vs just 16.
>          displaces 2A's "JSR/N @(TBR,disp)", but TBR doesn't exist in my
> ISA.
>      86xx, "FMOV.S @(SP, disp)" (disp=0..15)
>      87xx, "FMOV.S @(SP, disp)" (disp=16..31)
>          these also save some space.
>          these displace some SH2A bit-twiddly operations (which seemed
> overly niche).
>          these could be dropped if something better comes along.
>      8Axx, "MOV #imm24, R0"
>          actually pretty useful for things like relative addressing and
> calls/branches.
>          is big enough to handle pretty much all intra-binary
> displacements in most cases.
>      8Cxx
>          used for some extended opcode blocks.
>      8Exx
>          used for the vast majority of my extended ISA ops.
>          this space is pretty big vs the 16-bit space, but sadly still
> not exactly infinite.
>
> some blocks only exist in certain operating modes:
>      CCxx (ops which only exist in 64-bit mode, and uROM mode)
>      CDxx (more 64b and uROM ops)
>      CExx (64-bit mode, possible 64-bit instruction forms, TBD, 1)
>      CFxx (64-bit mode, possible 48-bit instruction forms, TBD, 1)
>
> in the normal 32-bit mode, these ranges encode some "@(GBR, R0)"
> instructions which are otherwise basically unused (and in some of my
> stuff GBR is re-purposed to serve a similar role to FS or GS in x86-64).
>
>
> 1: I had previously decided that no larger I-forms will exist in 32-bit
> mode, but may consider them for 64-bit mode (mostly for large immediate
> values). this is mostly due to matters of how the I-cache works
> (limiting opcodes to 32 bits allows for a somewhat cheaper I-cache; but,
> also hinders efficient handling of large constants).
>
> some other scattered ops were used for other instructions, a few of
> which have turned out pretty useful.
>      eg: PC-relative load/store ops to reduce the cost of accessing globals.
>          mostly function as combiner ops to form a larger "pseudo
> instruction".
>
>
>