[J-core] [RFC] SIMD extention for J-Core
Ken Phillis Jr
kphillisjr at gmail.com
Thu Oct 26 17:12:02 EDT 2017
On Wed, 25 Oct 2017 23:08:30 -0500, BGB wrote:
> On 10/25/2017 6:59 PM, Ken Phillis Jr wrote:
>> I figure I will include my Idea for an easy to use SIMD Extension for
>> the J-Core. In general this extension will be used for Three Types of
>> number formats...
>> Format 1: Signed, and Unsigned Integers
>> Format 2: Floating Point Values
>> Format 3: Fixed Point Values - This is low priority, and most likely
>> will be dropped unless sample applications requiring this
>> functionality can be found. Also, in many cases, this can be simulated
>> using existing floating point and integer math routines.
> errm, usually the fixed-point *is* the packed integer operations.
> there are edge-cases, for example a fixed-point multiply is usually done
> as (X*Y>>SH); sometimes the shift is omitted or delayed depending on how
> it is used; or will produce a wider result and only keep the high order
> bits (typically if working with unit-scale values).
> possible could be an op that does, say:
> c[i]=clamp((a[i]*b[i])>>shift, min, max);
> which could more easily allow some things which can't readily be done
> with packed multiply ops which keep only the low or high bits.
> such an operation would probably require a 32-bit instruction-format
> though to represent effectively.
I know this, and the SIMD Instructions probably should be expanded to
allow for Fixed point math.
>> New CPUID Flags:
>> New Registers: simd0 to simd15
>> These Registers are 128-bits in size, and are used to perform a bulk
>> of the SIMD Math.
> well, you can maybe get one or two of those.
> my proposal is that SIMD would use the same register space as the FPU.
> it makes a lot more sense (considering the design of the SH FPU) than it
> did for x87/MMX (where basically the two subsystems were at odds with
> each other and effectively couldn't be used simultaneously).
> granted, the actual register ordering gets a little convoluted if one
> considers vectors:
> * my designs had defined XR0=(FR1,FR0,XF1,XF0); XR2=(FR3,FR2,XF3,XF2), ...
> * I had considered a full 16x SIMD registers, but am now leaning towards 8.
> ** size of the register space can itself be a problem.
> ** the memory blocks needed for registers are generally more scarce than
> those for normal memory;
> ** ...
I did not know about the Register space constraints. To safe in both
Register Space, and Instruction Space, the SIMD Instructions I
recommended can actually completely replace the old Floating Point
unit. This means that the FPR0_bank0 to FPR15_bank0, and FPR0_bank1 to
FPR15_bank1 would change into just the following registers...
SIMD0_BANK0 to SIMD15_bank0, and SIMD0_bank1 to simd15_Bank1 and all
be 128-bits in size. Also, to save space, a bank of the simd registers
can be removed.
>> SIMD Configuration Instructions:
>> SIMD.IMODE - This configures the Integer Math mode of the Integer SIMD
>> Operations. The accepted modes should include the following:
>> * Integer Carry Mode - See ADDC and SUBC for example.
>> * Value UnderFlow and OverFlow Mode - See ADDV and SUBV for examples.
>> * Integer Type: Signed and Unsigned values with sizes of 8-bit,
>> 16-bit, 32-bit, and 64-bits.
>> SIMD.FMODE - This configures the Floating Point Mode. This can change
>> the Operation Data size, and In particular the SIMD FPU should support
>> Half-Precision ( 16-bit), Single Precision (32-bit), and Double
>> Precision (64-bit) Floating Point numbers. This also has configuration
>> settings for what happens when Floating point errors occur.
> packed double is probably seriously overkill at this stage, and would be
> rather expensive (unless emulated using scalar operations).
> assuming this sticks with the same basic register space as the SH4 FPU,
> then presumably, if SIMD and FPU coexist on the same core, float and
> double could be done with scalar ops (and/or if packed float/double ops
> exist they may internally "decay" into several scalar ops executed in
The SIMD.FMODE is meant to be a replacement for the FPSCR Instruction.
As far as Double Precision floating Point is concerned, A viable
solution would be to reduce performance on these values by only
Instructions in this mode only have a single True 64-bit Floating
Point unit available at this time.
Now as far as Implementation is Concerned, I know of a few existing
Open Source Floating Point cores exist, and they are already licensed
under an ISC Styled license...
32-bit Floats: https://opencores.org/project,fpu
32-bit Floats: https://opencores.org/project,fpu100
64-bit Floats: https://opencores.org/project,fpu_double - This Core
may be of great use, and I wonder how much logic space is used by
>> Data Manipulation Instructions:
>> SIMD.MOV - The MOV Instruction, but for the SIMD Registers. There is
>> no need for a SIMD.FMOV Instruction since there is only one set of
>> SIMD.SHUFFLE - This instruction allows for Free form byte reordering.
>> A close example of an existing instruction is the SWAP instruction.
>> SIMD.ROTL - Left Binary Shift
>> SIMD.ROTR - Right Binary Shift
>> Bitwise Operations:
>> SIMD.AND - SIMD Variation of the AND instruction.
>> SIMD.NOT - SIMD Variation of the NOT instruction.
>> SIMD.OR - SIMD Variation of the OR instruction.
>> SIMD.XOR - SIMD Variation of the XOR instruction
> yes, probably.
>> Integer Arithmetic:
>> SIMD.ADD - See: ADD Instruction.
>> SIMD.SUB - See: SUB Instruction
>> SIMD.MUL - Integer Multiplication.
>> SIMD.DIV - Integer Division.
>> Integer Composite Instructions:
>> SIMD.ABS - Absolute Value.
>> SIMD.MIN - Minimum Value
>> SIMD.MAX - Maximum Value
>> Integer Comparison:
>> SIMD.CMP/EQ - Data Parallel version of CMP/EQ
>> SIMD.CMP/GT - Data Parallel version of CMP/GT
>> Floating Point Arithmetic:
>> SIMD.FADD - See: FADD Instruction.
>> SIMD.FSUB - See: FSUB Instruction.
>> SIMD.FMUL - See: FMUL Instruction.
>> SIMD.FDIV - See: FDIV Instruction.
>> Floating Point Composite Instructions:
>> SIMD.FABS - Floating Point Absolute Value.
>> SIMD.FMIN - Floating Point Minimum Value.
>> SIMD.FMAX - Floating Point Maximum Value.
>> SIMD.FSQRT - Floating Point Square Root.
>> Floating Point Comparisons:
>> SIMD.FCMP/EQ - Data Parallel version of FCMP/EQ
>> SIMD.FCMP/GT - Data Parallel version of FCMP/GT
> plausible enough.
>> Data Loading/Conversion Instructions:
>> Bulk Conversion From Integers to Floats, and Floats to integers is a
>> must. That said, I'm not exactly sure how many Instructions are needed
>> for this, but It would be reasonable to say that four to seven
>> instructions may be required.
> issue isn't so much number of instructions, but how they are encoded;
> not all bit patterns are created equal, and so:
> * different sets of operations can coexist if they have mutually
> exclusive operating modes;
> * operations with fewer/smaller operands are cheaper;
> * ...
> in terms of space for 2-register 16-bit operations:
> * there is pretty much no space is left for these;
> * there is a little more space for ops with single (4-bit) operands.
> * ...
> if done simply with mode bits, and overlaps the FPU, you basically have
> the Fxxx space (or about a 12 bit space), which needs to encode all
> possible SIMD ops (and their operands).
> otherwise, 16-bit land is basically already pretty full.
> ( decided to omit some stuff )
> in my case, I am using some parts of the 16-bit space for other things:
> 82xx, reserved for now (probably for more opcode space);
> 83xx, "MOV.L @(SP, disp)" (stretches locals up to 31, sometimes useful)
> there are a lot of functions with a lot of locals basically
> "register thrashing".
> if at least the 32-most common locals have 16-bit forms, this
> saves a bit of space vs just 16.
> displaces 2A's "JSR/N @(TBR,disp)", but TBR doesn't exist in my
> 86xx, "FMOV.S @(SP, disp)" (disp=0..15)
> 87xx, "FMOV.S @(SP, disp)" (disp=16..31)
> these also save some space.
> these displace some SH2A bit-twiddly operations (which seemed
> overly niche).
> these could be dropped if something better comes along.
> 8Axx, "MOV #imm24, R0"
> actually pretty useful for things like relative addressing and
> is big enough to handle pretty much all intra-binary
> displacements in most cases.
> used for some extended opcode blocks.
> used for the vast majority of my extended ISA ops.
> this space is pretty big vs the 16-bit space, but sadly still
> not exactly infinite.
> some blocks only exist in certain operating modes:
> CCxx (ops which only exist in 64-bit mode, and uROM mode)
> CDxx (more 64b and uROM ops)
> CExx (64-bit mode, possible 64-bit instruction forms, TBD, 1)
> CFxx (64-bit mode, possible 48-bit instruction forms, TBD, 1)
> in the normal 32-bit mode, these ranges encode some "@(GBR, R0)"
> instructions which are otherwise basically unused (and in some of my
> stuff GBR is re-purposed to serve a similar role to FS or GS in x86-64).
> 1: I had previously decided that no larger I-forms will exist in 32-bit
> mode, but may consider them for 64-bit mode (mostly for large immediate
> values). this is mostly due to matters of how the I-cache works
> (limiting opcodes to 32 bits allows for a somewhat cheaper I-cache; but,
> also hinders efficient handling of large constants).
> some other scattered ops were used for other instructions, a few of
> which have turned out pretty useful.
> eg: PC-relative load/store ops to reduce the cost of accessing globals.
> mostly function as combiner ops to form a larger "pseudo
More information about the J-core