[J-core] [RFC] SIMD extention for J-Core

Thu Oct 26 10:50:32 EDT 2017

On 10/26/2017 8:59 AM, Christopher Friedt wrote:
>
>
> On Oct 26, 2017 9:04 AM, "emanuel stiebler" <emu at e-bbes.com 
> <mailto:emu at e-bbes.com>> wrote:
>
>     On 2017-10-25 17:59, Ken Phillis Jr wrote:
>
>         New CPUID Flags:
>         SIMD_INTEGER8
>         SIMD_INTEGER16
>         SIMD_INTEGER32
>         SIMD_INTEGER64
>         SIMD_HALF_PRECISION_FLOAT
>         SIMD_SINGLE_PRECISION_FLOAT
>         SIMD_DOUBLE_PRECISION_FLOAT
>
>
>     Just a short one,
>     for the floating points, I would prefer
>
>     SIMD_FLOAT16
>     SIMD_FLOAT32
>     SIMD_FLOAT64
>     SIMD_FLOAT128
>
>
> I would suggest shortening those further.
>
> simd.u8
> simd.s16
> simd.f32
> simd.f16
>

yeah, we don't need Double/Float64 SIMD.

Float128 would be pretty absurd given no hardware currently does it 
natively (also it would be crazy expensive, a full-width MAC unit quite 
possibly wouldn't fit into the FPGA...).

> * dot notation is a bit nicer
> * u := unsigned
> * s := signed
> * f32 := IEEE 754 32-bit float
> * f16 := half precision
> * similarly, f64, u64, ...
>
> BGB - could you mention on the list how the FPU design differs between 
> SH and x86/mmx ?
>

x87 had 8x 80-bit FPU registers organized as a stack; so you couldn't 
perform operations between arbitrary registers, but more push/pop/swap/...

MMX allowed using them instead as 8x 64-bit registers, but doing so 
required entering/exiting MMX mode, which would basically trash the FPU 
registers.

SSE basically reduced this mess, and since then compilers have mostly 
abandoned x87.

the SH style FPU uses two banks of 16x 32-bit registers (labeled 
FR0..FR15 and XF0..XF15; though some SH variants only have a single bank);
operations on these registers are basically direct register operations 
(more like GPRs, or like the more modern x86-64 strategy of doing all 
the FPU stuff using SSE);
the SH FPU also implements Double operations by working on pairs of 
registers (kind of funky, but works).

so, unlike x87, there is no metadata and no stack, and likewise no need 
to clear the registers if moving between SIMD mode and FPU mode (if 
desired).
likewise, by sharing the register state, it would make it possible to do 
arbitrary scalar operations within vector elements, which is not 
something readily supported by SSE.

a partial limitation is due to how the "bank" mechanism works, though, 
which would hinder doing operations if doing 128-bit vectors between the 
high and low halves of a vector with 16-bit instructions (my SIMD 
extensions though would remedy this by allowing 32-bit FPU I-forms to 
access 32 float registers, rather than only banks of 16 registers; with 
8x SIMD registers this basically meant direct access to the entire space).

likewise, the space can also be used as 16x 64bit vectors, given the 
vectors themselves are implemented via register pairs (and in this case, 
equivalent to the Double registers).

a minor extension to the FPU is that Double operations may access all 
16x Double registers at the same time;
likewise "borrowed" the feature allowing both SZ and PR bits to be set, 
allowing moving Doubles to/from memory in the correct order (usually 
these require doing the loads/stores as pairs of instructions).

sadly also, the FPU design requires frequently togging bits in the FPSCR 
register (to move between Float/Double), but I added an op to load the 
relevant bits in the register directly (vs, say, needing to do a 
constant load and then shove the desired FPU state into this register).

> Having ported FFTW over to ARM neon I'm extensively familiar with it 
> and know it is strikingly similar to mmx. I know that (at least on 
> ARM) the contention you've mentioned is quite significant. Pipeline 
> stalls must be precisely inserted to ensure correct results for simd 
> instructions are obtained at the correct times, etc. Vector loads and 
> stores, cache-prefetching, and i/o alignment were critical.
>
> For A8 that meant fine-tuning the instructions generated by the 
> compiler. There were also some memory barriers, iirc. Mostly 
> hand-written assembly. Intrinsics were ~ meh.
>
> I did work on it before the A9 OOO pipeline was introduced, but also 
> know that having an out-of-order unit helped simd on ARM.
>
> Again, I'm really curious how the FPU design differs, because if SH / 
> J-Core can avoid that mess, it would be better off.
>

sadly, basically out of time, so no response.

>
>
>
> _______________________________________________
> J-core mailing list
> J-core at lists.j-core.org
> http://lists.j-core.org/mailman/listinfo/j-core

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.j-core.org/pipermail/j-core/attachments/20171026/113d5c83/attachment.html>