[J-core] 2D unit

Mon Feb 27 19:33:39 EST 2017

On Mon, Feb 27, 2017 at 3:41 PM, BGB <cr88192 at gmail.com> wrote:
> On 2/27/2017 3:35 PM, Cedric BAIL wrote:
>> On Sun, Feb 26, 2017 at 9:35 PM, BGB <cr88192 at gmail.com> wrote:
>>> On 2/26/2017 7:29 PM, Cedric BAIL wrote:
>>>> We had Thursday a nice meetup after ELC and we came up with some
>>>> ideas. Not to sure if they are good as the food and drink may have
>>>> helped :-)
>>>>
>>>> One of the idea we discussed quite in depth was how to design a nice
>>>> 2D unit that would go with the turtle. The main issue as discussed
>>>> before on this mailing list is that most toolkit won't care about
>>>> anything, but OpenGL and Vulkan this days. Obviously this standard are
>>>> way to big and complex for us at this stage.
>>>
>>> OpenGL isn't actually all that complicated per-se (yes, granted, it looks
>>> pretty scary up-front), but what makes things more complicated here is
>>> mostly trying to make it fast enough to be useful.
>>
>> Trying to just follow the standard properly is already a danting task.
>> Look at a very good example. Intel took years to reach a point where
>> it was covering the standard properly and they are only now catching
>> up on speed too. The amount of work to be done on the software side is
>> multi year before you can claim compliance with OpenGL. Vulkan may be
>> slightly easier there, but I haven't looked at it enough to say so
>> (And you may be able to implement OpenGL on top of Vulkan, but not
>> OpenCL for now and you also can't do compositing with Vulkan at the
>> moment).
>>
>> I am talking here about the used today OpenGL, the one with shaders
>> everywhere. Which is pretty demanding on both software stack (to
>> generate them efficiently) and on the GPU side to have all the needed
>> hardware and cycle to do something useful. I think you are under
>> estimating the work needed from, I got something on screen to Qt and
>> Steams can use it without crashing.
>
> I got Quake 3 Arena to work mostly ok with a custom-written OpenGL.
>
> granted, this was a subset of OpenGL 1.x, as it mostly focused on the
> features that Quake 3 used, and skipped out on stuff that wasn't used (such
> as display lists, etc...).

OpenGL 1.0 was released in 1992 and the last release of the 1.x branch
in 2003. None of them had a programmable pipeline which is what every
toolkit and game that is used today use. I see personnally no interest
in anything below Vulkan has most toolkit and games engine will have
moved to it or a fully programmable pipeline by the time we are done
doing any hardware.

<snip>

> another SW GL implementation (Mesa's LLVMpipe) had used LLVM basically to
> dynamically compile parts of the software-rendering pipeline, which is a
> fair bit fancier (than just doing a plain-C software rasterizer).

Yes, but still it is super damn slow and would require massive rewrite
to match a non classic CPU architecture. Also llvm is a pretty bad
solution for GPU compilation has the goal of LLVM and a shader
compiler are not aligned which is why Mesa has its own IR and
optimization pass on it.

<snip>

>>>> I do like this idea a lot as it open the possibility for a lot of
>>>> hacking here. You could maybe manage to generate a JIT version per
>>>> frame instead of relying on function and manage a larger number of
>>>> "hardware" plane. Implementing a mini scenegraph would enable the
>>>> possibility to correctly detect the case when not to composite to
>>>> surface and reduce the bandwidth need. That is for the most casual
>>>> idea, opening firmware development is I am sure going to lead to
>>>> interesting idea.
>>>
>>> FWIW: it is worth noting that a lot of mobile GPUs are effectively
>>> multicore
>>> processors with proprietary SIMD-oriented ISA's (just usually the ISA is
>>> held under NDA's and under the sole control of the GPU driver, which is
>>> often given out as "binary blobs" or similar).
>>>
>>> an open GPU could be nice.
>>>
>>> SIMD could be pretty useful for performance (particularly packed-byte and
>>> packed-word), but probably isn't critical (it is sort-of possible to
>>> approximate SIMD using plain integer ops via some additional
>>> bit-twiddly).
>>
>> Well, if I was to design a GPU, I would start first by designing a
>> proper vector extention. A typical trick that I think GPU actually do
>> internally (I have no way to know if that is true) is to handle
>> variable length request. Something where you can rerun a previous
>> instruction with a different mask. Also being able to get the number
>> of available vector unit could be neat (With a given minimun, like all
>> your vector register are a multiple of 32 or 64bits). Pseudo code
>> would be :
>>
>> start:
>> previous_state = vector_process_start(&data_count);
>> v1 = load_mem(@somewhere);
>> v2 = load_mem(@somewhere_else);
>> v3 = load_mem(@somewhere_different);
>> v4 = !v1;
>> v5 = v4 ? v2 + v3 : v5;
>> vector_process_end(previous_state);
>> if (data_count) goto start;
>>
>> I have no idea if there is any previous vector processing CPU that
>> followed such a design. Still from a software point of view this has
>> really a lot of benefit. No pre or post processing loop are necessary.
>> As every instruction would be able to be marked with a test, you can
>> explore linearly both branch of the if inside a loop without jumping
>> or stopping processing in the vector unit. As I said, would require to
>> do some prior art research on this.
>>
>> This combined with a J2 could be a nice base core for a GPU (If the
>> vector unit has a 16, 32 and 64bits FPU). You would assemble a bunch
>> of them in parallel (One core would then be dedicated to one tile). If
>> you want to optimize for energy consumption, you could turn off core
>> and reduce the size of the vector you are processing (Could be fun to
>> see if the software can figure out in advance how much data it has to
>> process and reduce the number of unit running ahead).
>>
>> Anyway if you were to follow such a design, starting by adding a
>> vector coprocessor to the j-core, would be a first needed step. Would
>> be interesting to see if there is prior art here that are useful. I
>> would have loved to ideally start from Mesa, take a few games and
>> libraries to see what they use the most out of shaders to make the
>> vector unit fit its use better, but I am afraid this would lead to a
>> patent mess.
>
> dunno. earlier I wrote up an idea for a fairly simplistic SIMD extension for
> SH, which would basically reuse the FPU registers and opcodes (via currently
> reserved FPSCR bits).

The main issue with SIMD extension is the pre and post loop that
consume a fair amount of code and really reduce massively the benefit
of it. Most C code is not properly annotated to let the compiler
remove them. This lead to situation where what the compiler know and
the cost of running a pre/post loop is higher than going with scalar
code and so abandon the SIMD path which would have been perfectly
viable if the compiler knew about data being aligned and of the proper
size. The only solution around that is to find a technical solution
where pre and post loop are non irrelevant whatever the compiler know
about the code. This way, the compiler will have more opportunity to
use vector operation.

Today compiler are pretty good at finding what to parallelize, but CPU
instruction set aren't helping much. Having a vector unit that
existing compiler can't take much advantage of, would be pretty
problematic for the j-core as there is little chance that there is
enough ressource to go over all the assembly optimized source code to
add a new variant to support it. Basically, developping the
instruction set around GCC and LLVM capability, more than what a human
can do would fit this project better I think.

> https://github.com/cr88192/bgbtech_shxemu/wiki/SHx-Mini-SIMD
>
> supporting arbitrary shader code would be harder though, as one can't just
> casually give them limited-range fixed-point and expect them to work.
> probably at a minimum would need 16-bit half-floats.
>
> possibly there could be an alternate SIMD mode which supports packed
> half-floats (to be friendlier to GLSL).

GLSL and Vulkan shader would require to support 16, 32 and 64bits floats.

> could still beat on the idea more.
>
>>>> We can discuss the benefit of having a specific instruction to do
>>>> blending or enable the full sized multiply, but I think that something
>>>> we can experiment later with the turtle and see what work best. So
>>>> let's just focus on is this a good idea for now and maybe can we apply
>>>> the same concept to other unit (network, audio, tpm ?) ? Is it not
>>>> going to consume to much space on the fpga ? Do we really need all
>>>> that flexibility ?
>>>>
>>>> So what do you think of this idea ?
>>>
>>> I generally agree with the idea, but can't say much for how much one can
>>> fit
>>> on an FPGA.
>>>
>>> as I understand it, it shouldn't be too hard to modify an ALU to be able
>>> to
>>> do packed ADD/SUB (mostly need to disable some internal carry bits in the
>>> adder, ...).
>>>
>>> one probably will also need an operation to multiply packed bytes or
>>> words
>>> and only keep the high order bits (as this comes up a lot in blending and
>>> is
>>> problematic to implement efficiently using normal integer operations).
>>>
>>> PADDB Rm, Rn    //packed add bytes
>>> PADDW Rm, Rn    //packed add words
>>> PMULHW Rm, Rn    //multiply packed words and keep high 16 bits of
>>> intermediate results.
>>> ...
>>>
>>> for packed-word, could be nice to be able to also operate on pairs, ex:
>>> PMULHW/V VR1, VR3
>>>      does the same as: "PMULHW R2, R6; PMULHW R3, R7"
>>>
>>> ex, the VRn registers could be 64-bit pairs of 32-bit registers.
>>>
>>> maybe they could be done with FRn registers (to avoid taking up GPR
>>> space),
>>> though ideally in this case there would be a way to move values more
>>> directly between Rn and FRn registers (without needing to go through FPUL
>>> or
>>> similar).
>>>
>>> PADDW FRm, FRn
>>> and:
>>> PADDW DRm, DRn
>>>
>>> hell, maybe FPSCR.PR and FPSCR.SZ control the behavior of SIMD operations
>>> or
>>> something.
>>>
>>> just a random idea, little to say it is actually viable.
>>
>> As for a blend instruction, I was more thinking about just taking two
>> colors in a RGBA pre-multiply world and just do 8bits multiplication
>> there, which look like :
>> Dest[{R,G,B}] = Source[{R,G,B}] + Dest[{R,G,B}] * (1 - Source[A]);
>> Dest[A] = Source[A] + Dest[A] * (1 - Source[A]);
>>
>> Maybe a color interpolation instruction could be useful too, but
>> anyway, the idea is to keep it simple and dedicated to the task at
>> hand for a 2D unit not to design a generic vector unit. And even then,
>> I would benchmark this against using just the 32bits multiplication
>> with bitmask and shift. As it might be enough and increase the
>> complexity of the hardware is maybe not even needed.
>
> I suspect by the time one has the gates needed to do a blend function in
> hardware, they probably already have the gates needed to do packed SIMD
> operations (packed-integer SIMD should actually need a lot fewer gates I
> think than a full-precision scalar FPU; so it could be cheaper I think to do
> some cores lacking a conventional FPU, and just doing narrower SIMD types),
> but dunno...
>
> but, yeah, it is probably sane to compare against trying to do fake SIMD via
> integer ops or similar.

A blend and an interpolate function will only act on 32 bits register
operating with 8bits components acting for an RGBA pixels (the 4 mul,
4 add and 1 sub would only be operating on 8bits). It is like a custom
SIMD instruction working on the same register as any instruction, but
just useful for blending as that would be the main task of that 2D
unit. Maybe it could be implemented as a micro coded operation that
reuse an existing mul, add and sub unit. I have no clue of what would
be the best solution there, but clearly having 2 instructions seems to
me way more limited and simpler to implement and use for this use
case.
-- 
Cedric BAIL