[J-core] 2D unit

Mon Feb 27 16:35:41 EST 2017

On Sun, Feb 26, 2017 at 9:35 PM, BGB <cr88192 at gmail.com> wrote:
> On 2/26/2017 7:29 PM, Cedric BAIL wrote:
>> We had Thursday a nice meetup after ELC and we came up with some
>> ideas. Not to sure if they are good as the food and drink may have
>> helped :-)
>>
>> One of the idea we discussed quite in depth was how to design a nice
>> 2D unit that would go with the turtle. The main issue as discussed
>> before on this mailing list is that most toolkit won't care about
>> anything, but OpenGL and Vulkan this days. Obviously this standard are
>> way to big and complex for us at this stage.
>
> OpenGL isn't actually all that complicated per-se (yes, granted, it looks
> pretty scary up-front), but what makes things more complicated here is
> mostly trying to make it fast enough to be useful.

Trying to just follow the standard properly is already a danting task.
Look at a very good example. Intel took years to reach a point where
it was covering the standard properly and they are only now catching
up on speed too. The amount of work to be done on the software side is
multi year before you can claim compliance with OpenGL. Vulkan may be
slightly easier there, but I haven't looked at it enough to say so
(And you may be able to implement OpenGL on top of Vulkan, but not
OpenCL for now and you also can't do compositing with Vulkan at the
moment).

I am talking here about the used today OpenGL, the one with shaders
everywhere. Which is pretty demanding on both software stack (to
generate them efficiently) and on the GPU side to have all the needed
hardware and cycle to do something useful. I think you are under
estimating the work needed from, I got something on screen to Qt and
Steams can use it without crashing.

<snip>

>> I do like this idea a lot as it open the possibility for a lot of
>> hacking here. You could maybe manage to generate a JIT version per
>> frame instead of relying on function and manage a larger number of
>> "hardware" plane. Implementing a mini scenegraph would enable the
>> possibility to correctly detect the case when not to composite to
>> surface and reduce the bandwidth need. That is for the most casual
>> idea, opening firmware development is I am sure going to lead to
>> interesting idea.
>
> FWIW: it is worth noting that a lot of mobile GPUs are effectively multicore
> processors with proprietary SIMD-oriented ISA's (just usually the ISA is
> held under NDA's and under the sole control of the GPU driver, which is
> often given out as "binary blobs" or similar).
>
> an open GPU could be nice.
>
> SIMD could be pretty useful for performance (particularly packed-byte and
> packed-word), but probably isn't critical (it is sort-of possible to
> approximate SIMD using plain integer ops via some additional bit-twiddly).

Well, if I was to design a GPU, I would start first by designing a
proper vector extention. A typical trick that I think GPU actually do
internally (I have no way to know if that is true) is to handle
variable length request. Something where you can rerun a previous
instruction with a different mask. Also being able to get the number
of available vector unit could be neat (With a given minimun, like all
your vector register are a multiple of 32 or 64bits). Pseudo code
would be :

start:
previous_state = vector_process_start(&data_count);
v1 = load_mem(@somewhere);
v2 = load_mem(@somewhere_else);
v3 = load_mem(@somewhere_different);
v4 = !v1;
v5 = v4 ? v2 + v3 : v5;
vector_process_end(previous_state);
if (data_count) goto start;

I have no idea if there is any previous vector processing CPU that
followed such a design. Still from a software point of view this has
really a lot of benefit. No pre or post processing loop are necessary.
As every instruction would be able to be marked with a test, you can
explore linearly both branch of the if inside a loop without jumping
or stopping processing in the vector unit. As I said, would require to
do some prior art research on this.

This combined with a J2 could be a nice base core for a GPU (If the
vector unit has a 16, 32 and 64bits FPU). You would assemble a bunch
of them in parallel (One core would then be dedicated to one tile). If
you want to optimize for energy consumption, you could turn off core
and reduce the size of the vector you are processing (Could be fun to
see if the software can figure out in advance how much data it has to
process and reduce the number of unit running ahead).

Anyway if you were to follow such a design, starting by adding a
vector coprocessor to the j-core, would be a first needed step. Would
be interesting to see if there is prior art here that are useful. I
would have loved to ideally start from Mesa, take a few games and
libraries to see what they use the most out of shaders to make the
vector unit fit its use better, but I am afraid this would lead to a
patent mess.

>> We can discuss the benefit of having a specific instruction to do
>> blending or enable the full sized multiply, but I think that something
>> we can experiment later with the turtle and see what work best. So
>> let's just focus on is this a good idea for now and maybe can we apply
>> the same concept to other unit (network, audio, tpm ?) ? Is it not
>> going to consume to much space on the fpga ? Do we really need all
>> that flexibility ?
>>
>> So what do you think of this idea ?
>
> I generally agree with the idea, but can't say much for how much one can fit
> on an FPGA.
>
> as I understand it, it shouldn't be too hard to modify an ALU to be able to
> do packed ADD/SUB (mostly need to disable some internal carry bits in the
> adder, ...).
>
> one probably will also need an operation to multiply packed bytes or words
> and only keep the high order bits (as this comes up a lot in blending and is
> problematic to implement efficiently using normal integer operations).
>
> PADDB Rm, Rn    //packed add bytes
> PADDW Rm, Rn    //packed add words
> PMULHW Rm, Rn    //multiply packed words and keep high 16 bits of
> intermediate results.
> ...
>
> for packed-word, could be nice to be able to also operate on pairs, ex:
> PMULHW/V VR1, VR3
>     does the same as: "PMULHW R2, R6; PMULHW R3, R7"
>
> ex, the VRn registers could be 64-bit pairs of 32-bit registers.
>
> maybe they could be done with FRn registers (to avoid taking up GPR space),
> though ideally in this case there would be a way to move values more
> directly between Rn and FRn registers (without needing to go through FPUL or
> similar).
>
> PADDW FRm, FRn
> and:
> PADDW DRm, DRn
>
> hell, maybe FPSCR.PR and FPSCR.SZ control the behavior of SIMD operations or
> something.
>
> just a random idea, little to say it is actually viable.

As for a blend instruction, I was more thinking about just taking two
colors in a RGBA pre-multiply world and just do 8bits multiplication
there, which look like :
Dest[{R,G,B}] = Source[{R,G,B}] + Dest[{R,G,B}] * (1 - Source[A]);
Dest[A] = Source[A] + Dest[A] * (1 - Source[A]);

Maybe a color interpolation instruction could be useful too, but
anyway, the idea is to keep it simple and dedicated to the task at
hand for a 2D unit not to design a generic vector unit. And even then,
I would benchmark this against using just the 32bits multiplication
with bitmask and shift. As it might be enough and increase the
complexity of the hardware is maybe not even needed.
-- 
Cedric BAIL