[J-core] 2D unit

Mon Feb 27 18:41:53 EST 2017

On 2/27/2017 3:35 PM, Cedric BAIL wrote:
> On Sun, Feb 26, 2017 at 9:35 PM, BGB <cr88192 at gmail.com> wrote:
>> On 2/26/2017 7:29 PM, Cedric BAIL wrote:
>>> We had Thursday a nice meetup after ELC and we came up with some
>>> ideas. Not to sure if they are good as the food and drink may have
>>> helped :-)
>>>
>>> One of the idea we discussed quite in depth was how to design a nice
>>> 2D unit that would go with the turtle. The main issue as discussed
>>> before on this mailing list is that most toolkit won't care about
>>> anything, but OpenGL and Vulkan this days. Obviously this standard are
>>> way to big and complex for us at this stage.
>> OpenGL isn't actually all that complicated per-se (yes, granted, it looks
>> pretty scary up-front), but what makes things more complicated here is
>> mostly trying to make it fast enough to be useful.
> Trying to just follow the standard properly is already a danting task.
> Look at a very good example. Intel took years to reach a point where
> it was covering the standard properly and they are only now catching
> up on speed too. The amount of work to be done on the software side is
> multi year before you can claim compliance with OpenGL. Vulkan may be
> slightly easier there, but I haven't looked at it enough to say so
> (And you may be able to implement OpenGL on top of Vulkan, but not
> OpenCL for now and you also can't do compositing with Vulkan at the
> moment).
>
> I am talking here about the used today OpenGL, the one with shaders
> everywhere. Which is pretty demanding on both software stack (to
> generate them efficiently) and on the GPU side to have all the needed
> hardware and cycle to do something useful. I think you are under
> estimating the work needed from, I got something on screen to Qt and
> Steams can use it without crashing.

I got Quake 3 Arena to work mostly ok with a custom-written OpenGL.

granted, this was a subset of OpenGL 1.x, as it mostly focused on the 
features that Quake 3 used, and skipped out on stuff that wasn't used 
(such as display lists, etc...).

granted, yes, no shaders in this case; and a few things in Quake 3 
didn't display correctly for some undetermined reason. Quake 1 and Quake 
2 also worked with it, as did my own 3D engine at the time (which had a 
GL 1.x friendly mode). (while Quake 3 uses "shaders", they are more in 
the sense of allowing textures to be layered and supporting user defined 
blend-modes and similar).

I have a few videos of it I did at one point:
https://www.youtube.com/watch?v=uZcnL3jwi2c

laggier, but at a higher resolution and with everything enabled:
https://www.youtube.com/watch?v=EjOMPcionqY

some other people had written "better" software GL's, but they didn't 
work on my PC. most depended on SSE4.2, which the PC I was using at the 
time didn't support. mine was mostly plain C scalar code (with some 
pseudo SIMD), but near the end I added some use of SSE2 intrinsics and 
similar (mostly for things like trilinear texture filtering and similar).

another SW GL implementation (Mesa's LLVMpipe) had used LLVM basically 
to dynamically compile parts of the software-rendering pipeline, which 
is a fair bit fancier (than just doing a plain-C software rasterizer).

shaders would probably require a JIT, as I don't expect they could be 
made sufficiently fast with an interpreter.

I at one point considered possibly using a fork of my BS2 VM as a GLSL 
compiler (and working towards GL 2.x support and newer), but never did so.

but, FWIW, something being able to run shaders (at some semblance of 
acceptable speed) would already beat out my laptop. it is Vista-era with 
an Intel GMA X3100; and is basically unusable for much beyond GL 1.x and 
fixed-function (too slow to really be usable).

> <snip>
>
>>> I do like this idea a lot as it open the possibility for a lot of
>>> hacking here. You could maybe manage to generate a JIT version per
>>> frame instead of relying on function and manage a larger number of
>>> "hardware" plane. Implementing a mini scenegraph would enable the
>>> possibility to correctly detect the case when not to composite to
>>> surface and reduce the bandwidth need. That is for the most casual
>>> idea, opening firmware development is I am sure going to lead to
>>> interesting idea.
>> FWIW: it is worth noting that a lot of mobile GPUs are effectively multicore
>> processors with proprietary SIMD-oriented ISA's (just usually the ISA is
>> held under NDA's and under the sole control of the GPU driver, which is
>> often given out as "binary blobs" or similar).
>>
>> an open GPU could be nice.
>>
>> SIMD could be pretty useful for performance (particularly packed-byte and
>> packed-word), but probably isn't critical (it is sort-of possible to
>> approximate SIMD using plain integer ops via some additional bit-twiddly).
> Well, if I was to design a GPU, I would start first by designing a
> proper vector extention. A typical trick that I think GPU actually do
> internally (I have no way to know if that is true) is to handle
> variable length request. Something where you can rerun a previous
> instruction with a different mask. Also being able to get the number
> of available vector unit could be neat (With a given minimun, like all
> your vector register are a multiple of 32 or 64bits). Pseudo code
> would be :
>
> start:
> previous_state = vector_process_start(&data_count);
> v1 = load_mem(@somewhere);
> v2 = load_mem(@somewhere_else);
> v3 = load_mem(@somewhere_different);
> v4 = !v1;
> v5 = v4 ? v2 + v3 : v5;
> vector_process_end(previous_state);
> if (data_count) goto start;
>
> I have no idea if there is any previous vector processing CPU that
> followed such a design. Still from a software point of view this has
> really a lot of benefit. No pre or post processing loop are necessary.
> As every instruction would be able to be marked with a test, you can
> explore linearly both branch of the if inside a loop without jumping
> or stopping processing in the vector unit. As I said, would require to
> do some prior art research on this.
>
> This combined with a J2 could be a nice base core for a GPU (If the
> vector unit has a 16, 32 and 64bits FPU). You would assemble a bunch
> of them in parallel (One core would then be dedicated to one tile). If
> you want to optimize for energy consumption, you could turn off core
> and reduce the size of the vector you are processing (Could be fun to
> see if the software can figure out in advance how much data it has to
> process and reduce the number of unit running ahead).
>
> Anyway if you were to follow such a design, starting by adding a
> vector coprocessor to the j-core, would be a first needed step. Would
> be interesting to see if there is prior art here that are useful. I
> would have loved to ideally start from Mesa, take a few games and
> libraries to see what they use the most out of shaders to make the
> vector unit fit its use better, but I am afraid this would lead to a
> patent mess.

dunno. earlier I wrote up an idea for a fairly simplistic SIMD extension 
for SH, which would basically reuse the FPU registers and opcodes (via 
currently reserved FPSCR bits).

https://github.com/cr88192/bgbtech_shxemu/wiki/SHx-Mini-SIMD

supporting arbitrary shader code would be harder though, as one can't 
just casually give them limited-range fixed-point and expect them to 
work. probably at a minimum would need 16-bit half-floats.

possibly there could be an alternate SIMD mode which supports packed 
half-floats (to be friendlier to GLSL).

could still beat on the idea more.

>>> We can discuss the benefit of having a specific instruction to do
>>> blending or enable the full sized multiply, but I think that something
>>> we can experiment later with the turtle and see what work best. So
>>> let's just focus on is this a good idea for now and maybe can we apply
>>> the same concept to other unit (network, audio, tpm ?) ? Is it not
>>> going to consume to much space on the fpga ? Do we really need all
>>> that flexibility ?
>>>
>>> So what do you think of this idea ?
>> I generally agree with the idea, but can't say much for how much one can fit
>> on an FPGA.
>>
>> as I understand it, it shouldn't be too hard to modify an ALU to be able to
>> do packed ADD/SUB (mostly need to disable some internal carry bits in the
>> adder, ...).
>>
>> one probably will also need an operation to multiply packed bytes or words
>> and only keep the high order bits (as this comes up a lot in blending and is
>> problematic to implement efficiently using normal integer operations).
>>
>> PADDB Rm, Rn    //packed add bytes
>> PADDW Rm, Rn    //packed add words
>> PMULHW Rm, Rn    //multiply packed words and keep high 16 bits of
>> intermediate results.
>> ...
>>
>> for packed-word, could be nice to be able to also operate on pairs, ex:
>> PMULHW/V VR1, VR3
>>      does the same as: "PMULHW R2, R6; PMULHW R3, R7"
>>
>> ex, the VRn registers could be 64-bit pairs of 32-bit registers.
>>
>> maybe they could be done with FRn registers (to avoid taking up GPR space),
>> though ideally in this case there would be a way to move values more
>> directly between Rn and FRn registers (without needing to go through FPUL or
>> similar).
>>
>> PADDW FRm, FRn
>> and:
>> PADDW DRm, DRn
>>
>> hell, maybe FPSCR.PR and FPSCR.SZ control the behavior of SIMD operations or
>> something.
>>
>> just a random idea, little to say it is actually viable.
> As for a blend instruction, I was more thinking about just taking two
> colors in a RGBA pre-multiply world and just do 8bits multiplication
> there, which look like :
> Dest[{R,G,B}] = Source[{R,G,B}] + Dest[{R,G,B}] * (1 - Source[A]);
> Dest[A] = Source[A] + Dest[A] * (1 - Source[A]);
>
> Maybe a color interpolation instruction could be useful too, but
> anyway, the idea is to keep it simple and dedicated to the task at
> hand for a 2D unit not to design a generic vector unit. And even then,
> I would benchmark this against using just the 32bits multiplication
> with bitmask and shift. As it might be enough and increase the
> complexity of the hardware is maybe not even needed.

I suspect by the time one has the gates needed to do a blend function in 
hardware, they probably already have the gates needed to do packed SIMD 
operations (packed-integer SIMD should actually need a lot fewer gates I 
think than a full-precision scalar FPU; so it could be cheaper I think 
to do some cores lacking a conventional FPU, and just doing narrower 
SIMD types), but dunno...

but, yeah, it is probably sane to compare against trying to do fake SIMD 
via integer ops or similar.