[J-core] 2D unit

Tue Feb 28 01:25:11 EST 2017

On Mon, Feb 27, 2017 at 7:27 PM, BGB <cr88192 at gmail.com> wrote:
> On 2/27/2017 6:33 PM, Cedric BAIL wrote:
>> On Mon, Feb 27, 2017 at 3:41 PM, BGB <cr88192 at gmail.com> wrote:
>>> On 2/27/2017 3:35 PM, Cedric BAIL wrote:
>>>> On Sun, Feb 26, 2017 at 9:35 PM, BGB <cr88192 at gmail.com> wrote:
>>>>> On 2/26/2017 7:29 PM, Cedric BAIL wrote:
>>>>>> We had Thursday a nice meetup after ELC and we came up with some
>>>>>> ideas. Not to sure if they are good as the food and drink may have
>>>>>> helped :-)
>>>>>>
>>>>>> One of the idea we discussed quite in depth was how to design a nice
>>>>>> 2D unit that would go with the turtle. The main issue as discussed
>>>>>> before on this mailing list is that most toolkit won't care about
>>>>>> anything, but OpenGL and Vulkan this days. Obviously this standard are
>>>>>> way to big and complex for us at this stage.
>>>>>
>>>>> OpenGL isn't actually all that complicated per-se (yes, granted, it
>>>>> looks
>>>>> pretty scary up-front), but what makes things more complicated here is
>>>>> mostly trying to make it fast enough to be useful.
>>>>
>>>> Trying to just follow the standard properly is already a danting task.
>>>> Look at a very good example. Intel took years to reach a point where
>>>> it was covering the standard properly and they are only now catching
>>>> up on speed too. The amount of work to be done on the software side is
>>>> multi year before you can claim compliance with OpenGL. Vulkan may be
>>>> slightly easier there, but I haven't looked at it enough to say so
>>>> (And you may be able to implement OpenGL on top of Vulkan, but not
>>>> OpenCL for now and you also can't do compositing with Vulkan at the
>>>> moment).
>>>>
>>>> I am talking here about the used today OpenGL, the one with shaders
>>>> everywhere. Which is pretty demanding on both software stack (to
>>>> generate them efficiently) and on the GPU side to have all the needed
>>>> hardware and cycle to do something useful. I think you are under
>>>> estimating the work needed from, I got something on screen to Qt and
>>>> Steams can use it without crashing.
>>>
>>> I got Quake 3 Arena to work mostly ok with a custom-written OpenGL.
>>>
>>> granted, this was a subset of OpenGL 1.x, as it mostly focused on the
>>> features that Quake 3 used, and skipped out on stuff that wasn't used
>>> (such
>>> as display lists, etc...).
>>
>> OpenGL 1.0 was released in 1992 and the last release of the 1.x branch
>> in 2003. None of them had a programmable pipeline which is what every
>> toolkit and game that is used today use. I see personnally no interest
>> in anything below Vulkan has most toolkit and games engine will have
>> moved to it or a fully programmable pipeline by the time we are done
>> doing any hardware.
>
>
> but, can we make something newer (even GLES2) usable with SH cores or
> (lower-end) FPGA's?...
>
> I have doubts, hence why looking into stuff mostly from 15-20 years ago.
>
> it is a little easier to make stuff from that era work, and "I can basically
> do this 20 year old thing" may be better than "it can't be done at all".

But then you have no software that is using it. There is no toolkit,
web browser or compositor that is maintained for such target. OpenGL
1.x is pretty much dead today. Old games is pretty much the only
things that could take advantage of it.

<snip>

>>>>>> I do like this idea a lot as it open the possibility for a lot of
>>>>>> hacking here. You could maybe manage to generate a JIT version per
>>>>>> frame instead of relying on function and manage a larger number of
>>>>>> "hardware" plane. Implementing a mini scenegraph would enable the
>>>>>> possibility to correctly detect the case when not to composite to
>>>>>> surface and reduce the bandwidth need. That is for the most casual
>>>>>> idea, opening firmware development is I am sure going to lead to
>>>>>> interesting idea.
>>>>>
>>>>> FWIW: it is worth noting that a lot of mobile GPUs are effectively
>>>>> multicore
>>>>> processors with proprietary SIMD-oriented ISA's (just usually the ISA
>>>>> is
>>>>> held under NDA's and under the sole control of the GPU driver, which is
>>>>> often given out as "binary blobs" or similar).
>>>>>
>>>>> an open GPU could be nice.
>>>>>
>>>>> SIMD could be pretty useful for performance (particularly packed-byte
>>>>> and
>>>>> packed-word), but probably isn't critical (it is sort-of possible to
>>>>> approximate SIMD using plain integer ops via some additional
>>>>> bit-twiddly).
>>>>
>>>> Well, if I was to design a GPU, I would start first by designing a
>>>> proper vector extention. A typical trick that I think GPU actually do
>>>> internally (I have no way to know if that is true) is to handle
>>>> variable length request. Something where you can rerun a previous
>>>> instruction with a different mask. Also being able to get the number
>>>> of available vector unit could be neat (With a given minimun, like all
>>>> your vector register are a multiple of 32 or 64bits). Pseudo code
>>>> would be :
>>>>
>>>> start:
>>>> previous_state = vector_process_start(&data_count);
>>>> v1 = load_mem(@somewhere);
>>>> v2 = load_mem(@somewhere_else);
>>>> v3 = load_mem(@somewhere_different);
>>>> v4 = !v1;
>>>> v5 = v4 ? v2 + v3 : v5;
>>>> vector_process_end(previous_state);
>>>> if (data_count) goto start;
>>>>
>>>> I have no idea if there is any previous vector processing CPU that
>>>> followed such a design. Still from a software point of view this has
>>>> really a lot of benefit. No pre or post processing loop are necessary.
>>>> As every instruction would be able to be marked with a test, you can
>>>> explore linearly both branch of the if inside a loop without jumping
>>>> or stopping processing in the vector unit. As I said, would require to
>>>> do some prior art research on this.
>>>>
>>>> This combined with a J2 could be a nice base core for a GPU (If the
>>>> vector unit has a 16, 32 and 64bits FPU). You would assemble a bunch
>>>> of them in parallel (One core would then be dedicated to one tile). If
>>>> you want to optimize for energy consumption, you could turn off core
>>>> and reduce the size of the vector you are processing (Could be fun to
>>>> see if the software can figure out in advance how much data it has to
>>>> process and reduce the number of unit running ahead).
>>>>
>>>> Anyway if you were to follow such a design, starting by adding a
>>>> vector coprocessor to the j-core, would be a first needed step. Would
>>>> be interesting to see if there is prior art here that are useful. I
>>>> would have loved to ideally start from Mesa, take a few games and
>>>> libraries to see what they use the most out of shaders to make the
>>>> vector unit fit its use better, but I am afraid this would lead to a
>>>> patent mess.
>>>
>>> dunno. earlier I wrote up an idea for a fairly simplistic SIMD extension
>>> for
>>> SH, which would basically reuse the FPU registers and opcodes (via
>>> currently
>>> reserved FPSCR bits).
>>
>> The main issue with SIMD extension is the pre and post loop that
>> consume a fair amount of code and really reduce massively the benefit
>> of it. Most C code is not properly annotated to let the compiler
>> remove them. This lead to situation where what the compiler know and
>> the cost of running a pre/post loop is higher than going with scalar
>> code and so abandon the SIMD path which would have been perfectly
>> viable if the compiler knew about data being aligned and of the proper
>> size. The only solution around that is to find a technical solution
>> where pre and post loop are non irrelevant whatever the compiler know
>> about the code. This way, the compiler will have more opportunity to
>> use vector operation.
>>
>> Today compiler are pretty good at finding what to parallelize, but CPU
>> instruction set aren't helping much. Having a vector unit that
>> existing compiler can't take much advantage of, would be pretty
>> problematic for the j-core as there is little chance that there is
>> enough ressource to go over all the assembly optimized source code to
>> add a new variant to support it. Basically, developping the
>> instruction set around GCC and LLVM capability, more than what a human
>> can do would fit this project better I think.
>
>
> probably, the core (fixed function) rasterizer loops could be written in ASM
> or similar (or maybe GLSL).
> typically this is where most of the time goes IME.

I am not sure about what loops you are talking about here ? On the
GPU, 2D unit or in the CPU ?

<snip>

>>> could still beat on the idea more.
>>>
>>>>>> We can discuss the benefit of having a specific instruction to do
>>>>>> blending or enable the full sized multiply, but I think that something
>>>>>> we can experiment later with the turtle and see what work best. So
>>>>>> let's just focus on is this a good idea for now and maybe can we apply
>>>>>> the same concept to other unit (network, audio, tpm ?) ? Is it not
>>>>>> going to consume to much space on the fpga ? Do we really need all
>>>>>> that flexibility ?
>>>>>>
>>>>>> So what do you think of this idea ?
>>>>>
>>>>> I generally agree with the idea, but can't say much for how much one
>>>>> can
>>>>> fit
>>>>> on an FPGA.
>>>>>
>>>>> as I understand it, it shouldn't be too hard to modify an ALU to be
>>>>> able
>>>>> to
>>>>> do packed ADD/SUB (mostly need to disable some internal carry bits in
>>>>> the
>>>>> adder, ...).
>>>>>
>>>>> one probably will also need an operation to multiply packed bytes or
>>>>> words
>>>>> and only keep the high order bits (as this comes up a lot in blending
>>>>> and
>>>>> is
>>>>> problematic to implement efficiently using normal integer operations).
>>>>>
>>>>> PADDB Rm, Rn    //packed add bytes
>>>>> PADDW Rm, Rn    //packed add words
>>>>> PMULHW Rm, Rn    //multiply packed words and keep high 16 bits of
>>>>> intermediate results.
>>>>> ...
>>>>>
>>>>> for packed-word, could be nice to be able to also operate on pairs, ex:
>>>>> PMULHW/V VR1, VR3
>>>>>       does the same as: "PMULHW R2, R6; PMULHW R3, R7"
>>>>>
>>>>> ex, the VRn registers could be 64-bit pairs of 32-bit registers.
>>>>>
>>>>> maybe they could be done with FRn registers (to avoid taking up GPR
>>>>> space),
>>>>> though ideally in this case there would be a way to move values more
>>>>> directly between Rn and FRn registers (without needing to go through
>>>>> FPUL
>>>>> or
>>>>> similar).
>>>>>
>>>>> PADDW FRm, FRn
>>>>> and:
>>>>> PADDW DRm, DRn
>>>>>
>>>>> hell, maybe FPSCR.PR and FPSCR.SZ control the behavior of SIMD
>>>>> operations
>>>>> or
>>>>> something.
>>>>>
>>>>> just a random idea, little to say it is actually viable.
>>>>
>>>> As for a blend instruction, I was more thinking about just taking two
>>>> colors in a RGBA pre-multiply world and just do 8bits multiplication
>>>> there, which look like :
>>>> Dest[{R,G,B}] = Source[{R,G,B}] + Dest[{R,G,B}] * (1 - Source[A]);
>>>> Dest[A] = Source[A] + Dest[A] * (1 - Source[A]);
>>>>
>>>> Maybe a color interpolation instruction could be useful too, but
>>>> anyway, the idea is to keep it simple and dedicated to the task at
>>>> hand for a 2D unit not to design a generic vector unit. And even then,
>>>> I would benchmark this against using just the 32bits multiplication
>>>> with bitmask and shift. As it might be enough and increase the
>>>> complexity of the hardware is maybe not even needed.
>>>
>>> I suspect by the time one has the gates needed to do a blend function in
>>> hardware, they probably already have the gates needed to do packed SIMD
>>> operations (packed-integer SIMD should actually need a lot fewer gates I
>>> think than a full-precision scalar FPU; so it could be cheaper I think to
>>> do
>>> some cores lacking a conventional FPU, and just doing narrower SIMD
>>> types),
>>> but dunno...
>>>
>>> but, yeah, it is probably sane to compare against trying to do fake SIMD
>>> via
>>> integer ops or similar.
>>
>> A blend and an interpolate function will only act on 32 bits register
>> operating with 8bits components acting for an RGBA pixels (the 4 mul,
>> 4 add and 1 sub would only be operating on 8bits). It is like a custom
>> SIMD instruction working on the same register as any instruction, but
>> just useful for blending as that would be the main task of that 2D
>> unit. Maybe it could be implemented as a micro coded operation that
>> reuse an existing mul, add and sub unit. I have no clue of what would
>> be the best solution there, but clearly having 2 instructions seems to
>> me way more limited and simpler to implement and use for this use
>> case.
>
> but, I have doubts it will save all that much, and as presented would only
> be applicable in an extremely narrow set of use-cases (it is not even the
> most commonly used form of alpha blending IME).
>
> if it were: "Srgba*Sa+Drgba*(1-Sa)" this is at least more commonly used.

This is not the more common fonction as this one lead to more visually
incorrect result. Premultiplied surface is used for this purpose (Even
in hardware), please check
https://en.wikipedia.org/wiki/Alpha_compositing for more information.

> but, if it can do packed byte and word ops, it can support all of the
> blending modes, and also things like color gradients, ...
>
> being able to do color gradients is a big place where packed word ops are
> helpful, and drawing primitives with interpolated vertex colors is also
> fairly common (and trying to directly use bytes here typically results in
> some pretty nasty looking artifacts).
>
> related is interpolating raster or texture coordinates, which can also
> benefit from packed-word.

I have excluded discussion on any of this topic as my sole interest is
in enabling existing software with hardware acceleration. Porting
library specifically to this unit is highly time consuming and
difficult to maintain. Which is why I referred to kms/drm being the
only goal. This mean a very basic scene graph of buffer being blended
together. As there might be some color/alpha interpolation included,
that is the only second operation that might be useful.

I understand your interest into trying to make it as general purpose
as possible, but that is I think only useful if we can make it a
modern GPU (Vulkan capable), otherwise sticking with optimizing for
kms/drm would be enough in my opinion. I do not see how to provide
meaningful and broad acceleration to the existing software eco systeme
with what you have in mind.
-- 
Cedric BAIL