[J-core] 2D unit

Mon Feb 27 22:27:39 EST 2017

On 2/27/2017 6:33 PM, Cedric BAIL wrote:
> On Mon, Feb 27, 2017 at 3:41 PM, BGB <cr88192 at gmail.com> wrote:
>> On 2/27/2017 3:35 PM, Cedric BAIL wrote:
>>> On Sun, Feb 26, 2017 at 9:35 PM, BGB <cr88192 at gmail.com> wrote:
>>>> On 2/26/2017 7:29 PM, Cedric BAIL wrote:
>>>>> We had Thursday a nice meetup after ELC and we came up with some
>>>>> ideas. Not to sure if they are good as the food and drink may have
>>>>> helped :-)
>>>>>
>>>>> One of the idea we discussed quite in depth was how to design a nice
>>>>> 2D unit that would go with the turtle. The main issue as discussed
>>>>> before on this mailing list is that most toolkit won't care about
>>>>> anything, but OpenGL and Vulkan this days. Obviously this standard are
>>>>> way to big and complex for us at this stage.
>>>> OpenGL isn't actually all that complicated per-se (yes, granted, it looks
>>>> pretty scary up-front), but what makes things more complicated here is
>>>> mostly trying to make it fast enough to be useful.
>>> Trying to just follow the standard properly is already a danting task.
>>> Look at a very good example. Intel took years to reach a point where
>>> it was covering the standard properly and they are only now catching
>>> up on speed too. The amount of work to be done on the software side is
>>> multi year before you can claim compliance with OpenGL. Vulkan may be
>>> slightly easier there, but I haven't looked at it enough to say so
>>> (And you may be able to implement OpenGL on top of Vulkan, but not
>>> OpenCL for now and you also can't do compositing with Vulkan at the
>>> moment).
>>>
>>> I am talking here about the used today OpenGL, the one with shaders
>>> everywhere. Which is pretty demanding on both software stack (to
>>> generate them efficiently) and on the GPU side to have all the needed
>>> hardware and cycle to do something useful. I think you are under
>>> estimating the work needed from, I got something on screen to Qt and
>>> Steams can use it without crashing.
>> I got Quake 3 Arena to work mostly ok with a custom-written OpenGL.
>>
>> granted, this was a subset of OpenGL 1.x, as it mostly focused on the
>> features that Quake 3 used, and skipped out on stuff that wasn't used (such
>> as display lists, etc...).
> OpenGL 1.0 was released in 1992 and the last release of the 1.x branch
> in 2003. None of them had a programmable pipeline which is what every
> toolkit and game that is used today use. I see personnally no interest
> in anything below Vulkan has most toolkit and games engine will have
> moved to it or a fully programmable pipeline by the time we are done
> doing any hardware.

but, can we make something newer (even GLES2) usable with SH cores or 
(lower-end) FPGA's?...

I have doubts, hence why looking into stuff mostly from 15-20 years ago.

it is a little easier to make stuff from that era work, and "I can 
basically do this 20 year old thing" may be better than "it can't be 
done at all".

> <snip>
>
>> another SW GL implementation (Mesa's LLVMpipe) had used LLVM basically to
>> dynamically compile parts of the software-rendering pipeline, which is a
>> fair bit fancier (than just doing a plain-C software rasterizer).
> Yes, but still it is super damn slow and would require massive rewrite
> to match a non classic CPU architecture. Also llvm is a pretty bad
> solution for GPU compilation has the goal of LLVM and a shader
> compiler are not aligned which is why Mesa has its own IR and
> optimization pass on it.

yeah, LLVM is a bit of a monster, so I don't really use it either (in 
contrast, I was recently able to do a basically functional JIT in only a 
few kLOC).

as for speed of LLVMpipe, well, it can run 15-20 year old games pretty ok.
as noted before, it is possible to roughly emulate early 2000s GPUs on a 
modern CPU;
one is a bit more harder-pressed to emulate a more modern GPU though.

> <snip>
>
>>>>> I do like this idea a lot as it open the possibility for a lot of
>>>>> hacking here. You could maybe manage to generate a JIT version per
>>>>> frame instead of relying on function and manage a larger number of
>>>>> "hardware" plane. Implementing a mini scenegraph would enable the
>>>>> possibility to correctly detect the case when not to composite to
>>>>> surface and reduce the bandwidth need. That is for the most casual
>>>>> idea, opening firmware development is I am sure going to lead to
>>>>> interesting idea.
>>>> FWIW: it is worth noting that a lot of mobile GPUs are effectively
>>>> multicore
>>>> processors with proprietary SIMD-oriented ISA's (just usually the ISA is
>>>> held under NDA's and under the sole control of the GPU driver, which is
>>>> often given out as "binary blobs" or similar).
>>>>
>>>> an open GPU could be nice.
>>>>
>>>> SIMD could be pretty useful for performance (particularly packed-byte and
>>>> packed-word), but probably isn't critical (it is sort-of possible to
>>>> approximate SIMD using plain integer ops via some additional
>>>> bit-twiddly).
>>> Well, if I was to design a GPU, I would start first by designing a
>>> proper vector extention. A typical trick that I think GPU actually do
>>> internally (I have no way to know if that is true) is to handle
>>> variable length request. Something where you can rerun a previous
>>> instruction with a different mask. Also being able to get the number
>>> of available vector unit could be neat (With a given minimun, like all
>>> your vector register are a multiple of 32 or 64bits). Pseudo code
>>> would be :
>>>
>>> start:
>>> previous_state = vector_process_start(&data_count);
>>> v1 = load_mem(@somewhere);
>>> v2 = load_mem(@somewhere_else);
>>> v3 = load_mem(@somewhere_different);
>>> v4 = !v1;
>>> v5 = v4 ? v2 + v3 : v5;
>>> vector_process_end(previous_state);
>>> if (data_count) goto start;
>>>
>>> I have no idea if there is any previous vector processing CPU that
>>> followed such a design. Still from a software point of view this has
>>> really a lot of benefit. No pre or post processing loop are necessary.
>>> As every instruction would be able to be marked with a test, you can
>>> explore linearly both branch of the if inside a loop without jumping
>>> or stopping processing in the vector unit. As I said, would require to
>>> do some prior art research on this.
>>>
>>> This combined with a J2 could be a nice base core for a GPU (If the
>>> vector unit has a 16, 32 and 64bits FPU). You would assemble a bunch
>>> of them in parallel (One core would then be dedicated to one tile). If
>>> you want to optimize for energy consumption, you could turn off core
>>> and reduce the size of the vector you are processing (Could be fun to
>>> see if the software can figure out in advance how much data it has to
>>> process and reduce the number of unit running ahead).
>>>
>>> Anyway if you were to follow such a design, starting by adding a
>>> vector coprocessor to the j-core, would be a first needed step. Would
>>> be interesting to see if there is prior art here that are useful. I
>>> would have loved to ideally start from Mesa, take a few games and
>>> libraries to see what they use the most out of shaders to make the
>>> vector unit fit its use better, but I am afraid this would lead to a
>>> patent mess.
>> dunno. earlier I wrote up an idea for a fairly simplistic SIMD extension for
>> SH, which would basically reuse the FPU registers and opcodes (via currently
>> reserved FPSCR bits).
> The main issue with SIMD extension is the pre and post loop that
> consume a fair amount of code and really reduce massively the benefit
> of it. Most C code is not properly annotated to let the compiler
> remove them. This lead to situation where what the compiler know and
> the cost of running a pre/post loop is higher than going with scalar
> code and so abandon the SIMD path which would have been perfectly
> viable if the compiler knew about data being aligned and of the proper
> size. The only solution around that is to find a technical solution
> where pre and post loop are non irrelevant whatever the compiler know
> about the code. This way, the compiler will have more opportunity to
> use vector operation.
>
> Today compiler are pretty good at finding what to parallelize, but CPU
> instruction set aren't helping much. Having a vector unit that
> existing compiler can't take much advantage of, would be pretty
> problematic for the j-core as there is little chance that there is
> enough ressource to go over all the assembly optimized source code to
> add a new variant to support it. Basically, developping the
> instruction set around GCC and LLVM capability, more than what a human
> can do would fit this project better I think.

probably, the core (fixed function) rasterizer loops could be written in 
ASM or similar (or maybe GLSL).
typically this is where most of the time goes IME.

shaders are harder, but luckily GLSL tends to use vectors as native types.

>> https://github.com/cr88192/bgbtech_shxemu/wiki/SHx-Mini-SIMD
>>
>> supporting arbitrary shader code would be harder though, as one can't just
>> casually give them limited-range fixed-point and expect them to work.
>> probably at a minimum would need 16-bit half-floats.
>>
>> possibly there could be an alternate SIMD mode which supports packed
>> half-floats (to be friendlier to GLSL).
> GLSL and Vulkan shader would require to support 16, 32 and 64bits floats.

IME, GLSL tends to actually be a bit lax about how it is processed and 
can vary a fair bit between implementations; being limited to half-float 
wouldn't really be anything all that unusual.

though, could still provide a full FPU, so that vector-types could be 
half-float but scalar types can be float (as the SIMD extension doesn't 
preclude the use of an FPU). double is probably overkill though.

>> could still beat on the idea more.
>>
>>>>> We can discuss the benefit of having a specific instruction to do
>>>>> blending or enable the full sized multiply, but I think that something
>>>>> we can experiment later with the turtle and see what work best. So
>>>>> let's just focus on is this a good idea for now and maybe can we apply
>>>>> the same concept to other unit (network, audio, tpm ?) ? Is it not
>>>>> going to consume to much space on the fpga ? Do we really need all
>>>>> that flexibility ?
>>>>>
>>>>> So what do you think of this idea ?
>>>> I generally agree with the idea, but can't say much for how much one can
>>>> fit
>>>> on an FPGA.
>>>>
>>>> as I understand it, it shouldn't be too hard to modify an ALU to be able
>>>> to
>>>> do packed ADD/SUB (mostly need to disable some internal carry bits in the
>>>> adder, ...).
>>>>
>>>> one probably will also need an operation to multiply packed bytes or
>>>> words
>>>> and only keep the high order bits (as this comes up a lot in blending and
>>>> is
>>>> problematic to implement efficiently using normal integer operations).
>>>>
>>>> PADDB Rm, Rn    //packed add bytes
>>>> PADDW Rm, Rn    //packed add words
>>>> PMULHW Rm, Rn    //multiply packed words and keep high 16 bits of
>>>> intermediate results.
>>>> ...
>>>>
>>>> for packed-word, could be nice to be able to also operate on pairs, ex:
>>>> PMULHW/V VR1, VR3
>>>>       does the same as: "PMULHW R2, R6; PMULHW R3, R7"
>>>>
>>>> ex, the VRn registers could be 64-bit pairs of 32-bit registers.
>>>>
>>>> maybe they could be done with FRn registers (to avoid taking up GPR
>>>> space),
>>>> though ideally in this case there would be a way to move values more
>>>> directly between Rn and FRn registers (without needing to go through FPUL
>>>> or
>>>> similar).
>>>>
>>>> PADDW FRm, FRn
>>>> and:
>>>> PADDW DRm, DRn
>>>>
>>>> hell, maybe FPSCR.PR and FPSCR.SZ control the behavior of SIMD operations
>>>> or
>>>> something.
>>>>
>>>> just a random idea, little to say it is actually viable.
>>> As for a blend instruction, I was more thinking about just taking two
>>> colors in a RGBA pre-multiply world and just do 8bits multiplication
>>> there, which look like :
>>> Dest[{R,G,B}] = Source[{R,G,B}] + Dest[{R,G,B}] * (1 - Source[A]);
>>> Dest[A] = Source[A] + Dest[A] * (1 - Source[A]);
>>>
>>> Maybe a color interpolation instruction could be useful too, but
>>> anyway, the idea is to keep it simple and dedicated to the task at
>>> hand for a 2D unit not to design a generic vector unit. And even then,
>>> I would benchmark this against using just the 32bits multiplication
>>> with bitmask and shift. As it might be enough and increase the
>>> complexity of the hardware is maybe not even needed.
>> I suspect by the time one has the gates needed to do a blend function in
>> hardware, they probably already have the gates needed to do packed SIMD
>> operations (packed-integer SIMD should actually need a lot fewer gates I
>> think than a full-precision scalar FPU; so it could be cheaper I think to do
>> some cores lacking a conventional FPU, and just doing narrower SIMD types),
>> but dunno...
>>
>> but, yeah, it is probably sane to compare against trying to do fake SIMD via
>> integer ops or similar.
> A blend and an interpolate function will only act on 32 bits register
> operating with 8bits components acting for an RGBA pixels (the 4 mul,
> 4 add and 1 sub would only be operating on 8bits). It is like a custom
> SIMD instruction working on the same register as any instruction, but
> just useful for blending as that would be the main task of that 2D
> unit. Maybe it could be implemented as a micro coded operation that
> reuse an existing mul, add and sub unit. I have no clue of what would
> be the best solution there, but clearly having 2 instructions seems to
> me way more limited and simpler to implement and use for this use
> case.

but, I have doubts it will save all that much, and as presented would 
only be applicable in an extremely narrow set of use-cases (it is not 
even the most commonly used form of alpha blending IME).

if it were: "Srgba*Sa+Drgba*(1-Sa)" this is at least more commonly used.

but, if it can do packed byte and word ops, it can support all of the 
blending modes, and also things like color gradients, ...

being able to do color gradients is a big place where packed word ops 
are helpful, and drawing primitives with interpolated vertex colors is 
also fairly common (and trying to directly use bytes here typically 
results in some pretty nasty looking artifacts).

related is interpolating raster or texture coordinates, which can also 
benefit from packed-word.