[J-core] 2D unit

Mon Feb 27 00:35:32 EST 2017

On 2/26/2017 7:29 PM, Cedric BAIL wrote:
> Hello,
>
> We had Thursday a nice meetup after ELC and we came up with some
> ideas. Not to sure if they are good as the food and drink may have
> helped :-)
>
> One of the idea we discussed quite in depth was how to design a nice
> 2D unit that would go with the turtle. The main issue as discussed
> before on this mailing list is that most toolkit won't care about
> anything, but OpenGL and Vulkan this days. Obviously this standard are
> way to big and complex for us at this stage.

OpenGL isn't actually all that complicated per-se (yes, granted, it 
looks pretty scary up-front), but what makes things more complicated 
here is mostly trying to make it fast enough to be useful.

I suspect speed is likely to be a much bigger issue for this on an FPGA 
(which can neither do high clock speeds nor handle throwing a lot of 
cores at it).

(while, on a modern desktop PC, it is basically doable to get speeds on 
par with early 2000s GPUs just doing it on the CPU).

> There is a limited alternative which is the KMS/DRM API which provide
> a way to provide a list of buffer to display at a specific position
> per frame usually named hardware plane. This is used by a certain
> numbers of Wayland server to accelerate the movement of mouse pointer
> and allow the active window to be a zero compositing case (Not just
> when watching a movie, just about any application).
>
> To implement such functionnality the usual way in all hardware I know,
> is to have a completely dedicated black box. More often than not, this
> block are actually running some kind of firmware. This gave us an
> idea, what if we use a J1 with a small amount of dedicated SRAM (2 *
> 8KB ?) accessible from the main cpu, a dedicated access to the DMA
> engine, an interrupt line to the main CPU, control over the HDMI
> output and maybe a few special instruction to do blending operation.
>
> The kernel driver could actually contain the exact source code of the
> firmware run on that J1. The driver would use some linker trick to
> select the function and the data that need to be copied from its own
> code (As J1 and J2 have compatible instruction set, same compiler can
> be used). The boot loader could do likely the same trick, with a very
> simple implementation that handle just a terminal at 640x480.

I actually imagined something similar to this recently for a "Soft GPU", 
namely that the GPU would be a few cores running off with some local 
RAM, and has access mostly to VRAM, and the host-driver basically gives 
them code for how to draw stuff, and starts feeding in commands.

I started recently on an early software GPU (in the context of my 
personal emulator project), in which the main program communicates with 
the "GPU" via a ring buffer.

the host processor writes commands into the ring buffer and the GPU 
reads commands from the ring buffer; with some basic flow-control 
mechanisms (so hopefully things don't stomp on each other).

in this case, commands would be mostly things like "draw a screen-space 
triangle with these parameters".
in this case, the coords are given in screen-space (so, from the GPU's 
POV, it is drawing 2D primitives either way). things like textures and 
drawing surfaces were mostly handled as pointers within VRAM.

I was also taking some design ideas from the S3 ViRGE (which basically 
did 2D + Z-buffer).
though, it is considerably different in the use of a ring-buffer for 
submitting primitives (and borrowing the PVR2's 2D framebuffer interface).

in this case, for both 2D and 3D, only 2D drawing operations would be 
used by the GPU (warping distortions during rendering can be reduced by 
subdividing large triangles during projection, and things like 
mip-mapping can be handled per-primitive, ...).

thus far I have tested using it for drawing triangles and similar, but a 
bit more work would be needed for it to be more useful.

if it works, experiments running it more constrained could be possible 
(ex: throttling the speeds of the emulated CPU and GPU so something an 
FPGA version could probably do).

> I do like this idea a lot as it open the possibility for a lot of
> hacking here. You could maybe manage to generate a JIT version per
> frame instead of relying on function and manage a larger number of
> "hardware" plane. Implementing a mini scenegraph would enable the
> possibility to correctly detect the case when not to composite to
> surface and reduce the bandwidth need. That is for the most casual
> idea, opening firmware development is I am sure going to lead to
> interesting idea.

FWIW: it is worth noting that a lot of mobile GPUs are effectively 
multicore processors with proprietary SIMD-oriented ISA's (just usually 
the ISA is held under NDA's and under the sole control of the GPU 
driver, which is often given out as "binary blobs" or similar).

an open GPU could be nice.

SIMD could be pretty useful for performance (particularly packed-byte 
and packed-word), but probably isn't critical (it is sort-of possible to 
approximate SIMD using plain integer ops via some additional bit-twiddly).

> We can discuss the benefit of having a specific instruction to do
> blending or enable the full sized multiply, but I think that something
> we can experiment later with the turtle and see what work best. So
> let's just focus on is this a good idea for now and maybe can we apply
> the same concept to other unit (network, audio, tpm ?) ? Is it not
> going to consume to much space on the fpga ? Do we really need all
> that flexibility ?
>
> So what do you think of this idea ?

I generally agree with the idea, but can't say much for how much one can 
fit on an FPGA.

as I understand it, it shouldn't be too hard to modify an ALU to be able 
to do packed ADD/SUB (mostly need to disable some internal carry bits in 
the adder, ...).

one probably will also need an operation to multiply packed bytes or 
words and only keep the high order bits (as this comes up a lot in 
blending and is problematic to implement efficiently using normal 
integer operations).

PADDB Rm, Rn    //packed add bytes
PADDW Rm, Rn    //packed add words
PMULHW Rm, Rn    //multiply packed words and keep high 16 bits of 
intermediate results.
...

for packed-word, could be nice to be able to also operate on pairs, ex:
PMULHW/V VR1, VR3
     does the same as: "PMULHW R2, R6; PMULHW R3, R7"

ex, the VRn registers could be 64-bit pairs of 32-bit registers.

maybe they could be done with FRn registers (to avoid taking up GPR 
space), though ideally in this case there would be a way to move values 
more directly between Rn and FRn registers (without needing to go 
through FPUL or similar).

PADDW FRm, FRn
and:
PADDW DRm, DRn

hell, maybe FPSCR.PR and FPSCR.SZ control the behavior of SIMD 
operations or something.

just a random idea, little to say it is actually viable.