[J-core] 2D unit
BGB
cr88192 at gmail.com
Mon Feb 27 00:35:32 EST 2017
On 2/26/2017 7:29 PM, Cedric BAIL wrote:
> Hello,
>
> We had Thursday a nice meetup after ELC and we came up with some
> ideas. Not to sure if they are good as the food and drink may have
> helped :-)
>
> One of the idea we discussed quite in depth was how to design a nice
> 2D unit that would go with the turtle. The main issue as discussed
> before on this mailing list is that most toolkit won't care about
> anything, but OpenGL and Vulkan this days. Obviously this standard are
> way to big and complex for us at this stage.
OpenGL isn't actually all that complicated per-se (yes, granted, it
looks pretty scary up-front), but what makes things more complicated
here is mostly trying to make it fast enough to be useful.
I suspect speed is likely to be a much bigger issue for this on an FPGA
(which can neither do high clock speeds nor handle throwing a lot of
cores at it).
(while, on a modern desktop PC, it is basically doable to get speeds on
par with early 2000s GPUs just doing it on the CPU).
> There is a limited alternative which is the KMS/DRM API which provide
> a way to provide a list of buffer to display at a specific position
> per frame usually named hardware plane. This is used by a certain
> numbers of Wayland server to accelerate the movement of mouse pointer
> and allow the active window to be a zero compositing case (Not just
> when watching a movie, just about any application).
>
> To implement such functionnality the usual way in all hardware I know,
> is to have a completely dedicated black box. More often than not, this
> block are actually running some kind of firmware. This gave us an
> idea, what if we use a J1 with a small amount of dedicated SRAM (2 *
> 8KB ?) accessible from the main cpu, a dedicated access to the DMA
> engine, an interrupt line to the main CPU, control over the HDMI
> output and maybe a few special instruction to do blending operation.
>
> The kernel driver could actually contain the exact source code of the
> firmware run on that J1. The driver would use some linker trick to
> select the function and the data that need to be copied from its own
> code (As J1 and J2 have compatible instruction set, same compiler can
> be used). The boot loader could do likely the same trick, with a very
> simple implementation that handle just a terminal at 640x480.
I actually imagined something similar to this recently for a "Soft GPU",
namely that the GPU would be a few cores running off with some local
RAM, and has access mostly to VRAM, and the host-driver basically gives
them code for how to draw stuff, and starts feeding in commands.
I started recently on an early software GPU (in the context of my
personal emulator project), in which the main program communicates with
the "GPU" via a ring buffer.
the host processor writes commands into the ring buffer and the GPU
reads commands from the ring buffer; with some basic flow-control
mechanisms (so hopefully things don't stomp on each other).
in this case, commands would be mostly things like "draw a screen-space
triangle with these parameters".
in this case, the coords are given in screen-space (so, from the GPU's
POV, it is drawing 2D primitives either way). things like textures and
drawing surfaces were mostly handled as pointers within VRAM.
I was also taking some design ideas from the S3 ViRGE (which basically
did 2D + Z-buffer).
though, it is considerably different in the use of a ring-buffer for
submitting primitives (and borrowing the PVR2's 2D framebuffer interface).
in this case, for both 2D and 3D, only 2D drawing operations would be
used by the GPU (warping distortions during rendering can be reduced by
subdividing large triangles during projection, and things like
mip-mapping can be handled per-primitive, ...).
thus far I have tested using it for drawing triangles and similar, but a
bit more work would be needed for it to be more useful.
if it works, experiments running it more constrained could be possible
(ex: throttling the speeds of the emulated CPU and GPU so something an
FPGA version could probably do).
> I do like this idea a lot as it open the possibility for a lot of
> hacking here. You could maybe manage to generate a JIT version per
> frame instead of relying on function and manage a larger number of
> "hardware" plane. Implementing a mini scenegraph would enable the
> possibility to correctly detect the case when not to composite to
> surface and reduce the bandwidth need. That is for the most casual
> idea, opening firmware development is I am sure going to lead to
> interesting idea.
FWIW: it is worth noting that a lot of mobile GPUs are effectively
multicore processors with proprietary SIMD-oriented ISA's (just usually
the ISA is held under NDA's and under the sole control of the GPU
driver, which is often given out as "binary blobs" or similar).
an open GPU could be nice.
SIMD could be pretty useful for performance (particularly packed-byte
and packed-word), but probably isn't critical (it is sort-of possible to
approximate SIMD using plain integer ops via some additional bit-twiddly).
> We can discuss the benefit of having a specific instruction to do
> blending or enable the full sized multiply, but I think that something
> we can experiment later with the turtle and see what work best. So
> let's just focus on is this a good idea for now and maybe can we apply
> the same concept to other unit (network, audio, tpm ?) ? Is it not
> going to consume to much space on the fpga ? Do we really need all
> that flexibility ?
>
> So what do you think of this idea ?
I generally agree with the idea, but can't say much for how much one can
fit on an FPGA.
as I understand it, it shouldn't be too hard to modify an ALU to be able
to do packed ADD/SUB (mostly need to disable some internal carry bits in
the adder, ...).
one probably will also need an operation to multiply packed bytes or
words and only keep the high order bits (as this comes up a lot in
blending and is problematic to implement efficiently using normal
integer operations).
PADDB Rm, Rn //packed add bytes
PADDW Rm, Rn //packed add words
PMULHW Rm, Rn //multiply packed words and keep high 16 bits of
intermediate results.
...
for packed-word, could be nice to be able to also operate on pairs, ex:
PMULHW/V VR1, VR3
does the same as: "PMULHW R2, R6; PMULHW R3, R7"
ex, the VRn registers could be 64-bit pairs of 32-bit registers.
maybe they could be done with FRn registers (to avoid taking up GPR
space), though ideally in this case there would be a way to move values
more directly between Rn and FRn registers (without needing to go
through FPUL or similar).
PADDW FRm, FRn
and:
PADDW DRm, DRn
hell, maybe FPSCR.PR and FPSCR.SZ control the behavior of SIMD
operations or something.
just a random idea, little to say it is actually viable.
More information about the J-core
mailing list