[J-core] Video output on Turtle

Sat Jan 25 19:54:22 UTC 2020

On Friday, January 24, 2020 4:23 AM, Rob Landley <rob at landley.net> wrote:
> On 1/24/20 1:23 AM, Cedric Bail wrote:
> > On Thursday, January 23, 2020 7:30 PM, Rob Landley rob at landley.net wrote:
> > > On 1/23/20 8:15 PM, Cedric Bail wrote:
> > > > it might already be challenging without any other hardware help to handle
> > > > this resolution.
> > 

> > > Oh our hardware could generate a much faster output, the problem is memory
> > > bandwidth. We have one DRAM controller and are constantly streaming from it to
> > > refresh the display. I think it eats something like 20% of the bandwidth? (We
> > > have the numbers somewhere.)
> 

> Hmmm, did the 800x600 eat 20% and 640x480 eat less? (I originally wanted
> 1024x768x24 bit color and that was a "nope"...)

This doesn't surprise me :-)

> This was something like August, it's been a while and I forget the numbers. :(
> 

> > Oh! I wasn't expecting that the DRAM controller would be that much of a
> > bottleneck. Arguably memory bandwidth is almost always the limiting factor. This
> > sounds like the total memory bandwidth would be something around 150MB/s. Would
> > you have that number somewhere?
> 

> I don't have the notes in front of me, but I believe we chose 16 bit color depth
> (so the framebuffer is an array of unsigned, I forget which color has 6 bits and
> the other 2 have 5), which means each pass on the screen reads 614400 bytes
> (which is 19200 cache lines), and then it has to refresh the screen just under
> 60 times/second (twice the NTSC refresh rate if we're generating both signals at
> the same time without reading the framebuffer data even MORE times), which means
> it's reading 36.8 megabytes/second from DRAM.

I am guessing it was a typical RGB565 colorspace you had in mind. I was doing my math
at 30FPS and in 32Bits colorspace, so we end with the same number at around 40MB/s.

> (The obvious way to generate the NTSC signal at the same time is to make your
> sram buffer be two scanlines long (64022 = 2560 bytes, an even 80 cache
> lines), and have the DVI output emit the whole thing but have the NTSC do the
> left side at half speed each time, then do the right side at half speed each
> time. (There's a funky half line at top and bottom handoff but it works out ok.)
> Then you have the HDMI side schedule a cache line fetch each time it's output 32
> bytes of data (which spaces them out nicely so the CPU is never starved, it get
> 3+ cache line fetches to each one of yours), and the data the NTSC output is
> using from the same buffer is refreshed automatically at the right time. And
> since you don't need the cache line fetch request results until it's gone all
> the way around the buffer again, you don't even need to worry about I/O
> priority. (Making the video higher priority than the processor does spread the
> latency evenly though.)
> 

> > > > An ARM at 500Mhz with NEON is what Qt would require to
> > > > handle this resolution at 24fps.
> > 

> > > Nah, it's circuitry, not processor doing it. And we can scroll the screen
> > > vertically for free by treating the bitmap as a ring buffer and have the
> > > starting location in a register.
> > 

> > Yes, that is an old trick for handling scrolling, especially in space shoot em
> > up style of game. Console can also use it to implement their scrolling ability.
> > I am wondering if we can use the Kernel KMS API to express this kind of
> > scrolling in a way that the console can use it right away.
> 

> I think it's built into the framebuffer driver somewhere already. That's more of
> a Rich question, though. :)

I think it was before, but that they have dropped it for a less complex code base as
you can now write your console fully in userspace which is always a solution.

> > Anyway, I would not recommend
> > spending to much effort on a 2D acceleration unit as most software have completely
> > given up on supporting anything but KMS/DRM and OpenGL.
> 

> X11 hasn't. Dragging windows around is still a thing, andtheir terminal
> window scrolling is gonna move a rectangle, not the entire screen. (Hence bitblt.)

Do you have the name of the X11 subsystem that would do so? I was under the impression
that DRI was just for providing access to GL this days with glamor on top being used
to speed up all the 2D rendering task of X11 (When there is no compositor involved).

> > The KMS/DRM allow for
> > multiple layer to be composited in hardware and software usually expect a small
> > amount of them with some constraint
> 

> Compositing takes even more memory. We haven't even done double buffering.

Well, every user interface is the result of compositing different elements, it is not
just windows. The question is when do you merge the composition into their own buffer
and when you don't. Case where you usually don't is when you do scrolling list/grid
element or dynamic/animated element.

Considering how you describe the rendering logic, supporting double buffering or even 

triple buffering should be doable with not much change in hardware design. You do not
seems to need to have in mind a Video dedicated RAM part, except for the two sram 

scanlines. I am expecting that you have a pointer to the framebuffer that is provided
to the video system. If you do have a way to get an event when that pointer has been read
and another event when the buffer being pointed is done been read, you can implement any
kind of n-buffering strategie you want. I would strongly recommend to have that capacity
very early on as this is useful, easy to use and most software a ready for this kind of
interface.

> > (Like size and color space). Having support
> > for hardware plane with different colorspace would allow for "compression"
> > and reduced memory bandwidth for compositing some use case (YUV background with a
> > pallettised buffer on top for example).
> 

> While Jeff knows how to make a 3D engine, we're not going there just now?

This is not necessarily for a 3D engine, but for simple 2D interface. You have a lot of UI
that are just gray with a few colored elements for things like highlight (which fit inside
a 256 palette). The only bits that usually aren't are pictures/photos and they are usually
JPEG, so you can easily get them as a YUV buffer. Especially on constrained platform, 

selecting UI design element that help the software is really useful to achieve nice
interface that look snappy. Material design trend really help in that regard.

> > The memory bandwidth constraint is going to be where the challenge and
> > interesting technics are. SIMD would most likely barely have any meaningful impact
> > from what I understand. Even if it might reduce a bit the amount of code being
> > processed, the framebuffer will still be the main contender for bandwidth.
> 

> Did you miss the "via DMA engine" above?
>
> DMA is needed for bitblt, for USB, for sdcard, for ethernet... Dunno if we'll
> have it in the first bitstream (it's not a blocker in our release path), but it
> makes a big performance difference.

I tend to dismiss DMA engine for bitblt as every time I have had to use one, CPU was
more efficient and faster, but arguably as this one doesn't exist yet sharing experience
might help.

So most of this engine are hard to access safely from user space, this increase the cost
of using them. They also usually are not good with small size operation (Not just because
of the need to go to kernel space and all, but because setting them up is costly). The
larger the better, but then you are forced to loose bandwidth for things that are not
necessary. They also are not very flexible with colorspace, clipping and stretching. They
also some time do not allow to start from a previous buffer, forcing unnecessary screen
update. Some work with a command queue, but the logic to detect when it is full or prepare
the next list of command is really costly. That's kind of top of my head issue I have seen
with them.

I would think this is the entire experience of everyone writing 2D UI framework as none
of the open source 2D rendering libraries support anything else than GL and software. It
might also be why hardware manufacturer seems to have dropped trying to do 2D accelerator
and only provide compositor accelerator. It doesn't mean it is impossible to write a useful
good one, just that I haven't seen one :-) I may have dismissed DMA bitblt engine solution
due to past experience, but maybe it is possible to actually get a good one. Maybe a solution
would be to integrate a J1 to pilot the dma engine independently and give the flexibility
required. I don't know. This is not a simple piece of hardware to make it useful. A discussion
to start again when you will be looking at implementing it, I guess.

Cedric
-------------- next part --------------
A non-text attachment was scrubbed...
Name: publickey - cedric.bail at pm.me - 0x1FE1EA3D.asc
Type: application/pgp-keys
Size: 1805 bytes
Desc: not available
URL: <http://lists.j-core.org/pipermail/j-core/attachments/20200125/44a73f23/attachment.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 477 bytes
Desc: OpenPGP digital signature
URL: <http://lists.j-core.org/pipermail/j-core/attachments/20200125/44a73f23/attachment.sig>