[J-core] Video output on Turtle

Fri Jan 24 12:27:59 UTC 2020

On 1/24/20 1:23 AM, Cedric Bail wrote:
> Moving our discussion regarding Turtle design and constraint on the graphics part.
> 
> On Thursday, January 23, 2020 7:30 PM, Rob Landley <rob at landley.net> wrote:
>> On 1/23/20 8:15 PM, Cedric Bail wrote:
>>> Supporting 640x480 in digital only is ok, but if you are to go
>>> with analogic, maybe supporting 576i and 420i would make sense to
>>
> 
>> We picked 640x480 because it's the closest VGA size that maps down to analog
>> easily. (Horizontal resolution's sort of a handwave based on clock speed, but
>> vertical is a 1-1 mapping.)
> 
> I see.

Hmmm... Actually I vaguely remember us fiddling with the vertical resolution
slightly to make the NTSC side easier. (My notes on that are in Austin. I
_really_ didn't want the HDMI side to be gratuitously interlaced, so we worked
out how to do interlaced and double speed progressive output from the same
memory buffer.)

But DVI is pretty darn flexible about output resolution (it takes whatever you
feed it), and we're the ones dividing the block of memory into rows and columns...

>> We're using DVI, not HDMI, because it's out of patent. (HDMI is backwards
>> compatible with DVI signals.)
> 
> Make sense and you get the DVI signal over the HDMI bits. From what I remember
> this could be valid all the way up to 1080p, so quite enough margin with using
> a non patented technology.

Indeed.

>>> (So 720x576 at 50 and 720x420 at 60) with the rest of the space being padded
>>> when generating analogical signal. 576i is ~30% bigger than 640x480,
>>
> 
>> 640x480 is 80x60 characters in an 8x8 font.

I remember looking at 800x600 and it was too much memory bandwidth for comfort.
(Would work fine in black and white though.)

>>> it might already be challenging without any other hardware help to handle
>>> this resolution.
>>
> 
>> Oh our hardware could generate a much faster output, the problem is memory
>> bandwidth. We have one DRAM controller and are constantly streaming from it to
>> refresh the display. I think it eats something like 20% of the bandwidth? (We
>> have the numbers somewhere.)

Hmmm, did the 800x600 eat 20% and 640x480 eat less? (I originally wanted
1024x768x24 bit color and that was a "nope"...)

This was something like August, it's been a while and I forget the numbers. :(

> Oh! I wasn't expecting that the DRAM controller would be that much of a
> bottleneck. Arguably memory bandwidth is almost always the limiting factor. This
> sounds like the total memory bandwidth would be something around 150MB/s. Would
> you have that number somewhere?

I don't have the notes in front of me, but I believe we chose 16 bit color depth
(so the framebuffer is an array of unsigned, I forget which color has 6 bits and
the other 2 have 5), which means each pass on the screen reads 614400 bytes
(which is 19200 cache lines), and then it has to _refresh_ the screen just under
60 times/second (twice the NTSC refresh rate if we're generating both signals at
the same time without reading the framebuffer data even MORE times), which means
it's reading 36.8 megabytes/second from DRAM.

(The obvious way to generate the NTSC signal at the same time is to make your
sram buffer be _two_ scanlines long (640*2*2 = 2560 bytes, an even 80 cache
lines), and have the DVI output emit the whole thing but have the NTSC do the
left side at half speed each time, then do the right side at half speed each
time. (There's a funky half line at top and bottom handoff but it works out ok.)
Then you have the HDMI side schedule a cache line fetch each time it's output 32
bytes of data (which spaces them out nicely so the CPU is never starved, it get
3+ cache line fetches to each one of yours), and the data the NTSC output is
using from the same buffer is refreshed automatically at the right time. And
since you don't need the cache line fetch request results until it's gone all
the way around the buffer again, you don't even need to worry about I/O
priority. (Making the video higher priority than the processor does spread the
latency evenly though.)

>>> An ARM at 500Mhz with NEON is what Qt would require to
>>> handle this resolution at 24fps.
>>
> 
>> Nah, it's circuitry, not processor doing it. And we can scroll the screen
>> vertically for _free_ by treating the bitmap as a ring buffer and have the
>> starting location in a register.
> 
> Yes, that is an old trick for handling scrolling, especially in space shoot em
> up style of game. Console can also use it to implement their scrolling ability.
> I am wondering if we can use the Kernel KMS API to express this kind of
> scrolling in a way that the console can use it right away.

I think it's built into the framebuffer driver somewhere already. That's more of
a Rich question, though. :)

>>> EFL could do slightly better, but overall without SIMD it would be very
>>> hard. Ideally having faster memset and memcpy would be also an improvement.
>>> It really depends if the display system is important to the project or not.
>>
> 
>> It is, but the concerns you're raising seem completely unrelated? We may
>> implement 2D acceleration at some point (bitblt via dma), but our first pass is
>> just "raw linux framebuffer driver".
> 
> Indeed, I was expecting the CPU to be a bottleneck in its current state as I was
> expecting memory bandwidth to be higher (I think I did a newbie error when reading
> the LX45 specs and read MB/s when it was Mb/s).

lpddr2 was designed to be low power and cheap. If you need more bandwidth you
either switch up to ddr3 or install multiple memory busses in parallel, but
turtle's just got the one.

> Anyway, I would not recommend
> spending to much effort on a 2D acceleration unit as most software have completely
> given up on supporting anything but KMS/DRM and OpenGL.

X11 hasn't. Dragging windows around is still a thing, and _their_ terminal
window scrolling is gonna move a rectangle, not the entire screen. (Hence bitblt.)

> The KMS/DRM allow for
> multiple layer to be composited in hardware and software usually expect a small
> amount of them with some constraint

Compositing takes even more memory. We haven't even done double buffering.

> (Like size and color space). Having support
> for hardware plane with different colorspace would allow for "compression"
> and reduced memory bandwidth for compositing some use case (YUV background with a
> pallettised buffer on top for example).

While Jeff knows how to make a 3D engine, we're not going there just now?

>    The memory bandwidth constraint is going to be where the challenge and
> interesting technics are. SIMD would most likely barely have any meaningful impact
> from what I understand. Even if it might reduce a bit the amount of code being
> processed, the framebuffer will still be the main contender for bandwidth.

Did you miss the "via DMA engine" above?

DMA is needed for bitblt, for USB, for sdcard, for ethernet... Dunno if we'll
have it in the first bitstream (it's not a blocker in our release path), but it
makes a big performance difference.

> Cedric

Rob