[J-core] How crazy would a nommu debian port be?

Tue Aug 23 18:25:29 EDT 2016

On Tue, Aug 23, 2016 at 1:39 PM, Rich Felker <dalias at libc.org> wrote:
> On Tue, Aug 23, 2016 at 11:50:58AM -0700, Cedric BAIL wrote:
>> >> > The issue is drawing into the buffer in the first place. Rich did
>> >> > some benchmarking on memory copy, and this is a similar problem
>> >> > set. With come clever programming or hardware, one could get DMA
>> >> > assist on block copy.
>> >>
>> >> Yes, I am expecting that. memcpy and memset are usually the limit on
>> >> every hardware when doing graphics anyway. That's why the most
>> >> important technic is to not draw anything if possible :-) Partial
>> >> update and hardware plane are usually the best helper there. If you
>> >> look at your screen how many pixels do really change at once. That's
>> >> the only thing your CPU really need to work on. With clever UI design
>> >> and proper cut off on the useless drawing, a J2 should provide enough
>> >> possibility, but you are indeed pointing to an interesting point.
>> >> Would it be possible to speedup memcpy and memset with some DMA assist
>> >> ? It is usually not possible as the cost of going into the kernel and
>> >> setting up MMU destroy all possible gain, but maybe on a J2 it makes
>> >> sense.
>> >
>> > Yes, but for it to be architecturally reasonable (and future-proof to
>> > J3/J4 with mmu) for userspace Linux binaries to use dma memcpy, we'd
>> > need to add a dma_memcpy syscall. There's precedent for this on
>> > blackfin so I think it may be reasonable. But it would only make sense
>> > for large, well-aligned copies.
>>
>> As memcpy is part of the libc, doesn't that relax a bit the constraint
>> ?
>
> Yes and no. To do it in userspace, the libc memcpy would have to be
> aware of the specific dma controller available, where it's registers
> are mapped, which channel it's permitted to use, how to negotiate
> access to that dma channel and to the dma controller registers with
> other processes, etc. It could be done if the kernel somehow provided
> the information to the process in an appropriate form, but this is all
> way outside the scope of what belongs in userspace, unless perhaps the
> kernel simply provided it as a black box memcpy function in the vdso.

I see your point. It seems hard to be doable in a vdso, no ? Well, at
least in case where you have a mmu and you proper separation between
user space and kernel. I am not a kernel developer nor a libc
developer, so I might be wrong.

>> I can imagine that it is an inline function version for the small
>> alignment copy and then switch to a function for the larger one. In
>> which case we can likely use a runtime initialization check to get
>> both the length size that it makes sense to switch to use the dma on
>> that cpu and select the function pointer (either using a syscall or
>> directly accessing the hardware). This would work for all J core
>> version and provide a speedup for every application. Binaries wouldn't
>> be impacted and wouldn't even require any change. It should be easy to
>> test and experiment with a FPGA if this kind of improvements would pay
>> off.
>
> The syscall approach would be for libc to have some way to probe if
> the syscall is available (maybe just try it the first time and
> remember failure) and use it for copies above a certain size threshold
> for which the syscall cycle overhead is tolerable (small relative to
> the time that would be spent in memcpy).

Yes, I was thinking of something along that line.

> But vdso might be a better approach.

My understanding of vdso is that it will be also solving the probing
at link time, so no first time cost. Added that it should have a lower
cost than doing a syscall, this would expand the amount of memcpy it
would cover. I see the appeal, but wouldn't that impose harder
requirement for future system to provide a safe/secure operation per
process once you have a mmu and kernel/userspace separation ?

>> >> > But the real win would come from hardware assist on 2D primitives,
>> >> > which X or anything else could use. That would make for an
>> >> > interesting project, but not something our team here will be able
>> >> > to spend time on, or even look into...
>> >>
>> >> Hum, I don't know. Today, none of the major toolkit use X primitive
>> >> and they are badly supported by all major GPU driver. I don't even
>> >> know if a 2D raster accelerator (something that would allow arbitrary
>> >> line compositing) would be beneficial, as the maintenance and porting
>> >> of pixman and cairo will be quite tricky.
>> >
>> > If nothing else, accelerated blit via dma would be a useful driver
>> > feature and less invasive than dma memcpy for userspace.
>>
>> Not sure what you mean by an accelerated blitter here. Maintaining
>
> I meant a "2D accel" driver that doesn't have any accelerated
> operations except block fills and block copies.
>
>> pixman and cairo is a lot of work and quite invasive too. The
>> Raspberry Pi had this 2D unit for years, it never took off and they
>> have a very large user base with a constrained hardware. Arguably
>> people may have settled for their OpenGL core as it require less work
>> to get that one going. Still, I think maintaining a custom 2D engine
>> in any libraries is going to be a hard task.
>
> Outside of the J-Core project, efficient 2D UI rendering, including
> modern (multilingual, high-res, antialiased) text rendering, with very
> small code and hardware requirements, is a major interest of mine, but
> I don't know if/when I'll get a chance to work on it.

:-) Time :-) I have been very surprised by freetype ability to just
generate scan line very fast (faster than cairo). We are using it for
vector rendering in EFL. Basically leveraging freetype for converting
SVG shape into scan line and using EFL optimized blitting code already
there for raster graphics. The reason that I am pointing this is that
I would think you would go into step if you want to do a 2D UI
rendering core. I think you would first rely on generating the scan
line in software and use a "multi"-line blitter to push the content on
screen/to a buffer. This would allow a simpler step and useful for
many (A blitter can be used right away in pixman for example). Still
this is a lot of work to support the minium set of colorspace, filling
scenario and optimize user space to take advantage of it.

Also all rendering API are synchronous (pixman and cairo for example)
or with very little room for a fire, queue and notify design which is
what you would want to have for a 2D blitter. Some of the task, like
glyph rendering, under a certain size, really are to small of a task
to offload to an hardware unit. In OpenGL, it pays off passed a
certain glyph size, otherwise, just rendering to a buffer and
uploading it, is faster (Of course the complexity of such a shader is
one of the reason of the problem). And with the J2 we are definitively
going to stick to small font size, I don't think we are going to be
close to HiDPI rendering :-)

Just a quick link for people to understand what I am talking about and
follow the discussion, you can see in my talk
http://elinux.org/images/a/a1/ELCE-2015-EFL-Cedric-Bail.pdf page 26,
all the compositing (green) could be pushed to a 2D hardware blitter
if it was there. While the scan line computation, in yellow, would
remain on the CPU. I have no presentation yet on the subject of using
hardware layer in the most efficient way as we are not there yet in
our code. Hopefully by end of next year, we will have a full support
of hardware layer in the application side. It is way more tricky than
it looks like to get right.

>> A simple support for a fixed number of hardware plane with various
>> color space support, clipping and color multiplier could be exposed as
>> a kms/drm/dmabuf output and would require no change in userspace to
>> leverage it.
>
> Oh yeah, maybe the kernem dmabuf stuff already allows software to take
> advantage of dma copies/fills.

Yes, to a limited extend it should allow for scrolling at a minimum. I
don't know if it allow yet for clipping and color merging too.
Hopefully it should, but even if it doesn't, it should be doable to
extend it. Most UI being static, most of the time, with very little
pixels changing (like just a cursor blinking), using hardware plane
should pay off the most and be the easiest to roll out. It is also
portable as an API and available on x86 and ARM hardware (mostly to
improve battery life, as performance is less of a concern for them).

It is the same API on all hardware, pays off everywhere (reducing
battery usage and increase performance), so it is easier to support
and maintain over time. Weston, even if it is just a toy, and
Enlightenment, are still the most likely to run on J2 hardware, should
already support hardware plane to some extend (Focused top application
and mouse cursor if I remember correctly). Improving toolkit to take
advantage of it is the next step.

>> Arguably it won't be as nice and powerful as a full
>> accelerated blitter, but it will require no specific software except
>> for the kernel driver. Was that what you had in mind when you say
>> accelerated blit ? If so, yes, I agree, should indeed be even less
>> invasive than an dma memcpy even as described above.
>
> Yes.
>
>> Do you think there will be enough space on the Turtle to actually have
>> SMP, HDMI, ethernet and both DMA as described in this email ?
>
> Yes. I'm not sure if the final Turtle will use an LX25 or LX45, but
> given that everything in the current release fits in an LX9, even if
> it were using the whole LX9 and even if SMP doubled the FPGA space
> needed, there'd still be almost 1/3 of an LX25 (about the space of a
> whole LX9) free for HDMI, ethernet, DMA, USB, etc. The prototype I
> have with LX25 has SMP and ethernet already.

That sounds great ! Really hope to see that hardware as I am really
starting to have idea of things to meddle with now :-)

Thanks,
-- 
Cedric BAIL