[J-core] How crazy would a nommu debian port be?

Tue Aug 23 16:39:48 EDT 2016

On Tue, Aug 23, 2016 at 11:50:58AM -0700, Cedric BAIL wrote:
> >> > The issue is drawing into the buffer in the first place. Rich did
> >> > some benchmarking on memory copy, and this is a similar problem
> >> > set. With come clever programming or hardware, one could get DMA
> >> > assist on block copy.
> >>
> >> Yes, I am expecting that. memcpy and memset are usually the limit on
> >> every hardware when doing graphics anyway. That's why the most
> >> important technic is to not draw anything if possible :-) Partial
> >> update and hardware plane are usually the best helper there. If you
> >> look at your screen how many pixels do really change at once. That's
> >> the only thing your CPU really need to work on. With clever UI design
> >> and proper cut off on the useless drawing, a J2 should provide enough
> >> possibility, but you are indeed pointing to an interesting point.
> >> Would it be possible to speedup memcpy and memset with some DMA assist
> >> ? It is usually not possible as the cost of going into the kernel and
> >> setting up MMU destroy all possible gain, but maybe on a J2 it makes
> >> sense.
> >
> > Yes, but for it to be architecturally reasonable (and future-proof to
> > J3/J4 with mmu) for userspace Linux binaries to use dma memcpy, we'd
> > need to add a dma_memcpy syscall. There's precedent for this on
> > blackfin so I think it may be reasonable. But it would only make sense
> > for large, well-aligned copies.
> 
> As memcpy is part of the libc, doesn't that relax a bit the constraint
> ?

Yes and no. To do it in userspace, the libc memcpy would have to be
aware of the specific dma controller available, where it's registers
are mapped, which channel it's permitted to use, how to negotiate
access to that dma channel and to the dma controller registers with
other processes, etc. It could be done if the kernel somehow provided
the information to the process in an appropriate form, but this is all
way outside the scope of what belongs in userspace, unless perhaps the
kernel simply provided it as a black box memcpy function in the vdso.

> I can imagine that it is an inline function version for the small
> alignment copy and then switch to a function for the larger one. In
> which case we can likely use a runtime initialization check to get
> both the length size that it makes sense to switch to use the dma on
> that cpu and select the function pointer (either using a syscall or
> directly accessing the hardware). This would work for all J core
> version and provide a speedup for every application. Binaries wouldn't
> be impacted and wouldn't even require any change. It should be easy to
> test and experiment with a FPGA if this kind of improvements would pay
> off.

The syscall approach would be for libc to have some way to probe if
the syscall is available (maybe just try it the first time and
remember failure) and use it for copies above a certain size threshold
for which the syscall cycle overhead is tolerable (small relative to
the time that would be spent in memcpy). But vdso might be a better
approach.

> >> > But the real win would come from hardware assist on 2D primitives,
> >> > which X or anything else could use. That would make for an
> >> > interesting project, but not something our team here will be able
> >> > to spend time on, or even look into...
> >>
> >> Hum, I don't know. Today, none of the major toolkit use X primitive
> >> and they are badly supported by all major GPU driver. I don't even
> >> know if a 2D raster accelerator (something that would allow arbitrary
> >> line compositing) would be beneficial, as the maintenance and porting
> >> of pixman and cairo will be quite tricky.
> >
> > If nothing else, accelerated blit via dma would be a useful driver
> > feature and less invasive than dma memcpy for userspace.
> 
> Not sure what you mean by an accelerated blitter here. Maintaining

I meant a "2D accel" driver that doesn't have any accelerated
operations except block fills and block copies.

> pixman and cairo is a lot of work and quite invasive too. The
> Raspberry Pi had this 2D unit for years, it never took off and they
> have a very large user base with a constrained hardware. Arguably
> people may have settled for their OpenGL core as it require less work
> to get that one going. Still, I think maintaining a custom 2D engine
> in any libraries is going to be a hard task.

Outside of the J-Core project, efficient 2D UI rendering, including
modern (multilingual, high-res, antialiased) text rendering, with very
small code and hardware requirements, is a major interest of mine, but
I don't know if/when I'll get a chance to work on it.

> A simple support for a fixed number of hardware plane with various
> color space support, clipping and color multiplier could be exposed as
> a kms/drm/dmabuf output and would require no change in userspace to
> leverage it.

Oh yeah, maybe the kernem dmabuf stuff already allows software to take
advantage of dma copies/fills.

> Arguably it won't be as nice and powerful as a full
> accelerated blitter, but it will require no specific software except
> for the kernel driver. Was that what you had in mind when you say
> accelerated blit ? If so, yes, I agree, should indeed be even less
> invasive than an dma memcpy even as described above.

Yes.

> Do you think there will be enough space on the Turtle to actually have
> SMP, HDMI, ethernet and both DMA as described in this email ?

Yes. I'm not sure if the final Turtle will use an LX25 or LX45, but
given that everything in the current release fits in an LX9, even if
it were using the whole LX9 and even if SMP doubled the FPGA space
needed, there'd still be almost 1/3 of an LX25 (about the space of a
whole LX9) free for HDMI, ethernet, DMA, USB, etc. The prototype I
have with LX25 has SMP and ethernet already.

Rich