[J-core] How crazy would a nommu debian port be?

Tue Aug 23 14:50:58 EDT 2016

On Tue, Aug 23, 2016 at 9:22 AM, Rich Felker <dalias at libc.org> wrote:
> On Sun, Aug 21, 2016 at 02:17:49PM -0700, Cedric BAIL wrote:
>> On Fri, Aug 19, 2016 at 10:02 PM, D. Jeff Dionne
>> <Jeff at se-instruments.co.jp> wrote:
>> > Turtle has as noted below HDMI, and I've tested the hardware.
>> > Turtle also has composite video, which I have not tested. Making a
>> > quick circuit that generates timing for VGA, HDMI or composite
>> > video is very simple (a few 10s lines of RTL), the question is
>> > really if you want to be able to alter that timing or if it is
>> > acceptable to have one or a small number of fixed resolutions (at
>> > first, this could be fine).
>>
>> Well, HDMI is kind of an overkill, but if it is there, it is
>> definitively enough :-)
>
> From a standpoint of physical hardware on the board I suspect it's
> actually simpler/cheaper, because you don't need a DAC. I assume this
> is why the Numato board has low-quality 2/3/2 bit RGB on its VGA port
> -- to avoid the cost of a proper DAC.

I see. It does indeed make a lot of sense, thanks for the clarification.

>> As for a small number of fixed resolutions,
>> does that means we could change the supported resolution in the VHDL
>> and just rebuild it to match the output ? Or would it requires more
>> work ? If it would be the first case, I believe that would be
>> perfectly fine in my opinion.
>
> I'm not clear on whether there's any reason to prefer fixed
> resolutions. It's probably just a matter of whether the support for
> programmable clock multiplier/divider is too costly for a small fpga.
> I'm not really qualified to judge that but my guess would be not.

Sounds good.

>> > The issue is drawing into the buffer in the first place. Rich did
>> > some benchmarking on memory copy, and this is a similar problem
>> > set. With come clever programming or hardware, one could get DMA
>> > assist on block copy.
>>
>> Yes, I am expecting that. memcpy and memset are usually the limit on
>> every hardware when doing graphics anyway. That's why the most
>> important technic is to not draw anything if possible :-) Partial
>> update and hardware plane are usually the best helper there. If you
>> look at your screen how many pixels do really change at once. That's
>> the only thing your CPU really need to work on. With clever UI design
>> and proper cut off on the useless drawing, a J2 should provide enough
>> possibility, but you are indeed pointing to an interesting point.
>> Would it be possible to speedup memcpy and memset with some DMA assist
>> ? It is usually not possible as the cost of going into the kernel and
>> setting up MMU destroy all possible gain, but maybe on a J2 it makes
>> sense.
>
> Yes, but for it to be architecturally reasonable (and future-proof to
> J3/J4 with mmu) for userspace Linux binaries to use dma memcpy, we'd
> need to add a dma_memcpy syscall. There's precedent for this on
> blackfin so I think it may be reasonable. But it would only make sense
> for large, well-aligned copies.

As memcpy is part of the libc, doesn't that relax a bit the constraint
? I can imagine that it is an inline function version for the small
alignment copy and then switch to a function for the larger one. In
which case we can likely use a runtime initialization check to get
both the length size that it makes sense to switch to use the dma on
that cpu and select the function pointer (either using a syscall or
directly accessing the hardware). This would work for all J core
version and provide a speedup for every application. Binaries wouldn't
be impacted and wouldn't even require any change. It should be easy to
test and experiment with a FPGA if this kind of improvements would pay
off.

>> > But the real win would come from hardware assist on 2D primitives,
>> > which X or anything else could use. That would make for an
>> > interesting project, but not something our team here will be able
>> > to spend time on, or even look into...
>>
>> Hum, I don't know. Today, none of the major toolkit use X primitive
>> and they are badly supported by all major GPU driver. I don't even
>> know if a 2D raster accelerator (something that would allow arbitrary
>> line compositing) would be beneficial, as the maintenance and porting
>> of pixman and cairo will be quite tricky.
>
> If nothing else, accelerated blit via dma would be a useful driver
> feature and less invasive than dma memcpy for userspace.

Not sure what you mean by an accelerated blitter here. Maintaining
pixman and cairo is a lot of work and quite invasive too. The
Raspberry Pi had this 2D unit for years, it never took off and they
have a very large user base with a constrained hardware. Arguably
people may have settled for their OpenGL core as it require less work
to get that one going. Still, I think maintaining a custom 2D engine
in any libraries is going to be a hard task.

A simple support for a fixed number of hardware plane with various
color space support, clipping and color multiplier could be exposed as
a kms/drm/dmabuf output and would require no change in userspace to
leverage it. Arguably it won't be as nice and powerful as a full
accelerated blitter, but it will require no specific software except
for the kernel driver. Was that what you had in mind when you say
accelerated blit ? If so, yes, I agree, should indeed be even less
invasive than an dma memcpy even as described above.

Do you think there will be enough space on the Turtle to actually have
SMP, HDMI, ethernet and both DMA as described in this email ?

Regards,
-- 
Cedric BAIL