[J-core] Pondering j-core kickstarter ideas.

Mon Oct 23 21:52:20 EDT 2017

On Mon, Oct 23, 2017 at 6:34 PM, Ken Phillis Jr <kphillisjr at gmail.com> wrote:
> On Mon, Oct 23, 2017 at 6:38 PM, Cedric BAIL <cedric.bail at free.fr> wrote:
>> On Wed, Oct 18, 2017 at 5:45 PM, Ken Phillis Jr <kphillisjr at gmail.com> wrote:
>> > On Wed, Oct 18, 2017 at 3:58 PM, Rob Landley <rob at landley.net> wrote:
>> >> On 10/16/2017 12:18 PM, Ken Phillis Jr wrote:
>> > I honestly believe the current J2 Core lacks hardware Floating Point, but
>> > that is not a major problem... I was thinking the fastest solution for this
>> > would be to add Instructions for Fixed Point math, and have this task broken
>> > down into four major stages/tasks...
>>
>> What are the reasonning for you to drop to fixed point ? I am asking
>> because most of the modern API for 3D are definitively around floating
>> point. I am guessing it is related to your idea of doing an emulator
>> more than playing modern work load. Am I right ?
>
> The reason for Fixed point is for faster computation. I know it's a
> lot easier to get integer based math running faster than Floating
> Point math. As far as modern Graphics hardware goes, Floating point
> rasterization must be avoided until the patent covering
> GL_ARB_TEXTURE_Float [1] expires.
>
> [1] https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_half_float_pixel.txt

I beleive this only apply on the output buffer (and not on the
intermediatte step). Most shader I know and have seen do all their
math in float. Half float buffer is useful for handling HDR output
(And a real blocker there).

<snip>

>> > Stage 3: Implement Texture Block...
>> > This is mainly 4 sub-Tasks...
>> > Task 1: Implement Texture Lookup for 1D Textures, 2D Texture (U, V), 3D
>> > Texture (U, V, W). In general this is well defined in the OpenGL
>> > specification. The biggest change is from the addition of GL_EXT_texture3D
>> > which is released in 1996.
>>
>> This are mostly specific load/store instruction with a memory unit
>> that could optimize for fetching memory for stream in advance
>> (hardware prefetch). I am not sure at this stage if there is a real
>> need to have dedicated instruction or if just longer code to fetch
>> memory is enough. This require looking at the code generated by
>> compiler for this case.
>>
>> > Task 2: Implement basic Texture Filtering. This is mainly texture
>> > Interpolation, and most 3D Engines make use of this in some form or another.
>> > The filters for this are as follows....
>> > * Nearest-neighbor interpolation
>> > * Bilinear filtering
>> > * Trilinear filtering
>> > * Optional: anisotropic filtering - This is an extremely expensive
>> > operation, which will probably need to wait until the cores support ddr3 or
>> > ddr4.
>>
>> Most of this operation are very simple to implement on the math side
>> with a few vector instruction and do not justify a special unit in my
>> opinion. The pressure is almost entirely on the memory and how to
>> efficiently fetch it. That's where for example having the ability to
>> map memory in tile (Useful when fetching multiple pixels next to each
>> other for interpolation) or have compressed memory access would pay
>> off (I would think that it is best to combine that with a tile so that
>> we can easily find where a line start and there isn't to much to jump
>> around to find the pixel that needs to be fetched).
>>
>
> Being able to do these filters in software would be extremely useful
> since it would show the overall performance of the processor extremely
> well.

I guess when I play with my SIMD proposal, I should take that as one
of the example.

>> > Task 3: Implement Alpha Blending/Compositing engine - This is mostly
>> > hardware implementation of the math functions outlined in:
>> > https://www.w3.org/TR/compositing/
>>
>> I have been thinking about this, and I think this is the first special
>> unit that could be done. Still instead of doing it fully in hw, I
>> would make it a simple J1 with a vector extention, a direct access to
>> the DMA engine, a small SRAM and access to the video output. This
>> would allow the kernel to share the same code as the hw compositor
>> firmware and allow for on the fly code change. Imagine that you can
>> directly unroll and inline all the functions needed to fetch each
>> pixels, convert the color space, composite the result and send back
>> the result. In the future that J1 driving the video output could also
>> take care of a herding J-Core oriented toward GPU (which would have
>> likely some SMT, their own cache, ...).
>>
>> I am guessing that you would not have a performance/watt as good as a
>> dedicated hw unit, but over the lifecycle of such hw you will have
>> guained so much flexibility. It might even be a good idea for other
>> IO, like ethernet.
>
> This is an interesting approach. The CPUID Function can help with this
> since this would mean allowing the chip to offer SIMD instructions
> without having any hardware floating point. I figure this type of
> offering is especially useful for applications like Network Storage,
> Internet Routers, etc.

Yes. You might even be able to move some of the kernel function to
that J1 core just by playing trick with the linker. Would be fun.

>> > Task 4: Texture data is probably going to use options that are not in the
>> > Powers of Two.
>> >
>> > Stage 4: Implement OpenGL ES 1.1 with this functionality. The big part is
>> > include the GL_OES_fixed_point extension.
>>
>> How much software would be able to leverage GL ES 1.1 ? System build
>> for Vulkan are not that far from a general purpose CPU, just with less
>> constraint on memory synchronisation and a lot of core parallelism
>> (Using SMT and SIMD). I have actually started to wonder if language
>> like rust would not allow to create memory barrier in software and
>> allow to offset all the memory access synchronisation complexity
>> directly into the language. I haven't read any research on the topic,
>> but if a task could run in a special mode that disable cache
>> synchronisation, because the software is smart enough to do the
>> synchronisation manually, maybe we do not need dedicated core for a
>> GPU...
>
> The availability of OpenGL ES 1.1 is mostly a artificial limitation
> until patents covering the newer Specifications expire. Believe it or
> not, there is still more than a dozen patents covering Vulkan, OpenGL
> ES 2, OpenGL ES 3, and OpenGL 4.x.

Only a dozen patents ? :-) I think most of the landmine for a GPU is
not really on the instruction set as this has some history, but more
on all the optimization that are needed for being fast regarding
memory access (tile, compression, stream detection, memory barrier,
...) and the scheduling of instruction (SMT, OOO, ...). And those are
harder to find if you start by implementing heads on, instead of
trying to look for similar trick played more than 20 years ago. This
is one of the concern here, how do you find patents or historical
implementation that would make it safe to implement.
-- 
Cedric BAIL