[J-core] Pondering j-core kickstarter ideas.

Mon Oct 23 19:38:09 EDT 2017

On Wed, Oct 18, 2017 at 5:45 PM, Ken Phillis Jr <kphillisjr at gmail.com> wrote:
> On Wed, Oct 18, 2017 at 3:58 PM, Rob Landley <rob at landley.net> wrote:
>> On 10/16/2017 12:18 PM, Ken Phillis Jr wrote:
> I honestly believe the current J2 Core lacks hardware Floating Point, but
> that is not a major problem... I was thinking the fastest solution for this
> would be to add Instructions for Fixed Point math, and have this task broken
> down into four major stages/tasks...

What are the reasonning for you to drop to fixed point ? I am asking
because most of the modern API for 3D are definitively around floating
point. I am guessing it is related to your idea of doing an emulator
more than playing modern work load. Am I right ?

> Stage 1: Add hardware Accelerated Fixed Point math - Luckily this has two
> main wikipedia articles, and some example C/C++ code exists. Either way,
> this can be used to help coax the Integer unit of the device to speed up 3D
> math.
>
> Wikipedia: Fixed Point Arithmetic:
> https://en.wikipedia.org/wiki/Fixed-point_arithmetic
> Wikipedia: Q Number Format: https://en.wikipedia.org/wiki/Q_(number_format)
> Github: Libfixmath - the fixed point math library:
> https://github.com/PetteriAimonen/libfixmath
> Github: LibfixMatrix - Fixed Point matrix math library:
> https://github.com/PetteriAimonen/libfixmatrix
>
>
> Stage 2: Add SIMD Instructions for integer math (Signed, Unsigned, and fixed
> Point)
>
> This is cloning the expanded ALU to handle running various math functions in
> parallel... In general this should cover the following Functions:
> * Parallel instructions for basic ALU Functions - Think Addition,
> Subtraction, multiplication, division, etc.
> * Vector Instructions - This is stuff like Dot Product, Cross Product, and
> Linear Interpolation.
> * Vector Interpolation - this is Linear interpolation between two values
> with a values.

I have started to put a proposal for SIMD a few months back, if you
have time to review
(http://lists.j-core.org/pipermail/j-core/2017-August/000650.html).
One thing to not forget, is that we should keep the code density and
make sure that adding an instruction is useful and justifiy the loss
available space for future instruction.

> Stage 3: Implement Texture Block...
> This is mainly 4 sub-Tasks...
> Task 1: Implement Texture Lookup for 1D Textures, 2D Texture (U, V), 3D
> Texture (U, V, W). In general this is well defined in the OpenGL
> specification. The biggest change is from the addition of GL_EXT_texture3D
> which is released in 1996.

This are mostly specific load/store instruction with a memory unit
that could optimize for fetching memory for stream in advance
(hardware prefetch). I am not sure at this stage if there is a real
need to have dedicated instruction or if just longer code to fetch
memory is enough. This require looking at the code generated by
compiler for this case.

> Task 2: Implement basic Texture Filtering. This is mainly texture
> Interpolation, and most 3D Engines make use of this in some form or another.
> The filters for this are as follows....
> * Nearest-neighbor interpolation
> * Bilinear filtering
> * Trilinear filtering
> * Optional: anisotropic filtering - This is an extremely expensive
> operation, which will probably need to wait until the cores support ddr3 or
> ddr4.

Most of this operation are very simple to implement on the math side
with a few vector instruction and do not justify a special unit in my
opinion. The pressure is almost entirely on the memory and how to
efficiently fetch it. That's where for example having the ability to
map memory in tile (Useful when fetching multiple pixels next to each
other for interpolation) or have compressed memory access would pay
off (I would think that it is best to combine that with a tile so that
we can easily find where a line start and there isn't to much to jump
around to find the pixel that needs to be fetched).

> Task 3: Implement Alpha Blending/Compositing engine - This is mostly
> hardware implementation of the math functions outlined in:
> https://www.w3.org/TR/compositing/

I have been thinking about this, and I think this is the first special
unit that could be done. Still instead of doing it fully in hw, I
would make it a simple J1 with a vector extention, a direct access to
the DMA engine, a small SRAM and access to the video output. This
would allow the kernel to share the same code as the hw compositor
firmware and allow for on the fly code change. Imagine that you can
directly unroll and inline all the functions needed to fetch each
pixels, convert the color space, composite the result and send back
the result. In the future that J1 driving the video output could also
take care of a herding J-Core oriented toward GPU (which would have
likely some SMT, their own cache, ...).

I am guessing that you would not have a performance/watt as good as a
dedicated hw unit, but over the lifecycle of such hw you will have
guained so much flexibility. It might even be a good idea for other
IO, like ethernet.

> Task 4: Texture data is probably going to use options that are not in the
> Powers of Two.
>
> Stage 4: Implement OpenGL ES 1.1 with this functionality. The big part is
> include the GL_OES_fixed_point extension.

How much software would be able to leverage GL ES 1.1 ? System build
for Vulkan are not that far from a general purpose CPU, just with less
constraint on memory synchronisation and a lot of core parallelism
(Using SMT and SIMD). I have actually started to wonder if language
like rust would not allow to create memory barrier in software and
allow to offset all the memory access synchronisation complexity
directly into the language. I haven't read any research on the topic,
but if a task could run in a special mode that disable cache
synchronisation, because the software is smart enough to do the
synchronisation manually, maybe we do not need dedicated core for a
GPU...
-- 
Cedric BAIL