[J-core] Pondering j-core kickstarter ideas.

Mon Oct 23 21:34:44 EDT 2017

On Mon, Oct 23, 2017 at 6:38 PM, Cedric BAIL <cedric.bail at free.fr> wrote:
>
> On Wed, Oct 18, 2017 at 5:45 PM, Ken Phillis Jr <kphillisjr at gmail.com> wrote:
> > On Wed, Oct 18, 2017 at 3:58 PM, Rob Landley <rob at landley.net> wrote:
> >> On 10/16/2017 12:18 PM, Ken Phillis Jr wrote:
> > I honestly believe the current J2 Core lacks hardware Floating Point, but
> > that is not a major problem... I was thinking the fastest solution for this
> > would be to add Instructions for Fixed Point math, and have this task broken
> > down into four major stages/tasks...
>
> What are the reasonning for you to drop to fixed point ? I am asking
> because most of the modern API for 3D are definitively around floating
> point. I am guessing it is related to your idea of doing an emulator
> more than playing modern work load. Am I right ?
>

The reason for Fixed point is for faster computation. I know it's a
lot easier to get integer based math running faster than Floating
Point math. As far as modern Graphics hardware goes, Floating point
rasterization must be avoided until the patent covering
GL_ARB_TEXTURE_Float [1] expires.

[1] https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_half_float_pixel.txt

>
> > Stage 1: Add hardware Accelerated Fixed Point math - Luckily this has two
> > main wikipedia articles, and some example C/C++ code exists. Either way,
> > this can be used to help coax the Integer unit of the device to speed up 3D
> > math.
> >
> > Wikipedia: Fixed Point Arithmetic:
> > https://en.wikipedia.org/wiki/Fixed-point_arithmetic
> > Wikipedia: Q Number Format: https://en.wikipedia.org/wiki/Q_(number_format)
> > Github: Libfixmath - the fixed point math library:
> > https://github.com/PetteriAimonen/libfixmath
> > Github: LibfixMatrix - Fixed Point matrix math library:
> > https://github.com/PetteriAimonen/libfixmatrix
> >
> >
> > Stage 2: Add SIMD Instructions for integer math (Signed, Unsigned, and fixed
> > Point)
> >
> > This is cloning the expanded ALU to handle running various math functions in
> > parallel... In general this should cover the following Functions:
> > * Parallel instructions for basic ALU Functions - Think Addition,
> > Subtraction, multiplication, division, etc.
> > * Vector Instructions - This is stuff like Dot Product, Cross Product, and
> > Linear Interpolation.
> > * Vector Interpolation - this is Linear interpolation between two values
> > with a values.
>
> I have started to put a proposal for SIMD a few months back, if you
> have time to review
> (http://lists.j-core.org/pipermail/j-core/2017-August/000650.html).
> One thing to not forget, is that we should keep the code density and
> make sure that adding an instruction is useful and justifiy the loss
> available space for future instruction.
>
I agree that code density is important, and excellent reason to have
an CPUID Function. Believe it or not, x86 processors have seen their
fair share of Instruction Removal. The largest removal of Instructions
I currently know of was done by AMD when they removed the 3DNow!
Instructions[1] from their processors starting in 2011. This
Instruction was originally introduced in 1998 with the AMD K6-2
processor[2], and ultimately removed with with the release of
processors based on the bulldozer (2011)[3], Bobcat(2011)[4], and Zen
(2017)[5] Micro-architecture(s).

[1] Wikipedia: 3DNow!: https://en.wikipedia.org/wiki/3DNow!
[2] Wikipedia: AMD K6-2: https://en.wikipedia.org/wiki/AMD_K6-2
[3] Wikipedia: Bulldozer Microarchitecture:
https://en.wikipedia.org/wiki/Bulldozer_(microarchitecture)
[4] Wikipedia: Bobcat Microarchitecture:
https://en.wikipedia.org/wiki/Bobcat_(microarchitecture)
[5] Wikipedia: Zen Microarchitecture:
https://en.wikipedia.org/wiki/Zen_(microarchitecture)

> > Stage 3: Implement Texture Block...
> > This is mainly 4 sub-Tasks...
> > Task 1: Implement Texture Lookup for 1D Textures, 2D Texture (U, V), 3D
> > Texture (U, V, W). In general this is well defined in the OpenGL
> > specification. The biggest change is from the addition of GL_EXT_texture3D
> > which is released in 1996.
>
> This are mostly specific load/store instruction with a memory unit
> that could optimize for fetching memory for stream in advance
> (hardware prefetch). I am not sure at this stage if there is a real
> need to have dedicated instruction or if just longer code to fetch
> memory is enough. This require looking at the code generated by
> compiler for this case.
>
> > Task 2: Implement basic Texture Filtering. This is mainly texture
> > Interpolation, and most 3D Engines make use of this in some form or another.
> > The filters for this are as follows....
> > * Nearest-neighbor interpolation
> > * Bilinear filtering
> > * Trilinear filtering
> > * Optional: anisotropic filtering - This is an extremely expensive
> > operation, which will probably need to wait until the cores support ddr3 or
> > ddr4.
>
> Most of this operation are very simple to implement on the math side
> with a few vector instruction and do not justify a special unit in my
> opinion. The pressure is almost entirely on the memory and how to
> efficiently fetch it. That's where for example having the ability to
> map memory in tile (Useful when fetching multiple pixels next to each
> other for interpolation) or have compressed memory access would pay
> off (I would think that it is best to combine that with a tile so that
> we can easily find where a line start and there isn't to much to jump
> around to find the pixel that needs to be fetched).
>

Being able to do these filters in software would be extremely useful
since it would show the overall performance of the processor extremely
well.

>
> > Task 3: Implement Alpha Blending/Compositing engine - This is mostly
> > hardware implementation of the math functions outlined in:
> > https://www.w3.org/TR/compositing/
>
> I have been thinking about this, and I think this is the first special
> unit that could be done. Still instead of doing it fully in hw, I
> would make it a simple J1 with a vector extention, a direct access to
> the DMA engine, a small SRAM and access to the video output. This
> would allow the kernel to share the same code as the hw compositor
> firmware and allow for on the fly code change. Imagine that you can
> directly unroll and inline all the functions needed to fetch each
> pixels, convert the color space, composite the result and send back
> the result. In the future that J1 driving the video output could also
> take care of a herding J-Core oriented toward GPU (which would have
> likely some SMT, their own cache, ...).
>
> I am guessing that you would not have a performance/watt as good as a
> dedicated hw unit, but over the lifecycle of such hw you will have
> guained so much flexibility. It might even be a good idea for other
> IO, like ethernet.
>

This is an interesting approach. The CPUID Function can help with this
since this would mean allowing the chip to offer SIMD instructions
without having any hardware floating point. I figure this type of
offering is especially useful for applications like Network Storage,
Internet Routers, etc.

> > Task 4: Texture data is probably going to use options that are not in the
> > Powers of Two.
> >
> > Stage 4: Implement OpenGL ES 1.1 with this functionality. The big part is
> > include the GL_OES_fixed_point extension.
>
> How much software would be able to leverage GL ES 1.1 ? System build
> for Vulkan are not that far from a general purpose CPU, just with less
> constraint on memory synchronisation and a lot of core parallelism
> (Using SMT and SIMD). I have actually started to wonder if language
> like rust would not allow to create memory barrier in software and
> allow to offset all the memory access synchronisation complexity
> directly into the language. I haven't read any research on the topic,
> but if a task could run in a special mode that disable cache
> synchronisation, because the software is smart enough to do the
> synchronisation manually, maybe we do not need dedicated core for a
> GPU...
> --
> Cedric BAIL

The availability of OpenGL ES 1.1 is mostly a artificial limitation
until patents covering the newer Specifications expire. Believe it or
not, there is still more than a dozen patents covering Vulkan, OpenGL
ES 2, OpenGL ES 3, and OpenGL 4.x.