[J-core] Pondering j-core kickstarter ideas.

Mon Oct 23 23:43:47 EDT 2017

On 10/23/2017 6:38 PM, Cedric BAIL wrote:
> On Wed, Oct 18, 2017 at 5:45 PM, Ken Phillis Jr <kphillisjr at gmail.com> wrote:
>> On Wed, Oct 18, 2017 at 3:58 PM, Rob Landley <rob at landley.net> wrote:
>>> On 10/16/2017 12:18 PM, Ken Phillis Jr wrote:
>> I honestly believe the current J2 Core lacks hardware Floating Point, but
>> that is not a major problem... I was thinking the fastest solution for this
>> would be to add Instructions for Fixed Point math, and have this task broken
>> down into four major stages/tasks...
> What are the reasonning for you to drop to fixed point ? I am asking
> because most of the modern API for 3D are definitively around floating
> point. I am guessing it is related to your idea of doing an emulator
> more than playing modern work load. Am I right ?

I am thinking it probably makes more sense for the main core to have an FPU.
as can be noted, most games and APIs sort of assume use of an FPU.

OTOH, on the backend (eg: triangle rasterization, etc), then FPU is 
basically unnecessary unless you want to support fragment shaders, but 
then you basically need to throw some pretty absurd amounts of 
computational power at the problem (relative to what is required for a 
typical software rasterizer or fixed-function pipeline).

even then, fixed function is still pretty expensive vs a typical 
software renderer (due mostly to GL apps/... tending to use blending 
everywhere even when it is basically unnecessary; emulating in GL in 
software sort-or requires a lot of work to infer when this all this 
blending is unnecessary).

so, likely (on main core):
     input comes in as floating point (or whatever else);
     possibly normalized either to float32 or fixed16.16;
     vertices go through projection process;
         multiply vertex with modelview_projection;
         divide XYZ by W.
     may also do frustum and backface culling here.
     is converted to narrower 16-bit fixed-point screen-space vectors 
(eg: 12.4).
     texture coordinates could be partially-normalized, ex, 4.12 or so.
         an integer part is needed for things like texture wrapping.
     triangles are handed off to rasterizer cores
         likely divided up by screen regions or similar (or duplicated 
if crossing a boundary).

on rasterizer cores (integer only ISA, maybe SIMD):
     walk lists of triangles;
     walk edges and emit lists of spans (into buckets per scanline);
     walk scanlines, and draw lists of spans.

there is a tradeoff with when/how to early Z-cull spans:
     in a software renderer, it may be slightly faster to probe triangle 
corners and span-extents early.
         if all 3 vertices Z fail, specially check if any part of the 
triangle will be visible.
         if, say, a bounding-box check Z fails, discard whole triangle.
     per-span, if both the start and end Z-fail, check the span for Z 
fail early.

however, the drawback is because these don't happen in raster order, the 
Z-buffer probes are much more expensive (lots of cache misses). however, 
when walking the spans-list, there tends to be lots of cache misses when 
accessing each span, so it is a tradeoff (Z buffer misses vs misses 
accessing spans which will not be visible).

>> Stage 1: Add hardware Accelerated Fixed Point math - Luckily this has two
>> main wikipedia articles, and some example C/C++ code exists. Either way,
>> this can be used to help coax the Integer unit of the device to speed up 3D
>> math.
>>
>> Wikipedia: Fixed Point Arithmetic:
>> https://en.wikipedia.org/wiki/Fixed-point_arithmetic
>> Wikipedia: Q Number Format: https://en.wikipedia.org/wiki/Q_(number_format)
>> Github: Libfixmath - the fixed point math library:
>> https://github.com/PetteriAimonen/libfixmath
>> Github: LibfixMatrix - Fixed Point matrix math library:
>> https://github.com/PetteriAimonen/libfixmatrix
>>
>>
>> Stage 2: Add SIMD Instructions for integer math (Signed, Unsigned, and fixed
>> Point)
>>
>> This is cloning the expanded ALU to handle running various math functions in
>> parallel... In general this should cover the following Functions:
>> * Parallel instructions for basic ALU Functions - Think Addition,
>> Subtraction, multiplication, division, etc.
>> * Vector Instructions - This is stuff like Dot Product, Cross Product, and
>> Linear Interpolation.
>> * Vector Interpolation - this is Linear interpolation between two values
>> with a values.
> I have started to put a proposal for SIMD a few months back, if you
> have time to review
> (http://lists.j-core.org/pipermail/j-core/2017-August/000650.html).
> One thing to not forget, is that we should keep the code density and
> make sure that adding an instruction is useful and justifiy the loss
> available space for future instruction.

my attempt at SIMD extensions almost completely overlap with the normal 
16-bit FPU ops, for the parts of the SIMD ISA which use 16-bit instructions.

I did add a few new instructions, but mostly to make it cheaper to 
reload the bits in FPSCR:
     the existing bit-toggle ops (though, from SH-2A and SH-4A) simply 
inverted the bits;
     coming from an unknown state, it is most useful to reload these 
bits directly;
     this was also helpful for more normal FPU intensive code.
     the main op here uses a 4 bit immed:
         PR, SZ, FR, and a bit for whether to zero everything else or 
leave as-is.

though, chances are a "realistic" implementation would probably only do 
4x packed-word though (too much beyond this is a bit expensive; as-is 
supporting multiple vector formats). also some parts of the FPGA's seem 
specially built for working with 16-bit data values.

I suspect a full SIMD (128-bit 4x float) may be far too expensive to be 
realistic (at least based on my prior attempts; but it is also possible 
my Verilog skills are basically crap...); but I was mostly trying to 
design a core I could hopefully fit on a 25k LUT FPGA (then getting 
discouraged by how much things cost, ...).

though, I suspect as-is, something with graphics cores, ... would 
probably require a somewhat bigger FPGA than this.

>> Stage 3: Implement Texture Block...
>> This is mainly 4 sub-Tasks...
>> Task 1: Implement Texture Lookup for 1D Textures, 2D Texture (U, V), 3D
>> Texture (U, V, W). In general this is well defined in the OpenGL
>> specification. The biggest change is from the addition of GL_EXT_texture3D
>> which is released in 1996.
> This are mostly specific load/store instruction with a memory unit
> that could optimize for fetching memory for stream in advance
> (hardware prefetch). I am not sure at this stage if there is a real
> need to have dedicated instruction or if just longer code to fetch
> memory is enough. This require looking at the code generated by
> compiler for this case.

agreed, these probably don't belong in the ISA.

for GPU texture tasks, a few operators could be useful:
     a dedicated "SHLR4" or "SHAR4" operator;
     possibly some sort of MMIO peripheral to aid with DXTn block decoding.
         say, write block to an address and read back the decoded pixels.
         it is worth noting that rasterization tends to be somewhat 
memory bound;
         so, the cost of twiddly/DXTn decoding could be less than the 
memory bandwidth cost.
         luckily, the S3TC patents have apparently now (finally) died.
         ( though, can probably be done affordably enough without it )
     ...

accessing a texture is mostly shifts, masking, and addition (and maybe 
hashing the index into the texture block cache or similar).

in a recent software renderer (intended mostly to run on my Vista-era 
laptop), was also using mostly 16 bit pixels for texture-cache:
     0aaa-gggg-rrrr-bbbb
     1ggg-ggrr-rrrb-bbbb

this at least sort-of reducing memory bandwidth for the textures, and 
not "too" horrible.
the funky component ordering was mostly to make it cheaper to 
(semi-accurately) compare pixels by luma.
     if(pxa&pxb&0x8000)
         if(pxa>pxb) ...    //typically also holds true for actual pixel 
luma

>> Task 2: Implement basic Texture Filtering. This is mainly texture
>> Interpolation, and most 3D Engines make use of this in some form or another.
>> The filters for this are as follows....
>> * Nearest-neighbor interpolation
>> * Bilinear filtering
>> * Trilinear filtering
>> * Optional: anisotropic filtering - This is an extremely expensive
>> operation, which will probably need to wait until the cores support ddr3 or
>> ddr4.
> Most of this operation are very simple to implement on the math side
> with a few vector instruction and do not justify a special unit in my
> opinion. The pressure is almost entirely on the memory and how to
> efficiently fetch it. That's where for example having the ability to
> map memory in tile (Useful when fetching multiple pixels next to each
> other for interpolation) or have compressed memory access would pay
> off (I would think that it is best to combine that with a tile so that
> we can easily find where a line start and there isn't to much to jump
> around to find the pixel that needs to be fetched).

though, bilinear and trilinear are still computationally expensive.

for software rasterizers, there is a cheaper alternative, say:
     minification always behaves as NEAREST_MIPMAP_NEAREST;
     magnification only does LINEAR if beyond a certain amount (say, 4x).

the loss in perceived quality is modest, but the advantage is that the 
renderer is (at least slightly less likely) to have molasses-like 
framerates.

( though, given it isn't exactly dead-simple even on a decently powerful 
desktop CPU; performance would be a pretty serious concern. )

>> Task 3: Implement Alpha Blending/Compositing engine - This is mostly
>> hardware implementation of the math functions outlined in:
>> https://www.w3.org/TR/compositing/
> I have been thinking about this, and I think this is the first special
> unit that could be done. Still instead of doing it fully in hw, I
> would make it a simple J1 with a vector extention, a direct access to
> the DMA engine, a small SRAM and access to the video output. This
> would allow the kernel to share the same code as the hw compositor
> firmware and allow for on the fly code change. Imagine that you can
> directly unroll and inline all the functions needed to fetch each
> pixels, convert the color space, composite the result and send back
> the result. In the future that J1 driving the video output could also
> take care of a herding J-Core oriented toward GPU (which would have
> likely some SMT, their own cache, ...).
>
> I am guessing that you would not have a performance/watt as good as a
> dedicated hw unit, but over the lifecycle of such hw you will have
> guained so much flexibility. It might even be a good idea for other
> IO, like ethernet.

alpha blending is sort of a necessary evil FWIW.

luckily, it can be eliminated from most of the geometry:
     if the vertex color is opaque, the texture-block opaque, ... just 
do 'replace'.
     if vertex color is opaque, and texture is DXT1, draw via alpha testing.
     ...

>> Task 4: Texture data is probably going to use options that are not in the
>> Powers of Two.
>>
>> Stage 4: Implement OpenGL ES 1.1 with this functionality. The big part is
>> include the GL_OES_fixed_point extension.
> How much software would be able to leverage GL ES 1.1 ? System build
> for Vulkan are not that far from a general purpose CPU, just with less
> constraint on memory synchronisation and a lot of core parallelism
> (Using SMT and SIMD). I have actually started to wonder if language
> like rust would not allow to create memory barrier in software and
> allow to offset all the memory access synchronisation complexity
> directly into the language. I haven't read any research on the topic,
> but if a task could run in a special mode that disable cache
> synchronisation, because the software is smart enough to do the
> synchronisation manually, maybe we do not need dedicated core for a
> GPU...

my leaning personally is mostly for OpenGL 1.x (with a few scattered 2.x 
features, and some omissions);

partly it is because GL1.x can be used for "real work" and is more 
widely supported than GLES, and most things which expect newer would be 
probably too expensive to be usable anyways (even some GL 1.x era 
software, like Doom 3, would be unrealistically expensive to try to 
support).

even running Quake 3 Arena would probably be a pretty solid achievement.

though, probably VertexArrays or VBOs would be the main interface, with 
glBegin/glEnd/... faked via wrappers.