[J-core] [RFC] SIMD extention for J-Core

Thu Oct 26 20:47:02 EDT 2017

On 10/26/2017 04:12 PM, Ken Phillis Jr wrote:
> On  Wed, 25 Oct 2017 23:08:30 -0500,   BGB  wrote:
>> On 10/25/2017 6:59 PM, Ken Phillis Jr wrote:
>>> I figure I will include my Idea for an easy to use SIMD Extension for
>>> the J-Core. In general this extension will be used for Three Types of
>>> number formats...
>>>
>>> Format 1: Signed, and Unsigned Integers
>>> Format 2: Floating Point Values
>>> Format 3: Fixed Point Values - This is low priority, and most likely
>>> will be dropped unless sample applications requiring this
>>> functionality can be found. Also, in many cases, this can be simulated
>>> using existing floating point and integer math routines.
>>
>> errm, usually the fixed-point *is* the packed integer operations.
>>
>> there are edge-cases, for example a fixed-point multiply is usually done
>> as (X*Y>>SH); sometimes the shift is omitted or delayed depending on how
>> it is used; or will produce a wider result and only keep the high order
>> bits (typically if working with unit-scale values).

A while back Jeff made a really simple "fixed.h" (I can't post it here,
but Jeff might).

It's 99 lines, half of which is static inline functions to convert from
float to fixed and back. (The reason it's so long is we have 32 an 64
bit versions of those conversion functions, because 64 bit ints on a 32
bit processor are more than twice as expensive.) Then there's a
saturate() function, and fixed point sqrti() and arctan2() functions.

The fixed point representation is just ints. (There's sfixed_t and
lsfixed_t typedefs but it's just int32_t and uint64_t.) The programmer
tracks how shifted they are, is responsible for making sure the
precision matches up when adding/subtracting, and when you multiply or
divide you're responsible for doing the shifting yourself to put the
precision back and make sure it doesn't overflow. (Shift before
dividing, shift after multiplying.) Calling saturate() caps the range so
you can detect overflows, but mostly you're expected to know your input
data.

We implemented a gps process pair (realtime half and nonrealtime half,
talking through an mmap() ring buffer) that processes the input signal
with this. Tracking 5 satellites and getting fixes takes up less than
half of one processor, and we can probably get it down from there. Yeah
the FPGA correlators are doing a lot of the heavy lifting, but still. We
did the initial implementation pinning it with CPU affinity and doing
chrt (each correlator output's replaced a milisecond later if we didn't
fetch it yet, and the phase locked loop has something like 4ms of leeway
before doppler and codephase move out of tracking range, so dropouts are
bad mm'kay. That's true for each of 5 satellites, and 1/5 of a
milisecond is 200 microseconds, on a processor running at less than 100
mhz, so yeah we care about the realtime part. Now it's working we've
started moving some of the plumbing into an interrupt routine...)

tl;dr we didn't add fixed point support to the processor because we
didn't _need_ to.

Rob