[J-core] Threading idea.

Thu Oct 17 20:15:51 UTC 2019

Jeff and I have discussed adding thread support with register banks to j-core,
where "which thread I'm running" is just a couple unused bits out of the status
register, and each time you change those bits the next instruction that gets run
is from a different set of registers including a different PC. :)

Most of that's easy to do because regsiters 0-15 are in SRAM, and it's
_actually_ more like registers 0-20 because register "16" is GBR, 17 is VBR, 18
is PR, and then there's a TEMP1 and TEMP2 used internally by some of the
instruction implementations... The easy thing to do is "current register set
starts at 32*thread in the SRAM bank". We'd still need to flush and reload the
registers that AREN'T stored in SRAM but are instead special circuitry (PC,
MACH, MACL, SR, and we have a special cached copy of R0 because reasons), but
that's the TODO part of the design. :)

Actually, we'll probably reorder the registers above 15 in a future cleanup
anyway, since there's a lot of "MACH, MACHL, PR, SR, GBR, VBR" ordering inherent
in the instruction set, so if 16 and 17 are temp1 and temp2, PR could be 18,
skip 19 or find another use for it, then GBR is 20, and VBR is 21 and suddenly
the offsets come out of the bits of the instruction word. :)

And we've got a DMA engine idea that might use another 6 registers, and *32
gives us more than 6 spares anyway. (You want it to be a power of 2 because the
circuitry is SO much smaller then. We might be able to pull off 16+8=24 if we
need to, but we're not short on SRAM space.)

The motivation for all this is we've designed a page faulter in the ice40
version that reads spi flash data into blocks of sram and feeds it into the
jcore instruction bus. It's _basically_ a small physically indexed cache that
populates itself from spi flash instead of DRAM. (The fiddly bit is the
bitstream has to be at the start of the SPI flash, because the FPGA's attached
circuitry initializes the FPGA from a bitstream at the start of the flash; we
can't even stick a jump before that, it _has_ to start at zero. So we have to
either move the mapping down with a (slow) adder in the address line resolution,
or convince the processor to start at an address other than zero: second option
seems easier. :)

Anyway, this means the flash is mapped through ~4 cache lines (probably 512
bytes each?), and each time you fault one in, the processor stalls for over a
microsecond. Which is bad if you're doing something realtime like GPS tracking,
and our _first_ stab at this the sucker ping-ponged because text and rodata
aliased (meaning we need seperate I and D cache...)

(Someday Jeff may post this to github, but not until development on it is long
dead and it's been polished to the point you can't see how we thought of it.)

BUT we also have some sram mapped (the flash addresses aren't writeable), and if
you stick code in the sram it can run realtime without ever having those latency
spikes. And when the other code DOES have a latency spike, all you have to do is
switch to one of the other threads (each with its own register bank) and it can
just continue on until the realtime code lets go of the procesor. (It _also_
means things like exception handling become much easier to implement: just have
a "system thread" with its own stack and everything, and your exceptions switch
to it.)

Oh, we've also got a tiny DMA engine (basically free from a circuitry
perspective) proposed that's just "the processor runs a copy instruction
periodically". See the system_op() stuff at line 171 of decode_core.vhd: we'd
basically add a "DMA_I when (whatever dma ready means)" stanza to the staircase
already there, and make sure it never triggers more often than every other
instruction (so it can't starve code execution).

If you look at decode_table_simple.vhd line 3084 that's currently the first
"system plane" instruction (the top bit is 1 instead of 0). Notice how the
instructions patterns its matching are _17_ bits, when the instructions it's
fetching from the instruction bus are 16 bits. Translation: we have a whole
second set of instructions the processor can generate internally to do things
like reset itself (line 3268, which is what we have to modify to start someplace
other than 0). So adding DMA means adding another "system plane" instruction to
do it, hooking it up in decode_core, and figuring out what control signals it
needs. (You need the memory address, length, which I/O port, whether it's read
or write, what IOREADY line/condition to monitor, and what DONE exception to
raise when you're done. That's maybe 6 registers, except direction and DONE
definitely fit in the same register, probably with space left over for an
IOREADY mask of some kind...

Anyway, we need something like that for Turtle (which has USB, Audio, ethernet,
sdcard, and maybe GPS devices that really want to be DMA-driven). The existing
DMA engine we have is in our source tree is WAY bigger and more complicated than
what we actually want to use for turtle, and it'd never fit in ice40. (The use
we were putting it to before had 64 DMA channels instantiated which made it
bigger than the CPU, and it turns out to be hard enough to pare back down it
might be easier to start over. But reusing the CPU circuitry for this is
trivial, the busses and everything are already there.)

I dunno if we can get it clever enough to do "read into TEMP3 register when the
memory address and data bus is idle anyway this cycle, write from TEMP3 whenever
whatever output bus we need to use for that is unused this cycle". That's more a
hack of the _execution_ stage than the instruction decoder stage, and we're not
even trying that in time for turtle shipping. But long-term, there's lots of
optimizations we can do, and you can always upgrade the turtle bitstream after
the fact. :)

Rob