[J-core] Threading idea.

Fri Oct 18 23:47:02 UTC 2019

On 10/17/19 4:39 PM, Joh-Tob Schäg wrote:
> On Thu, 17 Oct 2019 at 22:14, Rob Landley <rob at landley.net> wrote:
>>
>> Jeff and I have discussed adding thread support with register banks to j-core,
>> where "which thread I'm running" is just a couple unused bits out of the status
>> register, and each time you change those bits the next instruction that gets run
>> is from a different set of registers including a different PC. :)
> 
> This implementation seems to differ from my understanding of the SH3's
> register banks.

Which J32 implements. But this is much smaller and simpler and could fit in the
ice40.

> They had a single flag which switched out the lowest 8
> general purpose registers for another set of 8 registers.

Yes, it was a special purpose thing to allow the page fault handler to run with
minimal overhead. This is a more generic solution that can do more.

> The
> non-active 8 general purpose register were still available over copy
> from register-to-register instructions. Although there might be some
> interaction with the cpu-mode bit.

*shrug* Feel free to ignore the new thing, but don't think we aren't _aware_ of
what superh did. We had several of the hitachi engineers who implemented superh
work for us on j-core for years.

> Is that on purpose?

Yes, the different new design is on purpose. J32 _already_ implements the
non-threading register bank.

> I feel like swapping out 32 register is a big latency spike,

It's not swapping. It's using the thread number as the high bits of the register
address, so that thread 0's R2 is @2 and thread 1's R2 is @34 and thread 2's R2
is at 66...

The "we'd need to flush these" is for registers that _don't_ live in SRAM.

> which is
> why the register banks were introduced in the first place (fast
> interrupt processing without touching the stack).
> Does it cause much latency in the current implementation?

A) We haven't done it yet,

B) A thread switch, like a jump instruction, must take at least 2 clock cycles
due to reloading PC and having to run the new data through the Instruction Fetch
stage before it gets back to Instruction Decode.

(This is more or less why branch delay slots exist: You've already LOADED the
next instruction, you might as well execute it rather than twiddle your thumbs
for a clock waiting for data from the new location to get to you through the
pipeline.)

That said, the hardware only has to swap SR and PC, maybe MACH and MACL could be
flushed by the thread you wake up (or the relevant exception handler etc can
just avoid using them)...

As I said, we haven't implemented it yet. The driving idea was running realtime
code on ICE40 (out of sram) while _also_ being able to run code in the memory
mapped spi flash (with a 1 microsecond latency spike each time it has to fetch a
new cache line).

> Is it going to cause problems in ASIC?

Not really. SRAM sizes are already powers of two (hard to make them _not_ be)
and moving from 32 to 64 or 128 bytes of sram is not a big lift. (And in ice40
the _smallest_ sram is 512 bytes so we're basically proposing using the wasted
space there. The reason we use so many is generally I/O port bottlenecks, I
think each lattice sram allows one 32 bit read and one 32 bit write per cycle.)

Rob