[J-core] Porting J1 to LiteX

Tue Jun 15 21:45:26 UTC 2021

Hi Jeff,

Good to hear from you again!

> But does it fit reasonably?  I had found that J1 was impractical in 
HX, and so did not investigate further.

I guess it depends on what you mean by "reasonably"- LUT usage or RAM usage.

The entire SoC uses ~5300 LUTs on HX8K vs ~4300 LUTs on UP5K. This 
leaves room for some peripherals like LEDs, GPIO, and UART, XIP from SPI 
flash, and cache.

However, since 8K doesn't have SPRAM, I reduced the amount of "bulk" RAM 
to 1Kb. This is still enough to run the bootrom, and there's still a 
good 25% of the EBR (4kB) left unused. The smallest e.g. ARM 
microcontrollers had like 4kB flash and 1kB of RAM (LPC810), and I am 
partial to msp430 (some of them have 128 bytes of RAM). So I'm very 
tolerant of microcontrollers w/ limited memory :D!

Based on prior experience targeting Micropython to 8k parts, I think 8k 
support is worth exploring at this point. Using j1 to its full power w/ 
e.g. Micropython on HX8K will most likely require a small icache and SPI 
Flash XIP. This configuration w/ lm32 on TinyFPGA halves execution time 
compared to without; we can probably go without a dcache. And even if 
it's not ideal, we can use j1 until j0 is ready :D!

If you want to duplicate my results and see LUT/EBR uages, use my copy 
of j1 (https://github.com/cr1901/jcore-j1-ghdl/tree/hx8k) and run "make 
TARGET=ice40hx8k_b_evn". LiteX is using my own copy of j1 for now, just 
in case I need to make changes and experiment; this is temporary.

> for now another place to look is here: 
https://github.com/j-core/j-core-ice40/tree/master/testrom 
<https://github.com/j-core/j-core-ice40/tree/master/testrom> which while 
less clean, is closer to what you'll need.

Ack. Where is the script/program you use to convert the testrom to a 
VHDL array? I think I'd rather reuse yours for now than write my own.

> keep in mind that J1 is still a full Harvard machine, so you'll need 
to mux it down to a single master.

LiteX provides its own mux on the Wishbone bus, so I would adapt both 
the D and I buses to Wishbone before the mux.

> Yes, there are a few proposals... just microcode, only SH1 
instructions (no MAC or other DSP functions).  We've also just not had 
the focus.

Ack. This is a proof-of-concept for now. Once the CPU is in LiteX and 
working, the hardest part is done, and we can iterate to make the 
integration better.

> I don't think timing closure, on FPGA multipliers tend to be very 
fast.  There are a few critical paths, the one that erks me is the T bit 
feeding into the microcode sequencer.  But when I wrote the MAC unit, it 
was very clean, even if it's picked up a bit of cruft since.

Ack. One thing I'd like to add: nextpnr has trouble routing the up5k 
version of the SoC, even with the DSPs and SPRAM relieving about 1k LUTs 
for other use. nextpnr can take upwards of 5 minutes to route on up5k, 
and by changing the PCF, I could get nextpnr to take over 10 minutes to 
route before I cancelled it. I'll ask one of the nextpnr devs for some 
insight.

> IIRC, some are instruction chewers.  J1 is a highly encoded and more 
complex operation per instruction ISA, and pipelined machine with 
parallel ALU, MAC and shift units.  The throughput might be comparable, 
even at a slower clock :)

The time it takes to checksum the main payload in LiteX BIOS may be a 
good benchmark.

Sincerely,

-- 
William D. Jones
wjones at wdj-consulting.com