[J-core] Pondering j-core kickstarter ideas.

Tue Apr 26 02:18:45 EDT 2016

On 04/25/2016 11:26 AM, Christopher Friedt wrote:
> Hi Rob,
> 
> On Sun, Apr 24, 2016 at 2:50 PM, Rob Landley <rob at landley.net> wrote:
>>   http://www.youtube.com/watch?v=-vSZHVXp5Cw
> 
> That's a *very* interesting video that you shared, by the way. I was
> working on something similar for about 1.5 years until my company
> decided to shelve the project. I have reserved various judgements
> about those management decisions...
> 
> In any case,
> 
>> the result would be around 36,000 chips running around 250 mhz.
> 
> 250 MHz is plenty for this sort of device, particularly with the SH2 /
> J2 ISA being so dense (i.e. like Thumb / Thumb2).

250 mhz is for the 150 nanometer process. At 45 nanometers you get maybe
450 mhz? (Jeff ran the numbers, we should post them somewhere.)

We're not doing branch prediction and register renaming and such to keep
multiple execution units full (instead we explicitly pair the suckers,
which gives us the branch delay slot. Basically a 2-issue VLIW! Woo!)

I'm sure I blathered about that before... Yes I did:

http://www.fool.com/portfolios/rulemaker/2000/rulemaker000222.htm
http://www.fool.com/portfolios/rulemaker/2000/rulemaker000223.htm
http://www.fool.com/portfolios/rulemaker/2000/rulemaker000224.htm
http://www.fool.com/portfolios/rulemaker/2000/rulemaker000225.htm

Anyway, the ADVANTAGE of not doing things like speculative execution is
we never waste work. If we calculate a value it's because we're using
the result. This optimizes for power consumption to performance ratio
instead of absolute performance. So the chip runs cool and your battery
lasts. Also our pipelines are short (currently 5 stage) so we don't get
long-lived bubbles from cache misses and such.

>> What would it take to make an actual, minimal hobbyist _device_?
> 
> Since nobody else has ponied up to give some feedback, I would
> probably say that the best features to add are are on-chip peripherals
> for short-range communication:
> 
> * i2c
> * spi

We have that in the works. (Is it not in the new VHDL source drop?)

> * (full) uart(s)

There's a (sad) 16550a implementation in the tree (did we release
that?), but uartlite has its advantages. (Maybe Geoff could say what
they are, other than fewer transistors.)

> * obvioulsy a GPIO controller

You mean in the kernel? That's a Rich question.

> * A2D / DAC

Another geoff question. :)

> * as large a supported memory area as possible

I asked Niishi-san to describe our memory architure to us, and right now
the SOC supports 256 megs. Well, the one we released supports 128, but
256 is an easy tweak. We might eventually be able to expand it to 512M,
but doing so would be tricky. And going behind that is really tricky.

The current physical memory layout is:

0x00000000-0x010000000 ROM/SRAM
0x01000000-0x01FFFFFFF DRAM
0xABCD0000-??????????? I/O memory

That ROM/SRAM space is a repeated mapping of our 32K ROM/SRAM space. (I
believe it's 30k of ROM and 2K of SRAM, ignoring address lines 16-27 so
you see the same 32k over and over again through that space.)

The DRAM area is 256 megs, we'll come back to this.

The I/O memory starting location is idiosyncratic and historical, and
will probably be changed as soon as Rich Felker finishes converting the
remaining kernel drivers over to device tree. (Then we can just change
the mappings in the device tree to match where the VHDL responds to I/O
memory addresses for our SOC peripherals in a given build.)

Our DRAM area is 256 megs, and although there's plenty of physical
address space above it we have several bottlenecks to actually using that.

Our first DRAM size bottleneck is our very simple cache architecture.
Our cache tags are 16 bits wide and cache lines are 32 bytes line.
There's 8k of instruction cache and 8k of data cache, each of which is a
set of 256 32-byte cache lines. Each cache also has 512 bytes of
associated cache tag memory, which is array of 16 bit integers. Each
cache tag is the high bits of the memory location that cache line was
loaded from, left shifted 13 bits (I.E. by the 8k cache size).

Our cache is _also_ repeatedly mapped, just like the 32k ROM/SRAM above.
By which I mean our cache lines alias every 8k: if you load 0x01001230
and then read from 0x1003230, the second cache line evicts the first,
and when it flushes the old data and loads the new data it'll write a
new high 16 bits into that cache line's tag.

So the bottom 13 address bits (8k cache size) plus 16 high bits of cache
tag gives us 29 bits, for a maximum of 512 megs of DRAM that our current
cache infratructure knows how to deal with.

The other problem is the type of DRAM our controller drives. Between the
"DDR" and "DDR2" generation, there was something called LPDDR, which is
what we're currently using. DDR was 2.5 volts, and DDR2 was 1.8 volts,
and LPDDR is also 1.8 volts but has some of its own built-in refresh
circuitry and such. It's nice and compact, and the largest you can get
is 2 gigabit, which is 256 megs. (Our current in-house boards have 1
gigabit versions, which is 128 megs. If we go to full DDR2 chips we
might find larger ones but then we need more external refresh circuitry.)

In the current source, see the table in components/ddr2/ddrc_fsm.vhd
starting at line 139.

Wiring up 2 DRAM chips starts getting a lot more complicated. (More I/O
pins, more flimsy connections to make your SOC less reliable, etc.) We
have not opened that can of worms yet. I can ask somebody to write more
about it if you're interested.

> * naturally on-chip clocks and timers are of course very useful

That's not a thing you can do in an FPGA, clocks require a timing
crystal so you can't implement them with just gates. Your FPGA has to
provide one those as existing circuits that the VHDL compiler knows how
to connect your circuitry up to (these are called "libraries" in
VHDL-speak).

(There's generally a bunch of those, things like phase locked loops and
stuff. Half the porting from one FPGA type to another is figuring out
what libraries you've got and how to plug them into your design. There's
also a vhdl spec with a bunch of "standard" libraries like floating
point registers and such you can just go "gimme" and link against, but
the docs we were using for those were on vhdl.org and that went down a
couple months ago because the company that ran it  started redirecting
to their corporate website. I'll have to ask Jeff where to find it these
days.)

> For the some of the above, we can naturally bit-bang those protocols,
> but doing so would mean that CPU time for the application would be
> spread thin. Having integrated controllers means less board space
> occupied by external devices, but yes, it does mean that the analog
> I/O stage must also be considered (i2c clock stretching, 3.3 / 5V I/O
> voltage tolerance, for example).

Niishi-san is writing a better DMA engine for SD cards and such. I'll
let you know when we have something to post on that.

> For some things, e.g. ethernet, there are a number of discrete
> ethernet MAC / PHY chips available too [1][2]. Notice that SPI is a
> very common interface for ethernet & wireless comms.

We have an ethernet implementation in VHDL, it just needs work. (It was
a bad first pass and Jeff decided on a better way to do it.)

> The more integrated peripherals the better (particularly embedded bus
> peripherals), because that just lets people do more.

We have plans for VHDL to drive all the peripherals in the turtle board.
That kickstarter should be up sometime this coming month.

> Possibly not for the initial fab, but future (more powerful) chips,
> one consideration could be implementing peripheral controllers
> themselves as embedded J2 cores (not in an SMP configuration) as an
> alternative to e.g. Tensilica or opencores controllers.

And/or replacing all the mips and arm chips embedded in everything in
the world out there today...

> If we chose
> J2, the firmware could be updatable rather than hard-coded in ROM, and
> that would address some of the security issues brought up at ELC.
> 
> [1] http://www.microchip.com/wwwproducts/en/en022889
> [2] https://www.maximintegrated.com/en/products/interface/transceivers/78Q8430.html
> 

Rob