[J-core] Roadmap question (SMP)

Tue Sep 6 17:00:57 EDT 2016

On Tue, Sep 6, 2016 at 10:53 AM, Rob Landley <rob at landley.net> wrote:
> What would be involved in implementing higher-order SMP: 4-way, 8-way,
> or even 16-way? And is doing so interesting to do down the road, or
> should we just try to throw DSPs at everything?

I haven't seen much information on the DSP you are working on. Is
there any documentation on it ? Or shorter question, would it be
possible to convert SPIR-V (Vulkan intermediate representation) to
something manageable by the DSP. SPIR-V is a GPGPU intermediate
representation and OpenCL/CUDA kind of language can be built on top of
it. It actually has the notion of compute node. The main reason I am
asking this specifically is that it is quite costly usually to port
and maintain software for DSP. They end up being used for only a very
small portion of the work load and most open source software don't
even care about them. If you could have a SPIR-V frontend, this would
enable reusing the work done on other platform and obviously affect
the answer to your question a lot in my opinion.

> At the moment we've got a couple bottlenecks I'm aware of:
>
> 1) Cache coherency.
>
> The dual-processor we've got implements cache coherency without a _bus_
> per se, but instead by directly connecting the caches. For cache
> coherency among more processors we'd need something more like the old
> Alpha bus (which the Opteron guys happily recycled when ARM acquired
> that part of DEC's corpse in 1996).
>
> I note Intel went down the cache-connection route for a while, which
> maxed out at 4x and their first x8 chips were sort of two 4x chips doing
> a semi-NUMA thing.
>
> Possibly the right thing here is to have individual ASICs only have 2-4
> processors, and if you want more SMP than that have the ASICs connected
> by a fast bus doing NUMA (which Linux happily supports).
>
> 2) Memory size.
>
> Right now our memory controller drives a type of memory that maxes out
> at 256 megs. In theory we could have multiple instances of said memory
> controller, but we haven't gone there yet. Even 2x is a bit cramped in
> 256 megs, and 4x would probably be pretty unhappy. (I'm sure Rich will
> bring up threading here.)

What about memory bandwidth ?

> 3) FPGA size.
>
> An LX25 has 2.7 times the resources of an lx9, so we _might_ be able to
> fit 4-way in there if we can get sh2 squeezed down a bit. The J1 stuff
> would presumably fit 4 instances easily.
>
> An LX45 has 5 times the resources of an LX9, so we could definitely fit
> 4-way in there. 8-way would be a stretch (ala the 4-way on lx9 squeeze),
> and 16 is probably off the table.
>
> My understanding at the moment is we're basically not worrying about
> this while there are still low-hanging fruit to be had in
> single-processor performance. (Multi-issue isn't exactly low-hanging
> fruit, but it's an achievable goal.) But I'm wondering if anybody's put
> any thought into this yet, or is it too far down the road?

What kind of multi-issue do you have in mind ? SIMD, VLIW or
superscalar ? Also wouldn't a SMT setup scale better as you could
share the cache (maybe increase their size) and almost all the stage
in the pipeline (only the register and a few state would be
duplicated) ? So 1 and 3 wouldn't be a problem for a SMT case, no ?
Wouldn't also SMT allow for deeper pipeline without increasing energy
consumption to much compared to other improvement ?

Best,
-- 
Cedric BAIL