[J-core] Roadmap question (SMP)

Tue Sep 6 18:55:05 EDT 2016

On 09/06/2016 04:00 PM, Cedric BAIL wrote:
> On Tue, Sep 6, 2016 at 10:53 AM, Rob Landley <rob at landley.net> wrote:
>> What would be involved in implementing higher-order SMP: 4-way, 8-way,
>> or even 16-way? And is doing so interesting to do down the road, or
>> should we just try to throw DSPs at everything?
> 
> I haven't seen much information on the DSP you are working on.

It's not out yet.

> Is there any documentation on it ?

No, and that's part of the _reason_ it's not out yet. We need to figure
out how to make it less impossible to program.

> Or shorter question, would it be
> possible to convert SPIR-V (Vulkan intermediate representation) to
> something manageable by the DSP. SPIR-V is a GPGPU intermediate
> representation and OpenCL/CUDA kind of language can be built on top of
> it.

You know how the Cell processor had 8 DSPs glue to the sides, and the
problem turned out to be nobody know how to program them?

  http://blog.regehr.org/archives/1049

As far as I can tell, they had it easy.

> It actually has the notion of compute node. The main reason I am
> asking this specifically is that it is quite costly usually to port
> and maintain software for DSP.

Yes. Especially since your average DSP can't access main memory, and
basically has a small SRAM cache it twiddles data in and external
circuitry pumps data into and out of it. (And has no interrupt
controller, just a CPU cycle counter you can use to schedule the next
exit from your event loop.)

 They end up being used for only a very
> small portion of the work load and most open source software don't
> even care about them. If you could have a SPIR-V frontend, this would
> enable reusing the work done on other platform and obviously affect
> the answer to your question a lot in my opinion.

My understanding is "not even remotely", but I'll leave it to Jeff to
figure out what details he wants to talk about there.

Our DSPs are enormously powerful, and very much not the same as SMP.

> What kind of multi-issue do you have in mind ? SIMD, VLIW or
> superscalar ?

Approximately the kind sh4 was doing?

> Also wouldn't a SMT setup scale better as you could
> share the cache (maybe increase their size) and almost all the stage
> in the pipeline (only the register and a few state would be
> duplicated) ?

Hyper-threading you mean? We're proposing 2 execution units, not 3. Thus
we haven't got the problem of keeping the third one busy.

SMT not only dedicates one of the three execution units to each thread
and lets them fight over the middle one, but it also leverages the
register renaming infrastructure you've already _got_ for branch
prediction and speculative execution. (All that speculation is
ping-ponging back and forth between register profiles, SMT is basically
just pinning different ones into different execution units to advance
totally unrelated contexts in parallel.)

If you're not already doing all that speculation, you haven't got that
infrastructure to leverage (and your chip's probably half as big).

As far as I can tell, rather a lot of the past 20 years of processor
design has been about Moore's Law relentlessly increasing transistor
counts and engineers scrambling to consume them ala Lucille Ball in the
chocolate factory. The doubling period's been increasing but it's still
been way faster than a linear increase. You know how corporate
departments that have to spend their budget by the end of the fiscal
year or they'll get less money next year? Processor engineer is have
been trying very hard to turn a fixed transistor allotment into even
tiny improvements in performance. (Why fixed? They need this many I/O
pads and this much distance between thermal hotspots from the massive
overclocking and can only cut the wafer THIS small, so you you've got
this much real estate and you either put transistors there or leave the
silicon surface area blank, and it costs about the same to manufacture
either way. There's probably an element of gates' law where software
expands to fill available memory largely because it's there, so nobody
does the extra work to slim it down.)

We're not trying to find a way to justify running through the next
iteration of an ever-increasing budget. We're treating circuitry as an
expense, which is only worth adding if the returns justify it. That's
why we fit in an lx9 and most of the others don't.

> So 1 and 3 wouldn't be a problem for a SMT case, no ?
> Wouldn't also SMT allow for deeper pipeline without increasing energy
> consumption to much compared to other improvement ?

You don't always _want_ a deeper pipeline. (Pentium 4, case in point.) A
deeper pipeline is a _price_ you pay for keeping the individual stages
small and thus letting you clock your chip faster.

Our stages are already pretty small.

> Best,

Rob

P.S. Disclaimer: I'm mostly a software guy. I wrote stuff like
http://www.fool.com/portfolios/rulemaker/2000/rulemaker000901.htm years
ago (check the dates), but am by no means an expert here. I didn't
design any of j2, I've just been trying to understand it for the past
couple years, been in a bunch of meetings about it, and pestered people
with questions. If you think I'm wrong, shout out...