[J-core] Roadmap question (SMP)

Wed Sep 7 13:44:24 EDT 2016

On Tue, Sep 6, 2016 at 3:55 PM, Rob Landley <rob at landley.net> wrote:
> On 09/06/2016 04:00 PM, Cedric BAIL wrote:
>> On Tue, Sep 6, 2016 at 10:53 AM, Rob Landley <rob at landley.net> wrote:
>>> What would be involved in implementing higher-order SMP: 4-way, 8-way,
>>> or even 16-way? And is doing so interesting to do down the road, or
>>> should we just try to throw DSPs at everything?
>>
>> I haven't seen much information on the DSP you are working on.
>
> It's not out yet.
>
>> Is there any documentation on it ?
>
> No, and that's part of the _reason_ it's not out yet. We need to figure
> out how to make it less impossible to program.

Hehe, so maybe this discussion could be about how to make DSP more
useful in fact.

>> Or shorter question, would it be
>> possible to convert SPIR-V (Vulkan intermediate representation) to
>> something manageable by the DSP. SPIR-V is a GPGPU intermediate
>> representation and OpenCL/CUDA kind of language can be built on top of
>> it.
>
> You know how the Cell processor had 8 DSPs glue to the sides, and the
> problem turned out to be nobody know how to program them?
>
>   http://blog.regehr.org/archives/1049

Yes, that is exactly my point. A DSP that is not easy to program and
recycle your code for something else will have a hard time to be used.

> As far as I can tell, they had it easy.

>> It actually has the notion of compute node. The main reason I am
>> asking this specifically is that it is quite costly usually to port
>> and maintain software for DSP.
>
> Yes. Especially since your average DSP can't access main memory, and
> basically has a small SRAM cache it twiddles data in and external
> circuitry pumps data into and out of it. (And has no interrupt
> controller, just a CPU cycle counter you can use to schedule the next
> exit from your event loop.)

This is very close to what a GPU core is. A I-Cache, D-Cache, a local
SRAM scratchpad, a vector unit and a way to transfer data from inside
to the outside. Oh, and usually a not to good way for flow control and
synchronising work with other core.

>  They end up being used for only a very
>> small portion of the work load and most open source software don't
>> even care about them. If you could have a SPIR-V frontend, this would
>> enable reusing the work done on other platform and obviously affect
>> the answer to your question a lot in my opinion.
>
> My understanding is "not even remotely", but I'll leave it to Jeff to
> figure out what details he wants to talk about there.

As you seems to be trying to figure out how to make it easier to
program, can I recommend you to seriously look at SPIR-V. It is a
child of OpenCL SPIR (LLVM intermediate representation). TI DSP is a
possible target of OpenCL, and I guess they should provide a SPIR-V
frontend for it at some point if not already. It should be possible
with some addition to support SPIR-V for a DSP and that would make
this DSP a lot more useful.

There is a SPIR-V to NIR in mesa that I guess could be a good spot to
start a compute node backend from. I have never looked at mesa code,
so it is just an educated guess here.

SPIR-V is interesting as it is likely to become the target of a lot of
development in the future related to GPGPU and graphics in general. We
should see domain specific language and open source library start to
use it. I believe it should be possible to use it to offload some part
of the Jpeg decompression/compression (patching libjpeg-turbo to do
so, should be useful for many user). In general some part of the audio
and video codec could be offloaded and SPIR-V would provide the
infrastructure to have something reusable. I can also see other part
of the stack that could get offloaded, like the span line generation
of FreeType getting accelerated by it at some point. There are also
application that use OpenCL today (I doubt they would work any time
soon on a J2, but well, a SH4 at a higher frequency, maybe ?), like
darktable and libreoffice.

My point is that if you manage to enable a SPIR-V target, you will
benefit from optimization on other platform, making this DSP and this
platform really interesting. Answering your previous question on the
topic, on should you work on more SMP core or more DSP.

> Our DSPs are enormously powerful, and very much not the same as SMP.
>
>> What kind of multi-issue do you have in mind ? SIMD, VLIW or
>> superscalar ?
>
> Approximately the kind sh4 was doing?

If I remember correctly the SH4 was a 2 way superscalar CPU with a
vector unit, but I think you do not have plan to implement this vector
unit or maybe something different ? I don't remember exactly what was
said and where on the subject.

>> Also wouldn't a SMT setup scale better as you could
>> share the cache (maybe increase their size) and almost all the stage
>> in the pipeline (only the register and a few state would be
>> duplicated) ?
>
> Hyper-threading you mean? We're proposing 2 execution units, not 3. Thus
> we haven't got the problem of keeping the third one busy.

Yes, some form of hyper-threading.

> SMT not only dedicates one of the three execution units to each thread
> and lets them fight over the middle one, but it also leverages the
> register renaming infrastructure you've already _got_ for branch
> prediction and speculative execution. (All that speculation is
> ping-ponging back and forth between register profiles, SMT is basically
> just pinning different ones into different execution units to advance
> totally unrelated contexts in parallel.)
>
> If you're not already doing all that speculation, you haven't got that
> infrastructure to leverage (and your chip's probably half as big).

Well, I was thinking of a simpler solution here. Just adding a
register bank without duplicating the execution unit. The instruction
fetch and decode would switch from one "processor" context to another
for each cycle and add a bit to define on which register
bank/"processor" context they are operating. If an error happen,
exception would only affect the context they are linked with and
shouldn't need to roll back or cancel anything on the other context.

This is from a software developer point of view a few miles away from
how things are done in hardware, so I clearly could miss something
here. But the idea would be that there is little increase in silicium
use sure with just another register bank and a few more for handling
the two "context" and the deeper pipeline (for higher frequency), but
without the complexity as described in a previous email of cancelling
work that shouldn't be done.

> As far as I can tell, rather a lot of the past 20 years of processor
> design has been about Moore's Law relentlessly increasing transistor
> counts and engineers scrambling to consume them ala Lucille Ball in the
> chocolate factory. The doubling period's been increasing but it's still
> been way faster than a linear increase. You know how corporate
> departments that have to spend their budget by the end of the fiscal
> year or they'll get less money next year? Processor engineer is have
> been trying very hard to turn a fixed transistor allotment into even
> tiny improvements in performance. (Why fixed? They need this many I/O
> pads and this much distance between thermal hotspots from the massive
> overclocking and can only cut the wafer THIS small, so you you've got
> this much real estate and you either put transistors there or leave the
> silicon surface area blank, and it costs about the same to manufacture
> either way. There's probably an element of gates' law where software
> expands to fill available memory largely because it's there, so nobody
> does the extra work to slim it down.)
>
> We're not trying to find a way to justify running through the next
> iteration of an ever-increasing budget. We're treating circuitry as an
> expense, which is only worth adding if the returns justify it. That's
> why we fit in an lx9 and most of the others don't.

Yes and this is what make that project so interesting. I am just
throwing idea here to see if there isn't a way to do this without the
classic cost of architecture that have added this on top of their
already complex design. Also at the same time I am learning
interesting stuff :-)

>> So 1 and 3 wouldn't be a problem for a SMT case, no ?
>> Wouldn't also SMT allow for deeper pipeline without increasing energy
>> consumption to much compared to other improvement ?
>
> You don't always _want_ a deeper pipeline. (Pentium 4, case in point.) A
> deeper pipeline is a _price_ you pay for keeping the individual stages
> small and thus letting you clock your chip faster.

Oh, yes, deep pipeline for one stream of instructions has been shown
to be very difficult to use efficiently. Apparently it is hard to keep
a core busy due to memory latency and a limited set of registers :-)
-- 
Cedric BAIL