[J-core] Synthesis for Silicon?

Tue Sep 10 23:20:16 EDT 2019

Nice to see the project is still ongoing...

On 9/10/2019 9:02 PM, D. Jeff Dionne wrote:
> On Sep 10, 2019, at 5:43, Joh-Tob Schäg <johtobsch at gmail.com> wrote:
>
>>>> Can you tell me and the community more about this?
>>> J2 was designed for a mixed signal grid monitor chip, on a TSMC 152nm process.  It came in at about 40k gates, and about 0.45mm^2 in this process.
>> This is just the J2 or including the DSP?
> This is just the J2 CPU core, not an SoC.
>
>> Any estimates for the J1 already?
> Probably around 15-18k gates.
>
>>>> How expensive are the fixed and per chip costs?
>>> Depends on if it's a multi project wafer or a full mask set.  180nm full mask set is about $100k US.
>> I have no idea what the difference between multi project wafer or a full mask set is. Can someone explain?
> A MPW, also called a shuttle run, is a shared wafer.  The mask sets, that is tooling, and wafer cost, is shared between us and a bunch of other (unrelated) customers.  A full mask set is the up front tooling to produce full wafers full of chips.
>
>>>> What clock rates are possible at the targeted node?
>>> We're still optimizing.  We can for sure hit 125MHz in 180nm, and probably 500MHz in 45nm.
>> Mhh i looked at some processors i could find at the same node.
>> Wikipedia mentions
>> - the PowerPC 7455 “Apollo 6” was produced with Motorola's 0.18 µm
>> (180 nm) HiPerMOS and it is said to clock >1GHz.
>> - the second generation Athlon, the Thunderbird, clocked between 600
>> MHz to 1.4 GHz and was made using 180nm
>> - the Celeron Willamette was clocked at 1.7 GHz also produced at 180nm
>> - the PS2 emotion engine at 294 MHz being a vector processor
> These are all so called ‘full custom’ layouts, not synthesizable soft cores.  A good comparison is ARM M4, where we are roughly equivalent in equivalent process.

So, I am guessing this makes a big difference in terms of what is 
achievable?...

>> For 45nm i could find several x86 chips in the 2~3GHz, a PowerPC at
>> 1.243 GHz and for ARM the OMAP4 at 1.2 and a Samsung Exynos 3110 at
>> 1.0~1.2 GHz. There was also an outlier the z196 which clocked up to
>> 5.2 GHz
> We are closer here.  J2 is a 5 stage pipeline.  These are significantly deeper, and therefore the clock speed is faster… each stage is simpler.

In my prototypes I have mostly ended up with a 6-stage pipeline (with 2 
decode and 2 execute stages).

Have since moved away from SH to a more heavily customized design (more 
focused on performance), but not all that much running on actual 
hardware (mostly doing a lot of simulation and testing; in addition to 
the use of ISA-level emulation).

Also now have a job (running an industrial waterjet machine), so the 
amount of time available for working on this stuff has been 
significantly reduced.

>> Why is J-Core clocked so much slower at the same node size than
>> commercial chips which eyed for performance or performance per Watt?
>> Is it in the ISA or just in your design which does not prioritize performance?
> It’s not the ISA, it’s the choice of implementation methodology.  Squeezing 1GHz out of 180nm is just not necessary (or done) anymore.  We are not slower than other soft cores, synthesized to standard cell.  The ISA does make a difference though, we perform better at a given clock speed.
>
>>> Not right now, no.  The chip SEI was doing was about 1 to 2W power budget.  But it had 16 DSP cores and 32 hardware accelerators for power system measurements, and 6 high speed LVDS SERDES.
>> Are looking into releasing the DSP design too?
>> Will it under the J-Core name too?
> Yes.  It is tentatively called S-Core DSP.  It’s a very traditional DSP, and so the audience is slightly different.  The design is 2 stage, 5 way issue, 18bit, X Y P Harvard memory space design, with 4 way multithreading… which is necessarily a bear to program.  But if you need to to signal processing on chip, it’s really something :)

What I have currently ended up with is sort of intermediate:
   6 stage pipeline (IF ID1 ID2 EX1 EX2 WB)
   1-3 execute lanes (explicitly parallel, via tagging rather than VLIW)
     Design allows binary compatibility between different profiles.
     Though, running 3-wide code on a 2-wide core will fall back to scalar.
     Wider is possible, but no current plans to do so.
   32x 64-bit GPRs
   FPU operates using GPRs (double precision)
        Basic support for packed-integer SIMD
   Modified Harvard (L1 I$ and D$, larger shared L2 cache)
       Current, 2K+2K L1, 32K..128K L2
   MMU (optional) uses a 64x 4-way TLB
     Currently, physical addresses are 32-bit, virtual may be 32 or 48.
     With 32-bit VA's, it typically also uses 32-bit pointers.

Currently targeting XC7S25, XC7S50, and XC7A100.
   On XC7S25, need to use a scalar core, and 32K L2 (direct-mapped).
   On XC7A100, can do 3-wide and use 128K of L2 (2-way).

For an XC7S25, a case could be made for using a lighter-weight 32-bit 
ISA, but was able to get it to fit so mostly called it "good enough".

It is possible to optimize for either code-density or performance.
Allows use of 16, 32, and 48 bit instruction forms (variable length).
However, 16 and 48 bit forms are scalar-only, and support for 48 bit 
forms is optional.
Its fixed-length subset uses 32-bit instructions.

At 50 MHz, it can currently play Doom reasonably well (though, Doom 
spends a lot of time doing things which aren't terribly easy to extract 
much usable ILP from).

However, for things like color-cell encoding, was able to get it pretty 
fast.
Not really tested, but it should also do pretty well for things like 
encoding or decoding JPEGs.

Trying to get Quake to perform acceptably at 50MHz is still "easier said 
than done" (and, it was undue pain trying to keep things passing timing 
at 100MHz; touch much of anything and timing fails...).

Doesn't help that, as is, my C compiler is still pretty bad at 
extracting all that much ILP, can get somewhat better results with 
hand-written ASM though. Haven't yet resorted to rewriting the inner 
core of Quake's renderer in ASM.

For display hardware, mostly doing a color-cell display with the video 
memory in Block-RAM.
On the XC7A100, there is enough BRAM in the FPGA to be able to afford a 
128K VRAM space, which is enough for some bitmapped modes 
(320x200x16bpp, or 640x400 16-color).

It can still do a color-cell mode good enough to make Doom and Quake 
look "passable" with 32K of VRAM; though the 128K bitmap mode looks a 
little nicer. There is also support for a 64K 320x200 8bpp (256-color) 
mode as well.

>>> Cheers.  A little vague, but we're getting there.
>> Any details get the community excited. It is probably good that not all information is released at once any way.
> Not meaning to drip it out… we’re just running flat out on making things happen, technical and non technical :)
> J.

Yep.

>> Cheers Johann-Tobias
> _______________________________________________
> J-core mailing list
> J-core at lists.j-core.org
> http://lists.j-core.org/mailman/listinfo/j-core