[J-core] Roadmap (was Re: j2 llvm repo?)

BGB cr88192 at gmail.com
Wed Feb 8 12:43:29 EST 2017

On 2/8/2017 3:20 AM, Rob Landley wrote:
> On 02/07/2017 03:29 PM, John Paul Adrian Glaubitz wrote:
>> On 02/07/2017 10:26 PM, Rob Landley wrote:
>>> Whatever method you're using to push to your repository seems to be
>>> forcing me to delete my copy and re-clone it each time, which is kind of
>>> inconvenient...
>> That normally happens when changes are force-pushed into the repository.
>>>> It hasn't been tested that much and still remains an educational-purpose
>>>> backend.
>>> Let me know when you're ready for outside testing, I look forward to
>>> trying it...
>> How much is actually missing for sh4? I would love to have this in Debian.
> I'm fairly certain Jeff could do a better explanation of all of this,
> but here's a stab at it (none of which is cannon, this is a software guy
> talking about what hardware guys said over dinner months ago, not merely
> half-remembered but half-understood in the first place):
> My understanding is it's not the amount of development that's the
> problem, it's that the back of the envelope calculations say we probably
> can't fit j3 into an LX9 FPGA (at least not the first couple versions),
> so it's been punted until Turtle is out so people have something
> reasonable to run it on.
> There's three main things in j3: adding an MMU, floating point support,
> and some misc new instructions. (There's other stuff like multiple DRAM
> controller instances for >256M memory, making use of the DMA engines,
> and so on. But that's SOC not processor.) The new instructions aren't a
> big deal, the FPU is a largeish transistor count (especially to do
> double precision), and the MMU isn't actually that _complicated_ but it
> adds extra signal routing that takes up a lot more FPGA cells than the
> increase in gate count would suggest. (And all the timing closures have
> to be rethought. Solvable, but nontrivial. I think the xilinx tools did
> something obviously non-optimal and it could benefit from layout work,
> or something?)
> But the main reason we haven't gotten to most of this yet is the
> hardware engineers got buried in 6 months of $DAYJOB customer
> productization stuff based on the existing j2 SOC design that _should_
> be working their way through soon? (The definition of "soon" is a
> management question, not an engineering question.) There's been some
> quick "Is this trivial to do? No? Back on the todo list" stabs at
> non-customer stuff, but mostly small things. We should have cycles to
> properly advance the roadmap again in "a few weeks". (I've asked: "few"
> is like "soon".)
> I believe the main reason we've split "j3" from "j4" is that j4 is
> multi-issue (executing two instructions per clock cycle), which is a big
> deal at the design level and we _know_ that's a lot of work. (The Japan
> contingent seems excited about getting started on it, but we haven't
> opened that can of worms yet beyond a couple whiteboard sessions to
> scope it.)
> What gets moved between j3 and j4 otherwise is a judgement call, we
> could sequence the todo list several ways. The roadmap and the todo list
> are at a different granularity: we could maybe implement the MMU without
> the FPU or vice versa and do a release that way. (And MAYBE it's
> possible to fit an MMU without FPU into an LX9, but we haven't managed
> it yet? I think you'd have to sacrifice the cache to make space, and our
> first stab at an MMU design was integrated with the cache, or something?
> I'm still not a domain expert here, I just ask everybody questions about
> what they're doing when I get the chance...)
> We also don't want too many instruction set versions floating around out
> there confusing the compiler people, if we can help it. We have
> --with-cpu=j2 right now, we'd like to keep versions to a minimum. Should
> the FPU have separate 32 bit and 64 bit modes: from a VHDL build
> perspective and fitting into less FPGA space, sure why not? From a
> toolchain/standardized instruction set perspective: ick, pick one. So
> what configuration granularity level we implement beyond the first
> release is a judgement call we haven't had to make yet. (There's been
> talk of menuconfig in the VHDL build, which would itself be rather a lot
> of work to implement...)

I guess it could be possible to maybe add support for a subset of the 
FPU, and then use traps (and emulation) for the rest?

say, for example, the FPU has registers FR0-FR15 (can probably safely 
omit XF0-XF15, as the compiler doesn't seem to use them, *1). ISA 
support includes FMOV, FADD/FSUB and FMUL; with other operations 
generating traps.

compiler then generates code assuming a full FPU, and the rest is 
emulated via firmware or similar.

*1: or at least, I now have Quake working in my emulator (in its 
pseudo-SH4 mode) using a subset of the FPU (which omits XF/XD registers 
and several of the FPU instructions).

from what I can tell, the compiler (GCC) doesn't use: FTRV, FIPR, FMAC, 

while the compiler also doesn't appear to use FSQRT, I used it manually 
(using an ASM stub to replace the C-lib's "sqrt()"), partly as SQRT is 
used fairly extensively by Quake 1 and 2 (by Quake 3, a trick was 
devised to approximate "1.0/sqrt(x)" via bit-twiddly, could try 
back-porting it).

another interesting behavior is that GCC seems to often demote double 
calculations to float internally.
an example is a function declared as double, but in all uses its results 
are cast to float, GCC seems to compile it instead using float instructions.

FLDI0/1 seem like they would be useful, just for whatever reason GCC 
prefers instead to load these values from memory.

while not particularly fast or good, the C library provides math 
functions by doing them in C (though, in the modified PDPCLIB I am 
using, ended up having to modify them to keep them from sometimes 
getting stuck in infinite loops).

the functions were written to assume terminate on an exact result rather 
than a fixed/bounded number of iteration steps, as these don't always 
converge on an exact value (like, noisy low-order bits are more useful 
than a math function which sometimes never terminates).

nevermind if Taylor series and Newton-Raphson and similar are basically 
overkill for something like Quake.
could probably do just as well, and faster, using math functions which 
"fake it" via lookup tables and interpolation.

as for Quake port: basically doing 320x240 and software rendering.
seems to hold up ok and give usable framerates at this, but the emulator 
currently isn't really fast enough to go to all that much of higher 
resolutions (still using a plain-C interpreter here).

I could almost try to do an SH2 build of Quake, but it depends pretty 
heavily on the FPU, so that wont really work.

for the MMU in the emulator, not really using it all that much yet.
still sort of using the lazy / faked MMU, which uses an x86/ARM style 
page table (and stubs-over the other parts) and hopes the OS doesn't 
notice the difference (theoretically should work with Linux and BSD, 
can't confirm yet though).

     a page-table goes in TTB (and assume only using 4kB pages and similar);
     check relevant bits in MMUCR, ...;
     pretty much everything else is no-op.

not sure which strategy is simpler/cheaper for hardware, but this is 
simpler/cheaper for an emulator.

if needed, I may do a full/proper MMU at some point, which is likely to 
be triggered if it encounters something unexpected (say, the OS appears 
to be doing something other than just using a page-table).

if doing HW this way, would probably make sense to document it and/or 
have a disclaimer.

but, a chip with both a subset of the FPU and a more minimalist MMU 
could probably still be useful (if the alternative is one with neither).

> The other thing is we want to get our repository properly up on github
> before doing too much more major design work. The questions in the last
> paragraph are something the community should be able to weigh in on, and
> doing development behind closed doors is not where we want to be long-term.
> My own work is mostly on the software side, and my open-source-side todo
> list right now is getting a toybox release out, working with Rich to get
> a musl-cross-make release out, flushing my kernel patch queue this merge
> window (have initramfs honor CONFIG_DEVTMPFS_MOUNT and so on), getting a
> mkroot release out (which involves moving the kernel build and qemu boot
> script stuff Aboriginal Linux used to do into a mkroot-style script),
> present on all of this at ELC later this month so I can point people at
> a video
> (https://openiotelcna2017.sched.com/event/9Ith/tutorial-building-the-simplest-possible-linux-system-rob-landley-se-instrumentscom)
> and then update the j-core website to have new build instructions using
> current versions of all of the above.
> The previous paragraph is one big dependency chain where I can't do a
> mkroot release until I have the toybox and musl-cross-make releases it
> uses, and really don't want to update website documentation to tell
> people "grab a git snapshot du jour" instead of a release version.
> (Releases are important. Trying out the release version first means you
> mostly hit _known_ bugs...)
> *shrug* Working on it...

could be useful to have newer / nicer things to test with.

I did my video playback tests, and did a Quake port, mostly lacking much 
else to test with.

a video of the video playback test:

not yet made a video of running the Quake engine (this is on my to-do 
list, was mostly making it work, and doing lots of debugging and similar).

I don't expect I will be able to run the Quake engine inside the 
emulator inside a web-browser, as running in the browser adds a roughly 
10-14x performance penalty (over running it as native code), and the 
approx 6 MIPS I am getting in-browser isn't nearly enough to run Quake 
(which seems as-is to need around 60-70 MIPS or so to get usable 

6 MIPS is basically enough to sort-of do 320x180 video playback (though 
best results are closer to 7.5 MIPS), but not much beyond this.

or such...

More information about the J-core mailing list