[J-core] correction (Linux is now working) (Re: Achievable clock speed, bottlenecks?)

Sun Oct 2 01:25:09 EDT 2016

On 10/1/2016 2:23 PM, D. Jeff Dionne wrote:
> On Oct 1, 2016, at 14:53, BGB <cr88192 at gmail.com> wrote:
>> emulator source
> Just a clarification for casual list readers: this is Linux running in BJB's emulator.   Linux does and has run on j-core RTL (j-core was designed to run Linux) for 4-5years.
>
> Glad to see an emulator coming along well also!!!

I wasn't aware J-core was around that long; I had only heard of it 
fairly recently.

but, yeah, this is a personal emulator project.

yep, source for the emulator is here:
https://github.com/cr88192/bgbtech_shxemu

it is still a bit rough, and could use more work in terms of 
organization and cleaning things up and similar.

recap is, I started working on the emulator sometime around a month ago 
now, initially thinking it would be something quick and dirty (like a 
prior MSP430 emulator).

I have noted since then that SuperH, in combination with running Linux, 
is more complicated than emulating small programs running on an MSP430. 
likewise, there are edge cases which are a little confusing and for 
which the existing specifications don't really address sufficiently well 
(delay-slot behavior being an example).

my initial motivations were mostly that I thought SuperH was overall a 
pretty nicely designed ISA, so felt like doing an emulator for it 
(nevermind if there are some things that are not immediately obvious 
from looking at the instruction listings).

it also kind of resembled some of my own (hypothetical) designs for 
machine ISA's (though, most of mine had used variable-length 16/32 or 
16/32/64 bit instructions). though, my ideas were mostly for "modestly 
low cost" manycore processors. I was imagining something sort of like 
Xeon Phi, but preferably much cheaper, and usable in embedded projects, 
such as robots; and also able to preferably deal better with 
branch-intensive chaotic workloads than GPUs. a big uncertainty, and 
thing people had used to beat down the idea, was memory bandwidth; my 
idea had been a sort of hierarchical NUMA (RAM being kept local to small 
groups of cores); but others had asserted that such a thing was 
non-viable (unless there can be a big fat pipe to a large external RAM 
space, as opposed to lots of smaller internal RAM spaces organized in a 
2D grid and relying primarily on locality and adjacency).

probably to-do for emulator:
     clean up handling of peripheral hardware slightly
         not just doing it all as a blobs of code in "main".
         try to figure out more what these various HW registers do.
     more test-code, verify SD/MMC interface works correctly 
(read/write/...)
         still need to verify that other OS's (ex: Linux) can read FAT 
volumes generated by my code.
     maybe go implement emaclite interface
         and try to get a Linux kernel built which has network support, 
and can read SD cards...

likely also:
     continue working on SH-4 target
         Linux as built for SH-4 seems to expect a rather different 
hardware profile though.
         need to investigate this more.
     may consider also later adding SH-2A support
     maybe add support for a fixed virtual clock-speed
         would need to figure out approximate cycle timings for various 
ops though to make this "accurate".
         this would be, say, you want to throttle the virtual SoC to 66 
MHz or similar
     maybe figure how multicore works on SH-2/J-2 and add this.
         either purely in software, or via OS-provided multi-threading.
     ...

other possibilities:
     write JIT compiler and have emulator JIT compile to native code 
(x86-64)
         would add a secondary static trace-cache and some basic 
"hot-spot" analysis
         this could potentially significantly improve the MIPS value
             unclear if worthwhile ATM, as it would also have a 
significant kLOC cost.
             vague estimate, could probably get somewhere between 400 
and 800 MIPS though.
                 getting much past this would likely require a more 
complex JIT (1).
     write alternate assembler/linker for SH
         might want to later do some things which can't really be done 
with GAS/LD (such as JIT).
     possibly experiment with an SH port of my script VM
         currently works on x86, x86-64, and ARM (RasPi, 
interpreter-only for now).

could go into more speculative stuff, but I will probably leave it at 
this for now.

1: though not entirely free, a modest JIT can be done with a 
"reasonable" code cost, but with the limitation that its "cleverness" is 
fairly limited, and its code generation is fairly naive. both 
code-quality and speed are typically similar to those of compiling C 
with optimizations disabled. getting better/faster code generally 
requires a significantly more complex codegen, and at some point "just 
use LLVM" becomes a more viable option (vs, compared with an interpreter 
or naive JIT, throwing LLVM at it being like trying to hit a nail with a 
sledgehammer...).

generally, for a small JIT or compiler, one of the bigger costs (in 
terms of code) is the assembler and linker, with a codegen that 
basically maps the VM operations to mostly pre-formed blobs of ASM. 
technically, it is possible to bypass an assembler/linker and directly 
craft blobs of raw machine code (for example, Quake3 Arena did its QVM 
JIT this way), but the added pain of working with raw machine-code IMO 
generally offsets any real savings in terms of smaller code-size 
(working with ASM is nicer, and easier to debug).