[J-core] correction (Linux is now working) (Re: Achievable clock speed, bottlenecks?)
BGB
cr88192 at gmail.com
Sun Oct 2 01:25:09 EDT 2016
On 10/1/2016 2:23 PM, D. Jeff Dionne wrote:
> On Oct 1, 2016, at 14:53, BGB <cr88192 at gmail.com> wrote:
>> emulator source
> Just a clarification for casual list readers: this is Linux running in BJB's emulator. Linux does and has run on j-core RTL (j-core was designed to run Linux) for 4-5years.
>
> Glad to see an emulator coming along well also!!!
I wasn't aware J-core was around that long; I had only heard of it
fairly recently.
but, yeah, this is a personal emulator project.
yep, source for the emulator is here:
https://github.com/cr88192/bgbtech_shxemu
it is still a bit rough, and could use more work in terms of
organization and cleaning things up and similar.
recap is, I started working on the emulator sometime around a month ago
now, initially thinking it would be something quick and dirty (like a
prior MSP430 emulator).
I have noted since then that SuperH, in combination with running Linux,
is more complicated than emulating small programs running on an MSP430.
likewise, there are edge cases which are a little confusing and for
which the existing specifications don't really address sufficiently well
(delay-slot behavior being an example).
my initial motivations were mostly that I thought SuperH was overall a
pretty nicely designed ISA, so felt like doing an emulator for it
(nevermind if there are some things that are not immediately obvious
from looking at the instruction listings).
it also kind of resembled some of my own (hypothetical) designs for
machine ISA's (though, most of mine had used variable-length 16/32 or
16/32/64 bit instructions). though, my ideas were mostly for "modestly
low cost" manycore processors. I was imagining something sort of like
Xeon Phi, but preferably much cheaper, and usable in embedded projects,
such as robots; and also able to preferably deal better with
branch-intensive chaotic workloads than GPUs. a big uncertainty, and
thing people had used to beat down the idea, was memory bandwidth; my
idea had been a sort of hierarchical NUMA (RAM being kept local to small
groups of cores); but others had asserted that such a thing was
non-viable (unless there can be a big fat pipe to a large external RAM
space, as opposed to lots of smaller internal RAM spaces organized in a
2D grid and relying primarily on locality and adjacency).
probably to-do for emulator:
clean up handling of peripheral hardware slightly
not just doing it all as a blobs of code in "main".
try to figure out more what these various HW registers do.
more test-code, verify SD/MMC interface works correctly
(read/write/...)
still need to verify that other OS's (ex: Linux) can read FAT
volumes generated by my code.
maybe go implement emaclite interface
and try to get a Linux kernel built which has network support,
and can read SD cards...
likely also:
continue working on SH-4 target
Linux as built for SH-4 seems to expect a rather different
hardware profile though.
need to investigate this more.
may consider also later adding SH-2A support
maybe add support for a fixed virtual clock-speed
would need to figure out approximate cycle timings for various
ops though to make this "accurate".
this would be, say, you want to throttle the virtual SoC to 66
MHz or similar
maybe figure how multicore works on SH-2/J-2 and add this.
either purely in software, or via OS-provided multi-threading.
...
other possibilities:
write JIT compiler and have emulator JIT compile to native code
(x86-64)
would add a secondary static trace-cache and some basic
"hot-spot" analysis
this could potentially significantly improve the MIPS value
unclear if worthwhile ATM, as it would also have a
significant kLOC cost.
vague estimate, could probably get somewhere between 400
and 800 MIPS though.
getting much past this would likely require a more
complex JIT (1).
write alternate assembler/linker for SH
might want to later do some things which can't really be done
with GAS/LD (such as JIT).
possibly experiment with an SH port of my script VM
currently works on x86, x86-64, and ARM (RasPi,
interpreter-only for now).
could go into more speculative stuff, but I will probably leave it at
this for now.
1: though not entirely free, a modest JIT can be done with a
"reasonable" code cost, but with the limitation that its "cleverness" is
fairly limited, and its code generation is fairly naive. both
code-quality and speed are typically similar to those of compiling C
with optimizations disabled. getting better/faster code generally
requires a significantly more complex codegen, and at some point "just
use LLVM" becomes a more viable option (vs, compared with an interpreter
or naive JIT, throwing LLVM at it being like trying to hit a nail with a
sledgehammer...).
generally, for a small JIT or compiler, one of the bigger costs (in
terms of code) is the assembler and linker, with a codegen that
basically maps the VM operations to mostly pre-formed blobs of ASM.
technically, it is possible to bypass an assembler/linker and directly
craft blobs of raw machine code (for example, Quake3 Arena did its QVM
JIT this way), but the added pain of working with raw machine-code IMO
generally offsets any real savings in terms of smaller code-size
(working with ASM is nicer, and easier to debug).
More information about the J-core
mailing list