[J-core] working on SH-2 emulator.
cr88192 at gmail.com
Sun Sep 4 19:41:04 EDT 2016
so, in case anyone cares, I have been working on an SH-2 emulator
(hacked together over the past few days).
not actually sure what I will do with it if/when I have it fully working.
I may make it available, probably under MIT terms or similar.
the core ISA is basically implemented apart from a few misc instructions.
I didn't use the boot ROM for now, instead I have been loading the
"vmlinux" image directly.
status thus far is I don't have Linux booting up yet, it seem to start
starting up and then goes into an infinite loop, and appears to be (I
guess) reading/comparing values from memory (but the values don't
change, so it loops indefinitely).
I guess it may be expecting a timer interrupt, given it appears to have
set up VBR by this point. not sure ATM about what frequency it is
expecting for the clock.
I am basically going in blind here, I don't really know much more about
the expected HW configuration than what I can infer from looking at the
"boot" code and "board.h" and similar (and just guessing that when it
gets further along it will start to try accessing the UART at 0xABCD0000
or similar, then I will know it is time to implement this...).
( ok, adding in clock interrupts, I seem to get a fault within the VM
(trying to access 095D0174, nothing is mapped here), but well, this is
different from getting stuck in an infinite loop... )
note that there are still very likely bugs in the instruction
decoder/interpreter, so this part can't be ruled out yet:
spent a while trying to figure out how much to offset PC for load
operations, best I can tell pretty much everything seems to expect an
offset of 4 bytes, with the (~3) mask for 32-bit loads, but could be wrong.
likewise, seems to be (for "MOV.W"):
I fiddled with it, and this seemed to be the general answer that doesn't
result in stuff immediately faulting.
it also appears that an offset of '4' is the magic number for branch
offsets as well, where PC here is the address of the instruction for
which the PC-relative address is being calculated.
the spec seemed to be a little ambiguous on some of this (often
referencing the SH-4 spec as it is less ambiguous).
side note: the handling for "delay slot" operations was basically to
swap the operations in the trace decoder, so that they execute in the
on my PC (AMD Phenom II X4 955 running Windows 10), I am currently
getting interpreter speeds of around 130 to 160 MIPS (unsure of exact
MHz equivalent, would require accounting for cycle-timings).
on average, my PC gets similar single-thread performance in benchmarks
(sometimes faster or slower) vs a 2.3 GHz Intel Xeon E5410 (the Xeon
generally beats my main PC in multi-threaded tests, as well as having
faster memcpy speeds).
I also get slightly faster speeds if built using GCC in WS4L (as opposed
to building natively in VS2015).
note that this is for a single-threaded plain C interpreter (no ASM or
JIT at present).
this is using a 2-way associative trace-cache, where instruction traces
are decoded and kept in a hash table, with 2 spots for each hash-key.
this isn't the fastest possible interpreter design, but tends to be
better behaved in CPU emulators (the cache-size is bounded, and it is
better behaved in the face of SMC).
currently, it doesn't have an MMU.
could probably add some stuff from SH-4, but would likely skip a few
things (exposed cache and TLB stuff, unless really necessary), as these
could be fairly costly to emulate.
similarly, have noted that some things in SH-4 (interrupt handling,
memory map, ...) look a bit different from what I have been able to
infer from SH-2 and J-2.
I have a separate CPU context from the memory map, so conceivably it
should be possible to allow multiple logical threads to share the same
physical memory map. if added, would probably have the MMU tied to the
logical cores (so each can have their own page-tables).
any thoughts / comments?
More information about the J-core