[J-core] working on SH-2 emulator.

Sun Sep 4 19:41:04 EDT 2016

so, in case anyone cares, I have been working on an SH-2 emulator 
(hacked together over the past few days).
not actually sure what I will do with it if/when I have it fully working.
I may make it available, probably under MIT terms or similar.

the core ISA is basically implemented apart from a few misc instructions.
I didn't use the boot ROM for now, instead I have been loading the 
"vmlinux" image directly.

status thus far is I don't have Linux booting up yet, it seem to start 
starting up and then goes into an infinite loop, and appears to be (I 
guess) reading/comparing values from memory (but the values don't 
change, so it loops indefinitely).

I guess it may be expecting a timer interrupt, given it appears to have 
set up VBR by this point. not sure ATM about what frequency it is 
expecting for the clock.

I am basically going in blind here, I don't really know much more about 
the expected HW configuration than what I can infer from looking at the 
"boot" code and "board.h" and similar (and just guessing that when it 
gets further along it will start to try accessing the UART at 0xABCD0000 
or similar, then I will know it is time to implement this...).

( ok, adding in clock interrupts, I seem to get a fault within the VM 
(trying to access 095D0174, nothing is mapped here), but well, this is 
different from getting stuck in an infinite loop... )

note that there are still very likely bugs in the instruction 
decoder/interpreter, so this part can't be ruled out yet:
spent a while trying to figure out how much to offset PC for load 
operations, best I can tell pretty much everything seems to expect an 
offset of 4 bytes, with the (~3) mask for 32-bit loads, but could be wrong.
     ex: "Addr=((PC+4)&(~3))+(i*4);"
likewise, seems to be (for "MOV.W"):
     "Addr=(PC+4)+(i*2);"

I fiddled with it, and this seemed to be the general answer that doesn't 
result in stuff immediately faulting.
it also appears that an offset of '4' is the magic number for branch 
offsets as well, where PC here is the address of the instruction for 
which the PC-relative address is being calculated.

the spec seemed to be a little ambiguous on some of this (often 
referencing the SH-4 spec as it is less ambiguous).

side note: the handling for "delay slot" operations was basically to 
swap the operations in the trace decoder, so that they execute in the 
expected order.

on my PC (AMD Phenom II X4 955 running Windows 10), I am currently 
getting interpreter speeds of around 130 to 160 MIPS (unsure of exact 
MHz equivalent, would require accounting for cycle-timings).
on average, my PC gets similar single-thread performance in benchmarks 
(sometimes faster or slower) vs a 2.3 GHz Intel Xeon E5410 (the Xeon 
generally beats my main PC in multi-threaded tests, as well as having 
faster memcpy speeds).

I also get slightly faster speeds if built using GCC in WS4L (as opposed 
to building natively in VS2015).

note that this is for a single-threaded plain C interpreter (no ASM or 
JIT at present).

this is using a 2-way associative trace-cache, where instruction traces 
are decoded and kept in a hash table, with 2 spots for each hash-key. 
this isn't the fastest possible interpreter design, but tends to be 
better behaved in CPU emulators (the cache-size is bounded, and it is 
better behaved in the face of SMC).

currently, it doesn't have an MMU.
could probably add some stuff from SH-4, but would likely skip a few 
things (exposed cache and TLB stuff, unless really necessary), as these 
could be fairly costly to emulate.

similarly, have noted that some things in SH-4 (interrupt handling, 
memory map, ...) look a bit different from what I have been able to 
infer from SH-2 and J-2.

I have a separate CPU context from the memory map, so conceivably it 
should be possible to allow multiple logical threads to share the same 
physical memory map. if added, would probably have the MMU tied to the 
logical cores (so each can have their own page-tables).

any thoughts / comments?