[J-core] status: emulator speedup (Re: Roadmap)

Wed Feb 22 21:56:49 EST 2017

On 2/8/2017 7:30 PM, BGB wrote:

> though, as-is, the majority of the running time (in multiple tests) is 
> going mostly into things like the emulated MMU and trace-dispatch 
> logic (ex: lots of memory loads/stores and branches).
>
> though a lot of this isn't really because the logic is all that 
> complicated, but more the "little things" that start to become an 
> issue when a function is called millions of times per second.
>

found and used some workarounds, now have a considerable speedup (around 
a 7x speedup from when I started optimizing stuff, though mostly via a 
JIT compiler).

code still here:
https://github.com/cr88192/bgbtech_shxemu

(though, a lot of the newer stuff is currently Win64 only).

>
> ( then again, it is currently sufficient to run Quake at around 15-20 
> fps (320x240) and 20-25 fps (320x200), so could probably be worse... )
>
> meanwhile, the Dreamcast has a claimed CPU performance of 360 mips, 
> which implies a sustained average performance of around 2 instructions 
> per clock (well, assuming this isn't an overly optimistic value or 
> something).

I have reached, and now exceeded this target (on my FX-8350).
can now run Quake 1 in 640x480 at ~ 30-60 fps.

speeds are about 4 to 5x slower than a native x86 version of Quake (in 
terms of frame-times with matched resolution and vantage point).

some other benchmarks fall a bit short though: around 13x slower than 
native for N-Primes, and still around 35x slower than native for 
Dhrystone (in both cases with the emulator pulling off around 600 MIPS).

I had been testing to verify it could still boot the Linux image, though 
not all the optimizations apply to SH2 (a lot of them currently assume 
the SH4 LE AT=0 case). ( though, it still does boot very fast as-is. )

> though, this is when getting ~ 93 MIPS in the VM, implying the SH-4 
> instructions are worth slightly less than a DMIPS "instruction".

current status:
     generally now getting speeds typically in the range of 500 to 700 
MIPS (with a JIT).
         highest speeds have currently been seen in Quake, but this is 
mostly what I have optimized with.
         lowest MIPS is seen when decoding CRAM video, but still pushing 
~ 70 megapixels/second, so...
         this is about 6x slower than a native-code CRAM decoder (~ 400 
megapixels/second).
     speeds are around 100-150 MIPS with just the interpreter.
         and speeds in a web-browser are still pretty terrible.
     I don't know if there is many likely major optimizations remaining.
         I have doubts it is readily possible to reach 1000 MIPS, but 
OTOH, I didn't expect to get this far.
         I previously thought I was at the limit at ~400, but came up 
with a few nifty hacks and got it faster.
             I suspect I would need a little more "cleverness" from the 
JIT for this.
                 though, as-is, still don't have complete coverage of 
the ISA though
                 likewise, some parts are still load/modify/store (vs 
using allocated registers).
                 ...

currently the JIT is only for Win64, but I may also do a version for 
Linux x86-64.
     mostly dealing with ABI differences.
I have decided in this case I will probably not do a JIT for 32-bit x86.
     not much immediate need.

note that it is very new and likely also very buggy.

some specifics to-be-decided, but I may try to add a "software GPU" and 
see if I can make Quake 3 Arena work with it. current leaning is for a 
custom S3-like GPU (likely simpler than trying to add PVR).

current leaning is along the lines of stuffing screen-space triangles 
into a ring-buffer, and the "GPU" would run along drawing these into the 
framebuffer. would simplify from the S3 by only doing triangles (quad or 
poly means multiple triangles), potentially with triangle subdivision to 
limit warping.

near the bottom of here, have spec'ed some possible ISA extensions:
https://github.com/cr88192/bgbtech_shxemu/wiki/SH-ISA

they could help with performance, assuming a compiler capable of using 
them (may experiment with this at some point). a lot are for memory 
loads/stores, as this appears to be a problem area for SH (lots of 
operation mostly for doing address calculation and similar).

still don't have a working version of SH4 Linux though (maybe need a 
known good Linux image, and a known hardware configuration to clone).

still mostly just screwing around on my end.

any thoughts?...