[J-core] Jcore mailing list and tutrle board

Sun Jul 9 13:23:57 EDT 2017

On 7/9/2017 4:42 AM, Rob Landley wrote:
> On 07/07/2017 06:06 AM, D. Jeff Dionne wrote:
>> Also keep in mind that as soon as you have 32 bit instruction words,
>> your clean RISC (like) architecture starts to 'degrade' (maybe, my
>> opinion ;).  Much better to have a mode bit in the status register,
>> for instance, and avoid variable length instructions... those can
>> also double your external memory instruction bandwidth (which is
>> important, see above).
> For example, the j64 we roughed out still uses 16 bit instructions, it
> just has an x86-64 style mode bit that interprets some of those
> instructions differently, and has 64 bit registers (the top 32 bits of
> which are masked out and not modified in 32 bit mode).
>
> The different instructions are things like turning "load/store 16 bits
> of memory" into "load/store 64 bits of memory" (meaning a 16 bit store
> in 64 bit mode needs to become two 8 bit stores, but there are only so
> many encodings so you gotta trade something off).

mine can work basically like this, but using 2 bits:
     SR.JQ, which enables 64-bit addressing (and is basically an "enable 
64-bit mode" flag).
     SR.DQ, which toggles between 32-bit and 64-bit arithmetic and 
word/quad loads/stores.

SR.DQ=0:
     basically, still has all the MOV.W and friends.
     all the arithmetic operates on the low 32 bits leaving the high 
bits unchanged.
ST.DQ=1:
     replaces 16-bit MOV.W forms with MOV.Q
     does 64-bit arithmetic (vs 32-bit).
     currently ignored unless JQ is also set (but could change if needed).

I ended up fudging CLRS/SETS to have alternate mode-set forms:
***        0048 CLRS                //Clear SR.S
****        0148 ICLRMD.DQ        //Clear SR.DQ
****        0248 ICLRMD.JQ        //Clear SR.JQ
****        0348 ICLRMD.JDQ        //Clear SR.JQ and SR.DQ
***        0058 SETS                //Set SR.S
****        0158 ISETMD.DQ        //Set SR.DQ
****        0258 ISETMD.JQ        //Set SR.JQ
****        0358 ISETMD.JQ        //Set SR.JQ and SR.DQ

mostly as the only other current way to set these bits is via a memory load.

some of the high 32 bits of SR indicate the desired state of JQ and DQ 
in the event of an interrupt.
     this is mostly so that a user-process could run in 32-bit mode 
while having a 64-bit kernel.
     though, granted, an ISR could manually mode-toggle easily enough.
         but, it is relevant to VBR-register width and how to interpret 
the memory addresses.
     these bits also effectively indicate the "master operating mode".
         basically means whether to dump registers as 32 or 64-bits.

the SR bits:
*    0000_1000: DQ        //Data Quad
*    8000_0000: JQ        //Operate in 64-bit mode

SRH Bits:
*    0000_1000: TR_DQ    //Trap: Data Quad
*    8000_0000: TR_JQ    //Trap: 64-bit Mode
     (these give the state of DQ and JQ in the event of an interrupt).

as noted, there are multiple possible ways to approach things:
     toggle DQ frequently, like is common with the FPU.
     leave DQ as 1, sign/zero-extend arithmetic operations as needed.
         mainly, extending is needed in the case of operations which may 
produce different results.

tradeoffs exist.

it is likely what is used would be based on a combination of what is 
available on the processor and compiler flags.

keeping a 16-bit operations only mode is still pretty doable; as the 
current design doesn't entirely rely on the existence of the 32-bit 
instruction forms.

one big difference between 32-bit and 64 bit mode is in the 
interpretation of addresses:
     0x00000000..0x7FFFFFFF: user address (with MMU, or 29 bit physical 
address)
     0x80000000..0xDFFFFFFF: system address ranges
     0xE0000000..0xFFFFFFFF: memory mapped registers/etc
or:
     0x0000_00000000..0x7FFF_FFFFFFFF: user address range (MMU)
     0x8000_00000000..0x8FFF_FFFFFFFF: memory-mapped registers / etc
     0x9000_00000000..0xFFFF_FFFFFFFF: system address range (MMU)
     (addresses are sign-extended to 64 bits, but thinking is only 48 
bits would be used for now).

setting/clearing JQ changes the interpretation of addresses, and in my 
current VM needs to be handled similar to a branch operation, which also 
invokes some MMU setup stuff, ...

I debated some whether to internally have separate MMU interfaces for 32 
and 64-bit addresses, and whether to actually use 64-bit addresses in 
the "physical" memory interface (vs internally using an 8086-like 
segmented model).

ended up for now using 64-bit flat addressing for both (with a 
"reconfigure some stuff on mode-change" strategy).

note that when SR.JQ=1 and MMUCR.AT=0, 0000_00000000..0000_1FFFFFFF 
currently maps to the same address range as 00000000..1FFFFFFF.

setting/clearing JQ would likely require either being in an area where 
PC maps equivalently between both modes; or possibly doing a combination 
mode-change + jump via an RTE instruction or similar.

> It is very much an x86-64 approach of mostly the same instructions in
> the different modes (share as much circuitry as possible and make it all
> _conceptually_ the same too), _NOT_ the Itanic approach of "let's glue a
> couple of completely different processors together and have a toggle
> indicating which one to use now".

mine was also about like x86-64.
     possibly more so, given it also borrows x86's scaled-index memory 
addressing and similar...

the 32-bit I-forms are escape-coded forms, not some completely 
different/parallel ISA glued on via mode-changing or misaligned jumps or 
some-such (and probably the majority of the instructions in-use would 
remain 16-bit either way, 1).

granted, they do imply going from strictly 16-bit ops, to variable-width 
16/32 ops.

1: in my tests, the main operations to make inroads into displacing 
16-bit I-forms were mostly:
     the SH2A MOV I-forms (stack frames and structs);
     branch I-forms with 16/20 bit displacements (mostly "BT/BF disp16");
     3-address integer ops (with "OP Rm, imm, Rn" forms being dominant);
     LEA ops (pointer arithmetic);
     scaled-index MOV ops (mostly array operations);
     ...
( roughly in descending order )

in disassembly, the bulk of instructions tends to remain 16-bit either way.

though, granted, the number of glued on instruction-forms could still be 
an issue (I spec'ed and was testing a fair amount of them and "seeing 
what sticks"...).

> The core of the x86-64 design team was the DEC Alpha design team. When
> Compaq bought the corpse of DEC in 1996 they didn't do chip design so
> didn't hire those guys, and AMD snapped them up and asked "if you were
> going to do an x86 chip, what would it look like" and that was the
> Athlon, then asked "if you were to expand it to 64 bits..." and they did
> their first implementation (SledgeHammer) with just 10% more circuitry
> than Athlon (part of which was adding 8 more general purpose registers,
> but j-core already has plenty of those).

as noted before, in my design the main GPRs/... were extended to 64-bits.
     primarily, they would be seen as 16x 64-bit GPRs.
     they could also be viewed potentially as 32x 32 registers.
         if, albeit, only by a subset of the ISA.
         64-bit integer ops would work on both parts as a single unit.

> We used that as a frame of reference: those guys did their first 64 bit
> implementation in about 10% more chip real estate on top of their
> existing 32 bit design. It can be done, because it _has_ been done, so
> that's the sort of thing we should be aiming for.

yeah.

I was thinking also about code footprint and operation counts.

may need to prototype both strategies and see how they compare.
     as is, this would be mostly about what sort of output the compiler 
spits out.

will need to work on more on my compiler and similar before 64-bit 
support is complete enough to compare them and have an accurate measure 
(my off-hand estimate is probably somewhere around a 20-30% increase in 
code footprint vs 32-bit, but I could be wrong here).

>> - Radical ideas about register files are generally a no-no.  SPARC
>> tried register windows (fail) SH2A tried register page ideas (fail)...
> And sh2a was not done by Hitachi. When Renesas spun off after Y2K it
> inherited the superh intellectual property but the engineers who'd
> created superh _before_ y2k all stayed at Hitachi. The sh2a, sh4a, and
> sh5 designs came about years later, done by a completely different team.
> So far j-core development has only tried for compatibility with the
> Hitachi stuff.

FWIW:
x86 also tried this sort of thing in the form of the TSS mechanism.
OS's mostly ignored TSS's as the cost of swapping registers via the TSS 
was higher than that of saving/restoring registers manually (but they 
are still needed for some system-level functionality; typically with a 
single TSS used for the whole OS or similar).

about the only things my design had borrowed directly from 2A was the 
32-bit MOV.B/W/L and MOVI20/MOVI20S instructions.
     well, also "MOV.B/W/L Rm, @Rn+" and similar.

most of the rest I left out, seeing little obvious use.
     ex: discarded most of 2A's system-level features.
     also skipped bit set/clear ops, ..., as probably too specialized to 
be generally useful.

the 32-bit MOV ops can save a few percent off the code size (including 
in 32-bit mode), mostly by largely eliminating the use of constant-loads 
for accessing local variables (for functions with frames too large to be 
accessed directly). this is with previously already having done an 
optimization of organizing the frame to minimize the distance to local 
variables.

the base design for my design is more directly based off of SH4.
     it also uses SH4's exception handling mechanism
         IOW: setting TRA/INTEVT/EXPEVT/... and jumping to a fixed 
offset relative to VBR.

I have mixed feelings about the mechanism in the sense that getting 
alignment right is an issue (and I end up with several kB of mostly 
padding as a result). I think the idea is that ISR code would be put in 
these spaces, but this is made an issue that GAS syntax seems to lack 
any concept of "fill relative to this offset from this label" (and 
manual aligning is too much effort otherwise).

I previously considered a different strategy:
     VBR would be interpreted as a vector table (more like in SH2).
     interrupts would actively swap SR and R15.
         vs save/restore using SPC/SSR/SGR, ...
     ...

but for now am basically reusing SH4's exception mechanism.