[J-core] 64 bit ISA (Re: Jcore mailing list and tutrle board)

BGB cr88192 at gmail.com
Fri Jul 7 14:25:59 EDT 2017


On 7/7/2017 6:06 AM, D. Jeff Dionne wrote:
> On Jul 7, 2017, at 3:28, BGB <cr88192 at gmail.com> wrote:
>
> Hi BGB,
>
> It seems like what you are working with is quite a redesign of the ISA.  I wouldn't do that, such approaches usually wind up as personal experiments.  Backward compatibility is the order of the day for an instruction set architecture upgrade.  Something completely different is going to run into adoption problems, and as such not be of much interest to anyone except as an exercise (unless it catches on, and then it's a different ISA).  One thing that is important for any potential 64bit extension is the 32bit mode should be fully compatible with the existing toolchains and OS, meaning the mode switch is well defined and on the fly switching as seamless as possible.

it is backwards compatible, and builds directly on the existing 
instruction encoding.

there are limits though, namely that if it needs to deal with 64-bit 
registers & addresses, then normal 32-bit code wont be able to work 
correctly in 64-bit mode (it will either truncate addresses or otherwise 
mess up with mismatched data sizes or similar).


the arch would be able to run 64-bit kernel code with 32-bit userland 
code (like is typical in Windows), though the reverse probably wont be 
true. the CPU would boot in 32-bit mode, so it is mostly up to the OS to 
enable 64-bit operation and similar (otherwise, it should act basically 
about like an SH4).


but, if one wants quadwords in 32-bit code, this is easier, and can be 
done while still in 32-bit mode.
I was mostly just considering it "non-canonical" for various reasons 
(mostly that a 32-bit OS, and the existing C ABI, would not preserve 
registers correctly, and also to avoid making it an issue for "32-bit 
only" cores).

say, we have a 32 and a 64 bit core; ideally code compiled in 32-bit 
mode should work equivalently on both cores.

it could also be possible to do mixed 32/64 code via thunking, but this 
would add a lot of hair.


> One main reason we started with SHcompact as the ISA is the memory footprint, and the external bus access patterns (not to ignore of all the toolchain and OS infrastructure that was mostly there already).  In embedded systems, memory access patterns and bandwidth is of critical importance.  We have, for instance, massive data flows through the DMA engines in our products and the CPUs have to co-exist usefully using the same DDR memory.

this is still based on SHcompact.

I didn't abandon it for an SHmedia like approach or something (or switch 
to an entirely different ISA design).


> Also keep in mind that as soon as you have 32 bit instruction words, your clean RISC (like) architecture starts to 'degrade' (maybe, my opinion ;).  Much better to have a mode bit in the status register, for instance, and avoid variable length instructions... those can also double your external memory instruction bandwidth (which is important, see above).

yeah, this falls more into the variable-length-instruction camp (sort of 
like SH-2A and SH-DSP).

as noted, there is hardly any space left for new 16-bit instructions, so 
I opted instead to escape-code into 32-bit instruction forms, where 
there is a bit more space left for adding instructions.

this allows being more sparing with the remaining opcode space for 
16-bit ops.


there are a pair of bits SR.JQ and SR.DQ, which are used to control the 
operating mode. SR.JQ means it is in 64-bit mode, and SR.DQ may be used 
to set the current working size for 16-bit integer instructions, as 
needed. if SR.JQ is clear then currently DQ is ignored.

the use of 32-bit instruction forms with explicit sizes avoids the need 
to endlessly toggle the DQ bit (which can eat more space than using the 
larger instruction forms in many cases).


most instructions are 16-bit, but a subset of the experimental design is 
32-bit instructions.
     as-is, these are mostly prefixed with 8Exx and 8Cxx
     but I was also testing with some of the SH2A MOV forms (0xx0, 0xx1, 
3xx1)
         they are effective, but if needed may still omit them.

for HW, I had imagined it could be possible to use a multi-stage decode 
for them:
     first cycle, reads the prefix, does nothing (sees prefix and sets 
some internal state)
         ex: first cycle only does something for 16-bit instructions.
     next cycle, reads the second word, does something.
         probably switches to alternate instruction decoding based on 
the first word.

charts for SH2A seemed to imply they were decoded all at once though, vs 
needing multiple clocks, so dunno.


in earlier ideas (and as a vestige in the existing spec) there were 
64-bit and some 48-bit instruction forms, but I have basically dropped 
these (limiting everything to 32 bits), as this would make everything 
unnecessarily complicated (and for very little gain).


> So in sum:
> - Keep the 32bit mode as is.  Any proposed additions to the ISA that get accepted will be by hard won, on the merits, battles ;)

yes, that is what I was thinking, given I mostly failed at 
(significantly) improving on it.

32-bit code will be able to run as-is, or run inside a 64-bit kernel.


> - Radical ideas about register files are generally a no-no.  SPARC tried register windows (fail) SH2A tried register page ideas (fail)...

it is not doing anything like this.


the design follows mostly from expanding the current register set to 64 
bits:
     R0..R15, if expanded to 64 bits, have another 32-bits which still 
exists somewhere.
         if I have an expanded instruction space, why not make them 
accessible?...
     SH4 has an alternate set of R0..R7 which are swapped out for 
interrupt-handler code.
         as-is, mine keeps the same basic interrupt-handling mechanism 
as SH4, so they remain.

as for where my 96 dwords estimate came from:
     R0..R15, 16 dwords
     R0B..R7B, 8 dwords
     MACL, MACH, PR, ..., MMUCR, ... another ~20 dwords.
     so, we have ~ 48 dwords (rounded up) for the 32-bit register-state 
of an SH4-like ISA.
     now, what if we add high halves so each can be 64 bits?
         suddenly 96 registers.

granted, the actual number could be lower (say, around 80), as we don't 
necessarily need to extend some of them (only the ones holding addresses).


I ended up using a pretty big chunk of encoding space for 3-address ops.
     8C3x-xxxx    (so, 20 bits)
         mostly to encode "ADD Rs, Rt, Rn" and the like
         which for 32-bit words can access the registers as 32x 32.
     for 64-bit quadword ops, there are only 16 registers.
         so, the extra bits were used mostly to encode immediate cases.
         quadword gets an imm6, and dword gets an imm5 case.
             "ADD.Q R7, 5, R13" -> 8C3C-0D57

it isn't fully orthogonal though as much of the rest of the ISA is still 
limited to 16 GPRs, but the high halves may be used separately for 
32-bit arithmetic or similar (and a few MOV/MOV.L forms and similar).


as for FPU, there are effectively 32 FPU registers in SH4:
     FR0..FR15, XF0..XF15
     and if FPSCR.SZ was set, for FMOV there were 16 doubles.
         DR0, XD0, DR2, XD2, ...

if I pair them into 128-bit SIMD vectors, then there are 8 vectors (or 
16 if 64-bit vectors).

what if I expand the vectors to 16 vectors?
     then we have 32 more floats (for a total of 64).
     I termed the additional registers YFn and ZFn.
     also the doubles are YDn and ZDn.
         the added registers follow the same pattern as the first set of 
registers.

I was noting in the 32-bit instruction-encoding intended for SIMD, I had 
a few bits left over, so expanded the number of addressable registers 
for most of these ops from 16 to 32 (with "FMOV FRm,FRn" having access 
to the full 64).
     for these reasons, YDn and ZDn are inaccessible to the 16-bit 
instruction forms.
     for 16-bit I-forms, a bit in FPSCR selects between FRn and XFn 
registers (as in SH4).
         for 32-bit I-forms, any of FRn or XFn may be accessed at any time.
         likewise goes for having 32 doubles.
     unlike SH4, FPSCR.PR allows direct access to 16 doubles for 16-bit 
I-forms.

most of this will depend though on whether the core has SIMD capabilities.
if it only has a normal FPU, then most of the added stuff will be absent.


but, anyways, 96+64=160 dwords, 640 bytes, or 5.12 kbits...
128 dwords / 512 bytes would be enough for a 64-bit core with a more 
basic FPU.


> - A portable implementation that is also space efficient restricts things to register files of r read ports and w=1 write port.  The area scales roughly with r*w, but also you need last written tag bits and other nastiness when w > 1.  Most modern FPGAs allow turning the basic nLUT logic element into a 1bit x 2^n async RAM primitive.  You can build a lot of good things efficiently with those.  A lot of Open RTL we find doesn't fit that model, and ends up using fabric flops.  But there are only so many of those LUTs in any case.  You don't want to use Block RAM for your reg files in FPGA if you can help it.  It's not impossible, but since they are sync read, it affects the design of your pipeline.

I am not sure here.

it is a bit fuzzy how much space is available to work with for 
registers, or how this stuff really works at this level.


> - A 64 bit mode absolutely needs to be designed to have a nice even flow of instructions through the memory interface and cache.

I am trying here.


> - The reason ditching MOV.W @(PC,disp) etc had an adverse effect is because it's fundamental to constant loading on the SH.  The ISA is designed around that (and other) construct(s)... it was an early design decision, likely before any instructions were defined at all.

there was still MOV.L @(PC,disp) and similar, but forcing everything 
from for MOV.W over to MOV.L or even MOVI20 resulted in a fairly 
significant increase in code footprint.

my original idea was to replace "MOV.W @(PC,disp)" with "MOV.Q 
@(PC,disp)" in 64-bit mode; but initial testing showed that this was a 
bad idea, and it would be better space-wise to use a 32-bit instruction 
form for "MOV.Q @(PC,disp)" and leave the 16-bit MOV.W form as it is (or 
at least make it depend on DQ rather than JQ).

the current MOV.Q instruction form is:
     8Eed-Dndd  MOV.Q @(PC,disp13), Rn
     otherwise, this may also encode a MOV.W or MOV.L with a 13-bit 
displacement.
     supports a 32 or 64-bit destination register (may sign or zero 
extend to 64 bits).


> - So the best 64bit approach is 'mundane', and predictable, meaning that,
>
> - The fundamental principles of the ISA (e.g. constants are loaded relative to PC, just one such thing) remain the same.  Then things like LEA instructions are not a good fit... since they don't match the methodology.

the LEA instructions mostly provide the ability to do "Reg+Reg*Sc+Disp" 
and similar, which if done directly using arithmetic and 32-bit 
instruction-forms (if needed for quadword ops), could do bad things to 
the code footprint (and using 16-bit instruction forms still isn't so 
great here).

ex:
     LEA.L @(R10, R11, 20), R12
works out to:
     MOV R11, R12
     ADD #5, R12
     SHLL2 R12
     ADD R10, R12

so, 4 bytes vs 8 bytes (for 32-bit mode w/ 16-bit ops).
if all quadword ops need 32 bits, then it is 4 vs 16.
if there is no displacement, granted, it is 4 vs 6 (32-bit) or 4 vs 12 
(64-bit).

these constructs come up a lot when dealing with things like arrays and 
structs.

there is also the simpler case:
     LEA.L @(Rm, disp8s), Rn
     adds a scaled displacement to Rm and stores result in Rn.
     still useful in 64-bit mode because there otherwise isn't otherwise 
a good/cheap way to do this.


the alternative is having the 16-bit instruction forms use 64-bit 
arithmetic, which runs into other issues:
     need to mode-change between working with integer math and doing 
pointer operations
         this is more messy and would adds mode-changing related costs.
         this is a problem as code often tends to rapidly alternate 
between integer and pointer ops.
             even a mode-toggle op would still be expensive in these cases.
     or, would need to regularly sign and zero extend arithmetic so that 
it behaves as expected.
         this is an issue mostly because integer arithmetic is a lot 
more common than pointer arithmetic.
         a lot of existing code will not work correctly unless integer 
operations sign/zero extend.
         could work out more expensive overall than the mode-toggling 
approach.

this is closer to my original design, but as-noted, seems to be more 
expensive than a "just use LEA and similar" approach (which mostly 
avoids both sets of issues).


likewise, having nearly all quadword ops use 32-bits instructions is 
simpler than mode-toggling.
     but, now "MOV.Q Rm, @-R15" would need 32-bits
     hence, this is where 16-bit PUSH.Q/POP.Q ops come from
         IOW: mostly to limit an otherwise significant expansion of 
function prolog/epilog sequences.


so, mostly looking for low-cost solutions here.

it is possible that 16-bit-ops-only mode-toggling operation could still 
be supported.


> And above all, it has to be efficiently implementable, in real hardware, of course :)

I am trying to make something which could hopefully be viable.

I am trying to keep things conservative where possible, but also trying 
to avoid the code footprint getting significantly bigger in 64-bit mode 
than in 32-bit mode (and testing out some other ideas to see what does 
or doesn't work here).


> Cheers,
> J.
>
>
>> On 7/6/2017 8:11 AM, D. Jeff Dionne wrote:
>>> On Jul 6, 2017, at 5:58, Rob Landley <rob at landley.net> wrote:
>>>
>>> David,
>>>
>>> Pls see inline.
>>>
>>> J.
>>>> On 07/04/2017 02:30 PM, David Summers wrote:
>>>>> On 04/07/17 09:35, Rob Landley wrote:
>>>>>> On 07/03/2017 02:41 PM, David Summers wrote:
>>>>>>> Hi Rob,
>>>> ...
>>>>>>> Anyway, you probably know the answer to what I was planning to ask
>>>>>>> anyway. Why on the turtle boards did you fix on the Spartan FPGA? The
>>>>>>> newer Artix is similar price and newer, e.g.:
>>> The highest volume product Xilinx ship right now is still S6 in the low cost area, not Artix (I have it on good authority).
>>>
>>> We have a bunch of products using S6 on the commercial side, we don't want A7 right now.
>>>
>>> Once we make a change of tools from ISE to Vivado, we can make a compatible upgrade to either A7 or S7.
>> seems cool.
>>
>>>>>>> https://shop.trenz-electronic.de/en/TE0725-03-100-2C-FPGA-Module-with-Xilinx-Artix-7-XC7A100T-2CSG324C-2-x-50-Pin-with-2.54-mm-pitch
>>>>>>>
>>>>>>> a 100k gate FPGA for just over a $100.
>>> And that is another reason why.  $100 is another class of product.  Turtle design goals had a BoM cost of about $50.
>> I am half wondering here if the current FPGAs could handle a version of the ISA with ~5 kbits of register state (~ 160 DWORDs), and a somewhat expanded ISA.
>>
>> or:
>>     96 dwords for GPRs + system registers;
>>         existing registers + high words.
>>     64 dwords for FPRs.
>>         expanded some to allow 16x128 bit SIMD, also usable as 32 double registers.
>>         though, only about 1/2 this space is accessible as 32-bit floats for many operations.
>>             unlike SSE, most of the space is usable as independent scalar registers.
>>
>> ATM, this is about what the 64-bit version of my ISA is looking like.
>>
>> I have recently been working with trying to get a "working prototype" of sorts into working order, but some things are still flexible in the design.
>>
>>
>> I have generally determined that trying to make the 32-bit ISA's memory-footprint smaller by adding instructions is of fairly limited effectiveness; so (probably) not a worthwhile tradeoff in the face of expanding the complexity of the instruction set (over the existing SH ISA).
>>
>> it is possible to reduce the number of logical operations (by around 30%), though most of this is by replacing multiple 16-bit I-forms with fewer 32-bit I-forms so the overall memory footprint remains pretty similar.
>>
>> for the 64-bit ISA, they become a bit more useful, mostly:
>>     to be able to actually do 64-bit stuff with the way I am currently implementing it;
>>         most 64-bit operations will require dedicated quadword forms, ex: "ADD.Q" / etc.
>>     to mostly avoid a significant expansion of the code footprint.
>>         as-is, it looks like the footprint expansion from 32-bit to 64-bit will be fairly modest.
>>             at-least, if most pointer arithmetic is done with dedicated LEA instructions.
>>
>> some changes (of the 64-bit ISA) from the prior design:
>>     most of the 16-bit instruction forms behave as they do in 32-bit mode.
>>         exceptions are LDC.L/STC.L/... which expand to 64-bit forms.
>>             also MOV.{B/W/L} ops expand to using 64-bit addresses.
>>         determined that full-scale extension to 64-bit arithmetic would be detrimental.
>>             these is far more 32-bit integer arithmetic than pointer arithmetic / etc.
>>     16-bit MOV.W forms are left as-is.
>>         most MOV.Q variants will require 32-bit I-forms.
>>         losing "MOV.W @(PC,disp)" turned out to have more of an adverse effect than expected.
>>     added 16-bit PUSH/POP ops.
>>         but, dropped a few previously spec'ed ops as they became unnecessary/redundant.
>>     the number of usable 32-bit GPRs is expanded to 32 (by using the low/high halves separately).
>>         in addition, there are some 3-address arithmetic instructions and similar.
>>         ...
>>
>> the 64-bit ISA still isn't really done yet, and my early testing thus far has been in sort of a hybrid mode (some quadword stuff but still otherwise using a 32-bit address space).
>>
>> also, note that 64 bit arithmetic/registers would be accessible by 32-bit code, but I will probably define doing this as "non-canonical".
>>
>> but, it goes on... (and hopefully the design is viable).
>>
>> <snip rest, not much to add>
>>
>> _______________________________________________
>> J-core mailing list
>> J-core at lists.j-core.org
>> http://lists.j-core.org/mailman/listinfo/j-core




More information about the J-core mailing list