[J-core] PC-relative loads and delay slots

Tue Jul 19 11:01:01 EDT 2016

On Tue, Jul 19, 2016 at 01:02:24PM +0900, D. Jeff Dionne wrote:
> On Jul 19, 2016, at 12:23 PM, Rich Felker <dalias at libc.org> wrote:
> > On Mon, Jul 18, 2016 at 09:52:16AM -0400, Rich Felker wrote:
> >> On Mon, Jul 18, 2016 at 02:37:55AM -0700, Robert Ou wrote:
> >>> Hi,
> >>> 
> >>> What is the correct behavior of PC-relative instructions such as
> >>> "mov.l @(disp, PC), Rn" in a branch delay slot? Is this even allowed?
> >>> From my testing, GAS seems to think it is "disp is multiplied by 4 and
> >>> added to the address of the mov.l opcode + 2" but J-core seems to
> >>> execute it as "disp is multiplied by 4 and added to the address of the
> >>> branch target + 2". I discovered this while working on my MyHDL
> >>> demonstration, and you can compare the difference in my demonstration
> >>> by running the master branch and the branch_delay_test branch.
> >> 
> >> If true I think it's a bug. The original SH ISA documentation
> >> specifies the behavior for "PC-relative" mov instructions as:
> >> 
> >>  "The PC points to the starting address of the second instruction
> >>  after this MOV instruction"
> >> 
> >> (as opposed to the actual current value of the program counter). This
> >> text is found on page 202 of document REJ09B0171-0500O.
> >> 
> >> I'm quite surprised we haven't run into this bug, since I would expect
> >> gcc to generate code with immediate loads in branch delay slots (e.g.
> >> when making function calls with constant arguments).
> > 
> > Some further info:
> > 
> > mova is documented to produce a result relative to the branch
> > destination, but pc-relative mov.l seems to be documented to behave as
> > I described above.
> 
> After some internal discussion, I had a look at the SH3 manual,
> which a member/associate of that design team seems to remember had
> the same behaviour as the SH1/2:
> 
> REJ09B0317-0400 PG216
> "When this MOV instruction is placed immediately after a delayed
> branch instruction, the PC points to an address specified by (the
> starting address of the branch destination) + 2.”

In the SH1/2 manual (rej09b0171) I found that text for the mova
instruction but not for mov.l/mov.w. For the latter, the manual
contains the text:

"The PC points to the starting address of the second instruction after
this MOV instruction." (mov.w)

and:

"The PC points to the starting address of the second instruction after
this MOV instruction, but the lowest two bits of the PC are corrected
to B'00." (mov.l)

This is page 202 of REJ09B0171-0500O. No mention of branch delay slots
is made here so it might be a mistake/omission.

The documentation for mova (page 213) also contains equivalent text:

"The PC is the address four bytes after this instruction, but the
lowest two bits of the PC are corrected to B'00."

followed by the apparently contradictory:

"Note: If this instruction is placed immediately after a delayed
branch instruction, the PC must point to an address specified by (the
starting address of the branch destination) +2."

> If I’m reading this correctly, the implementation of J-Core is
> correct, but violates the principle of least surprise. It would be
> very inconvenient to have it point where intuitively it would,
> because that PC value actually doesn’t exist in the pipeline at the
> correct time...

Since the manual does not cover the issue unambiguously, I think the
only real way to resolve the question would be to test on an actual
SH-2. However, as we've already determined that gcc will not generate
mova or pc-relative mov.w/mov.l in branch delay slots, any discrepancy
seems unimportant; it would only happen in (poorly) hand-written asm.

> > However, this is only valid for sh1/2/3. On sh4,
> > both mova and pc-relative mov.l (and mov.w) are illegal in delay slots
> > and result in a trap (so the kernel can emulate them very slowly if
> > you really want them). This is presumably why gcc never generates
> > the pc-relative mov.l in delay slots and thus why the bug has never
> > affected us.
> 
> I personally think the SH4 behaviour is correct, and I think that
> because if the pipeline were made multi issue, etc, even keeping the
> current (non intuitive) behaviour might be difficult. Much better to
> have consistent slot illegal than execution differences. Maybe there
> should be a generic to make these instructions trap...

I agree and I would assume this was the motivation for the sh4 change.
This is a good time to mention that it would be really nice if illegal
slot instruction and undefined instruction exceptions actually worked.

Rich