[J-core] Sorry for the radio silence.

Sun May 14 19:12:58 EDT 2017

On 5/14/2017 2:36 PM, Rob Landley wrote:
>
> On 05/11/2017 09:04 PM, BGB wrote:
>> On 5/11/2017 11:34 AM, Rob Landley wrote:
>>> On 05/10/2017 08:46 AM, Kieran Bingham wrote:
>>>> What SoC are you using for the Turtle?
>>> We aren't. The brain of the thing is a Xilinx Spartan 6 FPGA. There is
>>> no SOC unless you load a bitstream into that FPGA.
>>>
>>> There is an 8 bit atmel microcontroller that loads the FPGA at from spi
>>> flash at power-on, and then switches itself off once the FPGA starts
>>> running the bitstream, but it's a brainless little thing running from
>>> something like 8k of its own flash. (The other thing it can do is run a
>>> little program to reflash the spi from usb if you flip the flash/run
>>> switch to flash position; that way the board's a lot harder to brick.)
>> nifty.
>>
>> I had apparently been under the impression that FPGA's were non-volatile
>> things which were flashed similar to an MCU (with the flashing rewiring
>> the gates, but with a small/finite number of write-cycles); apparently
>> not...
> Not the ones we're using.
>
> My understanding is a FPGA is piles of generic circuitry arranged in
> repeating "cells" with wires connecting them up in lots of different
> ways, and transistors on those wires acting as switches. So you feed in
> a bitstream controlling those transistors, indicating which wires should
> go through and which ones should stay unplugged, and that's how you
> program your FPGA. (I think each bitstream bit controlling one of these
> switches has/is its own sram cell?)
>
> And there's wires between cells you can plug into. Some cells have extra
> resources like attached sram, some on the edge of the board can plug
> wires into special circuits providing things like timing crystals you
> can't just whip up from a bunch of NOR gates. Your layout program has to
> group stuff physically together and know how long the wires are between
> them (because signal propagation delay is a big input into your timing
> score, how fast you can clock the result). You can run _out_ of wires
> connecting cells together (plenty of circuitry but you can't get a wire
> from here to there without going the long way around and screwing up
> your timing...)
>
> There's a whole art to this. The great thing abotu VHDL is you can get
> the tools to do a whole lot of it for you.

yeah, looking into it, it apparently is pretty variable (depends on the 
type of FPGA and the vendor).

a lot of them use SRAM (apparently the wiring done mostly with flip-flops).

others use FGMOS (like used in Flash), where basically the residual 
charge held in a MOSFET is used to control whether a connection is 
open/closed (just it requires driving an elevated voltage to change the 
state of the MOSFET).

>>>> Has the boat set sail already?
>>>>
>>>> http://www.cnx-software.com/2017/05/10/meet-zynqberry-a-xilinx-zynq-fpga-board-with-raspberry-pi-23-form-factor/
>>>>
>>> The FPGA in turtle is not a peripheral bolted to the "real" processor,
>>> no. All the I/O devices on turtle are routed to the FPGA pins, not to an
>>> existing ARM SOC ASIC. (Unless you count the atmel, which can talk to
>>> the usb serial and the mmc bus to load/flash the bitstream to spi flash.
>>> I'm not sure how martin wired that up because it doesn't come up when
>>> the board is running.)
>> FWIW: it seems like it would sort-of defeat the point if there were ARM
>> cores and various other peripheral HW in-use...
> You also need to be able to plug the I/O devices directly into the FPGA
> pins, not into some ARM core you're not using.

yeah.

>> I am off doing some of my own stuff, working on a C compiler and
>> intending to experiment with some ISA extensions, but am much less
>> certain if a significantly extended ISA would be viable for a limited
>> transistor budget (or for use on an FPGA).
> Inventing your own instruction set in software is easy. Java bytecode,
> python bytecode... I believe the spidermonkey and v8 engines use
> _different_ bytecodes for mozilla/chrome javascript. Heck, I did my own
> bytecode for a bulletin board system I wrote when I was 19. I was so
> proud. Then I found out about pascal p-code from 1973 and BCPL's o-code
> in 1966 (https://en.wikipedia.org/wiki/O-code).

yeah, I have done plenty with bytecode formats as well...

though, in this case, what I am doing isn't really a bytecode format (*).

even with a bytecode VM, it requires some care in the design to be able 
to get good performance (ex: trying to at least be sort-of competitive 
with native C code; at least without trying to invest a mountain of time 
and effort into "making a pig fly" like with JVM or JS).

*: or, at least, for the final compiler output. static libraries and 
"object files" may use a stack-based bytecode as an intermediate step; 
but this isn't really intended to be deployed though. rather, it is so 
that the backend can do all of the code-generation all at once, and thus 
avoid some of the limitations associated with a separate linking stage.

currently its main final output format is statically linked ELF.

I went with using a stack-based bytecode rather than serializing the 
3AC/SSA IR from the backend mostly on the basis that I realized that 
serializing and reloading the 3AC IR would lead to a lot of hair and 
complexity vs using a stack-machine (and still serves the same basic 
purpose; the stack code is translated into 3AC as it is loaded).

as can be noted, unlike GCC, it is basically a monolithic compiler process.
it is also capable of loading textual ASM though, and uses basically the 
same ASM syntax as GAS in this case (ASM modules and inline ASM are 
basically passed-through the IR as big text-blobs, with an assembler 
being present in the backend).

> Doing a good hardware instruction set is a significantly harder problem.
> The main reason we haven't finalized and published our proposed 64-bit
> instruction set for j-core is until we implement it in hardware (a ways
> down the todo list) it's subject to change. (Last year Jeff and I
> printed out the instruction set list and worked out that there _is_
> enough space to do a 64-bit implementation. I believe Jeff has those
> pages, but we could do it again if we need to.)

as noted, my C compiler is currently targeting the baseline SH ISA.

I decided to try to get it initially working well for the baseline ISA 
before moving on to extensions; in order to provide a more effective 
baseline (and because I don't really want to require the use of an 
extended ISA with it).

the extended ISA would thus be a superset of SH, and not a completely 
separate ISA.
it is also intended (in 32-bit mode) to be binary backwards compatible 
with SH4.

the C compiler is also intended to be able to produce normal SH code.

though, it still needs a fair bit of work (ex: lots more 
testing/debugging/...) before it is likely to be ready for general use 
(and moving on to testing larger programs, ...).

for the most part, most of my ISA extensions were added using 8Exx as a 
prefix, forming some new 32-bit I-Forms (but mostly following the 
baseline ISA opcode-space layout); the vast majority of these cases 
either add or extend an immediate field (though there is partial overlap 
with functionality which would be provided by the 32-bit I-Forms from SH2A).

another big/ugly/scary extension was for SIMD.
but, I have doubts a SIMD vector unit would really be viable for a core 
running in an FPGA.

as noted, the basic idea for the SIMD unit is based on a seriously 
hacked version of the SH4 FPU (and is based mostly on twiddling bits in 
FPSCR).

the 8Exx prefix, in the case of SIMD/FPU, mostly allows directly 
encoding the operation-mode, in most cases explicitly giving the bits 
which would otherwise be found in FPSCR. the remaining bits are reserved 
mostly for sake of adding more opcodes (ex: "8Ev0-Fnm0" is still an 
FADD, but "8Ex1-Fnm0" might be "something different").

some other I-forms were added mostly for sake of updating the relevant 
bits in FPSCR (to reduce the relative cost of switching between 
operating modes).