[J-core] [musl] Re: Aligned copies and cacheline conflicts?

Sat Sep 17 04:08:54 EDT 2016

On 9/16/2016 9:17 PM, Rich Felker wrote:

snip, not much comment.

>> And of course gcc inlines its own version unless you hit it with a brick
>> anyway, so solving it in musl is of questionable utility.
> Only for cases where the size is constant and various other conditions
> are met. The fact that it does this even for large sizes is a bug and
> it will be fixed. Almost all archs admit faster bulk memcpy that what
> you can reasonably inline so gcc simply should not be doing this, and
> fixing it is necessary anyway if we want to allow vdso-provided dma
> memcpy.

I am left to wonder how much of a factor this is (in general).

for x86 and x86-64, for larger copies, it seems like there is a limit 
that, once one hits it, the memcpy (or whatever else) doesn't really get 
any faster no matter what else one does (for algos basically operating 
on data at memcpy-like speeds, this seemingly being a PC-specific 
constant value).

usually about the only real way I know of to make things faster is to 
try to reduce the memory footprint (ex: try to keep the working data 
packed fairly densely in memory), and to avoid touching as much memory.

for smaller copies, where higher speeds are possible, I have noted that 
large blocks aren't always an advantage (generally, IME, moving 128 or 
64 bits at a time is often faster than moving 256 bits).

granted, there seems to be a lot of hardware-dependent behavior here, 
and I have seen somewhat different performance behavior on different 
hardware.

likewise, I had mostly been doing specialized memory copies for 
compressors, which often need more specific behaviors, and where the 
normal C library provided memcpy's typically had a fairly high constant 
overhead (especially for smallish copies).

>> I'm not sure you're focusing on the right problem?
> Yes, for sh/j2, but there are other open problems for improving musl's
> generic memcpy now that we have a wider variety of archs.
>
> Regarding the barrier that makes memcpy safe against direct-mapped
> caches like the j2 one, It could be turned on/off with a simple macro
> definition provided by the arch files if it harms performance on other
> archs, but I don't think it does.

can't comment much.

I am left wondering what prevents using a hash in the cache lookup, 
where presumably a simplistic (presumably XOR based, 1) hash isn't that 
hard to pull off in VHDL? granted, I don't really know the specifics here.

1: because presumably XOR'ing bits is cheaper in terms of gates than 
doing the equivalent of multiplication by a prime number (though 
generally multiplication by a prime and taking the high-order bits has 
fewer collisions IME).

> We could likewise add such a macro for "arch allows misaligned
> loads/stores" to optimize out the shift-into-place code on archs that
> don't need it and allow a faster direct load/store (the "mod4=0" case
> in the current code) to be used for all alignments on such archs.

seems like probably a good idea, given processors which support cheap 
misaligned access aren't exactly a rarity at this point.

though, even as such, I am left with a bit of mixed feelings with GCC's 
behavior of apparently only making 64-bit types 32-bit aligned on SH 
(this just seems a little wrong, even if technically there is little 
technical reason per-se to make them be 64-bit aligned).

> One of the changes made in this update, moving the shifted block copy
> into its own (inline) function, makes it easier to move to using
> native wordsize rather than hard-coding 32-bit. The main reason the
> current code in musl hard-coded 32-bit was that I got tired of writing
> the (written out in duplicate) cases for 8 different alignments mod 8.
> But now it's just a few lines per case with little opportunity for
> introducing errors. In terms of code size, "twice as many cases but
> half the number of loads/stores per case" should be fairly neutral.

admittedly, if it were me I would have probably just had some number of 
per-target #ifdef protected versions with a generic-case fallback.

FWIW, I have mixed feelings of recently just using a naive linked-list 
allocator in my "testkern" thing running in my emulator, even if this 
isn't really meant to be "good".

my thinking was that for heaps in the lower MB range, a linked-list 
based malloc seemed like a reasonable bet.
a straightforward adaptation of my other MM designs (larger chunks of 
memory with allocations managed by bitmaps) would have a pretty steep 
cost for a relatively small address space.

left uncertain what the "best" design is for this size range, thinking 
something like:
     under 256B, uses slabs of small granular-sized objects (say, 
multiples of 8B or 16B);
     medium (up to maybe 256kB to 1MB), uses linked-lists of memory blocks
         allocated via first-fit / free-lists
             if found block is bigger than needed, it is split apart
             if a block is freed, and its siblings are also free, then 
they are merged together.
     larger (presumably rare), probably allocates spans of pages.

normally the medium range would be done with arrays of fixed-size cells 
and allocated by marking spans of bits in a bitmap, but with a smaller 
RAM space and heap, the question is how to best do this without 
introducing fragmentation and memory-waste issues (given allocating 
memory in chunks of multiple MB each isn't really a viable option here).

I also partially considered an AVL tree based allocator, but noted that 
the per-object overhead would be a lot higher at this object size range 
than the use of linear linked lists.

so, it is a case of hrrm... not entirely sure what to make of this (not 
thought much about this particular size-range recently).

another related uncertainty is whether to use a FAT image stored in a 
file as a virtual SD card, or to synthesize the FAT filesystem at 
startup from a list of files (was working some on implementing a SPI/MMC 
virtual device).

likewise, whether to actually bother trying to make testkern "good" (an 
abstraction layer over block-devices and filesystems, and dealing with 
event-driven IO), or just do the minimalist thing (no abstraction and 
blocking IO, given after all it is more just meant to test stuff / try 
to help find emulator bugs, than to be viable as an OS).

likewise, probably want to avoid spreading my efforts too thin.

there was something else I felt like I wanted to mention, but I forget 
right now what it was.
well, beyond feeling uncertain if there is any point in what I am 
working on, or if I am just annoying the people here with being 
off-topic and irrelevant.

also partially debating if I should go write a C compiler for this 
target which isn't huge, but this can wait as GCC works. this would only 
really make sense if I wanted to implement my 64-bit thingy, since 
granted, GCC support would be pretty much non-existent (and it is 
doubtful it would ever amount to much...).

or such...