[J-core] [musl] Re: Aligned copies and cacheline conflicts?
BGB
cr88192 at gmail.com
Sat Sep 17 04:08:54 EDT 2016
On 9/16/2016 9:17 PM, Rich Felker wrote:
snip, not much comment.
>> And of course gcc inlines its own version unless you hit it with a brick
>> anyway, so solving it in musl is of questionable utility.
> Only for cases where the size is constant and various other conditions
> are met. The fact that it does this even for large sizes is a bug and
> it will be fixed. Almost all archs admit faster bulk memcpy that what
> you can reasonably inline so gcc simply should not be doing this, and
> fixing it is necessary anyway if we want to allow vdso-provided dma
> memcpy.
I am left to wonder how much of a factor this is (in general).
for x86 and x86-64, for larger copies, it seems like there is a limit
that, once one hits it, the memcpy (or whatever else) doesn't really get
any faster no matter what else one does (for algos basically operating
on data at memcpy-like speeds, this seemingly being a PC-specific
constant value).
usually about the only real way I know of to make things faster is to
try to reduce the memory footprint (ex: try to keep the working data
packed fairly densely in memory), and to avoid touching as much memory.
for smaller copies, where higher speeds are possible, I have noted that
large blocks aren't always an advantage (generally, IME, moving 128 or
64 bits at a time is often faster than moving 256 bits).
granted, there seems to be a lot of hardware-dependent behavior here,
and I have seen somewhat different performance behavior on different
hardware.
likewise, I had mostly been doing specialized memory copies for
compressors, which often need more specific behaviors, and where the
normal C library provided memcpy's typically had a fairly high constant
overhead (especially for smallish copies).
>> I'm not sure you're focusing on the right problem?
> Yes, for sh/j2, but there are other open problems for improving musl's
> generic memcpy now that we have a wider variety of archs.
>
> Regarding the barrier that makes memcpy safe against direct-mapped
> caches like the j2 one, It could be turned on/off with a simple macro
> definition provided by the arch files if it harms performance on other
> archs, but I don't think it does.
can't comment much.
I am left wondering what prevents using a hash in the cache lookup,
where presumably a simplistic (presumably XOR based, 1) hash isn't that
hard to pull off in VHDL? granted, I don't really know the specifics here.
1: because presumably XOR'ing bits is cheaper in terms of gates than
doing the equivalent of multiplication by a prime number (though
generally multiplication by a prime and taking the high-order bits has
fewer collisions IME).
> We could likewise add such a macro for "arch allows misaligned
> loads/stores" to optimize out the shift-into-place code on archs that
> don't need it and allow a faster direct load/store (the "mod4=0" case
> in the current code) to be used for all alignments on such archs.
seems like probably a good idea, given processors which support cheap
misaligned access aren't exactly a rarity at this point.
though, even as such, I am left with a bit of mixed feelings with GCC's
behavior of apparently only making 64-bit types 32-bit aligned on SH
(this just seems a little wrong, even if technically there is little
technical reason per-se to make them be 64-bit aligned).
> One of the changes made in this update, moving the shifted block copy
> into its own (inline) function, makes it easier to move to using
> native wordsize rather than hard-coding 32-bit. The main reason the
> current code in musl hard-coded 32-bit was that I got tired of writing
> the (written out in duplicate) cases for 8 different alignments mod 8.
> But now it's just a few lines per case with little opportunity for
> introducing errors. In terms of code size, "twice as many cases but
> half the number of loads/stores per case" should be fairly neutral.
admittedly, if it were me I would have probably just had some number of
per-target #ifdef protected versions with a generic-case fallback.
FWIW, I have mixed feelings of recently just using a naive linked-list
allocator in my "testkern" thing running in my emulator, even if this
isn't really meant to be "good".
my thinking was that for heaps in the lower MB range, a linked-list
based malloc seemed like a reasonable bet.
a straightforward adaptation of my other MM designs (larger chunks of
memory with allocations managed by bitmaps) would have a pretty steep
cost for a relatively small address space.
left uncertain what the "best" design is for this size range, thinking
something like:
under 256B, uses slabs of small granular-sized objects (say,
multiples of 8B or 16B);
medium (up to maybe 256kB to 1MB), uses linked-lists of memory blocks
allocated via first-fit / free-lists
if found block is bigger than needed, it is split apart
if a block is freed, and its siblings are also free, then
they are merged together.
larger (presumably rare), probably allocates spans of pages.
normally the medium range would be done with arrays of fixed-size cells
and allocated by marking spans of bits in a bitmap, but with a smaller
RAM space and heap, the question is how to best do this without
introducing fragmentation and memory-waste issues (given allocating
memory in chunks of multiple MB each isn't really a viable option here).
I also partially considered an AVL tree based allocator, but noted that
the per-object overhead would be a lot higher at this object size range
than the use of linear linked lists.
so, it is a case of hrrm... not entirely sure what to make of this (not
thought much about this particular size-range recently).
another related uncertainty is whether to use a FAT image stored in a
file as a virtual SD card, or to synthesize the FAT filesystem at
startup from a list of files (was working some on implementing a SPI/MMC
virtual device).
likewise, whether to actually bother trying to make testkern "good" (an
abstraction layer over block-devices and filesystems, and dealing with
event-driven IO), or just do the minimalist thing (no abstraction and
blocking IO, given after all it is more just meant to test stuff / try
to help find emulator bugs, than to be viable as an OS).
likewise, probably want to avoid spreading my efforts too thin.
there was something else I felt like I wanted to mention, but I forget
right now what it was.
well, beyond feeling uncertain if there is any point in what I am
working on, or if I am just annoying the people here with being
off-topic and irrelevant.
also partially debating if I should go write a C compiler for this
target which isn't huge, but this can wait as GCC works. this would only
really make sense if I wanted to implement my 64-bit thingy, since
granted, GCC support would be pretty much non-existent (and it is
doubtful it would ever amount to much...).
or such...
More information about the J-core
mailing list