Glibc PowerPC optimizations

Home Info Benchmarks Conclusion Downloads

Benchmarks results of different memcpy routines on PowerPC


I've benchmarked some douzend different memcpy algorythms on PowerPC.
On this page, we will focus on a few popular examples.

Summary:
There is much room for improvement for Linux on PowerPC.
Glibc memcpy performance is often only between 30-60 % of what is possible on PowerPC.


Routines benchmarked:

  • glibc
    Linux glibc memcpy. Sadly, the performance is very bad on PowerPC.
    Currently, Linux does a very bad job on optimizing for PowerPC.
    MAC OS X memcpy is much better than the one used by Linux.

  • MySQL bmove512
    Internal MySQL function used to copy big blocks of memory fast.
    Using this function instead of the glibc memcpy does not improve performance at all.
    In fact, on MAC OS X where the normal memcpy is PowerPC aware and fully optimized, using the internal MySQL function results in a big performance loss.

  • cpy8 Byte_Copy
    A simple loop copying the memory byte per byte.
    This function is the worst possible implementation.
    Its worth to mind this, as often applications unneccesary tend to process data byte wise.

  • cpy64 "STREAM"
    The standard stream benchmark to measure memory throughput.
    Its a simple loop copying the memory 8 byte wise using a float register.
    The simple stream memcpy function does not utilize the PowerPC potential.


  • CPY 32 CP
    A simple loop copying the memory 4 Byte wise using a normal 32bit Int register.
    This copy routine gets good perfoemance by using a dcbt instruction to prefetch the next cache line in the background while processing the current.

  • CPY 64x4 CP
    A very simple copy routine not using Altivec registers but normal 64bit float registers which are available on every PowerPC CPU.
    This copy routine is written similar to stream but the loop gets good speed on the G4 by using dcbt instructions to prefetch the next cache line in the background while processing the current.

  • Freescale's recommended Altivec memcpy
    Freescale clearly shows what the normal performance of a PowerPC is.

  • Apple memcpy
    Apple's memcpy is excellent optimized and uses optimal functions for G3/G4 and G5 CPUs. An Linux Apples function performance is comparable to CPY 64c4 CP and Freescales Libmotovec. He below chart shows the performance that this function gets on Apple Mac OS. Apple Mac OS sets up the whole machine better resulting in all functions to run a bit faster.

The benchmarks were conducted on a Genesi Pegasos and Apple Power MACs.


Memory throughput. Copying a block of 80MB from a to b.



Here comparing the Glibc memcpy with the optimal implementation on different CPU Architectures.


Comparing the performance of Linux Glibc with Apple OS X implementations on the same hardware.


Tests have shown that memory read performance is less than 50% with normal C routines. By using Data Cache prevetching/streaming instructions the performance of all memory operations involving reads can be increased a lot.


The light colored bars show the performance of the read/write/cmp and copy using normal c routines. The darker colored bars show the archieved performance using streaming data prefetches.


Detailed results Legend



G5 Systems

Detailed test results Dual 970FX 2300 Mhz clock , Dual DDR 3200 Memory
stream_2x970FX.txt
stream_2x970MP.txt


G4 Systems

Detailed test results iBook G4 1420 Mhz clock , 144Mhz Memory bus
stream_ibook_G4_1420.txt

Detailed test results Pegasos2 G4 1000 Mhz clock , 133Mhz Memory bus

Detailed test results Pegasos2 G4 1000 Mhz clock, using a Firmware version
stream_ODW_G4_1000.txt
stream_ODW_G4_1000.txt_2

Detailed test results Pegasos2 G4 1000 Mhz clock, using an Firmware 1.2(20050602)

Detailed test results Amiga One G4 800 Mhz clock , 133Mhz Memory bus
stream_AONE_G4_800.txt
stream_AONE_G4_800_2.txt

Detailed test results Amiga One G4 933 Mhz clock , 133Mhz Memory bus
stream_AONE_G4_933.txt


G3 Systems

Detailed test results Pegasos1 G3 600 Mhz clock , 100Mhz Memory bus

Detailed test results Amiga One G3 800 Mhz clock , 133Mhz Memory bus