Benchmarks results of different memcpy routines on PowerPC
I've benchmarked some douzend different memcpy algorythms on PowerPC.
On this page, we will focus on a few popular examples.
Summary:
There is much room for improvement for Linux on PowerPC.
Glibc memcpy performance is often only between 30-60 % of what is possible on PowerPC.
Routines benchmarked:
- glibc
Linux glibc memcpy. Sadly, the performance is very bad on PowerPC.
Currently, Linux does a very bad job on optimizing for PowerPC.
MAC OS X memcpy is much better than the one used by Linux.
- MySQL bmove512
Internal MySQL function used to copy big blocks of memory fast.
Using this function instead of the glibc memcpy does not improve performance at all.
In fact, on MAC OS X where the normal memcpy is PowerPC aware and fully optimized,
using the internal MySQL function results in a big performance loss.
- cpy8 Byte_Copy
A simple loop copying the memory byte per byte.
This function is the worst possible implementation.
Its worth to mind this, as often applications unneccesary tend to process data byte wise.
- cpy64 "STREAM"
The standard stream benchmark to measure memory throughput.
Its a simple loop copying the memory 8 byte wise using a float register.
The simple stream memcpy function does not utilize the PowerPC potential.
- CPY 32 CP
A simple loop copying the memory 4 Byte wise using a normal 32bit Int register.
This copy routine gets good perfoemance by using a dcbt instruction to prefetch the next cache line in the background while processing the current.
- CPY 64x4 CP
A very simple copy routine not using Altivec registers but normal
64bit float registers which are available on every PowerPC CPU.
This copy routine is written similar to stream but the loop gets good speed on the G4 by using dcbt instructions to prefetch the next cache line in the background while processing the current.
- Freescale's recommended Altivec memcpy
Freescale clearly shows what the normal performance of a PowerPC is.
- Apple memcpy
Apple's memcpy is excellent optimized and uses optimal functions for
G3/G4 and G5 CPUs. An Linux Apples function performance is comparable to CPY 64c4 CP and Freescales Libmotovec. He below chart shows the performance that this function gets on Apple Mac OS. Apple Mac OS sets up the whole machine better resulting in all functions to run a bit faster.
The benchmarks were conducted on a Genesi Pegasos and Apple Power MACs.
Memory throughput. Copying a block of 80MB from a to b.

Here comparing the Glibc memcpy with the optimal implementation on different CPU Architectures.

Comparing the performance of Linux Glibc with Apple OS X implementations on the same hardware.

Tests have shown that memory read performance is less than 50% with normal C routines. By using Data Cache prevetching/streaming instructions the performance of all memory operations involving reads can be increased a lot.

The light colored bars show the performance of the read/write/cmp and copy using normal c routines. The darker colored bars show the archieved performance using streaming data prefetches.
Detailed results Legend

G5 Systems
Detailed test results Dual 970FX 2300 Mhz clock , Dual DDR 3200 Memory

stream_2x970FX.txt
stream_2x970MP.txt
G4 Systems
Detailed test results iBook G4 1420 Mhz clock , 144Mhz Memory bus

stream_ibook_G4_1420.txt
Detailed test results Pegasos2 G4 1000 Mhz clock , 133Mhz Memory bus

Detailed test results Pegasos2 G4 1000 Mhz clock, using a Firmware version

stream_ODW_G4_1000.txt
stream_ODW_G4_1000.txt_2
Detailed test results Pegasos2 G4 1000 Mhz clock, using an Firmware 1.2(20050602)

Detailed test results Amiga One G4 800 Mhz clock , 133Mhz Memory bus

stream_AONE_G4_800.txt
stream_AONE_G4_800_2.txt
Detailed test results Amiga One G4 933 Mhz clock , 133Mhz Memory bus

stream_AONE_G4_933.txt
G3 Systems
Detailed test results Pegasos1 G3 600 Mhz clock , 100Mhz Memory bus

Detailed test results Amiga One G3 800 Mhz clock , 133Mhz Memory bus

|