![]() But the grouping is a little coarser here, with 5 instructions in one group. The same grouping also happens on the 970FX or G5. The buffer in front of the execution units is about 100 instructions big, still respectable compared to the Athlon 64's reorder buffer of 72 instructions, divided into 24 groups of 3 instructions. Although it is true that the bigger the buffers, the better, the number of instructions that can be tracked and analysed per clock cycle is very limited. And searching and analysing the buffers takes time, and time is very limited at clock speeds of 2.5 GHz and more. However, the scheduler has to be able to pick out independent instructions (instructions that do not rely on the outcome of a previous one) out of those buffers. If there was no limitation except die size, CPUs would probably keep thousands of instructions in flight. While 200 instructions in flight are impressive, there is a catch. So, is the 970FX the ultimate out-of-order CPU? The PowerPC 970 FX fetches up to 8 instructions per cycle from the L1 and can decode at the same rate of 8 instructions per cycle. The rate at which instructions are fetched will not limit the issue rate either. The 970FX works out of order and up to 200 instructions can be kept in flight, compared to 126 in the Pentium 4. The Opteron can sustain 3 at most the Pentium 4's trace cache bandwidth "limits" the P4 to about 2 x86 instruction per clock cycle. It is not only a deeply pipelined processor, but it is also a very wide superscalar CPU that can theoretically sustain up to 5 instructions (4+ 1 branch) per clock cycle. The design philosophy of the 970FX is very aggressive. But when it comes to frequency headroom, the 970FX should do - in theory - better than the Opteron, but does not come close to the "old" Pentium 4. When it comes to branch prediction penalties, the 970FX penalty will be closer to the Pentium 4 (Northwood). So, the Pentium 4 has to do less work in those 20 stages than what the 970FX performs in those 16 or 21 stages. The 20 stages were counted from the trace cache. 21 stages might make you think that the 970FX is close to a Pentium 4 Northwood, but you should remember that the Pentium 4 also had 8 stages in front of the trace cache. Floating point is handled through 21 stages, and the Opteron only needs 17. While the Opteron has a 12 stage pipeline for integer calculations, the 970FX goes deeper and ends up with 16 stages. The 970FX is deeply pipelined, quite a bit deeper than the Athlon 64 or Opteron. Remember that most of the performance boost (10-30%) noticed in x86 64 bit programs came from the 8 extra registers available in "pure" 64 bit mode. Insiders say that the PowerPC970FX has less "register pressure" than, for example, the EM64T and AMD64 CPUs (16 registers), which on their turn have less register pressure than the "older" 32 bit x86 CPUs with only 8 architectural registers. ![]() The end result is better performance, thanks to less "bookkeeping". These are the registers that can be used to program the calculations in the binary (and assembler) code.Ĭompilers for the PowerPC 970FX should thus be able to produce code that is cleaner, with less shuffling of data between the L1-cache, "secret" rename registers and architectural registers. Architectural registers are the registers that are visible to the programmer, mostly the compiler programmer. The RISC ISA, which is quite complex and can hardly be called "Reduced" (The R of RISC), provides 32 architectural registers. * Built Intel code on GCC 4.0, PowerPC remains on GCC 3.3 for 10.Meet the G5 processor, which is in fact IBM's PowerPC 970FX processor. * Altered graphics code to flush only every 1/60th of a second, in order to cooperate with Tiger beam syncing * Re-calibrated 100 point baseline to a 2.0 GHz G5 running Tiger * Built as a Universal Binary to run on both PowerPC and x86 Macs * Fixed an issue that causes Xbench to fail to launch on Leopard * Added support for temporarily turning off beam sync on Tiger while running graphics tests * Added code to dynamically load machine database on launch from the Xbench website * Revised machine database to include the MacBook, Intel iMac and several other models ![]() This also raises Xbench's system requirements to 10.3.9 or higher. This provides some boost to floating point and AltiVec scores, and these have been recalibrated accordingly. * Switched compiler to GCC 4.0 on PowerPC. * Turned off coalesced graphics updates for all platforms on Mac OS 10.4.4 and higher * Corrected a mistake that caused the altivec test to be turned off on PowerPC machines ![]() Xbench is accompanied by a website that allows graphical side-by-side comparison of any out of thousands of submitted benchmarks. Xbench is useful not only for comparing the relative speeds of two different Macintoshes, but also for optimizing performance on a single machine. ![]()
0 Comments
Leave a Reply. |