Инструкция обслуживания AMD x86

256 страниц 2.99 mb
Скачать

Перейти на страницу of 256

Summary
  • AMD x86 - page 1

    AM D Athlon Pr oc essor x86 Code Optimization Guide TM ...

  • AMD x86 - page 2

    T ra demarks AMD , the A MD logo , A MD Athlon , K6, 3DNo w!, and combi nations ther e of, K 86, and Sup er7 ar e tr adema rks, and AMD -K6 is a r egis tered tra demark of Ad v anced Micr o De vices, I nc. Microso ft, Windows , and Wind ows NT are r egi stered trademarks of Micros oft Corp oration. MMX is a tra demark a nd P entium is a r egiste re ...

  • AMD x86 - page 3

    Contents iii 22007E/0 — Novembe r 1 99 9 AMD Athlon™ Pr ocessor x86 Code Optimization Contents Revision Histo ry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 1 Intro duction 1 About this Docum ent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 AMD Athlon ™ Proc essor F ...

  • AMD x86 - page 4

    iv Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 Switch Statement Us age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Optimize Switch State ments . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Use Prototy pes for All Functions . . . . . . . . . . . . . . . . . . . . . ...

  • AMD x86 - page 5

    Contents v 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use 8-Bit Sign- Extended Displacements . . . . . . . . . . . . . . . . . . . . . . . 39 Code Padding Using Neutral Code F illers . . . . . . . . . . . . . . . . . . . . . 39 Recommenda tions for the AM D Athlon Processor . . . . . . . . . 40 Recommenda tions f ...

  • AMD x86 - page 6

    vi Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 7 Scheduling Opti mizations 6 7 Schedule Instructio ns According to their La tency . . . . . . . . . . . . . . 67 Unrolling Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Complete Loop Unrolling . . . . . . . . . ...

  • AMD x86 - page 7

    Contents vii 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Signed Deriva tion for Algorithm, Multiplier, and Shift Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 9 Floating-P oint Optimizations 9 7 Ensure All FP U Data is Alig ned . . . . . . . . . . . . . . . . . . . . . ...

  • AMD x86 - page 8

    viii Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 Fast Conver sion of Signed Wo rds to Floating-P oint . . . . . . . . . . . . 113 Use MMX PX OR to Negate 3 DNow! Data . . . . . . . . . . . . . . . . . . . . 113 Use MMX PCM P Instead of 3D Now! PFCMP . . . . . . . . . . . . . . . . . . 114 Use MMX Instruct ...

  • AMD x86 - page 9

    Contents ix 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Integer Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Floating-Point Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Floating-Point Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . 137 Loa ...

  • AMD x86 - page 10

    x Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 Perf Ctr[3:0] MSRs (MSR Addre sses C001_00 04h – C001_0007h) . . . . . . . . . . . . . 167 Starting and Stopping the Perfor mance-Monitoring Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Event an d Time-Stamp ...

  • AMD x86 - page 11

    List of Figures xi 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization List of Figures Figure 1. AMD Athlon ™ Processo r Block Diagr am . . . . . . . . . . . 131 Figure 2. Integer Execution Pipeline . . . . . . . . . . . . . . . . . . . . . . . 135 Figure 3. Floating-Point Unit Block Diagram . . . . . . . . . . . . . . ...

  • AMD x86 - page 12

    xii List of Figur es AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 ...

  • AMD x86 - page 13

    List of T ables xiii 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization List of T ables Table 1. Latency of Repeated String Instr uctions . . . . . . . . . . . . . 84 Table 2. Integer Pipeline Operation T ypes . . . . . . . . . . . . . . . . . 149 Table 3. Integer Decode Types . . . . . . . . . . . . . . . . . . . . . . ...

  • AMD x86 - page 14

    xiv List of T ables AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Table 29. VectorPa th Integer In structions . . . . . . . . . . . . . . . . . . . 231 Table 30. VectorPa th MMX Instructions . . . . . . . . . . . . . . . . . . . . 234 Table 31. VectorPa th MMX Extensions . . . . . . . . . . . . . . . . . . . . . 234 ...

  • AMD x86 - page 15

    Revision History xv 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Revision History Date Rev Descriptio n Nov . 1 999 E Added “ About this Document” on page 1. F urther clarification of “Consider the Sign of Integer Operands” on page 1 4. Added the optimization, “Use Array Style Instead of Pointer Style Cod ...

  • AMD x86 - page 16

    xvi Revision History AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ...

  • AMD x86 - page 17

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization About this Docume nt 1 1 Introduction Th e A M D At h l o n ™ processor is the ne west micr oprocessor in the AMD K86 ™ famil y of micropr ocessors. T he ad v ances in the AMD Athlon pro cessor tak e super scalar oper ation and out- of- or der execution to a new le v ...

  • AMD x86 - page 18

    2 About this Document AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 pr e vious- gener ation processor s and describes how those optimizations ar e applicable to the AMD Athlon processor . This guide co ntains the f ollowing c hapt er s: Chapter 1: Introduction. Outlin es the material co ver ed in this document. Summ ...

  • AMD x86 - page 19

    AMD Athlon ™ Proces sor Family 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Appendix B: Pipeline and Execu tion Unit Resources Over view . Describes in detail the e xecution units and its r elation to the instructi on pipeline. Appendix C: Implementation of Write Combining. Describes the algorithm us ed by the ...

  • AMD x86 - page 20

    4 AMD Athlon ™ Processor Mic roarchitectur e Summary AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 AM D Athlon ™ Proc essor Microar chitecture Summary T he AMD Athlon pr ocessor brings s uper scalar performance and high operating frequency to P C syste ms run ning industr y- standard x86 softw ar e. A brief summ ...

  • AMD x86 - page 21

    AMD Athlon ™ Processor Mic roarchitecture Summary 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization AMD A thlon execution c or e to ac hiev e and sustain maxim um performance. As a decoupled decode/exec ution processor , the AMD At hlon pr ocessor make s use of a propri etary micr oarc hitecture, whic h defines the ...

  • AMD x86 - page 22

    6 AMD Athlon ™ Processor Mic roarchitectur e Summary AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T he coding tec hniques for ac hieving peak perf ormance on the AMD Athlon processor include, but are not limited to , those for the AMD-K6, AMD-K6-2, P e ntium ® , P enti um Pro , and P ent ium II pr ocessor s. Ho ...

  • AMD x86 - page 23

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Top Optimiz ations 7 2 T op Optimizations T his chap ter contains concise desc riptions of the best optimizations f or impro ving the performance of the AMD Athlon ™ processor . Subsequent c hapters contai n more detailed descriptions of these and other optimizations. ...

  • AMD x86 - page 24

    8 Optimization Star AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ■ A void Placing Cod e and Da ta in the Same 64 -Byte Cache Line Optimization Star T he top optimizations described in this c hapter ar e flagged with a star . In addition, the star appears beside the mor e detailed descriptions found in subsequent ...

  • AMD x86 - page 25

    Group II Optimizati ons — Secondary Optimizations 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization an ywher e, in an y type of code (integer , x87, 3DNo w!, MMX, etc.). Use the f ollowi ng f ormul a to determine pr efetc h distance: Prefetc h Length = 200 ( DS / C ) ■ Round up to the near est cache line. ■ DS i ...

  • AMD x86 - page 26

    10 Group II Optimizations — S econdary Optimizations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 A void Load-Execute Floating-Point Instructions with Integer Opera nds Do not use load-execute floating-point instructions with integer operands . T he floating- point load- execute instructions with integer ope rand ...

  • AMD x86 - page 27

    Group II Optimizati ons — Secondary Optimizations 11 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization A void Placing Code and Data in the Sam e 64-Byte Cache Line Consider that the AMD Athlon processor cac he line is twice the siz e of pr e vious processor s. Code and data sh ould not be shar ed in the same 64 - byt ...

  • AMD x86 - page 28

    12 Group II Optimizations — S econdary Optimizations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ...

  • AMD x86 - page 29

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Ensure Floati ng-Point Variables and Exp ressions are of Type Float 13 3 C Sourc e Lev el Optimizations This c h apter details C pro gramming pr actice s f or opt imizing code f or the AMD Athlon ™ pr ocessor . Guide lines ar e listed in order of importan ce. Ensure Fl ...

  • AMD x86 - page 30

    14 Consider the S ign of Integer Operands AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Consider the Sig n of Integer Oper ands In man y cases, the data stored in integer v aria bles determines whether a signed or an unsigned integer type is appr opriate. F or example, to re cor d the w eight of a person in pounds, ...

  • AMD x86 - page 31

    Use Array Style Instead of Poin ter Style Code 15 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example (Av oid): int i; ====> MOV EAX, i CDQ i = i / 4; AND EDX, 3 ADD EAX, EDX SAR EAX, 2 MOV i, EAX Example (Preferred): unsigned int i; ====> SHR i, 2 i = i / 4; In summar y: Use unsigned types for: ■ Di visio ...

  • AMD x86 - page 32

    16 Use Array Style Instead of Pointer Style Co de AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Note that source code transf ormations wi ll interact with a compiler ’ s code gener ator and that it is difficult to contr ol the gener ated mac hine code fr om the sourc e lev el. It is e v en possibl e that sour ce c ...

  • AMD x86 - page 33

    Use Array Style Instead of Poin ter Style Code 17 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization *res++ = dp; /* write transformed z */ dp = vv->x * *m++; dp += vv->y * *m++; dp += vv->z * *m++; dp += vv->w * *m++; *res++ = dp; /* write transformed w */ ++vv; /* next input vertex */ m -= 16; /* reset to s ...

  • AMD x86 - page 34

    18 Completely Unr oll Small L oops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Completely Unr oll Small Loops T ak e ad v antage of the AMD At hlon pr ocessor ’ s large, 64-Kb yte instruct ion cache and completel y unroll small loops. Unr olling loops can be beneficial to perf o rmance, especially if the l oop b ...

  • AMD x86 - page 35

    Avoid Unnecessary Store-to-Load Depend encies 19 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization code in a w a y that a v oids the stor e-to-load dependency . In some instances the language definition ma y prohibit the compiler fr om using code tra nsforma tions that would r emo v e the stor e- to-load dependenc y . I ...

  • AMD x86 - page 36

    20 Consider Expressi on Order in Compoun d Branch Conditions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Consider Expr ession Order in Compound Branch Conditions Br anc h conditions in C pro gr ams are often compound conditions con sisting of multiple boolean expr ess ions joined by the boolean oper ator s &&a ...

  • AMD x86 - page 37

    Switch Statement Us age 21 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Switch Statemen t Usage Optimize Switch Statements Switc h statements ar e transl ated using a vari ety of algorithms. T he most common of these ar e jump ta bles and comparison c hains/t r ees. It is r ecommended t o sort th e cases of a s wit ...

  • AMD x86 - page 38

    22 Use Const T ype Qualifier AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Use Const T ype Qualifier Use the “ const ” type qualifier as m u c h as possible. T his optimization mak e s code mor e r obust and ma y ena ble higher perf ormance code t o be gener ated due to the additional inf ormat ion a v ailable t ...

  • AMD x86 - page 39

    Generic Loop Hoisting 23 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Generalization for M ultiple Const ant Control C ode T o gener alize this further f or multiple constant control code some mor e w ork ma y ha ve to be done to cr eate the pr oper outer loop . Enumer ation of the constant cases will r educe this ...

  • AMD x86 - page 40

    24 Declar e Local Functions as Static AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 case combine( 1, 1 ): for( i ... ) { DoWork1( i ); DoWork3( i ); } break; default: break; } T he trick her e is that there is some up-fr ont wor k inv olved in gener ating all the combinations f or the switc h constan t and the total ...

  • AMD x86 - page 41

    Dynamic Memory All ocation Consideration 25 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization which might inhibit certain op timizations with some compiler s — for example, agg r essiv e inlining. Dynamic Memory Allocation Consideration Dynamic memor y alloca tion ( ‘ malloc ’ in C language) should al w a ys r etu ...

  • AMD x86 - page 42

    26 Explicitly Extract Common S ube xpressions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 lead to unexpected r esults. F ortunately , in the v ast majority of cases, the final result will differ onl y in the least significa nt bits. Example 1 (Av oid): double a[100],sum; int i; sum = 0.0f; for (i=0; i<100; i++) ...

  • AMD x86 - page 43

    C Language Struc ture Component Considerations 27 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 1 Avoid: double a,b,c,d,e,f; e = b*c/d; f = b/d*a; Preferred: double a,b,c,d,e,f,t; t = b/d; e = c*t; f = a*t; Example 2 Avoid: double a,b,c,e,f; e = a/c; f = b/c; Preferred: double a,b,c,e,f,t; t = 1/c; e = a*t f ...

  • AMD x86 - page 44

    28 Sort L ocal V ariables Acco rding to Base T ype Size AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 P ad by Multiple of Largest Base T ype Size P ad the structur e to a m ultiple of the larg est base type siz e of an y member . In this fa shion, if the fir st member of a structur e is natur ally aligned, all other ...

  • AMD x86 - page 45

    Accelerating Floating-Point Div ides and Square Roots 29 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization quadw ord alignment), so that quadw or d operands might be misaligned, ev en if this technique is used and the compiler does alloca te v ariables in t he order they ar e de clared. T he f ollowing example de monstr ...

  • AMD x86 - page 46

    30 Accel erating Floating-Point Divides and Squar e Roots AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 necessar y for the c urr ently s elected pr ecision. This means that settin g pr ecision c ontrol to singl e pr ecisio n (v ersus Win32 default of double precision) lo w ers the latenc y of those oper ations. T he ...

  • AMD x86 - page 47

    Avoid Unnecessary Integ er Division 31 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization A void Unnec essary Integer Division Integer divisi on is the slow est of all integer arithmetic oper ations a nd should be a v oided wh er ev er possi ble. One possibility f or r e ducing the number of integer di visions is mu ltip ...

  • AMD x86 - page 48

    32 Copy Fr equently De-r eferenced Pointe r Arguments to Local V ariables AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): //assumes pointers are different and q!=r void isqrt ( unsigned long a, unsigned long *q, unsigned long *r) { *q = a; if (a > 0) { while (*q > (*r = a / *q)) { *q = (*q + ...

  • AMD x86 - page 49

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Over view 33 4 Instruction Dec oding Optimizations T his c hapter discusses w a ys to maximize the n umber of instructions decoded by the instruction decoder s in the AMD Athlon ™ pr ocessor . Guidelines are listed in or der of importance. Over view T he AMD Athlon pro ...

  • AMD x86 - page 50

    34 Select Dir ectPath Over V ectorPath Instructions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Select DirectP ath Over V ectorP ath Instructions Use Dir ect P ath instructions rather than V ectorP ath instructions. Dir ectP ath in structions ar e optimiz ed for decode and execute effi cientl y b y minimiz ing the ...

  • AMD x86 - page 51

    Load-Execute Instructio n Usage 35 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use Load-Execute Floating-Point Instructions with Floating-P oint Operands W hen opera ting on single- pr ecision or double- pr ecision floating- point data, wher ev er possible use floating- point load-exec ute instructions to i ncr ea ...

  • AMD x86 - page 52

    36 Align Branch T argets in Pr ogram Hot Spots AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): FLD QWORD PTR [foo] FIMUL DWORD PTR [bar] FIADD DWORD PTR [baz] Example 2 (Preferred): FILD DWORD PTR [bar] FILD DWORD PTR [baz] FLD QWORD PTR [foo] FMULP ST(2), ST FADDP ST(1),ST Align Br anch T argets i ...

  • AMD x86 - page 53

    Avoid Partial Reg ister Reads and Writes 37 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 2 (Preferred): 05 78 56 34 12 add eax, 12345678h ;uses single byte ; opcode form 83 C3 FB add ebx, -5 ;uses 8-bit sign ; extended immediate 74 05 jz $label1 ;uses 1-byte opcode, ; 8-bit immediate A void P artial Registe ...

  • AMD x86 - page 54

    38 Replace C ertain SH LD Instructions with Alternative AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Replac e Certain S H LD Instructions with Alternative Code Certain instances of the SHLD instruction can be r eplaced b y alternati v e code using SHR and LEA. The alternati v e code has lo w er latenc y and r equir ...

  • AMD x86 - page 55

    Use 8-Bit Sign-E xtended Displacements 39 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use 8-Bit Sign-Extended Displac ements Use 8- bit sign- extend ed displacements for condition al br anc hes. Using short, 8-bit sign- extended displacements for conditional br anc hes impr ov es code density with no negati v e ef ...

  • AMD x86 - page 56

    40 Code Padding Using Neutral Code F illers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Recommendation s for th e AM D Athlon ™ Processo r F or code that is optimiz ed spec ifically f or the AMD At hlon pr ocessor , the optimal co de fillers ar e NOP instr uctions (opcode 0x90) with up to tw o REP pr efixes (0xF ...

  • AMD x86 - page 57

    Code Padding Usi ng Neutral Code Fillers 41 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Recommendati ons for AM D- K6 ® Family and AM D Athlon ™ Processor Blen ded Code On x86 pr ocessors other than the AMD Athlon pr ocessor (incl udin g th e AMD-K6 fam il y o f proces sor s) , the REP p refix and especially m ...

  • AMD x86 - page 58

    42 Code Padding Using Neutral Code F illers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 NOP3_ECX TEXTEQU <DB 08Dh,00Ch,021h> ;lea ecx, [ecx] NOP3_EDX TEXTEQU <DB 08Dh,014h,022h> ;lea edx, [edx] NOP3_ESI TEXTEQU <DB 08Dh,024h,024h> ;lea esi, [esi] NOP3_EDI TEXTEQU <DB 08Dh,034h,026h> ;lea ed ...

  • AMD x86 - page 59

    Code Padding Usi ng Neutral Code Fillers 43 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ;lea edi ,[edi+00000000] NOP6_EDI TEXTEQU <DB 08Dh,0BFh,0,0,0,0> ;lea ebp ,[ebp+00000000] NOP6_EBP TEXTEQU <DB 08Dh,0ADh,0,0,0,0> ;lea eax,[eax*1+00000000] NOP7_EAX TEXTEQU <DB 08Dh,004h,005h,0,0,0,0> ;lea ebx ...

  • AMD x86 - page 60

    44 Code Padding Using Neutral Code F illers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ...

  • AMD x86 - page 61

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Memory Size a nd Alignment Issues 45 5 Cache and Memory Optimizations T his chapter describes code optimization tec hniques that tak e ad v anta ge of the large L1 caches and high-band width buses of the AMD Athlon ™ proces sor . Guidelines ar e listed in or der of imp ...

  • AMD x86 - page 62

    46 Use the 3DNow! ™ PR EF ET C H and PR E FETCHW AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Align Data Where P ossible In general, a v oid misaligned data references. All data who se siz e is a pow er of 2 is cons ider ed aligned i f it is naturally aligned. F or example: ■ QW OR D accesses ar e aligned if th ...

  • AMD x86 - page 63

    Use the 3DNow ! ™ PREFE TCH and PREFETCHW I nstructions 47 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization PRE FET CH /W ve rs us PR E F ETC H N T A/T0/T1 /T2 T he PREFETCHNT A/T0/T1/T2 instructions in the MMX extensions ar e pr ocessor implement ation dependent. T o maintain compati bility with t he 25 million AMD- ...

  • AMD x86 - page 64

    48 Use the 3DNow! ™ PR EF ET C H and PR E FETCHW AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 MOV ECX, (-LARGE_NUM) ;used biased index MOV EAX, OFFSET array_a ;get address of array_a MOV EDX, OFFSET array_b ;get address of array_b MOV ECX, OFFSET array_c ;get address of array_c $loop: PREFETCHW [EAX+196] ;two cac ...

  • AMD x86 - page 65

    Use the 3DNow ! ™ PREFE TCH and PREFETCHW I nstructions 49 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T he follo wing optimiza tion rule s w er e app lied to this example . ■ Loops should be unr olled to mak e sur e that the data stride per loop i ter ation is equal to the length of a cac he line. T his a voi ...

  • AMD x86 - page 66

    50 T ake A dvantage of W rite Combining AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T ak e Advantage of W rite Combining Oper ating system and device dri v er pro gr ammers sh ould tak e ad v antage of the write- combining capabili ties of the AMD Athlon pr ocessor . T he AMD Athlon pr ocessor has a v er y aggr es ...

  • AMD x86 - page 67

    Store-to-Load F orwarding Restrictions 51 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Store-to-Load F o rwarding R estrictions Stor e-to-load forw arding r efers to the pr ocess of a load reading (f orw ar ding) data fr om the stor e buffer (LS2). T h er e ar e instances in the AMD Athlon processor load/stor e arc ...

  • AMD x86 - page 68

    52 Store-to -Load Forwar ding Restrictions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Narrow-to-Wide Store-Buffer Data F orwarding Restriction If the f ollo wing conditions ar e pr esent, there i s a narro w-to- wide stor e-buffer data f o rw ar ding r estricti on: ■ T he oper and size of the stor e data is sma ...

  • AMD x86 - page 69

    Store-to-Load F orwarding Restrictions 53 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 5 (Preferred): MOVD [foo], MM1 ;store lower half PUNPCKHDQ MM1, MM1 ;get upper half into lower half MOVD [foo+4], MM1 ;store lower half ... ADD EAX, [foo] ;fine ADD EDX, [foo+4] ;fine Misaligned Store-Buffer Data F orward ...

  • AMD x86 - page 70

    54 Stack Alignment Consider ations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 One Supported Store- to-Load Forw arding Case T her e is one case of a mism atc hed stor e-to- load fo rw arding that is supported by the b y AMD Athlon pr ocessor . The low er 32 bits fr om an aligned QW ORD write feeding into a D W OR ...

  • AMD x86 - page 71

    Align TBYTE Variab les on Quadword Aligned Addres ses 55 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example (Preferred): Prolog: PUSH EBP MOV EBP, ESP SUB ESP, SIZE_OF_LOCALS ;size of local variables AND ESP, –8 ;push registers that need to be preserved Epilog: ;pop register that needed to be preserved MOV ESP, ...

  • AMD x86 - page 72

    56 Sort V ariables Accordin g to Base T ype Size AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example: struct { char a[5]; long k; doublex; } baz; T he str uctur e components should be alloc ated (lo west to highes t addr ess) as follo ws: x, k, a[4], a[3], a[2], a[1], a[0], padbyte6, ..., padbyte0 See “ C Langua ...

  • AMD x86 - page 73

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Avoid Branches Depende nt on Random Data 57 6 Br anch Optimizations W hile th e AMD Athlon ™ pr ocessor contains a v ery sophisticated br anch unit, certain optimizations increase t he effect iv eness of the br anc h pr ediction unit. T his c hapter discusses rules tha ...

  • AMD x86 - page 74

    58 A void Branches De pendent on Random Dat a AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 AM D Ath lon ™ Proces sor Spec ific Code E xample 1 — Signed integer ABS function (X = labs(X)): MOV ECX, [X] ;load value MOV EBX, ECX ;save value NEG ECX ;–value CMOVS ECX, EBX ;if –value is negative, select value MO ...

  • AMD x86 - page 75

    Always Pair CALL and RETURN 59 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 6 — Increment Ring Buffer Offset: //C Code char buf[BUFSIZE]; int a; if (a < (BUFSIZE-1)) { a++; } else { a = 0; } ;------------- ;Assembly Code MOV EAX, [a] ; old offset CMP EAX, (BUFSIZE-1) ; a < (BUFSIZE-1) ? CF : NC INC ...

  • AMD x86 - page 76

    60 Replace Bran ches with Computation in 3DNow! ™ Code AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Rep lace Br anches with Computa tion in 3D Now! ™ Code Br anches negati vel y impact the perf ormance of 3DNo w! code. Br anches can oper ate onl y on one data item at a time , i.e., the y ar e inherentl y scalar ...

  • AMD x86 - page 77

    Replace Branches wi th Computation in 3DNow! ™ Code 61 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 2 (Preferred): ; r = (x < y) ? a : b ; ; in: mm0 a ; mm1 b ; mm2 x ; mm3 y ; out: mm1 r PCMPGTD MM3, MM2 ; y > x ? 0xffffffff : 0 PAND MM1, MM3 ; y > x ? b : 0 PANDN MM3, MM0 ; y > x > 0 : a ...

  • AMD x86 - page 78

    62 Replace Bran ches with Computation in 3DNow! ™ Code AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 2: C code: float x,z; z = abs(x); if (z >= 1) { z = 1/z; } 3DNow! code: ;in: MM0 = x ;out: MM0 = z MOVQ MM5, mabs ;0x7fffffff PAND MM0, MM5 ;z=abs(x) PFRCP MM2, MM0 ;1/z approx MOVQ MM1, MM0 ;save z PFRC ...

  • AMD x86 - page 79

    Replace Branches wi th Computation in 3DNow! ™ Code 63 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 4: C code: #define PI 3.14159265358979323 float x,z,r,res; /* 0 <= r <= PI/4 */ z = abs(x) if (z < 1) { res = r; } else { res = PI/2-r; } 3DNow! code: ;in: MM0 = x ; MM1 = r ;out: MM1 = res MOVQ MM ...

  • AMD x86 - page 80

    64 Replace Bran ches with Computation in 3DNow! ™ Code AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 5: C code: #define PI 3.14159265358979323 float x,y,xa,ya,r,res; int xs,df; xs = x < 0 ? 1 : 0; xa = fabs(x); ya = fabs(y); df = (xa < ya); if (xs && df) { res = PI/2 + r; } else if (xs) { res ...

  • AMD x86 - page 81

    Avoid the Loop Instruction 65 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization A void the Loop Instruction T he LOOP instruction in the AMD A thlon pr ocessor r equires eight cycles to e xecute. Use the preferr ed code shown belo w: Example 1 (Av oid): LOOP LABEL Example 2 (Preferred): DEC ECX JNZ LABEL A void F ar Con ...

  • AMD x86 - page 82

    66 A void Recursive Functions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 A void R ecursive Functions A void r ec ur siv e func tions due to the danger o f o verflo wing t he r eturn addr ess stac k. Con v ert end- r ecur siv e functions to iterati ve code. An end-recursi v e funct ion is wh en the func tion call ...

  • AMD x86 - page 83

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Schedule In structions According to their Latenc y 67 7 Scheduling Optimizations T his c hapter descr ibes ho w to code instruc tions f or efficient scheduling. Guidelines ar e lis ted in or der of impor tance. Schedule Instructions Ac cor ding to their Latency Th e A M ...

  • AMD x86 - page 84

    68 Unrolling Loops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 unroll ing r educ es r egist er pr essur e by r emoving the loop counter . T o complete l y unroll a loop, remo ve the loop control and r eplicate the loop bod y N times. In addition, completel y unr olling a lo op incr eases scheduling oppo rtunities. ...

  • AMD x86 - page 85

    Unrolling Loops 69 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Without Loop Unrolling: MOV ECX, MAX_LENGTH MOV EAX, OFFSET A MOV EBX, OFFSET B $add_loop: FLD QWORD PTR [EAX] FADD QWORD PTR [EBX] FSTP QWORD PTR [EAX] ADD EAX, 8 ADD EBX, 8 DEC ECX JNZ $add_loop T he loop consists of se v en instructions. T he AMD At ...

  • AMD x86 - page 86

    70 Unrolling Loops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 no faster than three iter a tions in 10 cycles, or 6/10 floating-po int adds per c ycle, or 1.4 times as f ast as the or iginal loop. Deriving Loop Control For P arti ally Unrolled Loops A fr equentl y used loop construct is a counting loop. In a typic ...

  • AMD x86 - page 87

    Use Function Inlini ng 71 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use Function In lining Overview Mak e use of the AMD A thlon pr ocessor ’ s large 64- Kbyte instruct ion cache b y inl ining sm all routines to av oid pr ocedur e- call ov erhead. Consider the cost of possible incr eased r egister usage, whic ...

  • AMD x86 - page 88

    72 A void Address Generati on Interlocks AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novembe r 1 9 9 9 Always Inline Fu nctions if Called from One Site A function should alw a ys be inlined if it can be established that it is called from just one site in the code. F or the C language, determination of this char act eristic is made ...

  • AMD x86 - page 89

    Use MOVZX and MO VSX 73 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 1 (Av oid): ADD EBX, ECX ;inst 1 MOV EAX, DWORD PTR [10h] ;inst 2 (fast address calc.) MOV ECX, DWORD PTR [EAX+EBX] ;inst 3 (slow address calc.) MOV EDX, DWORD PTR [24h] ;this load is stalled from ; accessing data cache due ; to long laten ...

  • AMD x86 - page 90

    74 Minimize Po inter Arithmetic in L oops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i; for (i=0; i < MAXSIZE; i++) { c [i] = a[i] + b[i]; } MOV ECX, MAXSIZE ;initialize loop counter XOR ESI, ESI ;initialize offset into array a XOR EDI, EDI ;initializ ...

  • AMD x86 - page 91

    Push Memory Data Carefu lly 75 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization v ariable that starts wi th a negati ve v alue and r eac hes zero when the loop expires. Note that if the base addresses ar e held in r egisters (e.g., when the base addr e sses ar e passe d as ar guments of a function) biasing the base add ...

  • AMD x86 - page 92

    76 Push Memory Data Careful ly AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novembe r 1 9 9 9 ...

  • AMD x86 - page 93

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Replace Divi des with Multiplies 77 8 Integer Optimizations T his c hapter desc ribes w a ys to impr ov e integer p erf ormance thr ough optimize d pr ogr amming tec hniques. T he guidelines ar e listed in order of importance. Replace Divides with Multiplies Replace inte ...

  • AMD x86 - page 94

    78 Replace Div ides with Multiplies AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Signed Division Utility In the opt_utilities dir ector y of the AMD documentation CDR O M, ru n sdiv .exe in a DOS window to find the fastest code fo r si gned di vision b y a constant. T he utility displa ys the code after the user en ...

  • AMD x86 - page 95

    Replace Divi des with Multiplies 79 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 1: ;In: EDX = dividend ;Out: EDX = quotient XOR EDX, EDX;0 CMP EAX, d ;CF = (dividend < divisor) ? 1 : 0 SBB EDX, -1 ;quotient = 0+1-CF = (dividend < divisor) ? 0 : 1 In cases where the di vi dend does not need to be pr e ...

  • AMD x86 - page 96

    80 Replace Div ides with Multiplies AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ;algorithm 1 MOV EAX, m MOV EDX, dividend MOV ECX, EDX IMUL EDX ADD EDX, ECX SHR ECX, 31 SAR EDX, s ADD EDX, ECX ;quotient in EDX Derivation for a, m, s The deri v atio n f or the algorith m (a), multiplier (m), and sh ift coun t (s), ...

  • AMD x86 - page 97

    Use Alternative Code When Multiplying by a Co nstant 81 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Remainder of Signed Integer 2 n or – (2 n ) ;IN:EAX = dividend ;OUT:EAX = remainder CDQ ;Sign extend into EDX AND EDX, (2^n–1) ;Mask correction (abs(divison)–1) ADD EAX, EDX ;Apply pre-correction AND EAX, (2^n ...

  • AMD x86 - page 98

    82 Use Alternative Code When Multiplying b y a Constant AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 by 11: LEA REG2, [REG1*8+REG1] ;3 cycles ADD REG1, REG1 ADD REG1, REG2 by 12: SHL REG1, 2 LEA REG1, [REG1*2+REG1] ;3 cycles by 13: LEA REG2, [REG1*2+REG1] ;3 cycles SHL REG1, 4 SUB REG1, REG2 by 14: LEA REG2, [REG1* ...

  • AMD x86 - page 99

    Use MMX ™ Instructio ns for Integer-Only Work 83 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization by 26: use IMUL by 27: LEA REG2, [REG1*4+REG1] ;3 cycles SHL REG1, 5 SUB REG1, REG2 by 28: MOV REG2, REG1 ;3 cycles SHL REG1, 3 SUB REG1, REG2 SHL REG1, 2 by 29: LEA REG2, [REG1*2+REG1] ;3 cycles SHL REG1, 5 SUB REG1, RE ...

  • AMD x86 - page 100

    84 Repeated String Instructi on Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 In addition, using MMX instructi ons incr eases t he a v ailable par allelism. T he AMD Athlon proces sor can issue thr ee integer OPs and two MMX OPs per cycle. Rep eated String Instruction Usage Latency of Repeated String Instructi ...

  • AMD x86 - page 101

    Repeated String I nstruction Usage 85 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Ensure D F=0 (U P) A lway s m a ke s u re t h a t D F = 0 ( U P ) ( a f t e r ex e c u t i o n o f C L D ) fo r REP MO VS an d REP STOS. DF = 1 ( DO WN ) is only needed f o r certain cases of o ver lapping REP MO VS (f or example, so ...

  • AMD x86 - page 102

    86 Use X OR Instruction to Cl ear Integer Registers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Use X O R Instruction to Clear Integer Registe rs T o clear an inte ger r egister to all 0s, use “ X OR r eg , r eg ” . T he AMD Athlon pr ocessor is a ble to av oid the false r ea d dependenc y on the XOR instructi ...

  • AMD x86 - page 103

    Efficient 64-Bi t Integer Arithmetic 87 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 4 (Le ft shift ): ;shift operand in EDX:EAX left, shift count in ECX (count ; applied modulo 64) SHLD EDX, EAX, CL ;first apply shift count SHL EAX, CL ; mod 32 to EDX:EAX TEST ECX, 32 ;need to shift by another 32? JZ $lshi ...

  • AMD x86 - page 104

    88 Efficient 64-Bit Integer Arithmeti c AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 7 (Division): ;_ulldiv divides two unsigned 64-bit integers, and returns ; the quotient. ; ;INPUT: [ESP+8]:[ESP+4] dividend ; [ESP+16]:[ESP+12] divisor ; ;OUTPUT: EDX:EAX quotient of division ; ;DESTROYS: EAX,ECX,EDX,EFlags ...

  • AMD x86 - page 105

    Efficient 64-Bi t Integer Arithmetic 89 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization MOV ECX, EAX ;save quotient IMUL EDI, EAX ;quotient * divisor hi-word ; (low only) MUL DWORD PTR [ESP+20];quotient * divisor lo-word ADD EDX, EDI ;EDX:EAX = quotient * divisor SUB EBX, EAX ;dividend_lo – (quot.*divisor)_lo MOV EA ...

  • AMD x86 - page 106

    90 Efficient 64-Bit Integer Arithmeti c AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 $r_two_divs: MOV ECX, EAX ;save dividend_lo in ECX MOV EAX, EDX ;get dividend_hi XOR EDX, EDX ;zero extend it into EDX:EAX DIV EBX ;EAX = quotient_hi, EDX = intermediate ; remainder MOV EAX, ECX ;EAX = dividend_lo DIV EBX ;EAX = qu ...

  • AMD x86 - page 107

    Efficient Impl ementation of Populati on Count Function 91 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Efficient Implementation of Population Co unt Function P opulation count is an oper ation that determines the number of set bits in a bit string. F or example, this can be used to determine the car dinality of a ...

  • AMD x86 - page 108

    92 Efficient Impl ementation of Populat ion Count Function AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Step 3 F or the fir st time, the v alue in each k-bit field is small eno ugh that adding two k-bit fields r esults in a v alue that stil l fits in the k-bit field. Thus the f ollowing computation is perf ormed: y ...

  • AMD x86 - page 109

    Derivation of Multipl ier Used for Integer Division by Constants 93 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ADD EAX, EDX ;x = (w & 0x33333333) + ((w >> 2) & ; 0x33333333) MOV EDX, EDX ;x SHR EAX, 4 ;x >> 4 ADD EAX, EDX ;x + (x >> 4) AND EAX, 00F0F0F0Fh ;y = (x + (x >> 4) & 0 ...

  • AMD x86 - page 110

    94 Derivation of Multiplie r Used for Integer Division by AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ;algorithm 1 MOV EDX, dividend MOV EAX, m MUL EDX ADD EAX, m ADC EDX, 0 SHR EDX, s ;EDX=quotient */ typedef unsigned __int64 U64; typedef unsigned long U32; U32 d, l, s, m, a, r; U64 m_low, m_high, j, k; U32 log2 ...

  • AMD x86 - page 111

    Derivation of Multipl ier Used for Integer Division by Constants 95 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization /* Generate m, s for algorithm 1. Based on: Magenheimer, D.J.; et al: “Integer Multiplication and Division on the HP Precision Architecture”. IEEE Transactions on Computers, Vol 37, No. 8, August 198 ...

  • AMD x86 - page 112

    96 Derivation of Multiplie r Used for Integer Division by AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ;algorithm 1 MOV EAX, m MOV EDX, dividend MOV ECX, EDX IMUL EDX ADD EDX, ECX SHR ECX, 31 SAR EDX, s ADD EDX, ECX ; quotient in EDX */ typedef unsigned __int64 U64; typedef unsigned long U32; U32 log2 (U32 i) { U32 ...

  • AMD x86 - page 113

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Ensure All FP U Data is Ali gned 97 9 Floating-P oint Optimizations T his c hapt er details the methods used to optimiz e floating-point code to the pipelined floating-point unit (FPU). Guidelines are listed in order of impo rtance. Ensure All F P U Data is Aligned As di ...

  • AMD x86 - page 114

    98 Use FFRE E P Macr o to Pop On e Register fr om the FPU AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Use F F R E E P Macro to P op One Register fr om the F P U Stack In FPU intensi v e code, fr equently accessed data is oft en pr e-loaded at the bottom of the FPU stac k befor e pr ocessing floating- point data. A ...

  • AMD x86 - page 115

    Use the FXCH Instruction Rather tha n FST/FLD Pairs 99 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T hese instruc tions ar e muc h faster than the classical appr oach using FSTSW , because FSTSW is essentiall y a serializing instruction on the AMD Athlon pr ocess or . W hen FSTSW cannot be a v oided (f or example, ...

  • AMD x86 - page 116

    10 0 Minimize Floating-Po int-to-Integer Conversions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Minimize Floating-P oint-to-Integer Con versio ns C++, C, an d F ortr an define floa ting-point-t o-integer con v er sions as truncating . This cr eates a pr oblem because the activ e r ounding mode in an application i ...

  • AMD x86 - page 117

    Minimize F loating-Point-to-Integer Conversi ons 10 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization FPU into truncating mode, and perf orming all of the conv ersions before restoring the original control w ord. The speed of the a bo v e code is somewhat dependent on the natur e of the code surrounding it. F o r appl ...

  • AMD x86 - page 118

    10 2 Minimize Floating-Po int-to-Integer Conversions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 3 (P otentially faster): MOV ECX, DWORD PTR[X+4] ;get upper 32 bits of double XOR EDX, EDX ;i = 0 MOV EAX, ECX ;save sign bit AND ECX, 07FF00000h ;isolate exponent field CMP ECX, 03FF00000h ;if abs(x) < 1.0 ...

  • AMD x86 - page 119

    Floating-Point Subex pression Elimination 10 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Floating-P oint Subexpr ession Elimination T her e ar e cases which do not r equir e an FXCH instruction after e v er y instruction to allo w access to tw o new stac k entries. In the cases wher e two instructions shar e a s ...

  • AMD x86 - page 120

    10 4 Check Argument Range of T rigonometric Instructions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 If an “ ar gument out of r ange ” is detected, a r ange r eduction subr o utine is in v ok ed whic h r educes the ar gument to less than 2^63 befor e the instruction is attempted again. While an ar gument > ...

  • AMD x86 - page 121

    Take Advantag e of the FSINCOS Instruction 10 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Since out- of-r an ge arguments ar e extremely uncommon, the conditional br anch will be perfectly pr edicted, and the other instructions used to guard the trigonometric instruction can execute in par allel to it. T ak e Ad ...

  • AMD x86 - page 122

    10 6 T ake Advantage of the FSI NCOS Instruction AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ...

  • AMD x86 - page 123

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use 3DNow! ™ Instr uctions 10 7 10 3D Now! ™ and M MX ™ Optimizations T his chapter describes 3DNow! and MMX code optimization tec hniqu es f or the AMD Athlon ™ processo r . Guidelines ar e listed in order of impor tance. 3DNo w! porting guideline s can be f oun ...

  • AMD x86 - page 124

    10 8 Use 3DNow! ™ Instructions for Fast Div ision AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 FEMMS instru ction is suppo rted fo r bac kw ar d compatibili ty with AMD-K6 famil y p r ocessors, and is aliased t o the EMMS instruction. 3DNo w! and MMX in structions are designed to be used concurr entl y with no sw ...

  • AMD x86 - page 125

    Use 3DNow! ™ Instructions for Fast Division 10 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Pipelined P a ir of 24-Bit Precisio n Divides T his di vi de operation execu tes wi th a tot al late nc y of 21 cycles, assuming that the pr ogr am hides t he latenc y of the fir st MO VD/MO VQ instructio ns within pr ec ...

  • AMD x86 - page 126

    110 Use 3DNow ! ™ Instructions for Fast Square Ro ot and AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novembe r 1 9 9 9 Use 3D Now! ™ Instructions for F a st Squar e Root and Recipr ocal Square Root 3DNo w! instruc tions can be used to compute a ver y fast, highly ac c u ra t e s q u a re ro o t a n d re c i pr oc a l s q u a re ...

  • AMD x86 - page 127

    Use MMX ™ PMADDWD Ins truction to Perform Two 32-Bit Multipli es in Parallel 111 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Newton- Raphson Re cipr ocal Squa re Ro ot T he gener al Ne wton-Raphson r ecipro cal squar e root r ecurr ence is: Z i+1 = 1/2 • Z i • (3 – b • Z i 2 ) T o r educe the number of i ...

  • AMD x86 - page 128

    112 3D Now! ™ and MMX ™ Intra-Operand S wapping AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example: PXOR MM2, MM2 ; 0 | 0 MOVD MM0, [ab] ; 0 0 | b a MOVD MM1, [cd] ; 0 0 | d c PUNPCKLWD MM0, MM2 ; 0 b | 0 a PUNCPKLWD MM1, MM2 ; 0 d | 0 c PMADDWD MM0, MM1 ; b*d | a*c 3D Now! ™ and M MX ™ Intra-Oper and Swa ...

  • AMD x86 - page 129

    Fast Conversion of S igned Words to Floating-Poin t 113 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization F ast Conversion of Signed W ords to Floating-P oint In many appl ications there is a need to quickl y conv ert data consisting of pac ked 16-bit signed integer s into floating-point n umbers. T he follo wing two e ...

  • AMD x86 - page 130

    114 Us e M MX ™ P CM P Instead of 3DNow! ™ PFCMP AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 c ycle b ypassing penalty , and another one c ycle penalty if the r esult goes to a 3DNo w! operation. T he PFMUL execution latenc y is fo ur , ther efo re, in the w orst case, the PXOR and PMUL in structio ns ar e the ...

  • AMD x86 - page 131

    Use MMX ™ Instructio ns for Block Copies and Block Fills 115 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use M MX ™ Instructions for Block Copies and Block Fills F or moving or filling small bloc ks of data (e.g., less than 512 b ytes) bet w een cachea ble memo r y ar eas, t he REP MO VS and REP ST OS families ...

  • AMD x86 - page 132

    116 Us e M MX ™ Instructions for Block Copies and Block Fills AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 $xfer: movq mm0, [eax] add edx, 64 movq mm1, [eax+8] add eax, 64 movq mm2, [eax-48] movq [edx-64], mm0 movq mm0, [eax-40] movq [edx-56], mm1 movq mm1, [eax-32] movq [edx-48], mm2 movq mm2, [eax-24] movq [edx ...

  • AMD x86 - page 133

    Use MMX ™ Instructio ns for Block Copies and Block Fills 117 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization AM D Athlon ™ Proc essor Specific Code T he f ollo wing exam ple code, written f or the inlin e assembler of Micros oft V isual C, is suita ble for mo ving/filling a quadw ord aligned block of data in the f ...

  • AMD x86 - page 134

    118 Us e M MX ™ PXOR to Clear All Bits in an M MX ™ Register AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 /* block fill (destination QWORD aligned) */ __asm { mov edx, [dst_ptr] mov ecx, [blk_size] shr ecx, 6 movq mm0, [fill_data] align 16 $fill_nc: movntq [edx], mm0 movntq [edx+8], mm0 movntq [edx+16], mm0 mov ...

  • AMD x86 - page 135

    Use MMX ™ PCMPEQD to S et All Bits in an MMX ™ Register 119 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use M MX ™ PC M P E QD to Set All Bits in an M MX ™ Regi ste r T o set all the bit s in an MMX r egister to o ne, use: PCMPEQD MMreg, MMreg Note that PCMPEQD MMr eg, MMr eg is dependent on pr evio us wri ...

  • AMD x86 - page 136

    12 0 Optimized Matrix Multip lication AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 /* Function XForm performs a fully generalized 3D transform on an array of vertices pointed to by "v" and stores the transformed vertices in the location pointed to by "res". Each vertex consists of four floats. T ...

  • AMD x86 - page 137

    Optimized Matrix Multipli cation 121 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization $$xform: ADD EBX, 16 ;res++ MOVQ MM0, QWORD PTR [EDX] ;v->y | v->x MOVQ MM1, QWORD PTR [EDX+8] ;v->w | v->z ADD EDX, 16 ;v++ MOVQ MM2, MM0 ;v->y | v->x MOVQ MM3, QWORD PTR [EAX+M00] ;m[0][1] | m[0][0] PUNPCKLDQ MM0, ...

  • AMD x86 - page 138

    12 2 Efficient 3D- Clipping Code Computation Using AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Efficient 3D- Clipping Code Computation Using 3D Now! ™ Instructions Clipping is one of the major acti vities occurring in a 3D gr aphics pipeli ne. In many instances, this activ ity is split i nto tw o parts which do ...

  • AMD x86 - page 139

    Use 3DNow! ™ PAVGUSB for MPEG-2 Motion Compensation 12 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ;; ;; DESTROYS MM0,MM1,MM2,MM3,MM4 PXOR MM0, MM0 ; 0 | 0 MOVQ MM1, MM6 ; w | z MOVQ MM4, MM5 ; y | x PUNPCKHDQ MM1, MM1 ; w | w MOVQ MM3, MM6 ; w | z MOVQ MM2, MM5 ; y | x PFSUBR MM3, MM0 ; -w | -z PFSUBR MM2, MM ...

  • AMD x86 - page 140

    12 4 Use 3DNow! ™ P A VG US B for MP EG-2 Motion AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): MOV ESI, DWORD PTR Src_MB MOV EDI, DWORD PTR Dst_MB MOV EDX, DWORD PTR SrcStride MOV EBX, DWORD PTR DstStride MOVQ MM7, QWORD PTR [ConstFEFE] MOVQ MM6, QWORD PTR [Const0101] MOV ECX, 16 L1: MOVQ MM0, ...

  • AMD x86 - page 141

    Stream of Packed Unsi gned Bytes 12 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T he f ollo wing code fr agment us es the 3DNo w! P A V GUSB instruction to perform a v er aging betw een the sour ce macr oblock and destination macr obloc k: Example 2 (Preferred): MOV EAX, DWORD PTR Src_MB MOV EDI, DWORD PTR Dst_M ...

  • AMD x86 - page 142

    12 6 Co mple x N umbe r Ari thm etic AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Complex Number Arithmetic Complex n umbers ha v e a “ real ” part and an “ imaginar y ” part. Multipl ying complex number s (ex. 3 + 4i) is an integral part of many algorithms such as Discrete F o urier T r ansform (DF T) and ...

  • AMD x86 - page 143

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Short Forms 12 7 11 Gener al x86 Optimization Guidelines T his c hapter describes gener a l code optimization tec hniques specific to super scalar proc essors ( that is, tec hniques common to the AMD- K6 ® processor , AMD A thlon ™ processor , and Pe n t i u m ® fami ...

  • AMD x86 - page 144

    12 8 Dependencies AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Depend encies Spr ead out true dependencies to increase the opportunities f or par allel execution. Anti- depende ncies and output dependencies do not impact performance. Reg ister Operands Maintain fr equently used v alues in register s r at her than i ...

  • AMD x86 - page 145

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Introduction 12 9 Appendix A AM D Athlon ™ Proc essor Micr oarc hitecture Intr oduction W hen discussing processor design, it is important to unders tand the follo wing terms — architecture , microarchitectur e , and design implementation . T he term arch itecture r ...

  • AMD x86 - page 146

    130 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 AM D Athlon ™ Proc essor Microar chitecture T he innov ativ e AMD Athlon processor micr oar chitectur e appr oach implements the x86 instruction set by pr ocessing simpler oper ations (OPs) instead of complex x86 instruct ...

  • AMD x86 - page 147

    AMD Athlon ™ Processor Mic roarchitecture 131 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization Figure 1 . AM D Athlon ™ Processor Block Diagram Instruction Cache T he o ut-of-or der ex ecute engi ne of t he AMD Athlon proc essor contains a v ery larg e 64- Kbyte L1 ins truction cac he. T he L1 instruction cac he is ...

  • AMD x86 - page 148

    132 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 r eplacement is based on a least- r ecently used (LR U ) r eplacement algori thm. T he L1 instruction cac he has an associated tw o-le v el tr anslation look- aside buffer (TLB) structur e. T he firs t-le vel TLB is full y ...

  • AMD x86 - page 149

    AMD Athlon ™ Processor Mic roarchitecture 13 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization r eturn stack. Subsequen t RETs pop a p r ed icted return addr ess off the top of the stac k. Early Dec oding T he Dir ectP ath and V ectorP ath decoders perf orm ear ly- decoding of instructions into Macr oOPs. A Macr oOP ...

  • AMD x86 - page 150

    134 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 Instruction Control Unit T he instruction contr ol unit (ICU) is the contr ol center f or the AMD Athlon processor . T he ICU controls the follo wing r esources — the centr alized in-flight r eorder buf fer , the integer ...

  • AMD x86 - page 151

    AMD Athlon ™ Processor Mic roarchitecture 13 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization Integer Scheduler T he integer s che duler is ba sed on a thr ee- wide queuing system (also kno wn as a r eserv ation station) that feeds thr e e integer executi on positions or pipes. T he r eser va tion stat ions ar e six ...

  • AMD x86 - page 152

    136 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 Eac h of the three IEUs ar e general purpose in that eac h performs lo gic functions, arithmetic functions, conditional functions, di vide step functions, status flag multiplexing, and br anc h r esolutions. The A GUs calcu ...

  • AMD x86 - page 153

    AMD Athlon ™ Processor Mic roarchitecture 13 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization Floa ting-P oint Ex ecutio n Unit T he floating-point execution unit (FPU) is implemented as a coprocessor that has its o wn out-of- ord er control in addition to the da ta path. T he FPU hand les all r egister oper ations ...

  • AMD x86 - page 154

    138 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 Load-Store Unit (LS U ) T he load-s tor e unit (LSU) manages dat a load and s tor e accesses to the L1 dat a cache and, if r equired, to the backside L2 cache or system memory . The 44-entr y LSU pro vides a data interface ...

  • AMD x86 - page 155

    AMD Athlon ™ Processor Mic roarchitecture 13 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization L2 Cache Controller T he AMD Athlon processor contai ns a v ery flexible onboar d L2 contr oller . It uses an independent bac kside bus to access up to 8-Mb ytes of industry- standar d SRAMs. Ther e ar e full on-c hip tags ...

  • AMD x86 - page 156

    140 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 ...

  • AMD x86 - page 157

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Fetch and Dec ode Pipeline Stages 141 Appendix B Pipeline and Execution Unit R esourc es Ov erview Th e A M D A t h l o n ™ pr ocessor contains two independent execut ion pipelines — one for integer oper ations and one for floating-point operations. T h e integer pip ...

  • AMD x86 - page 158

    142 Fetch and Dec ode Pipeline Stages AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Figure 5. F etch/Scan/Align/D ecode Pipeline Hardware T he most common x8 6 instructions flo w throug h the Dir ectP ath pipeline stages and are decoded by har dw a r e . T he l ess common instructions, whic h r equire micr ocode ass ...

  • AMD x86 - page 159

    Fetch and Dec ode Pipeline Stages 14 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Cycle 1 – FET CH The FETCH pipeline stag e calculates t he addr ess of the next x86 instr uction window to fetch from the pr oce ssor caches or system me mory . Cycle 2 – SCAN SC AN determines the start and end pointers of instr ...

  • AMD x86 - page 160

    144 Integer Pipeline Stages AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 oper ands mapped to r egisters. Both integer and floating-point Macr oOPs ar e placed into the IC U . Integer Pipeline Stages T he integer execution pipeline consi sts of f our or more stages f or scheduling and execution and, if necessar y , ...

  • AMD x86 - page 161

    Integer Pipelin e Stages 14 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Cycle 7 – SC H E D In the scheduler (SCHED) pipeline stage, the scheduler buffer s can cont ain Macr oOPs that are waiting f or integer operands fr om the ICU or the IEU r esult bus . W hen all oper ands ar e r eceiv ed, SCHED s c hedules ...

  • AMD x86 - page 162

    146 Floating-Point Pipe line Stages AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Floating-P oint Pipeline Stages T he floa ting-point unit (FPU) is implemente d as a coprocessor that has its o w n out- of- or der cont r ol in addition to the data path. T he FPU handles al l r egister oper ations f or x8 7 instructi ...

  • AMD x86 - page 163

    Floating-Point P ipeline Stages 14 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Cycle 7 – ST K R E N T he stack r ename (S TKREN) pipeline stage in cycle 7 r eceiv e s up to thr ee Macr oOPs fr om IDEC and maps stac k- relati ve r egi ster tag s to vir tual register ta gs. Cycle 8 – REG REN The r egister r e ...

  • AMD x86 - page 164

    148 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Execution Unit Resour ces Te r m i n o l o g y T he execution units o perate with two types of register v al ues — operands and res u lt s . T here ar e three oper and types and two r esult types, which ar e described in this section. Oper ...

  • AMD x86 - page 165

    Execution Unit Resources 14 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Integer Pipeline Operations T abl e 2 shows the categor y or type of o per ations handled b y the integer pipeline. T able 3 sho w s examples of the decode type. As sho wn in T able 2 , the MO V instruction earl y decodes in the Dir ectP a t ...

  • AMD x86 - page 166

    150 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Floa ting-P oint P ipeline Oper ations T abl e 4 shows the categor y or type of o per ations handled b y the floating-point execution units. T able 5 sho ws examples of the decode types. As sho wn in T able 4, the F ADD r egister-to- regi st ...

  • AMD x86 - page 167

    Execution Unit Resources 151 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Load/Store Pipeline Oper ations T he AMD Athlon pr ocessor decodes an y instruction that r efer ences memor y into primiti ve load/stor e oper a tions. F o r exa mple, consider the fo llo wing code sample : MOV AX, [EBX] ;1 load MacroOP PUSH ...

  • AMD x86 - page 168

    152 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Code Sample Analysis T he samples in T able 7 on page 153 and T able 8 on page 154 show the execut ion behavior of sev eral serie s of ins tructi ons as a function of decode constr aints, dependenc ies, and execution r esour ce constr aints. ...

  • AMD x86 - page 169

    Execution Unit Resources 15 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T a ble 7 . Sample 1 – Integer Register Operations Inst ructi on Number Deco de Pipe Decode Ty p e Clocks I n s t r u c t i o n 12345 6 7 8 1I M U L E A X , E C X 0 V P D I M M M M 2 IN C ESI 0 DP D I E 3 MOV E DI, 0x0 7F4 1 DP D I E 4 AD ...

  • AMD x86 - page 170

    154 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T a ble 8. Sample 2 – Integer Reg ister and Memory Load Operations Instruc Num Decode Pipe D ecode Ty p e Clocks I n s t r u c t i o n 1 2 3 4 5 6 7 8 9 10 11 12 1D E C E D X 0 D P D I E 2 MOV E DI, [ECX] 1 DP D I &/S A $ 3 S UB EAX, [ ...

  • AMD x86 - page 171

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Introduction 15 5 Appendix C Implementation of W rite Combining Intr oduction T his appendix describes the memory write- c ombining featur e as implemente d in the AMD Athlon ™ pr ocessor famil y . T he AMD Athlon pr ocessor supports the memor y type and r ange r e gis ...

  • AMD x86 - page 172

    15 6 Write-Combinin g Definitions and Abbrev iations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 W rite-Combining Definitions and Abbr eviations T his appendix uses the follo wing definitions and ab br ev iations: ■ UC — Uncach eable memor y type ■ WC — Write-combining memory type ■ WT — Writethr ough ...

  • AMD x86 - page 173

    Write-Combining Operations 15 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization signatur e in r egister EAX, wher e EAX[11 – 8] contai ns the instruction famil y code. F or the AMD Athlon processor , the instruction famil y code is six . 2. In addition, t he pr esence of the MTRRs is indicated b y bit 12 and the pr ...

  • AMD x86 - page 174

    15 8 Wr ite- Combining Operations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T a ble 9. W rite Combining Completion Events Event Comment Non-WB write outside o f current buffer The first non-WB write to a different cache block address closes combining for previous writes. WB writes do no t affect write combining. ...

  • AMD x86 - page 175

    Write-Combining Operations 15 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Sending W rite-Buffer Data to the System Once write combining is closed f or a 64- byte write buffer , the contents of the write buffer ar e eligible to be sent to the system as one or more AMD Athlon system bus commands. T able 10 lists t ...

  • AMD x86 - page 176

    16 0 W rite- Combining Operations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ...

  • AMD x86 - page 177

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Over view 16 1 Appendix D P erformance-Monitoring Counters T his c hapter describes ho w to use the AMD Athlon ™ processo r perf ormance monitoring counters. Over view T he AMD Athlon processor pr o vides four 48- bi t perf ormance counter s, which allo ws four type s ...

  • AMD x86 - page 178

    16 2 Performance Counter Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T hese r egisters can be r ead from and written to using t he RDMSR and WRM SR instructions, r espectiv el y . T he P erfEvtSel[3 :0] r egister s ar e locat ed at MSR l ocations C001_0000h to C0 01_0003h. The P erfCtr[3:0] register s ar e l ...

  • AMD x86 - page 179

    Performance Counter Usage 16 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Unit Mask Field (Bits 8 — 15 ) Th ese bits are used to further qualify the e vent sel ected in the e v ent select fi eld. F or e xample, f or some cac he ev ents, the ma sk is used as a MESI- pr otocol qualifier of cac he states. See T ab ...

  • AMD x86 - page 180

    16 4 Per formance Counter Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 greater than or equal to the counter mask. Otherwise if this field is zero , then the counte r increm ents by the total n umber of even t s . T able 1 1 . Performance-Monitoring Counters Event Numbe r Source Unit Notes / Unit Mask (bits 1 ...

  • AMD x86 - page 181

    Performance Counter Usage 16 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization 65h BU 1xxx_xxxxb = reserved x1xx_xxxxb = WB xx1x_xxxxb = WP xxx1_xxxxb = WT bits 11–10 = reserved xxxx_xx1xb = WC xxxx_xxx1b = UC Sy stem requests with the selected type 73h B U bits 15–11 = reserved xxxx_x1xxb = L2 (L2 hit and no DC h ...

  • AMD x86 - page 182

    16 6 Performance Counter Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 7Ah B U C ycles that at least one fill request waited to use the L2 80h PC Instr uctio n cache f etche s 8 1h PC Instruction cache misses 82h PC Instruction cache refills from L2 83h PC Instruction cache refills from system 84h PC L1 ITLB m ...

  • AMD x86 - page 183

    Performance Counter Usage 16 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization P erfCtr[3:0] M S Rs (M S R Addr esses C00 1 _000 4h – C00 1 _000 7h) T he performance-counter MSRs contain the e vent or dur ation counts for the se lecte d ev ents b eing count ed. The RDP MC instruction can be used by pr ogr ams or p r ...

  • AMD x86 - page 184

    16 8 Event and Time-S tamp Monitoring Softwar e AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 allo ws writing both positi ve and negativ e va lues to the perf ormance counters . The perf ormance counter s ma y be initializ ed us ing a 64-bit sig ned integer in the r ange -2 47 and +2 47 . Negati ve v alues ar e usef ...

  • AMD x86 - page 185

    Monitoring Counter Ov erflow 16 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T he initialization and start counter s pr ocedur e sets the P erfEvtSel0 and/ or P erfEvtSel1 MSRs for the e v ents to be counted and the method used to count them and init ializ es the counter MSR s (P erfCtr[3:0]) to starting counts. ...

  • AMD x86 - page 186

    17 0 Monitoring Counter Overflow AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 An e v ent moni tor application util ity or another application pr ogr am can r ead the collected perf ormance inf ormation of the pr ofiled a pplication. ...

  • AMD x86 - page 187

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Introduction 171 Appendix E Progr amming the M TR R and PA T Intr oduction Th e A M D A t h l o n ™ processor includes a set of memor y type and r ange register s (MTRRs) to control cachea bility and access to spec ified m emor y re gions. T he pr ocesso r also i nclud ...

  • AMD x86 - page 188

    17 2 Memory T ype Range Register (MTRR) Mechanism AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T her e ar e two types of ad dr ess r anges: fixed and v a ria ble. (See F i gur e 12.) F or each addr ess r a nge, ther e is a memo ry type. F or eac h 4K, 16K or 64K s egment within t he fir st 1 Mb yte of memory , ther ...

  • AMD x86 - page 189

    Memory Type Ra nge Register (MTRR) Mechan ism 17 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Figure 1 2. MTRR Mapping of Physic al Memory 0 FFFFFFFF h 512 K b y t e s 256 K by t es 256 Kb y tes 8 Fixed Rang es (64 Kb y tes ea ch) 64 Fixed R anges (4 Kby tes ea ch) 1 6 Fixed Ran ges (1 6 Kb y tes ea ch) 80000h C0 ...

  • AMD x86 - page 190

    17 4 Memory T ype Range Register (MTRR) Mechanism AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Memory T ypes F iv e standard memor y types ar e defi ned b y the AMD At hlon pr ocessor: writethr ough (WT), write back (WB), wr ite-pro tect (WP), write-combining (WC) , and uncachea ble (UC). T hese ar e described in T ...

  • AMD x86 - page 191

    Memory Type Ra nge Register (MTRR) Mechan ism 17 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization MTRR D efault T ype Register Format. T he MTRR def ault type r egister is defined as f ollows. Figure 1 4. MTRR Default T ype Register Format E MTRRs ar e ena bled when set. Al l MTRRs (both fixed and v aria ble r ange) ...

  • AMD x86 - page 192

    17 6 Memory T ype Range Register (MTR R) Mechanism AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Note that if tw o or mor e v ariable m emor y r anges matc h then the inter actions ar e defined as f ollows: 1. If the memor y types ar e identical, then that memor y type is used. 2. If one or mor e of the memor y type ...

  • AMD x86 - page 193

    Page Attribute Tabl e (PAT) 17 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization not affected b y this issue, onl y the v ariable r ange (and MTRR DefT ype) r egi sters are affecte d. P age Attribute T able (P A T) T he P age Attribute T able (P A T) is an e xtension of the page ta ble entry f ormat, whic h a llo ws t ...

  • AMD x86 - page 194

    17 8 Page Attribute T able (P A T) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Accessing the P A T A 3-bit inde x consisting of the P A T i, PCD , and PWT bit s of the page ta ble entr y , is used to select one o f the se v en P A T reg ister fields to acquir e the memor y type fo r the desire d page (P A T i is d ...

  • AMD x86 - page 195

    Page Attribute Tabl e (PAT) 17 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T a ble 1 5. Effective Memor y T ype Based on P A T and MTR Rs P A T Memory T ype MTRR Memory T ype Effec tive Memory T ype UC- WB, W T, WP, WC UC-Page UC UC-MTR R WC x WC WT W B, WT WT UC UC WC CD WP CD WP WB, WP WP UC UC-MTR R WC, WT CD ...

  • AMD x86 - page 196

    18 0 Page Attribute T able (P A T ) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T a ble 1 6. Final Output Memory T ypes Input Memory T ype Output Memory T ype Note RdMem WrM e m Effective. MT ype forceCD 5 AM D -75 1 RdMem WrMe m MemT yp e ●● UC - ●● UC 1 ●● CD - ●● CD 1 ●● WC - ●● WC 1 ● ...

  • AMD x86 - page 197

    Page Attribute Tabl e (PAT) 18 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ● ● CD - ●● CD ●● WC - ●● WC ●● WT - ●● WT ●● WP - ●● WP ●● WB - ● ● WT 4 ●● - ●● ● CD 2 Notes: 1 . WP is not functional for RdMem/WrMem. 2. ForceCD must cause the MTR R memory t ype to be ...

  • AMD x86 - page 198

    18 2 Page Attribute T able (P A T) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 MTR R Fixed-Range Register F ormat T he memor y types defined f or memor y segments defined in eac h of the MTRR fixed-r ange r egist er s ar e defined in T a ble 17 (Also See “ Standar d MTRR T ypes and Pr operties ” on page 176.). ...

  • AMD x86 - page 199

    Page Attribute Tabl e (PAT) 18 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization V ariable-Range MTRRs A v ariable MTRR can be pro gramm ed to st art at ad dr ess 0000_0000h bec ause the fixed MTRRs alw ays o verride the v aria ble ones. Ho we v er , it is r ecommended not to create an ove rl a p . T he upper tw o v a ...

  • AMD x86 - page 200

    18 4 Page Attribute T able (P A T) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Figure 1 7 . MTR RphysMask n Register F ormat Note: A softwar e attempt to write to reser ved bits will generate a general protection exception. Physical Speci fies a 24 -bit mask t o dete rmine the range of Mask the region defined in t ...

  • AMD x86 - page 201

    Page Attribute Tabl e (PAT) 18 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization MTRR M SR F ormat T his table defines the model-specifi c r egister s re lated to the memor y type range r egister implementation. All MTRRs ar e defined to be 64 bits. T able 1 8. MTRR-R elated Model-Specific Register (MS R) Map Register ...

  • AMD x86 - page 202

    18 6 Page Attribute T able (P A T ) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ...

  • AMD x86 - page 203

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Instruction Dispatch and Execution Resou rces 18 7 Appendix F Instruction Dispatch and Execution Resourc es T his c hapter describes the Macr oOPs gener ated by eac h decoded instruction, along with the r elativ e static execution latencies of these groups of operations. ...

  • AMD x86 - page 204

    18 8 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ■ disp1 6/32 — 16-bit or 32-bit displacem ent v alue ■ disp3 2/48 — 32-bit or 48-bit displacem ent v alue ■ eXX — re gister width depending on the oper and size ■ mem32 real — 32-bit floating-point v alue ...

  • AMD x86 - page 205

    Instruction Dispatch and Execution Resou rces 18 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ADC mreg8, reg8 1 0h 1 1-xxx-xxx DirectPath ADC mem8, r eg8 1 0h mm-xxx -xxx DirectPath ADC mreg1 6/32, reg1 6/32 1 1h 1 1-xxx-xxx DirectPath ADC mem1 6/32, reg1 6/32 1 1h m m-xxx- xxx DirectPath ADC reg8, mreg8 1 2h 1 1 ...

  • AMD x86 - page 206

    19 0 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 AN D mem8, reg8 20h mm- xxx-xxx Dir ectPath AN D mreg1 6/ 32, reg1 6/32 2 1h 1 1-xxx-xxx DirectPath AN D mem1 6/32, reg1 6/32 2 1h m m-xxx-xxx DirectPath AN D reg8, mreg8 22h 1 1-xxx-xxx DirectPath AN D reg8, mem8 22h mm ...

  • AMD x86 - page 207

    Instruction Dispatch and Execution Resou rces 19 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization BT mem1 6/32, imm8 0Fh BAh mm-1 00-xxx DirectPath BT C mreg1 6/32, reg1 6/32 0Fh BBh 1 1-xxx-xxx V e ctorPath BT C mem1 6/32, reg1 6/32 0Fh B Bh m m-xxx-xxx V ectorPath BT C mreg1 6/32, imm8 0Fh BAh 1 1-1 1 1-xxx V ector ...

  • AMD x86 - page 208

    19 2 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 CMOVE/C MOVZ reg1 6/32, reg1 6/32 0Fh 44h 1 1-xxx-xxx DirectP ath CMOVE/C MOVZ reg1 6/32, mem1 6/32 0Fh 44h mm-xxx-xxx DirectPath CMOVG/CMOVN LE reg1 6/32, reg1 6/32 0Fh 4Fh 1 1-xxx -xxx DirectPath CMOVG/CMOVN LE reg1 6/3 ...

  • AMD x86 - page 209

    Instruction Dispatch and Execution Resou rces 19 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization CM P EA X, imm1 6/32 3Dh DirectPath CM P mreg8, imm8 80h 1 1-1 1 1-xxx DirectPath CM P mem8, imm8 80h mm-1 1 1-xxx DirectPath CM P mreg1 6/32, imm1 6/32 8 1h 1 1-1 1 1-xxx DirectPath CM P mem1 6/32, imm1 6/32 8 1h mm-1 1 ...

  • AMD x86 - page 210

    19 4 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 DIV EA X, mreg1 6/32 F7h 1 1-1 1 0-xxx V ectorPath DIV EA X, mem1 6/32 F7h mm-1 1 0-xxx V ectorPath ENTE R C8 V ectorPath IDIV mreg8 F6h 1 1-1 1 1-xxx V ectorPath IDIV mem8 F6h mm-1 1 1-xxx V ectorPath IDIV E A X, mreg1 6 ...

  • AMD x86 - page 211

    Instruction Dispatch and Execution Resou rces 19 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization INC mreg8 FEh 1 1-000-xxx DirectPath INC mem8 F Eh mm-000-xxx DirectPath INC mreg1 6/32 F Fh 1 1-000-xxx DirectPath INC mem1 6/32 FFh mm-000-xxx DirectPath INVD 0Fh 08h V ectorPath INVLPG 0Fh 0 1h mm-1 1 1-xxx VectorP at ...

  • AMD x86 - page 212

    19 6 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 J P/JP E near disp1 6/32 0Fh 8Ah DirectPath J NP/J PO near disp1 6/32 0Fh 8Bh DirectPath J L/JNG E near disp1 6/32 0Fh 8Ch DirectPath J NL/JG E near disp1 6/32 0Fh 8Dh DirectPath J LE/JNG near disp1 6/32 0Fh 8Eh DirectPa ...

  • AMD x86 - page 213

    Instruction Dispatch and Execution Resou rces 19 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization L OOP E/L OOPZ disp8 E1h V ectorPath L OOPN E/L OOP NZ disp8 E0h V ectorPath LSL reg1 6/32, mreg1 6/32 0Fh 03h 1 1 -xxx-xxx VectorP ath LSL reg1 6/32, mem1 6/32 0Fh 03h mm-xxx-xxx V ectorPath LSS reg1 6/32, mem32/ 48 0Fh ...

  • AMD x86 - page 214

    19 8 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 MOV EDX, imm1 6/32 BAh DirectPath MOV EBX, imm1 6/32 BBh DirectPath MOV E SP, imm 1 6/32 BCh DirectPath MOV EB P, im m1 6/32 B Dh DirectP ath MOV E SI, im m1 6/32 BEh DirectPath MOV EDI, imm1 6/32 B Fh DirectPath MOV mre ...

  • AMD x86 - page 215

    Instruction Dispatch and Execution Resou rces 19 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization NOT mem8 F6h mm-0 1 0- xx DirectPath NOT mreg1 6/32 F7h 1 1-0 1 0-xxx DirectPath NOT mem1 6/32 F7h mm-0 1 0-xx Dire ctPath OR mreg8, reg8 08h 1 1-xxx-xxx DirectPath OR mem8, reg8 08h mm-xxx-xxx DirectPath OR mreg1 6/32, ...

  • AMD x86 - page 216

    200 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 POP EB X 5Bh V ectorPath POP ES P 5Ch VectorP ath POP EB P 5Dh V ectorPath POP ES I 5Eh V ectorPath POP EDI 5Fh V ectorPath POP mreg 1 6/32 8Fh 1 1-000-xxx VectorP ath POP mem 1 6/32 8Fh mm-000-xxx V ectorPath POP A/POP A ...

  • AMD x86 - page 217

    Instruction Dispatch and Execution Resou rces 20 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization RCL mreg8, 1 D0h 1 1-0 1 0-xxx DirectPath RC L mem8, 1 D0h mm- 0 1 0-x xx Dir ectPath RCL mreg1 6/32, 1 D1h 1 1-0 1 0-xxx DirectPath RC L mem 1 6/32 , 1 D1h mm- 0 1 0 -xxx Dire ctPat h RCL mreg8, C L D2h 1 1-0 1 0-xxx Di ...

  • AMD x86 - page 218

    202 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ROL mreg1 6/32, 1 D1h 1 1-000-xxx DirectPath ROL mem1 6/32, 1 D1h mm- 000-xxx DirectPath ROL mreg8, CL D2h 1 1-000-xxx DirectPath ROL mem8, CL D2h mm-000-xxx DirectPath ROL mreg1 6/32, CL D3h 1 1-000-xxx DirectPath ROL me ...

  • AMD x86 - page 219

    Instruction Dispatch and Execution Resou rces 203 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization SB B mreg1 6/32, reg1 6/32 1 9h 1 1-xxx-xxx DirectPath S BB mem 1 6/32, r eg1 6/32 1 9h mm-xxx-xxx DirectPath S BB reg8, mreg8 1A h 1 1 -xxx-xxx DirectPath S BB reg8, mem8 1Ah m m-xxx-xxx DirectPath SB B reg1 6/32, mreg1 ...

  • AMD x86 - page 220

    204 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 S ETS mreg8 0Fh 98h 1 1-xxx -xxx DirectPath S ETS mem8 0Fh 98h mm-xxx -xxx DirectPath SE TN S mreg8 0Fh 99h 1 1-xxx-xxx DirectPath S ETN S mem8 0Fh 99h mm-xxx- xxx DirectPath S ETP/S ETP E mreg8 0Fh 9 Ah 1 1-xxx -xxx Direc ...

  • AMD x86 - page 221

    Instruction Dispatch and Execution Resou rces 205 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization SH R mem1 6/32, imm8 C1h mm-1 0 1-xxx DirectPath SH R mreg8, 1 D0h 1 1-1 0 1-xxx DirectPath SH R mem8, 1 D0h mm-1 0 1-xxx DirectPath SH R mreg1 6/32, 1 D 1h 1 1-1 0 1-xxx DirectPath SH R mem1 6/32, 1 D1h mm-1 0 1-xxx Dire ...

  • AMD x86 - page 222

    206 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 S UB r eg8, mreg8 2Ah 1 1-xxx -xxx DirectPath S UB r eg8, mem8 2Ah mm-xxx-xxx DirectPath S U B r eg1 6/ 32, mreg 1 6/32 2Bh 1 1- xxx -xx x Dir ect Path S UB r eg1 6/32, mem1 6/32 2Bh m m-xxx-xxx DirectPath SU B AL, imm 8 ...

  • AMD x86 - page 223

    Instruction Dispatch and Execution Resou rces 207 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization X ADD mreg8, reg8 0Fh C0h 1 1 -1 00-xxx V ectorPath XA DD mem8, r eg8 0F h C0h mm-1 00-xxx V ectorPath X ADD mreg1 6/32, reg1 6/32 0Fh C1h 1 1-1 0 1-xxx V ectorPath XA DD mem1 6/32, reg1 6/32 0Fh C1h mm-1 0 1-xxx V ectorP ...

  • AMD x86 - page 224

    208 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T able 20. M MX ™ Instruct ions Instruct ion Mnem onic Prefix By t e(s ) First By t e ModR/ M By t e Decode Ty p e FP U Pipe(s) Notes EM M S 0Fh 77h DirectPath F ADD/FM U L/F ST OR E MOVD mmreg, reg32 0Fh 6Eh 1 1-xx x-x ...

  • AMD x86 - page 225

    Instruction Dispatch and Execution Resou rces 209 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization P AN DN mmreg1 , mmreg2 0Fh DFh 1 1-xx x-xxx DirectPath F ADD/F M UL P AN DN mmreg, mem64 0Fh DFh m m-xxx-xxx DirectPath F ADD/F M U L PCM P EQB mmreg1 , mmreg2 0Fh 74h 1 1-xxx-xxx DirectPath F ADD/F M UL PCM P EQB mmreg, ...

  • AMD x86 - page 226

    210 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 PS R AW mmreg1 , mmreg2 0Fh E1h 1 1-xxx-xxx DirectPath F ADD/F M UL P SR A W mmreg, mem64 0Fh E1h mm-xxx-xx x DirectPath F ADD/FM U L PS R AW mmreg, imm8 0Fh 7 1h 1 1-1 00-xxx DirectPath F ADD/F MU L PS R AD mmreg1 , mmreg ...

  • AMD x86 - page 227

    Instruction Dispatch and Execution Resou rces 21 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization P UN PCK HDQ mmreg1 , mmreg2 0Fh 6Ah 1 1-xxx-xxx DirectPath F ADD/FM U L P UN PC KHDQ mmreg, mem64 0Fh 6Ah m m-xxx-xxx DirectPath F AD D/FM U L P UN PCK HWD mmreg1 , mmreg2 0Fh 69h 1 1-xx x-xxx DirectPath F AD D/FM U L P ...

  • AMD x86 - page 228

    212 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 PM I NSW mmreg, mem64 0F h EAh mm-xxx-xxx Direct Path F ADD/FM U L PM I N UB mmreg1 , mmreg2 0Fh DAh 1 1-xxx -xxx DirectPat h F ADD/F M UL PM I NU B mmreg, mem6 4 0Fh DA h mm-xxx-xx x Direct Path F ADD/FM U L PMOVMSKB re g ...

  • AMD x86 - page 229

    Instruction Dispatch and Execution Resou rces 21 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization FCMOVB ST(0), ST(i) DAh C0- C7h VectorP ath FCMOVE ST(0), ST(i) DAh C 8- CFh V ectorPath FCMOVBE ST(0), ST(i) DAh D 0-D7h V ectorPath FCMOVU ST(0), ST(i) DAh D8-DFh V ectorPath FCMOVN B ST(0), ST(i) DBh C0- C7h Vector Pa ...

  • AMD x86 - page 230

    214 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 FIADD [mem32int] DAh m m-000-xxx V ectorPath FIADD [mem1 6int] DEh mm-000-xxx VectorP ath FICOM [mem32int] DAh mm-0 1 0-xxx V ectorPath FICOM [mem1 6int] DEh mm-0 1 0-xx x VectorP ath F ICOM P [m em 32in t] D Ah m m- 0 1 1 ...

  • AMD x86 - page 231

    Instruction Dispatch and Execution Resou rces 21 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization FLD CW [mem1 6] D9h mm-1 0 1-xxx V ectorPath FLD ENV [mem1 4byte] D 9h mm-1 00-xxx V ectorPath FLD ENV [mem28byte] D9h mm-1 00-xxx V ectorPath FLDL2E D9h EA h Dire ctPa th FSTORE FLD L2T D9h E9h DirectPath F STORE FLDL G ...

  • AMD x86 - page 232

    216 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 F S T C W [ m e m 16 ] D 9 h m m - 111 - x x x V e c t o r P a t h FSTE NV [mem1 4by te] D9h mm-1 1 0-xxx V ectorPath FSTE NV [mem28by te] D9h mm-1 1 0-xxx Vector Path FSTP [mem32real] D9h mm-0 1 1-xxx D irectPath F ADD/F ...

  • AMD x86 - page 233

    Instruction Dispatch and Execution Resou rces 21 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T a ble 23. 3DNow! ™ Instructions Instru ction Mn emonic Prefix Byte(s) imm8 ModR/M By t e Decode Ty p e FPU Pipe (s) Note FE M M S 0Fh 0Eh Di rectPat h F ADD/FM U L/F ST OR E 2 P A VGU S B mmreg1 , mmreg2 0Fh, 0Fh B F ...

  • AMD x86 - page 234

    218 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 PF R SQRT mmr eg, mem64 0F h, 0Fh 9 7h mm-xxx-xxx DirectPat h F MU L P FS U B mmreg1 , mmreg2 0Fh, 0Fh 9 Ah 1 1-xxx-xxx DirectPath F ADD PF S UB mmreg, mem64 0Fh, 0Fh 9Ah mm-xxx-xxx Direct Path F ADD P FS U BR mmreg1 , mmr ...

  • AMD x86 - page 235

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Select DirectP ath Over VectorPath Instruc tions 219 Appendix G Dire ctP ath versus V ectorP ath Instructions Select DirectP ath Over V ectorP ath Instructions Use DirectP ath instructions rather than V ectorPath ins tr ucti on s. Direc tP a th instructions ar e optimiz ...

  • AMD x86 - page 236

    220 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 T able 25. DirectP ath Integer Instructions Instru ction Mn emonic ADC mreg8, reg8 ADC mem8, reg8 ADC mreg1 6/32, reg1 6/32 ADC mem1 6/32, reg1 6/32 ADC reg8, mreg8 ADC reg8, mem8 ADC reg1 6/32, mreg1 6/32 ADC reg1 6/32, mem1 6/32 ADC AL, i mm ...

  • AMD x86 - page 237

    DirectPath Instructi ons 22 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization CMOVBE/C MOVNA reg1 6/32, reg1 6/32 CMOVBE/C MOVNA reg1 6/32, mem1 6/32 CMOVE/C MOVZ reg1 6/32, reg1 6/32 CMOVE/CM OVZ reg1 6/32, mem1 6/32 CMOVG/CMOVN LE reg1 6/32, reg1 6/32 CMOVG/CMOVN LE reg1 6/32, mem1 6/32 CMOVG E/CMOVN L reg1 6/32, reg ...

  • AMD x86 - page 238

    222 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 JN O s ho rt di sp 8 JB /JNAE short disp8 JN B/JAE short disp8 JZ/J E short disp8 J NZ/JN E short disp8 JBE/J NA short disp8 JN BE/JA short disp8 JS short disp8 JN S short disp8 JP/J P E short disp8 JNP/ JPO sh o rt di sp 8 JL/J NG E short dis ...

  • AMD x86 - page 239

    DirectPath Instructi ons 223 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization MOV mem1 6/32, imm1 6/32 MOVSX reg1 6/32, mreg8 MOVSX reg1 6/32, mem8 MOVSX reg32, mreg1 6 MOVSX reg32, mem1 6 MOVZX reg1 6/32, mreg8 MOVZX reg1 6/32, mem8 MOVZX reg32, mreg1 6 MOVZX reg32, mem1 6 NEG mreg8 NEG m em 8 NEG mreg1 6/32 N EG mem1 ...

  • AMD x86 - page 240

    224 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 ROL mreg8, CL ROL mem8, CL ROL mreg1 6/32, CL ROL mem1 6/32, CL ROR mreg8, i mm8 ROR mem8, imm8 ROR mreg1 6/32, imm8 ROR mem1 6/32 , imm8 ROR mreg8, 1 ROR mem8, 1 ROR mreg1 6/32, 1 ROR mem1 6/32, 1 ROR mreg8, CL ROR mem8, CL ROR mreg1 6/32, CL ...

  • AMD x86 - page 241

    DirectPath Instructions 225 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization SE TL/S ETNG E mreg8 SE TL/SE TNGE mem8 SE TGE/SE TNL mreg8 SET GE/ SETNL mem 8 SE TLE/S ETNG mreg8 SE TLE/S ETNG mem8 SE TG/ S ETN LE mreg8 SE T G/S ETNLE mem8 SH L/SAL mreg8, imm8 SH L/SAL mem8 , im m8 SH L/SAL mreg1 6/32, imm8 SH L/SAL mem1 ...

  • AMD x86 - page 242

    226 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 XO R reg1 6/32, mem1 6/32 XOR AL, imm8 XO R EA X, imm1 6/32 XOR mreg8, imm8 X OR mem8, imm8 XOR m reg 1 6 /32 , imm 1 6/32 X OR mem1 6/32, imm1 6/32 XO R mreg1 6/32, imm8 (sign extended) XO R mem1 6/32, imm8 (sign extended) T able 25. DirectP ...

  • AMD x86 - page 243

    DirectPath Instructi ons 227 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization T able 26. DirectP ath M MX ™ Instructions Instruct ion Mnem onic EMMS MOVD mmreg, mem32 MOVD mem32, mmreg MOVQ mmreg1 , mmreg2 MOVQ mmreg, mem64 MOVQ mmreg2, mmreg1 MOVQ mem64, mmreg P ACKSS DW mmreg1 , mmreg2 P ACKSS DW mmreg, me m64 P ACK ...

  • AMD x86 - page 244

    228 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 PS R LD mmreg, imm8 PS R LQ mmreg1 , mmreg2 PS R LQ mmreg, mem64 PS R LQ mmreg, imm8 PS R L W mmreg1 , mmreg2 P S R L W mm reg, m em64 P S R L W mmre g, imm8 PS U BB mmreg1 , mmreg2 P S U BB mmre g, me m64 PS U BD mmreg1 , mmreg2 PS U BD mmreg ...

  • AMD x86 - page 245

    DirectPath Instructi ons 229 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization T able 28. DirectP ath Floating-Point Instructions Instruct ion Mnem onic FA B S F ADD ST, ST(i) F ADD [mem32real] F ADD ST(i), ST F ADD [mem64real] F ADDP ST(i), ST FCH S FCOM ST(i) FCOMP ST(i) FCOM [mem 32real] FCOM [mem 64real] FCOMP [mem32 ...

  • AMD x86 - page 246

    230 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 FS U B ST(i), ST FS U BP ST, ST(i) FS U BR [mem32real] FS U BR [mem64real] FS U BR ST, ST(i) FS U BR ST(i), ST FS U BR P ST(i), ST F TST FUC OM FUC OMP FUC OMPP FW A IT FXCH T able 28. DirectP ath Floating-Point Instructions Instruct ion Mnem ...

  • AMD x86 - page 247

    V ectorPath Instructions 23 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization V ectorP ath Instructions T he f ollowi ng ta bles contain Ve c t o r P a t h instructions, which should be av o i d e d in the AMD Athlon processor: ■ Ta b l e 2 9 , “ V ectorP a th Integer Instructions, ” on page 231 ■ Ta b l e 3 0 ...

  • AMD x86 - page 248

    232 V ectorPath Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — Novembe r 1 9 9 9 DIV EA X, mem1 6/32 EN TER IDIV mr eg8 IDIV mem8 IDIV E A X, mreg1 6/32 IDIV E A X, mem1 6/32 IM U L reg1 6/32, imm1 6/32 I M U L r eg 1 6 /32, mre g1 6/ 32, i mm 1 6 /32 IM U L reg1 6/32, mem1 6/32, imm1 6/32 IM U L reg1 6/32, imm8 (sign ext ...

  • AMD x86 - page 249

    V ectorPath Instructions 233 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization MUL EAX , m em 3 2 OUT imm8, A L OUT imm8, A X OUT imm8, E A X OUT DX, AL OUT DX, A X OUT DX, EA X POP ES POP SS POP DS POP FS POP GS POP EA X POP ECX POP EDX POP EB X POP ES P POP EB P POP ES I POP EDI POP mreg 1 6/32 POP mem 1 6/32 POP A/POP ...

  • AMD x86 - page 250

    234 V ectorPath Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — Novembe r 1 9 9 9 STI ST OS B mem8, AL ST OSW mem1 6, A X STOSD mem32, EA X STR mreg1 6 STR mem1 6 SYSC ALL SYSE NTE R SYSE XIT SYSR E T VER R mreg1 6 VER R mem1 6 VER W mreg1 6 VER W mem1 6 WBINVD WRM SR X ADD mreg8, reg8 XADD mem8, reg8 XA DD mreg1 6/32, reg ...

  • AMD x86 - page 251

    V ectorPath Instructions 235 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization T able 32. V ectorPath Floating-P oint Instructions Instruct ion Mnem onic F2XM1 FB LD [mem80] FBSTP [mem80] FCLE X FCMOVB ST(0), ST(i) FCMOVE ST(0), ST(i) FCMOVBE ST(0), ST(i) FCMOVU ST(0), ST(i) FCMOVN B ST(0), ST(i) FCMOVN E ST(0), ST(i) FC ...

  • AMD x86 - page 252

    236 V ectorPath Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — Novembe r 1 9 9 9 ...

  • AMD x86 - page 253

    Index 237 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Pr ocessor x86 Code Optimization Index Numerics 3DNow! ™ Inst ructions . . . . . . . . . . . . . . . . . . . . . . . . . . 10 , 107 3DNo w! and MMX ™ Intr a-Oper and Swapping . . . . . . . 112 Clippin g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 F ast ...

  • AMD x86 - page 254

    238 Index AM D Athlon ™ Pr ocessor x86 Code Optimization 22007E/0 — Novemb er 1 9 9 9 Instructio n Cach e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Contr ol Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

  • AMD x86 - page 255

    Index 239 22007E/0 — No ve mb er 1 999 AM D Athlon ™ Pr ocessor x86 Code Optimization T TBYTE V ariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 T rigo nome tri c Inst ruc tions . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 V V ectorP ath Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

  • AMD x86 - page 256

    240 Index AM D Athlon ™ Pr ocessor x86 Code Optimization 22007E/0 — Novemb er 1 9 9 9 ...

Производитель AMD Категория Typewriter

Документы, которые мы получаем от производителя устройства AMD x86 мы можем разделить на несколько групп. Это в частности:
- технические чертежи AMD
- инструкции обслуживания x86
- паспорта изделия AMD
- информационные брошюры
- энергетические этикетки AMD x86
Все из них важны, однако самую важную информацию с точки зрения пользователя мы найдем в инструкции обслуживания AMD x86.

Группа документов, определяемая как инструкции обслуживания, делится также на более подробные типы, такие как: Инструкции монтажа AMD x86, инструкции обслуживания, короткие инструкции или инструкции пользователя AMD x86. В зависимости от потребностей, Вам необходимо поискать требуемый документ. На нашем сайте Вы можете просмотреть самую популярную инструкцию использования изделия AMD x86.

Похожие инструкции обслуживания

Полная инструкция обслуживания устройства AMD x86, как должна выглядеть?
Инструкция обслуживания, определяемая также как пособие пользователя, или просто "руководство" - это технический документ, цель которого заключается в использовании AMD x86 пользователями. Инструкции пишет, как правило технический писатель, языком, доступным для всех пользователей AMD x86.

Полная инструкция обслуживания AMD, должна заключать несколько основных элементов. Часть из них менее важная, как например: обложка / титульный лист или авторские страницы. Однако остальная часть, должна дать нам важную с точки зрения пользователя информацию.

1. Вступление и рекомендации, как пользоваться инструкцией AMD x86 - В начале каждой инструкции, необходимо найти указания, как пользоваться данным пособием. Здесь должна находится информация, касающаяся местонахождения содержания AMD x86, FAQ и самых распространенных проблем - то есть мест, которые чаще всего ищут пользователи в каждой инструкции обслуживания
2. Содержание - индекс всех советов, касающихся AMD x86, которое найдем в данном документе
3. Советы по использованию основных функций устройства AMD x86 - которые должны облегчить нам первые шаги во время использования AMD x86
4. Troubleshooting - систематизированный ряд действия, который поможет нам диагностировать а в дальнейшем очередность решения важнейших проблем AMD x86
5. FAQ - чаще всего задаваемые вопросы
6. Контактные данные Информация о том, где искать контактные данные производителя / сервисного центра AMD x86 в данной стране, если самостоятельно не получится решить проблему.

У вас вопрос, касающийся AMD x86?

Воспользуйтесь формуляром, находящимся ниже

Если с помощью найденной инструкции Вы не решили свою проблему с AMD x86, задайте вопрос, заполнив следующий формуляр. Если у какого то из пользователей была похожая проблема с AMD x86 со всей вероятностью он захочет поделиться методом ее решения.

Перепишите текст с картинки

Комментарии (0)