## Intel MMX, SSE, SSE2, SSE3/SSE3/SSE4 Architectures

Baha Guclu Dundar SALUC *Lab Computer Science and Engineering Department University of Connecticut* 

Slides 1-33 are modified from Computer Organization and Assembly Languages Course By Yung-Yu Chuang



















## Compatibility



11

- Although Intel defenses their decision on aliasing MMX to FPU for compatibility. It is actually a bad decision. OS can just provide a service pack or get updated.
- It is why Intel introduced SSE later without any aliasing





|            | Category                                                      | Wraparound                                                                     | Signed Saturation                      | Unsigned<br>Saturation                     |
|------------|---------------------------------------------------------------|--------------------------------------------------------------------------------|----------------------------------------|--------------------------------------------|
| Arithmetic | Addition<br>Subtraction<br>Multiplication<br>Multiply and Add | PADDB, PADDW,<br>PADDD<br>PSUBB, PSUBW,<br>PSUBD<br>PMULL, PMULH<br>PMADD      | PADDSB,<br>PADDSW<br>PSUBSB,<br>PSUBSW | PADDUSB,<br>PADDUSW<br>PSUBUSB,<br>PSUBUSW |
| Comparison | Compare for Equal<br>Compare for Greater<br>Than              | PCMPEQB,<br>PCMPEQW,<br>PCMPEQD<br>PCMPGTPB,<br>PCMPGTPW,<br>PCMPGTPD          |                                        |                                            |
| Conversion | Pack                                                          |                                                                                | PACKSSWB,<br>PACKSSDW                  | PACKUSWB                                   |
| Unpack     | Unpack High<br>Unpack Low                                     | PUNPCKHBW,<br>PUNPCKHWD,<br>PUNPCKHDQ<br>PUNPCKLBW,<br>PUNPCKLWD,<br>PUNPCKLDQ |                                        |                                            |

|                    |                                                                     | Packed                                       | Full Quadword                |
|--------------------|---------------------------------------------------------------------|----------------------------------------------|------------------------------|
| Logical            | And<br>And Not<br>Or<br>Exclusive OR                                |                                              | PAND<br>PANDN<br>POR<br>PXOR |
| Shift              | Shift Left Logical<br>Shift Right Logical<br>Shift Right Arithmetic | PSLLW, PSLLD<br>PSRLW, PSRLD<br>PSRAW, PSRAD | PSLLQ<br>PSRLQ               |
|                    |                                                                     | Doubleword Transfers                         | Quadword Transfers           |
| Data Transfer      | Register to Register<br>Load from Memory<br>Store to Memory         | MOVD<br>MOVD<br>MOVD                         | MOVQ<br>MOVQ<br>MOVQ         |
| Empty MMX<br>State |                                                                     | EMMS                                         |                              |































## Application: matrix transport



31

```
char M1[4][8];// matrix to be transposed
char M2[8][4];// transposed matrix
int n=0;
for (int i=0;i<4;i++)
    for (int j=0;j<8;j++)
        { M1[i][j]=n; n++; }
__asm{
    //move the 4 rows of M1 into MMX registers
movq mm1,M1
movq mm2,M1+8
movq mm3,M1+16
movq mm4,M1+24
```





























```
/* cross */
 _m128 _mm_cross_ps( __m128 a , __m128 b ) {
  __m128 ea , eb;
 // set to a[1][2][0][3] , b[2][0][1][3]
 ea = _mm_shuffle_ps( a, a, _MM_SHUFFLE(3,0,2,1) );
 eb = _mm_shuffle_ps( b, b, _MM_SHUFFLE(3,1,0,2) );
 // multiply
  __m128 xa = _mm_mul_ps( ea , eb );
 // set to a[2][0][1][3] , b[1][2][0][3]
 a = _mm_shuffle_ps( a, a, _MM_SHUFFLE(3,1,0,2) );
 b = _mm_shuffle_ps( b, b, _MM_SHUFFLE(3,0,2,1) );
 // multiply
  __m128 xb = _mm_mul_ps( a , b );
 // subtract
 return _mm_sub_ps( xa , xb );
}
                                                      45
```





















| g saturation.<br>g saturation.<br>ration.<br>using saturation.<br>using saturation.<br>results and adds results.<br>the high 16bits of the result.<br>the how 16bits of the result.<br>64bit results.<br>54bit results.<br>55. |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|



| SSE2 Instructions                                                                                                                                                                                                                                                                                                                                                |      |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|
| Compare:<br>cmppd - Compares 2 pairs of 64bit doubles.<br>cmpsd - Compares bottom 64bit doubles.<br>comisd - Compares bottom 64bit doubles and stores resultint<br>in EFLAGS.<br>ucomisd - Compares bottom 64bit doubles and stores resultint<br>in EFLAGS. (QNaNs don't throw exceptions with ucomisd<br>unlike comisd.<br>pcmpxxb - Compares 16 8bit integers. | sult |
| pcmpxxw - Compares 8 16bit integers.<br>pcmpxxd - Compares 4 32bit integers.<br>Compare Codes (the xx parts above):<br>eq - Equal to.<br>It - Less than.<br>le - Less than or equal to.<br>ne - Not equal.<br>nlt - Not less than.<br>nle - Not less than or equal to.<br>ord - Ordered.<br>unord - Unordered.                                                   |      |
|                                                                                                                                                                                                                                                                                                                                                                  | 58   |





## SSE2 Instructions Load/Store: (is "minimize cache pollution" the same as "without using cache"??) movg - Moves a 64bit value, clearing the top 64bits of an XMM register. movsd - Moves a 64bit double, leaving tops unchanged if move is between two XMMregisters. movapd - Moves 2 aligned 64bit doubles. movupd - Moves 2 unaligned 64bit doubles. movhpd - Moves top 64bit value to or from an XMM register. movlpd - Moves bottom 64bit value to or from an XMM register. movdg2g - Moves bottom 64bit value into an MMX register. movq2dq - Moves an MMX register value to the bottom of an XMM register. Top is cleared to zero. movntpd - Moves a 128bit value to memory without using the cache. NT is "Non Temporal." movntdq - Moves a 128bit value to memory without using the cache. movnti - Moves a 32bit value without using the cache. maskmovdqu - Moves 16 bytes based on sign bits of another XMM register. pmovmskb - Generates a 16bit Mask from the sign bits of each byte in an XMM register. 61



























