Clusters
Transcript
Clusters
Lesson 30 Clusters High Performance Computer Architectures http://www.dii.unisi.it/~giorgi/teaching/hpca2 All copyrighted figures are copyright of respective authors. Figures may be reproduced only for classroom or personal educational use only when the above copyright line is included. They may not be otherwise reproduced, distributed, or incorporated into other works without the prior written consent of the publisher. Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 1 di 27 EXAMPLE: MATRIX MULTIPLY + SUM SEQUENTIAL PROGRAM: 1 sum = 0; 2 for (i=0,i<N, i++) 3 for (j=0,j<N, j++){ 4 C[i,j] = 0; 5 for (k=0,k<N, k++) 6 C[i,j] = C[i,j] + A[i,k]*B[k,j]; 7 sum += C[i,j]; 8 } • MULTIPLY MATRICES A[N,N] BY B[N,N] AND STORE RESULT IN C[N,N] • ADD ALL ELEMENTS OF C Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 2 di 27 INTER-PE COMMUNICATION • IMPLICITELY VIA MEMORY • PROCESSORS SHARE SOME MEMORY • COMMUNICATION IS IMPLICIT THROUGH LOADS AND STORES • NEED TO SYNCHRONIZE • NEED TO KNOW HOW THE HARDWARE INTERLEAVES ACCESSES FROM DIFFERENT PROCESSORS • EXPLICITELY VIA MESSAGES (SENDS AND RECEIVES) • NEED TO KNOW THE DESTINATION AND WHAT TO SEND • EXPLICIT MESSAGE-PASSING STATEMENTS IN THE CODE • CALLED “MESSAGE PASSING” NO HYPOTHESIS ON THE RELATIVE SPEED OF PROCESSORS Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 3 di 27 EXAMPLE: MATRIX MULTIPLY + SUM SHARED-MEMORY PROGRAM /* A, B, C, BAR, LV and sum are shared /* All other variables are private 1a low = pid*N/nproc; /pid=0...nproc-1 1b hi = low + N/nproc; /rows of A 1c mysum = 0; sum = 0; /A and B are in 2 for (i=low,i<hi, i++) /shared memory 3 for (j=0,j<N, j++){ 4 C[i,j] = 0; 5 for (k=0,k<N, k++) 6 C[i,j] = C[i,j] + A[i,k]*B[k,j]; /at the end matrix C is 7 mysum +=C[i,j]; /C is in shared memory 8 } 9 BARRIER(BAR); 10 LOCK(LV); 11 sum += mysum; 12 UNLOCK(LV); Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 4 di 27 MESSAGE-PASSING: MULTICOMPUTERS INTERCONNECTION NETWORK NI M NI C M P NI C P M C P • PROCESSING NODES INTERCONNECTED BY A NETWORK • COMMUNICATION CARRIED OUT BY MESSAGE EXCHANGES • SCALES WELL • HARDWARE IS INEXPENSIVE • SOFTWARE IS COMPLEX Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 5 di 27 SYNCHRONOUS MESSAGE-PASSING • CODE FOR THREAD T1: CODE FOR THREAD T2: A = 10; SEND(&A,sizeof(A),T2,SEND_A); A = A+1; RECV(&C,sizeof(C),T2,SEND_B); printf(C); B = 5; RECV(&B,sizeof(B),T1,SEND_A); B=B+1; SEND(&B,sizeof(B),T1,SEND_B); • EACH SEND/RECV HAS 4 OPERANDS: • • • • STARTING ADDRESS IN MEMORY SIZE OF MESSAGE DESTINATION/SOURCE THREAD ID TAG CONNECTING SENDS AND RECEIVES • IN SYNCHRONOUS M-P SENDER BLOCKS UNTIL RECV IS COMPLETED AND RECEIVER BLOCKS UNTIL MESSAGE HAS BEEN SENT • NOTE: THIS IS MUCH MORE THAN WAITING FOR MESSAGE PROPAGATION • QUESTION: WHAT IS THE VALUE PRINTED UNDER SYNCHRONOUS M-P? • VALUE 10 IS RECEIVED IN B BY T2; B IS INCREMENTED BY 1 • THEN THE NEW VALUE OF B (11) IS SENT AND RECEIVED BY T1 INTO C • AND THREAD 1 PRINTS “11” Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 6 di 27 SYNCHRONOUS MESSAGE-PASSING • ADVANTAGE: ENFORCES SYNCHRONIZATION • DISADVANTAGES: • PRONE TO DEADLOCK • BLOCK THREADS (NO OVERLAP OF COMMUNICATION WITH COMPUTATION) • DEADLOCK EXAMPLE: CODE FOR THREAD T1: CODE FOR THREAD T2: A = 10; SEND(&A,sizeof(A),T2,SEND_A); B = 5; SEND(&B,sizeof(B),T1,SEND_B) RECV(&C,sizeof(C),T2,SEND_B); RECV(&D,sizeof(D),T1,SEND_A); TO ELIMINATE THE DEADLOCK: SWAP THE SEND/RECV PAIR IN T2 OR EMPLOY ASYNCHRONOUS MESSAGE-PASSING Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 7 di 27 SYNCHRONOUS MESSAGE-PASSING CODE FOR THREAD T1: A = 10; ASEND(&A,sizeof(A),T2,SEND_A); <Unrelated computation;> SRECV(&B,sizeof(B),T2,SEND_B); CODE FOR THREAD T2: B = 5; ASEND(&B,sizeof(B),T1,SEND_B); <Unrelated computation;> SRECV(&A,sizeof(B),T1,SEND_A); • BLOCKING vs NON-BLOCKING MESSAGE PASSING • BLOCKING: RESUME ONLY WHEN AREA OF MEMORY CAN BE MODIFIED • NON-BLOCKING: RESUME EARLY--RELIES ON PROBE MESSAGES Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 8 di 27 SYNCHRONOUS MESSAGE-PASSING IN GENERAL MULTIPLE THREADS ARE RUNNING ON A CORE • THIS HANDSHAKE IS FOR SYNCHRONOUS M-P • CAN BE APPLIED TO BLOCKING OR NON-BLOCKING ASYNCHRONOUS • SENDER CONTINUES AFTER SEND (A) AND RECVer AFTER RECV (B) • IF BLOCKING, SENDER MUST MAKE A COPY OF MESSAGE FIRST Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 9 di 27 SUPPORT FOR MESSAGE PASSING PROTOCOLS • DATA MUST BE COPIED FROM/TO MEMORY TO/FROM NI • PROCESSOR COULD DO IT • DMA (DIRECT MEMORY ACCESS) CAN SPEED UP MESSAGE TRANSFERS AND OFF-LOAD THE PROCESSOR • DMA IS PROGRAMMED BY THE PROCESSOR; • START ADDRESS AND SIZE • DEDICATED MESSAGE PROCESSORS • USE A SPECIAL PROCESSOR TO PROCESS MESSAGES ON BOTH ENDS • THUS RELIEVING THE COMPUTE O/S FROM DOING IT • SUPPORT FOR USER-LEVEL MESSAGES • BASIC MESSAGE PASSING SYSTEMS DRIVE DMA ENGINE FROM O/S • THIS IS NEEDED FOR PROTECTION BETWEEN USERS • MESSAGE IS FIRST COPIED INTO SYSTEM SPACE AND THEN INTO USER SPACE (RECEIVE) • MESSAGE IS COPIED FROM USER SPACE TO SYSTEM SPACE (SEND) • TAG USER MESSAGES SO THAT THEY ARE PICKED UP AND DELIVERED DIRECTLY IN USER MEMORY SPACE (ZERO COPY) Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 10 di 27 MESSAGE-PASSING SUPPORT Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 11 di 27 MMUL IN CILK Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 12 di 27 • Parallelism is about (103)2=106 Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 13 di 27 Recursive Matrix Multiply • Divide and conquer • 8 moltiplications of n/2 x n/2 matrices • 1 sum of matrices n x n Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 14 di 27 MMUL in CILK – Divide and conquer template <typename T> void MMult(T *C T *A T *B int n int size) { T *D = new T[n*n]; //base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2, size); cilk_spawn MMult(C12, A11, B12, n/2, size); cilk_spawn MMult(C22, A21, B12, n/2, size); cilk_spawn MMult(C21, A21, B11, n/2, size); cilk_spawn MMult(D11, A12 ,B21, n/2, size); cilk_spawn MMult(D12, A12, B22, n/2, size); cilk_spawn MMult(D22, A22, B22, n/2, size); MMult(D21, A22, B21, n/2, size); cilk_sync; MAdd(C, D, n, size); // C += D; delete[] D; } Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 15 di 27 MMUL IN CUDA Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 16 di 27 Matrix Multiplication Example • Generalize adjacent_difference example • AB = A * B • Each element ABij • = dot(row(A,i),col(B,j)) • Parallelization strategy • Thread ABij • 2D kernel Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 17 di 27 First Implementation __global__ void mat_mul(float *a, float *b, float *ab, int width) { // calculate the row & col index of the element int row = blockIdx.y*blockDim.y + threadIdx.y; int col = blockIdx.x*blockDim.x + threadIdx.x; float result = 0; // do dot product between row of a and col of b for(int k = 0; k < width; ++k) result += a[row*width+k] * b[k*width+col]; ab[row*width+col] = result; } Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 18 di 27 How will this perform? How many loads per term of dot product? 2 (a & b) = 8 Bytes How many floating point operations? 2 (multiply & addition) Global memory access to flop ratio (GMAC) 8 Bytes / 2 ops = 4 B/op What is the peak fp performance of GeForce GTX 260? 805 GFLOPS Lower bound on bandwidth required to GMAC * Peak FLOPS = 4 * 805 = reach peak fp performance 3.2 TB/s What is the actual memory bandwidth of GeForce GTX 260? 112 GB/s Then what is an upper bound on performance of our implementation? Actual BW / GMAC = 112 / 4 = 28 GFLOPS Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 19 di 27 Idea: Use __shared__ memory to reuse global data • Each input element is read by width threads • Load each element into __shared__ memory and have several threads use the local version to reduce the memory bandwidth width Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 20 di 27 Tiled Multiply TILE_WIDTH • Partition kernel loop into phases • Load a tile of both matrices into __shared__ each phase • Each phase, each thread computes a partial result Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 21 di 27 Better Implementation __global__ void mat_mul(float *a, float *b, float *ab, int width) { // shorthand int tx = threadIdx.x, ty = threadIdx.y; int bx = blockIdx.x, by = blockIdx.y; // allocate tiles in __shared__ memory __shared__ float s_a[TILE_WIDTH][TILE_WIDTH]; __shared__ float s_b[TILE_WIDTH][TILE_WIDTH]; // calculate the row & col index int row = by*blockDim.y + ty; int col = bx*blockDim.x + tx; float result = 0; Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 22 di 27 Better Implementation // loop over the tiles of the input in phases for(int p = 0; p < width/TILE_WIDTH; ++p) { // collaboratively load tiles into __shared__ s_a[ty][tx] = a[row*width + (p*TILE_WIDTH + tx)]; s_b[ty][tx] = b[(m*TILE_WIDTH + ty)*width + col]; __syncthreads(); // dot product between row of s_a and col of s_b for(int k = 0; k < TILE_WIDTH; ++k) result += s_a[ty][k] * s_b[k][tx]; __syncthreads(); } ab[row*width+col] = result; } Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 23 di 27 Use of Barriers in mat_mul • Two barriers per phase: • __syncthreads after all data is loaded into __shared__ memory • __syncthreads after all data is read from __shared__ memory • Note that second __syncthreads in phase p guards the load in phase p+1 • Use barriers to guard data • Guard against using uninitialized data • Guard against bashing live data Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 24 di 27 First Order Size Considerations • Each thread block should have many threads • TILE_WIDTH = 16 16*16 = 256 threads • There should be many thread blocks • 1024*1024 matrices 64*64 = 4096 thread blocks • TILE_WIDTH = 16 gives each SM 3 blocks, 768 threads • Full occupancy • Each thread block performs 2 * 256 = 512 32b loads for 256 * (2 * 16) = 8,192 fp ops • Memory bandwidth no longer limiting factor Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 25 di 27 Optimization Analysis Implementation Original Improved Global Loads 2N3 2N2 *(N/TILE_WIDTH) Throughput 10.7 GFLOPS 183.9 GFLOPS SLOCs 20 44 Relative Improvement 1x 17.2x Improvement/SLOC 1x 7.8x • Experiment performed on a GT200 • This optimization was clearly worth the effort • Better performance still possible in theory Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 26 di 27 TILE_SIZE Effects 200 180 160 GFLOPS 140 120 100 80 60 40 20 0 untiled 2x2 4x4 Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 27 di 27 8x8 12x12 14x14 15x15 16x16 TILE_SIZE