Clusters

Transcript

Clusters
Lesson 30
Clusters
High Performance Computer Architectures
http://www.dii.unisi.it/~giorgi/teaching/hpca2
All copyrighted figures are copyright of respective authors.
Figures may be reproduced only for classroom or personal educational use
only when the above copyright line is included.
They may not be otherwise reproduced, distributed, or incorporated into other
works without the prior written consent of the publisher.
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 1 di 27
EXAMPLE: MATRIX MULTIPLY + SUM
SEQUENTIAL PROGRAM:
1
sum = 0;
2
for (i=0,i<N, i++)
3
for (j=0,j<N, j++){
4
C[i,j] = 0;
5
for (k=0,k<N, k++)
6
C[i,j] = C[i,j] + A[i,k]*B[k,j];
7
sum += C[i,j];
8
}
• MULTIPLY MATRICES A[N,N] BY B[N,N] AND STORE RESULT IN C[N,N]
• ADD ALL ELEMENTS OF C
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 2 di 27
INTER-PE COMMUNICATION
• IMPLICITELY VIA MEMORY
• PROCESSORS SHARE SOME MEMORY
• COMMUNICATION IS IMPLICIT THROUGH LOADS AND
STORES
• NEED TO SYNCHRONIZE
• NEED TO KNOW HOW THE HARDWARE INTERLEAVES ACCESSES FROM
DIFFERENT PROCESSORS
• EXPLICITELY VIA MESSAGES (SENDS AND RECEIVES)
• NEED TO KNOW THE DESTINATION AND WHAT TO SEND
• EXPLICIT MESSAGE-PASSING STATEMENTS IN THE CODE
• CALLED “MESSAGE PASSING”
NO HYPOTHESIS ON THE RELATIVE SPEED OF PROCESSORS
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 3 di 27
EXAMPLE: MATRIX MULTIPLY + SUM
SHARED-MEMORY PROGRAM
/* A, B, C, BAR, LV and sum are shared
/* All other variables are private
1a
low = pid*N/nproc;
/pid=0...nproc-1
1b
hi = low + N/nproc;
/rows of A
1c
mysum = 0; sum = 0; /A and B are in
2
for (i=low,i<hi, i++) /shared memory
3
for (j=0,j<N, j++){
4
C[i,j] = 0;
5
for (k=0,k<N, k++)
6
C[i,j] = C[i,j] + A[i,k]*B[k,j]; /at the end matrix C is
7
mysum +=C[i,j]; /C is in shared memory
8
}
9
BARRIER(BAR);
10
LOCK(LV);
11
sum += mysum;
12
UNLOCK(LV);
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 4 di 27
MESSAGE-PASSING: MULTICOMPUTERS
INTERCONNECTION NETWORK
NI
M
NI
C
M
P
NI
C
P
M
C
P
• PROCESSING NODES INTERCONNECTED BY A NETWORK
• COMMUNICATION CARRIED OUT BY MESSAGE EXCHANGES
• SCALES WELL
• HARDWARE IS INEXPENSIVE
• SOFTWARE IS COMPLEX
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 5 di 27
SYNCHRONOUS MESSAGE-PASSING
• CODE FOR THREAD T1:
CODE FOR THREAD T2:
A = 10;
SEND(&A,sizeof(A),T2,SEND_A);
A = A+1;
RECV(&C,sizeof(C),T2,SEND_B);
printf(C);
B = 5;
RECV(&B,sizeof(B),T1,SEND_A);
B=B+1;
SEND(&B,sizeof(B),T1,SEND_B);
• EACH SEND/RECV HAS 4 OPERANDS:
•
•
•
•
STARTING ADDRESS IN MEMORY
SIZE OF MESSAGE
DESTINATION/SOURCE THREAD ID
TAG CONNECTING SENDS AND RECEIVES
• IN SYNCHRONOUS M-P SENDER BLOCKS UNTIL RECV IS COMPLETED AND
RECEIVER BLOCKS UNTIL MESSAGE HAS BEEN SENT
• NOTE: THIS IS MUCH MORE THAN WAITING FOR MESSAGE PROPAGATION
• QUESTION: WHAT IS THE VALUE PRINTED UNDER SYNCHRONOUS M-P?
• VALUE 10 IS RECEIVED IN B BY T2; B IS INCREMENTED BY 1
• THEN THE NEW VALUE OF B (11) IS SENT AND RECEIVED BY T1 INTO C
• AND THREAD 1 PRINTS “11”
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 6 di 27
SYNCHRONOUS MESSAGE-PASSING
• ADVANTAGE: ENFORCES SYNCHRONIZATION
• DISADVANTAGES:
• PRONE TO DEADLOCK
• BLOCK THREADS (NO OVERLAP OF COMMUNICATION WITH
COMPUTATION)
• DEADLOCK EXAMPLE:
CODE FOR THREAD T1:
CODE FOR THREAD T2:
A = 10;
SEND(&A,sizeof(A),T2,SEND_A);
B = 5;
SEND(&B,sizeof(B),T1,SEND_B)
RECV(&C,sizeof(C),T2,SEND_B);
RECV(&D,sizeof(D),T1,SEND_A);
TO ELIMINATE THE DEADLOCK: SWAP THE SEND/RECV PAIR
IN T2
OR EMPLOY ASYNCHRONOUS MESSAGE-PASSING
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 7 di 27
SYNCHRONOUS MESSAGE-PASSING
CODE FOR THREAD T1:
A = 10;
ASEND(&A,sizeof(A),T2,SEND_A);
<Unrelated computation;>
SRECV(&B,sizeof(B),T2,SEND_B);
CODE FOR THREAD T2:
B = 5;
ASEND(&B,sizeof(B),T1,SEND_B);
<Unrelated computation;>
SRECV(&A,sizeof(B),T1,SEND_A);
• BLOCKING vs NON-BLOCKING MESSAGE PASSING
• BLOCKING: RESUME ONLY WHEN AREA OF MEMORY CAN BE MODIFIED
• NON-BLOCKING: RESUME EARLY--RELIES ON PROBE MESSAGES
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 8 di 27
SYNCHRONOUS MESSAGE-PASSING
IN GENERAL MULTIPLE THREADS ARE RUNNING ON A CORE
• THIS HANDSHAKE IS FOR SYNCHRONOUS M-P
• CAN BE APPLIED TO BLOCKING OR NON-BLOCKING ASYNCHRONOUS
• SENDER CONTINUES AFTER SEND (A) AND RECVer AFTER RECV (B)
• IF BLOCKING, SENDER MUST MAKE A COPY OF MESSAGE FIRST
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 9 di 27
SUPPORT FOR MESSAGE PASSING PROTOCOLS
• DATA MUST BE COPIED FROM/TO MEMORY TO/FROM NI
• PROCESSOR COULD DO IT
• DMA (DIRECT MEMORY ACCESS) CAN SPEED UP MESSAGE TRANSFERS
AND OFF-LOAD THE PROCESSOR
• DMA IS PROGRAMMED BY THE PROCESSOR;
• START ADDRESS AND SIZE
• DEDICATED MESSAGE PROCESSORS
• USE A SPECIAL PROCESSOR TO PROCESS MESSAGES ON BOTH ENDS
• THUS RELIEVING THE COMPUTE O/S FROM DOING IT
• SUPPORT FOR USER-LEVEL MESSAGES
• BASIC MESSAGE PASSING SYSTEMS DRIVE DMA ENGINE FROM O/S
• THIS IS NEEDED FOR PROTECTION BETWEEN USERS
• MESSAGE IS FIRST COPIED INTO SYSTEM SPACE AND THEN INTO
USER SPACE (RECEIVE)
• MESSAGE IS COPIED FROM USER SPACE TO SYSTEM SPACE (SEND)
• TAG USER MESSAGES SO THAT THEY ARE PICKED UP AND DELIVERED
DIRECTLY IN USER MEMORY SPACE (ZERO COPY)
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 10 di 27
MESSAGE-PASSING SUPPORT
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 11 di 27
MMUL IN CILK
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 12 di 27
• Parallelism is about (103)2=106
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 13 di 27
Recursive Matrix Multiply
• Divide and conquer
• 8 moltiplications of n/2 x n/2 matrices
• 1 sum of matrices n x n
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 14 di 27
MMUL in CILK – Divide and conquer
template <typename T>
void MMult(T *C T *A T *B int n int size) {
T *D = new T[n*n];
//base case & partition matrices
cilk_spawn MMult(C11, A11, B11, n/2, size);
cilk_spawn MMult(C12, A11, B12, n/2, size);
cilk_spawn MMult(C22, A21, B12, n/2, size);
cilk_spawn MMult(C21, A21, B11, n/2, size);
cilk_spawn MMult(D11, A12 ,B21, n/2, size);
cilk_spawn MMult(D12, A12, B22, n/2, size);
cilk_spawn MMult(D22, A22, B22, n/2, size);
MMult(D21, A22, B21, n/2, size);
cilk_sync;
MAdd(C, D, n, size); // C += D;
delete[] D;
}
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 15 di 27
MMUL IN CUDA
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 16 di 27
Matrix Multiplication Example
• Generalize adjacent_difference
example
• AB = A * B
• Each element ABij
• = dot(row(A,i),col(B,j))
• Parallelization strategy
• Thread
ABij
• 2D kernel
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 17 di 27
First Implementation
__global__ void mat_mul(float *a, float *b,
float *ab, int width)
{
// calculate the row & col index of the element
int row = blockIdx.y*blockDim.y + threadIdx.y;
int col = blockIdx.x*blockDim.x + threadIdx.x;
float result = 0;
// do dot product between row of a and col of b
for(int k = 0; k < width; ++k)
result += a[row*width+k] * b[k*width+col];
ab[row*width+col] = result;
}
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 18 di 27
How will this perform?
How many loads per term of dot
product?
2 (a & b) =
8 Bytes
How many floating point operations?
2 (multiply & addition)
Global memory access to flop ratio
(GMAC)
8 Bytes / 2 ops =
4 B/op
What is the peak fp performance of
GeForce GTX 260?
805 GFLOPS
Lower bound on bandwidth required to GMAC * Peak FLOPS = 4 * 805 =
reach peak fp performance
3.2 TB/s
What is the actual memory bandwidth
of GeForce GTX 260?
112 GB/s
Then what is an upper bound on
performance of our implementation?
Actual BW / GMAC = 112 / 4 =
28 GFLOPS
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 19 di 27
Idea: Use __shared__ memory to reuse
global data
• Each input element is read
by width threads
• Load each element into
__shared__ memory and
have several threads use
the local version to reduce
the memory bandwidth
width
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 20 di 27
Tiled Multiply
TILE_WIDTH
• Partition kernel loop into
phases
• Load a tile of both
matrices into __shared__
each phase
• Each phase, each thread
computes a partial result
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 21 di 27
Better Implementation
__global__ void mat_mul(float *a, float *b,
float *ab, int width)
{
// shorthand
int tx = threadIdx.x, ty = threadIdx.y;
int bx = blockIdx.x, by = blockIdx.y;
// allocate tiles in __shared__ memory
__shared__ float s_a[TILE_WIDTH][TILE_WIDTH];
__shared__ float s_b[TILE_WIDTH][TILE_WIDTH];
// calculate the row & col index
int row = by*blockDim.y + ty;
int col = bx*blockDim.x + tx;
float result = 0;
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 22 di 27
Better Implementation
// loop over the tiles of the input in phases
for(int p = 0; p < width/TILE_WIDTH; ++p)
{
// collaboratively load tiles into __shared__
s_a[ty][tx] = a[row*width + (p*TILE_WIDTH + tx)];
s_b[ty][tx] = b[(m*TILE_WIDTH + ty)*width + col];
__syncthreads();
// dot product between row of s_a and col of s_b
for(int k = 0; k < TILE_WIDTH; ++k)
result += s_a[ty][k] * s_b[k][tx];
__syncthreads();
}
ab[row*width+col] = result;
}
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 23 di 27
Use of Barriers in mat_mul
• Two barriers per phase:
• __syncthreads after all data is loaded into __shared__ memory
• __syncthreads after all data is read from __shared__ memory
• Note that second __syncthreads in phase p guards the load in
phase p+1
• Use barriers to guard data
• Guard against using uninitialized data
• Guard against bashing live data
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 24 di 27
First Order Size Considerations
• Each thread block should have many threads
• TILE_WIDTH = 16
16*16 = 256 threads
• There should be many thread blocks
• 1024*1024 matrices
64*64 = 4096 thread blocks
• TILE_WIDTH = 16
gives each SM 3 blocks, 768 threads
• Full occupancy
• Each thread block performs 2 * 256 = 512 32b loads
for 256 * (2 * 16) = 8,192 fp ops
• Memory bandwidth no longer limiting factor
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 25 di 27
Optimization Analysis
Implementation
Original
Improved
Global Loads
2N3
2N2 *(N/TILE_WIDTH)
Throughput
10.7 GFLOPS
183.9 GFLOPS
SLOCs
20
44
Relative Improvement
1x
17.2x
Improvement/SLOC
1x
7.8x
• Experiment performed on a GT200
• This optimization was clearly worth the effort
• Better performance still possible in theory
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 26 di 27
TILE_SIZE Effects
200
180
160
GFLOPS
140
120
100
80
60
40
20
0
untiled 2x2
4x4
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 27 di 27
8x8 12x12 14x14 15x15 16x16
TILE_SIZE