Clusters

Transcript

Clusters

Lesson 30
Clusters
High Performance Computer Architectures
http://www.dii.unisi.it/~giorgi/teaching/hpca2
All copyrighted figures are copyright of respective authors.
Figures may be reproduced only for classroom or personal educational use
only when the above copyright line is included.
They may not be otherwise reproduced, distributed, or incorporated into other
works without the prior written consent of the publisher.
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ30-SL 1 di 27
EXAMPLE: MATRIX MULTIPLY + SUM
SEQUENTIAL PROGRAM:
1
sum = 0;
2
for (i=0,i<N, i++)
3
for (j=0,j<N, j++){
4
C[i,j] = 0;
5
for (k=0,k<N, k++)
6
C[i,j] = C[i,j] + A[i,k]*B[k,j];
7
sum += C[i,j];
8
}
• MULTIPLY MATRICES A[N,N] BY B[N,N] AND STORE RESULT IN C[N,N]
• ADD ALL ELEMENTS OF C
INTER-PE COMMUNICATION
• IMPLICITELY VIA MEMORY
• PROCESSORS SHARE SOME MEMORY
• COMMUNICATION IS IMPLICIT THROUGH LOADS AND
STORES
• NEED TO SYNCHRONIZE
• NEED TO KNOW HOW THE HARDWARE INTERLEAVES ACCESSES FROM
DIFFERENT PROCESSORS
• EXPLICITELY VIA MESSAGES (SENDS AND RECEIVES)
• NEED TO KNOW THE DESTINATION AND WHAT TO SEND
• EXPLICIT MESSAGE-PASSING STATEMENTS IN THE CODE
• CALLED “MESSAGE PASSING”
NO HYPOTHESIS ON THE RELATIVE SPEED OF PROCESSORS
EXAMPLE: MATRIX MULTIPLY + SUM
SHARED-MEMORY PROGRAM
/* A, B, C, BAR, LV and sum are shared
/* All other variables are private
1a
low = pid*N/nproc;
/pid=0...nproc-1
1b
hi = low + N/nproc;
/rows of A
1c
mysum = 0; sum = 0; /A and B are in
2
for (i=low,i<hi, i++) /shared memory
3
for (j=0,j<N, j++){
4
C[i,j] = 0;
5
for (k=0,k<N, k++)
6
C[i,j] = C[i,j] + A[i,k]*B[k,j]; /at the end matrix C is
7
mysum +=C[i,j]; /C is in shared memory
8
}
9
BARRIER(BAR);
10
LOCK(LV);
11
sum += mysum;
12
UNLOCK(LV);
MESSAGE-PASSING: MULTICOMPUTERS
INTERCONNECTION NETWORK
NI
M
NI
C
M
P
NI
C
P
M
C
P
• PROCESSING NODES INTERCONNECTED BY A NETWORK
• COMMUNICATION CARRIED OUT BY MESSAGE EXCHANGES
• SCALES WELL
• HARDWARE IS INEXPENSIVE
• SOFTWARE IS COMPLEX
SYNCHRONOUS MESSAGE-PASSING
• CODE FOR THREAD T1:
CODE FOR THREAD T2:
A = 10;
SEND(&A,sizeof(A),T2,SEND_A);
A = A+1;
RECV(&C,sizeof(C),T2,SEND_B);
printf(C);
B = 5;
RECV(&B,sizeof(B),T1,SEND_A);
B=B+1;
SEND(&B,sizeof(B),T1,SEND_B);
• EACH SEND/RECV HAS 4 OPERANDS:
•
•
•
•
STARTING ADDRESS IN MEMORY
SIZE OF MESSAGE
DESTINATION/SOURCE THREAD ID
TAG CONNECTING SENDS AND RECEIVES
• IN SYNCHRONOUS M-P SENDER BLOCKS UNTIL RECV IS COMPLETED AND
RECEIVER BLOCKS UNTIL MESSAGE HAS BEEN SENT
• NOTE: THIS IS MUCH MORE THAN WAITING FOR MESSAGE PROPAGATION
• QUESTION: WHAT IS THE VALUE PRINTED UNDER SYNCHRONOUS M-P?
• VALUE 10 IS RECEIVED IN B BY T2; B IS INCREMENTED BY 1
• THEN THE NEW VALUE OF B (11) IS SENT AND RECEIVED BY T1 INTO C
• AND THREAD 1 PRINTS “11”
• ADVANTAGE: ENFORCES SYNCHRONIZATION
• DISADVANTAGES:
• PRONE TO DEADLOCK
• BLOCK THREADS (NO OVERLAP OF COMMUNICATION WITH
COMPUTATION)
• DEADLOCK EXAMPLE:
CODE FOR THREAD T1:
CODE FOR THREAD T2:
A = 10;
SEND(&A,sizeof(A),T2,SEND_A);
B = 5;
SEND(&B,sizeof(B),T1,SEND_B)
RECV(&C,sizeof(C),T2,SEND_B);
RECV(&D,sizeof(D),T1,SEND_A);
TO ELIMINATE THE DEADLOCK: SWAP THE SEND/RECV PAIR
IN T2
OR EMPLOY ASYNCHRONOUS MESSAGE-PASSING
CODE FOR THREAD T1:
A = 10;
ASEND(&A,sizeof(A),T2,SEND_A);
<Unrelated computation;>
SRECV(&B,sizeof(B),T2,SEND_B);
CODE FOR THREAD T2:
B = 5;
ASEND(&B,sizeof(B),T1,SEND_B);
<Unrelated computation;>
SRECV(&A,sizeof(B),T1,SEND_A);
• BLOCKING vs NON-BLOCKING MESSAGE PASSING
• BLOCKING: RESUME ONLY WHEN AREA OF MEMORY CAN BE MODIFIED
• NON-BLOCKING: RESUME EARLY--RELIES ON PROBE MESSAGES
IN GENERAL MULTIPLE THREADS ARE RUNNING ON A CORE
• THIS HANDSHAKE IS FOR SYNCHRONOUS M-P
• CAN BE APPLIED TO BLOCKING OR NON-BLOCKING ASYNCHRONOUS
• SENDER CONTINUES AFTER SEND (A) AND RECVer AFTER RECV (B)
• IF BLOCKING, SENDER MUST MAKE A COPY OF MESSAGE FIRST
SUPPORT FOR MESSAGE PASSING PROTOCOLS
• DATA MUST BE COPIED FROM/TO MEMORY TO/FROM NI
• PROCESSOR COULD DO IT
• DMA (DIRECT MEMORY ACCESS) CAN SPEED UP MESSAGE TRANSFERS
AND OFF-LOAD THE PROCESSOR
• DMA IS PROGRAMMED BY THE PROCESSOR;
• START ADDRESS AND SIZE
• DEDICATED MESSAGE PROCESSORS
• USE A SPECIAL PROCESSOR TO PROCESS MESSAGES ON BOTH ENDS
• THUS RELIEVING THE COMPUTE O/S FROM DOING IT
• SUPPORT FOR USER-LEVEL MESSAGES
• BASIC MESSAGE PASSING SYSTEMS DRIVE DMA ENGINE FROM O/S
• THIS IS NEEDED FOR PROTECTION BETWEEN USERS
• MESSAGE IS FIRST COPIED INTO SYSTEM SPACE AND THEN INTO
USER SPACE (RECEIVE)
• MESSAGE IS COPIED FROM USER SPACE TO SYSTEM SPACE (SEND)
• TAG USER MESSAGES SO THAT THEY ARE PICKED UP AND DELIVERED
DIRECTLY IN USER MEMORY SPACE (ZERO COPY)
MESSAGE-PASSING SUPPORT
MMUL IN CILK
• Parallelism is about (103)2=106
Recursive Matrix Multiply
• Divide and conquer
• 8 moltiplications of n/2 x n/2 matrices
• 1 sum of matrices n x n
MMUL in CILK – Divide and conquer
template <typename T>
void MMult(T *C T *A T *B int n int size) {
T *D = new T[n*n];
//base case & partition matrices
cilk_spawn MMult(C11, A11, B11, n/2, size);
cilk_spawn MMult(D11, A12 ,B21, n/2, size);
cilk_spawn MMult(D12, A12, B22, n/2, size);
cilk_spawn MMult(D22, A22, B22, n/2, size);
MMult(D21, A22, B21, n/2, size);
cilk_sync;
MAdd(C, D, n, size); // C += D;
delete[] D;
}
MMUL IN CUDA
Matrix Multiplication Example
• Generalize adjacent_difference
example
• AB = A * B
• Each element ABij
• = dot(row(A,i),col(B,j))
• Parallelization strategy
• Thread
ABij
• 2D kernel
First Implementation
__global__ void mat_mul(float *a, float *b,
float *ab, int width)
{
// calculate the row & col index of the element
int row = blockIdx.y*blockDim.y + threadIdx.y;
int col = blockIdx.x*blockDim.x + threadIdx.x;
float result = 0;
// do dot product between row of a and col of b
for(int k = 0; k < width; ++k)
result += a[row*width+k] * b[k*width+col];
ab[row*width+col] = result;
}
How will this perform?
How many loads per term of dot
product?
2 (a & b) =
8 Bytes
How many floating point operations?
2 (multiply & addition)
Global memory access to flop ratio
(GMAC)
8 Bytes / 2 ops =
4 B/op
What is the peak fp performance of
GeForce GTX 260?
805 GFLOPS
Lower bound on bandwidth required to GMAC * Peak FLOPS = 4 * 805 =
reach peak fp performance
3.2 TB/s
What is the actual memory bandwidth
of GeForce GTX 260?
112 GB/s
Then what is an upper bound on
performance of our implementation?
Actual BW / GMAC = 112 / 4 =
28 GFLOPS
Idea: Use __shared__ memory to reuse
global data
• Each input element is read
by width threads
• Load each element into
__shared__ memory and
have several threads use
the local version to reduce
the memory bandwidth
width
Tiled Multiply
TILE_WIDTH
• Partition kernel loop into
phases
• Load a tile of both
matrices into __shared__
each phase
• Each phase, each thread
computes a partial result
Better Implementation
__global__ void mat_mul(float *a, float *b,
float *ab, int width)
{
// shorthand
int tx = threadIdx.x, ty = threadIdx.y;
int bx = blockIdx.x, by = blockIdx.y;
// allocate tiles in __shared__ memory
__shared__ float s_a[TILE_WIDTH][TILE_WIDTH];
__shared__ float s_b[TILE_WIDTH][TILE_WIDTH];
// calculate the row & col index
int row = by*blockDim.y + ty;
int col = bx*blockDim.x + tx;
float result = 0;
Better Implementation
// loop over the tiles of the input in phases
for(int p = 0; p < width/TILE_WIDTH; ++p)
{
// collaboratively load tiles into __shared__
s_a[ty][tx] = a[row*width + (p*TILE_WIDTH + tx)];
s_b[ty][tx] = b[(m*TILE_WIDTH + ty)*width + col];
__syncthreads();
// dot product between row of s_a and col of s_b
for(int k = 0; k < TILE_WIDTH; ++k)
result += s_a[ty][k] * s_b[k][tx];
__syncthreads();
}
ab[row*width+col] = result;
}
Use of Barriers in mat_mul
• Two barriers per phase:
• __syncthreads after all data is loaded into __shared__ memory
• __syncthreads after all data is read from __shared__ memory
• Note that second __syncthreads in phase p guards the load in
phase p+1
• Use barriers to guard data
• Guard against using uninitialized data
• Guard against bashing live data
First Order Size Considerations
• Each thread block should have many threads
• TILE_WIDTH = 16
16*16 = 256 threads
• There should be many thread blocks
• 1024*1024 matrices
64*64 = 4096 thread blocks
• TILE_WIDTH = 16
gives each SM 3 blocks, 768 threads
• Full occupancy
• Each thread block performs 2 * 256 = 512 32b loads
for 256 * (2 * 16) = 8,192 fp ops
• Memory bandwidth no longer limiting factor
Optimization Analysis
Implementation
Original
Improved
Global Loads
2N3
2N2 *(N/TILE_WIDTH)
Throughput
10.7 GFLOPS
183.9 GFLOPS
SLOCs
20
44
Relative Improvement
1x
17.2x
Improvement/SLOC
1x
7.8x
• Experiment performed on a GT200
• This optimization was clearly worth the effort
• Better performance still possible in theory
TILE_SIZE Effects
200
180
160
GFLOPS
140
120
100
80
60
40
20
0
untiled 2x2
4x4
8x8 12x12 14x14 15x15 16x16
TILE_SIZE

Clusters

Transcript

Documenti analoghi

determination of thyreostats in bovine urine and thyroid glands by

2nd Conference on Recent Trends in Nonlinear Phenomena

Elementary Affine Logic and the Call by Value Lambda Calculus

Evolution and development of the cerebral cortex

SOME LIKE IT NON-FINANCIAL... POLITICIANS` AND MANAGERS

volcanism and intrusions of the deccan traps, india: geochemistry

Snapshots of the retarded interaction of charge charriers with

here the leaflet

Journal of Ophthalmology Special Issue on Multimodal Imaging in

the banquet - Comune di Siena

Example of Student or Graduate CV

Monuments in Siena - Hotel Le Fontanelle

Architetture e Protocolli nelle Reti Peer-to-Peer

Eigenvalue coalescence for parameter dependent matrices

NVP melt/magma viscosity: insight on Mercury lava flows

SIMAI 2014 – Abstracts

learning, regularization, and optimization for computational biology

Aom titans cd key

babar - Università degli Studi della Basilicata

Tesi di dottorato di Ricerca ANNO ACCADEMICO 2011-2012

Full Curriculum. Pdf file, 249KB

Lift - Opening systems - Folding door - Gas piston strength