Sistemi embedded a multiprocessore Motivation
Transcript
Sistemi embedded a multiprocessore Motivation
Sistemi embedded a multiprocessore Massimo Bocchi Architetture Digitali per l’Elaborazione dei Segnali LS A.A. 2003/2004 Motivation Performance increase – Pipelined execution – Parallel execution Cost effectiveness – – Custom uniprocessors are less cost effective Microprocessor design cost can be amortized if the processor core can be reused on the same system Multi processor System On Chip (MPSoC (MPSoC)) – Growing transistor count on a single chip should be exploited – Parallelism becomes effective use of chip density Massimo Bocchi, 20/02/2004 ARCES - University of Bologna 2/23 1 Processor-time diagram sequential execution P2 2-level pipelined execution P2 2-level parallel execution P2 P1 A P1 A P1 A A A A A A A Massimo Bocchi, 20/02/2004 C C C C C A A A B C C C C C C C C C B A A A B A A A B A A A B C C C C C A C C A A A B B B C C C ARCES - University of Bologna C 3/23 Main topics Inter-processor communication and data sharing Inter-processor synchronization Interconnection architecture Massimo Bocchi, 20/02/2004 ARCES - University of Bologna 4/23 2 Multiprocessor classification In 1966 Flynn proposed a multiprocessor classification: SISD (single instruction stream – single data stream): a normal single processor based on the von Neumann model SIMD (single instruction stream – multiple data stream): a single instruction stream is generated by a single control unit but used used by several processors; each processor executes the same instruction stream but acts upon different data, generating multiple data streams MISD (multiple instruction stream – single data stream): no existing implementations can be referred to this model MIMD (multiple instruction stream – multiple data stream): one instruction stream is generated for each computer; each instruction instruction acts upon different data Massimo Bocchi, 20/02/2004 ARCES - University of Bologna 5/23 MIMD multiprocessor classification A MIMD multiprocessor system needs a mechanism to load instruction and data memories and to pass information between the processors. Two main models: shared memory multiprocessor systems – Uniform Memory Access (UMA) multiprocessors or Symmetric Multiprocessors (SMP) – NonNon-Uniform Memory Access (NUMA) multiprocessors or Asymmetric Multiprocessors (AMP) multiprocessor systems without shared memory – dataflow multiprocessor systems – messagemessage-passing multiprocessor systems Massimo Bocchi, 20/02/2004 ARCES - University of Bologna 6/23 3 Shared memory model All the processors have access to a common memory space Cache memories create consistency problems on the shared data Massimo Bocchi, 20/02/2004 Memories Interconnection network Processors ARCES - University of Bologna 7/23 Shared memory multiprocessors Implicit communication performed through shared variables Explicit synchronization based on lock/unlock mechanism (semaphores and monitors) Two types of shared memory systems: – Uniform Memory Access (UMA) – NonNon-Uniform Memory Access (NUMA) Massimo Bocchi, 20/02/2004 ARCES - University of Bologna 8/23 4 UMA Multiprocessor Systems All the memory resources feature the same access latency CPUs can access to RAM via bus or interconnection network Cache memories and local (private) memories can be used to reduce bus contention or network traffic. Straightforward programmability Private Private Memory Memory Private Private Memory Memory CPU CPU cache cache Shared Shared Memory Memory Interconnection network Massimo Bocchi, 20/02/2004 ARCES - University of Bologna 9/23 NUMA Multiprocessor Systems Remote memory resources have higher access latency then local memories The system features a single address space visible to all CPUs Potentially higher performance due to higher scalability Difficult to program Massimo Bocchi, 20/02/2004 Node 1 CPU CPU cache cache M Node n CPU CPU cache cache M Interconnection network ARCES - University of Bologna 10/23 10/23 5 Dataflow model This model directly derives from DFG representations An instruction is executed within a module when the operands required are available The computation is data driven since the data stream activates the nodal operations Massimo Bocchi, 20/02/2004 ARCES - University of Bologna 11/23 11/23 Message-passing model Direct links are used to pass data between the processors There are only local memories; each memory can be accessed only by one processor Massimo Bocchi, 20/02/2004 Interconnection network Processors Memories ARCES - University of Bologna 12/23 12/23 6 Message-passing systems elements Node: an independent subsystem containing a processor and local memories and peripherals Link: a physical communication path between two nodes Channel: a communication path between two processes in one processor or between two processes executing on different processors Massimo Bocchi, 20/02/2004 ARCES - University of Bologna 13/23 13/23 Message Passing Systems communication based on send and receive primitives Implicit synchronization through blocking methods Since each node runs as an independent cluster, these system are also referred as multicomputers. Explicit Massimo Bocchi, 20/02/2004 ARCES - University of Bologna 14/23 14/23 7 Message communication mechanisms Process P1 Process P2 Process P3 receive (m1, P1) send (m1, P2) receive (m1, P1) send (m2, P3) Massimo Bocchi, 20/02/2004 ARCES - University of Bologna 15/23 15/23 Interconnection architectures Bus-based interconnection M P Massimo Bocchi, 20/02/2004 M P M P P ARCES - University of Bologna 16/23 16/23 8 Interconnection architectures (2) Switch-based interconnection (network) – Crossbar switch – Omega switching network P M P P M P P M P P M M Massimo Bocchi, 20/02/2004 M M ARCES - University of Bologna 17/23 17/23 Bus-based multiprocessor systems A typical implementation of a shared memory model consists of a busbus-based multiprocessor system (e.g. an AMBA AHBAHB-based system) These systems are not easily expandable to contain a large number of bus masters Synchronization mechanisms are necessary to maintain consistency on the shared data The shared bus and memory generate contentions between bus masters, thus reducing the system speed and the data throughput Massimo Bocchi, 20/02/2004 ARCES - University of Bologna 18/23 18/23 9 Parallel systems architectures Shared Memory P M P P P M P P M P M M M M P P P P M M M M P P P P P Massimo Bocchi, 20/02/2004 ARCES - University of Bologna Switch-based M M Bus-based M Private Memory 19/23 19/23 How many processors? Category Type Number of processors Communication model Message passing 8-256 Physical connection Massimo Bocchi, 20/02/2004 Shared memory NUMA 8-256 UMA 2-64 Switch-based 8-256 Bus-based 2-32 ARCES - University of Bologna 20/23 20/23 10 MPI standard Message Passing Interface is a standard for writing messagemessage-passing programs Supports an extended messagemessage-passing model It is not targeted to a specific platform Both Fortran 77 and C languages are supported Main MPI features are intended to improve performance on scalable parallel computers with specialized interinter-processor communication hardware MPICH is a free portable implementation of MPI standard Massimo Bocchi, 20/02/2004 ARCES - University of Bologna 21/23 21/23 MPI Programming Single Program Multiple Data (SPMD) model Communication: – PointPoint-toto-point – Collective (broadcast primitives) Synchronization – PointPoint-toto-point synchronization is performed by message passing and blocking primitives – Global synchronization is done by broadcast communication paradigm Massimo Bocchi, 20/02/2004 ARCES - University of Bologna 22/23 22/23 11 SPMD Programming Model The same program is loaded by all the processors Each processor can identify itself Total number of processors is known Execution can be different on each CPU: My_ID = getID(); if (my_ID == 1) execute task1; if (my_ID == 2) execute task2; … if (my_ID == N) execute taskN; Massimo Bocchi, 20/02/2004 ARCES - University of Bologna 23/23 23/23 12