An Ultra-low energy asynchronous processor for Wireless
Transcript
An Ultra-low energy asynchronous processor for Wireless
An Ultra-low energy asynchronous processor for Wireless Sensor Networks L.Necchi, L.Lavagno, D.Pandini, L.Vanzago Politecnico di Torino ST Microelectronics Wireless Sensor Networks - Ad-hoc wireless networks - Sensing - Computation - Actuation Application areas: Monitoring Building automation Health care, Medical Emergency response Automotive Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 2 Key WSN Requirements Flexibility (general purpose design) High energy efficiency (battery powered) Extremely wide voltage supply range Exhausted battery or energy scavenging Fast and inexpensive wake-up event driven power management (not predictable) Sporadic high computational load Encryption (security) Aggregation, distributed data processing Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 3 Sensor node architecture Main components of a WSN node: Microcontroller Atmel AVR TI Memory MSP430 Radio Sensors / Actuators Power supply Battery (energy storage) Power scavenging Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 4 Circuit-level Power Management Management Save energy while Idle Active Clock Gating X X Power Gating Dynamic Voltage Scaling Adaptive Body Biasing Scenario Idle Time X X Long Deadlines DVS can be obtained by: Off-line pre-computed voltage/frequency tables High delay margins Evaluated on-line: PowerWise, Razor, Asynchronous, De-synchronization Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 5 Closed-loop DVS technique PowerWise: Samples, with a high frequency clock, the output of a digital delay line, and arrange voltage supply to deliver required performance Razor: Detects timing errors comparing values stored in duplicated slave latches, in which the second is clocked half clock cycle later, restarts the pipeline and arranges voltage supply accordingly Asynchronous with Dual-Rail encoding: (Quasi) delay insensitive implementation, that guarantees correctness for (almost) every voltage supply and process variation Asynchronous with Bundled Data encoding: A digital delay line output is directly used to generate a local clock signal, resulting in a direct dependence between voltage supply and delay period Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 6 De-synchronization Synchronous Desynchronize CLK Async’06 - March 13-15 Asynchronous CLK Luca Necchi – Politecnico di Torino 7 Design Flow HDL RTL Synthesis & Optimization Library Netlist De-synchronization Netlist Physical Design Layout Async’06 - March 13-15 Obtain asynchronous implementation from synchronous specification: Think synchronously Design synchronously De-synchronize (automatically) Test synchronously Run asynchronously Luca Necchi – Politecnico di Torino 8 Synchronous circuit MS flip-flop L 0 L 1 L 0 L 1 CLK 0 L Async’06 - March 13-15 0 L Luca Necchi – Politecnico di Torino 9 De-synchronization Async’06 - March 13-15 L 0 L 1 L 0 L 1 C C C C C C 0 L 0 L Luca Necchi – Politecnico di Torino 10 De-synchronization Distributed micropipeline-style controllers substitute the clock network C C C C C C The data path remains intact ! Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 11 Flow equivalence [Guernic, Talpin, Lann, 2003] A B Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 12 Flow equivalence [Guernic, Talpin, Lann, 2003] CLK A B 1 5 3 1 A B 1 5 1 2 Async’06 - March 13-15 0 2 1 5 2 3 1 4 Synchronous behavior 3 0 2 1 5 3 2 3 3 1 4 2 4 De-synchronized behavior Luca Necchi – Politecnico di Torino 1 4 1 6 3 6 3 0 1 0 1 13 Flow equivalence [Guernic, Talpin, Lann, 2003] CLK A B 1 5 3 1 A B 1 5 1 2 0 2 1 5 2 3 1 4 Synchronous behavior 3 0 2 1 5 3 2 3 3 1 4 2 4 De-synchronized behavior 1 4 1 6 3 6 3 0 1 0 1 Theorem: The de-synchronization model preserves flow-equivalence Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 14 Flow equivalence [Guernic, Talpin, Lann, 2003] Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 15 De-synchronization Benefits For the end user: Reduced electromagnetic emission Process Variation tolerance Enables partial average case design, wrt process & environment variation (not wrt data-dependent delay) The resulting circuit will be: Ready for frequency and voltage scaling Inherently more robust to delay variations Virtually no performance or area overhead wrt synchronous For the designer Conventional EDA Tools and design flow Limited design time and effort, fully automated Re-use legacy designs Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 16 Asynchronous advantages not offered by de-synchronization Fine-grained power management The desynchronized circuit inherits the synchronous clock gating Fine-grained pipelining The pipeline structure is not changed Data-dependent delays Could be exploited by using a datapath with completion detection (work in progress) Robustness with respect to uncorrelated local variability Would require completion detection Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 17 Synchronous Logic Interfacing CL LL 01 CL LL 01 C CL C LL 01 FAST C LOGIC Data path (not modified) Handshaking line Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 18 Synchronous Logic Interfacing CL LL 01 CL C LL 01 CL C LL 01 SLOW C LOGIC External CLK •Synchronized with an external slower clock -Just low EMI Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 19 Synchronous Logic Interfacing CL LL 01 C CL LL 01 CL C LL 01 C SELF TIMED LOGIC • Example: SRAM with Completion Detection Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 20 Sensor node architecture Main components of a WSN node: Microcontroller Atmel AVR Memory Radio Sensors / Actuators Power supply Battery (energy storage) Power scavenging Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 21 Our Case Study Application independent 8 Bit CPU architecture: Atmel AVR Instruction Set (like MICA2 MICAZ) from OpenCores.org, implemented with a 130nm technology Toolchain and lots of software are ready to use nesC, TinyOS, TinyDB, Surge, Tossim Aggressive Energy management enabled by de-synchronization, using: Dynamic Voltage Scaling zero wake-up time (No CLK, no wait for PLL to restart) Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 22 Typical AVR architecture INSTR. DATA Memory Memory MEM Instruction FETCH LL 01 Instruction Access DECODE ALU Execution Data Path (8 bit) External CLK Async’06 - March 13-15 Address bus Clk distribution Luca Necchi – Politecnico di Torino 23 Design Choices Main target is energy efficiency (vs speed) Large delay margins (100%) to increase robustness at low voltage supply AVR core is really small (~4500 gates), hence we used a Single controller Reduced area overhead No electro magnetic emission reduction Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 24 De-synchronized AVR INSTR. DATA Memory Memory MEM Instruction FETCH LL 01 C Instruction Access DECODE ALU Execution Data Path Address bus Handshake signal distribution Delay chain Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 25 Logic and Delay Line Matching Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 26 Energy Efficiency Energy per Power Instruction Consumption Leakage per Logic Delay instruction Voltage Supply [V] Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 27 Energy Efficiency Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 28 Some Past Work Comparison Philips 80c51 (H. van Gageldonk., 1998) Asynchronous bundled-data implementation of the 8051 ISA, general purpose. Lutonium (A. Martin et al., 2003) Asynchronous QDI implementation of the 8051 ISA. Snap/le (V. Ekanayake et al., 2004) Asynchronous QDI processor specifically designed for WSN. Razor (D. Ernst et al., 2004) Synchronous processor that estimated the best Vdd by dynamically monitoring the delay of the logic using a redundant latching schema. Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 29 CONCLUSIONS Aggressive Energy management using DVS 14 pJ/Instr @ 1.2 V (170 MIPS) 2.7 pJ/Instr @ 0.51 V ( 48 MIPS) Minimal overhead wrt synchronous counterpart +6% area (due to FF->latch conversion) -20% speed (could be improved by reducing margins) Future work: Analysis with other “SPICE-like” simulators (Hsim) Statistical simulations to check robustness wrt process variability (Monte Carlo) Fabrication (?) Async’06 - March 13-15 Luca Necchi – Politecnico di Torino 30