An Ultra-low energy asynchronous processor for Wireless

Transcript

An Ultra-low energy asynchronous processor for Wireless
An Ultra-low energy
asynchronous processor for
Wireless Sensor Networks
L.Necchi, L.Lavagno, D.Pandini, L.Vanzago
Politecnico di Torino
ST Microelectronics
Wireless Sensor Networks
- Ad-hoc wireless networks
- Sensing
- Computation
- Actuation
Application areas:
Monitoring
Building automation
Health care, Medical
Emergency response
Automotive
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
2
Key WSN Requirements
Flexibility (general purpose design)
High energy efficiency (battery powered)
Extremely wide voltage supply range
„
Exhausted battery or energy scavenging
Fast and inexpensive wake-up
„
event driven power management (not predictable)
Sporadic high computational load
„
„
Encryption (security)
Aggregation, distributed data processing
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
3
Sensor node architecture
Main components of a WSN node:
„
„
„
„
„
Microcontroller
Atmel AVR
TI
Memory
MSP430
Radio
Sensors / Actuators
Power supply
Battery (energy storage)
Power scavenging
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
4
Circuit-level Power Management
Management
Save energy while
Idle
Active
Clock Gating
X
X
Power Gating
Dynamic Voltage Scaling
Adaptive Body Biasing
Scenario
Idle Time
X
X
Long
Deadlines
DVS can be obtained by:
„
Off-line pre-computed voltage/frequency tables
High delay margins
„
Evaluated on-line:
PowerWise, Razor, Asynchronous, De-synchronization
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
5
Closed-loop DVS technique
PowerWise:
„
Samples, with a high frequency clock, the output of a digital delay
line, and arrange voltage supply to deliver required performance
Razor:
„
Detects timing errors comparing values stored in duplicated slave
latches, in which the second is clocked half clock cycle later,
restarts the pipeline and arranges voltage supply accordingly
Asynchronous with Dual-Rail encoding:
„
(Quasi) delay insensitive implementation, that guarantees
correctness for (almost) every voltage supply and process
variation
Asynchronous with Bundled Data encoding:
„
A digital delay line output is directly used to generate a local
clock signal, resulting in a direct dependence between voltage
supply and delay period
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
6
De-synchronization
Synchronous
Desynchronize
CLK
Async’06 - March 13-15
Asynchronous
CLK
Luca Necchi – Politecnico di Torino
7
Design Flow
HDL
RTL
Synthesis &
Optimization
Library
Netlist
De-synchronization
Netlist
Physical
Design
Layout
Async’06 - March 13-15
Obtain asynchronous
implementation from
synchronous specification:
Think synchronously
Design synchronously
De-synchronize
(automatically)
Test synchronously
Run asynchronously
Luca Necchi – Politecnico di Torino
8
Synchronous circuit
MS flip-flop
L
0
L
1
L
0
L
1
CLK
0
L
Async’06 - March 13-15
0
L
Luca Necchi – Politecnico di Torino
9
De-synchronization
Async’06 - March 13-15
L
0
L
1
L
0
L
1
C
C
C
C
C
C
0
L
0
L
Luca Necchi – Politecnico di Torino
10
De-synchronization
Distributed micropipeline-style controllers
substitute the clock network
C
C
C
C
C
C
The data path remains intact !
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
11
Flow equivalence [Guernic, Talpin, Lann, 2003]
A
B
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
12
Flow equivalence [Guernic, Talpin, Lann, 2003]
CLK
A
B
1
5
3
1
A
B
1
5
1
2
Async’06 - March 13-15
0
2
1
5
2
3
1
4
Synchronous behavior
3
0
2
1
5
3
2
3
3
1
4
2
4
De-synchronized behavior
Luca Necchi – Politecnico di Torino
1
4
1
6
3
6
3
0
1
0
1
13
Flow equivalence [Guernic, Talpin, Lann, 2003]
CLK
A
B
1
5
3
1
A
B
1
5
1
2
0
2
1
5
2
3
1
4
Synchronous behavior
3
0
2
1
5
3
2
3
3
1
4
2
4
De-synchronized behavior
1
4
1
6
3
6
3
0
1
0
1
Theorem:
The de-synchronization model preserves flow-equivalence
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
14
Flow equivalence [Guernic, Talpin, Lann, 2003]
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
15
De-synchronization Benefits
For the end user:
„
„
„
Reduced electromagnetic emission
Process Variation tolerance
Enables partial average case design,
wrt process & environment variation (not wrt data-dependent
delay)
„
The resulting circuit will be:
Ready for frequency and voltage scaling
Inherently more robust to delay variations
Virtually no performance or area overhead wrt synchronous
For the designer
„
Conventional EDA Tools and design flow
Limited design time and effort, fully automated
Re-use legacy designs
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
16
Asynchronous advantages not
offered by de-synchronization
Fine-grained power management
„
The desynchronized circuit inherits the synchronous
clock gating
Fine-grained pipelining
„
The pipeline structure is not changed
Data-dependent delays
„
Could be exploited by using a datapath with
completion detection (work in progress)
Robustness with respect to uncorrelated local
variability
„
Would require completion detection
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
17
Synchronous Logic Interfacing
CL
LL
01
CL
LL
01
C
CL
C
LL
01
FAST
C
LOGIC
Data path (not modified)
Handshaking line
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
18
Synchronous Logic Interfacing
CL
LL
01
CL
C
LL
01
CL
C
LL
01
SLOW
C
LOGIC
External CLK
•Synchronized with an external slower clock
-Just low EMI
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
19
Synchronous Logic Interfacing
CL
LL
01
C
CL
LL
01
CL
C
LL
01
C
SELF
TIMED
LOGIC
• Example: SRAM with Completion Detection
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
20
Sensor node architecture
Main components of a WSN node:
„
„
„
„
„
Microcontroller
Atmel AVR
Memory
Radio
Sensors / Actuators
Power supply
Battery (energy storage)
Power scavenging
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
21
Our Case Study
Application independent 8 Bit CPU architecture:
„ Atmel AVR Instruction Set (like MICA2 MICAZ) from OpenCores.org, implemented
with a 130nm technology
„ Toolchain and lots of software are ready to use
nesC, TinyOS, TinyDB, Surge, Tossim
Aggressive Energy management enabled by
de-synchronization, using:
„
„
Dynamic Voltage Scaling
zero wake-up time (No CLK, no wait for PLL to
restart)
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
22
Typical AVR architecture
INSTR.
DATA
Memory
Memory
MEM
Instruction
FETCH
LL
01
Instruction
Access
DECODE
ALU
Execution
Data Path (8 bit)
External
CLK
Async’06 - March 13-15
Address bus
Clk distribution
Luca Necchi – Politecnico di Torino
23
Design Choices
Main target is energy efficiency (vs speed)
„
Large delay margins (100%) to increase
robustness at low voltage supply
AVR core is really small (~4500 gates),
hence we used a Single controller
„
„
Reduced area overhead
No electro magnetic emission reduction
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
24
De-synchronized AVR
INSTR.
DATA
Memory
Memory
MEM
Instruction
FETCH
LL
01
C
Instruction
Access
DECODE
ALU
Execution
Data Path
Address bus
Handshake signal distribution
Delay chain
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
25
Logic and Delay Line Matching
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
26
Energy Efficiency
Energy per
Power
Instruction
Consumption
Leakage per Logic Delay
instruction
Voltage Supply [V]
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
27
Energy Efficiency
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
28
Some Past Work Comparison
Philips 80c51 (H. van Gageldonk., 1998)
„
Asynchronous bundled-data implementation of the
8051 ISA, general purpose.
Lutonium (A. Martin et al., 2003)
„
Asynchronous QDI implementation of the 8051 ISA.
Snap/le (V. Ekanayake et al., 2004)
„
Asynchronous QDI processor specifically designed for
WSN.
Razor (D. Ernst et al., 2004)
„
Synchronous processor that estimated the best Vdd by
dynamically monitoring the delay of the logic using a
redundant latching schema.
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
29
CONCLUSIONS
Aggressive Energy management using DVS
„
„
14 pJ/Instr @ 1.2 V (170 MIPS)
2.7 pJ/Instr @ 0.51 V ( 48 MIPS)
Minimal overhead wrt synchronous counterpart
„
„
+6% area (due to FF->latch conversion)
-20% speed (could be improved by reducing margins)
Future work:
„
„
„
Analysis with other “SPICE-like” simulators (Hsim)
Statistical simulations to check robustness wrt
process variability (Monte Carlo)
Fabrication (?)
Async’06 - March 13-15
Luca Necchi – Politecnico di Torino
30