Sunteți pe pagina 1din 77

How to save 10 mW for free?

Dont trust drivers for power management


Lin Zhong
http://recg.org

Efficiency matters at various scales

0.1 W
1W

10 W

10 mW

1 mW

The secret of efficiency

The secret of efficiency


Dont do more than necessary
As slow as possible
Dont overprovision
Dont do useless things
Reuse previous work

... ...

Go to sleep when nothing to do


a.k.a Power Management (PM)

xxxxxxxxxxxx5

System vs. Runtime PM


System power management
Put the entire system into a low-power state
Nothing works

Runtime power management


Put an idle component into a low-power state
Other components remain functional

Outline
Background: Runtime PM of mobile SoC
Problems with driver-directed runtime PM
Centralized power management
Future work
7

Public Version

A lot of modules on mobile SoC

www.ti.com

Description

Figure 1-2. OMAP4430 Block Diagram


To L3-Core
instrumentation

From Debug subsystem

L4_PER
32 bits

L4_CFG
32 bits

L4_CFG
32 bits

L4_PER
32 bits
L4_CFG
32 bits

ISS megacell

MPU
Master 1

64 bits

DSP subsystem

SL2
SL2
32 bits

32 bits

Shared L2 IF +
SL2: 256KB

64 bits

L4_CFG

128 bits

VID1,2,3
GFX,
Write back
pipelines

32 bits

ISS interconnect
(128 bits)

RTOS
Cortex-M3

L1 32KB shared cache


and MMU/CTM

CBUFF
128 bits

32 bits

SIMCOP
Ctrl
Cortex-M3

SGX540
subsystem
HS USB OTG
2D, 3D
graphics
and
video codecs

System DMA
32 channels
127 requests
ULPI
wrapper

L2 64-KB RAM / 16-KB ROM

BTE

Embedded
DMA

128 bits

32 bits

ISP5

FD
core

LCD
&
TV
overlays

config

128 bits

SIMCOP

HS
USB
PHY

Cortex-M3 subsystem

Serial IF: CSI2 and CCP2


protocol engines

2x
HS-MMC

USB 2.0
controller

Emulation features
32 bits

64 bits

64 bits

64 bits

WR port

PL310 L2 cache controller + SCU


L2 cache : 1MB
ROM: 48KB
INTC: 128 req.

- 3x MCBSP
- 1x SLIMbus
- 1x MCPDM
- 1x MCASP
- 1x DMIC
- 4x GPTIMER
- 1x WDTIMER
- 1x AESS

Face
Detect
(FDIF)

DMA
RFBI
NTSC/PAL video enc
HDMI video enc
MIPI DSI Ctrl / Mem

RD port

+ Neon + VFPv3

ISS
ISS interconnect
(32 bits)

+ Neon + VFPv3

DSS interconnect
(32 bits)

Cortex-A9
CPU1

32 bits

Cortex-A9
CPU0

Audio engine
RAM: 88KB
L4-ABE interconnect
(32 bits)

MPU
Master 0

IVAHD subsystem
1080p
- Sys Ctrl, Acc. engines,
- Filters, Msg IF (16 bits)
- Seq: ARM968 w/mem
- w/ int ctrl, Mailbox

ABE subsystem

IVAHD interconnect
(32 bits)

MPU subsystem

Display subsystem

128 bits

64 bits
32 bits

32 bits

9
CORE L3 instrumentation

NAND/NOR/
PSRAM
controller

(56KB SRAM)

HS-I C
HDQ/1-Wire
MCBSP
MCSPI
UART (1x IRDA)
GPTIMER
GPIO
SLIMBUS
HS-MMC

Other modules:
- CSreplicator
- CSTF
- PDLO

IEEE
1149.7
adapter
XTRIG

4x
1x
1x
4x
4x
6x
5x
1x
3x

EMU Configuration
interconnect

ICEPick

MIPI_STM
To CSTF

128 x 48
mem

CSETB

64 bits

64 bits

32 bits

32 bits

L4_CFG
32 bits

L4_CFG
32 bits

To
MPU_ss

CSTPIU

8KB mem
DRM

To EMU L3
instrumentation

3
1
1
1

32 bits

L4_CFG
32 bits

To EMU L3
instrumentation
HSI
1-port
C2C

Clock Manager 2 + profiler

ELM

To
HS-MMC 1
HS-MMC 2
DSS

To
EMIF4D

DAP

TAPs

To HSI,
System DMA,
HS USB OTG,
HS USB PHY,
HS USB Host,
FS USB,
Shared OCP WP,
HS USB TLL

Clock Manager 1 + profiler

EMU L3
instrumentation

L4_CFG interconnect

OCM L3 RAM

29

From CM1
(profiler port)
From PRM
(profiler port)

- General Core Control module


- Spinlock
- Mailbox
- SAR ROM (4KB 32-bit data)
To FDIF (face detect)

To CORE L3
instrumentation

HST

2-port
HS USB
Host

FS USB
EHCI /
OHCI

HSR

L4_CFG
32 bits

2-port
HS
USB
TLL

HS
HS
ICIC
x2

To DSP subsystem

3
3x SmartReflex

PRM + profiler
- GPTIMER
- GPIO
- 32KTIMER
- SCRM
- WDTIMER
- General Wakeup Control module
- SAR RAM (8KB 32-bit data)
ICEmelter

KEYBOARD
Device Wakeup Control module

L4_WKUP interconnect

GPMC

L4_PER interconnect

LPDDR2

From CM2
From
IVA HD (profiler port)

Shared OCP WP

64 bits

EMIF4D
2

32 bits

32 bits
L4_CFG

From
C2C

EMIF4D
1
LPDDR2

4 x 32 bits

64 bits

DMM 2 128 bits

DMM (splitter and TILER)

32 bits

64 bits

Performance Monitoring

DMM 1 128 bits

128 bits

L3 interconnect
32 bits

- Device Core Control module


- eFuse farm + FROM
- Modem ICR (port 1)
- Modem ICR (port 2)

intro_swpu141-001

TI OMAP4

Die photo of OMAP4430, red rectangle shows Cortex A9 cores


http://www.ubmtechinsights.com/uploadedFiles/Anatomy-of-a-Tablet-ESC-SV-2011.pdf

Module is the basic PM unit


Software modes
Disabled vs. Enabled
Before disabling, software must save the
execution context

Hardware states
Clock-gated
Powered-off
10

Modules are organized into domains


SoC
power domain

power domain

clock domain
clock domain
clock domain
I2C

GPIO

power domain

module

power domain

clock domain
clock domain
clock domain
module

clock domain
clock domain
clock domain

clock domain
clock domain
clock domain
module

11

Power domains have states


SoC
power domain

ON

clock domain
clock domain
clock domain
I2C

GPIO

power domain

OFF

clock domain
clock domain
clock domain
module

RETENTION

power domain

clock domain
clock domain
clock domain
module

power domain

clock domain
clock domain
clock domain
module

12

Runtime PM is a collaboration
between software & hardware
power domain

ON

clock domain
clock domain
clock domain
I2C

GPIO

13

Software disables modules


power domain

ON

clock domain
clock domain
clock domain Software
Trigger

I2C

GPIO

14

Software disables modules


power domain

ON

clock domain
clock domain
clock domain
I2C

Software
Trigger

GPIO

15

Hardware put domains into sleep


power domain

ON

clock domain
clock domain

Hardware
Trigger

clock domain
I2C

GPIO

16

Hardware put domains into sleep


power domain

ON

clock domain

Hardware
Trigger

clock domain
clock domain
I2C

GPIO

17

Hardware put domains into sleep


Software configures
RETENTION or OFF

power domain

Hardware
Trigger

clock domain
clock domain
clock domain
I2C

GPIO

18

Software has three responsibilities


Decide when to disable a module
Track pending tasks

Save the execution context when disabling


Decide the target state of a domain
Performance vs. efficiency

19

How(Linux(does(it(

Operating System

User

User(programs(set(QoS(requirements(
User programs

QoS requirement

Linux PM_QoS framework


Notify driver about changed QoS
requests

Linux PM_QoS framework

QoS callback

Device Drivers

Example(QoS(requirement(
Wakeup(latency(

20

How(Linux(does(it(

Operating System

User

Drivers(fulll(QoS(requirements(
User programs

QoS requirement
Linux PM_QoS framework

Adjust PM parameters
according to QoS
requirements

QoS callback

Device Drivers

Examples(
Idle(>meout(
Target(low@power(state(

21

How(Linux(does(it(

User

Drivers(track(pending(tasks(
User programs

Linux Runtime PM framework


Reference counting
APIs: pm_runtime_get/put

Operating System

QoS requirement
Linux PM_QoS framework

QoS callback

Device Drivers

Drivers role:

pm_runtime_get/put
Linux runtime_PM framework

Make reference counter=0 when


#pending task = 0

SoC
Modules

enable/disable module

DSP
I2C

Display
controller

SPI EHCI
22

SoC Runtime PM (recap)


Software chooses
RET or OFF

power domain

Hardware
trigger

clock domain
clock domain
clock domain
I2C

Driver triggers
enable/disable

GPIO

23

SoC Runtime PM (recap)


Software chooses
RET or OFF

power domain

clock domain
clock domain
clock domain
I2C

Driver triggers
enable/disable

GPIO

Hardware
trigger

QoS
requirements

Pending
Tasks?

24

Outline
Background: Runtime PM of Mobile SoC
Problems with driver-directed runtime PM
Centralized runtime power management
Future work
25

Linux driver has three


responsibilities
Decide when to disable a module
Using Linux Runtime PM framework

Save the execution context when disabling


Decide the target state of a domain
Using Linux PM QoS frameworks

26

Problem #1:
Drivers do no runtime PM
Board(
TI(OMAP(

Samsung(Exynos(
Freescale(i.MX(

Nvidia(Tegra(

Module(
UART(
WDT(
USB(
Keypad(
USB(
SD(
I2C(
SPI(
SD*(

Delay((month)(
12(
22(
45(
18(
14(
37(
9(
29(
44(

*: still no PM at the time of this presentation

27

Problem#2:
Drivers can be very complex
OMAP display subsystem
565 pages in manual
22,000 lines of code
Tens of callbacks
Several asynchronous executions

Display(controller(is(kept(on(when(screen(is(on(

28

Problem#3:((
Hierarchical(PM(makes(it(worse(
Bad PM for a module => Bad PM for the domain

L4PER power domain


L4PER clock domain
GPIO

UART

29

Solutions
Get more and better driver developers
Help driver developers with a tool
Relieve drivers from doing runtime PM

30

Outline
Background: Runtime PM of Mobile SoC
Problems with driver-directed runtime PM
Centralized runtime power management
Future work
31

Fundamental information needed


for runtime PM
Users QoS requirements
Wakeup latency from a low power state, etc.
Whether a module has pending tasks

32

Fundamental information needed


for runtime PM
Users QoS requirements
Wakeup latency from a low power state, etc.
Whether a module has pending tasks

33

Fundamental information needed


for runtime PM
Users QoS requirements
Wakeup latency from a low power state, etc.
Whether a module has pending tasks
Stock(Linux(relies(on(device(drivers(

34

Our key insight


Users QoS requirements
Wakeup latency from a low power state, etc.
Whether a module has pending tasks
Stock(Linux(relies(on(device(drivers(
Can(be(inferred(without(driver(assistance(
35

Pending task inference


Software only solution
Hypothesis: When there is a pending task,
there will be frequent register accesses

Hardware-assisted solution

36

Example: register access indicates


pending tasks for I2C controller
I2C controller

set I2C mess


age
address

set I2C mess


age
length

IRQ handling

Configuration

CPU (I2C driver)

37

Example: register access indicates


pending tasks for I2C controller

IRQ handling

Configuration

CPU (I2C driver)

I2C controller

set I2C mess


age
address

set I2C mess


age
length
ined
Interrupt: FIFO dra
read & ack in

terrupt

38

Example: register access indicates


pending tasks for I2C controller

IRQ handling

Configuration

CPU (I2C driver)

I2C controller

set I2C mess


age
address

set I2C mess


age
length
ined
Interrupt: FIFO dra
read & ack in

terrupt

e
Interrupt: messag
transferred
read & ack in
terru

pt
39

Monitor register access with


memory exception
Memory-map registers of modules
Supported by all ARM-based SoC

Periodically remove access permission


of mapped memory region
Tthreshold

Grant permission after first access


40

No memory exception !
No register access in the past period
!No pending tasks
Enabled module

: memory exception

Monitor

time

Driver

41

No memory exception !
No register access in the past period
!No pending tasks
Enabled module
remove read/
write permission
Tthreshold
remove read/
write permission
Tthreshold
: memory exception
remove read/
write permission
Monitor

time

Driver

42

No memory exception !
No register access in the past period
!No pending tasks
Enabled module
remove read/
write permission
...

Tthreshold

register
access

remove read/
write permission
Tthreshold
: memory exception
remove read/
write permission
Monitor

time

Driver

43

No memory exception !
No register access in the past period
!No pending tasks
Enabled module
remove read/
write permission
...

Tthreshold

register
access

remove read/
write permission
Tthreshold
: memory exception
remove read/
write permission
all tasks finished
Monitor

time

no register
access

Driver

44

Disabled module has access


permission removed
Module
Disabled

: memory exception

Monitor

time

Driver

45

Disabled module has access


permission removed
Module
Disabled

register access
new pending task!
Enable module
: memory exception

Monitor

time

Driver

46

A centralized runtime PM architecture


Central PM Agent
(a kernel module)
Monitor
Device Driver
PM callback

Pending
task?

Controller

Enable/Disable
SoC Module

Drive provides mechanisms (PM callbacks)


Central PM agent decides when to call which

47

Evaluation
Pandaboard uses TI OMAP4460
Linaro Android 13.10 release
Kernel version 3.2

48

Evaluation setup
Tested module
MMC controller (used by file system on SD card)
I2C, SDIO controllers (used by Wi-Fi NIC)
DISPC (Display Controller, part of display subsystem)

10 minutes of usage trace with 48 user input


events
Reading emails with Android email application
Browsing webpages with Android browser

49

Central PM is equally effective as


good drivers
Disabled0time0percentage0(%)

100

Stock0Linux0Driver

Central0PM0Agent
(Tthreshold=100ms)

80
60
40
20
0
MMC

I2C

SDIO

DISPC
50

Enable runtime PM for bad drivers


Disabled0time0percentage0(%)

100

Stock0Linux0Driver

Central0PM0Agent
(Tthreshold=100ms)

80
60
40
20
0
MMC

I2C

SDIO

DISPC
51

Low overhead
Memory exception handling causes 2500 cycles
(2.5 s if CPU freq.= 1 GHz) latency.
Memory exception occurs for each module at
most once every Tthreshold=100ms

52

Negligible performance loss


Stock+Linux

Central+PM

30

16
Wi*Fi,Throughput,(MB/s),

SD+Card+Throughput+(MB/s)+

18
14
12
10
8
6
4
2
0

Read
Write
(a)+SD+card+Throughput+

SD Card
Throughput

Stock,Linux

Central,PM

25
20

15
10
5
0

Send
Receive
(b),Wi*Fi,Throughput,

Wi-Fi
Throughput

53

Extend standby time by 3 hours


SDIO controller: at most 17mW power saving;
extends 2.4 hours standby time daily*;
DISPC: 10 mW power saving; extends 0.6 hours
standby time daily*;

*: estimated based on smartphone user study in LiveLab: http://


livelab.recg.rice.edu/traces.html
54

Central PM Agent(recap)
Relieving driver developers from PM
Comparable with hand-tuned PM
No hardware modification
Central PM Agent
Monitor

Device Driver
PM callback

Pending
task?

Controller

Enable/Disable
SoC Module
55

Limitation of software-only approach


Less(aggressive:(Tthreshold!is(lower(bounded((
Overhead:(periodic(memory(excep>ons(
Does(not(work(for(all(SoC(modules(
Central PM Agent
Monitor

Device Driver
PM callback

Pending
task?

Controller

Enable/Disable
SoC Module
56

The key hypothesis revisited


If there is a pending task, there must be frequent
register access
Intervals between register access when working
on a task must be bounded
All I/O controllers
Tthreshold=100ms, lower bounded by DMA transaction length

Some accelerators, e.g., FDIC


Not true for programmable units such as GPU & DSP
57

Small hardware modifications

Central PM Agent
Monitor

Device Driver
PM callback

Pending
task?

Controller

Enable/Disable
SoC Module
58

Small hardware modifications


A(busy/idle(register(per(module(

Central PM Agent
Monitor
Pending
task?

Device Driver
PM callback

Controller

Enable/Disable
SoC Module

busy/idle
register

59

Small hardware modifications


A(busy/idle(register(per(module(
Polling(busy/idle(register(
Central PM Agent
HW-assisted
Monitor
Monitor
Pending
task?

Device Driver
PM callback

Controller

Enable/Disable
SoC Module

busy/idle
register

60

Busy/Idle register behavior


Set(to(busy((1):(whenever(the(module(starts(
processing(a(task(
Reset(to(idle((0):((aber(a(read(if(the(module(is(
idle(

61

Timing info from the register


Busy(1):(module(has(once(been(busy(since(last(
check(
Idle((0):(module(has(always(been(idle(since(last(
check(

62

Busy/Idle register behavior


0
Task

read busy/idle
1
reg.

Task

read busy/idle
1
reg.
read busy/idle
1
reg.
T: can be as small as
1ms
0
read busy/idle
reg.
HWassisted
Monitor

time

Module
Activity

time

Busy/Idle
register
value

63

Prototype on Zynq SoC

64
http://www.bdti.com/InsideDSP/2011/03/30/Xilinx

Add Buys/Idle register to I2C controller


Hardware:(
Zedboard(uses(Zynq@7000(SoC(
Freescale(MMA8452Q(accelerometer(as(I2C(slave(device(
Sobware:(
Linux(version(3.14(
Xilinx(I2C(controller(in(Verilog(

I2C slave:
MMA8452Q
Accelerometer

65

Hardware modification is small


Reusing(the(exis>ng(nite(state(machine((FSM)(
Small(number(of(extra(gates((Depending(on(FSM(
encoding(scheme(and(the(size(of(FSM)(
Busy/Idle Register

Existing State
Machine in IP

Clk

Qf

Clk

Qf

Clk

Qf

Bit 0

IP Core

Bit n

Register Read ACK

66

Hardware modification is small


Reusing(the(exis>ng(nite(state(machine((FSM)(
A(few(lines(of(Verilog(code(
reg busyIdle;
always @(posedge clk)
/* state machine in busy */
if(|state !=0)
busyIdle = 1b1;
/* busy/idle reg is just read,
and state machine is in idle*/
else if (read)
busyIdle = 1b0;
67

Small development effort


Small hardware overhead

Module
Xilinx I2C
Opencores SPI
Opencores I2C

Development Efforts
FPGA Resources
LoC
Time*
LUTs
Registers
93 (+1.2%) 12
16 (+3.8%) 8(+2.4%)
15 (+6%)
5
1 (+1.3%) 1(+1.5%)
20 (+2%) 10
4 (1.8%) 1(+0.6%)

ASIC Resources
Gates
N/A**
15(+1.1%)(
34(+1.6%)(

*: development time is in man-hour


**:Xilinx I2C uses Xilinx specific FPGA primitives, which cannot be synthesized to
ASIC by tool we used.

68

Central PM Agent
with Busy/Idle Register (recap)
Small(eorts(to(add(a(busy/idle(register(per(module(
Enabling(aggressive(PM;(Incurring(less(overhead(
Work(for(all(SoC(modules(
Central PM Agent
HW-assisted
Monitor
Monitor
Pending
task?

Device Driver
PM callback

Controller

Enable/Disable
SoC Module

busy/idle
register

69

Moving forward
Software-only solution
Distinguish read/write access to registers
Exploit interrupts

Hardware modification
busy/idle register+polling ! interrupt

70

Todays runtime PM relies on


power-hungry CPU
CPU
Linux runtime PM
Driver

Driver
Routines for
runtime PM

Device

Device

Interconnect
71

Difficult to move runtime PM


to low-power cores
CPU
Ultra low-power core

Linux runtime PM
Driver

Driver

Device

Device

Interconnect
72

Much easier to disengage CPU in


the centralized architecture
CPU
Monitor

Ultra low-power core

Controller
Driver

Driver

Device

Device

Interconnect
73

Disengaging CPU: Step 1


CPU
Ultra low-power core

Controller
Driver

Driver

Device

Device

Monitor

Interconnect
74

Disengaging CPU: Step 2


CPU
Ultra low-power core
Monitor
Controller
Driver

Driver

Device

Device

DSM

Interconnect
75

Summary
Driver-directed runtime PM harmful
A centralized architecture equally effective
Disengaging CPU possible with the
centralized architecture
All examples, source code & traces available at http://www.recg.org
Chao Xu, Xiaozhu Lin, Yuyang Wang, and Lin Zhong, "Automated OS-level device runtime power management," to appear in ACM
ASPLOS, March 2015.

76

Acknowledgement

http://recg.org

77

S-ar putea să vă placă și