HWSW Co-Synthesis Algorithms

HARDWARE/SOFTWARE
CO-SYNTHESIS ALGORITHMS
Dr B ABDUL RAHIM
Professor, Dept. of ECE
AITS, Rajampet.
• Hardware-software co-synthesis simultaneously designs the
software architecture of an application and the hardware on which
that software is executed.
• Problem specification
– description of functionality
– Performance goal i.e, non functional requirements
• Co-synthesis creates a hardware engine – consists of
– One or more PEs (for functionality execution)
– Application architecture (allocates functionality into PEs in hardware
engine)
• Some functions on CPUs
• Others on ASICs
• Communication channels – HW/SW
• Schedule for execution
• Component mapping & other details
• Co-synthesis allows trade-offs between design
application & HW engine
– Not important in data processing applications
– Very important in embedded computing
• Embedded computing applications almost always have
constraints:
• Cost
• Power
• Weight
• Size
[ if change in software application – critical changes in
embedded systems]
• Implementing a system in software on existing
CPU – simpler
(unless speed & power consumption are met)
• If CPU does all jobs required with speed & power
constraints – OK
Otherwise go for custom designed HW
• We can use CPU & special purpose functional
units – complex computations
• Heterogeneous systems – can be design
alternatives
• Why should we ever use more than one CPU or
functional unit?
• Why not use the fastest CPU available to implement a
function?
Ans: CPU cost is exponential function of performance
More CPUs – cost/performance is very expensive
• Sol: cheaper to use several small processors to
implement a function
even communication overhead is included
– Processors in multiprocessor need not be CPUs
they may be special purpose functional units
– A heterogeneous multiprocessor uses functional units for
some operations and CPU for remaining operations
Scheduling overhead reduces the cost and increases the cost
of adding functionality even more on a large CPU
• External & internal implementation factors – say- impossible
to know exactly when data will be ready for a process & how
long it will execute
• If data dependencies present – means – change of speed at a
code to execute for different data values
• If these data dependencies are fed to next process then jitter
occurs
because don’t know exactly when the computation will be
ready to run.
must see CPU reserves its Horsepower for overload/heavyload
extra CPU performance comes at exponentially increased cost
• Job of HW/SW Co-synthesis is to create HW architecture
and SW architecture for functional implementation
and also to meet non-functional goals such as speed,
power consumption etc.
• SW architecture is defined in embedded system by process
structure of the code
– Each process executes sequentially
– So division of functions into processes determines parallelism
– Process structure influences other costs such as data
requirements
• Dividing too many processes – negative effect in design
– Difficult in design decisions
– Problems in context switching
– Problems in implementing parallelism
Preliminaries
• Co-synthesis systems: single rate behavior and multi rate behaviors
• Single rate system – perform single complex function
• Multirate system – ex: audio/video decoder system
Encoded in different sampling rates requiring specifications on multiple components run on
own rate
• Two performance parameters are commonly used:
– The rate at which a behavior is executed is reciprocal of the max. interval between two
successive initiations of the behavior
– The latency of a behavior is the required max. time between starting to execute the
behavior and finish it.
Rate and latency both are required in general because the way data arrives.
In simpler systems data will arrive at fixed time in each sampling period and go through
fixed set of processing
In more complex systems data may arrive at different points of period this may be either of
due to external causes or data from other processes.
• Rate specifies – intervals in which data samples may occur
• Latency specifies – time to process the data
• Single rate system can be modeled – CDFG – control DFG
– Semantics imply program counter or system state which is a natural
model for SW or hardwire-controlled datapath.
– The unified system rate makes it difficult to describe multi-rate tasks.
• Multi-rate system modeled as task graph
– The task graph has a structure very similar to a dataflow graph
– Nodes represent larger units of functionality
– Edges represent data communication
Ex: shows multiple flows of control
note: the terminology used for the task graph varies from author to
author
– Nodes as processes/ tasks
• Task graph may contain several unconnected
components
• Each component is a sub task
• Sub tasks allow the description of multirate behavior,
since we can assign a separate rate to each subtask
• A multirate behavior model particularly well suited to
signal processing systems is the synchronous dataflow
graph (SDFG)
• Nodes represent functions and arcs represent
communication
• A valid SDFG may be cyclic
• A co-design finite state machine (CFSM) is used as a
model of behavior by the POLIS system.
• A CFSM is an event driven state machine
– Transitions between states are triggered by the arrival of
events rather than by a periodic clock signal
Ex: when the machine arrives an event its action depends on
its current state and the defined transitions out of those
states.
Upon taking a transition to a new state, the machine may
emit events, which can go either to the outside world or
other CFSM in the system.
A network of CFSMs can be interpolated as a network of non-
deterministic FSMs
Architectural models
• Architectural models describe the implementation which is
sufficient for cost estimation during co-synthesis
• The engine itself is generally modeled as graph
– Processing elements-represented by nodes
– Communication channels by edges
Edges can connect only two nodes, so buses are hard to accurately
model
• When pre-designed PEs or communication channels are used to
construct the hardware engine, a component technology library
describes their characteristics.
• The component technology library includes general parameters
such as manufacturing cost, average power consumption etc.
• Also includes information which relates PEs to functional elements
• When ASIC is part of HW engine its clock rate is important part of the
technology model
• Should have detailed cost metrics
– Clock cycles for a particular function
– Must be determined by synthesizing the function using high level synthesis
Or
– Estimate the properties of that estimation
– Synthesis system assumes about communication mechanism
• Some background in CPU scheduling is useful for co-synthesis because
several components run on CPU
• Units of execution on a CPU are variously called processes, threads or
lightweight processes.
• Processes on a workstation or mainframe generally have separate address
spaces to protect users from each other.
• Lightwieght processes or threads generally executes in
a shared memory space to reduce overhead.
• The processes on a CPU must be scheduled for
execution because one may run at any given time
• Some decision making must be done at runtime to take
into account variabilities in data arrival times,
execution times etc.
• Simplest reducing mechanism that can be followed is
queue of processes
– Assign priorities to each process
Prioritized execution gives significantly higher utilization of
the CPU resources.
• In real time systems – execution of processes on CPUs is to
meet deadlines
1. Rate monotonic analysis (RMA)- static priority scheme
2. Earliest Deadline first (EDF)- dynamic priority scheme
• RMA assumes that the deadline for each process is equal
to the end of its period
• EDF does not assume that deadline equal period
• RMA –static prioritization was optional with the highest
priority going to the process with the shortest period
• EDF –this scheduling policy gives highest priority to the
active process whose deadline is closest to arriving
Hardware software partitioning
• A hardware software partitioning algorithm implements a
specification on some sort of architectural templates for
the multiprocessor, usually a single CPU with one or more
ASICs connected to the CPU bus
• Allocation is synthesis method which designs the
multiprocessor topology with PEs & software architecture
• In most HW/SW partitioning algorithms the type of CPU is
normally given but the ASIC must be synthesized
• HW/SW partitioning problems are usually single rate
synthesis problems, starting from a CDFG specification
• ASICs – architectures are used to accelerate core functions
• CPUs – performs the less computationally intensive tasks
• Classification based on optimization strategy
– Vulcan: starts all functions on ASICs
progressively moves to CPU to reduce
implementation cost
primal
– Cosyma: starts all functions on CPU and
moves to ASIC to meet the
performance goal
dual
Architectural Models
Srivastava and Brodersen developed an early co-synthesis system
based on a hierarchy of templates
• They developed a multi-level set of templates, ranging from the
workstation level to the board architecture.
• Each template included both hardware and software components.
• Relationships between system elements are embodied in buses,
ranging from the workstation system bus to the microprocessor bus
on a single board.
• At each stage of abstraction - a mapping tool allocated components
of the specification onto the elements of the template.
• Once the mapping at a level was complete, the hardware and
software components were generated separately, with no closed
loop optimization.
• The components at one level of abstraction mapped onto
templates at the next lower level of abstraction.
• They used their system to design and construct a multiprocessor
based robot arm controller.
Architectural models
Lin et.al developed an architectural model for heterogeneous
embedded systems.
• They developed a set of parameterizable libraries and CAD tools to
support design in this architectural paradigm.
• Their specifications are constructed from communicating machines.
• Depending on the synchronization regime used within each
component, adapters may be required to mediate communications
between components .
• Modules in this specifications may be mapped into programmable
CPUs or onto ASICs:
– Mapping is supported in both cases by libraries.
– The libraries for CPUs include descriptions of various I/O mechanisms
(memory mapped, programmed, interrupt, DMA)
– Communication primitives are mapped onto software libraries.
– For ASIC components, VHDL libraries are used to describe
communication primitives.
Performance estimation
Hardt and Rosenstiel developed an estimation tool for hardware
/software systems.
• Their architectural model for performance modeling is a co-
processor linked to the CPU through a specialized bus.
• The model communication costs using four types of transfer:
– Global data transfer
– Parameter transfer
– Pointer access and
– Local data transfer
and distinguished between reads and writes.
• They use a combination of static analysis and run-time tracing to
extract performance data.
• The estimate speedup based on the ratio of hardware and software
execution times and the incremental performance change caused
by communication requirements.
Performance estimation
Henkel and Ernst developed a clustering algorithm to try
to optimize the tradeoff between accuracy and
execution time when estimating the performance of a
hardware implementation of a dataflow graph.
They used a heuristic to cluster nodes in the CDFG given
as input to Cosyma.
Clustering optimized the size of the clusters to minimize
the effects of data transfer times and to create clusters
which are of moderate size.
clustering improved the accuracy of the estimation of
hardware performance obtained by list scheduling the
cluster.
Vulcan
• Gupta and DeMicheli’s Co-synthesis system uses a primal methodology
start with performance feasibility solution, move functionality to SW – reduces
cost
• Vulcan emphasizes on – analysis of concurrency
– Eliminating unnecessary concurrency
i.e, functions are moved to CPU side
• The system function is written as a HardwareC program.
– HardwareC provides some data types for hardware
– Adds constructs for the description of timing constraints.
– Provides serialization and parallelization constructs to aid data dependencies.
– However, it is represented as a flow graph.
• A flow graph includes a set of nodes representing functions and set of edges specifying
data dependencies.
• The operations represented by the nodes are typically low-level operations: such as
multiplications.
• Each edge has a boolean condition under which the edge is traversed in order to control
the amount of parallelism
Vulcan
• In the first flow graph two assignments can be
executed in parallel
• In the second flow graph the conditional node controls
which one of two assignments is created
i.e, the flow of control during the execution of the flow
graph can split and merge, depending on the allowed
parallelism.
• The flow graph is executed repeatedly at some rate.
the designer should specify constraints
the relative timing of operators and
rate of execution of an operator
Vulcan
• Vulcan divides the flow graph into threads and allocates
those threads during co-synthesis.
– A thread boundary is always created by a non-deterministic
delay operation in the flow graph,
such as a wait for an external variable.
– Other points may also be chosen to break the graph into
threads.
– Scheduler on the CPU looks into scheduling of all threads
On the CPU or on the Co-processor
– Size of a software implementation of a thread can be relatively
straight forwardly estimated.
Biggest challenge is the amount of static storage required.
– Performance of both hardware and software implementations
of threads are estimated from the flow graph and basic
execution times for the operators.
Vulcan
• Partitioning’s goal is to allocate threads to one of two partitions
the CPU (the set φH) or the co-processor (the set φS)
Such that the required execution rate is met and
Total implementation cost is minimized.
• Vulcan uses this cost function to drive co-synthesis:
F(ω)=C1Sh(φH)-C2Ss(φs)+C3 β -C4‽+C5|m|
Where Ci’s are constants used to weight components of cost function
the functions Shand Ss measure HW and SW size respectively
β is the bus utilization
‽ is the processor utilization ( must be less than 1=100% utilization)
m is the total no. of variables which must be transferred between
the CPU and the coprocessor.
Vulcan
• The first step in co-synthesis is to create an initial partition:
All the threads are initially placed in the hardware partition φH
• The co-synthesis algorithm then iteratively performs two steps:
1. A group of operations is selected to be moved across the partition
boundary.
The new partition is checked for performance feasibility by computing the
worst case delay through the flow graph given the new thread times.
If feasible, the move is selected
2. The new cost function is incrementally updated to reflect the new partition.
• Once a thread is moved to the software partition, its immediate
successors are placed in the list for evaluation in the next iteration:
. this is a heuristic to minimize communication between the CPU and
coprocessor
• The algorithm does not back track
once a thread is assigned to φs, it stays there.
Vulcan advantage
• The co-synthesis can produce mixed HW/SW
implementations which are considerably
faster than all-software implementation but
much cheaper than all-hardware designs.
Cosyma
• It uses a dual technology
• It starts out with the complete system running on the CPU and moves basic
blocks to the ASIC accelerator to meet performance objectives.
• The system is specified in Cx, a superset of the C language with extensions for
specifying time constraints, communication, processes and co-synthesis
directives.
• The Cx compiler translates the source code into an extended syntax graph (ESG)
which uses DFGs to represent basic blocks and CFGs to represent program
structure.
• The ESG is partitioned into components to be implemented on CPU or on ASIC.
• Cosyma allocates functionality to the CPU or ASIC on the basic block level
a basic block is not split between software and custom hardware
implementation.
• Therefore, the major cost analysis is to compare the speed of a basic block as
implemented in software running on CPU or on an ASIC function
Cosyma
• When a function is evaluated for reallocation from the CPU to the ASIC, the change
in the basic block b’s performance can be written as
∆C(b) = ω(tHW(b)-tSW(b)+tcom(Z)-tcom(ZUb))x It(b)
Where ∆C(b) – estimated decrease in execution
t(b)s – execution times of HW & SW implementations
tcom(Z) – estimated communication time between CPU & coprocessor
Z – basic blocks implemented on the ASIC
It(b) – Total no. of times that b is executed
ω – weight choosen
Other partitioning systems
• Kalavade & Lee developed an iterative algorithm for co-synthesis of
mixed hardware-software systems which they call a global
criticality/local phase algorithm
• The two aspects of the algorithm are designed to control heuristic
search for performance and implementation cost.
• Global criticality measure estimates the criticality of a node in the
system schedule- selection of moving node to ASIC
• Local phase criterion is used to determine a low cost
implementation for a function.
Distributed system co-synthesis
• Does not use an architectural template to drive co-synthesis
• Instead it creates a multiprocessor architecture for the hardware

engine as part of co-synthesis
• The multiprocessor is usually heterogeneous in both its processing

elements, communication channels and topologies
• Less emphasis on custom ASICs and more on design of

multiprocessor topologies
• ASICs and CPUs in multi/heterogeneous. Low cost

Integer linear programming model
• Prakash and parker
• First co-synthesis method for distributed computing systems
• For single rate co-synthesis problem
• Uses general ILP solvers to solve equations
• The methodology starts with single rate task graph and a
technology model for PEs, communication channels and processes.
• These inputs create –
– A set of variables and associated
– A set of constraints
– A set of variables describe the task graph
– A set of variables describe the technology model
– A set of auxiliary variables define the implementation
– A set of constraints define the structure of the system
Performance analysis
• Worst case execution time calculation
• Scheduling processes
• Multi-task processes on task graphs
• Finding initiation and completion times for each process in the task graph
• Obtain more accurate estimates
• Performance model for hierarchical memory systems in embedded

multiprocessors
• Caches/buffer memory – minimizing miss/rather reducing miss, increasing hits

Heuristic algorithms
• Allocates a set of heterogeneous PEs and communication channels
between them
– Objective – reduce the cost of PEs, I/O devices & communication channels
• Co-synthesis starts by allocating PEs
ignoring communication costs
it later fills communication & devices required
Steps:
1. Assign each process a separate PE
Perform initial scheduling
2. Reallocate processes to PEs to minimize total PE cost
3. Reallocate processes again to minimize amount of communication
required between PEs
4. Allocate communication channels
5. Allocate I/O devices – adding internally/externally
System partitioning
• Use partitioning algorithms to divide unified behavioral description
– Equivalent to single task
– Into hardware
– Into software components
• Synthesis process uses operator level partitioning but comprises
HW &SW implementation in cluster
• Implementation on co-saw – codesign systems architect workbench
– One phase simulates the complete model
– Another phase clusters operators in behavioral description to form
tasks
– Estimates costs of HW & SW implementation of clusters
– Data gathered is used to drive synthesis
Reactive system co-synthesis
• Control dominated systems are often a reactive systems reacts to
external events
– Such systems have control rich specifications
• Rowson and sangiovanni-vincentelli describe and interface based
design methodology and a simulator to support that methodology
• POLIS is a co-design system – includes both synthesis and
simulation tools
– Uses CFSM model to represent behavior
– Uses zero delay hypothesis
• Events move between communicating CFSMs in zero time
– A partitioning step allocates functions to hardware and software
implementations
– Hardware and software components can be synthesized from state
transition graphs
– Elements of the system can be mapped into microcontroller
peripherals by modeling them with library CFSMs.
Communication modeling and co-synthesis
• Communication is easy to neglect during design but is often a critical
resource in embedded systems
• Communication links can be a significant cost of the total system
implementation cost
• As a result, bandwidth is often costly
– Automotive industry adopted optical bus
• Since optical link is plastic for low cost and ruggedness
• Its bandwidth is limited
• Requires proper scheduling and allocation for feasibility and low cost
• The coware system is a design environment for heterogeneous system
which requires significant effort in building interfaces between
components.
– It is based on a process model, with processes communicating via ports
– SHOCK is a synthesis subsystem which generates both software (I/O drivers)
and hardware (Interface logic) interfaces to implement required
communication

HWSW Co-Synthesis Algorithms

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

HWSW Co-Synthesis Algorithms

Încărcat de

Drepturi de autor:

Formate disponibile

HARDWARE/SOFTWARE

∆C(b) = ω(tHW(b)-tSW(b)+tcom(Z)-tcom(ZUb))x It(b)

Where ∆C(b) – estimated decrease in execution

t(b)s – execution times of HW & SW implementations

tcom(Z) – estimated communication time between CPU & coprocessor

Z – basic blocks implemented on the ASIC

It(b) – Total no. of times that b is executed

• Instead it creates a multiprocessor architecture for the hardware

• The multiprocessor is usually heterogeneous in both its processing

• Less emphasis on custom ASICs and more on design of

• ASICs and CPUs in multi/heterogeneous. Low cost

• Multi-task processes on task graphs

• Obtain more accurate estimates

• Performance model for hierarchical memory systems in embedded

• Caches/buffer memory – minimizing miss/rather reducing miss, increasing hits

S-ar putea să vă placă și