CSCE 5610 Project Report - Gozick

Embedded
Architecture
Comparison: A
Smartphone
Approach
Brandon Gozick
With smartphones increasing popularity each month, we
breakdown the hardware attributes of the most popular
Android based smartphones the past two years. After a
brief introduction on relevant architecture, we implement
and present results on multiple benchmarks that aim to
test the performance of the CPU, I/O operations and
Memory performance. An overall comparison of each
CSCE 5610 phone is given along with a conclusive evaluation of the
best performing phone which we then correlate that with it
Computer Architecture hardware capabilities.
Project
Dr. Kavi
5/5/2011
Table of Contents
Introduction .................................................................................................................................................. 2
Relevant Architectures.......................................................................................................................... 2
ARMv6 ................................................................................................................................................... 4
ARMv7 ................................................................................................................................................... 5
Related Work ................................................................................................................................................ 6
Experimental Setup ....................................................................................................................................... 7
Linpack .................................................................................................................................................. 9
Nbench .................................................................................................................................................. 9
Quadrant ............................................................................................................................................. 10
DHTDroid............................................................................................................................................. 11
Results ......................................................................................................................................................... 11
Linpack ................................................................................................................................................ 12
Quadrant ............................................................................................................................................. 13
Nbench ................................................................................................................................................ 14
DHTDroid............................................................................................................................................. 15
Problems and Conclusion............................................................................................................................ 17
References .................................................................................................................................................. 19
Appendix A .................................................................................................................................................. 21
Introduction
A smartphone is a mobile phone which offers advanced capabilities beyond that of a typical mobile
device used the past several years. This smartphone, in most cases today, often contains functionality
similar to that of a personal computer (PC). There is no industry standard definition of a smartphone,
except for a public acceptance of being able to accomplish tasks they would normally perform at a
desktop or laptop. A smartphone that is portable or mobile and able to accomplish daily and even
complex tasks is an ideal tool for an always on the go society. For most, a smartphone is a phone that
runs a complete and efficient operating system providing an easy to use interface and a dedicated
platform to attract application developers leading to an increase user base [1].
Mobile smartphones today are abundant among the population and only increasing in umber. With this
rise in popularity, the demand for these multi-tasking machines have also increased, and at a dramatic
rate. This has resulted in an extreme growth in embedded architectures. These enhancements have led
to increasing speed and response time respectively while the demand for the next best thing is still
conveyed by the smartphone users, an ever increasing percentage of the current population between
the ages of 15-24. It’s reported that 84% of mobile phone users in the United States are between these
ages own a smartphone [2]. Currently in the United States, over 86% of all mobile traffic originates from
a smartphone [3]. With this age group getting older, the demand for more complex processor
architectures will continue to grow. This growth however does have pressure points in which limit the
continued success that is wanted by so many.
With the effective development of smartphones, devices now begin to incorporate more and more
functionality. The main concern and problem seen by many users, is that the more features the
smartphone is capable of and carried out by the user, the more processing chips and processing cycles is
required in hardware. With this increase means an increase in power consumption. In other words, the
more hardware required by greater functionality, the more battery power is needed. This is a directly
positive relationship and is one of the greatest concerns when creating an embedded architecture,
especially a processor implemented in a smartphone. This idea creates limitations when designing an
embedded architecture and so we have to take into account the purpose of the device. The main reason
behind these hardware limitations is due to the lack of efficiently from the battery. Insufficient battery
power plagues embedded devices and especially mobile phones and smartphones. Many CPU
manufacturers are forced to underclock the CPU’s designed by ARM and others, to save power drained
from the battery. For this purpose, manufacturers have underclocked the CPU’s used, such as the
Motorola Droid clocked at 600MHz and underclocked to 550MHz for better performance by the user. To
follow this idea, many software developers have also extended this idea creating switchable CPU speeds
to overclock or underclock the processor on the fly based on the current task at hand.
Relevant Architectures
There are a few main embedded architectural design companies who create blueprints for today’s
smartphone processors. One company which is dominant today, housing its architecture in over 85% of
android smartphones is the Advanced RISC Machine (ARM). ARM utilizes a 32-bit reduced instruction set
computer (RISC) instruction set architecture (ISA) for most of application processors [4].
The two notable ARM processors that we focus on are the ARMv6 and ARMv7 architectures
producing ARM11 and Cortex processors respectively. We illustrate ARM’s processor and architecture
layout in Figure 1. This presents ARM’s processor production line for the past few years featuring classic,
application and embedded processors. For this project and smartphone implementation, we focus on
ARM’s Classic and Application Processors which feature the ARMv6 and ARMv7 architectures
respectively. We will describe an overview of ARM architectures in general followed by a more detailed
analysis each architecture.
Figure 1 – ARM Processor Layout illustrating Classic, Application and Embedded processors. ARM architectures
are seen in the middle in gray text with the processors listed in its respective category above and ARM
architectural details below.
ARM has developed their architecture to be implemented across a very wide range of devices which
all require different performance need and the way we evaluate this performance. From embedded
devices such as DVD players, set-top boxes, televisions, mobile phones and even laptops, an ARM
processor is apparent in almost all our embedded devices we use every day. With these embedded
devices very much different than power hungry desktop computers, ARM has found a way to output
comparable performance without the necessity of a powerful power source. ARM’s simplistic
architectural design has resulted in processors that are on the level of performance requirements of
most today’s laptops, while consuming very little power. This low power consumption is the key idea
was stands ARM out from other design companies and what attracts today’s mobile phones to embed
their processors with a 32-bit ARM architecture. The reduced instruction set computer attributes allow
great functionalities which introduce a large uniform register file. The load and store architectural
feature provides data processing ease as the operations only use register contents rather than utilizing
direct memory contents. Addressing modes are designed to be simple not allocating a lot of space but
accomplishing tasks with minimal terms. With this, the loads and stores are developed in the register
contents and instruction fields only. The capability to load and store multiple instructions at the same
time while performing conditional executions of instruction sets have maximized data throughput and
execution throughput respectively. Some of the following accessible features available and can be found
in the ARM architecture are:
 Level 1/2 Instruction Cache  Mispredicted Branch Execution

 Level 1/2 Instruction TLB  No Prediction Branch Execution
 Level 1/2 Data Cache Refill  Cycle Count
 Level 1/2 Data Cache Access  Predictable Branch Execution
 Level 1/2 Data TLB refill  Level 1/2 Data Cache Write-Back
From Figure 1 we can see the processors under the ARMv6 and ARMv7 architectures. We present
Figure 2, the ARM processor flow chart which illustrates increasing functionality and performance of the
processors along with the capability. We can see the ARM11 which utilizes the older ARMv6 architecture
is listed under the “classic” section while the newer “application” processors house the ARMv7
architecture of the Cortex A8 processor which we will focus here in this project. To better obtain an idea
why the Cortex A8 is used more today than the ARM11, we compare some aspects of both the ARMv6
and ARMv7 architectures.
Figure 2 - ARM Processor Flow Chart illustrating increasing performance and increasing capabilities of each
processor. Most of the phones we use in this study utilizing the Cortex A8 processor which is an outdated
processor according to ARM. With newer phones releasing with greater processor, we can see a significantly
greater performance level.
ARMv6
The ARMv6 architecture utilized data operations such as the Single Instruction Multiple Data (SIMD)
technique. With a SIMD implementation a level of parallelism is achieved with multiple processing
elements performing a simultaneous instruction operation on multiple data entities. This technique
allowed for performance on a mobile phone which was never seen before. ARM has been working to
extend this work with a Multiple Data Multiple Instruction (MIMD) architectural implementation.
ARMv6 was design for the combination of low cost and high performance with a newly designed 32-bit
device. The previous market releasing featured a dominating population of 8-bit devices which had no
match for an ARM11 processor. It also featured an 8-bit stage pipeline along with a variable cache and
memory management unit which helped utilizes memory storage and performance. We list a few more
characteristics which helped make the ARMv6 architecture a dominate presence in the market.
 8-stage pipeline  Variable Cache and Memory Management

 SIMD capability Unit
 Enhanced DSP instructions for  Typical DMIPS of 965 at Max
increased performance  Utilized at CPU speeds up to 600Mhz.
ARMv7
ARMv7 features more a more advance technology then ARMv6 which produces a significantly
greater performance in processing as well as utilizing less power than the previous architecture. ARMv7
produces a new line called Cortex. Cortex utilizes a 16 and 32 bit instruction set providing all the useful
advantages of RISC while also having the advantage of a small code size with the 16-bit Thumb
instruction set architecture adding over 120 instructions. ARMv7 produces processors with a speed
range of 600MHz to 1GHz and utilized for applications requiring a 2000 DMIPS. This low power design
achieves high levels of performance achieved by the combination of a dual issue integer pipeline, an
integrated L2 cache and an efficient 13 stage pipeline. With the dual issue integer pipeline, ARM
introduces a superscalar pipeline which maximizes the use of operations by having the ability to issue
two instructions at the same time. ARM also features a dual ALU pipeline which is symmetric and
capable of handling most arithmetic instructions quick and efficiently. Branch prediction is increased
with the addition of the 13-stage pipeline which operates at a higher frequency than previous
architectures. To minimize branch prediction misses, the Cortex-A8 processor implements a two level
global history branch predictor which consists of a Branch Target Buffer (BTB) and a Global History
Buffer (GHB). Both of these structures can be accessed in parallel with the instruction fetches producing
a high optimized pipeline cycle. An example of this is shown below in Figure 3.
Figure 3 – Branch Prediction of the ARMv7 architecture which features the instruction fetch, decode, and the
execution of the load and store instruction with both the ALU pipes present.
ARMv7 has a single cycle load use penalty for a fast access to the level 1 cache which is 16k or 32k
configurable featuring a 4-way set associative configuration. The data cache in the level one has a write
back feature with no write allocations. ARMv7 features an integrated level 2 cache giving it a dedicated
low latency and high bandwidth when interacting with the implemented level 1 cache. It has an 8-way
set associativity with a size of 64K. Having it dedicated adds to better power performance and speed
performance. The Cortex A8 processor implements an advanced virtual memory system architecture on
an improved memory management unit and an advanced hardware floating point unit allowing for
greater precision operations. These features prove why most phones today utilize ARMv7 architecture
and the Cortex A8 combination which we will later talk about and present in Table 1.
 13-stage pipeline  Variable Cache and Memory

 4-way set associative 16K or Management Unit
32K Level 1 Cache  Typical DMIPS of 2000 at Max
 8-way set associative 64K  Utilized at CPU speeds up to
dedicated Level 2 Cache 1GHz
Related Work
There has been some work which aims to achieve a comparison of ARM architectures from previous
implementations and some that have tried to compare the performance of this architecture has on the
operating system and phone/user interaction. Some of these tests were performed on the CPU,
memory, and battery. We explain a few papers that have produced relevant data when comparing the
ARM architecture in smartphones as well as an introduction to benchmarking software written by a
software company based out of Austin Texas, DHTechnologies. We use this benchmarking software as
part of our ARM architecture comparison between different smartphones.
GreenDroid [5] is a project at University of California that focused on innovating and expanding on
microprocessor technology. Since they believe this is where the future is headed, they try to improve on
the architecture used today to be useful in a dual core and even a smaller environment as hardware
sizes reduce. They try and solve the silicon infrastructure of processors to produce an economically and
performance oriented processor using dark silicon which they dub conservation cores. These
conservation cores are extremely energy conservative compared to the microprocessor produced today.
GreenDroid is an actual prototype 45nm processor created using these conservation cores which tries to
attack Moore’s Law and leakage problems. The GreenDroid architecture is shown in Figure 4.
a b c
Figure 4 – The GreenDroid Architecture. GreenDroid is multicore mobile application processor that is made up of
(a) 16 non-identical tiles that holds (b) common components to every tile such as the CPU, on-chip network and
a shared L1 data cache and implanted in (c) represents the connections between the components and the
conservation cores.
Freescale [6], a semiconductor company did research on the architectures needed to deliver a
product which would appeal to the market today. They analyzed different embedded products including
a mobile phone and focused that on the growth of the industry, the demand of performance and the
limitations present in today’s architectures to produce their own design. They have extended the ARM
architecture with their own to design an improved architecture called Mobile Extreme Convergence
(MXC) which is said to reduce power consumption, improve memory access times, and reach CPU
speeds around that of 500MHz. They did performance tests on their dedicated application processor
which featured 128KB on-chip L2 cache which achieved improvement from past designs. We present
some of their cache hit rate results in Figure 5 comparing their on-chip L2-cache and a system with no L2
cache.
a b
Figure 5 – L2 Cache performance comparison of a HTC G1 Smartphone containing an ARM11 process utilizing
ARMv6 architecture. (a) On the left is the L2 cache hit rate with no flash memory (b) features the L2 cache hit
rate with flash memory. We can see the box highlighting the typical operating range. Results show an on-chip L2
cache performance better in both scenarios.
DHTDroid [7] is a benchmarking utility toolset created to see how the Android operating system
interacts with the ARM architecture. The goal of this project was to implement a set of Android based
system benchmarks that generate an operating system and hardware abstraction vector that can be
compared across multiple smartphones with different hardware. The DHTDroid tool-set consists of 12
individual macro-benchmarks that stress test the CPU, the TLB, the cache, the memory, the I/O, and the
network capabilities. We can use this tool-set to identify potential hardware and operating system
issues which happens often as the Android OS gets updated every couple months. We can also perform
a performance test, store the results, and later compare to see if there is a hardware issue decreasing
system performance. We later discuss this in the methodology section as well as results in the Results
section.
Experimental Setup
With all these new advancements in embedded architecture, a comparison of the performance of
each processor and architecture type is needed. With this we can try and gage how far or where the
system enhancements are found. To do this we first need to gain a sample set consisting of numerous
smartphones, all using Android, which have different hardware implementations for a complete
comparison. Figure 6 illustrates five of the six phones that were used to evaluate the hardware. From left
to right in the figure the phones are HTC G1, HTC Hero, HTC Nexus One, Motorola Droid, and Samsung
Nexus S. The HTC EVO is not featured in this picture but is used in this project and is shown with
specifications in Table 1.
Figure 6 - Five Android Smartphones used to obtain CPU and memory benchmarks. From left to right, HTC G1,
HTC Hero, HTC Nexus One, Motorola Droid, Samsung Nexus S. HTC EVO not present in the picture.
Since each phone has different hardware, we first have to recognize this and analyze how they are
different. We would like to note that each of these phones utilize a processor which is designed by ARM
but some feature more advanced architecture of ARMv7 rather than ARMv6 which will result in
decreased performance as we stated above in the Relevant Architecture section. Table 1 shows each
phone and its relative year, processor type, instruction set (architecture), CPU speed, and internal
memory specifications (RAM and ROM). These phones were released from late 2008 to the end of 2010.
Within this two year gap, there has been a significant growth in embedded architecture, specifically by
ARM with the upgrade from ARMv6 to ARMv7.
Table 1 – Android smartphone specifications which were used to test performance by benchmark analysis.
Max Internal
Instruction Clock Memory
Phone Year Processor (CPU Core) Set Speed RAM ROM
October Qualcomm MSM7201A
HTC G1 ARMv6 528 MHz 192 MB 256 MB
22, 2008 ARM11
October Qualcomm MSM7600A
HTC Hero ARMv6 600 MHz 288 MB 512 MB
11, 2009 ARM11
Motorola October TI OMAP 3430
ARMv7 550 MHz 256 MB 512 MB
Droid 17, 2009 ARM Cortex A8
HTC Nexus March 16, Qualcomm QSD8250
ARMv7 1 GHz 512 MB 512 MB
One 2010 Snapdragon ARM
June 4, Qualcomm QSD8650
HTC EVO ARMv7 1 GHz 512 MB 1 GB
2010 Snapdragon
Samsung December Samsung Hummingbird
ARMv7 1 GHz 512 MB 16 GB
Nexus S 16, 2010 S5PC110 ARM Cortex A8
Multiple benchmark programs were performed on each phone to try and gain an idea of which
phone was better in an area. We performed an MFLOP analysis, a CPU analysis, a memory analysis, and
an overall system benchmark analysis. We explain the overview of each technique and then present
results of each in the next section.
Linpack
Linpack benchmarks [8] have been used since the late 1970’s and early 1980’s. It was decided for
performance tests on supercomputers, but since then it has been a scale that has been used to grade
the computer system performance and a standard test on the TOP500 list, which details the world’s
most powerful computer systems. “The Linpack benchmark is a measure of a system’s floating point
computing power,” and recently has been ported over to the Android OS where it can be evaluate
performance of each update to the operating system [9]. It measures how fast a computer solves a
dense N by N system of linear equations, Ax=b, which is a very common task in computer engineering
system. The system is obtained by a Gaussian elimination technique with partial pivoting using
2/3·N3 + 2·N2 floating point operations. The configured end results in a number illustrating the millions
of floating point operations per second (MFLOPS). An example screenshot is shown in Figure 7 showing
Linpack for Android results in MFLOPS.
Figure 7 – Example simulation screenshot of Linpack for Android running on the Nexus S. Benchmark results
represent the MFLOPS and the time taken to execute.
Nbench
Nbench is performance evaluating tool which tests the CPU efficiency. It is an old tool originally
written in 1995 for an old UNIX distribution. Since then, it has been modified and ported to many
operating systems such as Unix/Linux, Windows, ARM Evaluation, and recently, Android. Nbench
stresses the CPU on a number of areas including numeric sort, Fourier, Huffman and concludes with an
integer index and floating point index. “The benchmark was designed to expose the capabilities of a
system's CPU, FPU, memory and C compiler performance.” We can use this to compare with other
Android smartphones to gauge and compare performance levels. An example screenshot of Nbench
running on an Android phone is shown in Figure 8 illustrating relevant CPU performance output.
Figure 8 – Example screenshot of Nbench running on an Android Device. The resulting output illustrates each of
the CPU tests shown in the middle, while the indices of each value, memory, integer, and floating point, are
shown at the top in yellow.
Quadrant
Quadrant is an independent benchmarking tool created specifically for Android devices. We run this
benchmark to get an idea of the total systems performance which includes the CPU, I/O and graphics.
Since quadrant tests all these aspects, they create their own levels and therefore we cannot gain an idea
of the scores that it outputs. As a result, we can only compare each phone’s quadrant score and identify
which phone performs better than the rest. Table 2 shows the measurement tests that were performed
on each of the four areas, CPU, Memory, I/O and Graphics. We also show screenshots of the program
running on an Android device. This is shown in
Table 2 – Quadrant Benchmark Tool showing the area being tested along with the benchmarking measurements
Quadrant Benchmark Tool
Hardware Measuring Test
CPU Branch Logic, Integer, Long Int, Short Int, Byte,
Floating Point, Double Precision, Checksum,
Compression, XML Parsing, Video Decoding
(H.264), Audio Decoding (AAC)
Memory Throughput
I/O File System Reads, File System Writes
Database Reads, Database Writes
Graphics 2D/3D – Frames per Second
a b
Figure 9 – Screenshot examples of Quadrant Benchmarking Tool running on an Android device (a) illustrates the
tests performed on the device to gain an overall score for Android device comparison (b) final output of the
device which shows the score for each category specified as color, blue – CPU, red – Memory, green – I/O,
Orange – 2D graphics, and yellow – 3D graphics.
DHTDroid
DHTDroid is a UNIX shell executable tool set which his written to analyze the hardware
functionalities of the present embedded architecture on a device. Specifically, it was custom written for
use on Android smartphones and targeted main for CPU and Memory performance. This benchmark
tool-set was ported from Linux, and re-written to meet embedded architecture standards. The tool set
consists of multiple performance evaluations scripts that are performed by the kernel to analyze a
number of benchmarks including:
1) cacheperf – Measures TLB, Cache, and memory performance

2) ctxswtch – Measures CPU and context switch performance
3) memcache – Measures data cache and memory efficiency/performance
4) syscpu – Measures CPU and system call subsystem performance
5) numsim – Measures the efficiency of executing math functions (vector, matrix)
Due to Android file system privileges, only a rooted phone, or phone who has gained access to the root
file system (administrator) can run these scripts on the phone. With this, we only had three of the six
phones that fell into this category: Motorola Droid, HTC Nexus One and HTC EVO. Each of the above
benchmarks, totaling five, was tested on these three phones.
Results
As we perform each benchmark on each phone, we look at each benchmark test individually and only
compare what is being analyzed on that test compared to the six phones being tested. Since the G1 is
the oldest and the Nexus S is the newest, we expect to see results reflect this idea that a newer phone
will represent better benchmarks. However, this may not be the case. Since the efficiency of the
operating system also has influence in the performance of a device. With newer version of the Android
OS present throughout many of the devices, we might see benchmarks that result from better
performance instrumented in the software rather than hardware, but we do not neglect the fact that
there is different and to an extent, better hardware upgrades in each phone.
Table 3 – Performance Table which features results from the Linpack Benchmark (MFLOPS) and Quadrant. Each
phones BOGO-MIPS value and external memory read/write times are shown.
MFLOPS MEMORY
BOGO-
Phone Double Quadrant
Single MIPS Read Write
Precision
Precision (SP) (MB/S) (MB/S)
(DP)
HTC G1 2.213539 1.5514531 274 383.38 4.0 6.4
HTC Hero 5.9771547 4.1366262 746 599.65 4.3 6.5
Motorola
12.462279 7.048375 1099 615.35 5.5 6.4
Droid
HTC
16.530212 8.794049 1142 662.40 7.0 23.2
Nexus One
HTC EVO 17.004862 9.604989 1225 780.55 7.5 21.1
Samsung
15.497228 11.538064 1407 996.31 8.3 23.9
Nexus S
Linpack
Table 3 shows the performance results of two of the benchmark tests we performed, Linpack and
Quadrant. First we focus on the Linpack Results which are shown in column 2 of the table. We calculated
the Floating point Operations per Second of each processor (FLOPS) which is a type of measure of a
computer’s performance. With the feature of single instruction multiple data (SIMD) in the ARM
architecture, we can also calculate the double precision FLOPS (DP) with 64 bit, along with the standard
single precision FLOPS (SP). This parallelism capability allows these phones to generate a better
performance on a single core than that of a single core without the presence of SIMD. The FLOPS
present in the Table 3 are mega FLOPS meaning, that if a computer had 5 MFLOPS, it has 5 x 10^6
floating point operations a second. This is an important number and we can say, to an extent, that a
computer with a high FLOP number will have good CPU performance. The results for both the single
precision and double precision seem to follow a consistent path as better hardware/software
implementations have better CPU performance. This is shown clearly when comparing the HTC G1 with
the HTC EVO. The G1 has an ARMv6 ARM11 processor clocked at 528 MHz while the EVO has an ARMv7
Cortex A8 processor clocked at 1GHz. With the increase clock speed and significantly improved
hardware architecture, the MFLOP measurements illustrate this idea clearly. We can also see that the
single precision MFLOP is higher than that of the double precision. This is correct as a dual instruction
pipeline is slower than that of the single instruction eventually, but being able to incorporate two
instructions at once incorporates a much better system efficiency. This can be seen with the Samsung
Nexus S, which also has a Cortex A8 process at 1GHz. The Nexus S, though having a lower single
precision MFLOP number has a highest double precision MFLOP which can also have a great effect on
the performance of a machine. Therefore, we should just not compare these two numbers together and
say a system with a higher single precision number is overall faster without factoring in what else is
going on in the infrastructure. We must see what else is happening in the system such as I/O operations
and memory limitations which both influence a system greater.
Quadrant
Table 3 also shows the Quadrant scores which try to quantify a systems overall performance levels
based on CPU operations, I/O operations, Memory, and graphics. We do not have individual scores for
each category but only the final number which we can use to compare with the other phones in this
project. Similar to the Linpack results, we still see the HTC G1 performing at the bottom of the list. With
the architecture and processor limits at the time it was released, we are not shocked with these results.
The Droid, Nexus One, EVO all have similar high Quadrant results which means that performance on
these phones are good, but not better than the Nexus S. With a score of 1407, it outperforms all the
phones in this project. However, the Quadrant benchmark is heavily influenced by GPU hardware which
tests the 2D and 3D graphics. With the G1, when smartphones were being released, embedded
designers were not focused on improving graphics of phones as they are today. So this benchmark
should only be used on phones released recently rather than an older phone built as an introductory
smartphone.
Nexus S
1407
EVO
1225
Nexus One
1142
Droid 1099
Hero 746
G1 274
0 500 1000
Figure 10 – Quadrant Benchmark Results. This illustrates an overall system benchmark which stresses the CPU,
Memory, I/O, and 2D/3D graphics. Since this requires a graphical analysis, all these phones must have a graphics
unit to be compared. We can see that the Nexus S outperforms all the phones but given the limitations to the
Quadrant software used, we do not know in which area. We can assume that it is an accumulation for each area
which gives it the highest overall score. Since this is an overall benchmark tool, we can only compare other
phones scores and cannot make a statement on the contributions of the individual hardware.
Nbench
Next, we look at the results given by the performance analysis tool, Nbench. We focus here on the
index given for the Memory, Integer and Float shown at the top as a scale in Figure 8. Each benchmark
focuses on different aspects of the system with the Integer index testing the CPU and how fast it can
perform calculations, Memory calculating memory performance within the system, and floating point
also looking at the CPU but this time using floating point as a guideline. The Nbench results seem to
reflect that as the other two previous benchmark tools. As we look with the Memory index, the Nexus
One, EVO, and Nexus S all top out around an index of 4 while the G1 and Hero are maxing out around
the 1.5 mark. Performing any integer calculations on these phones will result in better performance
times. The same could be said about Memory performance with the Nexus S having a greater index than
any of the others. This time, the precision is not as close as the Integer index, but gradually moves up
almost like a step function, from the Droid, Nexus One, EVO and finally Nexus S. With the floating point
index, we are confused here. We are not sure how this index is calculated but looking at the previous
MFLOP results from Linpack, it is consistent if this FLOP index was using a 10^7 scale. The Nexus S was
the fastest with double precision and the EVO with single precision. Results from the Droid, Nexus One,
EVO and Nexus S all perform well in this area which is significantly greater than both the Hero and G1. A
more detailed result on each phone using Nbench is given in Appendix B.
Nbench Performance Analysis
4.5
HTC G1
4 HTC Hero
Motorola Droid
3.5 HTC Nexus One
HTC EVO
Samsung Nexus S
Benchmark Value
2.5
1.5
0.5
0
Integer Memory Floating Point
Figure 11 – Nbench results from the three benchmark tests, Integer, Memory and Floating Point. From this graph
we can see that the G1, which is older in year, is not up to the performance levels of the newer phones released
in 2010. These newer phones, Nexus One, EVO, and Nexus S, all stand out excelling each category. With respect
to the Samsung Nexus S, with newer hardware and software, we expect this to be the best performing phone in
this project and these results reflect that. A more detailed result on each phone is given in Appendix B.
DHTDroid
Using this benchmark tool-set we focused on testing the CPU and the Memory. Using 5 different tests,
we try and examine the performance levels of the Motorola Droid, HTC Nexus One and HTC EVO. Since
we can only use this tool-set with phones that have root access, these were the only phones testing
using DHTDroid. We look at Memcache first which is a memory benchmark that walks through the data
cache and the physical memory subsystems, invoking different levels of the memory hierarchy.
Memcache performs read and write operations in memory and obtains the cache latency times which
are given in nanoseconds. We tested the 1024K memory level using increasing read and write sizes of 4B
to 525KB. Figure 12 illustrates the results from the memcache memory performance test. We can see
each phone starts out on the 1024K level using an increasing read and write size which varies in access
times. Latency values less than 100ns reflect a high data cache usage (read/write position), where
latency values higher than 100ns reflect operations that are mainly executed at physical memory
speeds. The complexity of the memory subsystem, as well as the number of CPU’s present in the phone
(in this case we only have one for each phone) all have significant impact on these benchmark results.
We can see the droid has higher latency times than the EVO and Nexus One. With the Nexus One and
EVO, they have different latency times when the cache is used while the physical memory system, they
have equaling access times.
Latency of Read/Write Operations in Memory
600
500 Motorola Droid

HTC EVO
Cache Latency (ns)
HTC Nexus One

400
300
200
100
0
0 1 2 3 4 5
Memory Hierarchy Size x 10
5
Figure 12 – Results from the memcache performance test. Cache latency times of the 1024K memory hierarchy
subsystem using increasing read/write operations. Lower cache times reflect a high data cache usage while
higher latency times represent operations that are executed at physical memory speeds. The EVO and Nexus
One perform well in this area compared to the Droid, but are equal when analyzing times with high data cache
usage.
CPU Context Switching
For this performance test, we use the ctxswtch program to evaluate the CPU and context switches. The
program sets a certain amount of context switches which is passed into the program as a specification.
We evaluated each phone using 4,200,000 context switches on the CPU. The results from this test are
shown in Table 4. The table shows the total number of context switches perfumed with the total amount
of time spent for each phone. We can see the Droid lacks in CPU performance here as the total time was
more than twice the time spent by the Nexus One and triple the time spent by the EVO. This means that
the number of context switches per second endured by the CPU is lower with the Droid. This lower
number shows the limitations of the CPU as it will take longer to finish the context switching test which
the table reflects. We also present context switch time in microseconds. The EVO produces a very low
switch time of 16.8µs, the Nexus One with 22.71µs, and the Droid ending it out with almost 50µs. Each
smartphone CPU performed the test with about 50% utilization.
Table 4 – Ctxswtch performance results - CPU and Context Switches

Total Time Total Number of Context Switch Time per Switch
Phone
(s) Context Switches Frequency (Hz) (µs)
Motorola Droid 101.33 4195303 20155.71 49.61
HTC Nexus One 47.12 4195800 44031.04 22.71
HTC EVO 33.71 4196664 59527.84 16.80
We next look at the results from the cacheperf benchmark. During this tool, the Droid experienced some
kind of error as it was not outputting the final results. Given this, we can only illustrate the performance
of the Nexus One and EVO. Cacheperf is a memory performance tool which quantifies the performance
of the cache/memory subsystem. It also measures the CPU and TLB access/latency values.
Table 5 – Cacheperf performance of the cache/memory along with TLB latency values
Nexus One EVO
CPU + L1 Access: 13.95 ns CPU + L1 Access: 13.95 ns
Cache Subsystems: Cache Subsystems:
Level | Size | Line Size | Cache Miss Latency |Cache Replace Time Level | Size | Line Size | Cache Miss Latency |Cache Replace Time
1 256 KB 128 bytes 241.28 ns 248.00 ns 1 256 KB 128 bytes 258.24 ns 241.54 ns
TLB Subsystem: TLB Subsystem:
Level | Size | Page Size | TLB Miss Latency Level | Size | Page Size | TLB Miss Latency
1 80 4 KB 14.17 ns 1 80 4 KB 14.60 ns
2 1536 8 KB 78.34 ns 2 1536 8 KB 77.16 ns
Numsim performance analysis was executed on each of the three phones which is a raw floating point
performance evaluation on the CPU. The benchmark operates in a close loop using 7 difference
matrix/vector scenarios (Bm1-Bm7) where Bm stands for Benchmark. The results are shown as floating
point operations per second or MFLOPS. The vector equations using in the numsim test are shown
below for each benchmark scenario while the results of each are shown in
Bm1: Vector Copy D[i] = A[i]
Bm2: Vector Add D[i] = A[i] + B[i]
Bm3: Vector Multiply D[i] = A[i] * B[i]
Bm4: Vector Divide D[i] = A[i] / B[i]
Bm5: Vector Add-Multiply D[i] = A[i] + B[i] * C[i]
Bm6: Vector Add-Divide D[i] = A[i] + B[i] / C[i]
Bm7: Matrix Vector Product of a 5-Diagonal Sparse Matrix
Table 6 – Numsim performance results for each of the benchmark scenarios testing different vector equations
(MFLOP)
Phone Bm1 Bm2 Bm3 Bm4 Bm5 Bm6 Bm7
Droid 11.90 4.66 4.43 0.968 5.11 1.69 5.80
Nexus
26.70 7.35 9.79 2.02 8.26 3.24 9.27
One
EVO 24.80 7.56 8.97 2.02 8.29 3.08 9.52
With results from Table 6 we can see that the EVO and Nexus One both perform very well in floating
point arithmetic operations. The Droid seems to lag around in this area which could be dependent on its
lesser processor. Throughout other benchmarks, the Nexus One and EVO both seem to have great
performance in this area.
Problems and Conclusion

Mobile phones are continuing to be upgraded in hardware and in architecture. The combination of
both results in greater performance and greater potential for optimizing a mobile device to operate
more like a desktop computer. This seems to be the way we are headed as more advanced phones
continue to be released. For this project we first looked at benchmarks which were provided by Android
developers optimized to run on the Java Virtual Machine as an external application to the Android
Operating System. Two of these benchmarks were very well known programs which were ported to the
Android OS for means of evaluating both hardware and software capabilities. These two programs were
Linpack and Nbench. We mentioned that not only do hardware and architecture specifications influence
benchmarks but also the software does too. As the Android OS keeps upgrading, and different phones
using different OS versions, we will see an increase and sometimes a decrease in performance levels.
We can use these benchmarks for future analyze on mobile devices as hardware requirements increase
significantly. We perform MFLOP tests in multiple ways on multiple benchmarks which help grasp the
capabilities and performance levels of the CPU. From class, we discussed about memory and cache
performance relating to average access time and latency values. The DHTDroid toolset helped grasp
these ideas performing tests on a familiar embedded device used daily by most people, a smartphone.
The DHTDroid toolset offered a great analysis experience as we were able to stress not only the CPU,
but also the memory and cache including the TLB. We were able to capture cache latency values of the
memory hierarchy and compare that with a number of different Android smartphones. Originally getting
the toolset to execute on the phone was a problem as administrator (root) capabilities weren’t available
on some of the phones. It took a while to realize this and find an appropriate place in the file system
with executable directory permissions. As soon as a solution was found for these problems, we were
able to smoothly run the DHTDroid toolset and analyze the results for each. When analyzing the
specifications of each phone, specifically the ARMv6 and ARMv7 architecture, we can clearly see a huge
performance upgrade from the ARMv6 G1 and Hero smartphones to the ARMv7 Droid, Nexus One, EVO,
and Nexus S smartphones. In every performance evaluation, the ARMv7 phones outperformed the
ARMv6 by a significant margin. This concludes that ARM has deployed many architectural design
implementations as we discussed in the relevant architecture section of the Introduction. Looking at the
Quadrant results, we can easily see how far that embedded mobile architectures have come a long way
and the design processes does not seem to be slowing down with increasing processor speeds, dual core
processors, and even quad core processors already in the works. ARM has already released a design of
the Cortex A15, a dual core processor architecture which is said to consume less power and have higher
performance levels than the overused Cortex A8 processor. Using these benchmark utilities, we were
able to compare and finalize an analysis on each using phones with different hardware specifications.
Annotated References
[1] P. Zheng and L. M. Ni, “Spotlight: the rise of the smart phone,” IEEE Distributed Systems Online, vol.
7, no. 3, March 2006.
This paper discusses an overview and introduction of the smartphone and what it takes to classify a
mobile device as a smartphone. A brief discussion about each operating system capable of
optimizing a smartphone is covered as well as hardware requirements and limitations.
[2] A. Gahran, “One-third of US youth have smartphones,” CNN, December 17, 2010. available:
http://articles.cnn.com/2010-12-17/tech/youth.cellphones.gahran_1_prepaid-phone-plans-teen-
texting-mobile-users?_s=PM:TECH
CNN reports on the rise in popularity of the smartphone and the correlation to the young adults in
the United States. Other countries are also compared with the age that teenagers start using a
smartphone and what their main uses are during their daily use of the embedded device. Mainly this
report covers useful statistics covering smartphones and their users.
[3] “Smartphones generate 65 per cent of all mobile traffic worldwide,” Mobile Communications
International Magazine, Informatm, Issue. 168, pp. 11, December 2010.
A brief magazine article discussion cell phone networks and data traffic cause by smartphones.
These statistics were useful for this report to portray to the readers of the rise in popularity of the
smartphone and the convenience of operating it as a daily device. As smartphones continue to
increase in popularity, data traffic directly reflect this with a continuous increase in network traffic.
[4] “ARM Architecture Reference Manual: Performance Monitors v2 Supplement,” ARM, 2009.
This reference manual gives an introduction to the ARM architecture along with its accessible
functions and performance evaluations of newer and older architectures. We used this to discuss
smartphone architecture to help the user have an idea what is being used today.
[5] S. Swanson and M. B. Taylor, “GreenDroid: exploring the next evolution in smartphone application
processors,” IEEE Communications Magazine, pp. 112-119, April 2011.
GreenDroid is a project at University of California that focused on innovating and expanding on
microprocessor technology. Since they believe this is where the future is headed, they try to
improve on the architecture used today to be useful in a dual core and even a smaller environment
as hardware sizes reduce. They try and solve the silicon infrastructure of processors to produce an
economically and performance oriented processor using dark silicon which they dub conservation
cores.
[6] “Mobile Extreme Convergence: A streamlined architecture to deliver mass-market converged mobile
devices,” Freescale Semiconductor Inc., rev. 5, 2009.
Freescale is a semiconductor company who did research on the architectures needed to deliver a
product which would appeal to the market today. They analyzed different embedded products
including a mobile phone and focused that on the growth of the industry, the demand of
performance and the limitations present in today’s architectures to produce their own design. They
have extended the ARM architecture with their own to design an improved architecture called
Mobile Extreme Convergence (MXC) which is said to reduce power consumption, improve memory
access times, and reach CPU speeds around that of 500MHz.
[7] D. Heger, “DHTDroid v3.2 Benchmark – Quantifying Android OS & HW Performance,”

DHTechnologies, April, 5, 2011.
DHTDroid is a benchmarking utility toolset created to see how the Android operating system
interacts with the ARM architecture. The goal of this project was to implement a set of Android
based system benchmarks that generate an operating system and hardware abstraction vector that
can be compared across multiple smartphones with different hardware. The DHTDroid tool-set
consists of 12 individual macro-benchmarks that stress test the CPU, the TLB, the cache, the
memory, the I/O, and the network capabilities.
[8] Linpack for Android, GreenComputing. Available: http://www.greenecomputing.com/apps/linpack/.

Linpack benchmarks have been used since the late 1970’s and early 1980’s. It was decided for
performance tests on supercomputers, but since then it has been a scale that has been used to
grade the computer system performance and a standard test on the TOP500 list, which details the
world’s most powerful computer systems. The Linpack benchmark is a measure of a system’s
floating point computing power, and recently has been ported over to the Android OS where it can
be evaluate performance of each update to the operating system.
[9] S. Weintraub, “Android 2.2 tests reveal stunning speed gains,” CNN Money: Fortune, May 12, 2010,
Available: http://tech.fortune.cnn.com/2010/05/12/android-2-2-demonstrating-incredible-speed-
gains/.
This CNN report discusses the Android operating system along with smartphone hardware increased
with the occurrence of an Android OS update, Android 2.2. Linpack was used for evaluation of the
new upgraded version of Android showing a 450% increase in performance.
Appendix A
NBench Results
HTC G1
BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)
TEST : Iterations/sec. : Old Index : New Index

: : Pentium 90* : AMD K6/233*
-------------------:------------------:-------------:------------
NUMERIC SORT : 87.416 : 2.24 : 0.74
STRING SORT : 3.8943 : 1.74 : 0.27
BITFIELD : 1.1819e+08 : 4.04 : 0.84
FP EMULATION : 2.3572e+07 : 4.39 : 1.01
FOURIER : 9.1485 : 0.17 : 0.10
ASSIGNMENT : 149.09 : 4.01 : 1.04
IDEA : 1.0527 : 4.74 : 1.41
HUFFMAN : 143.03 : 3.97 : 1.27
NEURAL NET : 0.19878 : 0.32 : 0.13
LU DECOMPOSITION : 5.831 : 0.30 : 0.22
==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX : 3.397
FLOATING-POINT INDEX: 0.254
Baseline (MSDOS*) : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================

CPU : ARMv7 Processor rev 3 (v7l)
L2 Cache :0
OS : Linux version 2.6.27-00393-g6607056 (san@sandroid.corp.google.com) (gcc
version 4.2.1) #1 PREEMPT Mon May 11 10:38:09 PDT 2009
C compiler : arm-eabi-gcc (GCC) 4.4.0
libc : Android Bionic libc
MEMORY INDEX : 0.618
Baseline (LINUX) : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.
HTC Hero

: : Pentium 90* : AMD K6/233*
-------------------:------------------:-------------:------------
NUMERIC SORT : 142.45 : 3.65 : 1.74
STRING SORT : 4.8218 : 2.15 : 0.33
BITFIELD : 3.8314e+07 : 6.57 : 1.37
FP EMULATION : 14.459 : 6.94 : 1.60
FOURIER : 2010.6 : 2.29 : 1.28
ASSIGNMENT : 2.3811 : 9.06 : 2.35
IDEA : 491.44 : 7.52 : 2.23
HUFFMAN : 230.07 : 6.38 : 2.04
NEURAL NET : 0.36901 : 0.59 : 0.25

==============================LINUX DATA BELOW===============================

L2 Cache :0
OS : Linux version 2.6.32.17-g30929af (htc-kernel@and18-2) (gcc version 4.4.0
(GCC) ) #1 PREEMPT Wed Dec 1 15:10:40 CST 2010
Motorola Droid

: : Pentium 90* : AMD K6/233*
-------------------:------------------:-------------:------------
NUMERIC SORT : 316.49 : 8.12 : 2.67
STRING SORT : 28.848 : 12.89 : 2.00
BITFIELD : 1.1819e+08 : 20.27 : 4.23
FP EMULATION : 40.657 : 19.51 : 4.50
FOURIER : 2133 : 2.43 : 1.36
ASSIGNMENT : 7.1511 : 27.21 : 7.06
IDEA : 1232 : 18.84 : 5.59
HUFFMAN : 621.69 : 17.24 : 5.51
NEURAL NET : 0.80774 : 1.30 : 0.55

==============================LINUX DATA BELOW===============================

L2 Cache :0
OS : Linux version 2.6.32.9_rMoD_250-1100_ (corcor67@corcor67-desktop) (gcc
version 4.4.3 (GCC) ) #8 PREEMPT Sun Mar 13 22:03:01 CDT 2011
HTC Nexus One


: : Pentium 90* : AMD K6/233*
-------------------:------------------:-------------:------------
NUMERIC SORT : 319.53 : 8.19 : 2.69
STRING SORT : 12.833 : 5.73 : 0.89
BITFIELD : 1.0573e+08 : 18.14 : 3.79
FP EMULATION : 37.708 : 18.09 : 4.18
FOURIER : 3075.1 : 3.50 : 1.96
ASSIGNMENT : 5.6452 : 21.48 : 5.57
IDEA : 1126.3 : 17.23 : 5.11
HUFFMAN : 495.69 : 13.75 : 4.39
NEURAL NET : 0.77346 : 1.24 : 0.52
==============================LINUX DATA BELOW===============================

L2 Cache :0
OS : Linux version 2.6.32.9-27240-gbca5320 (android-
build@apa26.mtv.corp.google.com) (gcc version 4.4.0 (GCC) ) #1 PREEMPT Tue Aug 10 16:42:38 PDT
2010
HTC EVO

: : Pentium 90* : AMD K6/233*
-------------------:------------------:-------------:------------
NUMERIC SORT : 326.07 : 8.36 : 2.75
STRING SORT : 20.765 : 9.28 : 1.44
BITFIELD : 1.0899e+08 : 18.70 : 3.91
FP EMULATION : 38.539 : 18.49 : 4.27
FOURIER : 3141.2 : 3.57 : 2.01
ASSIGNMENT : 5.7634 : 21.93 : 5.69
IDEA : 1150.3 : 17.59 : 5.22
HUFFMAN : 506.6 : 14.05 : 4.49
NEURAL NET : 0.78505 : 1.26 : 0.53
==============================LINUX DATA BELOW===============================

L2 Cache :0
OS : Linux version 2.6.37.4-cyanogenmod-01295-gdc22375 (shade@toxygene) (gcc
version 4.4.3 (GCC) ) #1 PREEMPT Wed Apr 6 22:14:12 EDT 2011
Samsung Nexus S

: : Pentium 90* : AMD K6/233*
-------------------:------------------:-------------:------------
NUMERIC SORT : 146.29 : 3.75 : 1.23
STRING SORT : 7.7558 : 3.47 : 0.54
BITFIELD : 5.4413e+07 : 9.33 : 1.95
FP EMULATION : 18.394 : 8.83 : 2.04
FOURIER : 1011.8 : 1.15 : 0.65
ASSIGNMENT : 3.7009 : 14.08 : 3.65
IDEA : 580.4 : 8.88 : 2.64
HUFFMAN : 282.19 : 7.83 : 2.50
NEURAL NET : 0.37025 : 0.59 : 0.25
==============================LINUX DATA BELOW===============================

L2 Cache :0
OS : Linux version 2.6.35.7-g7f1638a (android-build@apa28.mtv.corp.google.com)
(gcc version 4.4.3 (GCC) ) #1 PREEMPT Thu Dec 16 21:12:36 PST 2010
FLOATING-POINT INDEX:0.954

CSCE 5610 Project Report - Gozick

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

CSCE 5610 Project Report - Gozick

Încărcat de

Drepturi de autor:

Formate disponibile

Embedded

 Level 1/2 Instruction Cache  Mispredicted Branch Execution

 8-stage pipeline  Variable Cache and Memory Management

 13-stage pipeline  Variable Cache and Memory

1) cacheperf – Measures TLB, Cache, and memory performance

500 Motorola Droid

HTC Nexus One

Table 4 – Ctxswtch performance results - CPU and Context Switches

Problems and Conclusion

[7] D. Heger, “DHTDroid v3.2 Benchmark – Quantifying Android OS & HW Performance,”

[8] Linpack for Android, GreenComputing. Available: http://www.greenecomputing.com/apps/linpack/.

TEST : Iterations/sec. : Old Index : New Index

==========================ORIGINAL BYTEMARK RESULTS==========================

==============================LINUX DATA BELOW===============================

* Trademarks are property of their respective holder.

TEST : Iterations/sec. : Old Index : New Index

==========================ORIGINAL BYTEMARK RESULTS==========================

==============================LINUX DATA BELOW===============================

* Trademarks are property of their respective holder.

TEST : Iterations/sec. : Old Index : New Index

==========================ORIGINAL BYTEMARK RESULTS==========================

==============================LINUX DATA BELOW===============================

* Trademarks are property of their respective holder.

HTC Nexus One

TEST : Iterations/sec. : Old Index : New Index

==============================LINUX DATA BELOW===============================

* Trademarks are property of their respective holder.

TEST : Iterations/sec. : Old Index : New Index

==============================LINUX DATA BELOW===============================

* Trademarks are property of their respective holder.

TEST : Iterations/sec. : Old Index : New Index

==============================LINUX DATA BELOW===============================

* Trademarks are property of their respective holder.

S-ar putea să vă placă și