Sunteți pe pagina 1din 9

Project 3: Optimizing the Alpha Blending Code for Neon SIMD Processing

Rishi Hora

001080823

ECE785

Description of Base Case

CPU clock frequency: The clock frequency was by default set at 300 MHz, but since the board was being powered by the USB cable the actual frequency was 297MHz (approx.)

Performance: With these settings the execution time of the unmodified code was

beaglebone:/home/rishi/QC_SIMD_Project/base# ./a.out fore.rgba back.rgba out.rgba Routine took 188066 microseconds

‘cpufreq-set -f 1000MHz’ command was used to set the frequency to 1000MHz, the governor was also set to performance using the ‘cpufreq-set -g performance’ command.

When the frequency was set to 1000MHz the execution time was:

beaglebone:/home/rishi/QC_SIMD_Project/base# ./a.out fore.rgba back.rgba out.rgba Routine took 59596 microseconds

At the clock speed of 1000MHz and time the un-optimized code ran for 59040566 cycles.

The actual speed of the processor was found out using the ‘cat /proc/cpuinfo’ command

root@beaglebone:~# cat /proc/cpuinfo

processor

model name

BogoMIPS

Features

: 0 : ARMv7 Processor rev 2 (v7l)

: 0 : ARMv7 Processor rev 2 (v7l)

: 990.68 : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls

: 990.68 : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls

CPU implementer : 0x41 CPU architecture: 7

CPU variant

: 0x3

CPU part

: 0xc08

CPU revision

: 2

Hardware

: Generic AM33XX (Flattened Device Tree)

Revision

: 0000

Serial

: 0000000000000000

Optimizations

-ftree-vectorize (Optimization #1)

Vectorization is enabled by this flag, it speeds up code by vectorizing the all the loops that can be vectorized. In this case the memory access is made such that it is predictable and there are no conditional branches. The compiler recognizes that and auto vectorizes where ever possible.

Performance Impact There was a considerable speedup due to this optimization.

beaglebone:/home/rishi/QC_SIMD_Project# ./project1 fore.rgba back.rgba out.rgba Routine took 10285 microseconds

Total Cycles: 10189144 cycles

Speedup: (59596 - 10285)/ 59596 * 100 = 82.74%

Analysis The inner loop was taking too much time to execute, by using the vectorization functionalty of the NEON unit we achieved this speed up.

Use of Intrinsic Instructions & arm_neon.h (Optimization #2)

Intrinsic instructions are used where we need to access the Neon unit directly. Also the header file was inserted into the code; it is used for the intrinsic data type implementations.

Performance Impact There was negligible impact on the performance.

Analysis Since the vectorization was already taking place automatically, it did not add to the performace of the code and hence this optimization was abandoned.

Run Fast Mode (Optimization #3)

In this mode the instructions are directly able to access the Neon directly. The code includes ASM which gets executed on the processor as is.

Performance Impact The effect on performance wasn’t huge, but the impact can be seen in the assembly code where the instructions were inserted directly. This was interesting to observe.

Analysis Since the vectorization was already taking place automatically, it did not add to the performance of the code and hence this optimization was abandoned.

Restrict keyword (Optimization #4)

By using this keyword we ensure that there isn’t a simultaneous access to the address space of the output array by the other arrays.

Performance Impact There was a little speedup due to this optimization.

beaglebone:/home/rishi/QC_SIMD_Project# ./project1 fore.rgba back.rgba out.rgba Routine took 9895 microseconds

Total Cycles: 9802779cycles

Speedup: (10285 - 9895)/ 10285 * 100 = 3.8%

Reducing Redundant Multiplications and Divisions (Optimization #5)

There were multiple redundant instructions in the code which the compiler mostly recognized and removed such as the multiplication and division by constants. But during the regeneration of the dstImg array the values which were earlier divided by 256 were now being left shifted by 8 bits, i.e. multiplied by 256.

Performance Impact There was some speedup due to this optimization.

beaglebone:/home/rishi/QC_SIMD_Project# ./project1 fore.rgba back.rgba out.rgba Routine took 9692 microseconds

Total Cycles: 9601671 cycles

Speedup: (9895 - 9692)/ 9895 * 100 = 2.1%

Analysis Some redundant multiplications were removed but since they were very few in number the speedup wasn’t that high.

Summary

Overall performance improvement:

Initial Cycles: 59040566 cycles Final Cycles: 9601671 cycles

Speedup: (59596 - 9692)/59596 * 100 = 83.74%

Which single optimization gave the largest improvement?

The optimization that gave the largest improvement was inclusion of vectorization flags. It enabled the loop to get vectorized and allowed the neon unit to process 4 words simultaneously.

Using the intrinsic instructions want that useful as the code was getting automatically vectorized. Hence that optimization was abandoned.

The code was finally sped-up by 83.74%.

Makefile

PROJ_NAME = project1

CC = gcc

VECTFLAGS = -ftree-vectorize -ffast-math -fsingle-precision-constant -ftree-

vectorizer-verbose=2 -mvectorize-with-neon-quad

CFLAGS = -Wall -O3 -march=armv7-a -mtune=cortex-a8 -mfloat-abi=softfp -mfpu=neon $(VECTFLAGS) -funroll-loops LIBS = -lm -lrt OBJFILES := $(patsubst %.c,%.o,$(wildcard *.c)) $(PROJ_NAME): $(OBJFILES)

# echo $(OBJFILES) $(CC) -o $(PROJ_NAME) $(OBJFILES) $(LIBS) %.o: %.c $(CC) $(CFLAGS) -c -o $@ $< %.lst: %.c $(CC) $(CFLAGS) -Wa,-adhln $(LIBS) $< > $@ clean:

rm -f *.o *.lst

alpha_time.c

void alphaBlend_c(int *fgImage, int *bgImage, int *dstImage);

#include <stdio.h>

#include <sys/time.h>

#include <arm_neon.h>

int backImage[512 * 512];

int foreImage[512 * 512];

int newImage[512 * 512];

void enable_runfast()

{

static const unsigned int x = 0x04086060;

static const unsigned int y = 0x03000000;

int r;

asm volatile (

"fmrx

%0, fpscr

\n\t" //r0 = FPSCR

"and

%0, %0, %1

\n\t" //r0 = r0 & 0x04086060

"orr

%0, %0, %2

\n\t" //r0 = r0 | 0x03000000

"fmxr

fpscr, %0

\n\t" //FPSCR = r0

}

: "=r"(r)

: "r"(x), "r"(y)

);

int main(int argc, char**argv)

{

FILE *fgFile, *bgFile, *outFile;

int result;

struct timeval oldTv, newTv;

//enable_runfast();

if(argc != 4){

fprintf(stderr, "Usage:%s foreground background outFile\n",argv[0]);

return 1;

}

fgFile = fopen(argv[1], "rb");

bgFile = fopen(argv[2], "rb");

outFile = fopen(argv[3], "wb");

if(fgFile && bgFile && outFile){

result = fread(backImage, 512*sizeof(int), 512, bgFile);

if(result != 512){

fprintf(stderr, "Error with backImage\n");

return 3;

}

result = fread(foreImage, 512*sizeof(int), 512, fgFile);

if(result != 512){

fprintf(stderr, "Error with foreImage\n");

return 4;

}

gettimeofday(&oldTv, NULL);

alphaBlend_c(&foreImage[0], &backImage[0], &newImage[0]);

gettimeofday(&newTv, NULL);

fprintf(stdout,

oldTv.tv_usec));

"Routine

took

%d

microseconds\n",

fwrite(newImage, 512*sizeof(int),512,outFile);

fclose(fgFile);

fclose(bgFile);

fclose(outFile);

return 0;

}

fprintf(stderr, "Problem opening a file\n");

return 2;

}

(int)(newTv.tv_usec

-

#define A(x) (((x) & 0xff000000) >> 24)

#define R(x) (((x) & 0x00ff0000) >> 16)

#define G(x) (((x) & 0x0000ff00) >> 8)

#define B(x) ((x) & 0x000000ff)

void alphaBlend_c(int *fgImage, int *bgImage, int*

{

int x, pos, y;

for(y = 0; y < 512; y++){

for(x = 0; x < 512; x++){

/*for(xx = 0; xx< 4; xx++)

{

pos[xx]= y*512

}*/

//pos = (y*512)+x;

pos = (y*512)+x;

int a_fg = A(fgImage[pos]);

restrict dstImage)

int dst_r = ((R(fgImage[pos]) * a_fg) + (R(bgImage[pos]) * (255-a_fg)));

int dst_g = ((G(fgImage[pos]) * a_fg) + (G(bgImage[pos]) * (255-a_fg)));

int dst_b = ((B(fgImage[pos]) * a_fg) + (B(bgImage[pos]) * (255-a_fg)))>>8;

dstImage[pos] = 0xff000000 |(0x00ff0000 & (dst_r << 8))

(0x000000ff & (dst_b));

}

}

}

|(0x0000ff00

&

(dst_g))

|