Sunteți pe pagina 1din 41

LibreOfficeCalc

SpreadsheetsontheGPU
MichaelMeeks<michael.meeks@collabora.com>
mmeeks,#libreofficedev,irc.freenode.net
Stand at the crossroads and look; ask for the
ancient paths, ask where the good way is,
and walk in it, and you will find rest for your
souls... - Jeremiah 6:16

Overview

LibreOffice?

Abitabout:

GPUs

Spreadsheets

Internalrefactoring

OpenCLoptimisation

newcalcfeatures

XML/loadperformance

Calc/GPUquestions?

Questions?

LibreOffice Project & Software


Open Source / Free
Software
One million new unique
IPs per week (that we can
track)
Double the weekly
growth one year ago.
Tens of millions of users,
and growing fast.
Hundred+ contributing
coders each month
2500+ commits last
month
Around a thousand
developers ( including
QA, Translators, UX etc.
http://www.libreoffice.org/

Cumulative unique IP's for updates vs. time


not counting any Linux / vendor versions

60,000,000
50,000,000
40,000,000
30,000,000
20,000,000
10,000,000
0

AdvisoryBoardMembers
This slide's layout is a victim of our success here ...

4 / 41

Event Name | Your Name

WhyusetheGPU?

APUsGPUfasterthanCPU

TonsofunusedComputeUnitsacrossyourAPU

Doubleprecisionisunreasonablyslower

Andprecisionisnonnegotiablefor
spreadsheetsIEE764required.

Betterpowerusageperflop.

Numbers based
on a Kaveri 7850K
APU - & top-end
discrete Graphics
card.

fp64

CPU flops
GPU flops
FirePro 7990

fp32
1

10

100

1000

Flops : note the log scale ...

10000

Developersbehindthecalcrework:
Kohei Yoshida:
MDDS maintainer
Heroic calc core re-factorer
Code Ninja etc.
Markus Mohrhard
Calc maintainer,
Chart2 wrestler
Unit tester par
Excellence
etc.

Jagan Lokanatha
Kismat Singh

Matus Kukan
Data Streamer,
G-builder,
Size optimizer ..

A large OpenCL team,


Particularly I-Jui (Ray) Sung

SpreadsheetGeometry
An early
Spreadsheet
C 3000 BC
Aspect ratio: 8:1
Contents:
Victory against
every land
who giveth all life
forever

Excel 2003

Excel 2010

64k x 256

10^6 x 16k

Aspect:
256:1

Aspect:
16:1

50% of
spreadsheets
used to make
business
decisions.
Columnar data structures

The 'Broom
Handle'
aspect
ratio.

SpreadsheetCoreDataStorage

ThejoyofObjectOrientation
ScTable

ScBaseCell
ScDocument
ScColumn

Broadcaster (8 bytes)
Text width (2 bytes)
Cell type (1 byte)
Script type (1 byte)

ScValueCell

ScFormulaCell
ScStringCell

ScEditCell
ScNoteCell*

10 / 41

Event Name | Your Name

Abstraction of Cell Value Access


ScBaseCell Usage (Before)

ScDocument

11

Undo / Redo

RTF Filter

Change Tracking

Quattro Pro Filter

Content Rendering

HTML Filter

Excel Filter (xls, xlsx)

External Reference

Document Iterators

CSV Filter

DIF Filter

UNO API Layer

Conditional Formatting

SYLK Filter

VBA API Layer

Chart Data Provider

DBF Filter

ODF Filter

Cell Validation

CppUnit Test

Abstraction of Cell Value Access


ScBaseCell Usage (After)

ScDocument

Biggest calc core re-factor


in a decade+
Dis-infecting the horrible,
long-term, inherited
structural problems of Calc.

Document Iterators

Lots of new unit tests being


created for the first time for
the calc core.
Moved to using new 'MDDS'
data structures.
2x weeks with no compile ...

12

Before(ScBaseCell)
ScTable

ScBaseCell
ScDocument
ScColumn

Broadcaster (8 bytes)
Text width (2 bytes)
Cell type (1 byte)
Script type (1 byte)

ScValueCell

ScFormulaCell
ScStringCell

ScEditCell
ScNoteCell*

13 / 41

Scattered
pointer
chasing
walking cells
down a
column ...

Event Name | Your Name

After(mdds::multi_type_vector)
ScTable
ScColumn

svl::SharedString block

ScDocument
double block

EditTextObject block
ScFormulaCell block
Broadcasters
Cell notes
Text widths
Script types
14 / 41

Cell values

Event Name | Your Name

Iteratingovercells(oldway)
loop down a column and the inner loop:
double nSum = 0.0;
ScBaseCell* pCell = pCol >maItems[nColRow].pCell;
++nColRow;
switch (pCell->GetCellType())
{
case CELLTYPE_VALUE:
nSum += ((ScValueCell*)pCell)->GetValue();
break;
case CELLTYPE_FORMULA:
something worse ...
case CELLTYPE_STRING:
case CELLTYPE_EDIT:

case CELLTYPE_NOTE:

15 / 41

Event Name | Your Name

Iteratingovercells(newway)
double nSum = 0.0;
for (size_t i = 0; i < nChunkLength; i++)
nSum += pDoubleChunk[i];
ONO. from a vectoriser ...

16 / 41

Event Name | Your Name

SharedFormula

Before
Tokens

18 / 41

ScFormulaCell

ScTokenArray

ScFormulaCell

ScTokenArray

ScFormulaCell

ScTokenArray

ScFormulaCell

ScTokenArray

ScFormulaCell

ScTokenArray

ScFormulaCell

ScTokenArray

ScFormulaCell

ScTokenArray

Event Name | Your Name

...
...

RPN

After
ScFormulaCell
ScFormulaCell

ScFormulaCellGroup

ScFormulaCell
Tokens
ScFormulaCell

ScTokenArray

ScFormulaCell
ScFormulaCell
ScFormulaCell

19 / 41

Event Name | Your Name

RPN

Memoryusage
Heap memory size (MB)

400

372

300

259

200
100
27
0

Shared formula on
Empty document
Shared formula off

Test document used:


http://kohei.us/wp-content/uploads/2013/08/shared-formula-memory-test.ods
20 / 41

Event Name | Your Name

Sharedstringrework

Stringcomparisonswereslow

AlsonottractableforaGPU
Caseinsensitiveequalityisahard
problemICU&heavylifting.

Stringcomparisonsalotin
functions,andPivotTables.

Sharedstringstorageisuseful.

Sofixit...

Concept
svl::SharedStringPool
svl::SharedString
Original string pool
svl::SharedString
Upcased string pool
svl::SharedString

22 / 41

Event Name | Your Name

Stringcomparison(oldway)

23 / 41

Event Name | Your Name

Stringcomparison(newway)

24 / 41

Event Name | Your Name

OpenCL/calculation...

WhyOpenCL&HSA...

GPUandCPUoptimisation

WhywritecustomSSE2/SSE3etc.assembly
detectarch,andselectbackendcross
platforms.
InsteadgetOpenCL(fromAPUvendor)to
generatethebestcode...

HetrogenousSystemArchitecturerocks:

AnAMD64likeinnovation:
sharedVirtualMemoryAddressspace&pointers:
GPUCPU.

Avoidwastefulcopies,fastdispatch

GreatOpenCL2.0support.

UsetherightComputeUnitforthejob.

Auto-compile Formula OpenCL


#pragma OPENCL EXTENSION cl_khr_fp64: enable
int isNan(double a) { return isnan(a); }
double legalize(double a, double b) { return isNan(a)?b:a;}
double tmp0_0_fsum(__global double *tmp0_0_0)
{
double tmp = 0;
{
int i;
i = 0;
tmp = legalize(((tmp0_0_0[i])+(tmp)), tmp);
i = 1;
Formulae compiled idly / on
tmp = legalize(((tmp0_0_0[i])+(tmp)), tmp);
entry in a thread to hide
i = 2;
tmp = legalize(((tmp0_0_0[i])+(tmp)), tmp);
latency.
} // to scope the int i declaration
return tmp;
Kernel generation thanks
}
to:
double tmp0_nop(__global double *tmp0_0_0)
{
double tmp = 0;
int gid0 = get_global_id(0);
tmp = tmp0_0_fsum(tmp0_0_0);
return tmp;
}
__kernel void DynamicKernel_nop_fsum(__global double *result, __global double
*tmp0_0_0)
{
int gid0 = get_global_id(0);
result[gid0] = tmp0_nop(tmp0_0_0);
}

__kernel void
The same formula for a longer sum
tmp0_0_0_reduction(__global double* A,
__global double *result,
int arrayLength, int windowSize)
Compiled from standard formula syntax
{
double tmp, current_result =0;
int writePos = get_group_id(1);
int lidx = get_local_id(0);
double tmp0_0_fsum(__global double
__local double shm_buf[256];
*tmp0_0_0) {
int offset = 0;
double tmp = 0;
int end = windowSize;
int gid0 = get_global_id(0);
end = min(end, arrayLength);
tmp = ((tmp0_0_0[gid0])+(tmp));
barrier(CLK_LOCAL_MEM_FENCE);
return tmp;
int loop = arrayLength/512 + 1;
}
for (int l=0; l<loop; l++) {
double tmp0_nop(__global double
tmp = 0;
*tmp0_0_0) {
int loopOffset = l*512;
double tmp = 0;
if((loopOffset + lidx + offset + 256) < end) {
int gid0 = get_global_id(0);
tmp = legalize(((A[loopOffset + lidx + offset])+
tmp = tmp0_0_fsum(tmp0_0_0);
(tmp)), tmp);
return tmp;
tmp = legalize(((A[loopOffset + lidx + offset +
}
256])+(tmp)), tmp);
__kernel void
} else if ((loopOffset + lidx + offset) < end)
DynamicKernel_nop_fsum(__global double
tmp = legalize(((A[loopOffset + lidx + offset])+
*result,
(tmp)), tmp);
shm_buf[lidx] = tmp;
__global double *tmp0_0_0)
barrier(CLK_LOCAL_MEM_FENCE);
{
for (int i = 128; i >0; i/=2) {
int gid0 = get_global_id(0);
if (lidx < i)
result[gid0] = tmp0_nop(tmp0_0_0);
shm_buf[lidx] = ((shm_buf[lidx])+
}
(shm_buf[lidx + i]));
barrier(CLK_LOCAL_MEM_FENCE);
}
if (lidx == 0)
current_result =((current_result)+(shm_buf[0]));
barrier(CLK_LOCAL_MEM_FENCE);
}
if (lidx == 0)
result[writePos] = current_result;
}

Performance numbers for sample sheets.


GPU / OpenCL
Software

min_max_avg_r

30x 500x
faster for
these
samples vs.
the legacy
software
calculation

destination-workbook

Shorter is better
dates-worked

stock-history

on Kaveri.
ground-water

10

100

1,000

10,000

100,000

Yet another log plot milliseconds on the X axis ...

Inmoredetail...

Thisisaspreadsheet

Highlyspreadsheetgeometrydependent

WhatdoyoumeanwhatistheXfactor?
Don'tlikeyourXfactoraddmorerows,or
complexity.

Representativesheetsimportantsomebased
onrealworldmadness
Functions:

Researchshowsvastmajorityofdistinct
fomulaehaveverysimplefunctions:SUM,
AVERAGE,SUMIF,VLOOKUP,etc.

Weoptimisethose

Wedon'tdoeg.TextfunctionslikeUPPER

Howthatworksinpractise:

Enabling Custom Calculation

Turn on OpenCL computation: Tools Options

Enabling OpenCL goodness

Auto-select the best OpenCL device via a micro-benchmark

Or disable that and explicitly select a device.

33 / 41

Event Name | Your Name

BigdataneedsDocument
Loadoptimization

ParallelizedLoading...

DesktopCPUcoresareoftenidle.

XMLparsing:

Theidealapplicationofparallelism

SAXparsers:
SuckingicAcheeXperienceparsers

read,parseatinypieceofXML&emitanevent
punchthatdeepintothecoreoftheAPPlogic,and
return..

ParseanothertinypieceofXML.

BetterAPIsandimpl'sneeded:Tokenizing,
Namespacehandlingetc.

Luckilyeasytoretrofitthreading...

DozensofperformancewinsinXFastParser.

Utilisingyour32coreCPU...
(boxesarethreads).

Thread 2

SplitXMLParse&
Sheetpopulate

Thread 1
Unzip,
XML Parse,
Tokenize

Populate
Sheet Data
Structures.

ParallelisedSheet
Loading

Unzip,
XML Parse,
Tokenize

Populate
Sheet Data
Structures.

Progress bar
thread

ParalleltoGPU
compilation

etc.
=COVAR(A1:A300,B1:B300)
OpenCL code
Ready to execute kernels

Tools->Options->Advanced->Experimental Mode required for parallel loading

Doesitwork?withGPUenabled
Wall-clock time to load set of large XLSX spreadsheets: 8 thread Intel machine

num-formula-2-sheets-1m.xlsx
numbers-formula-8-sheets-100k.xlsx
numbers-formula-100k.xlsx

Shorter is better

numbers-100k.xlsx
sumifs-testsheet.xlsx

Calc 4.1.3
Calc
Reference

stock-history.xlsm
matrix-inverse.xlsx
mandy.xlsm
mandy-no-macro.xlsx
groundwater-daily.xlsm
dates-worked.xlsx

0.1

10

Log Time / seconds

Apologies for another log scale: Average 5X vs. 4.1.3

100

Howdoesthatpanout?

Problems^WOpportunities...

PickingagoodOpenCLdriver

White/Black/Anylistingofknowngood/bad/
mixedHardware/Driver/OS

Whichcoretopick?

fp64perfetc.Timevs.Power

Currentlymicrobenchmarktime.

HSArocks

CL_MEM_USE_HOST_PTRisaroyalpain:

Alignmentissuescurrentlycauselotsofcopyingin
severalcases.

OpenCL2.0'sSharedVirtualMemoryisawesome

CompilerPerformance:

ExcelRPNCstringIRGPU

SPIRsoundsgreatifitcanbestable.

FutureOpenCLwork...

Volunteers/funderswelcome

Killpercelldependencygraphing

Badlyneedstobepercolumn:

Shrinkmemoryusage,improveloadtime

Detectindependentcolumncalculations

SPIRintegration

Enablingparallelexecution,widerCSEetc.

Avoid'NaN'foobyadaptingtodatashapefaster.

Calcasaflowprocess,'constructyour
pipelineinasheet'

Crazyawesomedemos:Mobilevs.PC...

ZIPLZ77/OpenCLaccelerationorsimilar

LibreOfficeConclusions

LibreOfficeisinnovating:

Goinginterestingplacesnoonehasgonebefore:

OpenCLinagenericspreadsheetsafirst

Whywrite5xhandcodedassemblerversionsandselectperplatform.

RunyourworkloadontherightComputeUnittosavetime&battery.

RefactoringforOpenCLimprovesperformanceforall

FasterforCPUandGPU

PCMark8.2includesLibreOfficebenchmarking.

LibreOfficelovesnewcontributor&features

thereisalreadyatoolforthat.

Talktomeaboutgettinginvolved...

Thanksforallofyourhelpandsupport!
Oh, that my words were recorded, that they were written on a scroll, that they were
inscribed with an iron tool on lead, or engraved in rock for ever! I know that my Redeemer
lives, and that in the end he will stand upon the earth. And though this body has been
destroyed yet in my flesh I will see God, I myself will see him, with my own eyes - I and not
another. How my heart yearns within me. - Job 19: 23-27

41

S-ar putea să vă placă și