Sunteți pe pagina 1din 14

Data-centric computing with Netezza Architecture

DISC reading group September 24, 2007

High Level Points


Supercomputer use model today:
Compile, submit, wait Does a poor job of taking advantage of human insight available in interactive models

Large datasets can be interactively processed using Netezza

What is Netezza?

Essentially: A big, fast SQL database

What is Netezza?

Frontend provides SQL interface Backend is a large rack of specialized blades

Custom Backend Blades

Commodity CPU, NIC, disk Custom FPGA replaces disk interface


Can do basic filtering in hardware, i.e., stream processing before data hits main memory

Division of Data
Database distributed across multiple (100+) SPUs Each SPU controls, manages its slice of DB

No info on data management, replciation, etc.

Division of Labor
SPU FPGA handles basic filtering tasks SPU CPU handles record level processing: filtering, parsing, projecting, logging, etc. SPU CPU handles most operations on intermediate results: sorts, joins, aggregates Frontend CPU handles remaining operation

>>> Processing close to disk

What can this be used for?


Paper gives 3 examples:
Citation graph processing Search for particular structure in electrical netlist Word meaning disambiguation through search of ontology

Citation graph example


Look through large, sparse graph (16 million nodes, 388 million edges) Find both strong (direct edge) and weak couplings (e.g., two papers cite the same work) Essentially same code for workstation and Netezza no need to expose parallel architecture Workstation DNF; 80-100x speedup on smaller tests

IC netlist example
Flattened netlist of 3.5 million transistors, 10 million wires Search for AND structure

IC example results
Combinatorial explosion makes directly joining all possibilities for each element impossible Can constrain better using fanouts of signals internal to the circuit Individual SQL queries for finding possible matches for the individual transistors took under 10 seconds Found all uses of the AND macro, as well as many other (1300+) identical structures generated through other means

Ontology example
Expand out all possible interpretations of a phrase Ontology specifies lexical elements, IS-A relations, concepts, and constraints on concepts Goal is to search the space, expand concepts to find all matches to given phrase

Ontology results
Partially unfolded ontology
Greatly expands database size, but reduces iterations / recursions

Recoded ontology triples as integers 5.58 sec. vs. 262 sec. can pipeline multiple queries

Issues
Works if you can reduce your problem to SQL queries All of the problems were based on graph expansion / exploration how about other domains? Issues of database partitioning? How does arbitrary slicing across 108 blades affect performance / scalability, esp. for non-sparse problems? Strawman comparison to workstation class machine: how does a traditional DB server / storage cluster compare?

S-ar putea să vă placă și