VectorStar - Technology Introduction

VectorSTAR Technology Overview∗
† ‡
Ernesto H. Legorreta, CTO
July 20, 2008
Abstract
VectorSTAR is a high performance columnar RDBMS targeted at
VLDBs in the OLAP, data warehousing, operational BI, nancial en-
gineering, bioinformatics and scientic computation markets. Complex
multi-table associations are computed very eciently by using a vector-
based (instead of set-based) column model. Memory-mapped (instead
of buered) le I/O is used to achieve multiple order of magnitude im-
provements in data loading and query execution times while increasing
reliability and security. Data denition, manipulation and querying is
done with a vectorial SQL dialect (based on function composition) that
provides an interactive querying style supporting exploratory analysis by
end-users, plus a LINQ-style API that simplies database interaction for
application programmers. VectorSTAR uses a hybrid open-source model,
runs on both Linux and Windows 64-bit OS, and supports industry stan-
dard interfaces such as HTTP, ODBC, Microsoft Excel and .NET, and
Sun Java APIs.
∗ Copyright © 2008 by Vectornova SA de CV. All rights reserved.

† ernesto@vectornova.com
‡ Vectornova SAdeCV , Monterrey, NL, MX
1
Contents
1 VectorSTAR 3
1.1 64-bit Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Memory-mapped File I/O . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Data Storage Architecture . . . . . . . . . . . . . . . . . . . . . . 8
Columnar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Schema-oriented . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Information Schema . . . . . . . . . . . . . . . . . . . . . 12
Vector-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Insertions, Deletions, and Updates . . . . . . . . . . . . . . . . . 14
Mapping mode . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Failover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 VectorSQL 18
2.1 Exploratory Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 ANSI SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Stored Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Vectorized Operations . . . . . . . . . . . . . . . . . . . . . . . . 22
3 xSTAR 23
xSTAR and ETL . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Just-in-Time CSV Loader . . . . . . . . . . . . . . . . . . . . . . 24
4 Interfaces 25
User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Bridges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
R statistical programming system . . . . . . . . . . . . . 26
Excel Spreadsheet . . . . . . . . . . . . . . . . . . . . . . 26
5 Future Release Schedule 27

5.1 Taipan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Berzerk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2
1 VectorSTAR
VectorSTAR is a high-performance analytic DBMS for enterprise-level appli-
TM
cations on the Linux and Microsoft Windows 64-bit operating systems.
Designed by Vectornova SAdeCV during 2001-2003, it has been continuously
1
developed and eld-tested on large scale applications since its rst deployment
in October 2004. Version 2.0, available since February 2008, is the current
state-of-the-art in high-speed databases world-wide.
VectorSTAR enables users to dene, store, manipulate, share, query, and an-
alyze extremely large amounts of data
2
in a safe, secure, consistent, compatible,
VLDB
and ecient manner. Unlike operational DBMS which usually target OLTP ap-
OLAP
plications that require support for large numbers of concurrent users executing
simple, write-biased, short-lived transactions on relatively small tables (typi-
cally smaller than ten to a hundred million rows), VectorSTAR was designed
for maximum performance in the OLAP, Data Warehousing, Financial Engi-
neering, and Scientic Computation markets, where any number of concurrent
users execute complex, read-biased, potentially long-lived queries, calculations,
and reports on VLDB applications (frequently consisting of billions of rows).
VectorSTAR is the only DBMS specically architected to fulll the most de-
manding requirements of Operational BI and Bioinformatics, two of the fastest
growing emerging markets in IT today.
VectorSTAR is relational. Informally, this means that your data is stored
on rows within tables composed of columns having a specic data type, and is
RDBMS
manipulated and queried using a well-known set of relational operators. More
vs
formally, it means that VectorSTAR adheres to the relational model (as intro-
SQL
duced by E.F. Codd in [Codd70]) as much as any of the major RDBMS in the
market today. Despite a popular misconception, the relational model is not the
same thing as SQL and compatibility with any specic version of SQL is not a
3
requirement for being a RDBMS . VectorSTAR includes an innovative dialect
of SQL which takes advantage of its vectorial architecture while providing an
eective superset of the functionality dened in the SQL2 and SQL3 standards,
including the XML-related functionality dened in SQL2006.
4
VectorSTAR is a columnar DBMS . Roughly, this means that table data
5
is stored in column-major order rather than row-major order . The idea for
columnar RDBMS seems to have arisen simultaneously during the early 1990's
both in industry (SybaseIQ) and the academia (MIT's C-Store). However,
column-based data storage had long been pioneered since the early 1970's by
1 as an embedded high-performance database engine in a family of government information

systems created by Tlallian SAdeCV, a Vectornova OEM partner.
2 25 billion (25 thousand millions) records in one, and 60 terabytes total in another of the
two largest test deployments as of June 2008.
3 In fact, alternate RDBMS languages (such as QUEL and Tutorial-D) have been
strongly advocated for many years now by the industry's chief luminaries, including Codd
himself.
4 Also known as column-based or column-oriented.
5 Row-major order is when the data elements belonging to the same row but dierent
columns are stored contiguously. Column-major order is the opposite.
3
the mainframe-based APL systems. Surprisingly, this simple 90 degree shift in
storage order has been shown to have major performance, functionality, and
ease-of-use implications over the past several years. In consequence, many of
the high-performance DBMS being designed today are columnar. VectorSTAR,
however, diers from most of them in that:
it has been designed specically to take full advantage of today's industry-

standard 64-bit CPU architecture
it takes advantage of the OS virtual memory and MMF facilities to achieve

extremely fast le I/O
it provides a vector-based, functional SQL dialect that is fast, easy to

6
learn, and has great expressive power
it oers a unique, interactive query interface that enables end-users to

easily do sophisticated exploratory data analysis
it uses a vector-based (instead of set-based) column model, oering native

7
multidimensional operations
it has a very small CPU footprint, so it coexists eectively with related

applications running on the same node
it provides a high-speed bulk data loader which can load multiple columns
in parallel and across a grid/cluster
This unique set of state-of-the-art architectural features enables VectorSTAR

to reach extreme levels of performance on industry-standard commodity hard- synergistic
8
ware . Although some of these features are implemented in isolation on other combination
DBMSs, it is their synergistic combination in VectorSTAR that produces a quan-
tum leap in performance: memory-mapped le I/O, for example, becomes very
eective when coupled with an unrestricted address space and an array-based
9
data representation, but is of no major practical use by itself .
In fact, while VectorSTAR oers data compression and indexing capabili-
10
ties , its high performance does not derive from them but from a more advanced
data architecture that simply works better. This is a very important fact, as it
6 VectorSTAR is built as a set of extension libraries on top of XJ, an array-based functional

programming language, which is also used for coding new user-dened functions and stored
procedures.
7 VectorSTAR is a RDBMS built on top of a multidimensional persistence engine (the J
engine); this is in contrast with other multidimensional databases in the market, which are
essentially multidimensional engines built on top of a relational or b-tree persistence engine.
8 Query speed improvements of two and even three orders of magnitude are common when
dealing with complex queries involving multiple join clauses on large databases, for example.
9 IBM's Red Brick (a data warehousing DBMS) achieved at least part of its performance
edge in the late 1990's by making some use of memory-mapped le I/O.
10 Symbolic coding of sparse-valued columns (i.e., when the number of distinct values is
much less than the column cardinality) and rst-level association indices (i.e., without any
kind of secondary indexing method to access the index values themselves).
4
points to a large growth potential from the current level of performance using
11
well known, straightforward techniques , in contrast to other columnar DBMS
which are, arguably, already functioning at their peak architectural capacity
12
today . In summary, for its target application area, VectorSTAR is already
much faster than most competing DBMS (of any kind) and at least as fast as
13
any other current top performer , while still being poised for a signicant in-
crease in performance in its next release. Besides, its high performance benets
come together with unparalleled simplicity and exibility.
1.1 64-bit Architecture
VectorSTAR is a native 64-bit application that has been designed from the
14
ground up to take full advantage of the x86-64 architecture (developed by
15
AMD in the early 2000's and then cloned by Intel ) which has now become the
de facto standard 64-bit architecture in the industry, clearly surpassing the Sun
16 17
SPARC , IBM POWER, and HP/Intel Itanium architectures in momentum,
pace of innovation, adoption rate, and market share.
Among the most powerful new capabilities oered by the x86-64 architec-
ture is its vastly larger address space
18
, which increases the maximum amount
vast
of addressable memory available to the OS (kernel mode) and applications (user
address
space
11 Human compression of variable length values (e.g., text and images), secondary associ-
ation indices, constant-time hash-based search indexes, and massively parallel GPU pipeline
support, for example.
12 Vertica [Vert1], which heavily emphasizes compression as one of the foundations for its
performance model (along with columnar architecture and multiple copies of data columns
sorted in dierent orders [SAB05, SBC07]), may perhaps be an example.
13 which certainly includes KDB [KX1] and perhaps Vertica [Vert1].
14 The x86-64 architecture is now called AMD64 and Intel64 by AMD and Intel, respec-
tively. Intel64 should not be confused with Intel's own IA-64 architecture, implemented on
the Itanium family of CPUs, which is not compatible with the enormous installed base of
x86 applications and has been strongly criticized by some of the most respected names in the
industry, including none other than Donald Knuth.
15 in direct competition to its own previous 64-bit strategy embodied in the Itanium CPU.
16 Sun's Opteron-based x86 servers are now outselling its own SPARC servers by a wide
margin: according to some metrics, up to three quarters of the servers they sell are x86, and
of those, perhaps up to three quarters have Linuxinstead of Solarisspecied as their OS.
17 "The history of the Itanium is mixed at best. Hailed as the x86 killer when launched in
2001, the Itanium never gained a strong following and was being written o in many circles
by mid-decade..." Cole, Arthur, "Waiting for Itanium", IT Business, January 11, 2008
HP withdrew its Itanium-2 based workstations from the market as early as 2004, citing that
...in working with and listening to our high-performance workstation partners and customers,
we have become aware that the focus in this arena is being driven toward 64-bit extension
technology.
SPARC servers out-shipped all Itanium servers by 4.5 times for the during CY07, according
to IDC's Worldwide Quarterly Server Tracker, February 2008.
The Itanium is sometimes jokingly referred to as the "Itanic" among industry analysts.
18 Although previous solutions enlarged the physically addressable memory space up to
64GB by extending the address space to 36-bits, they were architectural patches that did
not make all that memory transparently available to end user applications in a compatible
and straightforward manner.
5
19
mode) from about 2GB each, up to 128TB each . As long as the actual memory
chips and motherboards needed to put signicantly more than 4GB of physi-
cal memory on a system were lacking, this tremendous expansion in memory
addressing was of no direct practical signicance. Today, however, most com-
modity motherboards have the capability to hold 16GB of RAM, and those with
the capability to hold 32-64GB of RAM are available at aordable prices (i.e.,
within the USD$500 through $1'000 range).
Traditional DBMS do not eectively take direct advantage of this order
of magnitude increase in system memory. Essentially, all of the techniques
they employ to eciently move data between the disk subsystem (secondary
memory) and the system RAM (primary memory) were developed and perfected
at a time where 32-bit addressing space was not only a practical, but also a
theoretical constraint, and 4GB of physical RAM was considered prodigious. As
a consequence of this outdated strategy, even when provided with signicantly
more than 4GB of RAM, those DBMS will essentially end up using it solely as
an increased buering space for the system le I/O operations that continuously
move relatively small chunks of data to and from disk, still working under the
assumption that system memory is a scarce resource. The end result is that
performance gains due to increased physical memory are, at best, evolutionary
rather than revolutionary, and usually stop far below the order of magnitude
improvement that is the minimum required to achieve a signicant advance in
dealing with the ever growing amounts of data available today to competitive
20
organizations worldwide .
This failure of traditional RDBMS to take full advantage of the new genera-
tion of hardware that has now become mainstream has fostered the appearance
in-memory
of the in-memory DBMS
21
. The rapid spread of this technology
22
reects a
not
pressing industry need. In general terms, these in-memory DBMS concentrate
scalable
on improving traditional indexing, query optimization, and storage management
techniques by essentially limiting data access to that much which ts in physical
memory. This eectively precludes their utilization in VLDB OLAP and data
warehousing applications (such as those that are common in retail, nancial,
23
scientic, and telecom markets) where a 100 million ceiling on the number of
records would often be unacceptable.
Actually, the practically unlimited memory addressing space oered by 64-
bit CPUs does enable a revolutionary approach to data management that is
new
scalable
19 16TB each on Windows OSs due to a reduced 44-bit addressing space instead of the 48-bit
approach
addressing limit of the current generation of x86-64 CPUs.
20 In some markets, such as retail, network service monitoring, mobile telecom, and world
nancial trading, a performance improvement of at least a 3 orders of magnitude (1000 times)
seems to be required to allow a practical implementation of the new theoretical analysis
techniques discovered within the past two decades.
21 Probably best exemplied by TimesTen, a 1996 HP Labs spin-o acquired by Oracle in
2005 and now marketed as ... the foundation product for real-time data management. [Pro-
viding] application-tier database and transaction management built on a memory-optimized
architecture accessed through industry-standard interfaces.
22 Oracle claims at least 1,500 TimesTen installations worldwide.
23 Assuming 100-byte sized records and 64GB of RAM available.
6
truly scalable: one where application programmers and end users deal with
large arrays of data as if they were in memory, though they're stored as plain
24
sequence-structured binary les (reecting the in-memory structure of the
corresponding arrays) and letting the OS memory virtualization mechanism
transparently page them in and out of physical memory as needed. This is the
approach followed by VectorSTAR.
This next-generation approach to data management not only takes advan-
tage of the large 64-bit memory address space, but actually requires it: if Vec-
torSTAR were a 32-bit application, for example, the 2GB per process memory
address limit would restrict a mapped database to hold a maximum of 2GB of
25
data, clearly too small for enterprise-level databases . This is why no memory-
mapped DBMS such as VectorSTAR could have been practical before the main-
stream availability of 64-bit CPUs.
32-bit Version
Despite the address-space limitations of the 32-bit architecture, a VectorSTAR

version adapted to run on 32-bit architectures is available. It is targeted at per-
sonal DBMS on desktop PCs, local client DBMS on corporate PCs and work-
stations, and mobile DBMS on PocketPC handhelds. The practical maximum
for column cardinality on the 32-bit version is around 10 million rows. It runs
on Linux, Mac OS/X, Windows XP, Windows VISTA, and Windows Mobile.
1.2 Memory-mapped File I/O
VectorSTAR is a memory-mapped database, not an in-memory database. On

in-memory databases, the amount of available physical RAM constrains the memory-mapped
size of the tables that can be handled by the DBMS. This is not the case for vs
memory-mapped databases, such as VectorSTAR, where the OS virtual memory in-memory
mechanism transparently swaps disk pages into memory (called the active page
set ) as they are required by the code executing the DBMS process.
Naturally, if the ratio of real physical memory to the size of the active page
set is too small, the performance of the memory-mapped DBMS will degrade
accordingly. However, typical analytical DBMS usage does not result in a dense
page loading since highly selective where-criteria are applied up-front by most
queries. Some nancial engineering and scientic computations, however, in-
volve queries that result in fully dense page loadings, requiring the use of a
larger physical-to-virtual memory ratio in order to maintain performance. Any-
way, it can be shown that any scenario that results in heavy paging on a memory-
mapped system will also likely result in heavy memory trashing on a buer-based
system.
24 where the le consists of a simple sequence of binary values representing the basic ele-
mental data types, such as integers, oats and characters.
25 A typical sale ticket item record in a retail data warehouse, for example, will likely take
10-20 columns of, say, 4 bytes each, totalling 50-100 bytes. Just 10 million of those could ll
up the entire 2GB addressing space. Common retail data warehouses hold at least hundreds
of millions of sale ticket item records.
7
Furthermore, some important scientic visualization techniques require ac-
cess to a very small, but unpredictable, portion of a very large data set. In this
case, the unpredictability of the data access pattern prevents a buer-based
system (as employed by non-memory-mapped DBMS) from fectching the right
portion of the data set into memory a priori, whereas a memory-mapped system
will behave optimally in this situation. In any case, intelligent memory mapping
can increase IO throughput by orders of magnitude and this is why modern OS
themselves use memory mapping to implement shared libraries and to load and
run executable program les (.DLL and .EXE on Windows OS, for example).
A memory mapped le is added to a process' virtual memory space (known
as the VAS) without actually reading the le into physical memory. The virtual
memory system will transparently read only those portions of the data set ac-
tually referenced by subsequent code; i.e., physical memory acts as a cache for
data on disk, but a cache that is loaded on a reference basis, not according to a
predetermined strategy. Also, application code which accesses memory-mapped
les is identical to code that accesses private in-memory structures (although
performance may dier).
In short, a memory-mapped database trades buer-based le I/O for the
OS virtual-memory mechanism. The latter provides a private virtual address
buer
space for every process using memory-mapped le I/O on the system page le,
vs
the executable and library les, and the mapped data les associated with the
virtual
process. In contrast to buer-based le I/O, with virtual memory-mapped le
I/O there is no need to manage buers or to use any of the traditional lesystem
I/O calls (fopen, fclose, fread, fgets, ...) 26
to access le data : the OS does
this hard work, and does it eciently, transparently, and reliably (it is one of
its main jobs, anyway). Multiple processes can share memory by mapping their
27
virtual address spaces to the same le or to the page le .
An important consequence of the conceptual simplicity of the memory-
mapped le I/O strategy is that the performance prole of the application is
linear
more linear, with less hierarchic buering and cache layers, and with signi-
performance
cantly less degrees of freedom when compared to the multi-hierarchical buered
model
le I/O alternative.
1.3 Data Storage Architecture
Columnar
VectorSTAR is a columnar RDBMS. This means that table data is stored in

column-major order instead of row-major order (which is the case for tradi-
tional RDBMS). Columnar architectures are quickly becoming the new standard
28
for high-performance analytical DBMS of any kind. Alternatively, the adjec-
tives column-based and column-oriented are also used to designate columnar
database architectures. There is nothing in the relational model that says that
26 they are still used to open and close the memory mapped les themselves, of course.
27 interprocess memory sharing is a common reason for mapping to the paging le.
28 Such as SybaseIQ, MonetDB, C-Store/Vertica, and KX/KDB.
8
data must or should be stored by rows rather than by columns. The fact that
this has been so in the vast majority of the RDBMS systems built to date is
an historical accident related to the common way of thinking about records in
29
application programming languages and fostered by the particular constraints
imposed by the highly concurrent, write-biased OLTP environment.
Current usage in academic, industry, and press writings (discarding outlier
cases likely attributable to misconception), indicates that column-major order
columnar
storage is a necessary and sucient condition to be categorized as a columnar
DBMS
DBMS. However, existing columnar DBMS dier signicantly in several other
diversity
important architectural traits. VectorSTAR, for example, stores every column
30
as a separate le consisting of a vector of values , a trait that is shared by
very few other columnar DBMS. The vast diversity and large number of dier-
entiating traits among current columnar DBMS makes it very hard to come up
with sound, useful generalizations which are broadly applicable to the columnar
DBMS market at this moment.
Nevertheless, as is the case with most other true architectural contrasts,
the contrast between column-major and row-major storage has performance
implications which can be eectively harnessed for competitive advantage.
It is true that any advantage so gained is often likely to be also the ulti-
mate source of a disadvantage in the contrasting context. Read optimization,
magnitude
for example, is often in contrast with write optimization, as is encoding space
assymetry
vs. decoding speed, and any architecture that intrinsically favors one will fre-
quently (but not always nor necessarily) be at a disadvantage when dealing
with the opposite. However, in the column-major vs. row-major architecture
dichotomy, column-major storage unilateraly benets from a signicant magni-
tude assymetry in the underlying domain: column cardinality is many orders of
magnitude larger than row cardinality for the vast majority of tables. If the row
and column cardinalities were of similar magnitudes, then it is certainly true that
column-major storage architecture would have the edge in certain contexts while
row-major storage would have it in others. But the cardinalities are nowhere of
comparable magnitude and operations that take advantage of the vastly larger
cardinality of columns easily result in signicant performance gains when com-
pared with their application under the loop-based, one-by-one case required
by the row-oriented approach (which necessarily ends up physically separating
logically-contiguous column values by those values of the intermediary columns
that constitute the same row). Thus, in the end (probably after a degree of
evolutionary technical development similar to the one that traditional DBMS
31
have undergone over the past 15 years ), columnar DBMS should outperform
29 Files have traditionally being thought as sequences of records consisting of n data values
encapsulated into an atomic n-tuple.
30 which are then memory-mapped into the DBMS process using a single, very ecient,
mmap system call.
Compressed columns are stored in dierent formats, according to the compression scheme
used. RLE-compressed columns are stored in a format that keeps track of the count and the
value of each run, for example.
31 Which likely will not take the same amount of time, as the pace of technological develop-
ment grows exponentially with time.
9
row-oriented ones on most counts.
An important pair of contrasting contexts is that of OLTP vs. OLAP ap-
plications. As OLTP emphasizes the concurrent writing of large amounts of
OLTP
small records, traditional DBMS (which grew up in a world where OLTP was
vs
the main driver for DBMS development) were designed on a row-oriented ar-
OLAP
chitecture that favored the atomic retrieval, insertion, and updating of whole
records at a time. As a result, in practically all current row-oriented RDBMS,
whole sets of records are often pre-fetched from disk into memory buers so
that they can be made available quickly to the DBMS process. Ironically, it is
precisely this behavior that gives rise to one of the rst speedup opportunities
for columnar DBMS.
Consider a table of employees consisting of employee name, birthdate, salary,
department, ngerprint, and photo. On a row-oriented DBMS, a query that
selects those employees in department xxx with salaries greater than yyy and
displays their names and birthdates will cause the unnecessary loading into
memory of the data in the ngerprint and photo columns (which are usually
relatively heavy columns), even though they're never referenced neither in the
where-criteria nor in the display-criteria of the query. On a columnar DBMS,
in contrast, the ngerprint and photo columns will not be touched as a result
of executing this query.
VectorSTAR can go further. A query such as the one described above could
be split so that only the where-criteria was executed on a main (fast, expen-
sive) server, producing what is called a result index set (or NDX) that would
then be passed to a (slower, cheaper) secondary server which would then exe-
cute the display-criteria , nally producing what is called the result set that is
returned to the user. This means that the photo and ngerprint columns, which
32
in this particular case would likely never be used within a where-criteria , do
not even need to be stored on the fast server, but solely on one or more sec-
ondary servers which would not be required to execute processor-intensive or
memory-consuming searches but only return the column values (as specied in
the display-criteria) for the indices specied by the NDX. This is only possible
due to the vectorial column representation used by VectorSTAR.
Note that a consequence of column independence is that total aggregate table
size is no longer an adequate indicator of the global "database size" as used
total size
when performing IT infrastructure planning. On truly columnar DBMS, it is
not
eectively replaced by the maximum column cardinality. relevant
Schema-oriented
In VectorStar, the individual data le representing the contents of a single table
column is the basic building block, at the lowest level, of a data storage strat-
egy designed to reect the logical schema of the database on the disk lesystem.
VectorSTAR uses directories to represent the schema objects (databases, tables,
and columns) and les to hold the column values and various metadata. For
32 Although not in a case where VectorSTAR biometric support was used to query based on
face or ngerprint, for example.
10
example, a VectorSTAR installation with two databases (Db1 and Db2) and mul-
tiple tables each, would look like this on disk: (names ending with / represent
directories):
VSTAR/ -- Db1/ -- Tbl1/ -- Col1/ --data

-- Col2/ --data
-- Tbl2/ -- Col1/ --data
-- Db2/ -- Tbl1/ -- Col1/ --data
XSTAR -- Db1/ -- Tbl1/ -- csv source les
-- Db2/ -- Tbl1/ -- csv source les
This illustrates how each VectorSTAR database occupies a single lesystem

directory named as the database itself. Within that database directory, there schema
is one directory for each table within that database, named as the table itself. reected
Within each of those table directories, there is one directory for each column on-disk
within that table, named as the column itself. Finally, within each of those
column directories, there is a single le (named data) containing the values
for the corresponding column.
All the information related to a particular database is stored within its cor-
responding directory, including all its metadata (which is used to populate the
INFORMATION SCHEMA metadata database when the VectorSTAR command SYNC SYS
is executed). This means that a database can always be copied anywhere simply
by copying the contents of its associated directory, using the standard le and
directory manipulation tools provided by the OS, with full condence that one
is taking along everything that is needed to have a full backup or even to bring
up the database on a dierent VectorSTAR node. This is a very powerful and
simple way to manage the backup and distribution of databases. Of course, all
this applies also to specic tables within databases and, furthermore, to specic
columns within tables.
As a consequence of this design, the data and metadata stored in Vec-
torSTAR databases can be easily browsed, manipulated, copied, and even queried
with a large number of dierent tools other than VectorSTAR's own. For ex-
ample, one can always nd out the type of any column Col1 in table Tbl2 within
database Db3 by simply fetching the URL file://VSTAR/Db3/Tbl2/Col1/TYPE from
33
any browser . Similarly, backing up a whole database is simply a matter of
34
copying the database directory to another disk using nothing more than the
standard le and directory copy commands in OS shell or any of the myriad
o-the-shelf tools available for that purpose.
Vertical partitioning of a table is also straightforward to accomplish: simply
locate the desired columns on a dierent disk node and create symbolic links
within the table directory to point to them. Again, VectorSTAR relies on the
standard OS facilities available to accomplish that task. Users are spared the
need to learn yet another partitioning scheme.
33 VectorSTAR supports a RESTful API.

34 the OS itself ensures that its own RAM-to-disk buers are ushed when it is issued the
le copy system call.
11
In summary, VectorSTAR's disk-based architecture is completely open and
very straightforward. A specic example is illustrated below for School, one of
the VectorSTAR tutorial databases:
VSTAR/
|__School/
| |__Student/
| | | |__COUNT: 5
| | | |__DESCRIPTION: A registered student at the university
| | |__Name/
| | | |__DATA: anna kuhn|louis herbert blake|...35
| | | |__TYPE: string
| | | |__DESCRIPTION: first name, optional middle name, last name
| | | |+ ...other metadata for Name column
| | |__Gpa/
| | | |__DATA: 3.9 2.4 ...36
| | | |__TYPE: numeric
| | | |__DESCRIPTION: Most recently calculated GPA
| | | |+ ...other metadata for Gpa column
| | |+ ...other columns in Student table
| | | |__DESCRIPTION: A professor at the university
| |+_Teacher/
| |+ ...other tables in School database
|+_Telco/
|+_Retail/
|+ ...other databases in this VectorSTAR node
Information Schema As a consequence of this schema-oriented storage ar-

chitecture, the database metadata (ANSI information schema ) is always directly
available through standard OS system calls. Of course, to conform with the
Relational Model, VectorSTAR provides a VectorSQL interface to these calls,
making it appear that the information is actually stored in the corresponding
tables within the SYS database (Databases, Tables, Columns, Procedures, etc.) This
arrangement provides multiple benets:
any change to the underlying storage is immediately reected in an ulterior

query for the SYS metadata
the concurrency and consistency of DDL commands is provided by the OS

itself
dierent underlying le systems provide dierent performance models for

VectorSQL DDL commands, enabling a very ne tuning of VectorSTAR
to dierent usage scenarios
database metadata can be easily obtained (and reported) using standard

OS utilities and o-the-shelf tools
35 sequence of null-terminated (indicated by | here) arrays of characters
36 sequence of 64-bit IEEE double precision oating point numbers
12
Vector-based
37
VectorSTAR columns are vectors of values rather than sets of values, both
conceptually and physically
38
. All columns in a table must be of the same
cardinality
39
. The i th element of column Col1 corresponds to the i th element
of column Col2. This avoids the need for indexing when associating the (oth-
erwise independent) columns within a table. Furthermore, it also leads to a
straightforward bit index implementation: VectorSTAR's result index set (NDX)
is simply a bit array of the same cardinality as the associated table. The NDX
represents the current set of selected rows on the table (See Section 2.1).
Array Data Types The datatype for a column is not restricted to scalars:
it can also be a multidimensional array. Thus, you can have a column of type
int(1000) for example. This is not, as some may think, a violation of the re-
lational model's principle of normality. Rather, it is the natural extension of
the SQL string type denition, e.g. char(30), to data types other than charac-
ters. This conceptually simple feature has nonetheless important implications in
performance and code simplicity under a wide variety of information modeling
problems. It is frequently useful when modeling 1:N relationships where N is
xed and invariant as, for example, the set of temperature measures returned
by a xed number of thermometers on a given location at a given time, or the
low, high, and closing price of a stock on a given day, etc.
Cubes The XJ/J engine underneath VectorSTAR can create and manipu-
late multidimensional cubes, both persistent and transient, of practically unlim-
ited size. VectorSTAR supports not only the typical slice and dice operations,
but also provides a large number of vectorized operations that manipulate the
data in the cubes without requiring loops as in traditional programming lan-
guages. Furthermore, all user-dened operations are automatically applicable
across any one dimension and over any dimensional partition of the cube.
For example, a cube with sales information for 20 countries, 50 regions, 1000
salesreps, 75 products, and 366 days in a year would be constructed as follows
(assuming the source data is on a le called salesdata) :
[] Sales =: cube 'country 20, region 50, salesrep 1000, product 75, day 266'
37 The rst element (at index 0) of every column serves both as its null reference (i.e., 0 is
the value used by FKs into this column to signal a null reference) and as the holder for its
null value (i.e., the actual bit value of the null value for a column depends on the column's
data type).
38 VectorSQL conforms to the Relational Model and does not provide commands that depend
on a particular ordering of the rows within a table, but XJ functions see the data in columns
as (possibly multidimensional) arrays. This allows for certain optimizations that are not
possible otherwise as shown in the mixed master+detail table described in the Advanced
Retail Tutorial provided with the VectorSTAR distribution.
39 Columns can be grown at dierent moments and thus have dierent apparent cardinalities.
However, the table itself has a CARDINALITY attribute that represents the maximum cardinality
common to all its columns and this is the value that is used to compute all queries on them.
13
[] 'Sales' READ_CUBE 'salesdata'
=> Loaded: 100'000 cells.
The READ_CUBE operation reports 100'000 cells read: these are the non-sparse
elements of a much bigger cube, whose total number of elements is obtained by
the product of all the dimension cardinalities:
[] print prod dim Sales

=> 27'450'000'000
In spite that the cube consists ofmore than 27 billion cells, none of the following
calculations takes more than a second on an entry level x86-64 CPU machine.
First, calculate the total revenues (i.e., the sum of all cells):
[] print sum all Sales

=> 10'304'029'394
40
Total revenues by country :
[] print 4&rollup Sales

=> 0 | 515'349'201
1 | 235'220'139
2 | 392'482'283
...
19 | 301'328'404
Total revenues by the rst 3 salesreps:
[] print 3 first 2&rolldown 2&rollup Sales

=> 0 | 6'193'490
1 | 13'008'375
2 | 8'203'330
1.4 Insertions, Deletions, and Updates
VectorSTAR is strongly oriented towards OLAP instead of OLTP usage. How-

ever, its insert/update/delete model is very eective, and Taipan, its next
major release, should be able to compete successfully in transaction based in-
41
sert/update/delete contexts, such as the one included in the TPC-H .
Bulk inserts on tables that are not accessed 24/7 should be done simply by
bringing the table oine, doing the bulk insert, then bringing it back online.
Bulk inserts on tables that are continuously accessed should be done using a
40 The code uses the following XJ standard denitions:
'level rollup cube' is 'sum"1 repeat level cube'

'level rolldown cube' is 'sum repeat level cube'
The operations could have been made at an ever higher level using subtotal:
[] print 'country'&subtotal Sales
[] print 3 first 'salesrep'&subtotal Sales
41 rumored to be there only at the instigation of the big traditional RDBMS vendors, perhaps
in an attempt to keep read-biased OLAP dbms competition at bay.
14
two table model: keeping the table version 1 online, using table version 2 to
do the bulk insert, typically of hundreds of thousands to millions of rows, and
then having VectorSTAR switch table version 2 for version 1, using the VERSION
command (this is the reason for the trailing digits name restriction: they're
used to indicate multiple versions of the same table). Once this is done, the
contents of version 2 are copied in the background (perhaps using the OS tools)
to version 1 (either using a delta scheme or simply overwriting, depending on
the context). The next bulk insert will be done on version 1, keeping version 2
online. And so on.
Row inserts/updates/deletes do not aect read performance at all. They
can only clash among themselves.
Insertion is at the column level, meaning that actual insertion to the columns
in a table need not be strictly simultaneous for all its columns. If the values for
a column are not available at the moment of the insert, they can be temporarily
instantiated with NULL if the table allows them (this the default behavior in
VectorSTAR). Later, a BULK UPDATE or ROW UPDATE can be done for the missing
value(s) in that column (bulk updating requires bringing the table oine, but
versioned tables avoid this problem).
Whenever you write a value to an address that is mapped read-write to a le,
the value is committed and ushed to disk transparently, outside of user control.
This usually happens immediately for most practical purposes (however, this
ultimately falls under the OS memory map ush policy).
One should never write directly to a disk le that has been mapped into
42
memory . Data in a memory-mapped le should only be modied through
writing to the corresponding memory addresses instead.
Mapping mode The loader's job (xSTAR) is to transform a text-format source

le into a binary-format data le. The mapping is done by the DBMS pro-
cess (vSTAR), not the loader process. The column data les can be mapped
either READ_ONLY or READ_WRITE. Default is READ_ONLY, to avoid inadvertent over-
writes (which can happen when using the lower-level XJ interface instead of the
controlled one oered by VectorSQL).
Any number of processes can map the column data le both ways. If more
than one process is writing to the same word in the data le, then a write-
conict can occur if they are not coordinated with a semaphore, for example.
VectorSQL does this coordination for the INSERT, UPDATE and DELETE commands.
Write-locking a column data le could potentially be as ne grained as byte-
level for special situations (orchestrated using XJ functions), although the eec-
tive granularity is coarser when using the current implementation of VectorSQL.
Using XJ functions, mapping can also be done at the column segment level, in-
stead of the whole column.
Access mode There are 4 distinct kinds of vSTAR usage:
42 Windows Server OS will in fact prohibit the operation. Linux will let it proceed and crash
the associated application process, however.
15
Read-only mode, Private or Shared
Inserts allowed, Private or Shared
Updates allowed, Private or Shared
Deletes allowed, Private or Shared
Read-only This is the default mode for tables. Inserts are done only in
bulk mode ("bulk inserts"), by bringing the table oine, doing the bulk insert,
then bringing it online again (but see below for versioned tables).
Inserts When inserts are allowed into a table, its columns must be pre-
sized to the maximum space that they will eventually require (somewhat anal-
ogous to what Oracle does in order to maximize performance in certain sit-
uations). Initially, all the "unused" slots are lled with the NULL value (in the
current version, tables that allow Insertion must allow NULL in all columnsexcept
the primary key column, of course, which is automatically initialized to a se-
quence of INT values). The table itself holds its eective cardinality in an internal
variable. Private-insert mode suers no performance penalty at all (compared
to Read-only mode). Shared-insert mode requires a "guardian" process to co-
ordinate, through a semaphore, the writing to the same column by more than
one process.
This coordination is only needed to obtain a valid "row index" (rix) to be
used for the insert. Once gotten hold of a valid rix, a process simply writes the
new values in the columns at that position. The new values are not yet made
available to others until the process does a commit on that rix. At that point,
the table updates its internal SYSTEM VIEW NDX (see below, under Deletes, for
a description) to include that rix (this operation requires locking the NDX le
for writing; readers will not be locked) and increases its eective cardinality
accordingly. At that moment the new value is available to any new reader. If
the process fails to call a commit, that rix will remain unused, as a blank space
in the column (which can later be removed by compacting the table, see below).
If the process call abort, that rix is open for reuse. A table that supports row
inserts has, by default, a column called INSERTID which holds the user identity
of the process doing the insert.
Updates Updates are handled transparently as inserts with the relevant

modied data for the updated columns. They require the table to be marked as
supporting Row-Versioning. In this manner, the newly inserted row is made the
current version of the row, while the previous versions remain available (and
can actually be accessed using the SELECT VERSIONS command, for example). The
performance penalty for a row-versioned table comes from the FILTER operation
that has to be done before each select, to remove non-current versions of rows.
On a table with cardinality of order 10^5 this is practically negligible: less than
half a millisecond.
16
To discard older, non-current versions of rows, the table must be compacted.
The COMPACT command gives the option of transferring the non-current versions
to a backup table. A table that supports row versioning has a column called
VERSION which holds a full timestamp value. Row-level versioning is a powerful
feature in certain contexts (you could, for example, ask for those versions of a
record as they were between May 1st and May 3rd, for example). A table that
supports updates has, by default, a column called UPDATEID which holds the user
identity of the process doing the update.
Deletes Deletes require controlled, concurrent shared access to the ta-

ble's SYSTEM VIEW NDX which could become a bottleneck in a worst-case scenario
where a large number of concurrent users are deleting rows from the same ta-
ble concurrently (this situation very rarely arises in common real-world usage,
however). The delete feature requires that the table be marked as enabled for
Horizontal-ltering (as opposed to Vertical-ltering, which lters which columns
are available for selects).
43
Horizontal ltering works by having each table keep a bit index (bix) of
those rows that can be accessed by the user. Before any select is computed
against the table, the non-accessible rows are removed from the potential result
set. In the case of a Delete-enabled table, this bix always contains 0's in those
positions where the corresponding row has been deleted.
To discard deleted rows, the table must be compacted. The COMPACT command
gives the option of transferring deleted rows to a backup table. A table that
supports deletes has, by default, a column called DELETEID which holds the user
identity of the process doing the update.
Private vs Shared The private version of all this usage modes avoid
any lock overhead and thus function a the same level of the highest read-only
performance.
1.5 Failover
The purpose of the Failover conguration is to provide continued service in

the face of server (either hardware or software) breakdown. It is designed for
a cluster conguration accessed via HTTP, usually through a web server. In
this conguration, two or more nodes in the cluster host identical VectorSTAR
servers. The VectorSTAR HTTP listener processes are hosted by a web server
(which must take care of redundancy for this part of the system). These HTTP
listeners are assigned one of the available dbms nodes randomly (default policy)
for each new query. If the dbms node does not respond in the maximum alloted
time window, the HTTP listener process sends the query again, this time to a
44
dierent, but also randomly selected, VectorSTAR DBMS node .
43 this feature provides functionaliry roughly equivalent to horizontal partitioning in tradi-

tional RDBMS, typically exploited through materialized views.
44 This conguration is only currently available for the internet web access model, not yet
for the intranet client/server access model.
17
2 VectorSQL
VectorSQL is a vectorial dialect of SQL implemented as a library extension
on top of the XJ vector programming language. In contrast to traditional SQL functional
dialects, VectorSQL is interactive, function-based, vector-valued, and extensible: SQL
function based: every command in VectorSQL is a function (traditional

SQL commands are statements). Functions have much simpler syntax
than staments, can be easily and consistently composed to produce larger
semantic units, and can be user dened.
interactive: VectorSQL code is highly synthetical, i.e., it expresses larger

concepts as compositions of smaller ones, and as a result, it can be run
interactively, enabling users to do exploratory what-if data analysis (tra-
ditional SQL code is monolithic and compiled).
vectorized: every VectorSQL function can operate on whole (single or

multidimensional) arrays at once, without having to resort to for or while
loops.
extensible: new functions that you create are rst-class citizens in Vec-
torSQL and perform just as fast as the native ones provided in the core
distribution.
VectorSQL is simple to learn and command of the language can comfortably be

increased by gradually incorporating use of advanced XJ concepts (such as func-
tion rank, function composition, boxing and unboxing, dimension transposing,
attaching, joining, and appending items) into the results of VectorSQL com-
mands. This advanced functionality is useful when dening calculated columns,
for example.
Anything that can be done with ANSI SQL (and this applies also to most
common proprietary SQL vendor extensions) can be done with VectorSQL, fre-
45
quently in a simpler manner. For any signicant amount of SQL code , the
46
equivalent VectorSQL is frequently not only up to 75% shorter (about a quar-
ter of the actual number of LOC), but also simpler in structure (VectorSQL
code is rarely nested) and thus easier to understand.
2.1 Exploratory Analysis
Exploratory Analysis refers to that style of continued interaction between a user

and a DBMS where the results of running a query determines the nature and
details of the next query to be performed. Exploratory analysis imposes, at the
very least, the following requirements on a DBMS:
45 That is, when scripts are longer than about a hundred lines or so.
46 This is a rough statistic based on VectorSQL translations of about a hundred real-world
SQL scripts done over the past four years, and on a translation of the examples in a popular
SQL cookbook, a current work-in-progress at Vectornova.
18
very fast query execution, so that most ad-hoc queries complete in less
than 5 seconds
interactive querying language, so that queries can be comfortably and

eectively entered piece-meal into a console
a way to keep track of previous results and to use them in ulterior queries
Selection Index The row selection index or NDX represents the current
set of selected rows on the table, based on the immediately previous execution
of WHERE and related VectorSQL commands. The NDX is simply a bit array of
the same cardinality as the associated table. VectorSTAR provides the facilities
to save these bit indexes to an ordinary le, which is typically quite small:
the NDX for a dense selection out of a 10 million record table is barely 1.2MB
uncompressed, and usually less than 200KB when compressed. Compression is
done automatically by the SAVE_SELECTION command using a fast RLE algorithm.
Actually, for a sparse selection, the NDX is stored as an index set (rather than
bitset) and the result size in bytes is equal to the number of selected rows (i.e.,
for 10 rows out of 10 million, the NDX occupies a mere 10 bytes). A unique and
very powerful capability of VectorSTAR is that this saved NDX can then be:
imported later into the same session
imported later into a future session by the same user
sent to a colleague (perhaps as a mail attachment?) who imports it into

their session
sent over a grid to a dierent node(s) where the output will be produced
and displayed
sent to a dierent node to be applied to some other (usually conceptually

related) table of the same cardinality
Selecting Rows and the Selection Set

47
Whenever you do a WHERE command on a table , the eect of the command is
that the selection set for the table is modied. Think of the selection set as a
boolean vector with the same cardinality as its associated table. Entries marked
1 (TRUE) are currently selected (by SELECT_ROW commands that you have issued
on the table), those marked 0 (FALSE) are not. The EVERY command resets the
selection set for a table to all 1's.
The selection set is not persistent. It retains its value only while the session
is running. Furthermore, if a user connects to VectorSTAR via a web browser,
the default access mode does not guarantee them the exact same backend session
process for every command they issue, thus preventing them from being able to
refer to previous selection sets.
47 or any of its aliases, including SELECT_ROW, WHEN, etc.
19
A user can save the current selection set to a le (note that this is not
the same as saving the selected rows, it only saves their references) using the
SAVE_SELECTION command. This le can then be sent to other users who can
LOAD_SELECTION from that le and obtain a selection set on their session that is
the same as if they had performed the same queries as the original user who
created the selection set le.
2.2 ANSI SQL

48
The VectorSQL command set provides a nearly one-to-one functionality map-
49 50
ping to the ANSI standard SQL command set . Nevertheless, VectorSQL is
a function-based dialect of SQL and as such, its structure is naturally sequen-
tial instead of nested (a fact that reportedly makes it easier to comprehend on
complex queries) and, more notably, its order of execution is often "backwards"
compared to that of SQL. In the end, although strict compatibility is impossible
by design, functional compatibility is achieved in most cases.
A concrete example should help clarify the issue. Here is the code for calcu-
lating a business metric using a common commercial dialect of SQL:
SQL
INSERT INTO report_output

SELECT &&tech_id, &&datum_id, &&service_id, tmp.region, tmp.amount
FROM (SELECT tmp2.region, sum(tmp2.amount) amount
FROM (SELECT region,
round(sum(total_amnt),2) amount
FROM cdr
WHERE trunc(tx_time) = to_date ('&&rundate','YYYYMMDD')
GROUP BY region
UNION ALL
SELECT region_id region,
round((sum(abs(ADJ_AMNT))+nvl(sum(abs(DED_1_AMNT)),0)
+nvl(sum(abs(DED_2_AMNT)),0)),2) amount
FROM ahr
WHERE trunc(tx_time) = to_date('&&rundate','YYYYMMDD')
AND adj_type in (17,20)
GROUP BY region_id
)tmp2
GROUP BY tmp.region
)tmp;
48 The word command is chosen as a way to avoid having to specify whether the implemen-
tation is through a statement or a function.
49 Actually, there is really no such thing as a standard SQL outside of the denitional papers
presented to the ANSI and ISO as most SQL vendors implement widely varying functionalities
using even more diverse syntax.
50 Transactional inserts and updates, plus other related commands, are only currently avail-
able on the Taipan alpha.
20
Note both the nested structure and the fact that the order of LOC execution is
not the same as their order of appearance. Note also the need for two temporary
tables. Here is the equivalent in VectorSQL:
VectorSQL
FROM 'Cdr'
WHERE 'TxTime in_day ', RunDate
GROUP_BY 'Region'
SELECT INTO 'Tmp' 'group AS Region, round@sum TotalAmnt AS Amount'
FROM 'Ahr'
WHERE 'TxTime in_day ', RunDate
AND 'AdjType in 17, 20'
GROUP_BY 'RegionId'
Adj =. 'sum abs AdjAmnt Ded1Amnt Ded2Amnt'
SELECT UNION 'Tmp' 'group AS Region, round@sum ', Adj, ' AS Amount'
FROM 'Tmp'
GROUP_BY 'Region'
SELECT INTO 'ReportOutput' _
Tech, Datum, Service
group, sum Amount AS Amount
)
Things to note:
VectorSQL is frequently shorter (sometimes signicantly) although this is

mostly related to the fact that some temporary tables that have to be
calculated and stored on traditional RDBMS are not required on Vec-
torSTAR. This is not the case for this query, however.
Notice how the 2-level nesting of the original query is attened by Vec-
torSQL, as the cdr and ahr intermediate results do not need to be cal-
culated from inside a subquery.
Every SQL statement and function has a direct equivalent in VectorSQL.
The nvl function is usually not needed.
Notice the use of the INTO modier to SELECT, which is a shortcut for: INSERT
INTO x ;; SELECT y . The use of _ in the last SELECT means "use the following
lines up to the single ) as argument to SELECT".
Experience with a rather varied set of SQL programmer backgrounds and ca-
pabilities has shown that many SQL users have no trouble understanding the
51
VectorSQL implementation of a SQL stored procedure . Frequently, the com-
monality of the command set is promptly perceived, leaving only the dierent
52
ordering of clauses as a notable distinction .
51 Of course, after a very brief mapping strategy is provided to them.

52 Often the source of comments such as (quoting an actual statement): "I see. They're the
same commands but in reverse order."
21
There is a reason for the apparently "reverse" order of execution. Since
VectorSQL is based on function composition, it has no complex "statements"
and thus the ordering of the component functions (think "substatements", as
in the FROM, WHERE, SORTED BY, etc., pieces of a full SELECT statement) has to
match the required execution ow, whereas in SQL, where SELECT is a complex
statement, the xed syntatic ordering of its subclauses is only really the result of
its original designer's choice. Note also that the ordering is not strictly "reverse"
in many cases, as with the SORT_BY command, which frequently goes at the end
in VectorSQL queries too.
In summary, existing SQL scrips can be reused in VectorSQL in the sense
53
that their underlying logic won't have to be changed , but they will still require
54
some (mostly mechanical) syntax transformation to accomodate the true-to-
execution-ow ordering required by VectorSQL.
2.3 Stored Procedures
VectorSTAR fully supports the functionality provided by stored procedures on

traditional DBMS through user-dened function scripts that are hosted isolated
on their own VectorSTAR process: this provides the assurance that the user
interaction will be strictly limited to that functionality provided by the specic
stored procedure invoked.
2.4 Vectorized Operations

55
Array-based languages generalize operators and functions to work transpar-
ently over multidimensional arrays as well as scalars, thus providing a higher-
level conceptualization of data manipulation than scalar-based languages (which
require loops to iterate over all the individual members of the arrays).
The following XJ function computes the usual arithmetic average:
avg =: sum div num
Note that there are no references to the parammeter names anywhere in the
code: this style of function denition is called function-level programming
56
.
53 Although it may well be the case that some potential peformance gains are missed out
by adhereing strictly to the original SQL detailed logic in some cases, such as those that call
for the creation of a large number of temporary les in situations where a natural VectorSQL
rendering of the general logic would not require it.
54 A SQL-to-VectorSQL translator is certainly feasible.
55 such as VectorSQL, XJ, and J (used in VectorSTAR); as well as APL, A+, K, Q, Nial,
and some versions of Fortran.
56 Function-level programming is not the same as functional programming. The former
produces programs by assembling functions using higher-level functor operators (i.e., operators
that work on functions to produce other functions). The latter consists on dening (and later
calling) functions which produce no side-eects and always produce the same result when
called in the same context. Examples of the former are rare, the most complete being FP/FL
by John Backus (creator of FORTRAN), J by K. Iverson and Roger Hui, plus some
quite uncommon LISP programming styles. Examples of the latter, though, abound: standard
LISP, Scheme, O'Caml, and Haskell being among the best known ones.
22
This avg will work for any number and kind (as long as they're specializations
of numeric) of arguments.
57
A more familiar implementation follows, using functional programming :
avg values is (sum values) div (num values)
Finally, an implementation using imperative programming:
avg values is
sum: 0
for v in values do
sum: sum + v
end
n: num values
return sum div n
)
These examples clearly point out the signicant reduction in code size (mostly
due to spurious complexity) that can be achieved with the vectorized operations
available in VectorSQL and XJ. Vectorized operations are also often optimized
in ways that a function operating on successive scalars inside a loop could not
be subject to, resulting in signicant performance gains.
As a complete example, here is a very concise library of commonly used
58
statistical functions written in J (NB. introduces a comment ):
mean =: sum div num NB. sum div by cardinality

norm =: diff mean NB. differences between values and mean
var =: mean @ sqr @ norm NB. mean of square of norms
stdev =: sqrt @ var NB. square root of var
cov =: mean @ (* & norm) NB. mean of (values times norms)
corr =: cov div (* & stdev) NB. cov div by (values times stdevs)
It is instructive to workout the same code on a traditional scalar programming

language for comparison.
3 xSTAR
xSTAR is VectorSTAR's grid-parallel bulk data loader and is specically de-
signed to take advantage of the unique opportunities presented by VectorSTAR's
parallel
memory-mapped columnar architecture. xSTAR can be aordably congured
data
to provide the highest data-loading performance available today using only in-
loader
dustry standard disk subsystems.
57 functional programming works at the data-level : it species an "assembly line" (made of

functions) through which input data is passed and transformed into output data.
58 In handwritten documents N.B. stands for the latin Nota Bene, and is often used in
mathematical literature to introduce author comments and notes.
23
xSTAR and ETL
xSTAR is not a full-edged, general-purpose ETL application. In a typical Vec-

torSTAR installation, a dedicated ETL product will extract data from OLTP
59
and other legacy LOB systems and produce CSV les, which are then scanned
by xSTAR and transformed into the column binary data les that will be
memory-mapped by the VectorSTAR DBMS. A single CSV input le with rows
consisting of n elds, will be transformed into n dierent binary data les, one
per column.
Other text formats (such as XML, JSON and NetCDF) are supported through
addons to the core VectorSTAR product, but maximum loading speed is achieved
using CSV-formatted text les.
Inmmediate availability
Once xSTAR has converted a le from CSV to binary format, the le's data
is immediately available to any VectorSTAR DBMS with access to it (either
through DAS, or across a SAN).
Independence of VectorSTAR engine
xSTAR loader processes can be running on any node on a network/grid without

requiring VectorSTAR to be installed and running on that node. This enables
the deployment of very scalable and cost-eective bulk loading systems.
Maximum loading peformance When the highest attainable level of per-

formance is required:
xSTAR can read directly from a shared pipe connection established by the
ETL application, so that no actual intermediate data text le needs to be
created on disk
program the ETL process to produce multiple single-column CSV les,

instead of a single multi-column one.
Just-in-Time CSV Loader
Based on a description of the row structure of a CSV data le to be loaded,

VectorSTAR's data loader produces the le's specic loader: a customized
C program that is highly optimized for loading that specic row structure.
Essentially, the loader consists of a loop that is executed once for every row
in the CSV le. It is precompiled so that no function calls are made at all,
and only macros and direct pointer references are used within the tightly bound
code within the loop.
59 Character-separated text les, typically using pipes, commas, colons, tabs, or blank spaces
as eld separators within rows terminated by newline charactersCRLF on Windows, LF on
Linux/Unix, CR on Macs.
24
Binary File Structure
4 Interfaces
User Interface
A VectorSTAR server can be accessed in several ways:
console: monolithically, often directly from the console, where the user
interface (UI) is running in the same process as the database
client-server: from the same or a dierent network node, via sockets
web: from anywhere on the internet/intranet, using the HTTP/HTTPS

protocol
Console Mode DBA The console mode is the most exible of the three
and thus the one often most adequate for doing development on VectorSTAR.
However, it oers little in terms of security and it exposes the whole underly-
ing programming language facilities to the user, so it should not be used for
deployment of end-user applications.
In this mode, both the UI (used to enter VectorSQL commands) and the
actual database code are running on the same OS process, which will show on
your CPU task list as a j.exe process on a Windows OS or its equivalent on a
Linux OS. It is a very lightweight process that res up practically immediately
and typically does not consume more than 500K-750K of RAM initially. This
process is not multithreaded: you run multiple copies of a similar one (without
the UI overhead) to support multiple users in client-server and web congura-
tions. The sheer simplicity of this arrangement means that you can eectively
use the underlying OS process administration facilities to manage your Vec-
torSTAR database. This is in stark contrast to other DBMS systems that re
up so many obscure processes that it becomes very hard to know what each one
does.
After installing VectorSTAR in console mode, you run one of the available
console60 programs that provide a UI to the underlying VectorSTAR process.
API
C, C++, Java, Microsoft CLR languages.

60 either j-console which is a standard J console,nj-console which is a VectorSTAR
version of the J console, or xj-console which is the VectorSTAR console supporting XJ.
The nj-console is a standard J ijx command execution window that has been set up to
automatically load the VectorSQL and other VectorSTAR libraries. It is designed for raw
J input and thus it is not optimized for VectorSQL but, rather, to provide full unimpaired
access to the underlying J multidimensional programming engine. You should only use it if
xj-console.
you require a specic J facility that is not available through the alternative
The plain j-console is would be required even less frequently, mostly to access a J feature
not directly accessible from the nj-console such as the package manager.
The xj-console is built on top of the J console window and is specically designed for the
manipulation of VectorSTAR databases using the VectorSQL language.
25
Bridges
R statistical programming system There are two choices for interacting

with R:
socket-based approach provided by JSoftware with the J programming

environment
memory-mapped le-based approach provided by VectorSTAR
Socket-based R interface
require'R'
'Store' OPEN 'Sale,Product'
(GET 'Sale.Price') R 'sd($1)'

=> 289.49517039958
R_import 'sd($1)'61
sd GET 'Sale.Price'
=> 289.49517039958
JOIN 'What'
GROUP 'Product.Department'
SELECT 'group, sd Qty'
=>
ProductDepartment SdQty
FURNITURE 5.77
APPLIANCES 4.20
CLOTHING 3.41
ELECTRONICS 120.3
Excel Spreadsheet VectorSQL provides the following Excel-related func-

62
tionality :
1. directly importing an XLS le into a VectorSTAR database
2. outputting XLS format as a result of a query
3. live connection to a running Excel spreadsheet so that the results of a

query are immediately reected as a new sheet (usually as a pivot table )
4. using Excel as a command console to VectorSTAR (i.e., the user may
type free-form VectorSQL commands on a cell, and the results will be
shown on the same worksheet immediately)
61 Equivalent to:
'sd' is R with 'sd($1)'
62 (3) and (4) require VectorSTAR running on Windows.
26
5 Future Release Schedule
5.1 Taipan
Taipan is the code name for the next major release of VectorSTAR, due in 1Q09.
It will provide support for:
transaction-based inserts, updates and deletes
triggers
constant-time hash-indexes
transparent data column compression
text Human compression
Support OGC standards for geographic and geometric algorithms
Python programming language
advanced Cryptography Protocol support
OGC geographic and geometric algorithms
2D Geometries Point, LineString, Polygon, MultiPoint, MultiLineString,

MultiPolygon, GeometryCollection.
Predicates: Equals, Disjoint, Intersects, Touches, Crosses, Within, Con-

tains, Overlaps, Covers, CoveredBy, Relate, IsWithinDistance.
Geometric: ConvexHull, Intersection, Union, Dierence, SymDierence.
Linear referencing: LineInterpolatePoint, LineLocatePoint, LineSubstring.
Spatial relationships: Distance, Length, Area, Buer, Centroid, PointOn-

Surface, Simplify, Polygonize, LineMerge.
Coordinate systems: SRID, SetSRID, Transform.
Spatial indexing
GML support
27
5.2 Berzerk
Due out in 2Q09, Berzerk is an experimental VectorSTAR add-on that supports

leading-edge GPU-based massively parallel calculations. Berzerk will support
OpenCL when available.
Berzerk's goal is to achieve a further 500x improvement in performance over
the standard VectorSTAR architecture. Current benchmarks point to a 1 Tops
single precision and 200 Gops double precision 64-bit performance using the
63
latest GPU coprocessor cards . Unlike most other DBMS, VectorSTAR can
directly harness most of this raw performance (as parallel operations performed
on single and double precision numeric data columns) thanks to its memory-
mapped vector column model.
A single GPU card, priced under USD$1'000 and installable on commodity
x86-64 servers via a PCIe interface, could thus provide VectorSTAR with the
power to peform numeric calculations on one billion rows, each 1000 columns
wide, in under one second.
63 Such as the ATI FireStream 9150/9250 and NVIDIA Tesla T10P
28
Nomenclature
.NET CLR programming framework, a Microsoft technology
ANSI American National Standards Institute
API Application Programmer Interface
BI Business Intelligence
C C Programming Language
C++ Programming Language, one big step backward for mankind
CLR Common Language Runtime, a Microsoft technology
CPU Central Processing Unit
CSV Character/Comma Separated Value
DBMS Database Management System
DLL Dynamic Loading Library
ETL Extraction Transformation and Loading
EXE Executable le
FORTRAN A classic programming language
GPU Graphics Processing Unit
HTTP Hyper-Text Transfer Protocol
HTTP HyperText Transfer Protocol
HTTPS Secure HTTP
I/O Input/Ouput
ISO International Standards Organization
IT Information Technology
Java Programming Language, a tumbling half-step forward for mankind
JSON JavaScript Object Notation
LINQ Language Integrated Query, a Microsoft technology
LOB Line Of Business
LOC Lines Of Code
MMF Memory-Mapped File
29
NetCDF Network Common Data Form
OEM Original Equipment Manufacturer
OLAP On-Line Analytical Processing
OLTP On-Line Transaction Processing
OpenCL - Open Computing Language: C-like programming for GPUs
OS Operating System
PCIe PCI Express
QUEL A relational programming language
R Open-source GNU-S (statistical programming), www.r-project.org
RAM Random-Access Memory
RDBMS Relational DBMS
RESTful Implementing the REST interface
RLE Run Length Encoding
SQL Structured Query Language
Tutorial-D A relational programming language
UI User Interface
VLDB Very Large Database
XLS Excel sheet le
xSTAR xSTAR is VectorSTAR's grid-parallel bulk data loader
browser - a program that implements communication via the HTTP protocol

and display of HTML pages.
column cardinality - number of items (rows) in a column
comment - a note, within a program source, that is not to be executed
CPU footprint - the load that a program imposes on a CPU when idle
display-criteria - the elds that make up the SELECT clause in SQL
where-criteria - the conditions that make up the WHERE clause in SQL
cube - a persistent multidimensional array
le - a sequence of bytes stored on disk according to the conventions of a

particular lesystem
30
loop - a series of instructions executed repetitively
operational BI - near real-time BI during on-going business operations
persistent - a value that is stored on disk
record - a set of named values
slice and dice - most common OLAP operations on cubes
transient - a value that is only stored in volatile memory
ops UNIT - Floating Point Operations Per Second
GB UNIT - Giga-Byte = 1024 MB
Gops UNIT - 1 Gigaops = 1000 million ops
TB UNIT - Tera-Byte = 1024 GB
Tops UNIT - 1 Teraops = 1000 Gigaops
31
Index
32-bit, 7 database size, 10
64-bit, 35 decoding speed, 9
dense selection, 19
ANSI, 20 directory, 10
ANSI SQL, 18 disadvantage, 9
APL, 4 disk, 10
arithmetic average, 22 disk subsystem, 23
ATI, 28 disk-based architecture, 12
atomic, 10 display-criteria, 10
DLL, 8
b-tree, 4
double precision, 12
BI, 3
Bioinformatics, 3
encoding space, 9
bit array, 13, 19 end-user, 4
bit index, 13
ETL, 24
buer, 8
Excel, 26
buered le I/O, 8
EXE, 8
bulk data loader, 23
exploratory data analysis, 4
extensible, 18
C, 24, 25
C++, 25
le, 9
cache, 8
lesystem, 10
cardinality, 7, 9, 13, 19
ngerprint, 10
client-server, 25
FireStream, 28
CLR, 25
rst-class citizen, 18
Codd, 3
oating point, 12
code size, 23
FORTRAN, 22
column cardinality, 10
FP/FL, 22
columnar, 3, 8
full backup, 11
command, 18, 20
function composition, 22
comment, 23
function-based, 18
compiled, 18
function-level, 22
concurrent, 10
functional programming, 23
console, 25
constraint, 6 Gops, 28
CPU footprint, 4 GPU, 28
Cryptography, 27 grid, 19
CSV, 24 grid-parallel, 23
cube, 13
Haskell, 22
data, 11 HTTP, 25
data level, 23 HTTPS, 25
data storage, 10 Human compression, 27
data warehousing, 1, 4
32
IEEE, 12 O'Caml, 22
ijx, 25 OLAP, 6
imperative programming, 23 OLTP, 3, 9
in-memory DBMS, 6 OpenCL, 28
interactive, 4, 18 operational BI, 3
intermediate le, 24
internet, 25 page le, 8
interprocess memory sharing, 8 PCIe, 28
intranet, 25 performance, 23
ISO, 20 persistent, 13
IT, 3 photo, 10
IT infrastructure planning, 10 pivot table, 26
Itanium, 5 PocketPC, 7
pointer reference, 24
Java, 25 POWER, 5
JSON, 24 pre-fetch, 10
press, 9
kernel mode, 5 processor-intensive, 10
programming language, 4, 9, 23
legacy, 24
proprietary, 18
library extension, 18
Python, 27
LISP, 22
loading, 23, 24 QUEL, 3
LOC, 18, 21 query, 10
loop, 13, 18, 23, 24
loop-based, 9 R, 26
loops, 22 RAM, 6, 25
Read optimization, 9
Mac OS/X, 7
relational, 3, 8
macro, 24
relational model, 13
mail attachment, 19
relational operator, 3
mainframe, 4
RESTful, 11
memory buer, 10
result index set, 10, 13
memory-consuming, 10
result set, 10
metadata, 11
RLE, 19
mobile, 7
Roger Hui, 22
monolithic, 18
row-oriented, 9, 10
multidimensional, 13
scalar, 22
native, 18
Scheme, 22
NDX, 10, 13, 19
shared pipe, 24
NetCDF, 24
slice and dice, 13
normality, 13
socket, 25
null value, 13
SPARC, 5
null-terminated, 12
sparse selection, 19
NVIDIA, 28
SQL, 3, 4
nvl function, 21
33
SQL2, 3 XML, 3
SQL2006, 3 xSTAR, 23
SQL3, 3
standard, 20
state-of-the-art, 4
statement, 22
statistical, 23
Stored Procedures, 22
subclauses, 22
subquery, 21
synergistic, 4
syntatic ordering, 22
system memory, 6
Taipan, 20
temporary table, 21
temporary tables, 21
Tops, 28
transient, 13
triggers, 27
tutorial, 12
Tutorial D, 3
UI, 25
user interface, 25
user mode, 6
user-dened, 13
VAS, 8
vector-based, 4
vectorized, 18
Vectornova, 3
virtual memory, 4, 8
VLDB, 3
warehousing, 6
web, 25
what-if, 18
where-criteria, 10
Windows Mobile, 7
Windows VISTA, 7
Windows XP, 7
worksheet, 26
write optimization, 9
x86-64, 5, 28
XLS, 26
34
References
[Codd70] E.F. Codd: A Relational Model of Data for Large Shared Data Banks.
1970
[Grae07a] G. Graefe: Ecient columnar storage in B-trees . SIGMOD Record

36(1), Mar 2007
[SMA07] M. Stonebraker, S. Madden, D.J. Abadi, S. Harizopoulos, N. Hachem,

P. Helland: The End of an Architectural Era (It's Time for a Com-
plete Rewrite). Proc. VLDB, 2007
[Abra67] P.S. Abrams: What's wrong with APL? Scientic Timesharing Cor-
poration, 1967
[Iver72] K.E. Iverson: A Programming Language. New York, John Wiley &
Sons, Inc., 1972
[Bone02] P.A. Bonez: Monet: A Next-Generation DBMS Kernel For Query-

Intensive Applications. Ph.D. Thesis, Universiteit van Amsterdam,
Amsterdam, The Netherlands, May 2002
[Date76] C.J. Date: An Architecture for High-Level Language Database Exten-

sions. Proc. SIGMOD, 1976
[Date84] C.J. Date: A critique of the SQL database language. SIGMOD Record
14(3):8-54, Nov 1984
[Hell07] P. Helland: Life beyond Distributed Transactions: an Apostate's

Opinion. Proc. CIDR, 2007
[HM93] M. Herlihy, J.E. Moss: Transactional memory: architectural support
for lock-free data structures. Proc. ISCA, 1993
[KR81] H.T. Kung, J.T. Robinson: On optimistic methods for concurrency
control. ACM Trans. Database Sys. 6(2):213-226, Jun 1981
[LM06] E. Lau, S. Madden: An Integrated Approach to Recovery and High
Availability in an Updatable, Distributed Data Warehouse. Proc.
VLDB, 2006
[RR99] Cache Conscious Indexing for Decision-Support

J. Rao, K.A. Ross:
in Main Memory. Proc. VLDB, 1999
[SBC07] M. Stonebraker, C. Bear, U. Cetintemel, M. Chemiack, T. Ge, N.
One Size
Hachem, S. Harizopoulos, J. Lifter, J. Rogers, S. Zdonik:
Fits All? - Part 2: Benchmarking Results. Proc. CIDR, 2007
[SAB05] M. Stonebraker, D. Abadi, A. Batkin, X. Chen, M. Chemiack, M. Fer-
reira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N.
Tran, S. Zdonik: C-Store: A Column-oriented DBMS. Proc. VLDB
2005
35
[SC05] M. Stonebraker, U. Cetintemel: One Size Fits All: An Idea whose
Time has Come and Gone. Proc. ICDE, 2005
[TPCH08] The Transaction Processing Council: TPC-H Benchmark. Revision,
2008
[ADHW99] A. Ailamaki, D.J. DeWitt, M.D. Hill, D.A. Wood: DBMSs on a

Modern Processor: Where Does Time Go? Proc. VLDB, 1999
[BM72] R. Bayer, E.M. McCreight: Organization and Maintenance of Large

Ordered Indexes. Acta Inf. 1:173-189, 1972
[BU77] R. Bayer, K. Unterauer: Prex B-Trees. ACM TODS 2(1):11-26, 1977
[Grae07b] G. Graefe: Master-detail clustering using merged indexes. Informatik
Forschung und Entwicklung, 2007
[GRS98] J. Goldstein, R. Ramakrishnan, U. Shaft: Compressing Relations and

Indexes. IEEE ICDE, 1998
[HP03] R.A. Hankins, J.M. Patel: Eect of Node Size on the Performance
of cache-conscious B+-trees. SIGMETRICS, 2003
[Lome01] D.B. Lomet: The Evolution of Eective B-tree: Page Organization
and Techniques: A Personal Account. SIGMOD Record 30(3):64-69,
2001
[Neil92] P.E. O'Neil:The SB-Tree: An Index-Sequential Structure for High-

Performance Sequential Access. Acta Inf. 29(3):241-265, 1992
[Ponn01] N. Ponnekanti: Pseudo Column Level Locking. ICDE, 2001
[RR00] J. Rao, K.A. Ross: Making B+-Trees Cache Conscious in Main
Memory. SIGMOD, 2000
[SL76] D.G. Severance, G.M. Lohman: Dierential Files: Their Application
to the Maintenance of Large Databases. ACM TODS 1(3):256-267,
1976
[Jant95] J. Jantzen: Array approach to fuzzy logic. Fuzzy Sets and Systems
70:359-370, 1995
[Mamd77] E.H. Mamdani: Application of fuzzy logic to approximate reason-

ing using linguistic synthesis. IEEE Transactions on Computers C-
26(12):1182-1191, 1977
[Wens80] F. Wenstop: Quantitative analysis with linguistic values. Fuzzy Sets

and Systems 4(2):99-115, 1980
[Zade84] L.A. Zadeh: Making computers think like people. IEEE Spectrum pp.
26-32, 1984
36
[Baye97] R. Bayer: The universal B-Tree for multidimensional Indexing: Gen-
eral Concepts. WWCA, Mar 1997
[CD97] An Overview of Data Warehousing and
S. Chaudhuri, U. Dayal:
OLAP Technology. SIGMOD Record 26(1):65-74, 1997
[CI98] C.Y. Chan, Y.E. Ioannidis: Bitmap Index Design and Evaluation.
SIGMOD, 1998
[CS94] S. Chaudhuri, K. Shim: Including Group-By in Query Optimization.

VLDB, 1994
[DRSN98] P. Deshpande, K. Ramasamy, A. Shukla, J.F. Naughton: Caching

Multidimensional Queies Using Chunks. SIGMOD, 1998
[GBLP96] J. Gray, A. Bosworth, A. Layman, H. Pirahesh: Data Cube: A Rela-
tional Aggregation Operator Generalizing Group-By, Cross-Tab, and
Sub-Total. ICDE, 1996
[GG97] V. Gaede, O. Günther: Multidimensional Access Methods. ACM
Computing Surveys 30(2), 1997
[GHQ95] Aggregate Query Processing in

A. Gupta, V. Harinarayan, D. Quass:
Data Warehousing Environments. VLDB, 1995
[GJ01] C.A. Galindo-Legaria, M. Joshi: Orthogonal Optimization of Sub-
queries and Aggregation. SIGMOD, 2001
[Kimb96] R. Kimball: The Data Warehouse Toolkit. John Wiley & Sons, New
York, 1996
[KR98] An Alternative Storage Organization

Y. Kotidis, N. Roussopoulos:
for ROLAP Aggregate Views Based on Cubetrees. SIGMOD, 1998
[KS01] N. Karayannidis, T. Sellis: SISYPHUS: A Chunk-Based Storage
Manager for OLAP Cubes. DMDW, Jun 2001
[MRB99] V. Markl, F. Ramsak, R. Bayern: Improving OLAP Performance by
Multidimensional Hierarchical Clustering. IDEAS, 1999
[NG95] P.E. O'Neil, G. Graefe: Multi-Table Joins Through Bitmapped Join
Indices. SIGMOD Record 24(3):8-11, 1995
[Same90] H. Samet: The Design and Analysis of Spatial Data Structures. Ad-
dison Wesley, 1990
[Sara97] S. Sarawagi: Indexing OLAP Data. Data Engineering Bulletin

20(1):36-43, 1997
[TS1] Transaction Software: TransBas e(R) Hypercube. www.transaction.de
37
[TT01] D. Theodoratos, A. Tsois: Heuristic Optimization of OLAP Queries
in Multidimensionally Hierarchically Clustered Databases. DOLAP,
2001
[WB98] M.C. Wu, A.P. Buchmann: Encoded Bitmap Indexing for Data Ware-
houses. ICDE, 1998
[WOS01] K. Wu, E. J. Otoo, A. Shoshani: A Performance Comparison of
bitmap indexes. CIKM, 2001
[YL94] W.P. Yan, P.A. Larson: Performing Group-By before Join. ICDE,
1994
[YL95] W.P. Yan, P.A. Larson: Eager Aggregation and Lazy Aggregation.
VLDB, 1995
[ZSL98] C. Zou, B. Salzberg, R. Ladin: Back to the Future: Dynamic Hier-

archical Clustering. ICDE, 1998
[FGR92] J. Feigenbaum, E. Grosse, J.A. Reeds : Cryptographic Protection
of Membership Lists. Newsletter of the International Association of
Cryptologic Research v.9:16-20, 1992
[FLW91] J. Feigenbaum, M.Y. Liverman, R.N. Wright. Cryptographic Protec-

tion of Databases and Software. Distributed Computing and Cryptog-
raphy, J. Feigenbaum and M. Merritt, eds., American Mathematical
Society, pp-161-172, 1991
[Fren95] C.D. French: One Size Fits All Database Architectures Do Not Work
for DSS. Proceedings of SIGMOD, 1995
[Syba1] Sybase: SybaseIQ. www.sybase.com
[KX1] KX Systems: KDB plus. www.kx.com
[WKHM00] T. Westmann, D. Kossmann, Sven Helmer, Guido Moerkotte: The
Implementation and Performance of Compressed Databases. SIG-
MOD Record 29(3), 2000
[West05] P. Westerman: Data Warehousing: Using the Wal-Mart Model.

Morgan-Kaufmann Publishers, 2000
[AT1] Addamark Technologies: Omnisight. www.addamark.com

[Gray97] DataCube: A Relational Aggregation Operator Generaliz-
Gray et al.:
ing Group-By, Cross-Tab, and Sub-Totals. Data Mining and Knowl-
edge Discovery, 1(1), 1997
[Stone86] M. Stonebraker: The Case for Shared Nothing. Database Engineer-

ing, 9(1), 1986
38
[Tand89] Tandem Database Group: NonStop SQL, A Distributed High Per-
formance, High Availability Implementation of SQL. Proceedings of
HPTPS, 1989
[Vert1] Vertica Corporation: Vertica. www.vertica.com
39

VectorStar - Technology Introduction

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

VectorStar - Technology Introduction

Încărcat de

Drepturi de autor:

Formate disponibile

VectorSTAR Technology Overview∗

July 20, 2008

∗ Copyright © 2008 by Vectornova SA de CV. All rights reserved.

5 Future Release Schedule 27

1 as an embedded high-performance database engine in a family of government information

 it has been designed specically to take full advantage of today's industry-

 it takes advantage of the OS virtual memory and MMF facilities to achieve

 it provides a vector-based, functional SQL dialect that is fast, easy to

 it oers a unique, interactive query interface that enables end-users to

 it uses a vector-based (instead of set-based) column model, oering native

 it has a very small CPU footprint, so it coexists eectively with related

This unique set of state-of-the-art architectural features enables VectorSTAR

6 VectorSTAR is built as a set of extension libraries on top of XJ, an array-based functional

1.1 64-bit Architecture

Despite the address-space limitations of the 32-bit architecture, a VectorSTAR

1.2 Memory-mapped File I/O

VectorSTAR is a memory-mapped database, not an in-memory database. On

1.3 Data Storage Architecture

VectorSTAR is a columnar RDBMS. This means that table data is stored in

VSTAR/ -- Db1/ -- Tbl1/ -- Col1/ --data

This illustrates how each VectorSTAR database occupies a single lesystem

33 VectorSTAR supports a RESTful API.

Information Schema As a consequence of this schema-oriented storage ar-

 any change to the underlying storage is immediately reected in an ulterior

 the concurrency and consistency of DDL commands is provided by the OS

 dierent underlying le systems provide dierent performance models for

 database metadata can be easily obtained (and reported) using standard

[] print prod dim Sales

[] print sum all Sales

[] print 4&rollup Sales

Total revenues by the rst 3 salesreps:

[] print 3 first 2&rolldown 2&rollup Sales

1.4 Insertions, Deletions, and Updates

VectorSTAR is strongly oriented towards OLAP instead of OLTP usage. How-

40 The code uses the following XJ standard denitions:

'level rollup cube' is 'sum"1 repeat level cube'

Mapping mode The loader's job (xSTAR) is to transform a text-format source

Access mode There are 4 distinct kinds of vSTAR usage:

 Inserts allowed, Private or Shared

 Updates allowed, Private or Shared

 Deletes allowed, Private or Shared

Updates Updates are handled transparently as inserts with the relevant

Deletes Deletes require controlled, concurrent shared access to the ta-

The purpose of the Failover conguration is to provide continued service in

43 this feature provides functionaliry roughly equivalent to horizontal partitioning in tradi-

 function based: every command in VectorSQL is a function (traditional

 interactive: VectorSQL code is highly synthetical, i.e., it expresses larger

 vectorized: every VectorSQL function can operate on whole (single or

VectorSQL is simple to learn and command of the language can comfortably be

2.1 Exploratory Analysis

Exploratory Analysis refers to that style of continued interaction between a user

 interactive querying language, so that queries can be comfortably and

 imported later into the same session

 imported later into a future session by the same user

 sent to a colleague (perhaps as a mail attachment?) who imports it into

 sent to a dierent node to be applied to some other (usually conceptually

Selecting Rows and the Selection Set

47 or any of its aliases, including SELECT_ROW, WHEN, etc.

2.2 ANSI SQL

INSERT INTO report_output

 VectorSQL is frequently shorter (sometimes signicantly) although this is

 Every SQL statement and function has a direct equivalent in VectorSQL.

 The nvl function is usually not needed.

51 Of course, after a very brief mapping strategy is provided to them.

it has been designed specically to take full advantage of today's industry-

it takes advantage of the OS virtual memory and MMF facilities to achieve

it provides a vector-based, functional SQL dialect that is fast, easy to

it oers a unique, interactive query interface that enables end-users to

it uses a vector-based (instead of set-based) column model, oering native

it has a very small CPU footprint, so it coexists eectively with related

This illustrates how each VectorSTAR database occupies a single lesystem

any change to the underlying storage is immediately reected in an ulterior

the concurrency and consistency of DDL commands is provided by the OS

dierent underlying le systems provide dierent performance models for

database metadata can be easily obtained (and reported) using standard

Total revenues by the rst 3 salesreps:

40 The code uses the following XJ standard denitions:

Inserts allowed, Private or Shared

Updates allowed, Private or Shared

Deletes allowed, Private or Shared

The purpose of the Failover conguration is to provide continued service in

function based: every command in VectorSQL is a function (traditional

interactive: VectorSQL code is highly synthetical, i.e., it expresses larger

vectorized: every VectorSQL function can operate on whole (single or

interactive querying language, so that queries can be comfortably and

imported later into the same session

imported later into a future session by the same user

sent to a colleague (perhaps as a mail attachment?) who imports it into

sent to a dierent node to be applied to some other (usually conceptually

VectorSQL is frequently shorter (sometimes signicantly) although this is

Every SQL statement and function has a direct equivalent in VectorSQL.

The nvl function is usually not needed.

57 functional programming works at the data-level : it species an "assembly line" (made of

xSTAR is not a full-edged, general-purpose ETL application. In a typical Vec-

program the ETL process to produce multiple single-column CSV les,

Based on a description of the row structure of a CSV data le to be loaded,

client-server: from the same or a dierent network node, via sockets

web: from anywhere on the internet/intranet, using the HTTP/HTTPS

socket-based approach provided by JSoftware with the J programming

memory-mapped le-based approach provided by VectorSTAR

1. directly importing an XLS le into a VectorSTAR database

transaction-based inserts, updates and deletes

transparent data column compression

text Human compression

Support OGC standards for geographic and geometric algorithms

Python programming language

advanced Cryptography Protocol support

2D Geometries Point, LineString, Polygon, MultiPoint, MultiLineString,

Predicates: Equals, Disjoint, Intersects, Touches, Crosses, Within, Con-

Geometric: ConvexHull, Intersection, Union, Dierence, SymDierence.

Linear referencing: LineInterpolatePoint, LineLocatePoint, LineSubstring.

Spatial relationships: Distance, Length, Area, Buer, Centroid, PointOn-

Coordinate systems: SRID, SetSRID, Transform.

EXE Executable le