OpenSAP Hsgra1 All Transcripts

openSAP
Analyzing Connected Data with SAP HANA Graph

Week 01 Unit 01
00:00:10 Welcome to this openSAP course on analyzing connected data with SAP HANA Graph.
00:00:16 My name is Markus Fath. I work as a manager
00:00:19 in the SAP HANA Development Organization. Here's what we're going to cover
00:00:25 in this one week nutshell course. Consists of seven units,
00:00:31 so after this interjection, where we will be looking at some use cases, we will cover the basics on
how you organize your data
00:00:40 in nodes and edges and workspaces. We will then look at ways to query your graph,
00:00:46 essentially how to do pattern matching using openCypher. After that, we will dive into some of
those
00:00:53 built-in algorithms that SAP HANA Graph provides. Before we look at a programming language
00:01:02 that you can use to develop your own custom graph algorithm. That language is called
GraphScript.
00:01:11 In unit six, we will be looking at a related topic of hierarchies
00:01:17 and ways to query hierarchies from a SQL perspective. The course is concluded by unit seven,
00:01:28 where we will be looking at some adjacent topics like full-text search and spatial analysis.
00:01:37 Let's start with a quick motivation here and I took a chart from db-engines.com.
00:01:45 I chose the trend in popularity for certain types of databases.
00:01:52 And what you see in that chart is that up line, that's the trend in popularity for graph databases.
00:02:00 And in that chart there are other database categories. For example, the flat one being
00:02:07 relational database management systems. So obviously, what we do see here is
00:02:14 that people start to get more and more interested in how graph databases work
00:02:21 and what value-add they provide for certain use cases. The one thing that makes graph
databases so popular from my perspective,
00:02:34 is that if you do modeling, if you talk about data,
00:02:40 if you talk about relations, if you talk about ways to analyze data,
00:02:45 people tend to think very much in terms of a graph, in terms of nodes and relationships between
nodes.
00:02:53 So you have a customer and its relationship to a product. Things like that.
00:02:57 So if you go to a whiteboard and talk to your friend about your data and analytics challenges,
00:03:03 you're probably going to draw a graph, essentially nodes and edges.
00:03:07 So graph databases and the way they handle and query data is pretty much whiteboard friendly.
00:03:14 On the other hand, we do see that a lot of value can be obtained
00:03:20 by analyzing relationships, making edges, making relationships a first class citizen
00:03:26 when you do data analysis. And that is also something that fits quite nicely
00:03:31 into the approach of graph database systems. So if we look at some use cases and some data
00:03:42 then we see that essentially, connected data, networks, are everywhere out there.
00:03:49 So most of you, we all know social networks like the one that is stored within the Facebook
system.
00:03:58 So there are people, there are persons being connected to it by friendships.
00:04:04 There are other types of very interesting networks or connected data sets.
00:04:09 So think, for example, about a utility grid where you have different assets
00:04:15 which are connected by lines. So a grid themself is some form of graph,
00:04:21 some form of network, so to say which makes absolutely sense to analyze.
00:04:28 Then very important, we do see production chains or supply chains where products are sourced
from suppliers
00:04:37 and raw materials are in a production process, being turned into products, essentially,
00:04:45 which you sell to your customers. So you also have that connectivity,
00:04:50 that type of network, which connects your raw products to your customers, essentially.
00:04:58 And you might want to be able to analyze that production chain for certain risks or potential
failures,
00:05:05 impact of failures and stuff like that. Then the data that we will be working with
00:05:13 throughout the demos in that course is essentially a citation network.
00:05:19 So we do have scientific research papers which do cite each other.
00:05:25 So we have that citation network and can analyze that network for certain queries.
00:05:31 In the end, what we see, is there's a lot of data out there that is organized in a graphic way so to
say.
00:05:38 And there are a couple of really value-adding or high-important use cases, which I already
mentioned.
00:05:45 So understanding your customer, having a 360 degree on your customer,
00:05:49 being able to provide the right product recommendations, that is a problem that can be solved
with graph technology.
00:05:57 Again, if we look at margin, at risk analysis in a production network or in a supply chain network,
00:06:04 where you would like to understand which products or which materials are at risk
00:06:09 because you have only a single supply of them. And you want to evaluate that risk and propagate
that risk
00:06:17 in order to understand which of your end products are at risk in the end. Fraud detection is also a
very important use case
00:06:24 where we see certain requirements to analyze financial transactions,
00:06:31 for example detecting cycles when looking at financial transactions,
00:06:35 which might be an indicator of fraudulent activity. The challenge of course, first of all is,
00:06:44 how do you store, how do you handle potentially large amounts of data which are connected
00:06:50 and essentially, how can you analyze the relationships in your graph in order to extract
meaningful information
00:06:59 that you can infuse into your business processes or use to improve your business processes.
00:07:10 If we look at certain types of those graph analysis patterns where, for example, in a social
network
00:07:18 you would like to understand who are the friends of your friends.
00:07:21 Or in a network of financial institutes or organizations doing financial transactions
00:07:30 and you would like to understand if there were cycles in the transaction system.
00:07:37 We have seen a couple of traditional approaches on how you would like to run those queries,
00:07:43 analyze cycles, for example. And what we do see is, of course, people tend to use
00:07:50 the technology that they have at hand, relational database management system.
00:07:54 And starting to write some graphy application logic that essentially invokes in a recursive way,
00:08:03 using complex joins, some SQL operations in the database. That can work to some extent
00:08:11 but if you are really going into more sophisticated graph analysis problems, that approach is really
not that beneficial.
00:08:23 You run into all kinds of problems because, essentially, SQL is not made
00:08:27 for analyzing connected data. And you, of course, still work with tables
00:08:34 as your first class citizen in your database. So you lack a kind of an extraction
00:08:41 in terms of nodes and edges or nodes and relationships between your data nodes.
2
00:08:47 However, you of course have a single copy of the data so you always have consistent, few
interior data.
00:08:54 On the other hand, we have seen that customers are starting to bring in graph technology,
00:09:00 for example, a Neo4j database, to handle that graph processing, those graph queries,
00:09:06 in a much better and much more performant way. However, what you usually see is
00:09:11 that you then have a data replication task. So you usually have a primary data store,
00:09:18 which is your relational database management system and you're copying over the data to that
graph database
00:09:23 in order to do some graph processing. And that approach,
00:09:28 of course you have that burden of that replication, so you, per se, have synchronization issues
00:09:34 between those two database systems. And furthermore, of course a graph database can handle
00:09:42 the graphy workload, the connection analysis in a much better way
00:09:47 but usually falls behind when it comes to standard analytics or when it comes to providing full-text
search capabilities
00:09:56 so you usually end up with a rather complex architecture of components which you need to
synchronize
00:10:03 in order to provide all the application functionality that you would require.
00:10:10 So SAP HANA provides a combined approach in terms of that we have a built-in graph engine.
00:10:19 So what we do provide here is native graph abstraction within a relational database management
system.
00:10:27 And with that we have the possibility to provide multi-model processing.
00:10:34 So you can easily combine some graph queries with a full-text search query,
00:10:39 with some standard analytics or data mining algorithms. So that's the kind of the sweet spot here
00:10:46 of having the graph engine being deeply embedded in a functional rich data management
platform like SAP HANA.
00:10:58 Last but not least of course, the embedded graph engine inherits
00:11:04 all the enterprise characteristics that you eventually require,
00:11:10 in terms of security, how to model authorization control, how to do backup and restore,
00:11:18 how to design the system for high availability or disaster recovery and things like that.
00:11:25 Because of the fact that the HANA graph engine is deeply embedded into HANA as a platform,
00:11:33 all those operation concepts of course also apply to your graph approach.
00:11:40 If we look at a way how to model data, how you think about your connected data,
00:11:46 and if you look at academic research literature, I mean, there are a couple of different modeling
approaches
00:11:54 to connected data. You might be familiar with RDF,
00:11:59 resource description framework, so an approach to a kind of do annotations
00:12:05 in a semantic map approach. What we support in SAP HANA is a so-called property graph
00:12:12 because we found it's the most general approach to modeling and the most intuitive way to think
about your data.
00:12:20 So what is a property graph? Essentially, we are thinking about
00:12:24 in terms of nodes and edges. Or, as sometimes nodes are also depicted as a vertex,
00:12:31 or in plural vertices, we are talking about vertex in edge approach here.
00:12:36 So in this example we have a document node which is connected to an author node.
00:12:43 And of course both nodes, as well as the edge in between connecting these two nodes,
00:12:50 can have an arbitrary number of properties. So Fred, for example, who is of type author
00:12:56 does have a location with it. Fred is located in the USA.
00:13:01 Whereas the document vertex, or the document node, does have a title, for example,
00:13:06 the hub and spoke approach, la la la. Things like that.
3
00:13:10 So one reason why we chose the property graph model is A, because it's, again, whiteboard
friendly,
00:13:19 and B, because usually if you have a relational model in place, it's usually quite easy to kind of
translate
00:13:29 a full-fledged relational model into such a property graph model.
00:13:33 And so we wanted to make it easy for, let's say, traditional relational applications to add a graph
00:13:42 as an additional dimension for analysis. When it comes to analysis we usually choose,
00:13:49 do see two different types of workload, I would say. On the one hand, we have what we call
pattern matching
00:13:57 where you essentially have a sequence, a pattern, which you are looking for in your data graph.
00:14:05 So think about scientific research papers and authors, you might want to have a pattern where
you say
00:14:12 I have that author thread and I'm interested in his papers and specially in the papers he's cited.
00:14:21 So that's the pattern that we see on the left hand side. There is a declarative query language that
we do support
00:14:29 for these type of processes, it's called openCypher,
00:14:32 which allows you to specify these type of patterns and run that query against your data graph.
00:14:40 On the other hand, we do have the workload of graph analysis.
00:14:46 We are usually looking at larger portions of the graph or the complete graph
00:14:50 and you are evaluating its topology in terms of, you want to understand where are communities
00:14:56 or, for example, you have two sets of nodes, you would like to understand how they are
connected
00:15:02 and essentially, what is the shortest path between those two nodes.
00:15:07 So we do see these type of workloads where we have for pattern matching usually
00:15:12 a quite declarative approach. These are short, highly selective queries.
00:15:17 On the other hand, we have graph analysis where we do see the need of
00:15:21 a kind of more imperative approach, procedural approach,
00:15:26 providing a programming language so to say in order to encode that graph algorithm logic.
00:15:36 So SAP HANA, as I mentioned, provides an in-database graph engine.
00:15:41 From a functional perspective, we provide a property graph model
00:15:45 which uses standard relational tables as a data store. That comes in quite handy in terms of
providing then,
00:15:56 in the end, real-time capabilities, real-time inserts in your graph data.
00:16:01 We do have a set of built-in functions or algorithms, like, for example, to evaluate the shortest
path
00:16:07 or to understand strongly connected components. As already mentioned, we do support a subset
of openCypher
00:16:15 for pattern matching queries, and probably most important, we have that graph-specific language
00:16:23 for stored procedures in the database which is called GraphScript.
00:16:29 Gives you an abstraction to that graph model and graph-specific operations,
00:16:33 like a neighborhood operation, like a traversal operation,
00:16:37 which makes it quite easy for you to develop your own domain-specific graph algorithms
00:16:43 directly in the database. So the benefits that we do see here
00:16:48 is that we're tightly integrated into our relational model so we provide real-time graph insights into
your data.
00:16:56 We provide the capabilities to use and mix and mingle other functionality in HANA,
00:17:04 again, like full-text search, like our spatial analysis capability
00:17:08 which you can use in combination with graph analysis. And again, from an operation perspective,
00:17:15 you get all the enterprise characteristics that you require in terms of your database operation,
4
00:17:24 but also security backup and restore and stuff like that. So, as a graph is just one single
component
00:17:35 within the SAP HANA data management platform, I would like to give you a glimpse of the
breadth
00:17:41 of the functionality. Maybe you've seen the slide already.
00:17:45 So down below you have the database services, so here's how where the persistency,
00:17:50 where the tables do live. Here's how where all the operations do live
00:17:55 in the database core. On the left upper corner you see the application services,
00:18:01 so as you might know we have an application server here being tightly covered with SAP HANA.
00:18:09 There is an HTML5 framework which is called SAPUI5,
00:18:14 which you can use to develop modern browser-based application user interfaces.
00:18:21 On the right hand side, you see the integration and quality services.
00:18:25 These are essentially the components that allow you to bring in data from a third-party data
source into HANA
00:18:33 and do certain type of transformations. For example, why you do the replication.
00:18:39 And when it comes to the processing services, so the functionality, essentially,
00:18:43 that is provided on top of the data stored in SAP HANA, we will find things like the spatial engine,
00:18:50 full-text search, text analysis, text mining,
00:18:54 but also data mining and predictive algorithms which are part of the SAP HANA data
management platform.
00:19:01 So that concludes unit one of this nutshell course on analyzing connected data with SAP HANA
Graph.
00:19:11 See you in the next unit, where we will be talking about how we organize the data.
00:19:15 So we will be looking at the basics of nodes and edges and workspaces.
00:19:20 See you in the next unit.
5
Week 01 Unit 02
00:00:08 Welcome to Unit 2. We will be talking about the basic

00:00:12 data structures, being nodes and edges, and the way to expose your data
00:00:18 to the graph engine, using workspaces. We've already introduced the property graph model.
00:00:26 Just a quick recap, within graph, we are thinking about
00:00:32 vertices and edges. And both the vertices, as well as the edges,
00:00:37 can have certain properties. Throughout the course,
00:00:43 I mostly will refer to a vertex as a node, so you could consider node, or nodes,
00:00:51 being synonymous with vertex or vertices. The data that we will be looking at for demo purposes
00:01:01 consists of authors, which wrote scientific papers,
00:01:09 and those authors in turn also affiliated to or affiliated with
00:01:17 a certain university. In our case, we will be talking about Fred,
00:01:23 who's the author of a paper called The Hub and Spoke Paradigm,
00:01:27 and the paper, The Hub and Spoke Paradigm, does cite another paper.
00:01:33 We do have a kind of a citation graph. We do have the authors attached to those papers,
00:01:40 and the authors themselves are affiliated to one or multiple organizations or universities.
00:01:48 That's from a logical perspective. From a physical perspective,
00:01:52 your data is organized basically in two structures. The orange one here on the left-hand side
00:02:00 is, in this case, a table that contains data about the authors, so for example,
00:02:06 we see Fred Richardson here, as well as data about the scientific papers,
00:02:12 for example, The Hub and Spoke Paradigm. The blue data structure is the data structure
00:02:20 that describes the relationships, the edges between your nodes.
00:02:27 We do have edges between, for example, the paper and the author, and that edge is of
00:02:35 isAuthoredBy. Now, what you realize in the data structure
00:02:41 is that both data structures do have an identifying attribute,
00:02:47 so we have a node identifier, as well as an edge identifier.
00:02:52 Furthermore, the edge table, or the edge structure, does contain a column called source and
target.
00:03:00 That is essentially the start point of the edge and the end point of the edge.
00:03:07 Both data structures, of course, do have additional columns which contain attributes
00:03:13 or properties describing the corresponding instances. Now, once you have your data structures in
place,
00:03:24 what you need to create in SAP HANA Graph, in order to expose your data to the graph engine
00:03:31 and do something meaningful with it, is that you have to create a workspace.
00:03:38 That is, for example, done by a SQL statement, "CREATE GRAPH WORKSPACE", and
essentially,
00:03:45 it points to the nodes, as well as to the edges, and it contains the information
00:03:51 about where is the source and the target column and which are the identifiers of your data
structures.
00:04:00 With that, let's quickly switch to a demo and look at the data.
00:04:09 The data I loaded from, or is being provided by Yale University
00:04:15 by a project called LILY, and within that project,
00:04:23 they are keeping data on natural language processing academic papers.
00:04:30 For demo purposes, I downloaded the data from this Web site here, and it contains tables
00:04:38 about the papers, about the citations, so which papers cited
00:04:45 which other papers, about the authors, and so I imported the data into SAP HANA,
6
00:04:55 and I'm using right now SAP HANA studio in order to visualize the data and play with the data.
00:05:05 First of all, the data that we are dealing with
00:05:10 contains the information about the papers, so we do have an ID of that paper.
00:05:16 We do have a title, and we do have an attribute, which is the publication year.
00:05:22 There is a second table that contains data about the authors, in this case, just an author ID
00:05:29 and the author name. Then, quite important, we do have the table
00:05:34 that essentially makes up the citation graph. We do have essentially the edges,
00:05:41 the source and the target information, which links the papers together in terms of
00:05:48 a citation graph, so we have that source and target column both containing the identifying
attribute
00:05:54 of the corresponding papers. Next, we will be working with
00:06:03 the edge type isAuthoredBy. That essentially is derived from that data structure,
00:06:10 where we do have information about which papers have been authored by which authors.
00:06:16 That's the type of the relationship that we see implicitly here in that table.
00:06:23 Then, last but not least, we will be dealing with author affiliations,
00:06:28 which we generated out of that original data source, where we do have for each paper,
00:06:34 the author name and its corresponding affiliation, so the author work for a specific organization
00:06:41 or a university. That's kind of the source data that we have.
00:06:48 From here, we start to populate our nodes and edges data structure.
00:06:53 The first thing I'll do is I'll create a table called Nodes, and I will load certain data,
00:07:01 essentially the papers, into that Nodes table. We loaded roughly 20,000 papers,
00:07:13 and this is how our Nodes table looks. Sure, it contains the identifying attribute.
00:07:18 I added a Type column, which describes the as being of type paper, and we do have title
00:07:25 and year as the attributes of these nodes. For the edges,
00:07:30 I will create, in a similar way, an edges table, and I will load the citation edges
00:07:38 into my Edges table. Let's also do that quickly.
00:07:42 We have roughly 100,000 citation edges, and the way that looks is pretty simple.
00:07:50 It just contains the source and the target column, and I also added a Type column,
00:07:56 describing, let's say, the semantics of this relationship being a citation.
00:08:04 With that basic graph, we can now go ahead and expose the data to the graph engine.
00:08:12 Before I do that, I'll check the consistency of the data. First of all, I will look at
00:08:20 the uniqueness in my Nodes table. I will look for dangling edges,
00:08:26 so essentially, records in my Edges table, which do not have a corresponding identifier
00:08:34 in the Nodes table. Let me quickly run those two checks.
00:08:39 First of all, I do see that each identifier in the Nodes table is only occurring once,
00:08:46 so I do have a kind of uniqueness on that ID column. I couldn't identify
00:08:55 any dangling edges in my data. In order to, kind of,
00:09:03 keep that consistency, I will create primary and foreign key constraints
00:09:08 on the Nodes and the Edges tables. A couple of alter statements,
00:09:13 which just kind of helps me keep the data in a clean way. After I've checked the consistency,
00:09:20 the next thing I do is I create that graph workspace. It simply points to the Nodes and the Edges
tables
00:09:28 which I just created, and identifies the identifier as well as the source and the target columns.
00:09:35 Let's quickly do that. With that, we can do our first exploration.
00:09:43 For that, I will switch to an application, which is called Graph Viewer.
00:09:48 It's a lightweight application that allows me to explore my demo data,
7
00:09:56 and it can be used for demo purposes or for POC purposes.
00:10:02 I pick the graph workspace which I just created. I can go ahead
00:10:07 and pick a specific attribute. I will simply
00:10:14 start selecting my paper, The Hub and Spoke Paradigm, I'll apply.
00:10:19 Here is my node. I will pick a label, which is the title in this case,
00:10:27 and from here, I can explore along the edges in my citation graph.
00:10:38 So this is what we did. We created the Nodes and Edges tables.
00:10:41 We created a graph workspace, and we used the Graph Viewer to explore the data.
00:10:48 As you've seen, the data is organized in two data structures,
00:10:53 and most of the time, you will use tables to store your nodes and your edges data.
00:11:01 As I described in the consistency checks, the Nodes table must have an identifying attribute,
00:11:09 the Edge table as well - and in addition, the Edge table, of course, requires
00:11:15 some type of source or target column, describing the actual edge.
00:11:22 The data can, and in most of the cases will, be stored in a physical table, but it is absolutely valid
00:11:31 to use views, SQL views, to kind of expose the data,
00:11:36 for example, from an existing relational model into the data structures of nodes and edges.
00:11:43 That kind of allows you to avoid data redundancy,
00:11:49 in case you have your data already, for example, in a relational schema in HANA.
00:11:56 It also allows you to kind of partition your data in a semantic way, so if you have
00:12:03 different kinds of node objects, so for example, papers and authors,
00:12:09 you might very well store them in two distinct tables, and use a view to project the data
00:12:17 into the graph engine. By that, it also allows you to implement
00:12:23 a very simple security model, where you can "I have users which are allowed to see the papers,
00:12:30 but they're not allowed to see the authors." So it's a very simple way to enforce
00:12:35 security on your data. Let's take a look at how that might look in practice.
00:12:41 What you see on the lower level here is a kind of a relational model that contains the data that we
are dealing with.
00:12:49 We have a table that contains the papers. We have a table that contains the authors,
00:12:54 and we have a table that contains the organizations or the universities.
00:12:59 In order to project the data from those three tables into that nodes data structure,
00:13:07 you can use a SQL Union View. In a very similar way,
00:13:13 in a relational model, you would have the relations being stored in specific tables.
00:13:20 So for example, you would have a table that stores the paper-to-author relationship.
00:13:25 And in a very similar way, you can use a SQL Union View to kind of grab all of that
00:13:33 relationship data and expose it in an edge data structure, like the one we see on the upper right
side.
00:13:43 With regards to source data, I already mentioned a relational model, which we see quite often,
00:13:52 which you want to expose for the graph engine, in order to do some graph processing with it.
00:13:59 They're another type of, let's say, data models or source data patterns that we
00:14:08 In some cases, the source data already comes with a kind of a "graphy" schema, essentially
00:14:14 being a nodes data structure and an edges data structure. And there, of course, then it's pretty
straightforward
00:14:22 to create a workspace on top of the data. When you load these types of data
00:14:28 from your original source, for example, using a flat file import,
00:14:33 one tip from my side, you could consider a concept, which is called flexible tables
00:14:39 in order to efficiently load the data. If you have a CSV file that contains
8
00:14:45 a bunch of columns, and you don't want to set the table structure upfront,
00:14:50 you can simply create a minimum flexible table and load the data structure into the flexible table.
00:14:56 What happens is the flexible table would simply extend, depending on the data it sees in the
source.
00:15:04 There is another type of data organization, which we see frequently,
00:15:10 which I call node-based. So essentially, you have a single data structure, a single table,
00:15:19 that contains the nodes and their attributes, and the relationship is
00:15:25 kind of laid out in terms of a common attribute. For example,
00:15:30 you would have a common attribute of being an event, so if you have two parties,
00:15:37 which came together in an event, for example two soccer teams playing against each other
00:15:42 in a certain event, then that common attribute might actually be the edge, the relationship.
00:15:48 From that data structure, it's also very easy to expose into a Nodes and Edges table.
00:15:54 Then finally, what we usually see is an edge- based kind of model, where each record essentially
depicts a relationship,
00:16:04 and you have redundant information about nodes and their attributes.
00:16:11 You would use SQL-distinct operations in order to extract the distinct nodes data
00:16:18 into a separate table. With regards to data consistency,
00:16:27 if possible, if you are working with physical tables, our advice is essentially to check the
uniqueness
00:16:36 of your nodes identifier, and if it is unique, you should add a primary key constraint on that table.
00:16:44 And likewise, for the Edges table, if you don't have an identifying attribute,
00:16:50 just create one. There are SQL ways to do that,
00:16:53 for example, to generate an identity, as depicted here. Finally, you should check for dangling
edges.
00:17:02 Either create dummy nodes, in case you have dangling edges,
00:17:07 or remove the dangling edges from your table, or filter them out in a view.
00:17:16 Now that we have the data being stored in standard HANA column tables
00:17:23 or other types of tables, how is data manipulation handled?
00:17:27 That is handled via SQL, as you know it, or as you are used to.
00:17:32 Any insertion or deletion or update to your data is handled by a SQL statement.
00:17:38 When you create new edges, which refer to new nodes, you just need to be aware of the
sequence,
00:17:44 so create the nodes data first, and then you can create a new edge
00:17:49 because you might have that foreign key constraint on your data. Very important is, again,
00:18:00 that graph workspace concept. Please note that you can have
00:18:04 as many graph workspaces as you want in SAP HANA.
00:18:10 And I already mentioned that using views is a very good way to avoid data redundancy
00:18:17 and to semantically partition your data. In this case, I have depicted a model,
00:18:23 where you say, "Okay, I've organized the relationships for a citation graph in a different data
structure
00:18:30 as for an author or co-author graph", and it is absolutely valid to use views
00:18:37 in order to project into separate graph workspaces, and by this, help you organize your data
00:18:45 and implement a simple security model, and so With that, let me quickly switch to a second demo,
00:18:54 where we essentially use that view concept in order to generate a meta graph.
00:18:59 What I mean with that, I aggregate my data in the Nodes table by the type,
00:19:06 and I'm counting the number of nodes for its corresponding type, and I'm likewise
00:19:11 doing the same way, aggregating the Edges table by using a SQL view.
9
00:19:18 With that, I can very well generate that GRAPH_META model workspace,
00:19:25 which then in my case, looks like this, just created the graph, so what we have
00:19:33 in our system is essentially one node type, which is paper, and there is an edge,
00:19:39 or edge types, which point from the papers to the papers, so a very simple graph model here.
00:19:52 To kind of wrap this up, what we've seen from an architecture perspective
00:19:56 is how we organize our data into nodes and edges, how we can use views to project
00:20:03 in such data structures of nodes and edges, and we have seen how we create a graph workspace
00:20:09 in order to expose the data into the graph workspace. Again, you, can have multiple graphs in
your system,
00:20:17 and the graphs can share data, of course. That concludes Unit 2.
00:20:23 In the next unit, we will be talking about pattern matching.
10
Week 01 Unit 03
00:00:07 Welcome to Unit 3: Pattern Matching. You've already seen that slide describing two different
kinds
00:00:17 of workloads that we do see on graph data. On the one hand, we do have pattern matching
00:00:24 where you essentially look for subgraphs, for patterns, using certain topology constraints,
00:00:32 certain filter conditions, for example on your node attributes and node properties.
00:00:38 And on the other hand, we have that workload type of graph analysis
00:00:42 where we do see our topology analysis of the complete graph doing community detection,
00:00:49 doing our shortest path calculation and so on. So within that unit, we will focus on the left-hand
side,
00:00:56 talking about pattern matching. For pattern matching,
00:01:02 we are using a query language called Cypher, or its open-source alternative, openCypher.
00:01:09 So you might be familiar with Neo4j, which is one of the leading graph databases in the market.
00:01:17 And Neo4j, they pushed forward and invented a query language for pattern matching, which is
called Cypher.
00:01:26 Now, parts of the Cypher language have been open-sourced and are handled in that community,
opencypher.org.
00:01:37 I've attached a screenshot. So SAP HANA supports the subset of that openCypher language
00:01:45 for pattern matching. So what is it?
00:01:49 In its simplest form, pattern matching and a pattern matching query using openCypher,
00:01:57 contains three clauses. There is the so-called MATCH clause
00:02:03 which describes topology in terms of nodes and edges.
00:02:10 In this case, for example, I'm matching a subgraph from node A via edge e1 to node P1,
00:02:21 and it's a directive edge. The second clause is the WHERE clause.
00:02:26 It adds additional filter condition to that subgraph. In my case, for example, the node A should
have the name Fred.
00:02:38 And finally, there is that RETURN clause that projects the results into a table,
00:02:45 into a flat data structure. We simply pick those properties of your results
00:02:51 which should be exposed as a table. A valid example in our scientific research graph,
00:03:02 in our citation graph, might be, I'm looking for a node called Fred.
00:03:07 I'm looking for his papers and I'm looking for the papers who got cited by his papers,
00:03:13 and I'm returning the titles of these citations. Now this is kind of for your reference.
00:03:22 It describes all the language constructs and concepts that we do support in SAP HANA
00:03:30 with regards to openCypher, so there is a concept of a direct edge
00:03:35 and an undirected edge. There is a variable length path that you can specify,
00:03:40 so where you say, "Okay, between nodes A and there are up to N number of hops", things like
that.
00:03:47 As for the WHERE clause, there are certain kinds of predicates that we do support.
00:03:52 Of course, comparison predicates like equals or does not equal, things like that.
00:03:57 There are sub-string predicates being part of openCypher so STARTS WITH, ENDS WITH, and
so on.
00:04:05 And finally, in the RETURN clause, you can use things like an ORDER BY clause or a LIMIT
00:04:12 in order to describe and organize your results. Now, as for
00:04:22 the built-in functions that we do support, there are ways to kind of
00:04:30 identify constraints on the relationship, on the edge types, in the variable length path.
00:04:36 There is also a very powerful full-text search capability which we infused into the openCypher
language.
00:04:46 As you might be aware, SAP HANA provides full-text search capabilities,
11
00:04:51 so if you have natural language text in your graph - in our example, it's the titles of the scientific
papers
00:04:59 for example - you can use full-text search via that TEXT_CONTAINS predicate in openCypher
00:05:09 in order, for example, to filter your documents by a keyword query.
00:05:14 In this example here, I'm looking for "principle component analysis"
00:05:19 as being part of that abstract call. It's a very powerful way.
00:05:24 Just a little bit of background on the full-text indexing and search capabilities in HANA.
00:05:32 Again, we are organizing our data in relational tables. If you have natural language text, so for
example,
00:05:39 an abstract or a title of a scientific paper, you can create a full-text index on that table column.
00:05:47 That is an index that we will leverage whenever you're using the SQL_CONTAINS predicate
00:05:53 in order to do a full-text search on that. Note that the full-text index is created once.
00:05:59 It's kind of bound to the table, HANA manages that for you. And whenever you insert or update
00:06:05 or delete your data in your table, of course, HANA will pick up the changes
00:06:10 and also update the full-text index. Furthermore, the full-text index is also backed
00:06:17 when you do a backup of your table and it's of course restored, and stuff like that.
00:06:24 One additional slide on full-text search capabilities in HANA, there is full-text search being part of
the SQL language
00:06:33 where we're using the CONTAINS predicate in order to do a full-text search on that.
00:06:38 That's the kind of the plain, vanilla variant of full- text search. SAP HANA provides a so-called
enterprise search stack
00:06:47 where we do have modeling capabilities on top of your physical tables
00:06:54 where you can annotate a search model in terms of certain behaviors when you search on it.
00:07:01 So which columns are irrelevant, which columns contribute to which extent
00:07:05 to relevance ranking, and so on. And finally, there is an out-of-the-box search UI
00:07:10 that you can leverage in SAP HANA to give your end users a full-text search UI.
00:07:19 Now back to our examples for pattern matching. So what we will be doing in the demo
00:07:24 is we will use pattern matching to find co- authors of Fred. In this query, we essentially have the
filter condition
00:07:34 on one node, where we say, "This should be Fred. We are looking for his papers", and from the
found papers,
00:07:40 we are evaluating the other authors that wrote that paper. So with that, we will be switching to
HANA studio again.
00:07:53 And one thing that we will do in order to run our pattern matching query for co- authors
00:07:59 is we would create new data. So I will add author data in my Nodes table
00:08:05 and I will create an additional type of edge, which is "isAuthoredBy"
00:08:11 to connect the papers to the authors. So let's do that and have a look at the data
00:08:19 that we just generated. So my Nodes table now looks like that.
00:08:24 I do have papers in there - that was from the previous unit. Now, I've just added author data to my
Nodes table.
00:08:35 So now we have two types of objects in our Nodes table: papers and authors.
00:08:44 The Edge table looks like that. The citation edges is what we had previously
00:08:49 in our data, and I've just added the "isAuthoredBy" type of edge
00:08:54 which connects a paper to an author. So of course, we do see these two types of edges
00:09:02 and their corresponding numbers. What you've seen is that I've used a name column
00:09:12 for the author names, and I've still got the title column. So I will simply copy over the title of the
papers
12
00:09:19 in the name column, just to have one consistent label. And I will create a full-text index on that
name column,
00:09:29 which in this case, can be leveraged by that simple CONTAINS predicate
00:09:35 where I'm searching for the term "fred" in all of the columns of my Nodes table.
00:09:41 And it will return some Freds which it found in the author type of nodes.
00:09:49 So from here, we can also refresh our meta model.
00:10:00 Let's call the GRAPH_META, and what we see here is that in addition to the papers
00:10:06 that we already had in there, we now have an author type of node
00:10:11 and we have a relationship, an edge, between the paper and the author.
00:10:17 Now, finally, let's do some pattern matching. The way I will run pattern matching here
00:10:25 is I will use database, let me quickly rerun that.
00:10:34 I would use a tool called Database Explorer. And in here, I'm going to my schema where I stored
my data.
00:10:43 I will launch the graph viewer were within Database Explorer. And from here, you have a context
menu
00:10:50 that allows you to add Cypher queries. So I have that text field, I've pasted in a Cypher query,
00:10:57 which essentially looks for a pattern from paper to author. And I'm looking for my Fred Richardson
as being the author.
00:11:06 So the subgraph that I've found here is that single structure.
00:11:11 I will add that label here. So I found Fred Richardson being the author
00:11:16 of that Hub and Spoke Paradigm. A little bit more advanced now, of course,
00:11:22 I'm adding an additional item in that MATCH clause here. So I'm looking now for the co-author.
00:11:28 So I have two topology constraints. I'm looking for Fred Richardson.
00:11:32 Again, finding the Hub and Spoke paper. And from the second part of the MATCH clause,
00:11:39 I find all the co-authors that contributed to the paper. Finally, as a third example,
00:11:46 I will use a similar MATCH structure. But now, I'm looking for papers
00:11:53 which contain the term "language" and which are written by some Frank or Franz.
00:11:59 So that kind of gives me the subgraphs or the patterns
00:12:06 wherever I do have a paper that somehow talks about language, and where one of the authors is
some Franz or Frank,
00:12:14 or something like that. Let's use some color coding here
00:12:20 to understand who are the authors and which are the papers. So here, I have a language paper
00:12:25 which is written by some Frank or Francis here, and here's another example
00:12:31 where I do find "language" in the title of the paper and I do have that Frank being listed.
00:12:40 Go back to slides. So basically, there are three ways
00:12:47 how you can run pattern matching queries using openCypher in SAP HANA.
00:12:52 What we've just seen in that video is we were using the Database Explorer
00:12:57 where you have that specific editor where you can create openCypher queries
00:13:04 and execute them, and see the results in a graphy way. There is another way using calculation
views
00:13:15 and the HANA Web IDE which provides a calculation view modeler.
00:13:21 The calculation views are logical data flows which are used for analytics purposes.
00:13:27 So within such a logical data flow, you usually have projections, aggregation, joins,
00:13:35 and operations like that that kind of expose a logical data view
00:13:40 to SAP HANA where you can run analytics on top of that. Within the calculation view modeler, in
Web IDE,
00:13:49 you have the possibility to create graph nodes. So besides projections and aggregations,
13
00:13:55 you can insert graph nodes and run an openCypher query as part of that node, and then post-
process
00:14:04 that result of that openCypher query, for example, for certain aggregations,
00:14:09 and expose it to analytic clients. As the third variant, you can use the SQL Console
00:14:17 because essentially, what you can do, you can create so-called calculation scenarios
00:14:23 in order to tell SAP HANA to run openCypher queries. Now, that is a kind of an ugly XML syntax
00:14:32 where you describe the Cypher query and the form of the output.
00:14:38 But it gives you a nice way to handle or to integrate openCypher queries
00:14:44 into the world of SQL. That is the mechanism that we will use
00:14:51 in order to actually create co-author edges. So what we will do, we will use the third mechanism,
00:15:00 "create calculation scenarios". It's documented in our SAP HANA Graph documentation.
00:15:06 And this is kind of an ugly XML syntax where I, essentially, point to my graph.
00:15:15 So it's in schema HSGRA. It's called graph.
00:15:21 Here, I'm adding a variable where I will infuse the openCypher query.
00:15:27 And here's where I define the output structure. So let's create that calculation scenario once.
00:15:35 Now once it's there, we can use that calculation scenario to find co-authors, first of all.
00:15:41 Now previous examples, we were looking for co-authors of Fred.
00:15:45 Now, we're looking for all co-authors in that system. So let me quickly run that query, and this is
the result.
00:15:52 So I will have a source and a target column, essentially pointing from an author to his co- author.
00:15:59 And in that scenario, I'm returning the paper ID which connects these two authors.
00:16:11 That result can be further aggregated. And what I had in mind is,
00:16:16 I will add a numeric attribute that describes how many papers two authors
00:16:24 have been collaborating on. So I'm simply adding an aggregation,
00:16:31 a GROUP BY clause around that openCypher result. And it's running the very same query.
00:16:37 However, what you see here is that I have an aggregated number
00:16:41 which simply counts the number of papers the two authors have been collaborating on.
00:16:47 This is essentially what I will store in my Edges table. So I do an insert on my edges
00:16:55 and I'm running that aggregation query to identify the data...
00:17:04 that is to be stored. So once I've executed that,
00:17:08 I added some data in my Edges table. Let's quickly look at our GRAPH_META model.
00:17:14 Simply refresh that. And now, what you see here is that
00:17:20 from my author type of nodes, I have an edge pointing to itself,
00:17:26 essentially depicting that co-authorship type of relation. So if we use the SQL Console and the
calculation scenarios,
00:17:40 in order to run openCypher queries, we have wrapped SQL around those openCypher queries
00:17:46 doing some aggregations. You can very well do some joins
00:17:50 and we have used that way to create new types of edges and stored them in our tables.
00:17:57 So here, just for your reference, here's how you can use the Database Explorer.
00:18:01 We have seen that pasting openCypher queries or creating openCypher queries and exploring
the results
00:18:07 in a graphy way. This is the calculation view modeler
00:18:11 where you can have openCypher nodes and run your openCypher queries
00:18:16 as part of larger logical data view for analytical purposes. And finally, we have seen calculation
scenarios
00:18:25 that are described in XML format. It's not all too handy but it does what it does,
00:18:31 and it's a very nice way to kind of wrap SQL queries around these calculation scenarios.
14
00:18:38 So to wrap this up, what we've seen in the previous unit is how you organize your data
00:18:43 and expose it via graph workspace. And what we've seen in this unit is
00:18:49 how we use SQL or calculation views or the Database Explorer
00:18:56 to run openCypher queries, which directly leverage the data
00:19:02 which is exposed as a graph workspace. That concludes Unit 3.
00:19:08 In the next unit, we will be talking about the built-in algorithms.
15
Week 01 Unit 04
00:00:08 Welcome to Unit 4, Built-In Algorithms. In the previous unit, we looked at pattern matching
00:00:15 and how to use openCypher query language for pattern matching. We will now be looking at the
graph analysis side of the things.
00:00:26 We're looking at the built-in algorithms that you can leverage in SAP HANA Graph.
00:00:34 At this point, there is not a lot that is built-in out of the box. There are ways to extend the
capabilities
00:00:43 of SAP HANA that we will be talking about in the next unit. As of now, you basically have a
neighborhood search algorithm -
00:00:50 you have a shortest path and two variants, one-to-one and one-to-all. And finally, you have the
strongly connected components
00:00:58 as being built in to SAP HANA Graph natively. Starting with the neighborhood search,
00:01:07 it does what it does. You provide a start node or a set of start nodes,
00:01:14 and you provide a minimum depth and then a maximum depth.
00:01:19 And the neighborhood search algorithm will essentially traverse the graph
00:01:23 from a start node or start nodes into the neighborhood, and then
00:01:28 depending on your depth parameters, up to n hops, for example.
00:01:36 In that graphic here on the right-hand side, you see if you start at node 0, you will reach the node
1s
00:01:44 within one hop distance or with traversing one edge. Next, there is the shortest path algorithm
being built in,
00:01:56 so here you provide a start and a target node, and SAP HANA will look for the shortest path.
00:02:06 Note that there are two ways to calculate or define distance. By default, SAP HANA Graph is
looking
00:02:15 for the shortest path with regards to hop distance, but you can also provide a weight column
00:02:23 on the edges to calculate a distance. In our example here, looking at hop distance,
00:02:30 you will traverse from start to target via that edge with a weight of 1 down here,
00:02:37 so that will be hop distance, shortest path. However, if you take a weight attribute into account,
00:02:44 you will find the shortest path traversing this edge to that orange node,
00:02:50 and down here to the target because 0.8 and 0.1
00:02:54 is smaller than that weight being 1. Finally, there is the strongly connected components,
00:03:05 and that algorithm will look for sets of nodes, where all nodes are reachable from each other.
00:03:15 In my example here on the right-hand side, I will find two separate sets of nodes,
00:03:21 two strongly connected components, the orange one and the gray one. Again, how you can
00:03:32 invoke those built-in algorithms - there is the SAP HANA Database Explorer again,
00:03:39 so just like we did it for the openCypher queries, you can execute the built-in algorithms
00:03:45 from Database Explorer. Again, Web IDE provides
00:03:50 graph nodes within the calculation view modeler, where you can also invoke built-in algorithms,
00:03:57 for example, for neighborhood search, and calculation scenarios is also supported.
00:04:03 The built-in algorithms are also exposed in our procedural language for graph-specific custom
algorithms,
00:04:13 which is called GraphScript. We will be looking at that in the next unit.
00:04:18 With that, let's quickly dive into the demo. I will be adding some more data to my graph.
00:04:29 Essentially, I will be adding organizations as a third node type between papers and authors.
00:04:40 Authors are affiliated to organizations. Those are the types of edges I will create.
00:04:47 Note that my organization has a geolocation attribute, so a spatial attribute,
00:04:54 assigning a latitude-longitude coordinate to that organization. Let's create that data and quickly
inspect that data.
16
00:05:07 My Nodes table now looks like that. Besides authors and papers,
00:05:12 I do have the organizations here, and you see essentially two types of spatial attributes
00:05:18 referring to different spatial reference systems, describing the geolocation of that organization.
00:05:26 There are roughly 2,000 organizations in my Nodes table. And with regards to the edges,
00:05:32 I now have a new edge type, which is called "isAffiliatedWith",
00:05:39 which links an author to an organization. That's the type of the data that we will be using.
00:05:50 The first thing I will do is I will run the strongly connected component algorithm
00:05:56 on my co-author graph. Remember, in the previous unit, we've created edges
00:06:01 that connect an author to another author whenever they collaborated on a paper.
00:06:08 For the strongly connected components, I will also use this ugly calculation scenario.
00:06:15 Essentially, now pointing to a subset of my graph, the co-author graph,
00:06:22 and when I create that calculation scenario, I'm using SQL to simply...
00:06:37 There's one thing I forgot to create in the last unit. Let me quickly fix that.
00:06:48 You see parts of it are live demo. We created the co-author edges
00:06:54 in the previous unit, and what is missing here is I'm creating additional graphs.
00:06:59 So I have the overall graph, which contains all the information, and from that overall graph,
00:07:05 I am creating a few exposing the authors only and the co-author relationship.
00:07:10 I'm creating a graph workspace directly on that data, kind of semantically partitioning my graph
into a subgraph.
00:07:19 And I'm doing the very same for the citation graph, so simply adding additional views
00:07:25 on top of my Nodes and Edges tables, and adding additional graph workspace,
00:07:30 which exposes the paper only. Let me quickly add that.
00:07:36 And then going back to my algorithm, strongly connected components,
00:07:41 because it works with the co-author graph that we've just exposed.
00:07:49 Now, this works fine. That strongly connected component algorithm
00:07:54 then is exposed and callable via SQL. Let's simply do that.
00:08:01 This is the output of the algorithm. It assigns to each, in this case author ID,
00:08:07 a component number, so all these authors are part of the strongly connected component with
number 1.
00:08:14 Overall, we have found roughly 7,000 strongly connected components in our data,
00:08:21 and aggregating the size of those strongly connect components,
00:08:27 we, for example, found that component number 243 does contain 18 nodes.
00:08:38 What you can do with the built-in algorithm wrapped in the calculation scenario
00:08:47 is not only you can aggregate your data, but, of course, you can add joins to it.
00:08:53 That's something that I do here, so adding a little context information
00:08:58 to that result of the strongly connected components, simply adding the attributes of my Nodes
table, in this case.
00:09:08 What I of course can also do, I can now persist the relationship, or that strongly connected
component number,
00:09:17 back in my data table for exploration purposes. I will update my Nodes table, and I will simply
update it
00:09:25 with the results of my strongly connected component algorithm. I will write that component and
component size into my data.
00:09:36 From here, I can first of all recheck my... meta graph again.
00:09:48 Here it is - we are still dealing with papers, authors, and organizations.
00:09:54 From here, I can now use that component number in order to evaluate specific components,
17
00:10:03 for example, those 18 co-authors that belong to that strongly connected component with the
number 243.
00:10:12 Let's quickly select that, and look at that graph,
00:10:15 maybe add the name label to see who we're talking with. So from here, you see a couple of
authors,
00:10:23 which are all connected or reachable from each other by leveraging that co-author relationship.
00:10:34 The next thing I would like to show you is how you can use, again, the Database Explorer
00:10:42 in order to restore that, the Database Explorer to invoke a shortest path algorithm.
00:10:54 Again, I'm launching that previewing tool on my graph workspace. And okay, I'm doing the same
thing here again,
00:11:03 looking for that component number, 243, visualizing the very same graph also in Database
Explorer,
00:11:11 which we've already seen, and...
00:11:17 so laying it out in a little bit of a nicer way. And then the next thing that we can do,
00:11:25 also using the Database Explorer, is we can go to the algorithm menu,
00:11:30 selecting the shortest path algorithm, and now looking for the shortest path
00:11:35 between, in this case, two authors. The first one is my Fred Richardson,
00:11:40 and the second one is another author in my graph. I'm looking for the shortest path, which looks
at any direction,
00:11:48 and so this is a shortest path which I found in my graph, essentially connecting
00:11:54 Fred Richardson to Stanley Su. As for the last demo, I'm going back to HANA studio,
00:12:04 going back to calculation scenarios. We will create a calculation scenario for our neighborhood
exploration.
00:12:11 It does provide some parameters or expose some parameters,
00:12:17 which we will use, so again creating that calculation scenario once.
00:12:22 Here is how I call my neighborhood search and then populate the parameters
00:12:27 that are part of that calculation scenario definition. First, I'm providing a start vertex,
00:12:34 so again, handing in the node ID of my Fred Richardson. I will traverse in any direction,
00:12:41 and I will look for neighborhoods which are reachable within 1 up to 10 hops.
00:12:47 Let's quickly launch that and look at the results. Not very surprisingly,
00:12:53 the hop 10 distance neighborhood of Fred Richardson does contain a couple of nodes,
00:12:59 And you will find organizations as well as authors in here,
00:13:03 depending on which edge you traversed. You can use the neighborhood
00:13:09 built-in algorithm also in an aggregation query. So what we will do here is now we will simply
aggregate
00:13:15 by the depth, which is the number of hops, and simply count
00:13:22 the number of nodes which are reachable. You found from starting from Fred Richardson,
00:13:28 within three hops, I reach approximately 8,000 other nodes.
00:13:34 And even within hop 10, I still reach three new nodes here.
00:13:43 Finally, adding some join condition next to the aggregation, I'm still simply adding
00:13:50 additional node attributes to kind of visualize the results and aggregate the results.
00:13:57 In this case, I've just joined the node type, is it an author or an organization,
00:14:02 to kind of aggregate, calculate the number of nodes by type for each depth.
00:14:12 With that, back to the slides. We have seen a couple of those alternatives - Database Explorer
00:14:19 we used to invoke a shortest path algorithm. We have used the calculation scenarios
00:14:25 to use, for example, neighborhood search and do some aggregation and do some joins with it.
00:14:32 This kind of wraps up, from an architectural perspective, what we are seeing. So there are,
besides pattern matching,
18
00:14:38 those built-in algorithms, which are supported on top of a graph workspace, and we have seen
00:14:42 ways how to invoke those built-in algorithms, both from tooling as well as from SQL.
00:14:49 That concludes this unit. In the next unit, we will be then talking about GraphScript.
00:14:57 See you then.
19
Week 01 Unit 05
00:00:08 Welcome to Unit 5: GraphScript. GraphScript is our domain-specific language

00:00:16 for writing stored procedures. It is domain-specific in terms of
00:00:24 it helps you to implement custom graph algorithms directly in the database.
00:00:32 So, if you are familiar with SQLScript already, then GraphScript follows kind of a similar approach.
00:00:42 So, you use SQLScript, in SAP HANA, to implement stored procedures on relational data,
00:00:50 using relational operations and functions. So, GraphScript, on the other hand, of course,
00:00:57 is domain-specific, it's a domain-specific language that gives you an abstraction on that
00:01:03 property graph data model. So, GraphScript operates on graph workspaces,
00:01:09 and exposes graph-specific functions and algorithms in its language.
00:01:16 The important thing here to understand is really that you can integrate
00:01:22 SQLScript procedures and GraphScript procedures by, for example, calling out from a SQLScript
procedure
00:01:29 into a GraphScript procedure, doing some graph processing, and then adding, for example,
relational post- processing
00:01:37 to the results. So there is that integration piece,
00:01:41 which for us is really important to bridge the gap between the world of nodes and edges,
00:01:48 the graph, essentially, and the relational world.
00:01:53 Before we start looking at the most important concepts in the GraphScript language,
00:02:00 let's look at this motivational example here. So, there is that idea of an UBO,
00:02:08 an Ultimate Beneficial Owner, if you're looking at companies and their ownership.
00:02:15 So, imagine you have a graph, a network, of companies, and natural persons linked to those
companies.
00:02:23 You might have a relationship between those saying, okay, this person, number one,
00:02:28 owns 25% of that company A. So, an UBO, an ultimate beneficial owner,
00:02:37 is either a natural person, which owns more than 25% of a company directly,
00:02:46 or owns, or which controls another company which owns more than 25% of that company A.
00:02:56 So, those types of networks, types of relationships, can be quite complex, and you need to really
00:03:04 into those relationships in order to understand if after some traversals in your company graph
00:03:12 you will find a natural person which has, in the indirectly, more than 25%
00:03:18 of the company you're investigating. So, that is a good example of a reason,
00:03:28 or use case, why you're looking at a custom algorithm,
00:03:33 why you want to implement a custom application logic in the database. Because in the end, this is
quite hard to solve
00:03:40 using declarative languages like Cypher, openCypher, for example,
00:03:45 because you really need to evaluate some logical decisions while you're traversing the graph
here.
00:03:52 So, in our example, you will find that person number one is, of course, an UBO,
00:03:58 because it controls, or owns, 75% of company But also that person two, down below, here,
00:04:07 in the end, controls a company which owns more than 25% of company A, so person number two
is also an UBO.
00:04:16 So these are the types of the problem classes that you might be facing in real-world scenarios,
00:04:23 where you don't find a generic, out-of-the-box graph algorithm of some type that solves the
problem,
00:04:31 but where you rather would like to implement your own application logic in a very performant, very
efficient way.
00:04:39 And our approach to that is GraphScript. So, one of the core differences
20
00:04:48 between GraphScript and, for example, SQLScript, or other languages for database-stored
procedures
00:04:55 in the relational world - one important differentiator is really the type system.
00:05:00 So, besides the primitive data types, being integer, or strings, and stuff like that,
00:05:06 GraphScript uses data types which are GRAPH, or VERTEX, or EDGE.
00:05:15 So rather graph-specific data types, here. And you see the core, most important types being listed
here.
00:05:24 So, for example, you instantiate a GRAPH g by calling out that GRAPH function
00:05:30 and providing a schema and a graph workspace name. So then you have that data type GRAPH
instantiated in a variable "G".
00:05:38 In a similar way, you can call out the VERTEX function to identify a vertex, in this case with an
identifier
00:05:47 And for an edge, for example, you call out the EDGE function to instantiate an edge with an
identifier 1.
00:05:56 Then, there are collection data types, so there is MULTISET and there is SEQUENCE.
00:06:05 MULTISET is an unordered collection, SEQUENCE is an ordered collection
00:06:10 of either vertices, nodes, or edges. Other expressions or control structures
00:06:21 being used in GraphScript allow you to, for example, access the attributes
00:06:27 of a node or a vertex. Here, for example, you access the attribute
00:06:33 called WEIGHT of an edge. Then, just to pick one additional example,
00:06:40 there are loop statements, or conditional statements, so we have if-then-else constructs,
00:06:46 and we do support for-each and while loops. Besides the type system, there are of course
00:06:54 graph-specific functions that make up GraphScript. And some of them are depicted here.
00:07:01 For example, SUBGRAPH and INVERSEGRAPH allow you to instantiate a new graph from a
given
00:07:08 So, for example, SUBGRAPH lets you induce a new graph, for example, based on some filtered
edges
00:07:17 out of an original graph. In my example here, I'm filtering the vertices
00:07:25 which are of color blue. And based on that subset of vertices,
00:07:30 I'm inducing that subgraph. In a similar way, you can create an INVERSEGRAPH
00:07:35 with reverse directions with regards to the edges, and there are built-in functions
00:07:42 in order to get to the source, or the target, of an edge. Other important built in-functions are
00:07:50 the NEIGHBORS function, which allows you to traverse from a given start vertex
00:08:00 up to a specified number of hops into the neighborhood, and collect the nodes being reachable
from there.
00:08:09 SHORTEST_PATH is a very important function. It allows you to invoke the shortest path
algorithm
00:08:17 in a similar way that we have seen in the previous unit. So, you provide a source and a target
vertex,
00:08:24 and in this case, it will evaluate the shortest in terms of hop distance.
00:08:30 The data type that's being created here is a WEIGHTEDPATH,
00:08:35 and WEIGHTEDPATH is a very central thing, especially as it relates to one of the core functions,
00:08:41 being SHORTEST_PATH. So, a WEIGHTEDPATH, of course, specifies a path,
00:08:47 and there are certain operations that you can invoke on a SHORTEST_PATH. You can extract or
project the nodes,
00:08:54 or the edges, that make up the path, or you can calculate the length,
00:09:00 which by default is the hop distance length of that SHORTEST_PATH, or that weight of the path,
in case you
00:09:07 use a custom weight function to calculate the distance, calculate the shortest path.
21
00:09:16 Some additional information on the collection data types here on the multisets.
00:09:22 Of course, yeah, you can run set operations on those collections,
00:09:26 on those multisets, essentially running a union or intersect of two multisets,
00:09:33 you can count the elements in that collection and you can call a distinct function on them.
00:09:41 In a similar way, on the sequence collection, you can do set operations in terms of
00:09:48 you can do a concatenation of two ordered collections, of two sequences.
00:09:57 And there is one important thing to notice. GraphScript procedures are read-only by default.
00:10:04 However, in certain algorithms, of course, if you're traversing the graph, and you're calculating
some
00:10:12 attributes, some properties for the nodes or the edges, you need to store that data.
00:10:19 And for that, we are introducing temporal attributes. In terms of, when you instantiate a graph,
00:10:25 you can add temporal attributes to that graph, and you can write to these temporal attributes,
00:10:32 and in the end, at the end of that GraphScript procedure, of course, project, use a projection
00:10:38 to kind of create the results that the graph algorithm outputs.
00:10:43 And of course, those temporal attributes can be included in a projection, can be included in a
results definition.
00:10:53 And then there is a very important operator. That is the traversal operator,
00:10:59 which gives you a nice, efficient way to traverse the graph in a breadth-first manner.
00:11:07 So, while invoking that operation, you start from a specific vertex,
00:11:12 and you can now implement so-called hooks, that means custom functions,
00:11:18 while visiting either a vertex or an edge. So, in this very simple case,
00:11:25 I'm writing to a temporal attribute in both the edges as well as the vertices.
00:11:31 Whenever I visit a vertex or an edge, I'll have that level information simply store it
00:11:37 to that temporal attribute. So, TRAVERSE BFS, I'll traverse in a breadth- first manner,
00:11:44 that is a very nice, very interesting way to do graph traversals.
00:11:51 With that, let's switch to some demo examples. So, again, I'm going back to my citation graph,
00:12:01 which, by now, has some additional stuff in there. So just to remind you, we have organizations,
00:12:10 we have papers, and we have authors in our nodes. Within our edges table, we do have different
types.
00:12:18 So, we have the edges that make up the citation graph, we have the relationship from the paper
to the author,
00:12:24 we have that co-author relationship, and we have the relationship that relates an author
00:12:31 to an organization. So now let's look at some basic examples here.
00:12:38 First of all, we will be doing some neighborhood traversal in GraphScript. So, you create that
GraphScript procedure,
00:12:45 providing input parameters and output parameters. In my case, as I'm doing a neighborhood
traversal here,
00:12:52 my input parameter is a start vertex, represented by the identifier.
00:12:57 Then two input parameters that control the depth, so the hop distance from a minimum to a
maximum level.
00:13:04 And finally, I'm using a table type in HANA to define the output structure.
00:13:10 So now, the important keyword here, when defining that procedure,
00:13:13 is language graph. So that's our indication that GraphScript,
00:13:20 the graph language, is being used in that procedure. So, the procedure itself is very simple.
00:13:26 I'm instantiating my graph, referring to a schema and a workspace name.
00:13:31 Then, I'm instantiating a vertex by using that input parameter that is an identifier.
00:13:40 Then I'm creating a multiset by simply calling the neighbors function.
22
00:13:45 And now the neighbors function operates, of course, within a graph. It starts with a start vertex,
00:13:50 and then it has a minimum and maximum depth it traverses to. Finally, I'm simply using a
projection
00:13:57 to kind of create the result set here. And that's the end of the procedure.
00:14:03 So let's simply create that one. And the way you call out to procedures in HANA
00:14:08 is by that CALL statement. So I'm calling the procedure I've just created,
00:14:12 I'm starting from Fred Richardson, and I'm reaching in to its neighborhood,
00:14:16 and I'm interested only in the hop distance three. So I'm reaching from three to three.
00:14:22 And this is the result set in terms of, these are, for example, authors that are reachable from Fred
00:14:30 within three hops of distance. Now, calling out to a GraphScript procedure
00:14:36 via that call statement is of course one option. In order to have a nicer way of SQL consumption,
00:14:45 you can wrap that GraphScript procedure in a SQLScript procedure.
00:14:49 For example, in order to create a table function. That's essentially what we do here.
00:14:57 So, we create a function, and we are using GraphScript as a language here,
00:15:03 and we are simply calling out to our neighborhood GraphScript algorithm, which we have seen
previously.
00:15:09 And we simply return its result. And with that, you can use a simple SQL SELECT statement
00:15:15 to select from that table function. So let's quickly do that.
00:15:19 Creating that table function and using a SQL select to kind of call out to that wrapping SQLScript
procedure
00:15:28 and return the very same result that we have seen previously. So what that allows you to do is to,
for example,
00:15:37 add some post-processing analysis directly in SQL.
00:15:44 So, given the scenario, okay, I'm interested in Fred, I'm interested in its neighborhood,
00:15:52 especially in the organizations he is directly or indirectly related to.
00:15:58 I'm interested in the spatial extent of all the organizations.
00:16:02 And in order to evaluate this spatial extent, I can use the spatial functions.
00:16:06 Here, for example, I'm calculating the concave of all the neighbors of Fred that are reachable
00:16:17 within 10 hops of distance. So, I'm calling out to the table function here,
00:16:23 and I'm doing some post-processing by doing the concave hull aggregation function.
00:16:28 So, what that query now returns, it runs the neighborhood algorithm,
00:16:33 it calculates the concave hull. It translates the results into a format
00:16:40 which is called GeoJSON. And one can visualize those results,
00:16:46 and what you see here on that base map is essentially a concave hull
00:16:51 which is calculated based on the geolocations of the organizations that Fred is related to
00:17:00 within up to 10 hops of distance. So this is a kind of a post-processing logic
00:17:05 that you can add to GraphScript algorithms. So let's take a look at the next example,
00:17:18 and that is how you use the SHORTEST_PATH function in GraphScript. So, here, that procedure
takes in a start and a target vertex,
00:17:29 essentially the nodes where we want to calculate the shortest path. And what this returns is the
length of the path,
00:17:35 as well as the nodes and the edges that make up the path. So what we are doing, in a very
similar way,
00:17:42 is we are instantiating a graph here. In this case, we are just looking at the co-author graph.
00:17:50 So just looking at authors and their co-author relationships.
00:17:54 We instantiate the source and the target, or the start and the target vertex,
00:17:59 and then we are calling out the shortest path function, and storing the results in a weighted path
data type.
23
00:18:06 On that weighted path p, we calculate the length. And in the end, we're also projecting
00:18:12 the nodes and the edges out of that weighted path. So let's quickly instantiate that procedure,
00:18:23 and call it via the CALL function. So what that gives me is a result which depicts the length of the
path.
00:18:31 So, there are seven hops between that source and that target vertex, here.
00:18:37 This is essentially the nodes that make up the path, with their ordinality.
00:18:44 So there is a relation from Fred, to Francis, and so on, until I finally reach this author down here.
00:18:50 And in a similar way, representing the very same path, I do have the edges information.
00:18:58 So there is a co-author relationship from this source to this target, and so on,
00:19:03 again, with ordinality information. So this essentially extracts the complete path
00:19:08 out of the shortest path algorithm here. In a similar way, that GraphScript procedure
00:19:13 can be wrapped, again, in a SQLScript procedure, as, for example, a table function,
00:19:18 exposing the edges, we can do some post- processing, some aggregation, where you can join
additional
00:19:25 that resides out of your graph workspace. So now, I've just invoked the shortest path algorithm
00:19:33 using the default distance, which is hop distance. So, what shortest path also allows you to do
00:19:40 is to calculate distance based on a custom weight attribute. And in my case, I will use as a
custom weight attribute
00:19:47 the number of collaborations within my author graph. So, remember that two authors collaborate
on several papers,
00:19:56 so I have that count information. And I'm simply using one divided by that count
00:20:02 as a kind of weight function, as kind of a distance. So the more papers two authors collaborated
00:20:08 the closer those two authors are. So, for that, I'm creating a very similar procedure,
00:20:18 again, handing in a start and target. But in this case, I'm creating a weighted path.
00:20:26 And what I do over here is, I'm calling out to a specific edge function,
00:20:33 which calculates by that formula down here, one divided by the count,
00:20:39 which is the number of collaborations, and uses that as weight attribute to calculate the distance.
00:20:45 So in this case, I'm not calling out the length function to evaluate the hop distance,
00:20:52 I'm calling out the weight function on that weighted path, which gives me the sum of the weights
over that path.
00:20:58 So that's the basic difference here. So let's also create that procedure,
00:21:08 and then kind of compare that results. So that was the GraphScript procedure
00:21:12 calculating shortest path on hop distance. And that is the procedure that we just created
00:21:17 using one divided by count as the distance function. So let's quickly look at the results.
00:21:26 So, what you notice here, based on hop distance, I do have five hops
00:21:32 in that path, calculated by hop distance, whereas when I use that custom weight function,
00:21:39 I do have one additional hop. What you would realize is that the count,
00:21:44 the strength of the collaboration, is greater in that path being evaluated here
00:21:51 by using that custom weight function. So two different shortest paths,
00:21:55 one depending on hop distance, one depending on a custom weight function.
00:22:02 So, one nice thing that you can do is, for example, expose the SHORTEST_PATH function
00:22:08 as a scalar, user-defined function. So what I'm doing is I'm simply creating that GraphScript
procedure
00:22:17 and wrapping that in a scalar, user-defined function, simply returning, basically,
00:22:26 the length of the shortest path, given two vertices. So, let's wrap that, or create that SQLScript
function here.
00:22:36 And what that allows you to do is to use that scalar UDF in a SELECT statement.
24
00:22:43 So, just like we see here, where I calculate the shortest path between two nodes,
00:22:48 and I'm just returning the hop distance as a simple scalar. So, having that function implemented,
00:22:55 you could think about calculating the pairwise distance of two sets of nodes.
00:23:03 So, for example, I've made up two artificial, let's say, communities.
00:23:08 There are the authors called Fred, and there are the authors called Franz.
00:23:15 And I'm calculating the pairwise distance in order to understand what is the greatest distance
00:23:23 between all the Freds and the Franks in my graph. So, this is by calling out to that scalar function,
00:23:32 and handing in two sets of nodes, which I pairwise then run the scalar UDF.
00:23:38 And what we see here, for example, there is a Fred Goodman which is six hops away from that
Franz Beil.
00:23:45 So that is longest path which you can find between the Franzes and the Freds.
00:23:56 So, as a last example, let's take a look at that traverse BFS operator.
00:24:02 So, what I'm doing here is, again, I'm looking at my citation graph here,
00:24:07 I'm adding a temporary attribute on my vertices which I can write to.
00:24:13 And what I will store in that temporary attribute is simply the level
00:24:18 on which I traversed or found that vertex. So, I'm starting from the given start vertex
00:24:26 and calling the traverse BFS operator on that graph, starting from that vertex.
00:24:31 And I'm simply saying, okay, upon each vertex being visited,
00:24:37 write that level information into that temporary attribute. So pretty straightforward.
00:24:42 And finally, I'm projecting that result. So let's create that procedure, and I'm giving it a call.
00:24:53 So what you expect here is that, given that start vertex down here,
00:24:59 I'm visiting after one hop, so level number one, I'm seeing that other paper in my citation graph.
00:25:08 And similar to the other papers, that is, the information that we see here is,
00:25:14 at which level have I seen that graph in my citation graph... that paper in my citation graph.
00:25:21 So, in that previous unit, I mentioned that there are not a lot of built-in functions, currently, in SAP
HANA Graph.
00:25:31 And one reason for this is that in the examples and the use cases we have been discussing
00:25:38 with our customers and internal stakeholders, we see a lot of use cases
00:25:43 that require very, very specific application logic being implemented.
00:25:49 So, we concentrated on GraphScript as our strategic approach here.
00:25:55 And this example shows you how you can implement, for example, standard, general graph
algorithms.
00:26:02 In this example, PageRank using GraphScript as a language. So, PageRank is a recursive
algorithm
00:26:12 that kind of identifies important nodes. And the important nodes, for example,
00:26:16 in a Web graph, where we have Web pages referring to each other via hyperlinks,
00:26:21 you are assigning a certain stored weight to these pages, and then you're distributing that weight
in a proportional way
00:26:29 along the number of outgoing edges from one page to another.
00:26:34 And you're calling that in a recursive manner, then adding a stop condition,
00:26:38 or a convergence condition here. So, this is how PageRank may look,
00:26:45 being implemented in GraphScript. What I'm doing here is, I'm grabbing a graph,
00:26:51 I'm adding two temporary attributes to store the PageRank and the outDegree.
00:26:56 And as a kind of initialization, I'm simply calculating the outDegree for each
00:27:04 that I have in the graph. And then I'm doing in a while loop,
00:27:08 essentially, from each node, I'm reaching out to its direct neighbors,
00:27:13 and looking for the weights that are then attached to those direct neighbors, and then propagating
25
00:27:20 to my stored node where I came from. I'm doing that for every node,
00:27:23 I'm doing that in an iterative way, so the weight kind of propagates through that graph.
00:27:29 So this is essentially one example of how you could implement general graph algorithms,
00:27:35 which we do not offer out of the box, by simply using SQLScript...
00:27:41 GraphScript, I'm sorry. So this, for example, this PageRank algorithm
00:27:45 provides that information in terms of, for each node, I have PageRank assigned to it.
00:27:51 And in this case, the results table has been sorted in descending order.
00:27:56 So I see that this paper A88 is very prominent, because it's being cited by a lot of other important
papers.
00:28:10 So back to the slides for a second. Returning to our motivational example
00:28:16 of calculating the ultimate beneficial owners of a company. Here you see a sample
implementation using GraphScript.
00:28:25 Essentially, where you have logical conditions depending on which level in your traversal you are.
00:28:31 So if there is a person which is directly owning a certain amount, a certain percentage of a
company,
00:28:39 or if you have, in a lower... a hop... or further away from that company A,
00:28:47 you have a natural person where you need to sum up the owning percentages in order to
understand if
00:28:52 this is an UBO or not. So as a last thing, I again want to point out
00:28:58 that integration between SQLScript, the relational world, and GraphScript, the graphy world,
00:29:05 is very important for us. So we have seen that example of spatial post- processing,
00:29:10 where I created a concave hull based on all the reachable organizations
00:29:14 and their geolocations. So this is a very important integration
00:29:18 that allows you, really, to call out from an orchestrating, overarching,
00:29:24 or pre-processing, post-processing kind of SQLScript procedure, into a GraphScript procedure,
00:29:30 doing some recursive operations like neighborhood traversal, shortest path calculation,
00:29:35 and then doing post-processing in SQLScript again, and here, for example, calculating the
pairwise distance
00:29:43 between all the UBOs that I've calculated using GraphScript. So, that concludes this unit on
GraphScript.
00:29:54 In the next unit, we will be talking about hierarchies in SAP HANA.
26
Week 01 Unit 06
00:00:08 Welcome to Unit 6: SAP HANA Hierarchies. So we're talking about connected data
00:00:16 and for the demo data that we have been using so far, so citation graph and ortho graph,
00:00:23 we have that general directed graph structure. Sometimes your data is connected,
00:00:31 but it forms kind of a tree-like structure. So think about an organizational hierarchy
00:00:38 that, in most cases, is pretty much a very well-formed tree. So a tree is a kind of special graph.
00:00:47 And as it has some really dedicated use cases for tree-like data structures,
00:00:55 like organizational hierarchies, there is a specific SQL syntax for handling
00:01:03 hierarchical data, handling tree-like data. So you might ask yourself: In the end, if hierarchies
00:01:11 are just a specific type of graph, why are there two, kind of, distinct approaches?
00:01:17 And I mean, from a conceptual point of view, there are really, yeah, some differences in terms of
00:01:26 how the data is stored, essentially. In a tree or in an organizational hierarchy,
00:01:33 you essentially have just one semantic with the types of the edges.
00:01:38 So for example, a parent-child semantic, a parent-child relationship, simply saying,
00:01:43 okay, an employee is connected to its boss. So the boss is the kind of parent of that employee.
00:01:53 There is also the approach in SAP HANA hierarchies to kind of support ad hoc queries, ad hoc
generation
00:02:03 of that tree-like data structure of these hierarchies. So as opposed to the graph engine in graph,
00:02:10 where we really need to create a graph workspace in order to expose the data to the graph
engine.
00:02:16 And with regards to navigation functions or traversals, there is a limited important set of functions
00:02:24 that you invoke in hierarchies, essentially doing some ascendance
00:02:28 or descendance traversals and stuff like that. So there is a gray line or a fine line
00:02:37 between those two and there might be use cases that can be solved in either approach, either
using
00:02:44 the HANA graph engine or the hierarchies function. If you think in general, if you have
00:02:51 a tree-like data structure, that's more of a hierarchy, if you have a general graph, it's more for the
graph engine.
00:02:57 That might be a very good first differentiator here. So again, as a motivating example,
00:03:06 if you're thinking about an organizational hierarchy, common questions that you can answer
00:03:10 with the hierarchy functions in SAP HANA are, for example: Which employees are reporting to a
specific manager?
00:03:19 Or if you have two employees: Which manager do they have in common?
00:03:25 Or then in the end, if we're talking about aggregation functions on these hierarchies,
00:03:30 think about financial transactions being recorded on a group level.
00:03:41 Yeah, on a group level, essentially. And at a higher level within that organization,
00:03:45 you ask yourself: Do I have money or don't I have money? So it is about aggregating certain
numeric figures
00:03:56 that are attached to the nodes in the hierarchy up to a certain level in that hierarchy.
00:04:02 So in general, hierarchies are supported via a SQL interface in HANA.
00:04:09 And that SQL interface essentially gives you functions to generate a hierarchy.
00:04:16 For example a parent-child hierarchy but also a leveled hierarchy, a temporal hierarchy are
supported.
00:04:22 And there are functions that let you navigate a hierarchy. So in this case we are looking for
descendants,
00:04:29 ancestors, and siblings, and in a very similar way, where the notion of aggregation
00:04:36 while looking at the descendants here. So there's the generation and there is
27
00:04:40 that navigation port in the hierarchies. Let's look at this example of a parent-child hierarchy.
00:04:48 So this is a way how you could think of it in a tree-like structure.
00:04:53 You can generate that hierarchy, base the data that is depicted here in that table.
00:04:59 So you have a PARENT_ID and a NODE_ID column, and you simply tell the hierarchy generation
function
00:05:08 where that parent is, where that node is. And then internally that hierarchy is created.
00:05:13 It's generated ad hoc and exposed for navigation functions. So before we continue with the slides,
00:05:24 let's see the demo for that. So I tried to stick to that AAN dataset
00:05:31 that I used throughout the demos, so there is a kind of topic hierarchy that is also exposed in the
dataset.
00:05:41 It is a categorization of the papers in terms of the topics they cover.
00:05:47 So what we see in the topics of that AAN dataset is that there is a high-level node called
Introduction and Linguistics.
00:05:58 It has a node ID 1. And down here I have the next level
00:06:02 in that parent-child relationship. So for example, Introduction to NLP,
00:06:07 where the parent ID is essentially that Introduction and Linguistics node that we have already
seen.
00:06:13 So this is a parent-child hierarchy in that category data. So in order to generate a hierarchy,
00:06:23 I'm calling out to that table function called HIERARCHY. And what this takes as a minimum is the
source definition.
00:06:30 So I'm simply pointing to the parent ID and the node ID of the table that we have just seen.
00:06:36 And then, you need to provide an order criterion on how the siblings are ordered in order
00:06:43 to generate consistent results later on. So let's simply call out to the function
00:06:49 and generate that hierarchy on the fly. So what you see here is, of course,
00:06:53 that there are certain columns of my source data, of the source table being reflected here.
00:06:59 So we understand Natural Language Processing being one of the nodes here.
00:07:03 And what you see on the left-hand side are the attributes that are generated by that hierarchy
generation function.
00:07:10 So for example, we see a generated attribute for the hierarchy tree size.
00:07:16 So underneath that node, node ID zero, there are 273 descendants.
00:07:23 It is on level one, it is not a cycle, it's not an orphan. So these are generated attributes that you
can leverage
00:07:31 later on for your hierarchy navigation functions. So the one thing that you might want to do
00:07:44 with your hierarchy definition, if you want to reuse it in order to, kind of, persist the definition,
00:07:52 you can wrap that hierarchy generation in a view. So that's what I'm doing here.
00:07:57 I'm simply creating a view and I'm creating it as AS SELECT * FROM that hierarchy function
00:08:03 that we've already seen. And by wrapping it into a view,
00:08:06 it also allows me to define caching behavior. So if you have a hierarchy where the data
00:08:13 doesn't frequently change, you can very much benefit from caching mechanisms, but there are
also ways to deal
00:08:19 with rather frequently changing data in your hierarchy. So let's simply create that view and simply
do a call out
00:08:27 to that view, which kind of gives me, of course, this same data that we already have seen
00:08:33 while ad hoc creating that hierarchy without a view. So now, once I have created that view,
00:08:43 which encapsulates the hierarchy generation, I can call the hierarchy navigation function.
00:08:49 In this case, I'm calling out the hierarchy descendants. Then I'm saying, okay, start at node ID 3
00:08:55 and give me the descendants up to a distance from 1. So invoking that function returns all the
descendants
00:09:06 that are underneath node number 3. So for example, Lexical Semantics has the node ID 31,
28
00:09:14 and it is a child of the node ID number 3. So these are the kind of results that you get
00:09:21 when calling two navigation functions. In addition to the attributes that we see here
00:09:26 on the left-hand side that we already know, we have hierarchy distance information,
00:09:32 so how far away I am from the start node, and what the rank of the start node is
00:09:37 in a pre-order ranking definition. In a similar way, calling out to hierarchy ancestors
00:09:45 in order to calculate the ancestors of a given node. So I'm starting with node ID 744 and I'm going
up
00:09:53 in the hierarchy until I reach, in this case, the root node being Natural Language Processing.
00:10:06 Then in a similar way, of course, you can call out to the hierarchy siblings, so on the same level,
00:10:11 the nodes which are on the same level as the start node, nothing surprising here.
00:10:16 And as a final example, the aggregation function. So I've attached some arbitrary numeric data
00:10:25 to my topic hierarchy by simply counting the papers that, kind of, share the same ID or a part of
the ID.
00:10:40 So I'm simply taking the topic information and joining my papers data in my nodes table
00:10:47 and simply counting the papers. That is a workaround because I didn't have
00:10:52 the direct relationship between the topics and the papers. It was not in the original data.
00:10:56 So I'm just making up my own numerics here. So what you do with
HIERARCHY_DESCENDENTS_AGGREGATE,
00:11:02 you start, for example, providing the nodes you are interested in, for example,
00:11:08 those ones with a level smaller than or equal to 3, and you are aggregating the count in terms of
00:11:15 you're counting the number of papers here, which are attached to the nodes or their descendants.
00:11:20 So this is a pretty efficient way to do calculations and aggregating in a hierarchical manner.
00:11:29 So for example, you will find that attached to this node, Introductions to NLP, I in the end have
over 3,000 documents
00:11:37 that are attached to that topic, again, using my crude regex type of join condition here.
00:11:45 So this is a very powerful way, if you have numeric data on your nodes level in your hierarchy, but
you need to run
00:11:52 certain aggregations on the ancestor level, so to speak. So with that, going back to the slides,
00:12:04 this is basically, again, for your reference, here the hierarchy generation function.
00:12:10 These are the generated hierarchy attributes that we have seen, so for example,
00:12:14 the tree size for each node in that hierarchy. Here are, for your reference, the syntax elements
00:12:23 that you can use in hierarchy generation. So there are ways to deal with not exactly
00:12:31 tree-like data structures, where you have multiple parents, where you have cycles, where you
have also orphans.
00:12:37 So there are ways to kind of deal with these situations and still use the hierarchy functions.
00:12:44 This is how you wrap the hierarchy generation in a view for reusing that hierarchy definition.
00:12:53 This is what you get by calling HIERARCHY_DESCENDENTS. So starting from node B2,
00:12:58 I'm reaching out to all the descendants. This is the output structure when calling the descendants.
00:13:04 We have seen the distance-to and the rank attribute that are based on the start node in that result
set.
00:13:11 Then we have the hierarchy ancestors. We're starting from node C4;
00:13:16 you evaluate all the ancestors here. And finally, in a similar way,
00:13:20 this is how you calculate the siblings. So starting from node C4, you understand the sibling here.
00:13:27 And finally, this is a visualization of what you do with the hierarchy aggregation functions.
00:13:33 So depending on the definition and your output result, you are doing an aggregation of numeric
data
00:13:41 on all the descendants, on specific nodes or nodes level. And with that, that's all that I would like
to tell you
29
00:13:51 about SAP hierarchy functions, the SQL syntax you use for navigating hierarchies and generating
hierarchies.
00:13:59 Again, it's a specific form of a graph. It might come in very handy if you really deal
00:14:05 with these tree-like structures. So that concludes this unit.
00:14:12 In the next unit, we will be looking at how you can leverage full text search and additional
00:14:20 spatial capabilities in combination with graph. Thanks.
30
Week 01 Unit 07
00:00:07 Welcome to the last unit of this course. We will be looking a little beyond Graph, again,
00:00:14 into the area of spatial and full-text search. We have supported spatial analysis,
00:00:24 spatial data types for a while in SAP HANA. In essence, this is about 2D, 3D vector data types
00:00:34 that allow you to store things like points, and lines, and polygons natively in SAP HANA.
00:00:42 There is a set of functions and algorithms that you can use where that spatial data
00:00:48 representing customer locations, highways, or sales areas and stuff like that,
00:00:54 that, for example, lets you evaluate if a point is within a given polygon.
00:01:01 Besides those core technical capabilities of data types and functions,
00:01:06 SAP HANA also supports or comes with content and services. For example, there is a way to do
geocoding
00:01:16 natively in SAP HANA that allows you to turn address information,
00:01:22 so postcode, city, street, house number, into a geolocation.
00:01:28 Essentially a latitude-longitude coordinate. Besides geocoding, there is content
00:01:36 which you can download if you have an SAP HANA license, and that, for example, gives you
access
00:01:44 to generalized administration boundaries and postcode areas which you can use
00:01:50 in your spatial analysis. What we and our customers are doing
00:01:56 with the spatial capabilities in HANA is a lot related to emergency response,
00:02:03 disaster recovery kind of use cases where we try to understand, for example,
00:02:08 the impact of a wildfire, where we try to coordinate resources,
00:02:13 so your fire workers, police, other rescue workers, to optimally support and provide relief in those
scenarios.
00:02:24 Looking at some of the functions or predicates that are supported,
00:02:30 I already mentioned ST_Within as a way to evaluate if a point is within a given polygon.
00:02:38 Just to give you some more examples, there is that Touches predicate,
00:02:42 which allows you to understand if, for example, a highway touches a natural reserve area,
00:02:48 something like that, or in order to understand if two polygons essentially overlap.
00:02:54 The spatial predicates can be used to filter your data, but also to join your data,
00:03:01 for example, to join your customer data, which is geocoded with your sales areas, something like
that.
00:03:08 Of course, those spatial functions can be used to evaluate the output of a graph algorithm,
00:03:18 or can be used to create new data for your graph. For example, you could use spatial functions
00:03:27 that let you calculate distance in order to create new edges,
00:03:31 for example, persons who live nearby, something like that. As for the presentation, there are
some spatial functions
00:03:44 that allow you to calculate bounding geometry on a set of points.
00:03:52 In a previous example, I already used concave hull in order to understand the spatial reach of
Fred Richardson.
00:04:00 There are a couple of other spatial methods to calculate geometries out of given ones.
00:04:10 One that is quite often used, for example, is to calculate the envelope
00:04:15 in order to understand, from a UI perspective, to which level do I need to zoom in,
00:04:21 zoom out my base map in order to plot a subgraph on that base map?
00:04:28 For example, a subgraph could be one that describes a part of a utility grid network.
00:04:39 Think about an electricity grid. As a matter of fact, we have run POC in the past
00:04:44 with a Dutch electricity company where it was about evaluating potential outage scenarios.
00:04:51 There is a transformer going down, and the sub-net is experiencing an outage.
31
00:04:58 That is essentially a graph algorithm that traverses the top graph,
00:05:03 and evaluates the top graph. In the end, you're using spatial information
00:05:08 assigned to those assets in order to calculate essentially the bounding shape,
00:05:13 the area where you need to zoom to in order to display that potential outage.
00:05:21 Let's take a look at another quick example on how you could use spatial predicates,
00:05:28 in this case, to generate new data. The idea that I had was,
00:05:33 I have those organizations in my graph, so essentially university, and they carry
00:05:39 a geolocation, a geotag. What I want to create is a new edge type,
00:05:46 a new graph, essentially, that is only based on spatial proximity.
00:05:53 What I'm doing is, first of all, I'm creating a view which gives me
00:06:00 all my nodes of type organization. That's what I'm doing on here.
00:06:05 Then I'm creating a new edge type in my Edges where I'm using the ST_WITHINDISTANCE
predicate
00:06:14 in order to evaluate if a university is within one kilometer of distance
00:06:20 to another university, and if so, I'm creating a new edge in my data.
00:06:26 Essentially, I'm giving it the semantics or the type isCloseBy.
00:06:32 I'm using spatial functionality to create new types of data.
00:06:38 Let me create a workspace on that new data. What that, in the end, does, is
00:06:45 give me a way to analyze my graph in a spatial way.
00:06:55 What I'm doing here is, again, I'm currently using the graph viewer
00:06:59 on that workspace I just created, and searching for Boston just to pick a start node.
00:07:05 I'm picking the Teragram Corporation, which is in Boston, and once I have that node
00:07:10 in my graph viewer, I'm using a custom extension to lay it out in a geospatial way.
00:07:17 I'm picking that location attribute in my graph data to plot that Teragram Corporation on a graph.
00:07:25 Now, I'm using the isCloseBy relationship in order to explore the neighborhood
00:07:32 in its true spatial sense. This evaluates me the neighbors
00:07:35 that are within one kilometer of distance, and plots it on a map.
00:07:39 In a similar way, from there, I can extend my neighborhood reach into the next level,
00:07:45 again using the calculated isCloseBy data, which is being calculated by using
ST_WITHINDISTANCE.
00:07:57 Now that, I hope, is somehow helpful on giving you some ideas in how you can combine,
00:08:07 let's say, spatial data, spatial processing, and graph processing, to some extent.
00:08:13 Next, let's take a brief look at the capabilities of SAP HANA
00:08:18 which are related to natural language text. In essence, we support full-text search,
00:08:23 and we have already seen a glimpse of it. There are, then, the capabilities
00:08:29 of running text analysis natively in the database. Text analysis gives you, in some form,
00:08:34 a way, for example, to extract salient information like named entities out of natural language text.
00:08:41 Things like organizations, or persons, or currencies, or date information.
00:08:47 Last but not least, there is the text-mining engine that is part of the SAP HANA platform.
00:08:54 That allows you, for example, to understand the similarity of documents
00:08:59 in terms of their content. You can use the text-mining engine,
00:09:03 if you have a given document at hand, to understand which documents are similar.
00:09:08 With regards to full-text search, we support natively a wide variety of languages,
00:09:15 and also binary file formats like PDF documents, or Microsoft Office documents.
00:09:24 What sets us apart in our approach to handle search in HANA is our approach to modeling.
00:09:33 If you're coming from the area of business intelligence or analytics,
32
00:09:38 you are, of course, very familiar with creating OLAP cubes, creating star schemas,
00:09:43 creating calculation views of some type that feed into an analytical client,
00:09:49 or provide a data model to an analytical client. In a very similar way, we are approaching search
00:09:56 by providing modeling capabilities that allow you to define the structure
00:10:01 of your search data, for example, using your paper information, title and abstract,
00:10:07 and your author master data joining it together. And then, on top, add annotations
00:10:12 that control the search behavior of that model. That model is a pure logical one,
00:10:19 so whenever you change that model, for example, including an additional table,
00:10:25 exposing an additional column for search purposes, it's just a change to a virtual model,
00:10:30 so there is no need to re-index your data or whatever. We support an approach to use SQL
00:10:38 and the Contains predicate for full-text search, but there is also a higher-level approach
00:10:43 to add using or exposing a built-in procedure called ESH_Search which exposes
00:10:51 a higher-level search API. What we see here depicted on the screen
00:10:58 also is our out-of-the-box search UI, which uses SAPUI5, which is our HTML5 UI framework.
00:11:09 That comes along with SAP HANA, and is a generic user interface
00:11:13 for search purposes that you can use out of the Quick look at the distinction between
00:11:22 plain vanilla search and our enterprise search stack. Again, I already mentioned,
00:11:26 if you create a full text index on a table, on a table column, you can use the SQL_Contains
predicate
00:11:33 to run a full-text search on a table column or set of columns in a table.
00:11:38 Then there is that enterprise search stack which, on top of that, gives you a way
00:11:44 to create models using SQL views to define the structure of your search model,
00:11:50 for example, joining your papers and your author data, and then adding search-specific
annotations on top of that,
00:11:57 for example, controlling which columns contribute to a relevance ranking,
00:12:01 which columns should be searched in in very specific error-tolerant or fuzzy way,
00:12:06 and which columns should go in which area of the UI, so for example,
00:12:11 which column is the title of a paper, and which other columns should be rendered as a facet.
00:12:16 Finally, there is that out-of-the-box search UI which I've just shown you.
00:12:22 How does that relate to Graph? In most cases, one of the first things
00:12:31 a user does while diving into a graph and exploring a graph in an interactive way
00:12:38 is picking a start node, or a set of start nodes. That is usually facilitated
00:12:43 by using full-text search capabilities. Remember me typing in Boston
00:12:47 in order to select the Teragram Corporation as a starting point for my previous demo?
00:12:53 Search is a very integral part of user interaction that, in the end, might focus on Graph.
00:13:02 The search modeling, the search capabilities in are an integral part.
00:13:08 It's an in-database search engine. Once you have your graph data already persisted,
00:13:14 for example, in a nodes table, there is no need to replicate the data
00:13:19 to, for example, an Elasticsearch, or a Lucene Solr system, or something like that.
00:13:24 You can leverage the full-text search capabilities that are built in HANA.
00:13:31 In that architecture diagram, you see here that, on your primary persistency nodes and edges,
00:13:37 it's a quite easy, straightforward way to provide a search model, for example,
00:13:41 on top of your nodes data, and leverage either the search API
00:13:45 or the out-of-the-box search UI in order to search and identify your data
00:13:51 where it resides in your nodes table. With that also, let's change to a quick demo here.
00:14:02 Just to recap, we have, of course, our nodes table, we have in a previous tab
33
00:14:09 already created a full-text index, a plain vanilla full-text index
00:14:14 on the name column, which contains the titles, or the author names, or the organization names.
00:14:20 What you need to do as a first step to search-enable that is,
00:14:24 you need to create a view on that nodes data. That can be a very simple projection,
00:14:30 or a quite complex one. Here, I'm exposing, for example,
00:14:33 the geolocation as a GeoJSON. Let me quickly create that view.
00:14:38 It's just a view on top of our nodes data. Let's see what's in that view.
00:14:44 I'm just projecting the ID, the name, the type, and the year, and the geolocation
00:14:50 in case there is one. That's, let's say, the structural search model
00:14:55 that I'm working with. One way to add search-specific annotations
00:14:59 to do search modeling on that structure is by using the built-in function ESH_Config.
00:15:08 It takes a CDS, core data services-like syntax to annotate that data structure.
00:15:14 For example, I'm creating a search model on top of that view that I just created.
00:15:21 Next, I'm saying, okay, there is a column called Name, and it should be the default search
element,
00:15:27 and it should go into some UI area specifically. Then, for type, for example,
00:15:34 I'm saying, okay, it should be rendered as a facet. That means, in my search result,
00:15:38 the values in the Type column will be aggregated, and a count will be calculated.
00:15:44 This is the core idea of building a search model. Let me quickly run that.
00:15:55 We now created that search model. Empty response means that the search model
00:15:59 has been created successfully. With that, you can leverage the out-of-the-box search UI.
00:16:04 This is it. I've run a search for "hub and spoke"
00:16:07 and found my Hub and Spoke Paradigm paper. If you are searching for "spoke", for example,
00:16:13 you will find a lot more. Here, you see that we are using fuzzy search capabilities,
00:16:18 because I also found "spoken" in there. For sure, I find my Fred Richardson here.
00:16:29 It should be somewhere down here. Here is Fred Richardson.
00:16:32 Finally, I can search for "boston" in order to understand which organizations
00:16:38 contain the keyword Boston. Let me filter in that facet by that organization.
00:16:45 From here, you can then also switch to an alternative display mode,
00:16:49 bringing the search results on a map because my organizations carry a geolocation tag.
00:17:01 The search capabilities in HANA offer you a very nice, very integrated way
00:17:08 to index and expose your graph data for full-text search purposes.
00:17:14 That allows you to create custom applications that really jump back and forth
00:17:19 between graph processing, interactive exploration, and full-text search in order to understand
00:17:26 and explore your data. Quickly on text analysis, again,
00:17:32 it's very much about extracting salient information from text,
00:17:35 so think about entities like persons or organizations. SAP HANA provides some additional
configurations,
00:17:43 so there is plain vanilla linguistic analysis which gives you word stems
00:17:47 and part of speech tags assigned to it. There is a very sophisticated configuration
00:17:53 for English language which lets you evaluate the grammatical role,
00:17:59 essentially gives you subject, predicate, object triples extracted out of natural language text.
00:18:05 Think about a sentence like, "SAP acquired BusinessObjects."
00:18:09 You have that subject, predicate, object structure which grammatical role analysis
00:18:15 allows you to extract out of text. Of course, it is extensible in terms of,
00:18:20 you can adapt it to your own specific domains. Or if you're working in healthcare,
34
00:18:24 one of the first things that you might want to do is bring in additional dictionaries
00:18:29 in order to extract things like drugs, or diseases, or symptoms, and things like that.
00:18:35 Last but not least, there is the possibility to add extensions in terms of custom rules
00:18:40 in order to detect custom facts. Things like merger and acquisition events,
00:18:45 "SAP acquired BusinessObjects for..." I don't know, "... one million."
00:18:51 Then you could formulate a pattern, a linguistically aware regular expression, so to speak,
00:18:57 that extracts these kinds of factual patterns out of natural language text.
00:19:03 As for SAP, we are very much active in the healthcare sector,
00:19:07 where we actually provide specific applications that leverage text analysis capabilities
00:19:13 to understand the content of doctors' letters, for example. In the end, what are customers doing
00:19:21 using text analysis? One thing that you can do, for example, is,
00:19:26 if you extracted entities of type, let's say, chemical substance or something like that,
00:19:32 you can use that extracted entity type as a facet in a search UI for filter and drilldown purposes.
00:19:40 You can start implementing your own calculations, doing analytics, so to speak.
00:19:46 For example, looking at entity co-occurrence. You might be able to evaluate which adjectives
00:19:53 go along with the term "Chancellor Angela Merkel", for example.
00:19:58 There are also more complex, more powerful data mining, text mining,
00:20:03 algorithms available in SAP HANA, for example, for topic modeling,
00:20:08 and text analysis results are a very valid input for these types of higher-level text mining
algorithms.
00:20:19 Finally, we see that, once you extracted location entities by using the out-of-the-box
00:20:25 text analysis capabilities to understand cities, or countries, or stuff like that,
00:20:30 you can add a geotagging, geocoding step to it, and for example, understand which geographic
locations
00:20:39 certain news articles relate to. Thinking about, "Chancellor Merkel visits
00:20:45 Vladimir Putin in Moscow," things like that. You understand Moscow is a city.
00:20:50 You can use geocoding to assign a geotag, and then finally plot that news article onto a map.
00:20:57 Last but not least, I mentioned that a triple structure of subject, predicate, object,
00:21:03 that grammatical role analysis will provide on English language, that, of course,
00:21:10 is a data structure that you can store in a graph-like manner for exploration purposes.
00:21:17 Then, for example, understand from news articles, what does Chancellor Angela Merkel
00:21:25 actually do all the time? You're exploring these triples,
00:21:28 you're exploring the relationships, and finally can go into areas like reasoning or inferencing.
00:21:36 That concludes Unit 7. That essentially also is the end of this course,
00:21:44 Analyzing Connected Data with SAP HANA Graph. Thank you for taking the time and watching it.
00:21:52 I hope it was valuable information to you and gives you some ideas on
00:21:58 how you can leverage SAP HANA and especially Graph in a meaningful manner to add, in the
end, business value.
00:22:07 Thanks again for your time and your patience. See you the next time.
00:22:14 I wish you good luck with your assignments. Bye bye.
35
www.sap.com/contactsap
© 2018 SAP SE or an SAP affiliate company. All rights reserved.

No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP SE or an SAP affiliate company.
The information contained herein may be changed without prior notice. Some software products marketed by SAP SE and its distributors contain proprietary software components of other software vendors.
National product specifications may vary.
These materials are provided by SAP SE or an SAP affiliate company for informational purposes only, without representation or warranty of any kind, and SAP or its affiliated companies shall not be liable
for errors or omissions with respect to the materials. The only warranties for SAP or SAP affiliate company products and services are those that are set forth in the express warranty statements
accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty.
In particular, SAP SE or its affiliated companies have no obligation to pursue any course of business outlined in this document or any related presentation, or to develop or release any functionality
mentioned therein. This document, or any related presentation, and SAP SE’s or its affiliated companies’ strategy and possible future developments, products, and/or platform directions and functionality are
all subject to change and may be changed by SAP SE or its affiliated companies at any time for any reason without notice. The information in this document is not a commitment, promise, or legal obligation
to deliver any material, code, or functionality. All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materially from e xpectations. Readers are
cautioned not to place undue reliance on these forward-looking statements, and they should not be relied upon in making purchasing decisions.
SAP and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP SE (or an SAP affiliate company) in Germany and other
countries. All other product and service names mentioned are the trademarks of their respective companies. See http://www.sap.com/corporate-en/legal/copyright/index.epx for additional trademark
information and notices.

OpenSAP Hsgra1 All Transcripts

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

OpenSAP Hsgra1 All Transcripts

Încărcat de

Drepturi de autor:

Formate disponibile

openSAP

Analyzing Connected Data with SAP HANA Graph

00:00:08 Welcome to Unit 2. We will be talking about the basic

00:00:08 Welcome to Unit 5: GraphScript. GraphScript is our domain-specific language

© 2018 SAP SE or an SAP affiliate company. All rights reserved.

S-ar putea să vă placă și