Analyzing and improving the performance of Azure Cosmos DB Gremlin queries
Gremlin is one of the most popular query languages for exploring and analyzing data modeled as property graphs. There are many graph-database vendors out there that support Gremlin as their query language, but in this article, we will focus on Azure Cosmos DB which is one of the world’s first self-managed, geo-distributed, multi-master capable graph databases.
To set the expectation, this article is not aimed at teaching Gremlin, rather it should be seen as a self-help article. In that sense, the ultimate goal is to explain the basic constructs and the execution model of gremlin queries, and how understanding those basics can empower one to debug, analyze, and improve Gremlin queries against Azure Cosmos DB.
Now many of you may wonder, and rightly so, that why should one care about the execution model while writing a query. After all, we don’t typically do so while writing a SQL query. The revelation lies in the fact that Gremlin is very different from declarative query languages. With declarative query languages, one defines the intent of the query, and the database tries to pick the best execution plan. But, with Gremlin, which is a functional, data-flow language, we actually lay out the execution plan while writing the query. In simple terms, we are indirectly recommending the database with a sequence of steps to execute. Because of these reasons, improving the performance of a Cosmos DB Gremlin query will be hinged on understanding:
1. The nature of the implicit execution flow that’s embedded in a Gremlin query.
2. How the execution flow would be carried out by the implementation (i.e., by the underlying database, and in this case Azure Cosmos DB)?
3. How can one take feedback from the actual execution (in this case, Azure Cosmos DB Execution profile) to restructure their original queries and/or change the data model for better performance?
PS: So far, I have been vague about the quantification of “performance”. If you are familiar with Azure Cosmos DB, a bad performing query typically means two things, (1) the latency of the query is high, (2) the query costs more request units (RU). In this article, I will use a combination of both the measures as my definition of performance. The reasoning is that, for graph queries, these two measures are strongly correlated for most of the cases. The latency of a query is high, mostly due to the sheer number of vertices and edges that it needs to explore during the execution. And more the number of vertices and edges a query explores, the more will be the RU usage. Now, let's call the vertices and edges accessed during a query as the footprint of the query. With that definition, our goal in this article would be to understand how one can reduce the footprint of a query to improve its performance.
Another point I want to clarify is the need to “change the data model for better performance”. This may seem a bit counter-intuitive but it is a very common practice in databases. Changing a graph data model is very akin to data normalization/denormalization in the relational realm. So, my argument here is that one should really be ready, in fact, I would raise the stake even higher and say that one should be proactive in changing the data model in order to achieve better performance. We will see a couple of examples of this, but in the end, the data model designer will need to rely on their creativity and the knowledge of the workload to come up with the optimal data model.
A Gremlin query is really an execution plan in disguise
A gremlin query is a chain of gremlin steps stringed together sequentially (There are some exceptions, but those are not important for the following discussion). Moving forward, in general, a Gremlin query looks like the following:
g.step1().step2().step3()………………..stepN()
Here, ‘g’ refers to the graph, and the Steps() define operations (TinkerPop community calls ‘g’ the Graph Traversal Source, and the chain of steps the Traversal). Each step can be thought as a tiny Map-Reduce style operator. They take a set of input and convert them to a set of output. The conversion logic is dictated by the semantics of the steps. In most scenarios there are either Vertices, Edges, Properties, or some scalar values. However, in some advanced cases, they can be complex objects as well.
By now, I hope that you have already started to realize how a gremlin query effectively lays out an execution flow. Let’s take couple of examples to understand that in a bit more detail.
Example 1: g.V().out().out()
This query has three steps: V() followed by out() followed by another out(). The V() step takes a graph as input and produces a list of vertices as output, which in this case is the list of all vertices. The out() step takes a set of vertices as input and produces another list/set of vertices as output. Below is tabular presentation of input and output for at the end of each of the steps.
The input and output of the V() step is straightforward. However, the first out() step may look a little tricky, as the output contains two instances of V4. The explanation is that V4 is reachable from both V1 and V2 via out edges. Execution wise, the algorithm that each of the above steps follows are:
1. For each input element compute the corresponding output element. Please see the document here to understand the semantics of different gremlin steps.
2. Combine the output for all the input into a list. Note that the final output is simply a set, and not a multi-set. If it were a multi-set, the output of the first out() would have been {{V3, V4}, {V4}, {}, {V5}, {}}. Note that, thinking about this multi-set makes it more evident why there are two V4s.
How did Cosmos DB act on the implicit execution plan embedded in the Gremlin query
Figure 1 shows the Cosmos Db execution plan for the query. The exact gremlin query to get the execution plan at line shown in the top line. You can get the execution profile from Azure portal or through your favorite Gremlin client. You will see that, the actual execution profile has lot more information, but to keep the discussion focused, I have extracted the information that’s needed for the “footprint analysis”.
The execution profile has been color coded to show how Cosmos DB has executed each of the steps in the Gremlin query. An interesting thing to note here is that the out() step has been phased into two steps GetEdges and GetNeighborVertices. This makes sense, since to get to the outgoing neighbors of a vertex, you need to go through the edges. At this point, it’s easy to see how this matches with the output Column of Table 1.
In that the context the Gremlin query “g.V().outE().inV().outE().inV()” should have the same exact execution profile. However, you will see a different one if you look up the Cosmos DB execution profile. In this case you will see that “V().outE()” has been executed using the “FetchEdgeOperator”. The reason is that Cosmos DB was able to understand that “V().outE()” effectively computes all edges of the graph, and used a different operator to fetch edges from a different index, avoiding the need to access the vertices first. It’s easy to see that how this has helped to reduce the footprint of the query.
Takeaway 1: Whenever there is a Gremlin step that you want to use, please understand the semantics of the step. To be specific, please understand the input and output characteristic of the step. One of the best way to do this is to create a toy graph and write a minimal query to understand its characteristics.
Takeaway 2: Use Cosmos DB execution profile to check if the footprint of the query is in line with the expectation. Note that there is no clear definition of expectation here. It really depends on the algorithm of the graph exploration task. As long as the current footprint is higher than a reasonable expectation there is always a possibility of improving the query, either by changing the query or by changing the data model.
How does one rewrite queries or change the data model for better performance
Needless to say, this is perhaps the trickiest part to discuss. This is not surprising considering the large number of gremlin steps there are, how arbitrarily nested they can be, and how different combination of steps can be analyzed and optimized by Cosmos DB. Even though I can’t give a general algorithm for rewriting Gremlin, what I will do is to discuss a list of cases that I have encountered in the past. And I strongly feel that these cases, along with the discussion in the previous two paragraphs will help one to adopt a systematic approach to tackle low-performing queries.
Before discussing the cases, I would like to mention some of the issues that are orthogonal and can lead low performance. Please take care to ensure that the following checks for these issues are in place.
Check 1: The Gremlin client and the Azure Cosmos DB are in the same region. This is to eliminate any possibility of network delays interfering with the query latency. In my experience, with everything else unchanged, the best performance can be achieved by placing the Gremlin client in an Azure VM that’s located in the same data center as the Azure Cosmos DB instance itself.
Check 2: It is recommended to create a small pool of Websocket connections and use them to make requests (note that different gremlin clients may implement connection pool differently). Cosmos DB Gremlin server talks over WebSocket which has a set -up cost due to authentication and other protocol-related handshakes. You don’t want to create a new WebSocket connection every time pay the set-up cost for each query.
Check 3: Please run the same query multiple times and check if all of them show high latency, as opposed to something like the 95th percentile latency being high. The first instance of a query may be slow due to the setup cost discussed above. The 95th latency may be high due to load variability in the Gremlin server. Rewriting the queries won’t help in this situation. A quick experiment that I highly recommend is to check for the “average latency of accessing a vertex by its id (g.V().has(‘id’, ‘idValue’).has(‘pk’, ‘pkValue’))”. Given Cosmos DB SLA, this number should be below 10 ms. If the number is higher, that means that the client set up is not quite right.
Check 4: Please understand how an unlimited collection (aka partitioned collection) influences Gremlin queries. The key aspect to remember is to provide partition keys in Gremlin queries while accessing a vertex/edge, whenever possible. Otherwise, a query may need to access multiple partitions, and thereby impacting performance. Cosmos DB execution profile (see Figure 4) can tell you, if a gremlin operator indeed visited multiple partitions.
Case 1: Apply filters early and aggressively
Given the graph in Figure 6, let’s say the goal is to find out “all the Engineers who are two-hop away from V1 in the reporting chain”.
Corresponding Gremlin query: g.V().has(‘id’, ‘V1’).out().out().has(‘label’, ‘Engineer’)
Even though the Gremlin query does the job, the footprint (8 vertices and 7 edges) of the query is rather high.
But, you can do a lot better, if you know that “an engineer can only report to a Manager and not to an HR manager”, and use that information to reduce the foot print. The footprint of the new query (see below) is much lower (6 vertices and 5 edges).
g.V().has(‘id’, ‘V1’).out().has(‘label’, ‘Manager’).out().has(‘label’, ‘Engineer’)
The takeaways here are:
- you should really squint at your data and apply any domain expertise that you may have to try and cut down on the number vertices and edges that needs to be explored.
- Apply as many filters as possible to narrow down the search space.
- Consider traversing the graph from reverse direction if that helps reducing the footprint.
Case 2: Label edges if possible
In fact, the query discussed in Case 1, can be optimized even further in terms of footprint by using the domain knowledge discussed before and by re-labeling the edges (see Figure 7).
g.V().has(‘id’, ‘V1’).out(“RB_Manager”).out(“RB_Engineer”)
The footprint of this query is the lowest (4 vertices and 3 edges). And as you can see the idea here is to traverse only the edges that are relevant for the query. Note that, this is case where you are changing the data model to improve query performance.
Case 3: Try to not visit the same vertices multiple times
Let’s say you want to compute all the reports of the vertex V1, irrespective of whether they are 1-hop away or 2-hop away (Graph in Figure 7).
One possible gremlin query : g.V().has(‘id’, ‘V1’).union(out(), out().out())
If you do the footprint analysis of this query, you will find that the footprint is 10 vertices and 9 edges. The reason is that V3 and V4 will be visited twice. Agreed, that the database engine could have optimized this internally, but it’s often hard for the databases to optimize these, especially when the steps are nested. Moreover, this simple case may look to be an obvious candidate for optimization, but in reality with more predicates things could be rather complicated.
The above mentioned issue aside, as a power user you can do much better. Below is a variation of the Gremlin query that reduces the footprint.
g.V().has(‘id’, ‘V1’).out().as(‘one-hop-reports’).union(select(‘one-hop-reports’), select(‘one-hop-reports’).out())
The idea here is to save the one-hop reports under a label using the As step, and then reuse the results whenever necessary. The As step instructs the engine to not revisit those vertices again, and reuse them after their first access.
Case 4: You don’t always have to reach a neighbor through an edge
This is an interesting case. Given the graph in Figure 8, let’s say the goal is to find “all the reports of vertex V1”. A simple Gremlin query would be:
g.V().has(‘id’, ‘V1’).out()
The query has a footprint of 4 vertices and 3 edges.
However, the fact that all the reports of vertex V1 has their manager’s id stored as a property, we don’t need traverse all the edges to reach team. In fact, we don’t even need to access the vertex V1. In that context, the new gremlin query would be:
g.V().has(‘manager’, ‘V1’)
The new query has a footprint of just 3 vertices. The key takeaway here are:
- Please consider whether you really want a physical edge between two vertices, or you want to implicitly design that by adding vertex properties. If you compare it with the relational model, this is nothing but data denormalization. We are sacrificing some extra storage to achieve better query performance. Note that this might require additional checks, depending on your use-cases. For example, you may want to check if the vertex V1, actually exists in the graph, and so on.
- In general, if you notice that you have large number of queries with footprint N vertices and N-1 edges, there is a high chance that this pattern can come in handy.
- You should also consider a hybrid model, where you add explicit edges for certain type of relationships, while the rest can be implicit.
I will stop here today, and will add more cases as I remember or encounter them. Please let me know in the comments, if you have come across any such challenges and solved them with intelligent data modeling and query restructuring.
Case 5: Rewrite queries so that the scope of a predicate is as narrow as possible
The query below is trying to find out all the vertices that doesn’t have any out edges with label ‘label_value’. However, since the entire condition outE(‘label_value’) is inside a not step(), it’s rather hard to optimize this. This is because the condition inside the not() step can be arbitrarily complex, and it’s cumbersome for the optimizer to enumerate all such cases and pick the best plan.
g.V().not(outE(‘label_value’))
Instead, if you rewrite the query above so that the not() predicate is applicable to a narrower scope, you can achieve better performance.
rewrite: g.V().as(‘nodes’).outE().has(‘label’, neq(‘label_value’)).select(‘nodes’)