Getting the best out of Azure Cosmos DB’s scale-out Graph API aka partitioned Graph containers

How graphs are distributed with Azure Cosmos DB’s Gremlin API and how to use that information while designing a scale-out graph application?

Jayanta Mondal
9 min readFeb 18, 2019

In this article, we will look at one such graph database platform, Azure Cosmos DB’s API — a distributed, geo-replicated, self-managed graph database, and how understanding its operational philosophy can help us make informed design choices while building a scalable graph application.

What we look to cover: We will start with an overview of Cosmos DB partitioned containers, followed by how graphs are distributed and it affects the cost of various graph operations. Finally, we will look at some of the important design considerations curated from a set of practical use-cases.

What we will not attempt to cover: For the article to be concise and self-contained, we won’t directly get into the topic of how to model data as graphs, or how to pick a partition key for a graph data-set. The goal of this article is to make developers/architects aware of the various decision points and design alternatives so that they can make the best choices for their application scenarios.

Overview of Cosmos DB’s partitioned container

Cosmos DB’s scale-out story revolves around its partitioned containers. A partitioned container can be seen as a collection of distributed container-lets ( aka physical partitions). At the time of creating a partitioned container, one can select a specific property of the data tuples to be the basis of data placement/distribution among the container-lets. For example, let’s say a Recruiting Farm is trying to ingest data of a list of potential candidates; let’s say the candidates’ current employer was chosen as the property to distribute data. What this means is that all the candidate with the same current employer (say Microsoft) will be placed in a single container-let. The property we elected (i.e., current employer) for data distribution is termed as the partition key of the partitioned container, and all data that has the same partition key is said to form a logical partition. Data belonging to the same logical partition is guaranteed to be collocated within a single physical partition.

  1. As an application developer, however, one doesn’t know if two logical partitions are collocated in the same physical partition. In fact, this information is not static, as, over time, when more data is added, Cosmos DB can add more physical partitions and redistribute the logical partitions.
  2. In Cosmos DB, there is a size limit on how large a logical partition get. The limit is currently set at 10 GB. A logical partition also defines the transaction scope of Cosmos DB’s stored procedures.

If you wish to get more details, you can follow Cosmos DB’s official documentation on this topic.

How are graphs distributed?

Given the overview, let’s look at how graphs are distributed in Azure Cosmos DB. The assumption is that users have created a graph data model using following the TinkerPop property graph standards, and have a partition key in mind. Here are some tips for picking the partition key, in general. However, I would like to urge the readers to revisit their partition key choice, after we understand the data distribution mechanism and its influence on the overall design.

We will use the above diagram as a running example to understand the distribution mechanism. Additionally, we will adopt a question-answer format for this discussion to give it a bit more interactive feel. As a part of the discussion, we will be writing a few gremlin queries, and it may be beneficial to read the second section of this article which outlines the details about gremlin execution model. However, even without that, the discussion should be easy to follow.

(Q1) How to add a vertex to a partitioned graph?

Vertices are added by specifying a mandatory partition key property. Below is the gremlin query for adding vertex V1 (from the example), assuming that the partition key property of the partitioned graph is “/employer”:

g.addV(‘vertexlabel’).property(‘id’, ‘v1’).property(‘employer’, ‘Microsoft’).property(‘prop1', ‘propvalue1')

Note that, the leading “/”, which indicates the root of the partition key path, needs to be skipped while specifying the partition key property during the add vertex step. Here the partition key is /employer and the logical partition key that the particular vertex will belong to is Microsoft. Note that you can add multiple vertices under the same logical partition key value (i.e., Microsoft). The 10GB limit is applicable to the total amount of data (all vertices and edges originating from those vertices) that can reside under the Microsoft logical partition. The discussion about how the edges are stored is described in Q3.

(Q2) Can multiple vertex have the same “partition key value”?

Yes, multiple vertices can have the same partition key property value. In fact, two vertices can have the same ‘id’ as well. It is the (partition key, id) pair that makes each vertex unique. In the example diagram above, vertex V1 and V3 have the same partition key and resides within the same logical partition ‘pkX’.

(Q3) How are the edges distributed in a unlimited collection?

In Cosmos DB graph, edge information (this includes all the properties defined on an edge) is stored in the same logical partition as the source vertex. When we add the edge E12 from V1 to V2, the edge will automatically inherit the partition key of the source vertex V1.

g.V('V1').has('employer','Microsoft')addE(‘knows’).to(V('V2').has('pk', 'Walmart'))

(Q4) How to efficiently fetch a vertex from a partitioned collection?

The standard way to fetch a vertex using gremlin is:

g.V('V1')

g.V(‘V1’) simply fetches a vertex that has id ‘V1’. However, this is not the most efficient way in Cosmos DB partitioned graph. The reason is that it’s not possible to uniquely identify the logical partition the vertex by its id alone. Trying to look up a vertex by its id would require to scan all partitions, which is inefficient.

However, along with vertex id, if the partition key value is also provided the search becomes an efficient lookup within a logical partition.

Any of the following variations of the query will be efficient:

1. g.V('V1').has('pk', 'pkX')2. g.V().has('id', 'V1').has('pk', 'pkX')3. g.withStrategies(PartitionStrategy.build().partitionKey('pk')  .readPartitions('pkX').create()).V('V1')

Here, while the first two queries are easy to understand, the 3rd query requires a bit of an explanation. In effect, the 3rd query executes a simple “g.V(‘V1’)” but with a Partition Strategy. In this example, the partition strategy instructs the execution engine to limit its search with the logical partition represented by the partition key value ‘pkX’. We will see later, how such a partition strategy can come in handy for more complex Gremlin queries.

(Q5) What’s the most efficient way to fetch the out edges of a vertex?

The gremlin step to fetch the out-going edges of a vertex is outE(). Now, since out-edge information is stored in the same logical partition as of the source vertex, outE() steps are always efficient. Once we have fetched the source vertex, we precisely know which logical partition the out-edge information will belong to.

In that context, the efficiency of the outE() steps in isolation is the same for both the queries below:

1. g.V(‘V1’).outE()2. g.V('V1').has('pk', 'pkX').outE()

However, fetching the source vertex, before we execute outE(), is more efficient in the 2nd case.

(Q6) Are inE() queries equally efficient?

The answer is no. inE() is the gremlin step to find all the incoming edges to a vertex. Since edge information is not stored with the destination vertices, inE() steps will need to fan-out to all partitions in search of its source vertex.

For cost purposes (both latency and RUs), it is recommended to avoid the inE() steps. A few workarounds are:

A. Changing the data model in such a way that edges are always traversed from the source vertices, thereby eliminating all inE() queries.

B. If it is intended to traverse an edge both from its source and destination vertices (perhaps for two different use-cases), one can add two directionally opposite edges. This will allow us to rewrite all inE() queries as outE() queries making them efficient to execute. Typically, this is a good pattern when the edges are read-heavy and they are not frequently updated.

In this context, the preferred path to go to vertex V1 from V2 should be via E21 using:

g.V(‘V2’).has(‘pk’, ‘pkY’).outE().inV()

instead of via E12:

g.V(‘V2’).has(‘pk’, ‘pkY’).inE().outV()

C. Finally, if the data model is such that both the source and destination vertices belong to the same logical partition, it is guaranteed that the incoming edge information (for an inE() step) will be in the same logical partition as the destination vertex (also the logical partition of the source vertex). In such a case, it’s safe to use inE() in conjunction with an appropriate partition strategy, so that the execution engine precisely knows which logical partition to restrict the query to.

(Q7) Is going from vertex V1 to V3 more efficient than going form V1 to V2?

The answer here is ‘no’, even though V1 and V3 lives in the same logical partition, while V2 is not. If we look at the corresponding traversal queries (assuming E12 and E13 has labels ‘e12’ and ‘e13’ respectively),

g.V(‘V1').has(‘pk’, ‘pkX’).outE(‘e13').inV() g.V(‘V2').has(‘pk’, ‘pkX’).outE(‘e12').inV()

The queries perform exactly the same kind work till the last inV(). In fact, both the queries perform the same work for last the inV() as well, which is looking up a vertex by its id and partition key. At that point, it doesn’t matter if the vertex we are looking for is in logical partition ‘pkX’ or ‘pkY’, they are lookups and their performance is same irrespective of which logical partition they belong to.

Design Considerations

Now that, we have a bit better understanding of how a Cosmos DB graph is distributed let’s look at some of the design considerations that are good to be aware of:

Logical partitions include vertices and all its edges

This is important to keep in mind while picking the partition key of the graph collection. We often focus on partitioning the vertices evenly ignoring the edges. However, the fact that edges reside in the same logical partition as the source vertex, and the size of a logical partition is limited to 10 GB, the edge distribution of the vertices needs to be taken into account during partition key selection.

Fine-grained partition key can be a good thing

We have seen in Q7 that the performance of a graph traversal doesn’t suffer even if the traversal needs to hop from one logical partition to another, as long as we always choose out-going edges. This gives us freedom in terms of choosing a more fine-grained partition key that focuses on evenly distributing data, as opposed to combining them into a logical partition in an attempt to collocate data for query execution.

Partition key strategy may come in handy during data modelling

Partition key strategy, as discussed in Q4, can be useful in simplifying the data model, especially when the graph is constituted of a large number of small disconnected graphs. We can enforce all the vertices within the same connected graph to live within a logical partition by designing them to have the same partition key. If that’s the case, we can use the partition strategy to limit the inE() queries to a single logical partition, making them as efficient as the outE() queries. This avoids the need to remodel the graph or rewrite the queries using only outE() queries.

Drop vertex in a bulk can be expensive

Dropping a vertex with Tinkerpop involves dropping all its outgoing and and incoming edges. This can be an expensive affair as finding the incoming edges requires a fan-out to all partitions. Things can be aggravated when many vertices are deleted at the same time.

Partition Strategy can be useful here, especially if

  1. The vertex deleted don’t have any incoming edges
  2. Or, It is known that all the incoming edges also belong to the same logical partition as the to-be-deleted vertex.

If the application requires to delete a large number of vertices with incoming edges in other partitions, please consider

  1. Deleting the entire graph and recreating with the remaining vertices and edges.
  2. Using the BulkExecutor tool to delete vertices and edges.

Large id and pk can cause performance dip

It is important to pay attention to the the size of the id and partition key. If a query is fetching lots of vertices and edges, and since they are looked up by id and partition key, it means that query would need to send more data and spend more time over the wire. We want this to be as streamlined and efficient as possible.

--

--

Jayanta Mondal

These opinions are my own and not the views of my employer (Microsoft).