Cosmos DB Graph/Gremlin API: Thoughts about transaction support

5 min readFeb 11, 2019

It is often required that multiple writes via the Gremlin API is required to executed as an atomic unit (aka transaction support). One example of such an use-case is to add a new vertex and add an edge to/from the new vertex from/to an existing vertex. Another common case is to block any edge addition to a vertex that is being deleted. In this article, we will look at these examples and explore work-arounds in absence of transaction support at Cosmos DB’s end.

Case 1: Adding a new vertex and an edge from that vertex as an atomic unit.

To make the example a bit explicit, let’s say that the task at hand is: Add a ‘tweet vertex’ and create an ‘edge’ between the tweet and ‘user vertex’ who created the tweet. The application semantics here is that a Tweet vertex must always be connected to the user who created it.

The standard way to express this in gremlin is the following:

g.addV(‘tweet’)
      .property('id', 'tweetId')
      .property('pk', 'pkValue2')
      .property(‘content’, ’blah’)
  .addE(‘created by’)
      .property('id', 'edgeId')
  .to(
       g.V('userId').has('pk', 'pkValue1')
   )

This query, however, suffers from the fact that:

😒 The query can fail after adding the tweet vertex (say due to 429 aka RequestRateTooLargeException), breaking the desired application semantics. Note that, individual writes are atomic and durable. So, even if the query fails after adding the vertex, it’s guaranteed to be present.

😒 The query is not retriable, i.e., if the query fails after adding the tweet vertex, retrying the query will be met with an error saying that the ‘tweet’ vertex with the given id is already present.

Idempotency to the rescue

Now, the key to get around the problem is to, make use of the expressive power of Gremlin, and the convert the query to an idempotent so that the query is infinitely retriable without any side effect. Below is the pseudocode describing how we can convert the gremlin query above into an idempotent.

The idea is really simple — i.e., check if the vertex/edge exists before trying to create it, and if it is present then use it for creating the edge. Gremlin coalesce(A, ..) step provides an easy way to express such semantics. Coalesce step takes a list of sub-traversals (aka gremlin snippet) as parameters, executes them in order, and picks the result of the first sub-traversal that yields any result, ignoring the rest of the traversals.

Pseudocode : Executing multiple gremlin writes as an unit.while(true)
{
    try{
        g.V('userId').has('pk', 'pkValue1').as(‘creator’)
        .coalesce(
            g.V('tweetId').has('pk', 'pkValue2'),
            addV(‘tweet’)
               .property('id', 'tweetId')
               .property('pk', 'pkValue2')
               .property(‘content’,’blah’))
           ).as('tweet')
        .coalesce(
            g.E().has('id', 'edgeId').has('pk',’pkValue2'),
            addE(‘created by’)
                 .property('id', 'edgeId')
                 .to(‘creator’)
           )       // break out of the loop on success
   }
   catch(){
      // give control back to the while loop for retry
      // or, break if allowable number of retries have exceeded
   }
}

Now a few things to keep in mind while adopting this pattern:

Note that, the addE() part of the query doesn’t necessarily be wrapped inside a coalesce, however, it is a good practice to do so. It will avoid failures in case of accidental retries.
This is definitely more expensive than the standard version of the query, as we will be always doing a read to check for the existence of the ‘tweet vertex’ and the ‘created by edge’ even when the graph has sufficient RUs provisioned and the query is not expected to fail. However, this is fair cost to pay. Because if the underlying system had a transactional guarantee across multiple writes, it would have had to do the same amount of work or more, making the query expensive anyway.
Note that this is not equivalent to having full transactional support, simply because the client machine can fail before it could successfully retry a failed query (i.e., the query failed after adding the ‘tweet vertex’), breaking the application semantics. In fact, the server may also become unavailable. However, the application semantics can be restored if the client is able to retry the query after it is back in action. Also, such failures are expected to be rare.

Now, it’s possible that as system doesn’t allow such inconsistency, even if it is for a short period of time. Let’s say that the newly created tweets are shown on a UI, and the UI is considered broken without the ‘created by edge’. One pattern that could be useful here is to assume that a tweet will have a state indicating whether it’s active or inactive. With that assumption, we can update the property of a vertex to ‘active’ only when we successfully added the ‘created by edge’. This could be achieved by appending the above query with the steps below, and retrying the query till it’s successful.

.select('tweet').property('status', 'active')

4. Note that the retriability is dependent on the fact that in Cosmos DB {‘partitionKey’, ‘id’} of a vertex and edges are unique.

Case 2: If a vertex is being deleted, prevent any new edge being added to that vertex.

Since, Cosmos DB doesn’t provide transactional guarantee for the scope of an Gremlin query, it possible that concurrently trying to (a) delete a vertex and (b) trying to add an edge to the same vertex, may lead to ‘orphaned edges’.

Specifically, when a drop() step is issued against a vertex, Cosmos DB first tries to delete all the outgoing/incoming edges from/to the vertex, and then deletes the vertex. This is done to make sure that the drop() operations are retriable in case of a server crash or the request hitting a resource limit at the server (aka Cosmos DB 429). Now, since the drop() operation may take time, especially if the vertex has high indegree or outdegree, any new edge addition to that vertex will succeed and then eventually lead to an orphaned edge, once the drop() step finally deletes the vertex.

One way to reduce the probability of orphaned edge problem is to first mark the vertex as a deletion-target before starting the delete operation. That way, any new gremlin query that’s trying to add an edge can check if either of the source or destination vertex of the new edge have been marked for delete (i.e., in the process of being deleted).

Here is simple way to mark a vertex for delete before starting the delete operation.

g.V().has('id', 'sourceId').has('pk', 'sourcePK')
     .property('locked_for_delete', true)
     .drop()

And then, one can use the following check during an edge addition:

g.V()
.has('id','sourceId')
.has('pk','sourcePK')
.not(has('locked_for_delete'))
   .addE('knows')
   .to(
      g.V()
      .has('id', 'destinationId').has('pk','destinationPK')
      .not(has('locked_for_delete'))
   )

Please note that, this is nowhere close to providing a transactional guarantee. The above approach can also lead to orphaned edge, if both the gremlin queries happen to read the vertex at the same time (i.e., just before the vertex is marked for delete by the first gremlin query). However, the chances for this to happen will be greatly reduced since marking a vertex for delete is a fast operation as opposed to deleting all the incoming/outgoing edges of a vertex.

If the application, can’t tolerate even this kind of race condition, then additional measures need to be taken. One extreme option is implement an equivalent of two-phase-commit at the client level, and let the gremlin write operations of interest go through that.

One simpler option could be to maintain a log of all vertices that have been deleted and run daily job check for edges to/from those vertex via Cosmos DB SQL query and delete them.

Cosmos DB Graph/Gremlin API: Thoughts about transaction support

Idempotency to the rescue

Written by Jayanta Mondal