Review: MongoDB learns cool new tricks

With useful graph search capabilities and important stability improvements, MongoDB 3.4 is a no-brainer upgrade

Review: MongoDB learns cool new tricks
Thinkstock
At a Glance

MongoDB 3.4 continues the trend of databases building out support for a range of conceptual data models over the same underlying data store. This multimodel approach aims to deliver a single database that can be used to store data as documents, tables, and graphs simultaneously. The benefit to the user is a dramatically simplified infrastructure when compared to a polyglot persistence model, which might entail managing three or four separate data stores to satisfy those different use cases.

MongoDB 3.4 introduces the ability to perform recursive graph queries. It was co-released last November with version 2.0 of the BI connector, which provides the ability to query MongoDB using a SQL interface through tools like Tableau and Qlik. Also, aggregation operators were added to greatly enhance the ease and performance of facet-style interactions.

New and improved infrastructure features have been released in addition to this expanded query model. Most notably, updates to the v1 replication protocol have enabled MongoDB to pass previously failing Jepsen tests, meaning that MongoDB 3.4 should prevent stale reads, dirty reads, and lost updates.

Graph search

Recursive searches are now available in the aggregation pipeline, allowing you to perform certain types of graph searches on a collection. The new $graphLookup aggregation operator lets you define a subquery to join documents with each input document and a maximum depth on which to perform these joins. For example, imagine a collection that contains documents describing people, where one of the fields in each document is a list of children identified by their Social Security number. One such document might look like this:

{
  _id: 1,
  ssn: "111-22-3333"
  name: "Irene Katsopolis",
  children: ["111-22-3334", "111-22-3335"]
}

We could use the $graphLookup operator within an aggregation pipeline to find all of the children, grandchildren, great-grandchildren, and so on of a certain person. The pipeline would look something like this:

db.people.aggregate([
  { $match: { ssn: "111-22-3333" } },
  { $graphLookup: {
     from: "people",
     connectFromField: "children",
     connectToField: "ssn",
     startsWith: "$children",
     as: "descendants"
  }
]);

The pipeline above specifies the collection on which to perform joins with the from field. In this case, we’re using the same collection as the one we’re aggregating on initially, but that doesn’t have to be the case. The only restriction on the from field is that the target collection cannot be a sharded collection. Then we specify which fields to look at to infer a relationship.

In this case, we try to match the values in the children array with the ssn field in other documents. To start the lookups, we feed in an initial value specified by startsWith. If there are matches, we add those documents to a new field that gets added to the input document specified by the as field. The output of this example might look something like:

{
  _id: 1,
  ssn: "111-22-3333"
  name: "Irene Katsopolis",
 children: ["111-22-3334", "111-22-3335"],
  descendants: [
    {
      _id: 3,
      ssn: "111-22-3334",
      name: "Jesse Katsopolis",
      children: ["111-23-3333", "111-23-3334"]
    },
    ...
  ]
}

The new descendants field will contain the flattened result of traversing these inferred relationships. If knowing the depth of the graph lookup for each matched document is important (for example, if you were searching for distant relatives and wanted to measure that distance), you simply specify a depthField on the aggregation operator. When specified, each matching document will have a new field added to it with the depth value.

In this case, the value of the depth field would be 0 for children, 1 for grandchildren, and so on. We can also restrict potential matches with other search criteria using the restrictWithinMatch field, which uses standard MongoDB query filter syntax. This would let us support queries such as finding descendant chains where every descendant exhibits a certain genetic trait.

It’s important to note that the $graphLookup operator is more of a recursive join than a full-fledged graph search. This means you can’t do complex pattern matching or otherwise reap the benefits of a graph-optimized underlying data structure, such as what you get with a native graph database like Neo4j. But similar to full-text search in MongoDB, $graphLookup is useful enough to cover some popular use cases. It might buy you time before having to invest in a polyglot persistence model, or it might eliminate the need entirely.

Faceted search

New facet operators are available in the aggregation pipeline as well, providing a mechanism for performing parallel aggregation pipelines on the same set of input documents. This is a huge advantage for anyone using MongoDB to drive a search interface, where faceted interactions have become the standard for quickly and flexibly narrowing down search results.

The general idea is that at a certain stage in the aggregation pipeline, you have a set of input documents that you want to group and count in different ways in order to display multiple facets to the user. If we’re dealing with a restaurant data set, we might want to show restaurants grouped by neighborhood, when it opened, and price point. Instead of having to execute three different aggregation queries, now we can use the $facet aggregation operator to do it all at once.

The $facet operator lets you specify any number of aggregation pipelines through which all input documents will pass in parallel. A single document is returned, containing the results of each aggregation pipeline with the names assigned in the $facet operator. Consider the restaurant use case where a document is structured as follows:

{
  _id: 1,
  name: "Au Cheval",
  yearOpened: "2012",
  averageMealPrice: 30,
  neighborhood: "West Loop"
}

The skeleton of the aggregation query using $facet would look like this:

db.restaurants.find([
  {
    $facet: {
      byYearOpened: [ /* aggregation pipeline */ ],
      byAveragePrice: [ /* aggregation pipeline */ ],
      byNeighborhood: [ /* aggregation pipeline */ ],
    }
  }
]);

And the single output document would look like this:

{
  byYearOpened: [ /* pipeline results */ ],
  byAveragePrice: [ /* pipeline results */  ],
  byNeighborhood: [ /* pipeline results */  ]
}

There are some operators that cannot be used within aggregation pipelines specified within the $facet operator, but for the most part they don’t make sense in facet use cases anyway. In practice it shouldn’t be an issue.

To MongoDB’s large collection of aggregation operators, MongoDB 3.4 adds three new operators to help get the most out of faceting: $bucket, $autoBucket, and $sortByCount.

The $bucket operator provides the ability to specify a series of value ranges to group documents based on a field within that document. For example, we could bucket restaurants by average meal price, where different buckets contain restaurants whose average meals fall between $0 and $15, between $15 and $30, between $30 and $60, and over $60. The operator to do this would look something like this:

{
  $bucket: {
    $groupBy: "$averageMealPrice",
    $buckets: [ 0, 15, 30, 60 ],
    $default: "$60 or more"
  }
}

Notice that buckets are specified as an array of boundaries. A bucket is formed between any two adjacent values in the array, where the left value is inclusive and the right value is exclusive. In this case, I’m using the $default bucket to group all values greater than $60, but it will also cover restaurants that don’t include an averageMealPrice field. The output looks something like the following:

[
  {
    _id: 0,
    count: 7
  },
  {
    _id: 15,
    count: 12
  },
  {
    _id: 30,
    count: 22
  },
  {
    _id: "$60 or more",
    count: 4
  }
]

Each resulting document identifies itself by the beginning bounds of its bucket and includes a count of documents in that bucket by default. If more information is desired on the output document, such as average meal price or a list of names of the restaurants in the bucket, you can specify them using the output field on the $bucket operator. The biggest issue with usability here is that it’s impossible to tell what the bucket bounds are given a single document. Either you have to see the subsequent document as well or you have to know the query ahead of time.

While the $bucket operator works well for absolute, explicit bucket ranges, we’ll need to turn to the $autoBucket operator for more dynamic ranges. $autoBucket takes a number of buckets and figures out what the bucket boundaries should be for an even distribution of documents within the buckets. Sometimes the bucket boundaries fall on undesirable decimal values when ensuring even distribution. An optional granularity field can be used to make sure bucket boundaries get shifted to the nearest number in the series specified.

Unlike the output of $bucket, the output of $autoBucket includes both sides of the bucket boundary. An example output document would look like this:

{
  _id: { min: 0, max: 15 },
  count: 7
}

The third new operator released with MongoDB 3.4 to augment the $facet operator is $sortByCount. It’s not new functionality, per se, but shorthand for a fairly common aggregation use case in which you group by a particular field, keep track of the number of documents within that field, then sort based on the count. Instead of having to write this:

{ $groupBy: { _id: "$neighborhood", count: { $sum: 1 } } }
{ $sort: { count: -1 } }

You can now write this:

{ $sortByCount: "$neighborhood" }

All together, the new aggregation operators dramatically reduce the number of queries required to support a faceted search interface, resulting in better performance and a smoother user experience.

BI connector

While not technically a part of MongoDB 3.4, version 2.0 of the MongoDB Connector for BI was released at the same time. Striving to cover more and more data modeling paradigms, the BI connector allows users to query with data in MongoDB using SQL. Users can connect directly and run ad hoc SQL commands, or the BI connector can be used in conjunction with SQL-compatible BI tools such as Tableau in order to visualize data in dashboards.

The 2.0 release includes performance enhancements, but the most notable update is that it now supports SQL-99 SELECT statements. Instead of having to drop down to MongoDB aggregation syntax when you needed things like GROUP BY, it’s now possible to aggregate using the SQL syntax familiar to both developers and business analysts. The BI connector is only available with the enterprise version with MongoDB.

Preventing data loss

If you’ve been working with distributed systems in the last few years, you are probably familiar with Jepsen tests. Jepsen is a library designed to help verify consistency claims of distributed systems, and it is known for proving that many distributed systems don’t work as advertised.

MongoDB has had some fairly unpleasant run-ins with Jepsen tests in the past, but instead of avoiding the tests, MongoDB funded additional research to help identify potential bugs or architectural problems in the code. The tests exposed architectural issues in the v0 replication protocol that resulted in lost updates, dirty reads, and stale reads. The tests also exposed bugs in the implementation of the v1 replication protocol that resulted in these same issues.

After working with the Jepsen team, the MongoDB team implemented bug fixes, with the result that MongoDB 3.4 now passes those tests using the v1 replication protocol. This doesn’t necessarily mean the database is perfect and will never lose data, but it represents a huge stride forward in the safety and reliability of the system.

MongoDB 3.4 is a substantial improvement over MongoDB 3.2, providing a mix of useful new features and safety improvements without introducing any significant usability irritations. Similar to the full-text search feature, graph lookups and SQL interactions won’t cover all graph or relational use cases, but they’re good enough that you might be able to get by. Additionally, the work done to increase the safety and stability of replication is an effort that some users might never notice, but should increase our confidence in the underlying infrastructure and make us all feel better about choosing MongoDB.

---

Cost: MongoDB Community Server is free open source under the GNU AGPL v3.0 license. MongoDB Enterprise Server is available by subscription under a commercial license, or free of charge for evaluation and development. MongoDB Atlas provides MongoDB as a service on AWS, with free and paid tiers. 

Platforms: MongoDB is available for Windows Server 2008 64-bit and later; Amazon, Debian, RHEL, Suse, and Ubuntu Linux distributions; MacOS; and Solaris. 

InfoWorld Scorecard
Administration (20%)
Ease of use (20%)
Scalability (20%)
Installation and setup (15%)
Documentation (15%)
Value (10%)
Overall Score (100%)
MongoDB 3.4 9 9 9 9 8 9 8.8
At a Glance
  • MongoDB 3.4 extends support for new use cases and improves safety and stability making it clear win to upgrade.

    Pros

    • New $graphLookup aggregation operator brings some graph traversal capabilities to the existing query model
    • New $facet aggregation operators make parallel pipeline aggregations possible
    • Increased SQL compatibility in the BI connector provides more flexible integration with external tools
    • Previously known issues with data loss have been resolved

    Cons

    • $graphLookup is limited to fairly simple recursive lookups across homogeneous data
    • $bucket operator output isn't as straightforward to consume as $bucketAuto, despite other behavior being the same

Copyright © 2017 IDG Communications, Inc.