RethinkDB is an open source database for the real-time Web. It has a built-in change notification system that streams live updates to your application. Instead of polling for new data, you can have the database push the latest to changes to you. The ability to subscribe to streaming updates from the persistence layer can simplify your application architecture and make it easier to support clients that maintain persistent connections to your back end.
RethinkDB is a schemaless JSON document store, but it also supports relational features like table joins. RethinkDB supports clustering, which makes it easy to scale. You can configure sharding and replication for the cluster through the database's built-in administrative Web interface. The latest release of RethinkDB, version 2.1, brings automatic fail-over to clusters with three or more servers.
The RethinkDB query language, which is called ReQL, embeds natively into the code in which you write your application. If you build your application in Python, for example, you use conventional Python syntax to write ReQL queries. Each ReQL query is composed of functions that the developer chains together to describe a desired database operation.
A brief introduction to ReQL
RethinkDB databases contain tables, which store conventional JSON documents. The JSON object structures can be deeply nested. Every RethinkDB document has a primary key, an id
property with a value that is unique within the document's table. You can reference the primary key in a ReQL query to efficiently fetch an individual document.
Writing ReQL queries in an application feels a lot like using a SQL query builder API. The following is a simple ReQL query, written in JavaScript, that finds the number of unique last names in the users
table:
r.table("users").pluck("last_name").distinct().count()
In a ReQL query, each function in the chain operates on the output of the previous function. It works a little bit like the pipeline paradigm from Unix shell scripting. Specifically, the functions in the above query perform the following operations:
table
accesses a specific table in the databasepluck
extracts a specific property (or multiple properties) from every recorddistinct
eliminates the duplicate values, leaving one instance of each unique valuecount
counts the number of items and returns the total
Conventional CRUD operations are easy to perform. ReQL includes an insert
function, which you can use to add new JSON documents to a table:
r.table("fellowship").insert([
{ name: "Frodo", species: "hobbit" },
{ name: "Sam", species: "hobbit" },
{ name: "Merry", species: "hobbit" },
{ name: "Pippin", species: "hobbit" },
{ name: "Gandalf", species: "istar" },
{ name: "Legolas", species: "elf" },
{ name: "Gimili", species: "dwarf" },
{ name: "Aragorn", species: "human" },
{ name: "Boromir", species: "human" }
])
The filter
function retrieves documents that match particular parameters:
r.table("fellowship").filter({species: "hobbit"})
You can chain functions like update
or delete
to the expression if you want to perform those operations on the documents returned by the filter:
r.table("fellowship").filter({species: "hobbit"}).update({species: "halfling"})
ReQL includes more than 100 functions that you can combine in different ways to achieve your desired result. There are functions for flow control, transforming documents, performing aggregate computations, and writing records. There are also specialized functions for performing common operations on strings, numbers, timestamps, and geospatial coordinates.
There's even an http
command that you can use to fetch data from public Web APIs. The following example shows how you can use the http
command to fetch posts from Reddit:
r.http("http://www.reddit.com/r/aww.json")("data")("children")("data").orderBy(r.desc("score")).limit(5).pluck("score", "title", "url")
The query retrieves the posts, orders them by score, then displays several properties from the top five entries. Used to its full potential, ReQL makes it possible for developers to perform some fairly sophisticated data manipulation.
How ReQL works
RethinkDB client libraries are responsible for integrating ReQL into the underlying programming language. A complete client library implements functions for all of the query operations supported by the database. Under the hood, ReQL query expressions evaluate into structured objects that look a bit like an abstract syntax tree. To execute a query, the client library translates the query objects into RethinkDB's JSON wire protocol format, which it then transmits to the database server.
The ReQL run
function, chained to the end of a query, translates the query, executes it on the server, and returns the output. You typically provide the run
function with a reference to a RethinkDB server connection that it can use to perform the operation. In the official client libraries, connection handling is a manual process. You have to create the connection and close it when you are finished.
The following example shows how to perform a RethinkDB query in Node.js with the JavaScript client driver. The query retrieves all of the halflings from the fellowship
table and displays them in the terminal:
var r = require("rethinkdb");
r.connect().then(function(conn) {
return r.table("fellowship")
.filter({species: "halfling"}).run(conn)
.finally(function() { conn.close(); });
})
.then(function(cursor) {
return cursor.toArray();
})
.then(function(output) {
console.log("Query output:", output);
})
The rethinkdb
module in Npm provides RethinkDB's official JavaScript client driver. You can use it in a Node.js application to compose and send queries. The example above uses Promises for asynchronous flow control, but the client library also supports conventional callbacks for users who prefer those.
The connect
method establishes the connection, which the run
function uses to perform the query. The query itself returns a cursor, which is sort of like an open window into the contents of the database. Cursors support lazy data fetching, offering an efficient way to iterate over a large data set. In the example above, I simply chose to convert the contents of the cursor to an array because the output is relatively small.
Although ReQL queries are written in your application like regular code, they execute on the database server and return their results. The integration is so seamless that new users often experience a bit of initial confusion regarding the boundaries between their application and the database.
ReQL's function chaining and native language integration are really helpful for increasing code reuse and abstracting out frequent operations. Because queries are written in your application's native language, it's easy to encapsulate query subexpressions in functions and variables. For example, the following JavaScript function generalizes pagination, returning a ReQL expression that incorporates the specified values:
function paginate(table, index, limit, last) {
return (!last ? table : table
.between(last, null, {leftBound: "open", index: index}))
.orderBy({index: index}).limit(limit)
}
Another noteworthy advantage that ReQL offers over conventional SQL: It is largely immune to conventional injection attacks. You can easily incorporate external data into your queries without the need for risky string concatenation.
Many advanced ReQL features are outside the scope of this article, including secondary indexes, table joins, and the use of anonymous functions. You might want to check out the official ReQL API reference to learn more.
Build real-time apps with changefeeds
RethinkDB has a built-in change notification system that developers can use to simplify the development of real-time applications. When you chain the changes
function to the end of a ReQL query, it will emit a continuous stream of updates to reflect live changes in the query's result data. The update stream is called a changefeed.
Conventional database queries are a good fit for the Web's traditional request/response model. Polling, however, is not practical for real-time Web applications that use persistent connections or live data streaming. RethinkDB changefeeds provide an alternative to polling, a way to push updated query results to the application.
You can attach a changefeed directly to a table to detect all changes to its contents. You can also use changefeeds with relatively sophisticated queries to track updates on specific data. For example, you could attach a changefeed to a query that uses the orderBy
and limit
functions to create a real-time leaderboard for a multiplayer game:
r.table("players").orderBy({index: r.desc("score")}).limit(5).changes()
The query orders the players by score, then takes only the first five. Whenever the scores or the composition of the list of top five users changes, the changefeed will give you an update. Even in the case where a player who wasn't previously in the top five displaces one of its members, it will give you the changes you need to properly update the list.
Changefeed updates tell you both the previous value of the record and the new value of the record, allowing you to compare the two to determine the actual change. When existing records are deleted, the new value is null. Similarly, the old value is null when the table receives new records. You can even chain additional query operations after the changes
function if you want to use ReQL to manipulate the updates.
When you execute a query that includes the changes
command, the client library returns a cursor that remains open forever. The cursor exposes new updates as they become available. The following example shows how to consume changefeed updates in Node.js:
r.connect().then(function(conn) {
return r.table("data").changes().run(conn);
})
.then(function(cursor) {
cursor.each(function(err, item) {
console.log(item);
});
});
When you iterate over a changefeed cursor, you typically do it in the background so that it doesn't block your application. In natively asynchronous programming environments like Node.js, you don't have to take any special measures. In other languages, you may need to use asynchronous programming frameworks or manually implement threading. The official Python and Ruby client libraries for RethinkDB support Tornado and EventMachine, popular asynchronous programming frameworks that are widely used in those languages.
The changes
command currently works with the following kinds of queries: get
, between
, filter
, map
, orderBy
, min
, and max
. Support for additional kinds of queries, such as group operations, is planned for the future.
When you build a practical real-time Web application with RethinkDB, you can use WebSockets to broadcast updates to the front end. A number of popular WebSocket abstraction libraries, such as Socket.io, are easy to use.
Changefeeds are particularly useful in real-time applications that are designed to scale horizontally. When you use sticky load balancing to spread your audience across multiple running instances of the application, you typically have to go with an external mechanism like a message queue or an in-memory database to propagate updates between servers. RethinkDB moves that message propagation functionality to your application's persistence layer, flattening your application's architecture and eliminating the need for additional infrastructure. Each application instance subscribes directly to the database to receive changes. When changes are available, each server broadcasts the updates to its respective WebSocket clients.
In addition to real-time applications, changefeeds can make it easier to implement mobile push notifications and other items of that nature. Changefeeds represent an event-driven model of database interaction, which can prove useful in a number of scenarios.
Scale and manage a RethinkDB cluster
RethinkDB is a distributed database, designed for clustering and easy scalability. To add a new server to the cluster, simply launch it from the command line with the --join
option and specify the host of an existing server. When you have a cluster with multiple servers, you can configure sharding and replication on a per-table basis. Any feature that works with a single instance of the database will work exactly as expected with a sharded cluster.
The RethinkDB server includes an administrative user interface, which you can access in your Web browser. The admin interface offers a user-friendly way to manage and monitor your database cluster. You can even use it to configure sharding and replication with a few clicks.
RethinkDB also supports a ReQL-based approach to cluster configuration, which is ideal for fine-grained control and automation. ReQL includes a simple reconfigure
function that you can chain to a table to set sharding and replication settings. The cluster also exposes much of its internal configuration state through a set of special RethinkDB tables. You can perform queries against the system tables to tweak the configuration or retrieve statistics for monitoring purposes. Virtually all of the functionality provided by the browser-based administrative console is implemented on top of the ReQL-based configuration and monitoring APIs.