Denormalizing Your Data is Normal

Update (April 2, 2015): While this post still contains some useful and relevant information, we have since released advanced query functionality and added documentation on best practices for structuring data in Firebase.

This is the second in a series of blog posts on Architecting your application with Firebase.

One of the questions users frequently ask us is — “How do I query my Firebase data for X?” — where X often resembles a SQL query. This is a natural question, especially when coming to Firebase from a relational database background.

Firebase essentially has two ways to query for data: by path and by priority. This is more limited than SQL, and there’s a very good reason for that — our API is carefully designed to only allow operations we can guarantee to be fast. Firebase is a real-time and scalable backend, and we want to enable you to build great apps that can serve millions of users without compromising on responsiveness.

This makes it important to think about how you will need to access your data and then structure it accordingly. This blog post will help you do that!

Primitives

Firebase comes with two powerful query mechanisms which you should be familiar with. See our Data Structure and Query docs for details, but as a reminder:

  • You can query data by its location. You can think of the location as being a primary index (In SQL terms, ref.child('/users/{id}/') is roughly equivalent to SELECT * from users WHERE user_id={id}). Firebase automatically sorts data by location, and you can filter the results with startAt, endAt, and limit.

  • You can query data by its priority. Every piece of data in Firebase can be given an arbitrary priority, which you can use for whatever you want. You can think of this as a secondary index. You might, for example, store timestamps representing a user’s date-of-birth as priority and then query for users born after 2000 with ref.child('users').startAt(new Date('1/1/2000').getTime()).

With these two primitives in mind, we can start thinking about how to structure your data.

Structure is important

Before writing any code for a Firebase app, I spend some time thinking about how I’m going to structure the data. The benefit of this is two-fold — first, it makes it much easier to reason about Security and Firebase Rules when the time comes; and second, it forces me to think about the queries my app will need. A well designed tree structure is crucial to an elegant app.

The best way to learn is by doing, so let’s take an example. We’re going to design the data structure for a site that lets you post links and comment on them (think Hacker News or Reddit).

We’ll discuss how this app would be designed if it were backed by a SQL database first and then approach the same problem in a Firebase friendly way. If you aren’t familiar with SQL or just want to know how to structure the data in Firebase, feel free to skip the section on SQL!

In a SQL world

If you were writing this application with a SQL backend, you will probably have a table of users:

CREATE TABLE users (
  uid int auto_increment, name varchar, bio varchar, PRIMARY KEY (uid)
);

a table of posted links:

CREATE TABLE links (
  id int auto_increment, title varchar, href varchar, submitted int,
  PRIMARY KEY (id), FOREIGN KEY (submitted) REFERENCES users(uid)
);

and finally, a table of posted comments:

CREATE TABLE comments (
  id int auto_increment, author int, body varchar, link int,
  PRIMARY KEY (id), FOREIGN KEY (author) REFERENCES users(uid),
  FOREIGN KEY (link) REFERENCES links(id)
);

Rendering the home page of the site would require you to get a list of the latest links submitted:

SELECT * FROM links ORDER BY id DESC LIMIT 20

Displaying all the comments (latest first) for a particular link would be something like:

SELECT * FROM comments WHERE link = {link_id} ORDER BY id DESC

Whenever you want to display user information (on the user profile page, or the user name under a comment), you’ll fetch that from the users table:

SELECT * FROM users WHERE uid = {user_id}

Let’s say you also want to display all the comments made by a particular user on their profile page. That’s easy:

SELECT * FROM comments WHERE author = {user_id}

Notice that we will be accessing comments in two different ways (by link_id and by author). We’ll see the implications this has with Firebase momentarily.

You’d also have a set of INSERT statements whenever a comment is posted, a link is submitted, or a new user signs up. This should cover most aspects of the app (the devil is in the details, you need to sanitize all the incoming data and so on, but let’s ignore all that for now).

In a Firebase world

If I had to write the same app using Firebase, I might start by simply replicating the structure I came up with for the SQL version. Let’s have three top-level keys, one each for users, links and comments:

{
  users: {
    user1: {
      name: "Alice"
    },
    user2: {
      name: "Bob"
    }
  },
  links: {
    link1: {
      title: "Example",
      href: "http://example.org",
      submitted: "user1"
    }
  },
  comments: {
    comment1: {
      link: "link1",
      body: "This is awesome!",
      author: "user2"
    }
  }
}

Rendering the home page is easy enough, you simply fetch the last 20 links submitted using a limitToLast() query:

var ref = new Firebase("https://awesome.firebaseio-demo.com/links");
ref.limitToLast(20).on("child_added", function(snapshot) {
  // Add link to home page.
});
ref.limitToLast(20).on("child_removed", function(snapshot) {
  // Remove link from home page.
});

Notice that we’re already seeing the benefits of using Firebase. By handling child_added and child_removed events, the home page will automatically update in real-time as items are added without the user touching the refresh button!

Now what happens when you want to retrieve all the comments associated with a particular link? In SQL, each comment had a reference to the link it was tied to, and you could use WHERE link = {link_id} to retrieve them, but Firebase has no WHERE query. As structured, we can only access the comment if we know its comment ID, which we do not.

This is the crux of a Firebase-friendly data structure: you sometimes need to denormalize your data. In this case, to enable retrieving the list of comments for a link, I’ll explicitly store that list with each link:

{
  links: {
    link1: {
      title: "Example",
      href: "http://example.org",
      submitted: "user1",
      comments: {
        comment1: true
      }
    }
  }
}

Now you can simply fetch the list of comments for any given link and render them:

var commentsRef =
  new Firebase("https://awesome.firebaseio-demo.com/comments");
var linkRef =
  new Firebase("https://awesome.firebaseio-demo.com/links");
var linkCommentsRef = linkRef.child(LINK_ID).child("comments");
linkCommentsRef.on("child_added", function(snap) {
  commentsRef.child(snap.key()).once("value", function() {
    // Render the comment on the link page.
  ));
});

We also wanted to display a list of comments made by every user on their profile page. So we do something similar:

{
  users: {
    user2: {
      name: "Bob",
      comments: {
        comment1: true
      }
    }
  }
}

Duplicating data like this can be counter-intuitive to many developers. However, in order to build truly scalable applications, denormalization is almost a requirement. Essentially, we are optimizing our data reads by writing extra data at write-time. Consider that disk space is cheap, but a user’s time is not.

Considerations

Let’s discuss some consequences of a structure like this. You will need to ensure that every time some data is created (in this case, a comment) it is put in the right places:

functon onCommentSubmitted(comment) {
  var root = new Firebase("https://awesome.firebaseio-demo.com");
  var id = root.child("/comments").push();
  id.set(comment, function(err) {
    if (!err) {
      var name = id.key();
      root.child("/links/" + comment.link + "/comments/" + name).set(true);
      root.child("/users/" + comment.author + "/comments/" + name).set(true);
    }
  });
}

(If you want to be notified when both the set methods have completed, consider using a library like TameJS to compose multiple asynchronous operations.)

You’ll also need to think about how you want to handle deletion and modification of comments. Modification of comments is easy: just set the value of the comment under /comments to the new content. For deletion, simply delete the comment from /comments — and whenever you come across a comment ID elsewhere in your code that doesn’t exist in /comments, you can assume it was deleted and proceed normally:

function deleteComment(id) {
  var url = "https://awesome.firebaseio-demo.com/comments/";
  new Firebase(url + id).remove();
}

function editComment(id, comment) {
  var url = "https://awesome.firebaseio-demo.com/comments/";
  new Firebase(url + id).set(comment);
}

Firebase does all it can to make these kinds of operations effecient. For example, if you’ve already fetched the content of a comment for a link’s page, and subsequently navigate to the profile page of the user that made the comment, refetching the value will be fast since it’ll be returned from the local cache.

If you’re looking for a full-featured example that applies these techniques, please take a look at Firefeed. Firefeed is a twitter clone that uses denormalization to build a complex application entirely client side.

Please don’t hesitate to get in touch with us if you have any questions or want to discuss the appropriate architecture for your own app. We’re available on Email, Twitter, Facebook, Stack Overflow and Google+.