How This Blog Does Comments

Published: 2018-11-14 21:33:13 -0800

Rather than use a system like Disqus, I wrote my own logic to handle commenting on this blog. Storing comments means the comments need to be stored somewhere, so this blog needs to be more than just statically generated content stored on Github Pages.

DynamoDB Stores The Comments

To store the comments, I use Amazon DynamoDB. I have two tables: users and comments. In general, DynamoDB is “schemaless”, each row is like a JSON blob, and the keys in one row don’t have to match up with the keys in the next row. However, it is typical that all rows follow some common format. Here are the schemas for the two tables:

comments:
  entry_id, # the blog post you are leaving a comment on,
  comment_id, # unique for each comment
  author_name,
  created_at,
  text,

users:
  github_id, # unique for each user
  github_name, # Their name as recorded on Github; not everyone has this
  github_login, # Their Github login name; this can change,
  json_user_payload, # A dump of Github's JSON representation of the user
  secret_code, # An authentication token stored in the users' cookies

Primary Partition vs Primary Sort Keys

DynamoDB has two kinds of key: primary partition key and primary sort key. DynamoDB is a partitioned database, which means it splits up the rows in a table across different DynamoDB machines.

Primary Partition Key

DynamoDB chooses which machine to store a new row on by looking at the primary partition key. For instance: if DynamoDB runs 10 machines for your database, and you insert a row with primary partition key value “Gizmo”, then DynamoDB will store this row on the machine numbered hash("Gizmo") % 10. This should relatively evenly (but deterministically) split up the rows across machines.

It is important that the splitting up is done deterministically, because later, when you look up your data, you will need to specify the primary partition key so that DynamoDB knows where that row is stored.

Aside: Partitioning and Scaling Out

In fact DynamoDB does something slightly more complicated that what I described about hashing. To see why the above strategy might be not the best, consider if we wanted to increase the number of machines from 10 to 11. When we increase the number of machines to handle more load, that is called scaling out. Under the scheme I suggested, each row should now live on machine numbered hash(primary_partition_key_value) % 11. But that means almost every row is going to be shuffled to a new machine.

This is similar to what happens when a hash map resizes and doubles the number of buckets. Increasing the modulus throws everything off, so that almost everything belongs in a different bucket number. For instance, if hash(key) == 50, this belongs in bucket number 0 when there are ten buckets, but bucket number ten when there are twenty buckets.

Returning to the scenario of partitioning, it would be good if when an 11th machine is added, we do not reshuffle almost all the data. Transmitting data to other machines can be very slow. Instead, an ideal would be if each of ten machines gave $\frac{1}{11}$ of their data to the 11th server. This is ideal because (1) the rows are perfectly balanced, (2) the minimum number of rows are transferred.

There are several ways to approach this ideal, and maybe I can talk about it in a future blog post. One of the innovations of Amazon’s Dynamo white paper is how it solves this issue.

Primary Sort Key

In my comments table, the primary partition key is entry_id. This might seem somewhat odd, because there can easily be multiple comment records for a single blog post. In a SQL DB, the primary key would probably be the comment_id.

A common query is to fetch all comments for a given blog post. Therefore, it is desirable that all these comments live on the same machine. That means the query can be answered by talking to a single machine. That is why I have selected entry_id as the primary partition key.

To reflect that entry_id is not itself a full primary key, I have also specified a primary sort key which is comment_id. That means it is allowed for there be multiple rows for one entry_id if they have different comment_ids.

Note: I cannot easily lookup a comment by its comment_id alone. The comment_id does not tell me on which machine that comment lives. I need also the entry_id. Without the entry_id DynamoDB wouldn’t know what machine to look in. To look up a single comment I need the pair (entry_id, comment_id).

As a bonus, the comments will be sorted on each machine in order of (entry_id, comment_id). This means that all comments for a given entry_id will live “next” to each other in storage. Reading comments for a single entry should involve reading one disk block.

In my scenario, it’s not a very useful property that, within a given entry_id, that comments are further sorted by comment_id. That could help if I wanted to look up a single comment by (entry_id, comment_id). Within the block of comments for entry_id, I could do binary search to find the comment with the given comment_id.

But I don’t want to retrieve comments one-by-one this way: I always want to retrieve all the comments. Still, if I allowed you to update a previously posted comment, then it might be nice that the DB can so quickly find a specified comment.

Evaluation of DynamoDB

I liked using DynamoDB. You can setup a DynamoDB database with a few clicks. You don’t need to specify a schema; you can change the structure of your stored data at will. Adding tables is easy. I love that I don’t have to do anything to make sure that the DB stays online and operational.

DynamoDB costs very little at my usage level. The cost is proportional to the request rate that you want to support. Things that go into the cost calculation are:

The number of reads and writes per second.
The size of the data you are fetching/storing.

Because this blog is so seldom read, and comments are not that long, I can use the minimum amount of read resources. For me the cost is about $1/month.

AWS Lambda

Now that I have somewhere to store comments, I need to write server logic to do things like:

Fetch comments,
Store a new comment from an logged in user,
Log in a user.

To do this I use AWS Lambda. Lambda lets you write code (which they call a Lambda function) that gets executed when an incoming event is fired. For instance, events can be fired when someone makes an API request. AWS offers a product called API Gateway which allows you to connect API endpoints to Lambda functions.

(BTW, “Lambda” here doesn’t really share a meaning with the Lambda Calculus or the lambda keyword in Python or Ruby. Sometimes this kind of service is called functions as a service.)

In short, you can treat Lambda functions as a simple place to write backend web request “controller” logic.

The nice thing is that AWS takes care of deploying your Lambda functions. You don’t have to rent a server box on EC2, you don’t have to install any software, you don’t have to configure a web server on the EC2 machine, you don’t have to to setup something like Passenger to run your web application…

Also, AWS Lambda will only charge you when your Lambda function is executed. You get 1,000,000 requests free. My blog has 820 requests for comments this past month. So I pay Amazon nothing for Lambda. (Lambda function execution is time limited, so no cheating by doing hours of bitcoin mining for free in a Lambda function!)

So let’s see some Lambda code.

Fetching Comments

dynamodb = boto3.resource('dynamodb')
comments_table = dynamodb.Table('comments')
users_table = dynamodb.Table('users')

def get_comments(event, context):
    entry_id = event["entry_id"]
    response = comments_table.query(
        KeyConditionExpression=Key('entry_id').eq(entry_id)
    )

    comments = response['Items']

    return {
        "statusCode": 200,
        "comments": comments
    }

This code is relatively self-explanatory. boto3 is a library that Amazon provides to interact with AWS services. I say I want to use the comments and (later on in other event handlers) users tables.

The get_comments function will be fired when someone makes a request for comments. They will specify the entry_id, which will be passed to us inside the event object (which contains the params of the request).

We submit a query to the comments_table, asking for all records that meet the expressed requirement that entry_id equal the specified value.

Last, the function returns a dictionary specifying the JSON response.

Simple!

Creating Comments

Here’s the code for posting a new comment. In a moment I’ll break it down:

def create_comment(event, context):
  entry_id, text, author_github_id, author_secret_code = (
      event["payload"]["entry_id"],
      event["payload"]["comment_text"],
      event["payload"]["author_github_id"],
      event["payload"]["author_secret_code"],
  )

  author_user = users_table.get_item(Key={"github_id": author_github_id})['Item']
  print(author_user)

  if author_user == None:
      return {
          "statusCode": 404,
          "message": f"No user found for this github_id: {author_github_id}.",
      }

  if author_user['secret_code'] != author_secret_code:
      return {
          "statusCode": 403,
          "message": f"That is the wrong secret_code for this github_id: {author_github_id}.",
          "submittedUserSecretCode": author_secret_code,
      }

  author_github_login, author_github_name = author_user['github_login'], author_user['github_name']
  comment_id = str(binascii.b2a_hex(os.urandom(15)), "ascii")
  created_at = datetime.now(timezone.utc)

  comment = {
      "entry_id": entry_id,
      "comment_id": comment_id,
      "created_at": created_at.isoformat(),
      "author_github_id": author_github_id,
      "author_github_login": author_github_login,
      "author_github_name": author_github_name,
      "text": text,
  }

  comments_table.put_item(Item=comment)

  return {
      "statusCode": 200,
      "comment": comment,
  }

First thing first: getting the params of the request:

entry_id, text, author_github_id, author_secret_code = (
  event["payload"]["entry_id"],
  event["payload"]["comment_text"],
  event["payload"]["author_github_id"],
  event["payload"]["author_secret_code"],
)

The user needs to specify what entry they are posting a comment for, and the text of that comment. They need to specify who they are; the primary key for the users table is github_id. A user’s Github ID uniquely and permanently identifies a user.

When a user logs in, I give them a secret_code. I store this in the user’s cookie, and also in their users row in DynamoDB. This is how, when the user makes a request to create a comment with a given github_id, we can verify they are who they say. Let’s do that now:

author_user = users_table.get_item(Key={"github_id": author_github_id})['Item']
print(author_user)

if author_user == None:
    return {
        "statusCode": 404,
        "message": f"No user found for this github_id: {author_github_id}.",
    }

if author_user['secret_code'] != author_secret_code:
    return {
        "statusCode": 403,
        "message": f"That is the wrong secret_code for this github_id: {author_github_id}.",
        "submittedUserSecretCode": author_secret_code,
    }

In this code, I lookup the user in DynamoDB. Just in case there was never any user stored for the given github_id, I check that.

Next, I check that the submitted author_secret_code matches the secret_code I stored in the users table. If not, I tell the user that they are not authorized.

In those anomalous cases, I return a 404 or a 403 (Not Authorized) status as necessary. I only proceed with the function if all is well.

Storing the Comment

Last up is storing the comment in DynamoDB.

author_github_login, author_github_name = author_user['github_login'], author_user['github_name']
comment_id = str(binascii.b2a_hex(os.urandom(15)), "ascii")
created_at = datetime.now(timezone.utc)

comment = {
    "entry_id": entry_id,
    "comment_id": comment_id,
    "created_at": created_at.isoformat(),
    "author_github_id": author_github_id,
    "author_github_login": author_github_login,
    "author_github_name": author_github_name,
    "text": text,
}

comments_table.put_item(Item=comment)

return {
    "statusCode": 200,
    "comment": comment,
}

I first pull out a couple fields of the user: github_login (their login name), github_name (their human name, if provided).

Next, I randomly generate a hex string for the comment_id value. os.urandom(15) generates 15 bytes of random data, while binascii.b2a_hex will convert a string of bytes to a string of ASCII letters. Each byte can be translated to two hexadecimal digits, since

$$ \begin{align} 2^8 = 2^4 \cdot 2^4 = 16 \cdot 16 \end{align} $$

The comment_id is selected without first looking at the comment_id values that exist in DynamoDB. Is this safe? What if I accidentally reuse a comment_id that has already been used by a previously saved comment?

If I do reuse a comment_id, this new comment will clobber the old comment. I will lose that old comment. But this is very unlikely given the number of comments I am likely to receive on this blog, and the huge size of the ID space (2¹²⁰ possible ID values). Regardless, the worst case scenario is that I lose one old comment.

Author Data Is Denormalized Into The Comments

You may ask: instead of storing the author_github_name and author_github_login in the comments table, why isn’t it sufficient to store the author_github_id, which is the primary key into the users table? I can always look up a user given an author_github_id.

The reason I have denormalized some of the author’s user data is as follows. DynamoDB is a partitioned database, and as such, you typically want to avoid logic that requires querying data across partitions. For instance, we know that because of the entry_id partition key, all comments for blog entry number 123 are stored on the same DynamoDB machine.

Say there are 100 comments for blog entry number 123. If the comments do not contain the user’s information like login or name, that information will have to be fetched from the users table. But each of the 100 comments for this post could be authored by a different user, each with their own github_id. So the row for each of 100 authors could live on any partition. We might have to query as many as 100 partitions of the users table to get information about 100 users.

This will have low performance. Thus the denormalization. If all the needed user data is stored directly in the comments rows, then there is no need to write the join logic.

Denormalization Slows Down Name Changes

The downside to this denormalization happens when a user (say, Markov) goes to change their name (say to “Markov The Great”). In a relational database, we would change the name in one location: Markov’s record in the users table. In the denormalized model, we must update Markov’s name in every comment Markov has ever written.

The comments are not partitioned by github_login_id; they are partitioned by entry_id. Therefore the comments written by Markov are likely to be spread across the DynamoDB machines. We would have to scan all the partitions, looking for comments written by Markov. Whenever we find one, we must update Markov’s name.

Note: not only will we have to scan all partitions, we’ll have to scan every row on every partition, since the rows are not sorted by github_login_id even within one partition!

Now back to this blog. Whenever you login from Github, I will store in the users table the latest version of your name. However, I will not go back and change your name in old comments. In principle I could, but I am lazy and don’t want to.

AWS API Gateway

I have shown you all these Lambda functions (except how to login a user). How do they ever get invoked?

AWS provides a service called API Gateway. In API Gateway, you can map API routes to Lambda functions. For instance, you can say that GET /comments maps to get_comments, POST /comments maps to create_comment.

AWS will provide you a domain name (mine is https://7n5x5r84kc.execute-api.us-west-2.amazonaws.com/production) where the endpoints live. So you can get this blog’ entries comments by visiting:

https://7n5x5r84kc.execute-api.us-west-2.amazonaws.com/production/comments?entry_id=c86b47d0d8a48f31581744d160b3dbdd

If anyone leaves a comment, you should start seeing data there!

If you like, you can configure API Gateway to serve your API under a domain name of your own (maybe api.blog.self-loop.com). For simplicity, I just used the default Amazon gave me.

My users make cross-site AJAX requests from blog.self-loop.com to 7n5x5r84kc.execute-api.us-west-2.amazonaws.com, so I also have to configure CORS. Note that I would have to do this even if the API domain name was api.blog-self.com, since even a subdomain counts as a cross-site request. Anyway, API Gateway makes it easy to setup CORS.

I found API Gateway to be a bit of a pain in the ass to configure. The user interface is clunky, and sometimes not all the request parameters I wanted would be passed through properly by API Gateway to the Lambda functions. This was always due to misconfiguration by me, though.

Clunky or not, using API Gateway to fire Lambda events is way simpler than the traditional alternative: rent an EC2 box, run a server, et cetera. In the final analysis, this was a good way to develop.

Future Blog Posts

I still want to talk about the React code I wrote. In the meantime you can read the code on Github.

Also, I wanted to talk a little about how I do Github OAuth. But that will have to wait too! If you want a sneak peak, you can read the relevant Lambda code I wrote on the Github.

<< Back To The Posts Index! <<