Interview with MongoDB developer [podcast recommendation]

On the way to Norfolk today I listened to a podcast I’d not heard of before called Techzing. Episode 39 (the most recent) featured a fantastically geeky interview with one of the 10gen developers who works on MongoDB which I talked about in my last post.

If you’re interested in web development, scalability, databases and the NoSQL movement then I’d definitely recommend listening if you get a chance.

You can listen/download the episode here http://techzinglive.com/?p=192 and this is the iTunes podcast subscription link http://itunes.apple.com/gb/podcast/techzing/id318567721

Authentication at Lincoln

Planning auth.lincoln.ac.uk
Planning auth.lincoln.ac.uk

One of the biggest parts of core.lincoln is the authentication API. The current applications used at Lincoln all use either our Windows domain logins (e.g. “abilbie” for my staff account or “06081032” for my student account) or they make use of our SafeCom printing ID (which is a member of staff’s employment id or for students it’s their eight digit student number). This presents a small problem because there currently isn’t any sort of service to easily convert one to the other and vice versa.

Our plan for the new Lincoln authentication service is to expose authentication logins over a standard known as OAuth which will hopefully start to bring some consistency to the way we sign in to apps at Lincoln. We’re going to implementing the OAuth 1.0a spec, “2 legged” OAuth and desktop pin OAuth (see http://apiwiki.twitter.com/Authentication). Additionally we’ll have a private SAML and ADFS authentication service for apps which we’ll talk about in the future.

OAuth authentication works with the concept of a consumer (an web/desktop application that wishes to access user data), a provider (a service which stores user data) and a user (who has an account with the provider). The conversation goes something like this:

Consumer: Hi there, I’d like to access your data on behalf of a user please. Here is my API Key and API Secret.

Service Provider validates the API Key and API Secret.

Service Provider: That’s cool. Here is an Request Token and an Requst Secret. Please send the User to this URL and append the Request Token to the query string.

Consumer redirects User to the Service Provider sign-in URL with the Request Token in the query string.

User signs into the Service Provider.

User approves the Consumer to access Protected Resources on their behalf.

Service Provider redirects User back to the Consumer Callback URL.

Consumer: Hello again, I’ve had a user sent back to my Callback URL. Here is the Request Token and Request Secret you gave me earlier, can I please now have access.

Service Provider validates the Request Token and Request Secret to check the User has authorised them.

Service Provider: Sure, here is an Access Token and an Access Secret which you can use to to access Protected Resources on behalf of the User.

Consumer sends the Access Token and Access Secret to the Service Provider to establish which user they have just authorised access.

Obviously it’s a bit more complicated than that but that’s roughly how it works.

The OAuth service needs to store information about applications (such as their API tokens and secrets), and also the request and access tokens and secrets. Additionally we’ve decided to write into it a permissions layer so that applications can only access certain information (e.g. they can only access basic information about users (such as their name and faculty) unless they’ve been granted additional access to extended information such as a user’s home address or phone number).

Traditionally we’d just create our app using a [insert your favourite relational database here] database however because of the high read + write requirements of an OAuth service – for each application there will be potentially 12,000 (current number of staff + students) sets of request tokens and secrets and access tokens and secrets – we’ve therefore decided to take at a different approach and are currently looking at using either Memcached or MongoDB to store these resources.

Memcached

Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.

Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.

http://memcached.org/

MongoDB

MongoDB (from “humongous”) is an open source, scalable, high-performance, schema-free, document-oriented database written in the C++ programming language.

MongoDB is designed for problems without heavy transactional requirements that aren’t easily solved by traditional RDBMSs, including problems which require the database to span many servers.

http://en.wikipedia.org/wiki/MongoDB

The advantage of Memcached to us is that it’s just so fast because everything is stored in memory. The disadvantage is that we need servers with a beefy amount of RAM and also if the machine should power off for any reason then it loses it’s entire index meaning that until everyone uses an application again it has to keep hitting the database to lookup data. Additionally Memcache indexes aren’t replicated across memcache instances.

MongoDB on the other hand is a document database. This means that unlike something like MySQL where you have to define a database schema made up of tables with set columns, with Mongo you are free to define things how you like creating and deleting things on the fly (basically no two rows have to have the same columns if you dont want them to). Recent builds of Mongo have shown seriously impressive benchmarks:

Storage engine request/second benchmark
Storage engine request/second benchmark - image source

Like Memcached, Mongo is seriously fast, peaking at just under 1600 request/second in the benchmark graph above (the server hardware was a 2.2 GHz quad core AMD Opteron with 2GB of RAM – benchmark source). Mongo can be configured to shard nicely across multiple servers so we don’t have to worry about “what ifs” so much in the event a node goes down and it’s also got a very stable PHP PECL module. Additionally because it is disk based we don’t have to worry about indexes being re-built and such.

We’re thinking at the moment we’re going to have two documents stores in Mongo, the first is the apps document which will hold information about apps such as name, admincontact details, API key + secret and also permissions. The second document will contain request and access tokens and also a copy of the application permissions which the tokens are linked to (to remove the need to perform joins between the two documents).

Nick and I did some planning the other night (see the first picture in this post) and we think we’ve got a solid internal and external API planned out, it’s really just a case of building it now and convincing the powers that be that this is something worth investing in. The plan is that all the new apps that we build will make use of the OAuth authentication process (starting with Posters), which means we can start to implement (to some extent) the beginnings of a single sign on service (I can just hear all the Lincoln staff and students crying out with joy at this!). Additionally once we’ve got this authentication service built and tested we can start to expose some of the APIs we’ve been building for consumption by outsiders (so if you’ve got any ideas for apps then please let us know and we’ll do our best to give you the APIs to build it).

So that’s what we’re up to, if you’ve got any ideas or experience with any of the above then please get in touch, this is all very new and any guidance will be welcomed.