#totalrecal: We need faster interwebz!

Now that we’ve got live data being produced from Blackboard and CEMIS we can start writing scheduled jobs to insert this data into the Total ReCal database however in the case of CEMIS we’re having a few problems.

Everyday a CSV file is created of all of the timetable events for each student. This file is (currently) 157mb in size and has approximately 1.7 million rows. In his last post, Nick explained that we have now developed an events API for our Nucleus metadata service which is going to be the repository for all of this time space data. Currently we’re able to parse each row in this CSV file and insert it into Nucleus over the API at about 0.9s per row. This isn’t good enough. As my tweet the other day shows, we need to significantly speed this up:

So our timetable import into #totalrecal (1.7m records) will currently take 19 days. Our target is 90 mins. HmmmTue Oct 12 17:24:31 via web

At the moment we’re simply streaming data out of the CSV file line by line (using PHP’s fgets function) and then sending it to Nucleus over cURL. Our two main problems are that the CSV file is generated one student at a time and so ideally needs to be re-ordered to group events by the unique event ID in order to improve performance by reducing the number of calls to Nucleus because we can just send the event with all associated students as one. Our second problem is parsing and building large arrays results in high memory usage and currently our server only has 2gb of memory to play with. We’ve capped PHP at 1gb memory at the moment however that is going to impact on Apache performance and other processes running on the server. Ideally we don’t want to just stick more memory into the machine because that isn’t really going to encourage us to fine tune our code so that isn’t an option at the moment.

Over the next few days we’re going to explore a number of options including altering the current script to instead send batched data using asynchronous cURL requests, and also then re-writing that script in a lower level language, however the second is going to take a bit of time as one of us learns a new language. Both should hopefully result in significantly improved performance and a decrease in execution time.

I’ll write another post soon that explains our final solution.

Some geo love

Mashing up single sign on and the CWD wasn’t enough. I’ve added to our Nucleus ‘location’ service some new geo APIs that will allow us to make location aware websites and applications.

There is both a server side version which can be called via cURL/file_get_contents/etc:

http://nucleus.online.lincoln.ac.uk/locations/geo/format/xml (your IP address)
http://nucleus.online.lincoln.ac.uk/locations/geo/format/xml?ip=86.6.170.144 (my current IP address)

(if you want JSON/JSONP/CSV then replace the format/xml with format/your choice)

and a JavaScript client side version:

http://nucleus.online.lincoln.ac.uk/locations/geojs (your IP address)
http://nucleus.online.lincoln.ac.uk/locations/geojs?ip=86.6.170.144 (my current IP address)

Both services return the following information:

  • which campus network the IP address is associated with (or ‘non’ if they aren’t using a campus network)
  • which campus they’re on (e.g. Brayford or Hull or ‘non’ if they aren’t on a campus)
  • the building (we can only do this currently for wired networks and some wireless networks)
  • the postcode of the building (where possible)
  • latitude and longitude (if the IP address isn’t on a University network then it uses the Maxmind GeoCity database)
  • the closest library to the library (GCW, Holbeach, Hull, or if they’re on a campus network, the Theology Reading Room)

So how does it work?

I’ve collected a huge list of IP ranges for the wired and wireless networks at the University and then I use the this function to loop over these ranges until it returns TRUE that an IP is in a range, or that it isn’t (i.e. they aren’t on a University network).

The ranges look like this:

$zones = array(
    'HBW'    => array(
        'network'=>'Holbeach wireless',
        'campus'=>'Holbeach',
        'postcode'=>'PE12 7PT',
        'building'=>'Minerva House',
        'latitude'=>'52.810004',
        'longitude'=>'0.01696'
    ),

If it results that an IP address is from a network then I’ve a simple multidimensional array for each network that contains the meta:

$zones = array(
'HBW'	=> array(
	'network'=>'Holbeach wireless',
	'campus'=>'Holbeach',
	'postcode'=>'PE12 7PT',
	'building'=>'Minerva House',
	'latitude'=>'52.810004',
	'longitude'=>'0.01696'
),
...

Finally it outputs in the required format (i.e. json/xml (server) or JavaScript (client)).

When mashed up with another Nucleus location API we could easily make a FourSquare like app that finds buildings around a location:

http://nucleus.online.lincoln.ac.uk/locations/buildings_near/format/xml?lat=53.228482&long=-0.547847&distance=0.5&limit=10

I’ll make sure that all of these location APIs are properly documented soon so people can go have fun with them. Also, none of these location APIs require any sort of authentication (though we are rate limiting so don’t try and kill our servers please!).