I'd originally thought up the concept because my wife is constantly checking the weather. I can understand this; in the UK, the weather seems to be appalling and unpredictable most of the time. It occurred to me that she was always checking the prediction for a point in time (e.g. the weekend), but that the prediction changes as we approach that point in time, as the Met Office revises their data.
Now, you can get hold of prediction data, but you need to store it and process it. Getting it into a database is the first priority, and since there is a lot of it, the process needs to be quick and scalable. I've been using MongoDB for a couple of years, and I can definitely say I've become a fan. It's not without challenges though. One of these is that it hogs quite a lot of memory and the other is that it requires quite a lot of disk space. I guess this is because the documents are not stored in quite such an optimal manner as in a SQL database. Then I came across Digital Ocean, who offer really cheap servers that are easy to scale - much cheaper and easier to scale than Amazon's EC2, I have to say.
From starting to look at the docs to getting a working app running on the server, downloading, parsing and inserting data into MongoDB? Three hours. I think that's a bit of a result.
From 5000 or so locations across the UK, I harvested 200,000 predictions for 3-hour markers over the next 5 days. For all subsequent predictions, I'll append the forecast to an array, giving me all the predictions for a point in time and space in a single handy doc. Each doc looks like this for a single prediction:
"loc" : "3220",
"fcD" : ISODate("2013-05-27T00:00:00Z"),
"t" : "360",
"d" : [
"D" : "NW",
"F" : "2",
"G" : "20",
"H" : "78",
"Pp" : "6",
"S" : "11",
"T" : "5",
"V" : "EX",
"W" : "7",
"U" : "1",
"pd" : ISODate("2013-05-23T11:00:00Z")
"_id" : ObjectId("519e077c8bf02215220014f5")
The loc field is the location ID - location information is stored in a separate collection. fcD is the forecast date - the date that they expect to have this weather. t: 360 means 360 minutes into the day. In the prediction data, 'pd' means 'prediction date' - They predicted at 11am today that the weather would have these characteristics. All other fields are stats like temperature, humidity, etc.
That's all the app does right now, but the next step once I've harvested some data is to start analysing it. Some things I have in mind:
- What is the average delta between first and last predictions for a point in time?
- Graph predictions over time for a point in time. i.e. how much do the predictions vary about temperature, humidity etc. from the first to the last prediction?
- How do different weather stations compare? Are some more accurate than others? This might be regional or down to equipment or personnel at a particular site.
There is also other data - what actually happened. It'd be nice to compare this across months in a Circos style graph. Linking each temperature to months that had days with the same max or min temperature.