{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "The atmospheric CO2 content in the year 2030\n", "--------------------------------------------------------------------------------\n", "\n", "Use Gaussian process regression to predict the global atmospheric CO2 concentration in the year 2030. Obviously, this is a question of great relevance. To make your prediction, you have available measurements taken monthly on the vulcano Mauna Loa on Hawai'i as shown below. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(mvtnorm) # multivariate Gaussian\n", "library(repr)\n", "options(repr.plot.width=8, repr.plot.height=5.5) # set plot size" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "options(repr.plot.width=15, repr.plot.height=5.5) \n", "d <- read.table(\"monthly_in_situ_co2_mlo.csv\", skip=57, sep=\",\")\n", "mask <- d$V5 > 0\n", "year <- d$V4[mask]\n", "co2 <- d$V5[mask]\n", "plot(year,co2, main=\"Mauna Loa CO2 measurements\", xlab=\"year\", ylab=\"CO2 concentration [ppm]\", type='l')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Here comes an array of kernel functions to work with**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "k1 <- function(sigf,l,dx) sigf^2*exp(-0.5*(dx/l)^2) # squared exponential, very often first and only trial \n", "kn <- function(sigy,dx) sigy^2*diag(1.0,nrow(dx),ncol(dx)) # for noise contribution\n", "k2 <- function(sigf,l1,l2,dx) sigf^2*exp(-0.5*(dx/l1)^2-2.0*(sin(pi*dx)/l2)^2) # damped periodic kernel \n", " # with period 1\n", "k4 <- function(sigf,l,dx) sigf^2*exp(-2.0*(sin(pi*dx)/l)^2) # periodic kernel with period 1\n", "k3 <- function(sigf,l,alpha,dx) sigf^2*(1+0.5*(dx/l)^2/alpha)^alpha # rational quadratic kernel\n", "k5 <- function(sigf,l,dx) { # Matérn 3/2\n", " d <- sqrt(3)*abs(dx)/l\n", " return(sigf*(1+d)*exp(-d))\n", "}\n", "k6 <- function(sigf,l,dx) { # Matérn 5/2\n", " d <- sqrt(5)*abs(dx)/l\n", " return(sigf*(1+d+(d^2)/3)*exp(-d))\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Here an implementation of Gaussian process regression**\n", " \n", "- GPtrain is an optimizer for the hyperparameters\n", "\n", "- GPpred produces the regression" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "GPtrain <- function(kernel,theta,x,y,noise=0) {\n", " dx <- outer(x,x,'-')\n", " nx <- length(x)\n", " yq <- y-mean(y)\n", " logl <- function(theta) { \n", " K <- kernel(theta,noise,dx)\n", " Lup <- chol(K)\n", " invKy <- backsolve(Lup,forwardsolve(t(Lup),yq))\n", " return(0.5* (t(yq) %*% invKy + 2.0*sum(log(diag(Lup))) + nx*log(2*pi)))\n", " }\n", " return(optim(theta,logl,method=\"Nelder-Mead\"))\n", "} \n", "\n", "GPpred <- function(kernel,theta,x,y,xs,noise=0) { \n", " ymean <- mean(y)\n", " K <- kernel(theta, noise, dx=outer(x,x,'-'))\n", " Ks <- kernel(theta, 0, dx=outer(xs,x,'-'))\n", " Kss <- kernel(theta, 0, dx=outer(xs,xs,'-'))\n", " Lup <- chol(K)\n", " invKy <- backsolve(Lup,forwardsolve(t(Lup),(y-ymean)))\n", " ys <- drop(Ks %*% invKy) + ymean\n", " v <- apply(Ks, MARGIN=1, FUN=forwardsolve, l=t(Lup))\n", " covys <- Kss - t(v) %*% v\n", " return(list(ys=ys,covys=covys))\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Here comes an example solution that helps but is not the real stuff:** " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "co2_with_noise <- function(theta, noise, dx) {\n", " sigf <- theta[1]\n", " l <- theta[2]\n", " k <- k1(sigf,l,dx)\n", " \n", " if (noise > 0) return(k+kn(noise,dx))\n", " else return(k)\n", "}\n", "\n", "x <- year[550:699]\n", "y <- co2[550:699]\n", "xs <- 2015.0 + seq(0,15,0.1)\n", "sigy <- 1.0\n", "\n", "o <- GPtrain(co2_with_noise,c(50.0, 35.0),x,y,noise=sigy)\n", "cat(\"Optimal hyperparameters: \", o$par, \"\\n\")\n", "cat(\"Counts and convergence:\" , o$counts, o$convergence, \"\\n\")\n", "p <- GPpred(co2_with_noise,o$par,x,y, xs,noise=sigy)\n", "plot(year,co2, main=\"Mauna Loa CO2 measurements\", xlab=\"year\", ylab=\"CO2 concentration [ppm]\", type='l', \n", " xlim=c(2010,2030), ylim=c(380,440))\n", "lines(xs,p$ys,type=\"l\", col=\"green\")\n", "lines(x,y,type=\"l\",col=\"red\")\n", "d <- 1.96*sqrt(abs(diag(p$covys))) # 2*sigma pointwise uncertainties\n", "polygon(c(xs,rev(xs)), c(p$ys-d,rev(p$ys+d)), border=NA, \n", " col = rgb(red=0.2,green=0.8,blue=0,alpha=0.3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predicting the CO2 content of the atmosphere in the year 2030\n", "+ periodic signal needs to be factored in (use k4?, or k2?)\n", "+ work incrementally when increasing kernel complexity\n", " `optim()` with Nelder-Mead fairly simple optimizer\n", "+ could a Markov sampler put to use?\n", "+ taking all data for training the GP produces large matrices slowing down the calculations;\n", " are old data particularly relevant?\n", "+ instead of assuming a noise level introducing the noise as another free parameter\n", "+ first using training data upto - say - 2016 and check for \"prediction\" is perhaps a good way to estimate\n", " the actual capability to make predictions\n", " \n", "In any case, Master Yoda states correctly: \"Hard to see the future is!\"" ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.5.0" } }, "nbformat": 4, "nbformat_minor": 2 }