Version 0.7.0 of the googleVis R package has been released, adding a new function for Gantt charts. Gantt charts are helpful to illustrates a project schedule and its dependencies.
Following the Google documentation the project has to be broken down into task IDs, task names, resources, start date, end dates, task duration (in milliseconds), how far the task has been completed (in percent), and finally any dependencies to other tasks IDs.
On Sunday the Tokyo Olympics men sprint 100m final will take place. Francesc Montané reminded me in his analysis that 9 years ago I used a simple regression model to predict the winning time for the 100m men sprint final of the 2012 Olympics in London. My model predicted a winning time of 9.68s, yet Usain Bolt finished in 9.63s. For this Sunday my prediction is 9.72s, with a 50% credible interval of [9.
Finally, the Insurance Data Science conference was back last week. After last year’s cancellation due to Covid-19 over 250 delegates from around the world came together on-line for the third instalment of the conference.
The event kicked-off, or should we say lifted off, with a keynote by Thomas Wiecki, CEO of PyMC Labs, on Wednesday. Thomas explained how probabilistic programming can be used to assess risk and make decision in the context of insuring rocket launches.
Book your ticket by 9 June 2021
This article illustrates how ordinary differential equations and multivariate observations can be modelled and fitted with the brms package (Bürkner (2017)) in R1.
As an example I will use the well known Lotka-Volterra model (Lotka (1925), Volterra (1926)) that describes the predator-prey behaviour of lynxes and hares. Bob Carpenter published a detailed tutorial to implement and analyse this model in Stan and so did Richard McElreath in Statistical Rethinking 2nd Edition (McElreath (2020)).
At the Insurance Data Science conference, both Eric Novik and Paul-Christian Bürkner emphasised in their talks the value of thinking about the data generating process when building Bayesian statistical models. It is also a key step in Michael Betancourt’s Principled Bayesian Workflow.
In this post, I will discuss in more detail how to set priors, and review the prior and posterior parameter distributions, but also the prior predictive distributions with brms (Bürkner (2017)).
The first Insurance Data Science event was held at Cass Business School last week, 16 - 17 July 2018.
The conference followed on from five iterations of the R in Insurance events, which have the aim of bringing together practitioners and academics together to discuss and exchange ideas and needs in the sector.
Expanding the remit from R in Insurance to Insurance Data Science has also attracted talks on Python and Tensorflow.
How do you build a model from first principles? Here is a step by step guide.
Following on from last week’s post on Principled Bayesian Workflow I want to reflect on how to motivate a model.
The purpose of most models is to understand change, and yet, considering what doesn’t change and should be kept constant can be equally important.
I will go through a couple of models in this post to illustrate this idea.
Insurance Data Science
The abstract submission deadline for the Insurance Data Science conference at Cass Business School on 16 July 2018 is closing soon. You have until the 9th of April to submit your abstract.
Please send your abstract to
[email protected].
We like to see proposals for talks that demonstrate how data science is used in insurance, e.g. in risk assessment, customer analytics, pricing, reserving, capital management, catastrophe and econometric modelling.
This is a follow-up post on hierarchical compartmental reserving models using PK/PD models. It will show how differential equations can be used with Stan/ brms and how correlation for the same group level terms can be modelled.
PK/ PD is usually short for pharmacokinetic/ pharmacodynamic models, but as Eric Novik of Generable pointed out to me, it could also be short for Payment Kinetics/ Payment Dynamics Models in the insurance context.
Today, I will sketch out ideas from the Hierarchical Compartmental Models for Loss Reserving paper by Jake Morris, which was published in the summer of 2016 (Morris (2016)). Jake’s model is inspired by PK/PD models (pharmacokinetic/pharmacodynamic models) used in the pharmaceutical industry to describe the time course of effect intensity in response to administration of a drug dose.
The hierarchical compartmental model fits outstanding and paid claims simultaneously, combining ideas of Clark (2003), Quarg and Mack (2004), Miranda, Nielsen, and Verrall (2012), Guszcza (2008) and Zhang, Dukic, and Guszcza (2012).
Following five R in Insurance conferences, we are organising the first Insurance Data Science conference at Cass Business School London, 16 July 2018.
Insurance Data Science
In 2013, we started with the aim to bring practitioners of industry and academia together to discuss and exchange ideas and needs from both sides.
R was and is a perfect glue between the two groups, a tool which both side embrace and which has fostered the knowledge transfer between the two.
On 23 November Glenn Meyers gave a fascinating talk about The Bayesian Revolution in Stochastic Loss Reserving at the 10th Bayesian Mixer Meetup in London. Glenn worked for many years as a research actuary at Verisk/ ISO, he helped to set up the CAS Loss Reserve Database and published a monograph on Stochastic loss reserving using Bayesian MCMC models.
In this blog post I will go through the Correlated Log-normal Chain-Ladder Model from his presentation.
The programme for the 2017 R in Insurance conference in Paris has been published. Talks will discuss new ideas and research with the applications in life and general insurance, from network analysis, reserving, pricing to catastrophe modelling, followed by a conference dinner at the Musée d’Orsay. Registration is open until 22 May.
Agenda 9:00 am - 9:10 am Welcome - Julien Pouget (Directeur de l’ENSAE) 9:10 am - 10:00 am Opening Keynote Session Textual analysis of expert reports to increase knowledge of technological risks - Julie Seguela, Covea 10:00 am - 11:00 am Session 1 - big data 10:00 - 10:20 › Network Analytics in Claims Level Predictive Modelling - Marcela Granados, Ernst & Young
The fifth conference on R in Insurance will be held on 8 June 2017 at ENSAE. ENSAE is the Paris Graduate School for Economics, Statistics and Finance.
The intended audience of the conference includes both academics and practitioners who are active or interested in the applications of R in Insurance.
This one-day conference will focus again on applications in insurance and actuarial science that use R, the lingua franca for statistical computation.
Last Friday the Cologne R user group came together for two talks and a quiz at Eye/o, the company behind Adblock Plus, in Köln-Ehrenfeld. Eye/o were a great host, offering nibbles and drinks to warm up the event and pizza at the end.
Cologne R user meeting at Eye/o The first talk was given by Jiddu Alexander, a physicist turned freelance data scientist.
The 19th Cologne R user group meeting is scheduled for this Friday, 14 October 2016. We have three talks, followed by networking drinks.
Introduction to the tidyverse tools - Jiddu Alexander Performance profiling and improvement in R - Nils Glück Batch processing of R-Scripts with Excel - Klaus Jacobi Venue: Eyeo GmbH, Lichtstraße 25, 50825 Köln
For further details visit our KölnRUG Meetup site.
Last Tuesday we got together for the 4th Bayesian Mixer Meetup. Product Madness kindly hosted us at their offices in Euston Square. About 50 Bayesians came along; the biggest turn up thus far, including developers of PyMC3 (Peadar Coyle) and Stan (Michael Betancourt).
The agenda had two feature talks by Dominic Steinitz and Volodymyr Kazantsev and a lightning talk by Jon Sedar.
Dominic Steinitz: Hamiltonian and Sequential MC samplers to model ecosystems Dominic shared with us his experience of using Hamiltonian and Sequential Monte Carlo samplers to model ecosystems.
Last week the French National Institute of Health and Medical Research (Inserm) organised with the Stan Group a training programme on Bayesian Inference with Stan for Pharmacometrics in Paris.
Daniel Lee and Michael Betancourt, who run the course over three days, are not only members of Stan’s development team, but also excellent teachers. Both were supported by Eric Novik, who gave an Introduction to Stan at the Paris Dataiku User Group last week as well.
We released googleVis version 0.6.1 on CRAN last week. The update fixes issues with setting certain options, following the switch from RJSONIO to jsonlite.
Screen shot of some of the Google Charts New to googleVis? The package provides an interface between R and the Google Charts Tools, allowing you to create interactive web charts from R without uploading your data to Google.
The 4th R in Insurance conference took place at Cass Business School London on 11 July 2016. This one-day conference focused once more on the wide range of applications of R in insurance, actuarial science and beyond. The conference programme covered topics including reserving, pricing, loss modelling, the use of R in a production environment and much more.
The audience of the conference included both practitioners (c.80%) and academics (c.
Last Thursday the Cologne R user group came together again. This time, our two speakers arrived from Bavaria, to talk about Spark and R Server. Introduction to Apache Spark Download slides Dubravko Dulic gave an introduction to Apache Spark and why Spark might be of interest to data scientists using R. Spark is designed for cluster computing, i.
Two Bayesian Mixer meet-ups in a row. Can it get any better?
Our third ‘regular’ meeting took place at Cass Business School on 24 June. Big thanks to Pietro and Andreas, who supported us from Cass. The next day, Jon Sedar of Applied AI, managed to arrange a special summer PyMC3 event. 3rd Bayesian Mixer meet-up First up was Luis Usier, who talked about cross validation. Luis is a former student of Andrew Gelman, so, of course, his talk touched on Stan and the ‘loo’ (leave one out) package in R.
Hurry! The early bird registration offer for the 4th R in Insurance conference, 11 July 2016, at Cass Business School closes 30 May.
This one-day conference will focus once more on applications in insurance and actuarial science that use R, the lingua franca for statistical computation. Topics covered include reserving, pricing, loss modelling, the use of R in a production environment, and more.
We have a fantastic programme with international speakers and conference dinner at Ironmongers Hall.
We are delighted to announce that the programme for the 4th R in Insurance conference at Cass Business School in London, 11 July 2016, have been finalised.
Register by the end of May to get the early bird booking fee.
The organisers gratefully acknowledge the sponsorship of Verisk, Mirai Solutions, Applied AI, Studio, CYBAEA and Oasis, without whom the event wouldn’t be possible.
Agenda [09:00 - 10:00] Keynote 1:
R Interface to Google Charts.
Staying on top of new CRAN packages is quite a challenge nowadays. However, thanks to Dirk’s CRANberries service I occasionally spot a new gem, such as wbstats, which appeared on CRAN last week.
Similarly to the WDI package, wbstats offers an interface to the World Bank database.
With the functions of wbstats the World Bank data can be searched and data for several indicators requested. Unlike WDI, the data is returned in a ‘long’ table with one column for all values and a separate column for the indicators.
Last Friday the 2nd Bayesian Mixer Meetup (@BayesianMixer) took place at Cass Business School, thanks to Pietro Millossovich and Andreas Tsanakas, who helped to organise the event. Bayesian Mixer at Cass First up was Davide De March talking about the challenges in biochemistry experimentation, which are often characterised by complex and emerging relations among components.
The very little prior knowledge about complex molecules bindings left a fertile field for a probabilistic graphical model.
Hurry! The abstract submission deadline for the 4th R in Insurance conference in London, 11 July 2016 is approaching soon.
You have until the 28th of March to submit a one-page abstract for consideration. Both academic and practitioner proposals related to R are encouraged. Please email your abstract of no more than 300 words (in text or pdf format) to
[email protected].
Invited talks will be given by: Mario V.
Last Friday the Cologne R user group came together for the 17th time. This time, we were in for a special treatment, with two talks by psychologists!
But, there was nothing to fear, we were in safe hands, and for the first time, we met at the new Microsoft office in Cologne.
Lecture room at Microsoft, Cologne First up was Meik Michalke from the University of Düsseldorf presenting the RKWard project.
The 17th Cologne R user group meeting is scheduled for this Friday, 26 February 2016. We have two talks, followed by networking drinks.
Introduction to Bayesian Regression Models using Stan with the brms package - Paul-Christian Bürkner (Uni Münster) RKWard: A Graphical User Interface and Integrated Development Environment for Statistical Analysis with R - Meik Michalke (Uni Düsseldorf) Venue: Microsoft Deutschland, Holzmarkt 2a Cologne 50676 DE, Köln
We had our first successful Bayesian Mixer Meetup last Friday night at the Artillery Arms!
We expected about 15 - 20 people to turn up, when we booked the function room overlooking Bunhill Cemetery and Bayes’ grave. Now, looking at the photos taken during the evening, it seems that our prior believe was pretty good.
The event started with a talk from my side about some very basic Bayesian models, which I used a while back to get my head around the concepts in an insurance context.
My traditional work flow for embedding R graphics into a blog post has been via a PNG files that I upload online. However, when I created a ‘simple’ graphic with only basic curves and triangles for a recent post, I noticed that the PNG output didn’t look as crisp as I expected it to be. So, eventually I used a SVG (scalable vector graphic) instead.
Creating a SVG file with R could’t be easier; e.
There is a nice pub between Bunhill Fields and the Royal Statistical Society in London: The Artillery Arms. Clearly, the perfect place to bring people together to talk about Bayesian Statistics. Well, that’s what Jon Sedar (@jonsedar, applied.ai) and I thought.
Source: http://www.artillery-arms.co.uk/ Hence, we’d like to organise a Bayesian Mixer Meetup on Friday, 12 February, 19:00. We booked the upstairs function room at the Artillery Arms and if you look outside the window, you can see Thomas Bayes’ grave.
I have admired the work of the artist Bridget Riley for a long time. She is now in her eighties, but as it seems still very creative and productive. Some of her recent work combines simple triangles in fascinating compositions. The longer I look at them, the more patterns I recognise.
Yet, the actual painting can be explained easily, in a sense of a specification document to reproduce the pattern precisely.
Formatting data for output in a table can be a bit of a pain in R. The package formattable by Kun Ren and Kenton Russell provides some intuitive functions to create good looking tables for the R console or HTML quickly. The package home page demonstrates the functions with illustrative examples nicely.
There are a few points I really like: the functions accounting, currency, percent transform numbers into better human readable output cells can be highlighted by adding color information contextual icons can be added, e.
Following the successful 3rd R in Insurance conference in Amsterdam last year, we return to London this year.
The registration for the 4th conference on R in Insurance on Monday 11 July 2016 at Cass Business School has opened.
This one-day conference will focus again on applications in insurance and actuarial science that use R, the lingua franca for statistical computation.
The intended audience of the conference includes both academics and practitioners who are active or interested in the applications of R in insurance.
The 16th Cologne R user group meeting is scheduled for this Friday, 4 December 2015 and we have great line up with three talks followed by networking drinks.
Monitoring process change using Bayesian methods (Mick Cooney) A common business problem is to evaluate the effect of a change of process, and this talk will discuss a straightforward approach to this using conjugate priors. Editing R files with DataJoy (Dietmar Janetzko) Brief introduction to the online collaborative data analysis platform DataJoy.
I had the great pleasure time to attend the Warsaw R meetup last Thursday. The organisers Olga Mierzwa and Przemyslaw Biecek had put together an event with a focus on R in Insurance (btw, there is a conference with the same name), discussing examples of pricing and reserving in general and life insurance. Experience vs. Data I kicked off with some observations of the challenges in insurance pricing.
I continue with the growth curve model for loss reserving from last week’s post. Today, following the ideas of James Guszcza [2] I will add an hierarchical component to the model, by treating the ultimate loss cost of an accident year as a random effect. Initially, I will use the nlme R package, just as James did in his paper, and then move on to Stan/RStan [6], which will allow me to estimate the full distribution of future claims payments.
Last week I posted a biological example of fitting a non-linear growth curve with Stan/RStan. Today, I want to apply a similar approach to insurance data using ideas by David Clark [1] and James Guszcza [2].
Instead of predicting the growth of dugongs (sea cows), I would like to predict the growth of cumulative insurance loss payments over time, originated from different origin years. Loss payments of younger accident years are just like a new generation of dugongs, they will be small in size initially, grow as they get older, until the losses are fully settled.
I suppose the go to tool for fitting non-linear models in R is nls of the stats package. In this post I will show an alternative approach with Stan/RStan, as illustrated in the example, Dugongs: “nonlinear growth curve”, that is part of Stan’s documentation.
The original example itself is taken from OpenBUGS. The data describes the length and age measurements for 27 captured dugongs (sea cows). Carlin and Gelfand (1991) model the data using a nonlinear growth curve with no inflection point and an asymptote as \(x_i\) tends to infinity:
Following the successful 3rd R in Insurance conference in Amsterdam this year, we will return to London next year. We will be back at Cass Business School, 11 July 2016.
The event will focus again on the use of R in insurance, bringing together experts from industry and academia with a diverse background of disciplines, such as actuarial science, catastrophe modelling, finance, statistics and computer science.
We are delighted to announce or keynote speakers already: Dan Murphy and Mario V.
We released version 0.2.2 of ChainLadder a few weeks ago. This version adds back the functionality to estimate the index parameter for the compound Poisson model in glmReserve using the cplm package by Wayne Zhang.
Ok, what does this all mean? I will run through a couple of examples and look behind the scene of glmReserve. However, the clue is in the title, glmReserve is a function that uses a generalised linear model to estimate future claims, assuming claims follow a Tweedie distribution.
Last Friday the Cologne R user group came together for the 15th time. Since its inception over three years ago the group evolved from a small gathering in a pub into an active data science community, covering wider topics than just R. Still, R is the link and clue between the different interests. Last Friday’s agenda was a good example of this, with three talks touching on workflow management, web development and risk analysis.
The 15th Cologne R user group meeting is scheduled for this Friday, 18 September 2015 and we have a full agenda with three talks followed by networking drinks.
R in big data pipeline with luigi (Yuki Katoh) R in big data pipeline: Put your awesome R codes into production. Learn how to build solid big data pipeline around it. shinyjs (Paul Viefers) Using JavaScript in shiny, without knowing JavaScript Experience vs Data (Markus Gesmann) How to asses risks with small data sets.
It seems the summer is coming to end in London, so I shall take a final look at my ice cream data that I have been playing around with to predict sales statistics based on temperature for the last couple of weeks [1], [2], [3].
Here I will use the new brms (GitHub, CRAN) package by Paul-Christian Bürkner to derive the 95% prediction credible interval for the four models I introduced in my first post about generalised linear models.
Last week I presented visualisations of theoretical distributions that predict ice cream sales statistics based on linear and generalised linear models, which I introduced in an earlier post. Theoretical distributions Today I will take a closer look at the log-transformed linear model and use Stan/rstan, not only to model the sales statistics, but also to generate samples from the posterior predictive distribution.
Two weeks ago I discussed various linear and generalised linear models in R using ice cream sales statistics. The data showed not surprisingly that more ice cream was sold at higher temperatures.
icecream <- data.frame( temp=c(11.9, 14.2, 15.2, 16.4, 17.2, 18.1, 18.5, 19.4, 22.1, 22.6, 23.4, 25.1), units=c(185L, 215L, 332L, 325L, 408L, 421L, 406L, 412L, 522L, 445L, 544L, 614L) ) I used a linear model, a log-transformed linear model, a Poisson and Binomial generalised linear model to predict sales within and outside the range of data available.
Linear models are the bread and butter of statistics, but there is a lot more to it than taking a ruler and drawing a line through a couple of points.
Some time ago Rasmus Bååth published an insightful blog article about how such models could be described from a distribution centric point of view, instead of the classic error terms convention.
I think the distribution centric view makes generalised linear models (GLM) much easier to understand as well.
Photo: Arthur Charpentier The R in Insurance conference in Amsterdam was a sold out success! Congratulations to the organising committee at the University of Amsterdam, and many thanks to our sponsors:
Milliman, RStudio, CYBAEA, Deloitte, a.s.r., Triple A Risk Finance, AEGON, Delta Lloyd Amsterdam, QBE Re and APPLIED AI
This one-day conference focused once more on applications in insurance and actuarial science that use R.
Over the weekend we released version 0.2.1 of the ChainLadder package for claims reserving on CRAN. New Features New function PaidIncurredChain by Fabio Concina, based on the 2010 Merz & Wüthrich paper Paid-incurred chain claims reserving method Functions plot.MackChainLadder and plot.BootChainLadder gained new argument which, allowing users to specify which sub-plot to display. Thanks to Christophe Dutang for this suggestion. Output of plot(MackChainLadder(MW2014, est.
I have to admit that I find the plotmath expressions in R a little fiddly to annotate plots with mathematical notation.
Apparently I am not the only one, but Stefano Meschiari did actually something about it. A few days ago his package latex2exp appeared on CRAN.
The package provides the wonderful function latex2exp that translates LaTeX code into plotmath expressions. Brillant! All I have to remember is to escape the “"
Last Friday the Cologne R user group came together for the 14th time. For the first time we met at Startplatz, a start-up incubator venue. The venue was excellent, not only did they provide us with a much larger room, but also with table-football and drinks. Many thanks to Kirill for organising all of this!
Photo: Günter Faes We had two excellent advanced talks.
The next Cologne R user group meeting is scheduled for this Friday, 6 June 2015 and we have an exciting agenda with two talks followed by networking drinks.
Data Science at the Commandline (Kirill Pomogajko) An Introduction to RStan and the Stan Modelling Language (Paul Viefers) Please note: Our venue changed! We have outgrown the seminar room at the Institute of Sociology and move to Startplatz, a start-up incubator venue: Im Mediapark, 550670 Köln Drinks and Networking The event will be followed by drinks (Kölsch!
I like the Economist theme in the latticeExtra package. It produces nice looking charts that mimic the design of the weekly newspaper, such as in this example:
For some time I wondered how I could put the title of my lattice plots into the top left corner as well (by default titles are centred). Reviewing the code of the theEconomist.theme function by Felix Andrews reveals the trick. It is the setting of par.
The forthcoming R Journal has an interesting article on the showtext package by Yixuan Qiu. The package allows me to use system and web fonts directly in R plots, reminding me a little of the approach taken by XeLaTeX. But “unlike other methods to embed fonts into graphics, showtext converts text into raster images or polygons, and then adds them to the plot canvas. This method produces platform-independent image files that do not rely on the fonts that create them.
I had a great time at the R/Finance conference in Chicago last Friday/Saturday. Some brief takeaways for me were:
From Emanuel Derman’s talk: It is is important to distinguish between theories and models. Theories live in an abstract world and for a given set of axioms they can be proven right. However, models live in the real world, are build on simplifying assumptions and are only useful until experiments/data proves them wrong.
I will be speaking at the Bay Area User Group meeting tonight about Communicating Risk. Anthony Goldbloom from Kaggle and Karim Chine from ElasticR will be there as well. The meeting will be at Microsoft in Mountain View.
Later this week I will give a similar presentation at the R in Finance conference in Chicago. Please get in touch if you are around and would like to share a coffee with me.
I continue my Stan experiments with another insurance example. Here I am particular interested in the posterior predictive distribution from only three data points. Or, to put it differently I have a customer of three years and I’d like to predict the expected claims cost for the next year to set or adjust the premium.
The example is taken from section 16.17 in Loss Models: From Data to Decisions [1].
In my previous post I discussed how Longley-Cook, an actuary at an insurance company in the 1950’s, used Bayesian reasoning to estimate the probability for a mid-air collision of two planes.
Here I will use the same model to get started with Stan/RStan, a probabilistic programming language for Bayesian inference.
Last week my prior was given as a Beta distribution with parameters \(\alpha=1, \beta=1\) and the likelihood was assumed to be a Bernoulli distribution with parameter \(\theta\): \[\begin{aligned} \theta & \sim \mbox{Beta}(1, 1)\\ y_i & \sim \mbox{Bernoulli}(\theta), \;\forall i \in N \end{aligned}\]For the previous five years no mid-air collision were observed, \(x=\{0, 0, 0, 0, 0\}\).
Suppose you have to predict the probabilities of events which haven’t happened yet. How do you do this?
Here is an example from the 1950s when Longley-Cook, an actuary at an insurance company, was asked to price the risk for a mid-air collision of two planes, an event which as far as he knew hadn’t happened before. The civilian airline industry was still very young, but rapidly growing and all Longely-Cook knew was that there were no collisions in the previous 5 years [1].
The programme for the 3rd R in Insurance conference is on-line. The event will take place on 29 June 2015 at the University of Amsterdam. Time to register now.
Special thanks to our sponsors, without whom the conference wouldn’t be possible: CYBAEA, RStudio, APPLIED AI, Milliman, QBE Re, AEGON, Delta Lloyd Amsterdam , Deloitte.
You find impressions from the previous events on www.rininsurance.com.
We hope to see you in Amsterdam!
Last week I mentioned the grid.arrange function of the gridExtra package that allows me to combine graphical grid objects onto one page. The latticeExtra package provides another elegant solution for trellis (lattice) plots: the function c.trellis() or just c() combines the panels of multiple trellis objects into one.
Here is minimal example from the help file of c.trellis:
library(latticeExtra) ## Combine different types of plots. c(wireframe(volcano), contourplot(volcano)) In my next example I am using data from Eurostat, the statistical office of the European Union, showing the use of public transport in four countries.
Occasionally I’d like to plot a table alongside a chart in R, e.g. to present summary statistics of the graph itself. Thanks to the gridExtra package this is quite straightforward. The function tableGrob creates a table like plot of a data frame, while arrangeGrob allows me to arrange ggplot2, lattice and grid graphical objects (short ‘grobs’, such as tableGrob) on a page.
Here is a little example: Session Info R version 3.
I mused over Test Driven Analysis on this blog before, but it was Richard Pugh’s talk on SAS to R Migration at LondonR last week that brought the topic back into my mind and clarified a few things.
Rich’s presentation focused on the challenge of how to ensure that the new system (R) would provide the same answers as the legacy system (SAS).
This is when it clicked with me: My brain is just another system as well.
I love interactive pivot tables. That is the number one reason why I keep using spreadsheet software. The ability to look at data quickly in lots of different ways, without a single line of code helps me to get an understanding of the data really fast.
Perhaps I can do the same now in R as well. At yesterday’s LondonR meeting Enzo Martoglio presented briefly his rpivotTable package. Enzo builds on Nicolas Kruchten’s PivotTable.
ChainLadder is an R package that provides statistical methods and models for claims reserving in general insurance.
With version 0.2.0 we added new functions to estimate the claims development result (CDR) as required under Solvency II. Special thanks to Alessandro Carrato, Giuseppe Crupi and Mario Wüthrich who have contributed code and documentation. New Features New generic function CDR to estimate the one year claims development result.
Hurry! The abstract submission deadline for the 3rd R in Insurance conference in Amsterdam, 29 June 2015 is approaching soon.
You have until the 28th of March to submit a one-page abstract for consideration. Both academic and practitioner proposals related to R are encouraged. Please email your abstract of no more than 300 words (in text or pdf format) to
[email protected].
The intended audience of the conference includes both academics and practitioners who are active or interested in the applications of R in insurance.
At last Friday’s Cologne R user group meeting we welcomed two Northerners from the left and right (or ‘right’ and ‘wrong’) side of the Rhine. Using R in Excel via R.NET Günter Faes and Matthias Spix
Download slides Günter and Michael presented examples of a new R Excel plugin ‘Calidris’ they developed using R.net. The plugin itself is written in C# and adds an R ribbon to Excel with pre-build functions.
The next Cologne R user group meeting is scheduled for this Friday, 6 March 2015 and we have an exciting agenda with two talks, followed by networking drinks: Using R in Excel via R.NET Günter Faes and Matthias Spix
MS Office and Excel are the ‘de-facto’ standards in many industries. Using R with Excel offers an opportunity to combine the statistical power of R with a familiar user interface.
The other day I got stuck working with a huge data set using data.table in R. It took me a little while to realise that I had to produce a minimal reproducible example to actually understand why I got stuck in the first place. I know, this is the mantra I should follow before I reach out to R-help, Stack Overflow or indeed the package authors. Of course, more often than not, by following this advise, the problem becomes clear and with that the solution obvious.
I have experimented with reading an Arduino signal into R in the past, using Rserve and Processing. Actually, it is much easier. I can read the output of my Arduino directly into R with the scan function.
Here is my temperature sensor example again:
And all it needs to read the signal into the R console with my computer is:
> f <- file("/dev/cu.usbmodem3a21", open="r") > scan(f, n=1) <b>Read 1 item</b> [1] 20.
The registration for the third conference on R in Insurance on Monday 29 June 2015 at the University of Amsterdam has opened.
This one-day conference will focus again on applications in insurance and actuarial science that use R, the lingua franca for statistical computation.
The intended audience of the conference includes both academics and practitioners who are active or interested in the applications of R in insurance.
We released googleVis version 0.5.8 on CRAN last week. The update is a maintenance release for the forthcoming release of R 3.2.0.
Screen shot of some of the Google Charts New to googleVis? The package provides an interface between R and the Google Charts Tools, allowing you to create interactive web charts from R without uploading your data to Google.
Last week’s post about the Kalman filter focused on the derivation of the algorithm. Today I will continue with the extended Kalman filter (EKF) that can deal also with nonlinearities. According to Wikipedia the EKF has been considered the de facto standard in the theory of nonlinear state estimation, navigation systems and GPS. Kalman filter I had the following dynamic linear model for the Kalman filter last week:
At the last Cologne R user meeting Holger Zien gave a great introduction to dynamic linear models (dlm). One special case of a dlm is the Kalman filter, which I will discuss in this post in more detail. I kind of used it earlier when I measured the temperature with my Arduino at home.
Over the last week I came across the wonderful quantitative economic modelling site quant-econ.net, designed and written by Thomas J.
Last week’s Cologne R user group meeting was the best attended so far, and it was a remarkable event - I believe not a single line of R code was shown. Still, it was an R user group meeting with two excellent talks, and you will understand shortly why not much R code needed to be displayed. Introduction to Julia for R Users Download slides Hans Werner Borchers joined us from Mannheim to give an introduction to Julia for R users.
The next Cologne R user group meeting is scheduled for this Friday, 12 December 2014.
We have an exciting agenda with two talks on Julia and Dynamic Linear Models: Introduction to Julia for R Users Hans Werner Borchers
Julia is a high-performance dynamic programming language for scientific computing, with a syntax that is familiar to users of other technical computing environments (Matlab, Python, R, etc.). It provides a sophisticated compiler, high performance with numerical accuracy, and extensive mathematical function libraries.
It is getting colder in London, yet it is still quite mild considering that it is late November. Well, indoors it still feels like 20°C (68°F) to me, but I have been told last week that I should switch on the heating.
Luckily I found an old thermometer to check. The thermometer showed 18°C. Is it really below 20°C?
The thermometer is quite old and I’m not sure that is works properly anymore.
Taking the first step is often the hardest: getting data from Excel into R.
Suppose you would like to use the ChainLadder package to forecast future claims payments for a run-off triangle that you have stored in Excel.
How do you get the triangle into R and execute a reserving function, such as MackChainLadder?
Well, there are many ways to do this and the ChainLadder package vignette, as well as the R manual on Data Import/Export has all of the details, but here is a quick and dirty solution using a CSV-file.
Have I missed unknown pleasures in Python by focusing on R?
A comment on my blog post of last week suggested just that. Reason enough to explore Python a little. Learning another computer language is like learning another human language - it takes time. Often it is helpful to start by translating from the new language back into the old one.
I found a Python script by Ludwig Schwardt that creates a plot like this:
The forthcoming R Journal has an interesting article about phaseR: An R Package for Phase Plane Analysis of Autonomous ODE Systems by Michael J. Grayling. The package has some nice functions to analysis one and two dimensional dynamical systems. As an example I use here the FitzHugh-Nagumo system introduced earlier: \[ \begin{aligned} \dot{v}=&2 (w + v - \frac{1}{3}v^3) + I_0 \\\\\\ \dot{w}=&\frac{1}{2}(1 - v - w)\\\\\\ \end{aligned} \] The FitzHugh-Nagumo system is a simplification of the Hodgkin-Huxley model of spike generation in squid giant axon.
Version 0.5.6 of googleVis was released on CRAN over the weekend. This version fixes a bug in gvisMotionChart. Its arguments xvar, yvar, sizevar and colorvar were not always picked up correctly.
Thanks to Juuso Parkkinen for reporting this issue.
Example: Love, or to love A few years ago Martin Hilpert posted an interesting case study for motion charts. Martin is a linguist and he researched how the usage of words in American English changed over time, e.
Last week Arthur Charpentier sketched out a Markov spatial process to generate hurricane trajectories. Here, I would like to take another look at the data Arthur used, but focus on its time component.
According to the Insurance Information Institute, a normal season, based on averages from 1980 to 2010, has 12 named storms, six hurricanes and three major hurricanes. The usual peak months of August and September passed without any major catastrophes this year, but the Atlantic hurricane season is not over yet.
Deploying applications via Docker container is the current talk of town. I have heard about Docker and played around with it a little, but when Dirk Eddelbuettel posted his R and Docker talk last Friday I got really excited and had to have a go myself.
My aim was to rent some resources in the cloud, pull an RStudio Server container and run RStudio in a browser. It was actually surprisingly simple to get started.
One of my take aways from last week’s EARL conference was that R is more and more growing out of its academic roots into the enterprise. And with that come some challenges, e.g. how do I ensure consistent and systematic access to a set of R packages in an organisation, in particular when one team is providing packages to others?
Two packages can help here: roxyPackage and miniCRAN.
I wrote about roxyPackage earlier on this blog.
Last Friday we had guests from Belgium and the Netherlands joining us in Cologne. Maarten-Jan Kallen from BeDataDriven came from The Hague to introduce us to Renjin, and the guys from DataCamp in Leuven, namely Jonathan, Martijn and Dieter, gave an overview of their new online interactive training platform. Renjin Maarten-Jan gave a fascinating introduction to Renjin, an R interpreter in the Java virtual machine (JVM). Why?
The next Cologne R user group meeting is scheduled for this Friday, 12 September 2014.
We have a great agenda with international speakers: Maarten-Jan Kallen: Introduction to Renjin, the R interpreter for the JVM Jonathan Cornelissen, Martijn Theuwissen: DataCamp - An online interactive learning platform for R The event will be followed by drinks and schnitzel at the Lux. For further details visit our KölnRUG Meetup site.
The Google Charts API is quite powerful and via googleVis you can access it from R. Here is an example that demonstrates how you can zoom into your chart. In the example below I set the maximum zoom level to 5% of the chart. Drag and pan with a left mouse button to zoom in; use a right mouse click to zoom out again. The functionality is available in other core charts as well, such as line, column and bar charts. For more configuration options of the explorer settings visit the Google documentation. Loading R code
Over the weekend we released version 0.1.8 of the ChainLadder package for claims reserving on CRAN.
What is claims reserving? The insurance industry, unlike other industries, does not sell products as such but promises. An insurance policy is a promise by the insurer to the policyholder to pay for future claims for an upfront received premium.
As a result insurers don’t know the upfront cost for their service, but rely on historical data analysis and judgement to predict a sustainable price for their offering.
Earlier this week we released googleVis 0.5.5 on CRAN. The package provides an interface between R and Google Charts, allowing you to create interactive web charts from R. This is mainly a maintenance release, updating documentation and minor issues.
Screen shot of some of the Google Charts New to googleVis? Review the examples of all googleVis charts on CRAN.
How did I miss the GrapheR package?
The author, Maxime Hervé, published an article about the package [1] in the same issue of the R Journal as we did on googleVis. Yet, it took me a package update notification on CRANbeeries to look into GrapheR in more detail - 3 years later! And what a wonderful gem GrapheR is.
The package provides a graphical user interface for creating base charts in R.
In many cases Word is still the preferred file format for collaboration in the office. Yet, it is often a challenge to work with it, not so much because of the software, but how it is used and abused. Thanks to Markdown it is no longer painful to include mathematical notations and R output into Word. I have been using R Markdown for a while now and have grown very fond of it.
At the R in Insurance conference Arthur Charpentier gave a great keynote talk on Bayesian modelling in R. Bayes’ theorem on conditional probabilities is strikingly simple, yet incredibly thought provoking. Here is an example from Daniel Kahneman to test your intuition. But first I have to start with Bayes’ theorem. Bayes’ theorem Bayes’ theorem states that given two events \(D\) and \(H\), the probability of \(D\) and \(H\) happening at the same time is the same as the probability of \(D\) occurring, given \(H\), weighted by the probability that \(H\) occurs; or the other way round.
The 2nd R in Insurance conference took place last Monday, 14 July, at Cass Business School London.
This one-day conference focused once more on applications in insurance and actuarial science that use R. Topics covered included reserving, pricing, loss modelling, the use of R in a production environment and more.
In the first plenary session, Montserrat Guillen (Riskcenter, University of Barcelona) and Leo Guelman (Royal Bank of Canada, RBC Insurance) spoke about the rise of uplift models.
Occasionally I have to connect to services from R that ask for login details, such as databases. I don’t like to store my login details in the R source code file, instead I would prefer to enter the my login details when I execute the code. Fortunately, I found some old code in a post by Barry Rowlingson that does just that. It uses the tcltk package in R to create a little window in which the user can enter her details, without showing the password.
Recently we released googleVis 0.5.3 on CRAN. The package provides an interface between R and Google Charts, allowing you to create interactive web charts from R.
Screen shot of some of the Google Charts Although this is mainly a maintenance release, I’d like to point out two changes: Default chart width is set to ‘automatic’ instead of 500 pixels.
The registration for the 2nd R in Insurance conference at Cass Business School London will close this Friday, 4 July.
The programme includes talks from international practitioners and leading academics, see below. For more details and registration visit: http://www.rininsurance.com. Still unsure? Review some impressions and presentations from last year’s conference.
On behalf of the committee and sponsors, Mango Solutions, Cybaea, RStudio and PwC, we look forward to seeing you in London on 14 July!
This post will present the wonderful pairs.panels function of the psych package [1] that I discovered recently to visualise multivariate random numbers.
Here is a little example with a Gaussian copula and normal and log-normal marginal distributions. I use pairs.panels to illustrate the steps along the way.
I start with standardised multivariate normal random numbers:
library(psych) library(MASS) Sig <- matrix(c(1, -0.7, -.5, -0.7, 1, 0.6, -0.5, 0.6, 1), nrow=3) X <- mvrnorm(1000, mu=rep(0,3), Sigma = Sig, empirical = TRUE) pairs.
The World Cup has finally kicked off last Thursday and I have seen some fantastic games already. Perhaps the Netherlands appears to be the strongest side so far, following their 5-1 victory over Spain.
To me the question is not only which country will win the World Cup, but also which prediction model will come closest to the actual results. Here I present three teams, FiveThirtyEight, a polling aggregation website, Groll & Schauberger, two academics from Munich and finally Lloyd’s of London, the insurance market.
The example I present here is a little silly, yet it illustrates how to join tables with data.table in R.
Mapping old data to new data Categories in general are never fixed, they always change at some point. And then the trouble starts with the data. For example not that long ago we didn’t distinguish between smartphones and dumbphones, or video on demand and video rental shops.
The early bird registration offer for the 2nd R in Insurance conference, 14 July 2014, at Cass Business School closes tomorrow.
This one-day conference will focus once more on applications in insurance and actuarial science that use R, the lingua franca for statistical computation. Topics covered include reserving, pricing, loss modelling, the use of R in a production environment, and more. All topics are to be discussed within the context of using R as a primary tool for insurance risk management, analysis and modelling.
The 10th Kölner R user meeting took place last Friday at the Institute of Sociology and to celebrate the anniversary we invited Andrie de Vries to join us from Revolution Analytics. Andrie is well known in the R community; he is the co-author of the R for Dummies book and an active contributor on stackoverflow.
Taking R to the Enterprise Andrie de Vries
Andrie de Vries: Taking R to the Enterprise.
The next Cologne R user group meeting is scheduled for this Friday, 23 May 2014.
To celebrate our 10th meeting we welcome: Andrie de Vries (Revolution Analytics and Co-author of R for Dummies): Taking R to the Enterprise Markus Gesmann: googleVis overview and recent developments Followed by drinks and schnitzel at the Lux.
Further details available on our KölnRUG Meetup site. Please sign up if you would like to come along.
Saturday’s Eurovision Song Contest (ESC) from Copenhagen was hilarious as usual with acts from all over Europe and some more or less sensible gimmicks: a circular piano, a giant hamster wheel, a sea-saw, or indeed a beard and fancy dress.
The results of the ESC were only a little different to what the bookmakers in the UK had predicted before the event started. Sweden was seen as the favourite, followed by Austria, Netherlands, Armenia and the UK.
At the end of March Google released a new version of the Chart Tools API with new options for point shapes and line brushes. The arguments are called pointShape and lineDashStyle and can be set directly via googleVis.
We published googleVis 0.5.2 on CRAN yesterday with added examples for those new options in gvisLineChart and gvisScatterChart. Note, these options can be used with most chart types as well, also in combination.
I am delighted to announce that the programme and abstracts for the second R in Insurance conference at Cass Business School in London, 14 July 2014, have been finalised.
Register by the end of May to get the early bird booking fee.
The organisers gratefully acknowledge the sponsorship of Mango Solutions, CYBAEA, RStudio and PwC without whom the event wouldn’t be possible.
R in Insurance Cass Business School, London, 14 July 2014 9:00 - 10:00 Opening keynote:
Last Thursday I had the pleasure to attend the Tokyo R user group meeting. And what a fun meeting it was! Over 40 R users had come together in central Tokyo. Yohei Sato, who organises the meetings, allowed me to talk a little about the recent developments of the googleVis package.
Thankfully all talks were given in English: Takashi J. Ozaki presented on Visualisation of Supervised Learning with arules and arulesViz.
GoogleVis 0.5.1 was released on CRAN yesterday.
New Features New functions gvisSankey, gvisAnnotationChart, gvisHistogram, gvisCalendar and gvisTimeline to support the new Google charts of the same names (without ‘gvis’). New demo Trendlines showing how trend-lines can be added to Scatter-, Bar-, Column-, and Line Charts. New demo Roles showing how different column roles can be used in core charts to highlight data.
After my posts on timeline, Sankey and calendar charts, this will be the last to introduce new chart types of the developer version of googleVis. Today I will give examples for the new annotation charts and histograms.
Annotation charts Annotation charts have been part of the Google Chart tools for a long time and googleVis as well. However, in the past only a flash based version was available (gvisAnnotatedTimeLine in googleVis).
My little series of posts about the new googleVis charts continues with calendar charts.
Google’s calendar charts are still in beta, but they provide already a nice heat map visualisation of calendar year data. The current development version of googleVis supports this new function via gvisCalendar. Here is an example displaying daily stock price data.
Loading For the code below to run you will require the developer version (≥ 0.
Sankey diagrams are great for visualising flows from one set of data values to another. Although named after Irish Captain Matthew Henry Phineas Riall Sankey, who used this type of diagram in 1898 to show the energy efficiency of a steam engine, the best know Sankey diagram is probably Charles Minard’s Map of Napoleon’s Russian Campaign of 1812, which he actually produced in 1869.
Thomas Rahlf: Datendesign mit R The above example from Thomas Rahlf’s book Datendesign mit R shows that Minard’s plot can be reproduced with base graphics in R.
Don’t forget, this is the final week you can submit an abstract for the second R in Insurance conference.
For more details see http://www.rininsurance.com and perhaps for inspiration review last year’s programme.
Last year at the Google I/O conference Mitchell Foley presented new developments of the Google Chart Tools API and one of the new features he mentioned were timeline charts (about 6 min into the talk).
Timeline charts are a great way of visualising different dates/events over time and are now also supported by googleVis from version 0.5.0 onwards (currently only available from GitHub). Here is an example, showing classroom allocation in the afternoon.
After nearly 4 years of developing googleVis on Google Code with SVN we decided to move to GitHub. The main reason was that Google stopped the facility of hosting pre-CRAN builds of the package for user testing. The devtools package on the other hand makes it really easy to install packages from source hosted on GitHub. Additionally, we hope that GitHub will make collaboration with others more effective. Thus, bookmark http://github.
Last week’s Cologne R user group meeting was all about R and databases. We had three talks from a generic overview on how to connect R to databases, to a specific example with kdb+ and perhaps the future with ArangoDB, a NoSQL database.
Connecting R with databases Diego de Castillo’s talk focused on the use of relational databases, such as PostgreSQL, SQLite and Oracle. For all these databases dedicated R drivers exist on CRAN that can be used in a generic way via the DBI package.
The next Cologne R user group meeting is scheduled for tomorrow, 26 February 2014. We are delighted to welcome: Diego de Castillo: R and databases Kim Kuen Tang: Hands on using R and kdb+ together Frank Celler: ArangoDB (Lightning Talk) Further details and the agenda are available on our KölnRUG Meetup site.
Please sign up if you would like to come along.
Here is the poster for the 2nd R in Insurance conference on Monday 14 July 2014 at Cass Business School in London:
R in Insurance 2014 conference poster. Download PDF version Important dead lines to keep in mind: Abstract submissions: 28 March 2014 Early bird booking: 30 May 2014 R in Insurance Conference: 14 July 2014 For all further information see: www.
The other day I had data that showed the development of many products over time. I grouped the products into categories and visualised the data as line graphs in lattice. But instead of adding an extensive legend to the plot I wanted to add labels to each line’s latest point. How do you do that? It turns out that panel.groups is there to help again.
Here is my solution: R code Session Info R version 3.
The registration for the second conference on R in Insurance on Monday 14 July 2014 at Cass Business School in London has opened.
This one-day conference will focus again on applications in insurance and actuarial science that use R, the lingua franca for statistical computation. Topics covered may include actuarial statistics, capital modelling, pricing, reserving, reinsurance and extreme events, portfolio allocation, advanced risk tools, high-performance computing, econometrics and more.
Recently the Guardian’s Data Blog reported about the results from the third National Survey of Sexual Attitudes and Lifestyles in the UK. One of the questions asked in the survey was if the participants had sex in the last four weeks. The results - a summary is available in this info graphic - show that the British have their most sexual active period when they are in their 20s - 40s.
Rasmus’ post of last week on binomial testing made me think about p-values and testing again. In my head I was tossing coins, thinking about gender diversity and toast. The toast and tossing a buttered toast in particular was the most helpful thought experiment, as I didn’t have a fixed opinion on the probabilities for a toast to land on either side. I have yet to carry out some real experiments.
Since Christmas I have been playing around with a Raspberry Pi. It is certainly not the fastest computer, but what a great little toy! Here are a few experiences and online resources that I found helpful. Setup Initially I connected the Raspberry Pi via HDMI to a TV; together with keyboard, mouse and an old USB Wifi adapter. Everything worked out of the box and I could install Raspbian and set up the network.
I noticed that the monthly number of posts on R-bloggers stopped increasing over the last year. Indeed, the last couple of months saw a decline in posts compared to the previous year. Thus, has most been said and written about R already? Who knows? Well, I took a stab at looking into the future. However, I can tell you already that I am not convinced by my predictions. But maybe someone else will be inspired to take this work forward.
The Christmas and New Year’s break is over, yet there is still time to return unwanted presents. Return to Santa was the title of an article in the Economist that highlighted the impact on online retailers, as return rates can be alarmingly high.
The article quotes a study by Christian Schulze of the Frankfurt School of Finance and Management, which analysed the return habits of customers who bought at least five items over a five year period from a large European online retailer.
Last week’s Cologne R user group meeting was the best attended so far. Well, we had a great line up indeed. Matt Dowle came over from London to give an introduction to the data.table package. He was joined by his collaborator Arun Srinivasan, who is based in Cologne. Their talk was followed by Thomas Rahlf on Datendesign mit R (Data design with R).
data.table Download slides Matt’s goal with the data.
Quick reminder: The next Cologne R user group meeting is scheduled for this Friday, 13 December 2013. We are delighted to welcome: Matt Dowle and Arun Srinivasan: Introduction to data.table Thomas Rahlf: Book presentation - Datendesign mit R Further details and the agenda are available on our KölnRUG Meetup site.
Please sign up if you would like to come along. Notes from past meetings are available here.
Following the very positive feedback that Andreas and I have received from delegates of the first R in Insurance conference in July of this year, we are planning to repeat the event next year. We have already reserved a bigger auditorium.
The second conference on R in Insurance will be held on Monday 14 July 2014 at Cass Business School in London, UK.
This one-day conference will focus again on applications in insurance and actuarial science that use R, the lingua franca for statistical computation.
Following on from last week, where I presented a simple example of a Bayesian network with discrete probabilities to predict the number of claims for a motor insurance customer, I will look at continuous probability distributions today. Here I follow example 16.17 in Loss Models: From Data to Decisions [1].
Suppose there is a class of risks that incurs random losses following an exponential distribution (density \(f(x) = \Theta {e}^{- \Theta x}\)) with mean \(1/\Theta\).
Here is a little Bayesian Network to predict the claims for two different types of drivers over the next year, see also example 16.15 in [1].
Let’s assume there are good and bad drivers. The probabilities that a good driver will have 0, 1 or 2 claims in any given year are set to 70%, 20% and 10%, while for bad drivers the probabilities are 50%, 30% and 20% respectively.
In my previous post, I presented a preview version of googleVis that provided an integration with RStudio’s Viewer pane (introduced with version 0.98.441).
Over 80% in my little survey favoured the new default output mechanism of googleVis within RStudio. Hence, I uploaded googleVis 0.4.7 on CRAN over the weekend.
However, there were also some thoughtful comments, which suggested that the RStudio Viewer pane is not always the best option. Indeed, Flash charts and gvisMerge output will still be displayed in your default browser, but also if you work on larger charts and with smaller screen, then the browser might still be the better option compared to the Viewer pane - of course you can launch the browser from the Viewer pane as well.
The preview version 0.98.441 of RStudio introduced a new viewer pane to render local web content and with that it allows me to display googleVis charts within RStudio rather than in a separate browser window.
I think this is a rather nice feature and hence I have updated the plot method in googleVis to use the RStudio viewer pane as the default output. If you use another editor, or if the plot is using one of the Flash based charts, then the browser is still the default display.
For most purposes PDF or other vector graphic formats such as windows metafile and SVG work just fine. However, if I plot lots of points, say 100k, then those files can get quite large and bitmap formats like PNG can be the better option. I just have to be mindful of the resolution.
As an example I create the following plot:
x <- rnorm(100000) plot(x, main="100,000 points", col=adjustcolor("black", alpha=0.2)) Saving the plot as a PDF creates a 5.
The Cologne R user group met last Friday for two talks on split apply combine in R and XLConnect by Bernd Weiß and Günter Faes respectively, before the usual Schnitzel and Kölsch at the Lux.
Split apply combine in R The apply family of functions in R is incredible powerful, yet for newcomers often somewhat mysterious. Thus, Bernd gave an overview of the different apply functions and their cousins.
Quick reminder: The next Cologne R user group meeting is scheduled for this Friday, 18 October 2013. We will discuss and hear about the apply family of functions and the XLConnect package. Further details and the agenda are available on our KölnRUG Meetup site. Please sign up if you would like to come along. Notes from past meetings are available here. Thanks to Revolution Analytics, who sponsors the Cologne R user group as part of their vector programme.
There can never be too many examples for transforming data with R. So, here is another example of reshaping a data.frame into a matrix.
Here I have a data frame that shows incremental claim payments over time for different loss occurrence (origin) years.
The format of the data frame above is how this kind of data is usually stored in a data base. However, I would like to see the payments of the different origin years in rows of a matrix.
Changing the plotting width in bar-, column- and combo-charts of googleVis works identical and is defined by the bar.groupWidth argument. The dot in the argument means that it has to be split in R into bar=“{groupWidth:‘10%’}”.
Example library(googleVis) cc <- gvisColumnChart(head(Population,10), xvar="Country", yvar="Population", options=list(seriesType="bars", legend="top", bar="{groupWidth:'10%'}", width=500, height=450), chartid="thincolumns") plot(cc) Your browser doesn’t support iframes. Session Info R version 3.0.1 (2013-05-16) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_GB.
Last Tuesday I attended the LondonR user group meeting, where Rich and Andy from Mango argued about the better package for multivariate graphics with R: lattice vs. ggplot2.
As part of their talk they had a little competition in visualising London Underground performance data, see their slides. Both made heavy use of the respective panelling / faceting capabilities. Additionally Rich used the panel.groups argument of xyplot to fine control the content of each panel.
The ave function in R is one of those little helper function I feel I should be using more. Investigating its source code showed me another twist about R and the “[” function. But first let’s look at ave.
The top of ave’s help page reads:
Group Averages Over Level Combinations of Factors
Subsets of x[] are averaged, where each subset consist of those observations with the same factor levels.
The guys at Google continue to update and enhance the Chart Tools API. One new recent feature is a pie chart with a hole, or as some call them: donut charts.
Thankfully the new functionality is being achieved through new options for the existing pie chart, which means that those new features are available in R via googleVis as well, without the need of writing new code.
Doughnut chart example With the German election coming up soon, here is the composition of the current parliament.
Over the weekend googleVis 0.4.4 found its way to CRAN. The function gvisTable gained a new argument formats that allow users to define the formats numbers displayed in tables. Thanks to J. Buros, who contributed the code.
Example Loading Session Info R version 3.0.1 (2013-05-16) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods [7] base other attached packages: [1] googleVis_0.
Version 0.1.6 of the ChainLadder package has been released and is already available from CRAN.
The new version adds the function CLFMdelta. CLFMdelta finds consistent weighting parameters delta for a vector of selected age-to-age chain-ladder factors for a given run-off triangle.
The added functionality was implemented by Dan Murphy, who is the co-author of the paper A Family of Chain-Ladder Factor Models for Selected Link Ratios by Bardis, Majidi, Murphy.
I posted about the various googleVis axis options for base charts, such as line, bar and area charts earlier, but I somehow forgot to mention how to set the axes limits.
Unfortunately, there are no arguments such as ylim and xlim. Instead, the Google Charts axes options are set via hAxes and vAxes, with h and v indicating the horizontal and vertical axis. More precisely, I have to set viewWindowMode : ‘explicit’ and set the viewWindow to the desired min and max values.
The programme and the presentation files of the first R in Insurance conference have been published on GitHub.
Front slides of the conference presentations Additionally to the slides many presenters have made their R code available as well: Alexander McNeil shared the examples of the CreditRisk+ model he presented. Lola Miranda made a Windows version of the double chain-ladder package DCL available via the Cass knowledge web site.
Despite the hot weather and the beginning of the school holiday season in North Rhine Westphalia the Cologne R user group met yet again for two fascinating talks and beer and schnitzel afterwards.
Analysing Twitter data to evaluate the US Dollar / Euro exchange rates Dietmar Janetzko presented ideas to forecast US Dollar / Euro exchange rate movements for the following day. To forecast exchange rate movements, Dietmar distinguishes two school of thoughts.
Yesterday the first R in Insurance conference took place at Cass Business School in London.
I think the event went really well, but as a member of the organising committee my view is probably skewed. Still, we had a variety of talks, a full house, a great conference dinner and to top it all, the Tower Bridge opened while we had our drinks at the end of the evening.
Today Diego and I will give our googleVis tutorial at useR!2013 in Albacete, Spain.
googleVis Tutorial at useR! 2013 We will cover: Introduction and motivation Google Chart Tools R package googleVis Concepts of googleVis Case studies googleVis on shiny
The useR!2013 conference in Albacete, Spain, will commence next Wednesday, 10 July, and on the day before Diego and I will give a googleVis tutorial.
The following Monday, 15 July, the first R in Insurance event will take place at Cass Business School and I am absolutely delighted with the programme and the fact that we are sold out.
On Tuesday, 16 July, the LondonR user group meets in the City, awaiting presentations by Andrie de Vries (Revolution Analytics), Rich Pugh (Mango Solutions) and Hadley Wickham (RStudio).
The Google Charts Tools provide two kinds of heat map charts for geographical data, the Flash based Geomap and the HTML5/SVG based Geochart.
I prefer the Geochart as it doesn’t require Flash, but so far there have been two shortcomings with it: I couldn’t add additional tooltip information and the default Mercator projection shows Greenland the size of Africa. Both of those issues seemed to have been resolved by Google.
Building R packages is not particular hard, but it can be a bit of a daunting endeavour at the beginning, particularly if you are more of a statistician than a computer scientist or programmer.
Some concepts may appear foreign or like red tape, yet many of them evolved over time for a reason. They help to stay organise, collaborate more effectively with others and write better code.
So, here are my slides of the R package development workshop at Lancaster University.
Following on from last week’s post, here are my slides on using googleVis on shiny from the Advanced R workshop at Lancaster University, 21 May 2013.
googleVis on shiny Again, I wrote my slides in RMarkdown and I used slidify to create the HTML5 presentation. Unfortunately you may have to reload the slides that use googleVis on shiny as the JavaScript code in the background is potentially not ideal.
Last week I was invited to give an introduction to googleVis at Lancaster University. This time I decided to use the R package slidify for my talk. Slidify, like knitr, is built on Markdown and makes it very easy to create beautiful HTML5 presentations.
Introduction to googleVis Separating content from layout is always a good idea. Markup languages such as TeX/LaTeX or HTML are built on this principle.
Over the last year I worked with two colleagues of mine on the subject of inflation and claims inflation in particular. I didn’t expect it to be such a challenging topic, but we ended up with more questions than answers. The key question and biggest challenge is to define what inflation, or indeed claims inflation actually is and how to measure it. We published a summary of our thoughts and findings in this month’s issue of The Actuary.
I am delighted to announce that the programme and abstracts for the first R in Insurance conference at Cass Business School in London, 15 July 2013, have been published.
The conference committee received strong abstracts from academia and the industry, covering: Pricing Reserving Data mining Capital modelling Automate reporting Catastrophe modelling High-performance computing Software development management Register by the end of May to get the early bird booking fee.
Often I like to reduce the alpha value (level of transparency) of colours to identify patterns of over-plotting when displaying lots of data points with R. So, here is a tiny function that allows me to add an alpha value to a given vector of colours, e.g. a RColorBrewer palette, using col2rgb and rgb, which has an argument for alpha, in combination with the wonderful apply and sapply functions.
Our 5th Cologne R user group meeting was the best attended meeting so far, with 20 members finding their way to the Institute of Sociology for two talks by Diego de Castillo on shiny and Stephan Holtmeier on cluster analysis, followed by beer and schnitzel at the Lux, a gastropub nearby.
Shiny Diego gave an overview of the design principles behind shiny, which provides a powerful API to build web apps in pure R.
At the last LondonR meeting Francine Bennett from Mastodon C shared some of her experience and findings from an analysis of a large prescriptions data set of the UK’s national health service (NHS). However, it was her last slide, which I found the most thought provoking. It asked for the definition of the following term: Test-driven analysis? Francine explained that test driven development (TDD) is a concept often used in software development for quality assurance and she wondered if a similar approach could be also used for data analysis.
Setting axis options in googleVis charts can be a bit tricky. Here I present two examples where I set several options to customise the layout of a line and combo chart with two axes. The parameters have to be set in line with the Google Chart Tools API, which uses a JavaScript syntax. In googleVis chart options are set via a list in the options argument. Some of the list items can be a bit more complex, often wrapped in {} brackets, e.
Quick reminder: The next Cologne R user group meeting is scheduled for this Friday, 12 April 2013. We will discuss cluster analysis and shiny. Further details and the agenda are available on our KölnRUG Meetup site. Please sign up if you would like to come along. Notes from the last Cologne R user group meeting are available here.
Thanks also to Revolution Analytics, who sponsors the Cologne R user group as part of their vector programme.
Be motivated. R has a steep learning curve. Find a problem you can’t solve otherwise. E.g. plotting multivariate data, a statistical analysis for which an R function exists already. Download and install R. Get to know the R console. Learn how to install additional packages, how to access the history, how to use auto completion and open the help system. Review the R Installation and Administration manual and check out the free books section on CRAN.
Last week we released version 0.1.5-6 of the ChainLadder package on CRAN. The ChainLadder package provides statistical models, which are typically used for the estimation of outstanding claims reserves in general insurance. The package vignette gives an overview of the package functionality.
Output of plot(MackChainLadder(GenIns)) Since the last CRAN release Dan Murphy added new features to the MackChainLadder function and we fixed a bug in BootChainLadder.
The registration for the first R in Insurance is open and there is still time to submit a talk / lightning talk.
The conference will take place at Cass Business School in London on Monday, 15 July 2013. This is the Monday following the useR! 2013 conference in Spain. Thus, if you come from overseas to Spain, why not stop in London on your way back?
The new version of googleVis 0.4.2 is now available via CRAN. Many thanks to all who provided feedback on version 0.4.0 and particularly to Sebastian Campbell, John Maindonald and Aonan Zhang. As usual, if you find any issues or bugs, please send us an email or add a line to our online issues log.
With version 0.4.0 we introduced support for googleVis on shiny. See my previous post for more details and examples.
A friend of mine asked me the other day how she could use the function optim in R to fit data. Of course, there are built-in functions for fitting data in R and I wrote about this earlier. However, she wanted to understand how to do this from scratch using optim.
The function optim provides algorithms for general-purpose optimisations and the documentation is perfectly reasonable, but I remember that it took me a little while to get my head around how to pass data and parameters to optim.
Documenting code can be a bit of a pain. Yet, the older (and wiser?) I get, the more I realise how important it is. When I was younger I said ‘documentation is for people without talent’. Well, I am clearly loosing my talent, as I sometimes struggle to understand what I programmed years ago. Thus, anything that soothes the pain of writing and maintaining documentation must be good and should help me to better understand my ‘old me’ in the future.
The guys at RStudio have done a fantastic job with shiny. It is really easy to build web apps with R using shiny. With the help of Joe Cheng from RStudio we figured out a way to make googleVis work on shiny as well. This allows you to make use of the Google Charts Tools in your shiny app directly from R. What I present here are three initial examples which seem to work in most browsers.
The registration for the first conference on R in Insurance on Monday 15 July 2013 at Cass Business School in London has opened.
The intended audience of the conference includes both academics and practitioners who are active or interested in the applications of R in insurance.
The 2013 R in Insurance conference builds upon the success of the R in Finance and R/Rmetrics events. We expect invited keynote lectures by: Professor Alexander McNeil, Department of Actuarial Science & Statistics Heriot-Watt University: Implementing CreditRisk+ in R with the Faster Fourier Transform Trevor Maynard, Head of Exposure Management and Reinsurance, Lloyd’s: There is an R in Lloyd’s We invite you to submit a one-page abstract for consideration.
Lloyd’s of London is looking for a Data Scientist as part of the Analysis team. See Lloyd’s career web site for more details.
The fourth Cologne R user meeting took place last Wednesday at the Institute of Sociology. Thanks to Bernd Weiß for hosting the event and Revolution Analytics for their sponsorship.
We had two fantastic talks by Klaus Jacobi and M.eik Michalke. Klaus talked about Eliminating cloud pixels in satellite images via chronological interpolation and Meik presented his new roxyPackage package, which makes it even easier to maintain R packages with roxygen2.
Quick reminder: The next Cologne R user group meeting is scheduled for tomorrow, 6 February 2013. All details and the agenda are available on the KölnRUG Meetup site. Please sign up if you would like to come along. Notes from the last Cologne R user group meeting are available here. Thanks also to Revolution Analytics, who are sponsoring the Cologne R user group as part of their vector programme.
A friend of mine told me the secret of making money at the stock market. “It’s easy”, he said.
All I would have to do is to buy a big jar of ants. Then I should observe the ants movement on my kitchen table, while following the stock market.
I shall keep the ants which walk in line with the stock market and remove those who don’t. Eventually I would have one ant left that walked all the way in line with the stock market.
This is the third post about Christofides’ paper on Regression models based on log-incremental payments [1]. The first post covered the fundamentals of Christofides’ reserving model in sections A - F, the second focused on a more realistic example and model reduction of sections G - K. Today’s post will wrap up the paper with sections L - M and discuss data normalisation and claims inflation.
I will use the same triangle of incremental claims data as introduced in my previous post.
Following on from last week’s post I will continue to go through the paper Regression models based on log-incremental payments by Stavros Christofides [1]. In the previous post I introduced the model from the first 15 pages up to section F. Today I will progress with sections G to K which illustrate the model with a more realistic incremental claims payments triangle from a UK Motor Non-Comprehensive account:
# Page D5.
A recent post on the PirateGrunt blog on claims reserving inspired me to look into the paper Regression models based on log-incremental payments by Stavros Christofides [1], published as part of the Claims Reserving Manual (Version 2) of the Institute of Actuaries.
The paper is available together with a spreadsheet model, illustrating the calculations. It is very much based on ideas by Barnett and Zehnwirth, see [2] for a reference.
I really like gists as a quick way to include more lengthly code snippets into my blog posts. However, I am not a git user as such, and so I was quite concerned when I noticed that all my gists on this blog had vanished after Christmas. I suppose this was a result of Github’s downtime on December 22nd.
Thankfully an email to the support guys at Github resolved the issue within a few hours.
The first conference on R in Insurance will be held on Monday 15 July 2013 at Cass Business School in London, UK.
The intended audience of the conference includes both academics and practitioners who are active or interested in the applications of R in insurance.
This one-day conference will focus on applications in insurance and actuarial science that use R, the lingua franca for statistical computation. Topics covered may include actuarial statistics, capital modelling, pricing, reserving, reinsurance and extreme events, portfolio allocation, advanced risk tools, high-performance computing, econometrics and more.
Of course, a picture on a computer monitor is a coloured plot of x and y coordinates or pixels. Still, I was smitten by David Sparks’ posts on is.r(), where he shows how easy it is to read images into R to analyse them. In two posts [1], [2] he replicates functionality of image manipulation programmes like GIMP.
I can’t resist to write about this here as well. David’s first post is about k-means cluster analysis.
Last week I attended a seminar where a talk was given about the economic opportunities in the SAAAME (South-America, Asia, Africa and Middle East) regions. Of course a map was shown with those regions highlighted. The map was not that disimilar to the one below.
library(RColorBrewer) library(rworldmap) data(countryExData) par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i") mapByRegion( countryExData, nameDataColumn="GDP_capita.MRYA", joinCode="ISO3", nameJoinColumn="ISO3V10", regionType="Stern", mapTitle=" ", addLegend=FALSE, FUN="mean", colourPalette=brewer.pal(6, "Blues")) It is a map that most of us in the Northern hemisphere see often.
Lattice plots are a great way of displaying multivariate data in R. Deepayan Sarkar, the author of lattice, has written a fantastic book about Multivariate Data Visualization with R [1]. However, I often have to refer back to the help pages to remind myself how to set and change the legend and how to ensure that the legend will use the same colours as my plot. Thus, I thought I write down an example for future reference.
I really should make it a habit of using data.table. The speed and simplicity of this R package are astonishing.
Here is a simple example: I have a data frame showing incremental claims development by line of business and origin year. Now I would like add a column with the cumulative claims position for each line of business and each origin year along the development years.
It’s one line with data.
Last week we released version 0.1.5-4 of the ChainLadder package on CRAN. The R package provides methods which are typically used in insurance claims reserving. If you are new to R or insurance check out my recent talk on Using R in Insurance.
The chain-ladder method which is a popular method in the insurance industry to forecast future claims payments gave the package its name. However, the ChainLadder package has many other reserving methods and models implemented as well, such as the bootstrap model demonstrated below.
I discussed earlier how the action potential of a neuron can be modelled via the Hodgkin-Huxely equations. Here I will present a simple model that describes how action potentials can be generated and propagated across neurons. The tricky bit here is that I use delay differential equations (DDE) to take into account the propagation time of the signal across the network.
My model is based on the paper: Epileptiform activity in a neocortical network: a mathematical model by F.
I am very grateful to all who provided feedback over the last two weeks and tested the previous versions 0.3.1 and 0.3.2, which were not released on CRAN.
So, what changed since version 0.3.2? Not much, but plot.gvis didn’t open a browser window when options(gvis.plot.tag) were not set to NULL, but the user explicitly called plot.gvis with tag NULL. Thanks to Sebastian Kranz for reporting this bug. Additionally the vignette has been updated and includes an extended section on knitr.
After last week’s kerfuffle I hope the roll out of googleVis version 0.3.2 will be smooth. To test the water I release this version into the wild here and if it doesn’t get shot down in the next days, then I shall try to upload it to CRAN. I am mindful of the CRAN policy, so please get in touch or add comments below if you find any show stoppers.
Version 0.3.0 of the googleVis package for R has been released on CRAN on 20 October 2012. With this version we have been able to speed up the code considerably. The transformation of R data frames into JSON works significantly faster. The execution of the gvisMotionChart function in the World Bank demo is over 35 times faster. Thanks to ideas by Wei Luo and in particular to Sebastian Kranz for providing the code.
If connecting data to the real world is the next sexy job, then how do I do this? And how do I connect the real world to R?
It can be done as Matt Shottwell showed with his home made ECG and a patched version of R at useR! 2011. However, there are other options as well and here I will use an Arduino. The Arduino is an open-source electronics prototyping platform.
The next Cologne R user group meeting is scheduled for 5 October 2012. All details and the agenda are available on the KölnRUG Meetup site. Please sign up if you would like to come along. Notes from the last Cologne R user group meeting are available here. Thanks also to Revolution Analytics, who are sponsoring the Cologne R user group as part of their vector programme. View Larger Map
Every year the UK’s general insurance actuarial community organises a big conference, which they call GIRO, short for General Insurance Research Organising committee.
This year’s conference is in Brussels from 18 - 21 September 2012. Despite the fact that Brussels is actually in Belgium the UK actuaries will travel all the way to enjoy good beer and great talks.
On Wednesday morning I will run a session on Using R in insurance.
Today I feel very lucky, as I have been invited to the Royal Statistical Society conference to give a tutorial on interactive web graphs with R and googleVis. I prepared my slides with RStudio, knitr, pandoc and slidy, similar to my Cambridge R talk. You can access the RSS slides online here and you find the original R-Markdown file on github. You will notice some HTML code in the file, which I had to use to overcome my knowledge gaps of Markdown or its limitations.
Michael Bach, who is a professor and vision scientist at the University of Freiburg, maintains a fascinating site about visual illusions. One visual illusion really surprised me: the sigma motion. The sigma motion displays a flickering figure of black and white columns. Actually it is just a chart, as displayed below, with the columns changing backwards and forwards from black to white at a rate of about 30 transitions per second.
The next version of the googleVis package has been released on the project site and CRAN. This version provides updates to the package vignette and a new example for the gvisMerge function. The new sections of the vignette have been featured on this blog in more detail earlier: Using googleVis with knitr (Link to post) Using Rook with googleVis (Link to post) Using Reduce with gvisMerge to display several charts on a page (Link to post) Additionally two little bugs were fixed: Data frames with one row only were not displayed in a chart.
The 100m mean’s sprint finals of the 2012 London Olympics are over and Usain Bolt won the gold medal again with a winning time of 9.63s. Time to compare the result with my forecast of 9.68s, posted on 22 July.
My simple log-linear model predicted a winning time of 9.68s with a prediction interval from 9.39s to 9.97s. Well, that is of course a big interval of more than half a second, or ±3%.
What is Rook? Rook is a web server interface for R, written by Jeffrey Horner, the author of rApache and brew. But unlike other web frameworks for R, such as brew, R.rsp (which I have used in the past1), Rserve, gWidgetWWWW or sumo (which I haven’t used yet) Rook appears incredible lightweight. Rook doesn’t need any configuration. It is an R package, which works out of the box with the R HTTP server (R ≥ 2.
It is less than a week before the 2012 Olympic games will start in London. No surprise therefore that the papers are all over it, including a lot of data and statistics around the games.
The Economist investigated the potential financial impact on sponsors (some benefits), tax payers (no benefits) and the athletes (if they are lucky) in its recent issue
The Guardian has awhole series around the Olympics, including the data of all Summer Olympic Medallists since 1896.
The other day I saw a fantastic exhibition of work by Bridget Riley. Karsten Schubert, who is Riley’s main agent, has a some of her most famous and influential artwork from 1960 - 1966 on display, including the seminal Moving Squares from 1961. Photo of Moving Squares by Bridget Riley, 1961 Emulsion on board, 123.2 x 121.3cm In the 1960s Bridget Riley created some great black and white artwork, which at a first glance may look simple and deterministic or sometimes random, but has fascinated me since I saw some of her work for the first time about 9 years ago at the Tate Modern.
The second Cologne R user meeting took place last Friday, 6 July 2012, at the Institute of Sociology. Thanks to Bernd Weiß, who provided the meeting room, we didn’t have to worry about the infrastructure, like we did at our first gathering. Again, we had an interesting mix of people turning up, with a very diverse background from chemistry to geo-science, energy, finance, sociology, pharma, physics, psychology, mathematics, statistics, computer science, telco, etc.
At the R in Finance conference Paul Teetor gave a fantastic talk about Fast(er) R Code. Paul mentioned the common higher-order function Reduce, which I hadn’t used before. Reduce allows me to apply a function successively over a vector. What does that mean? Well, if I would like to add up the figures 1 to 5, I could say:
add <- function(x,y) x+y add(add(add(add(1,2),3),4),5) or
This post is a quick reminder that the next Cologne R user group meeting is only one week away. We will meet on 6 July 2012. The meeting will kick off at 18:00 with three short talks at the Institute of Sociology and will continue, even more informal, from 20:00 in a pub (LUX) nearby. All details are available on the KölnRUG Meetup site. Please sign up if you would like to come along.
One of the great research papers of the 20th century celebrates its 60th anniversary in a few weeks time: A quantitative description of membrane current and its application to conduction and excitation in nerve by Alan Hodgkin and Andrew Huxley. Only shortly after Andrew Huxley died, 30th May 2012, aged 94.
In 1952 Hodgkin and Huxley published a series of papers, describing the basic processes underlying the nervous mechanisms of control and the communication between nerve cells, for which they received the Nobel prize in physiology and medicine, together with John Eccles in 1963.
This evening I will talk about Dynamical systems in R with simecol at the LondonR meeting.
Thanks to the work by Thomas Petzoldt, Karsten Rinke, Karline Soetaert and R. Woodrow Setzer it is really straight forward to model and analyse dynamical systems in R with their deSolve and simecol packages.
I will give a brief overview of the functionality using a predator-prey model as an example.
This is of course a repeat of my presentation given at the Köln R user group meeting in March.
Transforming data sets with R is usually the starting point of my data analysis work. Here is a scenario which comes up from time to time: transform subsets of a data frame, based on context given in one or a combination of columns.
As an example I use a data set which shows sales figures by product for a number of years:
df <- data.frame(Product=gl(3,10,labels=c("A","B", "C")), Year=factor(rep(2002:2011,3)), Sales=1:30) head(df) ## Product Year Sales ## 1 A 2002 1 ## 2 A 2003 2 ## 3 A 2004 3 ## 4 A 2005 4 ## 5 A 2006 5 ## 6 A 2007 6 I am interested in absolute and relative sales developments by product over time.
A new version of googleVis has been released on CRAN and the project site. Version 0.2.16 adds the functionality to plot quarterly and monthly data as a motion chart.
To illustrate the new feature I looked for a quarterly data set and stumbled across the quarterly UK house price data published by Nationwide, a building society. The data is available in a spread sheet format and presents the average house prices and indexed to 100 in Q1 1993 by region in the UK from Q4 1973 to Q1 2012.
Tonight I will give a talk at the Cambridge R user group about googleVis. Following my good experience with knitr and RStudio to create interactive reports, I thought that I should try to create the slides in the same way as well.
Christopher Gandrud’s recent post reminded me of deck.js, a JavaScript library for interactive html slides, which I have used in the past, but as Christopher experienced, it is currently not that straightforward to use with R and knitr.
John D. Cook gave a great talk about ‘Why and how people use R’. The talk resonated with me and highlighted why R is such a great tool for end user computing. A topic which has become increasingly important in the European insurance industry. John’s main point on why people use R is that R gets the job done and I think he is spot on.
Last Saturday I met the guys from RStudio at the R in Finance conference in Chicago. I was curious to find out what RStudio could offer. In the past I have used mostly Emacs + ESS for editing R files. Well, and what a surprise it was. JJ, Joe and Josh showed me a preview of version 0.96 of their software, which adds a close integration of Sweave and knitr to RStudio, helping to create dynamic web reports with the new R Markdown and R HTML formats more easily.
Waterfall charts are sometimes quite helpful to illustrate the various moving parts in financial data, particularly when I have positive and negative values like a profit and loss statement (P&L). However, they can be a bit of a pain to produce in Excel. Not so in R, thanks to the waterfall package by James Howard. In combination with the latticeExtra package it is nearly a one-liner to produce a good looking waterfall chart that mimics the look of The Economist:
The next Cologne R user group meeting is scheduled for 6 July 2012. All details are available on the new KölnRUG Meetup site. Please sign up if you would like to come along, and notice that there is also pub poll for the after “work” drinks. Notes from the first Cologne R user group meeting are available here. View Larger Map
Office building in Brüssel. Photo Markus Gesmann
It is not unusual that you will not have admin rights in an IT controlled office environment. But then again the limitations set by the IT department can spark of some creativity. And I have to admit that I enjoy this kind of troubleshooting.
The other day I ended up in front of a Windows PC with R installed, but a locked down “C:Files” folder.
How do you apply one particular row of your data to all other rows? Today I came across a data set which showed the revenue split by product and location. The data was formated to show only the split by product for each location and the overall split by location, similar to the example in the table below. Revenue by product and continent Africa America Asia Australia Europe A 40% 30% 50% 40% 40% B 20% 40% 20% 30% 40% C 40% 30% 30% 30% 20% Total 10% 40% 20% 10% 20% I wanted to understand the revenue split by product and location.
The first Kölner R user meeting was great fun. About 20 useRs had turned up to exchange their ideas, questions and experience with R. Three talks about R & Excel, ggplot2 & XeLaTeX and Dynamical systems with R & simecol had kicked off the evening, with Kölsch (beer) losing our tongues further. Thankfully a lot of people had brought along their laptops, as unfortunately we lacked a cable to connect any of the computers to the installed projector.
Venue: Sion em Keldenich, Weyertal 47, 50937 Cologne, Germany, 6 p.m., 30 March 2012, View Larger Map
For more details and registration see the Kölner R User Group page.
How can I embed a small data set into my R code? That was the question I came across today, when I prepared my talk about Dynamical Systems in R with simecol for the forthcoming Cologne R user group meeting. I wanted to add all the R code of the talk to the last slide. That’s easy, but the presentation makes use of a small data set of 3 columns and 21 rows.
The other day I found some old basic code I had written about 15 years ago on a Mac Classic II to plot the Feigenbaum diagram for the logistic map. I remember, it took the little computer the whole night to produce the bifurcation chart. With today’s computers even a for-loop in a scripting language like R takes only a few seconds.
logistic.map <- function(r, x, N, M){ ## r: bifurcation parameter ## x: initial value ## N: number of iteration ## M: number of iteration points to be returned z <- 1:N z[1] <- x for(i in c(1:(N-1))){ z[i+1] <- r *z[i] * (1 - z[i]) } ## Return the last M iterations z[c((N-M):N)] } ## Set scanning range for bifurcation parameter r my.
The data of the World Bank is absolutely amazing. I had said this before, but their updated iPhone App gives me a reason to return to this topic. Version 3 of the DataFinder App allows you to visualise the data on your phone, including motion maps, see the screen shot below. Screen shot of DataFinder 3.0 I was intrigued by the by the changes in life expectancy over time around the world.
The guys behind the Google Visualisation API don’t seem to rest. On 22 February 2012 they released an update of their API. Google added options for a gradient colour axis to bubble chart and a magnifying glass to geo chart, which opens when the user hovers over cluttered markers (excluding IE<=8). Those updates have been incorporated into version 0.2.15 of the googleVis package for R. Examples of new features Here are two examples demonstrating the new features.
Am 30. März 2012 möchte ich gerne das erste Kölner R Benutzer Treffen organisieren. Ich habe an den Treffen in London in den vergangen Jahren teilgenommen und hoffe auch in Köln Gleichgesinnte zu finden, die sich gerne bei einem Kölsch über R and das Leben unterhalten würden. I would like to organise the first R user group meeting in Cologne, Germany, on 30 March 2012. In the past few years I have participated at the London R user groups and I hope to find also like-minded people in Cologne, who would like to catch up over a Kölsch on R and life in general.
David Chan from City University is organising an interdisciplinary symposium on tackling the ‘Big Data’ challenge on 1 March 2012. It is an open seminar trying to bring together academics and practitioners from across industry to tackle the challenges posed by “big data” - the growing amount of information that needs to be stored, searched, analysed and visualised in the digital age. The event will take place in the Oliver Thompson Lecture Theatre, Northampton Square, London EC1V 0HB.
During my university time I worked on the IT help desk for a while. One day I received a call from a professor, who said that his printer had stopped working. So I asked him, if there was a message on the display and if he could read it to me. “Oh yes”, he said, “it says: ‘Load A4 paper.’” Rachel King quotes a study by Cisco on ZDnet, which believes to have found out that college students and young employees under the age of 30 would rather take a lower salary than having no social media freedom, device flexibility and work mobility.
The other day I wrote about the R functions by, apply and friends, which allow me to operate on subsets of data. All those functions work nicely, if the data is given in the right format. More often than not it isn’t and I have to reshape the data beforehand. Thus, time to discuss the reshape function. I will focus on the reshape function in base R, and not the package of the same name.
Version 0.2.14 of the googleVis package was released on CRAN today. Changes The help files have been checked against changes of the Google Visualisation API, typos in the vignette have been ironed out (thanks to Pat Burns for pointing them out), a new section on dealing with apostrophes in column names has been added and the example in the section “Setting options” has been reviewed. For more details and demos check out the project site.
I am amazed by the number of comments I received on my recent blog entry about “by”, “apply” and friends. I had started my post by pointing out that R is a language. Well indeed, I have come to the conclusion, that it is a language with lots of irregular expressions and dialects. It feels a bit like German or French where you have to learn and memorise the different articles.
R is a language, as Luis Apiolaza pointed out in his recent post. This is absolutely true, and learning a programming language is not much different from learning a foreign language. It takes time and a lot of practice to be proficient in it. I started using R when I moved to the UK and I wonder, if I have a better understanding of English or R by now.
The financial crisis has put a lot of pressure on countries’ long-term foreign currency credit ratings, with France recently being downgraded by S&P. Wikipedia provides a list of countries by credit ratings as report by US rating agencies S&P, Fitch, Moody’s and Dagong, a Chinese rating agency.
So, what does the world look like today through the eyes of those rating agencies?
I use the R packages XML and googleVis to read and display the data from Wikipedia with just a few lines.
Why the old and the new need to share time together It takes time to appreciate the new. Even if the new is much better than the old. It is easy to forget when you yourself created the exciting new.
At the end of August 2011 Google announced a new Blogger interface. The new interface offered about the same functionality, but had a different look and feel. At first I was reluctant to use it.
Many thanks to all who participated in the survey about writing R package vignettes. Following my post last Thursday the responses came in quickly in the evening and all day on Friday. Since Saturday the response rate has been decreasing constantly and I think it is time for a summary based on the 56 responses received. Summary - How to write a good vignette Length: Trust yourself, but aim for about 20 pages.
I am currently co-writing the vignette for the ChainLadder package and wonder what I should be focusing on. I have co-written the vignette of the googleVis package in the past and based it purely and what I thought would work. So, this is an experiment to find out, if user feedback will help me to write a better vignette. Let’s see how it develops. I will make the data available once I have at least 10 submission.
Over the years I convinced my colleagues and IT guys that LaTeX/XeLaTeX is the way forward to produce lots of customer reports with individual data, charts, analysis and text. Success! But of course the operating system in the office is still MS Windows. With my background in Solaris/Linux/Mac OSX I am still a little bit lost in the Windows world, when I have to do such simple tasks as finding and replacing a string in lots of files.
On 7th December Google published a new version of their Visualisation API. The new version adds a new chart type: Stepped Area Chart and provides improvements to Geo Chart. Now Geo Chart has similar functionality to Geo Map, but while Geo Map requires Flash, Geo Chart doesn’t, as it renders SVG/VML graphics. So it also works on your iOS devices. These new features have been added to the googleVis R package in version 0.
We need more data journalism. How else will we find the nuggets of data and information worth reading?
Life should become easier for data journalists, as the Guardian, one of the data journalism pioneers, points out in this article about the new open data initiative of the European Union (EU). The aims of the EU’s open data strategy are bold. Data is seen as the new gold of the digital age.
The London R user group met again last Wednesday at the Shooting Star pub. And it was busy. More than 80 people had turned up. Was it the free beer and food, sponsored by Mango, which attracted the folks or the speakers? Or the venue? James Long, who organises the Chicago R user group meetings and who gave gave the first talk that night, noted that to his knowledge only the London and Chicago R users would meet in a pub.
Fitting distribution with R is something I have to do once in a while, but where do I start?
A good starting point to learn more about distribution fitting with R is Vito Ricci’s tutorial on CRAN. I also find the vignettes of the actuar and fitdistrplus package a good read. I haven’t looked into the recently published Handbook of fitting statistical distributions with R, by Z. Karian and E.
Data analysis is often an iterative and interactive process. However, when I present about this subject, I feel often limited by the presentation software I use. It doesn’t matter if I use LaTeX/PDF, PowerPoint or Keynote. In all cases it is either very difficult or impossible to include interactive charts, such as Flash or SVG charts. As a result I have to switch between various applications during the talk. This can be fun, but quite often it is not.
Today we published version 0.1.5-1 of the ChainLadder package for R. It provides methods which are typically used in insurance claims reserving to forecast future claims payments.
Claims development and chain-ladder forecast of the RAA data set using the Mack method The package started out of presentations given at the Stochastic Reserving Seminar at the Institute of Actuaries in 2007, 2008 and 2010, followed by talks at CAS meetings in 2008 and 2010.
My 12" iBook G4 is celebrating its 8th birthday today! Time for a little present. How about R 2.14.0?
The iBook is still in daily use, mostly for browsing the web, writing e-mails and this blog; and I still use it for R as well. For a long time it run R 2.10.1, the last PowerPC binary version available on CRAN for Mac OS 10.4.11 (Tiger).
But, R 2.
Using R with LaTeX via Sweave is a great way to create reproducible output. However, using specific fonts, e.g. your corporate fonts, can be painful with pdflatex. Over the last few weeks I have fallen in love with the TeX format XeLaTeX and its XeTeX engine.
With XeLaTeX I had to overcome some hurdles, which I would like to share here:
attaching files, trimming and clipping images, learning how to use the tikzDevice package.
How many R related books have been published so far? Who is the most popular publisher? How many other manuals, tutorials and books have been published online? Let’s find out.
A few years ago I used the publication list on r-project.org as an argument with the IT department that R is an established statistical programming language and that they should allow me to install it on my PC. I believe at the time there were about 20 R related books available.
Following on from my article about accessing and plotting World Bank data with R I want to talk about how to change the initial view of a motion chart.
Over the last couple of weeks I have been asked a view times how to do this. For instance Stephen O’Grady wanted to create a motion chart, which shows initially a line chart, rather than a bubble chart.
Changing the initial settings of a motion chart is actually quite easy, if you know how to.
Over the past couple of days I played around with the data sets of the World Bank, and I have to admit that I am blown away by it. It is amazing, to see what is available on their web site and it is worth visiting their Data Visualisation Tools page. It is fantastic that they provide an API to their data. They have used it to build an iPhone App which is pretty cool.
Today we published googleVis 0.2.9 on CRAN. The new version updates the package for the new features of the Google Visualisation API and brings a new in-page editor option.
Here is a simple example, displaying the participants of the R user Conference 2011 in Warwick by country. Notice the ‘Edit me’ button in the top left corner of the chart, which allows you to change and customise the graph.