After six years on Google’s Blogger platform I migrated my blog to Hugo. Blogger was a great platform to start blogging, it was/ is very easy to set up, and perhaps most importantly I didn’t have to invest time or money to test if I enjoyed writing a blog.
However, over the last year or so, a couple of things started to annoy me so much that I stopped enjoying writing posts on Blogger.
I mused over Test Driven Analysis on this blog before, but it was Richard Pugh’s talk on SAS to R Migration at LondonR last week that brought the topic back into my mind and clarified a few things.
Rich’s presentation focused on the challenge of how to ensure that the new system (R) would provide the same answers as the legacy system (SAS).
This is when it clicked with me: My brain is just another system as well.
The other day I got stuck working with a huge data set using data.table in R. It took me a little while to realise that I had to produce a minimal reproducible example to actually understand why I got stuck in the first place. I know, this is the mantra I should follow before I reach out to R-help, Stack Overflow or indeed the package authors. Of course, more often than not, by following this advise, the problem becomes clear and with that the solution obvious.
They all try to do something new and take the risk to be seen as a fool.
Over the last few days I stumbled over three videos by a physicist, an entrepreneur and an actor, which at first have little in common, but they do. They all need to know when they are wrong in order to progress. If you are not wrong, then you are likely to be right, but that is often difficult to prove - often not at all.
David Spiegelhalter gave a fascinating talk on Communicating Risk and Uncertainty to the Public & Policymakers at the Grantham Institute of the Imperial College in London last Tuesday.
In a very engaging way David gave many examples and anecdotes from his career in academia and advisory. I believe his talk will be published on the Grantham Institute’s YouTuble channel, so I will only share a few highlights and thoughts that stuck in my mind here.
Everyone is talking about Big Data1, but it is the small data that is holding everything together. The small slowly changing reference tables are the linchpins. Unfortunately, too often politics gets in the way as those small tables, maintained by humans, don’t get the attention they deserve; or in other words their owners, if they exists - many of these little tables are orphans, make changes without understanding the potential consequences on downstream systems.
The other day someone mentioned to me a rule of thumb that he was using to estimate the number of years \(n\) it would take for inflation to destroy half of the purchasing power of today’s money: \[ n = \frac{70}{p}\] Here \(p\) is the inflation in percent, e.g. if the inflation rate is \(2\%\) then today’s money would buy only half of today’s goods and services in 35 years.
Deploying applications via Docker container is the current talk of town. I have heard about Docker and played around with it a little, but when Dirk Eddelbuettel posted his R and Docker talk last Friday I got really excited and had to have a go myself.
My aim was to rent some resources in the cloud, pull an RStudio Server container and run RStudio in a browser. It was actually surprisingly simple to get started.
The World Cup has finally kicked off last Thursday and I have seen some fantastic games already. Perhaps the Netherlands appears to be the strongest side so far, following their 5-1 victory over Spain.
To me the question is not only which country will win the World Cup, but also which prediction model will come closest to the actual results. Here I present three teams, FiveThirtyEight, a polling aggregation website, Groll & Schauberger, two academics from Munich and finally Lloyd’s of London, the insurance market.
Saturday’s Eurovision Song Contest (ESC) from Copenhagen was hilarious as usual with acts from all over Europe and some more or less sensible gimmicks: a circular piano, a giant hamster wheel, a sea-saw, or indeed a beard and fancy dress.
The results of the ESC were only a little different to what the bookmakers in the UK had predicted before the event started. Sweden was seen as the favourite, followed by Austria, Netherlands, Armenia and the UK.
Recently the Guardian’s Data Blog reported about the results from the third National Survey of Sexual Attitudes and Lifestyles in the UK. One of the questions asked in the survey was if the participants had sex in the last four weeks. The results - a summary is available in this info graphic - show that the British have their most sexual active period when they are in their 20s - 40s.
I noticed that the monthly number of posts on R-bloggers stopped increasing over the last year. Indeed, the last couple of months saw a decline in posts compared to the previous year. Thus, has most been said and written about R already? Who knows? Well, I took a stab at looking into the future. However, I can tell you already that I am not convinced by my predictions. But maybe someone else will be inspired to take this work forward.
The Christmas and New Year’s break is over, yet there is still time to return unwanted presents. Return to Santa was the title of an article in the Economist that highlighted the impact on online retailers, as return rates can be alarmingly high.
The article quotes a study by Christian Schulze of the Frankfurt School of Finance and Management, which analysed the return habits of customers who bought at least five items over a five year period from a large European online retailer.
About half a year ago Ian Branagan, Chief Risk Officer of Renaissance Re - a Bermudian reinsurance company with a focus on property catastrophe insurance, gave a talk about the usage of models in risk management and how they evolved over the last twenty years. Ian’s presentation, titled with the famous quote of George E.P. Box: “All models are wrong, but some are useful”, was part of the lunch time lecture series of talks at Lloyd’s, organised by the Insurance Institute of London.
I was trained as a mathematician and it was only last year, when I attended the Royal Statistical Society conference and met many statisticians that I understood how different the two groups are.
In mathematics you often start with some axioms, things you assume to be true, and these axioms are then the basis from which new theory is derived. In statistics or more general in science you start with a theory, or better a hypothesis and try to disprove it.
Over the last year I worked with two colleagues of mine on the subject of inflation and claims inflation in particular. I didn’t expect it to be such a challenging topic, but we ended up with more questions than answers. The key question and biggest challenge is to define what inflation, or indeed claims inflation actually is and how to measure it. We published a summary of our thoughts and findings in this month’s issue of The Actuary.
At the last LondonR meeting Francine Bennett from Mastodon C shared some of her experience and findings from an analysis of a large prescriptions data set of the UK’s national health service (NHS). However, it was her last slide, which I found the most thought provoking. It asked for the definition of the following term: Test-driven analysis? Francine explained that test driven development (TDD) is a concept often used in software development for quality assurance and she wondered if a similar approach could be also used for data analysis.
Be motivated. R has a steep learning curve. Find a problem you can’t solve otherwise. E.g. plotting multivariate data, a statistical analysis for which an R function exists already. Download and install R. Get to know the R console. Learn how to install additional packages, how to access the history, how to use auto completion and open the help system. Review the R Installation and Administration manual and check out the free books section on CRAN.
Lloyd’s of London is looking for a Data Scientist as part of the Analysis team. See Lloyd’s career web site for more details.
A friend of mine told me the secret of making money at the stock market. “It’s easy”, he said.
All I would have to do is to buy a big jar of ants. Then I should observe the ants movement on my kitchen table, while following the stock market.
I shall keep the ants which walk in line with the stock market and remove those who don’t. Eventually I would have one ant left that walked all the way in line with the stock market.
I really like gists as a quick way to include more lengthly code snippets into my blog posts. However, I am not a git user as such, and so I was quite concerned when I noticed that all my gists on this blog had vanished after Christmas. I suppose this was a result of Github’s downtime on December 22nd.
Thankfully an email to the support guys at Github resolved the issue within a few hours.
I discovered an old classic game of mine again: Moon-buggy by Jochen Voss, based on the even older Moon Patrol, which celebrates its 30th birthday his year. I remember installing the command line game on my Sun SPARCstation 1 computer at university many moons ago. Hours of fun! Well, waisted time actually. Never-mind, I am delighted to have found it again. You can’t beat command line games. One day I shall try to control the moon buggy with my Arduino.
At last week’s Royal Statistical Society (RSS) conference Hal Varian, Chief Economist at Google, gave a panel talk about ‘Statistics at Google’. Could he get a better audience than the RSS? Hal talked about his career in academia and at Google. He reminded us of the days when Google was still a small start up with no real idea about how they could actually generate revenue. At that time Eric Schmidt asked him to ‘take a look’ at advertising because ‘it might make us a little money’.
The German news magazine Der Spiegel published a series of articles [1, 2] around career developments. The stories suggest that career aspirations of young professionals today are somewhat different to those of previous generations in Germany. Apparently money and people management responsibility are less desirable for new starters compared to being able to participate in interesting projects and to maintain a healthy work life balance. Hierarchies are seen as a mean to an end, and should be more flexible, depending on requirements and skills sets.
The 100m mean’s sprint finals of the 2012 London Olympics are over and Usain Bolt won the gold medal again with a winning time of 9.63s. Time to compare the result with my forecast of 9.68s, posted on 22 July.
My simple log-linear model predicted a winning time of 9.68s with a prediction interval from 9.39s to 9.97s. Well, that is of course a big interval of more than half a second, or ±3%.
It is less than a week before the 2012 Olympic games will start in London. No surprise therefore that the papers are all over it, including a lot of data and statistics around the games.
The Economist investigated the potential financial impact on sponsors (some benefits), tax payers (no benefits) and the athletes (if they are lucky) in its recent issue
The Guardian has awhole series around the Olympics, including the data of all Summer Olympic Medallists since 1896.
A new version of googleVis has been released on CRAN and the project site. Version 0.2.16 adds the functionality to plot quarterly and monthly data as a motion chart.
To illustrate the new feature I looked for a quarterly data set and stumbled across the quarterly UK house price data published by Nationwide, a building society. The data is available in a spread sheet format and presents the average house prices and indexed to 100 in Q1 1993 by region in the UK from Q4 1973 to Q1 2012.
John D. Cook gave a great talk about ‘Why and how people use R’. The talk resonated with me and highlighted why R is such a great tool for end user computing. A topic which has become increasingly important in the European insurance industry. John’s main point on why people use R is that R gets the job done and I think he is spot on.
Last Saturday I met the guys from RStudio at the R in Finance conference in Chicago. I was curious to find out what RStudio could offer. In the past I have used mostly Emacs + ESS for editing R files. Well, and what a surprise it was. JJ, Joe and Josh showed me a preview of version 0.96 of their software, which adds a close integration of Sweave and knitr to RStudio, helping to create dynamic web reports with the new R Markdown and R HTML formats more easily.
The Guardian published a nice summary and link collection of an interdisciplinary visualisation workshop hosted by Microsoft dedicated to visualising probability and risk. Check it out <a 19="" 2012="" apr="" datablog="" href-“http:=”" href="" news="" visualising-risk-microsoft-conference“=”" www.guardian.co.uk="">here.
OECD better life index
The links I found most interesting were those to the pages of Gregor Aisch and Moritz Stefaner. You may have come across their work in the past, as Moritz worked on the OECD better life index and Gregor contributed to the Where does my money go site.
During my university time I worked on the IT help desk for a while. One day I received a call from a professor, who said that his printer had stopped working. So I asked him, if there was a message on the display and if he could read it to me. “Oh yes”, he said, “it says: ‘Load A4 paper.’” Rachel King quotes a study by Cisco on ZDnet, which believes to have found out that college students and young employees under the age of 30 would rather take a lower salary than having no social media freedom, device flexibility and work mobility.
I am amazed by the number of comments I received on my recent blog entry about “by”, “apply” and friends. I had started my post by pointing out that R is a language. Well indeed, I have come to the conclusion, that it is a language with lots of irregular expressions and dialects. It feels a bit like German or French where you have to learn and memorise the different articles.
Why the old and the new need to share time together It takes time to appreciate the new. Even if the new is much better than the old. It is easy to forget when you yourself created the exciting new.
At the end of August 2011 Google announced a new Blogger interface. The new interface offered about the same functionality, but had a different look and feel. At first I was reluctant to use it.
Many thanks to all who participated in the survey about writing R package vignettes. Following my post last Thursday the responses came in quickly in the evening and all day on Friday. Since Saturday the response rate has been decreasing constantly and I think it is time for a summary based on the 56 responses received. Summary - How to write a good vignette Length: Trust yourself, but aim for about 20 pages.
Over the years I convinced my colleagues and IT guys that LaTeX/XeLaTeX is the way forward to produce lots of customer reports with individual data, charts, analysis and text. Success! But of course the operating system in the office is still MS Windows. With my background in Solaris/Linux/Mac OSX I am still a little bit lost in the Windows world, when I have to do such simple tasks as finding and replacing a string in lots of files.
We need more data journalism. How else will we find the nuggets of data and information worth reading?
Life should become easier for data journalists, as the Guardian, one of the data journalism pioneers, points out in this article about the new open data initiative of the European Union (EU). The aims of the EU’s open data strategy are bold. Data is seen as the new gold of the digital age.
My 12" iBook G4 is celebrating its 8th birthday today! Time for a little present. How about R 2.14.0?
The iBook is still in daily use, mostly for browsing the web, writing e-mails and this blog; and I still use it for R as well. For a long time it run R 2.10.1, the last PowerPC binary version available on CRAN for Mac OS 10.4.11 (Tiger).
But, R 2.
How many R related books have been published so far? Who is the most popular publisher? How many other manuals, tutorials and books have been published online? Let’s find out.
A few years ago I used the publication list on r-project.org as an argument with the IT department that R is an established statistical programming language and that they should allow me to install it on my PC. I believe at the time there were about 20 R related books available.