Tuesday, July 28, 2015

Charting Number of R Packages over time (Part 3)

This is the third and final blog post on analysing the number of new R packages that have been submitted over time.

Check out the previous blog posts:

In this blog post I will show you how you can perform Forecasting on our data to make some predictions on the possible number of new packages over the next 12 months.

There are 2 important points to note here:

  1. Only time will tell if these predictions are correct or nearly connect. Just like with any other prediction techniques.
  2. You cannot use just one of the Forecasting techniques in isolation to make a prediction. You need to use a number of functions/algorithms to see which one suits your data best.

The second point above is very important with all prediction techniques. Sometimes you see people/articles talking about them only using algorithm X. They have not considered any of the other techniques/algorithms. It is their favourite or preferred method. But that does not mean it works or is suitable for all data sets and all scenarios.

In this blog post I'm going to use 3 different forecasting functions, the in-build Forecast function in R, using HoltWinters and finally using ARIMA. Yes there are many more (it is R after all) and I'll leave these for you to explore.

1. Convert data set to Time Series data format

The first thing I need to do is to convert the data I want analyze into TimeSeries format (ts). This looks to have one record or instance for each data point.

So you cannot not have any missing data, or in my case any missing dates. Yes (for my data set) we could have some months where we do not have any submissions. What I could do is to work out mean values (or things like that) and fill for the missing months. But I'm feeling a bit lazy and after examining the data I see that we have a continuous set of data from September 2009 onwards. This is fine as most of the growth up to that point is flat.

So I need to subset the data to only include cases greater than or equal to September 2009 and less than or equal to June 2015. I wanted to explore July 2015 as the number for this month is incomplete.

The following code builds on the work we did in the second blog post in the series

library(forecast)
library(ggplot2)

# Subset the data
sub_data <- subset(data.sum, Group.date >= as.Date("2009-08-01", "%Y-%m-%d"))
sub_data <- subset(sub_data, Group.date <= as.Date("2015-06-01", "%Y-%m-%d"))

# Subset again to only take out the data we want to use in the time series
sub_data2 <- sub_data[,c("R_NUM")]

# Create the time series data, stating that it is monthly (12) and giving the start and end dates
ts_data <- ts(sub_data2, frequency=12, start=c(2009, 8), end=c(2015, 6))

# View the time series data
ts_data
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2009                               2   3   4   3   4
2010   5   5   1  11   5   2   5   4   4   5   1   3
2011  11   4   3   6   6   5   9  15   5   8  23  18
2012  33  17  51  28  37  33  50  71  41 231  51  60
2013  75  67  76  81  76  74  77  89 111  96 111 200
2014 155 129 175 140 145 133 155 207 232 162 229 310
2015 308 343 332 378 418 558    

We now have the data prepared for input to our Forecasting functions in R.

2. Using Forecast in R

For the Forecast function all you need to do is pass in the Time Series dataset and tell the function how my steps into the future you want it to predict. In all my examples I'll ask the functions to predict for the next 12 months.

ts_forecast <- forecast(ts_data, h=12)
ts_forecast

         Point Forecast     Lo 80     Hi 80 Lo 95     Hi 95
Jul 2015       447.1965  95.67066  784.3163     0  958.6669
Aug 2015       499.4344 115.94329  873.1822     0 1069.8689
Sep 2015       551.7875 123.88426  952.0773     0 1230.4707
Oct 2015       603.7212 156.89486 1078.0395     0 1370.2069
Nov 2015       654.7647 143.29718 1179.4903     0 1603.8335
Dec 2015       704.5162 135.76829 1352.8925     0 1844.7230
Jan 2016       752.6447 151.09936 1502.6088     0 2100.9708
Feb 2016       798.8877 156.37383 1652.0847     0 2575.3715
Mar 2016       843.0474 159.67095 1848.1703     0 2888.1738
Apr 2016       884.9849 154.59456 2061.7990     0 3281.6062
May 2016       924.6136 148.04651 2325.9060     0 3891.5064
Jun 2016       961.8922 138.67935 2531.7578     0 4395.6033

plot(ts_forecast)

NewImage

For this we get a very large range of values and very wide predictive intervals. If we limit the y axis we can get a better picture of the actual predictions.

plot(ts_forecast, ylim=range(0:1000))
NewImage

3. Using HoltWinters

For HoltWinters we can use the in-built R function for this. All we need to do is to pass in the Time Series data set. The first part we can plot the HoltWinters for the existing data set

?HoltWinters
hw <- HoltWinters(ts_data)
plot(hw)
NewImage

Now we want to predict for the next 12 months

forecast <- predict(hw, n.ahead = 12, prediction.interval = T, level = 0.95)
forecast

              fit       upr      lwr
Jul 2015 519.9304  599.8097 440.0512
Aug 2015 560.1083  648.4183 471.7983
Sep 2015 601.4528  701.0163 501.8892
Oct 2015 643.9639  757.3750 530.5529
Nov 2015 681.5168  811.0727 551.9608
Dec 2015 724.7363  872.4508 577.0218
Jan 2016 773.8308  941.4768 606.1848
Feb 2016 809.8836  999.0401 620.7272
Mar 2016 847.1448 1059.2371 635.0525
Apr 2016 898.4476 1134.7795 662.1158
May 2016 933.8755 1195.6532 672.0977
Jun 2016 972.3866 1260.7376 684.0356

plot(hw, forecast)
NewImage

4. Using ARIMA

For ARIMA we need to perform a simple conversion of the Time Series data into ARIMA format and then perform the forecase

fc_arima <- auto.arima(ts_data)
fc_fc_arima  <- forecast(fc_arima, h=12)
fc_fc_arima

         Point Forecast    Lo 80     Hi 80    Lo 95     Hi 95
Jul 2015       524.4758 476.2203  572.7314 450.6753  598.2764
Aug 2015       567.1156 513.2301  621.0012 484.7048  649.5265
Sep 2015       609.7554 548.3239  671.1869 515.8041  703.7068
Oct 2015       652.3952 581.6843  723.1062 544.2522  760.5383
Nov 2015       695.0350 613.5293  776.5408 570.3828  819.6873
Dec 2015       737.6748 644.0577  831.2920 594.4998  880.8499
Jan 2016       780.3147 673.4319  887.1974 616.8516  943.7777
Feb 2016       822.9545 701.7797  944.1292 637.6337 1008.2752
Mar 2016       865.5943 729.2004 1001.9881 656.9978 1074.1907
Apr 2016       908.2341 755.7718 1060.6963 675.0631 1141.4050
May 2016       950.8739 781.5559 1120.1918 691.9244 1209.8233
Jun 2016       993.5137 806.6027 1180.4246 707.6581 1279.3693

plot(fc_fc_arima, ylim=range(0:800))
NewImage

As you can see there are very different results from each of these forecasting techniques. If this was a real life project on real data then we would go about exploring a lot more of the Forecasting function available in R. The reason for this is to identify which R function and Forecasting algorithm works best for our data.

Which Forecasting technique would you choose from the selection above?

But will this function and algorithm always work with our data? The answer is NO. As our data evolves so may the algorithm that works best for our data. This is why the data science/analytics world is iterative. We need to recheck/revalidate the functions/algorithms to see if we need to start using something else or not. When we do need to use another function/algorithm you need to ask yourself why this has happened, what has changed in the data, what has changed in the business, etc.

No comments:

Post a Comment