################################################################################
# REQUIRED LIBRARIES
################################################################################
library(astsa)
library(dplyr)
library(ggplot2)
library(tidyr)
library(knitr)
################################################################################
# DATA IMPORT
################################################################################
# Import file
pass_ts <- read.csv("C:/Users/Juan Manuel/Documents/STAT 498/Project/EANA/passengers_ts.csv", 
                    header = TRUE)

# Remove year and month
pass_ts <- pass_ts %>% 
  select(-c(year, month))

# Create the ts
pass_ts <- pass_ts %>% 
  ts(start = c(2014, 1), end = c(2018, 6), deltat = 1/12)

1. Introduction

Argentina is the eighth largest country in the world; however, the country’s airline connectivity is not enough. While its neighbours like Brazil or Chile multiplied by three the number of domestic passengers in the period 2000 - 2015, such quantity remained almost unaltered in Argentina.

In the last three years the Ministry of Transportation started a new policy aimed to change this situation. For example, since 2015 fifty nine new domestic air routes were added and thirty airports have been under remodeling to enlarge facilities or upgrade navegation and security equipment.

The objective of this study is to explore how the number of domestic air passengers in Argentina has evolved in the last years.

2. Data set

The data for this projet was taken from “Tablas de Movimientos y Pasajeros EANA 2014 - 2018”, a report from Empresa de Navegaci?n A?rea Argentina (EANA). It can be accessed via the following link: https://www.eana.com.ar/estadisticas#informes

The data set contains a lot of information about airline passengers and flights in Argentina for the period 2014 - 2018. I concentrated in the number of domestic airline passengers for the 54 months available: from January, 2014 to June, 2018. The data is presented as an xlsx file and displayed in way that it is hard to import to R. Hence, before importing it, I pre-cleaned the data using Microsoft Excel.

3. Analysis

The first step in the analysis was to make a plot of the original data, where the number of domestic passengers is taken as response variable and the time as explanatory variable.

################################################################################
# 3. ANALYSIS
################################################################################
# Original data time plot
plot.ts(pass_ts[ , 1],
        ylab = "thousand passengers",
        main = "Domestic air passengers in Argentina 2014 - 2018")

As seen in the plot above, the number of air passengers in Argentina for the period under study presents an upward trend and the variability is not constant over time. Because of this two facts, I can claim that the weak stationarity conditions are not met. Additionally, it seems that there is a seasonal behaviour of the underlying process.
The steps before fitting a model include i) variance stabilization and ii) trend removal.

3.1. Variance stabilization transformation

The problem of heteroscedasticity is that, as time grows, small percent changes becomes large absolute changes. In order to stabilize the variance, I took the natural logarithm of the original data.

# Log transformation
log_dom_pass_ts <- log(pass_ts[ , 1])
plot.ts(log_dom_pass_ts,
        ylab = "log thousand passengers",
        main = "Domestic air passengers in Argentina 2014 - 2018")

This plot is similar to the first, but the range of the dependent variable is considerably smaller.

3.2. Trend removal

From the time plot we observed an upward trend that can be removed by differencing the data. Besides the time plot, one can use the ACF of a time series to determine if differencing is needed before fitting a model.

Note: In the x axis for the ACF and PACF plots, the integer numbers represent years. In between two integers there are twelve marks in the x axis, one for each month of the year.

# Correlation structure
log_acf <- acf2(log_dom_pass_ts, 
                max.lag = 30, 
                main = "log thousand passengers")

As we see in the ACF plot above, it presents a slow decay as the lag increases. This supports the initial hypothesis that difference is needed, so I differenced the log passengers data and repeated the ACF and PACF analysis.

3.2.1. Non seasonal differentiation

The ACF and PACF for the non seasonal differenced log thousand passengers follows:

# Differentiation
d_log_dom_pass_ts <- diff(log_dom_pass_ts)
d_log_acf <- acf2(d_log_dom_pass_ts,
                  max.lag = 45,
                  main = "(Non seasonal) differenced log thousand passengers")

Non seasonal differencing removes only one part of the correlation. In the ACF plot above we still observe sesonal correlation for \(12 * k\) for \(k = 1, 2, \ldots\). This is telling us that the number of air passengers of a certain month is related to the number of passengers for the same month in previous years. Since the ACF tails off for \(12 * k\) for \(k = 1, 2, \ldots\) and PACF cuts off after lag 1, a sesonal autoregressive component of order one is needed.
Additionally, the following time plot of the non seasonal differenced log data also helps in the analysis: we observe how the peaks and valleys repeat every twelve months.

# Time plot of non seasonal differenced data
plot.ts(d_log_dom_pass_ts,
        ylab = "log thousand passengers",
        main = "(Non seasonal differenced) log thousand passengers")
abline(h = 0, lwd = 1.5)

3.2.2. Seasonal differentiation

Because of the seasonal correlation that still remained, seasonal differentiation was performed. Once I took seasonal differentiation, the ACF values lie between the blue dashed lines; meaning that the autocorelation function is not significantly different from zero as seen in the plot below:

# Seasonal and non seasonal differenced log data - ACF/PACF
dd_log_dom_pass_ts <- diff(d_log_dom_pass_ts, 12)
dd_log_acf <- acf2(dd_log_dom_pass_ts,
                   max.lag = 35,
                   main = "(Seasonal and non seasonal) differenced log thousand passengers")

Finally, the time plot shows that now the process seems to be stationary:

# Seasonal and non seasonal differenced log data - Time plot
plot.ts(dd_log_dom_pass_ts,
        type = "l",
        ylab = "log thousand passengers",
        main = "(Seasonal and non seasonal) differenced log thousand passengers")
abline(h = 0, lwd = 1.5)

3.3. Model Fitting

Using the analysis of the previous sections, in this section I fitted two models to the log thousand passengers data using the function sarima() in package astsa. The objective of fitting two models is to identify whether or not a moving average component is needed. This is not really clear in the ACF/PACF and time plots analysis made before, so I fitted one model with q = 1 and one with q = 0 and then compared AICc and BIC values.
The models I fitted were:

3.3.1. Model 1

model_1: SARIMA(p = 1, d = 1, q = 1, P = 1, D = 1, Q = 0, S = 12)

# model_1 fitting
model_1 <- sarima(log_dom_pass_ts[-1], 
                  p = 1, d = 1, q = 1, 
                  P = 1, D = 1, Q = 0, S = 12, 
                  details = FALSE)
# model_1 table
kable(model_1$ttable,
      align = "l",
      digits = 2)
Estimate SE t.value p.value
ar1 0.71 0.12 5.89 0.00
ma1 -1.00 0.09 -10.61 0.00
sar1 -0.39 0.15 -2.63 0.01
kable(model_1$AICc, 
      align = "l",
      col.names = "AICc", 
      digits = 2)
AICc
-5.39
kable(model_1$BIC, 
      align = "l",
      col.names = "BIC", 
      digits = 2)
BIC
-6.33

3.3.2. Model 2

model_2: SARIMA(p = 1, d = 1, q = 0, P = 1, D = 1, Q = 0, S = 12)

# model_2 fitting
model_2 <- sarima(log_dom_pass_ts[-1], 
                  p = 1, d = 1, q = 0, 
                  P = 1, D = 1, Q = 0, S = 12,
                  details = FALSE)
# model_2 table
kable(model_2$ttable,
      align = "l",
      digits = 2)
Estimate SE t.value p.value
ar1 0.09 0.16 0.54 0.59
sar1 -0.42 0.16 -2.67 0.01
kable(model_2$AICc, 
      align = "l",
      col.names = "AICc", 
      digits = 2)
AICc
-5.29
kable(model_2$BIC, 
      align = "l",
      col.names = "BIC", 
      digits = 2)
BIC
-6.26

3.3.3. Model selection

From the tables above, we can see that the estimates of the coefficients for model 1 are all significantly different from zero; while this is not the case for model 2. In the latter, the p-value for the non seasonal autoregressive estimate is 0.59 meaning that we fail to reject the null the hypothesis that the coefficient’s estimate is equal to zero.
Additionally, AICc and BIC values are smaller for model 1, which leads me to prefer model 1 over model 2. There is one step left to conclude if the model is adequate or not: residual analysis. For this analysis I used the following plots:

# Residual plot for model_1
plot.ts(model_1$fit$residuals,
        ylab = "log thousand passengers",
        main = "Model 1 residuals")
abline(h = 0, lwd = 1.5)

# ACF/PACF for model_1 residuals
res_analysis_acf <- acf2(model_1$fit$residuals,
                         max.lag = 30,
                         main = "Model 1 residuals")

In the time plot of the residuals for model 1, they seem to be independent. Furthermore, the ACF/PACF values are not significantly different from zero; except for the PACF values corresponding to lag three and lag twelve. This result suggests the possibility that there is still some slight seasonal regularity in the residuals, but the addition of another seasonal autoregressive component does not improve the results.

3.4. Forecast

The objective of this section is to forecast the number of passengers for the next twelve months, that is, for the period July, 2018 to June, 2019. The forecasts were computed using the function sarima.for() in package astsa and the model used was model_1 from section 3.3.1., namely, SARIMA(p = 1, d = 1, q = 1, P = 1, D = 1, Q = 0, S = 12)

Note: To include the label for the y axis, I made a minor modification to the function sarima.for() and added an extra argument called “y_label”. Such modified function can be found in the Appendix section and is called sarima.for_modified()

################################################################################
# Forecasts
################################################################################
# Forecast model_1
model_1_fore <- sarima.for(log_dom_pass_ts[-1], n.ahead = 12, 
                                    p = 1, d = 1, q = 1, 
                                    P = 1, D = 1, Q = 0, S = 12,
                                    plot.all = TRUE) 

In the plot, forecasted values are presented in red and forecasted values plus one or two standard errors are shown by the grey areas. Dark grey corresponds to one standard error, while light grey is for two standard errors. Forecasted values follow the behaviour observed in previous years.

The first forecasted value corresponds to July 2018, where the number of passengers increases significantly from June. This pattern repeats year after year, and might be due to the fact that the winter holidays in Argentina start in July. Ski resorts are located in provinces far away from the most populated cities, so it is likely that tourists decide to travel by plane instead of by car.

From August, 2018 to December, 2018 the forecasted number of passengers is similar but slightly below the value forecasted for July, 2018; following the same pattern seen in previous years.

Next, there is another increase for the forecast corresponding to January 2019, followed by a decreasing trend for the months February, 2019 to June, 2019 where a new minimum is reached.

Since the data taken to run the forecast was the log transformed data, it is required to untransform the data to present the results. Such values are displayed in the next table:

Note: The unit of passengers in the original data set is thousands, so the number 1000 in the table corresponds to 1000000 passengers.

################################################################################
# Forecast untransformed
################################################################################
f_values <- data.frame("Year" = c(rep(2018, 6), rep(2019, 6)), 
                       "Month" = c("Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
                                   "Jan", "Feb", "Mar", "Apr", "May", "Jun"),
                       "log_Air_Passengers" = round(model_1_fore$pred, 2),
                       "Air_Passengers" = round(exp(model_1_fore$pred), 2))

kable(f_values, align = "l", digits = 2)
Year Month log_Air_Passengers Air_Passengers
2018 Jul 7.18 1317.09
2018 Aug 7.16 1292.15
2018 Sep 7.14 1266.21
2018 Oct 7.18 1316.32
2018 Nov 7.18 1313.80
2018 Dec 7.16 1289.76
2019 Jan 7.23 1376.79
2019 Feb 7.12 1239.12
2019 Mar 7.18 1310.43
2019 Apr 7.06 1169.59
2019 May 7.04 1136.88
2019 Jun 6.99 1084.76

4. Conclusions and future work

The objetive of this project was to fit a model to the number of domestic air passengers in Argentina for the period 2014 - 2018, and then make a forecast for the next twelve months. The model fitted was a \(SARIMA(1, 1, 1, 1, 1, 0)_{S = 12}\) and, in order to fit it, variable transformation and differentiation was done to the original data. Furthermore, time, ACF, and PACF plots were used to identify model orders. Residual analysis and AICc and BIC criterions were used to asses model adequacy.

With the mentioned SARIMA model the number of domestic passengers for the next twelve months, namely July 2018 to June 2019, were forecasted. The forecasts show a continuation of the upward trend and their seasonal behaviour is as seen in previous years.

A path for future work could be to study the effect caused by the removal of certain regulations regarding air transportation. For example, until July 31st, 2018 the Ministry of Transportation imposed a regulation on the minimum fare for domestic air tickets. Such regulation was removed for round trips within Argentina booked at least thirty days in advance. On August 1st, the day this regulation was removed, Aerol?neas Argentinas (a public company and the largest operator in the domestic market with approximately 70% market share) launched a promotion for qualifying flights with price reduction of 50%. This promotion was followed by other market players, like LATAM and Flybondi, and as a result an important increase in the number of passengers is expected from September, 2018. (Note that the regulation of minimum fare still works for tickets booked with less than thirty days, so the regulation change will not impact in the month of August, 2018).

To conclude, and according to the analysis, is expected that the number of domestic air passengers in Argentina continue growing the next year. Although the changes in minimum fare regulations seem to be beneficial to the market, is still too soon to measure and quantify its real impact.

5. References

Empresa Argentina de Navegaci?n A?rea
https://www.eana.com.ar/estadisticas#informes

Ministry of Transportation of Argentina
https://www.argentina.gob.ar/noticias/el-gobierno-nacional-impulsa-vuelos-de-cabotaje-mas-baratos-0

Organismo Regulador del Sistema Nacional de Aeropuertos (ORSNA)
https://www.orsna.gob.ar/obras-en-ejecucion-a-ejecutarse/