Dataset coffee-sales
on kaggle includes 2 files index_1.csv
and index_2.csv
. In this notebook, I will use the file index_1.csv
for the time series analysis.
Name variable | Describe |
---|---|
date |
Sales date |
datetime |
Sales hours |
cash_type |
Payment method |
card |
Card code if paying by card |
money |
Payment amount |
coffee_name |
Coffee name |
Below are some of the command packages required for the process analysis.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(lubridate, dplyr, ggplot2, plotly, readxl, TSstudio, xts, tseries, forecast)
cofe <- read.csv("index_1.csv", header = T)
date datetime cash_type card
1 2024-03-01 2024-03-01 10:15:50.520 card ANON-0000-0000-0001
2 2024-03-01 2024-03-01 12:19:22.539 card ANON-0000-0000-0002
3 2024-03-01 2024-03-01 12:20:18.089 card ANON-0000-0000-0002
4 2024-03-01 2024-03-01 13:46:33.006 card ANON-0000-0000-0003
5 2024-03-01 2024-03-01 13:48:14.626 card ANON-0000-0000-0004
6 2024-03-01 2024-03-01 15:39:47.726 card ANON-0000-0000-0005
money coffee_name
1 38.7 Latte
2 38.7 Hot Chocolate
3 38.7 Hot Chocolate
4 28.9 Americano
5 38.7 Latte
6 33.8 Americano with Milk
options(repr.plot.width = 16, repr.plot.height = 8)
# ------------------ #
cofe %>% group_by(cash_type) %>% count() %>% summarise(percent = round(n/nrow(cofe)*100,2))
# A tibble: 2 × 2
cash_type percent
<chr> <dbl>
1 card 97.6
2 cash 2.45
ggplot(cofe, mapping = aes(x = coffee_name, fill = coffee_name)) +
geom_bar() + facet_wrap( ~ cash_type) +
geom_text(stat = 'count', aes(label = ..count..), vjust = "center") +
labs(x = "Coffee Name", y = "Count", title = "Number of each type of Coffee") +
theme(axis.text.x = element_text(angle = 60, hjust = 1), legend.position="none")
Since the number of days in each month is different, choose day 01 to calculate the total revenue of the month.
fig <- cofe %>% mutate(time = as.Date(format(date, "%Y-%m-01"))) %>%
group_by(time) %>%
summarise(sales = sum(money)) %>%
ggplot(mapping = aes(x = time, y = sales)) +
geom_line(linewidth = 0.8, color = "blue") +
geom_point(color = "darkblue") +
labs(x = "Month", y = "Total", title = "Revenue of Monthly")
ggplotly(fig)
# A tibble: 6 × 2
date sales
<date> <dbl>
1 2024-03-01 396.
2 2024-03-02 228.
3 2024-03-03 349.
4 2024-03-04 135.
5 2024-03-05 338.
6 2024-03-06 170.
Using function auto.arima()
.
ts <- xts(cfSales$sales, order.by = cfSales$date)
ts_info(ts)
The ts series is a xts object with 1 variable and 381 observations
Frequency: daily
Start time: 2024-03-01
End time: 2025-03-23
ts_plot(ts)
Create training and test sets
h1 <- 40 # Size of the test data
h2 <- 30 # Number of forecast periods
# -------------- #
# Create data train and data test
ts_train <- ts[c(1:(nrow(cfSales) - h1))]
ts_test <- ts[c((nrow(cfSales) - h1 +1): nrow(cfSales))]
# -------------- #
# Using function auto.arima()
arima_md <- auto.arima(ts_train)
fc_arima_md <- forecast(arima_md, h = h1)
# -------------- #
# Result
test_forecast(ts,
forecast.obj = fc_arima_md,
test = ts_test)
As you can see from the graph, it’s terrible, right? So it’s time to do it manually!
diff1 <- diff(cfSales$sales)
suppressWarnings(adf.test(diff1))
Augmented Dickey-Fuller Test
data: diff1
Dickey-Fuller = -10.241, Lag order = 7, p-value = 0.01
alternative hypothesis: stationary
p-value = 0.01 < 0.05
, so we accept the stationary series hypothesis. So choose d = 1 = D
.
The lag
decreases gradually, but we can see that the lag = 7
and lag = 14
are significantly higher in both the ACF and PACF plots. This shows that we can create a temp frequency of 7 for the time series.
With the ACF chart ignore lag = 0
, we have
First three lags decrease then lag = 4
increases, so we choose p = 3
After lag = 7, lag = 14
, it is difficult to determine which lag is suitable; avoid choosing lag that is too large, so we choose P = 1
(Because lag/frequency = 7/7
).
With the PACF chart (no have lag = 0
), we have
First three lags decrease then lag = 4
increases but lag = 3
quite small so i don’t choose, so we choose q = 2
.
The lags lag = 7, lag = 14
the delay is 0, so we choose Q = 0
.
md_hand <- arima(ts_train, order = c(3,1,2), seasonal = list(order = c(1,1,0), period = 7))
fc_md_hand <- forecast(md_hand, h = h1)
print("ARIMA(3,1,2)(1,1,0)[7]"); round(accuracy(fc_md_hand, ts_test),2)
[1] "ARIMA(3,1,2)(1,1,0)[7]"
ME RMSE MAE MPE MAPE MASE ACF1
Training set 6.77 159.07 123.26 -34.45 68.37 0.86 -0.01
Test set -6.94 115.57 87.88 -10.80 24.65 0.62 NA
With the criteria for Test set (the smaller the better) including:
MAE: Mean Absolute Error.
MAPE: Mean Absolute Percentage Error.
–> The model is quite good
test_forecast(ts,
forecast.obj = fc_md_hand,
test = ts_test)
final_md <- arima(cfSales$sales, order = c(3,1,2), seasonal = list(order = c(1,1,0), period = 7))
checkresiduals(final_md)
Ljung-Box test
data: Residuals from ARIMA(3,1,2)(1,1,0)[7]
Q* = 14.756, df = 4, p-value = 0.005234
Model df: 6. Total lags used: 10
fc_final_md <- forecast(final_md, h = h2)
plot_forecast(fc_final_md)
p-value = 0.005234 < 0.05
of Ljung-Box test shows that the residuals of the model are not really good.