Introduce

Dataset coffee-sales on kaggle includes 2 files index_1.csv and index_2.csv. In this notebook, I will use the file index_1.csv for the time series analysis.

Data this here

Name variable	Describe
`date`	Sales date
`datetime`	Sales hours
`cash_type`	Payment method
`card`	Card code if paying by card
`money`	Payment amount
`coffee_name`	Coffee name

Below are some of the command packages required for the process analysis.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(lubridate, dplyr, ggplot2, plotly, readxl, TSstudio, xts, tseries, forecast)
cofe <- read.csv("index_1.csv", header = T)

cofe$date <- as.Date(cofe$date)
head(cofe)

        date                datetime cash_type                card
1 2024-03-01 2024-03-01 10:15:50.520      card ANON-0000-0000-0001
2 2024-03-01 2024-03-01 12:19:22.539      card ANON-0000-0000-0002
3 2024-03-01 2024-03-01 12:20:18.089      card ANON-0000-0000-0002
4 2024-03-01 2024-03-01 13:46:33.006      card ANON-0000-0000-0003
5 2024-03-01 2024-03-01 13:48:14.626      card ANON-0000-0000-0004
6 2024-03-01 2024-03-01 15:39:47.726      card ANON-0000-0000-0005
  money         coffee_name
1  38.7               Latte
2  38.7       Hot Chocolate
3  38.7       Hot Chocolate
4  28.9           Americano
5  38.7               Latte
6  33.8 Americano with Milk

options(repr.plot.width = 16, repr.plot.height = 8) 
# ------------------ #
cofe %>% group_by(cash_type) %>% count() %>% summarise(percent = round(n/nrow(cofe)*100,2))

# A tibble: 2 × 2
  cash_type percent
  <chr>       <dbl>
1 card        97.6 
2 cash         2.45

ggplot(cofe, mapping = aes(x = coffee_name, fill = coffee_name)) + 
  geom_bar() + facet_wrap( ~ cash_type) + 
  geom_text(stat = 'count', aes(label = ..count..), vjust = "center") + 
  labs(x = "Coffee Name", y = "Count", title = "Number of each type of Coffee") +
  theme(axis.text.x = element_text(angle = 60, hjust = 1), legend.position="none")

Since the number of days in each month is different, choose day 01 to calculate the total revenue of the month.

fig <- cofe %>% mutate(time = as.Date(format(date, "%Y-%m-01"))) %>% 
  group_by(time) %>% 
  summarise(sales = sum(money)) %>% 
  ggplot(mapping = aes(x = time, y = sales)) + 
  geom_line(linewidth = 0.8, color = "blue") + 
  geom_point(color = "darkblue") +
  labs(x = "Month", y = "Total", title = "Revenue of Monthly")
ggplotly(fig)

Building a Coffee Sales Forecasting Model

cfSales <- cofe %>%
    group_by(date = date) %>% 
    summarise(sales = sum(money))
head(cfSales)

# A tibble: 6 × 2
  date       sales
  <date>     <dbl>
1 2024-03-01  396.
2 2024-03-02  228.
3 2024-03-03  349.
4 2024-03-04  135.
5 2024-03-05  338.
6 2024-03-06  170.

Build model by automatic function

Using function auto.arima().

ts <- xts(cfSales$sales, order.by = cfSales$date)
ts_info(ts)

 The ts series is a xts object with 1 variable and 381 observations
 Frequency: daily 
 Start time: 2024-03-01 
 End time: 2025-03-23

ts_plot(ts)

Create training and test sets

h1 <- 40 # Size of the test data
h2 <- 30 # Number of forecast periods
# -------------- #

# Create data train and data test
ts_train <- ts[c(1:(nrow(cfSales) - h1))]
ts_test <- ts[c((nrow(cfSales) - h1 +1): nrow(cfSales))]
# -------------- #

# Using function auto.arima()
arima_md <- auto.arima(ts_train)
fc_arima_md <- forecast(arima_md, h = h1)
# -------------- #

# Result
test_forecast(ts,
              forecast.obj = fc_arima_md,
              test = ts_test)

As you can see from the graph, it’s terrible, right? So it’s time to do it manually!

diff1 <- diff(cfSales$sales)
suppressWarnings(adf.test(diff1))


    Augmented Dickey-Fuller Test

data:  diff1
Dickey-Fuller = -10.241, Lag order = 7, p-value = 0.01
alternative hypothesis: stationary

p-value = 0.01 < 0.05, so we accept the stationary series hypothesis. So choose d = 1 = D.

acf(ts)

pacf(ts)

The lag decreases gradually, but we can see that the lag = 7 and lag = 14 are significantly higher in both the ACF and PACF plots. This shows that we can create a temp frequency of 7 for the time series.

With the ACF chart ignore lag = 0, we have

First three lags decrease then lag = 4 increases, so we choose p = 3
After lag = 7, lag = 14, it is difficult to determine which lag is suitable; avoid choosing lag that is too large, so we choose P = 1 (Because lag/frequency = 7/7).

With the PACF chart (no have lag = 0), we have

First three lags decrease then lag = 4 increases but lag = 3 quite small so i don’t choose, so we choose q = 2.
The lags lag = 7, lag = 14 the delay is 0, so we choose Q = 0.

md_hand <- arima(ts_train, order = c(3,1,2), seasonal = list(order = c(1,1,0), period = 7))
fc_md_hand <- forecast(md_hand, h = h1)
print("ARIMA(3,1,2)(1,1,0)[7]"); round(accuracy(fc_md_hand, ts_test),2)

[1] "ARIMA(3,1,2)(1,1,0)[7]"

                ME   RMSE    MAE    MPE  MAPE MASE  ACF1
Training set  6.77 159.07 123.26 -34.45 68.37 0.86 -0.01
Test set     -6.94 115.57  87.88 -10.80 24.65 0.62    NA

With the criteria for Test set (the smaller the better) including:

MAE: Mean Absolute Error.
MAPE: Mean Absolute Percentage Error.

–> The model is quite good

test_forecast(ts,
              forecast.obj = fc_md_hand,
              test = ts_test)

final_md <- arima(cfSales$sales, order = c(3,1,2), seasonal = list(order = c(1,1,0), period = 7))
checkresiduals(final_md)


    Ljung-Box test

data:  Residuals from ARIMA(3,1,2)(1,1,0)[7]
Q* = 14.756, df = 4, p-value = 0.005234

Model df: 6.   Total lags used: 10

fc_final_md <- forecast(final_md, h = h2)
plot_forecast(fc_final_md)

p-value = 0.005234 < 0.05 of Ljung-Box test shows that the residuals of the model are not really good.

Time Series Forecasting: Coffee Sales

Introduce

Building a Coffee Sales Forecasting Model

Build model by automatic function