Other Utilities¶
Generating Full Span of multiple time-series¶
[1]:
import pandas as pd
import numpy as np
from orbit.utils.general import expand_grid, regenerate_base_df
Define the series keys and datetime array.
[2]:
dt = pd.date_range('2020-01-31', '2022-12-31', freq='M')
keys = ['x' + str(x) for x in range(10)]
print(keys)
print(dt)
['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9']
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
'2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31',
'2020-09-30', '2020-10-31', '2020-11-30', '2020-12-31',
'2021-01-31', '2021-02-28', '2021-03-31', '2021-04-30',
'2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31',
'2021-09-30', '2021-10-31', '2021-11-30', '2021-12-31',
'2022-01-31', '2022-02-28', '2022-03-31', '2022-04-30',
'2022-05-31', '2022-06-30', '2022-07-31', '2022-08-31',
'2022-09-30', '2022-10-31', '2022-11-30', '2022-12-31'],
dtype='datetime64[ns]', freq='M')
Users can use expand_grid
to generate dataframe with observations in key
and dt
levels.
[3]:
df_base = expand_grid({
'key': keys,
'dt': dt,
})
x = np.random.normal(0, 1, 10 * 36)
df_base['x'] = x
print(df_base.shape)
df_base.head(5)
(360, 3)
[3]:
key | dt | x | |
---|---|---|---|
0 | x0 | 2020-01-31 | -0.588512 |
1 | x0 | 2020-02-29 | 1.547821 |
2 | x0 | 2020-03-31 | 1.114709 |
3 | x0 | 2020-04-30 | 1.410516 |
4 | x0 | 2020-05-31 | -0.699229 |
Regenerate Multiple Timeseries with Missing rows¶
Create missing rows.
[4]:
np.random.seed(2022)
drop_idx = np.random.choice(df_base.index, 5, replace=False)
df_missing = df_base.drop(drop_idx).reset_index(drop=True)
print(df_missing.shape)
df_missing.head(5)
(355, 3)
[4]:
key | dt | x | |
---|---|---|---|
0 | x0 | 2020-01-31 | -0.588512 |
1 | x0 | 2020-02-29 | 1.547821 |
2 | x0 | 2020-03-31 | 1.114709 |
3 | x0 | 2020-04-30 | 1.410516 |
4 | x0 | 2020-05-31 | -0.699229 |
Use regenerate_base_df
to regenerate the base dataframe.
[5]:
time_col = "dt"
key_col = "key"
new_df_base = regenerate_base_df(df_missing, time_col, key_col, val_cols=['x'])
By default, the missing entries regenerated come with a null value.
[6]:
new_df_base.iloc[drop_idx]
[6]:
dt | key | x | |
---|---|---|---|
286 | 2022-11-30 | x7 | NaN |
274 | 2021-11-30 | x7 | NaN |
75 | 2020-04-30 | x2 | NaN |
135 | 2022-04-30 | x3 | NaN |
43 | 2020-08-31 | x1 | NaN |
Users can also use fill_na
option to fill the missing values.
[7]:
new_df_base = regenerate_base_df(df_missing, time_col, key_col, val_cols=['x'], fill_na=0)
[8]:
new_df_base.iloc[drop_idx]
[8]:
dt | key | x | |
---|---|---|---|
286 | 2022-11-30 | x7 | 0.0 |
274 | 2021-11-30 | x7 | 0.0 |
75 | 2020-04-30 | x2 | 0.0 |
135 | 2022-04-30 | x3 | 0.0 |
43 | 2020-08-31 | x1 | 0.0 |