Using the VitalsDataFrame Class in Jupyter
The VitalsDataFrame class provides a direct interface between
the SQLite files generated by the gfdlvitals package and Python.
The class is an extension of a Pandas DataFrame with additonal methods
for smoothing and detrending.
The following imports are needed to start:
In [1]: import gfdlvitals
In [2]: import matplotlib.pyplot as plt
In [3]: from matplotlib.figure import figaspect
Loading SQLite datasets
The package ships with two demonstration datasets of 2 meter near surface air temperature from a preindustrial control simulation and a historical similation. Any path-like string pointing to a SQLite file can be passed to open_db:
# Load demonstration datasets
In [4]: df_ctrl = gfdlvitals.open_db(gfdlvitals.sample.picontrol)
In [5]: df_hist = gfdlvitals.open_db(gfdlvitals.sample.historical)
The details of a dataset are display by calling the object directly:
In [6]: df_hist
Out[6]:
area t_ref
1850-07-01 12:00:00 5.100640e+14 286.584833
1851-07-01 12:00:00 5.100640e+14 286.550367
1852-07-01 12:00:00 5.100640e+14 286.545789
1853-07-01 12:00:00 5.100640e+14 286.591339
1854-07-01 12:00:00 5.100640e+14 286.287207
... ... ...
2010-07-01 12:00:00 5.100640e+14 287.799508
2011-07-01 12:00:00 5.100640e+14 287.801954
2012-07-01 12:00:00 5.100640e+14 287.859331
2013-07-01 12:00:00 5.100640e+14 287.992293
2014-07-01 12:00:00 5.100640e+14 287.970241
[165 rows x 2 columns]
In this DataFrame we see that the index is the time coordinate using
the cf-time package. The global area and t_ref fields are also shown.
Plotting a field
Like a Pandas object, the .plot() is available for plotting a
variable directly:
In [7]: plotargs = {"color":"gray",
...: "linewidth":"0.8",
...: "ylabel":"DegK"};
...:
In [8]: df_ctrl.t_ref.plot(title="Control",**plotargs);
In [9]: df_hist.t_ref.plot(title="Historical",**plotargs);
Smoothing the timeseries
In [10]: df_hist.t_ref.plot(title="Historical",**plotargs);
In [11]: df_hist.smooth(20).t_ref.plot(color="red");
Identifying trends
In [12]: plotargs = {"color":"gray",
....: "linewidth":"0.8",
....: "ylabel":"DegK"};
....:
In [13]: df_ctrl.t_ref.plot(title="Control",**plotargs);
In [14]: df_ctrl.trend(order=1).t_ref.plot(color="purple");
In [15]: df_hist.t_ref.plot(title="Historical",**plotargs);
In [16]: df_hist.trend(order=3).t_ref.plot(color="purple");
Detrending
In [17]: plotargs = {"color":"gray",
....: "linewidth":"0.8",
....: "ylabel":"DegK"};
....:
In [18]: df_ctrl.detrend(order=1).t_ref.plot(title="Control",**plotargs);
In [19]: df_hist.detrend(order=3).t_ref.plot(title="Historical",**plotargs);
Removing drift
In [20]: df_hist_detrended = df_hist.detrend(order=1,
....: reference=df_ctrl,
....: anomaly=False);
....:
In [21]: df_hist.t_ref.plot(title="Historical",**plotargs);
In [22]: df_hist_detrended.t_ref.plot(color="green",linewidth=0.5);
Statistical comparison of two datasets
An instance of the VitalsDataFrame can be passed to a second instance and a t-test can be performed to identify differences between variables that are common to the two instances. The t-test adjusts the degrees of freedom based on the autocorrelation of the timeseries providing a more stringent threshold for assessing differences. For more details and examples of this method, see Santer et al. 2000 and Krasting et al. 2013.
In the example below, the test historical dataset is artifically split into two 20-year epochs for comparison.
In [23]: df_hist_t0 = df_hist[-40:-20]
In [24]: df_hist_t1 = df_hist[-20::]
In [25]: pvals = df_hist_t0.ttest(df_hist_t1)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[25], line 1
----> 1 pvals = df_hist_t0.ttest(df_hist_t1)
File ~/checkouts/readthedocs.org/user_builds/gfdlvitals/envs/latest/lib/python3.11/site-packages/gfdlvitals/extensions.py:326, in VitalsDataFrame.ttest(self, df2)
323 varlist = [x for x in varlist if x not in ignore_list]
325 # perform t-test
--> 326 result = {k: ttest_ind_auto(self[k], df2[k])[1] for k in varlist}
328 return pd.DataFrame(result, index=["pval"]).transpose()
File ~/checkouts/readthedocs.org/user_builds/gfdlvitals/envs/latest/lib/python3.11/site-packages/gfdlvitals/extensions.py:326, in <dictcomp>(.0)
323 varlist = [x for x in varlist if x not in ignore_list]
325 # perform t-test
--> 326 result = {k: ttest_ind_auto(self[k], df2[k])[1] for k in varlist}
328 return pd.DataFrame(result, index=["pval"]).transpose()
File ~/checkouts/readthedocs.org/user_builds/gfdlvitals/envs/latest/lib/python3.11/site-packages/gfdlvitals/extensions.py:78, in ttest_ind_auto(arr1, arr2, axis)
76 arrlen1 = arr1.shape[axis]
77 arrlen2 = arr2.shape[axis]
---> 78 lag1r1 = _autocorr(np.split(arr1, arr1.shape[axis], axis))
79 lag1r2 = _autocorr(np.split(arr2, arr2.shape[axis], axis))
80 n1eff = arrlen1 * ((1 - lag1r1) / (1 + lag1r1))
File ~/checkouts/readthedocs.org/user_builds/gfdlvitals/envs/latest/lib/python3.11/site-packages/gfdlvitals/extensions.py:44, in _autocorr(arr, lag)
27 """Computes the sample autocorrelation function coeffficient (rho)
28 for given lag
29
(...) 40 lagged auto-correlation
41 """
43 flen = float(len(arr))
---> 44 ybar = float(sum(arr)) / flen
45 denom = sum([(y - ybar) ** 2 for y in arr])
46 numer = sum(
47 [(y - ybar) * (ytpk - ybar) for (y, ytpk) in zip(arr[:-lag], arr[lag:])]
48 )
TypeError: only 0-dimensional arrays can be converted to Python scalars
In [26]: pvals
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[26], line 1
----> 1 pvals
NameError: name 'pvals' is not defined