5. Reproducibly saving data¶
Once you’ve gone through preprocessing and manually editing your data, you’ll
likely want to save your work. peakdet provides two ways to save your
outputs, depending on your data storage needs.
5.1. Duplicating data¶
If you don’t mind storing multiple copies of your data, you can simply save the
Physio object directly using peakdet.save_physio():
>>> from peakdet import save_physio
>>> path = save_physio('out.phys', data)
If later on you want to reload the processed data you can do so:
>>> from peakdet import load_physio
>>> data = load_physio('out.phys', allow_pickle=True)
Warning
peakdet.save_physio() uses numpy.savez() to save the data
objects, meaning the generated files can actually be quite a bit larger
than the original data, themselves! Moreover, you NEED to pass the
allow_pickle=True parameter; this is turned off by default for safety
reasons, as you should never load pickle files not generated by a trusted
source.
5.2. Saving history¶
If you loaded all your data using the IO functions contained in
peakdet then your Physio objects should have a
complete history. If that’s the case, you can avoid saving
a duplicate copy of your entire data structure and just save the history! To do
this we can use peakdet.save_history():
>>> from peakdet import save_history
>>> print(data)
Physio(size=240000, fs=250.0)
>>> path = save_history('out.json', data)
The history is saved as a JSON file. If you’re unfamiliar, JSON files are plain text files that can store lists and dictionaries–which is exactly what the history is!
We can then load in the history (and recreate the Physio object
it described) with peakdet.load_history():
>>> from peakdet import load_history
>>> reloaded_data = load_history('out.json')
>>> print(reloaded_data)
Physio(size=240000, fs=250.0)
The data object contains all the processing steps (including manual edits!)
that were performed on the original physiological data.
5.2.1. Relative paths in history¶
While the saved history file (in the above example, out.json) can be stored
anywhere (next to the raw data file typically makes sense!), extra care must be
taken when loading it back in. Because the history file contains a path to the
raw data file you must ensure that it is loaded with load_history()
from the same directory in which the raw data were originally loaded.
Let’s say that we have a directory tree that looks like the following:
./experiment
├── code/
│ └── preprocess.py
└── data/
└── sub-001/
└── PPG.csv
We navigate to this directory (cd experiment) and run python
code/preprocess.py, which generates a history file:
./experiment
├── code/
│ └── preprocess.py
└── data/
└── sub-001/
├── PPG.csv
└── PPG_history.json
Now, say we zip the entire experiment directory to send to a collaborator
who wants to run some analyses on our processed data. If they want to
regenerate the Physio objects we created from the saved history
files, they must call load_history() from within the experiment
directory—calling it from anywhere else in the directory tree will result in
a FileNotFoundError.
Note
In order to be able to reproducibly regenerate data using history files, you need to ensure that you load your data using relative paths from the get-go!