Hi,
My goal with this project is to unify the treatment of BLS spectra, to then allow us to address fundamental biophysical questions with BLS as a community in the mid-long term. Therefore my principal concern for this format is simplicity: measure are datasets called “Raw_data”, if they need one or more abscissa to be understood, this/these abscissa are called “Abscissa_i” and they are placed in groups called “Data_i” where we can store their attributes in the group’s attributes. From there, I can put groups in groups to store an experiment with different ... (e.g. : samples, time points, positions, techniques, wavelengths, patients, …). This structure is trivial but lacks usability so I added an attribute called “Name” to all groups that the user can define as he wants without impacting the structure.
Now here are a few critics on Carlo’s approach. I didn’t want to share them because I hate criticizing anyone’s work, and I do think that what Carlo presented is great, but I don’t want you to think that I just trashed what he did, to the contrary, it’s because I see limitations in his approach that I tried developing another, simpler one. So here are the points I have problems with in Carlo’s approach:
- The preferred type of the information on an experiment should be text as we’ll likely see new devices (like your super nice FT approach) appear in the next months/years that will require new parameters to be defined to describe them. I think we should really try not to use anything else than strings in the attributes.
- The experiment information should not be defined by the HDF5 file structure but rather by a spreadsheet because everyone knows how to use Excel, and it’s way easier to edit an Excel file than an HDF5 file.
- Experiment information should apply to measures and not to files, because they might vary from experiment to experiment, I think their preferred allocation is thus the attributes of the groups storing individual measures (in my approach)
- The characterization of the instrument might also be experiment-dependent (if for some reason you change the tilt of a VIPA during the experiment for example), therefore having a dedicated group for it might not work for some experiments
- The groups are named “tn” and the structure does not present the nomenclature of sub-groups, this is a big problem if we want to store in the same format say different samples measured at different times (the logic patch would be to have sub-groups follow the same structure so tn/tm/tl/… but it should be mentioned in the definition of the structure)
- The dataset “index” in Analyzed_data is difficult to understand, what is it used for? I think it’s not useful, I would delete it.
- The "spatial position" in Analyzed_data supposes that we are doing mappings. This is too restrictive for no reason, it’s better to generalize the abscissa and allow the user to specify a non-limiting parameter (the “Name” attribute for example) with whatever value he wants (position, temperature, concentration of whatever, …). A concrete example of a limit here: angle measurements, I want my data to be dependent on the angle, not the position.
- Having different datasets in the same “Analyzed_data” group corresponding to the result of different treatments raises the question of where the process followed to treat the data is stored. A better approach would be to create n groups “Analyzed_data_n” with only one dataset of treated values, allowing for the process to be stored in the attributes of the group Analyzed_data_n
- I don’t understand why store an array of amplitude in “Analyzed_data”, is it for the SNR? Then maybe we could name this array “SNR”?
- The array “Fit_error_n” is super important but ill defined. I’d rather choose a statistical quantity like the variance, standard deviation (what I think is best), least-square error… and have it apply to both the Shift and Linewidth array as so: “Shift_std” and “Linewidth_std"
- I don’t understand “Calibration_index”: where are the calibration curves? Are they in “experiment_info”? If so, we expect people to already process their calibration curves to create an array before adding them to the hdf5 file? I’m not very familiar with all devices, so I might not see the limitations here but do we ever have more than one calibration curve per measure? Could we not add it to the group as a “Calibration” dataset? Or just have one group with the calibration curve?
- Timestamp is typically an attribute, it shouldn’t be present inside the group as an element.
- In tn/Spectra_n , “Amplitude” is the PSD so I would call it PSD because there are other “Amplitude” datasets so it’s confusing. If it’s not the PSD, I would call it “Raw_data”.
- I don’t understand what “Parameters” is meant to do in tn/Spectra_n, plus it’s not a dataset so I would put it in attributes or most likely not use it as I don’t understand what it does.
- Frequency is a dataset, not a float (I think). We also need to have a place where to store the process to obtain it. In VIPA spectrometers for instance this is likely a big (the main even I think) source of error. I would put this process in attributes (as a text).
- I don’t think “Unit” is useful, if we have a frequency axis, then we should define it directly in GHz, and if it’s a pixel axis, then for one we should not call it “Frequency” and then it’s better not to put anything since by default we will consider the abscissa (in absence of Frequency) as a simple range of the size of the “Amplitude” dataset.
- /tn/Images: it’s a good idea, but I believe this is redundant with “Analyzed_data” (?) If it’s not then I don’t understand how we get the datasets inside it.
- “Calibration_spectra” is a separate group, wouldn’t it be better to have it in “Experiment_info” in the presented structure? Also might scare off people that might not want to store a calibration file every time they create a HDF5 file (I might or might not be one of them)
Like I said before, I don’t want this to be taken as more than what lead me to defining a new approach to the format. I want to state once again that Carlo's structure is complete, only it’s useful for specific instruments and applications, and difficultly applicable to other techniques and scenarios (and more lazy people like myself).
Following Carlo’s mail, I’ve also completed the PDF I made to describe the structure I use in the HDF5_BLS library, I’m joining it to this email and pushing its code to
GitHub, feel free to edit it and criticize it at length (I feel like I deserve it after this email I really tried not to write).
Best,
Pierre
Pierre Bouvet, PhD
Post-doctoral Fellow
Medical University Vienna
Department of Anatomy and Cell Biology
Wahringer Straße 13, 1090 Wien, Austria