Linear regression - response variabel as percent improvement or m/s?
Iâm trying to do statistics on a data set which contains 8 different run distances with a time before and after following a training regime and do a linear regression on improvements based on distance (all finish times have declined). Iâm in doubt whether to use a conversion to percent improvement or a variable like m/s and then subtract run time 1 and run time 2 to be able to compare between different distance groups. Obviously absolute time difference is not great, as longer distances naturally have greater improvements. But Iâve read that conversion into percent improvements as response variable in linear regression is not advised. How could I proceed?
Frank Harrell presents some arguments against using percents and percent differences in this post. For example, the inherent lack of symmetry:
When a quantity doubles, it gets back to its original value by halving. When in increases by 100% it gets back to its original value by decreasing 50%... an increase of 33.33% is balanced by a decrease of 25%, an increase by a factor of 4/3 is balanced by a decrease to a factor of 3/4.
Or what a "percent change" really means: percent change from some baseline, or a difference in percentage points:
Percent change has even more problems than percent. I have often witnessed confusion from statements such as âthe chance of stroke increased by 50%â. If the base stroke probability was 0.02 does the speaker mean that it is now 0.52? Not very likely, but you canât be sure.
It would be less ambiguous to evaluate ratios of post- to pre-training times, if all that you have is those two time points. With the small relative differences in times involved, I suspect that you will still have close enough to normally distributed errors around your model predictions. Working in a logarithmic scale of times is a related choice (the log of a ratio is the difference between the individual logarithms), which might be useful if training-related differences are larger or models are more complex.
One warning: if the same individuals were evaluated on multiple run distances, then you need to take those intra-individual correlations into account. Robust standard errors, generalized least squares, or mixed models are possibilities.
The fundamental outcome variable you measured is time ($t$), so the first thought would just be to use $t$, or rather $\Delta t$ (the time difference between before and after "treatment".
You also mention converting this time to a speed (m/s). However, for a given distance, this is just the inverse of the time ($\frac 1 t$) (within a scaling constant, since the distance is constant for all subjects on that distance).\
You also mention looking at % improvement. You could do this for time $(\frac {t_1-t_2} {t_1} )$, or you could do this for speed $(\frac {s2-s1} {s1})$. It is not clear from your question which you are considering?
A previous answer expressed reservations about using percent changes. I do share these reservations. I will however add another reason which is that this percent change will be relative, i.e. relative to each subject's initial time. But that time is different for each subject; so 1 single percentage point will mean something different for each subject (or said differently, 1s change will be a different percentage for each subject. You no longer have a constant scale for all your measurements).
Having said this, you say you want to perfrom a regression "on improvements based on distance". I assume you want to do a linear regression of the improvement (expressed as one of the 4 possible ways outlined above) as a function of distance?
In this case, the proper choice of variable may depend on which gives you the "more linear" relationship. And for this, the 4 possible choices will not behace identically.
If your various distances cover a large range (e.g. from 100m to 10km), it should be clear that the absolute time improvements will get larger with the distance, while the absolute speed improvement will get smaller. Subjects may be able to save only 1s on a 100m dash, but should be able to save 1m on a long distance; similarly, subjects may go 1 m/s faster in a dash, but can not keep this up on a long distance. It also may be that subjects who were slow on the initial time (maybe because they were out of shape?), gained more from the treatment than subjects which were already fast (in shape?). In this case, percent improvement may be a better predictor. But then again, all your subjects may already be fit athletes?
What would I do in your shoes? As always, the first rule of statistics is to plot your data (the so-called IOTT; inter-ocular trauma test. Plot it, and see if the answer hits you between the eyes). Plot the 4 possible measures of improvement as a function of distance, and see which measure appears to give you "the best" fit. This is then the one I would use for my regression (which would depend on the specifics of the experiment, which were not shared).