Any use of total_days
or a simple transformation like its logarithm is making a particular assumption about the association of disease_severity
with total_days
. That's going to be tricky in your situation, as many of your disease_severity
values are already close to the maximum value of 1. Simple assumptions are unlikely to work.
The approach most likely to account for differences in the number of total_days
is to model disease_severity
as a flexible function of total_days
, for example with regression splines, so that the data can tell you the form of the association. The difficulty is that you have a very restricted data set of 37 data points, while you are already trying to fit 4 fixed-effect predictors and a random effect for year
. Thus you are in danger of overfitting your data.
Frank Harrell's online notes and book provide guidance on how to use your knowledge of the subject matter to match your model to the available data. You might also consider a generalized additive model. See, for example, the R mgcv
package. That can allow flexible modeling of all of your continuous predictors while penalizing coefficients to minimize overfitting.