It is in fact even difficult to compare the same person on the same fMRI machine (and especially in developmental contexts).
Herting, M. M., Gautam, P., Chen, Z., Mezher, A., & Vetter, N. C. (2018). Test-retest reliability of longitudinal task-based fMRI: Implications for developmental studies. Developmental Cognitive Neuroscience, 33, 17–26. https://doi.org/10.1016/j.dcn.2017.07.001
I read that paper as suggesting that development, behavior, and fMRI are all hard.
It's not at all clear to me that teenagers' brains OR behaviours should be stable across years, especially when it involves decision-making or emotions. Their Figure 3 shows that sensory experiments are a lot more consistent, which seems reasonable.
The technical challenges (registration, motion, etc) like things that will improve and there are some practical suggestions as well (counterbalancing items, etc).
While I agree I wouldn't expect too much stability in developing brains, unfortunately there are pretty serious stability issues even in non-developing adult brains (quote below from the paper, for anyone who doesn't want to click through).
I agree it makes a lot of sense though the sensory experiments are more consistent, somatosensory and sensorimotor localization results generally seem to the be most consistent fMRI findings. I am not sure registration or motion correction is really going to help much here, I suspect the reality is just that the BOLD response is a lot less longitudinally stable than we thought (brain is changing more often and more quickly than we expected).
Or if we do get better at this, it will be more sophisticated "correction" methods (e.g. deep-learners that can predict typical longitudinal BOLD changes, and those better allow such changes to be "subtracted out", or something like that). But I am skeptical about progress here given the amount of data needed to develop any kind of corrective improvements in cases where there are such low longitudinal reliabilities.
===
> Using ICCs [intraclass correlation coefficients], recent efforts have examined test-retest reliability of task-based fMRI BOLD signal in adults. Bennett and Miller performed a meta-analysis of 13 fMRI studies between 2001 and 2009 that reported ICCs. ICC values ranged from 0.16 to 0.88, with the average reliability being 0.50 across all studies. Others have also suggested a minimal acceptable threshold of task-based fMRI ICC values of 0.4–0.5 to be considered reliable [...] Moreover, Bennett and Miller, as well as a more recent review, highlight that reliability can change on a study-by-study basis depending on several methodical considerations.
Herting, M. M., Gautam, P., Chen, Z., Mezher, A., & Vetter, N. C. (2018). Test-retest reliability of longitudinal task-based fMRI: Implications for developmental studies. Developmental Cognitive Neuroscience, 33, 17–26. https://doi.org/10.1016/j.dcn.2017.07.001