That depends immensely on the type of effect you're looking for.
Within-subject effects (this happens when one does A, but not when doing B) can be fine with small sample sizes, especially if you can repeat variations on A and B many times. This is pretty common in task-based fMRI. Indeed, I'm not sure why you need >2 participants expect to show that the principle is relatively generalizable.
Between-subject comparisons (type A people have this feature, type B people don't) are the problem because people differ in lots of ways and each contributes one measurement, so you have no real way to control for all that extra variation.
Precisely, and agreed 100%. We need far more within-subject designs.
You would still in general need many subjects to show the same basic within-subject patterns if you want to claim the pattern is "generalizable", in the sense of "may generalize to most people", but, precisely depending on what you are looking at here, and the strength of the effect, of course you may not need nearly as much participants as in strictly between-subject designs.
With the low test-retest reliability of task fMRI, in general, even in adults, this also means that strictly one-off within-subject designs are also not enough, for certain claims. One sort of has to demonstrate that even the within-subject effect is stable too. This may or may not be plausible for certain things, but it really needs to be considered more regularly and explicitly.
Between-subject heterogeneity is a major challenge in neuroimaging. As a developmental researcher, I've found that in structural volumetrics, even after controlling for total brain size, individual variance remains so large that age-brain associations are often difficult to detect and frequently differ between moderately sized cohorts (n=150-300). However, with longitudinal data where each subject serves as their own control, the power to detect change increases substantially—all that between-subject variance disappears with random intercept/slope mixed models. It's striking.
Task-based fMRI has similar individual variability, but with an added complication: adaptive cognition. Once you've performed a task, your brain responds differently the second time. This happens when studies reuse test questions—which is why psychological research develops parallel forms. But adaptation occurs even with parallel forms (commonly used in fMRI for counterbalancing and repeated assessment) because people learn the task type itself. Adaptation even happens within a single scanning session, where BOLD signal amplitude for the same condition typically decreases over time.
These adaptation effects contaminate ICC test-retest reliability estimates when applied naively, as if the brain weren't an organ designed to dynamically respond to its environment. Therefore, some apparent "unreliability" may not reflect the measurement instrument (fMRI) at all, but rather highlights the failures in how we analyze and conceptualize task responses over time.
Yeah, when you start getting into this stuff and see your first dataset with over a hundred MRIs, and actually start manually inspecting things like skull-stripping and stuff, it is shocking how dramatically and obviously different people's brains are from each other. The nice clean little textbook drawings and other things you see in a lot of education materials really hide just how crazy the variation is.
And yeah, part of why we need more within-subject and longitudinal designs is to get at precisely the things you mention. There is no way to know if the low ICCs we see now are in fact adaptation to the task or task generalities, if they reflect learning that isn't necessarily task-relevant adaptation (e.g. the subject is in a different mood on a later test, and this just leads to a different strategy), if the brain just changes far more than we might expect, or all sorts of other possibilities. I suspect if we ever want fMRI to yield practical or even just really useful theoretical insights, we definitely need to suss out within-subject effects that have high test-retest reliability, regardless of all these possible confounds. Likely finding such effects will involve more than just changes to analysis, but also far more rigorous experimental designs (both in terms of multi-modal data and tighter protocols, etc).
FWIW, we've also noticed a lot of magic can happen too when you suddenly have proper longitudinal data that lets you control things at the individual level.
Within-subject effects (this happens when one does A, but not when doing B) can be fine with small sample sizes, especially if you can repeat variations on A and B many times. This is pretty common in task-based fMRI. Indeed, I'm not sure why you need >2 participants expect to show that the principle is relatively generalizable.
Between-subject comparisons (type A people have this feature, type B people don't) are the problem because people differ in lots of ways and each contributes one measurement, so you have no real way to control for all that extra variation.