Navigating Black Boxes in STEM
The demands of coursework, research, and my personal projects have finally caught up with me. In all honesty, I still don’t really have the time to sit down and write a blog post, but I really wanted to write about unknowns and knowledge gaps in science and learning in general.
I was recently thinking about my higher education path and how there’s always been some topic where I know what something does, but not necessarily how it works, essentially a black box. When I started out as an undergrad planning to do an MD/PhD, I focused almost entirely on the biology. I understood why we were doing RNA-seq in my lab, but I didn’t really understand how the information went from my pipette tip to figures on my laptop. So I picked up a minor in computer science, which eventually led me into a master’s in bioinformatics. By the end of my first semester I felt more than comfortable processing any DNA or RNA reads, but because my ML background was limited, I started struggling more with interpreting the data at a higher level. So this winter I made it a point to spend a lot of time learning different ML techniques and how to actually use them in scripts.
I still wouldn’t call myself proficient in ML, but I’m starting to notice a new black box in my understanding, and this time it’s math, specifically linear algebra. And when I think about it, black boxes show up no matter what specialty you go into, whether that’s differential-equation PK/PD models for biochemists or hierarchical Bayesian models for business statisticians. But how do you actually approach these concepts? I was clustering single-cell data months before I knew how an eigenvector is calculated. So how do you decide what you should understand intuitively versus what you just learn to use well enough to get by?
Currently, with the power of LLMs, it’s completely possible to build a complex, functioning machine learning model without really knowing what’s happening under the hood. On one hand, that’s exciting because it lowers the barrier to experimentation and lets you move faster than you could a few years ago. But it also introduces a new kind of risk: it becomes easy to mistake a pipeline that runs for a result you can trust, and to slip into confirmation bias, where you see what you want to see and don’t question it as well as you should. If you don’t understand the assumptions you’re inheriting, you can end up overconfident in outputs that are insignificant, biased, or completely wrong. In a field like bioinformatics, where the pipeline is already a stack of tools and assumptions, that’s the part that worries me. The model might run, the plots might look clean, and you can still be telling yourself a story the data never actually supported.
Going back to the learning dilemma, despite what some academics will have you think, it’s basically impossible to know everything related to your field, especially in one as transdisciplinary as bioinformatics. At some point you’re always standing on abstractions you didn’t build yourself, and you’re choosing where to go deep versus where to move forward. As of now, I don’t exactly know where I stand on this paradox, but I’m starting to think you can’t fully eliminate black boxes; instead, it’s more important to be honest about which ones you’re relying on and to keep shrinking the ones that matter most.
Comments
Post a Comment