My Learning Experience with Computational Infrastructure in Biotech
And here we go again, after 5 years working in early-stage companies, I decided to share what I learned about building an infrastructure that works for the needs of a small company. Note, that there won’t be a unique solution to this, and depending on the data input throughput the strategy will change. A good introduction to DataOmics and why this is important can be found at Hammerspace in an interview with Eleanor Howe.
Before I start, many other great experts have already talked about this, I learned from them and applied much of what they have said: Michele Busby, and recently from Jack Lindamood.
Now some context, because context matters. I built everything from scratch at NextRNA. This company focused on targeting lncRNA-protein interactions with small molecules to expand the therapy landscape in oncology and neurodegeneration. The data we use is mainly RNAseq, since we are focusing on detecting lncRNA drivers. We have to internalize raw data from big studies like TCGA and similar, to re-analyze it with our pipeline and annotation.
AWS cloud services are what is working for us, and it helps us to be stable.
- Network: we have two VPCs, with two subnets (private and public). We only use the private subnet and we connect to them through the VPN Client services and with the VPN tunnels to our office. All machines have specific group security configurations to allow only the machine-to-machine communication we know will happen. This can be painful, but it will keep you aware of what is accessible. One VPC is dedicated to shared servers ( I will talk later about this), and another, to machines that will be created through the Batch service, you can call it our HPC space.
- Storage: we use mainly S3, we have different buckets depending on data usage policies. We store controlled data that needs to follow specific data usage agreements with NIH. We have permanent processed data and raw data with different storage classes to make it more cost-effective. We have an EFS that is mounted on our shared servers, so every temporary data that we need meanwhile we run downstream data analyses can be accessed from different applications. Some of this data is not temporary but is small and need fast access by Web-Apps, so we keep it here instead of S3.
- Shared Servers: we decided to go with the now-name Posit Bundle: workbench, connect and package manager. This together with our shared file system has made my team collaboration very efficient. About the software, we use all the same R version, with the same R packages. This ensures reproducibility. We decide as a team, the time to update to the next R version and what are the migration steps needed for this. This allows us to publish visualization tools very easily as well, and it has fostered the relationship with the other groups in the company. Nothing gives me more joy than seeing my team talking every day with colleagues from other groups.
- Pipelines: The biggest contribution to our better efficiency has been using Seqera platform together with NextFlow pipelines. I will always go with this, it makes it so easy to know what is going on, help debug, relaunch pipelines… And there are features we are not using yet that will make many other processes easier, like datasets ready to be analyzed at any point, visualizing reports, spin-off on-demand machines…
- GitHub: this is not new, but I want to share how we use it. We decided on some naming conventions for repositories, have templates repositories, and have documentation in the repository. Have a company internal package that collects all the common code that is used now and then, so we make sure that repetitive analyses of data are the same, and anybody can work on them. As well we have our figure theme so we are consistent when we create reports. We use it a lot for BD material creation.
Finally, even if it is not part of the computing infrastructure, I would say managing projects efficiently is important. We tried different things, but what worked the best was Trello. We decided on a protocol on what a card would mean, how to annotate the card, how to see the workload of the team only by looking at the Trello board…and it worked like magic. We have a weekly meeting to align on the short-term goals and long-term goals and plan accordingly.
Other solutions I know they will be good but I couldn’t explore more because of time constraints and budget limitations. For instance, Code Ocean. I know about Deep Origin, it would be more like an advanced software as a service but it could be part of the overall infrastructure to accelerate the time between data and decision. Similar companies, like BigOmics helps to analyze data quickly without thinking on the backend. BioBox is working in new products as well, this one focused on graph data representation.
More focused on data, I would evaluate TileDB and Databricks when the time is right and the amount of data and integration are a limitation to your process. If you need more help with this, I recently met Jesse Johnson, he is an expert in the data space working with Biotechs.
This is a summary, but I hope it is enough to give a broad idea. Happy to talk with anybody who wants to know more. This strategy was highly influenced by Judit Flo.