Data access, quantity, and quality drive innovation in clinical research, both for studies across whole populations and for personalized medicine that focuses on an individual’s disease, diagnosis, and response to treatment. New technologies such as Next Generation Sequencing (NGS) for whole genomes and high-resolution slide imaging are creating terabytes of data that were considered too rare or expensive to use in clinical diagnostics and research just a few years ago. To develop new diagnostic techniques and therapeutics, collaborative teams in geographically distributed biomedical research, pharmacology, academia, and national laboratories need to quickly and efficiently exchange and analyze these large data sets. Typically, geographically distributed research teams have had to rely on techniques such as shipping physical hard drives to overcome the long wait times and unreliability of transferring large datasets over the internet.
The Internet’s underlying data transfer protocol, TCP, becomes highly inefficient and unreliable when transferring data over long distances or over unpredictable commodity hardware networks. The limitations of conventional data transfer technology disproportionately affect industries that have large datasets such as the raw genomics files and diagnostic imagery that are commonly found in life sciences research and clinical studies. These limitations can be overcome with next-generation file transfer technologies that have been developed to enable the high-speed delivery of large files and datasets over commodity networks. With these technologies, transfer speeds can be increased so that many large files may be sent reliably at great distances over existing networks at high speeds to enable new analytical and diagnostic workflows.
Improving access to data with speed and reliability
Today, life sciences organizations use reliable, secure, high-speed transfer technologies to send huge datasets over inherently unreliable long-distance, international links that may span oceans. The Institute for Genome Sciences (IGS) at the University of Maryland School of Medicine generates and curates genomics data for the international scientific community, managing the collection, storage and distribution of terabytes of data globally. With average file sizes increasing from 5GB to 20GB in recent years, IGS implemented next-generation file transfer technology at its Data Analysis and Coordination Center (DACC). Since deploying the new technology to replace 40-50Mbps FTP file transfers, IGS has increased scientists’ access to data with file transfer speeds of 300Mbps while eliminating shipping costs and achieving reliability that they did not have with conventional transfers.
Life sciences IT solutions provider ESAC was contracted by the National Cancer Institute to manage the launch of a new Clinical Proteomic Tumor Analysis Consortium (CPTAC) Data Coordinating Center. Mass spectrometry is used to profile the proteome, or the set of proteins, in tumors, which can be used to research personalized application of therapeutics to specific individuals’ cancer. Mass spectrometry data set sizes can become exceedingly large and the quantity of data accumulates rapidly. A typical analysis may result in many gigabytes of output files that contain descriptive information for thousands of proteins in a sample. To enable researchers and clinicians to rapidly share large proteomics datasets for analysis and research, ESAC integrated high-speed file transfer software directly into the CPTAC web portal to enable data to be moved to and from the globally distributed data repositories that support each portal.
Speeding the transition to the cloud
As scientists grapple with increasingly large data sets that are essential for researching new diagnostic methods and analyses, the promise of unlimited compute and data storage in the cloud presents an opportunity for enabling discoveries and workflows that were not previously possible. For many academic research institutions, use of the cloud has enabled them to reduce data center costs while benefiting from available cloud compute resources that are often donated by larger IT organizations.
A barrier to cloud adoption for life sciences organizations has been the difficulty of moving terabytes of data in and out of the cloud over the internet. Beyond the limitations of conventional network transfer technologies, few transfer tools are able to directly store data in native cloud data storage architectures. Fortunately, recent advances have solved these issues, resulting in reliable high-speed transport directly into cloud storage on multiple platforms.
This technological breakthrough is powering cloud-based bioinformatics solutions such as BGI’s EasyGenomics service for faster bioinformatics in the cloud. BGI integrated high-speed transfer technology into the EasyGenomics workflow to allow users to rapidly upload sequencing data to the cloud for processing and to download completed projects. BGI’s researchers can now transfer genomic data at a rate of nearly 10Gbps over a new link connecting US and China research and education networks, greatly reducing the time between sequencing and analysis. During a recent live demo, BGI transferred 24GB of genomic data from Beijing to UC Davis in California in under 30 seconds, while a file of the same size sent over the public internet took more than 26 hours, demonstrating that their analytical platform is a viable option for researchers who may be sequencing data continents away.
GenoSpace, LLC recently launched a cloud-based system for connecting individual patients’ genomic analyses with clinical laboratory results and case history to identify patterns and relationships between genetic signatures and therapeutic options. GenoSpace has integrated high-speed cloud transfer technology to move genomics data to the GenoSpace system for analysis and reporting, and to disseminate information back to researchers, physicians, clinical labs and patients while maintaining patient privacy and speeding potential access to individualized treatment.
Advancements in life sciences data collection are opening up opportunities for new research into disease genetics, personalized medicine, and remote diagnosis that were not possible a few years ago in clinical diagnostics. Next-generation high-speed file transfer technologies are enabling the secure, reliable transfer of massive life sciences datasets within globally distributed networks of researchers, clinicians and diagnosticians, fueling a new age of data-driven innovation in life sciences.