PROVIDENCE, R.I. [Brown University] — Computer scientists from Brown University have been awarded $1.5 million to develop new computer algorithms and statistical methods to analyze large, complex datasets. Funding for the project comes from a joint initiative of the National Science Foundation and the National Institutes of Health aimed at supporting fundamental research on Big Data.
Eli Upfal, professor of computer science, will lead the research with fellow computer science professors Ben Raphael and Fabio Vandin. Brown’s funding allotment is the second largest of the eight grants awarded under the program this year, according to the official NSF/NIH announcement.
Upfal and his colleagues will test their new methods on genomics data. Nowhere are the challenges of Big Data more evident than in genomics. As techniques for sequencing genes have become faster and cheaper, researchers have compiled mountains of new data. The trick now is trying to make sense of it all — picking out significant trends and ignoring all the unimportant “noise” that inevitably accumulates in large datasets.
“These datasets have all the good and bad properties of Big Data,” Upfal said. “They’re big, noisy, and require very complicated statistical analysis to obtain useful information.”
One of the aims of this project is to develop better computational tools to isolate genetic mutations that drive cancer by comparing gene sequences of healthy tissue to those of cancerous tissue. The problem is that not every mutation found in cancerous cells is important. There could be thousands of mutations in each cell that don’t actually contribute to cancer growth. They’re simply insignificant, random mutations. An effective computer algorithm will be able to identify with statistical certainty the mutations that actually matter, keeping doctors from chasing millions of red herrings.
But that’s not the only problem Upfal and his team will try to address. There’s also the fact that the lab tools used to sequence genes sometimes record information inaccurately. The error rate varies between sequencing techniques but it’s significant, and analytical tools need to deal with that problem as well.
One of the thrusts of the Brown project is finding algorithms that address these problems in a way that can be verified statistically. The output of traditional machine learning algorithms, Upfal said, is generally not confirmed in an objective way. Take search engines as an example. If the search algorithm consistently returns the kinds of results users are looking for, they’ll keep using it and the algorithm will be deemed successful. But that evaluation is subjective and largely unquantifiable.
“In scientific applications, you need something that can be analyzed rigorously,” Upfal said. “We need to know the confidence level of the outcome.” So a key aspect of this project will be combining traditional machine learning algorithms with the most rigorous of statistical methods.
Daunting as the obstacles may be, Upfal and his colleagues have already had success in addressing them. Last year they developed an algorithm called HotNet that helps to isolate clusters of mutated genes that can cause cancer. They’re hoping to build on that success with this new grant.
Ultimately, Upfal said, the team hopes to develop new tools that can be broadly applied not only to genomics data but also to other Big Data problems like the analysis of large-scale social networks.