Unraveling the mystery of multiple sclerosis with big data

Confronted by a varied, 250TB data set, researchers opted for a huge hardware upgrade and analytic techniques based on the R statistical language

The State University of New York (SUNY) at Buffalo is home to one of the leading multiple sclerosis (MS) research centers in the world. It's one spot where big-data-powered analysis is helping researchers understand potential causes and treatments of the disease, accelerating the race to a cure.

What causes MS is not precisely known. Currently it is believed to originate from some complex combination of a virus and gene defect(s), perhaps in association with such environmental factors as sunlight and cigarette smoke.

[ Download the Big Data Analytics Deep Dive by InfoWorld's David Linthicum for a comprehensive, practical overview of this booming field. | Harness the power of Hadoop with InfoWorld's 7 top tools for taming big data. ]

Dr. Murali Ramanathan, is co-director of the Data Intensive Discovery Initiative at the SUNY research center. A technique developed there called AMBIENCE enables them to efficiently search for the interaction of multiple genetic variations -- called single nucleotide polymorphisms (SNP, pronounced "snips") -- and environmental factors that raise the risk of patients contracting multiple sclerosis.

The data sets used in this multivariable research total more than 250TB -- and the analysis is very demanding computationally because the researchers are looking for significant interactions between thousands of genetic and environmental factors.

In this research, there were two main issues to overcome: crunching through the immense data set and achieving sophisticated and easily customizable analytic models across a wide range of data sets. The researchers wanted to see not only which individual variables were significant, but also which combinations of variables stood out.

Running the algorithms required with sample data on commodity hardware took almost a week. It quickly dawned on the researchers that it would take many weeks to run the algorithms with all the data -- the results from which would lead to additional questions, algorithm adjustments, data changes, and so on.

To meet these challenges, the researchers settled on creating an analytic framework that combined the IBM Netezza analytic database appliance with Revolution Analytics' R Enterprise.

Netezza multiplied the processing capacity by 100 times, reducing the time required to conduct analysis from 27.2 hours to 11.7 minutes. Parallel processing was one key, but it was just the start. At the same time, some analysis is performed as the data is moving off the disks, rather than handling all processing only on the main processors. As a result, the work can be done faster and more efficiently.

Revolution Analytics, based on the R statistical language, allowed researchers to add and remove variables from the model quickly and easily, without having to write hundreds of lines of code. Both enabled the team to use a variety of data sets -- medical records, lab results, MRI scans and patient surveys -- and include a wide range of dependent variables so that interactions among the variables can be studied.

In the past, the SUNY team would have had to rewrite the entire algorithm. Now, thanks to the new system, the scientist can simply change the algorithm without assistance. With the new solution, the SUNY researchers are able to use new algorithms and add to processes multiple variables and data sets that were impractical before.

Thanks to these advancements, the researchers are now moving on to more complex research -- and inching ever closer to decoding the mysterious mechanisms behind multiple sclerosis.

This article, "Unraveling the mystery of multiple sclerosis with big data," was originally published at InfoWorld.com. Read more of Andrew Lampitt's Think Big Data blog, and keep up on the latest developments in big data at InfoWorld.com For the latest business technology news, follow InfoWorld.com on Twitter.

Copyright © 2012 IDG Communications, Inc.