A variety of initiatives aim to map the universe, revealing its nature, scale, immensity, and majesty. Leaders of one such initiative decided that big data machine learning was a critical component of its ongoing project to map the Virgo Cluster of up to 2,000 galaxies.
The nearest large cluster to our own Milky Way galaxy, the Virgo Cluster is of special interest to astronomers as a sort of laboratory for the study of galaxy formation and evolution. As part of this effort, the Next Generation Virgo Cluster Survey (NGVS) was formed, composed of more than 40 scientists at 23 institutions in Canada, France, the United States, the United Kingdom, Italy, China, and Chile.
Primary data collection is accomplished via the Canada-France-Hawaii Telescope (CFHT), an optical/infrared telescope atop the 13,700-foot summit of Mauna Kea, an inactive volcano on the island of Hawaii. Time with the telescope is a precious resource; NGVS was allocated about 140 nights from 2009 through 2012.
To the human eye, the area of survey is equivalent to about 200 by 200 Earth moons in the night sky. Each raw image is 1.6GB, and the data collection process adds terabytes per week, yielding hundreds of terabytes to analyze.
In order to make the most of the NGVS data, the project leaders employed the Canadian Advanced Network for Astronomical Research (CANFAR) -- the first dedicated cloud computing platform for astronomy, used to store, share, and analyze the data for astronomers worldwide. A primary objective was to use the platform to positively identify which celestial objects in the images, particularly dimmer ones, were actually part of the Virgo Cluster.
Researchers realized this task would be challenging. They determined that machine learning, a type of advanced analytics with origins in artificial intelligence, would provide the most productive approach to accurately identify galaxies and generating the full Virgo Cluster map.
But machine learning for big data presents its own difficulties. Many of the most essential machine learning algorithms often involve exponential functions that vastly increase the amount of data during processing, in this case to an impractical degree. The Hadoop machine learning project Mahout might have been a candidate for the job, but it did not yet have the algorithms required for the research.
Instead, Skytree was selected as the machine learning engine due to its state-of-the-art algorithms and massive scalability. Skytree can be deployed across a cluster either on premises or in the cloud. Input can come from nearly any structured or unstructured data source: relational databases, Hadoop's file system, flat files, and so on. Output can be massaged and visualized in a variety of ways, such as using the R language and environment (intended for statistical computing) or the open source Weka machine learning software, which includes tools for visualization and statistical modeling.
For NGVS, Skytree uses a machine learning algorithm to ingest reference data sets with known characteristics from more than 20 million galaxies. It then calculates data from the galaxy in question to see if it fits within the distance range associated with the Virgo Cluster.
With this powerful automation, astronomers can focus on mapping and studying the Virgo Cluster, rather than wasting thousands of hours manually sorting members of the cluster from other celestial objects. It's an important advance in astronomy that provides a taste of how big data will accelerate our exploration of the universe.
This article, "Astronomers crunch big data to map the galaxies," was originally published at InfoWorld.com. Read more of Andrew Lampitt's Think Big Data blog, and keep up on the latest developments in big data at InfoWorld.com For the latest business technology news, follow InfoWorld.com on Twitter.