How to put the R programming language to work

We're at the beginning of a new wave of numeric data -- and the new R language provides a powerful array of tools to make sense of it all

programming language

We tend to think of programming languages as general purpose, able to deliver any kind of application given enough time and enough code. But sometimes you want a language focused on solving one class of problem as efficiently as possible -- think SQL for database programming.

Numeric analysis is ripe for domain-specific languages, mainly because we’re about to encounter a major disconnect in the way we get and use data.

Soon, more data will be generated by machines than by people, as the Internet of things begins filling our databases with an ocean of data. That ocean is going to need trawlers if we’re going to fish anything useful from it -- tools that can work with large amounts of data quickly and deliver the analysis we need to make sense of it all.

Power tools for numeric data

That’s where R comes in. More than a language, it’s a complete numeric analysis toolkit, with tools for working with matrices and for analyzing and displaying data.

Based on the Bell Labs S language, R does a lot more than handle statistics. While you’re likely to build libraries of R functions, you’re not likely to keep much code around. R programs tend to focus on the current piece of analysis, itself often part of an interactive exploration of a data set.

There’s also the option of embedding R in other code where someone has written a bridge -- for example, using R in conjunction with the F# functional programming language. Blending functional programming and R lets you take advantage of both languages’ key features, and by embedding R functions in your code you can quickly visualize or extract meaning from information. You can treat R as a library of functions, much like working with a numeric analysis library in Fortran.

At the heart of R is the concept of the data structure, usually a vector or a matrix. You can assign data to a structure, then use R commands to work with the resulting structure -- to quickly extract basic statistical information, for example, or sort data. The tools in R then enable you to solve equations quickly and efficiently, and if there’s a lot of parallel computation involved, you can take advantage of GPGPU programming to speed things up using technologies like Nvidias’s Cuda.

Gearing up for R analysis

Large amounts of data can be grouped in lists or in data frames, letting you bring all the data associated with a problem in one place, reading them in from external sources. You’ll need to format data appropriately, though this can usually be handled by an external program, adding names for variables and labels to rows. By defining variables and labels at this stage, it’s easier to write code using R’s interactive tools to test out ideas and explore large amounts of data.

Once you have your data in a format R can use, you’re able to start using its statistical tools to identify information that’s going to be interesting -- for example, looking for outliers in a field of sensor information. If you’re logging temperature along a jet engine turbine, where is it getting hottest? If there is a hot spot, is it hot enough to cause fatigue? R also lets you compare the results from different sample sets. With our jet engine data we can start to ask questions -- say, what’s different today from yesterday?

While R includes many key analytical functions, the language can be extended with your own functions. Once you’ve defined a function, it can be called by any other function -- and saved for future use. It’s an almost Forth-like approach, as most of the default R functions are of course written in R.

While you don’t need to be a data scientist to use R, some statistical and analytical knowledge is needed to get the most from it, if only to understand what a significant result looks like. Statistical and numeric analyses are extremely powerful tools and should be treated as such. While R may make it easier to explore data, you do need to know what you’re doing before you make working with R a key part of your day-to-day work.

Best uses for R

R is becoming increasingly important to machine learning systems. You can embed R code in Microsoft’s Azure ML, extending its models and adding analytic steps to a machine learning process. As our applications and services start drowning under a flood of data, languages and tools like R help us find the significant results that are hidden in that flood of data, results that can mean the difference between failure and success.

There’s a lot to be said for learning R if you’re developing a high-volume sensor network. R code can help identify trends or sport outliers. That’s vitally important if you’re looking for indicators of possible problems on an oil pipeline, or trying to rapidly explore a data dump from a jet engine after a transatlantic flight. In both cases there’s a need for rapid response before a problem actually occurs, with a significant financial penalty if you get things wrong.

Using R to quickly extract trends from high volumes of data like this makes a lot of sense. With R code in a machine learning system you can deliver data that’s been identified as significant to your ML engine without flooding it or degrading performance.

R isn’t for beginners. You’re going to need to have a fair grounding in its underlying mathematical concepts if you’re going to make good use of it. That’s not a bad thing; you need to be able to ask the right questions to produce results that have any value. The beauty of R is that it enables many more people to ask those questions and to use those results.

Copyright © 2015 IDG Communications, Inc.