In an ideal world, calculation of computational cost metrics would be performed automatically within data scientists' development workbenches. In other words, analytic tools, libraries, and sandboxing platforms would present these metrics as a key decision-support feature so that data scientists can weigh the likely downstream performance characteristics of whatever model(s) they happen to be building.
From a devops standpoint, the ideal data scientist toolset would assess models' potential downstream performance impacts as manifested in any or all of the following types of latency:
- Data latency: When deployed for data acquisition, integration, and cleansing, will the model significantly slow data transmission from sources to downstream consuming applications?
- Execution latency: When executed on an in-database analytics platform, will the model take an inordinate amount of time to deliver results? If implemented on a mixed-workload platform, will it significantly slow execution of other workloads running on that platform?
- Modeling latency: When built with existing statistical analysis and data preparation tools, will the model take an excessive amount of time to be developed, populated, scored, iterated, and deployed?
Considering that there are usually alternative algorithms that can be used to address the same modeling domain, the ideal tool would present the performance advantages and disadvantages of each, so that the developer may choose wisely.
Anyway, that's my wish list. I'm sure it also resonates with many data science and big data professionals. After all, the world's most sophisticated analytics are utterly useless if they can't execute in a reasonable amount of time on our big data platforms.
This story, "Devops can take data science to the next level," was originally published at InfoWorld.com. Read more of Extreme Analytics and follow the latest developments in big data at InfoWorld.com. For the latest developments in business technology news, follow InfoWorld.com on Twitter.