Dataiku review: Data science fit for the enterprise

Dataiku’s end-to-end machine learning platform combines visual tools, notebooks, and code to address the needs of data scientists, data engineers, business analysts, and AI consumers.

Review: Dataiku democratizes data science
shutterstock
At a Glance

Dataiku Data Science Studio (DSS) is a platform that tries to span the needs of data scientists, data engineers, business analysts, and AI consumers. It mostly succeeds. In addition, Dataiku DSS tries to span the machine learning process from end to end, i.e. from data preparation through MLOps and application support. Again, it mostly succeeds.

The Dataiku DSS user interface is a combination of graphical elements, notebooks, and code, as we’ll see later on in the review. As a user, you often have a choice of how you’d like to proceed, and you’re usually not locked into your initial choice, given that graphical choices can generate editable notebooks and scripts.

During my initial discussion with Dataiku, their senior product marketing manager asked me point blank whether I preferred a GUI or writing code for data science. I said “I usually wind up writing code, but I’ll use a GUI whenever it’s faster and easier.” This met with approval: Many of their customers have the same pragmatic attitude.

Dataiku competes with pretty much every data science and machine learning platform, but also partners with several of them, including Microsoft Azure, Databricks, AWS, and Google Cloud. I consider KNIME similar to DSS in its use of flow diagrams, and at least half a dozen platforms similar to DSS in their use of Jupyter notebooks, including the four partners I mentioned. DSS is similar to DataRobot, H2O.ai, and others in its implementation of AutoML.

Dataiku DSS features

Dataiku says that its key capabilities are data preparation, visualization, machine learning, DataOps, MLOps, analytic apps, collaboration, governance, explainability, and architecture. It supports additional capabilities through plug-ins.

Dataiku data preparation features a visual flow where users can build data pipelines with datasets, recipes to join and transform datasets, plus code and reusable plug-in elements.

Dataiku does quick visual analysis of columns, including the distribution of values, top values, outliers, invalids, and overall statistics. For categorical data, the visual analysis includes the distribution by value, including the count and % of values for each value. The visualization capabilities let you perform exploratory data analysis without resorting to Tableau, although Dataiku and Tableau are partners.

Dataiku machine learning includes AutoML and feature engineering, as shown in the figure below. Each Dataiku project has a DataOps visual flow, including the pipeline of datasets and recipes associated with the project.

dataiku 02 IDG

Dataiku DSS offers three kinds of AutoML models and three kinds of expert models.

For MLOps, the Dataiku unified deployer manages project files’ movement between Dataiku design nodes and production nodes for batch and real-time scoring. Project bundles package everything a project needs from the design environment to run on the production environment.

Dataiku makes it easy to create project dashboards and share them with business users. The Dataiku visual flow is the canvas where teams collaborate on data projects; it also represents the DataOps and provides an easy way to access the details of individual steps. Dataiku permissions control who on the team can access, read, and change a project.

Dataiku provides critical capabilities for explainable AI, including reports on feature importance, partial dependence plots, subpopulation analysis, and individual prediction explanations. These are in addition to providing interpretable models.

DSS has a large collection of plug-ins and connectors. For example, time series prediction models come as a plug-in; so do interfaces to the AI and machine learning services of AWS and Google Cloud, such as Amazon Rekognition APIs for Computer Vision, Amazon SageMaker machine learning, Google Cloud Translation, and Google Cloud Vision. Not all plug-ins and connectors are available to all plans.

At a Glance
  • Dataiku DSS is a very good, end-to-end platform for data analysis, data engineering, data science, MLOps, and AI browsing. It tries hard to support citizen data scientists with visual machine learning tools, but it helps if users can at least modify Python notebooks.

    Pros

    • Supports end-to-end data science, from data engineering and analysis to MLOps and AI browsing
    • Has capable AutoML facilities with a choice of algorithms
    • Offers simple exploratory data analysis notebooks and dashboards
    • Has some explainable machine learning capabilities

    Cons

    • Trains in-memory by default; for bigger datasets, you need to pay for Spark integration
    • Doesn’t support GPUs in its lower-end cloud plans

Copyright © 2021 IDG Communications, Inc.