How to use KNIME for data science

Free, open-source Knime allows you to visually assemble data processing “nodes” into machine learning, deep learning, and other analytics workflows.

abstract data analytics

KNIME (the K is silent, so it’s pronounced nīm) is a highly rated data analytics platform with wide applicability and many integrations with other products, such as with databases, languages, machine learning frameworks, and deep learning frameworks. The philosophy of KNIME is to be inclusive and “blend” whatever software and data sources you want to use.

The exploration, model building, visualization, reporting, and development portions of the platform are open source, as are the community extensions. KNIME Server, which provides collaboration, automation, management, and deployment capabilities, is commercial, as are the partner extensions. KNIME Analytics Platform and KNIME Server are available for on-prem installation and for the AWS and Azure clouds.

In this tutorial I’ll concentrate on the open source KNIME Analytics Platform and selected open source extensions. My goal is to bring you to the point where you can find an existing KNIME workflow that you can use as a starting point for your own data science work, and where you understand the KNIME workflow well enough to customize it. To accomplish that in limited space, I’ll refer you to some of KNIME’s own materials to fill in the details.

Why use KNIME?

Choose KNIME for your analytics needs if you like building models by assembling processing pipelines (called workflows) graphically from processing elements (called nodes), as exemplified by the simple classifier workflow shown below. Choose another tool if you prefer to write code or to run your models in spreadsheets.

knime platform simple classifier IDG

KNIME Analytics Platform showing a very simple, well-annotated workflow example. [View larger image.]

If you like to mix and match languages and tools, KNIME is a good framework for blending them together. If your organization has data scientists who construct models and workflows for analysts to apply, KNIME is also a good fit, especially if you purchase a KNIME Server subscription.

Having a graphical workflow designer makes KNIME easier to learn and use than a programming language with modules and frameworks, such as Python with Scikit-learn and a deep learning framework. What I said earlier about personal preference still applies, however. Easier isn’t necessarily better, especially for trained programmers and data scientists.

With more than 2,000 nodes available, KNIME has considerable functionality—certainly more than you would want to learn all at once. These nodes include many areas, such as IO, views, analytics, database connectors, structured data, scripting, tools and services, workflow, social media, reporting, and chemistry—and that’s only with the basic nodes and a few of the available extensions. The reporting extension uses the open source BIRT package.

KNIME generally uses best-of-breed algorithms with high reliability and accuracy, like R and IBM Modeler. That isn’t always the case for other packages, as discussed in a recent academic paper.

Although KNIME itself is a Java application, many of its extensions use other languages. For example, the best built-in visualizations use JavaScript graphics libraries, and the scripting extensions include R and Python categories. Several deep learning extensions are still classified as previews from KNIME Labs as of KNIME 3.6.1.

KNIME Analytics Platform overview

The KNIME Analytics Platform is built on Eclipse. As you can see in the screen image below, going clockwise from the top left, there are panes for exploring your local and remote server workflows, for displaying and editing workflows, for displaying a description of the currently selected node, for displaying console output, for displaying an outline of the current workflow, and for exploring your installed nodes.

knime welcome screen IDG

The KNIME welcome screen, showing the panes for displaying and editing workflows and exploring nodes.

Some of the usual Eclipse chrome has been suppressed, so you can’t easily stumble into a different plug-in, but the help is still mostly about Eclipse. While there is a KNIME node at the bottom of the help, the content is older than what you can find on KNIME’s website and in the actual platform. Assuming that you’re connected to the Internet, I recommend going to the KNIME Learning Hub in a browser for reference rather than opening local help. While you are at it, download the beginner cheat sheet.

KNIME workflows tie nodes together by connecting their output and input ports to model data flow. You can create them by dragging nodes from the repository onto the workflow pane and drawing the connections between ports. Workflows are essentially self-documenting, but you can improve on that by adding comments to the workflow pane, as was done in the first screenshot we saw.

Nodes perform tasks on data, and usually need to be configured (double-click on the node to display the property sheet) before they are run. Nodes display traffic lights below the action block to signal their state: red for not configured, yellow for configured, and green after they have run successfully.

Ports are where data flows. Typically, double-clicking on an output port once the node is green will display the data. In the case of a graphics view output port, double-clicking on the port will display a graph window.

Chapter 1 of the KNIME introductory course, which I recommend, includes a video demonstrating the basic workflow operations.

KNIME applications

What can you do with KNIME? Where is it applicable?

KNIME is used in many areas, including customer intelligence, social media, finance, manufacturing, pharmaceuticals, retail, cross-industry, and government. That’s not a complete list, but KNIME has documented sample workflows for each of these, as shown below. You’ll find additional example workflows on the KNIME Example Server, which you can access from within the KNIME Analytics Platform by double-clicking under Examples within the KNIME Explorer pane.

knime applications IDG

KNIME application areas.

Install KNIME and extensions

At this point, I suggest installing KNIME on your own machine. It’s fairly simple. Browse to the preliminary download page, fill out the form on the first page to register for help and updates, then move to the actual download page to grab the installer for Windows, Linux, or MacOS. For Windows you have several options; for Linux and Mac you have one option each.

I recommend that you also download the KNIME Quick Start Guide PDF, so that you can view it in a separate window rather than relying on the copy you can view within the workbench. Some of what’s discussed or displayed in the Quick Start Guide is obsolete, but not enough to confuse you. For example, the installation section talks about unzipping the download into a directory, but several of the possible downloads are installers that you need to run, such as the MacOS installer.

When you first run KNIME, you will see a workspace picker. Use the default for now. You will then see a welcome screen similar to the screenshot in the overview section of this tutorial. There’s an option to get additional nodes in the “Where to go from here” section. There’s a case to be made for downloading all additional nodes, even ones that don’t sound useful, on the grounds that the capabilities and examples provided are often of value even outside the purported purpose of the node.

If you don’t want to do that right now, you can add nodes at any time either by using the link in the welcome workflow or by using the “File | Install KNIME Extensions…” menu item. Both methods bring up the Eclipse “Available Software” installer.

knime add nodes overview IDG

KNIME node installation.

I recommend that you take some time to browse through the KNIME nodes installed in your platform instance, as well as to read through the KNIME Node Guide, so that you get a rough idea of what’s available to you. This is also a good time to read the KNIME Quick Start guide and the Seven Things to Do page and go through the steps.

What you’ll be doing with KNIME is to create workflows that import and clean up your data, transform the data to new variables that are appropriate for the models you want to fit, then perform model fitting and evaluation, and finally generate a report. KNIME has most or possibly all of what you need for this. If you need to extend KNIME with other packages or with your own scripts to accomplish your goals, you should be able to find nodes that help you tie those into your KNIME workflow.

KNIME example workflows

The Seven Things to Do page suggests that you work through the “Building a Simple Classifier” sample found installed under “Example Workflows | Basic Examples.” It does Decision Tree Classification of a standard data set. It formerly used Iris morphology data; now it uses demographic data to predict income.

That sample is an excellent start. The only thing I would add to the official discussion is to point to the double-arrow icon in the workflow toolbar, which executes all nodes. You may also want to hover your mouse over each icon on the toolbar to see what it does and its keyboard shortcut.

knime workflow toolbar IDG

The KNIME workflow toolbar.

The shortcuts tend to be Windows-oriented function keys, but you can make them work on a Mac by pressing the fn key at the same time as, say, Shift-F7 (execute all available nodes). If you’d like to use key combinations that are more convenient on a Mac, use the “System Preferences | Keyboard | Shortcuts | App Shortcuts” window, add the KNIME app, and map your preferred keys to the Node menu items.

The Seven Things to Do page also suggests that you download a workflow from the Example Server. It makes several suggestions, and explicates one of them, Sentiment Classification, a model that predicts whether IMDB movie reviews are positive or negative by analyzing the text. That’s an excellent second step.

With both of these workflows, I want you to click on every node and read the description, which will appear at the right. I also want you try and examine the Data Blending and Simple Reporting examples, to get a feel for how to do ETL and generate reports with KNIME.

KNIME next steps

At this point, I’d recommend spending some quality time with the KNIME Example Workflows. You can browse through all the topics and view the meta information for those that might be of interest; you can also search for specific areas of interest. As you did with Sentiment Classification, copy any workflows that you want to run and customize, placing them into your local workspace. This would be a good time to add some workflow groups to organize your analyses into projects.

The KNIME Learning Hub is the next good place to browse, as you probably haven’t yet learned everything you’ll need to know to customize a workflow for different data and different (or more) algorithms. Depending on your background, interests, and skill level, you may want to look at various usage and application areas in the Learning Hub. If you are planning to develop your own nodes, the SDK information is now on GitHub.

There are several book and course recommendations under the various Learning Hub applications tabs. I have gone through a few of the books. The content is good, although the graphical nature of KNIME’s UI means that how-to instructions require lots of screenshots and long descriptions about where to click, which means that it can be easy to become lost in the weeds. I have also gone through half a dozen of the recommended videos. As long as you understand the speakers’ accents, you’ll find the presentations useful.

Copyright © 2018 IDG Communications, Inc.

How to choose a low-code development platform