How vectorization improves database performance

Although a serious engineering challenge, database vectorization delivers orders-of-magnitude performance boosts for a real-time analytics engine such as StarRocks. Here’s how we did it.

1 2 Page 2
Page 2 of 2

Parting thoughts and reflections

Now that we’ve taken this journey down the road to StarRocks database vectorization, let’s look back at what we’ve learned.

The underlying principles of different systems are similar. When we started looking into the micro architecture of the CPU, we realized the similarities of the CPU’s architecture to database architecture. In the case of StarRocks, the front end manages SQL parsing and query planning, and the back end takes care of SQL execution and interacting with the storage layer. The more systems and architectures you study, the deeper you’ll understand the similarities at the system level.

To build a high performance database, not only do we need a well designed architecture, but we also need to pay close attention to the engineering details. Although the need for both good design and good engineering seems obvious, one or the other often goes missing in database products. If you truly believe in both, you don’t design a database with only a bottom-up approach, starting from algorithms and unique components, without implementing a high-level architecture that ensures all of those components work well together. Nor do you select programming languages like Java or Go to implement your query execution engine and storage engine, when more performant languages such as C++ are available.

vectorization 13 CelerData

Mixing vectorization and compilation. Vectorization and compilation are the two major query execution styles, but they are not mutually exclusive. Even though most open source databases have chosen to use vectorization, we can leverage query compilation to generate more efficient vector code through information acquired during query execution. In the meantime, query compilation is constantly improving.

Try new hardware such as GPUs and FPGAs. After extensive optimizations, we probably are getting close to the point of diminishing returns on CPU optimization. We are also starting to look at other new hardware to further improve StarRocks’ performance.

Challenge the impossible. Most of what you’ve read about today reflects the tremendously hard work of StarRocks’ community of contributors. In just the last few years we have built a vectorized query engine, cost-based optimizer, pipeline execution engine, and more. All of these breakthrough components happened because of the community’s strong culture of pushing the limits of what’s possible and questioning the status quo.

The role of data engineers will only become more critical in the coming years as data volumes grow, data sources expand, and user expectations rise. With projects like StarRocks, and innovations like database vectorization, you’ll be ready to meet any performance demands you come up against.

To learn more about StarRocks’ vectorization implementation, visit our GitHub page. For customer case studies and information about StarRocks use cases, visit us at starrocks.io

James Li is cofounder and CEO of CelerData. Both chief executive and chief visionary, Li works directly with enterprise leaders and their engineers to develop the next generation of real-time analytics solutions. With more than a decade of experience heading digital transformation efforts at Microsoft, Baidu, and Xiaomi, Li has made it his life’s work to shape the way technology helps power human decisions and accelerate progress around the world.

Kaisen Kang is a founding engineer and software engineering manager of CelerData. Kang is also a StarRocks Community PMC, an Apache Kylin PMC and committer, and an Apache Doris committer. At CelerData, Kaisen leads the team that built the StarRocks vectorized execution engine, the CBO query optimizer, and the pipeline parallel query execution engine. Before CelerData, he worked at Meituan where he built the Doris OLAP platform.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Copyright © 2022 IDG Communications, Inc.

1 2 Page 2
Page 2 of 2