Etsy goes cloud native to better scale to seasonal demands

The popular marketplace moved everything to Google Cloud, after Google brought in engineers, not sales reps, during the proposal process

Etsy goes cloud native to better scale to seasonal demands

Popular online marketplace Etsy recently completed a two-year migration from 2,000 on-premises servers to Google Cloud. The Etsy website and mobile app provide an online shop window to makers of hand-crafted and niche goods, but with more than two million sellers on the platform by 2018, the in-house bare metal infrastructure was starting to creak.

Etsy was founded in 2005, well into the internet era but long before the explosion of public cloud services. The firm went public in 2015 and soon afterward made cloud migration a priority, so that it could scale its services without having to make big expensive hardware purchases up front, and to better leverage machine learning techniques.

In 2016 the company started to explore the public cloud market. Google Cloud stood out during the selection process thanks to Google’s desire to be more hands-on than its rivals. “They came in and took the opportunity to understand our business and our challenges and paired us up with actual engineers, not just sales reps,” Etsy’s chief technology officer, Mike Fisher, told InfoWorld.

A two-year migration

The two companies signed a five year deal in December 2017 and started work on the migration strategy at the start of 2018, with the aim of moving everything in two years.

Google Cloud had engineers embedded at Etsy’s Brooklyn office and in its Slack channels. The first stage of the project was the marketplace website application itself, which consists of the website, mobile APIs, web servers, API servers, and hundreds of MySQL databases in a monolithic architecture built on a LAMP stack.

The broad principles for the migration were: no large architecture changes, migrate as few systems as possible, and stay compliant.

“We felt that doing a rewrite would add a tremendous amount of risk to the project [...] the code base has a lot of kinks in it, it is legacy,” Keyur Govande, chief architect at Etsy, explained on stage at the Google Cloud Next conference last year.

It’s important to note that August would normally trigger a multi-million-dollar hardware buying binge for Etsy, an annual bolstering of capacity for the busy holiday season. The team wanted to cut over the core marketplace to Google Cloud in August so that it would be running mostly in the cloud ahead of this busy period and save on making the holiday hardware investment again.

The engineers eventually got over the line by the skin of their teeth at the second attempt on August 19, 2018. The first attempt was rolled back after engineers observed some key offline processing systems were running out of memory during the migration.

One core workload that Etsy did rearchitect to a cloud native model was search, a Java and Scala application running on highly customized versions of Apache Solr and Lucene.

“We migrated search on Kubernetes on our own data centers first,” Fisher said. “The challenge is figuring out not the move to containers and Kubernetes – we can bring in people to teach us that – the challenge is how to operationalize that and run your software on that.”

This application was migrated in March 2019, followed by the big data store, which was completed soon after in April. The last remaining supporting systems, such as monitoring, were migrated last. When that work was completed in February this year, the company had hit its two-year migration goal.

To carry out the project, Etsy put together a cross-functional migration “squad,” which focused exclusively on migrating Etsy service by service, with as little disruption to the rest of the technology function as possible.

Etsy did suffer through some months of lower availability than it ideally wanted while the engineers decided what they needed to monitor and observe and how and when to react. “That is the tough part — no one from the outside can teach you how to run this,” Fisher explained.

Etsy measures availability according to the level at which its systems are functioning, calculated as a percentage. This dropped by a few tenths of a point as the Etsy engineers learned how to manage their search application on Kubernetes instead of a virtual machine.

Finding the right crocheted octopus

Fisher describes Etsy as an iceberg, with most customers seeing the marketplace and not the 5.5 petabytes of data it sits on. Because Etsy relies on user-generated tagging, the company needs to build smarter algorithms and search capabilities if it is to convert customers and keep them engaged. Currently 80 percent of purchases are driven by the first page of search results, so getting that right is critical to the e-commerce site.

There are more than 65 million unique items in the Etsy marketplace, and the company’s data scientists are constantly on the lookout for inventive ways to serve up results to customers. These include leveraging image recognition technology to do things like categorize items by style.

“Usually style is restricted to a category but to be able to detect the style of a dress and apply that to a rug, for example, is difficult,” Fisher said. “We are able to do that using image recognition.”

“The real value of Google is those value-added services like big data and machine learning that we really need,” Fisher added. “If we invest in the infrastructure I want to partner with someone that does that really, really well.”

TensorFlow, Google’s popular machine learning platform, is a good example. As Danny Rosen, technical program manager at Google, said on stage at Cloud Next last year: “Finding the right crocheted octopus on Etsy? Kind of hard. How are you going to do it? Machine learning.”

Shutting down the data center

The numbers since the shift are proving out the strategy, especially when it comes to the speed at which the company can scale to meet demand around busy periods like Christmas.

“In the cloud, we can spin up hundreds of servers in minutes, whereas it would take months of budgeting, planning, and installing servers to get the same amount of compute power in data centers before,” Fisher said.

The company has already been able to shut down two of its three data centers as a result of the migration, and it’s in the process of consolidating the last one down to a couple of racks. For the time being, though, the company is still running the servers it has under maintenance for its development environments.

As a result of the lower operational overhead of running in the cloud, Etsy says it has been able to move 15 percent of its 500-plus engineering team “up the stack” to focus on improving the user experience for customers.

Following the migration, the company is now looking to shift toward more cloud native models of working, as can be seen with the search application and the use of containers and Kubernetes there.

And if they could do it all again?

Having come through this migration on time, Fisher believes they would do only one thing differently if they got the chance again. “I think understanding what we should shift up front, instead of working that out as we go along, was a big learning. We could have been more prescriptive and been less experimental as we went along with things like Kubernetes,” he said.

He does acknowledge the Catch-22 element of this advice: You can understand what you should and should not have moved only after you have gone through the process. So what can another company learn from their experience?

“That only works for us,” he admits, “as we know what our capabilities are and how to operationalize that. So is your engineering culture able to adapt that fast? There might not be any shortcuts. You have to learn as you go.”

Copyright © 2020 IDG Communications, Inc.