If you build big data systems in the public cloud, you know that it costs money to move data in and out of the cloud -- that is, network traffic. Public cloud providers vary on this expense, and it changes all the time.
Regardless of the fluctuating prices, if you're running huge databases in a public cloud, you need to consider the data traffic you're consuming.
Here are two tips that I have learned after having done a few of these:
1. Place your data processing as close to your data as you can
In many instances, IT organizations like to store data on-premises and place data processing in the cloud. That's not a great approach if you're paying for data movement into and out of the cloud. Alternately, IT organizations place data in the cloud and processes locally -- that's also not good.
The network cost climbs even higher if you consider the overhead of encrypting data in flight.
The cloud is a distributed system, and widely distributing data and data processes aids scaling and resiliency. But when you consider the cost of the cloud data traffic, such distribution may no longer be cost effective or provide the performance you're seeking.
2. Minimize data sent back to the user
In many cases, IT organizations send entire result sets back to the user interface -- instead of sending only the relevant data back to the user interface. This waste is typical of data visualization tools; they request a 20MB result set but display only five rows to the user. Why? Because the processing is done locally in the client. But it should be done on the back-end cloud so that the data traffic is minimal.
Make sure the the data analysis tools minimize that data traffic as much as possible -- even if performance is affected slightly.
If you don't consider network traffic for data movement when designing and building clouds, you'll see a very high network bill at the end of the month. Don't say you haven't been warned.