Big Data Efficiency: War Stories and Tips


When your data cloud gradually becomes a data swamp, it can be tough to drudge your way out. And while you’re struggling to get your footing, your company is shedding valuable dollars and resources.

Bluesky's founders share their insight into big data efficiency and cost optimization.

As an engineer at Uber, for example, Shao saw big data balloon to account for almost 40% of the company’s whole infrastructure. After leading the charge for big data efficiency, he reduced the company’s big data processing costs by one-third, saving millions per year.

“This shows how much opportunity there is to reduce the cost of big data,” Shao said. “And if one company can save that much money, then there's probably a lot more money we can save for the whole industry.”

In a recent virtual chat, Big Data Efficiency: War Stories and Tips, Hong and Shao shared the biggest challenges using data clouds — and useful hacks for reducing Snowflake costs and improving ROI.

Big Data War Stories from the Front Lines

One of the biggest challenges of big data management is, of course, the cost. Many companies are actually wasting valuable resources without even realizing it.

Here’s how.

1) Creating duplicate tables

With so many sources of big data ingested into your warehouse, it’s easy to lose track of how many tables you create — so you’re left with a large number of duplicate and wasteful ones.

“I’ve seen people get so excited to start using big data that they generate so many tables and, after a while, they forget why or how those tables were even generated,” said Shao, “It's very hard to manage a data warehouse if you’re faced with 10,000 tables. It will take so much time to understand what exactly has been put in already and it will be hard to refactor the warehouse once it's already in bad shape.”

2) Using multiple data warehouses without a clear policy

Many companies are starting to introduce not just one cloud data warehouse or query engine, but a combination of them, with the good intention of trying to “use the right tool for the job”, in the presence of their workload diversity.

“For analytics and data engineering teams, this transformation further increases not only the complexity of managing these different products like Snowflake, but also the complexity of using them,” Hong said. “As a user, you now need to focus on not only how to write your query in a fast and cheap way, but also which query or which data cloud to send the query.”

Unfortunately, the increased infra complexity tends to make end users more prone to mis-using the infrastructure, therefore making their workloads unnecessarily slow and expensive, and also complex to manage.

3) Running unoptimized queries

In the old days, data storage platforms like Oracle charged companies based on data volume used. Today’s cloud data platforms like Snowflake use a different model: they charge based on seconds used.

“New users are still bringing their old mindset and practices to this new generation of products,” said Hong. “So they don’t really think too much about throwing a very expensive query onto the cluster.”

Hong recently saw a Bluesky trial user run a query pattern that kept timing out after two hours of execution, costing $96 per run. That query was retried more than 100 times.

“That’s $10,000 in damage that one single less careful data cloud user could bring to their workflow and cost,” Hong said. “And that company has hundreds of data cloud users.”

Are you using your data cloud efficiency or wasting money and valuable resources?

Useful Data Cloud Hacks for Faster, Smarter Performance

So, what can companies do to reduce the costs of big data technologies like Snowflake while improving their data cloud performance?

Let’s take a look.

1) Keep a clean house

“Keep your house clean from day one and run a tight ship,” said Hong. “Attribute Snowflake costs to individuals and teams, to give that visibility and accountability.”

“Also, make sure you have good documentation of the tables you generate and data pipelines you run,” said Shao. “Share across teams to reduce duplicates and anything that can potentially incur additional cost or reduce the data quality.”

2) Audit your query history

Check your query history to pinpoint expensive query patterns and repeat failures.

“You might have thousands of queries running your Snowflake already,” Shao said. “Find what’s most costly before you start optimizing anything. Once you know what you need to optimize for, then it should be a much easier job.”

By auditing one company’s Snowflake query history, Bluesky was able to optimize away 20% of their data cloud Snowflake costs in just two weeks, where the user spent only 2 hours in this process.

3) Clouds getting you down? Blue skies are ahead

Want to optimize your big data workloads for a faster and cheaper data cloud? That’s where Bluesky Mission Control comes in.

Bluesky Mission Control helps you find the root causes of problems, gather insights, and automatically run improvements — so your team can focus on building your product, not managing complex software.

As Shao said, “There's so much room for automation in this field. If you don't control your cloud data management today, when the company grows bigger, the cost can easily be 100X greater.”

Ready to reduce costs and improve data cloud performance? Contact us today!