Hugo Bowne-Anderson, the Head of Data Science Evangelism and Marketing at Coiled, joined us last month for a webinar on this topic of increasing relevance: Making scaled data science work for people, and not the other way around. He discussed the challenges of distributed computing, the challenges of data culture in an organization, and how lots of problems aren’t technical but also cultural.
In this post we cover:
[wd_hustle id="5" type="embedded"/]
Let’s think about where Big Data stands in the Gartner Hype cycle shown below:
Gartner Hype Cycle
Hugo doesn’t answer this question directly, but gives us a way to navigate it using the google trends results for “big data” over the years. The searches can be a reasonable proxy for societal expectations around big data:
Google searches for "big data" over the years
The curve starts rising around the year 2010, then there is a peak, followed by a gradual decline. This suggests we might even be in the trough of disillusionment!
The origin of this big data hype can be traced back to a 2008 article published in Wired, called “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” by Chris Anderson. It hinges on the idea that if you have sufficient data, there is no need for scientific theory. The author made these statements based on the success of technologies like Google Search, which seemed to work this way. The convergence of three revolutions brought us here: algorithmic advancement, expansion in generation and storage of data, and increase in computational power.
Hugo takes a step back to remind us about the power of small, well-curated, and highly precise data. Kepler’s laws of planetary motion were informed by a dataset on the order of thousands of data points. This dataset is quite small, yet it paved the way for Newton’s laws of gravitation. Election polling is done on a random sample of the population, again in the order of thousands of people. It can still represent the state or country having millions of people with a fairly low margin of error.
Source: Harvard Business Review
Thick data, or qualitative data, is another important topic to discuss here. Hugo shares an example of a credit card company that was developing models to predict credit card fraud. The company worked with a consulting group called ReD Associates, who talked to credit card fraudsters to understand their methods. The team learned that fraudsters sent items to houses that were abandoned or on sale. Including this feature in the training model increased the model performance dramatically! This shows how some improvements can only be achieved through “thick” data and significant domain knowledge.
That said, the wins of big data cannot be ignored.
The biggest success stories of big data are in technology giants like Google, Netflix, Stitch Fix, and more. Big data and big compute has helped us make strides in almost every domain, some notable examples include:
Most of these (above mentioned) teams use Dask for working with big data, accelerating machine learning pipelines, and distributed computing. Dask makes it easy to move from a single machine to using multiple machines. The following slides compare traditional data science tools with their Dask-powered alternatives, notice how the code is almost identical:
To learn more about Dask, check out: What is Dask?
Talking of the benefits of big data and scalable compute, the next logical question is when should/shouldn’t you use it? As Tom Augspurger explains in a PyData NYC talk, we don’t always need distributed solutions.
In the above diagram, the bottom-left quadrant denotes data size and model size that fit in RAM. You do not need scalable data science in this case.
The bottom-right quadrant denotes RAM-bound situations, where you’re working with larger-than-memory data. Here too, there are cases where you don’t need distributed compute. Hugo suggests starting with a subset of data and checking the model performance. If the performance is good, you do not need to go through the challenges of distributed compute! But, if you really need to use the large dataset, you can choose distributed.
CPU-bound situations come next, denoted in the top-left quadrant. Distributed computing is a good option here, but first also consider using a simpler model, or getting a bigger machine. Hugo reiterates:
“If you do not need distributed compute, do not use it.”
The discussion now moves towards the impact of data science on decisions. Hugo had asked his Twitter audience how much of their data work is actually used in decision making? The results are below:
We can see that more than half the people say less than a quarter and almost 80% people say less than half is used. This is quite revealing and Hugo calls it the “Data Science Leakage”.
Data Science started gaining popularity in the last decade and in 2012, HBR called it the “The Sexiest Job of the 21st Century”. There was a huge push in the industry to hire more data scientists, bootcamps and online courses were created to serve this need, and universities started offering specialized data science programs. This begs the question, where is data science in the hype cycle?
Considering google trends as a proxy again, the following slide shows the searches for “data science” in red (and big data in blue). We can see data science hasn’t peaked yet!
Google searches for "data science" in red and "big data" in blue
The biggest wins for data science are also in technology and automated decision making is just the tip of the iceberg. Wins in other spaces are unclear, but we clearly haven’t seen a win in informing the decision function. It seems to be something cultural in many organizations where data science work is difficult to translate into concrete actionables.
Hugo discusses the interfaces that we need to figure out, and the key moving parts of the data culture to solve this problem. Watch the webinar replay to learn more!
Scalable computing brings new challenges for data scientists, management teams, and IT teams:
Do my machines all have the same software installed?
Can I share these same machines with my team?
What stops a novice leaving 100 GPUs idling?
Where are we spending money and how can we reduce this?
Is our sensitive data appropriately protected from external or internal threats?
We have written about these challenges extensively in The Unbearable Challenges of Data Science At Scale. At Coiled, we’re developing a product that takes care of these challenges for you, so that you can focus on data science! Try it for free at cloud.coiled.io.