The Coiled team held an internal Stack Overflow Sprint earlier this month. It was a week-long sprint during which we answered Dask-related questions on Stack Overflow and observed patterns in these questions to guide the production of community resources. After answering over fifty questions and engaging with over a hundred more, we’d like to share some lessons we learned. These lessons are applicable to any open source project aiming to understand its users’ pain points and explore themes to guide future documentation and content creation.
We’ll also discuss how we organized this sprint in a remote-first company and our ideas for future sprints. Would you be interested in participating in the next sprint? Let us know on Twitter or at firstname.lastname@example.org!
The Stack Overflow community has built a robust system for technologists to get accurate and quick answers to their questions. Many Dask users also share their problems there. In fact, there are currently over 3.5k questions with the ‘Dask’ tag.
Coiled has a lot of Dask expertise, and our team interacts with user questions on other platforms like the Dask issue tracker, Slack workspace, and the Gitter chat. So, we figured we can help users on Stack Overflow too. :)
The sprint had two primary goals:
Since this was Coiled’s first-ever Stack Overflow sprint, a secondary goal was also to test this format and gather lessons to help plan future events.
Our team spans over 5 different time zones from India to the Pacific Coast. Organizing a sprint in a fully remote company posed a challenge…but also a fun opportunity! Here are some lessons we learned from weeks of planning and the impromptu adjustments during the sprint.
We found that defining clear goals and day-to-day tasks was instrumental in getting everyone across the globe aligned on the expectations. The Data Science Evangelists at Coiled led this initiative and worked closely with the Open Source Dask Engineers.
Our pre-sprint plan included rough estimates for the number of questions to answer and triage (eg: identify themes, gauge complexity, etc.) each day. We decided to start with the “most upvoted questions” thinking that these would be the questions with the most ‘watchers’ and, therefore, most helpful to our users (but, we quickly changed directions, keep reading!). Our planned workflow was for the Evangelists to triage and answer questions first and then work through any tricky questions with the Dask Engineers.
Planning is valuable, however, there’s nothing like actual ‘on-the-ground experience to inform what you should be paying attention to, and what is (or isn’t!) working.
As an example, we quickly realized that many of the “most upvoted questions” lacked a minimal, reproducible example or were nuanced questions requiring deep expertise. Moreover, many of these were actually ‘stale’ questions that had been unanswered for years. Hence, after Day 1, we changed course to the newest questions. Even though many questions still lacked a verifiable example, this gave us an opportunity to engage with the community on a current issue and work towards getting such an example.
Continuous context switching can not only reduce overall productivity but also have an impact on motivation. The context switching between different topics started to weigh on us, so Day 3 onwards, we decided to answer questions:
Coordinating across time zones can be tricky, but we can also use it to our advantage by splitting the effort across groups from the beginning.
To make the most of the different time zones, our team divided the sprint into four “sessions”:
On the final day, we reflected on the sprint and created an action plan on how we can do better next time. This was incredibly helpful to not only understand our own workflow better but also communicate the results and learning to the broader team and the community. This very blog post is an outcome of the retrospective!
A major takeaway for us was the themes and patterns that we identified. More than half of all Dask questions on Stack Overflow involved Dask DataFrame. This was not a surprise because Dask DataFrame provides a gateway into Dask for many pandas-using data scientists. It was still nice to see this trend backed up by concrete data. Users had questions about DataFrame operations like groupby and set_index, and wanted more clarity around reading and writing different types of data.
Dask Array was the next most popular Dask collection, and users were interested in how it differs from the NumPy API. This was followed by questions around Dask’s distributed scheduler, memory usage, and diagnostic tools. Interestingly, xarray stood out among the projects that use Dask internally.
We also noticed that many users could benefit from the pointers in the Dask best practices guide, and we recommend checking it out!
For the next iteration, we’re considering an open event that involves the entire Dask community. We had a lot of fun during the sprint, and we’d love to share it with more people. We’re also very excited to create more resources based on the themes mentioned earlier!
Overall, this was an enjoyable and rewarding experience for us, and we can’t wait for the next one!
Thanks for reading!