The 8 deadly sins of a data science project

The 8 deadly sins of a data science project
Divij Shah

Divij Shah

5 min read

March 30

At Tunica, when we started executing data analytics projects, the focus always was on speed along with insights that help the business. In this process, initially, we were not able to do a retrospective analysis of the work we did like we did for our other projects. So, while executing our latest data analytics project for an Edtech company, we decided to take a step back and find out common errors that were made in the past. Through this article that highlights the findings, we hope you can avoid these mistakes while working on your data analytics projects.

Ignoring defining the problem

Step 0 of any data analytics project should be defining the problem as this is the root of most of the issues that crop up later. If the problem is not defined well enough, any solution you arrive at is just an illusion. It is important to examine the problem and evaluate all components while including all the stakeholders to create an action plan before the project starts.

Improper and insufficient documentation

The projects that always stood out for us were the ones that were well documented since the start. While we engage in deep analytics-related work, a few things might seem obvious to us such as the sources of data, times covered as well as transformations are done to the data for cleaning. These may not seem worthy of mentioning in the documentation at that point, but later in the project, this ignorance may snowball into the incorrect understanding of data and wrong assumptions and recommendations leading to losing clients. Whether the project is big or small, it always pays off to document procedures accurately. Additionally, documentation also comes in handy when you want to present trends or statistics. Context is really important while presenting data to help the user interpret it. For example, a peak might represent the start of a new campaign while a drop might represent the seasonal nature of a business. Remember, there is a story behind data and presenting it in an understandable way is what an analyst needs to do.

Focusing On The Wrong Metric

At the very start of projects, small wins can be very tempting and so your focus might shift to irrelevant metrics that might act as a morale booster which might distract you from the initial goal you set and the metrics that really matter. It is very similar to the sailors focusing on the North Star to reach home. So, find your North Star, the few metrics that matter for the Key Performance Indicators that you have decided and aim at accurately measuring them.

Relying On The Summary

The stakeholders apart from the analytics and data engineering team always tend to be super busy for discussions which prompt analysts to offer summary metrics on the basis of which decisions are made. But this can lead to ignorance about variances that get hidden due to averages. A very simple example of this is that your paid ads might get say 3000 clicks a week on average but that does not mean that these weeks had similar user behaviour or results in terms of conversions. Additionally, in the same example, in the age of cross-platform marketing, it is important to also analyse behaviour across various devices rather than just merging the data into a single pool and evaluating it as a single source.

Equating correlation and causation

This perhaps is the easiest trap to fall into because no single strategy is being executed at a single point in time and staying in a silo of data while ignoring the business side might make us believe stuff that is actually completely false. Let’s say a Facebook Ads campaign is being executed and during the same duration, there is an increase in organic traffic on your website. It’s easy to believe that one leads to another but it is necessary to also do a check of whether there could be any other reason (say a shoutout on social media by someone famous) for the same.

Not verifying the sanctity of the data

Analysts are not able to check the sanctity of the data in a lot of cases which can lead to missing values, rounding errors, duplicates, etc. This is where Exploratory Data Analysis comes in and plays a crucial role. Reports, in a few cases, are also based on the results of other reports which can lead to indirectly using old data, invalid assumptions or simply old data which makes the results unreliable. A simple mantra to remember is garbage in, garbage out.

Not talking to domain experts

This point is linked to the way an analyst interacts with the external world. As an analyst, it is easy to believe that our tools and algorithms can solve any business problem in the world. But this can not be done within our silos and comfort zones of our chairs. Interacting with domain experts is an unavoidable part of the job because they will give insights that will directly not be visible in the data. This, in fact, is a two-way street. The domain experts need to talk to you too as you are building a solution for them and they need to learn how to use it from you. It’s one thing to build a solution and another to communicate it to a non-technical user base.

Thinking deep learning and big data is key to everything

We have often heard data analysts say stuff like, maybe we need more data or we need to implement the solution on tensor flow. While these can be valid solutions, they may be impractical for most projects. It is very important to understand the power of simple solutions. Simple solutions can achieve 80%-100% of the work in most cases. Being able to build solutions without being attracted to fany models is a skill that all analysts need to master. Don’t get us wrong, we are not against complicated solutions but you need to start with baselines and judge if the complex ones justify the cost to effort ratio.

How we try to avoid these problems at Tunica

Reliable processes and due diligence are the most important when it comes to combating the above mistakes. You need to have a process that focuses on getting things right. While executing a project, there need to be checks and balances that make you think about where the data is coming from, are there existing biases, how are the hypotheses being tested, etc. Tackling a data science project requires advance planning and you also need to consider ways to keep refining your work and improving processes over time. It may take time and commitment but with the right people and culture, these mistakes are avoidable.