Back to blog
Defining correlation
In statistics, correlation refers to the relationship between two variables and how they move in relation to each other. Specifically, correlation looks at how changes in one variable correspond with changes in another variable. It gives a quantitative measure of the degree to which two variables move in tandem.
Let's look at user logins and software product updates. Often, these two share a potent positive correlation. When a new update or version of the product is released, it can fuel a surge in user logins as users are keen to explore new features and improvements. Therefore, strategically timing your product updates can effectively drive user engagement.
On the flip side, think about the relationship between application bugs and user sessions. Typically as the frequency of application bugs increases, your user sessions tend to show a decrease. Users may avoid using the application due to bugs, thereby lowering your product's usage statistics.
Understanding these correlations in your product analytics allows for proactive strategy planning and decision making - equipping you to maximize user engagement while simultaneously minimizing user churn.
Correlation is measured on a scale from -1 to +1. A correlation of -1 indicates a perfect negative correlation, meaning the variables move in completely opposite directions. A correlation of +1 indicates a perfect positive correlation, meaning the variables move in exactly the same direction. A correlation near 0 means the variables are not linearly related.
It's important to note that correlation does not imply any causal relationship between the variables. Just because two variables move together does not mean that one is causing the changes in the other. There may be other hidden factors driving the correlation. However, studying correlations can point to potential relationships worth exploring further through more rigorous statistical analysis.
Defining causation
In product analytics, it's all about understanding the relationships between variables. Take causation, for example. When one variable shifts, it causes a change in another. It's a step beyond the correlation where a cause-and-effect scenario unfolds.
Imagine you introduced a personalized onboarding process for your product. As a result, you notice an uptick in user retention rates. That's causation in action – the new onboarding process (the cause) led to the improved retention (the effect). It’s crucial intel for resource allocation and strategy layout of product managers and data analysts.
But here's the thing - while it's easy to get caught up in the causation game, it's equally important not to overdo it. Remember, not all correlations imply causation, and it's important to distinguish between the two. Doing the investigative work to understand the real relationship between variables helps you avoid being over-reliant on data, rooting your decisions in the right evidence, balancing and enhancing your approach to product analytics.
Proving causation requires controlled experimentation and testing. Simply observing that two variables change together is not enough to prove causation. Experiments must be conducted while controlling for all other possible influencing factors. This isolates the potential causal relationship between the two variables of interest.
In summary, causation means Variable A directly causes a change in Variable B. One variable influences the other through direct cause and effect. Causation enables us to make concrete predictions and test interventions based on established relationships.
Correlation does not imply causation
In the product analytics world, it's often hard to tell what metrics are correlation or causation. Sure, when variables move together, it can indicate something interesting is happening. But remember, it doesn't tell you if one variable is causing changes in the other - there could be hidden influences causing these shifts.
Think about social media ad spend versus application signups. You might notice when you spend more on ads, you see more signups. But, wait a minute! That doesn't mean the increased ad spend is directly responsible for the signups. Some other factor - like a seasonal trend or a viral shoutout about your product - may be the real driver behind both your ad spend and new users.
Sometimes, correlations simply pop up by chance. Other times, underlying factors cause two variables to move together. Dodgy correlations can also surface when you're deep diving into big datasets. The big reveal here is that correlation, by itself, can't confirm if one variable is really influencing another.
Correlation can give you a rough idea of where to look, but it's key to dig a bit deeper to unearth the true drivers of your product's performance and growth.
Ways correlation can mislead
Just because two variables are correlated does not mean one causes the other. There are several ways seemingly correlated data can mislead.
Reverse causation - The presumed cause and effect may actually be reversed. For example, one might observe that people who sleep less tend to weigh less and assume that sleeping less helps cause weight loss. But the reverse may be true - being overweight might be causing people to lose sleep.
Lurking variable - A third factor may be creating a spurious correlation between two variables. For example, spending on science education and space shuttle launches are correlated in the US, but do not directly cause each other. Rather, the federal budget impacts both variables.
Coincidence - Some correlations are simply due to chance and disappear with more data. For example, US spending on science, space, and technology correlated with suicides by hanging, strangulation, and suffocation from 1999 to 2001, but this was clearly coincidental.
Cherry-picking time ranges or data subsets - Apparent correlations can be created by selectively choosing a narrow data range or limited dataset. Time series correlations may not hold over longer periods. And subgroups may show correlations that are not present when looking at the entire population. True correlations persist across larger datasets.
Determining causal relationships
Establishing a definitive causal relationship between two variables requires going beyond correlation and conducting careful experiments. There are several key methods for determining causation:
Controlled Experiments in Product Analytics
Controlled experiments play a crucial role in product analytics, offering an approach to create data-driven product improvements. With a controlled experiment, one variable is intentionally adjusted while all others are kept consistent to isolate its effects.
As an example, think of a product development team wanting to test if a new user interface boosts engagement. They would create two user groups experiencing identical conditions and interactions with the product, but expose only one group to the new interface. Monitoring the user engagement within each group can then ascertain if the new interface drives higher engagement levels, showcasing evidence of causation.
Randomized Trials in Product Analytics
Randomized trials offer another effective tool in the product analytics arsenal, providing a way to eliminate potential biases in product development decisions. In a randomized trial, subjects are randomly divided into different groups. One group interacts with a specific feature or receives a treatment, whereas the other (control) group doesn’t.
Consider a product team testing a new notification feature. Users are randomly assigned to either continue with the current notification system or try the new one. By comparing user engagement or response times between groups, product analysts can determine the real impact of the new feature. This method aids in identifying true causal relationships and supports unbiased, data-informed enhancements in your product development strategy.Hypothesis Testing
Researchers start with a hypothesis about a potential causal relationship between variables. Experiments are designed to test the hypothesis. Statistical analyses determine if the results are consistent with the original hypothesis or if it should be rejected. Repeated hypothesis testing builds evidence for or against causation.
A/B Testing
A/B tests compare two versions of something to see which performs better. Common in marketing and web development, two versions (A and B) are shown randomly to users to test differences in click-through rate, conversion rate, or other metrics. Consistent differences point to a causal relationship between the page variation and the observed results.
Carefully designed experiments reduce external influences and isolate relationships between variables. Valid statistical analysis of the results can provide strong evidence for making causal conclusions.
Using correlation in product analytics
Correlations in your product data can point to interesting areas for further research and potential growth opportunities. For example, you may notice a correlation between increased feature usage and lower customer churn. This suggests that customers who use that feature more end up sticking around longer.
While such correlations don't prove causation, they can help prioritize areas to investigate further. Focus first on high-value correlations - ones that involve metrics critical to your business. A usage-churn correlation may warrant more research since reducing churn is likely a top priority.
Before acting on any correlation though, you need to confirm causation. Look to run experiments like A/B tests to determine if changing one variable directly impacts another. Without establishing causation, you risk wasting development resources by acting on meaningless correlations.
The key is using correlations to find promising areas to research while avoiding the assumption they represent causal relationships. Approach all correlations cautiously, but pay special attention to high-value ones since those point to the biggest potential gains if you can uncover an underlying causal mechanism. Just be sure to validate causation through rigorous testing before launching any major initiatives.
A/B testing for causation
A/B testing, also known as split testing, is a scientific approach to determine a causal relationship between two variables. It allows you to isolate specific variables and test their impact on a desired outcome.
To run an effective A/B test:
Randomly assign a sample of users to a control group or one or more variation groups. This eliminates selection bias in the experiment.
Only test one variable at a time. Changing multiple variables makes it impossible to tell which one impacted the outcome.
Make the variation and control experiences as identical as possible except for the one variable being tested. This isolates the variable and clarifies causation.
Control for external factors that could influence the results, like seasonality and major events. Run the test long enough to account for natural fluctuations.
Analyze the statistical significance of the results. A big enough difference between the control and variation establishes that changing the variable likely caused the difference in outcome.
A/B testing is a powerful way to identify causal relationships and make data-driven decisions. By narrowing the test to one variable and controlling other factors, you eliminate misleading correlations and reveal true causation. This enables confident product optimization based on causal impacts validated through experimentation.
Key takeaways
The key takeaway is that correlation does not imply causation. Just because two variables are correlated does not mean that one causes the other. Correlation simply means that two variables move in tandem, but the underlying cause of their relationship may not be known. The important thing is not to assume any correlation you observe is evidence of causation. Doing so can lead you down the wrong path.
That said, correlations can provide useful signals about relationships worth investigating further through rigorous testing. You may find meaningful associations in your product analytics that point to promising opportunities. But the next step is designing experiments to confirm or reject true causation before acting.
Leverage correlation to generate hypotheses and ideas. But rely on controlled tests to prove causation before making business decisions. Understanding this distinction is crucial to avoiding pitfalls and drawing accurate conclusions from your data.
Published on
Jan 19, 2024
in
Data
Chase Wilson
CEO of Flywheel
About THE article
Published on
Jan 19, 2024
in
Data
About THE Author
Chase Wilson
Additional content