Interpreting Data with Simpson’s Paradox

Interpreting Data with Simpson’s Paradox

9 mins read234 Views Comment
Updated on Mar 21, 2023 13:05 IST

In this article we have covered interpretation of data with Simpson’s Paradox. It also include topics like drawback of Simpson’s Paradox and how to Avoid Simpson’s Paradox.

2023_01_MicrosoftTeams-image-9.jpg

Table of Contents

  1. Introduction
  2. Simpson’s Paradox: Meaning, Definition with Example
  3. Lurking Variable
  4. The drawback of Simpson’s Paradox
  5. How to Avoid Simpson’s Paradox
  6. History of Simpson’s Paradox

Introduction

Simpson’s Paradox is a situation that gives a new viewpoint or result by interpreting the same data. The increasing density of digital data with more dependency on data-driven decision-making enhanced the importance and requirement of Data Mining. 

Data Mining organizes extensive data into a structured form for further processing and identifying hidden patterns. It uses various tools and techniques to organize unstructured data. 

In this article, we will learn about Simpson’s Paradox and how to resolve it. This article covers many examples to define the meaning of Simpson’s Paradox for data interpretation.

Data Preprocessing in Data Mining – The Basics
Data Preprocessing in Data Mining – The Basics
If you are someone working with big data or huge volumes of raw data, you know the importance of data preprocessing in improving data quality since the obtained raw data...read more
Data Transformation in Data Mining – The Basics
Data Transformation in Data Mining – The Basics
Businesses are now leveraging data mining and machine learning to improve everything from their sales processes to interpreting finances for investment purposes. To make predictive analysis work, data transformation in...read more
How tech giants like Google, Facebook, Instagram are using your data?
How tech giants like Google, Facebook, Instagram are using your data?
If you have landed upon this blog, then depending on the source, it was all because of the robust machine learning model used by Google, or LinkedIn to show you...read more

Simpson’s Paradox: Meaning, Definition with Example 

Simpson’s Paradox is a situation that arises when grouping data into subgroups shows different relationships when all groups are clubbed together. It reverses the result of the same data by interpreting data through some hidden variables. Simpson’s Paradox shows its importance with real-world data and defines how we can view the same data with different aspects.

It might be confusing for you to understand this Simpson’s Paradox and how we can use it for data interpretation. 

Consider the most famous example of Simpson’s Paradox: the University of UC Berkeley’s gender basis admission. 

It was in 1973 when the school was going through admissions and saw that there were 44% male applicants and 35% female applicants. People started saying that school is favoring males over women i.e, gender discrimination. There were chances of lawsuits, and to overcome the situation, Peter Bickel, a statistician, examined the school data and got surprising results. 

He saw that the school was functional in 6 departments and 4 departments had more female admissions than men. In the remaining 2 departments, there was no gender discrimination. 

He concluded that females are interested in departments with fewer applications. Peter interpreted data based on the department and divided the school into different departments showing no sex favoring. The school came with admission into different departments. Peter resolved the problem of gender discrimination-based admission, which was not the real problem.

This is Simpson’s Paradox situation, where interpreting data into small groups gives completely different results than combining all data into one major group. Simpson’s Paradox helps in revealing real insight into the data.

Lurking Variable 

It is a hidden variable in data that helps you to identify the truth or mold the data as per your requirement. It is challenging to identify the lurking variable. If data scientists closely examine the data, they can define it and use it to reach conclusions.

Consider an example to understand the meaning of the Lurking variable in Simpson’s Paradox.

A soft drink company wants to launch 2 new soft drink flavors, one Mango and the other Pineapple. The audience tasted both flavors and voted. It was found that 75% of people liked the mango flavor, and 60% of people liked the strawberry flavor. From the results, the mango-flavored soft drink is the winner. But the company is still looking for the results. 

The data mining team rechecks the data and identifies sex can be a deciding variable to get the interested users of both soft drinks.

In the above example, male and female are the lurking variables giving a new result to the data. It was easy to identify the lurking variable of Simpson’s Paradox in the above situation, but it can be challenging.

The drawback of Simpson’s Paradox

Simpson’s Paradox does not result in one solid conclusion. You can interpret data as much as you can using different variables and viewpoints. It becomes difficult to conclude with complete data meaning. 

Using a data set, one can draw different conclusions that become challenging for strategy preparation. 

But, one can positively use Simpson’s Paradox by examining which conclusion you need to use to reach your final destination and reveal the truth.

How to Avoid Simpson’s Paradox

One can avoid Simpson’s Paradox by closely examining the experiment and all related variables. Simpson’s Paradox arises when there is a difference between summing all data versus data into small groups. Simpson’s Paradox interprets data in a completely new form with subgroups.

For example:

You and your partner are going for dinner but are confused about where to go. Your choice is “The Heaven.” Your partner is interested in “Foodie”; to resolve the situation, both of you decide to focus on the modern oracle, and that is customer reviews. “The heaven” is voted by 40% of men and 60% of women as a good dining area. In contrast, 30% of women and 70% of males voted for the “Foodie.” Again confusion, she used the same data interpretation and found a combined population of 65% interested in Foodie restaurants. 

Here, we have two different results for the same data interpretation. One can use the above results as per requirements. 

Restaurant Male  Female
Foodie 70% 30%
The Heaven 40% 60%

The restaurant selection situation results in Simpson’s Paradox, and it is unavoidable, but one can use it smartly to get real-world results.

It is easy to resolve Simpson’s Paradox by identifying the need for the result and closely examining the frequency tables. To reveal the truth and hidden data patterns, Simpson’s Paradox helps data engineers and statisticians.

History of Simpson’s Paradox

Simpson’s Paradox was introduced very early by different statisticians. Firstly in 1899, Karl Pearson, a great statistician, described a similar effect in his research.

In 1903, Udny Yule also gave proof of similar effects with the association of different variables.

Edward Simpson. In 1951, Edward H. Simpson, in this technical paper, described the Effect of Simpson’s Paradox, and due to this great statistician, Simpson’s Paradox got its name. 

Simpson’s Paradox is also known as Yule-Simpson’s Effect, Simpson’s Reversal, Reversal paradox, or the amalgamation paradox.

Simpson’s Paradox is a probability and statistical situation that gives different results when data is disaggregated, and the result from small groups disappear or reverses when groups are aggregated.

Conclusion

Simpson’s Paradox helps in understanding the true meaning of the data. It also defines the data limitations and finds the hidden variables which can change the whole meaning of the result. 

Simpson’s Paradox situation does not always arise, and it only introduces if we do not interpret the data from all possible angles. With more dependence on digital data-based results, Simpson’s Paradox helps get different results to meet the business target. 

We have reached the end of this article, and I hope that article serves your purpose of understanding Simpson’s Paradox.

Contributed by Sonal Meenu singh

FAQs

When Can Simpson's paradox be used?

Simpsonu2019s Paradox can be used when there is some association between variables. It is used by statisticians in different situations to reverse the result and it provides a new viewpoint to the same data interpretation. Simpsonu2019s Paradox examines the hidden variables called lurking variables and the situation can be used in a positive sense as well in driving the real results in the favor of the situation.

What is the primary reason for Simpson's paradox?

The main reason for Simpsonu2019s Paradox is association between different variables. Simpsonu2019s Paradox situation arises when a data is disaggregated and the result from small groups disappear or reverse when groups are aggregated. The situation was first mentioned by Edward H. Simpson in his technical paper in 1951. Statisticians and data scientists use the Simpsonu2019s paradox situation to get data insight and closely observe the lurking variables of the data. The situation fluctuates from the actual result and sometimes it becomes difficult to reach a single output. The same data can be used with different variables to get variable results.

What is an example of Simpson's paradox?

There are various examples defining the Simpsonu2019s Paradox situation and the most famous example for this situation was the admission process in UC Berkeley in the academic year 1973. The admission was opened and it was found that the application for admission includes 44% male applicants and 35% female applicants, result was shocking as everyone started saying that the admission is biased with gender discrimination. After understanding the seriousness of the situation, the statistician, Peter Bickel, interpreted data and got some unrevealed facts. He found that among the 6 different departments in Berkeley, females applied to the 4 departments that had less applicants and in the remaining 2 departments there was no male dominance. Peter reversed the result and showed that the school is divided into departments and admission was as per departments. In the above example, the data was divided into sub-groups on the basis n departments and hence proved positive to deal with the opposite situation.

How does Simpson's paradox affect the outcome of a study?

Simpsonu2019s Paradox reverses the result of outcome by integrating the data with some different perspective. Sometimes, the incorrect results from Simpsonu2019s Paradox are due to the wrong interpreting of the data variables. In real life also, we can see the effect of this paradox on the outcomes of the study.

How to resolve Simpsonu2019s Paradox?

It is possible to overcome Simpsonu2019s Paradox by interpreting the data with its frequency table and correlation between different variables. We can control Simpsonu2019s Paradox situation to some extent and can use it as per the need of business. And to answer questions like why we need particular data and how the results are useful.

What is another name for Simpson's paradox?

Simpsonu2019s Paradox got its name from the great statistician Edward H. Simpson. In 1951, he explained this paradox in his technical paper. Simpsonu2019s Paradox is also known as Yule - Simpsonu2019s effect, the name is dedicated to statistician Udny Yule, who defined the similar association among variables in his paper in 1905. Sympsonu2019s Effect arises when there is a difference between marginal association and partial association of two or more variables for the same data.

How likely is Simpson's paradox?

The existence and formation of Simpsonu2019s Paradox is very low and it is found rarely in contingency tables. It occurs if you do not observe the data very carefully or miss some association among variables.

What can you do with Simpson's paradox?

Simpsonu2019s Paradox is used to control the results of data by interpreting the data from different variables. This effect can randomly arise for example you are comparing the runs of two batsmen A and B and found that A has more runs than B. Looking at the same situation by dividing into smaller groups like comparing the runs of A and B with the opposite team player, the results will be different.

Is Simpson paradox an ecological fallacy?

Yes, Simpson Paradox is an ecological fallacy among the other fallacies like the confusion between individual correlation and ecological correlation, confusion between total average and group average, and confusion between higher likelihood and average likelihood. Simpsonu2019s Paradox is a combination of statistics and probability with finding the correlation among small groups of data.

What does Simpson's paradox teach us quizlet?

Simpsonu2019s Paradox is a situation or effect that shows how results can be reversed when a population is divided into small groups on the basis of underlying variables.

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio