Simpson's Paradox in the context of market share calculation

Note: Please use this page with a modern web browser like Goolge Chrome (HTML5 canvas element is used).

## Abstract

In this article, a phenomenon in statistics, called the Simson's Paradox or Simpson's reversal, is discussed based on an example considering competitor's market share. The marketing perspective is enriched by taking a look at vector algebra which enables visualization of the paradox.

An interactive graph should ease the understanding of the phenomenon. Via mouse interaction, the market share (i.e., the slope of a vector) can be manipulated via drag and drop while observing the outcome.

## Definition

The Simpson's paradox is defined as the effect "in which a trend that appears in different groups of data disappears when these groups are combined, and the reverse trend appears for the aggregated data".

From the article's point of view, we consider the development of market share in two different countries and for the corresponding aggregration of these two countries. The effect discussed certainly may appear for more than two countries in a region, but we want to keep things simple at first.

We assume activity in two countries and we name them country 1 and country2. The market player is denoted as competitor and we interpret the numbers given for this competitor as sales volume. The two countries are managed on a regional (aggregated) level, and we simply denote this aggregation as region.

Furthermore, the total in each country is called the market representing the total sales volume of the country. The market share is defined as the portion of sales volume for a competitor of, i.e., divided by, the total sales volume of the country, i.e, the market. (The market share represents the percentage of a market for a specific competitor (or a specific product) in terms of revenue or volume .) Even though it would mathematically be possible, a real-world market share cannot be higher than 1 (100%). Finally, we name the change of market share from one period to another the gain/loss or the development of the market share.

In simple words the effect of the Simpson's paradox for the scenario at hand is that:

The market share gain/loss for a region might be negative, although the market share gain/loss of all countries in this region is positive (or vice versa). This effect is considered a paradox.

Vice versa: The market share gain/loss for a region might be negative, although the market share gain/loss of all countries in this region is positive.

## Explanation

A competitor's market share can be defined as the volume for this competitor (vc) divided by the volume of the total market (vm) or . This market share can be expressed with the mathematical equivalent of a vector (vc,vm) and a slope of vc/vm.

We should further consider that aggregated market share is represented by the volume of a certain competitor in country 1 plus the volume of a certain competitor in country 2 divided by the total volume of the market in country 1 plus the total volume of the market in country 2 which is different from the arithmetic of fractions (cf.: Adding unlike quantities) but equal to the calculation of a vector sum (cf.: Vectors: Addition and subtraction)

## Calculation Example

Please note: Even though the market share in country 1 and 2 are positive (in the table on the left), the market share for the region, i.e. the combined market share, is negative.

## Graph

In this section you can find some examples where Simpson's paradox becomes obvious. Of course the paradox can also occur, when adding up more than two vectors, i.e., adding up more than two countries to a regional level.

Please note: Event though, each blue vector has a higher slope than the corresponding orange vector, the vector sum of the orange vectors exceeds the slope of the vector sum of the blue vectors.

## Examples

### Example 2

In this example two business domains are compared where development of volume is not opposed. All volumes are growing, nonetheless the total market share is slightly negative.

## Interactive Graph

Again: Event though, each blue vector has a higher slope than the corresponding orange vector, the slope of the vector sum of the orange vectors (combination of both) exceeds the slope of the vector sum of the blue vectors.

You can drag the vector's bullet points to explore the paradox limits:

This text is displayed if your browser does not support HTML5 Canvas.
 2012 2013 gain/loss growth country1 competitor 6,5 6,0 -0,5 -7,7% market 11,0 10,0 -1,0 -9,2% share 59,0% 60,0% 1,0% country2 competitor 1,7 2,0 0,3 17,0% market 9,0 10,0 1,0 11,1% share 19,0% 20,0% 1,0% region competitor 8,2 8,0 -0,2 -2,6% market 20,0 20,0 0,0 -0,1% share 41,0% 40,0% -1,0%

## Findings on observation

• The competitor has a high market share (i.e. high vector slope) in country 1 and a low market share in country 2. Therefore the large difference in market share seems to be a prerequisite for the paradox.
• The Simpson's paradox can be found for positive and for negative numbers.

## Sensitivity

### Preliminary Considerations

1. We have a function with 8 (eight) independent variables.
2. We can find results by randomly filling the independent variables and check if they build a Simpson's paradox. (sample based approach)
3. We can only plot a three-dimensional graph, not an eight-dimensional.
4. We could use scatter plots while changing only one parameter at a time (OAT/OFAT).
5. The objective is to find a sentence that starts with: A Simpson's paradox can be found if ...

### Questions

1. Are the variables really independent or can we find correlations?
2. Can we find common attributes when studying the randomly generated results?
3. Can regression analysis help to interpret the results?

### Randomly created results

With a sample based approach, we can find matching results by filling the parameters randomly. For the data at hand, this was done for a closed interval from 10 to 100 and a loop step of 10. This setup results in an algorithm running 10 raised by 8 (10^8) = 100.000.000 times.
In an extended approach, the loop step could be reduced resulting in a higher number of loops and in more precise results. A different approach could be to work with the already found data of the first run and to go into detail for these results. This could reduce the runtime of the algorithm. E.g., for the result:

Here are the results fond in the first approach (rough results):

tbl...

### One factor at a time (OFAT)

The following charts show the effects of one input parameter changed at a time (+/- 100% of the original value is used with a 15 steps interval). All other parameters remain unchanged.
Chart points marked with x do not represent the paradox, while points marked with a bullet point do.

#### Country 1 in 2012 #### Country 1 in 2013 #### Country 2 in 2012 #### Country 2 in 2013 ### Findings

The order of the countries does not play a role, i.e. the following data is treated as on paradox case, not two: either:

high difference in market share (75%pts) and low change in volume development

or

low difference in market share (10%pts) and high change in volume development

contrary volume development competitor / market and country 1 country 2

tbd...