Wiki Highlights Experiment

T355224

Table of Contents

Summary

Wiki Highlights is a concise overview of text generated from the lead and other sections of a Wikipedia article, combined with a relevant image, whose purpose is to highlight relevant facts from a lengthy paragraph.

The experiment ran through January 4th to January 6th in six countries: Brazil, Germany, India, Indonesia, Nigeria, United States. The participants in each country were randomly assigned two versions of content uploaded on microsites: the highlight version of content and the article version of content. The content is sourced from English Wikipedia and Commons, as featured in this list. Participants were able to read one of the versions of content and choose whether to continue reading more or exit the microsite.

Purpose

We are measuring the following set of metrics, to understand whether Wiki-Highlights is a viable reading experience for global youth audiences on 3rd party platform.

Primary metric - Time on site(session length) - Total time = Time on homepage + Time on content page - Time on homepage - Time on content page

Secondary metrics

  • Summaries completion rate
  • Number of summaries consumed per session
  • Popular topics

Data Preparation

import matplotlib as mpl
import math
import pandas as pd
import numpy as np
import scipy
from pandasql import sqldf

import wmfdata
from wmfdata import hive, mariadb, spark
 
import matplotlib.pyplot as plt
import seaborn as sns
spark_session = wmfdata.spark.create_session(app_name='pyspark regular; wiki-highlights',
                                  type='yarn-regular', # local, yarn-regular, yarn-large
                                         )  
24/03/14 04:25:11 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
24/03/14 04:25:17 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
country_list = ('Brazil', 'Germany', 'India', 'Indonesia', 'Nigeria', 'United States')
## Adding function for percentile

def percentile(n):
    def percentile_(x):
        return x.quantile(n)
    percentile_.__name__ = 'percentile_{:02.0f}'.format(n*100)
    return percentile_

Wiki highlights Event Data

Collect event data from wiki_highlights_experiment schema between the test period January 4th - January 16th.

event_data_query = """

SELECT
  meta.dt as server_dt,
  experiment_group,
  geocoded_data['country'] as user_country,
  md5(concat(http.client_ip, '+{salt}')) as ip_hash,
  session_id, event_type,
  page_name, 
  CASE WHEN page_name IN ('categories_highlights', 'categories_articles') THEN 'homepage' ELSE topic END AS topic, -- hard code homepage
  CASE WHEN page_name IN ('categories_highlights', 'categories_articles') THEN 'homepage' ELSE category_name END AS category_name, -- hard code homepage
  page_bottom_was_visible, time_length_ms
FROM event.inuka_wiki_highlights_experiment e
LEFT JOIN cchen.wiki_highlights_article_list l ON e.page_name = l.article_title
WHERE
   (year = 2024 AND month = 1 AND day >=4 AND day <= 16)

"""
event_data = spark.run(event_data_query)
24/03/14 04:49:38 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
                                                                                
# store data in GlobalTempView
event_sdf = spark_session.createDataFrame(event_data)
event_sdf.createGlobalTempView("event_data_view")

Metrics

Time on Site (Session Length)

The metric indicates users’ willingness to consume articles and highlights. All the times we calculate are in seconds.

time_on_site_query = """ 
 
 SELECT 
        experiment_group,
        session_id,
        SUM(time_length_ms)/1000 AS total_length,
        SUM(CASE WHEN topic = 'homepage' THEN time_length_ms END)/1000 AS home_length,
        SUM(CASE WHEN topic != 'homepage' THEN time_length_ms END)/1000 AS content_length
    FROM global_temp.event_data_view
    WHERE event_type = 'pageUnloaded'
    GROUP BY experiment_group, session_id
    
"""
time_on_site = spark.run(time_on_site_query)
                                                                                
## Check % of sessions with only homepage visits, no content page visits
sqldf("""
    
    SELECT 
        experiment_group,
        SUM(CASE WHEN content_length IS NULL THEN 1 END)*100 /  COUNT(1) AS hp_only_pct
    FROM time_on_site
    GROUP BY experiment_group
    
""")
experiment_group hp_only_pct
0 control 51
1 experiment 52

There were 51% and 52% of sessions with only homepage visits in the control group and the experiment group, respectively.

Total Time

time_grouped = time_on_site.groupby('experiment_group')
total_time_column = time_grouped['total_length']
total_time_column.agg([percentile(0.5), percentile(0.75), percentile(0.90), percentile(0.95)])
percentile_50 percentile_75 percentile_90 percentile_95
experiment_group
control 18.942 41.227 95.2176 188.6705
experiment 20.524 46.332 109.9998 197.8194
sns.set_style("white")
fig, ax = plt.subplots(figsize=(10,5))

fig = sns.kdeplot(data=time_on_site, x="total_length", hue="experiment_group", shade=True, log_scale=True, clip =(-1,3.5))
fig.set(yticklabels=[]) 
fig.set(ylabel=None)
fig.set(xticklabels=[0,0,0,1,10,100,1000]) 

plt.xlabel("Seconds")
plt.title('Total Time Spent')
/tmp/ipykernel_3136929/331582030.py:4: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  fig = sns.kdeplot(data=time_on_site, x="total_length", hue="experiment_group", shade=True, log_scale=True, clip =(-1,3.5))
/tmp/ipykernel_3136929/331582030.py:7: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  fig.set(xticklabels=[0,0,0,1,10,100,1000])
Text(0.5, 1.0, 'Total Time Spent')

sns.set(rc={'figure.figsize':(15,5)})
sns.set_style("white")

sns.boxplot(data=time_on_site, x="total_length", hue="experiment_group",showfliers=False, gap=.5)

plt.xlabel("Seconds")
plt.title('Total Time Spent')
Text(0.5, 1.0, 'Total Time Spent')

In control group, 50% of sessions had a total reading time between 0 to 19 seconds; and in experiment group, 50% of sessions had a total reading time between 0 to 21 seconds.

In control group, 95% of sessions had a total reading time between 0 to 189 seconds; and in experiment group, 95% of sessions had a total reading time between 0 to 198 seconds.

The experiment group had more users spent more time on homepages and content pages than the control group.

Time on Homepage

home_time_column = time_grouped['home_length']
home_time_column.agg([percentile(0.5), percentile(0.75), percentile(0.90), percentile(0.95)])
percentile_50 percentile_75 percentile_90 percentile_95
experiment_group
control 14.348 27.75075 51.5783 76.13905
experiment 14.877 27.59550 52.5068 78.76800
sns.set_style("white")
fig, ax = plt.subplots(figsize=(10,5))

fig = sns.kdeplot(data=time_on_site, x="home_length", hue="experiment_group", shade=True, log_scale=True, clip =(-1,3.5))
fig.set(yticklabels=[]) 
fig.set(ylabel=None)
fig.set(xticklabels=[0,0,0,1,10,100,1000]) 

plt.xlabel("Seconds")
plt.title("Homepage Time Spent")
/tmp/ipykernel_3136929/3455724706.py:4: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  fig = sns.kdeplot(data=time_on_site, x="home_length", hue="experiment_group", shade=True, log_scale=True, clip =(-1,3.5))
/tmp/ipykernel_3136929/3455724706.py:7: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  fig.set(xticklabels=[0,0,0,1,10,100,1000])
Text(0.5, 1.0, 'Homepage Time Spent')

sns.boxplot(data=time_on_site, x="home_length", hue="experiment_group",showfliers=False, gap=.5)

plt.xlabel("Seconds")
plt.title('Time Spent on Homepage')

sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

In control group, 50% of sessions had a total reading time between 0 to 14 seconds; and in experiment group, 50% of sessions had a total reading time between 0 to 15 seconds.

In control group, 95% of sessions had a total reading time between 0 to 76 seconds; and in experiment group, 95% of sessions had a total reading time between 0 to 79 seconds.

The users in the experiment group seem to stay at a similar time as the users in the control group on the home page.

Time on Content Page

content_time_column = time_grouped['content_length']
content_time_column.agg([percentile(0.5), percentile(0.75), percentile(0.90), percentile(0.95)])
percentile_50 percentile_75 percentile_90 percentile_95
experiment_group
control 9.698 29.08675 116.2183 214.5830
experiment 11.725 46.27050 129.4230 260.2247
sns.set_style("white")
fig, ax = plt.subplots(figsize=(10,5))

fig = sns.kdeplot(data=time_on_site, x="content_length", hue="experiment_group", shade=True, log_scale=True, clip =(-1,3.5))
fig.set(yticklabels=[]) 
fig.set(ylabel=None)
fig.set(xticklabels=[0,0,0,1,10,100,1000]) 

plt.xlabel("Seconds")
plt.title("Contnt Time Spent")
/tmp/ipykernel_3136929/2061442196.py:4: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  fig = sns.kdeplot(data=time_on_site, x="content_length", hue="experiment_group", shade=True, log_scale=True, clip =(-1,3.5))
/tmp/ipykernel_3136929/2061442196.py:7: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  fig.set(xticklabels=[0,0,0,1,10,100,1000])
Text(0.5, 1.0, 'Contnt Time Spent')

sns.boxplot(data=time_on_site, x="content_length", hue="experiment_group",showfliers=False, gap=.5)

plt.xlabel("Seconds")
plt.title('Time Spent on Content pages')

sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

In control group, 50% of sessions had a total reading time between 0 to 10 seconds; and in experiment group, 50% of sessions had a total reading time between 0 to 12 seconds.

In control group, 95% of sessions had a total reading time between 0 to 215 seconds; and in experiment group, 95% of sessions had a total reading time between 0 to 260 seconds.

For number of users who viewed content pages, the experiment group had more users spent more time on content pages than the control group.

Note: in the control group, the articles are collapsed. This implies that it might be possible that some users did not expand each section to read through the entire article; which could have potentially affected the reading time of in control group

Content Read Completion Rate

The metric indicates users’ willingness to complete reading the content. Content is considered complete when users reach the bottom of an article or the last page of a highlight.

When calculating the completion rate, we are excluding homepage visits.

content_completion_query = """

SELECT 
    experiment_group,
    COUNT(1) AS pageview,
    SUM(CASE WHEN page_bottom_was_visible THEN 1 END)/ COUNT(1) AS completion_rate
FROM global_temp.event_data_view
WHERE event_type = 'pageUnloaded'
      AND topic != 'homepage'
GROUP BY experiment_group
    
"""
content_completion = spark.run(content_completion_query)
                                                                                
content_completion
experiment_group pageview completion_rate
0 control 1112 0.781475
1 experiment 1658 0.721954

The control group had 1,112 articles opened, with a 78.1% completion rate.

The experiment group had more highlights read but less completion rate. There are 1,658 highlights opened with a 72.2% completion rate.

Number of Content Viewed per Session

The metric reflects users’ willingness to view subsequent highlights and articles.

We also exclude homepage views here. If a session only had homepage views, then we count it as 0 content views in that session

content_per_session_query = """

SELECT
    experiment_group,
    session_id, 
    SUM(CASE WHEN topic = 'homepage' THEN 0 ELSE 1 END) AS num_pages
FROM global_temp.event_data_view
WHERE event_type = 'pageUnloaded'
GROUP BY experiment_group,session_id

"""
content_per_session = spark.run(content_per_session_query)
                                                                                
content_per_session_grouped = content_per_session.groupby('experiment_group')
content_per_session_column = content_per_session_grouped['num_pages']
content_per_session_column.agg([percentile(0.5), percentile(0.75), percentile(0.90), percentile(0.95)])
percentile_50 percentile_75 percentile_90 percentile_95
experiment_group
control 0.0 1.0 2.0 3.0
experiment 0.0 1.0 3.0 4.0
sns.set_style("white")
fig, ax = plt.subplots(figsize=(10,5))

fig = sns.kdeplot(data=content_per_session, x="num_pages", hue="experiment_group", shade=True, cut =0, clip=(0,15),
                 palette={'control':'b', 'experiment':'r'})
fig.set(yticklabels=[]) 
fig.set(ylabel=None)

plt.title("Number of Content per Session")
/tmp/ipykernel_3136929/3845779504.py:4: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  fig = sns.kdeplot(data=content_per_session, x="num_pages", hue="experiment_group", shade=True, cut =0, clip=(0,15),
Text(0.5, 1.0, 'Number of Content per Session')

sns.boxplot(data=content_per_session, x="num_pages", hue="experiment_group",showfliers=False, gap=.5,
           palette={'control':'b', 'experiment':'r'})

plt.xlabel("Seconds")
plt.title("Content Read per Session")

sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

There were 50% of sessions with 0 summary/article consumed per session for both control and experiment groups.

There were 75% of sessions with 0 or 1 summaries/articles consumed per session for both control and experiment groups.

For the control group, there were 95% of sessions with 0 to 3 articles per session. For the experiment group, there were 95% of sessions with 0 to 4 summaries per session.

There are more users in experiment group viewed slightly more summaries in control group does.

Top Viewed Content

This section shows which topics & categories had majority of reads through pageviews/wiki highlights views. Additionally, we include the content completion rate as a reference.

For the list of featured articles and their topics, please refer to this sheet.

top_page_query = """

    SELECT 
        experiment_group,
        page_name, 
        COUNT(1) AS pv,
        SUM(CASE WHEN page_bottom_was_visible THEN 1 ELSE 0 END)/ COUNT(1) AS completion_rate
    FROM global_temp.event_data_view
    WHERE event_type = 'pageUnloaded'
      AND topic != 'homepage'
    GROUP BY experiment_group, page_name

"""
top_page = spark.run(top_page_query)
                                                                                

Top 10 viewed articles in contol group are:

top_page.loc[(top_page['experiment_group'] == 'control')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group page_name pv completion_rate
49 control Lionel Messi 76 0.881579
7 control Friends 62 0.806452
16 control Japan 60 0.783333
34 control Ancient Egypt 53 0.735849
55 control Body piercing 47 0.808511
40 control Baseball 46 0.695652
21 control Comics 46 0.847826
17 control Feminism 43 0.813953
31 control Obesity 42 0.833333
37 control Statue of Liberty 41 0.804878

Top 10 viewed highligths in experiment group are:

top_page.loc[(top_page['experiment_group'] == 'experiment')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group page_name pv completion_rate
14 experiment Lionel Messi 99 0.595960
45 experiment Climate change 86 0.767442
52 experiment Elephant 80 0.700000
11 experiment Japan 79 0.721519
26 experiment Friends 74 0.756757
38 experiment Obesity 71 0.760563
39 experiment Comics 69 0.710145
28 experiment Sustainable energy 69 0.695652
29 experiment Statue of Liberty 64 0.593750
44 experiment Yoga 62 0.693548

Top Viewed Topics

top_topic_query = """

    SELECT 
        experiment_group,
        category_name AS topic, 
        COUNT(1) AS pv,
        SUM(CASE WHEN page_bottom_was_visible THEN 1 ELSE 0 END)/ COUNT(1) AS completion_rate
    FROM global_temp.event_data_view
    WHERE event_type = 'pageUnloaded'
      AND topic != 'homepage'
    GROUP BY experiment_group, category_name

"""
top_topic = spark.run(top_topic_query)
                                                                                

Top viewed topics in contol group are:

top_topic.loc[(top_topic['experiment_group'] == 'control')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group topic pv completion_rate
8 control LIFESTYLE 190 0.800000
0 control PERSONALITIES 184 0.836957
6 control HISTORY 160 0.787500
4 control TOPICAL 150 0.820000
5 control SPORT 146 0.705479
10 control NATURE 145 0.710345
2 control PLACES 137 0.788321

Top viewed topics in experiment group are:

top_topic.loc[(top_topic['experiment_group'] == 'experiment')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group topic pv completion_rate
12 experiment TOPICAL 286 0.762238
3 experiment PERSONALITIES 257 0.680934
11 experiment NATURE 254 0.728346
1 experiment LIFESTYLE 245 0.767347
7 experiment PLACES 216 0.726852
13 experiment SPORT 207 0.719807
9 experiment HISTORY 193 0.647668

Note: Nature didn’t show up at the top for any country contrary to user feedback from the survey.

Metrics Breakdown by Countries

Add a country-wise breakdown for each metric to facilitate comparisons.

Time on Site (Session Length)

time_on_site_c_query = """ 
 
 SELECT 
        user_country,
        experiment_group,
        session_id,
        SUM(time_length_ms)/1000 AS total_length,
        SUM(CASE WHEN topic = 'homepage' THEN time_length_ms END)/1000 AS home_length,
        SUM(CASE WHEN topic != 'homepage' THEN time_length_ms END)/1000 AS content_length
    FROM global_temp.event_data_view
    WHERE event_type = 'pageUnloaded'
      AND user_country IN {country_list}   
    GROUP BY user_country,experiment_group, session_id
    
"""
time_on_site_c = spark.run(
       time_on_site_c_query.format(
          country_list = country_list
        ))
                                                                                

Total Time

totla_time_c_query= """

    WITH total_time AS (
        SELECT 
            user_country,
            experiment_group,
            session_id,
            SUM(time_length_ms)/1000 AS total_length
        FROM global_temp.event_data_view
        WHERE event_type = 'pageUnloaded'
          AND user_country IN {country_list}   
        GROUP BY user_country,experiment_group, session_id
    )
    
    SELECT
       user_country,
       experiment_group,
       PERCENTILE_APPROX(total_length,0.50) AS 50_percentile,
       PERCENTILE_APPROX(total_length,0.75) AS 75_percentile,
       PERCENTILE_APPROX(total_length,0.90) AS 90_percentile,
       PERCENTILE_APPROX(total_length,0.95) AS 95_percentile
    FROM total_time
    GROUP BY user_country,experiment_group
    ORDER BY user_country,experiment_group

"""
spark.run( 
         totla_time_c_query.format(
          country_list = country_list
        )
    )
                                                                                
user_country experiment_group 50_percentile 75_percentile 90_percentile 95_percentile
0 Brazil control 20.637 39.948 96.972 164.477
1 Brazil experiment 25.193 52.807 123.295 200.400
2 Germany control 15.257 26.127 49.417 73.394
3 Germany experiment 15.560 24.174 48.615 68.727
4 India control 18.850 36.875 66.654 106.661
5 India experiment 22.300 45.418 92.987 129.510
6 Indonesia control 17.296 28.746 58.731 76.430
7 Indonesia experiment 15.608 25.449 48.345 102.256
8 Nigeria control 42.690 114.206 297.555 521.008
9 Nigeria experiment 60.662 122.915 319.056 606.709
10 United States control 17.650 39.494 78.332 146.681
11 United States experiment 19.837 42.082 88.832 123.535
#sns.set_theme(style="white")
#g = sns.FacetGrid(time_on_site_c, row="user_country",aspect=7, height=3.5)

#g.map_dataframe(sns.kdeplot, x="total_length",hue="experiment_group",shade=True, log_scale=True, clip =(-1,3.5))
#fig.set(yticklabels=[]) 
#fig.set(ylabel=None)
#fig.set(xticklabels=[0,0,0,1,10,100,1000]) 

#plt.xlabel("Seconds")
#plt.title('Total Time Spent')
sns.set(rc={'figure.figsize':(15,8)})
sns.set_style("white")

sns.boxplot(data=time_on_site_c, x="total_length", y = "user_country",hue="experiment_group",showfliers=False, gap=.3)

plt.xlabel("Seconds")
plt.title('Total Time Spent')
plt.ylabel([])  

sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

From the data above, we discover that, in Brazil, India, United Stats and Nigeria, the experiment group had more users spent more time on homepages and content pages.

In Indonesia and Germany, the control group had more users spent more time on homepages and content pages.

Time on homepage

homepage_c_query= """

    WITH total_time AS (
        SELECT 
            user_country,
            experiment_group,
            session_id,
            SUM(CASE WHEN topic = 'homepage' THEN time_length_ms END)/1000 AS home_length
        FROM global_temp.event_data_view
        WHERE event_type = 'pageUnloaded'
          AND user_country IN {country_list}   
        GROUP BY user_country,experiment_group, session_id
    )
    
    SELECT
       user_country,
       experiment_group,
       PERCENTILE_APPROX(home_length,0.50) AS 50_percentile,
       PERCENTILE_APPROX(home_length,0.75) AS 75_percentile,
       PERCENTILE_APPROX(home_length,0.90) AS 90_percentile,
       PERCENTILE_APPROX(home_length,0.95) AS 95_percentile
    FROM total_time
    GROUP BY user_country,experiment_group
    ORDER BY user_country,experiment_group

"""
spark.run( 
         homepage_c_query.format(
          country_list = country_list
        )
    )
                                                                                
user_country experiment_group 50_percentile 75_percentile 90_percentile 95_percentile
0 Brazil control 16.276 32.965 62.115 89.921
1 Brazil experiment 17.559 37.829 77.639 109.329
2 Germany control 12.864 18.631 35.523 42.511
3 Germany experiment 12.830 18.216 31.342 41.777
4 India control 14.632 26.862 43.337 61.269
5 India experiment 15.706 30.874 48.493 81.931
6 Indonesia control 13.864 22.128 39.994 57.322
7 Indonesia experiment 13.366 19.215 37.140 46.788
8 Nigeria control 20.399 50.511 96.485 178.650
9 Nigeria experiment 21.995 53.847 76.100 109.041
10 United States control 13.311 25.069 40.706 62.318
11 United States experiment 13.295 22.200 35.865 48.751
sns.boxplot(data=time_on_site_c, x="home_length", y = "user_country",hue="experiment_group",showfliers=False, gap=.3)
sns.set_style("white")

plt.xlabel("Seconds")
plt.title('Time Spent on Homepage')
plt.ylabel([])  

sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

For home page time spent, we discover that, in Brazil, Nigeria, and India, the experiment group had more users spent more time on homepages.

In Germany, users spent similar time on homepages pages in the experiment group and control group.

While in Indonesia and United States, the control group had more users spent more time on homepages pages.

Time on content page

content_time_c_query= """

    WITH total_time AS (
        SELECT 
            user_country,
            experiment_group,
            session_id,
            SUM(CASE WHEN topic != 'homepage' THEN time_length_ms END)/1000 AS content_length
        FROM global_temp.event_data_view
        WHERE event_type = 'pageUnloaded'
          AND user_country IN {country_list}   
        GROUP BY user_country,experiment_group, session_id
    )
    
    SELECT
       user_country,
       experiment_group,
       PERCENTILE_APPROX(content_length,0.50) AS 50_percentile,
       PERCENTILE_APPROX(content_length,0.75) AS 75_percentile,
       PERCENTILE_APPROX(content_length,0.90) AS 90_percentile,
       PERCENTILE_APPROX(content_length,0.95) AS 95_percentile
    FROM total_time
    GROUP BY user_country,experiment_group
    ORDER BY user_country,experiment_group

"""
spark.run( 
         content_time_c_query.format(
          country_list = country_list
        )
    )
                                                                                
user_country experiment_group 50_percentile 75_percentile 90_percentile 95_percentile
0 Brazil control 8.350 21.339 84.189 203.409
1 Brazil experiment 14.708 41.528 138.720 176.157
2 Germany control 6.515 13.817 37.952 50.856
3 Germany experiment 7.157 11.233 26.987 57.811
4 India control 9.637 20.231 59.604 121.616
5 India experiment 11.766 35.500 92.876 127.866
6 Indonesia control 8.790 17.016 38.026 70.835
7 Indonesia experiment 8.918 19.751 104.614 187.773
8 Nigeria control 43.949 176.100 369.269 526.803
9 Nigeria experiment 79.387 225.900 572.007 969.253
10 United States control 8.436 22.734 85.557 136.388
11 United States experiment 12.308 41.031 85.045 117.422
sns.boxplot(data=time_on_site_c, x="content_length", y = "user_country",hue="experiment_group",showfliers=False, gap=.3)
sns.set_style("white")

plt.xlabel("Seconds")
plt.title('Time Spent on Content pages')
plt.ylabel([])  

sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

For cotent page time spent, we discover that, in Brazil, India, Indonesia, Nigeria,and US the experiment group had more users spent more time on homepages.

And in Germany, in 90% of the seesions, the control group had more users spent more time on content pages.

Nigeria had much longer time spent on content pages compared to other countries.

Number of Content Viewed per Session

content_per_session_query_c = """

WITH content_view AS (
SELECT
    experiment_group,
    user_country,
    session_id, 
    SUM(CASE WHEN topic = 'homepage' THEN 0 ELSE 1 END) AS num_pages
FROM global_temp.event_data_view
WHERE event_type = 'pageUnloaded'
  AND user_country IN {country_list}   
GROUP BY experiment_group,user_country,session_id
)

SELECT
       user_country,
       experiment_group,
       PERCENTILE_APPROX(num_pages,0.50) AS 50_percentile,
       PERCENTILE_APPROX(num_pages,0.75) AS 75_percentile,
       PERCENTILE_APPROX(num_pages,0.90) AS 90_percentile,
       PERCENTILE_APPROX(num_pages,0.95) AS 95_percentile
    FROM content_view
    GROUP BY user_country,experiment_group
    ORDER BY user_country,experiment_group

"""
content_per_session_c = spark.run( 
        content_per_session_query_c.format(
          country_list = country_list
        )
    )
                                                                                
content_per_session_c 
user_country experiment_group 50_percentile 75_percentile 90_percentile 95_percentile
0 Brazil control 0 1 1 2
1 Brazil experiment 0 1 2 4
2 Germany control 1 1 2 3
3 Germany experiment 1 1 2 2
4 India control 0 1 2 2
5 India experiment 0 1 2 4
6 Indonesia control 0 1 1 2
7 Indonesia experiment 0 1 1 2
8 Nigeria control 1 1 4 5
9 Nigeria experiment 1 2 5 8
10 United States control 1 1 2 3
11 United States experiment 1 1 3 4

From the data above, we discover that, in Brazil, India, Nigeria and United States, users viewed more content in the experiment group than the control group per session.

In Indonesia, users viewed similar amount of content per session in both two groups.

While in Germany, users viewed fewer content in the experiment group than the control group per session.

Content Read Completion Rate

content_completion_query_c = """

SELECT 
    experiment_group,
    user_country,
    COUNT(1) AS pageview,
    SUM(CASE WHEN page_bottom_was_visible THEN 1 END)/ COUNT(1)*100 AS completion_rate 
FROM global_temp.event_data_view
WHERE event_type = 'pageUnloaded'
  AND topic != 'homepage'
  AND user_country IN {country_list}   
GROUP BY experiment_group, user_country
ORDER BY user_country,experiment_group
    
"""
content_completion_c = spark.run( 
        content_completion_query_c.format(
          country_list = country_list
        )
    )
                                                                                
content_completion_c
experiment_group user_country pageview completion_rate
0 control Brazil 117 83.760684
1 experiment Brazil 252 80.555556
2 control Germany 201 75.124378
3 experiment Germany 225 59.111111
4 control India 154 69.480519
5 experiment India 320 70.000000
6 control Indonesia 115 77.391304
7 experiment Indonesia 138 77.536232
8 control Nigeria 279 81.362007
9 experiment Nigeria 409 73.838631
10 control United States 241 80.082988
11 experiment United States 303 71.947195
sns.set(rc={'figure.figsize':(15,8)})
sns.set_style("white")

ax= sns.barplot(content_completion_c, x="completion_rate", y="user_country", hue="experiment_group", orient="y")
for container in ax.containers:
    ax.bar_label(container, fmt='%.1f%%')

plt.xlabel("Completion Rate %")
plt.ylabel([])  
plt.title('Content Completion Rate by Country')

sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

The content completion rate in experiment groups is lower in every country except India and Indonesia.

Top Viewed Content

top_page_query_c = """

    SELECT 
        experiment_group,
        user_country,
        page_name, 
        COUNT(1) AS pv,
        SUM(CASE WHEN page_bottom_was_visible THEN 1 ELSE 0 END)/ COUNT(1) AS completion_rate
    FROM global_temp.event_data_view
    WHERE event_type = 'pageUnloaded'
      AND topic != 'homepage'
    GROUP BY experiment_group, user_country, page_name

"""
top_page_c = spark.run(top_page_query_c)
                                                                                

In control group

Top 10 viewed articles in Brazil are:

top_page_c.loc[(top_page_c['experiment_group'] == 'control')&(top_page_c['user_country'] == 'Brazil')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country page_name pv completion_rate
30 control Brazil Amazon parrot 9 1.000000
135 control Brazil Baseball 7 0.857143
59 control Brazil Ancient Egypt 7 0.714286
79 control Brazil Friends 7 1.000000
337 control Brazil Japan 7 0.714286
1 control Brazil Lionel Messi 6 1.000000
114 control Brazil Obesity 6 1.000000
293 control Brazil Australian Magpie 5 0.400000
10 control Brazil Statue of Liberty 5 0.800000
151 control Brazil Comics 5 0.800000

Top 10 viewed articles in Germany are:

top_page_c.loc[(top_page_c['experiment_group'] == 'control')&(top_page_c['user_country'] == 'Germany')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country page_name pv completion_rate
300 control Germany Statue of Liberty 14 0.714286
308 control Germany Lionel Messi 13 0.923077
31 control Germany Body piercing 12 0.750000
44 control Germany Japan 12 0.750000
315 control Germany Michael Jackson 10 0.700000
157 control Germany Friends 9 0.777778
89 control Germany Elephant 9 0.666667
224 control Germany Ice dance 9 0.666667
298 control Germany Masrur Temples 9 1.000000
13 control Germany Ancient Egypt 8 0.625000

Top 10 viewed articles in India are:

top_page_c.loc[(top_page_c['experiment_group'] == 'control')&(top_page_c['user_country'] == 'India')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country page_name pv completion_rate
343 control India Friends 12 0.833333
232 control India Climate change 10 0.700000
14 control India Lionel Messi 10 0.900000
216 control India Japan 9 0.666667
144 control India Yoga 9 0.555556
15 control India Hyderabad 9 0.777778
290 control India Ancient Egypt 8 0.750000
112 control India Maya civilization 7 1.000000
240 control India Baseball 7 0.428571
225 control India Comics 7 0.857143

Top 10 viewed articles in Indonesia are:

top_page_c.loc[(top_page_c['experiment_group'] == 'control')&(top_page_c['user_country'] == 'Indonesia')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country page_name pv completion_rate
289 control Indonesia Japan 12 0.916667
120 control Indonesia Lionel Messi 10 1.000000
296 control Indonesia Comics 8 0.625000
126 control Indonesia Baseball 7 0.428571
86 control Indonesia Elephant 6 0.833333
323 control Indonesia Friends 5 1.000000
91 control Indonesia Maya civilization 5 0.600000
318 control Indonesia Statue of Liberty 5 0.800000
177 control Indonesia Feminism 5 0.800000
291 control Indonesia Winnipeg 5 0.600000

Top 10 viewed articles in Nigeria are:

top_page_c.loc[(top_page_c['experiment_group'] == 'control')&(top_page_c['user_country'] == 'Nigeria')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country page_name pv completion_rate
178 control Nigeria Lionel Messi 22 0.863636
2 control Nigeria Friends 18 0.722222
110 control Nigeria Feminism 16 0.812500
338 control Nigeria Nelson Mandela 14 0.785714
329 control Nigeria Body piercing 14 0.928571
204 control Nigeria Maraba Coffee 13 0.846154
39 control Nigeria Michael Jackson 11 0.909091
312 control Nigeria Obesity 11 0.818182
310 control Nigeria Maya Angelou 11 0.818182
249 control Nigeria Youth Olympic Games 11 0.818182

Top 10 viewed articles in United States are:

top_page_c.loc[(top_page_c['experiment_group'] == 'control')&(top_page_c['user_country'] == 'United States')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country page_name pv completion_rate
194 control United States Ancient Egypt 15 0.666667
34 control United States Australian Magpie 14 0.571429
124 control United States Lionel Messi 14 0.714286
317 control United States Baseball 13 0.846154
275 control United States Body piercing 12 0.833333
175 control United States Nelson Mandela 12 0.833333
53 control United States Elephant 11 0.636364
257 control United States Friends 11 0.727273
217 control United States Maraba Coffee 10 0.700000
321 control United States Maya Angelou 10 0.900000

In experiment group:

Top 10 viewed highlights in Brazil are:

top_page_c.loc[(top_page_c['experiment_group'] == 'experiment')&(top_page_c['user_country'] == 'Brazil')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country page_name pv completion_rate
294 experiment Brazil Lionel Messi 16 0.562500
334 experiment Brazil Hyderabad 15 0.933333
3 experiment Brazil Japan 13 0.769231
191 experiment Brazil Friends 12 0.916667
101 experiment Brazil Yoga 11 0.909091
119 experiment Brazil Maya Angelou 11 0.727273
115 experiment Brazil Comics 10 0.900000
52 experiment Brazil Ice dance 9 0.777778
72 experiment Brazil Sustainable energy 9 0.555556
80 experiment Brazil Statue of Liberty 9 0.666667

Top 10 viewed highlights in Germany are:

top_page_c.loc[(top_page_c['experiment_group'] == 'experiment')&(top_page_c['user_country'] == 'Germany')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country page_name pv completion_rate
279 experiment Germany Obesity 17 0.588235
121 experiment Germany Elephant 16 0.437500
180 experiment Germany Lionel Messi 15 0.333333
158 experiment Germany Giraffe 14 0.642857
313 experiment Germany Climate change 14 0.857143
190 experiment Germany Japan 12 0.500000
319 experiment Germany Feminism 11 0.636364
297 experiment Germany Ancient Egypt 11 0.545455
278 experiment Germany Statue of Liberty 11 0.545455
324 experiment Germany Body piercing 10 0.800000

Top 10 viewed highlights in India are:

top_page_c.loc[(top_page_c['experiment_group'] == 'experiment')&(top_page_c['user_country'] == 'India')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country page_name pv completion_rate
295 experiment India Elephant 25 0.600000
130 experiment India Sustainable energy 17 0.647059
234 experiment India Climate change 17 0.823529
5 experiment India Hyderabad 16 0.750000
286 experiment India Amazon parrot 15 0.800000
11 experiment India Japan 15 0.733333
133 experiment India Australian Magpie 15 0.533333
192 experiment India Yoga 15 0.733333
116 experiment India Comics 15 0.533333
134 experiment India Michael Jackson 13 0.692308

Top 10 viewed highlights in Indonesia are:

top_page_c.loc[(top_page_c['experiment_group'] == 'experiment')&(top_page_c['user_country'] == 'Indonesia')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country page_name pv completion_rate
299 experiment Indonesia Japan 15 0.733333
75 experiment Indonesia Lionel Messi 10 0.600000
163 experiment Indonesia Climate change 7 0.857143
62 experiment Indonesia Rwanda 7 0.571429
212 experiment Indonesia Statue of Liberty 7 0.428571
241 experiment Indonesia Hyderabad 7 1.000000
273 experiment Indonesia Comics 6 0.666667
82 experiment Indonesia Winnipeg 6 0.500000
106 experiment Indonesia Amazon parrot 6 1.000000
42 experiment Indonesia Giraffe 6 0.833333

Top 10 viewed highlights in Nigeria are:

top_page_c.loc[(top_page_c['experiment_group'] == 'experiment')&(top_page_c['user_country'] == 'Nigeria')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country page_name pv completion_rate
223 experiment Nigeria Lionel Messi 28 0.714286
188 experiment Nigeria Climate change 25 0.680000
162 experiment Nigeria Friends 21 0.714286
282 experiment Nigeria Yoga 20 0.650000
264 experiment Nigeria Sustainable energy 20 0.700000
85 experiment Nigeria Statue of Liberty 19 0.684211
263 experiment Nigeria Youth Olympic Games 18 0.722222
246 experiment Nigeria Obesity 18 0.833333
12 experiment Nigeria Comics 17 0.764706
29 experiment Nigeria Ancient Egypt 17 0.529412

Top 10 viewed highlights in United States are:

top_page_c.loc[(top_page_c['experiment_group'] == 'experiment')&(top_page_c['user_country'] == 'United States')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country page_name pv completion_rate
262 experiment United States Lionel Messi 18 0.666667
25 experiment United States Body piercing 17 0.764706
102 experiment United States Elephant 17 0.823529
267 experiment United States Friends 16 0.687500
108 experiment United States Michael Jackson 15 0.800000
269 experiment United States Climate change 15 0.666667
202 experiment United States Obesity 14 0.928571
179 experiment United States Feminism 13 0.769231
196 experiment United States Japan 13 0.846154
322 experiment United States Comics 12 0.750000

From the list, we can see that some content, such as Lionel Messi and Climate change, appears in the top-viewed lists of most countries. The rest of the top-viewed content differs from country to country.

Top View Topics

top_topic_query_c = """

    SELECT 
        experiment_group,
        user_country,
        category_name AS topic, 
        COUNT(1) AS pv,
        SUM(CASE WHEN page_bottom_was_visible THEN 1 ELSE 0 END)/ COUNT(1) AS completion_rate
    FROM global_temp.event_data_view
    WHERE event_type = 'pageUnloaded'
      AND topic != 'homepage'
    GROUP BY experiment_group,user_country, category_name

"""
top_topic_c = spark.run(top_topic_query_c)
                                                                                

In control group

Top viewed topics in Brazil are:

top_topic_c.loc[(top_topic_c['experiment_group'] == 'control')&(top_topic_c['user_country'] == 'Brazil')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country topic pv completion_rate
88 control Brazil NATURE 21 0.857143
13 control Brazil HISTORY 20 0.800000
18 control Brazil SPORT 18 0.777778
93 control Brazil LIFESTYLE 16 0.812500
17 control Brazil PLACES 15 0.800000
76 control Brazil PERSONALITIES 14 1.000000
87 control Brazil TOPICAL 13 0.846154

Top viewed topics in Germany are:

top_topic_c.loc[(top_topic_c['experiment_group'] == 'control')&(top_topic_c['user_country'] == 'Germany')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country topic pv completion_rate
20 control Germany HISTORY 39 0.769231
26 control Germany LIFESTYLE 31 0.806452
7 control Germany PERSONALITIES 29 0.793103
40 control Germany SPORT 28 0.678571
84 control Germany PLACES 27 0.814815
67 control Germany NATURE 24 0.583333
29 control Germany TOPICAL 23 0.782609

Top viewed topics in India are:

top_topic_c.loc[(top_topic_c['experiment_group'] == 'control')&(top_topic_c['user_country'] == 'India')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country topic pv completion_rate
23 control India LIFESTYLE 29 0.758621
55 control India TOPICAL 25 0.680000
37 control India HISTORY 24 0.791667
57 control India SPORT 23 0.434783
21 control India PLACES 18 0.722222
54 control India PERSONALITIES 18 0.833333
77 control India NATURE 17 0.647059

Top viewed topics in Indonesia are:

top_topic_c.loc[(top_topic_c['experiment_group'] == 'control')&(top_topic_c['user_country'] == 'Indonesia')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country topic pv completion_rate
80 control Indonesia PLACES 20 0.800000
65 control Indonesia PERSONALITIES 19 0.789474
60 control Indonesia LIFESTYLE 18 0.777778
2 control Indonesia NATURE 16 0.812500
15 control Indonesia HISTORY 15 0.733333
85 control Indonesia SPORT 14 0.571429
36 control Indonesia TOPICAL 13 0.923077

Top viewed topics in Nigeria are:

top_topic_c.loc[(top_topic_c['experiment_group'] == 'control')&(top_topic_c['user_country'] == 'Nigeria')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country topic pv completion_rate
47 control Nigeria PERSONALITIES 58 0.844828
64 control Nigeria LIFESTYLE 53 0.849057
90 control Nigeria TOPICAL 46 0.782609
89 control Nigeria SPORT 34 0.794118
73 control Nigeria PLACES 32 0.812500
11 control Nigeria HISTORY 29 0.827586
92 control Nigeria NATURE 27 0.740741

Top viewed topics in United States are:

top_topic_c.loc[(top_topic_c['experiment_group'] == 'control')&(top_topic_c['user_country'] == 'United States')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country topic pv completion_rate
66 control United States LIFESTYLE 43 0.767442
69 control United States PERSONALITIES 43 0.813953
72 control United States NATURE 40 0.675000
42 control United States HISTORY 33 0.787879
32 control United States TOPICAL 29 0.965517
82 control United States SPORT 28 0.892857
61 control United States PLACES 25 0.760000

In experiment group:

Top viewed topics in Brazil are:

top_topic_c.loc[(top_topic_c['experiment_group'] == 'experiment')&(top_topic_c['user_country'] == 'Brazil')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country topic pv completion_rate
71 experiment Brazil PLACES 43 0.860465
25 experiment Brazil PERSONALITIES 40 0.700000
52 experiment Brazil LIFESTYLE 39 0.923077
30 experiment Brazil TOPICAL 34 0.735294
19 experiment Brazil SPORT 33 0.878788
27 experiment Brazil NATURE 32 0.750000
48 experiment Brazil HISTORY 31 0.774194

Top viewed topics in Germany are:

top_topic_c.loc[(top_topic_c['experiment_group'] == 'experiment')&(top_topic_c['user_country'] == 'Germany')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country topic pv completion_rate
45 experiment Germany TOPICAL 50 0.660000
81 experiment Germany NATURE 45 0.600000
95 experiment Germany HISTORY 33 0.606061
68 experiment Germany PERSONALITIES 30 0.466667
74 experiment Germany LIFESTYLE 28 0.714286
12 experiment Germany PLACES 23 0.478261
75 experiment Germany SPORT 16 0.500000

Top viewed topics in India are:

top_topic_c.loc[(top_topic_c['experiment_group'] == 'experiment')&(top_topic_c['user_country'] == 'India')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country topic pv completion_rate
16 experiment India NATURE 67 0.656716
1 experiment India TOPICAL 53 0.754717
41 experiment India PLACES 46 0.782609
86 experiment India PERSONALITIES 44 0.681818
34 experiment India LIFESTYLE 43 0.674419
70 experiment India SPORT 42 0.690476
53 experiment India HISTORY 25 0.640000

Top viewed topics in Indonesia are:

top_topic_c.loc[(top_topic_c['experiment_group'] == 'experiment')&(top_topic_c['user_country'] == 'Indonesia')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country topic pv completion_rate
22 experiment Indonesia PLACES 35 0.714286
44 experiment Indonesia PERSONALITIES 19 0.684211
58 experiment Indonesia LIFESTYLE 19 0.736842
59 experiment Indonesia NATURE 19 0.842105
6 experiment Indonesia TOPICAL 17 0.941176
51 experiment Indonesia SPORT 15 0.933333
33 experiment Indonesia HISTORY 14 0.642857

Top viewed topics in Nigeria are:

top_topic_c.loc[(top_topic_c['experiment_group'] == 'experiment')&(top_topic_c['user_country'] == 'Nigeria')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country topic pv completion_rate
35 experiment Nigeria TOPICAL 78 0.769231
43 experiment Nigeria PERSONALITIES 71 0.690141
79 experiment Nigeria SPORT 65 0.723077
63 experiment Nigeria LIFESTYLE 58 0.775862
39 experiment Nigeria HISTORY 53 0.641509
91 experiment Nigeria NATURE 45 0.888889
3 experiment Nigeria PLACES 39 0.692308

Top viewed topics in United States are:

top_topic_c.loc[(top_topic_c['experiment_group'] == 'experiment')&(top_topic_c['user_country'] == 'United States')].sort_values(by=['pv'], ascending=False).head(10)
experiment_group user_country topic pv completion_rate
4 experiment United States LIFESTYLE 54 0.740741
24 experiment United States TOPICAL 54 0.814815
31 experiment United States PERSONALITIES 52 0.769231
28 experiment United States NATURE 45 0.733333
94 experiment United States SPORT 35 0.600000
50 experiment United States HISTORY 33 0.575758
78 experiment United States PLACES 30 0.700000

From the list, we can see the most popular topics are differ country by country.

We see that some content, such as Lionel Messi and Climate change, appears in the top-viewed lists of most countries. The rest of the top-viewed content differs from country to country.