Wiki Highlights Experiment

Summary
Purpose
Data Preparation
Metrics
Metrics Breakdown by Countries

Summary

Wiki Highlights is a concise overview of text generated from the lead and other sections of a Wikipedia article, combined with a relevant image, whose purpose is to highlight relevant facts from a lengthy paragraph.

The experiment ran through January 4th to January 6th in six countries: Brazil, Germany, India, Indonesia, Nigeria, United States. The participants in each country were randomly assigned two versions of content uploaded on microsites: the highlight version of content and the article version of content. The content is sourced from English Wikipedia and Commons, as featured in this list. Participants were able to read one of the versions of content and choose whether to continue reading more or exit the microsite.

Purpose

We are measuring the following set of metrics, to understand whether Wiki-Highlights is a viable reading experience for global youth audiences on 3rd party platform.

Primary metric - Time on site(session length) - Total time = Time on homepage + Time on content page - Time on homepage - Time on content page

Secondary metrics

Summaries completion rate
Number of summaries consumed per session
Popular topics

Data Preparation

import matplotlib as mpl
import math
import pandas as pd
import numpy as np
import scipy
from pandasql import sqldf

import wmfdata
from wmfdata import hive, mariadb, spark
 
import matplotlib.pyplot as plt
import seaborn as sns

spark_session = wmfdata.spark.create_session(app_name='pyspark regular; wiki-highlights',
                                  type='yarn-regular', # local, yarn-regular, yarn-large
                                         )

24/03/14 04:25:11 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
24/03/14 04:25:17 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!

country_list = ('Brazil', 'Germany', 'India', 'Indonesia', 'Nigeria', 'United States')

## Adding function for percentile

def percentile(n):
    def percentile_(x):
        return x.quantile(n)
    percentile_.__name__ = 'percentile_{:02.0f}'.format(n*100)
    return percentile_

Wiki highlights Event Data

Collect event data from wiki_highlights_experiment schema between the test period January 4th - January 16th.

event_data_query = """

SELECT
  meta.dt as server_dt,
  experiment_group,
  geocoded_data['country'] as user_country,
  md5(concat(http.client_ip, '+{salt}')) as ip_hash,
  session_id, event_type,
  page_name, 
  CASE WHEN page_name IN ('categories_highlights', 'categories_articles') THEN 'homepage' ELSE topic END AS topic, -- hard code homepage
  CASE WHEN page_name IN ('categories_highlights', 'categories_articles') THEN 'homepage' ELSE category_name END AS category_name, -- hard code homepage
  page_bottom_was_visible, time_length_ms
FROM event.inuka_wiki_highlights_experiment e
LEFT JOIN cchen.wiki_highlights_article_list l ON e.page_name = l.article_title
WHERE
   (year = 2024 AND month = 1 AND day >=4 AND day <= 16)

"""

event_data = spark.run(event_data_query)

24/03/14 04:49:38 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.

# store data in GlobalTempView
event_sdf = spark_session.createDataFrame(event_data)
event_sdf.createGlobalTempView("event_data_view")

Metrics

Time on Site (Session Length)

The metric indicates users’ willingness to consume articles and highlights. All the times we calculate are in seconds.

time_on_site_query = """ 
 
 SELECT 
        experiment_group,
        session_id,
        SUM(time_length_ms)/1000 AS total_length,
        SUM(CASE WHEN topic = 'homepage' THEN time_length_ms END)/1000 AS home_length,
        SUM(CASE WHEN topic != 'homepage' THEN time_length_ms END)/1000 AS content_length
    FROM global_temp.event_data_view
    WHERE event_type = 'pageUnloaded'
    GROUP BY experiment_group, session_id
    
"""

time_on_site = spark.run(time_on_site_query)

## Check % of sessions with only homepage visits, no content page visits
sqldf("""
    
    SELECT 
        experiment_group,
        SUM(CASE WHEN content_length IS NULL THEN 1 END)*100 /  COUNT(1) AS hp_only_pct
    FROM time_on_site
    GROUP BY experiment_group
    
""")

	experiment_group	hp_only_pct
0	control	51
1	experiment	52

There were 51% and 52% of sessions with only homepage visits in the control group and the experiment group, respectively.

Total Time

time_grouped = time_on_site.groupby('experiment_group')
total_time_column = time_grouped['total_length']

total_time_column.agg([percentile(0.5), percentile(0.75), percentile(0.90), percentile(0.95)])

	percentile_50	percentile_75	percentile_90	percentile_95
experiment_group
control	18.942	41.227	95.2176	188.6705
experiment	20.524	46.332	109.9998	197.8194

sns.set_style("white")
fig, ax = plt.subplots(figsize=(10,5))

fig = sns.kdeplot(data=time_on_site, x="total_length", hue="experiment_group", shade=True, log_scale=True, clip =(-1,3.5))
fig.set(yticklabels=[]) 
fig.set(ylabel=None)
fig.set(xticklabels=[0,0,0,1,10,100,1000]) 

plt.xlabel("Seconds")
plt.title('Total Time Spent')

/tmp/ipykernel_3136929/331582030.py:4: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  fig = sns.kdeplot(data=time_on_site, x="total_length", hue="experiment_group", shade=True, log_scale=True, clip =(-1,3.5))
/tmp/ipykernel_3136929/331582030.py:7: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  fig.set(xticklabels=[0,0,0,1,10,100,1000])

Text(0.5, 1.0, 'Total Time Spent')

sns.set(rc={'figure.figsize':(15,5)})
sns.set_style("white")

sns.boxplot(data=time_on_site, x="total_length", hue="experiment_group",showfliers=False, gap=.5)

plt.xlabel("Seconds")
plt.title('Total Time Spent')

Text(0.5, 1.0, 'Total Time Spent')

In control group, 50% of sessions had a total reading time between 0 to 19 seconds; and in experiment group, 50% of sessions had a total reading time between 0 to 21 seconds.

In control group, 95% of sessions had a total reading time between 0 to 189 seconds; and in experiment group, 95% of sessions had a total reading time between 0 to 198 seconds.

The experiment group had more users spent more time on homepages and content pages than the control group.

Time on Homepage

home_time_column = time_grouped['home_length']
home_time_column.agg([percentile(0.5), percentile(0.75), percentile(0.90), percentile(0.95)])

	percentile_50	percentile_75	percentile_90	percentile_95
experiment_group
control	14.348	27.75075	51.5783	76.13905
experiment	14.877	27.59550	52.5068	78.76800

sns.set_style("white")
fig, ax = plt.subplots(figsize=(10,5))

fig = sns.kdeplot(data=time_on_site, x="home_length", hue="experiment_group", shade=True, log_scale=True, clip =(-1,3.5))
fig.set(yticklabels=[]) 
fig.set(ylabel=None)
fig.set(xticklabels=[0,0,0,1,10,100,1000]) 

plt.xlabel("Seconds")
plt.title("Homepage Time Spent")

/tmp/ipykernel_3136929/3455724706.py:4: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  fig = sns.kdeplot(data=time_on_site, x="home_length", hue="experiment_group", shade=True, log_scale=True, clip =(-1,3.5))
/tmp/ipykernel_3136929/3455724706.py:7: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  fig.set(xticklabels=[0,0,0,1,10,100,1000])

Text(0.5, 1.0, 'Homepage Time Spent')

sns.boxplot(data=time_on_site, x="home_length", hue="experiment_group",showfliers=False, gap=.5)

plt.xlabel("Seconds")
plt.title('Time Spent on Homepage')

sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

In control group, 50% of sessions had a total reading time between 0 to 14 seconds; and in experiment group, 50% of sessions had a total reading time between 0 to 15 seconds.

In control group, 95% of sessions had a total reading time between 0 to 76 seconds; and in experiment group, 95% of sessions had a total reading time between 0 to 79 seconds.

The users in the experiment group seem to stay at a similar time as the users in the control group on the home page.

Time on Content Page

content_time_column = time_grouped['content_length']
content_time_column.agg([percentile(0.5), percentile(0.75), percentile(0.90), percentile(0.95)])

	percentile_50	percentile_75	percentile_90	percentile_95
experiment_group
control	9.698	29.08675	116.2183	214.5830
experiment	11.725	46.27050	129.4230	260.2247

sns.set_style("white")
fig, ax = plt.subplots(figsize=(10,5))

fig = sns.kdeplot(data=time_on_site, x="content_length", hue="experiment_group", shade=True, log_scale=True, clip =(-1,3.5))
fig.set(yticklabels=[]) 
fig.set(ylabel=None)
fig.set(xticklabels=[0,0,0,1,10,100,1000]) 

plt.xlabel("Seconds")
plt.title("Contnt Time Spent")

/tmp/ipykernel_3136929/2061442196.py:4: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  fig = sns.kdeplot(data=time_on_site, x="content_length", hue="experiment_group", shade=True, log_scale=True, clip =(-1,3.5))
/tmp/ipykernel_3136929/2061442196.py:7: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  fig.set(xticklabels=[0,0,0,1,10,100,1000])

Text(0.5, 1.0, 'Contnt Time Spent')

sns.boxplot(data=time_on_site, x="content_length", hue="experiment_group",showfliers=False, gap=.5)

plt.xlabel("Seconds")
plt.title('Time Spent on Content pages')

sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

In control group, 50% of sessions had a total reading time between 0 to 10 seconds; and in experiment group, 50% of sessions had a total reading time between 0 to 12 seconds.

In control group, 95% of sessions had a total reading time between 0 to 215 seconds; and in experiment group, 95% of sessions had a total reading time between 0 to 260 seconds.

For number of users who viewed content pages, the experiment group had more users spent more time on content pages than the control group.

Note: in the control group, the articles are collapsed. This implies that it might be possible that some users did not expand each section to read through the entire article; which could have potentially affected the reading time of in control group

Content Read Completion Rate

The metric indicates users’ willingness to complete reading the content. Content is considered complete when users reach the bottom of an article or the last page of a highlight.

When calculating the completion rate, we are excluding homepage visits.

content_completion_query = """

SELECT 
    experiment_group,
    COUNT(1) AS pageview,
    SUM(CASE WHEN page_bottom_was_visible THEN 1 END)/ COUNT(1) AS completion_rate
FROM global_temp.event_data_view
WHERE event_type = 'pageUnloaded'
      AND topic != 'homepage'
GROUP BY experiment_group
    
"""

content_completion = spark.run(content_completion_query)

content_completion

	experiment_group	pageview	completion_rate
0	control	1112	0.781475
1	experiment	1658	0.721954

The control group had 1,112 articles opened, with a 78.1% completion rate.

The experiment group had more highlights read but less completion rate. There are 1,658 highlights opened with a 72.2% completion rate.

Number of Content Viewed per Session

The metric reflects users’ willingness to view subsequent highlights and articles.

We also exclude homepage views here. If a session only had homepage views, then we count it as 0 content views in that session

content_per_session_query = """

SELECT
    experiment_group,
    session_id, 
    SUM(CASE WHEN topic = 'homepage' THEN 0 ELSE 1 END) AS num_pages
FROM global_temp.event_data_view
WHERE event_type = 'pageUnloaded'
GROUP BY experiment_group,session_id

"""

content_per_session = spark.run(content_per_session_query)

content_per_session_grouped = content_per_session.groupby('experiment_group')
content_per_session_column = content_per_session_grouped['num_pages']

content_per_session_column.agg([percentile(0.5), percentile(0.75), percentile(0.90), percentile(0.95)])

	percentile_50	percentile_75	percentile_90	percentile_95
experiment_group
control	0.0	1.0	2.0	3.0
experiment	0.0	1.0	3.0	4.0

sns.set_style("white")
fig, ax = plt.subplots(figsize=(10,5))

fig = sns.kdeplot(data=content_per_session, x="num_pages", hue="experiment_group", shade=True, cut =0, clip=(0,15),
                 palette={'control':'b', 'experiment':'r'})
fig.set(yticklabels=[]) 
fig.set(ylabel=None)

plt.title("Number of Content per Session")

/tmp/ipykernel_3136929/3845779504.py:4: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  fig = sns.kdeplot(data=content_per_session, x="num_pages", hue="experiment_group", shade=True, cut =0, clip=(0,15),

Text(0.5, 1.0, 'Number of Content per Session')

sns.boxplot(data=content_per_session, x="num_pages", hue="experiment_group",showfliers=False, gap=.5,
           palette={'control':'b', 'experiment':'r'})

plt.xlabel("Seconds")
plt.title("Content Read per Session")

sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

There were 50% of sessions with 0 summary/article consumed per session for both control and experiment groups.

There were 75% of sessions with 0 or 1 summaries/articles consumed per session for both control and experiment groups.

For the control group, there were 95% of sessions with 0 to 3 articles per session. For the experiment group, there were 95% of sessions with 0 to 4 summaries per session.

There are more users in experiment group viewed slightly more summaries in control group does.

	experiment_group	page_name	pv	completion_rate
49	control	Lionel Messi	76	0.881579
7	control	Friends	62	0.806452
16	control	Japan	60	0.783333
34	control	Ancient Egypt	53	0.735849
55	control	Body piercing	47	0.808511
40	control	Baseball	46	0.695652
21	control	Comics	46	0.847826
17	control	Feminism	43	0.813953
31	control	Obesity	42	0.833333
37	control	Statue of Liberty	41	0.804878

	experiment_group	page_name	pv	completion_rate
14	experiment	Lionel Messi	99	0.595960
45	experiment	Climate change	86	0.767442
52	experiment	Elephant	80	0.700000
11	experiment	Japan	79	0.721519
26	experiment	Friends	74	0.756757
38	experiment	Obesity	71	0.760563
39	experiment	Comics	69	0.710145
28	experiment	Sustainable energy	69	0.695652
29	experiment	Statue of Liberty	64	0.593750
44	experiment	Yoga	62	0.693548

	experiment_group	topic	pv	completion_rate
8	control	LIFESTYLE	190	0.800000
0	control	PERSONALITIES	184	0.836957
6	control	HISTORY	160	0.787500
4	control	TOPICAL	150	0.820000
5	control	SPORT	146	0.705479
10	control	NATURE	145	0.710345
2	control	PLACES	137	0.788321

	experiment_group	topic	pv	completion_rate
12	experiment	TOPICAL	286	0.762238
3	experiment	PERSONALITIES	257	0.680934
11	experiment	NATURE	254	0.728346
1	experiment	LIFESTYLE	245	0.767347
7	experiment	PLACES	216	0.726852
13	experiment	SPORT	207	0.719807
9	experiment	HISTORY	193	0.647668

Metrics Breakdown by Countries

Add a country-wise breakdown for each metric to facilitate comparisons.

Time on Site (Session Length)

time_on_site_c_query = """ 
 
 SELECT 
        user_country,
        experiment_group,
        session_id,
        SUM(time_length_ms)/1000 AS total_length,
        SUM(CASE WHEN topic = 'homepage' THEN time_length_ms END)/1000 AS home_length,
        SUM(CASE WHEN topic != 'homepage' THEN time_length_ms END)/1000 AS content_length
    FROM global_temp.event_data_view
    WHERE event_type = 'pageUnloaded'
      AND user_country IN {country_list}   
    GROUP BY user_country,experiment_group, session_id
    
"""

time_on_site_c = spark.run(
       time_on_site_c_query.format(
          country_list = country_list
        ))

Total Time

totla_time_c_query= """

    WITH total_time AS (
        SELECT 
            user_country,
            experiment_group,
            session_id,
            SUM(time_length_ms)/1000 AS total_length
        FROM global_temp.event_data_view
        WHERE event_type = 'pageUnloaded'
          AND user_country IN {country_list}   
        GROUP BY user_country,experiment_group, session_id
    )
    
    SELECT
       user_country,
       experiment_group,
       PERCENTILE_APPROX(total_length,0.50) AS 50_percentile,
       PERCENTILE_APPROX(total_length,0.75) AS 75_percentile,
       PERCENTILE_APPROX(total_length,0.90) AS 90_percentile,
       PERCENTILE_APPROX(total_length,0.95) AS 95_percentile
    FROM total_time
    GROUP BY user_country,experiment_group
    ORDER BY user_country,experiment_group

"""

spark.run( 
         totla_time_c_query.format(
          country_list = country_list
        )
    )

	user_country	experiment_group	50_percentile	75_percentile	90_percentile	95_percentile
0	Brazil	control	20.637	39.948	96.972	164.477
1	Brazil	experiment	25.193	52.807	123.295	200.400
2	Germany	control	15.257	26.127	49.417	73.394
3	Germany	experiment	15.560	24.174	48.615	68.727
4	India	control	18.850	36.875	66.654	106.661
5	India	experiment	22.300	45.418	92.987	129.510
6	Indonesia	control	17.296	28.746	58.731	76.430
7	Indonesia	experiment	15.608	25.449	48.345	102.256
8	Nigeria	control	42.690	114.206	297.555	521.008
9	Nigeria	experiment	60.662	122.915	319.056	606.709
10	United States	control	17.650	39.494	78.332	146.681
11	United States	experiment	19.837	42.082	88.832	123.535

#sns.set_theme(style="white")
#g = sns.FacetGrid(time_on_site_c, row="user_country",aspect=7, height=3.5)

#g.map_dataframe(sns.kdeplot, x="total_length",hue="experiment_group",shade=True, log_scale=True, clip =(-1,3.5))
#fig.set(yticklabels=[]) 
#fig.set(ylabel=None)
#fig.set(xticklabels=[0,0,0,1,10,100,1000]) 

#plt.xlabel("Seconds")
#plt.title('Total Time Spent')

sns.set(rc={'figure.figsize':(15,8)})
sns.set_style("white")

sns.boxplot(data=time_on_site_c, x="total_length", y = "user_country",hue="experiment_group",showfliers=False, gap=.3)

plt.xlabel("Seconds")
plt.title('Total Time Spent')
plt.ylabel([])  

sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

From the data above, we discover that, in Brazil, India, United Stats and Nigeria, the experiment group had more users spent more time on homepages and content pages.

In Indonesia and Germany, the control group had more users spent more time on homepages and content pages.

Time on homepage

homepage_c_query= """

    WITH total_time AS (
        SELECT 
            user_country,
            experiment_group,
            session_id,
            SUM(CASE WHEN topic = 'homepage' THEN time_length_ms END)/1000 AS home_length
        FROM global_temp.event_data_view
        WHERE event_type = 'pageUnloaded'
          AND user_country IN {country_list}   
        GROUP BY user_country,experiment_group, session_id
    )
    
    SELECT
       user_country,
       experiment_group,
       PERCENTILE_APPROX(home_length,0.50) AS 50_percentile,
       PERCENTILE_APPROX(home_length,0.75) AS 75_percentile,
       PERCENTILE_APPROX(home_length,0.90) AS 90_percentile,
       PERCENTILE_APPROX(home_length,0.95) AS 95_percentile
    FROM total_time
    GROUP BY user_country,experiment_group
    ORDER BY user_country,experiment_group

"""

spark.run( 
         homepage_c_query.format(
          country_list = country_list
        )
    )

	user_country	experiment_group	50_percentile	75_percentile	90_percentile	95_percentile
0	Brazil	control	16.276	32.965	62.115	89.921
1	Brazil	experiment	17.559	37.829	77.639	109.329
2	Germany	control	12.864	18.631	35.523	42.511
3	Germany	experiment	12.830	18.216	31.342	41.777
4	India	control	14.632	26.862	43.337	61.269
5	India	experiment	15.706	30.874	48.493	81.931
6	Indonesia	control	13.864	22.128	39.994	57.322
7	Indonesia	experiment	13.366	19.215	37.140	46.788
8	Nigeria	control	20.399	50.511	96.485	178.650
9	Nigeria	experiment	21.995	53.847	76.100	109.041
10	United States	control	13.311	25.069	40.706	62.318
11	United States	experiment	13.295	22.200	35.865	48.751

sns.boxplot(data=time_on_site_c, x="home_length", y = "user_country",hue="experiment_group",showfliers=False, gap=.3)
sns.set_style("white")

plt.xlabel("Seconds")
plt.title('Time Spent on Homepage')
plt.ylabel([])  

sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

For home page time spent, we discover that, in Brazil, Nigeria, and India, the experiment group had more users spent more time on homepages.

In Germany, users spent similar time on homepages pages in the experiment group and control group.

While in Indonesia and United States, the control group had more users spent more time on homepages pages.

Time on content page

content_time_c_query= """

    WITH total_time AS (
        SELECT 
            user_country,
            experiment_group,
            session_id,
            SUM(CASE WHEN topic != 'homepage' THEN time_length_ms END)/1000 AS content_length
        FROM global_temp.event_data_view
        WHERE event_type = 'pageUnloaded'
          AND user_country IN {country_list}   
        GROUP BY user_country,experiment_group, session_id
    )
    
    SELECT
       user_country,
       experiment_group,
       PERCENTILE_APPROX(content_length,0.50) AS 50_percentile,
       PERCENTILE_APPROX(content_length,0.75) AS 75_percentile,
       PERCENTILE_APPROX(content_length,0.90) AS 90_percentile,
       PERCENTILE_APPROX(content_length,0.95) AS 95_percentile
    FROM total_time
    GROUP BY user_country,experiment_group
    ORDER BY user_country,experiment_group

"""

spark.run( 
         content_time_c_query.format(
          country_list = country_list
        )
    )

	user_country	experiment_group	50_percentile	75_percentile	90_percentile	95_percentile
0	Brazil	control	8.350	21.339	84.189	203.409
1	Brazil	experiment	14.708	41.528	138.720	176.157
2	Germany	control	6.515	13.817	37.952	50.856
3	Germany	experiment	7.157	11.233	26.987	57.811
4	India	control	9.637	20.231	59.604	121.616
5	India	experiment	11.766	35.500	92.876	127.866
6	Indonesia	control	8.790	17.016	38.026	70.835
7	Indonesia	experiment	8.918	19.751	104.614	187.773
8	Nigeria	control	43.949	176.100	369.269	526.803
9	Nigeria	experiment	79.387	225.900	572.007	969.253
10	United States	control	8.436	22.734	85.557	136.388
11	United States	experiment	12.308	41.031	85.045	117.422

sns.boxplot(data=time_on_site_c, x="content_length", y = "user_country",hue="experiment_group",showfliers=False, gap=.3)
sns.set_style("white")

plt.xlabel("Seconds")
plt.title('Time Spent on Content pages')
plt.ylabel([])  

sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

For cotent page time spent, we discover that, in Brazil, India, Indonesia, Nigeria,and US the experiment group had more users spent more time on homepages.

And in Germany, in 90% of the seesions, the control group had more users spent more time on content pages.

Nigeria had much longer time spent on content pages compared to other countries.

Number of Content Viewed per Session

content_per_session_query_c = """

WITH content_view AS (
SELECT
    experiment_group,
    user_country,
    session_id, 
    SUM(CASE WHEN topic = 'homepage' THEN 0 ELSE 1 END) AS num_pages
FROM global_temp.event_data_view
WHERE event_type = 'pageUnloaded'
  AND user_country IN {country_list}   
GROUP BY experiment_group,user_country,session_id
)

SELECT
       user_country,
       experiment_group,
       PERCENTILE_APPROX(num_pages,0.50) AS 50_percentile,
       PERCENTILE_APPROX(num_pages,0.75) AS 75_percentile,
       PERCENTILE_APPROX(num_pages,0.90) AS 90_percentile,
       PERCENTILE_APPROX(num_pages,0.95) AS 95_percentile
    FROM content_view
    GROUP BY user_country,experiment_group
    ORDER BY user_country,experiment_group

"""

content_per_session_c = spark.run( 
        content_per_session_query_c.format(
          country_list = country_list
        )
    )

content_per_session_c

	user_country	experiment_group	50_percentile	75_percentile	90_percentile	95_percentile
0	Brazil	control	0	1	1	2
1	Brazil	experiment	0	1	2	4
2	Germany	control	1	1	2	3
3	Germany	experiment	1	1	2	2
4	India	control	0	1	2	2
5	India	experiment	0	1	2	4
6	Indonesia	control	0	1	1	2
7	Indonesia	experiment	0	1	1	2
8	Nigeria	control	1	1	4	5
9	Nigeria	experiment	1	2	5	8
10	United States	control	1	1	2	3
11	United States	experiment	1	1	3	4

From the data above, we discover that, in Brazil, India, Nigeria and United States, users viewed more content in the experiment group than the control group per session.

In Indonesia, users viewed similar amount of content per session in both two groups.

While in Germany, users viewed fewer content in the experiment group than the control group per session.

Content Read Completion Rate

content_completion_query_c = """

SELECT 
    experiment_group,
    user_country,
    COUNT(1) AS pageview,
    SUM(CASE WHEN page_bottom_was_visible THEN 1 END)/ COUNT(1)*100 AS completion_rate 
FROM global_temp.event_data_view
WHERE event_type = 'pageUnloaded'
  AND topic != 'homepage'
  AND user_country IN {country_list}   
GROUP BY experiment_group, user_country
ORDER BY user_country,experiment_group
    
"""

content_completion_c = spark.run( 
        content_completion_query_c.format(
          country_list = country_list
        )
    )

content_completion_c

	experiment_group	user_country	pageview	completion_rate
0	control	Brazil	117	83.760684
1	experiment	Brazil	252	80.555556
2	control	Germany	201	75.124378
3	experiment	Germany	225	59.111111
4	control	India	154	69.480519
5	experiment	India	320	70.000000
6	control	Indonesia	115	77.391304
7	experiment	Indonesia	138	77.536232
8	control	Nigeria	279	81.362007
9	experiment	Nigeria	409	73.838631
10	control	United States	241	80.082988
11	experiment	United States	303	71.947195

sns.set(rc={'figure.figsize':(15,8)})
sns.set_style("white")

ax= sns.barplot(content_completion_c, x="completion_rate", y="user_country", hue="experiment_group", orient="y")
for container in ax.containers:
    ax.bar_label(container, fmt='%.1f%%')

plt.xlabel("Completion Rate %")
plt.ylabel([])  
plt.title('Content Completion Rate by Country')

sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

The content completion rate in experiment groups is lower in every country except India and Indonesia.

Wiki Highlights Experiment

Table of Contents

Summary

Purpose

Data Preparation

Wiki highlights Event Data

Metrics

Time on Site (Session Length)

Total Time

Time on Homepage

Time on Content Page

Content Read Completion Rate

Number of Content Viewed per Session

Top Viewed Content

Top Viewed Topics

Metrics Breakdown by Countries

Time on Site (Session Length)

Total Time

Time on homepage

Time on content page

Number of Content Viewed per Session

Content Read Completion Rate

Top Viewed Content

In control group

In experiment group:

Top View Topics

In control group

In experiment group:

	experiment_group	user_country	page_name	pv	completion_rate
30	control	Brazil	Amazon parrot	9	1.000000
135	control	Brazil	Baseball	7	0.857143
59	control	Brazil	Ancient Egypt	7	0.714286
79	control	Brazil	Friends	7	1.000000
337	control	Brazil	Japan	7	0.714286
1	control	Brazil	Lionel Messi	6	1.000000
114	control	Brazil	Obesity	6	1.000000
293	control	Brazil	Australian Magpie	5	0.400000
10	control	Brazil	Statue of Liberty	5	0.800000
151	control	Brazil	Comics	5	0.800000