To further understand retention, it is helpful to look at retention by cohorts. A cohort is a group of customers with some shared characteristic. The characteristic can be when they signed up, or what plan they subscribed to, or how large their company is. In this case, we will group customers by the month they signed up.
Suppose we have a table with four columns, ['date', 'customer_id', 'order_num', 'order_amount_dollars']
.
date | customer_id | order_num | order_amount_dollars | |
---|---|---|---|---|
1 | 2019-08-01 | 1 | 1 | 30 |
2 | 2019-08-05 | 2 | 2 | 40 |
3 | 2019-09-01 | 1 | 3 | 30 |
4 | 2019-09-07 | 3 | 4 | 50 |
We are interested in calculating the monthly customer retention between the most recent month and the of the cohort for each month cohort. Rows should represent the year and month a customer group signed up and columns should be the time series of monthly retention. The resulting DataFrame could look like
201901 | 201902 | 201903 | 201904 | |
---|---|---|---|---|
201901 | 1.0 | 0.5 | 0.4 | 0.2 |
201902 | 0.0 | 1.0 | 0.6 | 0.5 |
201903 | 0.0 | 0.0 | 1.0 | 0.7 |
Task: Write a function customer_retention_month_cohort(df)
which takes a dataframe as above and returns a pandas DataFrame where the index is the year and month of the cohort and the columns are the retention from the original cohort year and month.
Note, this definition of retention is different from previous exercises, where retention was measured between the previous time period and the current time period. This retention is measured between the first time period the customer was active to the current month.
When doing this calculation, you need to be careful to consider only new customers for each cohort. The retention for times before the start of the cohort can be either 0 or nan.
df = pd.DataFrame({'date': pd.to_datetime(['2019-01-01', '2019-01-05', '2019-02-01', '2019-02-07', '2019-02-09', '2019-03-01']), 'customer_id': [1, 2, 1, 3, 4, 1], 'order_number': [709871, 709872, 709874, 709875, 709877, 709876], 'order_amount_dollars': [30, 20, 30, 80, 50, 30]})
date | customer_id | order_number | order_amount_dollars | |
---|---|---|---|---|
0 | 2019-01-01 00:00:00 | 1 | 709871 | 30 |
1 | 2019-01-05 00:00:00 | 2 | 709872 | 20 |
2 | 2019-02-01 00:00:00 | 1 | 709874 | 30 |
3 | 2019-02-07 00:00:00 | 3 | 709875 | 80 |
4 | 2019-02-09 00:00:00 | 4 | 709877 | 50 |
5 | 2019-03-01 00:00:00 | 1 | 709876 | 30 |
signup_yearmonth | 201901 | 201902 | 201903 |
---|---|---|---|
201901 | 1 | 0.5 | 0.5 |
201902 | 0 | 1 | 0 |