Customer Retention By Month Cohort

To further understand retention, it is helpful to look at retention by cohorts. A cohort is a group of customers with some shared characteristic. The characteristic can be when they signed up, or what plan they subscribed to, or how large their company is. In this case, we will group customers by the month they signed up.

Suppose we have a table with four columns, ['date', 'customer_id', 'order_num', 'order_amount_dollars'].

date customer_id order_num order_amount_dollars
1 2019-08-01 1 1 30
2 2019-08-05 2 2 40
3 2019-09-01 1 3 30
4 2019-09-07 3 4 50

We are interested in calculating the monthly customer retention between the most recent month and the of the cohort for each month cohort. Rows should represent the year and month a customer group signed up and columns should be the time series of monthly retention. The resulting DataFrame could look like

201901 201902 201903 201904
201901 1.0 0.5 0.4 0.2
201902 0.0 1.0 0.6 0.5
201903 0.0 0.0 1.0 0.7

Task: Write a function customer_retention_month_cohort(df) which takes a dataframe as above and returns a pandas DataFrame where the index is the year and month of the cohort and the columns are the retention from the original cohort year and month.

Note, this definition of retention is different from previous exercises, where retention was measured between the previous time period and the current time period. This retention is measured between the first time period the customer was active to the current month.

When doing this calculation, you need to be careful to consider only new customers for each cohort. The retention for times before the start of the cohort can be either 0 or nan.

Example Input

Code to generate input

df = pd.DataFrame({'date': pd.to_datetime(['2019-01-01', '2019-01-05', '2019-02-01', '2019-02-07', '2019-02-09', '2019-03-01']),
                   'customer_id': [1, 2, 1, 3, 4, 1],
                   'order_number': [709871, 709872, 709874, 709875, 709877, 709876],
                   'order_amount_dollars': [30, 20, 30, 80, 50, 30]})


Table generated

date customer_id order_number order_amount_dollars
0 2019-01-01 00:00:00 1 709871 30
1 2019-01-05 00:00:00 2 709872 20
2 2019-02-01 00:00:00 1 709874 30
3 2019-02-07 00:00:00 3 709875 80
4 2019-02-09 00:00:00 4 709877 50
5 2019-03-01 00:00:00 1 709876 30

Example Output

signup_yearmonth 201901 201902 201903
201901 1 0.5 0.5
201902 0 1 0