Suppose we have the following dataset of the price of a bottle of wine in a store each month. Unfortunately, some months we are missing the price data.

import pandas as pd df = pd.DataFrame({'time': pd.to_datetime(['2011-01-01', '2011-02-01', '2011-03-01', '2011-04-01']), 'price': [30, np.nan, 35, 32]})

time | price | |
---|---|---|

1 | 2011-01-01 | 30 |

2 | 2011-02-01 | nan |

3 | 2011-03-01 | 35 |

4 | 2011-04-01 | 32 |

Simply imputing with the mean over the whole dataset may not be what you want. Imputing with the overall mean leads to rows early in the dataframe having information from the future! Depending on your data and the application this lookahead may not be a problem or may be critically problematic.

For this exercise, let's assume it is problematic. Instead, we can fill missing price rows with the mean of all previous rows. Filling with the mean of all previous rows ensures the imputed value doesn't look into the future.

**Task:** Write a function, `fillna_with_past_mean(df)`

which takes in the DataFrame and updates the column `price`

so that `nan`

rows are set to the mean price of all *previous* rows.

*Note:* One important detail is how to compute the mean over all previous rows when there are missing rows. You can simply skip missing rows when computing the mean, fill missing rows with some constant value or fill with the mean that was computed for that row. For this exercise, simply skip missing rows when computing the mean.

df = pd.DataFrame({'time': pd.to_datetime(['2011-01-01', '2011-02-01', '2011-03-01', '2011-04-01', '2011-05-01']), 'price': [30, np.nan, 35, np.nan, 32]})

time | price | |
---|---|---|

0 | 2011-01-01 00:00:00 | 30 |

1 | 2011-02-01 00:00:00 | nan |

2 | 2011-03-01 00:00:00 | 35 |

3 | 2011-04-01 00:00:00 | nan |

4 | 2011-05-01 00:00:00 | 32 |

price | |
---|---|

0 | 30 |

1 | 30 |

2 | 35 |

3 | 32.5 |

4 | 32 |

Next Exercise