Commit b0baaf20 authored by Matthew Krafczyk's avatar Matthew Krafczyk
Browse files

Revise README

parent e6a6e1cf
......@@ -4,29 +4,76 @@ A utility library to easily build sequenced panda dataframe.
Suppose we have a DataFrame with columns `["Date","feature 1", "feature 2", ...]` So the dataframe has each feature column defined for each Date. In some cases, Date might even be the DataFrame index.
To train Machine Learning algorithms which depend on history, we might want the DataFrame to look like this:
To train Machine Learning algorithms which depend on history, we might need values lagged from the current day or position in the index, and easily accessed by the original index date or value. This would mean a dataframe looking like this:
`["Date", "feature 1 (-1)", "feature 2 (-1)", ..., "feature 1 (0)", "feature 2 (0)", ...]`
Where "feature 1 (-1)" means the value of feature 1 on the previous day.
Where "feature 1 (-1)" means the value of feature 1 on the previous day (lag of -1).
This library generalizes this procedure to also work with different sequences besides by day, or even sequences based on datetime indexes.
This library generalizes this procedure to also work with arbitrary sequences and DataFrames.
Additionally, frequently, we encounter data with missing dates meaning we can't just use panda's `DataFrame.shift`, as this will shift values by one directly within the index which will cause data to jump date gaps ruining our intent with lagged values.
For example, consider the following dataset:
| *`Date`* | `Feature A` | `Feature B` |
|:-------- | -----------:| -----------:|
| `2020-01-01` | `0.1` | `2` |
| `2020-01-02` | `0.2` | `3` |
| `2020-01-03` | `-0.1` | `4` |
| `2020-01-04` | `1.5` | `6` |
| `2020-01-05` | `1.0` | `1` |
| `2020-01-06` | `1.4` | `5` |
| `2020-01-08` | `1.8` | `10` |
We might consider building a lagged by 1 column for `Feature A` using the `.shift(1)` function, like so:
| *`Date`* | `(Feature A, -1)` | `(Feature A, 0)` |
|:-------- | -----------:| -----------:|
| `2020-01-01` | `NaN` | `0.1` |
| `2020-01-02` | `0.1` | `0.2` |
| `2020-01-03` | `0.2` | `-0.1` |
| `2020-01-04` | `-0.1` | `1.5` |
| `2020-01-05` | `1.5` | `1.0` |
| `2020-01-06` | `1.0` | `1.4` |
| `2020-01-08` | `1.4` | `1.8` |
However, there's a problem, the value on the day before `2020-01-08` isn't `1.4`! That data is missing!
We can solve this problem by noticing that the dates can be ordered by integers meaning, we know that `2020-01-01` is one day before `2020-01-02` and that `2020-01-04` is two days after `2020-01-02`.
`sequence_df` takes the data frame, the set of lags we want, and a grouping specification (defined below) to help determine what the groups should be for a pandas `groupby` operation, after which `.shift` will work as expected and then it will build a dataframe containing the lagged values for you.
```
def days_diff(d, ref_val):
return (d-ref_val).days
DF = sequence_df(df, [-1, 0], [('sequence', 'level', 'Date', days_diff)])
```
`DF` looks like this:
| *`Date`* | `(Feature A, -1)` | `(Feature B, -1)` | `(Feature A, 0)` | `(Feature B, 0)` |
|:-------- | -----------:| -----------:| --------:| ----:|
| `2020-01-02` | `0.1` | `2` | `0.2` | `3` |
| `2020-01-03` | `0.2` | `3` | `-0.1` | `4` |
| `2020-01-04` | `-0.1` | `4` | `1.5` | `6` |
| `2020-01-05` | `1.5` | 6` | `1.0` | `1` |
| `2020-01-06` | `1.0` | `1` | `1.4` | `5` |
## Usage
Given a DataFrame `df` which contains features and a column or index which can be ordered, we can call the function `sequence_df` which will build a sequenced version of `df`. `sequence_df` determines order using a 'sequencable' column which is called `sequence_col` and a sequencing function called `sequence_function`. `sequence_col` can be the name of a column, None in which case the index will be used, or a pandas Series. `sequence_function` should accept two arguments and return an integer value showing how many 'units' away the two arguments are. It can also be negative or positive depending on which is 'larger'.
Given a DataFrame `df` which contains features and a column or index which can be ordered, we can call the function `sequence_df` which will build a sequenced version of `df`. `sequence_df` determines groups of values which are in sequence using either a 'sequencable' column or an explicit group labelling. One or more of these grouping specifications can be passed to `sequence_df` which combines them and uses one `groupby` call to produce the shifted values. `NA` values are then dropped from the resulting table before being returned.
Let's take for example, a DataFrame with a 'Date' column, but which has data points only once per quarter. Then 'Date' is `sequence_col`, and we can define a function for `sequence_function`.
Let's take for example, a DataFrame with a 'Date' column, but which has data points only once per quarter. Then the group spec we want to pass is `('sequence', 'column', 'Date', quarter)`. `'sequence'` indicates the spec is a sequence group spec, `column` indicates we want to use a column of the dataframe (Though you can pass a full pandas series here and `sequence_df` will use that.), and `'Date'` specifies we want the `'Date'` column. Then `quarter` is a function to define the sequence ordering which is necessary in this case since pandas DataFrames doesn't by themselves know whether dates are off by one.
```
def quarter(date, ref_date):
return ((date.year*4)+(date.month//3))-((ref_date.year*4)+(ref_date.month//3))
```
Then we call `sequence_df` as follows:
Then we call `sequence_df` with our needed likes like `[-2, -1, 0]` (indicating we want values from two and one day ago, as well as today's value) as follows:
```
sequenced_df = sequence_df(df, 2, 0, inc_val=False, sequence_col='Date', sequence_function=quarter)
sequenced_df = sequence_df(df, [-2, -1, 0], [('sequence', 'column', 'Date', quarter)])
```
Here, we're producing a DataFrame where, for each quarter, the previous two quarter's values are listed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment