Patching Dec 9, 2021 6-7a CST- All GitLab services may be unavailable for 5-10 minutes

Commit e1956cb4 authored by Matthew Krafczyk's avatar Matthew Krafczyk
Browse files

Initial commit

parent fc02c996
# Pandas Sequence
A utility library to easily build sequenced panda dataframe.
Suppose we have a DataFrame with columns `["Date","feature 1", "feature 2", ...]` So the dataframe has each feature column defined for each Date. In some cases, Date might even be the DataFrame index.
To train Machine Learning algorithms which depend on history, we might want the DataFrame to look like this:
`["Date", "feature 1 (-1)", "feature 2 (-1)", ..., "feature 1 (0)", "feature 2 (0)", ...]`
Where "feature 1 (-1)" means the value of feature 1 on the previous day.
This library generalizes this procedure to also work with different sequences besides by day, or even sequences based on datetime indexes.
## Usage
Given a DataFrame `df` which contains features and a column or index which can be ordered, we can call the function `sequence_df` which will build a sequenced version of `df`. `sequence_df` determines order using a 'sequencable' column which is called `sequence_col` and a sequencing function called `sequence_function`. `sequence_col` can be the name of a column, None in which case the index will be used, or a pandas Series. `sequence_function` should accept two arguments and return an integer value showing how many 'units' away the two arguments are. It can also be negative or positive depending on which is 'larger'.
Let's take for example, a DataFrame with a 'Date' column, but which has data points only once per quarter. Then 'Date' is `sequence_col`, and we can define a function for `sequence_function`.
```
def quarter(date, ref_date):
return ((date.year*4)+(date.month//3))-((ref_date.year*4)+(ref_date.month//3))
```
Then we call `sequence_df` as follows:
```
sequenced_df = sequence_df(df, 2, 0, inc_val=False, sequence_col='Date', sequence_function=quarter)
```
Here, we're producing a DataFrame where, for each quarter, the previous two quarter's values are listed.
__version__ = "0.1.0"
import pandas as pd
def sequence_plain_df(df, num_before, num_after, inc_val=True):
"""
Sequence feature data into multi-component rows.
This function takes a dataframe containing various features over a sequence. This dataframe is
assumed to be 'in-order' that is, each row is 1 'unit' away from either row.
The input dataframe should have the following structure:
Sequence | 'Feat 1' | 'Feat 2' |
s1 | f1(s1) | f2(s1) |
s2 | f1(s2) | f2(s2) |
...
The function then returns for num_before=2, num_after=0, inc_date=True:
Sequence | ('Feat 1' , -2) | ('Feat 2', -2) | ('Feat 1', -1) | ('Feat 2', -1) | ('Feat 1', 0) | ('Feat 2', 0) |
s3 | f1(s1) | f2(s1) | f1(s2) | f2(s2) | f1(s3) | f2(s3) |
s4 | f1(s2) | f2(s2) | f1(s3) | f2(s3) | f1(s4) | f2(s4) |
...
"""
# Build segments
segments = []
for i in range(num_before,-num_after-1,-1):
segment = None
if i == 0:
if inc_val:
segment = df
else:
segment = df.shift(i)
if segment is not None:
segment.columns = pd.MultiIndex.from_product([df.columns,[-i]])
segments.append(segment)
# Join segments into full dataframe.
DF = pd.concat(segments, axis=1, join='outer').dropna()
return DF
def sequence_df(df, num_before, num_after, beg_val=None, end_val=None, inc_val=True, sequence_col=None, sequence_function=None):
"""
Sequence feature data into multi-component rows.
This function takes a dataframe containing various features over a sequence. These
are then stacked so neighboring values can be easily accessed by a specific sequence value.
The input dataframe should have the following structure:
Sequence | 'Feat 1' | 'Feat 2' |
s1 | f1(s1) | f2(s1) |
s2 | f1(s2) | f2(s2) |
...
The function then returns for num_before=2, num_after=0, inc_date=True:
Sequence | ('Feat 1' , -2) | ('Feat 2', -2) | ('Feat 1', -1) | ('Feat 2', -1) | ('Feat 1', 0) | ('Feat 2', 0) |
s3 | f1(s1) | f2(s1) | f1(s2) | f2(s2) | f1(s3) | f2(s3) |
s4 | f1(s2) | f2(s2) | f1(s3) | f2(s3) | f1(s4) | f2(s4) |
...
if res_df is the result dataframe, then for many models, the X matrix is simply:
res_df.values[:,:num_days]
Named Arguments
df: A Pandas dataframe containing a set of features for each day
beg_val: The first sequence value for which sequences are needed
end_val: The last sequence value for which sequences are needed
num_before: The number of before the first predicted day needed for a prediction.
num_after: The number of days after the first predicted day
sequence_col: The column to use, if None will use the index.
sequence_function: A function to use to compute sequence differences. If None, it'll just take the difference.
This function should have behavior like this:
Keyword Arguments
inc_date: Whether to include data from the first predicted day
returns
A pandas dataframe containing rows of prediction and/or label data.
"""
# Fetch sequence series
sequence_series = None
if sequence_col is None:
# Get the sequence series
sequence_series = df.index.to_series()
elif type(sequence_col) is pd.core.series.Series:
sequence_series = sequence_col
else:
# Get the sequence col
sequence_series = df[sequence_col]
# If the sequence function is None, set it as the simple difference formula
if sequence_function is None:
sequence_function_ = lambda s: s-ref_val
else:
sequence_function_ = lambda s: sequence_function(s, ref_val)
# Compute differences against 'reference' value
ref_val = sequence_series.iloc[0]
try:
sequence_values = sequence_series.apply(sequence_function_)
except Exception as e:
print(f"Tried to subtract sequence values but ran into an error!")
raise e
# Check that sequence is an integer type
if not pd.api.types.is_integer_dtype(sequence_values.dtype):
raise TypeError(f"Sequence value type: {sequence_values.dtype} is not an integer type!")
# Restrict the DF if necessary
restricted_df = df
if beg_val is not None:
beg_val_diff = sequence_function_(beg_val)
restricted_df = restricted_df[sequence_values >= beg_val_diff]
if end_val is not None:
end_val_diff = sequence_function_(end_val)
restricted_df = restricted_df[sequence_values <= end_val_diff]
# Detect sequential groups
S = (sequence_values-sequence_values.shift(1)).fillna(0.0).astype(int)
# Group ids
G_ids = (S != 1).cumsum()
dfs = []
max_id = G_ids.max()
num_skipped = 0
for g_id in range(1,max_id+1):
# For each group, first, check how many rows there are.
num_in_group = (G_ids == g_id).sum()
if num_in_group >= num_before+num_after+(1 if inc_val else 0):
# This group has enough data.
dfs.append(sequence_plain_df(df[G_ids == g_id], num_before, num_after, inc_val))
else:
num_skipped += 1
print(f"Skipped {num_skipped} groups when building dataframe")
if len(dfs) == 0:
raise RuntimeError("There was no data! to combine!")
print(f"Combined {len(dfs)} groups")
return pd.concat(dfs, axis=0)
from setuptools import setup
setup(
name = "pandas_sequence",
version = "0.1.0",
author = "Matthew Krafczyk",
author_email = "krafczyk.matthew@gmail.com",
description = ("A sequence building utility library"),
#license = "MIT",
#url =,
packages=['pandas_sequence'],
install_requires = [
'pandas',
],
)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment