jmspack.frequentist_statistics

jmspack.internal_utils

Submodule internal_utils.py includes the following functions:

  • postgresql_data_extraction(): importing data from a table of a postgresql database.

  • postgresql_table_names_list(): extract the table names from a specified postgresql database.

  • create_postgresql_table_based_on_df(): create a new table in a specified postgresql database based on the columns of a pandas data frame.

  • add_data_to_postgresql_table(): add new data to an existing table in a specified postgresql database.

  • delete_postgresql_table(): delete a table from a postgresql database.

jmspack.internal_utils.add_data_to_postgresql_table(df: DataFrame, database_name: str, user: str, table_name: str)[source]

Add new data to an existing table in a specified postgresql database.

Parameters:
df: pd.DataFrame

The pandas dataframe object you wish to use the columns and data types to create the table from in postgresql

database_name: str

The name of the postgresql database.

user: str

The name of the user.

table_name: str

A string specifying the name of the newly created table.

Returns:
str

Examples

>>> import seaborn as sns
>>> from dotenv import load_dotenv, find_dotenv
>>> from jmspack.internal_utils import add_data_to_postgresql_table
>>> # Make sure you have a .env file somewhere with your postgresql credentials
>>> # labelled as postgresql_host="BLA", and postgresql_password="BLA2"
>>> load_dotenv(find_dotenv())
>>> iris_df = sns.load_dataset("iris")
>>> _ = create_postgresql_table_based_on_df(df=iris_df,
...                                         database_name="tracker",
...                                         user="tracker",
...                                         table_name="iris_test",
...                                         )
>>> _ = add_data_to_postgresql_table(df=iris_df,
...                                 database_name="tracker",
...                                 user="tracker",
...                                 table_name="iris_test",
...                                 )
jmspack.internal_utils.create_postgresql_table_based_on_df(df: DataFrame, database_name: str, user: str, table_name: str)[source]

Create a new table in a specified postgresql database based on the columns of a pandas data frame.

Parameters:
df: pd.DataFrame

The pandas dataframe object you wish to use the columns and data types to create the table from in postgresql

database_name: str

The name of the postgresql database.

user: str

The name of the user.

table_name: str

A string specifying the name of the newly created table.

Returns:
str

Examples

>>> import seaborn as sns
>>> from dotenv import load_dotenv, find_dotenv
>>> from jmspack.internal_utils import add_data_to_postgresql_table
>>> # Make sure you have a .env file somewhere with your postgresql credentials
>>> # labelled as postgresql_host="BLA", and postgresql_password="BLA2"
>>> load_dotenv(find_dotenv())
>>> iris_df = sns.load_dataset("iris")
>>> _ = create_postgresql_table_based_on_df(df=iris_df,
...                                         database_name="tracker",
...                                         user="tracker",
...                                         table_name="iris_test",
...                                         )
jmspack.internal_utils.delete_postgresql_table(database_name: str, user: str, table_name: str)[source]

Delete a table from a postgresql database.

Parameters:
database_name: str

The name of the postgresql database.

user: str

The name of the user.

table_name: str

A string specifying the name of the newly created table.

Returns:
str

Examples

>>> import seaborn as sns
>>> from dotenv import load_dotenv, find_dotenv
>>> from jmspack.internal_utils import add_data_to_postgresql_table
>>> # Make sure you have a .env file somewhere with your postgresql credentials
>>> # labelled as postgresql_host="BLA", and postgresql_password="BLA2"
>>> load_dotenv(find_dotenv())
>>> iris_df = sns.load_dataset("iris")
>>> _ = create_postgresql_table_based_on_df(df=iris_df,
...                                         database_name="tracker",
...                                         user="tracker",
...                                         table_name="iris_test",
...                                         )
>>> _ = add_data_to_postgresql_table(df=iris_df,
...                                 database_name="tracker",
...                                 user="tracker",
...                                 table_name="iris_test",
...                                 )
>>> _ = delete_postgresql_table(database_name="tracker",
...                             user="tracker",
...                             table_name="iris_test"
...                             )
jmspack.internal_utils.postgresql_data_extraction(table_name: str = 'suggested_energy_intake', database_name: str = 'tracker', user: str = 'tracker')[source]

Load data from a specified postgresql database.

Parameters:
table_name: str

The name of the table to extract from the postgresql database.

database_name: str

The name of the postgresql database.

user: str

The name of the user.

Returns:
df: pd.DataFrame

pandas dataframe object containing the data from the specified table.

Examples

>>> from dotenv import load_dotenv, find_dotenv
>>> from jmspack.internal_utils import postgresql_data_extraction
>>> # Make sure you have a .env file somewhere with your postgresql credentials
>>> # labelled as postgresql_host="BLA", and postgresql_password="BLA2"
>>> load_dotenv(find_dotenv())
>>> df = postgresql_data_extraction(table_name = 'iris_test',
...                            database_name = 'tracker',
...                            user='tracker')
jmspack.internal_utils.postgresql_table_names_list(database_name: str = 'tracker', user='tracker')[source]

Extract the table names from a specified postgresql database.

Parameters:
database_name: str

The name of the postgresql database.

user: str

The name of the user.

Returns:
list

Examples

>>> from dotenv import load_dotenv, find_dotenv
>>> from jmspack.internal_utils import postgresql_table_names_list
>>> # Make sure you have a .env file somewhere with your postgresql credentials
>>> # labelled as postgresql_host="BLA", and postgresql_password="BLA2"
>>> load_dotenv(find_dotenv())
>>> table_names = postgresql_table_names_list()

jmspack.ml_utils

Submodule ml_utils.py includes the following functions:

  • plot_decision_boundary(): Generate a simple plot of the decision boundary of a classifier.

  • plot_cv_indices(): Visualise the inputted cross validation method in chunks.

  • plot_learning_curve(): Plot the learning curve of an estimator as samples increase to evaluate overfitting.

  • dict_of_models: A dictionary of useful models.

  • multi_roc_auc_plot(): A utility to plot the ROC curves of multiple classifiers (suggested to use in conjunction with the dict_of_models).

  • optimize_model(): A utility to run gridsearch and Recursive Feature Elimination on a classifier to return a model with the best parameters.

  • plot_confusion_matrix(): Visualise a confusion matrix.

  • summary_performance_metrics_classification(): A utility to return a selection of regularly used classification performance metrics.

  • RMSE(): Root Mean Squared Error.

jmspack.ml_utils.RMSE(true, pred)[source]

Root Mean Squared Error.

Parameters:
true: pd.Series

The actual values.

pred: pd.Series

The predicted values.

Returns:
float

Examples

>>> import pandas as pd
>>> from jmspack.ml_utils import RMSE
>>> true = pd.Series([1, 2, 5, 4, 5])
>>> pred = pd.Series([1, 2, 3, 4, 5])
>>> RMSE(true, pred)
jmspack.ml_utils.multi_roc_auc_plot(X: DataFrame, y: Series, models: list = [{'label': 'Logistic Regression', 'model': LogisticRegression()}, {'label': 'Gradient Boosting', 'model': GradientBoostingClassifier()}, {'label': 'K_Neighbors Classifier', 'model': KNeighborsClassifier(n_neighbors=3)}, {'label': 'SVM Classifier (linear)', 'model': SVC(C=0.025, kernel='linear', probability=True)}, {'label': 'SVM Classifier (Radial Basis Function; RBF)', 'model': SVC(C=1, gamma=2, probability=True)}, {'label': 'Gaussian Process Classifier', 'model': GaussianProcessClassifier(kernel=1**2 * RBF(length_scale=1))}, {'label': 'Decision Tree (depth=5)', 'model': DecisionTreeClassifier(max_depth=5)}, {'label': 'Random Forest Classifier(depth=5)', 'model': RandomForestClassifier(max_depth=5, max_features=1, n_estimators=10)}, {'label': 'Multilayer Perceptron (MLP) Classifier', 'model': MLPClassifier(alpha=1, max_iter=1000)}, {'label': 'AdaBoost Classifier', 'model': AdaBoostClassifier()}, {'label': 'Naive Bayes (Gaussian) Classifier', 'model': GaussianNB()}, {'label': 'Quadratic Discriminant Analysis Classifier', 'model': QuadraticDiscriminantAnalysis()}], figsize: tuple = (7, 7))[source]

Plot the ROC curves of multiple classifiers.

Parameters:
Xarray-like, shape (n_samples, n_features)

Classifier vector, where n_samples is the number of samples and n_features is the number of features.

yarray-like, shape (n_samples)

Target relative to X for classification. Datatype should be integers.

modelslist

A list of dictionaries containing the model and the label to be used in the plot.

figsize: tuple (default: (7, 7))

Width and height of the figure in inches

Returns:
fig: matplotlib.figure.Figure

Properties of the figure can be changed later, e.g. use fig.axes[0].set_ylim(0,100) to change ylim

ax: matplotlib.axes._subplots.AxesSubplot

The axes associated with the fig Figure.

Examples

>>> import seaborn as sns
>>> from jmspack.ml_utils import multi_roc_auc_plot, dict_of_models
>>> data = (
...     sns.load_dataset("iris")
...     .loc[lambda df: df["species"].isin(["setosa", "virginica"])]
...     .replace({"virginica": 0, "setosa": 1})
... )
>>> y = data["species"]
>>> X = data[["sepal_length", "sepal_width"]]
>>> _ = multi_roc_auc_plot(X=X, y=y, models=dict_of_models, figsize=(7, 7))
jmspack.ml_utils.optimize_model(X: DataFrame, y: Series, estimator: BaseEstimator = RandomForestClassifier(), grid_params_dict: dict = {'criterion': ['gini', 'entropy'], 'max_depth': [1, 2, 3, 4, 5, 10], 'max_features': ['log2', 'sqrt'], 'n_estimators': [10, 20, 30, 40, 50]}, gridsearch_kwargs: dict = {'cv': 3, 'n_jobs': -2, 'scoring': 'roc_auc'}, rfe_kwargs: dict = {'n_features_to_select': 2, 'verbose': 1})[source]

A utility to run gridsearch and Recursive Feature Elimination on a classifier to return a model with the best parameters.

Parameters:
Xarray-like, shape (n_samples, n_features)

Classifier vector, where n_samples is the number of samples and n_features is the number of features.

yarray-like, shape (n_samples)

Target relative to X for classification. Datatype should be integers.

estimatorobject type that implements the “fit” and “predict” methods

An object of that type which is cloned for each validation.

grid_params_dictdict

A dictionary of parameters to be used in the gridsearch.

gridsearch_kwargsdict

A dictionary of parameters to be used in the gridsearch.

rfe_kwargsdict

A dictionary of parameters to be used in the Recursive Feature Elimination.

Returns:
optimized_estimator: sklearn estimator

The optimized estimator.

feature_ranking: pandas DataFrame

A dataframe with features ranking (high = dropped early on).

feature_selected: list

A list of features selected.

feature_importance: pandas DataFrame

A dataframe with importances per feature.

optimal_parameters: pandas DataFrame

A dataframe with the optimal parameters.

Examples

>>> import seaborn as sns
>>> from sklearn.ensemble import RandomForestClassifier
>>> from jmspack.ml_utils import optimize_model
>>> data = (
...     sns.load_dataset("iris")
...     .loc[lambda df: df["species"].isin(["setosa", "virginica"])]
...     .replace({"virginica": 0, "setosa": 1})
... )
>>> y = data["species"]
>>> X = data[["sepal_length", "sepal_width"]]
>>> model = RandomForestClassifier()
>>> (
...    optimized_estimator,
...    feature_ranking,
...    feature_selected,
...    feature_importance,
...    optimal_parameters,
... ) = optimize_model(X=X, y=y, estimator=model)
jmspack.ml_utils.plot_confusion_matrix(cf, group_names=None, categories='auto', count=True, percent=True, cbar=True, xyticks=True, xyplotlabels=True, sum_stats=True, figsize: tuple = (7, 5), cmap='Blues', title=None)[source]

This function will make a pretty plot of an sklearn Confusion Matrix cm using a Seaborn heatmap visualization.

Parameters:
cf:

confusion matrix to be passed in

group_names:

List of strings that represent the labels row by row to be shown in each square.

categories:

List of strings containing the categories to be displayed on the x,y axis. Default is ‘auto’

count:

If True, show the raw number in the confusion matrix. Default is True.

normalize:

If True, show the proportions for each category. Default is True.

cbar:

If True, show the color bar. The cbar values are based off the values in the confusion matrix. Default is True.

xyticks:

If True, show x and y ticks. Default is True.

xyplotlabels:

If True, show ‘True Label’ and ‘Predicted Label’ on the figure. Default is True.

sum_stats:

If True, display summary statistics below the figure. Default is True.

figsize:

Tuple representing the figure size. Default will be the matplotlib rcParams value.

cmap:

Colormap of the values displayed from matplotlib.pyplot.cm. Default is ‘Blues’ See http://matplotlib.org/examples/color/colormaps_reference.html

title:

Title for the heatmap. Default is None.

Returns:
fig: matplotlib.figure.Figure

Properties of the figure can be changed later, e.g. use fig.axes[0].set_ylim(0,100) to change ylim

ax: matplotlib.axes._subplots.AxesSubplot

The axes associated with the fig Figure.

Examples

>>> import seaborn as sns
>>> from sklearn.metrics import confusion_matrix
>>> from jmspack.ml_utils import plot_confusion_matrix
>>> y_true = ["cat", "dog", "cat", "cat", "dog", "bird"]
>>> y_pred = ["cat", "cat", "cat", "dog", "bird", "bird"]
>>> cf = confusion_matrix(y_true, y_pred, labels=["cat", "dog", "bird"])
>>> _ = plot_confusion_matrix(cf, figsize=(7, 5))
jmspack.ml_utils.plot_cv_indices(cv, X, y, group, n_splits, lw=10, figsize=(6, 3))[source]

Create an example plot for indices of a cross-validation object.

Parameters:
cvcross-validation generator

A scikit-learn cross-validation object with a split method.

Xarray-like

Training vector, where n_samples is the number of samples and n_features is the number of features.

yarray-like

Target relative to X for classification or regression.

grouparray-like

Group relative to X for classification or regression.

n_splitsint

Number of splits in the cross-validation object.

lwint

Line width for the plots.

figsizetuple

Width and height of the figure in inches

Returns:
fig: matplotlib.figure.Figure

Properties of the figure can be changed later, e.g. use fig.axes[0].set_ylim(0,100) to change ylim

ax: matplotlib.axes._subplots.AxesSubplot

The axes associated with the fig Figure.

Examples

>>> import numpy as np
>>> from sklearn.model_selection import GroupKFold
>>> import matplotlib.pyplot as plt
>>> from jmspack.ml_utils import plot_cv_indices
>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
>>> y = np.array([1, 2, 1, 2])
>>> groups = np.array([0, 0, 2, 2])
>>> group_kfold = GroupKFold(n_splits=2)
>>> _ = plot_cv_indices(cv=group_kfold, X=X, y=y, group=groups, n_splits=2, lw=10, figsize=(6, 3))
>>> _ = plt.show()
jmspack.ml_utils.plot_decision_boundary(X: DataFrame, y: Series, clf: ClassifierMixin = LogisticRegression(), title: str = 'Decision Boundary Logistic Regression', legend_title: str = 'Legend', h: float = 0.05, figsize: tuple = (11.7, 8.27))[source]

Generate a simple plot of the decision boundary of a classifier.

Parameters:
Xarray-like, shape (n_samples, n_features)

Classifier vector, where n_samples is the number of samples and n_features is the number of features.

yarray-like, shape (n_samples)

Target relative to X for classification. Datatype should be integers.

clfscikit-learn algorithm

An object that has the predict and predict_proba methods

hint (default: 0.05)

Step size in the mesh

titlestring

Title for the plot.

legend_titlestring

Legend title for the plot.

figsize: tuple (default: (11.7, 8.27))

Width and height of the figure in inches

Returns:
boundaries: Figure

Properties of the figure can be changed later, e.g. use boundaries.axes[0].set_ylim(0,100) to change ylim

ax: Axes

The axes associated with the boundaries Figure.

Examples

>>> import seaborn as sns
>>> from sklearn.svm import SVC
>>> data = sns.load_dataset("iris")
>>> # convert the target from string to category to numeric as sklearn cannot handle strings as target
>>> y = data["species"]
>>> X = data[["sepal_length", "sepal_width"]]
>>> clf = SVC(kernel="rbf", gamma=2, C=1, probability=True)
>>> _ = plot_decision_boundary(X=X, y=y, clf=clf, title = 'Decision Boundary', legend_title = "Species")
jmspack.ml_utils.plot_learning_curve(X: DataFrame, y: Series, estimator: BaseEstimator = LogisticRegression(), title: str = 'Learning Curve Logistic Regression', groups: None | array = None, cross_color: str = '#8f0fd4', test_color: str = '#fcdd14', scoring: str = 'accuracy', ylim: None | tuple = None, cv: None | int = None, n_jobs: int = -1, train_sizes: array = array([0.1, 0.12307692, 0.14615385, 0.16923077, 0.19230769, 0.21538462, 0.23846154, 0.26153846, 0.28461538, 0.30769231, 0.33076923, 0.35384615, 0.37692308, 0.4, 0.42307692, 0.44615385, 0.46923077, 0.49230769, 0.51538462, 0.53846154, 0.56153846, 0.58461538, 0.60769231, 0.63076923, 0.65384615, 0.67692308, 0.7, 0.72307692, 0.74615385, 0.76923077, 0.79230769, 0.81538462, 0.83846154, 0.86153846, 0.88461538, 0.90769231, 0.93076923, 0.95384615, 0.97692308, 1.]), figsize: tuple = (10, 5))[source]

Generate a simple plot of the test and training learning curve.

Parameters:
estimatorobject type that implements the “fit” and “predict” methods

An object of that type which is cloned for each validation.

titlestring

Title for the chart.

Xarray-like, shape (n_samples, n_features)

Training vector, where n_samples is the number of samples and n_features is the number of features.

yarray-like, shape (n_samples) or (n_samples, n_features), optional

Target relative to X for classification or regression; None for unsupervised learning.

cross_colorstring

Signifies the color of the cross validation in the plot

test_colorstring

Signifies the color of the test set in the plot

scoringstring

Signifies a scoring to evaluate the cross validation

ylimtuple, shape (ymin, ymax), optional

Defines minimum and maximum yvalues plotted.

cvint, cross-validation generator or an iterable, optional

Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 3-fold cross-validation,

  • integer, to specify the number of folds.

  • CV splitter,

  • An iterable yielding (train, test) splits as arrays of indices.

For integer/None inputs, if y is binary or multiclass, :param groups: StratifiedKFold used. If the estimator is not a classifier or if y is neither binary nor multiclass, KFold is used. Refer User Guide for the various cross-validators that can be used here.

n_jobsint or None, optional (default=None)

Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

train_sizesarray-like, shape (n_ticks,), dtype float or int

Relative or absolute numbers of training examples that will be used to generate the learning curve. If the dtype is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class. (default: np.linspace(0.1, 1.0, 5))

jmspack.ml_utils.summary_performance_metrics_classification(model, X_test, y_true, bootstraps=100, fold_size=1000, random_state=69420)[source]

Summary of different evaluation metrics specific to a single class classification learning problem.

Parameters:
model: sklearn.model

A fitted sklearn model with predict() and predict_proba() methods.

X_test: pd.DataFrame

A data frame used to run predict the target values (y_pred).

y_true: pd.Series or np.arrays

Binary true values.

bootstraps: int
fold_size: int
Returns:
summary_df: pd.DataFrame

A dataframe with the summary of the metrics.

Notes

The function returns the following metrics:
  • true positive (TP): The model classifies the example as positive, and the actual label also positive.

  • false positive (FP): The model classifies the example as positive, but the actual label is negative.

  • true negative (TN): The model classifies the example as negative, and the actual label is also negative.

  • false negative (FN): The model classifies the example as negative, but the label is actually positive.

  • accuracy: The fractions of predictions the model got right.

  • prevalance: The proportion of positive examples. Where y=1.

  • sensitivity: The probability that our test outputs positive given that the case is actually positive.

  • specificity: The probability that the test outputs negative given that the case is actually negative.

  • positive predictive value: The proportion of positive predictions that are true positives.

  • negative predictive value: The proportion of negative predictions that are true negatives.

  • auc: A measure of goodness of fit.

  • bootstrapped auc: The bootstrap estimates the uncertainty by resampling the dataset with replacement.

  • F1: The harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

Examples

>>> import seaborn as sns
>>> from sklearn.ensemble import RandomForestClassifier
>>> from jmspack.ml_utils import summary_performance_metrics_classification
>>> data = (
...     sns.load_dataset("iris")
...    .loc[lambda df: df["species"].isin(["setosa", "virginica"])]
...    .replace({"virginica": 0, "setosa": 1})
... )
>>> y = data["species"]
>>> X = data[["sepal_length", "sepal_width"]]
>>> model = RandomForestClassifier()
>>> model.fit(X=X, y=y)
>>> summary_df = summary_performance_metrics_classification(model=model, X_test=X, y_true=y)

jmspack.NLTSA

Submodule NLTSA.py includes the following functions:

  • fluctuation_intensity(): run fluctuation intensity on a time series to detect non linear change.

  • distribution_uniformity(): run distribution uniformity on a time series to detect non linear change.

  • complexity_resonance(): the product of fluctuation_intensity and distribution_uniformity.

  • complexity_resonance_diagram(): plots a heatmap of the complexity_resonance.

  • ts_levels(): defines distinct levels in a time series based on decision tree regressor.

  • cmaps_options[]: a list of possible colour maps that may be used when plotting.

  • cumulative_complexity_peaks(): a function which will calculate the significant peaks in the dynamic complexity of a set of time series (these peaks are known as cumulative complexity peaks; CCPs).

  • cumulative_complexity_peaks_plot(): plots a heatmap of the cumulative_complexity_peaks.

jmspack.NLTSA.complexity_resonance(distribution_uniformity_df, fluctuation_intensity_df)[source]

Create a complexity resonance data frame based on the product of the distribution uniformity and the fluctuation intensity

Parameters:
distribution_uniformity_df: pandas DataFrame

A dataframe containing distribution uniformity values from multivariate time series data from 1 person. Rows should indicate time, columns should indicate the distribution uniformity.

fluctuation_intensity_df: pandas DataFrame

A dataframe containing fluctuation intensity values from multivariate time series data from 1 person. Rows should indicate time, columns should indicate the fluctuation intensity.

Returns:
complexity_resonance_df: pandas DataFrame

A dataframe containing the complexity resonance values from multivariate time series data from 1 person.

Examples

>>> ts_df = pd.read_csv("time_series_dataset.csv", index_col=0)
>>> scaler = MinMaxScaler()
>>> scaled_ts_df = pd.DataFrame(scaler.fit_transform(ts_df), columns=ts_df.columns.tolist())
>>> distribution_uniformity_df = pd.DataFrame(distribution_uniformity(scaled_ts_df, win=7, xmin=0, xmax=1, col_first=1, col_last=7))
>>> distribution_uniformity_df.columns=scaled_ts_df.columns.tolist()
>>> fluctuation_intensity_df = pd.DataFrame(fluctuation_intensity(scaled_ts_df, win=7, xmin=0, xmax=1, col_first=1, col_last=7))
>>> fluctuation_intensity_df.columns=scaled_ts_df.columns.tolist()
>>> complexity_resonance_df = complexity_resonance(distribution_uniformity_df, fluctuation_intensity_df)
jmspack.NLTSA.complexity_resonance_diagram(df, cmap_n: int = 12, plot_title='Complexity Resonance Diagram', labels_n=10, figsize=(20, 7))[source]

Create a complexity resonance data frame based on the product of the distribution uniformity and the fluctuation intensity

Parameters:
df: pandas DataFrame

A dataframe containing complexity resonance values from multivariate time series data from 1 person. Rows should indicate time, columns should indicate the complexity resonance.

cmap_n: int (Default=12)

An integer indicating which colour map to use when plotting the heatmap. These values correspond to the index of the cmaps_options list ([‘flag’, ‘prism’, ‘ocean’, ‘gist_earth’, ‘terrain’, ‘gist_stern’, ‘gnuplot’, ‘gnuplot2’, ‘CMRmap’, ‘cubehelix’, ‘brg’, ‘gist_rainbow’, ‘rainbow’, ‘jet’, ‘nipy_spectral’, ‘gist_ncar’]). Index=12 corresponds to ‘rainbow.

plot_title: str

A string indicating the title to be used at the top of the plot

labels_n: int (Default=10)

An integer indicating the nth value to be taken for the x-axis of the plot. So if the x-axis consists of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, and the labels_n value is set to 2, then 2, 4, 6, 8, 10, will be shown on the x-axis of the plot.

Returns:
plt.figure.Figure

The matplotlib figure object.

plt.axes.Axes

The matplotlib axes object.

Examples

>>> ts_df = pd.read_csv("time_series_dataset.csv", index_col=0)
>>> scaler = MinMaxScaler()
>>> scaled_ts_df = pd.DataFrame(scaler.fit_transform(ts_df), columns=ts_df.columns.tolist())
>>> distribution_uniformity_df = pd.DataFrame(distribution_uniformity(scaled_ts_df, win=7, xmin=0, xmax=1, col_first=1, col_last=7))
>>> distribution_uniformity_df.columns=scaled_ts_df.columns.tolist()
>>> fluctuation_intensity_df = pd.DataFrame(fluctuation_intensity(scaled_ts_df, win=7, xmin=0, xmax=1, col_first=1, col_last=7))
>>> fluctuation_intensity_df.columns=scaled_ts_df.columns.tolist()
>>> complexity_resonance_df = complexity_resonance(distribution_uniformity_df, fluctuation_intensity_df)
>>> complexity_resonance_diagram(complexity_resonance_df, cmap_n=12, labels_n=30)
jmspack.NLTSA.cumulative_complexity_peaks(df: DataFrame, significant_level_item: float = 0.05, significant_level_time: float = 0.05)[source]

Create a complexity resonance data frame based on the product of the distribution uniformity and the fluctuation intensity

Parameters:
df: pd.DataFrame

A dataframe containing complexity resonance values from multivariate time series data from 1 person. Rows should indicate time, columns should indicate the complexity resonance.

significant_level_item: float (Default=0.05)

A float indicating the cutoff of when a point in time is significantly different than the rest on an individual item level (i.e. is this time point different than all the other time points for this item/ feature).

significant_level_time: float (Default=0.05)

A float indicating the cutoff of when a point in time is significantly different than the rest on a timepoint level (i.e. is this day different than all the other days).

Returns:
ccp_df: pd.DataFrame

A dataframe containing the cumulative complexity peaks values from multivariate time series data from 1 person. Rows indicate time, columns indicate the cumulative complexity peaks.

sig_peaks_df: pd.DataFrame

A dataframe containing one column of significant complexity peaks values from multivariate time series data from 1 person. Rows indicate time, columns indicate the significant cumulative complexity peaks.

Examples

>>> ts_df = pd.read_csv("time_series_dataset.csv", index_col=0)
>>> scaler = MinMaxScaler()
>>> scaled_ts_df = pd.DataFrame(scaler.fit_transform(ts_df), columns=ts_df.columns.tolist())
>>> distribution_uniformity_df = pd.DataFrame(distribution_uniformity(scaled_ts_df, win=7, xmin=0, xmax=1, col_first=1, col_last=7))
>>> distribution_uniformity_df.columns=scaled_ts_df.columns.tolist()
>>> fluctuation_intensity_df = pd.DataFrame(fluctuation_intensity(scaled_ts_df, win=7, xmin=0, xmax=1, col_first=1, col_last=7))
>>> fluctuation_intensity_df.columns=scaled_ts_df.columns.tolist()
>>> complexity_resonance_df = complexity_resonance(distribution_uniformity_df, fluctuation_intensity_df)
>>> cumulative_complexity_peaks_df, significant_peaks_df = cumulative_complexity_peaks(df=complexity_resonance_df)
jmspack.NLTSA.cumulative_complexity_peaks_plot(cumulative_complexity_peaks_df: DataFrame, significant_peaks_df: DataFrame, plot_title: str = 'Cumulative Complexity Peaks Plot', figsize: tuple = (20, 5), height_ratios: list = [1, 3], labels_n: int = 10)[source]

Create a cumulative complexity peaks plot based on the cumulative_complexity_peaks_df and the significant_peaks_df

Parameters:
cumulative_complexity_peaks_df: pdDataFrame

A dataframe containing cumulative complexity peaks values from multivariate time series data from 1 person. Rows should indicate time, columns should indicate the cumulative complexity peaks.

significant_peaks_df: pd.DataFrame

A dataframe containing one column of significant complexity peaks values from multivariate time series data from 1 person. Rows should indicate time, columns should indicate the significant cumulative complexity peaks.

plot_title: str

A string indicating the title to be used at the top of the plot

figsize: tuple (Default=(20,5))

The tuple used to specify the size of the plot.

height_ratios: list (Default=[1,3])

The tuple used to specify the size of the plot.

labels_n: int (Default=10)

An integer indicating the nth value to be taken for the x-axis of the plot. So if the x-axis consists of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, and the labels_n value is set to 2, then 2, 4, 6, 8, 10, will be shown on the x-axis of the plot.

Returns:
plt.figure.Figure

The matplotlib figure object.

plt.axes.Axes

The matplotlib axes object.

Examples

>>> ts_df = pd.read_csv("datasets/time_series_dataset.csv", index_col=0)
>>> scaler = MinMaxScaler()
>>> scaled_ts_df = pd.DataFrame(scaler.fit_transform(ts_df), columns=ts_df.columns.tolist())
>>> distribution_uniformity_df = pd.DataFrame(distribution_uniformity(scaled_ts_df, win=7, xmin=0, xmax=1, col_first=1, col_last=7))
>>> distribution_uniformity_df.columns=scaled_ts_df.columns.tolist()
>>> fluctuation_intensity_df = pd.DataFrame(fluctuation_intensity(scaled_ts_df, win=7, xmin=0, xmax=1, col_first=1, col_last=7))
>>> fluctuation_intensity_df.columns=scaled_ts_df.columns.tolist()
>>> complexity_resonance_df = complexity_resonance(distribution_uniformity_df, fluctuation_intensity_df)
>>> cumulative_complexity_peaks_df, significant_peaks_df = cumulative_complexity_peaks(df=complexity_resonance_df)
>>> _ = cumulative_complexity_peaks_plot(cumulative_complexity_peaks_df=cumulative_complexity_peaks_df, significant_peaks_df=significant_peaks_df)
jmspack.NLTSA.distribution_uniformity(df, win, xmin, xmax, col_first, col_last)[source]

Run distribution uniformity on a time series to detect non linear change

Parameters:
df: pd.DataFrame

A dataframe containing multivariate time series data from 1 person. Rows should indicate time, columns should indicate the time series variables. All time series in df should be on the same scale. Otherwise the comparisons across time series will make no sense.

win: int

Size of sliding window in which to calculate distribution uniformity (amount of data considered in each evaluation of change).

xmin: int

The theoretical minimum that the values in the time series can take (scaling?)

xmax: int

The theoretical maximum that the values in the time series can take (scaling?)

col_first: int

The first column index you wish to be included in the calculation (index starts at 1!)

col_last: int

The last column index you wish to be included in the calculation (index starts at 1!)

Returns:
distribution_uniformity_df: pd.DataFrame

A dataframe containing the distribution uniformity values from multivariate time series data from 1 person.

Examples

>>> ts_df = pd.read_csv("time_series_dataset.csv", index_col=0)
>>> scaler = MinMaxScaler()
>>> scaled_ts_df = pd.DataFrame(scaler.fit_transform(ts_df), columns=ts_df.columns.tolist())
>>> distribution_uniformity_df = pd.DataFrame(distribution_uniformity(scaled_ts_df, win=7, xmin=0, xmax=1, col_first=1, col_last=7))
>>> distribution_uniformity_df.columns=scaled_ts_df.columns.tolist()
jmspack.NLTSA.fluctuation_intensity(df, win, xmin, xmax, col_first, col_last)[source]

Run fluctuation intensity on a time series to detect non linear change

Parameters:
df: pd.DataFrame

A dataframe containing multivariate time series data from 1 person. Rows should indicate time, columns should indicate the time series variables. All time series in df should be on the same scale. Otherwise the comparisons across time series will make no sense.

win: int

Size of sliding window in which to calculate fluctuation intensity (amount of data considered in each evaluation of change).

xmin: int

The theoretical minimum that the values in the time series can take (scaling?)

xmax: int

The theoretical maximum that the values in the time series can take (scaling?)

col_first: int

The first column index you wish to be included in the calculation (index starts at 1!)

col_last: int

The last column index you wish to be included in the calculation (index starts at 1!)

Returns:
fluctuation_intensity_df: pd.DataFrame

A dataframe containing the fluctuation intensity values from multivariate time series data from 1 person.

Examples

>>> ts_df = pd.read_csv("time_series_dataset.csv", index_col=0)
>>> scaler = MinMaxScaler()
>>> scaled_ts_df = pd.DataFrame(scaler.fit_transform(ts_df), columns=ts_df.columns.tolist())
>>> fluctuation_intensity_df = pd.DataFrame(fluctuation_intensity(scaled_ts_df, win=7, xmin=0, xmax=1, col_first=1, col_last=7))
>>> fluctuation_intensity_df.columns=scaled_ts_df.columns.tolist()
jmspack.NLTSA.ts_levels(ts, ts_x=None, criterion='squared_error', max_depth=2, min_samples_leaf=1, min_samples_split=2, max_leaf_nodes=30, plot=True, equal_spaced=True, n_x_ticks=10, figsize=(20, 5))[source]

Use recursive partitioning (DecisionTreeRegressor) to perform a ‘classification’ of relatively stable levels in a timeseries.

Parameters:
ts: pd.DataFrame (column)

A dataframe column containing a univariate time series from 1 person. Rows should indicate time, column should indicate the time series variable.

ts_x: pd.DataFrame (column; Default=None)

A dataframe column containing the corresponding timestamps to the aforementioned time series. If None is passed, the index of the time series will be used (Default = None).

criterion: str (Default=”squared_error”)

The function to measure the quality of a split. Supported criteria are squared_error for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node, “friedman_mse”, which uses mean squared error with Friedman’s improvement score for potential splits, and “mae” for the mean absolute error, which minimizes the L1 loss using the median of each terminal node.

max_depth: int or None, optional (default=2)

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_leaf: int, float, optional (default=1)

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. If int, then consider min_samples_leaf as the minimum number. If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

min_samples_split: int, float, optional (default=2)

The minimum number of samples required to split an internal node. If int, then consider min_samples_split as the minimum number. If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

max_leaf_nodes: int or None, optional (default=30)

Identify max_leaf_nodes amount of time series levels in the time series in best-first fashion. Best splits are defined as relative reduction in impurity. If None then unlimited number of splits.

plot: boolean (Default=True)

A boolean to define whether to plot the time series and it’s time series levels.

equal_spaced: boolean (Default=True)

A boolean to define whether or not the time series is continuously measured or not. If False this will be taken into account when plotting the X-axis of the plot.

n_x_ticks: int (Default=10)

The amount of x-ticks you wish to show when plotting.

figsize: tuple (Default=(20,5))

The tuple used to specify the size of the plot if plot = True.

Returns:
df_result: pd.DataFrame

A dataframe containing the time steps, original time series and time series levels.

plt.figure.Figure

The matplotlib figure object.

plt.axes.Axes

The matplotlib axes object.

Examples

>>> from jmspack.NLTSA import ts_levels
>>> ts_df = pd.read_csv("time_series_dataset.csv", index_col=0)
>>> ts = ts_df["lorenz"]
>>> ts_levels_df, fig, ax = ts_levels(ts, ts_x=None, criterion="squared_error", max_depth=10, min_samples_leaf=1,
...                          min_samples_split=2, max_leaf_nodes=30, plot=True, equal_spaced=True, n_x_ticks=10)

jmspack.utils

Submodule utils.py includes the following functions and classes:

  • silence_stdout(): A utility function used to stop other functions from printing to console (use with with()).

  • JmsColors: a class containing useful colours according to Jms and functions to show these colors in various forms.

  • apply_scaling(): a utility function to be used in conjunction with pandas pipe() to scale columns of a data frame seperately.

  • flatten(): a utility function used to flatten a list of lists to a single list.

class jmspack.utils.JmsColors[source]

Utility class for James Twose’s color codes.

Examples

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> from jmspack.utils import JmsColors
>>> x = np.linspace(0, 10, 100)
>>> fig = plt.figure()
>>> _ = plt.plot(x, np.sin(x), color=JmsColors.YELLOW)
>>> _ = plt.plot(x, np.cos(x), color=JmsColors.DARKBLUE)
BLUEGREEN = '#009cdc'
DARKBLUE = '#0072e8'
DARKGREY = '#282d32'
GREENBLUE = '#00c7b1'
GREENYELLOW = '#71db5c'
LIGHTGREY = '#b1b1b1'
MEDIUMGREY = '#808080'
OFFWHITE = '#d5d5d5'
PURPLE = '#8f0fd4'
YELLOW = '#fcdd14'
static get_names()[source]

Returns a list of the color names e.g. [PURPLE, DARKBLUE, etc.]

static plot_colors()[source]

Returns a lineplot of all the available colours (like a color swatch)

static to_dict()[source]

Returns a dictionary of format {color name: hexcode}

static to_list()[source]

Returns a list of hexcodes

jmspack.utils.apply_scaling(df: DataFrame, method: str | Callable | None = 'MinMax', kwargs: Dict = {})[source]

Utility function to be used in conjunction with pandas pipe() to scale columns of a data frame seperately.

Parameters:
df: pd.DataFrame

The data frame you want to scale.

method: Callable, str

The name of the method you wish to use [method options: “MinMax”, “Standard”], or an Sklearn transformer, see: https://scikit-learn.org/stable/modules/preprocessing.html

kwargs: Dict

Dictionary containing additional keywords to be added to the Scaler.

Returns:
scal_df: pd.DataFrame

The scaled data frame.

Examples

>>> import seaborn as sns
>>> import pandas as pd
>>> df = sns.load_dataset("iris")
>>> scaled_df = (df
...             .select_dtypes("number")
...             .pipe(apply_scaling)
...             )
jmspack.utils.flatten(list_of_lists)[source]

Utility function used to flatten a list of list into a single list.

Parameters:
l: list

A list of lists.

Returns:
list

The flattened list.

Examples

>>> from jmspack.utils import flatten
>>> list_of_lists = [[f"p_{x}" for x in range(10)],
...                 [f"p_{x}" for x in range(10, 20)],
...                 [f"p_{x}" for x in range(20, 30)]]
>>> flatten(list_of_lists)
jmspack.utils.silence_stdout()[source]

A utility function used to stop other functions from printing to console (use with with()).

Parameters:
None
Returns:
None

Examples

>>> with silence_stdout():
...    print("This will not print to console")
>>> print("This will print to console")