LipidSig: a web-based tool for lipidomic data analysis

In this section, lipid species and lipid characteristics data can be combined by users to predict the binary outcome using various machine learning methods and select the best feature combination to explore further relationships. For cross-validation, Monte-Carlo cross-validation (CV) is executed to evaluate the model performance and to reach statistical significance. Additionally, we provide eight feature ranking methods (p-value, p-value*FC, ROC, Random Forest, Linear SVM, Lasso, Ridge, ElasticNet) and six classification methods (Random Forest, Linear SVM, Lasso, Ridge, ElasticNet, XGBoost) for users to train and select the best model. Feature ranking methods can be divided into two categories: a univariate and multivariate analysis.

A series of consequent analyses assist users to evaluate the methods and visualise the results of machine learning, including ROC/PR curve, Model predictivity, Sample probability, Feature importance, and Network.

Data Source

Demo dataset

The landscape of cancer cell line metabolism (Nat Med. 2018)

We modified the data from this paper and divided cancer cell lines into sensitive or resistant to SCD gene knockout evenly based on gene dependency scores (CERES). In machine learning analysis, 89 lipid species were used to predict 228 cancer cell lines with a label of 0 (sensitive) or 1 (resistant) by a binary classifier.

For the data sources, users can either upload their datasets or use our Demo datasets. Two necessary tables, ‘lipid expression data’ and ‘Group assignment’ and one optional ‘Lipid characteristics table’ will be passed to the machine learning algorithm. All datasets must be uploaded in csv or tsv format.

Condition table

Lipid characteristics

Feature selection method:

Classifier:

alpha:

Cross validation times:

Note: the more 'Cross validation times', the longer it takes to calculate the results.

Test data propotion:

Additional variable:

: Successfully uploaded. : Error happaned. Please check your dataset. : Warning message.

Lipid expression data

Processed data
Uploaded data

Condition table

Lipid characteristics

Feature selection method:

Classifier:

alpha:

Cross validation times:

Test data propotion:

Additional variable:

Result

ROC/PR curve

The ROC and Precision-Recall (PR) curve are very common methods to evaluate the diagnostic ability of a binary classifier. Mean AUC and 95% confidence interval for the ROC and PR curve are calculated from all CV runs in each feature number. Theoretically, the higher the AUC, the better the model performs. PR curve is more sensitive to data with highly skewed datasets and offers a more informative view of an algorithm's performance. An AUC equal to 1 both represents perfect performance in two methods. We provide an overall ROC/PR Curve shown curve of CVs with different feature numbers and a ROC/PR Curve shown curve of average CVs by user-selected feature number.

Feature number:

Model performance

In this part, many useful indicators are provided for users to evaluate model performance. For each feature number, we calculate and plot the average value and 95% confidence interval of accuracy, sensitivity (recall), specificity, positive predictive value (precision), negative predictive value, F1 score, prevalence, detection rate, detection prevalence, balanced accuracy in all CV runs with confusion Matrix function in carat package. All these indicators can be described in terms of true positive (TP), false positive (FP), false negative (FN) and true negative (TN) and are summarized as follows.

Evaluation method:

Predicted probability

This page shows the average predicted probabilities of each sample in testing data from all CV runs and allows users to explore those incorrect or uncertain labels. We show the distribution of predicted probabilities in two reference labels on the left panel while a confusion matrix composed of sample number and proportion is laid out on the right. Results for different feature number can be selected manually by users.

Feature number:

Feature importance

Algorithm-based
SHAP analysis

Algorithm-based

After building a high-accuracy model, users are encouraged to explore the contribution of each feature on this page. Two methods here namely ‘Algorithm-based’ and ‘SHAP analysis’ can rank and visualize the feature importance.

In ‘Algorithm-based’ part, when users choose a certain feature number, the selected frequency and the average feature importance of top 10 features from all CV runs will be displayed. For a Linear SVM, Lasso, Ridge or ElasticNet model, the importance of each feature depends on the absolute value of their coefficients in the algorithm, while Random Forest and XGBoost use built-in feature importance results.

Feature number:

SHAP analysis

After building a high-accuracy model, users are encouraged to explore the contribution of each feature on this page. Two methods here namely ‘Algorithm-based’ and ‘SHAP analysis’ can rank and visualize the feature importance. SHapley Additive exPlanations (SHAP) approach on the basis of Shapley values in game theory has recently been introduced to explain individual predictions of any machine learning model.

Feature number:

Simulation times:

Show top N feature:

Number of group:

‘Show top N feature’ can be chosen for SHAP force plot. The samples are clustered into multiple groups (‘Number of group’) based on Shapley values using ward.D method.

X axis:

Y axis:

Color:

The x-axis represents the value of a certain feature while the y-axis is the corresponding Shapley value. The colour parameter can be assigned to check if a second feature has an interaction effect with the feature we are plotting.

Network

Correlation network helps users interrogate the interaction of features in a machine learning model. In this section, users can choose an appropriate feature number according to previous cross-validation results and the features in the best model (based on ROC-AUC+PR-AUC) will be picked up to compute the correlation coefficients between each other.

Feature number:

Feature importance method:

Simulation times:

Correlation method:

Coefficient cutoff: