Draft:Feature Importance

Review waiting, please be patient.

This may take 2–3 weeks or more, since drafts are reviewed in no specific order. There are 984 pending submissions waiting for review.

If the submission is accepted, then this page will be moved into the article space.
If the submission is declined, then the reason will be posted here.
In the meantime, you can continue to improve this submission by editing normally.

Where to get help

If you need help editing or submitting your draft, please ask us a question at the AfC Help Desk or get live help from experienced editors. These venues are only for help with editing and the submission process, not to get reviews.
If you need feedback on your draft, or if the review is taking a lot of time, you can try asking for help on the talk page of a relevant WikiProject. Some WikiProjects are more active than others so a speedy reply is not guaranteed.

How to improve a draft

Wikipedia:Contributing to Wikipedia – a basic overview on how to edit Wikipedia.
Help:Wikitext – how to use the markup
Help:Referencing for beginners – how to include references
Wikipedia:Article development – how to develop your article
Wikipedia:Writing better articles – how to improve your article
Wikipedia:Verifiability – make sure your article includes reliable third-party sources

You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article.

Improving your odds of a speedy review

To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags.

Add tags to your draft

Editor resources

Easy tools: Citation bot (help) | Advanced: Fix bare URLs

Reviewer tools

Instructions · What links here · Feature Importance (talk: + · bio) · (log) · Copyvios report · reFill · Citation Bot · (Search: Google, Wikipedia) · Submitted 10 days ago by Joej1997 (talk: D · +) · Last edited 7 days ago by Pythoncoder

Feature Importance (or Variable Importance or Feature Attribution) refers to a set of techniques and mathematical frameworks used in machine learning and statistics to quantify the contribution of input variables (features) to a model's output or to the underlying data-generating process. It is a fundamental component of Explainable Artificial Intelligence (XAI) and Interpretable Machine Learning (IML) and is used for model improvement, scientific inference, and feature selection.^[1]^[2]

The fundamental challenge in feature importance lies in the complexity of models and the complex dependence structures inherent in high-dimensional data. Model complexity leads to a plethora of different ways that model behavior can be summarized in human interpretable ways. In the presence of correlated features, the task of assigning credit becomes a multi-way trade-off where researchers must choose between different sets of axioms that favor different principles (i.e., relating to the internal logic of the model or the underlying nature of the data). As a result, there is no singular ground-truth value of feature importance, rather a context-dependent score that varies based on the scope of the explanation, the model (or data) of interest, and the ultimate purpose of the analysis.^[1]

In an online survey (n=266) done by the author of the most well known feature importance textbook,^[2] Christoph Molnar, 35.3% of researchers responded that they primarily use feature importance to get insights about data. Another 18.4% cited its primary use as justifying the model, and an equal 18.4% used it for debugging and improving the model (27.8% just wanted to see survey results).

History and Theoretical Foundations

The history of feature importance is a century-long progression from path analysis in genetics to gradient-based methods made specifically for modern deep learning architectures. The progression of the field and the changes in focus reflect the broader shift in the culture of data science from parametric modeling and inference toward algorithmic modeling and prediction.^[3] Feature importance continues to be an extremely active area of research. For a more complete historical overview, see the review by Ulrike Grömping.^[4]

Early Development and Correlation (1920s–1970s)

The earliest forms of feature importance assessed the strength of the relationships between pairs of variables within animal biology or human psychology using methods such as Francis Galton's correlation coefficient.^[5] The formal quest to determine variable importance continued with Sewall Wright’s development of path analysis in 1921.^[6] Wright sought to understand causal influences in complex systems, and he provided a method "path analysis" to determine the correlative influences along direct paths by decomposing correlation and partial correlation coefficients into their path-based components. For decades, the dominant approach to variable importance was the inspection of standardized regression coefficients (or regression weights). These weights, however, were notoriously unstable in the presence of multicollinearity (i.e., when there is high correlation between predictor variables). This led to a problem where the importance assigned to a variable depended on the order in which it was entered into a sequential regression model.^[4]

In 1960, Hoffman proposed a method (relative weights) to handle these correlations,^[7] which was later critiqued and refined by Darlington in 1968 as well as Green in 1978.^[8] ^[9] During this period, researchers primarily focused on partitioning the coefficient of determination ( $R^{2}$ ) among predictors.

The Averaging Movement (1980s–2000s)

To address the arbitrary nature of sequential entry, William Kruskal proposed a seminal solution in 1987: averaging relative importance over all possible orderings of the independent variables.^[10] This ensured that no single variable was unfairly penalized or elevated by its position in the model. This approach was formalized as the LMG method, named after Lindeman, Merenda, and Gold.^[11] In 2005, Feldman introduced the Proportional Marginal Value Decomposition (PMVD), which added an "Exclusion" property—ensuring that a regressor with a true coefficient of zero receives a zero share of importance asymptotically.^[12] Here, variable importance began to be linked with Shapely values,^[13] an important development for the next era.

The Random Forest and XAI Era (2001–Present)

The year 2001 marked a paradigm shift with Leo Breiman’s introduction of Random Forests. Breiman moved away from model parameters (coefficients) and linear importance by introducing "permutation importance" (Mean Decrease Accuracy or GINI), which assessed nonlinear importance by measuring the drop in model performance when a feature's values were randomly shuffled.^[14] Simultaneously, Lipovetsky and Conklin (2001) applied the Shapley value from cooperative game theory to regression, providing a consistent method for variance attribution in the presence of multicollinearity.^[15]

Gradient-based feature importance methods emerged from early sensitivity analysis in neural networks, where partial derivatives of a model’s output with respect to its inputs were used to quantify how small input changes affect predictions. Since the explosion in prominence of deep learning, gradient-based feature attribution has increased in popularity. Simple input gradients and gradient/input methods were among the earliest and most widely used due to their computational efficiency, but they were later criticized for instability and noise, especially in deep, non-linear models. To address these issues, more principled approaches were introduced, most notably Integrated Gradients,^[16] which average gradients along a path from a baseline input to the actual input and satisfy desirable axioms such as sensitivity and implementation invariance. Closely related methods include DeepLIFT, which propagates contribution scores relative to a reference activation, and Layer-wise Relevance Propagation (LRP), which redistributes prediction scores backward through the network. Variants such as SmoothGrad further improve robustness by averaging gradients over noisy perturbations. Together, these methods form the core of modern gradient-based feature attribution techniques and remain central to explainable AI for differentiable models.

A major milestone occurred in 2017 when Scott Lundberg and Su-In Lee introduced SHAP (SHapley Additive exPlanations). SHAP unified various local attribution methods (like LIME and DeepLIFT) under the umbrella of Shapley values, providing the first mathematically consistent framework for local feature importance in any machine learning model.^[17] This became one of the most cited machine learning papers of all time.

Christoph Molnar further advanced the field by synthesizing these methods into a popular and comprehensive textbook.^[2]

Taxonomy and Classification of Methods

The variable importance literature has produced many methods, but the community has failed to gain a consensus on how to classify these methods. The list below contains some of the most popular classifications, though this is not an exhaustive list.

Classification by Scope

Some researchers divide feature importance methods into four distinct settings based on two axes: Global vs. Local and Data vs. Model.^[18] ^[19] ^[20]

	Reference: Model	Reference: Data
Global	Global-Model Importance: Explains how a trained model behaves across the entire dataset. It identifies which features the model generally relies on for its predictions.^[18]	Global-Data Importance: Explains the true relationships in the underlying phenomenon. It seeks to identify the intrinsic predictive power of features within the population, regardless of a specific model's choices.^[1]
Local	Local-Model Importance: Explains why a specific prediction was made for a single instance. It quantifies the influence of each feature on that a specific model's specific outcome.^[18]	Local-Data Importance: Explains the role of a feature for a specific individual in the real world (e.g., why a specific patient developed a disease), focusing on the causal or statistical dependencies for that point.^[18]

Methods like SHAP and LIME are local-model. Global-model importance metrics include permutation importance^[14] and SAGE^[20]. Global-data methods include MCI^[1] and UMFI^[21].

Classification by Correlation Treatment: The Marginal to Conditional Continuum

Many feature importance methods differ in how they treat the correlation between features. Indeed, if features were independent some feature importance methods would give identical results. Thus, some researchers choose to classify methods based on how they assign credit to correlated features.^[22] ^[4] ^[23]

To distinguish marginal and conditional methods, suppose that the response variable is fully determined by two of the predictor variables $Y=2X_{1}+2X_{3}$ . Further suppose that there is another predictor variable that is fully determined by the first predictor $X_{2}=X_{1}$ and the third predictor is independent of both $X_{3}\perp \!\!\!\perp X_{1},X_{2}$ .

Conditional Feature Importance: Evaluates a feature by conditioning on the values of all other features. This approach respects the dependence structure of the data and measures the unique information a feature provides that is not already captured by other variables. As such, purely conditional methods would assign $X_{3}$ all the importance while giving zero importance to the other two variables. Methods in this category include conditional permutation importance^[22], Leave-One-Covariate-Out, and partial correlation.

Marginal Feature Importance: Relies on associations between the response and predictors, regardless of multicollinearity. Purely marginal methods assign the same high importance to all features. Examples include correlation, marginal contribution feature importance^[1], and ultra-marginal feature importance^[1].

Methods such as SHAP and permutation importance are somewhere in between the two extremes as importance is shared among correlated features.

Classification by Mechanism: Gradient vs. Non-Gradient

The technical implementation of the importance measure often dictates its applicability to different model architectures.

Mechanism	Description	Examples
Gradient-Based	Utilizes the derivatives (gradients) of the model's output with respect to its input features. These are typically model-specific and used for differentiable models like neural networks.^[24]	Saliency Maps, Integrated Gradients, Grad-CAM, DeepLIFT.^[24]
Non-Gradient Based	Treats the model as a "black box" and relies on perturbations, shuffling, or submodel training. These are typically model-agnostic and applicable to any algorithm.^[24]	Permutation Importance, KernelSHAP, LOCO, MCI.^[1]

Classification by Purpose

The choice of method is frequently driven by the end-user's objective.

Model Explanation: The goal is to understand the "logic" of a black-box model to ensure safety, fairness, and reliability. Methods like SHAP, LIME, and Accumulated local effects are standard here.^[17]
Data Explanation (Scientific Inference): The goal is to learn about the real world. Researchers prioritize methods that handle redundancy and correlation in a way that reflects the true underlying relationships (e.g., MCI, UMFI).^[1]^[21]
Model Optimization (Feature Selection): The goal is to improve the model's performance by removing irrelevant or redundant features. Techniques like Recursive Feature Elimination (RFE) use importance scores as a selection criterion.^[25]

Other Classifications

Achen (1982) introduced a classification of linear regression-based feature importance methods: "dispersion importance" (explained variance), "level importance" (impact on the mean), and "theoretical importance" (the change in response for a given change in regressor).^[26]

Axiomatic Foundations of Feature Importance

To move beyond heuristic rankings, researchers use axioms to define what a "fair" or "valid" importance score should look like. These axioms provide the mathematical justification for selecting one method over another.

The Shapley Axioms (S1–S4)

Shapley values are the unique solution that satisfies four core game-theoretic axioms, which describe how to fairly distribute the "total gain" of a model's prediction among the participating features.^[27]

Efficiency (Local Accuracy): The sum of the importance scores $\phi _{j}$ for all features must equal the difference between the model's prediction for an instance $f(x)$ and the expected prediction $E[f(X)]$ .^[17]
Symmetry: If two features $i$ and $j$ contribute exactly the same value to every possible subset of other features, they must receive the same importance score: $\phi _{i}=\phi _{j}$ .^[17]
Dummy (Null Player): If a feature contributes nothing to the value function for any subset of features, its importance score must be zero. This is crucial for identifying irrelevant features.^[17]
Additivity (Linearity): If the value function is the sum of two functions $v=v_{1}+v_{2}$ , then the importance scores must be the sum of the scores calculated for each function: $\phi _{j}(v)=\phi _{j}(v_{1})+\phi _{j}(v_{2})$ .^[17]

Data-Driven Axioms for MCI and UMFI

For scientific discovery, the Shapley axioms are sometimes criticized because they average contributions, which can lead to diluted importance for correlated features.^[1] Some researchers proposed alternative axioms for true-to-the-data methods.

Marginal Contribution: The importance of a feature must be at least as high as the gain it provides when added to the set of all other features.^[1]
Elimination: Removing other features from the feature set can only decrease (or leave unchanged) the importance of a remaining feature. It cannot increase it.^[1]
IRI & SD (Invariance under Redundant Information and Symmetry under Duplication): Adding a redundant feature should not change the importance of preexisting features, and identical features should receive equal importance.^[21] ^[1]
Blood Relation: A feature should have non-zero importance if and only if the feature is blood related (associated) with the response in the ground-truth causal graph.^[21]

The Inconsistency Theorem

It is mathematically impossible for a single feature importance score to satisfy certain intuitive properties simultaneously—such as being consistent between local and global settings while also being robust to all types of feature dependencies (like colliders).^[18] This suggests that users must prioritize specific axioms based on their task; for example, if one values local accuracy (efficiency), they might have to sacrifice robustness to perfect correlation.^[27]

Major Methods and Algorithms

Shapley Values and SHAP Variants

SHAP (SHapley Additive exPlanations) has become the dominant framework for local attribution. It interprets the model prediction as a "game" where feature values are the "players".^[2]

KernelSHAP: A model-agnostic approximation that uses a weighted linear regression (the "Shapley kernel") to estimate Shapley values. Its main limitation is computational speed, as it requires many model evaluations.^[2]
TreeSHAP: An algorithm specifically designed for tree ensembles (XGBoost, LightGBM, Random Forest). It computes exact Shapley values in polynomial time by traversing the tree structure. However, "path-dependent" TreeSHAP can sometimes produce unintuitive results because it changes the value function to rely on conditional expectations.^[2]
DeepSHAP: Combines SHAP values with the DeepLIFT algorithm to provide fast attributions for neural networks.^[24]

Permutation Importance (PI)

Permutation importance is the standard for global, model-agnostic explanation. It defines the importance of feature $X_{j}$ as the increase in model error after shuffling the values of $X_{j}$ in the test set.^[14]

Aspect	Description
Intuition	If the model relies on a feature, shuffling its values destroys the relationship, causing the error to spike.^[2]
Pros	Easy to understand; does not require model retraining; captures both main effects and interactions.^[2]
Cons	Vulnerable to correlated features; the permuted features can force the input off the training distribution.^[22]

Conditional Permutation Importance (CPI)

Strobl et al. (2008) introduced CPI to address the bias of PI toward correlated features. CPI permutes feature $X_{j}$ only within "blocks" defined by the values of other features $Z$ that are associated with $X_{j}$ . This ensures that the shuffled values stay "local" to the original data distribution.^[22]

Algorithm Details: The party package implementation in R uses p-values from independence tests to select which features to condition on. If the p-value is below a threshold $s$ , the feature is included in the conditioning set.^[22]
Limitations: High sample sizes can lead to "greedy" conditioning, where almost all features are selected for the blocks, making the permutation less effective. Newer implementations like permimp aim to be less sensitive to these sample-size effects.^[28]

Leave-One-Covariate-Out (LOCO)

LOCO is a rigorous frequentist approach. To find the importance of $X_{j}$ , a researcher trains two models: one with all features and one without $X_{j}$ . The importance is the difference in their predictive risk (e.g., $L_{2}$ loss).^[27]

Comparison with Shapley: While Shapley values average the marginal contributions across all $2^{d-1}$ submodels, LOCO looks only at the "top" (the full model vs. the model with $d-1$ features). Research by Verdinelli and Wasserman (2023) suggests that for many statistical purposes, a normalized version of LOCO is more reliable and easier to interpret than Shapley values.^[27]

Marginal Contribution Feature Importance (MCI)

MCI is specifically designed for "data explanation". Unlike Shapley, which averages contributions, MCI identifies the maximum contribution a feature can make to any possible subset.^[1]

Why use MCI?: In systems with high redundancy (e.g., measuring multiple similar metabolites in a biological pathway), Shapley values for each metabolite will approach zero as the number of redundant features increases. MCI remains robust, assigning high importance to any feature that could provide high predictive power in some context.^[1]

Limitations: Computationally expensive and can miss correlated interactions.^[21]

Integrated Gradients (IG)

Integrated Gradients is a leading gradient-based method for deep networks. It addresses the "saturation" problem of simple saliency maps (where gradients can become zero even for important features) by integrating the gradients along a path from a "baseline" (e.g., an all-black image) to the actual input.^[29]

Recursive Feature Elimination

RFE is a "wrapper" method that iteratively prunes the feature set.^[25]

Step	Action
1. Initialization	Train the model (e.g., SVM, Random Forest) on the full set of features $F$ .^[30]
2. Ranking	Use the model's internal importance measure (e.g., weights for SVM, MDI for RF) to rank features.^[30]
3. Elimination	Remove the least important feature (or a fraction of the least important features).^[30]
4. Iteration	Repeat the process on the remaining subset until the desired number of features is reached or performance begins to drop.^[30]

RFE is particularly effective because it accounts for feature interactions that might be missed by simple filter methods (like Pearson correlation). However, it is computationally expensive as it requires retraining the model in each iteration.^[31]

References

^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l ^m ⁿ Catav, Amnon; Fu, Boyang; Zoabi, Yazeed; Meilik, Ahuva Libi Weiss; Shomron, Noam; Ernst, Jason; Sankararaman, Sriram; Gilad-Bachrach, Ran (2021). Marginal contribution feature importance: an axiomatic approach for explaining data. International Conference on Machine Learning. Proceedings of Machine Learning Research. pp. 1324–1335.
^ ^a ^b ^c ^d ^e ^f ^g ^h Molnar, C. (2020). Interpretable Machine Learning. Lulu.com.
^ Breiman, Leo (2001). "Statistical modeling: The two cultures (with comments and a rejoinder by the author)". Statistical Science. 16 (3): 199–231. doi:10.1214/ss/1009213726.
^ ^a ^b ^c Grömping, U. (2015). "Variable importance in regression models". Wiley Interdisciplinary Reviews: Computational Statistics. 7 (2): 137–152. doi:10.1002/wics.1346.
^ Galton, Francis (1889). "Co-relations and their measurement, chiefly from anthropometric data". Proceedings of the Royal Society of London. 45 (273–279): 135–145. doi:10.1098/rspl.1888.0082.
^ Wright, S. (1921). "Correlation and Causation". Journal of Agricultural Research. 20: 557–585.
^ Hoffman, P. J. (1960). "The paramorphic representation of clinical judgment". Psychological Bulletin. 57 (2): 116–131. Bibcode:1960PsycB..57..116H. doi:10.1037/h0047807. PMID 14402414.
^ Darlington, R. B. (1968). "Multiple regression in psychological research and practice". Psychological Bulletin. 69 (3): 161–182. Bibcode:1968PsycB..69..161D. doi:10.1037/h0025471. PMID 4868134.
^ Green, P. E.; Carroll, J. D.; DeSarbo, W. S. (1978). "A new measure of regressor importance in multiple regression". Journal of Marketing Research. 15: 356–360. doi:10.1177/002224377801500305.
^ Kruskal, W. (1987). "Relative importance by averaging over orderings". The American Statistician. 41: 6–10. doi:10.1080/00031305.1987.10475432.
^ Lindeman, R. H.; Merenda, P. F.; Gold, R. Z. (1980). Introduction to Bivariate and Multivariate Analysis. Glenview, IL: Scott, Foresman.
^ Feldman, B. (1999). The proportional value of a cooperative game. Econometric Society World Congress 2000.
^ Shapley, L. S. (1953). "A value for n-person games". Contributions to the Theory of Games. 2: 307–317.
^ ^a ^b ^c Breiman, L. (2001). "Random forests". Machine Learning. 45 (1): 5–32. Bibcode:2001MachL..45....5B. doi:10.1023/A:1010933404324.
^ Lipovetsky, S.; Conklin, M. (2001). "Analysis of regression in game theory approach". Applied Stochastic Models in Business and Industry. 17 (4): 319–330. doi:10.1002/asmb.446.
^ Sundararajan, M.; Taly, A.; Yan, Q. (2017). Axiomatic attribution for deep networks. International Conference on Machine Learning. Proceedings of Machine Learning Research. pp. 3319–3328.
^ ^a ^b ^c ^d ^e ^f Lundberg, S. M.; Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems. Vol. 30.
^ ^a ^b ^c ^d ^e Harel, N.; Obolski, U.; Gilad-Bachrach, R. Inherent Inconsistencies of Feature Importance. XAI in Action: Past, Present, and Future Applications.
^ Chen, H.; Janizek, J. D.; Lundberg, S.; Lee, S. I. (2020). "True to the model or true to the data?". arXiv:2006.16234 [cs.LG].
^ ^a ^b Covert, I.; Lundberg, S. M.; Lee, S. I. (2020). Understanding global feature contributions with additive importance measures. Advances in Neural Information Processing Systems. Vol. 33. pp. 17212–17223.
^ ^a ^b ^c ^d ^e Janssen, J.; Guan, V.; Robeva, E. (2023). Ultra-marginal feature importance: Learning from data with causal guarantees. International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research. pp. 10782–10814.
^ ^a ^b ^c ^d ^e Strobl, C.; Boulesteix, A. L.; Kneib, T.; Augustin, T.; Zeileis, A. (2008). "Conditional variable importance for random forests". BMC Bioinformatics. 9 (1): 307. doi:10.1186/1471-2105-9-307.
^ Molnar, C.; König, G.; Bischl, B.; Casalicchio, G. (2024). "Model-agnostic feature importance and effects with dependent features: a conditional subgroup approach". Data Mining and Knowledge Discovery. 38 (5): 2903–2941. doi:10.1007/s10618-022-00901-9.
^ ^a ^b ^c ^d "MASE: Interpretable NLP Models via Model-Agnostic Saliency Estimation", arXiv. [1]
^ ^a ^b "A SYSTEMATIC LITERATURE REVIEW: RECURSIVE FEATURE ELIMINATION ALGORITHMS". [2]
^ Achen, C. H. (1982). Interpreting and Using Regression. Newbury Park, CA: Sage.
^ ^a ^b ^c ^d "Feature Importance: A Closer Look at Shapley Values", arXiv, 2023. [3]
^ Debeer, D.; Strobl, C. (2020). "Conditional permutation importance revisited". BMC Bioinformatics. 21 (1): 307.
^ "Agree to Disagree: Exploring Consensus of XAI Methods", IFIP. [4]
^ ^a ^b ^c ^d "Recursive Feature Elimination on the Sonar Data Set". [5]
^ "An automatically recursive feature elimination method", Taylor & Francis. [6]

[Catav2021-1] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l ^m ⁿ Catav, Amnon; Fu, Boyang; Zoabi, Yazeed; Meilik, Ahuva Libi Weiss; Shomron, Noam; Ernst, Jason; Sankararaman, Sriram; Gilad-Bachrach, Ran (2021). Marginal contribution feature importance: an axiomatic approach for explaining data. International Conference on Machine Learning. Proceedings of Machine Learning Research. pp. 1324–1335.

[MolnarBook-2] ^ ^a ^b ^c ^d ^e ^f ^g ^h Molnar, C. (2020). Interpretable Machine Learning. Lulu.com.

[3] Breiman, Leo (2001). "Statistical modeling: The two cultures (with comments and a rejoinder by the author)". Statistical Science. 16 (3): 199–231. doi:10.1214/ss/1009213726.

[GrompingReview-4] Grömping, U. (2015). "Variable importance in regression models". Wiley Interdisciplinary Reviews: Computational Statistics. 7 (2): 137–152. doi:10.1002/wics.1346.

[5] Galton, Francis (1889). "Co-relations and their measurement, chiefly from anthropometric data". Proceedings of the Royal Society of London. 45 (273–279): 135–145. doi:10.1098/rspl.1888.0082.

[6] Wright, S. (1921). "Correlation and Causation". Journal of Agricultural Research. 20: 557–585.

[7] Hoffman, P. J. (1960). "The paramorphic representation of clinical judgment". Psychological Bulletin. 57 (2): 116–131. Bibcode:1960PsycB..57..116H. doi:10.1037/h0047807. PMID 14402414.

[8] Darlington, R. B. (1968). "Multiple regression in psychological research and practice". Psychological Bulletin. 69 (3): 161–182. Bibcode:1968PsycB..69..161D. doi:10.1037/h0025471. PMID 4868134.

[9] Green, P. E.; Carroll, J. D.; DeSarbo, W. S. (1978). "A new measure of regressor importance in multiple regression". Journal of Marketing Research. 15: 356–360. doi:10.1177/002224377801500305.

[10] Kruskal, W. (1987). "Relative importance by averaging over orderings". The American Statistician. 41: 6–10. doi:10.1080/00031305.1987.10475432.

[11] Lindeman, R. H.; Merenda, P. F.; Gold, R. Z. (1980). Introduction to Bivariate and Multivariate Analysis. Glenview, IL: Scott, Foresman.

[12] Feldman, B. (1999). The proportional value of a cooperative game. Econometric Society World Congress 2000.

[13] Shapley, L. S. (1953). "A value for n-person games". Contributions to the Theory of Games. 2: 307–317.

[Breiman2001-14] Breiman, L. (2001). "Random forests". Machine Learning. 45 (1): 5–32. Bibcode:2001MachL..45....5B. doi:10.1023/A:1010933404324.

[15] Lipovetsky, S.; Conklin, M. (2001). "Analysis of regression in game theory approach". Applied Stochastic Models in Business and Industry. 17 (4): 319–330. doi:10.1002/asmb.446.

[16] Sundararajan, M.; Taly, A.; Yan, Q. (2017). Axiomatic attribution for deep networks. International Conference on Machine Learning. Proceedings of Machine Learning Research. pp. 3319–3328.

[SHAP-17] ^ ^a ^b ^c ^d ^e ^f Lundberg, S. M.; Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems. Vol. 30.

[Inconsistencies-18] Harel, N.; Obolski, U.; Gilad-Bachrach, R. Inherent Inconsistencies of Feature Importance. XAI in Action: Past, Present, and Future Applications.

[19] Chen, H.; Janizek, J. D.; Lundberg, S.; Lee, S. I. (2020). "True to the model or true to the data?". arXiv:2006.16234 [cs.LG].

[GlobalCovert-20] Covert, I.; Lundberg, S. M.; Lee, S. I. (2020). Understanding global feature contributions with additive importance measures. Advances in Neural Information Processing Systems. Vol. 33. pp. 17212–17223.

[UMFI-21] Janssen, J.; Guan, V.; Robeva, E. (2023). Ultra-marginal feature importance: Learning from data with causal guarantees. International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research. pp. 10782–10814.

[CPI-22] Strobl, C.; Boulesteix, A. L.; Kneib, T.; Augustin, T.; Zeileis, A. (2008). "Conditional variable importance for random forests". BMC Bioinformatics. 9 (1): 307. doi:10.1186/1471-2105-9-307.

[23] Molnar, C.; König, G.; Bischl, B.; Casalicchio, G. (2024). "Model-agnostic feature importance and effects with dependent features: a conditional subgroup approach". Data Mining and Knowledge Discovery. 38 (5): 2903–2941. doi:10.1007/s10618-022-00901-9.

[MASE-24] "MASE: Interpretable NLP Models via Model-Agnostic Saliency Estimation", arXiv. [1]

[RFE_Lit-25] "A SYSTEMATIC LITERATURE REVIEW: RECURSIVE FEATURE ELIMINATION ALGORITHMS". [2]

[26] Achen, C. H. (1982). Interpreting and Using Regression. Newbury Park, CA: Sage.

[Verdinelli2023-27] "Feature Importance: A Closer Look at Shapley Values", arXiv, 2023. [3]

[CPI_revisit-28] Debeer, D.; Strobl, C. (2020). "Conditional permutation importance revisited". BMC Bioinformatics. 21 (1): 307.

[IG-29] "Agree to Disagree: Exploring Consensus of XAI Methods", IFIP. [4]

[RFE_Sonar-30] "Recursive Feature Elimination on the Sonar Data Set". [5]

[RFE_Auto-31] "An automatically recursive feature elimination method", Taylor & Francis. [6]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]