Wiki Article

Draft:Feature Importance

Nguồn dữ liệu từ Wikipedia, hiển thị bởi DefZone.Net


Feature Importance (or Variable Importance or Feature Attribution) refers to a set of techniques and mathematical frameworks used in machine learning and statistics to quantify the contribution of input variables (features) to a model's output or to the underlying data-generating process. It is a fundamental component of Explainable Artificial Intelligence (XAI) and Interpretable Machine Learning (IML) and is used for model improvement, scientific inference, and feature selection.[1][2]

The fundamental challenge in feature importance lies in the complexity of models and the complex dependence structures inherent in high-dimensional data. Model complexity leads to a plethora of different ways that model behavior can be summarized in human interpretable ways. In the presence of correlated features, the task of assigning credit becomes a multi-way trade-off where researchers must choose between different sets of axioms that favor different principles (i.e., relating to the internal logic of the model or the underlying nature of the data). As a result, there is no singular ground-truth value of feature importance, rather a context-dependent score that varies based on the scope of the explanation, the model (or data) of interest, and the ultimate purpose of the analysis.[1]

In an online survey (n=266) done by the author of the most well known feature importance textbook,[2] Christoph Molnar, 35.3% of researchers responded that they primarily use feature importance to get insights about data. Another 18.4% cited its primary use as justifying the model, and an equal 18.4% used it for debugging and improving the model (27.8% just wanted to see survey results).

History and Theoretical Foundations

[edit]

The history of feature importance is a century-long progression from path analysis in genetics to gradient-based methods made specifically for modern deep learning architectures. The progression of the field and the changes in focus reflect the broader shift in the culture of data science from parametric modeling and inference toward algorithmic modeling and prediction.[3] Feature importance continues to be an extremely active area of research. For a more complete historical overview, see the review by Ulrike Grömping.[4]

Early Development and Correlation (1920s–1970s)

[edit]

The earliest forms of feature importance assessed the strength of the relationships between pairs of variables within animal biology or human psychology using methods such as Francis Galton's correlation coefficient.[5] The formal quest to determine variable importance continued with Sewall Wright’s development of path analysis in 1921.[6] Wright sought to understand causal influences in complex systems, and he provided a method "path analysis" to determine the correlative influences along direct paths by decomposing correlation and partial correlation coefficients into their path-based components. For decades, the dominant approach to variable importance was the inspection of standardized regression coefficients (or regression weights). These weights, however, were notoriously unstable in the presence of multicollinearity (i.e., when there is high correlation between predictor variables). This led to a problem where the importance assigned to a variable depended on the order in which it was entered into a sequential regression model.[4]

In 1960, Hoffman proposed a method (relative weights) to handle these correlations,[7] which was later critiqued and refined by Darlington in 1968 as well as Green in 1978.[8] [9] During this period, researchers primarily focused on partitioning the coefficient of determination () among predictors.

The Averaging Movement (1980s–2000s)

[edit]

To address the arbitrary nature of sequential entry, William Kruskal proposed a seminal solution in 1987: averaging relative importance over all possible orderings of the independent variables.[10] This ensured that no single variable was unfairly penalized or elevated by its position in the model. This approach was formalized as the LMG method, named after Lindeman, Merenda, and Gold.[11] In 2005, Feldman introduced the Proportional Marginal Value Decomposition (PMVD), which added an "Exclusion" property—ensuring that a regressor with a true coefficient of zero receives a zero share of importance asymptotically.[12] Here, variable importance began to be linked with Shapely values,[13] an important development for the next era.

The Random Forest and XAI Era (2001–Present)

[edit]

The year 2001 marked a paradigm shift with Leo Breiman’s introduction of Random Forests. Breiman moved away from model parameters (coefficients) and linear importance by introducing "permutation importance" (Mean Decrease Accuracy or GINI), which assessed nonlinear importance by measuring the drop in model performance when a feature's values were randomly shuffled.[14] Simultaneously, Lipovetsky and Conklin (2001) applied the Shapley value from cooperative game theory to regression, providing a consistent method for variance attribution in the presence of multicollinearity.[15]

Gradient-based feature importance methods emerged from early sensitivity analysis in neural networks, where partial derivatives of a model’s output with respect to its inputs were used to quantify how small input changes affect predictions. Since the explosion in prominence of deep learning, gradient-based feature attribution has increased in popularity. Simple input gradients and gradient/input methods were among the earliest and most widely used due to their computational efficiency, but they were later criticized for instability and noise, especially in deep, non-linear models. To address these issues, more principled approaches were introduced, most notably Integrated Gradients,[16] which average gradients along a path from a baseline input to the actual input and satisfy desirable axioms such as sensitivity and implementation invariance. Closely related methods include DeepLIFT, which propagates contribution scores relative to a reference activation, and Layer-wise Relevance Propagation (LRP), which redistributes prediction scores backward through the network. Variants such as SmoothGrad further improve robustness by averaging gradients over noisy perturbations. Together, these methods form the core of modern gradient-based feature attribution techniques and remain central to explainable AI for differentiable models.

A major milestone occurred in 2017 when Scott Lundberg and Su-In Lee introduced SHAP (SHapley Additive exPlanations). SHAP unified various local attribution methods (like LIME and DeepLIFT) under the umbrella of Shapley values, providing the first mathematically consistent framework for local feature importance in any machine learning model.[17] This became one of the most cited machine learning papers of all time.

Christoph Molnar further advanced the field by synthesizing these methods into a popular and comprehensive textbook.[2]

Taxonomy and Classification of Methods

[edit]

The variable importance literature has produced many methods, but the community has failed to gain a consensus on how to classify these methods. The list below contains some of the most popular classifications, though this is not an exhaustive list.

Classification by Scope

[edit]

Some researchers divide feature importance methods into four distinct settings based on two axes: Global vs. Local and Data vs. Model.[18] [19] [20]

Reference: Model Reference: Data
Global Global-Model Importance: Explains how a trained model behaves across the entire dataset. It identifies which features the model generally relies on for its predictions.[18] Global-Data Importance: Explains the true relationships in the underlying phenomenon. It seeks to identify the intrinsic predictive power of features within the population, regardless of a specific model's choices.[1]
Local Local-Model Importance: Explains why a specific prediction was made for a single instance. It quantifies the influence of each feature on that a specific model's specific outcome.[18] Local-Data Importance: Explains the role of a feature for a specific individual in the real world (e.g., why a specific patient developed a disease), focusing on the causal or statistical dependencies for that point.[18]

Methods like SHAP and LIME are local-model. Global-model importance metrics include permutation importance[14] and SAGE[20]. Global-data methods include MCI[1] and UMFI[21].

Classification by Correlation Treatment: The Marginal to Conditional Continuum

[edit]

Many feature importance methods differ in how they treat the correlation between features. Indeed, if features were independent some feature importance methods would give identical results. Thus, some researchers choose to classify methods based on how they assign credit to correlated features.[22] [4] [23]

To distinguish marginal and conditional methods, suppose that the response variable is fully determined by two of the predictor variables . Further suppose that there is another predictor variable that is fully determined by the first predictor and the third predictor is independent of both .

  • Conditional Feature Importance: Evaluates a feature by conditioning on the values of all other features. This approach respects the dependence structure of the data and measures the unique information a feature provides that is not already captured by other variables. As such, purely conditional methods would assign all the importance while giving zero importance to the other two variables. Methods in this category include conditional permutation importance[22], Leave-One-Covariate-Out, and partial correlation.
  • Marginal Feature Importance: Relies on associations between the response and predictors, regardless of multicollinearity. Purely marginal methods assign the same high importance to all features. Examples include correlation, marginal contribution feature importance[1], and ultra-marginal feature importance[1].

Methods such as SHAP and permutation importance are somewhere in between the two extremes as importance is shared among correlated features.

Classification by Mechanism: Gradient vs. Non-Gradient

[edit]

The technical implementation of the importance measure often dictates its applicability to different model architectures.

Mechanism Description Examples
Gradient-Based Utilizes the derivatives (gradients) of the model's output with respect to its input features. These are typically model-specific and used for differentiable models like neural networks.[24] Saliency Maps, Integrated Gradients, Grad-CAM, DeepLIFT.[24]
Non-Gradient Based Treats the model as a "black box" and relies on perturbations, shuffling, or submodel training. These are typically model-agnostic and applicable to any algorithm.[24] Permutation Importance, KernelSHAP, LOCO, MCI.[1]

Classification by Purpose

[edit]

The choice of method is frequently driven by the end-user's objective.

  • Model Explanation: The goal is to understand the "logic" of a black-box model to ensure safety, fairness, and reliability. Methods like SHAP, LIME, and Accumulated local effects are standard here.[17]
  • Data Explanation (Scientific Inference): The goal is to learn about the real world. Researchers prioritize methods that handle redundancy and correlation in a way that reflects the true underlying relationships (e.g., MCI, UMFI).[1][21]
  • Model Optimization (Feature Selection): The goal is to improve the model's performance by removing irrelevant or redundant features. Techniques like Recursive Feature Elimination (RFE) use importance scores as a selection criterion.[25]

Other Classifications

[edit]

Achen (1982) introduced a classification of linear regression-based feature importance methods: "dispersion importance" (explained variance), "level importance" (impact on the mean), and "theoretical importance" (the change in response for a given change in regressor).[26]

Axiomatic Foundations of Feature Importance

[edit]

To move beyond heuristic rankings, researchers use axioms to define what a "fair" or "valid" importance score should look like. These axioms provide the mathematical justification for selecting one method over another.

The Shapley Axioms (S1–S4)

[edit]

Shapley values are the unique solution that satisfies four core game-theoretic axioms, which describe how to fairly distribute the "total gain" of a model's prediction among the participating features.[27]

  1. Efficiency (Local Accuracy): The sum of the importance scores for all features must equal the difference between the model's prediction for an instance and the expected prediction .[17]
  2. Symmetry: If two features and contribute exactly the same value to every possible subset of other features, they must receive the same importance score: .[17]
  3. Dummy (Null Player): If a feature contributes nothing to the value function for any subset of features, its importance score must be zero. This is crucial for identifying irrelevant features.[17]
  4. Additivity (Linearity): If the value function is the sum of two functions , then the importance scores must be the sum of the scores calculated for each function: .[17]

Data-Driven Axioms for MCI and UMFI

[edit]

For scientific discovery, the Shapley axioms are sometimes criticized because they average contributions, which can lead to diluted importance for correlated features.[1] Some researchers proposed alternative axioms for true-to-the-data methods.

  • Marginal Contribution: The importance of a feature must be at least as high as the gain it provides when added to the set of all other features.[1]
  • Elimination: Removing other features from the feature set can only decrease (or leave unchanged) the importance of a remaining feature. It cannot increase it.[1]
  • IRI & SD (Invariance under Redundant Information and Symmetry under Duplication): Adding a redundant feature should not change the importance of preexisting features, and identical features should receive equal importance.[21] [1]
  • Blood Relation: A feature should have non-zero importance if and only if the feature is blood related (associated) with the response in the ground-truth causal graph.[21]

The Inconsistency Theorem

[edit]

It is mathematically impossible for a single feature importance score to satisfy certain intuitive properties simultaneously—such as being consistent between local and global settings while also being robust to all types of feature dependencies (like colliders).[18] This suggests that users must prioritize specific axioms based on their task; for example, if one values local accuracy (efficiency), they might have to sacrifice robustness to perfect correlation.[27]

Major Methods and Algorithms

[edit]

Shapley Values and SHAP Variants

[edit]

SHAP (SHapley Additive exPlanations) has become the dominant framework for local attribution. It interprets the model prediction as a "game" where feature values are the "players".[2]

  • KernelSHAP: A model-agnostic approximation that uses a weighted linear regression (the "Shapley kernel") to estimate Shapley values. Its main limitation is computational speed, as it requires many model evaluations.[2]
  • TreeSHAP: An algorithm specifically designed for tree ensembles (XGBoost, LightGBM, Random Forest). It computes exact Shapley values in polynomial time by traversing the tree structure. However, "path-dependent" TreeSHAP can sometimes produce unintuitive results because it changes the value function to rely on conditional expectations.[2]
  • DeepSHAP: Combines SHAP values with the DeepLIFT algorithm to provide fast attributions for neural networks.[24]

Permutation Importance (PI)

[edit]

Permutation importance is the standard for global, model-agnostic explanation. It defines the importance of feature as the increase in model error after shuffling the values of in the test set.[14]

Aspect Description
Intuition If the model relies on a feature, shuffling its values destroys the relationship, causing the error to spike.[2]
Pros Easy to understand; does not require model retraining; captures both main effects and interactions.[2]
Cons Vulnerable to correlated features; the permuted features can force the input off the training distribution.[22]

Conditional Permutation Importance (CPI)

[edit]

Strobl et al. (2008) introduced CPI to address the bias of PI toward correlated features. CPI permutes feature only within "blocks" defined by the values of other features that are associated with . This ensures that the shuffled values stay "local" to the original data distribution.[22]

  • Algorithm Details: The party package implementation in R uses p-values from independence tests to select which features to condition on. If the p-value is below a threshold , the feature is included in the conditioning set.[22]
  • Limitations: High sample sizes can lead to "greedy" conditioning, where almost all features are selected for the blocks, making the permutation less effective. Newer implementations like permimp aim to be less sensitive to these sample-size effects.[28]

Leave-One-Covariate-Out (LOCO)

[edit]

LOCO is a rigorous frequentist approach. To find the importance of , a researcher trains two models: one with all features and one without . The importance is the difference in their predictive risk (e.g., loss).[27]

  • Comparison with Shapley: While Shapley values average the marginal contributions across all submodels, LOCO looks only at the "top" (the full model vs. the model with features). Research by Verdinelli and Wasserman (2023) suggests that for many statistical purposes, a normalized version of LOCO is more reliable and easier to interpret than Shapley values.[27]

Marginal Contribution Feature Importance (MCI)

[edit]

MCI is specifically designed for "data explanation". Unlike Shapley, which averages contributions, MCI identifies the maximum contribution a feature can make to any possible subset.[1]

  • Why use MCI?: In systems with high redundancy (e.g., measuring multiple similar metabolites in a biological pathway), Shapley values for each metabolite will approach zero as the number of redundant features increases. MCI remains robust, assigning high importance to any feature that could provide high predictive power in some context.[1]
  • Limitations: Computationally expensive and can miss correlated interactions.[21]

Integrated Gradients (IG)

[edit]

Integrated Gradients is a leading gradient-based method for deep networks. It addresses the "saturation" problem of simple saliency maps (where gradients can become zero even for important features) by integrating the gradients along a path from a "baseline" (e.g., an all-black image) to the actual input.[29]

Recursive Feature Elimination

[edit]

RFE is a "wrapper" method that iteratively prunes the feature set.[25]

Step Action
1. Initialization Train the model (e.g., SVM, Random Forest) on the full set of features .[30]
2. Ranking Use the model's internal importance measure (e.g., weights for SVM, MDI for RF) to rank features.[30]
3. Elimination Remove the least important feature (or a fraction of the least important features).[30]
4. Iteration Repeat the process on the remaining subset until the desired number of features is reached or performance begins to drop.[30]

RFE is particularly effective because it accounts for feature interactions that might be missed by simple filter methods (like Pearson correlation). However, it is computationally expensive as it requires retraining the model in each iteration.[31]

See also

[edit]

References

[edit]
  1. ^ a b c d e f g h i j k l m n Catav, Amnon; Fu, Boyang; Zoabi, Yazeed; Meilik, Ahuva Libi Weiss; Shomron, Noam; Ernst, Jason; Sankararaman, Sriram; Gilad-Bachrach, Ran (2021). Marginal contribution feature importance: an axiomatic approach for explaining data. International Conference on Machine Learning. Proceedings of Machine Learning Research. pp. 1324–1335.
  2. ^ a b c d e f g h Molnar, C. (2020). Interpretable Machine Learning. Lulu.com.
  3. ^ Breiman, Leo (2001). "Statistical modeling: The two cultures (with comments and a rejoinder by the author)". Statistical Science. 16 (3): 199–231. doi:10.1214/ss/1009213726.
  4. ^ a b c Grömping, U. (2015). "Variable importance in regression models". Wiley Interdisciplinary Reviews: Computational Statistics. 7 (2): 137–152. doi:10.1002/wics.1346.
  5. ^ Galton, Francis (1889). "Co-relations and their measurement, chiefly from anthropometric data". Proceedings of the Royal Society of London. 45 (273–279): 135–145. doi:10.1098/rspl.1888.0082.
  6. ^ Wright, S. (1921). "Correlation and Causation". Journal of Agricultural Research. 20: 557–585.
  7. ^ Hoffman, P. J. (1960). "The paramorphic representation of clinical judgment". Psychological Bulletin. 57 (2): 116–131. Bibcode:1960PsycB..57..116H. doi:10.1037/h0047807. PMID 14402414.
  8. ^ Darlington, R. B. (1968). "Multiple regression in psychological research and practice". Psychological Bulletin. 69 (3): 161–182. Bibcode:1968PsycB..69..161D. doi:10.1037/h0025471. PMID 4868134.
  9. ^ Green, P. E.; Carroll, J. D.; DeSarbo, W. S. (1978). "A new measure of regressor importance in multiple regression". Journal of Marketing Research. 15: 356–360. doi:10.1177/002224377801500305.
  10. ^ Kruskal, W. (1987). "Relative importance by averaging over orderings". The American Statistician. 41: 6–10. doi:10.1080/00031305.1987.10475432.
  11. ^ Lindeman, R. H.; Merenda, P. F.; Gold, R. Z. (1980). Introduction to Bivariate and Multivariate Analysis. Glenview, IL: Scott, Foresman.
  12. ^ Feldman, B. (1999). The proportional value of a cooperative game. Econometric Society World Congress 2000.
  13. ^ Shapley, L. S. (1953). "A value for n-person games". Contributions to the Theory of Games. 2: 307–317.
  14. ^ a b c Breiman, L. (2001). "Random forests". Machine Learning. 45 (1): 5–32. Bibcode:2001MachL..45....5B. doi:10.1023/A:1010933404324.
  15. ^ Lipovetsky, S.; Conklin, M. (2001). "Analysis of regression in game theory approach". Applied Stochastic Models in Business and Industry. 17 (4): 319–330. doi:10.1002/asmb.446.
  16. ^ Sundararajan, M.; Taly, A.; Yan, Q. (2017). Axiomatic attribution for deep networks. International Conference on Machine Learning. Proceedings of Machine Learning Research. pp. 3319–3328.
  17. ^ a b c d e f Lundberg, S. M.; Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems. Vol. 30.
  18. ^ a b c d e Harel, N.; Obolski, U.; Gilad-Bachrach, R. Inherent Inconsistencies of Feature Importance. XAI in Action: Past, Present, and Future Applications.
  19. ^ Chen, H.; Janizek, J. D.; Lundberg, S.; Lee, S. I. (2020). "True to the model or true to the data?". arXiv:2006.16234 [cs.LG].
  20. ^ a b Covert, I.; Lundberg, S. M.; Lee, S. I. (2020). Understanding global feature contributions with additive importance measures. Advances in Neural Information Processing Systems. Vol. 33. pp. 17212–17223.
  21. ^ a b c d e Janssen, J.; Guan, V.; Robeva, E. (2023). Ultra-marginal feature importance: Learning from data with causal guarantees. International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research. pp. 10782–10814.
  22. ^ a b c d e Strobl, C.; Boulesteix, A. L.; Kneib, T.; Augustin, T.; Zeileis, A. (2008). "Conditional variable importance for random forests". BMC Bioinformatics. 9 (1): 307. doi:10.1186/1471-2105-9-307.
  23. ^ Molnar, C.; König, G.; Bischl, B.; Casalicchio, G. (2024). "Model-agnostic feature importance and effects with dependent features: a conditional subgroup approach". Data Mining and Knowledge Discovery. 38 (5): 2903–2941. doi:10.1007/s10618-022-00901-9.
  24. ^ a b c d "MASE: Interpretable NLP Models via Model-Agnostic Saliency Estimation", arXiv. [1]
  25. ^ a b "A SYSTEMATIC LITERATURE REVIEW: RECURSIVE FEATURE ELIMINATION ALGORITHMS". [2]
  26. ^ Achen, C. H. (1982). Interpreting and Using Regression. Newbury Park, CA: Sage.
  27. ^ a b c d "Feature Importance: A Closer Look at Shapley Values", arXiv, 2023. [3]
  28. ^ Debeer, D.; Strobl, C. (2020). "Conditional permutation importance revisited". BMC Bioinformatics. 21 (1): 307.
  29. ^ "Agree to Disagree: Exploring Consensus of XAI Methods", IFIP. [4]
  30. ^ a b c d "Recursive Feature Elimination on the Sonar Data Set". [5]
  31. ^ "An automatically recursive feature elimination method", Taylor & Francis. [6]