AI-Enabled Intrusion Detection in Enterprise Networks: A Systematic Review of Methods, Datasets, and Evaluation Metrics (2018–2026)
DOI:
https://doi.org/10.63125/k4t9f683Keywords:
AI-Enabled Intrusion Detection, Enterprise Networks, Explainability and Trust, Evaluation Rigor, False-Positive Burden IndexAbstract
This study addresses the problem that AI-enabled intrusion detection systems (IDS) often report strong benchmark performance yet struggle to deliver consistently actionable, trusted alerts in real enterprise and hybrid cloud environments where traffic is heterogeneous, class imbalance is extreme, and false-positive workload can overwhelm security operations centers. The purpose was to quantify which enterprise-relevant factors most strongly predict perceived AI-IDS effectiveness and operational suitability, while linking evaluation evidence to workload impact. Using a quantitative, cross-sectional, case-based design, data were collected from enterprise and cloud-facing security stakeholders (N = 162 valid responses) involved in IDS monitoring and triage. Key variables were measured on 5-point Likert scales, including Dataset Representativeness (DREP), Evaluation Rigor (ERIG), Model Robustness (MROB), Deployment Readiness (DREADY), Explainability and Trust (TRUST), and the dependent outcome Perceived AI-IDS Effectiveness (EFFECT); construct reliability was acceptable to strong (Cronbach’s α = 0.78–0.88). The analysis plan applied descriptive statistics, Pearson correlations, and multiple regression with diagnostics (VIF = 1.28–2.05). Headline findings showed moderately high perceived effectiveness (EFFECT M = 3.73, SD = 0.66) and trust (TRUST M = 3.69, SD = 0.68), with EFFECT positively correlated with TRUST (r = .52, p < .001), ERIG (r = .46, p < .001), DREP (r = .42, p < .001), MROB (r = .39, p < .001), and DREADY (r = .34, p < .001). The regression model explained substantial variance in effectiveness (R² = .51; adjusted R² = .49; F(5,156) = 32.45, p < .001), with TRUST the strongest predictor (β = .33, p < .001), followed by ERIG (β = .22, p = .002), DREP (β = .18, p = .008), MROB (β = .15, p = .019), and DREADY (β = .11, p = .041). Operational implications were quantified using a False-Positive Burden Index: at 420 alerts/day, FPR = 0.07, and 6.5 minutes triage time, false positives consumed 191.1 minutes/day (3.19 hours/day), while reducing FPR to 0.04 lowered burden to 109.2 minutes/day (1.82 hours/day), a 42.8% reduction. Overall, the results imply that enterprises gain the most adoption-ready value when AI-IDS is explainable, evaluated rigorously with enterprise-representative data, and tuned to reduce workload through threshold governance and imbalance-aware metrics.
