Machine learning for security / 2025Four-person team · Politecnico di Torino

230,000 attacks.
Seven adversary intents.

We analyzed 230,000 real attack sessions collected by decoy servers and compared several machine-learning methods for recognizing what each attacker was trying to do.

RoleResearch & development

Dataset230K sessions

StackPython, sklearn, PyTorch

FrameworkMITRE ATT&CK

01 / The problem

Turn unstructured commands into attacker intent.

An SSH honeypot is a decoy server that records the commands sent by attackers. Reading 230,000 sessions by hand is unrealistic, so we tested whether machine learning could summarize the attackers' goals.

We used seven categories based on MITRE ATT&CK, including system discovery, persistence, execution, and impact. One session can contain several goals, so a model may need to assign more than one label.

02 / Data

Prepare the attack data before training a model.

We decoded 90,026 hidden shell scripts, normalized timestamps, separated commands into useful terms, and removed noisy variables and symbols. A statistical text representation then reduced roughly 300,000 candidate terms to 90 useful features.

System discovery and persistence dominated the dataset, while the Impact category appeared only 27 times. This imbalance became the main weakness of every model: common behavior was easy to learn, but rare behavior was not.

Raw sessionsDecodeTokenizeCleanTF-IDFMulti-label targets

03 / Supervised

High scores can hide weak categories.

We first tested Random Forest, SVM, and Logistic Regression. Random Forest and SVM performed very well on the common attack categories, and additional parameter tuning produced only a small improvement.

Random Forest · baseline0.996 F1

Weighted precision 0.999 and recall 0.994 on the test set.

SVM · tuned0.9966 F1

RBF kernel, C=100, gamma=scale; approximately 1.48% above baseline.

Neither result erased the imbalance problem. Impact remained difficult because its support was vanishingly small. Weighted averages describe performance on the dominant classes well, but they should not be mistaken for uniform reliability across every intent.

04 / Unsupervised

A good metric did not guarantee useful groups.

We also asked unsupervised models to group similar sessions without using labels. Their mathematical scores looked strong, but visual inspection showed one dominant group and heavy overlap between attacker behaviors.

Some communities were interpretable-BusyBox and mount commands suggested IoT-oriented activity, while chmod, wget, and SSH indicated file modification and persistence-but the overall segmentation was not clean enough to treat clusters as distinct tactics.

Metric literacy

The most valuable result was the disagreement between a high silhouette score and weak semantic separation. Validation metrics, visual structure, and domain interpretation all had to be considered together.

05 / Language model

A language model added context, but not missing data.

Finally, we adapted BERT, a pretrained language model, to assign several attack labels to one session. We kept its existing language knowledge fixed and trained a small classification layer on top.

BERT · weighted test metrics0.9937 F1

Precision 0.9974, recall 0.9927, and ROC-AUC 0.9971.

Weakest class0.78 AUC

Harmless activity remained harder to distinguish because of limited examples and imbalance.

The experiment showed the value of semantic representations for command sequences, but it also reinforced a broader lesson: a sophisticated model does not compensate for missing class support.

06 / Takeaway

The comparison mattered more than the best score.

Classical supervised models were already highly effective on well-represented intents. Unsupervised methods surfaced broad command communities but struggled to produce semantically distinct attack categories. BERT captured multi-label semantics effectively, while its weakest class exposed the same imbalance found across the study.

The practical next steps are not simply “use a larger model”: introduce domain-aware features and network context, rebalance rare intents, test on broader sources, and evaluate whether results generalize beyond the original June 2019–March 2020 honeypot window.

Next case studyExploitability analysis →

230,000 attacks.Seven adversary intents.

Turn unstructured commands into attacker intent.

Prepare the attack data before training a model.

High scores can hide weak categories.

A good metric did not guarantee useful groups.

A language model added context, but not missing data.

The comparison mattered more than the best score.

230,000 attacks.
Seven adversary intents.