In-Draft
Problem: All domains look similar. Malicious domains obfuscate themselves to look normal to humans, but are obvious when scrutinized closely. Rule based detection doesn't scale because small changes allow for criminals to bypass controls
Question: Can we leverage ML/AI to automatically detect malicious URLs on our behalf?
Potential Application of Tech: Email blockers, perimeter control for ingress and egress, DLP, End User browser plug-ins, hijacked partner connections, etc.
Previous Work: In my research, I have found that this is not a new topic or concept. Multiple variations have occurred with varying but ultimately high success in detecting suspicious URLs based on 'malicious url' data sets.
Detection Malicious URL Using ML Models | Kaggle - This is very similar to my proposed research
Initial Data Sets:
Macroamilli malware data set: Malware Training Sets: A machine learning dataset for everyone – Marco Ramilli Web Corner
2019 UCSD dataset: Detecting Malicious URLs (ucsd.edu)
URLHaus.abuse.ch 90 day database: URLhaus | API (abuse.ch)
Malicious URL Dataset from Kaggle: https://www.kaggle.com/sid321axn/malicious-urls-dataset
Canadian URL dataset ISCX-2016
Tools:
Autogluon Git Hub: GitHub - awslabs/autogluon: AutoGluon: AutoML for Text, Image, and Tabular Data
Autogluon How to blog: Machine learning with AutoGluon, an open source AutoML library | AWS Open Source Blog (amazon.com)
SHAP with autogluon: autogluon/SHAP with AutoGluon-Tabular Census income classification.ipynb at master · awslabs/autogluon · GitHub
DataBrew: AWS Glue DataBrew | Visual Data Preparation | Amazon Web Services
URLscan.io: URL and website scanner - urlscan.io
Resources:
Malpedia - Mozi (Malware Family) (fraunhofer.de)
Malicious URL Detection Based on Machine Learning