A Malicious JavaScript Code Detection using Deep Learning Approach on Imbalanced Dataset
DOI:
https://doi.org/10.51153/kjcis.v8i1.245Keywords:
Class imbalance, GPT-2, Deep LearningAbstract
A web application refers to a software or computer program that makes use of a web browser. In the present day, the categorization of JavaScript-based attacks has significant importance, primarily driven by the substantial increase in the number of Internet users. JavaScript is a crucial component in the field of web development, serving as a major instrument for hackers to initiate attacks. When the presence of malicious JavaScript is identified within Web sites, the rate of detection is seen to rise while the rates of false-negative and false-positive outcomes are observed to decrease. Most of the existing work validates their approaches using balanced datasets which can not be generalized to real-world scenarios. In this paper, we employed GPT-2 as data augmentation technique to address the class imbalance problem. The resultant balanced dataset is then utilized as input for the DOC2Vec algorithm, which facilitates the process of vectorization. The resulting features are then fed into an attention-based Bi-LSTM model. The results demonstrate that the proposed model achieves superior performance in terms of recall and f1-score, with values of 0.74% and 0.50% respectively
References
Bichhawat, A., Rajani, V., Garg, D., Hammer, C. (2014). Information flow control in webkitâs javascript bytecode//Principles of Security and Trust: Third International Conference, POST 2014, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2014, Grenoble, France, April 5-13, 2014, Proceedings 3. Springer.
Manan, W. N. W., Kahar, M. N. M., Ali, N. M. (2020). A survey on current malicious javascript behavior of infected web content in detection of malicious web pages//IOP Conference Series: Materials Science and Engineering. volume 769. IOP Publishing.
Information on Javascript details. https://w3techs.com/technologies/details/ cp-javascript/all/all/. [Online; accessed 29-oct-2020].
Wang, R., Zhu, Y., Tan, J., Zhou, B. (2017). Detection of malicious web pages based on hybrid analysis.
Journal of Information Security and Applications, 35, 68–74.
Semantic report. http://www.symantec.com/securityresponse/publications/ threatreport.jsp/. [Online; accessed 14-April-2021].
Huang, Y., Li, T., Zhang, L., Li, B., Liu, X. (2021). Jscontana: Malicious javascript detection using adaptable context analysis and key feature extraction. Computers & Security, 104, 102218.
Kang, Z. (2021). A review on javascript engine vulnerability mining//Journal of Physics: Conference Series. volume 1744. IOP Publishing.
Jang-Jaccard, J., Nepal, S. (2014). A survey of emerging threats in cybersecurity. Journal of Computer and System Sciences, 80(5), 973–993.
Azar, A. T., Shehab, E., Mattar, A. M., Hameed, I. A., Elsaid, S. A. (2023). Deep learning based hybrid intrusion detection systems to protect satellite networks. Journal of Network and Systems Management, 31(4), 82.
Fang, Y., Huang, C., Su, Y., Qiu, Y. (2020). Detecting malicious javascript code based on semantic analysis. Computers & Security, 93, 101764.
Phung, N. M., Mimura, M. (2021). Detection of malicious javascript on an imbalanced dataset. Internet of Things, 13, 100357.
Ganganwar, V. (2012). An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering, 2(4), 42–47.
Anwar, S., Al-Obeidat, F., Tubaishat, A., Din, S., Ahmad, A., et al. (2019). Countering malicious urls in internet of things using a knowledge-based approach and a simulated expert. IEEE Internet of Things Journal, 7(5), 4497–4504.
Mimura, M., Suga, Y. (2019). Filtering malicious javascript code with doc2vec on an imbalanced dataset//2019 14th Asia Joint Conference on Information Security (AsiaJCIS). IEEE.
Ngoc, P. M., Mimura, M. (2021). Oversampling for detection of malicious javascript in realistic envi- ronment//Advances on Broad-Band Wireless Computing, Communication and Applications: Proceed- ings of the 15th International Conference on Broad-Band and Wireless Computing, Communication and Applications (BWCCA-2020). Springer.
Xu, W., Zhang, F., Zhu, S. (2012). The power of obfuscation techniques in malicious javascript code: A measurement study//2012 7th International Conference on Malicious and Unwanted Software. IEEE.
Patil, D. R., Patil, J. (2017). Detection of malicious javascript code in web pages. Indian Journal of Science and Technology, 10(19), 1–12.
Mohammad, R. M., Thabtah, F., McCluskey, L. (2014). Intelligent rule-based phishing websites classi- fication. IET Information Security, 8(3), 153–160.
Durai, K. N., Subha, R., Haldorai, A. (2021). A novel method to detect and prevent sqlia using ontology to cloud web security. Wireless Personal Communications, 117(4), 2995–3014.
Jemal, I., Haddar, M. A., Cheikhrouhou, O., Mahfoudhi, A. (2020). M-cnn: a new hybrid deep learning model for web security//2020 IEEE/ACS 17th International Conference on Computer Systems and Applications (AICCSA). IEEE.
Da Costa, K. A., Papa, J. P., Lisboa, C. O., Munoz, R., de Albuquerque, V. H. C. (2019). Internet of things: A survey on machine learning-based intrusion detection approaches. Computer Networks, 151, 147–157.
Ishida, M., Kaneko, N., Sumi, K. (2023). Moji: Character-level convolutional neural networks for malicious obfuscated javascript inspection. Applied Soft Computing, 137, 110138.
Mimura, M. (2020). An improved method of detecting macro malware on an imbalanced dataset. IEEE Access, 8, 204709–204717.
Song, X., Chen, C., Cui, B., Fu, J. (2020). Malicious javascript detection based on bidirectional lstm model. Applied Sciences, 10(10), 3440.
Singh, A. (2020). Malicious and benign webpages dataset. Data in brief, 32, 106304.
Yan, X., Xu, Y., Cui, B., Zhang, S., Guo, T., et al. (2020). Learning url embedding for malicious website detection. IEEE Transactions on Industrial Informatics, 16(10), 6673–6681.
Yoo, S., Kim, S., Kim, S., Kang, B. B. (2021). Ai-hydra: Advanced hybrid approach using random forest and deep learning for malware classification. Information Sciences, 546, 420–435.
Kolter, J. Z., Maloof, M. A. (2006). Learning to detect and classify malicious executables in the wild.
Journal of Machine Learning Research, 7(12).
McDermott, C. D., Majdani, F., Petrovski, A. V. (2018). Botnet detection in the internet of things using deep learning approaches//2018 international joint conference on neural networks (IJCNN). IEEE.
Luo, C., Tan, Z., Min, G., Gan, J., Shi, W., et al. (2020). A novel web attack detection system for internet of things via ensemble classification. IEEE Transactions on Industrial Informatics, 17(8), 5810–5818.
Tanaka, F. H. K. D. S., Aranha, C. (2019). Data augmentation using gans. arXiv preprint arXiv:1904.09135.
Wan, M., Yao, H., Yan, X. (2020). Generation of malicious webpage samples based on gan//2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). IEEE.
Shrivastava, A., Pupale, R., Singh, P. (2021). Enhancing aggression detection using gpt-2 based data balancing technique//2021 5th International Conference on intelligent computing and control systems (ICICCS). IEEE.
Shaikh, S., Daudpota, S. M., Imran, A. S., Kastrati, Z. (2021). Towards improved classification accuracy on highly imbalanced text dataset using deep neural language models. Applied Sciences, 11(2), 869.
Dotti, F. (2022). Detecting prototype pollution vulnerabilities in javascript using static analysis.
Ndichu, S., Kim, S., Ozawa, S., Misu, T., Makishima, K. (2019). A machine learning approach to detection of javascript-based attacks using ast features and paragraph vectors. Applied Soft Computing, 84, 105721.
Ndichu, S., Ozawa, S., Misu, T., Okada, K. (2018). A machine learning approach to malicious javascript detection using fixed length vector representation//2018 International Joint Conference on Neural Net- works (IJCNN). IEEE.
Han, K., Hwang, S. O. (2020). Lightweight detection method of obfuscated landing sites based on the ast structure and tokens. Applied Sciences, 10(17), 6116.
Mimura, M. (2020). Using fake text vectors to improve the sensitivity of minority class for macro malware detection. Journal of Information Security and Applications, 54, 102600.
HynekPetrak,javascript-malwarecollection,2019.https://github.com/HynekPetrak/javascript-malware-collection./. [Online; accessed April. 20, 2021].
Malicious samples. http://phishtank.org//. [Online; accessed April. 20, 2021].
Bengin samples. http://www.alexa.com//. [Online; accessed April. 20, 2021].
Definition of Malicious virus. https://www3.safenet-inc.com/csrt/ malicious-code-more.aspx/. [Online; accessed Aug. 20, 2021].
Definition of benign virus | PCMag. https://www.pcmag.com/encyclopedia/term/ benign-virus/. [Online; Aug. 20, 2021].