2nd International Conference on NLP, Data Mining and Machine Learning (NLDML 2023)

Accepted Papers

Size and Fit Recommendations for Cold Start Customers in Fashion E-commerce

Jaidam Ram Tej, Jyotirmoy Banerjee and Narendra Varma Dasararaju, Flipkart Internet Private Limited, Banglore, India

ABSTRACT

Fashion e-commerce is expected to grow rapidly over the next few years. One of the main hurdles in fashion e-commerce is to recommend the right size to customers which helps customers in having a better online shopping experience. Hence Size and fit recommendation is an important problem which helps improve the confidence of a customer for making a purchase on an e-commerce platform. This also reduces the returns in fashion e-commerce. In this work we propose a novel bayesian probabilistic approach for non-personalised product size recommendation for customers. We use maximum likelihood estimation for estimating the parameters of our model. We use customer purchase and returns history to infer the true product size. Given a product we provide size recommendations to a customer, i.e. we suggest a customer to buy a size small, large or same size. In experiments with flipkart shoes datasets our model leads to an improvement of 3-4% AUC over the existing baseline. In Online AB testing for flipkart shoes categories our approach shows a performance improvement of returns by 12-24 bps.

KEYWORDS

Fashion E-commerce, Recommendation, Size and Fit, Maximum Likelihood, A/B Testing, Catalog.

Mapping and Tracking Sentiment Arcs in Social Media Streams

Maryam ElOraby and Mervat Abu-ElKheir, Faculty of Media Engineering and Technology, German University in Cairo, Cairo, Egypt

ABSTRACT

Social media emerged as an effective platform to investigate how people express their opinions and sentiments towards different issues through text. BERT is a prominent deep neural architecture used for language modeling that can be leveraged to obtain rich insights related to people's opinions and sentiments. In this paper, we fine-tune BERT for sentiment classification and apply the resulting model to tweets concerning the COVID-19 pandemic. We use sentiment arcs to track how temporal trends about COVID19 change over time. In order to confirm the validity of the sentiment arc's representation of sentiment's trend over time, we compare COVID-19 sentiment arcs to sentiment arcs we compute for the vaccination campaign and to the reported death rates. Furthermore, we use the GSDMM topic modeling algorithm to explore and map other events that might have influenced the resulting COVID-19 sentiment arc. To validate the approach's generalizability to other domains, we apply the model to tweets concerning Elon Musk’s Twitter buyout deal.

KEYWORDS

Sentiment Analysis, Opinion Mining, Topic Modeling, Machine Learning, Social Media.

Automated Identification of Disaster News for Crisis Management using Machine Learning

Lord Christian Carl H. Regacho and Ai Matsushita, Department of Computer, Information Sciences and Mathematics, University of San Carlos, Cebu City, Philippines, 6000

ABSTRACT

A lot of news sources picked up on Typhoon Rai (also known locally as Typhoon Odette), along with fake news outlets. The study honed in on the issue, to create a model that can identify between legitimate and illegitimate news articles. With this in mind, we chose the following machine learning algorithms in our development: Logistic Regression, Random Forest and Multinomial Naive Bayes. Bag of Words, TF-IDF and Lemmatization were implemented in the Model. Gathering 160 datasets from legitimate and illegitimate sources, the machine learning was trained and tested. By combining all the machine learning techniques, the Combined BOW model was able to reach an accuracy of 91.07%, precision of 88.33%, recall of 94.64%, and F1 score of 91.38% and Combined TF-IDF model was able to reach an accuracy of 91.18%, precision of 86.89%, recall of 94.64%, and F1 score of 90.60%.

KEYWORDS

Machine Learning, Natural Language Processing, Disaster, Scrapy & Model.

Stock Market Prediction using Reinforcement Learning with Sentiment Analysis

Xuemei Li and Hua Ming, Department of Computer Science and Engineering, Oakland University, Rochester Hills, USA

ABSTRACT

This work creates a new Deep Q-learning model with augmented sentiment analysis and stock trend labelling (DQS model). It incorporates stock market trend label and sentiment analysis score label as input features to improve model accuracy and performance. The first part of this work proves that machine learning models can predict stock price trends instead of just accurate stock prices. It studies the performance difference between neural networks and other machine learning algorithm for stock price trend prediction. It shows that neural networks can accurately predict stock trends when price data are pre-processed and transformed into category data. Subsequently, this work utilizes Valence Aware Dictionary for Sentiment Reasoning (VADER) to predict the sentimental score of new titles. A correlation study shows that there is a strong correlation between stock price and market daily sentiment. Lastly, a new neural network customized for this application has been utilized in the DQS model to map state features to action for trading decision making.

KEYWORDS

Machine Learning, Deep Q-learning, Sentiment Analysis, Stock Market Prediction.

Proposal of Framework Based on Human Intelligence for Information Collection in Online Environment

Alcides Macedo¹, Laerte Peotta² and Flavio Gomes³, ^1,2,3Department of Electric Engineering, University of Brasilia, Brasilia

ABSTRACT

The traditional and doctrinal concepts of cybersecurity involve the concepts of physics, logic and social, prevailing the consensus that the system will be secure if these three layers have appropriate levels of compliance. However, the difficulties in conceptualizing “security” for human and social interfaces have compelled industry and academic research to treat cybersecurity with a focus on the physical and logical layers. Within the scope of the social layer, the threats are shaped in the interactions between information and technology that shift the central point of the discussions to the informational society with the interdependencies of human relations based on the use of social networks that result in the growing volume of information supply. This research aims to broaden the underlying debate about malicious user behavior by proposing a framework that allows classifying this type of user according to predefined profiles. The objective of this research is to develop a framework based on OSINT (Open Source Intelligence) and HUMINT (Human Intelligence) concepts that can help cybersecurity professionals to select potential collaborators with objective criteria that make it possible to measure the reliability of information obtained. Besides, the research aims to evaluate how the approach based on HUMINT (Human Intelligence) techniques and applied in a virtual environment can enhance cybersecurity.

KEYWORDS

Open Source Intelligence, Human Intelligence, Human Sources, Information Gathering. Social Engineering.

Comparative Study of Vehicle Detection With Different YOLOV5 Algorithms

Md. Milon Rana, Md. Dulal Haque and Md. Mahabub Hossain, Department of Electronics and Communication Engineering, Hajee Mohammad Danesh Science and Technology University, Dinajpur-5200, Bnagladesh

ABSTRACT

Vehicles detection is very much essential for implementing artificial intelligence assisted monitoring and driving system. In recent years, the quantity of vehicles on the road is rapidly increasing. As a result it becomes difficult to manage traffic system. For this purposes computer vision technology has been widely adopted in the field of traffic surveillance. Because of the different sizes of vehicles, their detection remains a challenge that directly affects the accuracy of detection vehicles. To deal with this issue, this paper proposes a vision-based vehicle detection system. In this research, a multi-object real-time vehicles detection based on You Look Only Once (YOLO V5) algorithm has been designed. Classification of vehicles has been into six categories such as bicycles, motorcycles, cars, bus, ambulances and trucks. In this study, the YOLOv5s (Small) model, YOLOv5n (nano) model,YOLOv5l (large) model , YOLOv5m (medium) and YOLOv5x largest among the five models algorithm has been employed to analyze the accuracy of vehicle detection. The experimental results verify that the YOLOv5x model can provide higher detection accuracy than that of other YOLOv5 algorithms especially for the detection of small vehicle objects. The main accuracy indicators are precision, recall, and mAP (0.5), and the losses of all models have been calculated. The determined accuracy is 62.4%, 64.2% , 62.9% , 68.7% and 69.7% for YOLOv5s, YOLOv5m, YOLOv5n, YOLOv5l and YOLOv5x algorithm, respectively in our dataset .As a result of the analysis indicates that YOLOv5x is more superior and effective for vehicle detection and can be implemented for real-time traffic control in transportation system.

KEYWORDS

Vehicle detection, OpenCV ,YOLOv5,Accuracy, Loss, Computer Vision ,CNN.

Authorship Attribution for Assamese Language Documents: Initial Results

Smriti Priya Medhi¹ and Shikhar Kumar Sarma², ¹Department of Computer Science and Engineering, Assam Don Bosco University, Guwahati and ²Department of Information Technology, Gauhati University, Guwahati

ABSTRACT

Impact of Digital India in the creation of electronic content on the web is primarily acknowledged. However, critically observing, we also realize the problems, especially in the cases of identifying the creator of content. Also, there can be issues of false annotation or even plagiarism. Every person has a unique style of writing. This characteristic can be explored to further train a system to identify the accurate author of a content, also popularly known as the field of authorship attribution. This paper attempts to showcase the initial experimental results of author identification done on a manually collected and annotated assamese literary corpus. Assamese being a low-resourced language, the applications of NLP like authorship identification has not been explored till date. And this reported work can be marked as the first attempt of bringing to the world the research scopes of authorship attribution in assamese language.

KEYWORDS

NLP. Authorship Attribution, Stylometry, Linguistics, Plagiarism.

Character-Level Generative Network for Indonesian Non-Word Error Correction

Shengyi Jiang and Jinyi Chen, School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, Guangdong, China

ABSTRACT

Spelling error correction (SEC) is an important but challenging task that requireshuman-level language understanding ability. As for the Indonesian language, althoughit is the official language of Indonesia and spoken by almost 200 million people, it isunder-represented in SEC research. Otherwise, with the rapid development of languagepre-training technology, most state-of-the-art research is based on BERT and treats SECas a classification task. This may potentially ignore out-of-vocabulary words (OOV) correction. To that end, in this paper, we explore how to correct Indonesian non-word errorsat character level. Specifically, we manually annotate an Indonesian non-word dataset totackle the data-hungry problem. Furthermore, we propose a character-level generativeerror correction model (CGEC). CGEC uses a pre-trained language model (PLM) toobtain the context information of the to-be-correct word and a character-level generation strategy to correct the word. We demonstrate the effectiveness of our model withextensive experiments.

KEYWORDS

Non-word Error Correction, Indonesian, Character Level& Dataset.

Using Artificial Intelligence to Filter Out Barking, Typing, And Other Noise From Video Calls in Microsoft Teams

Pawankumar Sharma and Bibhu Dash, Department of Computer and Information Systems, University of the Cumberlands, KY USA

ABSTRACT

The normal method for analyzing technology is formulating many search queries to extract patent datasets and filter the data physically. The purpose of filtering the collected data is to remove noise to guarantee accurate information analysis. With the advancement in technology and machine learning, the work of physical analysis of the patent can be programmed so the system can remove noise depending on the results based on the previous data. Microsoft team generates a new artificial intelligence model that provides solutions on how individuals respond to speakers. Microsoft team, workplace, Facebook, and Google collected data from many active users hence developing artificial intelligence to minimize distracting background noise, barking and typing during the call.

KEYWORDS

Artificial intelligence, Microsoft teams, speech identification, video call, video signal data, machine learning, and data sets.

Summeet - Ml Based Minutes of Meeting Generation Tool

Yash Pratapwar, Abhinav Gajakosh, Mithun Kuthully, Mikhael Uzagare and Chetana Badgujar, Dept. of Information Technology, Fr. C. Rodrigues Institute of Technology, Navi Mumbai, India

ABSTRACT

Considering the fast pace of the meetings and seminars and the variety of contents, it is not feasible and efficient way to manually summarize the overall meeting in a short time period and covering each and every important point of meeting for generating minutes of meeting moreover if any person is not able to attend the meeting due to some reason, they will miss out on all the important things that happened in the meeting, so to tackle that problem we propose summeet which is an artificial intelligence and natural language processing based solution for it, summeet will generate minutes of meeting with all the important points discussed in that meeting and provide it in document which can be easily referred and shared. summeetis an intelligent minutes of meeting generating tool. summeet extracts speech of the speaker from a meeting recording, then it transcribes meeting voice to text, this voice of the speaker is transcribed into a complete meeting text; this text is then summarized to get minutes of meeting by extracting meaningful key phrases and abstract sentences from the complete meeting text.

KEYWORDS

Speech to text, text summarizing, PDF generation.

Welcome to NLDML 2023