Abstract:
Micro-blogging platforms have proven their importance as vital communication channels over the internet. Individuals use micro-blogging platforms to keep in touch with friends and families whereas corporate users make use of it to introduce new products and services to their clients. Spammers also cash in on the global reach of micro-blogs to spread irrelevant, immaterial and offensive stuff like viruses, porn etc. Spammers are wasting resources, valued user time and annoying valid users by polluting these platforms with their orthogonal messages. Identifying an irrelevant message on such platforms is a challenging task. A user sending legitimate messages most of the times and infrequently sending junk replies cannot be declared as a spammer. Similarly, public messages, such as advertisements, can be considered irrelevant by one reader but relevant by another due to their diverse personal interests. These messages contain named entities, URLs, events, facts and figures. These named entities have different relationships among them. With the current, state of the art semantic information extraction and analysis techniques it has become possible to dig out these named entities and their relationships with each other. In this research we have implemented an algorithm to detect the irrelevant messages on one of the famous micro-blogging platforms known as Twitter. Our algorithm utilizes the semantic information extraction and analysis techniques to compute relevance among different parts of the messages and compares it with a user set threshold. The messages with higher similarity among their components are most likely the relevant message and vice versa. We have validated our algorithm to detect irrelevant messages from a dataset collected from Twitter. Our algorithm has successfully achieved a precision of up to 97% with equally good values for recall and F-Measure up to 100% and 97% respectively.