Real or not ? Predict which tweets are about real disasters and which ones are not.
This repository contains my solution for the Kaggle's NLP disaster tweets classification competition. You may find several solutions I've came up with as well as an exploratory data analysis notebook.
Embeddings used :
Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).
But, it’s not always clear whether a person’s words are actually announcing a disaster. Take this example:
The author explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine.
The goal is to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. The dataset is composed of 10,000 tweets that were hand classified.
Ekphrasis offers a quick and interesting solution. Coupled with some additionnal regex work, it was possible to get a satisfying dataset.
Based on these solutions, BERT gives very good scores. It managed to provide embeddings for each tweets and separate the ones that deal with real disaster from the ones that does not. The images below show the separation these tweets. The point cloud is obtained from the input of the last network layer using PCA.