Bidirectional Encoder Representation From Transformer

We humans easily understand the meaning of the word (say a bank) in different contexts it is used (say in a bank account and bank of the river). But have you ever wondered how machines understand it? Apart from the mentioned word, there are many other words whose meaning changes based on the context it is used. This type of understanding can be accomplished with an architecture named BERT which we will discuss in this article in detail. Also, BERT has a wide range of applications ranging from similarity retrieval, Text Summarization, Question Answering that can be seen in the google search engine to Response Selection in Gmail, and many more.
BERT (Bidirectional Encoder Representation From Transformer) is a transformers model pre-trained on a large corpus of English data in a self-supervised fashion. This means it was pre-trained on the raw texts only, with no humans labeling which is why it can use lots of publicly available data.
BERT works in two steps:
Fine-tuning: In Fine-tuning phase, we will use learned weights from pre-trained models in a supervised fashion using a small number of labeled textual datasets.
Pre-training: In the pretraining phase, it uses a large amount of unlabeled textual data to learn a language representation in an unsupervised fashion. Hence in the pre-training phase, we will train a model using a combination of Masked Language Modelling (MSM) and Next Sentence Prediction (NSP) on a large corpus.
Read more
