As you learnt in the previous section, NLP has a pretty wide array of application- it finds use in many field use in many fields such as social media, banking , insurance and many more.
However, there is one question that still remains. The data you’ll get while performing analytics on text, very often , will be just a sequence of word. Something like the text shown in the image below:
Now, think about it ,if the data you get is of this form, and your task is to create an algorithm that translates this paragraph to a different language, say, Hindi, then how exactly will you do it?
To do so, your system should be able to take the raw unprocessed data analysis down into samaller sequential problems(a pipeline), and solve each of those problems individually. the individual problems could be as simple as breaking the data into sentences, words etc. to something as understanding what a word means, based on the word in its ” neighbourhood ” .
In this course on “Text Analytics’, you’ll learn about all the different” steps” generally undertaken on the journey from data to meaning .This journey can be divided roughly into three parts, which correspond to the three modules that you’ll study one-by-one in this course.
Now, let’s listen to the professor as he talks about the process of text analytics, and about how the modules are structured in this particular course.
Now that you have looked at the areas of text analytics, let’s take a look at what does it mean to understand the text, i.e., how to approach a problem that deals with text.
Let’s go back to the wikipedia example. Recall what the data(textual data) looked like-it was simply a collection of characters , that machines can’t make any sense of. Starting with this data ,you will move according to the following steps-
- Lexical processing-
Fist, you will just convert the raw text into words and ,depending on your application’s need into sentences or paragraphs as well.
- For example ,if an email contains words
- such as lottery, prize and luck, then the email is represented by these words, and it is likely to be a spam email.
- Hence, in general, the group of words contained in a sentence gives us a pretty good idea of what that sentence means. Many more processing steps are usually undertaken in order to make this group more representative of the sentence, for example, cat and cats are considered to be the same word. In general, we can consider all plural words to be equivalent to the singular form.
- For a simple application like spam detection, lexical processing works just fine, but it is usually not enough in more complex applications, like, say, machine translation. For example, the sentences “My cat ate its third meal” and “My third cat ate its meal”, have very different meanings. However, lexical processing will treat the two sentences as equal, as the “group of words” in both sentences is the same. Hence, we clearly need a more advanced system of analysis.
- Semantic Processing: Lexical and syntactic processing don’t suffice when it comes to building advanced NLP applications such as language translation, chatbots etc.. The machine, after the two steps given above, will still be incapable of actually understanding the meaning of the text. Such an incapability can be a problem for, say, a question answering system, as it may be unable to understand that PM and Prime Minister mean the same thing. Hence, when somebody asks it the question, “Who is the PM of India?”, it may not even be able to give an answer unless it has a separate database for PMs, as it won’t understand that the words PM and Prime Minister are the same. You could store the answer separately for both the variants of the meaning (PM and Prime Minister), but how many of these meanings are you going to store manually? At some point, your machine should be able to identify synonyms, antonyms, etc. on its own.
- This is typically done by inferring the word’s meaning to the collection of words that usually occur around it. So, if the words, PM and Prime Minister occur very frequently around similar words, then you can assume that the meanings of the two words are similar as well.
- In fact, this way, the machine should also be able to understand other semantic relations. For example, it should be able to understand that the words “King” and “Queen” are related to each other and that the word “Queen” is simply the female version of the word “King”. Also, both of these words can be clubbed under the word “Monarch”. You can probably save these relations manually, but it will help you a lot more, if you can train your machine to look for the relations on its own, and learn them. Exactly how that training can be done, is something we’ll explore in the third module.
- Once you have the meaning of the words, obtained via semantic analysis, you can use it for a variety of applications. Machine translation, chatbots and many other applications require a complete understanding of the text, right from the lexical level to the understanding of syntax to that of meaning. Hence, in most of these applications, lexical and semantic processing simply form the “pre-processing” layer of the overall process. In some simpler applications, only lexical processing is also enough as the pre-processing part.
This gives you a basic idea of the process of analysing text and understanding the meaning behind it. Now, in the next segment, you’ll learn how text is stored on machines.