The coming revolution of Natural Language processing
During the last decade, most sectors have tried to cut their running cost by automatizing most of their process and been in the search of new revenue stream through the exploitation of the ever growing amount data that each companies collect on a daily basis.
Indeed, by looking at the expense in R&D of most industries we can see that machine learning is taking more and more space in the economic realm especially since the recent boom of the use of neural network and its different successful implementation across industries. However, even though data are now more and more exploited by companies most of the successful applications are based on numerical or graphical data. Indeed, we still have a lot of progress to do in the processing and valorization of raw text data and that is why one of the hottest current topic in ML is Natural Language Processing (NLP) especially since the recent finding of the possibility of using transfer learning in NLP as we are currently doing in computer vision.
However, before diving a bit more into the possible opportunity that NLP could offer across industries in the near future, let’s first make a quick recap about its nature and its process.
First, what is NLP ?
As pointed out before, NLP is the ability for computers to understand human language in its naturally spoken or written form. More specifically, it is the science of being able to analyze and comprehend free and unstructured text, breaking down sentences and words to accomplish tasks such as sentiment analysis, relationship extraction, stemming, text or sentences summarization all through the use of ML.
However, as it is often the case in machine learning, complex problematics like the ones enunciated previously, cannot be solved through the use of a single ML model and requires to build a pipeline.
But, what is an NLP pipeline ?
The term pipeline in ML describes the process consisting to break up a complex problematic into small subset easily solvable through the use of numerous ML models in order to then be able to provide a global solution to the initial problematic by chaining together multiple ML models that will feed into each other. In the case of NLP, the usual structure of a pipeline is the following :
Step 1 : Sentence segmentation
In this first step we break the text apart into separate sentences. This part can be as simple as splitting apart sentences whenever we encounter a punctuation mark or become quite complex in order to be able to process document that aren’t formatted cleanly.
Step 2 : Word tokenization
In this second step called tokenization, we break up our sentences into separate words also called tokens.
Step 3 : Token prediction
In this third step, we look at each token and try to guess it’s nature ( noun, adverb, verb …) usually by feeding each token into a pre-trained part of speech classification.
Step 4 : Text Lemmatization
In this fouth step we are trying through this lemmatization process of figuring-out the most basic form of lemma for the different tokens in each sentences.
Example of a lemma : run/runs/running
Step 5 : Identifying and retrieving stop words
Human languages have a lot of filler words that appear frequently like “and”, “the” etc. When doing statistics on text, those introduces a lot of noise that’s why it is important to not take them into account in the model in order to preserve its accuracy.
Step 6 : Dependency parsing
In this step we’re trying to figure out how the words in each sentence relates to each other by building a tree assigning a parent to each token in the sentence in order to qualify the existing relationship between the two tokens and group together nouns that have a similar meaning.
Step 7 : Named Entity Recognition (NER)
Here the goal is to try to detect and label our nouns with the real-world concepts that they represent and to use the context surrounding the noun to discriminate between the different type of nouns possible.
Exemple : Paris ( the city ) / Paris ( the name )
Step 8 : Meaning mapping
In this final step the goal is to figure out a meaning-mapping across our text corpus by tracking pronouns across sentences.
Now that we’ve seen what is NLP and how it’s usually implemented, let’s check what are the current limitations and problems that this new growing field still encounters nowadays.
NLP challenges
Usually, to train neural network and get the better results out of it, we are using vast dataset, this is why the information overload remains one of the main problems in this field to this day. Indeed, due to the limitations it imposes in finding specific and important piece of information in big dataset and then be able to identify the context of interaction among entities and objects in high dimensional, heterogenous and complex dataset this problematic is for now holding back to some extent the progress of NLP in the understanding of the human language.
The extraction of the relevant and correct information from unstructured or semi-structured data through the use of Information Extraction (IE) techniques is also a big issue due to the relative infancy of this field and the remaining work to do in order to obtain robust summarizing methodology enabling the development of informative chatbots and other technologies.
At last, despite the big leap forward operated in those two domains during the last 3 – 5 years, text classification and text translation are still under development and researchers are still looking for ways to solve multi-level text classification in a robust manner which will be able to dealing efficiently with majors problematics such as high-dimensional label space, label dependency, drifting etc.
NLP applications
From their first use in spam filtering and their tremendous progress in accuracy that we observed during the last 10 years NLP techniques have come a long way and invade a broad scope of areas especially among services. Indeed, if you are familiar with Slack or Apple, there is a lot of chance that you encounter some NLP algorithms and interact with them through the use of chatbot to get information on various subjects. Same can be said, of the recent boom of the personal assistant such as Siri or home assistant developed by Google and Amazon. Indeed, it is pretty clear that Alexa or Google home would have never seen the light of day if there were not able to process human language, treat queries and provide intelligible answers through the use of NLP techniques such as lemmatization, summarization etc.
However, those current use are not the total extent of the potential of NLP for the years to come. Indeed, with the recent breakthrough in transfer learning towards the NLP field and the development of new state of the art algorithms such as Bert or Roberta it is obvious that we’re only at the ignition phase of the NLP revolution and that the development and use of those algorithms will disrupt a lot of sectors such as the legal sector or the medical sector through new applications which will be able to process, understand, extract and summarize gigabytes of data much more faster and efficiently than any paralegal or medical secretary.
To conclude, even though this sector is still in its first phase of development we can see that as computer vision it will surely be part of the near future given that most of the global daily data production, is raw text and that thanks to the ever growing demand from the end users for more integration between software, applications and smart devices big tech companies will have a desperate need to be able to treat those data through NLP.
N.B. For the moment most companies dealing in the NLP sector are still private and does not trade publicly except for LAIX LingoChamp nonetheless for information purpose here is a non exhaustive list of the top ten companies :