New Tech: NLP (Natural Language Processing)

Computers are getting smarter. The question is how? They are trained using an array of different data and linguistic models to recognize patterns, data, and the human voice.

On the latter, we take a look at NLP (Natural Language Processing) and how computers are trained to turn speech into code.

While we could say that ‘Artificial Intelligence’ is very much a study of technology in relation to the future and ‘Linguistics’ is a study of language in relation to the past, the field of Natural Language Processing shows us that there is a very important overlap between the two.

This overlap is linked to the fact that Linguistics is really the breakdown of language into smaller components that the human brain assigns meaning to, and responds to in the form of speech, writing, or some kind of gesture.

Natural Language Processing (NLP) Algorithms and Models have been developed over decades to enable computers to recognize meaning and pattern in text and develop progressively more advanced responses.

Apple’s Siri and other voice-to-text assistants is the most visible consumer application at this time, but the Computer Science itself goes much deeper beyond building automated assistants, which is why we dig into NLP in this New Tech post.

What Is NLP?

Natural Language Processing sits at the intersection of Computer Science, Artificial Intelligence and Linguistics.

NLP is a subset of Artificial Intelligence.

Natural Language Processing (NLP) - AI, Computer Science, Linguistics

Often overlooked in relation to most applications related to advanced Computer Science (Artificial Intelligence, Machine Learning, etc.) is the field of Linguistics. As we will see below in The History section, the origins of ‘Natural Language Processing’ stems from Linguistics, which can be defined simply as “the study of language and its structure.”

The key element here is “structure” as essentially NLP is a breakdown of language in its basic structural form into some form of meaning that computers can interpret.

The core concept related to this is called ‘NLP Tokenization’ where “tokens” refers to the words, characters, etc. that give language structure and meaning. You can tokenize sentences into words, and words into characters.

Tokenization example
Tokenization Algos in NLP

Not only is language itself effectively unlimited in relation to words (ie. multiple different languages with multiple different characters), but in the modern era we see digital language evolving through the use of emojis and slang to convey new meaning.

That’s why the core of NLP is tokenization (break language down into its core components) and then training various computer/mathematical models to recognize patterns and assign meaning to language in an automated fashion.

Here are some examples of NLP specifically:

  • Topic Discovery and Modelling – looks for themes/meanings across various different sets of texts, applies some sort of model to it that can analyze it in real-time for discovery purposes
  • Contextual Extraction – automatically pulls structured data from various text-based sources
  • Sentiment Analysislooks for mood/mindset/emotion and any other subjective means of communication in text
  • Text-to-Speech/Speech-to-Text Conversion – as we see with many modern-day ‘voice search’ applications, NLP is used to convert voice to text and vice versa
  • Language Translation – can translate one language to another

As an isolated example, each one of these may not seem that impressive. But at scale, NLP is the bedrock towards transformation of not just communication, but business as a whole. In fact, we can see that as we enter this Age of Change, the concept of language is itself changing.

NLP – The History

According to history, the earliest seeds of NLP were planted in a Swiss University class dating back to the early 1900s that introduced the concept of thinking about ‘language as systems.’

From 1906 to 1911, Professor Saussure offered three courses at the University of Geneva, where he developed an approach describing languages as “systems.”

Dataversity

Conceptually, the arguments around thinking of ‘language-as-a-system’ paved the way for it to be effectively coded into computers in future years.

He argued that meaning is created inside language, in the relations and differences between its parts. Saussure proposed “meaning” is created within a language’s relationships and contrasts. A shared language system makes communication possible. Saussure viewed society as a system of “shared” social norms that provides conditions for reasonable, “extended” thinking, resulting in decisions and actions by individuals. (The same view can be applied to modern computer languages).

Dataversity

His work – with the help of his colleagues – became the backbone of the structuralist approach to learning languages, an approach that can be used for both teaching languages to students and teaching computers how to structure language into applications.

In 1950, Alan Turing wrote a paper describing a test for a “thinking” machine. He stated that if a machine could be part of a conversation through the use of a teleprinter, and it imitated a human so completely there were no noticeable differences, then the machine could be considered capable of thinking.

Dataversity

Many people are familiar with the ‘Turing Machine’ from the Second World War, but less so about how Mr. Turing’s work became the foundation for modern NLP (among other fields) as a subset of AI. Without the ability to breakdown language systematically and teach computers how to “read,” there would be no comprehension (ie. applications), which is the phase we are seeing now. Nevertheless, these applications were essentially written off for dead in the mid ’60s.

In 1966, Artificial Intelligence and Natural Language Processing (NLP) research was considered a dead end by many (though not all).

Dataversity

In the ’80s and ’90s, the field of Statistics moved to the forefront of AI and NLP research through the creation of essentially ‘decision trees,’ which gave way towards neural nets (networks).

These neural nets provided an early view into the ‘automation’ of decision making based on a string of probabilities; thus the reason Statistics became so important to NLP and associated fields.

Through this type of n-gram NLP model pictured above, computers were able to start learning how to create sentence structure by ‘predicting’ what words come next; by progressively developing more precise probabilities based on an analysis of multiple bodies of text, these n-grams became capable of ‘writing’ by creating sequences of words.

the term n-gram is used to mean either the word sequence itself or the predictive model that assigns it a probability.

Depends On The Definition

The development of these neural nets, n-grams, and other models related to NLP in the ’80s and ’90s led to the boom of NLP and associated technologies in the 2000s and beyond.

NLP – The Innovation

If the ‘tokenization’ of language was a concept that started in a University in Switzerland in the early 1900s, then what we are looking at now – at a time when we can start seeing NLP at the heart of multiple different applications – is tokenization-as-an-innovation.

With the amount of content (sentences, words, emotions) being produced online each day – a large % of it user-generated – computers have exponentially increasing probabilities of being able to tokenize and assign meaning to language more accurately; larger data sets means more data to analyze, test, and experiment with.

This is where – conceptually – we can see the field of ‘Artificial Intelligence’ arising to produce real ‘intelligence outcomes’ that would – theoretically – compete with the power of the human brain. But the brain’s continuous advantage is to be able to assign meaning, context, and signal to what are sometimes simple expressions such as ‘aha!’ or ‘; )’.

Many will recognize Apple’s Siri as a specific example of NLP – The Innovation in the market:

In the year 2011, Apple’s Siri became known as one of the world’s first successful NLP/AI assistants to be used by general consumers. Within Siri, the Automated Speech Recognition module translates the owner’s words into digitally interpreted concepts. The Voice-Command system then matches those concepts to predefined commands, initiating specific actions.

Dataversity
'Hey Siri' - Apple NLP - 2018
Medium – ‘Hey Siri’ by Apple, 2018

Without the ability to properly tokenize language, categorize it, and develop some kind of actionable response, Siri would have been no better than most other pipe dream techno-fantasies that came up in the 2000s.

This algorithm behind Siri and other voice-to-text assistants would thus be able to recognize patterns in speech (ie. queries related to a certain contextual request – like “where is the nearest gas station?”) and create a response through some kind of machine-learning application.

History of NLP (Natural Language Processing)
Medium

The Innovation is the recognition of different patterns in language to design algorithms that are designed to drive some form of actionable response, which will differ greatly within the many different potential applications we discussed above.

Topic Discovery and Modelling, Contextual Extraction, Sentiment Analysis, and Language Translation are other areas that are exponentially improved with accurate NLP tokenization.

Tokenization is the first step in any NLP pipeline. It has an important effect on the rest of your pipeline. A tokenizer breaks unstructured data and natural language text into chunks of information that can be considered as discrete elements.

Neptune.ai

From this point, we can talk about Neural Nets, Deep Learning and other very advanced applications of NLP.

NLP – The Technology

As one can imagine, The Technology behind NLP is very, very advanced.

The Technology itself is software programs that are rooted in advanced Mathematics, Statistics, and Computer Science. This tech sits at the backbone of the world’s most valuable Tech companies:

Today, NLP techniques form the backbone for a wide variety of available products including virtual assistants, such as Apple’s Siri and Amazon’s Alexa. For example, NLP technologies decode the user’s speech, which instructs the virtual assistant to respond. If the user requests
an answer to a query, then NLP enables the virtual assistant to generate a spoken answer.

Patents for NLP Software

As discussed above, NLP can encompass many different sub fields and adjacent fields, such as: Word Vectors, Neural Networks, Recurrent Neural Networks (RNN), etc. any of which could be applied to areas such as Deep Learning, Machine Learning and other subsets of Artificial Intelligence.

Within and across these areas, development of algorithms occurs in many private corporations, but a large amount of it is also open-source in order to improve the speed of development in the industry.

Naturally, it makes sense that the number of patents for NLP began increasing exponentially over the last decade. Economics drive software development, and so NLP is one area that has been somewhat of an “arm’s race.”

Nevertheless, the technology behind itself is not ‘settled’ and many of the most successful mathematical models in NLP date back decades or more. A NLP algorithm may be designed, a patent created, billions invested, and it will flounder when it hits the market.

Alternatively, a small, unfunded team could build an application on an existing NLP library from an open-source network and create something that is a massive success in the market.

The challenge with The Technology is that while the speed at which development occurs continues to increase exponentially, there are social limits that have yet to be explored. What types of ‘artificial intelligence’ work at scale are not yet known, and will naturally differ by cultural and social norms.

“Voice’ is one area that is hitting critical mass through a combination of applications like Siri, Alexa and a host of others. By extension, ‘Voice Search‘ is emerging quickly in relation to SEO and search behaviors. These applications rely on increasingly advanced neural networks.

Neural Networks - Apple - 2018
Medium – Neural Network by Apple, 2018

These types of algorithms and neural networks will continue to advance to reduce friction and increase accuracy in translation, response, and lag time.

Overall, NLP is somewhat of a revolutionary layer in the advancement of Artificial Intelligence; yet it is paradoxically rooted in one of the oldest fields of study in humanity, Linguistics.

There is still much to be determined as to how NLP rolls into mainstream applications beyond what we have seen to date, but a change in the business ecosystem as a result of NLP specifically seems both imminent and profound.

More NLP Posts

How Valuable is Sentiment Analysis?

Search Engine Business Model

New Tech – Community Networks