Neural nets: How Regular Expressions brought about Deep Learning

2016 is the year of Deep Learning, whose popularity has been on a steep incline ever since Google bought DeepMind at the end of 2014. Last year’s technical breakthroughs,  acquisitions, funding deals and, open source releases have all helped to cement Deep Learning as the hip artificial intelligence. 

Our CTO, Matt Painter, explains how Deep Learning got its start.

 

It all started with Neuroscience

For millennia, people have asked: Why? 

What is mind? What is consciousness? How does my brain work?

Enter: Neuroscience. Starting with Guillaume-Benjamin-Amand Duchenne, scientists began exploring how neurons work in the 1890s by pressing on different parts of the brain to see what would happen.

Then in 1943, McCulloch and Pitts published “A logical calculus of the ideas immanent in nervous activity” which tried to understand the brain by modeling neurons. They didn’t know it at the time but, their paper would have an enormous impact on computer science.

Shortly afterward in 1951, Stephen Kleene set out to prove what McCulloch and Pitts had started, that neural nets could compute in “Representation of Events in Nerve Nets and Finite Automata”. In it, Kleene proposed a simple algebra (well, simple for algebra) that defined a Regular Language – something we use in computer programming all the time.

WTF’s a Regular Language?

It’s easiest to explain Regular Languages with an example. You start with an alphabet {0,1,2,3,4,5,6,7,8,9}, some words {123 012334 23848484}. To make it a language, you have to add some rules.

For example,  all words in the Double-0 Language must start with two 00.

004 005 007 0012 089

If you can decide if a “word” is in a “language” by inspecting its sequential “letters” in isolation like this flow chart then it’s a Regular Language.

Screen Shot 2016-01-20 at 3.42.55 PM

Isn’t everything regular? Nope. For example, Palindromes are not Regular because you can’t inspect the letters in isolation.

aa abba abccba abcddcba abcdeedcba

The chart above (that defines a Regular Language) is what is called a Finite State Machine or Deterministic Finite Automaton.

In 1956, Kleene proved that Finite State Machines are equivalent to this…

Screen Shot 2016-01-20 at 3.47.04 PM

 

He also showed that a Regular Language can be defined using a grammar with 3 operators:

Concatenation {“ab”, “c”}{“d”, “ef”} = {“abd”, “abef”, “cd”, “cef”}

Alternation {“ab”, “c”}|{“ab”, “d”, “ef”} = {“ab”, “c”, “d”, “ef”}

Kleene Star {“ab”, “c”}* = {“”, “ab”, “c”, “abab”, “abc”, “cab”, “cc”, “ababab”, “abcab”, … }

Hint: these should look familiar!

A Regular Expression defines a Regular Language

A Regular Expression for my Double 0 language would look like this

00(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*

Which can also be written as…

00[1-9][0-9]*

…or simply

00[1-9]d*

What about computers?

So far nothing I’ve talked about actually mentions computers – we’re only up to the 1950s. And yet, these ideas are central to the field of Computer Science.

Let’s fast-forward to 1968, when Ken Thomson published his “Regular Expression Search Algorithm” and introduced Regular Expressions in his port of the editor QED for CTSS. The next year he wrote “ed” which is still part of *nux today.

Fun fact: “ed” has the command g/{some command}/p or g/re/p, which is where grep comes from!

Then in the 80s, Larry Wall created Perl, which has a further extended syntax with features like capturing subgroups (d{5})(?:-(d{4}))?, backrefs (.*)1, lookarounds (?<=a)bcd and recursion (w)(?:(?R)|w?)1.

All of which allows it to be more expressive and flexible. Hance why it is the default in Perl, Java, JS, Ruby, PHP, etc.

These RegEx aren’t always equivalent to a Finite State Machine. In fact, Noam Chomsky defined a hierarchy that splits up the languages…

Screen Shot 2016-01-20 at 4.06.45 PM

Perl “RegEx” covers Context-Free Languages (and some other ones too).

 

Regular Expressions today

RegEx is the workhorse of data extraction, from Log collation to Web Scraping.

In fact, here at Import.io, we sometimes generate a RegEx when you train Extractors. To make this more accessible, so we put it up as a labs API (regexpgen.import.io) which helps you to generate a Regular Expression based on some examples.

Screen Shot 2016-01-20 at 4.23.43 PM

From RegEx comes Deep Learning

Earlier I talked about the paper by McCulloch and Pitts. That paper not only gave rise to RegEx, but also, every neural network since, including Deep Learning.

Deep Learning is making massive strides in things like object recognition, facial recognition image description, voice recognition and a whole host of others.

At Import.io, we use neural nets for complex data extraction tasks where a rule based system (or even ML) is simply not practical. Tasks such as working out the name or address of a company from their website.

 

One final thought…

When you are training an Extractor, and it generates a RegEx, that’s a Regular Expression. It’s defining a Regular Language which can be detected by a Finite State Machine which is equivalent to a McCulloch-Pitts neural net.

And that, blows my mind!

Comments

Comments are closed.

Extract data from almost any website


INSTANT ACCESS