Sentiment Analysis

Basic Sentiment Analysis implementation to try Node and learn a bit about NLP

Made by Félix Dorn

A word about the process

Step one

When receiving the input through /api/nlp/analyze, I change words like "won't" to "will not" based on a dictionary. Then I remove every non-word string.

Step two

I use the Damerau–Levenshtein distance and an english word dictionary to remove most misspelled words. This is not 100% efficient as i tried to avoid false positive as much as possible. Thanks to Grafikart's video about string manipulation basics and my well-loved Wikipedia.

Step three

This is my favorite part, where it does not differ from building a programming language. Lexical analysis. Oh yeah! So, we turn the text into tokens to analyse it. (the tokenizer is one liner function, but i still love it)

Step four

I try to remove every non-consitent word like "here", "for", "and", "or"... This step is crucial as if there is non-constitent words and these are also not naturally catched by the step 4, it will returns invalid data more often.

Last step

I will let this bit of code speak as only god know what i did here. This seems so simple and beautiful. WE DID IT.

Diving into stemming and lemmatization

The goal of stemming and lemmatization is to transform verbs, words to their radicals.
eg: am, are, is=> be,
eg: car, cars, car's, cars'=> car
I used the Porter stemmer algorithm. Check the ressources section to get a full explanation

Ressources

Level meaning

Level range ∈]-3.5, 3.5[ using 10000 english random word combinations.
level===0 // neutral :|
level < 0 // bad :(
level > 0 // good :)

Caveats

Try with "good as fuck" or "i like you". This illustrate the limit of dictionary and of my mathematics knowledge. At this point, i think AI related thing could be interesting.
The reality is much more complicated than what i did but diving into text classification and NLP related things was super interesting.

Cool! And the source code ?

Well yes, but actually no. I don't want to open-source it cause there is so many better implementation out there. This one is so crap, I swear. At least the average response time of a request is not so long.

Disclaimer

I'm an human so be careful about what i said, probably lots of mistakes.

API

Do not use that in production.
Submit a POST request to /api/nlp/analyze with a JSON body eg {text: "your turn here"}. You will get something that looks like the sample below.

{
  "text": "Your turn here",
  "lexedText": "Your turn here",
  "casedText": "your turn here",
  "alphaOnlyText": "your turn here",
  "tokenizedText": [
    "your",
    "turn",
    "here"
  ],
  "filteredText": [
    "turn"
  ],
  "level": 0
}