|
1 |
| -# NLP_CODE |
| 1 | +#NLP_CODE |
| 2 | + |
| 3 | +This repository contains a single directory named sentence_autocomplete which is an assignment submission and is purely for academic purposes. |
| 4 | +Code description: |
| 5 | + The code file contains a class named "MarkovChain" which contains all the essential methods for creating markov chain using trigrams from the data loaded from a output file which is inturn generated using an original input file containing raw tweets. |
| 6 | + |
| 7 | + Method wise description: |
| 8 | + Class MarkovChain |
| 9 | + Method : initialize(input_file) |
| 10 | + Takes in a name of the file containing raw tweets and calls a method named "clean_data_and_save_to_file" which |
| 11 | + is in a module named DataPreprocessing in the same code file. This method "clean_data_and_save_to_file" removes unwanted material like special characters except apostrophe and urls from the tweets and writes back |
| 12 | + the processed tweets to an output file. The initialize method then loads this file and read it line by line to |
| 13 | + produce trigrams from the line and calls the "add" function to add the trigrams to the trigram hash(@word). |
| 14 | + |
| 15 | + Method : add(word, word1, word2) |
| 16 | + Takes in three words and add the first two words as key in the dictionary(hash) and the third word as a hash |
| 17 | + this key with value as the frequency of the occurence of this trigram. |
| 18 | + |
| 19 | + Method : get_possible_word(bigram) |
| 20 | + This method takes in two word combination and look up to the dictionary we created in "add" method and finds |
| 21 | + all the keys under the main bigram. Then it calculates the weight of all the keys under this bigram and return |
| 22 | + the key with highest probability or frequency of occurence. |
| 23 | + Note that this method returns only one word. |
| 24 | + |
| 25 | + Method : print_dict |
| 26 | + This method simply prints out the trigram dictionary @word. |
| 27 | + |
| 28 | + |
| 29 | +USAGE: |
| 30 | + Make an object of the class named "MarkovChain" and pass the input raw data text file. |
| 31 | + markov_obj = MarkovChain.new(<in_file>) |
| 32 | + Pass last two words of an incomplete sentence to the method named "get_possible_word" |
| 33 | + str = "American sniper is directed by" |
| 34 | + str_list = str.chomp.strip.split |
| 35 | + wrd1 = str_list[-2] |
| 36 | + wrd2 = str_list[-1] |
| 37 | + search_string = "#{wrd1} #{wrd2}" |
| 38 | + next_word = markov_obj.get_possible_word(search_string) |
| 39 | + |
| 40 | + To get the second possible word just pass the next_word generated before and the last word of the original incomplete sentence |
| 41 | + to the "get_possible_method" again. |
| 42 | + |
| 43 | + |
0 commit comments