![reddit source 2 reddit source 2](https://i.pinimg.com/736x/a5/0b/fd/a50bfdc208599a563389bc0c83037a2e--team-fortress-video-game.jpg)
The dataset I am using is available on Kaggle here. The package needs to be downloaded first.
#Reddit source 2 download
%matplotlib inline import nltk.data from gensim.models import word2vec from sklearn.cluster import KMeans from sklearn.neighbors import KDTree import pandas as pd import numpy as np import os import re import logging import sqlite3 import time import sys import multiprocessing from wordcloud import WordCloud, ImageColorGenerator import matplotlib.pyplot as plt from itertools import cycle įrom NLTK, we need to download the package “Punkt”, which contains a module for obtaining sentences from a text. For this project, we will need NLTK (for nlp), Gensim (for Word2Vec), SkLearn (for the clustering algorithm), Pandas, and Numby (for data structures and processing). We will first start with the necessary imports. Although I will not implement the Neural Network for Word2Vec model using a Deep Learning library (like TensorFlow), it is not difficult to implement since the network diagram is easily outlined. For this project, I will be using the Reddit May 2015 Dataset available from Kaggle.
#Reddit source 2 how to
I will now show how to train a Word2Vec model. The size of the hidden units h is also an important consideration here, since it decides the length of the hidden layer representation for each word. Once the model has been trained, we extract this hidden representation as the Word Vector of the word, as there is a 1-to-1 correspondence between each. By training the model on the target words and their surrounding labels, we are updating the values of the hidden layer iteratively to the point where the cost (or the difference between the prediction and actual label) is minimum. However, we need Word2Vec to obtain vectorial representations of the input words. We take as input one-hot vectorized representations of the words, applying W1 (and b1) to obtain the hidden layer representations, then feeding them again through a W2 and b2, and applying a SoftMax activation to obtain the probabilities of the target word and each of their associated y values. The above diagram summarizes the Word2Vec model well. A traditional neural network diagram looks like this: Word2Vec is a shallow network, which means that there is a single-hidden layer in it. Having obtained the input, the next step is to design the Neural Network. In the example above, the vector corresponding to the last sentence will be 0 0 0 1 0 0 0 0, because the word “fox” is the 4th word in the vocabulary, and there are 8 words (counting “the” once). In this case, we have a total of V words, so each vector will be of V length, and will have the index corresponding to it’s position set to 1. One-hot vectors are a vectorial representation of data, in which we use a large vector of zeros which correspond to each word in the vocabulary, and set the position corresponding to the target word to 1. The input words are then processed into one-hot vectors. In this case, the window size (denoted as C ) is 4, so there are 2 words on each side, except for edge words. We use a sliding window over the text, and for each “target word” in the set, we pair it with an adjacent word to obtain an x-y pair. Generating training data from a source text Let’s look at the following diagram (from Chris McCormick’s blog):
![reddit source 2 reddit source 2](https://64.media.tumblr.com/eb7ad81f96b2e6a3042244b8309a9c95/tumblr_mvj2kndE1u1ras9hto1_540.jpg)
The input of the algorithm is text data, so for a string of sentences, the training data will consist of individual sentences, paired with the contextual words we are trying to predict. To understand Word2Vec, let’s walk through it step by step.
#Reddit source 2 update
By training a Neural Network on the data, we update the weights of the Hidden layer, which we extract at the end as the word vectors. These Word Vectors are obtained by training a Shallow Neural Network (single hidden layer) on individual words in a text, and given surrounding words as the label to predict. Word2Vec is an algorithm that generates Vectorized representations of words, based on their linguistic context.
#Reddit source 2 code
Then, I will walk through the code for training a Word2Vec model on the Reddit comments dataset, and exploring the results from it. I will talk about some background of what the algorithm is, and how it can be used to generate Word Vectors from. In this post, I am going to go over Word2Vec.