Package smile.nlp.embedding
Class Word2Vec
java.lang.Object
smile.nlp.embedding.Word2Vec
Word2vec is a group of related models that are used to produce word
embeddings. These models are shallow, two-layer neural networks that
are trained to reconstruct linguistic contexts of words. Word2vec
takes as its input a large corpus of text and produces a vector space,
typically of several hundred dimensions, with each unique word in the
corpus being assigned a corresponding vector in the space. Word vectors
are positioned in the vector space such that words that share common
contexts in the corpus are located close to one another in the space.
Word2vec can utilize either of two model architectures to produce a distributed representation of words: continuous bag-of-words (CBOW) or continuous skip-gram. In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words. According to the authors' note, CBOW is faster while skip-gram is slower but does a better job for infrequent words.
-
Field Summary
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionfloat[]
Returns the embedding vector of a word.int
Returns the dimension of embedding vector space.float[]
Returns the embedding vector of a word.static Word2Vec
Loads a pre-trained word2vec model from binary file of ByteOrder.LITTLE_ENDIAN.static Word2Vec
Loads a pre-trained word2vec model from binary file.
-
Field Details
-
words
The vocabulary. -
vectors
The vector space.
-
-
Constructor Details
-
Word2Vec
Constructor.- Parameters:
words
- the vocabulary.vectors
- the vectors of d x n, where d is the dimension and n is the size of vocabulary.
-
-
Method Details
-
dimension
public int dimension()Returns the dimension of embedding vector space.- Returns:
- the dimension of embedding vector space.
-
get
Returns the embedding vector of a word.- Parameters:
word
- the word.- Returns:
- the embedding vector.
-
apply
Returns the embedding vector of a word. For Scala convenience.- Parameters:
word
- the word.- Returns:
- the embedding vector.
-
of
Loads a pre-trained word2vec model from binary file of ByteOrder.LITTLE_ENDIAN.- Parameters:
file
- the path to model file.- Returns:
- the word2vec model.
- Throws:
IOException
- when fails to read the file.
-
of
Loads a pre-trained word2vec model from binary file.- Parameters:
file
- the path to model file.order
- the byte order of model file.- Returns:
- the word2vec model.
- Throws:
IOException
- when fails to read the file.
-