class dataset.DataSet(directory, verbose, categories)[source]

Holds the dataset and the methods associated with it

__weakref__

list of weak references to the object (if defined)

static _make_categories(categories)[source]

Makes a list of categories to extract from a raw document :return: category list, or None (if extracting all categories)

static iter_documents()[source]

Generator: iterate over all relevant documents :return: yields one document (=list of utf8 tokens) at a time

preprocess()[source]

Calls pre-processing methods and prints progress (if verbose) :return: None

class embedding.Embedding(verbose)[source]

methods to generate and evaluate word embedding vector

__weakref__

list of weak references to the object (if defined)

generate(model_type, dim, workers)[source]

Models word embedding vector and saves it to file :param model_type: ‘word2vec’ or ‘fasttext’ :param dim: dimensions of word emb edding vector :param workers: number of workers to parallelise training of word embedding model :return: None

tSNE(model_file=None)[source]

Creates TSNE model, plots it and saves it :return: None

class transformer.Transformer(categories, apikey=None)[source]

Methods that process the dataset before generating word embedding model

__weakref__

list of weak references to the object (if defined)

static _find_frequent(threshold_bigram)[source]

Reads preprocessed files and counts bigrams :param threshold_bigram: minimum frequency of bigram :return: all bigrams that occur more than threshold_bigram times

make_clean_sample(f, stops, stemmer, ftype='xml')[source]

raw text -> clean text and generates word_frequency dictionary :param f: raw text :param stops: set of stopwords :param stemmer: nltk stemmer :param ftype: type of files to process :return: processed text