<no title> — MedEmbed 0.2 documentation

class dataset.DataSet(directory, verbose, categories)[source]¶

Holds the dataset and the methods associated with it

__weakref__¶: list of weak references to the object (if defined)

static _make_categories(categories)[source]¶: Makes a list of categories to extract from a raw document :return: category list, or None (if extracting all categories)

static iter_documents()[source]¶: Generator: iterate over all relevant documents :return: yields one document (=list of utf8 tokens) at a time

preprocess()[source]¶: Calls pre-processing methods and prints progress (if verbose) :return: None

class embedding.Embedding(verbose)[source]¶

methods to generate and evaluate word embedding vector

__weakref__¶: list of weak references to the object (if defined)

generate(model_type, dim, workers)[source]¶: Models word embedding vector and saves it to file :param model_type: ‘word2vec’ or ‘fasttext’ :param dim: dimensions of word emb edding vector :param workers: number of workers to parallelise training of word embedding model :return: None

tSNE(model_file=None)[source]¶: Creates TSNE model, plots it and saves it :return: None

class transformer.Transformer(categories, apikey=None)[source]¶

Methods that process the dataset before generating word embedding model

__weakref__¶: list of weak references to the object (if defined)

static _find_frequent(threshold_bigram)[source]¶: Reads preprocessed files and counts bigrams :param threshold_bigram: minimum frequency of bigram :return: all bigrams that occur more than threshold_bigram times

make_clean_sample(f, stops, stemmer, ftype='xml')[source]¶: raw text -> clean text and generates word_frequency dictionary :param f: raw text :param stops: set of stopwords :param stemmer: nltk stemmer :param ftype: type of files to process :return: processed text