Title: | Prototype of Multiple Latent Dirichlet Allocation Runs |
---|---|
Description: | Determine a Prototype from a number of runs of Latent Dirichlet Allocation (LDA) measuring its similarities with S-CLOP: A procedure to select the LDA run with highest mean pairwise similarity, which is measured by S-CLOP (Similarity of multiple sets by Clustering with Local Pruning), to all other runs. LDA runs are specified by its assignments leading to estimators for distribution parameters. Repeated runs lead to different results, which we encounter by choosing the most representative LDA run as prototype. |
Authors: | Jonas Rieger [aut, cre] |
Maintainer: | Jonas Rieger <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.3.1 |
Built: | 2025-02-19 03:59:50 UTC |
Source: | https://github.com/jonasrieger/ldaprototype |
Determine a Prototype from a number of runs of Latent Dirichlet
Allocation (LDA) measuring its similarities with S-CLOP: A procedure to select
the LDA run with highest mean pairwise similarity, which is measured by S-CLOP
(Similarity of multiple sets by Clustering with Local Pruning), to all other
runs. LDA runs are specified by its assignments leading to estimators for
distribution parameters. Repeated runs lead to different results, which we
encounter by choosing the most representative LDA run as prototype.
For bug reports and feature requests please use the issue tracker:
https://github.com/JonasRieger/ldaPrototype/issues. Also have a look at
the (detailed) example at https://github.com/JonasRieger/ldaPrototype.
reuters
Example Dataset (91 articles from Reuters) for testing.
LDA
LDA objects used in this package.as.LDARep
LDARep objects.as.LDABatch
LDABatch objects.
getTopics
Getter for LDA
objects.getJob
Getter for LDARep
and LDABatch
objects.getSimilarity
Getter for TopicSimilarity
objects.getSCLOP
Getter for PrototypeLDA
objects.getPrototype
Determine the Prototype LDA.
LDARep
Performing multiple LDAs locally (using parallelization).LDABatch
Performing multiple LDAs on Batch Systems.
mergeTopics
Merge topic matrices from multiple LDAs.jaccardTopics
Calculate topic similarities using the Jaccard coefficient (see Similarity Measures for other possible measures).dendTopics
Create a dendrogram from topic similarities.SCLOP
Determine various S-CLOP values.pruneSCLOP
Prune TopicDendrogram
objects.
cosineTopics
Cosine Similarity.jaccardTopics
Jaccard Coefficient.jsTopics
Jensen-Shannon Divergence.rboTopics
rank-biased overlap.
getPrototype
Shortcut which includes all calculation steps.LDAPrototype
Shortcut which performs multiple LDAs and
determines their Prototype.
Maintainer: Jonas Rieger [email protected] (ORCID)
Rieger, Jonas (2020). "ldaPrototype: A method in R to get a Prototype of multiple Latent Dirichlet Allocations". Journal of Open Source Software, 5(51), 2181, doi:10.21105/joss.02181.
Rieger, Jonas, Jörg Rahnenführer and Carsten Jentsch (2020). "Improving Latent Dirichlet Allocation: On Reliability of the Novel Method LDAPrototype". In: Natural Language Processing and Information Systems, NLDB 2020. LNCS 12089, pp. 118–125, doi:10.1007/978-3-030-51310-8_11.
Rieger, Jonas, Carsten Jentsch and Jörg Rahnenführer (2022). "LDAPrototype: A Model Selection Algorithm to Improve Reliability of Latent Dirichlet Allocation". Preprint on Research Square, doi:10.21203/rs.3.rs-1486359/v1.
Useful links:
Report bugs at https://github.com/JonasRieger/ldaPrototype/issues
Constructs a LDABatch
object for given elements reg
,
job
and id
.
as.LDABatch(reg, job, id) is.LDABatch(obj, verbose = FALSE)
as.LDABatch(reg, job, id) is.LDABatch(obj, verbose = FALSE)
reg |
|
job |
[ |
id |
[ |
obj |
[ |
verbose |
[ |
Given a Registry
the function returns
a LDABatch
object, which can be handled using the getter functions
at getJob
.
[named list
] with entries id
for the registry's folder name,
jobs
for the submitted jobs' ids and its parameter settings and
reg
for the registry itself.
Other constructor functions:
LDA()
,
as.LDARep()
Other batch functions:
LDABatch()
,
getJob()
,
mergeBatchTopics()
## Not run: batch = LDABatch(docs = reuters_docs, vocab = reuters_vocab, K = 15, chunk.size = 20) batch batch2 = as.LDABatch(reg = getRegistry(batch)) batch2 head(getJob(batch2)) batch3 = as.LDABatch() batch3 ### one way of loading an existing registry ### batchtools::loadRegistry("LDABatch") batch = as.LDABatch() ## End(Not run)
## Not run: batch = LDABatch(docs = reuters_docs, vocab = reuters_vocab, K = 15, chunk.size = 20) batch batch2 = as.LDABatch(reg = getRegistry(batch)) batch2 head(getJob(batch2)) batch3 = as.LDABatch() batch3 ### one way of loading an existing registry ### batchtools::loadRegistry("LDABatch") batch = as.LDABatch() ## End(Not run)
Constructs a LDARep
object for given elements lda
,
job
and id
.
as.LDARep(...) ## Default S3 method: as.LDARep(lda, job, id, ...) ## S3 method for class 'LDARep' as.LDARep(x, ...) is.LDARep(obj, verbose = FALSE)
as.LDARep(...) ## Default S3 method: as.LDARep(lda, job, id, ...) ## S3 method for class 'LDARep' as.LDARep(x, ...) is.LDARep(obj, verbose = FALSE)
... |
additional arguments |
lda |
[ |
job |
[ |
id |
[ |
x |
|
obj |
[ |
verbose |
[ |
Given a list of LDA
objects the function returns
a LDARep
object, which can be handled using the getter functions
at getJob
.
[named list
] with entries id
for computation's name,
jobs
for the parameter settings and lda
for the results themselves.
Other constructor functions:
LDA()
,
as.LDABatch()
Other replication functions:
LDAPrototype()
,
LDARep()
,
getJob()
,
mergeRepTopics()
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 7, num.iterations = 20) lda = getLDA(res) res2 = as.LDARep(lda, id = "newName") res2 getJob(res2) getJob(res) ## Not run: batch = LDABatch(docs = reuters_docs, vocab = reuters_vocab, n = 4, id = "TEMP", K = 30) res3 = as.LDARep(batch) res3 getJob(res3) ## End(Not run)
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 7, num.iterations = 20) lda = getLDA(res) res2 = as.LDARep(lda, id = "newName") res2 getJob(res2) getJob(res) ## Not run: batch = LDABatch(docs = reuters_docs, vocab = reuters_vocab, n = 4, id = "TEMP", K = 30) res3 = as.LDARep(batch) res3 getJob(res3) ## End(Not run)
Calculates the similarity of all pairwise topic combinations using the Cosine Similarity.
cosineTopics(topics, progress = TRUE, pm.backend, ncpus)
cosineTopics(topics, progress = TRUE, pm.backend, ncpus)
topics |
[ |
progress |
[ |
pm.backend |
[ |
ncpus |
[ |
The Cosine Similarity for two topics and
is calculated by
with determining the angle between the corresponding
count vectors
and
,
is the vocabulary size and
is the count of
assignments of the
-th word to the
-th topic.
[named list
] with entries
sims
[lower triangular named matrix
] with all pairwise
similarities of the given topics.
wordslimit
[integer
] = vocabulary size. See
jaccardTopics
for original purpose.
wordsconsidered
[integer
] = vocabulary size. See
jaccardTopics
for original purpose.
param
[named list
] with parameter
type
[character(1)
] = "Cosine Similarity"
.
Other TopicSimilarity functions:
dendTopics()
,
getSimilarity()
,
jaccardTopics()
,
jsTopics()
,
rboTopics()
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30) topics = mergeTopics(res, vocab = reuters_vocab) cosine = cosineTopics(topics) cosine sim = getSimilarity(cosine) dim(sim)
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30) topics = mergeTopics(res, vocab = reuters_vocab) cosine = cosineTopics(topics) cosine sim = getSimilarity(cosine) dim(sim)
Builds a dendrogram for topics based on their pairwise similarities using the
cluster algorithm hclust
.
dendTopics(sims, ind, method = "complete") ## S3 method for class 'TopicDendrogram' plot(x, pruning, pruning.par, ...)
dendTopics(sims, ind, method = "complete") ## S3 method for class 'TopicDendrogram' plot(x, pruning, pruning.par, ...)
sims |
[ |
ind |
[ |
method |
[ |
x |
an R object. |
pruning |
[ |
pruning.par |
[ |
... |
additional arguments. |
The label´s colors are determined based on their Run belonging using
rainbow_hcl
by default. Colors can be manipulated
using labels_colors
. Analogously, the labels
themself can be manipulated using labels
.
For both the function order.dendrogram
is useful.
The resulting dendrogram
can be plotted. In addition,
it is possible to mark a pruning state in the plot, either by color or by
separator lines (or both) setting pruning.par
. For the default values
of pruning.par
call the corresponding function on any
PruningSCLOP
object.
[dendrogram
] TopicDendrogram
object
(and dendrogram
object) of all considered topics.
Other plot functions:
pruneSCLOP()
Other TopicSimilarity functions:
cosineTopics()
,
getSimilarity()
,
jaccardTopics()
,
jsTopics()
,
rboTopics()
Other workflow functions:
LDARep()
,
SCLOP()
,
getPrototype()
,
jaccardTopics()
,
mergeTopics()
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30) topics = mergeTopics(res, vocab = reuters_vocab) jacc = jaccardTopics(topics, atLeast = 2) sim = getSimilarity(jacc) dend = dendTopics(jacc) dend2 = dendTopics(sim) plot(dend) plot(dendTopics(jacc, ind = c("Rep2", "Rep3"))) pruned = pruneSCLOP(dend) plot(dend, pruning = pruned) plot(dend, pruning = pruned, pruning.par = list(type = "color")) plot(dend, pruning = pruned, pruning.par = list(type = "both", lty = 1, lwd = 2, col = "red")) dend2 = dendTopics(jacc, ind = c("Rep2", "Rep3")) plot(dend2, pruning = pruneSCLOP(dend2), pruning.par = list(lwd = 2, col = "darkgrey"))
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30) topics = mergeTopics(res, vocab = reuters_vocab) jacc = jaccardTopics(topics, atLeast = 2) sim = getSimilarity(jacc) dend = dendTopics(jacc) dend2 = dendTopics(sim) plot(dend) plot(dendTopics(jacc, ind = c("Rep2", "Rep3"))) pruned = pruneSCLOP(dend) plot(dend, pruning = pruned) plot(dend, pruning = pruned, pruning.par = list(type = "color")) plot(dend, pruning = pruned, pruning.par = list(type = "both", lty = 1, lwd = 2, col = "red")) dend2 = dendTopics(jacc, ind = c("Rep2", "Rep3")) plot(dend2, pruning = pruneSCLOP(dend2), pruning.par = list(lwd = 2, col = "darkgrey"))
Returns the job ids and its parameter set (getJob
) or the (registry's)
id (getID
) for a LDABatch
or LDARep
object.
getRegistry
returns the registry itself for a LDABatch
object. getLDA
returns the list of LDA
objects for a
LDABatch
or LDARep
object. In addition, you can
specify one or more LDAs by their id(s).setFilDir
sets the registry's file directory for a
LDABatch
object. This is useful if you move the registry´s folder,
e.g. if you do your calculations on a batch system, but want to do your
evaluation on your desktop computer.
getJob(x) getID(x) getRegistry(x) getLDA(x, job, reduce, all) setFileDir(x, file.dir)
getJob(x) getID(x) getRegistry(x) getLDA(x, job, reduce, all) setFileDir(x, file.dir)
x |
|
job |
[ |
reduce |
[ |
all |
|
file.dir |
[Vector to be coerced to a |
Other getter functions:
getSCLOP()
,
getSimilarity()
,
getTopics()
Other replication functions:
LDAPrototype()
,
LDARep()
,
as.LDARep()
,
mergeRepTopics()
Other batch functions:
LDABatch()
,
as.LDABatch()
,
mergeBatchTopics()
Returns the Prototype LDA of a set of LDAs. This set is given as
LDABatch
object, LDARep
object, or as list of LDAs.
If the matrix of S-CLOP scores sclop
is passed, no calculation is needed/done.
getPrototype(...) ## S3 method for class 'LDARep' getPrototype( x, vocab, limit.rel, limit.abs, atLeast, progress = TRUE, pm.backend, ncpus, keepTopics = FALSE, keepSims = FALSE, keepLDAs = FALSE, sclop, ... ) ## S3 method for class 'LDABatch' getPrototype( x, vocab, limit.rel, limit.abs, atLeast, progress = TRUE, pm.backend, ncpus, keepTopics = FALSE, keepSims = FALSE, keepLDAs = FALSE, sclop, ... ) ## Default S3 method: getPrototype( lda, vocab, id, job, limit.rel, limit.abs, atLeast, progress = TRUE, pm.backend, ncpus, keepTopics = FALSE, keepSims = FALSE, keepLDAs = FALSE, sclop, ... )
getPrototype(...) ## S3 method for class 'LDARep' getPrototype( x, vocab, limit.rel, limit.abs, atLeast, progress = TRUE, pm.backend, ncpus, keepTopics = FALSE, keepSims = FALSE, keepLDAs = FALSE, sclop, ... ) ## S3 method for class 'LDABatch' getPrototype( x, vocab, limit.rel, limit.abs, atLeast, progress = TRUE, pm.backend, ncpus, keepTopics = FALSE, keepSims = FALSE, keepLDAs = FALSE, sclop, ... ) ## Default S3 method: getPrototype( lda, vocab, id, job, limit.rel, limit.abs, atLeast, progress = TRUE, pm.backend, ncpus, keepTopics = FALSE, keepSims = FALSE, keepLDAs = FALSE, sclop, ... )
... |
additional arguments |
x |
|
vocab |
[ |
limit.rel |
[0,1] |
limit.abs |
[ |
atLeast |
[ |
progress |
[ |
pm.backend |
[ |
ncpus |
[ |
keepTopics |
[ |
keepSims |
[ |
keepLDAs |
[ |
sclop |
[ |
lda |
[ |
id |
[ |
job |
[ |
While LDAPrototype
marks the overall shortcut for performing
multiple LDA runs and choosing the Prototype of them, getPrototype
just hooks up at determining the Prototype. The generation of multiple LDAs
has to be done before use of this function. The function is flexible enough
to use it at at least two steps/parts of the analysis: After generating the
LDAs (no matter whether as LDABatch or LDARep object) or after determing
the pairwise SCLOP values.
To save memory a lot of interim calculations are discarded by default.
If you use parallel computation, no progress bar is shown.
For details see the details sections of the workflow functions.
[named list
] with entries
id
[character(1)
] See above.
protoid
[character(1)
] Name (ID) of the determined Prototype LDA.
lda
List of LDA
objects of the determined Prototype LDA
and - if keepLDAs
is TRUE
- all considered LDAs.
jobs
[data.table
] with parameter specifications for the LDAs.
param
[named list
] with parameter specifications for
limit.rel
[0,1], limit.abs
[integer(1)
] and
atLeast
[integer(1)
]. See above for explanation.
topics
[named matrix
] with the count of vocabularies
(row wise) in topics (column wise).
sims
[lower triangular named matrix
] with all pairwise
jaccard similarities of the given topics.
wordslimit
[integer
] with counts of words determined as
relevant based on limit.rel
and limit.abs
.
wordsconsidered
[integer
] with counts of considered
words for similarity calculation. Could differ from wordslimit
, if
atLeast
is greater than zero.
sclop
[symmetrical named matrix
] with all pairwise
S-CLOP scores of the given LDA runs.
Other shortcut functions:
LDAPrototype()
Other PrototypeLDA functions:
LDAPrototype()
,
getSCLOP()
Other workflow functions:
LDARep()
,
SCLOP()
,
dendTopics()
,
jaccardTopics()
,
mergeTopics()
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30) topics = mergeTopics(res, vocab = reuters_vocab) jacc = jaccardTopics(topics, atLeast = 2) dend = dendTopics(jacc) sclop = SCLOP.pairwise(jacc) getPrototype(lda = getLDA(res), sclop = sclop) proto = getPrototype(res, vocab = reuters_vocab, keepSims = TRUE, limit.abs = 20, atLeast = 10) proto getPrototype(proto) # = getLDA(proto) getConsideredWords(proto) # > 10 if there is more than one word which is the 10-th often word (ties) getRelevantWords(proto) getSCLOP(proto)
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30) topics = mergeTopics(res, vocab = reuters_vocab) jacc = jaccardTopics(topics, atLeast = 2) dend = dendTopics(jacc) sclop = SCLOP.pairwise(jacc) getPrototype(lda = getLDA(res), sclop = sclop) proto = getPrototype(res, vocab = reuters_vocab, keepSims = TRUE, limit.abs = 20, atLeast = 10) proto getPrototype(proto) # = getLDA(proto) getConsideredWords(proto) # > 10 if there is more than one word which is the 10-th often word (ties) getRelevantWords(proto) getSCLOP(proto)
Returns the corresponding element of a PrototypeLDA
object.
getSCLOP(x) ## S3 method for class 'PrototypeLDA' getSimilarity(x) ## S3 method for class 'PrototypeLDA' getRelevantWords(x) ## S3 method for class 'PrototypeLDA' getConsideredWords(x) getMergedTopics(x) getPrototypeID(x) ## S3 method for class 'PrototypeLDA' getLDA(x, job, reduce = TRUE, all = FALSE) ## S3 method for class 'PrototypeLDA' getID(x) ## S3 method for class 'PrototypeLDA' getParam(x) ## S3 method for class 'PrototypeLDA' getJob(x)
getSCLOP(x) ## S3 method for class 'PrototypeLDA' getSimilarity(x) ## S3 method for class 'PrototypeLDA' getRelevantWords(x) ## S3 method for class 'PrototypeLDA' getConsideredWords(x) getMergedTopics(x) getPrototypeID(x) ## S3 method for class 'PrototypeLDA' getLDA(x, job, reduce = TRUE, all = FALSE) ## S3 method for class 'PrototypeLDA' getID(x) ## S3 method for class 'PrototypeLDA' getParam(x) ## S3 method for class 'PrototypeLDA' getJob(x)
x |
[ |
job |
[ |
reduce |
[ |
all |
[ |
Other getter functions:
getJob()
,
getSimilarity()
,
getTopics()
Other PrototypeLDA functions:
LDAPrototype()
,
getPrototype()
Returns the corresponding element of a TopicSimilarity
object.
getSimilarity(x) getRelevantWords(x) getConsideredWords(x) ## S3 method for class 'TopicSimilarity' getParam(x)
getSimilarity(x) getRelevantWords(x) getConsideredWords(x) ## S3 method for class 'TopicSimilarity' getParam(x)
x |
[ |
Other getter functions:
getJob()
,
getSCLOP()
,
getTopics()
Other TopicSimilarity functions:
cosineTopics()
,
dendTopics()
,
jaccardTopics()
,
jsTopics()
,
rboTopics()
Returns the corresponding element of a LDA
object.
getEstimators
computes the estimators for phi
and theta
.
getTopics(x) getAssignments(x) getDocument_sums(x) getDocument_expects(x) getLog.likelihoods(x) getParam(x) getK(x) getAlpha(x) getEta(x) getNum.iterations(x) getEstimators(x)
getTopics(x) getAssignments(x) getDocument_sums(x) getDocument_expects(x) getLog.likelihoods(x) getParam(x) getK(x) getAlpha(x) getEta(x) getNum.iterations(x) getEstimators(x)
x |
[ |
The estimators for phi
and theta
in
are calculated referring to Griffiths and Steyvers (2004) by
with is the vocabulary size,
is the number of modeled topics;
is the count of assignments of the
-th word to
the
-th topic. Analogously,
is the count of assignments
of the
-th text to the
-th topic.
is the total
number of assigned tokens in text
and
the total number of
assigned tokens to topic
.
Griffiths, Thomas L. and Mark Steyvers (2004). "Finding scientific topics". In: Proceedings of the National Academy of Sciences 101 (suppl 1), pp.5228–5235, doi:10.1073/pnas.0307752101.
Other getter functions:
getJob()
,
getSCLOP()
,
getSimilarity()
Other LDA functions:
LDABatch()
,
LDARep()
,
LDA()
Calculates the similarity of all pairwise topic combinations using a modified Jaccard Coefficient.
jaccardTopics( topics, limit.rel, limit.abs, atLeast, progress = TRUE, pm.backend, ncpus )
jaccardTopics( topics, limit.rel, limit.abs, atLeast, progress = TRUE, pm.backend, ncpus )
topics |
[ |
limit.rel |
[0,1] |
limit.abs |
[ |
atLeast |
[ |
progress |
[ |
pm.backend |
[ |
ncpus |
[ |
The modified Jaccard Coefficient for two topics and
is calculated by
with is the vocabulary size and
is the count of
assignments of the
-th word to the
-th topic. The threshold vector
is determined by the maximum threshold of the user given lower bounds
limit.rel
and limit.abs
. In addition, at least atLeast
words per topic are
considered for calculation. According to this, if there are less than
atLeast
words considered as relevant after applying limit.rel
and limit.abs
the atLeast
most common words per topic are taken
to determine topic similarities.
The procedure of determining relevant words is executed for each topic individually.
The values wordslimit
and wordsconsidered
describes the number
of relevant words per topic.
[named list
] with entries
sims
[lower triangular named matrix
] with all pairwise
jaccard similarities of the given topics.
wordslimit
[integer
] with counts of words determined as
relevant based on limit.rel
and limit.abs
.
wordsconsidered
[integer
] with counts of considered
words for similarity calculation. Could differ from wordslimit
, if
atLeast
is greater than zero.
param
[named list
] with parameter specifications for
type
[character(1)
] = "Jaccard Coefficient"
,
limit.rel
[0,1], limit.abs
[integer(1)
] and
atLeast
[integer(1)
]. See above for explanation.
Other TopicSimilarity functions:
cosineTopics()
,
dendTopics()
,
getSimilarity()
,
jsTopics()
,
rboTopics()
Other workflow functions:
LDARep()
,
SCLOP()
,
dendTopics()
,
getPrototype()
,
mergeTopics()
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30) topics = mergeTopics(res, vocab = reuters_vocab) jacc = jaccardTopics(topics, atLeast = 2) jacc n1 = getConsideredWords(jacc) n2 = getRelevantWords(jacc) (n1 - n2)[n1 - n2 != 0] sim = getSimilarity(jacc) dim(sim) # Comparison to Cosine and Jensen-Shannon (more interesting on large datasets) cosine = cosineTopics(topics) js = jsTopics(topics) sims = list(jaccard = sim, cosine = getSimilarity(cosine), js = getSimilarity(js)) pairs(do.call(cbind, lapply(sims, as.vector)))
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30) topics = mergeTopics(res, vocab = reuters_vocab) jacc = jaccardTopics(topics, atLeast = 2) jacc n1 = getConsideredWords(jacc) n2 = getRelevantWords(jacc) (n1 - n2)[n1 - n2 != 0] sim = getSimilarity(jacc) dim(sim) # Comparison to Cosine and Jensen-Shannon (more interesting on large datasets) cosine = cosineTopics(topics) js = jsTopics(topics) sims = list(jaccard = sim, cosine = getSimilarity(cosine), js = getSimilarity(js)) pairs(do.call(cbind, lapply(sims, as.vector)))
Calculates the similarity of all pairwise topic combinations using the Jensen-Shannon Divergence.
jsTopics(topics, epsilon = 1e-06, progress = TRUE, pm.backend, ncpus)
jsTopics(topics, epsilon = 1e-06, progress = TRUE, pm.backend, ncpus)
topics |
[ |
epsilon |
[ |
progress |
[ |
pm.backend |
[ |
ncpus |
[ |
The Jensen-Shannon Similarity for two topics and
is calculated by
with is the vocabulary size,
,
and
is the proportion of assignments of the
-th word to the
-th topic. KLD defines the Kullback-Leibler
Divergence calculated by
There is an epsilon
added to every , the count
(not proportion) of assignments to ensure computability with respect to zeros.
[named list
] with entries
sims
[lower triangular named matrix
] with all pairwise
similarities of the given topics.
wordslimit
[integer
] = vocabulary size. See
jaccardTopics
for original purpose.
wordsconsidered
[integer
] = vocabulary size. See
jaccardTopics
for original purpose.
param
[named list
] with parameter specifications for
type
[character(1)
] = "Cosine Similarity"
and
epsilon
[numeric(1)
]. See above for explanation.
Other TopicSimilarity functions:
cosineTopics()
,
dendTopics()
,
getSimilarity()
,
jaccardTopics()
,
rboTopics()
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30) topics = mergeTopics(res, vocab = reuters_vocab) js = jsTopics(topics) js sim = getSimilarity(js) dim(sim) js1 = jsTopics(topics, epsilon = 1) sim1 = getSimilarity(js1) summary((sim1-sim)[lower.tri(sim)]) plot(sim, sim1, xlab = "epsilon = 1e-6", ylab = "epsilon = 1")
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30) topics = mergeTopics(res, vocab = reuters_vocab) js = jsTopics(topics) js sim = getSimilarity(js) dim(sim) js1 = jsTopics(topics, epsilon = 1) sim1 = getSimilarity(js1) summary((sim1-sim)[lower.tri(sim)]) plot(sim, sim1, xlab = "epsilon = 1e-6", ylab = "epsilon = 1")
Constructor for LDA objects used in this package.
LDA( x, param, assignments, topics, document_sums, document_expects, log.likelihoods ) as.LDA( x, param, assignments, topics, document_sums, document_expects, log.likelihoods ) is.LDA(obj, verbose = FALSE)
LDA( x, param, assignments, topics, document_sums, document_expects, log.likelihoods ) as.LDA( x, param, assignments, topics, document_sums, document_expects, log.likelihoods ) is.LDA(obj, verbose = FALSE)
x |
[ |
param |
[ |
assignments |
Individual element for LDA object. |
topics |
Individual element for LDA object. |
document_sums |
Individual element for LDA object. |
document_expects |
Individual element for LDA object. |
log.likelihoods |
Individual element for LDA object. |
obj |
[ |
verbose |
[ |
The functions LDA
and as.LDA
do exactly the same. If you call
LDA
on an object x
which already is of the structure of an
LDA
object (in particular a LDA
object itself),
the additional arguments param, assignments, ...
may be used to override the specific elements.
[named list
] LDA object.
Other constructor functions:
as.LDABatch()
,
as.LDARep()
Other LDA functions:
LDABatch()
,
LDARep()
,
getTopics()
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 1, K = 10) lda = getLDA(res) LDA(lda) # does not change anything LDA(lda, assignments = NULL) # creates a new LDA object without the assignments element LDA(param = getParam(lda), topics = getTopics(lda)) # creates a new LDA object with elements param and topics
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 1, K = 10) lda = getLDA(res) LDA(lda) # does not change anything LDA(lda, assignments = NULL) # creates a new LDA object without the assignments element LDA(param = getParam(lda), topics = getTopics(lda)) # creates a new LDA object with elements param and topics
Performs multiple runs of Latent Dirichlet Allocation on a batch system using
the batchtools-package
.
LDABatch( docs, vocab, n = 100, seeds, id = "LDABatch", load = FALSE, chunk.size = 1, resources, ... )
LDABatch( docs, vocab, n = 100, seeds, id = "LDABatch", load = FALSE, chunk.size = 1, resources, ... )
docs |
[ |
vocab |
[ |
n |
[ |
seeds |
[ |
id |
[ |
load |
[ |
chunk.size |
[ |
resources |
[ |
... |
additional arguments passed to |
The function generates multiple LDA runs with the possibility of
using a batch system. The integration is done by the
batchtools-package
. After all jobs of the
corresponding registry are terminated, the whole registry can be ported to
your local computer for further analysis.
The function returns a LDABatch
object. You can receive results and
all other elements of this object with getter functions (see getJob
).
[named list
] with entries id
for the registry's folder name,
jobs
for the submitted jobs' ids and its parameter settings and
reg
for the registry itself.
Other batch functions:
as.LDABatch()
,
getJob()
,
mergeBatchTopics()
Other LDA functions:
LDARep()
,
LDA()
,
getTopics()
## Not run: batch = LDABatch(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 15) batch getRegistry(batch) getJob(batch) getLDA(batch, 2) batch2 = LDABatch(docs = reuters_docs, vocab = reuters_vocab, K = 15, chunk.size = 20) batch2 head(getJob(batch2)) ## End(Not run)
## Not run: batch = LDABatch(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 15) batch getRegistry(batch) getJob(batch) getLDA(batch, 2) batch2 = LDABatch(docs = reuters_docs, vocab = reuters_vocab, K = 15, chunk.size = 20) batch2 head(getJob(batch2)) ## End(Not run)
Performs multiple runs of LDA and computes the Prototype LDA of this set of LDAs.
LDAPrototype( docs, vocabLDA, vocabMerge = vocabLDA, n = 100, seeds, id = "LDARep", pm.backend, ncpus, limit.rel, limit.abs, atLeast, progress = TRUE, keepTopics = FALSE, keepSims = FALSE, keepLDAs = FALSE, ... )
LDAPrototype( docs, vocabLDA, vocabMerge = vocabLDA, n = 100, seeds, id = "LDARep", pm.backend, ncpus, limit.rel, limit.abs, atLeast, progress = TRUE, keepTopics = FALSE, keepSims = FALSE, keepLDAs = FALSE, ... )
docs |
[ |
vocabLDA |
[ |
vocabMerge |
[ |
n |
[ |
seeds |
[ |
id |
[ |
pm.backend |
[ |
ncpus |
[ |
limit.rel |
[0,1] |
limit.abs |
[ |
atLeast |
[ |
progress |
[ |
keepTopics |
[ |
keepSims |
[ |
keepLDAs |
[ |
... |
additional arguments passed to |
While LDAPrototype
marks the overall shortcut for performing
multiple LDA runs and choosing the Prototype of them, getPrototype
just hooks up at determining the Prototype. The generation of multiple LDAs
has to be done before use of getPrototype
.
To save memory a lot of interim calculations are discarded by default.
If you use parallel computation, no progress bar is shown.
For details see the details sections of the workflow functions at getPrototype
.
[named list
] with entries
id
[character(1)
] See above.
protoid
[character(1)
] Name (ID) of the determined Prototype LDA.
lda
List of LDA
objects of the determined Prototype LDA
and - if keepLDAs
is TRUE
- all considered LDAs.
jobs
[data.table
] with parameter specifications for the LDAs.
param
[named list
] with parameter specifications for
limit.rel
[0,1], limit.abs
[integer(1)
] and
atLeast
[integer(1)
]. See above for explanation.
topics
[named matrix
] with the count of vocabularies
(row wise) in topics (column wise).
sims
[lower triangular named matrix
] with all pairwise
jaccard similarities of the given topics.
wordslimit
[integer
] with counts of words determined as
relevant based on limit.rel
and limit.abs
.
wordsconsidered
[integer
] with counts of considered
words for similarity calculation. Could differ from wordslimit
, if
atLeast
is greater than zero.
sclop
[symmetrical named matrix
] with all pairwise
S-CLOP scores of the given LDA runs.
Other shortcut functions:
getPrototype()
Other PrototypeLDA functions:
getPrototype()
,
getSCLOP()
Other replication functions:
LDARep()
,
as.LDARep()
,
getJob()
,
mergeRepTopics()
res = LDAPrototype(docs = reuters_docs, vocabLDA = reuters_vocab, n = 4, K = 10, num.iterations = 30) res getPrototype(res) # = getLDA(res) getSCLOP(res) res = LDAPrototype(docs = reuters_docs, vocabLDA = reuters_vocab, n = 4, K = 10, num.iterations = 30, keepLDAs = TRUE) res getLDA(res, all = TRUE) getPrototypeID(res) getParam(res)
res = LDAPrototype(docs = reuters_docs, vocabLDA = reuters_vocab, n = 4, K = 10, num.iterations = 30) res getPrototype(res) # = getLDA(res) getSCLOP(res) res = LDAPrototype(docs = reuters_docs, vocabLDA = reuters_vocab, n = 4, K = 10, num.iterations = 30, keepLDAs = TRUE) res getLDA(res, all = TRUE) getPrototypeID(res) getParam(res)
Performs multiple runs of Latent Dirichlet Allocation.
LDARep(docs, vocab, n = 100, seeds, id = "LDARep", pm.backend, ncpus, ...)
LDARep(docs, vocab, n = 100, seeds, id = "LDARep", pm.backend, ncpus, ...)
docs |
[ |
vocab |
[ |
n |
[ |
seeds |
[ |
id |
[ |
pm.backend |
[ |
ncpus |
[ |
... |
additional arguments passed to |
The function generates multiple LDA runs with the possibility of
using parallelization. The integration is done by the
parallelMap-package
.
The function returns a LDARep
object. You can receive results and
all other elements of this object with getter functions (see getJob
).
[named list
] with entries id
for computation's name,
jobs
for the parameter settings and lda
for the results itself.
Other replication functions:
LDAPrototype()
,
as.LDARep()
,
getJob()
,
mergeRepTopics()
Other LDA functions:
LDABatch()
,
LDA()
,
getTopics()
Other workflow functions:
SCLOP()
,
dendTopics()
,
getPrototype()
,
jaccardTopics()
,
mergeTopics()
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, seeds = 1:4, id = "myComputation", K = 7:10, alpha = 1, eta = 0.01, num.iterations = 20) res getJob(res) getID(res) getLDA(res, 4) LDARep(docs = reuters_docs, vocab = reuters_vocab, K = 10, num.iterations = 100, pm.backend = "socket")
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, seeds = 1:4, id = "myComputation", K = 7:10, alpha = 1, eta = 0.01, num.iterations = 20) res getJob(res) getID(res) getLDA(res, 4) LDARep(docs = reuters_docs, vocab = reuters_vocab, K = 10, num.iterations = 100, pm.backend = "socket")
Collects LDA results from a given registry and merges their topic matrices for a given set of vocabularies.
mergeBatchTopics(...) ## S3 method for class 'LDABatch' mergeBatchTopics(x, vocab, progress = TRUE, ...) ## Default S3 method: mergeBatchTopics(vocab, reg, job, id, progress = TRUE, ...)
mergeBatchTopics(...) ## S3 method for class 'LDABatch' mergeBatchTopics(x, vocab, progress = TRUE, ...) ## Default S3 method: mergeBatchTopics(vocab, reg, job, id, progress = TRUE, ...)
... |
additional arguments |
x |
[ |
vocab |
[ |
progress |
[ |
reg |
[ |
job |
[ |
id |
[ |
For details and examples see mergeTopics
.
[named matrix
] with the count of vocabularies (row wise) in topics (column wise).
Other merge functions:
mergeRepTopics()
,
mergeTopics()
Other batch functions:
LDABatch()
,
as.LDABatch()
,
getJob()
Collects LDA results from a list of replicated runs and merges their topic matrices for a given set of vocabularies.
mergeRepTopics(...) ## S3 method for class 'LDARep' mergeRepTopics(x, vocab, progress = TRUE, ...) ## Default S3 method: mergeRepTopics(lda, vocab, id, progress = TRUE, ...)
mergeRepTopics(...) ## S3 method for class 'LDARep' mergeRepTopics(x, vocab, progress = TRUE, ...) ## Default S3 method: mergeRepTopics(lda, vocab, id, progress = TRUE, ...)
... |
additional arguments |
x |
[ |
vocab |
[ |
progress |
[ |
lda |
[ |
id |
[ |
For details and examples see mergeTopics
.
[named matrix
] with the count of vocabularies (row wise) in topics (column wise).
Other merge functions:
mergeBatchTopics()
,
mergeTopics()
Other replication functions:
LDAPrototype()
,
LDARep()
,
as.LDARep()
,
getJob()
Generic function, which collects LDA results and merges their topic matrices for a given set of vocabularies.
mergeTopics(x, vocab, progress = TRUE)
mergeTopics(x, vocab, progress = TRUE)
x |
|
vocab |
[ |
progress |
[ |
This function uses the function mergeRepTopics
or
mergeBatchTopics
. The topic matrices are transponed and cbinded,
so that the resulting matrix contains the counts of vocabularies/words (row wise)
in topics (column wise).
[named matrix
] with the count of vocabularies (row wise) in topics (column wise).
Other merge functions:
mergeBatchTopics()
,
mergeRepTopics()
Other workflow functions:
LDARep()
,
SCLOP()
,
dendTopics()
,
getPrototype()
,
jaccardTopics()
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30) topics = mergeTopics(res, vocab = reuters_vocab) dim(topics) length(reuters_vocab) ## Not run: res = LDABatch(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30) topics = mergeTopics(res, vocab = reuters_vocab) dim(topics) length(reuters_vocab) ## End(Not run)
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30) topics = mergeTopics(res, vocab = reuters_vocab) dim(topics) length(reuters_vocab) ## Not run: res = LDABatch(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30) topics = mergeTopics(res, vocab = reuters_vocab) dim(topics) length(reuters_vocab) ## End(Not run)
The function SCLOP
calculates the S-CLOP value for the best possible
local pruning state of a dendrogram from dendTopics
.
The function pruneSCLOP
supplies the corresponding pruning state itself.
pruneSCLOP(dend) ## S3 method for class 'PruningSCLOP' plot(x, dend, pruning.par, ...) pruning.par(pruning)
pruneSCLOP(dend) ## S3 method for class 'PruningSCLOP' plot(x, dend, pruning.par, ...) pruning.par(pruning)
dend |
[ |
x |
an R object. |
pruning.par |
[ |
... |
additional arguments. |
pruning |
[ |
For details of computing the S-CLOP values see SCLOP
.
For details and examples of plotting the pruning state see dendTopics
.
[list of dendrograms
]
PruningSCLOP
object specifying the best possible
local pruning state.
Other plot functions:
dendTopics()
Other SCLOP functions:
SCLOP()
Calculates the similarity of all pairwise topic combinations using the rank-biased overlap (RBO) Similarity.
rboTopics(topics, k, p, progress = TRUE, pm.backend, ncpus)
rboTopics(topics, k, p, progress = TRUE, pm.backend, ncpus)
topics |
[ |
k |
[ |
p |
[0,1] |
progress |
[ |
pm.backend |
[ |
ncpus |
[ |
The RBO Similarity for two topics and
is calculated by
with is the vocabulary set of topic
down to
rank
. Ties in ranks are resolved by taking the minimum.
The value wordsconsidered
describes the number of words per topic
ranked at rank or above.
[named list
] with entries
sims
[lower triangular named matrix
] with all pairwise
similarities of the given topics.
wordslimit
[integer
] = vocabulary size. See
jaccardTopics
for original purpose.
wordsconsidered
[integer
] = vocabulary size. See
jaccardTopics
for original purpose.
param
[named list
] with parameter
type
[character(1)
] = "RBO Similarity"
,
k
[integer(1)
] and p
[0,1]. See above for explanation.
Webber, William, Alistair Moffat and Justin Zobel (2010). "A similarity measure for indefinite rankings". In: ACM Transations on Information Systems 28(4), p.20:1–-20:38, DOI 10.1145/1852102.1852106, URL https://doi.acm.org/10.1145/1852102.1852106
Other TopicSimilarity functions:
cosineTopics()
,
dendTopics()
,
getSimilarity()
,
jaccardTopics()
,
jsTopics()
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30) topics = mergeTopics(res, vocab = reuters_vocab) rbo = rboTopics(topics, k = 12, p = 0.9) rbo sim = getSimilarity(rbo) dim(sim)
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30) topics = mergeTopics(res, vocab = reuters_vocab) rbo = rboTopics(topics, k = 12, p = 0.9) rbo sim = getSimilarity(rbo) dim(sim)
Example Dataset from Reuters consisting of 91 articles. It can be used to familiarize with the bunch of functions offered by this package.
data(reuters_docs) data(reuters_vocab)
data(reuters_docs) data(reuters_vocab)
reuters_docs
is a list of documents of length 91 prepared by LDAprep
.
reuters_vocab
is
An object of class character
of length 2141.
temporarily unavailable: http://ronaldo.cs.tcd.ie/esslli07/data/reuters21578-xml/
Lewis, David (1997). Reuters-21578 Text Categorization Collection Distribution 1.0. http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
Luz, Saturnino. XML-encoded version of Reuters-21578. http://ronaldo.cs.tcd.ie/esslli07/data/reuters21578-xml/ (temporarily unavailable)
The function SCLOP
calculates the S-CLOP value for the best possible
local pruning state of a dendrogram from dendTopics
.
The function pruneSCLOP
supplies the corresponding pruning state itself.
To get all pairwise S-CLOP scores of two LDA runs, the function SCLOP.pairwise
can be used. It returns a matrix of the pairwise S-CLOP scores.
All three functions use the function disparitySum
to calculate the
least possible sum of disparities (on the best possible local pruning state)
on a given dendrogram.
SCLOP(dend) disparitySum(dend) SCLOP.pairwise(sims)
SCLOP(dend) disparitySum(dend) SCLOP.pairwise(sims)
dend |
[ |
sims |
[ |
For one specific cluster and
LDA Runs the disparity is calculated by
while
contains the number of topics that belong to the different LDA runs and that
occur in cluster
.
The function disparitySum
returns the least possible sum of disparities
for the best possible pruning state
with
.
The highest possible value for
is limited by
with denotes the corresponding worst case pruning state. This worst
case scenario is useful for normalizing the SCLOP scores.
The function SCLOP
then calculates the value
where .
SCLOP
[0,1] value specifying the S-CLOP for the best possible local pruning state of the given dendrogram.
disparitySum
[numeric(1)
] value specifying the least
possible sum of disparities on the given dendrogram.
SCLOP.pairwise
[symmetrical named matrix
] with all
pairwise S-CLOP scores of the given LDA runs.
Other SCLOP functions:
pruneSCLOP()
Other workflow functions:
LDARep()
,
dendTopics()
,
getPrototype()
,
jaccardTopics()
,
mergeTopics()
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30) topics = mergeTopics(res, vocab = reuters_vocab) jacc = jaccardTopics(topics, atLeast = 2) dend = dendTopics(jacc) SCLOP(dend) disparitySum(dend) SCLOP.pairwise(jacc) SCLOP.pairwise(getSimilarity(jacc))
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30) topics = mergeTopics(res, vocab = reuters_vocab) jacc = jaccardTopics(topics, atLeast = 2) dend = dendTopics(jacc) SCLOP(dend) disparitySum(dend) SCLOP.pairwise(jacc) SCLOP.pairwise(getSimilarity(jacc))