Package 'rollinglda' reference manual

Title:	Construct Consistent Time Series from Textual Data
Description:	A rolling version of the Latent Dirichlet Allocation, see Rieger et al. (2021) <doi:10.18653/v1/2021.findings-emnlp.201>. By a sequential approach, it enables the construction of LDA-based time series of topics that are consistent with previous states of LDA models. After an initial modeling, updates can be computed efficiently, allowing for real-time monitoring and detection of events or structural breaks.
Authors:	Jonas Rieger [aut, cre]
Maintainer:	Jonas Rieger <[email protected]>
License:	GPL (>= 3)
Version:	0.1.3
Built:	2025-03-22 05:45:31 UTC
Source:	https://github.com/jonasrieger/rollinglda

rollinglda: Construct Consistent Time Series from Textual Data

Description

RollingLDA is a rolling version of the Latent Dirichlet Allocation (LDA). By a sequential approach, it enables the construction of LDA-based time series of topics that are consistent with previous states of LDA models. After an initial modeling, updates can be computed efficiently, allowing for real-time monitoring and detection of events or structural breaks.
For bug reports and feature requests please use the issue tracker: https://github.com/JonasRieger/rollinglda/issues. Also have a look at the (detailed) example at https://github.com/JonasRieger/rollinglda.

Data

economy Example Dataset (576 articles from Wikinews) for testing.

Constructor

as.RollingLDA RollingLDA objects used in this package.

Getter

getChunks Getter for RollingLDA objects.

Modeling

RollingLDA Performing the method from scratch.
updateRollingLDA Performing updates on RollingLDA objects.

Author(s)

Maintainer: Jonas Rieger [email protected] (ORCID)

References

Rieger, Jonas, Carsten Jentsch and Jörg Rahnenführer (2021). "RollingLDA: An Update Algorithm of Latent Dirichlet Allocation to Construct Consistent Time Series from Textual Data". EMNLP Findings 2021. URL doi:10.18653/v1/2021.findings-emnlp.201.

RollingLDA Object

Description

Constructor for RollingLDA objects used in this package. The function may be useful to create a RollingLDA object out of a standard LDA object to use it as initial model and update it using updateRollingLDA.

Usage

as.RollingLDA(x, id, lda, docs, dates, vocab, chunks, param)

is.RollingLDA(obj, verbose = FALSE)
as.RollingLDA(x, id, lda, docs, dates, vocab, chunks, param)

is.RollingLDA(obj, verbose = FALSE)

Arguments

`x`	[`named list`] `RollingLDA` object. Alternatively each element can be passed for individual results. Individually set elements overwrite elements from `x`.
`id`	[`character(1)`] Name for the computation/model.
`lda`	[`named list`] `LDA` object.
`docs`	[`named list`] Texts in a preprocessed format. See `LDAprep`.
`dates`	[`(un)named Date`] Dates of the texts. If unnamed, it must match the order of docs.
`vocab`	[`character`] Vocabularies.
`chunks`	[`data.table`] with specifications for each model chunk `chunk.id` [`integer`] Index counting up starting with `0`. `start.date` [`Date`] Minimum of each chunk's dates. `end.date` [`Date`] Maximum of each chunk's dates. `memory` [`Date`] Date from which texts are considered as memory. `n` [`integer`] Number of fitted texts. `n.dicsarded` [`integer`] Number of lost texts through preprocessing. `n.memory` [`integer`] Number of texts considered as memory. `n.vocab` [`integer`] Number of vocabularies (monotonously increasing). If not passed, `lda` is interpreted as initialization chunk.
`param`	[`named list(4)`] Parameters of the object, i.e. parameters for future updates fitted on the to be created model. List always should contain names "vocab.abs", "vocab.rel", "vocab.fallback" and "doc.abs".
`obj`	[`R` object] Object to test.
`verbose`	[`logical(1)`] Should test information be given in the console?

Details

If you call as.RollingLDA on an object x which already is of the structure of an RollingLDA object (in particular a RollingLDA object itself), the additional arguments id, param, ... may be used to override the specific elements.

Value

[named list] RollingLDA object.

Examples

roll_lda = RollingLDA(texts = economy_texts,
                      dates = economy_dates,
                      chunks = "quarter",
                      memory = "3 quarter",
                      init = "2008-07-03",
                      K = 10,
                      type = "lda")

is.RollingLDA(roll_lda, verbose = TRUE)
getID(roll_lda)
roll_lda = as.RollingLDA(roll_lda, id = "newID")
getID(roll_lda)

roll_lda = RollingLDA(texts = economy_texts,
                      dates = economy_dates,
                      chunks = "quarter",
                      memory = "3 quarter",
                      init = "2008-07-03",
                      K = 10,
                      type = "lda")

is.RollingLDA(roll_lda, verbose = TRUE)
getID(roll_lda)
roll_lda = as.RollingLDA(roll_lda, id = "newID")
getID(roll_lda)

A Snippet of the Economy Dataset from toscaData

Description

Example Dataset from Wikinews consisting of 576 articles. It can be used to familiarize with the functions offered by this package.

Usage

data(economy_texts)

data(economy_dates)
data(economy_texts)

data(economy_dates)

Format

economy_texts is a named list of tokenized texts of length 576.

economy_dates is

An object of class Date of length 576.

Source

https://github.com/Docma-TU/toscaData

Getter for RollingLDA

Description

Returns the corresponding element of a RollingLDA object.

Usage

getChunks(x)

getNames(x)

getDates(x, names, inverse)

getDocs(x, names, inverse)

getVocab(x)

## S3 method for class 'RollingLDA'
getLDA(x, job, reduce, all)

## S3 method for class 'RollingLDA'
getID(x)

## S3 method for class 'RollingLDA'
getParam(x)
getChunks(x)

getNames(x)

getDates(x, names, inverse)

getDocs(x, names, inverse)

getVocab(x)

## S3 method for class 'RollingLDA'
getLDA(x, job, reduce, all)

## S3 method for class 'RollingLDA'
getID(x)

## S3 method for class 'RollingLDA'
getParam(x)

Arguments

`x`	[`named list`] `RollingLDA` object.
`names`	[`character`] Names of the requested items (dates or docs). Default are all names.
`inverse`	[`logical(1)`] Should all items except those with the given names be returned? Default is `FALSE`.
`job`	not implemented for `RollingLDA` object. See `getLDA`
`reduce`	not implemented for `RollingLDA` object. See `getLDA`
`all`	not implemented for `RollingLDA` object. See `getLDA`

Value

The requested element of a RollingLDA object.

RollingLDA

Description

Performs a rolling version of Latent Dirichlet Allocation.

Usage

RollingLDA(...)

## Default S3 method:
RollingLDA(
  texts,
  dates,
  chunks,
  memory,
  vocab.abs = 5L,
  vocab.rel = 0,
  vocab.fallback = 100L,
  doc.abs = 0L,
  memory.fallback = 0L,
  init,
  type = c("ldaprototype", "lda"),
  id,
  ...
)
RollingLDA(...)

## Default S3 method:
RollingLDA(
  texts,
  dates,
  chunks,
  memory,
  vocab.abs = 5L,
  vocab.rel = 0,
  vocab.fallback = 100L,
  doc.abs = 0L,
  memory.fallback = 0L,
  init,
  type = c("ldaprototype", "lda"),
  id,
  ...
)

Arguments

`...`	additional arguments passed to `LDARep` or `LDAPrototype`, respectively. Default parameters are `alpha = eta = 1/K` and `num.iterations = 200`. There is no default for `K`.
`texts`	[`named list`] Tokenized texts.
`dates`	[`(un)named Date`] Dates of the tokenized texts. If unnamed, it must match the order of texts.
`chunks`	[`Date` or `character(1)`] Sorted dates of the beginnings of each chunk to be modeled after the initial model. If passed as `character`, dates are determined by passing `init` plus one day as `from` argument, `max(dates)` as `to` argument and `chunks` as `by` argument in `seq.Date`.
`memory`	[`Date`, `character(1)` or `integer(1)`] Sorted dates of the beginnings of each chunk's memory. If passed as `character`, dates are determined by using the dates of the beginnings of each chunk and substracting the given time interval in `memory` passing it as `by` argument in `seq.Date`. If passed as `integer/numeric`, the dates are determined by going backwards the modeled texts chronologically and taking the date of the text at position `memory`.
`vocab.abs`	[`integer(1)`] An absolute lower bound limit for which words are taken into account. All words are considered in the vocabularies that have a count higher than `vocab.abs` over all texts and at the same time a higher relative frequency than `vocab.rel`. Default is 5.
`vocab.rel`	[0,1] A relative lower bound limit for which words are taken into account. See also `vocab.abs`. Default is 0.
`vocab.fallback`	[`integer(1)`] An absolute lower bound limit for which words are taken into account. All words are considered in the vocabularies that have a count higher than `vocab.fallback` over all texts even if they might not have a higher relative frequency than `vocab.rel`. Default is 100.
`doc.abs`	[`integer(1)`] An absolute lower bound limit for which texts are taken into account. All texts are considered for modeling that have more words (subsetted to words occurring in the vocabularies) than `doc.abs`. Default is 0.
`memory.fallback`	[`integer(1)`] If there are no texts as memory in a certain chunk, `memory` is determined by going backwards the modeled texts chronologically and taking the date of the text at position `memory.fallback`. Default is 0, which means "end the fitting".
`init`	[`Date(1)` or `integer(1)`] Date up to which the initial model should be computed. This parameter is needed/used only if `chunks` is passed as `character`. Otherwise the initial model is computed up to the first date in `chunks` minus one day. If `init` is passed as `integer/numeric`, the `init` lowest date from `dates` is selected.
`type`	[`character(1)`] One of "ldaPrototype" or "lda" specifying whether a LDAProtoype or standard LDA should be modeled as initial model. Default is "ldaprototype".
`id`	[`character(1)`] Name for the computation/model.

Details

The function first computes a initial LDA model (using LDARep or LDAPrototype). Afterwards it models temporal chunks of texts with a specified memory for initialization of each model chunk.

The function returns a RollingLDA object. You can receive results and all other elements of this object with getter functions (see getChunks).

Value

[named list] with entries

id: [character(1)] See above.
lda: LDA object of the fitted RollingLDA.
docs: [named list] with modeled texts in a preprocessed format. See LDAprep.
dates: [named Date] with dates of the modeled texts.
vocab: [character] with the vocabularies considered for modeling.
chunks: [data.table] with specifications for each model chunk.
param: [named list] with parameter specifications for vocab.abs [integer(1)], vocab.rel [0,1], vocab.fallback [integer(1)] and doc.abs [integer(1)]. See above for explanation.

Examples

roll_lda = RollingLDA(texts = economy_texts,
                      dates = economy_dates,
                      chunks = "quarter",
                      memory = "3 quarter",
                      init = "2008-07-03",
                      K = 10,
                      type = "lda")

roll_lda
getChunks(roll_lda)
getLDA(roll_lda)


roll_proto = RollingLDA(texts = economy_texts,
                        dates = economy_dates,
                        chunks = "quarter",
                        memory = "3 quarter",
                        init = "2007-07-03",
                        K = 10,
                        n = 12,
                        pm.backend = "socket",
                        ncpus = 2)

roll_proto
getChunks(roll_proto)
getLDA(roll_proto)


roll_lda = RollingLDA(texts = economy_texts,
                      dates = economy_dates,
                      chunks = "quarter",
                      memory = "3 quarter",
                      init = "2008-07-03",
                      K = 10,
                      type = "lda")

roll_lda
getChunks(roll_lda)
getLDA(roll_lda)


roll_proto = RollingLDA(texts = economy_texts,
                        dates = economy_dates,
                        chunks = "quarter",
                        memory = "3 quarter",
                        init = "2007-07-03",
                        K = 10,
                        n = 12,
                        pm.backend = "socket",
                        ncpus = 2)

roll_proto
getChunks(roll_proto)
getLDA(roll_proto)

Updating an existing RollingLDA object

Description

Performs an update of an existing object consisting of a rolling version of Latent Dirichlet Allocation.

Usage

updateRollingLDA(
  x,
  texts,
  dates,
  chunks,
  memory,
  param = getParam(x),
  compute.topics = TRUE,
  memory.fallback = 0L,
  ...
)

## S3 method for class 'RollingLDA'
RollingLDA(
  x,
  texts,
  dates,
  chunks,
  memory,
  param = getParam(x),
  compute.topics = TRUE,
  memory.fallback = 0L,
  ...
)
updateRollingLDA(
  x,
  texts,
  dates,
  chunks,
  memory,
  param = getParam(x),
  compute.topics = TRUE,
  memory.fallback = 0L,
  ...
)

## S3 method for class 'RollingLDA'
RollingLDA(
  x,
  texts,
  dates,
  chunks,
  memory,
  param = getParam(x),
  compute.topics = TRUE,
  memory.fallback = 0L,
  ...
)

Arguments

`x`	[`named list`] `RollingLDA` object.
`texts`	[`named list`] Tokenized texts.
`dates`	[`(un)named Date`] Sorted dates of the tokenized texts. If unnamed, it must match the order of texts.
`chunks`	[`Date` or `character(1)`] Sorted dates of the beginnings of each chunk to be modeled as updates. If passed as `character`, dates are determined by passing the minimum of `dates` as `from` argument, `max(dates)` as `to` argument and `chunks` as `by` argument in `seq.Date`. If not passed, all texts are interpreted as one chunk.
`memory`	[`Date`, `character(1)` or `integer(1)`] Dates of the beginnings of each chunk's memory. If passed as `character`, dates are determined by using the dates of the beginnings of each chunk and substracting the given time interval in `memory` passing it as `by` argument in `seq.Date`. If passed as `integer/numeric`, the dates are determined by going backwards the modeled texts chronologically and taking the date of the text at position `memory`.
`param`	[`named list`] with entries (Default is `getParam(x)`) `vocab.abs` [`integer(1)`] An absolute lower bound limit for which words are taken into account. All words are considered in the vocabularies that have a count higher than `vocab.abs` over all texts and at the same time a higher relative frequency than `vocab.rel`. `vocab.rel` [0,1] A relative lower bound limit for which words are taken into account. See also `vocab.abs`. `vocab.fallback` [`integer(1)`] An absolute lower bound limit for which words are taken into account. All words are considered in the vocabularies that have a count higher than `vocab.fallback` over all texts even if they might not have a higher relative frequency than `vocab.rel`. `doc.abs` [`integer(1)`] An absolute lower bound limit for which texts are taken into account. All texts are considered for modeling that have more words (subsetted to words occurring in the vocabularies) than `doc.abs`.
`compute.topics`	[`logical(1)`] Should the topic matrix of the LDA model be computed? Default is `TRUE`.
`memory.fallback`	[`integer(1)`] If there are no texts as memory in a certain chunk, `memory` is determined by going backwards the modeled texts chronologically and taking the date of the text at position `memory.fallback`. Default is 0, which means "end the fitting".
`...`	not implemented

Details

The function uses an existing RollingLDA object and models new texts with a specified memory as initialization of the new LDA chunk.

The function returns a RollingLDA object. You can receive results and all other elements of this object with getter functions (see getChunks).

Value

[named list] with entries

id: [character(1)] See above.
lda: LDA object of the fitted RollingLDA.
docs: [named list] with modeled texts in a preprocessed format. See LDAprep
dates: [named Date] with dates of the modeled texts.
vocab: [character] with the vocabularies considered for modeling.
chunks: [data.table] with specifications for each model chunk.
param: [named list] with parameter specifications for vocab.abs [integer(1)], vocab.rel [0,1], vocab.fallback [integer(1)] and doc.abs [integer(1)]. See above for explanation.

Examples


roll_lda = RollingLDA(texts = economy_texts[economy_dates < "2008-05-01"],
                      dates = economy_dates[economy_dates < "2008-05-01"],
                      chunks = "month",
                      memory = "month",
                      init = 100,
                      K = 10,
                      type = "lda")

# updateRollingLDA = RollingLDA, if first argument is a RollingLDA object
roll_update = RollingLDA(roll_lda,
                         texts = economy_texts[economy_dates >= "2008-05-01"],
                         dates = economy_dates[economy_dates >= "2008-05-01"],
                         chunks = "month",
                         memory = "month")

roll_update
getChunks(roll_update)


roll_lda = RollingLDA(texts = economy_texts[economy_dates < "2008-05-01"],
                      dates = economy_dates[economy_dates < "2008-05-01"],
                      chunks = "month",
                      memory = "month",
                      init = 100,
                      K = 10,
                      type = "lda")

# updateRollingLDA = RollingLDA, if first argument is a RollingLDA object
roll_update = RollingLDA(roll_lda,
                         texts = economy_texts[economy_dates >= "2008-05-01"],
                         dates = economy_dates[economy_dates >= "2008-05-01"],
                         chunks = "month",
                         memory = "month")

roll_update
getChunks(roll_update)

Package 'rollinglda'

Help Index

rollinglda: Construct Consistent Time Series from Textual Data

Description

Data

Constructor

Getter

Modeling

Author(s)

References

See Also

RollingLDA Object

Description

Usage

Arguments

Details

Value

See Also

Examples

A Snippet of the Economy Dataset from toscaData

Description

Usage

Format

Source

Getter for RollingLDA

Description

Usage

Arguments

Value

See Also

RollingLDA

Description

Usage

Arguments

Details

Value

See Also

Examples

Updating an existing RollingLDA object

Description

Usage

Arguments

Details

Value

See Also

Examples