Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 22 additions & 13 deletions gensim/models/atmodel.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@
<https://arxiv.org/abs/1207.4169>`_. The model correlates the authorship information with the topics to give a better
insight on the subject knowledge of an author.

.. _'Online Learning for LDA' by Hoffman et al.: online-lda_
.. _online-lda: https://papers.neurips.cc/paper/2010/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf

Example
-------
.. sourcecode:: pycon
Expand Down Expand Up @@ -185,9 +188,12 @@ def __init__(self, corpus=None, num_topics=100, id2word=None, author2doc=None, d
iterations : int, optional
Maximum number of times the model loops over each document.
decay : float, optional
Controls how old documents are forgotten.
A number between (0.5, 1] to weight what percentage of the previous lambda value is forgotten
when each new document is examined. Corresponds to :math:`\\kappa` from
`'Online Learning for LDA' by Hoffman et al.`_
offset : float, optional
Controls down-weighting of iterations.
Hyper-parameter that controls how much we will slow down the first steps the first few iterations.
Corresponds to :math:`\\tau_0` from `'Online Learning for LDA' by Hoffman et al.`_
alpha : {float, numpy.ndarray of float, list of float, str}, optional
A-priori belief on document-topic distribution, this can be:
* scalar for a symmetric prior over document-topic distribution,
Expand All @@ -207,7 +213,8 @@ def __init__(self, corpus=None, num_topics=100, id2word=None, author2doc=None, d
* 'symmetric': (default) Uses a fixed symmetric prior of `1.0 / num_topics`,
* 'auto': Learns an asymmetric prior from the corpus.
update_every : int, optional
Make updates in topic probability for latest mini-batch.
Number of chunks to be iterated through before each M step of EM.
Set to 0 for batch learning, > 1 for online iterative learning.
eval_every : int, optional
Calculate and estimate log perplexity for latest mini-batch.
gamma_threshold : float, optional
Expand Down Expand Up @@ -618,15 +625,14 @@ def update(self, corpus=None, author2doc=None, doc2author=None, chunksize=None,

Notes
-----
This update also supports updating an already trained model (self)
with new documents from `corpus`: the two models are then merged in proportion to the number of old vs. new
documents. This feature is still experimental for non-stationary input streams.
This update also supports updating an already trained model (`self`) with new documents from `corpus`;
the two models are then merged in proportion to the number of old vs. new documents.
This feature is still experimental for non-stationary input streams.

For stationary input (no topic drift in new documents), on the other hand, this equals the online update of
`Hoffman et al. Stochastic Variational Inference
<http://www.jmlr.org/papers/volume14/hoffman13a/hoffman13a.pdf>`_ and is guaranteed to converge for any `decay`
in (0.5, 1.0>. Additionally, for smaller `corpus` sizes, an increasing `offset` may be beneficial (see
Table 1 in Hoffman et al.)
For stationary input (no topic drift in new documents), on the other hand, this equals the
online update of `'Online Learning for LDA' by Hoffman et al.`_
and is guaranteed to converge for any `decay` in (0.5, 1]. Additionally, for smaller corpus sizes, an
increasing `offset` may be beneficial (see Table 1 in the same paper).

If update is called with authors that already exist in the model, it will resume training on not only new
documents for that author, but also the previously seen documents. This is necessary for those authors' topic
Expand All @@ -653,9 +659,12 @@ def update(self, corpus=None, author2doc=None, doc2author=None, chunksize=None,
chunksize : int, optional
Controls the size of the mini-batches.
decay : float, optional
Controls how old documents are forgotten.
A number between (0.5, 1] to weight what percentage of the previous lambda value is forgotten
when each new document is examined. Corresponds to :math:`\\kappa` from
`'Online Learning for LDA' by Hoffman et al.`_
offset : float, optional
Controls down-weighting of iterations.
Hyper-parameter that controls how much we will slow down the first steps the first few iterations.
Corresponds to :math:`\\tau_0` from `'Online Learning for LDA' by Hoffman et al.`_
passes : int, optional
Number of times the model makes a pass over the entire training data.
update_every : int, optional
Expand Down
54 changes: 27 additions & 27 deletions gensim/models/ldamodel.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,13 @@
for online training.

The core estimation code is based on the `onlineldavb.py script
<https://github.com/blei-lab/onlineldavb/blob/master/onlineldavb.py>`_, by `Hoffman, Blei, Bach:
Online Learning for Latent Dirichlet Allocation, NIPS 2010
<https://scholar.google.com/citations?hl=en&user=IeHKeGYAAAAJ&view_op=list_works>`_.
<https://github.com/blei-lab/onlineldavb/blob/master/onlineldavb.py>`_, by
Matthew D. Hoffman, David M. Blei, Francis Bach:
`'Online Learning for Latent Dirichlet Allocation', NIPS 2010`_.

.. _'Online Learning for Latent Dirichlet Allocation', NIPS 2010: online-lda_
.. _'Online Learning for LDA' by Hoffman et al.: online-lda_
.. _online-lda: https://papers.neurips.cc/paper/2010/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf

The algorithm:

Expand Down Expand Up @@ -198,8 +202,7 @@ def blend(self, rhot, other, targetsize=None):

The number of documents is stretched in both state objects, so that they are of comparable magnitude.
This procedure corresponds to the stochastic gradient update from
`Hoffman et al. :"Online Learning for Latent Dirichlet Allocation"
<https://www.di.ens.fr/~fbach/mdhnips2010.pdf>`_, see equations (5) and (9).
`'Online Learning for LDA' by Hoffman et al.`_, see equations (5) and (9).

Parameters
----------
Expand Down Expand Up @@ -311,8 +314,7 @@ def load(cls, fname, *args, **kwargs):


class LdaModel(interfaces.TransformationABC, basemodel.BaseTopicModel):
"""Train and use Online Latent Dirichlet Allocation (OLDA) models as presented in
`Hoffman et al. :"Online Learning for Latent Dirichlet Allocation" <https://www.di.ens.fr/~fbach/mdhnips2010.pdf>`_.
"""Train and use Online Latent Dirichlet Allocation model as presented in `'Online Learning for LDA' by Hoffman et al.`_

Examples
-------
Expand Down Expand Up @@ -372,7 +374,7 @@ def __init__(self, corpus=None, num_topics=100, id2word=None,
passes : int, optional
Number of passes through the corpus during training.
update_every : int, optional
Number of documents to be iterated through for each update.
Number of chunks to be iterated through before each M step of EM.
Set to 0 for batch learning, > 1 for online iterative learning.
alpha : {float, numpy.ndarray of float, list of float, str}, optional
A-priori belief on document-topic distribution, this can be:
Expand All @@ -394,13 +396,11 @@ def __init__(self, corpus=None, num_topics=100, id2word=None,
* 'auto': Learns an asymmetric prior from the corpus.
decay : float, optional
A number between (0.5, 1] to weight what percentage of the previous lambda value is forgotten
when each new document is examined. Corresponds to Kappa from
`Matthew D. Hoffman, David M. Blei, Francis Bach:
"Online Learning for Latent Dirichlet Allocation NIPS'10" <https://www.di.ens.fr/~fbach/mdhnips2010.pdf>`_.
when each new document is examined.
Corresponds to :math:`\\kappa` from `'Online Learning for LDA' by Hoffman et al.`_
offset : float, optional
Hyper-parameter that controls how much we will slow down the first steps the first few iterations.
Corresponds to Tau_0 from `Matthew D. Hoffman, David M. Blei, Francis Bach:
"Online Learning for Latent Dirichlet Allocation NIPS'10" <https://www.di.ens.fr/~fbach/mdhnips2010.pdf>`_.
Corresponds to :math:`\\tau_0` from `'Online Learning for LDA' by Hoffman et al.`_
eval_every : int, optional
Log perplexity is estimated every that many updates. Setting this to one slows down training by ~2x.
iterations : int, optional
Expand Down Expand Up @@ -643,7 +643,7 @@ def inference(self, chunk, collect_sstats=False):
"""Given a chunk of sparse document vectors, estimate gamma (parameters controlling the topic weights)
for each document in the chunk.

This function does not modify the model The whole input chunk of document is assumed to fit in RAM;
This function does not modify the model. The whole input chunk of document is assumed to fit in RAM;
chunking of a large corpus must be done earlier in the pipeline. Avoids computing the `phi` variational
parameter directly using the optimization presented in
`Lee, Seung: Algorithms for non-negative matrix factorization"
Expand Down Expand Up @@ -860,13 +860,15 @@ def update(self, corpus, chunksize=None, decay=None, offset=None,

Notes
-----
This update also supports updating an already trained model with new documents; the two models are then merged
in proportion to the number of old vs. new documents. This feature is still experimental for non-stationary
input streams. For stationary input (no topic drift in new documents), on the other hand, this equals the
online update of `Matthew D. Hoffman, David M. Blei, Francis Bach:
"Online Learning for Latent Dirichlet Allocation NIPS'10" <https://www.di.ens.fr/~fbach/mdhnips2010.pdf>`_.
and is guaranteed to converge for any `decay` in (0.5, 1.0). Additionally, for smaller corpus sizes, an
increasing `offset` may be beneficial (see Table 1 in the same paper).
This update also supports updating an already trained model (`self`) with new documents from `corpus`;
the two models are then merged in proportion to the number of old vs. new documents.
This feature is still experimental for non-stationary input streams.

For stationary input (no topic drift in new documents), on the other hand,
this equals the online update of `'Online Learning for LDA' by Hoffman et al.`_
and is guaranteed to converge for any `decay` in (0.5, 1].
Additionally, for smaller corpus sizes,
an increasing `offset` may be beneficial (see Table 1 in the same paper).

Parameters
----------
Expand All @@ -877,13 +879,11 @@ def update(self, corpus, chunksize=None, decay=None, offset=None,
Number of documents to be used in each training chunk.
decay : float, optional
A number between (0.5, 1] to weight what percentage of the previous lambda value is forgotten
when each new document is examined. Corresponds to Kappa from
`Matthew D. Hoffman, David M. Blei, Francis Bach:
"Online Learning for Latent Dirichlet Allocation NIPS'10" <https://www.di.ens.fr/~fbach/mdhnips2010.pdf>`_.
when each new document is examined. Corresponds to :math:`\\kappa` from
`'Online Learning for LDA' by Hoffman et al.`_
offset : float, optional
Hyper-parameter that controls how much we will slow down the first steps the first few iterations.
Corresponds to Tau_0 from `Matthew D. Hoffman, David M. Blei, Francis Bach:
"Online Learning for Latent Dirichlet Allocation NIPS'10" <https://www.di.ens.fr/~fbach/mdhnips2010.pdf>`_.
Corresponds to :math:`\\tau_0` from `'Online Learning for LDA' by Hoffman et al.`_
passes : int, optional
Number of passes through the corpus during training.
update_every : int, optional
Expand Down Expand Up @@ -1053,7 +1053,7 @@ def do_mstep(self, rho, other, extra_pass=False):
----------
rho : float
Learning rate.
other : :class:`~gensim.models.ldamodel.LdaModel`
other : :class:`~gensim.models.ldamodel.LdaState`
The model whose sufficient statistics will be used to update the topics.
extra_pass : bool, optional
Whether this step required an additional pass over the corpus.
Expand Down
61 changes: 34 additions & 27 deletions gensim/models/ldamulticore.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,13 @@
unseen documents. The model can also be updated with new documents for online training.

The core estimation code is based on the `onlineldavb.py script
<https://github.com/blei-lab/onlineldavb/blob/master/onlineldavb.py>`_, by `Hoffman, Blei, Bach:
Online Learning for Latent Dirichlet Allocation, NIPS 2010 <http://www.cs.princeton.edu/~mdhoffma>`_.
<https://github.com/blei-lab/onlineldavb/blob/master/onlineldavb.py>`_, by
Matthew D. Hoffman, David M. Blei, Francis Bach:
`'Online Learning for Latent Dirichlet Allocation', NIPS 2010`_.

.. _'Online Learning for Latent Dirichlet Allocation', NIPS 2010: online-lda_
.. _'Online Learning for LDA' by Hoffman et al.: online-lda_
.. _online-lda: https://papers.neurips.cc/paper/2010/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf

Usage examples
--------------
Expand Down Expand Up @@ -102,7 +107,7 @@ class LdaMulticore(LdaModel):

"""
def __init__(self, corpus=None, num_topics=100, id2word=None, workers=None,
chunksize=2000, passes=1, batch=False, alpha='symmetric',
chunksize=2000, passes=1, update_every=1, alpha='symmetric',
eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50,
gamma_threshold=0.001, random_state=None, minimum_probability=0.01,
minimum_phi_value=0.01, per_word_topics=False, dtype=np.float32):
Expand All @@ -128,6 +133,9 @@ def __init__(self, corpus=None, num_topics=100, id2word=None, workers=None,
Number of documents to be used in each training chunk.
passes : int, optional
Number of passes through the corpus during training.
update_every : int, optional
Number of chunks to be iterated through before each M step of EM.
Set to 0 for batch learning, > 1 for online iterative learning.
alpha : {float, numpy.ndarray of float, list of float, str}, optional
A-priori belief on document-topic distribution, this can be:
* scalar for a symmetric prior over document-topic distribution,
Expand All @@ -147,13 +155,11 @@ def __init__(self, corpus=None, num_topics=100, id2word=None, workers=None,
* 'auto': Learns an asymmetric prior from the corpus.
decay : float, optional
A number between (0.5, 1] to weight what percentage of the previous lambda value is forgotten
when each new document is examined. Corresponds to Kappa from
`Matthew D. Hoffman, David M. Blei, Francis Bach:
"Online Learning for Latent Dirichlet Allocation NIPS'10" <https://www.di.ens.fr/~fbach/mdhnips2010.pdf>`_.
when each new document is examined. Corresponds to :math:`\\kappa` from
`'Online Learning for LDA' by Hoffman et al.`_
offset : float, optional
Hyper-parameter that controls how much we will slow down the first steps the first few iterations.
Corresponds to Tau_0 from `Matthew D. Hoffman, David M. Blei, Francis Bach:
"Online Learning for Latent Dirichlet Allocation NIPS'10" <https://www.di.ens.fr/~fbach/mdhnips2010.pdf>`_.
Corresponds to :math:`\\tau_0` from `'Online Learning for LDA' by Hoffman et al.`_
eval_every : int, optional
Log perplexity is estimated every that many updates. Setting this to one slows down training by ~2x.
iterations : int, optional
Expand All @@ -175,37 +181,32 @@ def __init__(self, corpus=None, num_topics=100, id2word=None, workers=None,

"""
self.workers = max(1, cpu_count() - 1) if workers is None else workers
self.batch = batch

if isinstance(alpha, str) and alpha == 'auto':
raise NotImplementedError("auto-tuning alpha not implemented in LdaMulticore; use plain LdaModel.")

super(LdaMulticore, self).__init__(
corpus=corpus, num_topics=num_topics,
id2word=id2word, chunksize=chunksize, passes=passes, alpha=alpha, eta=eta,
corpus=corpus, num_topics=num_topics, id2word=id2word, distributed=False, # not distributed across machines
chunksize=chunksize, passes=passes, update_every=update_every, alpha=alpha, eta=eta,
decay=decay, offset=offset, eval_every=eval_every, iterations=iterations,
gamma_threshold=gamma_threshold, random_state=random_state, minimum_probability=minimum_probability,
gamma_threshold=gamma_threshold, minimum_probability=minimum_probability, random_state=random_state,
minimum_phi_value=minimum_phi_value, per_word_topics=per_word_topics, dtype=dtype,
)

def update(self, corpus, chunks_as_numpy=False):
"""Train the model with new documents, by EM-iterating over `corpus` until the topics converge
(or until the maximum number of allowed iterations is reached).

Train the model with new documents, by EM-iterating over the corpus until the topics converge, or until
"""Train the model with new documents, by EM-iterating over the corpus until the topics converge, or until
the maximum number of allowed iterations is reached. `corpus` must be an iterable. The E step is distributed
into the several processes.

Notes
-----
This update also supports updating an already trained model (`self`)
with new documents from `corpus`; the two models are then merged in
proportion to the number of old vs. new documents. This feature is still
experimental for non-stationary input streams.
This update also supports updating an already trained model (`self`) with new documents from `corpus`;
the two models are then merged in proportion to the number of old vs. new documents.
This feature is still experimental for non-stationary input streams.

For stationary input (no topic drift in new documents), on the other hand,
this equals the online update of Hoffman et al. and is guaranteed to
converge for any `decay` in (0.5, 1.0>.
this equals the online update of `'Online Learning for LDA' by Hoffman et al.`_
and is guaranteed to converge for any `decay` in (0.5, 1].

Parameters
----------
Expand All @@ -229,14 +230,20 @@ def update(self, corpus, chunks_as_numpy=False):

self.state.numdocs += lencorpus

if self.batch:
# Same as in LdaModel but self.workers (processes) is used instead of self.numworkers (machines)
if self.update_every:
updatetype = "online"
if self.passes == 1:
updatetype += " (single-pass)"
else:
updatetype += " (multi-pass)"
updateafter = min(lencorpus, self.update_every * self.workers * self.chunksize)
else:
updatetype = "batch"
updateafter = lencorpus
else:
updatetype = "online"
updateafter = self.chunksize * self.workers

eval_every = self.eval_every or 0
evalafter = min(lencorpus, eval_every * updateafter)
evalafter = min(lencorpus, eval_every * self.workers * self.chunksize)

updates_per_pass = max(1, lencorpus / updateafter)
logger.info(
Expand Down