Yeah, 12 years ago or so I had trained an LSTM to make fake clinical reports based on clinical reports abstracts from pubmed. It was clear then that starting out in an empty state was a poor way to sample because the process that generates those documents doesn't start with 'pick the first letter' but it starts with the condition of a body which is revealed in the clinical encounter -- all of which is in the "state vector" of the real world so of course it should be in the state vector of the model.
The sponsor wasn't interested (people weren't interested enough in optimizing text generation then) so it never happened but it is nice to see that it works.
-----------------------------
Reply: we would have learned the initial state for all the training vectors and probably randomly generated initial states for generation
Makes sense. Random initial states for generation is interesting because it adds diversity at the source. We tried something related with the alpha parameter (scales the learned state magnitude) and found the optimal value differs 10x between architectures: 0.07 for GatedDeltaNet vs 0.65 for Mamba-2. Too large and generation degrades, too small and the state washes out before it affects anything.
Yeah, 12 years ago or so I had trained an LSTM to make fake clinical reports based on clinical reports abstracts from pubmed. It was clear then that starting out in an empty state was a poor way to sample because the process that generates those documents doesn't start with 'pick the first letter' but it starts with the condition of a body which is revealed in the clinical encounter -- all of which is in the "state vector" of the real world so of course it should be in the state vector of the model.
The sponsor wasn't interested (people weren't interested enough in optimizing text generation then) so it never happened but it is nice to see that it works.
Reply: we would have learned the initial state for all the training vectors and probably randomly generated initial states for generationMakes sense. Random initial states for generation is interesting because it adds diversity at the source. We tried something related with the alpha parameter (scales the learned state magnitude) and found the optimal value differs 10x between architectures: 0.07 for GatedDeltaNet vs 0.65 for Mamba-2. Too large and generation degrades, too small and the state washes out before it affects anything.
[dead]
[dead]