Faster Convergence
In the multiplicative representation, on updating a single parameter, the resultant weight matrix has many
more updates as compared to additive transformations, as can be seen in the figure below. This
can lead to the requirement of fewer updates to modify the weight matrix to another matrix, leading to
faster convergence. We observe this empirically in our experiments.
Convergence time reflects how quickly a model reaches a stable or desirable level of performance during
training. To complement the evaluation metrics presented in Table 1, we demonstrate in this section that
our proposed techniques achieve faster convergence compared to LoRA. We quantify convergence speed using
the Area Under the Curve (AUC) metric for the training loss curve, where a lower AUC indicates
faster convergence. The figure illustrates the training loss curves for LoRMA (both
and variants) compared to LoRA on the CoLA task while using the
RoBERTa model. The results show a steeper decline in training loss. The percentage
reduction in AUC for various tasks relative to LoRA is summarized in the table. Similar trends were
observed for other tasks as well.
% AUC decrease in comparison with LoRA
Presence v/s Absence of Rank Inflation
As explained earlier, a naive low-rank multiplicative adaptation of has limitations. We
present here the empirical verification of the same, and the results are shown in the below table. The
experiments were done on RoBERTalarge on a subset of GLUE tasks, and all the hyperparameters
and training conditions were kept exactly the same, apart from the presence and absence of the rank
inflation strategies. Further, we evaluate the effectiveness of the proposed rank inflation strategies
by monitoring the rank of matrices throughout the training procedure. We observe that these operations
successfully help break the rank bottleneck, and the matrices are almost full rank throughout.
The absence of rank inflation severely limits the model's capabilities.
Comparison with
For any technique, denote to be the difference between the final adapted weight
matrix and the initial weight matrix (the frozen weights).
We investigate the relationship of with and as compared to a random
matrix.
To assess the correlation, we employ a variety of metrics, the results of which are summarized in Table
1.
We utilize the Frobenius norm to measure the deviation between the
matrices.
The cosine similarity of the flattened matrices () and the principal
subspace angle between their column spaces have been used to measure their
alignment.
We compute the sum of squared differences between the top- singular values and eigenvalues of the two matrices to
assess their similarity.
Correlation between and for
RoBERTalarge.
indicates higher/lower is more similar.
As can be seen in the table above, the main trend points towards a high correlation between and and , which shows that our multiplicative techniques can capture updates
learned by additive LoRA. Additionally to assess the expressibility of the transformations, we compare
the rank of
. For LoRA, ; hence, it is restricted
to be a low-rank update. While for LoRMA, there are no such limitations. We empirically
observe them to be almost full-rank matrices.