LoRMA: Low-Rank Multiplicative Adaptation for LLMs

Abstract

Large Language Models have shown remarkable capabilities in the NLP domain.Their effectiveness can mainly be attributed to their ability to adapt to an array of downstream tasks. However, generally, full fine-tuning is a computationally expensive job. To mitigate this, many techniques have been developed that prime efficiency, a prominent one being Low-Rank Adaptation (LoRA). However, LoRA and its variants employ re-parametrized additive updates. In this paper, we propose Low-Rank Multiplicative Adaptation (LoRMA), which shifts the paradigm of additive updates to a richer space of matrix multiplicative transformations. We tackle challenges such as computational complexity and rank bottleneck of matrix multiplication by effectively re-ordering operations and introducing rank inflation strategies. We conduct extensive experiments to demonstrate the effectiveness of our approach in terms of various evaluation metrics.

Motivation

achieving additive updates of a vector via multiplcation and rotation

The main idea behind LoRA is to approximate the update matrix \(\Delta \mathbf{W} \in \mathbb{R}^{d \times k}\) by a low-rank approximation \(\frac{\alpha}{r} \cdot \mathbf{B} \mathbf{A}\), where \(\mathbf{B} \in \mathbb{R}^{d \times r}\) and \(\mathbf{A} \in \mathbb{R}^{r \times k}\) are low-rank matrices (\(r \ll d, k\)), \(\frac{\alpha}{r}\) is a scaling factor, leading to the weight update: \[\mathbf{W} = \mathbf{W}_0 + \frac{\alpha}{r} \cdot \mathbf{B} \mathbf{A}\] The current LoRA-based approaches have employed additive transformations, where the low-rank update matrix can be added to the original weight matrix during inference. However, a similar transformation could also be achieved via multiplicative updates. For example, consider a weight vector \(\mathbf{W}\) (Figure above) and we would like to transform it to vector \(\widehat{\mathbf{W}}\). This could be accomplished via the addition of a vector \(\mathbf{v}\), or it could also be done by rotating \(\mathbf{W}\) by angle \(\theta\) (done via Rotation Matrix \(\mathbf{R_{\theta}}\)) and subsequently by scaling it by scalar \(\alpha\). Inspired by this, we propose Low-Rank Multiplicative Adaptation (LoRMA) for efficiently fine-tuning LLMs on new tasks. LoRMA applies low-rank multiplicative update to a weight matrix: \[\mathbf{W} = \frac{\alpha}{r} \cdot (\mathbf{B}\mathbf{A}) \mathbf{W_0}\] where \(\frac{\alpha}{r}\) is a scalar and \(\mathbf{B} \in \mathbb{R}^{d \times r}\) and \(\mathbf{A} \in \mathbb{R}^{r \times d}\) are low-rank matrices (\(r \ll d, k\)).

Challenges

The naive approach of \(\mathbf{h} = \left((\mathbf{B} \mathbf{A}) \times \mathbf{W}_0\right) \mathbf{x}\) has a few shortcomings. Utilizing matrix multiplication operations could lead to significant increase in time complexity if not taken care of. Additionally, following is a property of matrix multiplcation which restricts the rank of product of matrices: \[\mathcal{R}\left(\mathbf M_1 \times \mathbf M_2\right) \ \le \ \min (\mathcal{R}(\mathbf M_1),\mathcal{R}(\mathbf M_2)) \] where \(\mathcal{R}(\cdot)\) denotes the rank of a matrix. Due to this the resultant matrix product is limited to be of rank \(r\) since \(\mathcal{R}(\mathbf{BAW}_0) \le \mathcal{R}(\mathbf{B}) \leq r\). This significantly undermines the potential desirable independence of rows/columns in the final representation of the updated weights. Further, during the onset of the fine-tuning, in the case of LoRA, it is preferable to have \(\Delta \mathbf{W} = \mathbf{0}\), so that \(\mathbf{h} = \mathbf{W} \mathbf{x}\). This ensures stability during fine-tuning. This is achieved by initializing \(\mathbf{B}\) with zeros, ensuring that the additive update starts at zero. In our case, this would require the matrix \(\mathbf{BA}\) to be equal to the identity matrix \(\mathbf{I}_d\). However this cannot be the case as \(\mathcal{R}(\mathbf{I}_d) = d\). This forces the tuning to have a significant deviation from the beginning.

Methodology

We propose two strategies to mitigate the rank limitation imposed by low-rank matrices to capture the multiplicative transformation. To tackle time complexity we introduce effective re-ordering of operations which brings down the complexity of LoRMA to be similar of LoRA as elaborated in the paper.

Permutation-Based Inflation (\(\mathcal{I}_{\pi}\))

Permutation-based rank inflation example

Permutation-based rank inflation utilizes the idea of strategic re-arrangement of elements of the matrices to increase the rank of a matrix. The rows of the matrix are rotated cyclically in incremental steps. The \(i\)-th row is rotated by \(i\) (i.e., row 0 by 0, row 1 by 1, and so on). As can be seen in Figure , this effective rearranging of a matrix's elements enhances the matrix's rank from 1 to a full rank of 3. We introduce this operation on the product of the matrices \(\mathbf{BA}\), which equips the model with the ability to learn a higher-rank representation. Since the operation is simply a re-arrangement of the parameters, it does not make the gradient intractable. \[\mathbf{h} = \left(\mathcal{I}_{\pi}(\mathbf{BA}) \times \mathbf{W}_0\right) \mathbf{x}\] This inflation strategy also provides a better initialization scheme. This is achieved by warranting \(\mathcal{I}_{\pi}(\mathbf{BA}) = \mathbf{I}_d\). The first column of \(\mathbf{B}\) is set to ones, while the rest of the elements are randomly initialized. \(\mathbf{A}[0,0]\) is set to one, while the rest of the elements in \(\mathbf{A}\) are set to zero. We refer to this variant as \(\text{LoRMA}_\pi\).

Additive Rank Inflation (\(\mathcal{I}_{+}\))

Motivated by the need for an identity initialization of the transformation matrix, we introduce another technique to address the rank limitation inherent in low-rank approximations. Drawing inspiration from ridge regression, where the solution is stabilized by adding a regularization term \(\hat{\mathbf{\theta}} = (\mathbf{X}^T \mathbf{X} + \lambda \cdot \mathbf{I})^{-1} \mathbf{X}^T \mathbf{Y}\), we incorporate an identity matrix into our formulation through addition. Specifically, the resulting transformation takes the form: \[\mathbf{h} = \mathcal{I}_{+}(\mathbf{BA}) \mathbf{W}_0 \mathbf{x}= \left( \frac{\alpha}{r} \cdot \mathbf{BA} + \mathbf{I}_d \right) \mathbf{W}_0 \mathbf{x}\] The rank of the sum \(\left( \frac{\alpha}{r} \cdot \mathbf{BA} + \mathbf{I}_d \right)\) (where \(\alpha\) is the scaling factor) is guaranteed to be at least \(d - r\). Since \(r \ll d,\ d - r \approx d\), this preserves sufficient rank flexibility, enabling richer transformations during training. This approach ensures that the transformation begins with identity initialization at the start of the fine-tuning process by setting \(\mathbf{B} = \mathbf{0}\) and randomly initializing \(\mathbf{A}\). We refer to this variant as \(\text{LoRMA}_+\).

Experimentation

We conduct a comprehensive set of experiments across a diverse set of tasks within the domain of Natural Language Understanding and Generation, involving widely used language models with a range of sizes from RoBERTa (base: 125M parameters, large: 355M parameters) and GPT-2 (medium: 355M parameters) to Gemma-2B (2.5B parameters) and Llama3-8B (8B parameters). LoRMA has been evaluated against various baselines, including LoRA and its variants like DoRA and SVFT, and other parameter-efficient fine-tuning strategies like BitFit and Adapters. We also report the full fine-tuning results for comparison. Overall, LoRMA demonstrates competitive performance compared to existing approaches.

Experimentation on NLU tasks.

Experimentation on NLG tasks for GPT-2 medium.

nlg experiments for Gemma-2B and Llama3-8B

Experimentation on NLG tasks for Gemma-2B and LlaMA-3-8B.

From the outset, our goal was to introduce an alternative efficient fine-tuning technique, and the overall trends and comparisons across a range of experiments demonstrate that our multiplicative adaptation approach achieves competitive performance relative to several other PEFT methods. However, the main advantage of our approach comes from faster convergence and richer parameter space explored by our approach.

Ablation

Faster Convergence

In the multiplicative representation, on updating a single parameter, the resultant weight matrix has many more updates as compared to additive transformations, as can be seen in the figure below. This can lead to the requirement of fewer updates to modify the weight matrix to another matrix, leading to faster convergence. We observe this empirically in our experiments.

impact of modifying element in additive vs multiplicative updates.

Convergence time reflects how quickly a model reaches a stable or desirable level of performance during training. To complement the evaluation metrics presented in Table 1, we demonstrate in this section that our proposed techniques achieve faster convergence compared to LoRA. We quantify convergence speed using the Area Under the Curve (AUC) metric for the training loss curve, where a lower AUC indicates faster convergence. The figure illustrates the training loss curves for LoRMA (both \(\mathcal{I}_{+}\) and \(\mathcal{I}_{\pi}\) variants) compared to LoRA on the CoLA task while using the RoBERTa\(_\text{base}\) model. The results show a steeper decline in training loss. The percentage reduction in AUC for various tasks relative to LoRA is summarized in the table. Similar trends were observed for other tasks as well.

% AUC decrease in comparison with LoRA

loss curves for LoRA, LoRMA_pi and LoRMA_+

Presence v/s Absence of Rank Inflation

As explained earlier, a naive low-rank multiplicative adaptation of \(\mathbf W_0\) has limitations. We present here the empirical verification of the same, and the results are shown in the below table. The experiments were done on RoBERTa_large on a subset of GLUE tasks, and all the hyperparameters and training conditions were kept exactly the same, apart from the presence and absence of the rank inflation strategies. Further, we evaluate the effectiveness of the proposed rank inflation strategies by monitoring the rank of matrices throughout the training procedure. We observe that these operations successfully help break the rank bottleneck, and the matrices are almost full rank throughout.

The absence of rank inflation severely limits the model's capabilities.

Comparison with \(\Delta\mathbf W_\text{LoRA}\)

For any technique, denote \(\Delta \mathbf{W}\) to be the difference between the final adapted weight matrix and the initial weight matrix (the frozen weights). We investigate the relationship of \(\Delta \mathbf{W}_{\text{LoRA}}\) with \(\Delta \mathbf{W}_{\text{LoRMA}_{+}}\) and \(\Delta \mathbf{W}_{\text{LoRMA}_{\pi}}\) as compared to a random matrix. To assess the correlation, we employ a variety of metrics, the results of which are summarized in Table 1. We utilize the Frobenius norm \(\left\Vert \cdot \right\Vert_F\) to measure the deviation between the matrices. The cosine similarity of the flattened matrices (\(\texttt{cos}(\cdot, \cdot)\)) and the principal subspace angle \(\Theta_1(\cdot, \cdot)\) between their column spaces have been used to measure their alignment. We compute the sum of squared differences between the top-\(r\) singular values \((\cdot, \cdot)_{\mathcal{S}}^r\) and eigenvalues \((\cdot, \cdot)_{\mathcal{E}}^r\) of the two matrices to assess their similarity.

Correlation between \(\Delta \mathbf{W}_{\text{LoRA}}\) and \(\Delta \mathbf{W}_{\text{LoRMA}}\) for RoBERTa_large. \(\uparrow/\downarrow\) indicates higher/lower is more similar.

As can be seen in the table above, the main trend points towards a high correlation between \(\Delta \mathbf{W}_{\text{LoRA}}\) and \(\Delta \mathbf{W}_{\text{LoRMA}_{+}}\) and \(\Delta \mathbf{W}_{\text{LoRMA}_{\pi}}\), which shows that our multiplicative techniques can capture updates learned by additive LoRA. Additionally to assess the expressibility of the transformations, we compare the rank of \(\Delta \mathbf{W}\). For LoRA, \(\Delta \mathbf{W} = \mathbf{B} \mathbf{A}\); hence, it is restricted to be a low-rank update. While for LoRMA\(_{\pi}\), there are no such limitations. We empirically observe them to be almost full-rank matrices.

Conclusion

In this work, we proposed LoRMA, a novel approach for updating the weights of a language model via multiplicative updates. We mathematically proved the existence of multiplicative updates. Further, to overcome the limitations of the naive approach of multiplicative updates, we propose two methods to inflate the rank of the update matrix via permutation and additive identity. Extensive experiments demonstrate the competitive performance and training efficacy of the proposed approach. In the future, we plan to experiment with combining LoRMA with existing LoRA-based enhancements like AutoLoRA, DyLoRA, etc.

Low-Rank Multiplicative Adaptation for LLMs