This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Language Models are Symbolic Learners in Arithmetic

Chunyuan Deng1  Zhiqi Li2  Roy Xie3  Ruidi Chang1  Hanjie Chen1
1Rice University  2Georgia Tech  3Duke University
{chunyuan.deng,hanjie}@rice.edu
Abstract

Large Language Models (LLMs) are thought to struggle with arithmetic learning due to the inherent differences between language modeling and numerical computation, but concrete evidence has been lacking. This work responds to this claim through a two-side experiment. We first investigate whether LLMs leverage partial products during arithmetic learning. We find that although LLMs can identify some partial products after learning, they fail to leverage them for arithmetic tasks, conversely. We then explore how LLMs approach arithmetic symbolically by breaking tasks into subgroups, hypothesizing that difficulties arise from subgroup complexity and selection. Our results show that when subgroup complexity is fixed, LLMs treat a collection of different arithmetic operations similarly. By analyzing position-level accuracy across different training sizes, we further observe that it follows a U-shaped pattern: LLMs quickly learn the easiest patterns at the first and last positions, while progressively learning the more difficult patterns in the middle positions. This suggests that LLMs select subgroup following an easy-to-hard paradigm during learning. Our work confirms that LLMs are pure symbolic learners in arithmetic tasks and underscores the importance of understanding them deeply through subgroup-level quantification.

Language Models are Symbolic Learners in Arithmetic


Chunyuan Deng1  Zhiqi Li2  Roy Xie3  Ruidi Chang1  Hanjie Chen1 1Rice University  2Georgia Tech  3Duke University {chunyuan.deng,hanjie}@rice.edu


1 Introduction

Modern math benchmarks like MATH (Hendrycks et al., 2021) and GSM8K (Cobbe et al., 2021) have been rapidly saturated due to the advancements of frontier language models like GPT-4o (OpenAI et al., 2024) and Claude (Anthropic, 2024). This trend is driving the assessment of these models toward more challenging tasks, such as Olympic Math. However, it has been observed that even the most advanced language models struggle with basic arithmetic, such as 5-digit multiplication (Yang et al., 2023). This notable gap raises questions about the underlying mechanisms behind arithmetic in language models. And some hypothesize (Boguraev et al., 2024) that this difficulty stems from the fact that mathematical calculation differs fundamentally from autoregressive language modeling.

Previous research on this topic has primarily focused on causal abstraction to identify model components such as key attention layers (Stolfo et al., 2023) or attention heads (Zhang et al., 2024) responsible for arithmetic learning. While these studies provide valuable insights into the components involved in mathematical operations, they fall short of explaining why frontier models continue to struggle with certain arithmetic tasks. For instance, causal attribution reveals that the 14th attention head in the 1919th layer is responsible for the operation 37+14=5137+14=51. However, it cannot explain why the model handles 37+1437+14 successfully but fails on 37×1437\times 14. This observation suggests that there is still room to explore arithmetic learning from alternative perspectives.

Refer to caption
Figure 1: Fundamental structure of the paper. We begin by investigating partial products and proceed to a detailed examination at the subgroup level to understand the mechanism in a symbolic manner.

To achieve this, we approach this problem from two sides (shown in Figure 1). First, we examine whether LLMs leverage partial products in arithmetic learning tasks. Then we explore whether, and more importantly, how LLMs handle arithmetic in a purely symbolic manner. Specifically, we decompose the task into subgroup level, hypothesizing the task can be understood through two aspects: subgroup complexity and subgroup selection.

For partial products, we considered four distinct methods for multiplication calculation: standard multiplication, repetitive addition, the lattice method, and Egyptian multiplication (detailed in §5.1), each generating distinct paths for partial product computation. We first fine-tune LLMs on multiplication tasks. After fine-tuning, models improved in identifying nearly all partial products, but explicit training on partial products did not enhance their ability to perform multiplications, instead. These findings suggest that large language models may not leverage partial products to perform arithmetic calculations. The increased ability to identify partial products seems to be a by-product of subgroup-level symbol learning rather than a mechanism the models inherently use for computation.

We then delve into a more fine-grained level by examining subgroups to understand their basic complexity. Subgroup complexity refers to the intrinsic complexity of the subgroup itself. We propose that this is related to the domain space cardinality |𝒟||\mathcal{D}|, label space entropy H()H(\mathcal{L}), and subgroup quality Q(s)Q(s). Here, |𝒟||\mathcal{D}| represents the maximum size of training data, H()H(\mathcal{L}) represents the variability in the label space, and Q(s)Q(s) refers to how deterministically one subgroup can map from the domain to the label space. While |𝒟||\mathcal{D}|, which strongly correlated with training size, is self-evident to influence learning, we will focus on label space entropy H()H(\mathcal{L}) first then discuss Q(s)Q(s) in the subgroup selection section. We observe influence of H()H(\mathcal{L}) by introducing perturbations to addition and multiplication by applying translation (Δc\Delta c) and scaling (×λ\times\lambda), while maintaining the total entropy the same across all output digits. Our findings show that LLMs have nearly the same accuracy across different perturbation magnitudes. Furthermore, when we intentionally reduce the entropy of the label space via modular operations, we observe an increase in accuracy as entropy decreases. These experiments confirm that label space entropy is an effective measure for quantifying task complexity and validates the hypothesis that LLMs operate as symbol-level pattern finders.

Subgroup selection refers to the process by which LLMs identify the correct subgroup during training, specifically by selecting the appropriate mapping between a token subset in the input domain and a corresponding token subset in the output label space. To investigate this, we curated four distinct training sets to analyze position-level accuracy throughout the learning process. Consistent patterns emerged: position-level accuracy exhibited a U-shaped curve, achieving near-perfect accuracy (close to 100%100\%) at both the beginning and end of digits with high Q(s)Q(s), but dropping significantly (to below 10%10\%) across the intermediate digits with low Q(s)Q(s). These results suggest that LLMs apply a selection mechanism driven by high-to-low subgroup quality, providing further evidence of their symbolic learning behavior.

Our work confirms that LLMs do not truly perform calculations in arithmetic learning. Instead, they approach it in a purely symbolic manner. We then provide a systematic framework to examine this mechanism from subgroup complexity and subgroup selection. Our research emphasizes the pivotal role of label space entropy in the convergence stage and the impact of subgroup quality on learning dynamics, highlighting the importance of deeply understanding arithmetic through subgroup-level quantification symbolically.

2 Related Work

Mathematical Learning in LLMs

Mathematical reasoning has been a longstanding area of research in natural language processing (NLP) (Kushman et al., 2014; Huang et al., 2016; Wang et al., 2017; Thawani et al., 2021; Sundaram et al., 2022; Guo et al., 2024). With recent advances in LLM, numerous studies have sought to improve their mathematical reasoning abilities through various techniques, including data annealing (Dubey et al., 2024), continued pretraining (Lewkowycz et al., 2022), fine-tuning (Yue et al., 2023; Liu et al., 2023), prompting (Wei et al., 2023; Wang et al., 2023), and inference-time computation (Zhou et al., 2023a; Wu et al., 2024a). However, LLMs still face challenges with basic calculations and remain vulnerable to adversarial examples or perturbations, where minor changes in problems can result in incorrect answers (Zhou et al., 2023b; Xie et al., 2024). Most research on LLMs’ mathematical reasoning focuses on math word problems, where natural language is involved in the questions (Hendrycks et al., 2021; Cobbe et al., 2021; Arora et al., 2023; Zhao et al., 2024a, b).

Arithmetic Learning in Transformer

Several previous efforts have aimed to improve arithmetic learning in LLMs. Lee et al. (2023) trained a 10.6M NanoGPT Karpathy (2022) model to learn arithmetic by carefully curating the data format, explicitly expanding each step using a method termed Scratchpad, which achieved remarkable performance compared to GPT-2 XL (Radford et al., 2019).  Yang et al. (2023) fine-tuned MathGLM with a sufficient training dataset, demonstrating its capability to solve 5-digit multiplication. Deng et al. (2023, 2024) further advanced this field by internalizing the CoT process, hiding detailed steps in a scheduled manner, enabling GPT-2 small to solve 9-digit multiplication after multiple training runs.

Research on understanding arithmetic primarily stems from the interpretability community. The core idea is to identify causal correlations between model components and outputs. Stolfo et al. (2023) identify key attention layers responsible for arithmetic learning using causal mediation analysis (CMA), a weight perturbation method that observes changes in output. Similarly, Hanna et al. (2023) and Wu et al. (2024b) explore causal abstraction concepts at different model scales, specifically 0.1B and 7B parameters, respectively. More recently, Zhang et al. (2024) employed an attention attribution to isolate a small subset of attention heads and fine-tune them for improved performance at a lower cost. While these studies have made progress in understanding how LMs perform arithmetic at a component level, there remains a gap in understanding the learning mechanisms from a purely symbolic perspective. Our research aims to contribute to this missing gap in a systematic manner.

3 Preliminaries

In this section, we present the preliminaries of basic autoregressive language modeling, along with algebraic structure and arithmetic learning.

Autoregressive Language Modeling

An autoregressive (AR) language model predicts the next token in a sequence based on the previously observed tokens. Formally, given a sequence of tokens 𝐱={x1,x2,,xT}\mathbf{x}=\{x_{1},x_{2},\dots,x_{T}\}, the probability of the sequence is decomposed using the chain rule of probability as:

P(𝐱)=t=1TP(xt|x1,x2,,xt1),P(\mathbf{x})=\prod_{t=1}^{T}P(x_{t}|x_{1},x_{2},\dots,x_{t-1}), (1)

where P(xt|x1,x2,,xt1)P(x_{t}|x_{1},x_{2},\dots,x_{t-1}) represents the conditional probability of token xtx_{t} given all previous tokens. In autoregressive modeling, the next token is sampled from the conditional distribution P(xt|x1,x2,,xt1)P(x_{t}|x_{1},x_{2},\dots,x_{t-1}) until the end of the sequence is reached.

Algebraic Structure

In our setting, we employ the algebraic structure known as a ring, which provides a formal framework for the arithmetic operations of addition and multiplication. A ring (R,+,)(R,+,\cdot) is defined by:

  • A set RR (domain) of elements. Specifically, the domain RR in our task is the set of integers \mathbb{Z}.

  • Two binary operations, addition and multiplication f(a,b)=c:R×RRf(a,b)=c:R\times R\to R. Specifically, we use A1A2+B1B2=C1C2A_{1}A_{2}+B_{1}B_{2}=C_{1}C_{2} to represent 2-digit addition. We use A1A2×B1B2=C1C2C3C4A_{1}A_{2}\times B_{1}B_{2}=C_{1}C_{2}C_{3}C_{4} to represent 2-digit multiplication.

In our case, for all a,bRa,b\in R, there exists a unique element a+bRa+b\in R, representing the sum of aa and bb. Similarly, there exists a unique element abRa\cdot b\in R, representing the product of aa and bb.

Arithmetic Learning

Let \mathcal{M} denote a pretrained autoregressive language model with weights 𝐰\mathbf{w}. We define an arithmetic task 𝒯\mathcal{T} as a function learning problem where the goal is to predict numerical outputs based on arithmetic expressions. The training dataset for this task is given by:

𝔻train={(a(k),b(k),c(k)}k=1N\mathbb{D}_{\text{train}}=\{(a^{(k)},b^{(k)},c^{(k)}\}_{k=1}^{N}

where c(k)=f(a(k),b(k))c^{(k)}=f(a^{(k)},b^{(k)}) and f()f(\cdot) represents a binary operator, NN denotes the number of data points. In this context, a(k)a^{(k)} and b(k)b^{(k)} are input operands, and c(k)c^{(k)} is the corresponding output, which is computed using the operator ff (e.g., addition, multiplication, etc.)

4 Experiment Setup

In this section, we will detail our experiment setup. Unless otherwise specified, the same setup will be used in the following section.

Domain

We select addition and multiplication as the fundamental operations for our experiments following previous work (Lee et al., 2023; Deng et al., 2023, 2024).

Model

To investigate arithmetic learning at scale, we selected two open-source LLMs, Gemma-2-2B (Team et al., 2024) and Llama-3.1-8B (Dubey et al., 2024). Both models are top performers in their respective categories and excel at language-related tasks. We did not choose GPT-4o (OpenAI et al., 2024) or other proprietary LLMs due to concerns that they may internally integrate function calling (e.g., invoking APIs or executing Python programs), which could affect the experimental setup. Training details is included at Appendix A.1.

Conventional Data Format

We directly train the model to predict the output (e.g., 130130) given the input operands and the operator (e.g., 13×1013\times 10). e add one space between each digit to ensure tokens are split into individual digits We do not consider chain-of-thought (CoT) (Wei et al., 2023) or other prompting strategies to enforce the model to focus on arithmetic learning. We include ablations with respect to data format in Appendix A.2.

5 Are Large Language Models Implicit Calculators?

In this section, we explore whether LLMs utilize partial products to enhance their arithmetic calculation capabilities, particularly in the context of multiplication. It is important to note that while multiplication is well-defined mathematically, the process of multiplication calculation is not limited to traditional methods defined in textbook. Thus, examining only one calculation method presents a flawed experimental design that is vulnerable to exploitation. We selected four calculation methods that are representative to cover the major approaches to multiplication.

5.1 Historical and Modern Multiplication

In terms of multiplication, four different calculation methods are most representative from history to now: Standard Multiplication, Repetitive Addition, Lattice Method, and Egyptian Multiplication.

M1: Standard Multiplication

In standard multiplication, we multiply each digit of one number by each digit of the other number, and then sum the results appropriately:

12×34\displaystyle 12\times 34 =12×(30+4)=12×30+12×4\displaystyle=12\times(30+4)=12\times 30+12\times 4
=360+48=408\displaystyle=360+48=408
M2: Repetitive Addition

Multiplication can be interpreted as repeated addition. For 12×3412\times 34, we add 1212 thirty-four times:

12×34\displaystyle 12\times 34 =12+12+12++12(34 times)\displaystyle=12+12+12+\dots+12\quad(\text{34 times})
=408\displaystyle=408
M3: Lattice Method

In the lattice method (or grid method), we place the numbers along the top and side of a grid, perform single-digit multiplications, and then sum along the diagonals:

12×34=10×30\displaystyle 12\times 34=10\times 30 =300\displaystyle=300
10×4\displaystyle 10\times 4 =40\displaystyle=40
2×30\displaystyle 2\times 30 =60\displaystyle=60
2×4\displaystyle 2\times 4 =8\displaystyle=8
Summing the results: 300+40+60+8=408\displaystyle\ 300+40+60+8=408
M4: Egyptian Multiplication

Egyptian multiplication computes the product by doubling the multiplicand and adding the results corresponding to the powers of two that sum to the multiplier. For 12×3412\times 34:

12×34=12×1\displaystyle 12\times 34=12\times 1 =12\displaystyle=12
12×2\displaystyle 12\times 2 =𝟐𝟒\displaystyle=\mathbf{24}
12×4\displaystyle 12\times 4 =48\displaystyle=48
12×8\displaystyle 12\times 8 =96\displaystyle=96
12×16\displaystyle 12\times 16 =192\displaystyle=192
12×32\displaystyle 12\times 32 =𝟑𝟖𝟒\displaystyle=\mathbf{384}
Summing the selected results: 24+384=408\displaystyle\ 24+384=408

Since 34=2+3234=2+32, we select the results for 12×1612\times 16 and 12×812\times 8, and summing these gives the final product.

Gemma-2-2B Llama-3.1-8B
Standard Lattice Repetitive Egyptian Standard Lattice Repetitive Egyptian
Task \rightarrow Partial P. +4.1%+4.1\% +6.8%+6.8\% 29.0%-29.0\% +3.6%+3.6\% +40.6%+40.6\% +40.8%+40.8\% 59.0%-59.0\% +29.6%+29.6\%
Partial P. \rightarrow Task 6.1%-6.1\% 10.7%-10.7\% 20.3%-20.3\% 9.6%-9.6\% 3.7%-3.7\% 0.2%-0.2\% 0.9%-0.9\% 2.7%-2.7\%
Table 1: Inductive and deductive accuracy difference Δ\Delta.

5.2 Examining Partial Product in Arithmetic Learning

To investigate whether LLMs generate partial products during arithmetic learning, we employ a set of diagnostic tasks as an approach to trace. We fine-tune Gemma-2-2B and Llama-3.1-8B on two-digit multiplication, observing changes in accuracy on diagnostic sets before and after fine-tuning (Task \rightarrow Partial Products). Subsequently, we fine-tune the LLMs on these diagnostic sets and examine how their accuracy on the multiplication task changes (Partial Products \rightarrow Task).

Method Diagnostic Sets
Standard Multiplication 𝒫std={A1×B1B2,A2×B1B2,B1×A1A2,B2×A1A2}\mathcal{P_{\text{std}}}=\{{A_{1}\times B_{1}B_{2},A_{2}\times B_{1}B_{2},B_{1}\times A_{1}A_{2},B_{2}\times A_{1}A_{2}}\}
Repetitive Addition 𝒫ra={i=1B1B2A1A2,i=1A1A2B1B2}\mathcal{P_{\text{ra}}}=\{{\sum_{i=1}^{B_{1}B_{2}}A_{1}A_{2},\sum_{i=1}^{A_{1}A_{2}}B_{1}B_{2}}\}
Lattice Method 𝒫lattice={A10×B10,A10×B2,A2×B10,A2×B2}\mathcal{P_{\text{lattice}}}=\{{A_{1}0\times B_{1}0,A_{1}0\times B_{2},A_{2}\times B_{1}0,A_{2}\times B_{2}}\}
Egyptian Multiplication 𝒫egyptian={2k×A1A2k0,1,,log2(B1B2)}\mathcal{P_{\text{egyptian}}}=\{{2^{k}\times A_{1}A_{2}\mid k\in 0,1,\ldots,\lfloor\log_{2}(B_{1}B_{2})\rfloor}\}
Table 2: Diagnostic sets with four calculation methods.

We probe language models’ partial product in four different directions. As shown in Table 2, for a task formatting like A1A2×B1B2=C1C2C3C4A_{1}A_{2}\times B_{1}B_{2}=C_{1}C_{2}C_{3}C_{4}, we would generate diagnostic test for each algorithm (detailed derivation in Appendix A.3).

Accuracy on Identifying Partial Products

According to the results in Figure 2, we found that standard multiplication, the lattice method, and the Egyptian method significantly improved in identifying partial products after fine-tuning, with gains of +17.45%17.45\%, +18.35%18.35\%, and +10.45%10.45\%, respectively. However, for repetitive addition tasks, LLMs failed to identify partial products, achieving an accuracy of only about 5%5\% after fine-tuning.

Refer to caption
Figure 2: Partial products identification accuracy before and after fine-tuning on tasks. Scores are reported on average of Gemma-2-2B and Llama-3.1-8B.
A Deeper Look into Calculations

Do the results showing increased accuracy across three paths really imply that partial products are used in arithmetic learning? We have two arguments against this interpretation. First, if LLMs genuinely use partial products to learn arithmetic, it is likely that they only use one calculation path at a time. Thus, the simultaneous improvement across three paths (standard, lattice, and Egyptian) is unusual. Second, if LLMs employ a specific path to compute partial products, this process should be demonstrated as bidirectional. Specifically, LLMs fine-tuned on a task should be able to identify partial products (inductive), and conversely, mastering partial products should enhance task learning (deductive). However, we currently have results for only one direction, lacking evidence for the other. Therefore, we extend our experiments to another direction.

Accuracy on Identifying Tasks

We fine-tune two LLMs on diagnostic sets and present the results of identifying tasks before and after fine-tuning in Table 1. Our findings reveal that, fine-tuning specifically on partial products does not enhance task learning. Instead, it results in a performance drop across all four calculation paths for both models. This indicates that pre-learning partial products does not aid in arithmetic learning. The improved ability to recognize partial products appears to stem from the symbol learning process (note that the standard partial product A1×B1B2A_{1}\times B_{1}B_{2} is a sub-portion of A1A2×B1B2A_{1}A_{2}\times B_{1}B_{2}, similar to lattice and Egyptian methods) rather than being an intrinsic computational method used by the models.

6 Are Language Models Symbolic Observers?

An intuitive alternative for explaining increasing performance from inductive tasks (Task \rightarrow Partial Products) is to treat LLMs as subgroup-level symbol observers, which aligns with the intrinsic properties of language modeling. Notably, the standard multiplication, lattice method, and Egyptian methods share a similar structure, where their partial product sets form token-level subgroups within the tasks. This observation naturally leads us to explore this idea further.

6.1 Define Subgroup in Token Level

We first define subgroup in this section. Arithmetic learning involves a training set 𝔻train={(a(k),b(k)),c(k)}k=1N\mathbb{D}_{\text{train}}=\{(a^{(k)},b^{(k)}),c^{(k)}\}_{k=1}^{N} where c(k)=f(a(k),b(k))c^{(k)}=f(a^{(k)},b^{(k)}) and f()f(\cdot) represents a binary operator, NN is the number of dataset. In nn-digit arithmetic learning, a(k)a^{(k)} and b(k)b^{(k)} can be regarded as different values taken by random variable sequences, {Ai}i=1n\{A_{i}\}_{i=1}^{n} and {Bi}i=1n\{B_{i}\}_{i=1}^{n}, respectively. The random variables AiA_{i} and BiB_{i} all follow a discrete uniform distribution P(X=x)=110,x=0,1,,9P(X=x)=\frac{1}{10},x=0,1,...,9. c(k)c^{(k)} can be regarded as different values taken by random variable sequence {Ci}i=1m\{C_{i}\}_{i=1}^{m}, where the random variables Ci,i=1,,mC_{i},i=1,...,m has a joint distribution given by:

I{f(a,b)=c}P({Ai}i=1n=a)P({Bi}i=1n=b)\displaystyle I_{\{f(a,b)=c\}}P(\{A_{i}\}_{i=1}^{n}=a)P(\{B_{i}\}_{i=1}^{n}=b)

where I{f(a,b)=c}I_{\{f(a,b)=c\}} is indicator function equals 1 if the condition f(a,b)=cf(a,b)=c holds true, and 0 otherwise.

Definition 1 (Subgroup):

For nn-digit arithmetic, a subgroup ss is defined as any element of the set 𝕊n\mathbb{S}_{n}:

s𝕊n={((𝔸,𝔹),)𝔸{Ai}i=1n,𝔹{Bi}i=1n,{Ci}i=1m}s\in\mathbb{S}_{n}=\{((\mathbb{A},\mathbb{B}),\mathbb{C})\mid\mathbb{A}\subseteq\{A_{i}\}_{i=1}^{n},\mathbb{B}\subseteq\{B_{i}\}_{i=1}^{n},\mathbb{C}\subseteq\{C_{i}\}_{i=1}^{m}\}

where 𝔸\mathbb{A}, 𝔹\mathbb{B} and \mathbb{C} can be any subportion of random variable sequences {Ai}i=1n\{A_{i}\}_{i=1}^{n}, {Bi}i=1n\{B_{i}\}_{i=1}^{n} and {Ci}i=1n\{C_{i}\}_{i=1}^{n}, respectively. Specifically, we use si𝕊ns_{i}\in\mathbb{S}_{n} to denote subgroups in ii-th prediction for CiC_{i}.

Definition 2 (Subgroup Space):

For any subgroup s=((𝔸,𝔹),)𝕊ns=((\mathbb{A},\mathbb{B}),\mathbb{C})\in\mathbb{S}_{n}, we have:

  • Domain Space: 𝒟s={({a}k=1|𝔸|,{b}k=1|𝔹|)P(𝔸={a}k=1|𝔸|)>0,P(𝔹={b}k=1|𝔹|)>0}\mathcal{D}_{s}=\{(\{a\}_{k=1}^{|\mathbb{A}|},\{b\}_{k=1}^{|\mathbb{B}|})\mid P(\mathbb{A}=\{a\}_{k=1}^{|\mathbb{A}|})>0,P(\mathbb{B}=\{b\}_{k=1}^{|\mathbb{B}|})>0\}. The size of domain space or cardinality is annotated as |𝒟s||\mathcal{D}_{s}|.

  • Label Space: s={{c}k=1||P(={c}k=1||)>0}\mathcal{L}_{s}=\{\{c\}_{k=1}^{|\mathbb{C}|}\mid P(\mathbb{C}=\{c\}_{k=1}^{|\mathbb{C}|})>0\}. The size of label space is annotated as |s||\mathcal{L}_{s}|.

The entropy of label space is given by:

H(s)=ysP(=y)log2P(=y)H(\mathcal{L}_{s})=-\sum_{y\in\mathcal{L}_{s}}P(\mathbb{C}=y)\log_{2}P(\mathbb{C}=y)\vspace{-0.2cm}
C1C_{1} C2C_{2} C3C_{3} C4C_{4} C5C_{5} {Ci}i=1n\{C_{i}\}_{i=1}^{n}
Task Format H()H(\mathcal{L}) H()H(\mathcal{L}) H()H(\mathcal{L}) H()H(\mathcal{L}) H()H(\mathcal{L}) |||\mathcal{L}| H()H(\mathcal{L})
f(a,b)=a+bf(a,b)=a+b A1A2+B1B2=C1C2C3A_{1}A_{2}+B_{1}B_{2}=C_{1}C_{2}C_{3} 0.97100.9710 3.32153.3215 3.32193.3219 - - 179179 7.21307.2130
f(a,b)=a+b+1f(a,b)=a+b+1 A1A2+B1B2=C1C2C3A_{1}A_{2}+B_{1}B_{2}=C_{1}C_{2}C_{3} 0.96490.9649 3.32153.3215 3.32193.3219 - - 179179 7.21307.2130
f(a,b)=a+b+15f(a,b)=a+b+15 A1A2+B1B2=C1C2C3A_{1}A_{2}+B_{1}B_{2}=C_{1}C_{2}C_{3} 0.92800.9280 3.32143.3214 3.32193.3219 - - 179179 7.21307.2130
f(a,b)=a+b+115f(a,b)=a+b+115 A1A2+B1B2=C1C2C3A_{1}A_{2}+B_{1}B_{2}=C_{1}C_{2}C_{3} 0.92800.9280 3.32143.3214 3.32193.3219 - - 179179 7.21307.2130
f(a,b)=(a+b)mod 100f(a,b)=(a+b)\>mod\>100 A1A2+B1B2=C1C2A_{1}A_{2}+B_{1}B_{2}=C_{1}C_{2} 3.32143.3214 3.32193.3219 - - - 100100 6.64326.6432
f(a,b)=(a+b)mod 50f(a,b)=(a+b)\>mod\>50 A1A2+B1B2=C1C2A_{1}A_{2}+B_{1}B_{2}=C_{1}C_{2} 2.32172.3217 3.32193.3219 - - - 5050 5.64365.6436
f(a,b)=(a+b)mod 10f(a,b)=(a+b)\>mod\>10 A1A2+B1B2=C1A_{1}A_{2}+B_{1}B_{2}=C_{1} 3.32193.3219 - - - - 1010 3.32193.3219
f(a,b)=a×bf(a,b)=a\times b A1A2×B1B2=C1C2C3C4A_{1}A_{2}\times B_{1}B_{2}=C_{1}C_{2}C_{3}C_{4} 2.89792.8979 3.32153.3215 3.31603.3160 3.03403.0340 - 26212621 11.117211.1172
f(a,b)=a×b×2f(a,b)=a\times b\times 2 A1A2×B1B2=C1C2C3C4C5A_{1}A_{2}\times B_{1}B_{2}=C_{1}C_{2}C_{3}C_{4}C_{5} 0.68730.6873 3.21733.2173 3.32153.3215 3.29643.2964 2.22272.2227 26212621 11.117211.1172
f(a,b)=a×b×4f(a,b)=a\times b\times 4 A1A2×B1B2=C1C2C3C4C5A_{1}A_{2}\times B_{1}B_{2}=C_{1}C_{2}C_{3}C_{4}C_{5} 1.60301.6030 3.30203.3020 3.32043.3204 3.22343.2234 2.22272.2227 26212621 11.117211.1172
f(a,b)=a×b×8f(a,b)=a\times b\times 8 A1A2×B1B2=C1C2C3C4C5A_{1}A_{2}\times B_{1}B_{2}=C_{1}C_{2}C_{3}C_{4}C_{5} 2.58112.5811 3.32023.3202 3.31513.3151 3.22353.2235 2.22272.2227 26212621 11.117211.1172
f(a,b)=(a×b)mod 100f(a,b)=(a\times b)\>mod\>100 A1A2×B1B2=C1C2A_{1}A_{2}\times B_{1}B_{2}=C_{1}C_{2} 3.31603.3160 3.03403.0340 - - - 100100 6.29126.2912
f(a,b)=(a×b)mod 50f(a,b)=(a\times b)\>mod\>50 A1A2×B1B2=C1C2A_{1}A_{2}\times B_{1}B_{2}=C_{1}C_{2} 2.32102.3210 3.03403.0340 - - - 5050 5.34945.3494
f(a,b)=(a×b)mod 10f(a,b)=(a\times b)\>mod\>10 A1A2×B1B2=C1A_{1}A_{2}\times B_{1}B_{2}=C_{1} 3.03403.0340 - - - - 1010 3.03403.0340
Table 3: Label space statistics with different rule perturbations. H()H(\mathcal{L}) represents the entropy of the label space, and |||\mathcal{L}| is the size of the label space. {Cj}i=1n\{C_{j}\}_{i=1}^{n} represents all positions in output digits.

6.2 Difficulty in Arithmetic Learning: A Well-educated Hypothesis

We propose the following hypothesis: For an nn-digit arithmetic task requiring mm predictions, the overall difficulty ζ\zeta is related to:

ζh^m\zeta\propto\hat{h}^{m} (2)

where h^=i=1mh(si)1m\hat{h}=\prod_{i=1}^{m}h(s_{i})^{\frac{1}{m}} represents the geometric average difficulty of an individual prediction, with si𝕊ns_{i}\in\mathbb{S}_{n} denoting subgroup selection for the ii-th prediction, and h()h(\cdot) representing the subgroup complexity evaluation function.

Subgroup Complexity (h()h(\cdot)): This represents the most basic difficulty in arithmetic learning. It captures the inherent difficulty in the structure of the arithmetic learning tasks. We believe that the complexity on subgroup is strongly correlated with the property on that subgroup space:

  • Domain Space Cardinality |𝒟||\mathcal{D}|: The size of the domain space |𝒟||\mathcal{D}| determines how many data points are available for learning a pattern. If the label space is fixed, a larger domain space generally leads to improved learning outcomes.

  • Label Space Entropy H()H(\mathcal{L}): Label space entropy H()H(\mathcal{L}) is also a critical factor in learning, as a low-entropy label space often leads to higher predictability.

  • Subgroup Quality QQ: For any subgroup s=((𝔸,𝔹),)Sns=((\mathbb{A},\mathbb{B}),\mathbb{C})\in S_{n},

    Q(s)\displaystyle\vspace{-0.2cm}Q(s) =maxgΩsa={a}k=1|𝔸|b={b}k=1|𝔹|c={c}k=1||\displaystyle=\max_{g\in\Omega_{s}}\sum_{a^{\prime}=\{a\}_{k=1}^{|\mathbb{A}|}}\sum_{b^{\prime}=\{b\}_{k=1}^{|\mathbb{B}|}}\sum_{c^{\prime}=\{c\}_{k=1}^{|\mathbb{C}|}}
    Psf(a,b,c)Psp(g,a,b,c)\displaystyle P_{s}^{f}(a^{\prime},b^{\prime},c^{\prime})P_{s}^{p}(g,a^{\prime},b^{\prime},c^{\prime})\vspace{-0.2cm} (3)

    where Ωs:𝒟sΘ(s)\Omega_{s}:\mathcal{D}_{s}\to\Theta(\mathcal{L}_{s}) represents a function space mapping from 𝒟s\mathcal{D}_{s} to Θ(s)\Theta(\mathcal{L}_{s}), and Θ(s)\Theta(\mathcal{L}_{s}) denotes the space of random variables taking values in s\mathcal{L}_{s}. Here, Psp(g,a,b,c)=P(g(a,b)=c)P_{s}^{p}(g,a^{\prime},b^{\prime},c^{\prime})=P(g(a^{\prime},b^{\prime})=c^{\prime}) represents the probability that the function gg maps (a,b)(a^{\prime},b^{\prime}) to cc^{\prime}, and Psf(a,b,c)P_{s}^{f}(a^{\prime},b^{\prime},c^{\prime}) represents the probability that 𝔸=a\mathbb{A}=a^{\prime}, 𝔹=b\mathbb{B}=b^{\prime} and =c\mathbb{C}=c^{\prime}, hold simultaneously in arithmetic tasks. Thus, Q(s)Q(s) measures the maximum possible probability of predicting the value of \mathbb{C} that is consistent with the values in the dataset from the values of (𝔸,𝔹)(\mathbb{A},\mathbb{B}).

Subgroup Selection (sis_{i}): si𝕊ns_{i}\in\mathbb{S}n represents the subgroup selection for the ii-th prediction. When LLMs predict the token in the ii-th position, they must select subgroups that include CiC_{i} to align with the underlying pattern. This reflects the learning dynamics of language models in arithmetic tasks, abstractly linked to their decision-making and selection processes. As discussed in §6.4, LLMs seem to initially select the subgroup sis_{i} with high quality Qhigh(si)Q_{high}(s_{i}), progressing to lower quality Qlow(si)Q_{low}(s_{i}) (easy-to-hard) during learning.

6.3 Subgroup Complexity: Label Space Matters in the Final Stage

In this section, we discuss subgroup complexity in arithmetic learning. The domain space cardinality |𝒟||\mathcal{D}| represents the number of training data available, which is an obvious factor influencing learning. Subgroup quality Q(s)Q(s) will be detailed in §6.4. Thus, we primarily focus on label space entropy H()H(\mathcal{L}) in this section.

Rule Perturbation

We first deliberately perturb the rules to observe whether these changes affect task difficulty for LLMs. We consider addition f(a,b)=a+bf(a,b)=a+b and multiplication f(a,b)=a×bf(a,b)=a\times b as our baselines. For addition, the perturbation is defined as f(a,b)=a+b+Δcf(a,b)=a+b+\Delta c, where Δc=1,15,115\Delta c=1,15,115 corresponds to perturbations at different position with different magnitudes. For multiplication, the perturbation is defined as f(a,b)=a×b×λf(a,b)=a\times b\times\lambda, where λ=2,4,8\lambda=2,4,8 following similar reasons above. Additionally, we incorporate modular addition and multiplication as further perturbations. Table 3 showcases the basic change for label space entropy after applying perturbations. We then fine-tune Gemma-2-2B and Llama-3.1-8B using different perturbation rules to observe how well these large language models can be influenced from such perturbations in learning.

Results

The results in Table 4 demonstrate that across two different rule perturbation methods and three distinct setups, both Gemma-2-2B and Llama-3.1-8B yield consistent outcomes. While calculating 13×27=280813\times 27=2808 may seem counterintuitive, LLMs handle this arithmetic same as 13×27=35113\times 27=351 when the label space entropy H()H(\mathcal{L}) are fixed.

Gemma-2-2B Llama-3.1-8B
f(a,b)=a+bf(a,b)=a+b - -
f(a,b)=a+b+1f(a,b)=a+b+1 0.1%-0.1\% 0.1%-0.1\%
f(a,b)=a+b+15f(a,b)=a+b+15 0.9%-0.9\% +0.1%+0.1\%
f(a,b)=a+b+115f(a,b)=a+b+115 1.4%-1.4\% +0.7%+0.7\%
f(a,b)=(a+b)mod 100f(a,b)=(a+b)\>mod\>100 +10.1%+10.1\% +3.7%+3.7\%
f(a,b)=(a+b)mod 50f(a,b)=(a+b)\>mod\>50 +13.1%+13.1\% +6.7%+6.7\%
f(a,b)=(a+b)mod 10f(a,b)=(a+b)\>mod\>10 +26.1%+26.1\% +13.7%+13.7\%
f(a,b)=a×bf(a,b)=a\times b - -
f(a,b)=a×b×2f(a,b)=a\times b\times 2 1.1%-1.1\% 2.7%-2.7\%
f(a,b)=a×b×4f(a,b)=a\times b\times 4 1.7%-1.7\% +0.7%+0.7\%
f(a,b)=a×b×8f(a,b)=a\times b\times 8 +0.2%+0.2\% 3.7%-3.7\%
f(a,b)=(a×b)mod 100f(a,b)=(a\times b)\>mod\>100 +7.1%+7.1\% +3.8%+3.8\%
f(a,b)=(a×b)mod 50f(a,b)=(a\times b)\>mod\>50 +12.1%+12.1\% +5.3%+5.3\%
f(a,b)=(a×b)mod 10f(a,b)=(a\times b)\>mod\>10 +18.9%+18.9\% +10.7%+10.7\%
Table 4: Test Accuracy difference Δ\Delta on perturbed addition and multiplication.

Regarding modular addition and multiplication with different modulus numbers, we find that decreasing the size of the entropy leads to performance improvements in both cases. These results highlight that an arithmetic task with low variability in the label space is more learnable. Together, these two observations reinforce the notion that LLMs are not performing traditional calculations but are instead functioning as sophisticated symbolic observers within the token space.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Position-level Accuracy from Gemma-2-2B and Llama-3.1-8B.

6.4 Subgroup Selection: Revealing Learning Dynamic in Arithmetic Learning

In the previous section, we established that subgroup space provides a basis for quantifying complexity in arithmetic tasks. However, the results mainly highlight insights at the end of learning (test accuracy after 12 epochs), leaving the learning dynamics less explored. Here, we investigate these dynamics by analyzing digit-level accuracy in model outputs, observing how LLMs select subgroups SnS_{n} based on their performance across different positions.

Settings

We maintain the same basic experimental settings as in the previous section to ensure the discussion remains within the same scope. We will train Gemma-2-2B and Llama-3.1-8B on four different dataset sizes (6.48K6.48K, 12.96K12.96K, 32.4K32.4K, and 64.8K64.8K). Our experiments cover multiplication tasks ranging from 33-digit to 55-digit numbers, with output digits from 66 to 1010.

Position-level Accuracy are U-curve

Figure 3 reveals a phenomenon overlooked in previous studies. Contrary to the common assumption that position-level accuracy decreases from right to left due to carryover effects and least-to-most significant digit calculations (Lee et al., 2023; Zhang-Li et al., 2024), our results show a U-shaped accuracy curve in both Gemma-2-2B and Llama-3.1-8B models. Accuracy peaks at the beginning and end positions, exceeding 95%95\%, with lower accuracy (~10%10\%) in the middle positions, especially in higher-digit multiplication (e.g., 4th/5th for 4-digit, 5th/6th for 5-digit).These results provide valuable insights, suggesting that the difficulty in learning multiplication is concentrated in the middle positions rather than at the beginning, which conceptually corresponds to the final steps of calculation.

Subgroup Selection via Quality

We think that the U-curve occurs because the subgroup for middle digits has a lower subgroup quality Q(s)Q(s) compared to the beginning and end digits. Given a subgroup s=((𝔸,𝔹),)s=((\mathbb{A},\mathbb{B}),\mathbb{C}), where mid\mathbb{C_{\text{mid}}} represents the middle digits, more digits from the operands are required to determine the value of mid\mathbb{C_{\text{mid}}} compared to representing digits at the beginning or end positions. This leads to a larger domain size |𝒟||\mathcal{D}| when summing all candidates. Consequently, determining the value of \mathbb{C} becomes less certain, resulting in a lower Q(s)Q(s). Specifically, when \mathbb{C} represents the last digit, its value can be fully determined by the least significant digits of the operands, which is not the case when \mathbb{C} represents a middle digit. For instance, in the case of 3-digit multiplication, it is relatively easy to identify the subgroup s={({A1},{B1}),{C1}}s=\{(\{A_{1}\},\{B_{1}\}),\{C_{1}\}\} for learning the first digits or the subgroup s={({A3},{B3}),{C6}}s=\{(\{A_{3}\},\{B_{3}\}),\{C_{6}\}\} for learning the last digits. This observation explains why the digits at the beginning or the end are easier to learn. It also reveals that LLMs fit arithmetic learning patterns following an easy-to-hard mechanism (from high Q(s)Q(s) to low Q(s)Q(s)), which demonstrates gradient descent in the "fastest" symbolic direction.

7 Conclusions

In this work, we investigate whether LLMs solve arithmetic tasks using partial products or operate as symbolic pattern matchers. Our findings confirm that LLMs do not rely on partial products but instead approach arithmetic purely symbolically. By breaking tasks into subgroups, we demonstrate that the difficulty in arithmetic learning can be attributed to subgroup complexity and selection. Our results emphasize the crucial role of label space entropy in understanding the convergence stage and the quality of subgroups for learning dynamics. Overall, at least in our setting, LLMs function as purely symbolic learners in arithmetic tasks. We encourage further research to explore more complex tasks from this perspective.

8 Limitations

In terms of limitations, one area our work does not currently address is the application of our framework to different chain-of-thought (CoT) methods (Wei et al., 2023; Deng et al., 2024). While CoT has proven to be highly effective in arithmetic learning, particularly by decomposing overall difficulty into smaller, more manageable operations—thereby reducing subgroup complexity from exponential to linear—this aspect has not been explored in our study. Additionally, we have not applied our framework in a totally natural language-aware setting like GSM8K or MATH. Exploring how LLMs leverage their symbolic capabilities in such a context could provide deeper insights into their reasoning abilities, particularly in tasks that require structured, multi-step problem solving. These unexplored areas present significant opportunities for future research.

9 Ethics Statement

Our research primarily focuses on the symbolic learning capabilities of large language models in arithmetic tasks. As such, it does not involve the collection or use of any human data. No personal or sensitive information is handled or analyzed in this study. We acknowledge the potential biases inherent in the datasets used for model training and the limitations of relying on symbolic learning without fully understanding the underlying numerical or logical processes. The societal impact of increased reliance on LLMs for arithmetic tasks, including overconfidence in symbolic learning without full comprehension, warrants careful consideration. We advocate for transparent model evaluations and awareness of the limitations in deploying such models for critical decision-making.

References

Appendix A Appendix

A.1 Training Detail

We carefully tuned the hyperparameters in our experiments. The learning rate for Gemma-2-2B was set to 1e41e-4, while for Llama-3.1-8B it was 2e42e-4. Both models used a warm-up of 55 steps and a weight decay of 0.010.01. We trained for 1212 epochs, splitting the dataset into 80%80\% for training, 10%10\% for validation, and 10%10\% for testing. Evaluation was conducted at the end of each epoch, with checkpoints saved based on the best performance on the validation set. LoRA finetuning is used for both models with same setting, lora_rank = 6464, lora_alpha = 1616, lora_dropout = 0, and we enable rank stabalized lora (Kalajdzievski, 2023) during training. We use unsloth111Unsloth AI is at: https://unsloth.ai/ to fasten our training process and vLLM (Kwon et al., 2023) to increase our inference process.

C1C_{1} C2C_{2} C3C_{3} C4C_{4} C5C_{5} {Cj}i=1n\{C_{j}\}_{i=1}^{n}
Task Format H()H(\mathcal{L}) H()H(\mathcal{L}) H()H(\mathcal{L}) H()H(\mathcal{L}) H()H(\mathcal{L}) |||\mathcal{L}| H()H(\mathcal{L})
f(a,b)=a+bf(a,b)=a+b A1A2+B1B2=C1C2C3A_{1}A_{2}+B_{1}B_{2}=C_{1}C_{2}C_{3} 0.97100.9710 3.32153.3215 3.32193.3219 - - 179179 7.21307.2130
f(a,b)=a+bf(a,b)=a+b What is A1A2A_{1}A_{2} add B1B2B_{1}B_{2}? Answer: C1C2C3C_{1}C_{2}C_{3} 0.96490.9649 3.32153.3215 3.32193.3219 - - 179179 7.21307.2130
f(a,b)=a+bf(a,b)=a+b fafr if A1A2A_{1}A_{2} hfk B1B2B_{1}B_{2}? Ffhjar: C1C2C3C_{1}C_{2}C_{3} 3.32143.3214 3.32193.3219 - - - 179179 7.21307.2130
f(a,b)=a+bf(a,b)=a+b 3.123 34 A1A2A_{1}A_{2} 461 B1B2B_{1}B_{2}? 952414: C1C2C3C_{1}C_{2}C_{3} 0.92800.9280 3.32143.3214 3.32193.3219 - - 179179 7.21307.2130
f(a,b)=a×bf(a,b)=a\times b A1A2×B1B2=C1C2C3C4C5A_{1}A_{2}\times B_{1}B_{2}=C_{1}C_{2}C_{3}C_{4}C_{5} 2.58112.5811 3.32023.3202 3.31513.3151 3.22353.2235 2.22272.2227 26212621 11.117211.1172
f(a,b)=a×bf(a,b)=a\times b What is A1A2A_{1}A_{2} multiply B1B2B_{1}B_{2}? Answer: C1C2C3C4C_{1}C_{2}C_{3}C_{4} 2.89792.8979 3.32153.3215 3.31603.3160 3.03403.0340 - 26212621 11.117211.1172
f(a,b)=a×bf(a,b)=a\times b fafr if A1A2A_{1}A_{2} hfk B1B2B_{1}B_{2}? Ffhjar: C1C2C3C4C_{1}C_{2}C_{3}C_{4} 0.68730.6873 3.21733.2173 3.32153.3215 3.29643.2964 2.22272.2227 26212621 11.117211.1172
f(a,b)=a×bf(a,b)=a\times b 3.123 34 A1A2A_{1}A_{2} 461 B1B2B_{1}B_{2}? 952414: C1C2C3C4C_{1}C_{2}C_{3}C_{4} 1.60301.6030 3.30203.3020 3.32043.3204 3.22343.2234 2.22272.2227 26212621 11.117211.1172
Table 5: Label space statistics with different format perturbations. H()H(\mathcal{L}) represents the entropy of the space, and |||\mathcal{L}| is the size of the space. {Cj}i=1n\{C_{j}\}_{i=1}^{n} represents all possible output digits.
Format Gemma-2-2B Llama-3.1-8B
f(a,b)=a+bf(a,b)=a+b Natural Language - -
f(a,b)=a+bf(a,b)=a+b Random String +0.1%+0.1\% 0.2%-0.2\%
f(a,b)=a+bf(a,b)=a+b Disturbed Digits 3.9%-3.9\% 2.1%-2.1\%
f(a,b)=a×bf(a,b)=a\times b Natural Language - -
f(a,b)=a×bf(a,b)=a\times b Random String +0.3%+0.3\% 0.5%-0.5\%
f(a,b)=a×bf(a,b)=a\times b Disturbed Digits 1.9%-1.9\% 3.1%-3.1\%
Table 6: Test Accuracy difference Δ\Delta on perturbed addition and multiplication.

A.2 Format Perturbations in Arithmetic Tasks

In this section, we apply three types of format perturbations to basic addition and multiplication tasks to evaluate the symbolic reasoning capabilities of large language models (LLMs). Our experiments utilize Gemma-2-2B and Llama-3.1-8B models. The primary objective of varying the input format is to investigate whether LLMs function as purely symbolic learners in arithmetic tasks. We consider the following three types of format perturbation in Table 5:

  1. 1.

    Natural Language (NL): Arithmetic expressions are converted into natural language statements. For example, the equation “3+53+5” becomes “What is 3 times 5?”

  2. 2.

    Random String (RS): Arithmetic expressions are first converted into natural language and then replaced with meaningless strings. Using the previous example, “3+53+5” might be transformed into “flad kf 3 lfd 5?”

  3. 3.

    Disturbed Digits (DD): Arithmetic expressions are initially converted into natural language and subsequently replaced with random digits. For instance, “3+53+5” could become “65.1 44 3 4 5?” This approach creates a counterfactual context for arithmetic tasks, increasing the difficulty for the models.

By implementing these perturbations, we aim to assess the robustness of LLMs in handling arithmetic operations under varied and challenging input formats.

Results

We examined the impact of three types of input format perturbations (Natural Language (NL), Random String (RS), and Disturbed Digits (DD)) on the arithmetic reasoning tasks for Gemma-2.2B and Llama-3.1-8B models. Table 6 shows that across both addition and multiplication tasks, the performance of the models remains largely unaffected by the perturbations when the label space is fixed. Specifically, there is only a marginal change in accuracy under the NL and RS formats, while the DD format causes minor fluctuations but does not significantly degrade performance. This demonstrates that LLMs can effectively handle various input perturbations as long as the output space remains consistent, suggesting their robustness in symbolic reasoning tasks despite superficial input variations.

A.3 Mathematical Explanation of Diagnostic Sets for Multiplication Algorithms

For the multiplication task involving two two-digit numbers formatted as A1A2×B1B2=C1C2C3C4A_{1}A_{2}\times B_{1}B_{2}=C_{1}C_{2}C_{3}C_{4}, we generate diagnostic test sets 𝒫\mathcal{P} for each algorithm to analyze and understand the partial computations involved. Below, we provide a mathematical explanation for the formulation of these diagnostic sets for each multiplication algorithm.

A.3.1 Standard Multiplication

In the standard multiplication algorithm, we multiply each digit of one number by each digit of the other number and sum the appropriately weighted results.

Formulation: Let the two-digit numbers be expressed as:

a\displaystyle a =A1A2=10A1+A2,\displaystyle=A_{1}A_{2}=10A_{1}+A_{2},
b\displaystyle b =B1B2=10B1+B2.\displaystyle=B_{1}B_{2}=10B_{1}+B_{2}.

The product is:

ab=(10A1+A2)(10B1+B2).ab=(10A_{1}+A_{2})(10B_{1}+B_{2}).

Expanding, we get four partial products:

ab=100A1B1+10A1B2+10A2B1+A2B2.ab=100A_{1}B_{1}+10A_{1}B_{2}+10A_{2}B_{1}+A_{2}B_{2}.

Diagnostic Set:

𝒫std={A1×b,A2×b,B1×a,B2×a}.\mathcal{P}_{\text{std}}=\left\{A_{1}\times b,\ A_{2}\times b,\ B_{1}\times a,\ B_{2}\times a\right\}.

Explanation:

  • A1×bA_{1}\times b: Multiplying the tens digit of aa by the entire number bb:

    A1×b=A1×(10B1+B2)=10A1B1+A1B2.A_{1}\times b=A_{1}\times(10B_{1}+B_{2})=10A_{1}B_{1}+A_{1}B_{2}.
  • A2×bA_{2}\times b: Multiplying the units digit of aa by bb:

    A2×b=A2×(10B1+B2)=10A2B1+A2B2.A_{2}\times b=A_{2}\times(10B_{1}+B_{2})=10A_{2}B_{1}+A_{2}B_{2}.
  • B1×aB_{1}\times a: Multiplying the tens digit of bb by aa:

    B1×a=B1×(10A1+A2)=10A1B1+A2B1.B_{1}\times a=B_{1}\times(10A_{1}+A_{2})=10A_{1}B_{1}+A_{2}B_{1}.
  • B2×aB_{2}\times a: Multiplying the units digit of bb by aa:

    B2×a=B2×(10A1+A2)=10A1B2+A2B2.B_{2}\times a=B_{2}\times(10A_{1}+A_{2})=10A_{1}B_{2}+A_{2}B_{2}.

Including these partial products in 𝒫std\mathcal{P}_{\text{std}} captures all intermediary computations in the standard algorithm, facilitating a comprehensive diagnostic analysis.

A.3.2 Repetitive Addition

Repetitive addition interprets multiplication as adding one number to itself repeatedly.

Diagnostic Set:

𝒫ra={i=1ba,j=1ab}.\mathcal{P}_{\text{ra}}=\left\{\sum_{i=1}^{b}a,\ \sum_{j=1}^{a}b\right\}.

Explanation:

  • i=1ba\sum_{i=1}^{b}a: Adding aa to itself bb times:

    i=1ba=a+a++a(b times)=ab.\sum_{i=1}^{b}a=a+a+\dots+a\quad(b\text{ times})=ab.
  • j=1ab\sum_{j=1}^{a}b: Adding bb to itself aa times:

    j=1ab=b+b++b(a times)=ab.\sum_{j=1}^{a}b=b+b+\dots+b\quad(a\text{ times})=ab.

Both summations lead to the same product abab, and including them in 𝒫ra\mathcal{P}_{\text{ra}} allows for analyzing both repetitive addition paths in the algorithm.

A.3.3 Lattice Method

The lattice method (or grid method) organizes the multiplication of each digit pair in a grid and sums along diagonals.

Diagnostic Set:

𝒫lattice={A1×B1,A1×B2,A2×B1,A2×B2}.\mathcal{P}_{\text{lattice}}=\left\{A_{1}\times B_{1},\ A_{1}\times B_{2},\ A_{2}\times B_{1},\ A_{2}\times B_{2}\right\}.

Explanation:

  • A1×B1A_{1}\times B_{1}: Tens digit of aa times tens digit of bb.

  • A1×B2A_{1}\times B_{2}: Tens digit of aa times units digit of bb.

  • A2×B1A_{2}\times B_{1}: Units digit of aa times tens digit of bb.

  • A2×B2A_{2}\times B_{2}: Units digit of aa times units digit of bb.

These products fill the cells of the lattice grid:

B1B2A1A1B1A1B2A2A2B1A2B2\begin{array}[]{c|cc}\hline\cr\hline\cr&B_{1}&B_{2}\\ \hline\cr A_{1}&A_{1}B_{1}&A_{1}B_{2}\\ A_{2}&A_{2}B_{1}&A_{2}B_{2}\\ \hline\cr\hline\cr\end{array}

Summing along the diagonals yields the final product. Including these partial products in 𝒫lattice\mathcal{P}_{\text{lattice}} covers all the necessary computations in the lattice method.

A.3.4 Egyptian Multiplication

Egyptian multiplication involves doubling the multiplicand and adding specific results based on the binary representation of the multiplier.

Diagnostic Set:

𝒫egyptian={2k×a|k=0,1,,log2b}.\mathcal{P}_{\text{egyptian}}=\left\{2^{k}\times a\ \middle|\ k=0,1,\dots,\left\lfloor\log_{2}b\right\rfloor\right\}.

Explanation:

  • Binary Representation of bb: Express bb as a sum of powers of two:

    b=k=0nCk2k,Ck{0,1},n=log2b.b=\sum_{k=0}^{n}C_{k}2^{k},\quad C_{k}\in\{0,1\},\ n=\left\lfloor\log_{2}b\right\rfloor.
  • Doubling aa: Compute successive doublings of aa:

    20×a, 21×a,, 2n×a.2^{0}\times a,\ 2^{1}\times a,\ \dots,\ 2^{n}\times a.
  • Selection and Summation: Identify which 2k×a2^{k}\times a correspond to Ck=1C_{k}=1 in bb’s binary representation and sum them:

    ab=k=0nCk(2k×a).ab=\sum_{k=0}^{n}C_{k}(2^{k}\times a).

Including all 2k×a2^{k}\times a up to nn in 𝒫egyptian\mathcal{P}_{\text{egyptian}} ensures that we have the necessary partial products for any bb, allowing us to reconstruct abab by selecting and summing the appropriate terms.

Example:

If b=13b=13, its binary representation is 11011101, so b=23+22+20b=2^{3}+2^{2}+2^{0}. The partial products are:

20×a\displaystyle 2^{0}\times a =a,\displaystyle=a,
21×a\displaystyle 2^{1}\times a =2a,\displaystyle=2a,
22×a\displaystyle 2^{2}\times a =4a,\displaystyle=4a,
23×a\displaystyle 2^{3}\times a =8a.\displaystyle=8a.

Select 20×a2^{0}\times a, 22×a2^{2}\times a, and 23×a2^{3}\times a (since C0=C2=C3=1C_{0}=C_{2}=C_{3}=1) and sum:

ab=a+4a+8a=13a.ab=a+4a+8a=13a.

By formulating the diagnostic sets 𝒫\mathcal{P} as above for each multiplication algorithm, we encapsulate all intermediary computational steps inherent to each method.