Revert changes to mod_p521 flow

It is not necessary to save the middle limb upfront as overwriting it is
the desired result: in the first step we are reducing modulo
2^{512+biL}.

Arguably, the original flow is more intuitive and easier to see the idea
behind it.

Signed-off-by: Janos Follath <janos.follath@arm.com>
Signed-off-by: Gabor Mezei <gabor.mezei@arm.com>
diff --git a/library/ecp_curves.c b/library/ecp_curves.c
index 0064265..7d029de 100644
--- a/library/ecp_curves.c
+++ b/library/ecp_curves.c
@@ -5222,12 +5222,6 @@
         return 0;
     }
 
-    /* Save and clear the A1 content of the shared limb to prevent it
-       from overwrite. */
-    mbedtls_mpi_uint remainder[P521_WIDTH] = { 0 };
-    remainder[0] = N_p[P521_WIDTH - 1] >> 9;
-    N_p[P521_WIDTH - 1] &= P521_MASK;
-
     if (N_n > P521_WIDTH) {
         /* Helper references for top part of N */
         mbedtls_mpi_uint *NT_p = N_p + P521_WIDTH;
@@ -5236,14 +5230,17 @@
         /* Split N as A0 + 2^(512 + biL) A1 and compute A0 + 2^(biL - 9) * A1.
          * This can be done in place. */
         mbedtls_mpi_uint shift = ((mbedtls_mpi_uint) 1u) << (biL - 9);
-        carry = mbedtls_mpi_core_mla(N_p, P521_WIDTH - 1, NT_p, NT_n, shift);
+        carry = mbedtls_mpi_core_mla(N_p, P521_WIDTH, NT_p, NT_n, shift);
 
         /* Clear top part */
         memset(NT_p, 0, sizeof(mbedtls_mpi_uint) * NT_n);
     }
 
+    mbedtls_mpi_uint remainder[P521_WIDTH] = { 0 };
+    remainder[0] = carry << (biL - 9);
+    remainder[0] += (N_p[P521_WIDTH - 1] >> 9);
+    N_p[P521_WIDTH - 1] &= P521_MASK;
     (void) mbedtls_mpi_core_add(N_p, N_p, remainder, P521_WIDTH);
-    N_p[P521_WIDTH - 1] += carry;
 
     return 0;
 }