Purpose Of Pre-Padding While Implementing LSTM

2 min readAug 7, 2020

All you want to know about why pre-padding requires

While learning the architecture of LSTM we are all focused on the Forgot gate, Input gate, and Output gate.

But when comes to implementing part before applying Embedding layer we apply pre-padding on Input data( Sequence of words)

what we think:

Input data ====> Embedding layer ======>LSTM layers =====> Softmax/Sigmoid =====> Output

Reality :

Input data =====> Pre-Padding on Input data ====> Embedding layer ======>LSTM layers =====> Softmax/Sigmoid =====> Output

why Pre-Padding:

Example : If we have 3 reviews (Input data)

Review1: This(w1) movie(w2) is(w3) very(w4) interesting(w5)

w1= word 1, w2= word 2, w3= word 3 ,w4= word 4 , w5= word5

we have 5 words in review1

Review2: I(w1) do(w2) not(w3) like(w4) this(w5) movie(w6)

Similarly we have 6 words in review2

Review3: I(w1) hate(w2) this(w3) type(w4) of(w5) genre(w6) movies(w7)

We have 7 words in review3

we have different lengths of reviews

Because of different lengths first we pass Review1 words through LSTM layer after completion of Review 1 words then only we pass Review 2 words through LSTM layer this process is nothing but SGD (stochastic gradient descent) with Batch size =1.

In the real world problem, we have millions of reviews(input data) if we follow the SGD with Batch size =1 it will take days to solve, that’s why we use SGD with Batch size=k (k may be any number)

Example if k=32 that means we pass 32 words in single time step

With SGD Batch size=k we can solve the problem within a significant amount of time but to perform the Batch size SGD all input Reviews need to be the same length.

len(Review1)=len(Review2)=len(Review3)

where len=Lenght

But we have len(Review1)=5, len(Review2)=6, len(Review3)=7

So to get the same length Reviews we apply Pre-Padding with zeros on input data

Review1= 0(w1) 0(w2) This(w3) movie(w4) is(w5) very(w6) interesting(w7)

Review2=0(w1) I(w2) do(w3) not(w4) like(w5) this(w6) movie(w7)

Review3: I(w1) hate(w2) this(w3) type(w4) of(w5) genre(w6) movies(w7)

Now we have the same length Reviews.

len(Review1)=len(Review2)=len(Review3)=7.

Now we can apply SGD with Batch size=k.

Now we can solve the problem with significant time

Conclusion:

To apply Batch size SGD we need to have the same length reviews(Sequence of words )to get the same length we apply Pre-Padding(zero paddings) on input data

Purpose Of Pre-Padding While Implementing LSTM

Conclusion:

Written by Janibasha Shaik