Purpose Of Pre-Padding While Implementing LSTM
All you want to know about why pre-padding requires
While learning the architecture of LSTM we are all focused on the Forgot gate, Input gate, and Output gate.
But when comes to implementing part before applying Embedding layer we apply pre-padding on Input data( Sequence of words)
what we think:
Input data ====> Embedding layer ======>LSTM layers =====> Softmax/Sigmoid =====> Output
Reality :
Input data =====> Pre-Padding on Input data ====> Embedding layer ======>LSTM layers =====> Softmax/Sigmoid =====> Output
why Pre-Padding:
Example : If we have 3 reviews (Input data)
Review1: This(w1) movie(w2) is(w3) very(w4) interesting(w5)
w1= word 1, w2= word 2, w3= word 3 ,w4= word 4 , w5= word5
we have 5 words in review1
Review2: I(w1) do(w2) not(w3) like(w4) this(w5) movie(w6)
Similarly we have 6 words in review2
Review3: I(w1) hate(w2) this(w3) type(w4) of(w5) genre(w6) movies(w7)
We have 7 words in review3
we have different lengths of reviews
Because of different lengths first we pass Review1 words through LSTM layer after completion of Review 1 words then only we pass Review 2 words through LSTM layer this process is nothing but SGD (stochastic gradient descent) with Batch size =1.
In the real world problem, we have millions of reviews(input data) if we follow the SGD with Batch size =1 it will take days to solve, that’s why we use SGD with Batch size=k (k may be any number)
Example if k=32 that means we pass 32 words in single time step
With SGD Batch size=k we can solve the problem within a significant amount of time but to perform the Batch size SGD all input Reviews need to be the same length.
len(Review1)=len(Review2)=len(Review3)
where len=Lenght
But we have len(Review1)=5, len(Review2)=6, len(Review3)=7
So to get the same length Reviews we apply Pre-Padding with zeros on input data
Review1= 0(w1) 0(w2) This(w3) movie(w4) is(w5) very(w6) interesting(w7)
Review2=0(w1) I(w2) do(w3) not(w4) like(w5) this(w6) movie(w7)
Review3: I(w1) hate(w2) this(w3) type(w4) of(w5) genre(w6) movies(w7)
Now we have the same length Reviews.
len(Review1)=len(Review2)=len(Review3)=7.
Now we can apply SGD with Batch size=k.
Now we can solve the problem with significant time
Conclusion:
To apply Batch size SGD we need to have the same length reviews(Sequence of words )to get the same length we apply Pre-Padding(zero paddings) on input data