Streaming Transformer

$$$$Attention(Q, K, V) == Softmax * (Q * K**T)/sqrt(d_k) * V

$$$$MHA(Q_hat, K_hat, V_hat) == Concat(Head_1, dots, Head_d_h) *

W**H

and * Head_i == Attention(Q_hat * W_i**Q, K_hat * W_i**K, V_hat

* W_i**V)

$$$$X_0 == EncCNN(X), X_E == EncSA(X_0)

