In order to encode the video, you will still have to get each frame's data in an internal format, from where it can be fed to the encoder.
AVFrame
is used for representing frames internally, and it can take any pixel format you want – you just have to allocate it correctly.
And this is why decoding takes CPU time, even if your input and output pixel formats are the same: You need to allocate memory for each frame, read it (even if it's raw data), and then pass on this frame to the encoder. An example of how the encoding and decoding is handled can be seen here.