Part of our perception of depth and the relative distance of two objects comes from the changes to how their images overlap as our point of observation moves.
As you move closer to something the field of view occupied by a foreground object increases more rapidly than that of a distant object. We sense this and automatically perceive a relative depth.
To achieve the 3D zoom effect the image of the forground object is separated to a different layer that is zoomed more rapidly than objects further away. Objects are also displaced as a function of their intended distance to match what would occur in real life. The relationships are quite simple and animators have always utilised the effect.
In an ordinaty photograph we don't have the information from directly behind the subject so we cannot pan the view since it would require information about areas obscured in the photo.
However in the zoom effect part of the background is covered by the foreground image as zoom is increased. This redundant information can be used to create a panning effect by moving the forground object across the available background in subsequent frames.
It works really well for simultaneous zooming and panning because an increasing amount of extra background is made available by the zooming as it is required to continue panning.