If you had an extremely low latency lightfield display and camera that allowed the eye to do the focusing then you could skip the light blocking part (not that that makes the optics challenge easier).
For anything close to the resolution the tech demos have people expecting, a true light field display, if we could even make one that fit in a svelte headset, would require hundreds of gigabytes per second of bandwidth. There is obviously a lot of redundancy in that signal (many "overlapping" views of the same scene) so there could be lots of opportunity for compression, but then you just spend more of the incredibly constrained battery budget on the CPU/GPU.
Add SLAM, CPU/GPU, Wireless radios... and do that all day on just a <5Wh battery (limited by a form factor anything close to a regular pair of glasses).