Human motion prediction is key to understand social environments, with direct applications in robotics, surveil-lance, etc. We present a simple yet effective pedestrian trajectory prediction model aimed at pedestrians' positions prediction in urban-like environments conditioned by the environment: map and surround agents. Our model is a neural-based architecture that can run several layers of attention blocks and transformers in an iterative sequential fashion, allowing to capture the important features in the environment that improve prediction. We show that without explicit introduction of social masks, dynamical models, social pooling layers, or complicated graph-like structures, it is possible to produce on par results with SoTA models, which makes our approach easily extendable and configurable, depending on the data available. We report results performing similarly with SoTA models on publicly available and extensible-used datasets with uni-modal prediction metrics ADE and FDE.