Accurate interpretation of images of complex joints, such as the temporomandibular joint (TMJ), has become essential in a variety of clinical practices, ranging from the basic assessment of wear and tear (e.g., osteoarthritis) to intricate surgical interventions (e.g., arthroplasty). Today, this examination remains subjective and time-consuming, requiring comprehensive understanding of the joint's properties. Ultrasound (US) is the main medical imaging modality that, in addition to its many advantages, allows assessing the condition of the joint during its physiological movement. Therefore, there is a demand for an automatic and efficient method to track the joint landmarks and their movements in the ultrasound videos, promising facilitation of the tedious routine work of the US operators and a more objective metric of the joint health. To address the problem of landmark detection, we propose a method that combines 3D U-Net (which extracts spatial patterns with abstract features) and Long Short-Term Memory module (LSTM, for processing data as temporal 2D sequences of video frames). The method is evaluated on the dataset containing 13 sequences of TMJ motion recordings during the opening and the closing movements of the lower jaw. The approach proved functional and could be readily integrated into the current clinical practice.