The present study sought to determine the format in which visual, auditory and auditory-visual durations ranging from 400 to 600 ms are encoded and maintained in short-term memory, using suppression conditions. Participants compared two stimulus durations separated by an interval of 8 s. During this time, they performed either an articulatory suppression task, a visuospatial tracking task or no specific task at all (control condition). The results showed that the articulatory suppression task decreased recognition performance for auditory durations but not for visual or bimodal ones, whereas the visuospatial task decreased recognition performance for visual durations but not for auditory or bimodal ones. These findings support the modality-specific account of short-term memory for durations.