Anybody have eyedrops?

I’ve recently started trying my hand at captioning, and it’s brought a new understanding on why certain captions are the way they are.
Captioners have conventions used to keep captions consistent from video to video, and to make their work easier. They take into account the vagaries of human speech, the limited amount of space available, and the viewer’s potential reading speed, among other things. Foreign films aren’t unique, of course, in their use of captions: that little stream of words can appear on anything from a commercial to a YouTube vlog. The show may have one speaker or many. The speakers may have mechanical difficulties with their speech (no, I’m not thinking of Colin Firth… any more than usual) or have English as a second language. There may be music or special effects to take into account. And if, reader, you’ve ever had a chance to use captions, whether you need them for daily life or just want to watch a movie in class without the professor noticing, you’ll know that the space allotted to them isn’t very large. In order to get the meat of the message into the captions, something’s gotta go. The projects I’ve done so far limit me to two lines of text, one above the other, with a certain allotment of characters and spaces. It’s a reasonable amount of room for a single thread of dialogue, but when dealing with multiple speakers, that allotment is halved for each line. Elsewhere, I’ve also seen captions that have one speaker’s words shown on, say, the bottom left side of the screen, while the other’s is on the bottom right, perhaps to correspond with the speaker’s own position in the scene. (Different software, different capabilities, it seems.) However, far as I’ve seen, those instances have always been for very brief sentences, or broken-off ones. These instances of multiple speakers talking across each other, referred to as cross-talk, can only be shown together on screen if they’re brief. Not only is there a lack of room, there’s also a lack of time to use that room. Take the example of two people in a heated argument about whether or not a dress really is a certain color. (Eyeroll, but that’s beside the point.) One person argues, “It’s blue.” Another person shoots back, “It’s black.” Heightened emotions are likely to result in a much faster rate of speech, so both statements might be said in far less than a second. If you, as a viewer, were to base your understanding solely on the captions for that portion of the argument, would you be able to get which speaker holds which view if it were blinked on screen for such a brief period of time? You might, but showing these instances of cross-talk side-by-side, or one on top of the other, allows the captioner to extend the amount of time that these blips of text appear on screen before the next bit of dialogue, and their corresponding captions, come up. And no, professor, I absolutely didn’t watch that movie while I was in your class, because you’re just that riveting. Minimized window? What minimized window?