a task to generate an unambiguous text description that applies to exactly one appointed object or region in the image. A good expression should be distinguishable enough to ensure that the listener can identify the unique target among various objects within the same image.