【多模态】28、LLaVA 第一版 | Visual Instruction Tuning 多模态模型的指令微调-四、数据格式

数据可以在 github 的代码中找到：https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/tree/main

pre-train 的数据使用的是 LAION-CC-SBU dataset，对话的格式如下，prompt 主要是让模型简单的描述图片的大概内容，粗糙的内容：

  {
    "id": "004539375",
    "image": "00453/004539375.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "Render a clear and concise summary of the photo.\n<image>"
      },
      {
        "from": "gpt",
        "value": "select luxury furniture 3 - inch gel memory foam mattress topper"
      }
    ]
  },
  {
    "id": "002239345",
    "image": "00223/002239345.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "Write a terse but informative summary of the picture.\n<image>"
      },
      {
        "from": "gpt",
        "value": "a grey watch with an army style strap"
      }
    ]
  },
  {
    "id": "005947502",
    "image": "00594/005947502.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWhat is this?"
      },
      {
        "from": "gpt",
        "value": "a dragon kite flying in the blue sky stock images"
      }
    ]
  },
  {
    "id": "005116462",
    "image": "00511/005116462.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nRender a clear and concise summary of the photo."
      },
      {
        "from": "gpt",
        "value": "$ 10 - cute cheap printed mini dress - khaki multicolor striped floral print peasant short sleeve tunic"
      }
    ]
  },
  {
    "id": "002017886",
    "image": "00201/002017886.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "What is in the photo?\n<image>"
      },
      {
        "from": "gpt",
        "value": "augmented reality using aruco markers in opencv"
      }
    ]
  },
  {
    "id": "001819539",
    "image": "00181/001819539.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "Describe the image concisely.\n<image>"
      },
      {
        "from": "gpt",
        "value": "a pair of childrens boots in the shape of knitted shoes"
      }
    ]
  },
  {
    "id": "002838027",
    "image": "00283/002838027.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "Share a concise interpretation of the image provided.\n<image>"
      },
      {
        "from": "gpt",
        "value": "floral watercolor clip art rose flower border with pink"
      }
    ]
  },
  {
    "id": "000406392",
    "image": "00040/000406392.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "Give a brief description of the image.\n<image>"
      },
      {
        "from": "gpt",
        "value": "the divine queen in her elaborate masks canvas print featuring the face and hands of a woman with red hair"
      }
    ]
  }

visual instruct tuning 使用的数据如下：https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/tree/main

第二阶段微调时，prompt 基本都是让描述更细致的描述图片的内容，或者带有推理性质的

在这里插入图片描述

detail_23k.json 如下：

  {
    "id": "000000442786",
    "image": "000000442786.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "What do you see happening in this image?\n<image>"
      },
      {
        "from": "gpt",
        "value": "The scene depicts a lively plaza area with several people walking and enjoying their time. A man is standing in the plaza with his legs crossed, holding a kite in his hand. The kite has multiple sections attached to it, spread out in various directions as if ready for flight.\n\nNumerous people are scattered throughout the plaza, walking and interacting with others. Some of these individuals are carrying handbags, and others have backpacks. The image captures the casual, social atmosphere of a bustling plaza on a nice day."
      }
    ]
  },
  {
    "id": "000000539056",
    "image": "000000539056.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "What do you think is going on in this snapshot?\n<image>"
      },
      {
        "from": "gpt",
        "value": "The image displays a tidy home office desk setting with a laptop placed in the center. On the desk, there is a cordless keyboard in front of the laptop and a wireless mouse situated slightly to the right of it. A cell phone can be seen on the left side of the desk, close to the laptop.\n\nNear the top left corner, there is a coke glass adding a personal touch to the workspace. A collection of books is scattered around the scene in various positions and stacks, possibly serving as reference material or leisure reading for the desk's occupant."
      }
    ]
  },
  {
    "id": "000000131697",
    "image": "000000131697.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "Can you elaborate on the elements of the picture provided?\n<image>"
      },
      {
        "from": "gpt",
        "value": "The image depicts a classroom full of children working together on laptops. There are several kids in the room, with some of them sharing a laptop in pairs. The students are focused on their tasks with laptops placed on desks or tables.\n\nAside from the laptops, there are multiple chairs in the room, accommodating the students as they work. Other objects in the classroom include a bottle, a cell phone, a book, and a keyboard. Some children can also be seen using additional electronic devices such as tablets or cell phones. The overall atmosphere indicates a modern, technology-filled learning environment."
      }
    ]
  },
  {
    "id": "000000281721",
    "image": "000000281721.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "Describe the following image.\n<image>"
      },
      {
        "from": "gpt",
        "value": "The image presents a bustling city street with cars driving down the road and pedestrians walking along the sidewalks. Multiple vehicles are passing through a green light at the intersection, traveling in both directions. Some cars are closer to the camera, while others can be seen further down the street.\n\nThere are a couple of pedestrians visible near the left side of the image, possibly navigating their way through the city. Various traffic lights are positioned throughout the scene, ensuring a safe flow of traffic for both vehicles and pedestrians alike. The atmosphere suggests a busy and lively city environment."
      }
    ]
  },
  {
    "id": "000000314298",
    "image": "000000314298.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "What do you think is going on in this snapshot?\n<image>"
      },
      {
        "from": "gpt",
        "value": "The image shows a woman with glasses sitting at a dining table, biting into a delicious donut with a smile on her face. Another man is sitting next to her, celebrating or enjoying the moment. They are surrounded by several chairs and cups placed on the table.\n\nIn the background, there's another dining table with a few more cups on it as well as another person who appears to be engaged in a separate activity. The scene is lively, capturing a pleasant social gathering with people enjoying food and each other's company."
      }
    ]
  },
  {
    "id": "000000543263",
    "image": "000000543263.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "What's happening in the scene?\n<image>"
      },
      {
        "from": "gpt",
        "value": "The image captures a scene with several people in the vicinity of uniquely designed benches. One of the benches, which is the focal point of the image, is shaped like a butterfly. A woman in a white shirt is sitting on this butterfly-shaped bench with a handbag placed beside her. \n\nSeveral other people can be spotted around the area, including a lady with glasses and a backpack walking behind the woman sitting on the bench. One more bench is visible in the distance, and there seem to be a few people sitting or standing by the other benches as well. Overall, it appears to be a casual outdoor setting where people are taking a break, engaging in conversations or passing by."
      }
    ]
  },

llava_instruct_80k.json 如下：

  {
    "id": "000000379143",
    "image": "000000379143.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "What is the man doing in the back of the pickup truck?\n<image>"
      },
      {
        "from": "gpt",
        "value": "The man is sitting on top of piled objects or belongings loaded into the back of a pickup truck."
      },
      {
        "from": "human",
        "value": "Is the man holding anything in his hands?"
      },
      {
        "from": "gpt",
        "value": "Yes, the man is holding a beer in his hand while sitting on top of the objects in the back of the pickup truck."
      },
      {
        "from": "human",
        "value": "What color is the pickup truck?"
      },
      {
        "from": "gpt",
        "value": "The pickup truck is white."
      },
      {
        "from": "human",
        "value": "Is the man sitting or standing?"
      },
      {
        "from": "gpt",
        "value": "The man is sitting on top of the piled objects in the back of the pickup truck."
      },
      {
        "from": "human",
        "value": "What could be the possible reasons for the man sitting on top of the possessions in the back of the pickup truck?"
      },
      {
        "from": "gpt",
        "value": "There could be several reasons for the man sitting on top of his possessions in the back of the pickup truck:\n\n1. Moving: The man might be moving to a new location and needed to transport his items in a pickup truck, utilizing available space efficiently. By sitting on top of the belongings, he could be helping to stabilize and secure the items during the move, preventing them from falling or shifting during transportation.\n\n2. Lack of seating: If the cab of the pickup truck is already at full capacity or there isn't enough space for him to sit inside, the man may have chosen to sit on his possessions as an alternative seating arrangement.\n\n3. Road trip or outing: The man might be on a road trip or a casual outing with friends or family, where he is using the back of the pickup truck as an open-air seating area. By sitting on top of the loaded items, he may be enjoying the journey while savoring his beer.\n\n4. Keeping an eye on belongings: The man could be safeguarding his possessions by staying close to them, ensuring that no items are lost, stolen or damaged during the journey.\n\nRegardless of the specific reason, the image shows a person making the most of their situation, adding a touch of lightheartedness or adventure to an otherwise mundane scene."
      }
    ]
  },
  {
    "id": "000000083996",
    "image": "000000083996.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nHow many people are seen surfing in the image?"
      },
      {
        "from": "gpt",
        "value": "The image shows several people or a group of people surfing the waves in the ocean."
      },
      {
        "from": "human",
        "value": "What is the weather like in the image?"
      },
      {
        "from": "gpt",
        "value": "The weather in the image is sunny."
      },
      {
        "from": "human",
        "value": "What are the people doing in the water?"
      },
      {
        "from": "gpt",
        "value": "The people in the water are surfing the waves on surfboards."
      },
      {
        "from": "human",
        "value": "Can you explain the basics of surfing and the necessary equipment for beginners?"
      },
      {
        "from": "gpt",
        "value": "Surfing is a popular water sport that involves riding waves on a surfboard while standing or lying in various positions. The main goal is to balance on the surfboard while being propelled forward by the wave's momentum. The sport originated from Polynesian culture and has evolved into a global phenomenon, attracting participants of all skill levels and backgrounds.\n\nFor beginners, there are a few essentials to get started with surfing:\n\n1. Surfboard: A beginner's surfboard should be wide, thick, and long to provide better stability and buoyancy. This makes it easier for beginners to learn balance and catch waves. Soft-top boards or foam boards are recommended for beginners, as they are more forgiving and less likely to cause injury during a wipeout.\n\n2. Wetsuit: A wetsuit is essential for most surfers, depending on the water temperature. Wetsuits are made of neoprene, providing insulation and flexibility to allow surfers to stay warm and move comfortably in the water. Thickness and type of wetsuit will vary based on the water temperature and local climate.\n\n3. Leash: A leash is a cord that attaches the surfboard to the surfer's ankle, preventing the board from drifting away after a wipeout. This is an essential safety tool and ensures that the surfboard stays within reach at all times.\n\n4. Surf Wax or Traction Pad: Surf wax or a traction pad is applied to the surfboard deck to provide grip and prevent surfers from slipping off their boards. Wax is applied in a circular pattern, while traction pads are adhesive pads placed on the deck.\n\n5. Sun Protection: Surfers should wear sunscreen, sunglasses, and, in some cases, a hat or rash guard to protect their skin from the sun's harmful rays.\n\nBefore hitting the waves, beginners should take a few lessons with an experienced instructor to learn essential techniques, safety tips, and surfing etiquette. With the right equipment and proper training, anyone can experience the thrill of catching and riding waves in the ocean."
      }
    ]
  },
  {
    "id": "000000091544",
    "image": "000000091544.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nHow many people are in the image?"
      },
      {
        "from": "gpt",
        "value": "There are two people in the image."
      },
      {
        "from": "human",
        "value": "What type of clothing are the people wearing in the image?"
      },
      {
        "from": "gpt",
        "value": "The people in the image are wearing jackets or coats, possibly leather or black in color."
      },
      {
        "from": "human",
        "value": "What activity are the two people engaged in?"
      },
      {
        "from": "gpt",
        "value": "The two people are walking next to each other on a street, likely during the day."
      },
      {
        "from": "human",
        "value": "Is there a traffic light visible in the image?"
      },
      {
        "from": "gpt",
        "value": "Yes, there is a yellow traffic light visible in the image."
      },
      {
        "from": "human",
        "value": "Can you tell if the two people are walking close together or maintaining some distance?"
      },
      {
        "from": "gpt",
        "value": "The image shows the two people walking close together, as seen in the close-up view of them."
      }
    ]
  },
  {
    "id": "000000148540",
    "image": "000000148540.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nAre people getting off a train or a bus in the image?"
      },
      {
        "from": "gpt",
        "value": "People are getting off a train in the image."
      },
      {
        "from": "human",
        "value": "Can you describe the colors of the train?"
      },
      {
        "from": "gpt",
        "value": "The train is blue and pink in color."
      },
      {
        "from": "human",
        "value": "What is happening at the platform when the train stops?"
      },
      {
        "from": "gpt",
        "value": "When the train stops at the platform, the door opens, and people start to exit the train. Among them, there is a couple carrying their luggage."
      },
      {
        "from": "human",
        "value": "Is there any luggage seen on the platform?"
      },
      {
        "from": "gpt",
        "value": "Yes, there is a suitcase on the ground near a pole on the platform."
      },
      {
        "from": "human",
        "value": "How many suitcases does the couple carry?"
      },
      {
        "from": "gpt",
        "value": "Unfortunately, the image does not explicitly show the exact number of suitcases the couple is carrying."
      }
    ]
  },
  {
    "id": "000000257752",
    "image": "000000257752.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nHow is the teddy bear positioned in the image?"
      },
      {
        "from": "gpt",
        "value": "The teddy bear is hanging by a string or thread in the image."
      },
      {
        "from": "human",
        "value": "What is happening to the teddy bear in terms of the surrounding elements?"
      },
      {
        "from": "gpt",
        "value": "Water droplets are falling on the teddy bear as it's hanging by the string."
      },
      {
        "from": "human",
        "value": "What is the background like in the image?"
      },
      {
        "from": "gpt",
        "value": "The background of the image features a dirt wall, which could indicate an outdoor or cave-like setting."
      },
      {
        "from": "human",
        "value": "Could you describe a possible scenario or story behind this image?"
      },
      {
        "from": "gpt",
        "value": "The image of the teddy bear hanging by a string and having water droplets fall on it in front of a dirt wall creates an intriguing and somewhat mysterious scene. One possible interpretation of this image could be an art installation or a creative photography concept, focusing on the juxtaposition of a soft, innocent teddy bear with the harsh, natural elements of water and earth.\n\nAnother possible scenario could be that the teddy bear represents an abandoned or lost toy, left to the elements and the passage of time. The hanging position of the teddy bear and the falling water droplets might symbolize feelings of loneliness, vulnerability, or helplessness. \n\nIn either case, the image's composition and subject evoke a sense of curiosity and may lead the viewer to wonder about the intentions, emotions, or stories behind the scene. The image thus becomes a visual metaphor or prompt for creating and exploring narratives within the context of the photograph."
      }
    ]
  },
  {
    "id": "000000426016",
    "image": "000000426016.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "How many orange parking meters are there in the image?\n<image>"
      },
      {
        "from": "gpt",
        "value": "There are two orange parking meters in the image."
      },
      {
        "from": "human",
        "value": "What is the color of the parking meters?"
      },
      {
        "from": "gpt",
        "value": "The parking meters are orange."
      },
      {
        "from": "human",
        "value": "Where are the parking meters in relation to trees and bushes?"
      },
      {
        "from": "gpt",
        "value": "The parking meters are positioned next to a tree and near some bushes."
      },
      {
        "from": "human",
        "value": "Is there any caution tape on the parking meter?"
      },
      {
        "from": "gpt",
        "value": "Yes, there is caution tape wrapped around the parking meter, possibly indicating that it is out of order or needs attention."
      }
    ]
  },
  {
    "id": "000000242544",
    "image": "000000242544.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "What color is the horse in the image?\n<image>"
      },
      {
        "from": "gpt",
        "value": "The horse in the image is brown."
      },
      {
        "from": "human",
        "value": "Is the horse stationary or in motion?"
      },
      {
        "from": "gpt",
        "value": "The horse is in motion, walking through the field."
      },
      {
        "from": "human",
        "value": "What is the woman doing in the image?"
      },
      {
        "from": "gpt",
        "value": "The woman is standing in front of the brown horse and looking at it."
      },
      {
        "from": "human",
        "value": "What is the setting of the image?"
      },
      {
        "from": "gpt",
        "value": "The setting of the image is a field, where the brown horse is walking, and the woman is standing in front of it observing the horse."
      }
    ]
  },

在处理视觉指令调整（visual instruct tuning）的标注数据时，使用多轮对话可以更好地模拟真实的人机交互场景。在这种情况下，多轮对话有以下好处：

上下文理解：通过连续的对话，系统可以更好地理解和跟踪上下文信息，从而提供更加准确和连贯的回答。
逐步澄清：用户可能会逐步提出问题，以澄清或获取更多的细节信息。多轮对话允许系统累积这些信息并相应地调整其回答。
动态反馈：用户可以对系统的回答进行反馈，系统可以根据反馈进行调整和改进。
灵活性和适应性：多轮对话展示了系统处理不同类型问题和用户需求的能力，包括对不明确或开放式问题的响应。
用户体验：模拟真实对话有助于提高用户体验，使得交互更加自然和人性化。

在上述标注数据中，多轮对话的使用确保了对话流畅且信息丰富，同时也帮助标注者理解如何在实际情况下设计和优化智能助手的回答。通过这种方式，可以确保智能助手在处理视觉信息和相关问题时能够提供有用的反馈和解释。

秒客网

【多模态】28、LLaVA 第一版 | Visual Instruction Tuning 多模态模型的指令微调-四、数据格式

相关文章