• luciferofastora@feddit.org
    link
    fedilink
    English
    arrow-up
    3
    arrow-down
    1
    ·
    2 days ago

    According to CometAPI:

    Text prompts are first tokenized into word embeddings, while image inputs—if provided—are converted into patch embeddings […] These embeddings are then concatenated and processed through shared self‑attention layers.

    I haven’t found any other sources to back that up, because most platforms seem more concerned with how to access it than how it works under the hood.