amlnn-model-playground/examples/clip
2026-02-28 11:06:26 +08:00
..
cpp docs: Update README and compilation guides for clarity and consistency, including path corrections and improved formatting. Add copyright notices to source files and adjust file permissions for several scripts and directories. 2026-02-28 11:06:26 +08:00
model docs: Update README and compilation guides for clarity and consistency, including path corrections and improved formatting. Add copyright notices to source files and adjust file permissions for several scripts and directories. 2026-02-28 11:06:26 +08:00
py docs: Update README and compilation guides for clarity and consistency, including path corrections and improved formatting. Add copyright notices to source files and adjust file permissions for several scripts and directories. 2026-02-28 11:06:26 +08:00
tokenizer_path feat:update demo code of CLIP 2026-02-12 11:19:52 +08:00
000000004505.jpg feat:update demo code of CLIP 2026-02-12 11:19:52 +08:00
README.md docs: Update README and compilation guides for clarity and consistency, including path corrections and improved formatting. Add copyright notices to source files and adjust file permissions for several scripts and directories. 2026-02-28 11:06:26 +08:00

CLIP

1. Overview

This demo demonstrates how to run CLIP (Contrastive Language-Image Pre-Training) image-text matching using AMLNNLite. The CLIP model consists of two parts: a vision encoder and a text encoder, which work together to compute similarity between images and text descriptions.

2. Model Download

TO DO

3. Model Conversion

TO DO

4. Demo Run

CPP

1. Compile

Prerequisites:

  • Android NDK (r25e recommended)
  • ANDROID_NDK_PATH environment variable set

Build:

# Build for arm64-v8a
cd examples/clip/cpp
AMLNN_HOME=/path/to/amlnn-toolkit ./build-android.sh -a arm64-v8a

The executable will be generated at build/android_arm64-v8a/clip_demo.

2. Run

# Push executable and resources to device
adb push build/android_arm64-v8a/clip_demo /data/local/tmp/
adb push model/vision_model_int8_S905X5.adla /data/local/tmp/
adb push model/text_model_int8_S905X5.adla /data/local/tmp/
adb push tokenizer_path/ /data/local/tmp/

# Run on device
adb shell
cd /data/local/tmp
chmod +x clip_demo
export LD_LIBRARY_PATH=/vendor/lib64 or (/vendor/lib)

# Usage: ./clip_demo <vision_model> <text_model> <tokenizer_path> [--profiling]
./clip_demo vision_model_int8_S905X5.adla text_model_int8_S905X5.adla ./tokenizer_path/

The program will prompt for image paths and text descriptions interactively. Enter the path to an image file, then enter comma-separated text descriptions (or skip to use defaults). Type exit to quit.

Argument Descriptions:

Argument Description
vision_model Path to vision encoder .adla model (required)
text_model Path to text encoder .adla model (required)
tokenizer_path Path to directory containing vocab.json and merges.txt (required)
--profiling Enable performance profiling output (optional)

Note: The tokenizer_path should contain vocab.json and merges.txt files from the CLIP tokenizer (e.g., from openai/clip-vit-base-patch32).

Python

Prerequisites:

  • Python 3.10
  • Required packages: numpy, Pillow, transformers, amlnnlite

Install dependencies:

pip install numpy Pillow transformers amlnnlite-1.0.0-cp310-cp310-linux_aarch64.whl

Run on device:

python clip.py \
    --vision-model ./vision_model_int8_S905X5.adla \
    --text-model ./text_model_int8_S905X5.adla \
    --tokenizer-dir ./tokenizer_path \
    --image-path ./000000004505.jpg \
    --texts "a red handbag" "a blue jacket" "a red bus"

Interactive Mode (Recommended):

If you don't provide --image-path, the program will run in interactive mode:

python clip.py \
    --vision-model ./vision_model_int8_S905X5.adla \
    --text-model ./text_model_int8_S905X5.adla \
    --tokenizer-dir ./tokenizer_path

The program will prompt for image paths and text descriptions. Enter an image path to process, then enter comma-separated texts to compare. Type exit to quit.

Argument Descriptions:

Argument Description
--vision-model Path to vision encoder .adla model (required)
--text-model Path to text encoder .adla model (required)
--tokenizer-dir Path to CLIPTokenizer directory (required)
--image-path Path to input image (.jpg, .png) - optional, will prompt if not provided
--texts List of text descriptions to compare (space-separated)
--max-len Maximum token sequence length, default is 64
--logit-scale Logit scale factor, default is 100.0

Note: The --tokenizer-dir should point to the directory containing the CLIPTokenizer files. You can use a Hugging Face model ID (e.g., openai/clip-vit-base-patch32) or a local directory.

5. Results

Performance Feedback

By using the --profiling flag (C++) or setting the loglevel to INFO, the program provides real-time performance metrics upon completion. The console log will display essential hardware and execution details, including:

  • Hardware Information: System and ADLA library versions.
  • Model Overview: Basic input/output configurations.
  • NPU Metrics: Total inference time (latency) and total DRAM bandwidth consumption.

Interactive Mode Example:

$ ./clip_demo vision_model_int8_S905X5.adla text_model_int8_S905X5.adla ./tokenizer_path

[Info] Models initialized successfully.

============================================================
[Info] Image Path (or 'exit' to quit):
000000004505.jpg
[Info] Enter text descriptions (comma-separated, or 'skip' for defaults):
a red handbag, a blue jacket, a red bus

[Info] Processing image: 000000004505.jpg
[Info] Image embedding size: 512
[Info] Processing 3 text(s)...
[Info] Text embeddings size: 3 x 512

============================================================
CLIP Image-Text Matching Results
============================================================
Image: 000000004505.jpg
logit_scale: 100.000000
------------------------------------------------------------
[1] prob=0.999975  sim=0.327895  text='a red bus'
[2] prob=0.000016  sim=0.217690  text='a red handbag'
[3] prob=0.000008  sim=0.211029  text='a blue jacket'
============================================================

============================================================
[Info] Image Path (or 'exit' to quit):
exit
[Info] Exiting...
Free vision model memory.
Free text model memory.
[Info] Done.