feat:update demo code of CLIP

This commit is contained in:
dian.yuan 2026-02-12 11:19:52 +08:00
parent 4bf4aafc73
commit 5478a8618b
12 changed files with 50385 additions and 694 deletions

BIN
examples/clip/000000004505.jpg Executable file

Binary file not shown.

After

Width:  |  Height:  |  Size: 210 KiB

View file

@ -1,95 +1,159 @@
## Demo Run # CLIP
### CPP ## 1. Overview
#### 1. Compile This demo demonstrates how to run CLIP (Contrastive Language-Image Pre-Training) image-text matching using AMLNNLite. The CLIP model consists of two parts: a vision encoder and a text encoder, which work together to compute similarity between images and text descriptions.
**Prerequisites:** ## 2. Model Download
- Android NDK (r25e recommended)
- `ANDROID_NDK_PATH` environment variable set TO DO
**Build:** ## 3. Model Conversion
```bash
# Build for arm64-v8a TO DO
cd examples/clip/cpp
./build-android.sh -a arm64-v8a ## 4. Demo Run
```
### CPP
The executable will be generated at `build/android_arm64-v8a/clip_demo` (Note: executable name may vary, verify in build folder).
#### 1. Compile
#### 2. Run
**Prerequisites:**
```bash - Android NDK (r25e recommended)
# Push executable to device - `ANDROID_NDK_PATH` environment variable set
adb push build/android_arm64-v8a/clip_demo /data/local/tmp/
adb push model/vision_model_int8_A311D2.adla /data/local/tmp/ **Build:**
adb push clip_datasets/ /data/local/tmp/ ```bash
adb push test_hat_0.jpg /data/local/tmp/ # Build for arm64-v8a
cd examples/clip/cpp
# Run on device ./build-android.sh -a arm64-v8a
adb shell ```
cd /data/local/tmp
chmod +x clip_demo The executable will be generated at `build/android_arm64-v8a/clip_demo`.
export LD_LIBRARY_PATH=/vendor/lib64 or (/vendor/lib)
#### 2. Run
# Usage: ./clip_demo <model_path> [base_dir] [json_filename]
./clip_demo vision_model_int8_A311D2.adla ./clip_datasets/ clip_text_res.json ```bash
``` # Push executable and resources to device
adb push build/android_arm64-v8a/clip_demo /data/local/tmp/
**Note:** adb push model/vision_model_int8_S905X5.adla /data/local/tmp/
- Replace `vision_model_int8_A311D2.adla` with your actual model file path. adb push model/text_model_int8_S905X5.adla /data/local/tmp/
- The `base_dir` and `json_filename` parameters are optional. You can also use environment variables `CLIP_BASE_DIR` and `CLIP_JSON_FILENAME`. adb push tokenizer_path/ /data/local/tmp/
- The program will prompt you to enter image paths interactively. Enter "exit" to quit.
# Run on device
### Python adb shell
cd /data/local/tmp
**Prerequisites:** chmod +x clip_demo
- Python 3.10 export LD_LIBRARY_PATH=/vendor/lib64 or (/vendor/lib)
- Required packages: `numpy`, `Pillow`, `amlnnlite`
# Usage: ./clip_demo <vision_model> <text_model> <tokenizer_path> [--profiling]
**Install dependencies:** ./clip_demo vision_model_int8_S905X5.adla text_model_int8_S905X5.adla ./tokenizer_path/
```bash ```
pip install numpy Pillow amlnnlite-1.0.0-cp310-cp310-linux_aarch64.whl
``` The program will prompt for image paths and text descriptions interactively. Enter the path to an image file, then enter comma-separated text descriptions (or `skip` to use defaults). Type `exit` to quit.
**Run on device:** **Argument Descriptions:**
```bash
# Basic usage (process current directory) | Argument | Description |
python clip.py --model-path ./vision_model_int8_A311D2.adla | -------------- | ------------------------------------------------------------ |
| vision_model | Path to vision encoder .adla model (required) |
# Specify image directory or file | text_model | Path to text encoder .adla model (required) |
python clip.py --model-path ./vision_model_int8_A311D2.adla --image-dir ./ | tokenizer_path | Path to directory containing `vocab.json` and `merges.txt` (required) |
| --profiling | Enable performance profiling output (optional) |
# Specify base directory and JSON filename
python clip.py --model-path ./vision_model_int8_A311D2.adla --base-dir ./clip_datasets/ --json-filename clip_text_res.json **Note:** The `tokenizer_path` should contain `vocab.json` and `merges.txt` files from the CLIP tokenizer (e.g., from `openai/clip-vit-base-patch32`).
```
### Python
The script will automatically process all image files (`.jpg`, `.jpeg`, `.png`, `.bmp`) in the specified directory or process a single image file, and display the best matching dataset for each image.
**Prerequisites:**
5. Results - Python 3.10
- Required packages: `numpy`, `Pillow`, `transformers`, `amlnnlite`
The program will print the best matching dataset path for each processed image. The program searches through all dataset folders in the base directory and finds the text feature with the highest similarity to the input image.
**Install dependencies:**
**Example output:** ```bash
``` pip install numpy Pillow transformers amlnnlite-1.0.0-cp310-cp310-linux_aarch64.whl
# python demo result ```
Model initialized successfully.
**Run on device:**
Found 2 image file(s) to process ```bash
Searching in base directory: ./clip_datasets/ python clip.py \
--vision-model ./vision_model_int8_S905X5.adla \
Processing image: test_jacket_0.jpg --text-model ./text_model_int8_S905X5.adla \
Best matching dataset: ./clip_datasets/shirt10_jacket7 --tokenizer-dir ./tokenizer_path \
Searching in base directory: ./clip_datasets/ --image-path ./000000004505.jpg \
--texts "a red handbag" "a blue jacket" "a red bus"
Processing image: test_hat_0.jpg ```
Best matching dataset: ./clip_datasets/hat1_jd
**Interactive Mode (Recommended):**
Total results: 2
Index[0]: ./clip_datasets/shirt10_jacket7 If you don't provide `--image-path`, the program will run in interactive mode:
Index[1]: ./clip_datasets/hat1_jd
```bash
Done. python clip.py \
``` --vision-model ./vision_model_int8_S905X5.adla \
--text-model ./text_model_int8_S905X5.adla \
The program returns the dataset folder path that contains the text feature with the highest similarity to the input image. Each result represents the best matching dataset for the corresponding input image. --tokenizer-dir ./tokenizer_path
```
The program will prompt for image paths and text descriptions. Enter an image path to process, then enter comma-separated texts to compare. Type `exit` to quit.
**Argument Descriptions:**
| Argument | Description |
| ---------------- | ------------------------------------------------------------ |
| --vision-model | Path to vision encoder .adla model (required) |
| --text-model | Path to text encoder .adla model (required) |
| --tokenizer-dir | Path to CLIPTokenizer directory (required) |
| --image-path | Path to input image (.jpg, .png) - optional, will prompt if not provided |
| --texts | List of text descriptions to compare (space-separated) |
| --max-len | Maximum token sequence length, default is 64 |
| --logit-scale | Logit scale factor, default is 100.0 |
**Note:** The `--tokenizer-dir` should point to the directory containing the CLIPTokenizer files. You can use a Hugging Face model ID (e.g., `openai/clip-vit-base-patch32`) or a local directory.
## 5. Results
**Performance Feedback**
By using the `--profiling` flag (C++) or setting the loglevel to INFO, the program provides real-time performance metrics upon completion. The console log will display essential hardware and execution details, including:
- Hardware Information: System and ADLA library versions.
- Model Overview: Basic input/output configurations.
- NPU Metrics: Total inference time (latency) and total DRAM bandwidth consumption.
**Interactive Mode Example:**
```bash
$ ./clip_demo vision_model_int8_S905X5.adla text_model_int8_S905X5.adla ./tokenizer_path
[Info] Models initialized successfully.
============================================================
[Info] Image Path (or 'exit' to quit):
000000004505.jpg
[Info] Enter text descriptions (comma-separated, or 'skip' for defaults):
a red handbag, a blue jacket, a red bus
[Info] Processing image: 000000004505.jpg
[Info] Image embedding size: 512
[Info] Processing 3 text(s)...
[Info] Text embeddings size: 3 x 512
============================================================
CLIP Image-Text Matching Results
============================================================
Image: 000000004505.jpg
logit_scale: 100.000000
------------------------------------------------------------
[1] prob=0.999975 sim=0.327895 text='a red bus'
[2] prob=0.000016 sim=0.217690 text='a red handbag'
[3] prob=0.000008 sim=0.211029 text='a blue jacket'
============================================================
============================================================
[Info] Image Path (or 'exit' to quit):
exit
[Info] Exiting...
Free vision model memory.
Free text model memory.
[Info] Done.
```

View file

@ -1,42 +1,43 @@
cmake_minimum_required(VERSION 3.5) cmake_minimum_required(VERSION 3.5)
project(clip_demo) project(clip_demo)
set(CMAKE_CXX_STANDARD 17) set(CMAKE_CXX_STANDARD 17)
# Set NNSDK path # Set NNSDK path
set(NNSDK_ROOT "${CMAKE_SOURCE_DIR}/../../../../dependency/nnsdk") set(NNSDK_ROOT "${CMAKE_SOURCE_DIR}/../../../../dependency/nnsdk")
include_directories(${NNSDK_ROOT}/include) include_directories(${NNSDK_ROOT}/include)
include_directories(${CMAKE_SOURCE_DIR}/../../../../common) include_directories(${CMAKE_SOURCE_DIR}/../../../../common)
# Set 3rdparty path # Set 3rdparty path
set(3RDPARTY_DIR "${CMAKE_SOURCE_DIR}/../../../../dependency") set(3RDPARTY_DIR "${CMAKE_SOURCE_DIR}/../../../../dependency")
# Include directories for stb_image and json # Include directories for stb_image and json
# Note: code uses #include "stb_image.h" and #include "json.hpp" # Note: code uses #include "stb_image.h" and #include "json.hpp"
include_directories(${3RDPARTY_DIR}/stb_image) include_directories(${3RDPARTY_DIR}/stb_image)
include_directories(${3RDPARTY_DIR}/json) include_directories(${3RDPARTY_DIR}/json)
if(CMAKE_SYSTEM_NAME STREQUAL "Android") if(CMAKE_SYSTEM_NAME STREQUAL "Android")
if (ANDROID_ABI STREQUAL "arm64-v8a") if (ANDROID_ABI STREQUAL "arm64-v8a")
link_directories(${NNSDK_ROOT}/lib/android/arm64-v8a) link_directories(${NNSDK_ROOT}/lib/android/arm64-v8a)
else() else()
link_directories(${NNSDK_ROOT}/lib/android/armeabi-v7a) link_directories(${NNSDK_ROOT}/lib/android/armeabi-v7a)
endif() endif()
# Android needs log # Android needs log
link_libraries(log) link_libraries(log)
elseif(CMAKE_SYSTEM_NAME STREQUAL "Linux") elseif(CMAKE_SYSTEM_NAME STREQUAL "Linux")
link_directories(${NNSDK_ROOT}/lib/linux/lib64_yocto) link_directories(${NNSDK_ROOT}/lib/linux/lib64_yocto)
endif() endif()
add_executable(${PROJECT_NAME} add_executable(${PROJECT_NAME}
main.cpp main.cpp
model_invoke.cpp model_invoke.cpp
pre_postprocess.cpp pre_postprocess.cpp
) clip_tokenizer.cpp
)
target_link_libraries(${PROJECT_NAME}
nnsdk target_link_libraries(${PROJECT_NAME}
dl nnsdk
m dl
) m
)

View file

@ -0,0 +1,53 @@
/*
* Copyright (C) 20242025 Amlogic, Inc. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#ifndef CLIP_PROCESS_H
#define CLIP_PROCESS_H
#include <string>
#include <vector>
#include <cstdint>
// ==================== Model Invoke ====================
// Initialize network from file
void* init_network_file(const char *model_path);
// Run vision model inference
std::vector<float> run_vision_model(void* context, const std::vector<float>& input_data);
// Run text model inference
std::vector<float> run_text_model(void* context, const std::vector<int64_t>& input_ids);
// Destroy network
int destroy_network(void *qcontext);
// ==================== Pre/Post Processing ====================
// Image preprocessing
std::vector<float> preprocess_image(const std::string& image_path);
// L2 normalize
std::vector<float> l2_normalize(const std::vector<float>& vec);
// Softmax
std::vector<float> softmax(const std::vector<float>& logits);
// Compute cosine similarity
float compute_similarity(const std::vector<float>& a, const std::vector<float>& b, float scale = 100.0f);
#endif // CLIP_PROCESS_H

View file

@ -0,0 +1,395 @@
/*
* Copyright (C) 20242025 Amlogic, Inc. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "clip_tokenizer.h"
#include "json.hpp"
#include <fstream>
#include <sstream>
#include <iostream>
#include <algorithm>
#include <regex>
#include <set>
#include <cassert>
#include <codecvt>
#include <locale>
using json = nlohmann::ordered_json;
// Reference: https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py
void CLIPTokenizer::init_byte_to_unicode()
{
byte_to_unicode_.clear();
unicode_to_byte_.clear();
// Printable ASCII ranges that map to themselves
// '!' (33) to '~' (126), '¡' (161) to '¬' (172), '®' (174) to 'ÿ' (255)
std::vector<int> bs;
for (int i = 33; i <= 126; ++i) bs.push_back(i); // '!' to '~'
for (int i = 161; i <= 172; ++i) bs.push_back(i); // '¡' to '¬'
for (int i = 174; i <= 255; ++i) bs.push_back(i); // '®' to 'ÿ'
std::vector<int> cs(bs.begin(), bs.end());
// Map remaining bytes (0-32, 127-160, 173) to 256+
int n = 0;
for (int b = 0; b < 256; ++b) {
if (std::find(bs.begin(), bs.end(), b) == bs.end()) {
bs.push_back(b);
cs.push_back(256 + n);
n++;
}
}
for (size_t i = 0; i < bs.size(); ++i) {
byte_to_unicode_[static_cast<uint8_t>(bs[i])] = static_cast<char32_t>(cs[i]);
unicode_to_byte_[static_cast<char32_t>(cs[i])] = static_cast<uint8_t>(bs[i]);
}
}
// ========== UTF-8 Helpers ==========
std::vector<char32_t> CLIPTokenizer::utf8_to_codepoints(const std::string& str)
{
std::vector<char32_t> result;
size_t i = 0;
while (i < str.size()) {
char32_t cp = 0;
unsigned char c = str[i];
int len = 0;
if (c < 0x80) {
cp = c;
len = 1;
} else if ((c & 0xE0) == 0xC0) {
cp = c & 0x1F;
len = 2;
} else if ((c & 0xF0) == 0xE0) {
cp = c & 0x0F;
len = 3;
} else if ((c & 0xF8) == 0xF0) {
cp = c & 0x07;
len = 4;
} else {
++i;
continue;
}
for (int j = 1; j < len && (i + j) < str.size(); ++j) {
cp = (cp << 6) | (str[i + j] & 0x3F);
}
result.push_back(cp);
i += len;
}
return result;
}
std::string CLIPTokenizer::codepoints_to_utf8(const std::vector<char32_t>& cps)
{
std::string result;
for (char32_t cp : cps) {
if (cp < 0x80) {
result += static_cast<char>(cp);
} else if (cp < 0x800) {
result += static_cast<char>(0xC0 | (cp >> 6));
result += static_cast<char>(0x80 | (cp & 0x3F));
} else if (cp < 0x10000) {
result += static_cast<char>(0xE0 | (cp >> 12));
result += static_cast<char>(0x80 | ((cp >> 6) & 0x3F));
result += static_cast<char>(0x80 | (cp & 0x3F));
} else {
result += static_cast<char>(0xF0 | (cp >> 18));
result += static_cast<char>(0x80 | ((cp >> 12) & 0x3F));
result += static_cast<char>(0x80 | ((cp >> 6) & 0x3F));
result += static_cast<char>(0x80 | (cp & 0x3F));
}
}
return result;
}
// ========== Load Functions ==========
bool CLIPTokenizer::load(const std::string& vocab_path, const std::string& merges_path)
{
init_byte_to_unicode();
// Load vocab.json
{
std::ifstream file(vocab_path);
if (!file.is_open()) {
std::cerr << "Failed to open vocab file: " << vocab_path << std::endl;
return false;
}
try {
json j;
file >> j;
for (auto it = j.begin(); it != j.end(); ++it) {
std::string token = it.key();
int id = it.value().get<int>();
token_to_id_[token] = id;
id_to_token_[id] = token;
}
} catch (const std::exception& e) {
std::cerr << "Error parsing vocab.json: " << e.what() << std::endl;
return false;
}
}
// Find special token IDs
if (token_to_id_.count("<|startoftext|>")) {
sot_token_id_ = token_to_id_["<|startoftext|>"];
}
if (token_to_id_.count("<|endoftext|>")) {
eot_token_id_ = token_to_id_["<|endoftext|>"];
}
// Load merges.txt
{
std::ifstream file(merges_path);
if (!file.is_open()) {
std::cerr << "Failed to open merges file: " << merges_path << std::endl;
return false;
}
std::string line;
int rank = 0;
// Skip header line "#version: ..." if present
if (std::getline(file, line)) {
if (line.find("#version") == std::string::npos) {
// First line is not a header, process it
std::istringstream iss(line);
std::string a, b;
if (iss >> a >> b) {
bpe_ranks_[{a, b}] = rank++;
}
}
}
while (std::getline(file, line)) {
if (line.empty()) continue;
std::istringstream iss(line);
std::string a, b;
if (iss >> a >> b) {
bpe_ranks_[{a, b}] = rank++;
}
}
}
loaded_ = true;
printf("[Info] CLIPTokenizer loaded: vocab_size=%zu, merges=%zu\n",
token_to_id_.size(), bpe_ranks_.size());
return true;
}
bool CLIPTokenizer::load_from_dir(const std::string& tokenizer_dir)
{
std::string dir = tokenizer_dir;
// Ensure trailing slash
if (!dir.empty() && dir.back() != '/' && dir.back() != '\\') {
dir += "/";
}
return load(dir + "vocab.json", dir + "merges.txt");
}
// ========== BPE Implementation ==========
std::string CLIPTokenizer::bytes_to_unicode_str(const std::string& raw) const
{
std::vector<char32_t> result;
for (unsigned char c : raw) {
auto it = byte_to_unicode_.find(c);
if (it != byte_to_unicode_.end()) {
result.push_back(it->second);
}
}
return codepoints_to_utf8(result);
}
std::vector<std::string> CLIPTokenizer::bpe(const std::string& token) const
{
// Convert token to individual unicode characters as strings
auto codepoints = utf8_to_codepoints(token);
if (codepoints.empty()) return {};
// Each character becomes a separate piece
std::vector<std::string> word;
for (size_t i = 0; i < codepoints.size(); ++i) {
std::string piece = codepoints_to_utf8({codepoints[i]});
// CLIP adds </w> to the last character
if (i == codepoints.size() - 1) {
piece += "</w>";
}
word.push_back(piece);
}
if (word.size() == 1) return word;
// Iteratively merge the most frequent pairs
while (true) {
if (word.size() < 2) break;
// Find the pair with the lowest rank
int best_rank = INT_MAX;
int best_idx = -1;
for (size_t i = 0; i < word.size() - 1; ++i) {
auto it = bpe_ranks_.find({word[i], word[i + 1]});
if (it != bpe_ranks_.end() && it->second < best_rank) {
best_rank = it->second;
best_idx = static_cast<int>(i);
}
}
if (best_idx == -1) break; // No more merges possible
// Merge the pair at best_idx
std::string merged = word[best_idx] + word[best_idx + 1];
std::vector<std::string> new_word;
for (size_t i = 0; i < word.size(); ++i) {
if (static_cast<int>(i) == best_idx) {
new_word.push_back(merged);
++i; // Skip next element
} else {
new_word.push_back(word[i]);
}
}
word = new_word;
}
return word;
}
std::vector<std::string> CLIPTokenizer::pre_tokenize(const std::string& text) const
{
// CLIP tokenizer: lowercase + basic clean + split by pattern
// Pattern from CLIP: <\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+
// Simplified version for ASCII-dominant text:
std::string cleaned;
// Lowercase and basic whitespace normalization
for (char c : text) {
if (c >= 'A' && c <= 'Z') {
cleaned += (c - 'A' + 'a');
} else {
cleaned += c;
}
}
// Simple tokenization: split by whitespace and punctuation
std::vector<std::string> words;
std::string current;
for (size_t i = 0; i < cleaned.size(); ++i) {
char c = cleaned[i];
if (c == ' ' || c == '\t' || c == '\n' || c == '\r') {
if (!current.empty()) {
words.push_back(current);
current.clear();
}
// Add space prefix to next word (CLIP uses space-prefixed tokens)
if (i + 1 < cleaned.size() && cleaned[i + 1] != ' ') {
// Next word will get a space prefix via the byte encoding
}
} else {
// Check if punctuation should be separate token
bool is_alpha_or_digit = (c >= 'a' && c <= 'z') || (c >= '0' && c <= '9');
bool cur_is_alpha = !current.empty() &&
((current.back() >= 'a' && current.back() <= 'z') ||
(current.back() >= '0' && current.back() <= '9'));
if (!current.empty() && !is_alpha_or_digit && cur_is_alpha) {
// Start new token for punctuation
words.push_back(current);
current.clear();
} else if (!current.empty() && is_alpha_or_digit && !cur_is_alpha) {
words.push_back(current);
current.clear();
}
current += c;
}
}
if (!current.empty()) {
words.push_back(current);
}
return words;
}
// ========== Encode ==========
std::vector<int64_t> CLIPTokenizer::encode(const std::string& text, int max_len) const
{
if (!loaded_) {
std::cerr << "Tokenizer not loaded!" << std::endl;
return std::vector<int64_t>(max_len, 0);
}
std::vector<int64_t> tokens;
// Add start-of-text token
tokens.push_back(sot_token_id_);
// Pre-tokenize
std::vector<std::string> words = pre_tokenize(text);
// Process each word
for (const auto& word : words) {
// Convert raw bytes to unicode representation
std::string unicode_word = bytes_to_unicode_str(word);
// Apply BPE
std::vector<std::string> bpe_tokens = bpe(unicode_word);
// Look up token IDs
for (const auto& bt : bpe_tokens) {
auto it = token_to_id_.find(bt);
if (it != token_to_id_.end()) {
tokens.push_back(it->second);
} else {
// Unknown token, try without </w>
std::string no_ew = bt;
if (no_ew.size() >= 4 && no_ew.substr(no_ew.size() - 4) == "</w>") {
no_ew = no_ew.substr(0, no_ew.size() - 4);
}
auto it2 = token_to_id_.find(no_ew);
if (it2 != token_to_id_.end()) {
tokens.push_back(it2->second);
}
// else: skip unknown token
}
}
}
// Add end-of-text token
tokens.push_back(eot_token_id_);
// Truncate if necessary
if (static_cast<int>(tokens.size()) > max_len) {
tokens.resize(max_len);
// Ensure EOT is at the end
tokens.back() = eot_token_id_;
}
// Pad to max_len with EOT token (consistent with HuggingFace CLIPTokenizer)
while (static_cast<int>(tokens.size()) < max_len) {
tokens.push_back(eot_token_id_);
}
return tokens;
}

View file

@ -0,0 +1,105 @@
/*
* Copyright (C) 20242025 Amlogic, Inc. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#ifndef CLIP_TOKENIZER_H
#define CLIP_TOKENIZER_H
#include <string>
#include <vector>
#include <map>
#include <unordered_map>
class CLIPTokenizer {
public:
CLIPTokenizer() = default;
/**
* Load tokenizer from vocab.json and merges.txt
* @param vocab_path Path to vocab.json
* @param merges_path Path to merges.txt
* @return true on success
*/
bool load(const std::string& vocab_path, const std::string& merges_path);
/**
* Load tokenizer from a directory containing vocab.json and merges.txt
* @param tokenizer_dir Path to directory
* @return true on success
*/
bool load_from_dir(const std::string& tokenizer_dir);
/**
* Tokenize text to token IDs with padding/truncation.
* Adds <|startoftext|> and <|endoftext|> automatically.
*
* @param text Input text string
* @param max_len Maximum sequence length (default: 64)
* @return Vector of int64_t token IDs with shape [max_len]
*/
std::vector<int64_t> encode(const std::string& text, int max_len = 64) const;
/**
* Check if tokenizer is loaded
*/
bool is_loaded() const { return loaded_; }
/**
* Get vocabulary size
*/
size_t vocab_size() const { return token_to_id_.size(); }
private:
// BPE pair
using BPEPair = std::pair<std::string, std::string>;
// Byte-to-unicode mapping (GPT-2 style)
std::unordered_map<uint8_t, char32_t> byte_to_unicode_;
std::unordered_map<char32_t, uint8_t> unicode_to_byte_;
// Vocabulary
std::unordered_map<std::string, int> token_to_id_;
std::unordered_map<int, std::string> id_to_token_;
// BPE merge rules (pair -> priority rank)
std::map<BPEPair, int> bpe_ranks_;
// Special token IDs
int sot_token_id_ = 49406; // <|startoftext|>
int eot_token_id_ = 49407; // <|endoftext|>
bool loaded_ = false;
// Initialize byte-to-unicode mapping
void init_byte_to_unicode();
// Convert UTF-8 string to vector of unicode codepoints
static std::vector<char32_t> utf8_to_codepoints(const std::string& str);
// Convert unicode codepoints to UTF-8 string
static std::string codepoints_to_utf8(const std::vector<char32_t>& cps);
// Apply BPE to a single word (already converted to unicode representation)
std::vector<std::string> bpe(const std::string& token) const;
// Clean and split text using CLIP's regex pattern
std::vector<std::string> pre_tokenize(const std::string& text) const;
// Convert raw bytes to unicode string using byte_to_unicode mapping
std::string bytes_to_unicode_str(const std::string& raw) const;
};
#endif // CLIP_TOKENIZER_H

View file

@ -15,22 +15,26 @@
*/ */
#include <iostream> #include <iostream>
#include <fstream>
#include <sstream>
#include <stdio.h> #include <stdio.h>
#include <stdlib.h> #include <stdlib.h>
#include <time.h> #include <time.h>
#include <vector>
#include <string>
#include <algorithm>
#include "model_invoke.h" #include "clip_process.h"
#include "clip_tokenizer.h"
#define BILLION 1000000000 #define BILLION 1000000000
struct Get_Times struct ProfilingTimer
{ {
uint64_t init_start_time, init_end_time, init_total_time; uint64_t init_start, init_end;
uint64_t preProcess_start_time, preProcess_end_time, preProcess_total_time; uint64_t preprocess_start, preprocess_end;
uint64_t invoke_start_time, invoke_end_time, invoke_total_time; uint64_t vision_infer_start, vision_infer_end;
uint64_t postProcess_start_time, postProcess_end_time, postProcess_total_time; uint64_t text_infer_start, text_infer_end;
uint64_t total_time;
std::vector<uint64_t> total_time_group;
}; };
static uint64_t get_time_count() static uint64_t get_time_count()
@ -40,70 +44,288 @@ static uint64_t get_time_count()
return (uint64_t)((uint64_t)ts.tv_nsec + (uint64_t)ts.tv_sec * BILLION); return (uint64_t)((uint64_t)ts.tv_nsec + (uint64_t)ts.tv_sec * BILLION);
} }
// Default text prompts for demo
static std::vector<std::string> default_texts = {
"a red handbag",
"a blue jacket",
"a red bus"
};
// Parse comma-separated texts
std::vector<std::string> parse_texts(const std::string& input)
{
std::vector<std::string> result;
std::stringstream ss(input);
std::string item;
while (std::getline(ss, item, ',')) {
// Trim whitespace
size_t start = item.find_first_not_of(" \t");
size_t end = item.find_last_not_of(" \t");
if (start != std::string::npos && end != std::string::npos) {
result.push_back(item.substr(start, end - start + 1));
}
}
return result;
}
void print_usage(const char* prog_name)
{
printf("Usage: %s <vision_model> <text_model> <tokenizer_dir> [--profiling]\n", prog_name);
printf("\n");
printf("Arguments:\n");
printf(" vision_model: Path to vision model (.adla)\n");
printf(" text_model: Path to text model (.adla)\n");
printf(" tokenizer_dir: Path to directory containing vocab.json and merges.txt\n");
printf(" --profiling: Enable performance profiling output (optional)\n");
printf("\n");
printf("Interactive mode:\n");
printf(" - Enter image path to process\n");
printf(" - Enter comma-separated texts to compare (or 'skip' for defaults)\n");
printf(" - Enter 'exit' to quit\n");
}
int main(int argc, char ** argv) int main(int argc, char ** argv)
{ {
Get_Times model_time; ProfilingTimer timer = {};
std::vector<float> input_data_fir;
float* model_output_data;
int ret = 0; int ret = 0;
int max_index = 0; bool profiling = false;
if (argc < 2) {
printf("Usage: %s <model_path> [base_dir] [json_filename]\n", argv[0]);
printf(" model_path: Path to the model file\n");
printf(" base_dir: Base directory for clip datasets (optional, can also use CLIP_BASE_DIR env var)\n");
printf(" json_filename: JSON filename in each dataset folder (optional, can also use CLIP_JSON_FILENAME env var, default: clip_text_res.json)\n");
return -1;
}
char* model_path_encoder = argv[1];
std::string base_dir = (argc >= 3) ? argv[2] : "";
std::string json_filename = (argc >= 4) ? argv[3] : "";
void *context_model = NULL;
model_time.init_start_time = get_time_count(); if (argc < 4) {
context_model = init_network_file(model_path_encoder); print_usage(argv[0]);
model_time.init_end_time = get_time_count();
if (context_model == NULL)
{
printf("init_network [context_model] fail.\n");
return -1; return -1;
} }
if (getenv("GET_TIME")) const char* vision_model_path = argv[1];
{ const char* text_model_path = argv[2];
model_time.init_total_time = (model_time.init_end_time - model_time.init_start_time) / 1000000; const char* tokenizer_dir = argv[3];
std::cout << "init_model_total time : " << model_time.init_total_time << "ms" << std::endl;
// Check for --profiling flag
for (int i = 4; i < argc; ++i) {
if (std::string(argv[i]) == "--profiling") {
profiling = true;
}
} }
while (true) const float logit_scale = 100.0f;
{ const int max_seq_len = 64;
std::string json_path;
printf("\nPlease enter the JPG image path (enter exit to quit):\n"); // Load tokenizer
std::getline(std::cin, json_path); printf("[Info] Loading tokenizer from: %s\n", tokenizer_dir);
if (json_path == "exit") break; CLIPTokenizer tokenizer;
if (json_path.empty()) { if (!tokenizer.load_from_dir(tokenizer_dir)) {
printf("The path cannot be empty.\n"); printf("[Error] Failed to load tokenizer.\n");
return -1;
}
// Initialize models
printf("[Info] Initializing vision model: %s\n", vision_model_path);
timer.init_start = get_time_count();
void* vision_context = init_network_file(vision_model_path);
if (vision_context == NULL) {
printf("[Error] Failed to initialize vision model.\n");
return -1;
}
printf("[Info] Initializing text model: %s\n", text_model_path);
void* text_context = init_network_file(text_model_path);
if (text_context == NULL) {
printf("[Error] Failed to initialize text model.\n");
destroy_network(vision_context);
return -1;
}
timer.init_end = get_time_count();
if (profiling) {
uint64_t init_time = (timer.init_end - timer.init_start) / 1000000;
printf("[Profiling] Model initialization: %lums\n", init_time);
}
printf("[Info] Models initialized successfully.\n\n");
// Interactive loop
while (true) {
std::string image_path;
printf("============================================================\n");
printf("[Info] Image Path (or 'exit' to quit):\n");
std::getline(std::cin, image_path);
// Trim whitespace
size_t start = image_path.find_first_not_of(" \t\r\n");
size_t end = image_path.find_last_not_of(" \t\r\n");
if (start != std::string::npos && end != std::string::npos) {
image_path = image_path.substr(start, end - start + 1);
} else {
image_path.clear();
}
if (image_path == "exit") {
printf("[Info] Exiting...\n");
break;
}
if (image_path.empty()) {
printf("[Warning] Please enter an image path.\n");
continue; continue;
} }
std::vector<std::string> out_str_path = process_image_dir(context_model, json_path, base_dir, json_filename);
for (int i = 0; i < out_str_path.size(); i++) // Check if file exists
{ {
std::cout << "Index[" << i << "] : " << out_str_path[i] << std::endl; std::ifstream img_file(image_path);
if (!img_file.good()) {
printf("[Error] Image not found: %s\n", image_path.c_str());
continue;
}
} }
// Get texts to compare
std::vector<std::string> texts;
printf("[Info] Enter text descriptions (comma-separated, or 'skip' for defaults):\n");
std::string text_input;
std::getline(std::cin, text_input);
// Trim
start = text_input.find_first_not_of(" \t\r\n");
end = text_input.find_last_not_of(" \t\r\n");
if (start != std::string::npos && end != std::string::npos) {
text_input = text_input.substr(start, end - start + 1);
} else {
text_input.clear();
}
if (text_input.empty() || text_input == "skip") {
texts = default_texts;
printf("[Info] Using default texts\n");
} else {
texts = parse_texts(text_input);
}
if (texts.empty()) {
printf("[Warning] No texts provided.\n");
continue;
}
// ==================== Process Image ====================
printf("\n[Info] Processing image: %s\n", image_path.c_str());
timer.preprocess_start = get_time_count();
std::vector<float> image_input = preprocess_image(image_path);
if (image_input.empty()) {
printf("[Error] Failed to preprocess image.\n");
continue;
}
timer.preprocess_end = get_time_count();
// Run vision model
timer.vision_infer_start = get_time_count();
std::vector<float> image_embedding = run_vision_model(vision_context, image_input);
if (image_embedding.empty()) {
printf("[Error] Vision model inference failed.\n");
continue;
}
timer.vision_infer_end = get_time_count();
// L2 normalize image embedding
image_embedding = l2_normalize(image_embedding);
printf("[Info] Image embedding size: %zu\n", image_embedding.size());
// ==================== Process Texts ====================
printf("[Info] Processing %zu text(s)...\n", texts.size());
std::vector<std::vector<float>> text_embeddings;
std::vector<uint64_t> text_infer_times;
timer.text_infer_start = get_time_count();
for (size_t i = 0; i < texts.size(); ++i) {
// Tokenize text
std::vector<int64_t> token_ids = tokenizer.encode(texts[i], max_seq_len);
// Run text model
uint64_t t_start = get_time_count();
std::vector<float> text_emb = run_text_model(text_context, token_ids);
uint64_t t_end = get_time_count();
text_infer_times.push_back((t_end - t_start) / 1000000);
if (text_emb.empty()) {
printf("[Error] Text model inference failed for: %s\n", texts[i].c_str());
continue;
}
// L2 normalize
text_emb = l2_normalize(text_emb);
text_embeddings.push_back(text_emb);
}
timer.text_infer_end = get_time_count();
if (text_embeddings.size() != texts.size()) {
printf("[Error] Some text embeddings failed.\n");
continue;
}
printf("[Info] Text embeddings size: %zu x %zu\n", text_embeddings.size(),
text_embeddings.empty() ? 0 : text_embeddings[0].size());
// ==================== Compute Similarity ====================
std::vector<float> similarities(texts.size());
std::vector<float> logits(texts.size());
for (size_t i = 0; i < texts.size(); ++i) {
similarities[i] = compute_similarity(image_embedding, text_embeddings[i], 1.0f); // cosine sim
logits[i] = similarities[i] * logit_scale;
}
// Compute probabilities
std::vector<float> probs = softmax(logits);
// Sort by probability (descending)
std::vector<size_t> indices(texts.size());
for (size_t i = 0; i < texts.size(); ++i) indices[i] = i;
std::sort(indices.begin(), indices.end(),
[&probs](size_t a, size_t b) { return probs[a] > probs[b]; });
// ==================== Print Results ====================
printf("\n============================================================\n");
printf("CLIP Image-Text Matching Results\n");
printf("============================================================\n");
printf("Image: %s\n", image_path.c_str());
printf("logit_scale: %.6f\n", logit_scale);
printf("------------------------------------------------------------\n");
for (size_t rank = 0; rank < indices.size(); ++rank) {
size_t i = indices[rank];
printf("[%zu] prob=%.6f sim=%.6f text='%s'\n",
rank + 1, probs[i], similarities[i], texts[i].c_str());
}
printf("============================================================\n");
if (profiling) {
uint64_t preprocess_time = (timer.preprocess_end - timer.preprocess_start) / 1000000;
uint64_t vision_time = (timer.vision_infer_end - timer.vision_infer_start) / 1000000;
uint64_t text_total_time = (timer.text_infer_end - timer.text_infer_start) / 1000000;
printf("\n[Profiling]\n");
printf(" Image preprocess: %lums\n", preprocess_time);
printf(" Vision inference: %lums\n", vision_time);
for (size_t i = 0; i < texts.size() && i < text_infer_times.size(); ++i) {
printf(" Text inference[%zu]: %lums '%s'\n", i, text_infer_times[i], texts[i].c_str());
}
printf(" Text total: %lums (%zu texts)\n", text_total_time, texts.size());
}
printf("\n");
} }
ret = destroy_network(context_model); // Cleanup
if (ret != 0) ret = destroy_network(vision_context);
{ if (ret != 0) {
printf("destroy_network [context_model] fail.\n"); printf("[Error] Failed to destroy vision model.\n");
return -1;
} }
return ret; ret = destroy_network(text_context);
} if (ret != 0) {
printf("[Error] Failed to destroy text model.\n");
}
printf("[Info] Done.\n");
return 0;
}

View file

@ -20,31 +20,20 @@
#include <fstream> #include <fstream>
#include <algorithm> #include <algorithm>
#include <vector> #include <vector>
#include <cmath>
#include <cstdlib> #include <cstdlib>
#include "model_invoke.h" #include "clip_process.h"
#include "nn_sdk.h" #include "nn_sdk.h"
#include "json.hpp"
#include <filesystem>
#include <regex>
using json = nlohmann::ordered_json; // Global DMA config for models
namespace fs = std::__fs::filesystem; static aml_memory_config_t vision_mem_config;
static aml_memory_data_t vision_mem_data;
static void* vision_context_flag = nullptr;
struct DMAConfig { static aml_memory_config_t text_mem_config;
bool use_dma = true; static aml_memory_data_t text_mem_data;
bool malloc_buffer_once = true; static void* text_context_flag = nullptr;
};
DMAConfig context_model;
///////////////////////////////////////////////////////////
aml_memory_config_t mem_config_context_model;
aml_memory_data_t mem_data_context_model;
std::vector<float> preprocess_image(const std::string& image_path);
float post_process(const float* a, const std::vector<float>& b);
void* init_network_file(const char *model_path) void* init_network_file(const char *model_path)
{ {
@ -95,202 +84,119 @@ void* init_network_file(const char *model_path)
return qcontext; return qcontext;
} }
float* run_network(void *qcontext, std::vector<float> input_ids, const std::string image_type) std::vector<float> run_vision_model(void* qcontext, const std::vector<float>& input_data)
{ {
int ret = 0; int ret = 0;
nn_input inData; nn_input inData;
nn_output *outdata = NULL; nn_output *outdata = NULL;
aml_output_config_t outconfig; aml_output_config_t outconfig;
inData.input_index = 0; inData.input_index = 0;
inData.info.input_format = AML_INPUT_DEFAULT; inData.info.input_format = AML_INPUT_DEFAULT;
inData.size = input_ids.size() * sizeof(float); inData.size = input_data.size() * sizeof(float);
if (context_model.use_dma) { // Use DMA
if (context_model.malloc_buffer_once) { if (!vision_context_flag) {
mem_config_context_model.cache_type = AML_WITH_CACHE; vision_mem_config.cache_type = AML_WITH_CACHE;
mem_config_context_model.memory_type = AML_VIRTUAL_ADDR; vision_mem_config.memory_type = AML_VIRTUAL_ADDR;
mem_config_context_model.direction = AML_MEM_DIRECTION_READ_WRITE; vision_mem_config.direction = AML_MEM_DIRECTION_READ_WRITE;
mem_config_context_model.index = 0; vision_mem_config.index = 0;
mem_config_context_model.mem_size = inData.size; vision_mem_config.mem_size = inData.size;
aml_util_mallocBuffer(qcontext, &mem_config_context_model, &mem_data_context_model); aml_util_mallocBuffer(qcontext, &vision_mem_config, &vision_mem_data);
aml_util_swapExternalInputBuffer(qcontext, &mem_config_context_model, &mem_data_context_model); aml_util_swapExternalInputBuffer(qcontext, &vision_mem_config, &vision_mem_data);
} vision_context_flag = qcontext;
inData.input_type = INPUT_DMA_DATA;
memcpy(mem_data_context_model.viraddr, input_ids.data(), mem_config_context_model.mem_size);
inData.input = NULL;
} else {
inData.input = reinterpret_cast<unsigned char*>(input_ids.data());
inData.input_type = BINARY_RAW_DATA;
ret = aml_module_input_set(qcontext, &inData);
if (ret)
{
printf("aml_module_input_set fail.\n");
}
} }
context_model.malloc_buffer_once = false;
inData.input_type = INPUT_DMA_DATA;
memcpy(vision_mem_data.viraddr, input_data.data(), vision_mem_config.mem_size);
inData.input = NULL;
memset(&outconfig, 0, sizeof(aml_output_config_t)); memset(&outconfig, 0, sizeof(aml_output_config_t));
outconfig.format = AML_OUTDATA_DMA;
if (context_model.use_dma) {
outconfig.format = AML_OUTDATA_DMA;
} else {
outconfig.format = AML_OUTDATA_RAW;
}
outconfig.typeSize = sizeof(aml_output_config_t); outconfig.typeSize = sizeof(aml_output_config_t);
outdata = (nn_output*)aml_module_output_get(qcontext, outconfig); outdata = (nn_output*)aml_module_output_get(qcontext, outconfig);
return reinterpret_cast<float*>(outdata->out[0].buf); if (outdata == NULL || outdata->out[0].buf == NULL) {
} printf("Vision model inference failed.\n");
return {};
int extract_index(const std::string& filename) {
std::regex pattern(R"(test_\w+_(\d+)\.jpg)");
std::smatch match;
if (std::regex_match(filename, match, pattern)) {
return std::stoi(match[1]);
} }
return -1;
// Copy output to vector
size_t output_size = outdata->out[0].size / sizeof(float);
float* output_ptr = reinterpret_cast<float*>(outdata->out[0].buf);
std::vector<float> result(output_ptr, output_ptr + output_size);
return result;
} }
std::vector<std::string> process_image_dir( std::vector<float> run_text_model(void* qcontext, const std::vector<int64_t>& input_ids)
void* context_model,
const std::string& image_dir_path,
const std::string& base_dir,
const std::string& json_filename)
{ {
std::vector<std::string> results; int ret = 0;
std::regex file_pattern(R"(test_(\w+)_\d+\.jpg)"); nn_input inData;
nn_output *outdata = NULL;
// Get base_dir from parameter, environment variable, or use default aml_output_config_t outconfig;
std::string actual_base_dir = base_dir;
if (actual_base_dir.empty()) { inData.input_index = 0;
const char* env_base_dir = std::getenv("CLIP_BASE_DIR"); inData.info.input_format = AML_INPUT_DEFAULT;
if (env_base_dir != nullptr) { inData.size = input_ids.size() * sizeof(int64_t);
actual_base_dir = env_base_dir;
} else { // Use DMA
actual_base_dir = "./demo_data/clip_datasets/"; if (!text_context_flag) {
} text_mem_config.cache_type = AML_WITH_CACHE;
} text_mem_config.memory_type = AML_VIRTUAL_ADDR;
text_mem_config.direction = AML_MEM_DIRECTION_READ_WRITE;
// Ensure base_dir ends with '/' text_mem_config.index = 0;
if (!actual_base_dir.empty() && actual_base_dir.back() != '/') { text_mem_config.mem_size = inData.size;
actual_base_dir += "/"; aml_util_mallocBuffer(qcontext, &text_mem_config, &text_mem_data);
} aml_util_swapExternalInputBuffer(qcontext, &text_mem_config, &text_mem_data);
text_context_flag = qcontext;
// Get json_filename from parameter, environment variable, or use default
std::string actual_json_filename = json_filename;
if (actual_json_filename.empty()) {
const char* env_json_filename = std::getenv("CLIP_JSON_FILENAME");
if (env_json_filename != nullptr) {
actual_json_filename = env_json_filename;
} else {
actual_json_filename = "clip_text_res.json";
}
} }
// storing qualified paths inData.input_type = INPUT_DMA_DATA;
std::vector<fs::directory_entry> matched_files; memcpy(text_mem_data.viraddr, input_ids.data(), text_mem_config.mem_size);
inData.input = NULL;
// collect all relevant img. memset(&outconfig, 0, sizeof(aml_output_config_t));
for (const auto& entry : fs::directory_iterator(image_dir_path)) { outconfig.format = AML_OUTDATA_DMA;
if (!entry.is_regular_file()) continue; outconfig.typeSize = sizeof(aml_output_config_t);
outdata = (nn_output*)aml_module_output_get(qcontext, outconfig);
std::string filename = entry.path().filename().string(); if (outdata == NULL || outdata->out[0].buf == NULL) {
if (std::regex_match(filename, file_pattern)) { printf("Text model inference failed.\n");
matched_files.push_back(entry); return {};
}
} }
// use index sort, test_type_index.jpg // Copy output to vector
std::sort(matched_files.begin(), matched_files.end(), size_t output_size = outdata->out[0].size / sizeof(float);
[](const fs::directory_entry& a, const fs::directory_entry& b) { float* output_ptr = reinterpret_cast<float*>(outdata->out[0].buf);
return extract_index(a.path().filename().string()) < std::vector<float> result(output_ptr, output_ptr + output_size);
extract_index(b.path().filename().string());
});
for (const auto& entry : matched_files) { return result;
if (!entry.is_regular_file()) continue;
std::string filename = entry.path().filename().string();
std::smatch match;
if (!std::regex_match(filename, match, file_pattern)) continue;
std::string name = match[1];
std::vector<float> input_data = preprocess_image(entry.path().string());
float* model_output = run_network(context_model, input_data, name);
float max_sim = -std::numeric_limits<float>::infinity();
std::string best_key, best_id;
// Iterate through all directories to find the directory containing the name
for (const auto& dir_entry : fs::directory_iterator(actual_base_dir)) {
if (!dir_entry.is_directory()) continue;
std::string folder_name = dir_entry.path().filename().string();
if (folder_name.find(name) == std::string::npos) continue;
std::string vit_res_path = actual_base_dir + folder_name + "/" + actual_json_filename;
std::ifstream vit_in(vit_res_path);
if (!vit_in.is_open()) {
printf("unopen: %s\n", vit_res_path.c_str());
continue;
}
json vit_json;
vit_in >> vit_json;
for (auto it = vit_json.begin(); it != vit_json.end(); ++it) {
const std::string& key = it.key();
const std::vector<float> vec = it.value().get<std::vector<float>>();
float sim = post_process(model_output, vec);
// printf("sim: %.4f\n", sim);
if (sim > max_sim) {
max_sim = sim;
best_key = key;
best_id = folder_name;
}
}
}
if (!best_key.empty() && !best_id.empty()) {
std::string best_path = actual_base_dir + best_id + "/";
results.push_back(best_path);
printf("\nProcessing images: %s, datasets img path: %s\n", filename.c_str(), best_path.c_str());
// printf("最相似图片: %s 相似度: %.4f\n", best_path.c_str(), max_sim); // for debug
}
}
return results;
} }
int destroy_network(void *qcontext) int destroy_network(void *qcontext)
{ {
int ret = 0; int ret = 0;
/* free model if (vision_context_flag == qcontext) {
model.use_dma = true printf("Free vision model memory.\n");
model.malloc_buffer_once = false aml_util_freeBuffer(qcontext, &vision_mem_config, &vision_mem_data);
*/ vision_context_flag = nullptr;
if (context_model.use_dma && mem_config_context_model.mem_size != 0) { } else if (text_context_flag == qcontext) {
ret = aml_util_freeBuffer(qcontext, &mem_config_context_model, &mem_data_context_model); printf("Free text model memory.\n");
if (ret) aml_util_freeBuffer(qcontext, &text_mem_config, &text_mem_data);
{ text_context_flag = nullptr;
std::cout << "aml_util_freeBuffer fail." << std::endl; } else {
} printf("Free network failed: context not found.\n");
return -1;
} }
context_model.use_dma = false;
ret = aml_module_destroy(qcontext); ret = aml_module_destroy(qcontext);
if (ret) if (ret)
{ {
printf("aml_module_destroy fail.\n"); printf("Free network failed: destroy failed.\n");
return -1; return -1;
} }
return ret; return ret;
} }

View file

@ -19,13 +19,13 @@
#include <algorithm> #include <algorithm>
#include <string> #include <string>
#include <iostream> #include <iostream>
#include "model_invoke.h" #include "clip_process.h"
#define STB_IMAGE_IMPLEMENTATION #define STB_IMAGE_IMPLEMENTATION
#include "stb_image.h" #include "stb_image.h"
// bilinear interpolation scaling // bilinear interpolation scaling
std::vector<float> resize_bilinear( static std::vector<float> resize_bilinear(
const unsigned char* src, int src_w, int src_h, int channels, const unsigned char* src, int src_w, int src_h, int channels,
int dst_w, int dst_h) int dst_w, int dst_h)
{ {
@ -102,29 +102,29 @@ std::vector<float> preprocess_image(const std::string& image_path) {
} }
} }
// get NHWC // Return NHWC format (batch dimension will be added in caller)
return cropped; return cropped;
} }
float post_process(const float* a, const std::vector<float>& b) { // ==================== Post Processing ====================
float dot = 0.0f, scale = 100.00000762939453f;
for (size_t i = 0; i < b.size(); ++i) { std::vector<float> l2_normalize(const std::vector<float>& vec)
dot += a[i] * b[i]; {
float norm = 0.0f;
for (float v : vec) {
norm += v * v;
} }
dot *= scale; norm = std::sqrt(norm) + 1e-12f;
return dot;
std::vector<float> result(vec.size());
for (size_t i = 0; i < vec.size(); ++i) {
result[i] = vec[i] / norm;
}
return result;
} }
float post_process(const int8_t* a, const std::vector<float>& b) { std::vector<float> softmax(const std::vector<float>& logits)
float dot = 0.0f, scale = 100.00000762939453f; {
for (size_t i = 0; i < b.size(); ++i) {
dot += (a[i] - 66) * b[i];
}
dot *= scale;
return dot;
}
std::vector<float> softmax(const std::vector<float>& logits) {
std::vector<float> result(logits.size()); std::vector<float> result(logits.size());
// numerical stability: subtract the maximum value first. // numerical stability: subtract the maximum value first.
@ -142,3 +142,17 @@ std::vector<float> softmax(const std::vector<float>& logits) {
return result; return result;
} }
float compute_similarity(const std::vector<float>& a, const std::vector<float>& b, float scale)
{
if (a.size() != b.size()) {
printf("Feature dimension mismatch: %zu vs %zu\n", a.size(), b.size());
return 0.0f;
}
float dot = 0.0f;
for (size_t i = 0; i < a.size(); ++i) {
dot += a[i] * b[i];
}
return dot * scale;
}

View file

@ -1,304 +1,339 @@
import numpy as np # -*- coding: utf-8 -*-
import os """
import argparse Copyright (C) 20242025 Amlogic, Inc. All rights reserved.
import json
import re Licensed under the Apache License, Version 2.0 (the "License");
from PIL import Image you may not use this file except in compliance with the License.
from amlnnlite.api import AMLNNLite You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
def preprocess_image(image_path: str, target_size: int = 224) -> np.ndarray:
""" Unless required by applicable law or agreed to in writing, software
Preprocess image for CLIP model. distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
Steps: See the License for the specific language governing permissions and
1. Load image and convert to RGB limitations under the License.
2. Scale the shorter side to target_size """
3. Center crop to target_size x target_size
4. Normalize with CLIP mean and std # This inference script is designed for CLIP model using AMLNNLite.
Args: import os
image_path (str): Path to input image import argparse
target_size (int): Target image size (default: 224) import numpy as np
from PIL import Image
Returns: from transformers import CLIPTokenizer
np.ndarray: Preprocessed image data with shape (target_size, target_size, 3) from amlnnlite.api import AMLNNLite
"""
# Load image # ==================== Utility Functions ====================
img = Image.open(image_path).convert("RGB")
width, height = img.size def softmax(x: np.ndarray, axis: int = -1) -> np.ndarray:
"""Compute softmax values for array x."""
# Scale the shorter side x = x - np.max(x, axis=axis, keepdims=True)
scale = target_size / min(width, height) e = np.exp(x)
new_w = int(round(width * scale)) return e / np.sum(e, axis=axis, keepdims=True)
new_h = int(round(height * scale))
# Resize def l2_normalize(x: np.ndarray, axis: int = -1, eps: float = 1e-12) -> np.ndarray:
img = img.resize((new_w, new_h), Image.BILINEAR) """L2 normalize array x along specified axis."""
return x / (np.linalg.norm(x, axis=axis, keepdims=True) + eps)
# Center crop
left = (new_w - target_size) // 2 # ==================== Vision Preprocessing ====================
top = (new_h - target_size) // 2
img = img.crop((left, top, left + target_size, top + target_size)) def preprocess_image(image_path: str, target_size: int = 224) -> np.ndarray:
"""
# Convert to numpy array and normalize to [0, 1] Preprocess image for CLIP model.
img_array = np.array(img, dtype=np.float32) / 255.0
Args:
# CLIP normalization image_path (str): Path to input image
mean = np.array([0.48145466, 0.4578275, 0.40821073], dtype=np.float32) target_size (int): Target image size (default: 224)
std = np.array([0.26862954, 0.26130258, 0.27577711], dtype=np.float32)
Returns:
# Normalize: (x - mean) / std np.ndarray: Preprocessed image data with shape (1, target_size, target_size, 3) in NHWC format
img_array = (img_array - mean) / std """
image = Image.open(image_path).convert("RGB")
# Return in NHWC format width, height = image.size
return img_array
# Scale the shorter side
scale = target_size / min(width, height)
def post_process( new_width = int(width * scale)
image_features: np.ndarray, new_height = int(height * scale)
text_features: np.ndarray, image_resized = image.resize((new_width, new_height), resample=Image.BICUBIC)
scale: float = 100.00000762939453,
use_cosine: bool = True, # Center crop
apply_scale: bool = True, left = (new_width - target_size) // 2
) -> float: top = (new_height - target_size) // 2
""" right = left + target_size
Calculate similarity between image and text features. bottom = top + target_size
image_cropped = image_resized.crop((left, top, right, bottom))
Args:
image_features (np.ndarray): Image feature vector # Convert to numpy array and normalize to [0, 1]
text_features (np.ndarray): Text feature vector image_np = np.array(image_cropped).astype(np.float32) / 255.0
scale (float): Scale factor for similarity calculation
use_cosine (bool): If True, L2-normalize both vectors before dot product (cosine similarity) # CLIP normalization
apply_scale (bool): If True, multiply by scale after dot product mean = np.array([0.48145466, 0.4578275, 0.40821073], dtype=np.float32)
std = np.array([0.26862954, 0.26130258, 0.27577711], dtype=np.float32)
Returns: image_np = (image_np - mean) / std
float: Similarity score
""" # Add batch dimension: HWC -> NHWC
img_vec = image_features.flatten().astype(np.float32) image_np = np.expand_dims(image_np, axis=0)
txt_vec = np.array(text_features, dtype=np.float32).flatten()
return image_np.astype(np.float32) # [1, 224, 224, 3]
if len(img_vec) != len(txt_vec):
raise ValueError(f"Feature dimension mismatch: image={len(img_vec)}, text={len(txt_vec)}") # ==================== Text Preprocessing ====================
if use_cosine: def preprocess_text(tokenizer: CLIPTokenizer, text: str, max_len: int = 64) -> np.ndarray:
img_norm = np.linalg.norm(img_vec) + 1e-8 """
txt_norm = np.linalg.norm(txt_vec) + 1e-8 Preprocess text for CLIP model using CLIPTokenizer.
img_vec = img_vec / img_norm
txt_vec = txt_vec / txt_norm Args:
tokenizer: CLIPTokenizer instance
dot_product = np.dot(img_vec, txt_vec) text (str): Input text string
max_len (int): Maximum sequence length (default: 64)
similarity = dot_product * scale if apply_scale else dot_product
Returns:
return float(similarity) np.ndarray: Tokenized text with shape (1, max_len) as int64
"""
enc = tokenizer(
def extract_index(filename: str) -> int: text,
""" padding="max_length",
Extract index from filename pattern: test_xxx_index.jpg truncation=True,
max_length=max_len,
Args: return_tensors="np",
filename (str): Filename to extract index from )
# text model input: int64[1, max_len]
Returns: input_ids = enc["input_ids"].astype(np.int64)
int: Extracted index, or -1 if pattern doesn't match return input_ids
"""
pattern = r"test_\w+_(\d+)\.jpg" # ==================== Model Inference ====================
match = re.match(pattern, filename)
if match: def compute_image_embedding(vision_amlnn: AMLNNLite, image_path: str) -> np.ndarray:
return int(match.group(1)) """
return -1 Compute image embedding using vision model.
Args:
def process_image_dir( vision_amlnn: AMLNNLite instance for vision model
amlnn: AMLNNLite, image_path (str): Path to input image
image_dir_path: str,
base_dir: str = "", Returns:
json_filename: str = "" np.ndarray: L2-normalized image embedding with shape (1, embed_dim)
) -> list: """
""" input_data = preprocess_image(image_path) # [1, 224, 224, 3]
Process image directory and find best matching text dataset.
outputs = vision_amlnn.inference(
Args: inputs=[input_data],
amlnn: AMLNNLite instance inputs_data_format='NHWC',
image_dir_path (str): Path to directory containing test images outputs_data_format='NHWC'
base_dir (str): Base directory for clip datasets (optional, can use CLIP_BASE_DIR env var) )
json_filename (str): JSON filename in each dataset folder (optional, can use CLIP_JSON_FILENAME env var)
feats = outputs[0].astype(np.float32)
Returns: feats = feats.reshape(1, -1) # Squeeze to [1, embed_dim]
list: List of best matching dataset paths return l2_normalize(feats, axis=1)
"""
results = [] def compute_text_embedding(text_amlnn: AMLNNLite, tokenizer: CLIPTokenizer, text: str, max_len: int = 64) -> np.ndarray:
file_pattern = re.compile(r"test_(\w+)_\d+\.jpg") """
image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.JPG', '.JPEG', '.PNG', '.BMP'} Compute text embedding using text model.
if not base_dir: Args:
base_dir = os.getenv("CLIP_BASE_DIR", "./clip_datasets/") text_amlnn: AMLNNLite instance for text model
tokenizer: CLIPTokenizer instance
if not json_filename: text (str): Input text string
json_filename = os.getenv("CLIP_JSON_FILENAME", "clip_text_res.json") max_len (int): Maximum sequence length
matched_files = [] Returns:
if os.path.isdir(image_dir_path): np.ndarray: L2-normalized text embedding with shape (1, embed_dim)
for filename in os.listdir(image_dir_path): """
filepath = os.path.join(image_dir_path, filename) input_ids = preprocess_text(tokenizer, text, max_len) # [1, max_len]
if os.path.isfile(filepath): print(f"input_ids: {input_ids}")
if file_pattern.match(filename):
matched_files.append((filename, filepath, True)) # AMLNNLite requires 4D input, reshape to (1, 1, 1, max_len)
elif any(filename.lower().endswith(ext) for ext in image_extensions): input_ids_4d = input_ids[:, None, None, :] # [1, 1, 1, max_len]
matched_files.append((filename, filepath, False))
elif os.path.isfile(image_dir_path): outputs = text_amlnn.inference(
filename = os.path.basename(image_dir_path) inputs=[input_ids_4d],
if any(filename.lower().endswith(ext) for ext in image_extensions): inputs_data_format='NHWC',
has_pattern = bool(file_pattern.match(filename)) outputs_data_format='NHWC'
matched_files.append((filename, image_dir_path, has_pattern)) )
else:
print(f"Error: {image_dir_path} is not a valid image file") feats = outputs[0].astype(np.float32)
return results feats = feats.reshape(1, -1) # Squeeze to [1, embed_dim]
else: return l2_normalize(feats, axis=1)
print(f"Error: {image_dir_path} is not a valid directory or file")
return results def compute_text_embeddings_batch(text_amlnn: AMLNNLite, tokenizer: CLIPTokenizer, texts: list, max_len: int = 64) -> np.ndarray:
"""
if not matched_files: Compute text embeddings for multiple texts.
print(f"Warning: No image files found in {image_dir_path}")
return results Args:
text_amlnn: AMLNNLite instance for text model
print(f"Found {len(matched_files)} image file(s) to process") tokenizer: CLIPTokenizer instance
texts (list): List of input text strings
matched_files.sort(key=lambda x: extract_index(x[0]) if x[2] else 999999) max_len (int): Maximum sequence length
# Process each image Returns:
for filename, filepath, has_pattern in matched_files: np.ndarray: L2-normalized text embeddings with shape (num_texts, embed_dim)
if has_pattern: """
match = file_pattern.match(filename) embeddings = []
if match: for text in texts:
name = match.group(1) emb = compute_text_embedding(text_amlnn, tokenizer, text, max_len)
else: embeddings.append(emb[0]) # Remove batch dimension
name = "" return np.stack(embeddings, axis=0) # [num_texts, embed_dim]
else:
name = "" # ==================== Similarity Calculation ====================
# Preprocess image def compute_similarity(image_embedding: np.ndarray, text_embeddings: np.ndarray, logit_scale: float = 100.0) -> tuple:
try: """
input_data = preprocess_image(filepath) Compute similarity between image and text embeddings.
input_data = np.expand_dims(input_data, axis=0)
except Exception as e: Args:
print(f"Error preprocessing image {filename}: {e}") image_embedding (np.ndarray): Image embedding with shape (1, embed_dim)
continue text_embeddings (np.ndarray): Text embeddings with shape (num_texts, embed_dim)
logit_scale (float): Scale factor for logits
# Run inference
try: Returns:
outputs = amlnn.inference(inputs=[input_data]) tuple: (similarities, logits, probabilities)
model_output = outputs[0] """
if isinstance(model_output, np.ndarray): # Cosine similarity (embeddings are already L2-normalized)
model_output = model_output.astype(np.float32) sims = text_embeddings @ image_embedding[0] # [num_texts]
else: logits = sims * logit_scale # [num_texts]
model_output = np.array(model_output, dtype=np.float32) probs = softmax(logits, axis=0) # [num_texts]
model_output = model_output.flatten()
except Exception as e: return sims, logits, probs
print(f"Error running inference on {filename}: {e}")
continue # ==================== Main Function ====================
max_sim = float('-inf') def main():
best_key = "" parser = argparse.ArgumentParser(description='CLIP Image-Text Matching Demo using AMLNNLite')
best_id = "" parser.add_argument('--vision-model', required=True, help='Path to vision model (.adla)')
parser.add_argument('--text-model', required=True, help='Path to text model (.adla)')
if not os.path.isdir(base_dir): parser.add_argument('--tokenizer-dir', required=True, help='Path to CLIPTokenizer directory')
print(f"Error: Base directory does not exist: {base_dir}") parser.add_argument('--image-path', default=None, help='Path to input image (optional, will prompt if not provided)')
continue parser.add_argument('--texts', nargs='+', default=None, help='List of text descriptions to compare')
parser.add_argument('--max-len', type=int, default=64, help='Maximum token sequence length (default: 64)')
print(f"Searching in base directory: {base_dir}") parser.add_argument('--logit-scale', type=float, default=100.0, help='Logit scale factor (default: 100.0)')
folder_count = 0
for folder_name in os.listdir(base_dir): args = parser.parse_args()
folder_path = os.path.join(base_dir, folder_name)
if not os.path.isdir(folder_path): # Validate model paths
continue if not os.path.exists(args.vision_model):
print(f"[Error] Vision model not found: {args.vision_model}")
if has_pattern and name and name not in folder_name: return -1
continue
if not os.path.exists(args.text_model):
folder_count += 1 print(f"[Error] Text model not found: {args.text_model}")
return -1
vit_res_path = os.path.join(folder_path, json_filename)
if not os.path.isfile(vit_res_path): # Load tokenizer
print(f"Warning: JSON file not found: {vit_res_path}") print(f"[Info] Loading CLIPTokenizer from: {args.tokenizer_dir}")
continue tokenizer = CLIPTokenizer.from_pretrained(args.tokenizer_dir)
try: # Initialize vision model
with open(vit_res_path, 'r', encoding='utf-8') as f: print(f"[Info] Initializing vision model: {args.vision_model}")
vit_json = json.load(f) vision_amlnn = AMLNNLite()
vision_amlnn.config(model_path=args.vision_model, run_cycles=1)
for key, text_vec in vit_json.items(): vision_amlnn.init()
if isinstance(text_vec, list):
text_features = np.array(text_vec, dtype=np.float32) # Initialize text model
sim_scaled = post_process( print(f"[Info] Initializing text model: {args.text_model}")
model_output, text_amlnn = AMLNNLite()
text_features, text_amlnn.config(model_path=args.text_model, run_cycles=1)
use_cosine=True, text_amlnn.init()
apply_scale=True,
) print("[Info] Models initialized successfully.\n")
if sim_scaled > max_sim: try:
max_sim = sim_scaled # Interactive loop
best_key = key while True:
best_id = folder_name # Get image path
except Exception as e: if args.image_path:
print(f"Error loading JSON file {vit_res_path}: {e}") image_path = args.image_path
continue args.image_path = None # Clear for next iteration
else:
if best_key and best_id: print("=" * 60)
best_path = os.path.join(base_dir, best_id) print("[Info] Image Path (or 'exit' to quit):")
results.append(best_path) image_path = input().strip()
print(f"\nProcessing image: {filename}")
print(f" Best matching dataset: {best_path}") # Check for exit
else: if image_path.lower() == 'exit':
print(f"\nProcessing image: {filename}") print("[Info] Exiting...")
print(f" No matching dataset found (searched {folder_count} folder(s))") break
return results # Validate image path
if not image_path:
print("[Warning] Please enter an image path.")
def main(): continue
parser = argparse.ArgumentParser(description='CLIP Image-Text Matching Demo')
parser.add_argument('--model-path', required=True, help='Path to the CLIP model file') if not os.path.exists(image_path):
parser.add_argument('--base-dir', default='./clip_datasets/', help='Base directory for clip datasets (can also use CLIP_BASE_DIR env var)') print(f"[Error] Image not found: {image_path}")
parser.add_argument('--json-filename', default='clip_text_res.json', help='JSON filename in each dataset folder (can also use CLIP_JSON_FILENAME env var, default: clip_text_res.json)') continue
parser.add_argument('--image-dir', default='./', help='Image directory or single image file to process (optional, will prompt if not provided)')
args = parser.parse_args() # Get texts to compare
if args.texts:
# Initialize AMLNNLite texts = args.texts
print("Initializing model...") args.texts = None # Clear for next iteration
amlnn = AMLNNLite() else:
amlnn.config(model_path=args.model_path) print("[Info] Enter text descriptions (comma-separated, or 'skip' to use defaults):")
amlnn.init() text_input = input().strip()
print("Model initialized successfully.\n")
if text_input.lower() == 'skip' or not text_input:
# Process images # Default texts for demo
if args.image_dir: texts = [
results = process_image_dir(amlnn, args.image_dir, args.base_dir, args.json_filename) "a red handbag",
print(f"\nTotal results: {len(results)}") "a blue jacket",
for i, result in enumerate(results): "a red bus",
print(f"Index[{i}]: {result}") ]
else: print(f"[Info] Using default texts: {texts}")
while True: else:
image_path = input("\nPlease enter the JPG image path or directory (enter 'exit' to quit):\n").strip() texts = [t.strip() for t in text_input.split(',') if t.strip()]
if image_path.lower() == 'exit': if not texts:
break print("[Warning] No texts provided.")
continue
if not image_path:
print("The path cannot be empty.") try:
continue # Compute image embedding
print(f"\n[Info] Processing image: {image_path}")
results = process_image_dir(amlnn, image_path, args.base_dir, args.json_filename) image_embedding = compute_image_embedding(vision_amlnn, image_path)
print(f"[Info] Image embedding shape: {image_embedding.shape}")
for i, result in enumerate(results):
print(f"Index[{i}]: {result}") # Compute text embeddings
print(f"[Info] Processing {len(texts)} text(s)...")
amlnn.uninit() text_embeddings = compute_text_embeddings_batch(text_amlnn, tokenizer, texts, args.max_len)
print("\nDone.") print(f"[Info] Text embeddings shape: {text_embeddings.shape}")
# Compute similarity
if __name__ == "__main__": sims, logits, probs = compute_similarity(image_embedding, text_embeddings, args.logit_scale)
main()
# Print results
print("\n" + "=" * 60)
print("CLIP Image-Text Matching Results")
print("=" * 60)
print(f"Image: {image_path}")
print(f"logit_scale: {args.logit_scale:.6f}")
print("-" * 60)
# Sort by probability (descending)
sorted_indices = np.argsort(probs)[::-1]
for rank, i in enumerate(sorted_indices):
print(f"[{rank + 1}] prob={probs[i]:.6f} sim={float(sims[i]):.6f} text='{texts[i]}'")
print("=" * 60 + "\n")
except Exception as e:
print(f"[Error] Processing failed: {e}")
import traceback
traceback.print_exc()
continue
except KeyboardInterrupt:
print("\n\n[Info] Interrupted by user. Exiting...")
finally:
# Cleanup
vision_amlnn.uninit()
text_amlnn.uninit()
print("[Info] Done.")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

File diff suppressed because it is too large Load diff

File diff suppressed because one or more lines are too long