feat:update demo code of CLIP

This commit is contained in:
dian.yuan 2026-02-12 11:19:52 +08:00
parent 4bf4aafc73
commit 5478a8618b
12 changed files with 50385 additions and 694 deletions

BIN
examples/clip/000000004505.jpg Executable file

Binary file not shown.

After

Width:  |  Height:  |  Size: 210 KiB

View file

@ -1,95 +1,159 @@
## Demo Run
### CPP
#### 1. Compile
**Prerequisites:**
- Android NDK (r25e recommended)
- `ANDROID_NDK_PATH` environment variable set
**Build:**
```bash
# Build for arm64-v8a
cd examples/clip/cpp
./build-android.sh -a arm64-v8a
```
The executable will be generated at `build/android_arm64-v8a/clip_demo` (Note: executable name may vary, verify in build folder).
#### 2. Run
```bash
# Push executable to device
adb push build/android_arm64-v8a/clip_demo /data/local/tmp/
adb push model/vision_model_int8_A311D2.adla /data/local/tmp/
adb push clip_datasets/ /data/local/tmp/
adb push test_hat_0.jpg /data/local/tmp/
# Run on device
adb shell
cd /data/local/tmp
chmod +x clip_demo
export LD_LIBRARY_PATH=/vendor/lib64 or (/vendor/lib)
# Usage: ./clip_demo <model_path> [base_dir] [json_filename]
./clip_demo vision_model_int8_A311D2.adla ./clip_datasets/ clip_text_res.json
```
**Note:**
- Replace `vision_model_int8_A311D2.adla` with your actual model file path.
- The `base_dir` and `json_filename` parameters are optional. You can also use environment variables `CLIP_BASE_DIR` and `CLIP_JSON_FILENAME`.
- The program will prompt you to enter image paths interactively. Enter "exit" to quit.
### Python
**Prerequisites:**
- Python 3.10
- Required packages: `numpy`, `Pillow`, `amlnnlite`
**Install dependencies:**
```bash
pip install numpy Pillow amlnnlite-1.0.0-cp310-cp310-linux_aarch64.whl
```
**Run on device:**
```bash
# Basic usage (process current directory)
python clip.py --model-path ./vision_model_int8_A311D2.adla
# Specify image directory or file
python clip.py --model-path ./vision_model_int8_A311D2.adla --image-dir ./
# Specify base directory and JSON filename
python clip.py --model-path ./vision_model_int8_A311D2.adla --base-dir ./clip_datasets/ --json-filename clip_text_res.json
```
The script will automatically process all image files (`.jpg`, `.jpeg`, `.png`, `.bmp`) in the specified directory or process a single image file, and display the best matching dataset for each image.
5. Results
The program will print the best matching dataset path for each processed image. The program searches through all dataset folders in the base directory and finds the text feature with the highest similarity to the input image.
**Example output:**
```
# python demo result
Model initialized successfully.
Found 2 image file(s) to process
Searching in base directory: ./clip_datasets/
Processing image: test_jacket_0.jpg
Best matching dataset: ./clip_datasets/shirt10_jacket7
Searching in base directory: ./clip_datasets/
Processing image: test_hat_0.jpg
Best matching dataset: ./clip_datasets/hat1_jd
Total results: 2
Index[0]: ./clip_datasets/shirt10_jacket7
Index[1]: ./clip_datasets/hat1_jd
Done.
```
The program returns the dataset folder path that contains the text feature with the highest similarity to the input image. Each result represents the best matching dataset for the corresponding input image.
# CLIP
## 1. Overview
This demo demonstrates how to run CLIP (Contrastive Language-Image Pre-Training) image-text matching using AMLNNLite. The CLIP model consists of two parts: a vision encoder and a text encoder, which work together to compute similarity between images and text descriptions.
## 2. Model Download
TO DO
## 3. Model Conversion
TO DO
## 4. Demo Run
### CPP
#### 1. Compile
**Prerequisites:**
- Android NDK (r25e recommended)
- `ANDROID_NDK_PATH` environment variable set
**Build:**
```bash
# Build for arm64-v8a
cd examples/clip/cpp
./build-android.sh -a arm64-v8a
```
The executable will be generated at `build/android_arm64-v8a/clip_demo`.
#### 2. Run
```bash
# Push executable and resources to device
adb push build/android_arm64-v8a/clip_demo /data/local/tmp/
adb push model/vision_model_int8_S905X5.adla /data/local/tmp/
adb push model/text_model_int8_S905X5.adla /data/local/tmp/
adb push tokenizer_path/ /data/local/tmp/
# Run on device
adb shell
cd /data/local/tmp
chmod +x clip_demo
export LD_LIBRARY_PATH=/vendor/lib64 or (/vendor/lib)
# Usage: ./clip_demo <vision_model> <text_model> <tokenizer_path> [--profiling]
./clip_demo vision_model_int8_S905X5.adla text_model_int8_S905X5.adla ./tokenizer_path/
```
The program will prompt for image paths and text descriptions interactively. Enter the path to an image file, then enter comma-separated text descriptions (or `skip` to use defaults). Type `exit` to quit.
**Argument Descriptions:**
| Argument | Description |
| -------------- | ------------------------------------------------------------ |
| vision_model | Path to vision encoder .adla model (required) |
| text_model | Path to text encoder .adla model (required) |
| tokenizer_path | Path to directory containing `vocab.json` and `merges.txt` (required) |
| --profiling | Enable performance profiling output (optional) |
**Note:** The `tokenizer_path` should contain `vocab.json` and `merges.txt` files from the CLIP tokenizer (e.g., from `openai/clip-vit-base-patch32`).
### Python
**Prerequisites:**
- Python 3.10
- Required packages: `numpy`, `Pillow`, `transformers`, `amlnnlite`
**Install dependencies:**
```bash
pip install numpy Pillow transformers amlnnlite-1.0.0-cp310-cp310-linux_aarch64.whl
```
**Run on device:**
```bash
python clip.py \
--vision-model ./vision_model_int8_S905X5.adla \
--text-model ./text_model_int8_S905X5.adla \
--tokenizer-dir ./tokenizer_path \
--image-path ./000000004505.jpg \
--texts "a red handbag" "a blue jacket" "a red bus"
```
**Interactive Mode (Recommended):**
If you don't provide `--image-path`, the program will run in interactive mode:
```bash
python clip.py \
--vision-model ./vision_model_int8_S905X5.adla \
--text-model ./text_model_int8_S905X5.adla \
--tokenizer-dir ./tokenizer_path
```
The program will prompt for image paths and text descriptions. Enter an image path to process, then enter comma-separated texts to compare. Type `exit` to quit.
**Argument Descriptions:**
| Argument | Description |
| ---------------- | ------------------------------------------------------------ |
| --vision-model | Path to vision encoder .adla model (required) |
| --text-model | Path to text encoder .adla model (required) |
| --tokenizer-dir | Path to CLIPTokenizer directory (required) |
| --image-path | Path to input image (.jpg, .png) - optional, will prompt if not provided |
| --texts | List of text descriptions to compare (space-separated) |
| --max-len | Maximum token sequence length, default is 64 |
| --logit-scale | Logit scale factor, default is 100.0 |
**Note:** The `--tokenizer-dir` should point to the directory containing the CLIPTokenizer files. You can use a Hugging Face model ID (e.g., `openai/clip-vit-base-patch32`) or a local directory.
## 5. Results
**Performance Feedback**
By using the `--profiling` flag (C++) or setting the loglevel to INFO, the program provides real-time performance metrics upon completion. The console log will display essential hardware and execution details, including:
- Hardware Information: System and ADLA library versions.
- Model Overview: Basic input/output configurations.
- NPU Metrics: Total inference time (latency) and total DRAM bandwidth consumption.
**Interactive Mode Example:**
```bash
$ ./clip_demo vision_model_int8_S905X5.adla text_model_int8_S905X5.adla ./tokenizer_path
[Info] Models initialized successfully.
============================================================
[Info] Image Path (or 'exit' to quit):
000000004505.jpg
[Info] Enter text descriptions (comma-separated, or 'skip' for defaults):
a red handbag, a blue jacket, a red bus
[Info] Processing image: 000000004505.jpg
[Info] Image embedding size: 512
[Info] Processing 3 text(s)...
[Info] Text embeddings size: 3 x 512
============================================================
CLIP Image-Text Matching Results
============================================================
Image: 000000004505.jpg
logit_scale: 100.000000
------------------------------------------------------------
[1] prob=0.999975 sim=0.327895 text='a red bus'
[2] prob=0.000016 sim=0.217690 text='a red handbag'
[3] prob=0.000008 sim=0.211029 text='a blue jacket'
============================================================
============================================================
[Info] Image Path (or 'exit' to quit):
exit
[Info] Exiting...
Free vision model memory.
Free text model memory.
[Info] Done.
```

View file

@ -1,42 +1,43 @@
cmake_minimum_required(VERSION 3.5)
project(clip_demo)
set(CMAKE_CXX_STANDARD 17)
# Set NNSDK path
set(NNSDK_ROOT "${CMAKE_SOURCE_DIR}/../../../../dependency/nnsdk")
include_directories(${NNSDK_ROOT}/include)
include_directories(${CMAKE_SOURCE_DIR}/../../../../common)
# Set 3rdparty path
set(3RDPARTY_DIR "${CMAKE_SOURCE_DIR}/../../../../dependency")
# Include directories for stb_image and json
# Note: code uses #include "stb_image.h" and #include "json.hpp"
include_directories(${3RDPARTY_DIR}/stb_image)
include_directories(${3RDPARTY_DIR}/json)
if(CMAKE_SYSTEM_NAME STREQUAL "Android")
if (ANDROID_ABI STREQUAL "arm64-v8a")
link_directories(${NNSDK_ROOT}/lib/android/arm64-v8a)
else()
link_directories(${NNSDK_ROOT}/lib/android/armeabi-v7a)
endif()
# Android needs log
link_libraries(log)
elseif(CMAKE_SYSTEM_NAME STREQUAL "Linux")
link_directories(${NNSDK_ROOT}/lib/linux/lib64_yocto)
endif()
add_executable(${PROJECT_NAME}
main.cpp
model_invoke.cpp
pre_postprocess.cpp
)
target_link_libraries(${PROJECT_NAME}
nnsdk
dl
m
)
cmake_minimum_required(VERSION 3.5)
project(clip_demo)
set(CMAKE_CXX_STANDARD 17)
# Set NNSDK path
set(NNSDK_ROOT "${CMAKE_SOURCE_DIR}/../../../../dependency/nnsdk")
include_directories(${NNSDK_ROOT}/include)
include_directories(${CMAKE_SOURCE_DIR}/../../../../common)
# Set 3rdparty path
set(3RDPARTY_DIR "${CMAKE_SOURCE_DIR}/../../../../dependency")
# Include directories for stb_image and json
# Note: code uses #include "stb_image.h" and #include "json.hpp"
include_directories(${3RDPARTY_DIR}/stb_image)
include_directories(${3RDPARTY_DIR}/json)
if(CMAKE_SYSTEM_NAME STREQUAL "Android")
if (ANDROID_ABI STREQUAL "arm64-v8a")
link_directories(${NNSDK_ROOT}/lib/android/arm64-v8a)
else()
link_directories(${NNSDK_ROOT}/lib/android/armeabi-v7a)
endif()
# Android needs log
link_libraries(log)
elseif(CMAKE_SYSTEM_NAME STREQUAL "Linux")
link_directories(${NNSDK_ROOT}/lib/linux/lib64_yocto)
endif()
add_executable(${PROJECT_NAME}
main.cpp
model_invoke.cpp
pre_postprocess.cpp
clip_tokenizer.cpp
)
target_link_libraries(${PROJECT_NAME}
nnsdk
dl
m
)

View file

@ -0,0 +1,53 @@
/*
* Copyright (C) 20242025 Amlogic, Inc. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#ifndef CLIP_PROCESS_H
#define CLIP_PROCESS_H
#include <string>
#include <vector>
#include <cstdint>
// ==================== Model Invoke ====================
// Initialize network from file
void* init_network_file(const char *model_path);
// Run vision model inference
std::vector<float> run_vision_model(void* context, const std::vector<float>& input_data);
// Run text model inference
std::vector<float> run_text_model(void* context, const std::vector<int64_t>& input_ids);
// Destroy network
int destroy_network(void *qcontext);
// ==================== Pre/Post Processing ====================
// Image preprocessing
std::vector<float> preprocess_image(const std::string& image_path);
// L2 normalize
std::vector<float> l2_normalize(const std::vector<float>& vec);
// Softmax
std::vector<float> softmax(const std::vector<float>& logits);
// Compute cosine similarity
float compute_similarity(const std::vector<float>& a, const std::vector<float>& b, float scale = 100.0f);
#endif // CLIP_PROCESS_H

View file

@ -0,0 +1,395 @@
/*
* Copyright (C) 20242025 Amlogic, Inc. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "clip_tokenizer.h"
#include "json.hpp"
#include <fstream>
#include <sstream>
#include <iostream>
#include <algorithm>
#include <regex>
#include <set>
#include <cassert>
#include <codecvt>
#include <locale>
using json = nlohmann::ordered_json;
// Reference: https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py
void CLIPTokenizer::init_byte_to_unicode()
{
byte_to_unicode_.clear();
unicode_to_byte_.clear();
// Printable ASCII ranges that map to themselves
// '!' (33) to '~' (126), '¡' (161) to '¬' (172), '®' (174) to 'ÿ' (255)
std::vector<int> bs;
for (int i = 33; i <= 126; ++i) bs.push_back(i); // '!' to '~'
for (int i = 161; i <= 172; ++i) bs.push_back(i); // '¡' to '¬'
for (int i = 174; i <= 255; ++i) bs.push_back(i); // '®' to 'ÿ'
std::vector<int> cs(bs.begin(), bs.end());
// Map remaining bytes (0-32, 127-160, 173) to 256+
int n = 0;
for (int b = 0; b < 256; ++b) {
if (std::find(bs.begin(), bs.end(), b) == bs.end()) {
bs.push_back(b);
cs.push_back(256 + n);
n++;
}
}
for (size_t i = 0; i < bs.size(); ++i) {
byte_to_unicode_[static_cast<uint8_t>(bs[i])] = static_cast<char32_t>(cs[i]);
unicode_to_byte_[static_cast<char32_t>(cs[i])] = static_cast<uint8_t>(bs[i]);
}
}
// ========== UTF-8 Helpers ==========
std::vector<char32_t> CLIPTokenizer::utf8_to_codepoints(const std::string& str)
{
std::vector<char32_t> result;
size_t i = 0;
while (i < str.size()) {
char32_t cp = 0;
unsigned char c = str[i];
int len = 0;
if (c < 0x80) {
cp = c;
len = 1;
} else if ((c & 0xE0) == 0xC0) {
cp = c & 0x1F;
len = 2;
} else if ((c & 0xF0) == 0xE0) {
cp = c & 0x0F;
len = 3;
} else if ((c & 0xF8) == 0xF0) {
cp = c & 0x07;
len = 4;
} else {
++i;
continue;
}
for (int j = 1; j < len && (i + j) < str.size(); ++j) {
cp = (cp << 6) | (str[i + j] & 0x3F);
}
result.push_back(cp);
i += len;
}
return result;
}
std::string CLIPTokenizer::codepoints_to_utf8(const std::vector<char32_t>& cps)
{
std::string result;
for (char32_t cp : cps) {
if (cp < 0x80) {
result += static_cast<char>(cp);
} else if (cp < 0x800) {
result += static_cast<char>(0xC0 | (cp >> 6));
result += static_cast<char>(0x80 | (cp & 0x3F));
} else if (cp < 0x10000) {
result += static_cast<char>(0xE0 | (cp >> 12));
result += static_cast<char>(0x80 | ((cp >> 6) & 0x3F));
result += static_cast<char>(0x80 | (cp & 0x3F));
} else {
result += static_cast<char>(0xF0 | (cp >> 18));
result += static_cast<char>(0x80 | ((cp >> 12) & 0x3F));
result += static_cast<char>(0x80 | ((cp >> 6) & 0x3F));
result += static_cast<char>(0x80 | (cp & 0x3F));
}
}
return result;
}
// ========== Load Functions ==========
bool CLIPTokenizer::load(const std::string& vocab_path, const std::string& merges_path)
{
init_byte_to_unicode();
// Load vocab.json
{
std::ifstream file(vocab_path);
if (!file.is_open()) {
std::cerr << "Failed to open vocab file: " << vocab_path << std::endl;
return false;
}
try {
json j;
file >> j;
for (auto it = j.begin(); it != j.end(); ++it) {
std::string token = it.key();
int id = it.value().get<int>();
token_to_id_[token] = id;
id_to_token_[id] = token;
}
} catch (const std::exception& e) {
std::cerr << "Error parsing vocab.json: " << e.what() << std::endl;
return false;
}
}
// Find special token IDs
if (token_to_id_.count("<|startoftext|>")) {
sot_token_id_ = token_to_id_["<|startoftext|>"];
}
if (token_to_id_.count("<|endoftext|>")) {
eot_token_id_ = token_to_id_["<|endoftext|>"];
}
// Load merges.txt
{
std::ifstream file(merges_path);
if (!file.is_open()) {
std::cerr << "Failed to open merges file: " << merges_path << std::endl;
return false;
}
std::string line;
int rank = 0;
// Skip header line "#version: ..." if present
if (std::getline(file, line)) {
if (line.find("#version") == std::string::npos) {
// First line is not a header, process it
std::istringstream iss(line);
std::string a, b;
if (iss >> a >> b) {
bpe_ranks_[{a, b}] = rank++;
}
}
}
while (std::getline(file, line)) {
if (line.empty()) continue;
std::istringstream iss(line);
std::string a, b;
if (iss >> a >> b) {
bpe_ranks_[{a, b}] = rank++;
}
}
}
loaded_ = true;
printf("[Info] CLIPTokenizer loaded: vocab_size=%zu, merges=%zu\n",
token_to_id_.size(), bpe_ranks_.size());
return true;
}
bool CLIPTokenizer::load_from_dir(const std::string& tokenizer_dir)
{
std::string dir = tokenizer_dir;
// Ensure trailing slash
if (!dir.empty() && dir.back() != '/' && dir.back() != '\\') {
dir += "/";
}
return load(dir + "vocab.json", dir + "merges.txt");
}
// ========== BPE Implementation ==========
std::string CLIPTokenizer::bytes_to_unicode_str(const std::string& raw) const
{
std::vector<char32_t> result;
for (unsigned char c : raw) {
auto it = byte_to_unicode_.find(c);
if (it != byte_to_unicode_.end()) {
result.push_back(it->second);
}
}
return codepoints_to_utf8(result);
}
std::vector<std::string> CLIPTokenizer::bpe(const std::string& token) const
{
// Convert token to individual unicode characters as strings
auto codepoints = utf8_to_codepoints(token);
if (codepoints.empty()) return {};
// Each character becomes a separate piece
std::vector<std::string> word;
for (size_t i = 0; i < codepoints.size(); ++i) {
std::string piece = codepoints_to_utf8({codepoints[i]});
// CLIP adds </w> to the last character
if (i == codepoints.size() - 1) {
piece += "</w>";
}
word.push_back(piece);
}
if (word.size() == 1) return word;
// Iteratively merge the most frequent pairs
while (true) {
if (word.size() < 2) break;
// Find the pair with the lowest rank
int best_rank = INT_MAX;
int best_idx = -1;
for (size_t i = 0; i < word.size() - 1; ++i) {
auto it = bpe_ranks_.find({word[i], word[i + 1]});
if (it != bpe_ranks_.end() && it->second < best_rank) {
best_rank = it->second;
best_idx = static_cast<int>(i);
}
}
if (best_idx == -1) break; // No more merges possible
// Merge the pair at best_idx
std::string merged = word[best_idx] + word[best_idx + 1];
std::vector<std::string> new_word;
for (size_t i = 0; i < word.size(); ++i) {
if (static_cast<int>(i) == best_idx) {
new_word.push_back(merged);
++i; // Skip next element
} else {
new_word.push_back(word[i]);
}
}
word = new_word;
}
return word;
}
std::vector<std::string> CLIPTokenizer::pre_tokenize(const std::string& text) const
{
// CLIP tokenizer: lowercase + basic clean + split by pattern
// Pattern from CLIP: <\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+
// Simplified version for ASCII-dominant text:
std::string cleaned;
// Lowercase and basic whitespace normalization
for (char c : text) {
if (c >= 'A' && c <= 'Z') {
cleaned += (c - 'A' + 'a');
} else {
cleaned += c;
}
}
// Simple tokenization: split by whitespace and punctuation
std::vector<std::string> words;
std::string current;
for (size_t i = 0; i < cleaned.size(); ++i) {
char c = cleaned[i];
if (c == ' ' || c == '\t' || c == '\n' || c == '\r') {
if (!current.empty()) {
words.push_back(current);
current.clear();
}
// Add space prefix to next word (CLIP uses space-prefixed tokens)
if (i + 1 < cleaned.size() && cleaned[i + 1] != ' ') {
// Next word will get a space prefix via the byte encoding
}
} else {
// Check if punctuation should be separate token
bool is_alpha_or_digit = (c >= 'a' && c <= 'z') || (c >= '0' && c <= '9');
bool cur_is_alpha = !current.empty() &&
((current.back() >= 'a' && current.back() <= 'z') ||
(current.back() >= '0' && current.back() <= '9'));
if (!current.empty() && !is_alpha_or_digit && cur_is_alpha) {
// Start new token for punctuation
words.push_back(current);
current.clear();
} else if (!current.empty() && is_alpha_or_digit && !cur_is_alpha) {
words.push_back(current);
current.clear();
}
current += c;
}
}
if (!current.empty()) {
words.push_back(current);
}
return words;
}
// ========== Encode ==========
std::vector<int64_t> CLIPTokenizer::encode(const std::string& text, int max_len) const
{
if (!loaded_) {
std::cerr << "Tokenizer not loaded!" << std::endl;
return std::vector<int64_t>(max_len, 0);
}
std::vector<int64_t> tokens;
// Add start-of-text token
tokens.push_back(sot_token_id_);
// Pre-tokenize
std::vector<std::string> words = pre_tokenize(text);
// Process each word
for (const auto& word : words) {
// Convert raw bytes to unicode representation
std::string unicode_word = bytes_to_unicode_str(word);
// Apply BPE
std::vector<std::string> bpe_tokens = bpe(unicode_word);
// Look up token IDs
for (const auto& bt : bpe_tokens) {
auto it = token_to_id_.find(bt);
if (it != token_to_id_.end()) {
tokens.push_back(it->second);
} else {
// Unknown token, try without </w>
std::string no_ew = bt;
if (no_ew.size() >= 4 && no_ew.substr(no_ew.size() - 4) == "</w>") {
no_ew = no_ew.substr(0, no_ew.size() - 4);
}
auto it2 = token_to_id_.find(no_ew);
if (it2 != token_to_id_.end()) {
tokens.push_back(it2->second);
}
// else: skip unknown token
}
}
}
// Add end-of-text token
tokens.push_back(eot_token_id_);
// Truncate if necessary
if (static_cast<int>(tokens.size()) > max_len) {
tokens.resize(max_len);
// Ensure EOT is at the end
tokens.back() = eot_token_id_;
}
// Pad to max_len with EOT token (consistent with HuggingFace CLIPTokenizer)
while (static_cast<int>(tokens.size()) < max_len) {
tokens.push_back(eot_token_id_);
}
return tokens;
}

View file

@ -0,0 +1,105 @@
/*
* Copyright (C) 20242025 Amlogic, Inc. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#ifndef CLIP_TOKENIZER_H
#define CLIP_TOKENIZER_H
#include <string>
#include <vector>
#include <map>
#include <unordered_map>
class CLIPTokenizer {
public:
CLIPTokenizer() = default;
/**
* Load tokenizer from vocab.json and merges.txt
* @param vocab_path Path to vocab.json
* @param merges_path Path to merges.txt
* @return true on success
*/
bool load(const std::string& vocab_path, const std::string& merges_path);
/**
* Load tokenizer from a directory containing vocab.json and merges.txt
* @param tokenizer_dir Path to directory
* @return true on success
*/
bool load_from_dir(const std::string& tokenizer_dir);
/**
* Tokenize text to token IDs with padding/truncation.
* Adds <|startoftext|> and <|endoftext|> automatically.
*
* @param text Input text string
* @param max_len Maximum sequence length (default: 64)
* @return Vector of int64_t token IDs with shape [max_len]
*/
std::vector<int64_t> encode(const std::string& text, int max_len = 64) const;
/**
* Check if tokenizer is loaded
*/
bool is_loaded() const { return loaded_; }
/**
* Get vocabulary size
*/
size_t vocab_size() const { return token_to_id_.size(); }
private:
// BPE pair
using BPEPair = std::pair<std::string, std::string>;
// Byte-to-unicode mapping (GPT-2 style)
std::unordered_map<uint8_t, char32_t> byte_to_unicode_;
std::unordered_map<char32_t, uint8_t> unicode_to_byte_;
// Vocabulary
std::unordered_map<std::string, int> token_to_id_;
std::unordered_map<int, std::string> id_to_token_;
// BPE merge rules (pair -> priority rank)
std::map<BPEPair, int> bpe_ranks_;
// Special token IDs
int sot_token_id_ = 49406; // <|startoftext|>
int eot_token_id_ = 49407; // <|endoftext|>
bool loaded_ = false;
// Initialize byte-to-unicode mapping
void init_byte_to_unicode();
// Convert UTF-8 string to vector of unicode codepoints
static std::vector<char32_t> utf8_to_codepoints(const std::string& str);
// Convert unicode codepoints to UTF-8 string
static std::string codepoints_to_utf8(const std::vector<char32_t>& cps);
// Apply BPE to a single word (already converted to unicode representation)
std::vector<std::string> bpe(const std::string& token) const;
// Clean and split text using CLIP's regex pattern
std::vector<std::string> pre_tokenize(const std::string& text) const;
// Convert raw bytes to unicode string using byte_to_unicode mapping
std::string bytes_to_unicode_str(const std::string& raw) const;
};
#endif // CLIP_TOKENIZER_H

View file

@ -15,22 +15,26 @@
*/
#include <iostream>
#include <fstream>
#include <sstream>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <vector>
#include <string>
#include <algorithm>
#include "model_invoke.h"
#include "clip_process.h"
#include "clip_tokenizer.h"
#define BILLION 1000000000
struct Get_Times
struct ProfilingTimer
{
uint64_t init_start_time, init_end_time, init_total_time;
uint64_t preProcess_start_time, preProcess_end_time, preProcess_total_time;
uint64_t invoke_start_time, invoke_end_time, invoke_total_time;
uint64_t postProcess_start_time, postProcess_end_time, postProcess_total_time;
uint64_t total_time;
std::vector<uint64_t> total_time_group;
uint64_t init_start, init_end;
uint64_t preprocess_start, preprocess_end;
uint64_t vision_infer_start, vision_infer_end;
uint64_t text_infer_start, text_infer_end;
};
static uint64_t get_time_count()
@ -40,70 +44,288 @@ static uint64_t get_time_count()
return (uint64_t)((uint64_t)ts.tv_nsec + (uint64_t)ts.tv_sec * BILLION);
}
// Default text prompts for demo
static std::vector<std::string> default_texts = {
"a red handbag",
"a blue jacket",
"a red bus"
};
// Parse comma-separated texts
std::vector<std::string> parse_texts(const std::string& input)
{
std::vector<std::string> result;
std::stringstream ss(input);
std::string item;
while (std::getline(ss, item, ',')) {
// Trim whitespace
size_t start = item.find_first_not_of(" \t");
size_t end = item.find_last_not_of(" \t");
if (start != std::string::npos && end != std::string::npos) {
result.push_back(item.substr(start, end - start + 1));
}
}
return result;
}
void print_usage(const char* prog_name)
{
printf("Usage: %s <vision_model> <text_model> <tokenizer_dir> [--profiling]\n", prog_name);
printf("\n");
printf("Arguments:\n");
printf(" vision_model: Path to vision model (.adla)\n");
printf(" text_model: Path to text model (.adla)\n");
printf(" tokenizer_dir: Path to directory containing vocab.json and merges.txt\n");
printf(" --profiling: Enable performance profiling output (optional)\n");
printf("\n");
printf("Interactive mode:\n");
printf(" - Enter image path to process\n");
printf(" - Enter comma-separated texts to compare (or 'skip' for defaults)\n");
printf(" - Enter 'exit' to quit\n");
}
int main(int argc, char ** argv)
{
Get_Times model_time;
std::vector<float> input_data_fir;
float* model_output_data;
ProfilingTimer timer = {};
int ret = 0;
int max_index = 0;
if (argc < 2) {
printf("Usage: %s <model_path> [base_dir] [json_filename]\n", argv[0]);
printf(" model_path: Path to the model file\n");
printf(" base_dir: Base directory for clip datasets (optional, can also use CLIP_BASE_DIR env var)\n");
printf(" json_filename: JSON filename in each dataset folder (optional, can also use CLIP_JSON_FILENAME env var, default: clip_text_res.json)\n");
return -1;
}
char* model_path_encoder = argv[1];
std::string base_dir = (argc >= 3) ? argv[2] : "";
std::string json_filename = (argc >= 4) ? argv[3] : "";
void *context_model = NULL;
bool profiling = false;
model_time.init_start_time = get_time_count();
context_model = init_network_file(model_path_encoder);
model_time.init_end_time = get_time_count();
if (context_model == NULL)
{
printf("init_network [context_model] fail.\n");
if (argc < 4) {
print_usage(argv[0]);
return -1;
}
if (getenv("GET_TIME"))
{
model_time.init_total_time = (model_time.init_end_time - model_time.init_start_time) / 1000000;
std::cout << "init_model_total time : " << model_time.init_total_time << "ms" << std::endl;
const char* vision_model_path = argv[1];
const char* text_model_path = argv[2];
const char* tokenizer_dir = argv[3];
// Check for --profiling flag
for (int i = 4; i < argc; ++i) {
if (std::string(argv[i]) == "--profiling") {
profiling = true;
}
}
while (true)
{
std::string json_path;
const float logit_scale = 100.0f;
const int max_seq_len = 64;
printf("\nPlease enter the JPG image path (enter exit to quit):\n");
std::getline(std::cin, json_path);
if (json_path == "exit") break;
if (json_path.empty()) {
printf("The path cannot be empty.\n");
// Load tokenizer
printf("[Info] Loading tokenizer from: %s\n", tokenizer_dir);
CLIPTokenizer tokenizer;
if (!tokenizer.load_from_dir(tokenizer_dir)) {
printf("[Error] Failed to load tokenizer.\n");
return -1;
}
// Initialize models
printf("[Info] Initializing vision model: %s\n", vision_model_path);
timer.init_start = get_time_count();
void* vision_context = init_network_file(vision_model_path);
if (vision_context == NULL) {
printf("[Error] Failed to initialize vision model.\n");
return -1;
}
printf("[Info] Initializing text model: %s\n", text_model_path);
void* text_context = init_network_file(text_model_path);
if (text_context == NULL) {
printf("[Error] Failed to initialize text model.\n");
destroy_network(vision_context);
return -1;
}
timer.init_end = get_time_count();
if (profiling) {
uint64_t init_time = (timer.init_end - timer.init_start) / 1000000;
printf("[Profiling] Model initialization: %lums\n", init_time);
}
printf("[Info] Models initialized successfully.\n\n");
// Interactive loop
while (true) {
std::string image_path;
printf("============================================================\n");
printf("[Info] Image Path (or 'exit' to quit):\n");
std::getline(std::cin, image_path);
// Trim whitespace
size_t start = image_path.find_first_not_of(" \t\r\n");
size_t end = image_path.find_last_not_of(" \t\r\n");
if (start != std::string::npos && end != std::string::npos) {
image_path = image_path.substr(start, end - start + 1);
} else {
image_path.clear();
}
if (image_path == "exit") {
printf("[Info] Exiting...\n");
break;
}
if (image_path.empty()) {
printf("[Warning] Please enter an image path.\n");
continue;
}
std::vector<std::string> out_str_path = process_image_dir(context_model, json_path, base_dir, json_filename);
for (int i = 0; i < out_str_path.size(); i++)
// Check if file exists
{
std::cout << "Index[" << i << "] : " << out_str_path[i] << std::endl;
std::ifstream img_file(image_path);
if (!img_file.good()) {
printf("[Error] Image not found: %s\n", image_path.c_str());
continue;
}
}
// Get texts to compare
std::vector<std::string> texts;
printf("[Info] Enter text descriptions (comma-separated, or 'skip' for defaults):\n");
std::string text_input;
std::getline(std::cin, text_input);
// Trim
start = text_input.find_first_not_of(" \t\r\n");
end = text_input.find_last_not_of(" \t\r\n");
if (start != std::string::npos && end != std::string::npos) {
text_input = text_input.substr(start, end - start + 1);
} else {
text_input.clear();
}
if (text_input.empty() || text_input == "skip") {
texts = default_texts;
printf("[Info] Using default texts\n");
} else {
texts = parse_texts(text_input);
}
if (texts.empty()) {
printf("[Warning] No texts provided.\n");
continue;
}
// ==================== Process Image ====================
printf("\n[Info] Processing image: %s\n", image_path.c_str());
timer.preprocess_start = get_time_count();
std::vector<float> image_input = preprocess_image(image_path);
if (image_input.empty()) {
printf("[Error] Failed to preprocess image.\n");
continue;
}
timer.preprocess_end = get_time_count();
// Run vision model
timer.vision_infer_start = get_time_count();
std::vector<float> image_embedding = run_vision_model(vision_context, image_input);
if (image_embedding.empty()) {
printf("[Error] Vision model inference failed.\n");
continue;
}
timer.vision_infer_end = get_time_count();
// L2 normalize image embedding
image_embedding = l2_normalize(image_embedding);
printf("[Info] Image embedding size: %zu\n", image_embedding.size());
// ==================== Process Texts ====================
printf("[Info] Processing %zu text(s)...\n", texts.size());
std::vector<std::vector<float>> text_embeddings;
std::vector<uint64_t> text_infer_times;
timer.text_infer_start = get_time_count();
for (size_t i = 0; i < texts.size(); ++i) {
// Tokenize text
std::vector<int64_t> token_ids = tokenizer.encode(texts[i], max_seq_len);
// Run text model
uint64_t t_start = get_time_count();
std::vector<float> text_emb = run_text_model(text_context, token_ids);
uint64_t t_end = get_time_count();
text_infer_times.push_back((t_end - t_start) / 1000000);
if (text_emb.empty()) {
printf("[Error] Text model inference failed for: %s\n", texts[i].c_str());
continue;
}
// L2 normalize
text_emb = l2_normalize(text_emb);
text_embeddings.push_back(text_emb);
}
timer.text_infer_end = get_time_count();
if (text_embeddings.size() != texts.size()) {
printf("[Error] Some text embeddings failed.\n");
continue;
}
printf("[Info] Text embeddings size: %zu x %zu\n", text_embeddings.size(),
text_embeddings.empty() ? 0 : text_embeddings[0].size());
// ==================== Compute Similarity ====================
std::vector<float> similarities(texts.size());
std::vector<float> logits(texts.size());
for (size_t i = 0; i < texts.size(); ++i) {
similarities[i] = compute_similarity(image_embedding, text_embeddings[i], 1.0f); // cosine sim
logits[i] = similarities[i] * logit_scale;
}
// Compute probabilities
std::vector<float> probs = softmax(logits);
// Sort by probability (descending)
std::vector<size_t> indices(texts.size());
for (size_t i = 0; i < texts.size(); ++i) indices[i] = i;
std::sort(indices.begin(), indices.end(),
[&probs](size_t a, size_t b) { return probs[a] > probs[b]; });
// ==================== Print Results ====================
printf("\n============================================================\n");
printf("CLIP Image-Text Matching Results\n");
printf("============================================================\n");
printf("Image: %s\n", image_path.c_str());
printf("logit_scale: %.6f\n", logit_scale);
printf("------------------------------------------------------------\n");
for (size_t rank = 0; rank < indices.size(); ++rank) {
size_t i = indices[rank];
printf("[%zu] prob=%.6f sim=%.6f text='%s'\n",
rank + 1, probs[i], similarities[i], texts[i].c_str());
}
printf("============================================================\n");
if (profiling) {
uint64_t preprocess_time = (timer.preprocess_end - timer.preprocess_start) / 1000000;
uint64_t vision_time = (timer.vision_infer_end - timer.vision_infer_start) / 1000000;
uint64_t text_total_time = (timer.text_infer_end - timer.text_infer_start) / 1000000;
printf("\n[Profiling]\n");
printf(" Image preprocess: %lums\n", preprocess_time);
printf(" Vision inference: %lums\n", vision_time);
for (size_t i = 0; i < texts.size() && i < text_infer_times.size(); ++i) {
printf(" Text inference[%zu]: %lums '%s'\n", i, text_infer_times[i], texts[i].c_str());
}
printf(" Text total: %lums (%zu texts)\n", text_total_time, texts.size());
}
printf("\n");
}
ret = destroy_network(context_model);
if (ret != 0)
{
printf("destroy_network [context_model] fail.\n");
return -1;
// Cleanup
ret = destroy_network(vision_context);
if (ret != 0) {
printf("[Error] Failed to destroy vision model.\n");
}
return ret;
}
ret = destroy_network(text_context);
if (ret != 0) {
printf("[Error] Failed to destroy text model.\n");
}
printf("[Info] Done.\n");
return 0;
}

View file

@ -20,31 +20,20 @@
#include <fstream>
#include <algorithm>
#include <vector>
#include <cmath>
#include <cstdlib>
#include "model_invoke.h"
#include "clip_process.h"
#include "nn_sdk.h"
#include "json.hpp"
#include <filesystem>
#include <regex>
using json = nlohmann::ordered_json;
namespace fs = std::__fs::filesystem;
// Global DMA config for models
static aml_memory_config_t vision_mem_config;
static aml_memory_data_t vision_mem_data;
static void* vision_context_flag = nullptr;
struct DMAConfig {
bool use_dma = true;
bool malloc_buffer_once = true;
};
DMAConfig context_model;
///////////////////////////////////////////////////////////
aml_memory_config_t mem_config_context_model;
aml_memory_data_t mem_data_context_model;
std::vector<float> preprocess_image(const std::string& image_path);
float post_process(const float* a, const std::vector<float>& b);
static aml_memory_config_t text_mem_config;
static aml_memory_data_t text_mem_data;
static void* text_context_flag = nullptr;
void* init_network_file(const char *model_path)
{
@ -95,202 +84,119 @@ void* init_network_file(const char *model_path)
return qcontext;
}
float* run_network(void *qcontext, std::vector<float> input_ids, const std::string image_type)
std::vector<float> run_vision_model(void* qcontext, const std::vector<float>& input_data)
{
int ret = 0;
nn_input inData;
nn_output *outdata = NULL;
aml_output_config_t outconfig;
inData.input_index = 0;
inData.info.input_format = AML_INPUT_DEFAULT;
inData.size = input_ids.size() * sizeof(float);
inData.size = input_data.size() * sizeof(float);
if (context_model.use_dma) {
if (context_model.malloc_buffer_once) {
mem_config_context_model.cache_type = AML_WITH_CACHE;
mem_config_context_model.memory_type = AML_VIRTUAL_ADDR;
mem_config_context_model.direction = AML_MEM_DIRECTION_READ_WRITE;
mem_config_context_model.index = 0;
mem_config_context_model.mem_size = inData.size;
aml_util_mallocBuffer(qcontext, &mem_config_context_model, &mem_data_context_model);
aml_util_swapExternalInputBuffer(qcontext, &mem_config_context_model, &mem_data_context_model);
}
inData.input_type = INPUT_DMA_DATA;
memcpy(mem_data_context_model.viraddr, input_ids.data(), mem_config_context_model.mem_size);
inData.input = NULL;
} else {
inData.input = reinterpret_cast<unsigned char*>(input_ids.data());
inData.input_type = BINARY_RAW_DATA;
ret = aml_module_input_set(qcontext, &inData);
if (ret)
{
printf("aml_module_input_set fail.\n");
}
// Use DMA
if (!vision_context_flag) {
vision_mem_config.cache_type = AML_WITH_CACHE;
vision_mem_config.memory_type = AML_VIRTUAL_ADDR;
vision_mem_config.direction = AML_MEM_DIRECTION_READ_WRITE;
vision_mem_config.index = 0;
vision_mem_config.mem_size = inData.size;
aml_util_mallocBuffer(qcontext, &vision_mem_config, &vision_mem_data);
aml_util_swapExternalInputBuffer(qcontext, &vision_mem_config, &vision_mem_data);
vision_context_flag = qcontext;
}
context_model.malloc_buffer_once = false;
inData.input_type = INPUT_DMA_DATA;
memcpy(vision_mem_data.viraddr, input_data.data(), vision_mem_config.mem_size);
inData.input = NULL;
memset(&outconfig, 0, sizeof(aml_output_config_t));
if (context_model.use_dma) {
outconfig.format = AML_OUTDATA_DMA;
} else {
outconfig.format = AML_OUTDATA_RAW;
}
outconfig.format = AML_OUTDATA_DMA;
outconfig.typeSize = sizeof(aml_output_config_t);
outdata = (nn_output*)aml_module_output_get(qcontext, outconfig);
return reinterpret_cast<float*>(outdata->out[0].buf);
}
int extract_index(const std::string& filename) {
std::regex pattern(R"(test_\w+_(\d+)\.jpg)");
std::smatch match;
if (std::regex_match(filename, match, pattern)) {
return std::stoi(match[1]);
if (outdata == NULL || outdata->out[0].buf == NULL) {
printf("Vision model inference failed.\n");
return {};
}
return -1;
// Copy output to vector
size_t output_size = outdata->out[0].size / sizeof(float);
float* output_ptr = reinterpret_cast<float*>(outdata->out[0].buf);
std::vector<float> result(output_ptr, output_ptr + output_size);
return result;
}
std::vector<std::string> process_image_dir(
void* context_model,
const std::string& image_dir_path,
const std::string& base_dir,
const std::string& json_filename)
std::vector<float> run_text_model(void* qcontext, const std::vector<int64_t>& input_ids)
{
std::vector<std::string> results;
std::regex file_pattern(R"(test_(\w+)_\d+\.jpg)");
// Get base_dir from parameter, environment variable, or use default
std::string actual_base_dir = base_dir;
if (actual_base_dir.empty()) {
const char* env_base_dir = std::getenv("CLIP_BASE_DIR");
if (env_base_dir != nullptr) {
actual_base_dir = env_base_dir;
} else {
actual_base_dir = "./demo_data/clip_datasets/";
}
}
// Ensure base_dir ends with '/'
if (!actual_base_dir.empty() && actual_base_dir.back() != '/') {
actual_base_dir += "/";
}
// Get json_filename from parameter, environment variable, or use default
std::string actual_json_filename = json_filename;
if (actual_json_filename.empty()) {
const char* env_json_filename = std::getenv("CLIP_JSON_FILENAME");
if (env_json_filename != nullptr) {
actual_json_filename = env_json_filename;
} else {
actual_json_filename = "clip_text_res.json";
}
int ret = 0;
nn_input inData;
nn_output *outdata = NULL;
aml_output_config_t outconfig;
inData.input_index = 0;
inData.info.input_format = AML_INPUT_DEFAULT;
inData.size = input_ids.size() * sizeof(int64_t);
// Use DMA
if (!text_context_flag) {
text_mem_config.cache_type = AML_WITH_CACHE;
text_mem_config.memory_type = AML_VIRTUAL_ADDR;
text_mem_config.direction = AML_MEM_DIRECTION_READ_WRITE;
text_mem_config.index = 0;
text_mem_config.mem_size = inData.size;
aml_util_mallocBuffer(qcontext, &text_mem_config, &text_mem_data);
aml_util_swapExternalInputBuffer(qcontext, &text_mem_config, &text_mem_data);
text_context_flag = qcontext;
}
// storing qualified paths
std::vector<fs::directory_entry> matched_files;
inData.input_type = INPUT_DMA_DATA;
memcpy(text_mem_data.viraddr, input_ids.data(), text_mem_config.mem_size);
inData.input = NULL;
// collect all relevant img.
for (const auto& entry : fs::directory_iterator(image_dir_path)) {
if (!entry.is_regular_file()) continue;
memset(&outconfig, 0, sizeof(aml_output_config_t));
outconfig.format = AML_OUTDATA_DMA;
outconfig.typeSize = sizeof(aml_output_config_t);
outdata = (nn_output*)aml_module_output_get(qcontext, outconfig);
std::string filename = entry.path().filename().string();
if (std::regex_match(filename, file_pattern)) {
matched_files.push_back(entry);
}
if (outdata == NULL || outdata->out[0].buf == NULL) {
printf("Text model inference failed.\n");
return {};
}
// use index sort, test_type_index.jpg
std::sort(matched_files.begin(), matched_files.end(),
[](const fs::directory_entry& a, const fs::directory_entry& b) {
return extract_index(a.path().filename().string()) <
extract_index(b.path().filename().string());
});
// Copy output to vector
size_t output_size = outdata->out[0].size / sizeof(float);
float* output_ptr = reinterpret_cast<float*>(outdata->out[0].buf);
std::vector<float> result(output_ptr, output_ptr + output_size);
for (const auto& entry : matched_files) {
if (!entry.is_regular_file()) continue;
std::string filename = entry.path().filename().string();
std::smatch match;
if (!std::regex_match(filename, match, file_pattern)) continue;
std::string name = match[1];
std::vector<float> input_data = preprocess_image(entry.path().string());
float* model_output = run_network(context_model, input_data, name);
float max_sim = -std::numeric_limits<float>::infinity();
std::string best_key, best_id;
// Iterate through all directories to find the directory containing the name
for (const auto& dir_entry : fs::directory_iterator(actual_base_dir)) {
if (!dir_entry.is_directory()) continue;
std::string folder_name = dir_entry.path().filename().string();
if (folder_name.find(name) == std::string::npos) continue;
std::string vit_res_path = actual_base_dir + folder_name + "/" + actual_json_filename;
std::ifstream vit_in(vit_res_path);
if (!vit_in.is_open()) {
printf("unopen: %s\n", vit_res_path.c_str());
continue;
}
json vit_json;
vit_in >> vit_json;
for (auto it = vit_json.begin(); it != vit_json.end(); ++it) {
const std::string& key = it.key();
const std::vector<float> vec = it.value().get<std::vector<float>>();
float sim = post_process(model_output, vec);
// printf("sim: %.4f\n", sim);
if (sim > max_sim) {
max_sim = sim;
best_key = key;
best_id = folder_name;
}
}
}
if (!best_key.empty() && !best_id.empty()) {
std::string best_path = actual_base_dir + best_id + "/";
results.push_back(best_path);
printf("\nProcessing images: %s, datasets img path: %s\n", filename.c_str(), best_path.c_str());
// printf("最相似图片: %s 相似度: %.4f\n", best_path.c_str(), max_sim); // for debug
}
}
return results;
return result;
}
int destroy_network(void *qcontext)
{
int ret = 0;
/* free model
model.use_dma = true
model.malloc_buffer_once = false
*/
if (context_model.use_dma && mem_config_context_model.mem_size != 0) {
ret = aml_util_freeBuffer(qcontext, &mem_config_context_model, &mem_data_context_model);
if (ret)
{
std::cout << "aml_util_freeBuffer fail." << std::endl;
}
if (vision_context_flag == qcontext) {
printf("Free vision model memory.\n");
aml_util_freeBuffer(qcontext, &vision_mem_config, &vision_mem_data);
vision_context_flag = nullptr;
} else if (text_context_flag == qcontext) {
printf("Free text model memory.\n");
aml_util_freeBuffer(qcontext, &text_mem_config, &text_mem_data);
text_context_flag = nullptr;
} else {
printf("Free network failed: context not found.\n");
return -1;
}
context_model.use_dma = false;
ret = aml_module_destroy(qcontext);
if (ret)
{
printf("aml_module_destroy fail.\n");
printf("Free network failed: destroy failed.\n");
return -1;
}
return ret;
}
}

View file

@ -19,13 +19,13 @@
#include <algorithm>
#include <string>
#include <iostream>
#include "model_invoke.h"
#include "clip_process.h"
#define STB_IMAGE_IMPLEMENTATION
#include "stb_image.h"
// bilinear interpolation scaling
std::vector<float> resize_bilinear(
static std::vector<float> resize_bilinear(
const unsigned char* src, int src_w, int src_h, int channels,
int dst_w, int dst_h)
{
@ -102,29 +102,29 @@ std::vector<float> preprocess_image(const std::string& image_path) {
}
}
// get NHWC
// Return NHWC format (batch dimension will be added in caller)
return cropped;
}
float post_process(const float* a, const std::vector<float>& b) {
float dot = 0.0f, scale = 100.00000762939453f;
for (size_t i = 0; i < b.size(); ++i) {
dot += a[i] * b[i];
// ==================== Post Processing ====================
std::vector<float> l2_normalize(const std::vector<float>& vec)
{
float norm = 0.0f;
for (float v : vec) {
norm += v * v;
}
dot *= scale;
return dot;
norm = std::sqrt(norm) + 1e-12f;
std::vector<float> result(vec.size());
for (size_t i = 0; i < vec.size(); ++i) {
result[i] = vec[i] / norm;
}
return result;
}
float post_process(const int8_t* a, const std::vector<float>& b) {
float dot = 0.0f, scale = 100.00000762939453f;
for (size_t i = 0; i < b.size(); ++i) {
dot += (a[i] - 66) * b[i];
}
dot *= scale;
return dot;
}
std::vector<float> softmax(const std::vector<float>& logits) {
std::vector<float> softmax(const std::vector<float>& logits)
{
std::vector<float> result(logits.size());
// numerical stability: subtract the maximum value first.
@ -142,3 +142,17 @@ std::vector<float> softmax(const std::vector<float>& logits) {
return result;
}
float compute_similarity(const std::vector<float>& a, const std::vector<float>& b, float scale)
{
if (a.size() != b.size()) {
printf("Feature dimension mismatch: %zu vs %zu\n", a.size(), b.size());
return 0.0f;
}
float dot = 0.0f;
for (size_t i = 0; i < a.size(); ++i) {
dot += a[i] * b[i];
}
return dot * scale;
}

View file

@ -1,304 +1,339 @@
import numpy as np
import os
import argparse
import json
import re
from PIL import Image
from amlnnlite.api import AMLNNLite
def preprocess_image(image_path: str, target_size: int = 224) -> np.ndarray:
"""
Preprocess image for CLIP model.
Steps:
1. Load image and convert to RGB
2. Scale the shorter side to target_size
3. Center crop to target_size x target_size
4. Normalize with CLIP mean and std
Args:
image_path (str): Path to input image
target_size (int): Target image size (default: 224)
Returns:
np.ndarray: Preprocessed image data with shape (target_size, target_size, 3)
"""
# Load image
img = Image.open(image_path).convert("RGB")
width, height = img.size
# Scale the shorter side
scale = target_size / min(width, height)
new_w = int(round(width * scale))
new_h = int(round(height * scale))
# Resize
img = img.resize((new_w, new_h), Image.BILINEAR)
# Center crop
left = (new_w - target_size) // 2
top = (new_h - target_size) // 2
img = img.crop((left, top, left + target_size, top + target_size))
# Convert to numpy array and normalize to [0, 1]
img_array = np.array(img, dtype=np.float32) / 255.0
# CLIP normalization
mean = np.array([0.48145466, 0.4578275, 0.40821073], dtype=np.float32)
std = np.array([0.26862954, 0.26130258, 0.27577711], dtype=np.float32)
# Normalize: (x - mean) / std
img_array = (img_array - mean) / std
# Return in NHWC format
return img_array
def post_process(
image_features: np.ndarray,
text_features: np.ndarray,
scale: float = 100.00000762939453,
use_cosine: bool = True,
apply_scale: bool = True,
) -> float:
"""
Calculate similarity between image and text features.
Args:
image_features (np.ndarray): Image feature vector
text_features (np.ndarray): Text feature vector
scale (float): Scale factor for similarity calculation
use_cosine (bool): If True, L2-normalize both vectors before dot product (cosine similarity)
apply_scale (bool): If True, multiply by scale after dot product
Returns:
float: Similarity score
"""
img_vec = image_features.flatten().astype(np.float32)
txt_vec = np.array(text_features, dtype=np.float32).flatten()
if len(img_vec) != len(txt_vec):
raise ValueError(f"Feature dimension mismatch: image={len(img_vec)}, text={len(txt_vec)}")
if use_cosine:
img_norm = np.linalg.norm(img_vec) + 1e-8
txt_norm = np.linalg.norm(txt_vec) + 1e-8
img_vec = img_vec / img_norm
txt_vec = txt_vec / txt_norm
dot_product = np.dot(img_vec, txt_vec)
similarity = dot_product * scale if apply_scale else dot_product
return float(similarity)
def extract_index(filename: str) -> int:
"""
Extract index from filename pattern: test_xxx_index.jpg
Args:
filename (str): Filename to extract index from
Returns:
int: Extracted index, or -1 if pattern doesn't match
"""
pattern = r"test_\w+_(\d+)\.jpg"
match = re.match(pattern, filename)
if match:
return int(match.group(1))
return -1
def process_image_dir(
amlnn: AMLNNLite,
image_dir_path: str,
base_dir: str = "",
json_filename: str = ""
) -> list:
"""
Process image directory and find best matching text dataset.
Args:
amlnn: AMLNNLite instance
image_dir_path (str): Path to directory containing test images
base_dir (str): Base directory for clip datasets (optional, can use CLIP_BASE_DIR env var)
json_filename (str): JSON filename in each dataset folder (optional, can use CLIP_JSON_FILENAME env var)
Returns:
list: List of best matching dataset paths
"""
results = []
file_pattern = re.compile(r"test_(\w+)_\d+\.jpg")
image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.JPG', '.JPEG', '.PNG', '.BMP'}
if not base_dir:
base_dir = os.getenv("CLIP_BASE_DIR", "./clip_datasets/")
if not json_filename:
json_filename = os.getenv("CLIP_JSON_FILENAME", "clip_text_res.json")
matched_files = []
if os.path.isdir(image_dir_path):
for filename in os.listdir(image_dir_path):
filepath = os.path.join(image_dir_path, filename)
if os.path.isfile(filepath):
if file_pattern.match(filename):
matched_files.append((filename, filepath, True))
elif any(filename.lower().endswith(ext) for ext in image_extensions):
matched_files.append((filename, filepath, False))
elif os.path.isfile(image_dir_path):
filename = os.path.basename(image_dir_path)
if any(filename.lower().endswith(ext) for ext in image_extensions):
has_pattern = bool(file_pattern.match(filename))
matched_files.append((filename, image_dir_path, has_pattern))
else:
print(f"Error: {image_dir_path} is not a valid image file")
return results
else:
print(f"Error: {image_dir_path} is not a valid directory or file")
return results
if not matched_files:
print(f"Warning: No image files found in {image_dir_path}")
return results
print(f"Found {len(matched_files)} image file(s) to process")
matched_files.sort(key=lambda x: extract_index(x[0]) if x[2] else 999999)
# Process each image
for filename, filepath, has_pattern in matched_files:
if has_pattern:
match = file_pattern.match(filename)
if match:
name = match.group(1)
else:
name = ""
else:
name = ""
# Preprocess image
try:
input_data = preprocess_image(filepath)
input_data = np.expand_dims(input_data, axis=0)
except Exception as e:
print(f"Error preprocessing image {filename}: {e}")
continue
# Run inference
try:
outputs = amlnn.inference(inputs=[input_data])
model_output = outputs[0]
if isinstance(model_output, np.ndarray):
model_output = model_output.astype(np.float32)
else:
model_output = np.array(model_output, dtype=np.float32)
model_output = model_output.flatten()
except Exception as e:
print(f"Error running inference on {filename}: {e}")
continue
max_sim = float('-inf')
best_key = ""
best_id = ""
if not os.path.isdir(base_dir):
print(f"Error: Base directory does not exist: {base_dir}")
continue
print(f"Searching in base directory: {base_dir}")
folder_count = 0
for folder_name in os.listdir(base_dir):
folder_path = os.path.join(base_dir, folder_name)
if not os.path.isdir(folder_path):
continue
if has_pattern and name and name not in folder_name:
continue
folder_count += 1
vit_res_path = os.path.join(folder_path, json_filename)
if not os.path.isfile(vit_res_path):
print(f"Warning: JSON file not found: {vit_res_path}")
continue
try:
with open(vit_res_path, 'r', encoding='utf-8') as f:
vit_json = json.load(f)
for key, text_vec in vit_json.items():
if isinstance(text_vec, list):
text_features = np.array(text_vec, dtype=np.float32)
sim_scaled = post_process(
model_output,
text_features,
use_cosine=True,
apply_scale=True,
)
if sim_scaled > max_sim:
max_sim = sim_scaled
best_key = key
best_id = folder_name
except Exception as e:
print(f"Error loading JSON file {vit_res_path}: {e}")
continue
if best_key and best_id:
best_path = os.path.join(base_dir, best_id)
results.append(best_path)
print(f"\nProcessing image: {filename}")
print(f" Best matching dataset: {best_path}")
else:
print(f"\nProcessing image: {filename}")
print(f" No matching dataset found (searched {folder_count} folder(s))")
return results
def main():
parser = argparse.ArgumentParser(description='CLIP Image-Text Matching Demo')
parser.add_argument('--model-path', required=True, help='Path to the CLIP model file')
parser.add_argument('--base-dir', default='./clip_datasets/', help='Base directory for clip datasets (can also use CLIP_BASE_DIR env var)')
parser.add_argument('--json-filename', default='clip_text_res.json', help='JSON filename in each dataset folder (can also use CLIP_JSON_FILENAME env var, default: clip_text_res.json)')
parser.add_argument('--image-dir', default='./', help='Image directory or single image file to process (optional, will prompt if not provided)')
args = parser.parse_args()
# Initialize AMLNNLite
print("Initializing model...")
amlnn = AMLNNLite()
amlnn.config(model_path=args.model_path)
amlnn.init()
print("Model initialized successfully.\n")
# Process images
if args.image_dir:
results = process_image_dir(amlnn, args.image_dir, args.base_dir, args.json_filename)
print(f"\nTotal results: {len(results)}")
for i, result in enumerate(results):
print(f"Index[{i}]: {result}")
else:
while True:
image_path = input("\nPlease enter the JPG image path or directory (enter 'exit' to quit):\n").strip()
if image_path.lower() == 'exit':
break
if not image_path:
print("The path cannot be empty.")
continue
results = process_image_dir(amlnn, image_path, args.base_dir, args.json_filename)
for i, result in enumerate(results):
print(f"Index[{i}]: {result}")
amlnn.uninit()
print("\nDone.")
if __name__ == "__main__":
main()
# -*- coding: utf-8 -*-
"""
Copyright (C) 20242025 Amlogic, Inc. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""
# This inference script is designed for CLIP model using AMLNNLite.
import os
import argparse
import numpy as np
from PIL import Image
from transformers import CLIPTokenizer
from amlnnlite.api import AMLNNLite
# ==================== Utility Functions ====================
def softmax(x: np.ndarray, axis: int = -1) -> np.ndarray:
"""Compute softmax values for array x."""
x = x - np.max(x, axis=axis, keepdims=True)
e = np.exp(x)
return e / np.sum(e, axis=axis, keepdims=True)
def l2_normalize(x: np.ndarray, axis: int = -1, eps: float = 1e-12) -> np.ndarray:
"""L2 normalize array x along specified axis."""
return x / (np.linalg.norm(x, axis=axis, keepdims=True) + eps)
# ==================== Vision Preprocessing ====================
def preprocess_image(image_path: str, target_size: int = 224) -> np.ndarray:
"""
Preprocess image for CLIP model.
Args:
image_path (str): Path to input image
target_size (int): Target image size (default: 224)
Returns:
np.ndarray: Preprocessed image data with shape (1, target_size, target_size, 3) in NHWC format
"""
image = Image.open(image_path).convert("RGB")
width, height = image.size
# Scale the shorter side
scale = target_size / min(width, height)
new_width = int(width * scale)
new_height = int(height * scale)
image_resized = image.resize((new_width, new_height), resample=Image.BICUBIC)
# Center crop
left = (new_width - target_size) // 2
top = (new_height - target_size) // 2
right = left + target_size
bottom = top + target_size
image_cropped = image_resized.crop((left, top, right, bottom))
# Convert to numpy array and normalize to [0, 1]
image_np = np.array(image_cropped).astype(np.float32) / 255.0
# CLIP normalization
mean = np.array([0.48145466, 0.4578275, 0.40821073], dtype=np.float32)
std = np.array([0.26862954, 0.26130258, 0.27577711], dtype=np.float32)
image_np = (image_np - mean) / std
# Add batch dimension: HWC -> NHWC
image_np = np.expand_dims(image_np, axis=0)
return image_np.astype(np.float32) # [1, 224, 224, 3]
# ==================== Text Preprocessing ====================
def preprocess_text(tokenizer: CLIPTokenizer, text: str, max_len: int = 64) -> np.ndarray:
"""
Preprocess text for CLIP model using CLIPTokenizer.
Args:
tokenizer: CLIPTokenizer instance
text (str): Input text string
max_len (int): Maximum sequence length (default: 64)
Returns:
np.ndarray: Tokenized text with shape (1, max_len) as int64
"""
enc = tokenizer(
text,
padding="max_length",
truncation=True,
max_length=max_len,
return_tensors="np",
)
# text model input: int64[1, max_len]
input_ids = enc["input_ids"].astype(np.int64)
return input_ids
# ==================== Model Inference ====================
def compute_image_embedding(vision_amlnn: AMLNNLite, image_path: str) -> np.ndarray:
"""
Compute image embedding using vision model.
Args:
vision_amlnn: AMLNNLite instance for vision model
image_path (str): Path to input image
Returns:
np.ndarray: L2-normalized image embedding with shape (1, embed_dim)
"""
input_data = preprocess_image(image_path) # [1, 224, 224, 3]
outputs = vision_amlnn.inference(
inputs=[input_data],
inputs_data_format='NHWC',
outputs_data_format='NHWC'
)
feats = outputs[0].astype(np.float32)
feats = feats.reshape(1, -1) # Squeeze to [1, embed_dim]
return l2_normalize(feats, axis=1)
def compute_text_embedding(text_amlnn: AMLNNLite, tokenizer: CLIPTokenizer, text: str, max_len: int = 64) -> np.ndarray:
"""
Compute text embedding using text model.
Args:
text_amlnn: AMLNNLite instance for text model
tokenizer: CLIPTokenizer instance
text (str): Input text string
max_len (int): Maximum sequence length
Returns:
np.ndarray: L2-normalized text embedding with shape (1, embed_dim)
"""
input_ids = preprocess_text(tokenizer, text, max_len) # [1, max_len]
print(f"input_ids: {input_ids}")
# AMLNNLite requires 4D input, reshape to (1, 1, 1, max_len)
input_ids_4d = input_ids[:, None, None, :] # [1, 1, 1, max_len]
outputs = text_amlnn.inference(
inputs=[input_ids_4d],
inputs_data_format='NHWC',
outputs_data_format='NHWC'
)
feats = outputs[0].astype(np.float32)
feats = feats.reshape(1, -1) # Squeeze to [1, embed_dim]
return l2_normalize(feats, axis=1)
def compute_text_embeddings_batch(text_amlnn: AMLNNLite, tokenizer: CLIPTokenizer, texts: list, max_len: int = 64) -> np.ndarray:
"""
Compute text embeddings for multiple texts.
Args:
text_amlnn: AMLNNLite instance for text model
tokenizer: CLIPTokenizer instance
texts (list): List of input text strings
max_len (int): Maximum sequence length
Returns:
np.ndarray: L2-normalized text embeddings with shape (num_texts, embed_dim)
"""
embeddings = []
for text in texts:
emb = compute_text_embedding(text_amlnn, tokenizer, text, max_len)
embeddings.append(emb[0]) # Remove batch dimension
return np.stack(embeddings, axis=0) # [num_texts, embed_dim]
# ==================== Similarity Calculation ====================
def compute_similarity(image_embedding: np.ndarray, text_embeddings: np.ndarray, logit_scale: float = 100.0) -> tuple:
"""
Compute similarity between image and text embeddings.
Args:
image_embedding (np.ndarray): Image embedding with shape (1, embed_dim)
text_embeddings (np.ndarray): Text embeddings with shape (num_texts, embed_dim)
logit_scale (float): Scale factor for logits
Returns:
tuple: (similarities, logits, probabilities)
"""
# Cosine similarity (embeddings are already L2-normalized)
sims = text_embeddings @ image_embedding[0] # [num_texts]
logits = sims * logit_scale # [num_texts]
probs = softmax(logits, axis=0) # [num_texts]
return sims, logits, probs
# ==================== Main Function ====================
def main():
parser = argparse.ArgumentParser(description='CLIP Image-Text Matching Demo using AMLNNLite')
parser.add_argument('--vision-model', required=True, help='Path to vision model (.adla)')
parser.add_argument('--text-model', required=True, help='Path to text model (.adla)')
parser.add_argument('--tokenizer-dir', required=True, help='Path to CLIPTokenizer directory')
parser.add_argument('--image-path', default=None, help='Path to input image (optional, will prompt if not provided)')
parser.add_argument('--texts', nargs='+', default=None, help='List of text descriptions to compare')
parser.add_argument('--max-len', type=int, default=64, help='Maximum token sequence length (default: 64)')
parser.add_argument('--logit-scale', type=float, default=100.0, help='Logit scale factor (default: 100.0)')
args = parser.parse_args()
# Validate model paths
if not os.path.exists(args.vision_model):
print(f"[Error] Vision model not found: {args.vision_model}")
return -1
if not os.path.exists(args.text_model):
print(f"[Error] Text model not found: {args.text_model}")
return -1
# Load tokenizer
print(f"[Info] Loading CLIPTokenizer from: {args.tokenizer_dir}")
tokenizer = CLIPTokenizer.from_pretrained(args.tokenizer_dir)
# Initialize vision model
print(f"[Info] Initializing vision model: {args.vision_model}")
vision_amlnn = AMLNNLite()
vision_amlnn.config(model_path=args.vision_model, run_cycles=1)
vision_amlnn.init()
# Initialize text model
print(f"[Info] Initializing text model: {args.text_model}")
text_amlnn = AMLNNLite()
text_amlnn.config(model_path=args.text_model, run_cycles=1)
text_amlnn.init()
print("[Info] Models initialized successfully.\n")
try:
# Interactive loop
while True:
# Get image path
if args.image_path:
image_path = args.image_path
args.image_path = None # Clear for next iteration
else:
print("=" * 60)
print("[Info] Image Path (or 'exit' to quit):")
image_path = input().strip()
# Check for exit
if image_path.lower() == 'exit':
print("[Info] Exiting...")
break
# Validate image path
if not image_path:
print("[Warning] Please enter an image path.")
continue
if not os.path.exists(image_path):
print(f"[Error] Image not found: {image_path}")
continue
# Get texts to compare
if args.texts:
texts = args.texts
args.texts = None # Clear for next iteration
else:
print("[Info] Enter text descriptions (comma-separated, or 'skip' to use defaults):")
text_input = input().strip()
if text_input.lower() == 'skip' or not text_input:
# Default texts for demo
texts = [
"a red handbag",
"a blue jacket",
"a red bus",
]
print(f"[Info] Using default texts: {texts}")
else:
texts = [t.strip() for t in text_input.split(',') if t.strip()]
if not texts:
print("[Warning] No texts provided.")
continue
try:
# Compute image embedding
print(f"\n[Info] Processing image: {image_path}")
image_embedding = compute_image_embedding(vision_amlnn, image_path)
print(f"[Info] Image embedding shape: {image_embedding.shape}")
# Compute text embeddings
print(f"[Info] Processing {len(texts)} text(s)...")
text_embeddings = compute_text_embeddings_batch(text_amlnn, tokenizer, texts, args.max_len)
print(f"[Info] Text embeddings shape: {text_embeddings.shape}")
# Compute similarity
sims, logits, probs = compute_similarity(image_embedding, text_embeddings, args.logit_scale)
# Print results
print("\n" + "=" * 60)
print("CLIP Image-Text Matching Results")
print("=" * 60)
print(f"Image: {image_path}")
print(f"logit_scale: {args.logit_scale:.6f}")
print("-" * 60)
# Sort by probability (descending)
sorted_indices = np.argsort(probs)[::-1]
for rank, i in enumerate(sorted_indices):
print(f"[{rank + 1}] prob={probs[i]:.6f} sim={float(sims[i]):.6f} text='{texts[i]}'")
print("=" * 60 + "\n")
except Exception as e:
print(f"[Error] Processing failed: {e}")
import traceback
traceback.print_exc()
continue
except KeyboardInterrupt:
print("\n\n[Info] Interrupted by user. Exiting...")
finally:
# Cleanup
vision_amlnn.uninit()
text_amlnn.uninit()
print("[Info] Done.")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

File diff suppressed because it is too large Load diff

File diff suppressed because one or more lines are too long