feat:update demo code of CLIP
This commit is contained in:
parent
4bf4aafc73
commit
5478a8618b
12 changed files with 50385 additions and 694 deletions
BIN
examples/clip/000000004505.jpg
Executable file
BIN
examples/clip/000000004505.jpg
Executable file
Binary file not shown.
|
After Width: | Height: | Size: 210 KiB |
|
|
@ -1,95 +1,159 @@
|
||||||
## Demo Run
|
# CLIP
|
||||||
|
|
||||||
### CPP
|
## 1. Overview
|
||||||
|
|
||||||
#### 1. Compile
|
This demo demonstrates how to run CLIP (Contrastive Language-Image Pre-Training) image-text matching using AMLNNLite. The CLIP model consists of two parts: a vision encoder and a text encoder, which work together to compute similarity between images and text descriptions.
|
||||||
|
|
||||||
**Prerequisites:**
|
## 2. Model Download
|
||||||
- Android NDK (r25e recommended)
|
|
||||||
- `ANDROID_NDK_PATH` environment variable set
|
TO DO
|
||||||
|
|
||||||
**Build:**
|
## 3. Model Conversion
|
||||||
```bash
|
|
||||||
# Build for arm64-v8a
|
TO DO
|
||||||
cd examples/clip/cpp
|
|
||||||
./build-android.sh -a arm64-v8a
|
## 4. Demo Run
|
||||||
```
|
|
||||||
|
### CPP
|
||||||
The executable will be generated at `build/android_arm64-v8a/clip_demo` (Note: executable name may vary, verify in build folder).
|
|
||||||
|
#### 1. Compile
|
||||||
#### 2. Run
|
|
||||||
|
**Prerequisites:**
|
||||||
```bash
|
- Android NDK (r25e recommended)
|
||||||
# Push executable to device
|
- `ANDROID_NDK_PATH` environment variable set
|
||||||
adb push build/android_arm64-v8a/clip_demo /data/local/tmp/
|
|
||||||
adb push model/vision_model_int8_A311D2.adla /data/local/tmp/
|
**Build:**
|
||||||
adb push clip_datasets/ /data/local/tmp/
|
```bash
|
||||||
adb push test_hat_0.jpg /data/local/tmp/
|
# Build for arm64-v8a
|
||||||
|
cd examples/clip/cpp
|
||||||
# Run on device
|
./build-android.sh -a arm64-v8a
|
||||||
adb shell
|
```
|
||||||
cd /data/local/tmp
|
|
||||||
chmod +x clip_demo
|
The executable will be generated at `build/android_arm64-v8a/clip_demo`.
|
||||||
export LD_LIBRARY_PATH=/vendor/lib64 or (/vendor/lib)
|
|
||||||
|
#### 2. Run
|
||||||
# Usage: ./clip_demo <model_path> [base_dir] [json_filename]
|
|
||||||
./clip_demo vision_model_int8_A311D2.adla ./clip_datasets/ clip_text_res.json
|
```bash
|
||||||
```
|
# Push executable and resources to device
|
||||||
|
adb push build/android_arm64-v8a/clip_demo /data/local/tmp/
|
||||||
**Note:**
|
adb push model/vision_model_int8_S905X5.adla /data/local/tmp/
|
||||||
- Replace `vision_model_int8_A311D2.adla` with your actual model file path.
|
adb push model/text_model_int8_S905X5.adla /data/local/tmp/
|
||||||
- The `base_dir` and `json_filename` parameters are optional. You can also use environment variables `CLIP_BASE_DIR` and `CLIP_JSON_FILENAME`.
|
adb push tokenizer_path/ /data/local/tmp/
|
||||||
- The program will prompt you to enter image paths interactively. Enter "exit" to quit.
|
|
||||||
|
# Run on device
|
||||||
### Python
|
adb shell
|
||||||
|
cd /data/local/tmp
|
||||||
**Prerequisites:**
|
chmod +x clip_demo
|
||||||
- Python 3.10
|
export LD_LIBRARY_PATH=/vendor/lib64 or (/vendor/lib)
|
||||||
- Required packages: `numpy`, `Pillow`, `amlnnlite`
|
|
||||||
|
# Usage: ./clip_demo <vision_model> <text_model> <tokenizer_path> [--profiling]
|
||||||
**Install dependencies:**
|
./clip_demo vision_model_int8_S905X5.adla text_model_int8_S905X5.adla ./tokenizer_path/
|
||||||
```bash
|
```
|
||||||
pip install numpy Pillow amlnnlite-1.0.0-cp310-cp310-linux_aarch64.whl
|
|
||||||
```
|
The program will prompt for image paths and text descriptions interactively. Enter the path to an image file, then enter comma-separated text descriptions (or `skip` to use defaults). Type `exit` to quit.
|
||||||
|
|
||||||
**Run on device:**
|
**Argument Descriptions:**
|
||||||
```bash
|
|
||||||
# Basic usage (process current directory)
|
| Argument | Description |
|
||||||
python clip.py --model-path ./vision_model_int8_A311D2.adla
|
| -------------- | ------------------------------------------------------------ |
|
||||||
|
| vision_model | Path to vision encoder .adla model (required) |
|
||||||
# Specify image directory or file
|
| text_model | Path to text encoder .adla model (required) |
|
||||||
python clip.py --model-path ./vision_model_int8_A311D2.adla --image-dir ./
|
| tokenizer_path | Path to directory containing `vocab.json` and `merges.txt` (required) |
|
||||||
|
| --profiling | Enable performance profiling output (optional) |
|
||||||
# Specify base directory and JSON filename
|
|
||||||
python clip.py --model-path ./vision_model_int8_A311D2.adla --base-dir ./clip_datasets/ --json-filename clip_text_res.json
|
**Note:** The `tokenizer_path` should contain `vocab.json` and `merges.txt` files from the CLIP tokenizer (e.g., from `openai/clip-vit-base-patch32`).
|
||||||
```
|
|
||||||
|
### Python
|
||||||
The script will automatically process all image files (`.jpg`, `.jpeg`, `.png`, `.bmp`) in the specified directory or process a single image file, and display the best matching dataset for each image.
|
|
||||||
|
**Prerequisites:**
|
||||||
5. Results
|
- Python 3.10
|
||||||
|
- Required packages: `numpy`, `Pillow`, `transformers`, `amlnnlite`
|
||||||
The program will print the best matching dataset path for each processed image. The program searches through all dataset folders in the base directory and finds the text feature with the highest similarity to the input image.
|
|
||||||
|
**Install dependencies:**
|
||||||
**Example output:**
|
```bash
|
||||||
```
|
pip install numpy Pillow transformers amlnnlite-1.0.0-cp310-cp310-linux_aarch64.whl
|
||||||
# python demo result
|
```
|
||||||
Model initialized successfully.
|
|
||||||
|
**Run on device:**
|
||||||
Found 2 image file(s) to process
|
```bash
|
||||||
Searching in base directory: ./clip_datasets/
|
python clip.py \
|
||||||
|
--vision-model ./vision_model_int8_S905X5.adla \
|
||||||
Processing image: test_jacket_0.jpg
|
--text-model ./text_model_int8_S905X5.adla \
|
||||||
Best matching dataset: ./clip_datasets/shirt10_jacket7
|
--tokenizer-dir ./tokenizer_path \
|
||||||
Searching in base directory: ./clip_datasets/
|
--image-path ./000000004505.jpg \
|
||||||
|
--texts "a red handbag" "a blue jacket" "a red bus"
|
||||||
Processing image: test_hat_0.jpg
|
```
|
||||||
Best matching dataset: ./clip_datasets/hat1_jd
|
|
||||||
|
**Interactive Mode (Recommended):**
|
||||||
Total results: 2
|
|
||||||
Index[0]: ./clip_datasets/shirt10_jacket7
|
If you don't provide `--image-path`, the program will run in interactive mode:
|
||||||
Index[1]: ./clip_datasets/hat1_jd
|
|
||||||
|
```bash
|
||||||
Done.
|
python clip.py \
|
||||||
```
|
--vision-model ./vision_model_int8_S905X5.adla \
|
||||||
|
--text-model ./text_model_int8_S905X5.adla \
|
||||||
The program returns the dataset folder path that contains the text feature with the highest similarity to the input image. Each result represents the best matching dataset for the corresponding input image.
|
--tokenizer-dir ./tokenizer_path
|
||||||
|
```
|
||||||
|
|
||||||
|
The program will prompt for image paths and text descriptions. Enter an image path to process, then enter comma-separated texts to compare. Type `exit` to quit.
|
||||||
|
|
||||||
|
**Argument Descriptions:**
|
||||||
|
|
||||||
|
| Argument | Description |
|
||||||
|
| ---------------- | ------------------------------------------------------------ |
|
||||||
|
| --vision-model | Path to vision encoder .adla model (required) |
|
||||||
|
| --text-model | Path to text encoder .adla model (required) |
|
||||||
|
| --tokenizer-dir | Path to CLIPTokenizer directory (required) |
|
||||||
|
| --image-path | Path to input image (.jpg, .png) - optional, will prompt if not provided |
|
||||||
|
| --texts | List of text descriptions to compare (space-separated) |
|
||||||
|
| --max-len | Maximum token sequence length, default is 64 |
|
||||||
|
| --logit-scale | Logit scale factor, default is 100.0 |
|
||||||
|
|
||||||
|
**Note:** The `--tokenizer-dir` should point to the directory containing the CLIPTokenizer files. You can use a Hugging Face model ID (e.g., `openai/clip-vit-base-patch32`) or a local directory.
|
||||||
|
|
||||||
|
## 5. Results
|
||||||
|
|
||||||
|
**Performance Feedback**
|
||||||
|
|
||||||
|
By using the `--profiling` flag (C++) or setting the loglevel to INFO, the program provides real-time performance metrics upon completion. The console log will display essential hardware and execution details, including:
|
||||||
|
- Hardware Information: System and ADLA library versions.
|
||||||
|
- Model Overview: Basic input/output configurations.
|
||||||
|
- NPU Metrics: Total inference time (latency) and total DRAM bandwidth consumption.
|
||||||
|
|
||||||
|
**Interactive Mode Example:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ ./clip_demo vision_model_int8_S905X5.adla text_model_int8_S905X5.adla ./tokenizer_path
|
||||||
|
|
||||||
|
[Info] Models initialized successfully.
|
||||||
|
|
||||||
|
============================================================
|
||||||
|
[Info] Image Path (or 'exit' to quit):
|
||||||
|
000000004505.jpg
|
||||||
|
[Info] Enter text descriptions (comma-separated, or 'skip' for defaults):
|
||||||
|
a red handbag, a blue jacket, a red bus
|
||||||
|
|
||||||
|
[Info] Processing image: 000000004505.jpg
|
||||||
|
[Info] Image embedding size: 512
|
||||||
|
[Info] Processing 3 text(s)...
|
||||||
|
[Info] Text embeddings size: 3 x 512
|
||||||
|
|
||||||
|
============================================================
|
||||||
|
CLIP Image-Text Matching Results
|
||||||
|
============================================================
|
||||||
|
Image: 000000004505.jpg
|
||||||
|
logit_scale: 100.000000
|
||||||
|
------------------------------------------------------------
|
||||||
|
[1] prob=0.999975 sim=0.327895 text='a red bus'
|
||||||
|
[2] prob=0.000016 sim=0.217690 text='a red handbag'
|
||||||
|
[3] prob=0.000008 sim=0.211029 text='a blue jacket'
|
||||||
|
============================================================
|
||||||
|
|
||||||
|
============================================================
|
||||||
|
[Info] Image Path (or 'exit' to quit):
|
||||||
|
exit
|
||||||
|
[Info] Exiting...
|
||||||
|
Free vision model memory.
|
||||||
|
Free text model memory.
|
||||||
|
[Info] Done.
|
||||||
|
```
|
||||||
|
|
|
||||||
|
|
@ -1,42 +1,43 @@
|
||||||
cmake_minimum_required(VERSION 3.5)
|
cmake_minimum_required(VERSION 3.5)
|
||||||
project(clip_demo)
|
project(clip_demo)
|
||||||
|
|
||||||
set(CMAKE_CXX_STANDARD 17)
|
set(CMAKE_CXX_STANDARD 17)
|
||||||
|
|
||||||
# Set NNSDK path
|
# Set NNSDK path
|
||||||
set(NNSDK_ROOT "${CMAKE_SOURCE_DIR}/../../../../dependency/nnsdk")
|
set(NNSDK_ROOT "${CMAKE_SOURCE_DIR}/../../../../dependency/nnsdk")
|
||||||
include_directories(${NNSDK_ROOT}/include)
|
include_directories(${NNSDK_ROOT}/include)
|
||||||
include_directories(${CMAKE_SOURCE_DIR}/../../../../common)
|
include_directories(${CMAKE_SOURCE_DIR}/../../../../common)
|
||||||
|
|
||||||
# Set 3rdparty path
|
# Set 3rdparty path
|
||||||
set(3RDPARTY_DIR "${CMAKE_SOURCE_DIR}/../../../../dependency")
|
set(3RDPARTY_DIR "${CMAKE_SOURCE_DIR}/../../../../dependency")
|
||||||
|
|
||||||
# Include directories for stb_image and json
|
# Include directories for stb_image and json
|
||||||
# Note: code uses #include "stb_image.h" and #include "json.hpp"
|
# Note: code uses #include "stb_image.h" and #include "json.hpp"
|
||||||
include_directories(${3RDPARTY_DIR}/stb_image)
|
include_directories(${3RDPARTY_DIR}/stb_image)
|
||||||
include_directories(${3RDPARTY_DIR}/json)
|
include_directories(${3RDPARTY_DIR}/json)
|
||||||
|
|
||||||
if(CMAKE_SYSTEM_NAME STREQUAL "Android")
|
if(CMAKE_SYSTEM_NAME STREQUAL "Android")
|
||||||
if (ANDROID_ABI STREQUAL "arm64-v8a")
|
if (ANDROID_ABI STREQUAL "arm64-v8a")
|
||||||
link_directories(${NNSDK_ROOT}/lib/android/arm64-v8a)
|
link_directories(${NNSDK_ROOT}/lib/android/arm64-v8a)
|
||||||
else()
|
else()
|
||||||
link_directories(${NNSDK_ROOT}/lib/android/armeabi-v7a)
|
link_directories(${NNSDK_ROOT}/lib/android/armeabi-v7a)
|
||||||
endif()
|
endif()
|
||||||
# Android needs log
|
# Android needs log
|
||||||
link_libraries(log)
|
link_libraries(log)
|
||||||
elseif(CMAKE_SYSTEM_NAME STREQUAL "Linux")
|
elseif(CMAKE_SYSTEM_NAME STREQUAL "Linux")
|
||||||
link_directories(${NNSDK_ROOT}/lib/linux/lib64_yocto)
|
link_directories(${NNSDK_ROOT}/lib/linux/lib64_yocto)
|
||||||
endif()
|
endif()
|
||||||
|
|
||||||
add_executable(${PROJECT_NAME}
|
add_executable(${PROJECT_NAME}
|
||||||
main.cpp
|
main.cpp
|
||||||
model_invoke.cpp
|
model_invoke.cpp
|
||||||
pre_postprocess.cpp
|
pre_postprocess.cpp
|
||||||
)
|
clip_tokenizer.cpp
|
||||||
|
)
|
||||||
target_link_libraries(${PROJECT_NAME}
|
|
||||||
nnsdk
|
target_link_libraries(${PROJECT_NAME}
|
||||||
dl
|
nnsdk
|
||||||
m
|
dl
|
||||||
)
|
m
|
||||||
|
)
|
||||||
|
|
||||||
|
|
|
||||||
53
examples/clip/cpp/src/clip_process.h
Executable file
53
examples/clip/cpp/src/clip_process.h
Executable file
|
|
@ -0,0 +1,53 @@
|
||||||
|
/*
|
||||||
|
* Copyright (C) 2024–2025 Amlogic, Inc. All rights reserved.
|
||||||
|
*
|
||||||
|
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
* you may not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
#ifndef CLIP_PROCESS_H
|
||||||
|
#define CLIP_PROCESS_H
|
||||||
|
|
||||||
|
#include <string>
|
||||||
|
#include <vector>
|
||||||
|
#include <cstdint>
|
||||||
|
|
||||||
|
// ==================== Model Invoke ====================
|
||||||
|
|
||||||
|
// Initialize network from file
|
||||||
|
void* init_network_file(const char *model_path);
|
||||||
|
|
||||||
|
// Run vision model inference
|
||||||
|
std::vector<float> run_vision_model(void* context, const std::vector<float>& input_data);
|
||||||
|
|
||||||
|
// Run text model inference
|
||||||
|
std::vector<float> run_text_model(void* context, const std::vector<int64_t>& input_ids);
|
||||||
|
|
||||||
|
// Destroy network
|
||||||
|
int destroy_network(void *qcontext);
|
||||||
|
|
||||||
|
// ==================== Pre/Post Processing ====================
|
||||||
|
|
||||||
|
// Image preprocessing
|
||||||
|
std::vector<float> preprocess_image(const std::string& image_path);
|
||||||
|
|
||||||
|
// L2 normalize
|
||||||
|
std::vector<float> l2_normalize(const std::vector<float>& vec);
|
||||||
|
|
||||||
|
// Softmax
|
||||||
|
std::vector<float> softmax(const std::vector<float>& logits);
|
||||||
|
|
||||||
|
// Compute cosine similarity
|
||||||
|
float compute_similarity(const std::vector<float>& a, const std::vector<float>& b, float scale = 100.0f);
|
||||||
|
|
||||||
|
#endif // CLIP_PROCESS_H
|
||||||
|
|
||||||
395
examples/clip/cpp/src/clip_tokenizer.cpp
Executable file
395
examples/clip/cpp/src/clip_tokenizer.cpp
Executable file
|
|
@ -0,0 +1,395 @@
|
||||||
|
/*
|
||||||
|
* Copyright (C) 2024–2025 Amlogic, Inc. All rights reserved.
|
||||||
|
*
|
||||||
|
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
* you may not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
#include "clip_tokenizer.h"
|
||||||
|
#include "json.hpp"
|
||||||
|
|
||||||
|
#include <fstream>
|
||||||
|
#include <sstream>
|
||||||
|
#include <iostream>
|
||||||
|
#include <algorithm>
|
||||||
|
#include <regex>
|
||||||
|
#include <set>
|
||||||
|
#include <cassert>
|
||||||
|
#include <codecvt>
|
||||||
|
#include <locale>
|
||||||
|
|
||||||
|
using json = nlohmann::ordered_json;
|
||||||
|
|
||||||
|
// Reference: https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py
|
||||||
|
|
||||||
|
void CLIPTokenizer::init_byte_to_unicode()
|
||||||
|
{
|
||||||
|
byte_to_unicode_.clear();
|
||||||
|
unicode_to_byte_.clear();
|
||||||
|
|
||||||
|
// Printable ASCII ranges that map to themselves
|
||||||
|
// '!' (33) to '~' (126), '¡' (161) to '¬' (172), '®' (174) to 'ÿ' (255)
|
||||||
|
std::vector<int> bs;
|
||||||
|
for (int i = 33; i <= 126; ++i) bs.push_back(i); // '!' to '~'
|
||||||
|
for (int i = 161; i <= 172; ++i) bs.push_back(i); // '¡' to '¬'
|
||||||
|
for (int i = 174; i <= 255; ++i) bs.push_back(i); // '®' to 'ÿ'
|
||||||
|
|
||||||
|
std::vector<int> cs(bs.begin(), bs.end());
|
||||||
|
|
||||||
|
// Map remaining bytes (0-32, 127-160, 173) to 256+
|
||||||
|
int n = 0;
|
||||||
|
for (int b = 0; b < 256; ++b) {
|
||||||
|
if (std::find(bs.begin(), bs.end(), b) == bs.end()) {
|
||||||
|
bs.push_back(b);
|
||||||
|
cs.push_back(256 + n);
|
||||||
|
n++;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
for (size_t i = 0; i < bs.size(); ++i) {
|
||||||
|
byte_to_unicode_[static_cast<uint8_t>(bs[i])] = static_cast<char32_t>(cs[i]);
|
||||||
|
unicode_to_byte_[static_cast<char32_t>(cs[i])] = static_cast<uint8_t>(bs[i]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ========== UTF-8 Helpers ==========
|
||||||
|
|
||||||
|
std::vector<char32_t> CLIPTokenizer::utf8_to_codepoints(const std::string& str)
|
||||||
|
{
|
||||||
|
std::vector<char32_t> result;
|
||||||
|
size_t i = 0;
|
||||||
|
while (i < str.size()) {
|
||||||
|
char32_t cp = 0;
|
||||||
|
unsigned char c = str[i];
|
||||||
|
int len = 0;
|
||||||
|
if (c < 0x80) {
|
||||||
|
cp = c;
|
||||||
|
len = 1;
|
||||||
|
} else if ((c & 0xE0) == 0xC0) {
|
||||||
|
cp = c & 0x1F;
|
||||||
|
len = 2;
|
||||||
|
} else if ((c & 0xF0) == 0xE0) {
|
||||||
|
cp = c & 0x0F;
|
||||||
|
len = 3;
|
||||||
|
} else if ((c & 0xF8) == 0xF0) {
|
||||||
|
cp = c & 0x07;
|
||||||
|
len = 4;
|
||||||
|
} else {
|
||||||
|
++i;
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
for (int j = 1; j < len && (i + j) < str.size(); ++j) {
|
||||||
|
cp = (cp << 6) | (str[i + j] & 0x3F);
|
||||||
|
}
|
||||||
|
result.push_back(cp);
|
||||||
|
i += len;
|
||||||
|
}
|
||||||
|
return result;
|
||||||
|
}
|
||||||
|
|
||||||
|
std::string CLIPTokenizer::codepoints_to_utf8(const std::vector<char32_t>& cps)
|
||||||
|
{
|
||||||
|
std::string result;
|
||||||
|
for (char32_t cp : cps) {
|
||||||
|
if (cp < 0x80) {
|
||||||
|
result += static_cast<char>(cp);
|
||||||
|
} else if (cp < 0x800) {
|
||||||
|
result += static_cast<char>(0xC0 | (cp >> 6));
|
||||||
|
result += static_cast<char>(0x80 | (cp & 0x3F));
|
||||||
|
} else if (cp < 0x10000) {
|
||||||
|
result += static_cast<char>(0xE0 | (cp >> 12));
|
||||||
|
result += static_cast<char>(0x80 | ((cp >> 6) & 0x3F));
|
||||||
|
result += static_cast<char>(0x80 | (cp & 0x3F));
|
||||||
|
} else {
|
||||||
|
result += static_cast<char>(0xF0 | (cp >> 18));
|
||||||
|
result += static_cast<char>(0x80 | ((cp >> 12) & 0x3F));
|
||||||
|
result += static_cast<char>(0x80 | ((cp >> 6) & 0x3F));
|
||||||
|
result += static_cast<char>(0x80 | (cp & 0x3F));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return result;
|
||||||
|
}
|
||||||
|
|
||||||
|
// ========== Load Functions ==========
|
||||||
|
|
||||||
|
bool CLIPTokenizer::load(const std::string& vocab_path, const std::string& merges_path)
|
||||||
|
{
|
||||||
|
init_byte_to_unicode();
|
||||||
|
|
||||||
|
// Load vocab.json
|
||||||
|
{
|
||||||
|
std::ifstream file(vocab_path);
|
||||||
|
if (!file.is_open()) {
|
||||||
|
std::cerr << "Failed to open vocab file: " << vocab_path << std::endl;
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
try {
|
||||||
|
json j;
|
||||||
|
file >> j;
|
||||||
|
for (auto it = j.begin(); it != j.end(); ++it) {
|
||||||
|
std::string token = it.key();
|
||||||
|
int id = it.value().get<int>();
|
||||||
|
token_to_id_[token] = id;
|
||||||
|
id_to_token_[id] = token;
|
||||||
|
}
|
||||||
|
} catch (const std::exception& e) {
|
||||||
|
std::cerr << "Error parsing vocab.json: " << e.what() << std::endl;
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Find special token IDs
|
||||||
|
if (token_to_id_.count("<|startoftext|>")) {
|
||||||
|
sot_token_id_ = token_to_id_["<|startoftext|>"];
|
||||||
|
}
|
||||||
|
if (token_to_id_.count("<|endoftext|>")) {
|
||||||
|
eot_token_id_ = token_to_id_["<|endoftext|>"];
|
||||||
|
}
|
||||||
|
|
||||||
|
// Load merges.txt
|
||||||
|
{
|
||||||
|
std::ifstream file(merges_path);
|
||||||
|
if (!file.is_open()) {
|
||||||
|
std::cerr << "Failed to open merges file: " << merges_path << std::endl;
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
std::string line;
|
||||||
|
int rank = 0;
|
||||||
|
|
||||||
|
// Skip header line "#version: ..." if present
|
||||||
|
if (std::getline(file, line)) {
|
||||||
|
if (line.find("#version") == std::string::npos) {
|
||||||
|
// First line is not a header, process it
|
||||||
|
std::istringstream iss(line);
|
||||||
|
std::string a, b;
|
||||||
|
if (iss >> a >> b) {
|
||||||
|
bpe_ranks_[{a, b}] = rank++;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
while (std::getline(file, line)) {
|
||||||
|
if (line.empty()) continue;
|
||||||
|
std::istringstream iss(line);
|
||||||
|
std::string a, b;
|
||||||
|
if (iss >> a >> b) {
|
||||||
|
bpe_ranks_[{a, b}] = rank++;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
loaded_ = true;
|
||||||
|
printf("[Info] CLIPTokenizer loaded: vocab_size=%zu, merges=%zu\n",
|
||||||
|
token_to_id_.size(), bpe_ranks_.size());
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
|
||||||
|
bool CLIPTokenizer::load_from_dir(const std::string& tokenizer_dir)
|
||||||
|
{
|
||||||
|
std::string dir = tokenizer_dir;
|
||||||
|
// Ensure trailing slash
|
||||||
|
if (!dir.empty() && dir.back() != '/' && dir.back() != '\\') {
|
||||||
|
dir += "/";
|
||||||
|
}
|
||||||
|
return load(dir + "vocab.json", dir + "merges.txt");
|
||||||
|
}
|
||||||
|
|
||||||
|
// ========== BPE Implementation ==========
|
||||||
|
|
||||||
|
std::string CLIPTokenizer::bytes_to_unicode_str(const std::string& raw) const
|
||||||
|
{
|
||||||
|
std::vector<char32_t> result;
|
||||||
|
for (unsigned char c : raw) {
|
||||||
|
auto it = byte_to_unicode_.find(c);
|
||||||
|
if (it != byte_to_unicode_.end()) {
|
||||||
|
result.push_back(it->second);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return codepoints_to_utf8(result);
|
||||||
|
}
|
||||||
|
|
||||||
|
std::vector<std::string> CLIPTokenizer::bpe(const std::string& token) const
|
||||||
|
{
|
||||||
|
// Convert token to individual unicode characters as strings
|
||||||
|
auto codepoints = utf8_to_codepoints(token);
|
||||||
|
if (codepoints.empty()) return {};
|
||||||
|
|
||||||
|
// Each character becomes a separate piece
|
||||||
|
std::vector<std::string> word;
|
||||||
|
for (size_t i = 0; i < codepoints.size(); ++i) {
|
||||||
|
std::string piece = codepoints_to_utf8({codepoints[i]});
|
||||||
|
// CLIP adds </w> to the last character
|
||||||
|
if (i == codepoints.size() - 1) {
|
||||||
|
piece += "</w>";
|
||||||
|
}
|
||||||
|
word.push_back(piece);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (word.size() == 1) return word;
|
||||||
|
|
||||||
|
// Iteratively merge the most frequent pairs
|
||||||
|
while (true) {
|
||||||
|
if (word.size() < 2) break;
|
||||||
|
|
||||||
|
// Find the pair with the lowest rank
|
||||||
|
int best_rank = INT_MAX;
|
||||||
|
int best_idx = -1;
|
||||||
|
|
||||||
|
for (size_t i = 0; i < word.size() - 1; ++i) {
|
||||||
|
auto it = bpe_ranks_.find({word[i], word[i + 1]});
|
||||||
|
if (it != bpe_ranks_.end() && it->second < best_rank) {
|
||||||
|
best_rank = it->second;
|
||||||
|
best_idx = static_cast<int>(i);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (best_idx == -1) break; // No more merges possible
|
||||||
|
|
||||||
|
// Merge the pair at best_idx
|
||||||
|
std::string merged = word[best_idx] + word[best_idx + 1];
|
||||||
|
std::vector<std::string> new_word;
|
||||||
|
for (size_t i = 0; i < word.size(); ++i) {
|
||||||
|
if (static_cast<int>(i) == best_idx) {
|
||||||
|
new_word.push_back(merged);
|
||||||
|
++i; // Skip next element
|
||||||
|
} else {
|
||||||
|
new_word.push_back(word[i]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
word = new_word;
|
||||||
|
}
|
||||||
|
|
||||||
|
return word;
|
||||||
|
}
|
||||||
|
|
||||||
|
std::vector<std::string> CLIPTokenizer::pre_tokenize(const std::string& text) const
|
||||||
|
{
|
||||||
|
// CLIP tokenizer: lowercase + basic clean + split by pattern
|
||||||
|
// Pattern from CLIP: <\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+
|
||||||
|
// Simplified version for ASCII-dominant text:
|
||||||
|
|
||||||
|
std::string cleaned;
|
||||||
|
// Lowercase and basic whitespace normalization
|
||||||
|
for (char c : text) {
|
||||||
|
if (c >= 'A' && c <= 'Z') {
|
||||||
|
cleaned += (c - 'A' + 'a');
|
||||||
|
} else {
|
||||||
|
cleaned += c;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Simple tokenization: split by whitespace and punctuation
|
||||||
|
std::vector<std::string> words;
|
||||||
|
std::string current;
|
||||||
|
|
||||||
|
for (size_t i = 0; i < cleaned.size(); ++i) {
|
||||||
|
char c = cleaned[i];
|
||||||
|
|
||||||
|
if (c == ' ' || c == '\t' || c == '\n' || c == '\r') {
|
||||||
|
if (!current.empty()) {
|
||||||
|
words.push_back(current);
|
||||||
|
current.clear();
|
||||||
|
}
|
||||||
|
// Add space prefix to next word (CLIP uses space-prefixed tokens)
|
||||||
|
if (i + 1 < cleaned.size() && cleaned[i + 1] != ' ') {
|
||||||
|
// Next word will get a space prefix via the byte encoding
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
// Check if punctuation should be separate token
|
||||||
|
bool is_alpha_or_digit = (c >= 'a' && c <= 'z') || (c >= '0' && c <= '9');
|
||||||
|
bool cur_is_alpha = !current.empty() &&
|
||||||
|
((current.back() >= 'a' && current.back() <= 'z') ||
|
||||||
|
(current.back() >= '0' && current.back() <= '9'));
|
||||||
|
|
||||||
|
if (!current.empty() && !is_alpha_or_digit && cur_is_alpha) {
|
||||||
|
// Start new token for punctuation
|
||||||
|
words.push_back(current);
|
||||||
|
current.clear();
|
||||||
|
} else if (!current.empty() && is_alpha_or_digit && !cur_is_alpha) {
|
||||||
|
words.push_back(current);
|
||||||
|
current.clear();
|
||||||
|
}
|
||||||
|
current += c;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (!current.empty()) {
|
||||||
|
words.push_back(current);
|
||||||
|
}
|
||||||
|
|
||||||
|
return words;
|
||||||
|
}
|
||||||
|
|
||||||
|
// ========== Encode ==========
|
||||||
|
|
||||||
|
std::vector<int64_t> CLIPTokenizer::encode(const std::string& text, int max_len) const
|
||||||
|
{
|
||||||
|
if (!loaded_) {
|
||||||
|
std::cerr << "Tokenizer not loaded!" << std::endl;
|
||||||
|
return std::vector<int64_t>(max_len, 0);
|
||||||
|
}
|
||||||
|
|
||||||
|
std::vector<int64_t> tokens;
|
||||||
|
|
||||||
|
// Add start-of-text token
|
||||||
|
tokens.push_back(sot_token_id_);
|
||||||
|
|
||||||
|
// Pre-tokenize
|
||||||
|
std::vector<std::string> words = pre_tokenize(text);
|
||||||
|
|
||||||
|
// Process each word
|
||||||
|
for (const auto& word : words) {
|
||||||
|
// Convert raw bytes to unicode representation
|
||||||
|
std::string unicode_word = bytes_to_unicode_str(word);
|
||||||
|
|
||||||
|
// Apply BPE
|
||||||
|
std::vector<std::string> bpe_tokens = bpe(unicode_word);
|
||||||
|
|
||||||
|
// Look up token IDs
|
||||||
|
for (const auto& bt : bpe_tokens) {
|
||||||
|
auto it = token_to_id_.find(bt);
|
||||||
|
if (it != token_to_id_.end()) {
|
||||||
|
tokens.push_back(it->second);
|
||||||
|
} else {
|
||||||
|
// Unknown token, try without </w>
|
||||||
|
std::string no_ew = bt;
|
||||||
|
if (no_ew.size() >= 4 && no_ew.substr(no_ew.size() - 4) == "</w>") {
|
||||||
|
no_ew = no_ew.substr(0, no_ew.size() - 4);
|
||||||
|
}
|
||||||
|
auto it2 = token_to_id_.find(no_ew);
|
||||||
|
if (it2 != token_to_id_.end()) {
|
||||||
|
tokens.push_back(it2->second);
|
||||||
|
}
|
||||||
|
// else: skip unknown token
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Add end-of-text token
|
||||||
|
tokens.push_back(eot_token_id_);
|
||||||
|
|
||||||
|
// Truncate if necessary
|
||||||
|
if (static_cast<int>(tokens.size()) > max_len) {
|
||||||
|
tokens.resize(max_len);
|
||||||
|
// Ensure EOT is at the end
|
||||||
|
tokens.back() = eot_token_id_;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Pad to max_len with EOT token (consistent with HuggingFace CLIPTokenizer)
|
||||||
|
while (static_cast<int>(tokens.size()) < max_len) {
|
||||||
|
tokens.push_back(eot_token_id_);
|
||||||
|
}
|
||||||
|
|
||||||
|
return tokens;
|
||||||
|
}
|
||||||
|
|
||||||
105
examples/clip/cpp/src/clip_tokenizer.h
Executable file
105
examples/clip/cpp/src/clip_tokenizer.h
Executable file
|
|
@ -0,0 +1,105 @@
|
||||||
|
/*
|
||||||
|
* Copyright (C) 2024–2025 Amlogic, Inc. All rights reserved.
|
||||||
|
*
|
||||||
|
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
* you may not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
#ifndef CLIP_TOKENIZER_H
|
||||||
|
#define CLIP_TOKENIZER_H
|
||||||
|
|
||||||
|
#include <string>
|
||||||
|
#include <vector>
|
||||||
|
#include <map>
|
||||||
|
#include <unordered_map>
|
||||||
|
|
||||||
|
class CLIPTokenizer {
|
||||||
|
public:
|
||||||
|
CLIPTokenizer() = default;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Load tokenizer from vocab.json and merges.txt
|
||||||
|
* @param vocab_path Path to vocab.json
|
||||||
|
* @param merges_path Path to merges.txt
|
||||||
|
* @return true on success
|
||||||
|
*/
|
||||||
|
bool load(const std::string& vocab_path, const std::string& merges_path);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Load tokenizer from a directory containing vocab.json and merges.txt
|
||||||
|
* @param tokenizer_dir Path to directory
|
||||||
|
* @return true on success
|
||||||
|
*/
|
||||||
|
bool load_from_dir(const std::string& tokenizer_dir);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Tokenize text to token IDs with padding/truncation.
|
||||||
|
* Adds <|startoftext|> and <|endoftext|> automatically.
|
||||||
|
*
|
||||||
|
* @param text Input text string
|
||||||
|
* @param max_len Maximum sequence length (default: 64)
|
||||||
|
* @return Vector of int64_t token IDs with shape [max_len]
|
||||||
|
*/
|
||||||
|
std::vector<int64_t> encode(const std::string& text, int max_len = 64) const;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Check if tokenizer is loaded
|
||||||
|
*/
|
||||||
|
bool is_loaded() const { return loaded_; }
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Get vocabulary size
|
||||||
|
*/
|
||||||
|
size_t vocab_size() const { return token_to_id_.size(); }
|
||||||
|
|
||||||
|
private:
|
||||||
|
// BPE pair
|
||||||
|
using BPEPair = std::pair<std::string, std::string>;
|
||||||
|
|
||||||
|
// Byte-to-unicode mapping (GPT-2 style)
|
||||||
|
std::unordered_map<uint8_t, char32_t> byte_to_unicode_;
|
||||||
|
std::unordered_map<char32_t, uint8_t> unicode_to_byte_;
|
||||||
|
|
||||||
|
// Vocabulary
|
||||||
|
std::unordered_map<std::string, int> token_to_id_;
|
||||||
|
std::unordered_map<int, std::string> id_to_token_;
|
||||||
|
|
||||||
|
// BPE merge rules (pair -> priority rank)
|
||||||
|
std::map<BPEPair, int> bpe_ranks_;
|
||||||
|
|
||||||
|
// Special token IDs
|
||||||
|
int sot_token_id_ = 49406; // <|startoftext|>
|
||||||
|
int eot_token_id_ = 49407; // <|endoftext|>
|
||||||
|
|
||||||
|
bool loaded_ = false;
|
||||||
|
|
||||||
|
// Initialize byte-to-unicode mapping
|
||||||
|
void init_byte_to_unicode();
|
||||||
|
|
||||||
|
// Convert UTF-8 string to vector of unicode codepoints
|
||||||
|
static std::vector<char32_t> utf8_to_codepoints(const std::string& str);
|
||||||
|
|
||||||
|
// Convert unicode codepoints to UTF-8 string
|
||||||
|
static std::string codepoints_to_utf8(const std::vector<char32_t>& cps);
|
||||||
|
|
||||||
|
// Apply BPE to a single word (already converted to unicode representation)
|
||||||
|
std::vector<std::string> bpe(const std::string& token) const;
|
||||||
|
|
||||||
|
// Clean and split text using CLIP's regex pattern
|
||||||
|
std::vector<std::string> pre_tokenize(const std::string& text) const;
|
||||||
|
|
||||||
|
// Convert raw bytes to unicode string using byte_to_unicode mapping
|
||||||
|
std::string bytes_to_unicode_str(const std::string& raw) const;
|
||||||
|
};
|
||||||
|
|
||||||
|
#endif // CLIP_TOKENIZER_H
|
||||||
|
|
||||||
|
|
@ -15,22 +15,26 @@
|
||||||
*/
|
*/
|
||||||
|
|
||||||
#include <iostream>
|
#include <iostream>
|
||||||
|
#include <fstream>
|
||||||
|
#include <sstream>
|
||||||
#include <stdio.h>
|
#include <stdio.h>
|
||||||
#include <stdlib.h>
|
#include <stdlib.h>
|
||||||
#include <time.h>
|
#include <time.h>
|
||||||
|
#include <vector>
|
||||||
|
#include <string>
|
||||||
|
#include <algorithm>
|
||||||
|
|
||||||
#include "model_invoke.h"
|
#include "clip_process.h"
|
||||||
|
#include "clip_tokenizer.h"
|
||||||
|
|
||||||
#define BILLION 1000000000
|
#define BILLION 1000000000
|
||||||
|
|
||||||
struct Get_Times
|
struct ProfilingTimer
|
||||||
{
|
{
|
||||||
uint64_t init_start_time, init_end_time, init_total_time;
|
uint64_t init_start, init_end;
|
||||||
uint64_t preProcess_start_time, preProcess_end_time, preProcess_total_time;
|
uint64_t preprocess_start, preprocess_end;
|
||||||
uint64_t invoke_start_time, invoke_end_time, invoke_total_time;
|
uint64_t vision_infer_start, vision_infer_end;
|
||||||
uint64_t postProcess_start_time, postProcess_end_time, postProcess_total_time;
|
uint64_t text_infer_start, text_infer_end;
|
||||||
uint64_t total_time;
|
|
||||||
std::vector<uint64_t> total_time_group;
|
|
||||||
};
|
};
|
||||||
|
|
||||||
static uint64_t get_time_count()
|
static uint64_t get_time_count()
|
||||||
|
|
@ -40,70 +44,288 @@ static uint64_t get_time_count()
|
||||||
return (uint64_t)((uint64_t)ts.tv_nsec + (uint64_t)ts.tv_sec * BILLION);
|
return (uint64_t)((uint64_t)ts.tv_nsec + (uint64_t)ts.tv_sec * BILLION);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Default text prompts for demo
|
||||||
|
static std::vector<std::string> default_texts = {
|
||||||
|
"a red handbag",
|
||||||
|
"a blue jacket",
|
||||||
|
"a red bus"
|
||||||
|
};
|
||||||
|
|
||||||
|
// Parse comma-separated texts
|
||||||
|
std::vector<std::string> parse_texts(const std::string& input)
|
||||||
|
{
|
||||||
|
std::vector<std::string> result;
|
||||||
|
std::stringstream ss(input);
|
||||||
|
std::string item;
|
||||||
|
|
||||||
|
while (std::getline(ss, item, ',')) {
|
||||||
|
// Trim whitespace
|
||||||
|
size_t start = item.find_first_not_of(" \t");
|
||||||
|
size_t end = item.find_last_not_of(" \t");
|
||||||
|
if (start != std::string::npos && end != std::string::npos) {
|
||||||
|
result.push_back(item.substr(start, end - start + 1));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return result;
|
||||||
|
}
|
||||||
|
|
||||||
|
void print_usage(const char* prog_name)
|
||||||
|
{
|
||||||
|
printf("Usage: %s <vision_model> <text_model> <tokenizer_dir> [--profiling]\n", prog_name);
|
||||||
|
printf("\n");
|
||||||
|
printf("Arguments:\n");
|
||||||
|
printf(" vision_model: Path to vision model (.adla)\n");
|
||||||
|
printf(" text_model: Path to text model (.adla)\n");
|
||||||
|
printf(" tokenizer_dir: Path to directory containing vocab.json and merges.txt\n");
|
||||||
|
printf(" --profiling: Enable performance profiling output (optional)\n");
|
||||||
|
printf("\n");
|
||||||
|
printf("Interactive mode:\n");
|
||||||
|
printf(" - Enter image path to process\n");
|
||||||
|
printf(" - Enter comma-separated texts to compare (or 'skip' for defaults)\n");
|
||||||
|
printf(" - Enter 'exit' to quit\n");
|
||||||
|
}
|
||||||
|
|
||||||
int main(int argc, char ** argv)
|
int main(int argc, char ** argv)
|
||||||
{
|
{
|
||||||
Get_Times model_time;
|
ProfilingTimer timer = {};
|
||||||
|
|
||||||
std::vector<float> input_data_fir;
|
|
||||||
float* model_output_data;
|
|
||||||
|
|
||||||
int ret = 0;
|
int ret = 0;
|
||||||
int max_index = 0;
|
bool profiling = false;
|
||||||
|
|
||||||
if (argc < 2) {
|
|
||||||
printf("Usage: %s <model_path> [base_dir] [json_filename]\n", argv[0]);
|
|
||||||
printf(" model_path: Path to the model file\n");
|
|
||||||
printf(" base_dir: Base directory for clip datasets (optional, can also use CLIP_BASE_DIR env var)\n");
|
|
||||||
printf(" json_filename: JSON filename in each dataset folder (optional, can also use CLIP_JSON_FILENAME env var, default: clip_text_res.json)\n");
|
|
||||||
return -1;
|
|
||||||
}
|
|
||||||
|
|
||||||
char* model_path_encoder = argv[1];
|
|
||||||
std::string base_dir = (argc >= 3) ? argv[2] : "";
|
|
||||||
std::string json_filename = (argc >= 4) ? argv[3] : "";
|
|
||||||
void *context_model = NULL;
|
|
||||||
|
|
||||||
model_time.init_start_time = get_time_count();
|
if (argc < 4) {
|
||||||
context_model = init_network_file(model_path_encoder);
|
print_usage(argv[0]);
|
||||||
model_time.init_end_time = get_time_count();
|
|
||||||
|
|
||||||
if (context_model == NULL)
|
|
||||||
{
|
|
||||||
printf("init_network [context_model] fail.\n");
|
|
||||||
return -1;
|
return -1;
|
||||||
}
|
}
|
||||||
|
|
||||||
if (getenv("GET_TIME"))
|
const char* vision_model_path = argv[1];
|
||||||
{
|
const char* text_model_path = argv[2];
|
||||||
model_time.init_total_time = (model_time.init_end_time - model_time.init_start_time) / 1000000;
|
const char* tokenizer_dir = argv[3];
|
||||||
std::cout << "init_model_total time : " << model_time.init_total_time << "ms" << std::endl;
|
|
||||||
|
// Check for --profiling flag
|
||||||
|
for (int i = 4; i < argc; ++i) {
|
||||||
|
if (std::string(argv[i]) == "--profiling") {
|
||||||
|
profiling = true;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
while (true)
|
const float logit_scale = 100.0f;
|
||||||
{
|
const int max_seq_len = 64;
|
||||||
std::string json_path;
|
|
||||||
|
|
||||||
printf("\nPlease enter the JPG image path (enter exit to quit):\n");
|
// Load tokenizer
|
||||||
std::getline(std::cin, json_path);
|
printf("[Info] Loading tokenizer from: %s\n", tokenizer_dir);
|
||||||
if (json_path == "exit") break;
|
CLIPTokenizer tokenizer;
|
||||||
if (json_path.empty()) {
|
if (!tokenizer.load_from_dir(tokenizer_dir)) {
|
||||||
printf("The path cannot be empty.\n");
|
printf("[Error] Failed to load tokenizer.\n");
|
||||||
|
return -1;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Initialize models
|
||||||
|
printf("[Info] Initializing vision model: %s\n", vision_model_path);
|
||||||
|
timer.init_start = get_time_count();
|
||||||
|
void* vision_context = init_network_file(vision_model_path);
|
||||||
|
if (vision_context == NULL) {
|
||||||
|
printf("[Error] Failed to initialize vision model.\n");
|
||||||
|
return -1;
|
||||||
|
}
|
||||||
|
|
||||||
|
printf("[Info] Initializing text model: %s\n", text_model_path);
|
||||||
|
void* text_context = init_network_file(text_model_path);
|
||||||
|
if (text_context == NULL) {
|
||||||
|
printf("[Error] Failed to initialize text model.\n");
|
||||||
|
destroy_network(vision_context);
|
||||||
|
return -1;
|
||||||
|
}
|
||||||
|
timer.init_end = get_time_count();
|
||||||
|
|
||||||
|
if (profiling) {
|
||||||
|
uint64_t init_time = (timer.init_end - timer.init_start) / 1000000;
|
||||||
|
printf("[Profiling] Model initialization: %lums\n", init_time);
|
||||||
|
}
|
||||||
|
|
||||||
|
printf("[Info] Models initialized successfully.\n\n");
|
||||||
|
|
||||||
|
// Interactive loop
|
||||||
|
while (true) {
|
||||||
|
std::string image_path;
|
||||||
|
|
||||||
|
printf("============================================================\n");
|
||||||
|
printf("[Info] Image Path (or 'exit' to quit):\n");
|
||||||
|
std::getline(std::cin, image_path);
|
||||||
|
|
||||||
|
// Trim whitespace
|
||||||
|
size_t start = image_path.find_first_not_of(" \t\r\n");
|
||||||
|
size_t end = image_path.find_last_not_of(" \t\r\n");
|
||||||
|
if (start != std::string::npos && end != std::string::npos) {
|
||||||
|
image_path = image_path.substr(start, end - start + 1);
|
||||||
|
} else {
|
||||||
|
image_path.clear();
|
||||||
|
}
|
||||||
|
|
||||||
|
if (image_path == "exit") {
|
||||||
|
printf("[Info] Exiting...\n");
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (image_path.empty()) {
|
||||||
|
printf("[Warning] Please enter an image path.\n");
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
std::vector<std::string> out_str_path = process_image_dir(context_model, json_path, base_dir, json_filename);
|
|
||||||
|
|
||||||
for (int i = 0; i < out_str_path.size(); i++)
|
// Check if file exists
|
||||||
{
|
{
|
||||||
std::cout << "Index[" << i << "] : " << out_str_path[i] << std::endl;
|
std::ifstream img_file(image_path);
|
||||||
|
if (!img_file.good()) {
|
||||||
|
printf("[Error] Image not found: %s\n", image_path.c_str());
|
||||||
|
continue;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Get texts to compare
|
||||||
|
std::vector<std::string> texts;
|
||||||
|
|
||||||
|
printf("[Info] Enter text descriptions (comma-separated, or 'skip' for defaults):\n");
|
||||||
|
std::string text_input;
|
||||||
|
std::getline(std::cin, text_input);
|
||||||
|
|
||||||
|
// Trim
|
||||||
|
start = text_input.find_first_not_of(" \t\r\n");
|
||||||
|
end = text_input.find_last_not_of(" \t\r\n");
|
||||||
|
if (start != std::string::npos && end != std::string::npos) {
|
||||||
|
text_input = text_input.substr(start, end - start + 1);
|
||||||
|
} else {
|
||||||
|
text_input.clear();
|
||||||
|
}
|
||||||
|
|
||||||
|
if (text_input.empty() || text_input == "skip") {
|
||||||
|
texts = default_texts;
|
||||||
|
printf("[Info] Using default texts\n");
|
||||||
|
} else {
|
||||||
|
texts = parse_texts(text_input);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (texts.empty()) {
|
||||||
|
printf("[Warning] No texts provided.\n");
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
// ==================== Process Image ====================
|
||||||
|
printf("\n[Info] Processing image: %s\n", image_path.c_str());
|
||||||
|
|
||||||
|
timer.preprocess_start = get_time_count();
|
||||||
|
std::vector<float> image_input = preprocess_image(image_path);
|
||||||
|
if (image_input.empty()) {
|
||||||
|
printf("[Error] Failed to preprocess image.\n");
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
timer.preprocess_end = get_time_count();
|
||||||
|
|
||||||
|
// Run vision model
|
||||||
|
timer.vision_infer_start = get_time_count();
|
||||||
|
std::vector<float> image_embedding = run_vision_model(vision_context, image_input);
|
||||||
|
if (image_embedding.empty()) {
|
||||||
|
printf("[Error] Vision model inference failed.\n");
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
timer.vision_infer_end = get_time_count();
|
||||||
|
|
||||||
|
// L2 normalize image embedding
|
||||||
|
image_embedding = l2_normalize(image_embedding);
|
||||||
|
printf("[Info] Image embedding size: %zu\n", image_embedding.size());
|
||||||
|
|
||||||
|
// ==================== Process Texts ====================
|
||||||
|
printf("[Info] Processing %zu text(s)...\n", texts.size());
|
||||||
|
|
||||||
|
std::vector<std::vector<float>> text_embeddings;
|
||||||
|
std::vector<uint64_t> text_infer_times;
|
||||||
|
timer.text_infer_start = get_time_count();
|
||||||
|
|
||||||
|
for (size_t i = 0; i < texts.size(); ++i) {
|
||||||
|
// Tokenize text
|
||||||
|
std::vector<int64_t> token_ids = tokenizer.encode(texts[i], max_seq_len);
|
||||||
|
// Run text model
|
||||||
|
uint64_t t_start = get_time_count();
|
||||||
|
std::vector<float> text_emb = run_text_model(text_context, token_ids);
|
||||||
|
uint64_t t_end = get_time_count();
|
||||||
|
text_infer_times.push_back((t_end - t_start) / 1000000);
|
||||||
|
|
||||||
|
if (text_emb.empty()) {
|
||||||
|
printf("[Error] Text model inference failed for: %s\n", texts[i].c_str());
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
// L2 normalize
|
||||||
|
text_emb = l2_normalize(text_emb);
|
||||||
|
text_embeddings.push_back(text_emb);
|
||||||
|
}
|
||||||
|
|
||||||
|
timer.text_infer_end = get_time_count();
|
||||||
|
|
||||||
|
if (text_embeddings.size() != texts.size()) {
|
||||||
|
printf("[Error] Some text embeddings failed.\n");
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
printf("[Info] Text embeddings size: %zu x %zu\n", text_embeddings.size(),
|
||||||
|
text_embeddings.empty() ? 0 : text_embeddings[0].size());
|
||||||
|
|
||||||
|
// ==================== Compute Similarity ====================
|
||||||
|
std::vector<float> similarities(texts.size());
|
||||||
|
std::vector<float> logits(texts.size());
|
||||||
|
|
||||||
|
for (size_t i = 0; i < texts.size(); ++i) {
|
||||||
|
similarities[i] = compute_similarity(image_embedding, text_embeddings[i], 1.0f); // cosine sim
|
||||||
|
logits[i] = similarities[i] * logit_scale;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Compute probabilities
|
||||||
|
std::vector<float> probs = softmax(logits);
|
||||||
|
|
||||||
|
// Sort by probability (descending)
|
||||||
|
std::vector<size_t> indices(texts.size());
|
||||||
|
for (size_t i = 0; i < texts.size(); ++i) indices[i] = i;
|
||||||
|
std::sort(indices.begin(), indices.end(),
|
||||||
|
[&probs](size_t a, size_t b) { return probs[a] > probs[b]; });
|
||||||
|
|
||||||
|
// ==================== Print Results ====================
|
||||||
|
printf("\n============================================================\n");
|
||||||
|
printf("CLIP Image-Text Matching Results\n");
|
||||||
|
printf("============================================================\n");
|
||||||
|
printf("Image: %s\n", image_path.c_str());
|
||||||
|
printf("logit_scale: %.6f\n", logit_scale);
|
||||||
|
printf("------------------------------------------------------------\n");
|
||||||
|
|
||||||
|
for (size_t rank = 0; rank < indices.size(); ++rank) {
|
||||||
|
size_t i = indices[rank];
|
||||||
|
printf("[%zu] prob=%.6f sim=%.6f text='%s'\n",
|
||||||
|
rank + 1, probs[i], similarities[i], texts[i].c_str());
|
||||||
|
}
|
||||||
|
printf("============================================================\n");
|
||||||
|
|
||||||
|
if (profiling) {
|
||||||
|
uint64_t preprocess_time = (timer.preprocess_end - timer.preprocess_start) / 1000000;
|
||||||
|
uint64_t vision_time = (timer.vision_infer_end - timer.vision_infer_start) / 1000000;
|
||||||
|
uint64_t text_total_time = (timer.text_infer_end - timer.text_infer_start) / 1000000;
|
||||||
|
printf("\n[Profiling]\n");
|
||||||
|
printf(" Image preprocess: %lums\n", preprocess_time);
|
||||||
|
printf(" Vision inference: %lums\n", vision_time);
|
||||||
|
for (size_t i = 0; i < texts.size() && i < text_infer_times.size(); ++i) {
|
||||||
|
printf(" Text inference[%zu]: %lums '%s'\n", i, text_infer_times[i], texts[i].c_str());
|
||||||
|
}
|
||||||
|
printf(" Text total: %lums (%zu texts)\n", text_total_time, texts.size());
|
||||||
|
}
|
||||||
|
printf("\n");
|
||||||
}
|
}
|
||||||
|
|
||||||
ret = destroy_network(context_model);
|
// Cleanup
|
||||||
if (ret != 0)
|
ret = destroy_network(vision_context);
|
||||||
{
|
if (ret != 0) {
|
||||||
printf("destroy_network [context_model] fail.\n");
|
printf("[Error] Failed to destroy vision model.\n");
|
||||||
return -1;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
return ret;
|
ret = destroy_network(text_context);
|
||||||
}
|
if (ret != 0) {
|
||||||
|
printf("[Error] Failed to destroy text model.\n");
|
||||||
|
}
|
||||||
|
|
||||||
|
printf("[Info] Done.\n");
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
|
||||||
|
|
@ -20,31 +20,20 @@
|
||||||
#include <fstream>
|
#include <fstream>
|
||||||
#include <algorithm>
|
#include <algorithm>
|
||||||
#include <vector>
|
#include <vector>
|
||||||
|
#include <cmath>
|
||||||
#include <cstdlib>
|
#include <cstdlib>
|
||||||
|
|
||||||
#include "model_invoke.h"
|
#include "clip_process.h"
|
||||||
#include "nn_sdk.h"
|
#include "nn_sdk.h"
|
||||||
#include "json.hpp"
|
|
||||||
#include <filesystem>
|
|
||||||
#include <regex>
|
|
||||||
|
|
||||||
using json = nlohmann::ordered_json;
|
// Global DMA config for models
|
||||||
namespace fs = std::__fs::filesystem;
|
static aml_memory_config_t vision_mem_config;
|
||||||
|
static aml_memory_data_t vision_mem_data;
|
||||||
|
static void* vision_context_flag = nullptr;
|
||||||
|
|
||||||
struct DMAConfig {
|
static aml_memory_config_t text_mem_config;
|
||||||
bool use_dma = true;
|
static aml_memory_data_t text_mem_data;
|
||||||
bool malloc_buffer_once = true;
|
static void* text_context_flag = nullptr;
|
||||||
};
|
|
||||||
|
|
||||||
DMAConfig context_model;
|
|
||||||
|
|
||||||
///////////////////////////////////////////////////////////
|
|
||||||
|
|
||||||
aml_memory_config_t mem_config_context_model;
|
|
||||||
aml_memory_data_t mem_data_context_model;
|
|
||||||
|
|
||||||
std::vector<float> preprocess_image(const std::string& image_path);
|
|
||||||
float post_process(const float* a, const std::vector<float>& b);
|
|
||||||
|
|
||||||
void* init_network_file(const char *model_path)
|
void* init_network_file(const char *model_path)
|
||||||
{
|
{
|
||||||
|
|
@ -95,202 +84,119 @@ void* init_network_file(const char *model_path)
|
||||||
return qcontext;
|
return qcontext;
|
||||||
}
|
}
|
||||||
|
|
||||||
float* run_network(void *qcontext, std::vector<float> input_ids, const std::string image_type)
|
std::vector<float> run_vision_model(void* qcontext, const std::vector<float>& input_data)
|
||||||
{
|
{
|
||||||
int ret = 0;
|
int ret = 0;
|
||||||
nn_input inData;
|
nn_input inData;
|
||||||
|
|
||||||
nn_output *outdata = NULL;
|
nn_output *outdata = NULL;
|
||||||
aml_output_config_t outconfig;
|
aml_output_config_t outconfig;
|
||||||
|
|
||||||
inData.input_index = 0;
|
inData.input_index = 0;
|
||||||
inData.info.input_format = AML_INPUT_DEFAULT;
|
inData.info.input_format = AML_INPUT_DEFAULT;
|
||||||
inData.size = input_ids.size() * sizeof(float);
|
inData.size = input_data.size() * sizeof(float);
|
||||||
|
|
||||||
if (context_model.use_dma) {
|
// Use DMA
|
||||||
if (context_model.malloc_buffer_once) {
|
if (!vision_context_flag) {
|
||||||
mem_config_context_model.cache_type = AML_WITH_CACHE;
|
vision_mem_config.cache_type = AML_WITH_CACHE;
|
||||||
mem_config_context_model.memory_type = AML_VIRTUAL_ADDR;
|
vision_mem_config.memory_type = AML_VIRTUAL_ADDR;
|
||||||
mem_config_context_model.direction = AML_MEM_DIRECTION_READ_WRITE;
|
vision_mem_config.direction = AML_MEM_DIRECTION_READ_WRITE;
|
||||||
mem_config_context_model.index = 0;
|
vision_mem_config.index = 0;
|
||||||
mem_config_context_model.mem_size = inData.size;
|
vision_mem_config.mem_size = inData.size;
|
||||||
aml_util_mallocBuffer(qcontext, &mem_config_context_model, &mem_data_context_model);
|
aml_util_mallocBuffer(qcontext, &vision_mem_config, &vision_mem_data);
|
||||||
aml_util_swapExternalInputBuffer(qcontext, &mem_config_context_model, &mem_data_context_model);
|
aml_util_swapExternalInputBuffer(qcontext, &vision_mem_config, &vision_mem_data);
|
||||||
}
|
vision_context_flag = qcontext;
|
||||||
|
|
||||||
inData.input_type = INPUT_DMA_DATA;
|
|
||||||
memcpy(mem_data_context_model.viraddr, input_ids.data(), mem_config_context_model.mem_size);
|
|
||||||
inData.input = NULL;
|
|
||||||
} else {
|
|
||||||
inData.input = reinterpret_cast<unsigned char*>(input_ids.data());
|
|
||||||
inData.input_type = BINARY_RAW_DATA;
|
|
||||||
|
|
||||||
ret = aml_module_input_set(qcontext, &inData);
|
|
||||||
if (ret)
|
|
||||||
{
|
|
||||||
printf("aml_module_input_set fail.\n");
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
context_model.malloc_buffer_once = false;
|
|
||||||
|
inData.input_type = INPUT_DMA_DATA;
|
||||||
|
memcpy(vision_mem_data.viraddr, input_data.data(), vision_mem_config.mem_size);
|
||||||
|
inData.input = NULL;
|
||||||
|
|
||||||
memset(&outconfig, 0, sizeof(aml_output_config_t));
|
memset(&outconfig, 0, sizeof(aml_output_config_t));
|
||||||
|
outconfig.format = AML_OUTDATA_DMA;
|
||||||
if (context_model.use_dma) {
|
|
||||||
outconfig.format = AML_OUTDATA_DMA;
|
|
||||||
} else {
|
|
||||||
outconfig.format = AML_OUTDATA_RAW;
|
|
||||||
}
|
|
||||||
outconfig.typeSize = sizeof(aml_output_config_t);
|
outconfig.typeSize = sizeof(aml_output_config_t);
|
||||||
outdata = (nn_output*)aml_module_output_get(qcontext, outconfig);
|
outdata = (nn_output*)aml_module_output_get(qcontext, outconfig);
|
||||||
|
|
||||||
return reinterpret_cast<float*>(outdata->out[0].buf);
|
if (outdata == NULL || outdata->out[0].buf == NULL) {
|
||||||
}
|
printf("Vision model inference failed.\n");
|
||||||
|
return {};
|
||||||
int extract_index(const std::string& filename) {
|
|
||||||
std::regex pattern(R"(test_\w+_(\d+)\.jpg)");
|
|
||||||
std::smatch match;
|
|
||||||
if (std::regex_match(filename, match, pattern)) {
|
|
||||||
return std::stoi(match[1]);
|
|
||||||
}
|
}
|
||||||
return -1;
|
|
||||||
|
// Copy output to vector
|
||||||
|
size_t output_size = outdata->out[0].size / sizeof(float);
|
||||||
|
float* output_ptr = reinterpret_cast<float*>(outdata->out[0].buf);
|
||||||
|
std::vector<float> result(output_ptr, output_ptr + output_size);
|
||||||
|
|
||||||
|
return result;
|
||||||
}
|
}
|
||||||
|
|
||||||
std::vector<std::string> process_image_dir(
|
std::vector<float> run_text_model(void* qcontext, const std::vector<int64_t>& input_ids)
|
||||||
void* context_model,
|
|
||||||
const std::string& image_dir_path,
|
|
||||||
const std::string& base_dir,
|
|
||||||
const std::string& json_filename)
|
|
||||||
{
|
{
|
||||||
std::vector<std::string> results;
|
int ret = 0;
|
||||||
std::regex file_pattern(R"(test_(\w+)_\d+\.jpg)");
|
nn_input inData;
|
||||||
|
nn_output *outdata = NULL;
|
||||||
// Get base_dir from parameter, environment variable, or use default
|
aml_output_config_t outconfig;
|
||||||
std::string actual_base_dir = base_dir;
|
|
||||||
if (actual_base_dir.empty()) {
|
inData.input_index = 0;
|
||||||
const char* env_base_dir = std::getenv("CLIP_BASE_DIR");
|
inData.info.input_format = AML_INPUT_DEFAULT;
|
||||||
if (env_base_dir != nullptr) {
|
inData.size = input_ids.size() * sizeof(int64_t);
|
||||||
actual_base_dir = env_base_dir;
|
|
||||||
} else {
|
// Use DMA
|
||||||
actual_base_dir = "./demo_data/clip_datasets/";
|
if (!text_context_flag) {
|
||||||
}
|
text_mem_config.cache_type = AML_WITH_CACHE;
|
||||||
}
|
text_mem_config.memory_type = AML_VIRTUAL_ADDR;
|
||||||
|
text_mem_config.direction = AML_MEM_DIRECTION_READ_WRITE;
|
||||||
// Ensure base_dir ends with '/'
|
text_mem_config.index = 0;
|
||||||
if (!actual_base_dir.empty() && actual_base_dir.back() != '/') {
|
text_mem_config.mem_size = inData.size;
|
||||||
actual_base_dir += "/";
|
aml_util_mallocBuffer(qcontext, &text_mem_config, &text_mem_data);
|
||||||
}
|
aml_util_swapExternalInputBuffer(qcontext, &text_mem_config, &text_mem_data);
|
||||||
|
text_context_flag = qcontext;
|
||||||
// Get json_filename from parameter, environment variable, or use default
|
|
||||||
std::string actual_json_filename = json_filename;
|
|
||||||
if (actual_json_filename.empty()) {
|
|
||||||
const char* env_json_filename = std::getenv("CLIP_JSON_FILENAME");
|
|
||||||
if (env_json_filename != nullptr) {
|
|
||||||
actual_json_filename = env_json_filename;
|
|
||||||
} else {
|
|
||||||
actual_json_filename = "clip_text_res.json";
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
// storing qualified paths
|
inData.input_type = INPUT_DMA_DATA;
|
||||||
std::vector<fs::directory_entry> matched_files;
|
memcpy(text_mem_data.viraddr, input_ids.data(), text_mem_config.mem_size);
|
||||||
|
inData.input = NULL;
|
||||||
|
|
||||||
// collect all relevant img.
|
memset(&outconfig, 0, sizeof(aml_output_config_t));
|
||||||
for (const auto& entry : fs::directory_iterator(image_dir_path)) {
|
outconfig.format = AML_OUTDATA_DMA;
|
||||||
if (!entry.is_regular_file()) continue;
|
outconfig.typeSize = sizeof(aml_output_config_t);
|
||||||
|
outdata = (nn_output*)aml_module_output_get(qcontext, outconfig);
|
||||||
|
|
||||||
std::string filename = entry.path().filename().string();
|
if (outdata == NULL || outdata->out[0].buf == NULL) {
|
||||||
if (std::regex_match(filename, file_pattern)) {
|
printf("Text model inference failed.\n");
|
||||||
matched_files.push_back(entry);
|
return {};
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
// use index sort, test_type_index.jpg
|
// Copy output to vector
|
||||||
std::sort(matched_files.begin(), matched_files.end(),
|
size_t output_size = outdata->out[0].size / sizeof(float);
|
||||||
[](const fs::directory_entry& a, const fs::directory_entry& b) {
|
float* output_ptr = reinterpret_cast<float*>(outdata->out[0].buf);
|
||||||
return extract_index(a.path().filename().string()) <
|
std::vector<float> result(output_ptr, output_ptr + output_size);
|
||||||
extract_index(b.path().filename().string());
|
|
||||||
});
|
|
||||||
|
|
||||||
for (const auto& entry : matched_files) {
|
return result;
|
||||||
if (!entry.is_regular_file()) continue;
|
|
||||||
|
|
||||||
std::string filename = entry.path().filename().string();
|
|
||||||
std::smatch match;
|
|
||||||
if (!std::regex_match(filename, match, file_pattern)) continue;
|
|
||||||
|
|
||||||
std::string name = match[1];
|
|
||||||
|
|
||||||
std::vector<float> input_data = preprocess_image(entry.path().string());
|
|
||||||
float* model_output = run_network(context_model, input_data, name);
|
|
||||||
|
|
||||||
float max_sim = -std::numeric_limits<float>::infinity();
|
|
||||||
std::string best_key, best_id;
|
|
||||||
|
|
||||||
// Iterate through all directories to find the directory containing the name
|
|
||||||
for (const auto& dir_entry : fs::directory_iterator(actual_base_dir)) {
|
|
||||||
if (!dir_entry.is_directory()) continue;
|
|
||||||
|
|
||||||
std::string folder_name = dir_entry.path().filename().string();
|
|
||||||
if (folder_name.find(name) == std::string::npos) continue;
|
|
||||||
|
|
||||||
std::string vit_res_path = actual_base_dir + folder_name + "/" + actual_json_filename;
|
|
||||||
std::ifstream vit_in(vit_res_path);
|
|
||||||
if (!vit_in.is_open()) {
|
|
||||||
printf("unopen: %s\n", vit_res_path.c_str());
|
|
||||||
continue;
|
|
||||||
}
|
|
||||||
|
|
||||||
json vit_json;
|
|
||||||
vit_in >> vit_json;
|
|
||||||
|
|
||||||
for (auto it = vit_json.begin(); it != vit_json.end(); ++it) {
|
|
||||||
const std::string& key = it.key();
|
|
||||||
const std::vector<float> vec = it.value().get<std::vector<float>>();
|
|
||||||
float sim = post_process(model_output, vec);
|
|
||||||
// printf("sim: %.4f\n", sim);
|
|
||||||
if (sim > max_sim) {
|
|
||||||
max_sim = sim;
|
|
||||||
best_key = key;
|
|
||||||
best_id = folder_name;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
if (!best_key.empty() && !best_id.empty()) {
|
|
||||||
std::string best_path = actual_base_dir + best_id + "/";
|
|
||||||
results.push_back(best_path);
|
|
||||||
printf("\nProcessing images: %s, datasets img path: %s\n", filename.c_str(), best_path.c_str());
|
|
||||||
// printf("最相似图片: %s 相似度: %.4f\n", best_path.c_str(), max_sim); // for debug
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
return results;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
int destroy_network(void *qcontext)
|
int destroy_network(void *qcontext)
|
||||||
{
|
{
|
||||||
int ret = 0;
|
int ret = 0;
|
||||||
|
|
||||||
/* free model
|
if (vision_context_flag == qcontext) {
|
||||||
model.use_dma = true
|
printf("Free vision model memory.\n");
|
||||||
model.malloc_buffer_once = false
|
aml_util_freeBuffer(qcontext, &vision_mem_config, &vision_mem_data);
|
||||||
*/
|
vision_context_flag = nullptr;
|
||||||
if (context_model.use_dma && mem_config_context_model.mem_size != 0) {
|
} else if (text_context_flag == qcontext) {
|
||||||
ret = aml_util_freeBuffer(qcontext, &mem_config_context_model, &mem_data_context_model);
|
printf("Free text model memory.\n");
|
||||||
if (ret)
|
aml_util_freeBuffer(qcontext, &text_mem_config, &text_mem_data);
|
||||||
{
|
text_context_flag = nullptr;
|
||||||
std::cout << "aml_util_freeBuffer fail." << std::endl;
|
} else {
|
||||||
}
|
printf("Free network failed: context not found.\n");
|
||||||
|
return -1;
|
||||||
}
|
}
|
||||||
context_model.use_dma = false;
|
|
||||||
|
|
||||||
ret = aml_module_destroy(qcontext);
|
ret = aml_module_destroy(qcontext);
|
||||||
if (ret)
|
if (ret)
|
||||||
{
|
{
|
||||||
printf("aml_module_destroy fail.\n");
|
printf("Free network failed: destroy failed.\n");
|
||||||
return -1;
|
return -1;
|
||||||
}
|
}
|
||||||
|
|
||||||
return ret;
|
return ret;
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -19,13 +19,13 @@
|
||||||
#include <algorithm>
|
#include <algorithm>
|
||||||
#include <string>
|
#include <string>
|
||||||
#include <iostream>
|
#include <iostream>
|
||||||
#include "model_invoke.h"
|
#include "clip_process.h"
|
||||||
|
|
||||||
#define STB_IMAGE_IMPLEMENTATION
|
#define STB_IMAGE_IMPLEMENTATION
|
||||||
#include "stb_image.h"
|
#include "stb_image.h"
|
||||||
|
|
||||||
// bilinear interpolation scaling
|
// bilinear interpolation scaling
|
||||||
std::vector<float> resize_bilinear(
|
static std::vector<float> resize_bilinear(
|
||||||
const unsigned char* src, int src_w, int src_h, int channels,
|
const unsigned char* src, int src_w, int src_h, int channels,
|
||||||
int dst_w, int dst_h)
|
int dst_w, int dst_h)
|
||||||
{
|
{
|
||||||
|
|
@ -102,29 +102,29 @@ std::vector<float> preprocess_image(const std::string& image_path) {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// get NHWC
|
// Return NHWC format (batch dimension will be added in caller)
|
||||||
return cropped;
|
return cropped;
|
||||||
}
|
}
|
||||||
|
|
||||||
float post_process(const float* a, const std::vector<float>& b) {
|
// ==================== Post Processing ====================
|
||||||
float dot = 0.0f, scale = 100.00000762939453f;
|
|
||||||
for (size_t i = 0; i < b.size(); ++i) {
|
std::vector<float> l2_normalize(const std::vector<float>& vec)
|
||||||
dot += a[i] * b[i];
|
{
|
||||||
|
float norm = 0.0f;
|
||||||
|
for (float v : vec) {
|
||||||
|
norm += v * v;
|
||||||
}
|
}
|
||||||
dot *= scale;
|
norm = std::sqrt(norm) + 1e-12f;
|
||||||
return dot;
|
|
||||||
|
std::vector<float> result(vec.size());
|
||||||
|
for (size_t i = 0; i < vec.size(); ++i) {
|
||||||
|
result[i] = vec[i] / norm;
|
||||||
|
}
|
||||||
|
return result;
|
||||||
}
|
}
|
||||||
|
|
||||||
float post_process(const int8_t* a, const std::vector<float>& b) {
|
std::vector<float> softmax(const std::vector<float>& logits)
|
||||||
float dot = 0.0f, scale = 100.00000762939453f;
|
{
|
||||||
for (size_t i = 0; i < b.size(); ++i) {
|
|
||||||
dot += (a[i] - 66) * b[i];
|
|
||||||
}
|
|
||||||
dot *= scale;
|
|
||||||
return dot;
|
|
||||||
}
|
|
||||||
|
|
||||||
std::vector<float> softmax(const std::vector<float>& logits) {
|
|
||||||
std::vector<float> result(logits.size());
|
std::vector<float> result(logits.size());
|
||||||
|
|
||||||
// numerical stability: subtract the maximum value first.
|
// numerical stability: subtract the maximum value first.
|
||||||
|
|
@ -142,3 +142,17 @@ std::vector<float> softmax(const std::vector<float>& logits) {
|
||||||
|
|
||||||
return result;
|
return result;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
float compute_similarity(const std::vector<float>& a, const std::vector<float>& b, float scale)
|
||||||
|
{
|
||||||
|
if (a.size() != b.size()) {
|
||||||
|
printf("Feature dimension mismatch: %zu vs %zu\n", a.size(), b.size());
|
||||||
|
return 0.0f;
|
||||||
|
}
|
||||||
|
|
||||||
|
float dot = 0.0f;
|
||||||
|
for (size_t i = 0; i < a.size(); ++i) {
|
||||||
|
dot += a[i] * b[i];
|
||||||
|
}
|
||||||
|
return dot * scale;
|
||||||
|
}
|
||||||
|
|
|
||||||
|
|
@ -1,304 +1,339 @@
|
||||||
import numpy as np
|
# -*- coding: utf-8 -*-
|
||||||
import os
|
"""
|
||||||
import argparse
|
Copyright (C) 2024–2025 Amlogic, Inc. All rights reserved.
|
||||||
import json
|
|
||||||
import re
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
from PIL import Image
|
you may not use this file except in compliance with the License.
|
||||||
from amlnnlite.api import AMLNNLite
|
You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
def preprocess_image(image_path: str, target_size: int = 224) -> np.ndarray:
|
|
||||||
"""
|
Unless required by applicable law or agreed to in writing, software
|
||||||
Preprocess image for CLIP model.
|
distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
Steps:
|
See the License for the specific language governing permissions and
|
||||||
1. Load image and convert to RGB
|
limitations under the License.
|
||||||
2. Scale the shorter side to target_size
|
"""
|
||||||
3. Center crop to target_size x target_size
|
|
||||||
4. Normalize with CLIP mean and std
|
# This inference script is designed for CLIP model using AMLNNLite.
|
||||||
|
|
||||||
Args:
|
import os
|
||||||
image_path (str): Path to input image
|
import argparse
|
||||||
target_size (int): Target image size (default: 224)
|
import numpy as np
|
||||||
|
from PIL import Image
|
||||||
Returns:
|
from transformers import CLIPTokenizer
|
||||||
np.ndarray: Preprocessed image data with shape (target_size, target_size, 3)
|
from amlnnlite.api import AMLNNLite
|
||||||
"""
|
|
||||||
# Load image
|
# ==================== Utility Functions ====================
|
||||||
img = Image.open(image_path).convert("RGB")
|
|
||||||
width, height = img.size
|
def softmax(x: np.ndarray, axis: int = -1) -> np.ndarray:
|
||||||
|
"""Compute softmax values for array x."""
|
||||||
# Scale the shorter side
|
x = x - np.max(x, axis=axis, keepdims=True)
|
||||||
scale = target_size / min(width, height)
|
e = np.exp(x)
|
||||||
new_w = int(round(width * scale))
|
return e / np.sum(e, axis=axis, keepdims=True)
|
||||||
new_h = int(round(height * scale))
|
|
||||||
|
|
||||||
# Resize
|
def l2_normalize(x: np.ndarray, axis: int = -1, eps: float = 1e-12) -> np.ndarray:
|
||||||
img = img.resize((new_w, new_h), Image.BILINEAR)
|
"""L2 normalize array x along specified axis."""
|
||||||
|
return x / (np.linalg.norm(x, axis=axis, keepdims=True) + eps)
|
||||||
# Center crop
|
|
||||||
left = (new_w - target_size) // 2
|
# ==================== Vision Preprocessing ====================
|
||||||
top = (new_h - target_size) // 2
|
|
||||||
img = img.crop((left, top, left + target_size, top + target_size))
|
def preprocess_image(image_path: str, target_size: int = 224) -> np.ndarray:
|
||||||
|
"""
|
||||||
# Convert to numpy array and normalize to [0, 1]
|
Preprocess image for CLIP model.
|
||||||
img_array = np.array(img, dtype=np.float32) / 255.0
|
|
||||||
|
Args:
|
||||||
# CLIP normalization
|
image_path (str): Path to input image
|
||||||
mean = np.array([0.48145466, 0.4578275, 0.40821073], dtype=np.float32)
|
target_size (int): Target image size (default: 224)
|
||||||
std = np.array([0.26862954, 0.26130258, 0.27577711], dtype=np.float32)
|
|
||||||
|
Returns:
|
||||||
# Normalize: (x - mean) / std
|
np.ndarray: Preprocessed image data with shape (1, target_size, target_size, 3) in NHWC format
|
||||||
img_array = (img_array - mean) / std
|
"""
|
||||||
|
image = Image.open(image_path).convert("RGB")
|
||||||
# Return in NHWC format
|
width, height = image.size
|
||||||
return img_array
|
|
||||||
|
# Scale the shorter side
|
||||||
|
scale = target_size / min(width, height)
|
||||||
def post_process(
|
new_width = int(width * scale)
|
||||||
image_features: np.ndarray,
|
new_height = int(height * scale)
|
||||||
text_features: np.ndarray,
|
image_resized = image.resize((new_width, new_height), resample=Image.BICUBIC)
|
||||||
scale: float = 100.00000762939453,
|
|
||||||
use_cosine: bool = True,
|
# Center crop
|
||||||
apply_scale: bool = True,
|
left = (new_width - target_size) // 2
|
||||||
) -> float:
|
top = (new_height - target_size) // 2
|
||||||
"""
|
right = left + target_size
|
||||||
Calculate similarity between image and text features.
|
bottom = top + target_size
|
||||||
|
image_cropped = image_resized.crop((left, top, right, bottom))
|
||||||
Args:
|
|
||||||
image_features (np.ndarray): Image feature vector
|
# Convert to numpy array and normalize to [0, 1]
|
||||||
text_features (np.ndarray): Text feature vector
|
image_np = np.array(image_cropped).astype(np.float32) / 255.0
|
||||||
scale (float): Scale factor for similarity calculation
|
|
||||||
use_cosine (bool): If True, L2-normalize both vectors before dot product (cosine similarity)
|
# CLIP normalization
|
||||||
apply_scale (bool): If True, multiply by scale after dot product
|
mean = np.array([0.48145466, 0.4578275, 0.40821073], dtype=np.float32)
|
||||||
|
std = np.array([0.26862954, 0.26130258, 0.27577711], dtype=np.float32)
|
||||||
Returns:
|
image_np = (image_np - mean) / std
|
||||||
float: Similarity score
|
|
||||||
"""
|
# Add batch dimension: HWC -> NHWC
|
||||||
img_vec = image_features.flatten().astype(np.float32)
|
image_np = np.expand_dims(image_np, axis=0)
|
||||||
txt_vec = np.array(text_features, dtype=np.float32).flatten()
|
|
||||||
|
return image_np.astype(np.float32) # [1, 224, 224, 3]
|
||||||
if len(img_vec) != len(txt_vec):
|
|
||||||
raise ValueError(f"Feature dimension mismatch: image={len(img_vec)}, text={len(txt_vec)}")
|
# ==================== Text Preprocessing ====================
|
||||||
|
|
||||||
if use_cosine:
|
def preprocess_text(tokenizer: CLIPTokenizer, text: str, max_len: int = 64) -> np.ndarray:
|
||||||
img_norm = np.linalg.norm(img_vec) + 1e-8
|
"""
|
||||||
txt_norm = np.linalg.norm(txt_vec) + 1e-8
|
Preprocess text for CLIP model using CLIPTokenizer.
|
||||||
img_vec = img_vec / img_norm
|
|
||||||
txt_vec = txt_vec / txt_norm
|
Args:
|
||||||
|
tokenizer: CLIPTokenizer instance
|
||||||
dot_product = np.dot(img_vec, txt_vec)
|
text (str): Input text string
|
||||||
|
max_len (int): Maximum sequence length (default: 64)
|
||||||
similarity = dot_product * scale if apply_scale else dot_product
|
|
||||||
|
Returns:
|
||||||
return float(similarity)
|
np.ndarray: Tokenized text with shape (1, max_len) as int64
|
||||||
|
"""
|
||||||
|
enc = tokenizer(
|
||||||
def extract_index(filename: str) -> int:
|
text,
|
||||||
"""
|
padding="max_length",
|
||||||
Extract index from filename pattern: test_xxx_index.jpg
|
truncation=True,
|
||||||
|
max_length=max_len,
|
||||||
Args:
|
return_tensors="np",
|
||||||
filename (str): Filename to extract index from
|
)
|
||||||
|
# text model input: int64[1, max_len]
|
||||||
Returns:
|
input_ids = enc["input_ids"].astype(np.int64)
|
||||||
int: Extracted index, or -1 if pattern doesn't match
|
return input_ids
|
||||||
"""
|
|
||||||
pattern = r"test_\w+_(\d+)\.jpg"
|
# ==================== Model Inference ====================
|
||||||
match = re.match(pattern, filename)
|
|
||||||
if match:
|
def compute_image_embedding(vision_amlnn: AMLNNLite, image_path: str) -> np.ndarray:
|
||||||
return int(match.group(1))
|
"""
|
||||||
return -1
|
Compute image embedding using vision model.
|
||||||
|
|
||||||
|
Args:
|
||||||
def process_image_dir(
|
vision_amlnn: AMLNNLite instance for vision model
|
||||||
amlnn: AMLNNLite,
|
image_path (str): Path to input image
|
||||||
image_dir_path: str,
|
|
||||||
base_dir: str = "",
|
Returns:
|
||||||
json_filename: str = ""
|
np.ndarray: L2-normalized image embedding with shape (1, embed_dim)
|
||||||
) -> list:
|
"""
|
||||||
"""
|
input_data = preprocess_image(image_path) # [1, 224, 224, 3]
|
||||||
Process image directory and find best matching text dataset.
|
|
||||||
|
outputs = vision_amlnn.inference(
|
||||||
Args:
|
inputs=[input_data],
|
||||||
amlnn: AMLNNLite instance
|
inputs_data_format='NHWC',
|
||||||
image_dir_path (str): Path to directory containing test images
|
outputs_data_format='NHWC'
|
||||||
base_dir (str): Base directory for clip datasets (optional, can use CLIP_BASE_DIR env var)
|
)
|
||||||
json_filename (str): JSON filename in each dataset folder (optional, can use CLIP_JSON_FILENAME env var)
|
|
||||||
|
feats = outputs[0].astype(np.float32)
|
||||||
Returns:
|
feats = feats.reshape(1, -1) # Squeeze to [1, embed_dim]
|
||||||
list: List of best matching dataset paths
|
return l2_normalize(feats, axis=1)
|
||||||
"""
|
|
||||||
results = []
|
def compute_text_embedding(text_amlnn: AMLNNLite, tokenizer: CLIPTokenizer, text: str, max_len: int = 64) -> np.ndarray:
|
||||||
file_pattern = re.compile(r"test_(\w+)_\d+\.jpg")
|
"""
|
||||||
image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.JPG', '.JPEG', '.PNG', '.BMP'}
|
Compute text embedding using text model.
|
||||||
|
|
||||||
if not base_dir:
|
Args:
|
||||||
base_dir = os.getenv("CLIP_BASE_DIR", "./clip_datasets/")
|
text_amlnn: AMLNNLite instance for text model
|
||||||
|
tokenizer: CLIPTokenizer instance
|
||||||
if not json_filename:
|
text (str): Input text string
|
||||||
json_filename = os.getenv("CLIP_JSON_FILENAME", "clip_text_res.json")
|
max_len (int): Maximum sequence length
|
||||||
|
|
||||||
matched_files = []
|
Returns:
|
||||||
if os.path.isdir(image_dir_path):
|
np.ndarray: L2-normalized text embedding with shape (1, embed_dim)
|
||||||
for filename in os.listdir(image_dir_path):
|
"""
|
||||||
filepath = os.path.join(image_dir_path, filename)
|
input_ids = preprocess_text(tokenizer, text, max_len) # [1, max_len]
|
||||||
if os.path.isfile(filepath):
|
print(f"input_ids: {input_ids}")
|
||||||
if file_pattern.match(filename):
|
|
||||||
matched_files.append((filename, filepath, True))
|
# AMLNNLite requires 4D input, reshape to (1, 1, 1, max_len)
|
||||||
elif any(filename.lower().endswith(ext) for ext in image_extensions):
|
input_ids_4d = input_ids[:, None, None, :] # [1, 1, 1, max_len]
|
||||||
matched_files.append((filename, filepath, False))
|
|
||||||
elif os.path.isfile(image_dir_path):
|
outputs = text_amlnn.inference(
|
||||||
filename = os.path.basename(image_dir_path)
|
inputs=[input_ids_4d],
|
||||||
if any(filename.lower().endswith(ext) for ext in image_extensions):
|
inputs_data_format='NHWC',
|
||||||
has_pattern = bool(file_pattern.match(filename))
|
outputs_data_format='NHWC'
|
||||||
matched_files.append((filename, image_dir_path, has_pattern))
|
)
|
||||||
else:
|
|
||||||
print(f"Error: {image_dir_path} is not a valid image file")
|
feats = outputs[0].astype(np.float32)
|
||||||
return results
|
feats = feats.reshape(1, -1) # Squeeze to [1, embed_dim]
|
||||||
else:
|
return l2_normalize(feats, axis=1)
|
||||||
print(f"Error: {image_dir_path} is not a valid directory or file")
|
|
||||||
return results
|
def compute_text_embeddings_batch(text_amlnn: AMLNNLite, tokenizer: CLIPTokenizer, texts: list, max_len: int = 64) -> np.ndarray:
|
||||||
|
"""
|
||||||
if not matched_files:
|
Compute text embeddings for multiple texts.
|
||||||
print(f"Warning: No image files found in {image_dir_path}")
|
|
||||||
return results
|
Args:
|
||||||
|
text_amlnn: AMLNNLite instance for text model
|
||||||
print(f"Found {len(matched_files)} image file(s) to process")
|
tokenizer: CLIPTokenizer instance
|
||||||
|
texts (list): List of input text strings
|
||||||
matched_files.sort(key=lambda x: extract_index(x[0]) if x[2] else 999999)
|
max_len (int): Maximum sequence length
|
||||||
|
|
||||||
# Process each image
|
Returns:
|
||||||
for filename, filepath, has_pattern in matched_files:
|
np.ndarray: L2-normalized text embeddings with shape (num_texts, embed_dim)
|
||||||
if has_pattern:
|
"""
|
||||||
match = file_pattern.match(filename)
|
embeddings = []
|
||||||
if match:
|
for text in texts:
|
||||||
name = match.group(1)
|
emb = compute_text_embedding(text_amlnn, tokenizer, text, max_len)
|
||||||
else:
|
embeddings.append(emb[0]) # Remove batch dimension
|
||||||
name = ""
|
return np.stack(embeddings, axis=0) # [num_texts, embed_dim]
|
||||||
else:
|
|
||||||
name = ""
|
# ==================== Similarity Calculation ====================
|
||||||
|
|
||||||
# Preprocess image
|
def compute_similarity(image_embedding: np.ndarray, text_embeddings: np.ndarray, logit_scale: float = 100.0) -> tuple:
|
||||||
try:
|
"""
|
||||||
input_data = preprocess_image(filepath)
|
Compute similarity between image and text embeddings.
|
||||||
input_data = np.expand_dims(input_data, axis=0)
|
|
||||||
except Exception as e:
|
Args:
|
||||||
print(f"Error preprocessing image {filename}: {e}")
|
image_embedding (np.ndarray): Image embedding with shape (1, embed_dim)
|
||||||
continue
|
text_embeddings (np.ndarray): Text embeddings with shape (num_texts, embed_dim)
|
||||||
|
logit_scale (float): Scale factor for logits
|
||||||
# Run inference
|
|
||||||
try:
|
Returns:
|
||||||
outputs = amlnn.inference(inputs=[input_data])
|
tuple: (similarities, logits, probabilities)
|
||||||
model_output = outputs[0]
|
"""
|
||||||
if isinstance(model_output, np.ndarray):
|
# Cosine similarity (embeddings are already L2-normalized)
|
||||||
model_output = model_output.astype(np.float32)
|
sims = text_embeddings @ image_embedding[0] # [num_texts]
|
||||||
else:
|
logits = sims * logit_scale # [num_texts]
|
||||||
model_output = np.array(model_output, dtype=np.float32)
|
probs = softmax(logits, axis=0) # [num_texts]
|
||||||
model_output = model_output.flatten()
|
|
||||||
except Exception as e:
|
return sims, logits, probs
|
||||||
print(f"Error running inference on {filename}: {e}")
|
|
||||||
continue
|
# ==================== Main Function ====================
|
||||||
|
|
||||||
max_sim = float('-inf')
|
def main():
|
||||||
best_key = ""
|
parser = argparse.ArgumentParser(description='CLIP Image-Text Matching Demo using AMLNNLite')
|
||||||
best_id = ""
|
parser.add_argument('--vision-model', required=True, help='Path to vision model (.adla)')
|
||||||
|
parser.add_argument('--text-model', required=True, help='Path to text model (.adla)')
|
||||||
if not os.path.isdir(base_dir):
|
parser.add_argument('--tokenizer-dir', required=True, help='Path to CLIPTokenizer directory')
|
||||||
print(f"Error: Base directory does not exist: {base_dir}")
|
parser.add_argument('--image-path', default=None, help='Path to input image (optional, will prompt if not provided)')
|
||||||
continue
|
parser.add_argument('--texts', nargs='+', default=None, help='List of text descriptions to compare')
|
||||||
|
parser.add_argument('--max-len', type=int, default=64, help='Maximum token sequence length (default: 64)')
|
||||||
print(f"Searching in base directory: {base_dir}")
|
parser.add_argument('--logit-scale', type=float, default=100.0, help='Logit scale factor (default: 100.0)')
|
||||||
folder_count = 0
|
|
||||||
for folder_name in os.listdir(base_dir):
|
args = parser.parse_args()
|
||||||
folder_path = os.path.join(base_dir, folder_name)
|
|
||||||
if not os.path.isdir(folder_path):
|
# Validate model paths
|
||||||
continue
|
if not os.path.exists(args.vision_model):
|
||||||
|
print(f"[Error] Vision model not found: {args.vision_model}")
|
||||||
if has_pattern and name and name not in folder_name:
|
return -1
|
||||||
continue
|
|
||||||
|
if not os.path.exists(args.text_model):
|
||||||
folder_count += 1
|
print(f"[Error] Text model not found: {args.text_model}")
|
||||||
|
return -1
|
||||||
vit_res_path = os.path.join(folder_path, json_filename)
|
|
||||||
if not os.path.isfile(vit_res_path):
|
# Load tokenizer
|
||||||
print(f"Warning: JSON file not found: {vit_res_path}")
|
print(f"[Info] Loading CLIPTokenizer from: {args.tokenizer_dir}")
|
||||||
continue
|
tokenizer = CLIPTokenizer.from_pretrained(args.tokenizer_dir)
|
||||||
|
|
||||||
try:
|
# Initialize vision model
|
||||||
with open(vit_res_path, 'r', encoding='utf-8') as f:
|
print(f"[Info] Initializing vision model: {args.vision_model}")
|
||||||
vit_json = json.load(f)
|
vision_amlnn = AMLNNLite()
|
||||||
|
vision_amlnn.config(model_path=args.vision_model, run_cycles=1)
|
||||||
for key, text_vec in vit_json.items():
|
vision_amlnn.init()
|
||||||
if isinstance(text_vec, list):
|
|
||||||
text_features = np.array(text_vec, dtype=np.float32)
|
# Initialize text model
|
||||||
sim_scaled = post_process(
|
print(f"[Info] Initializing text model: {args.text_model}")
|
||||||
model_output,
|
text_amlnn = AMLNNLite()
|
||||||
text_features,
|
text_amlnn.config(model_path=args.text_model, run_cycles=1)
|
||||||
use_cosine=True,
|
text_amlnn.init()
|
||||||
apply_scale=True,
|
|
||||||
)
|
print("[Info] Models initialized successfully.\n")
|
||||||
|
|
||||||
if sim_scaled > max_sim:
|
try:
|
||||||
max_sim = sim_scaled
|
# Interactive loop
|
||||||
best_key = key
|
while True:
|
||||||
best_id = folder_name
|
# Get image path
|
||||||
except Exception as e:
|
if args.image_path:
|
||||||
print(f"Error loading JSON file {vit_res_path}: {e}")
|
image_path = args.image_path
|
||||||
continue
|
args.image_path = None # Clear for next iteration
|
||||||
|
else:
|
||||||
if best_key and best_id:
|
print("=" * 60)
|
||||||
best_path = os.path.join(base_dir, best_id)
|
print("[Info] Image Path (or 'exit' to quit):")
|
||||||
results.append(best_path)
|
image_path = input().strip()
|
||||||
print(f"\nProcessing image: {filename}")
|
|
||||||
print(f" Best matching dataset: {best_path}")
|
# Check for exit
|
||||||
else:
|
if image_path.lower() == 'exit':
|
||||||
print(f"\nProcessing image: {filename}")
|
print("[Info] Exiting...")
|
||||||
print(f" No matching dataset found (searched {folder_count} folder(s))")
|
break
|
||||||
|
|
||||||
return results
|
# Validate image path
|
||||||
|
if not image_path:
|
||||||
|
print("[Warning] Please enter an image path.")
|
||||||
def main():
|
continue
|
||||||
parser = argparse.ArgumentParser(description='CLIP Image-Text Matching Demo')
|
|
||||||
parser.add_argument('--model-path', required=True, help='Path to the CLIP model file')
|
if not os.path.exists(image_path):
|
||||||
parser.add_argument('--base-dir', default='./clip_datasets/', help='Base directory for clip datasets (can also use CLIP_BASE_DIR env var)')
|
print(f"[Error] Image not found: {image_path}")
|
||||||
parser.add_argument('--json-filename', default='clip_text_res.json', help='JSON filename in each dataset folder (can also use CLIP_JSON_FILENAME env var, default: clip_text_res.json)')
|
continue
|
||||||
parser.add_argument('--image-dir', default='./', help='Image directory or single image file to process (optional, will prompt if not provided)')
|
|
||||||
args = parser.parse_args()
|
# Get texts to compare
|
||||||
|
if args.texts:
|
||||||
# Initialize AMLNNLite
|
texts = args.texts
|
||||||
print("Initializing model...")
|
args.texts = None # Clear for next iteration
|
||||||
amlnn = AMLNNLite()
|
else:
|
||||||
amlnn.config(model_path=args.model_path)
|
print("[Info] Enter text descriptions (comma-separated, or 'skip' to use defaults):")
|
||||||
amlnn.init()
|
text_input = input().strip()
|
||||||
print("Model initialized successfully.\n")
|
|
||||||
|
if text_input.lower() == 'skip' or not text_input:
|
||||||
# Process images
|
# Default texts for demo
|
||||||
if args.image_dir:
|
texts = [
|
||||||
results = process_image_dir(amlnn, args.image_dir, args.base_dir, args.json_filename)
|
"a red handbag",
|
||||||
print(f"\nTotal results: {len(results)}")
|
"a blue jacket",
|
||||||
for i, result in enumerate(results):
|
"a red bus",
|
||||||
print(f"Index[{i}]: {result}")
|
]
|
||||||
else:
|
print(f"[Info] Using default texts: {texts}")
|
||||||
while True:
|
else:
|
||||||
image_path = input("\nPlease enter the JPG image path or directory (enter 'exit' to quit):\n").strip()
|
texts = [t.strip() for t in text_input.split(',') if t.strip()]
|
||||||
|
|
||||||
if image_path.lower() == 'exit':
|
if not texts:
|
||||||
break
|
print("[Warning] No texts provided.")
|
||||||
|
continue
|
||||||
if not image_path:
|
|
||||||
print("The path cannot be empty.")
|
try:
|
||||||
continue
|
# Compute image embedding
|
||||||
|
print(f"\n[Info] Processing image: {image_path}")
|
||||||
results = process_image_dir(amlnn, image_path, args.base_dir, args.json_filename)
|
image_embedding = compute_image_embedding(vision_amlnn, image_path)
|
||||||
|
print(f"[Info] Image embedding shape: {image_embedding.shape}")
|
||||||
for i, result in enumerate(results):
|
|
||||||
print(f"Index[{i}]: {result}")
|
# Compute text embeddings
|
||||||
|
print(f"[Info] Processing {len(texts)} text(s)...")
|
||||||
amlnn.uninit()
|
text_embeddings = compute_text_embeddings_batch(text_amlnn, tokenizer, texts, args.max_len)
|
||||||
print("\nDone.")
|
print(f"[Info] Text embeddings shape: {text_embeddings.shape}")
|
||||||
|
|
||||||
|
# Compute similarity
|
||||||
if __name__ == "__main__":
|
sims, logits, probs = compute_similarity(image_embedding, text_embeddings, args.logit_scale)
|
||||||
main()
|
|
||||||
|
# Print results
|
||||||
|
print("\n" + "=" * 60)
|
||||||
|
print("CLIP Image-Text Matching Results")
|
||||||
|
print("=" * 60)
|
||||||
|
print(f"Image: {image_path}")
|
||||||
|
print(f"logit_scale: {args.logit_scale:.6f}")
|
||||||
|
print("-" * 60)
|
||||||
|
|
||||||
|
# Sort by probability (descending)
|
||||||
|
sorted_indices = np.argsort(probs)[::-1]
|
||||||
|
for rank, i in enumerate(sorted_indices):
|
||||||
|
print(f"[{rank + 1}] prob={probs[i]:.6f} sim={float(sims[i]):.6f} text='{texts[i]}'")
|
||||||
|
|
||||||
|
print("=" * 60 + "\n")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[Error] Processing failed: {e}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
continue
|
||||||
|
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
print("\n\n[Info] Interrupted by user. Exiting...")
|
||||||
|
|
||||||
|
finally:
|
||||||
|
# Cleanup
|
||||||
|
vision_amlnn.uninit()
|
||||||
|
text_amlnn.uninit()
|
||||||
|
|
||||||
|
print("[Info] Done.")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import sys
|
||||||
|
sys.exit(main())
|
||||||
|
|
|
||||||
48895
examples/clip/tokenizer_path/merges.txt
Executable file
48895
examples/clip/tokenizer_path/merges.txt
Executable file
File diff suppressed because it is too large
Load diff
1
examples/clip/tokenizer_path/vocab.json
Executable file
1
examples/clip/tokenizer_path/vocab.json
Executable file
File diff suppressed because one or more lines are too long
Loading…
Add table
Add a link
Reference in a new issue