VPDA: October 2023

Monday, October 23, 2023

Reasoning based visual programming frameworks

Roxana Virlan, Denisa Gal

Introduction

"Reasoning is the process of drawing logical conclusions from given information. It is a key component of AI applications such as expert systems, natural language processing and machine learning. It allows computers to draw logical conclusions from data and knowledge, and to make decisions based on those conclusions." [1]
ViperGPT [2] and VisProg [3] are just two of the newest reasoning based frameworks that emerged in the vast field that is machine learning.

ViperGPT

ViperGPT [2] is a framework that leverages code-generation models to compose vision-and-language models into subroutines to produce a result for any query. It utilizes a provided API to access the available modules, and composes them by generating Python code that is later executed. This simple approach requires no further training, and achieves state-of-the-art results across various complex visual tasks.

So how exactly does it work? It creates customized programs for each query that take images or videos as arguments and return the result of the query for that image or video. Given a visual input x and a textual query q about its contents, the authors first synthesize a program z=π(q) with a program generator π given the query. After that, the execution engine r=ϕ(x,z) is applied to execute the program z on the input x and produce a result r. This framework is flexible, supporting image or videos as inputs x, questions or descriptions as queries q, and any type (e.g., text or image crops) as outputs r.

The model’s prior training on code enables it to reason about how to use these functions and implement the relevant logic. Benefits of this approach:

interpretable, as all the steps are explicit as code function calls with intermediate values that can be inspected;
logical, as it explicitly uses built-in Python logical and mathematical
operators;
flexible, as it can easily incorporate any vision or language module, only requiring the specification of the associated module be added to the API;
compositional, decomposing tasks into smaller sub-tasks performed step by-step;
adaptable to advances in the field, as improvements in any of the used modules will result in a direct improvement in our approach’s performance;
training-free, as it does not require to re-train (or finetune) a new model for every new task;
general, as it unifies all tasks into one system.

Different evaluation settings to showcase the model’s diverse capabilities in varied contexts without additional training: visual grounding, compositional image question answering, external knowledge-dependent image question answering, video causal and temporal reasoning.

Visual grounding is the task of identifying the bounding box in an image that corresponds best to a given natural language query. Visual grounding tasks evaluate reasoning about spatial relationships and visual attributes. For evaluation, the RefCOCO and RefCOCO+ Datasets were used (TABLE 1).

Compositional Image Quotation Answering: they also evaluate ViperGPT on image question answering (focus on compositional question answering, which requires decomposing complex questions into simpler tasks). GQA dataset was used, which was created to measure performance on complex compositional questions. (TABLE 2)

External Knowledge dependent Image Question Answering: by equipping ViperGPT with a module to query external knowledge bases in natural language, it can combine knowledge with visual reasoning to handle such questions. They evaluate on the OK-VQA dataset, which is designed to evaluate models’ ability to answer questions about images that require knowledge that cannot be found in the image (TABLE 3).

Video Causal/Temporal Reasoning: they also evaluate how ViperGPT extends to videos and queries that require causal and temporal reasoning. To explore this, they use the NExT-QA dataset, designed to evaluate video models ability to perform this type of reasoning.

ViperGPT is a framework for programmatic composition of specialized vision, language, math, and logic functions for complex visual queries. ViperGPT is capable of connecting individual advances in vision and language; it enables them to show capabilities beyond what any individual model can do on its own.

VisProg

VISPROG [3] is a neuro-symbolic approach that takes natural language instructions and solves complex and compositional visual tasks. It avoids the need for task-specific training, using the in-context learning ability of LLM (large language models) to generate python-like modular programs, which are then executed to get both the solution and a comprehensive and interpretable rationale.

It takes as inputs visual data (a single image or a set of images) along with a natural language instruction and generates a sequence of steps (visual program) then executes them to produce the desired output.

Each step invokes one of the modules of the system. Each module is implemented as a python class, with methods for parsing, executing and summarizing (visual rationale). There are currently 20 modules for enabling capabilities such as:

image understanding
image manipulation (including generation)
knowledge retrieval
performing logical and arithmetic operations

VISPROG prompts GPT-3 with pairs of instructions and the desired high-level program, making use of its in-context learning ability, resulting in visual programs for natural language instructions. Each line of the program consists of the name of a module, the input argument names for the module and their values, and an output variable name (Figure 3). The resulting program can be executed on the input image in order to obtain the desired effect. This execution is handled by the interpreter.

The interpreter initializes the program state with the inputs, and steps through the program line-by-line while invoking the correct module with the inputs specified in that line. After executing each step, the outputs of the previous state become the inputs for the new one.

Additionally, each module class produces a HTML snippet to visually summarize the inputs and outputs of the module. The snippets of all steps are then stitched by the interpreter into a visual rationale (Figure 4).

This can be used to analyze the logical correctness of the program and inspect the intermediate outputs, as well as allowing users to understand where the program fails and how to minimally tweak the instructions in order to improve the results.

The framework is evaluated on a set of four tasks:

Compositional Visual Question Answering - “Is the small truck to the left or to the right of the people that are wearing helmets?”
Zero-Shot Reasoning on Image Pairs - “Which landmark did we visit, the day after we saw the Eiffel Tower?”
Factual Knowledge Object Tagging - “List the main characters on the TV show Big Bang Theory separated by commas.”
Image Editing with Natural Language - ”Hide the face of Daniel Craig with :p” (de-identification or privacy preservation), ”Create a color pop of Daniel Craig and blur the background” (object highlighting), ”Replace Barack Obama with Barack Obama wearing sunglasses” (object replacement)

VISPROG is a powerful framework that uses the in-context learning ability of LLM to take natural language instructions and generate visual programs for complex compositional visual tasks.

Conclusion

Those simple approaches based on reasoning, requiring no further training and making use of GPT-3, achieve state-of-the-art results across various complex visual tasks, and show great promise for the future of visual programming in the field of machine learning.

References

[1] https://www.autoblocks.ai/glossary/reasoning-system, last accessed 23.10.2023
[2] D. Surís, S. Menon and C. Vondrick, “ViperGPT: Visual Inference via Python Execution for Reasoning”, 2023 Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023 [3] T. Gupta and A. Kembhavi, "Visual Programming: Compositional visual reasoning without training," 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 2023

Saturday, October 21, 2023

Py-Feat: Python Facial Expression Analysis Toolbox

Py-Feat: Python Facial Expression Analysis Toolbox was written by Jin Hyun Cheong 1, Eshin Jolly , Tiankang Xie, Sophie Byrne, Matthew Kenney, & Luke J. Chang.

Facial expression analysis is notoriously difficult. Recent breakthroughs in affective computing have resulted in great improvements in autonomously recognizing facial expressions from images and videos. However, most of this research has yet to be extensively disseminated. Disseminated in social science fields such as psychology. Current cutting-edge models require significant topic expertise that is not normally found in social science

training programs available. Furthermore, there is a conspicuous dearth of user-friendly and open-source software program that provides a full collection of tools and functionalities to enable facial expression research. Py-Feat is an open-source Python toolkit that provides support for recognizing, preprocessing, analyzing, and displaying facial expression data.

Facial expressions can provide nonverbal channels for interpersonal and cross-species communication while also providing insights into an individual's interior mental state. One of the most difficult aspects of researching facial expressions has been reaching an agreement on how to effectively depict and objectively quantify expressions.

The Facial Affect Coding System (FACS) is one of the most widely used techniques for accurately measuring the intensity of groupings of facial muscles known as action units (AUs). However, obtaining facial expression information through FACS coding may be a time-consuming and arduous operation. To become a trained FACS coder, 100 hours of training are required, and manual labeling is time-consuming (e.g., one minute of video might take an hour 5 ) and prone to cultural biases and inaccuracies.

Facial electromyography (EMG) is one method for objectively recording from a limited number of facial muscles with high temporal resolution, but it requires specialized recording equipment, which limits data collection to the laboratory and can visually obscure the face, making it less suitable for social contexts.

Software comparison based on functionality and cost. Each package's characteristics are denoted by an X. The Py-Feat toolkit features are provided in brackets. Facial landmarks are points that correspond to significant spatial positions on the face, such as the jaw, mouth, nose, eyes, and brows. FACS defines action units as face muscle groups. The recognition of conventional emotional expressions is referred to as emotions. The pitch, roll, and yaw orientations of the face are referred to as headpose. The gaze refers to the direction in which the eyes are looking. iMotions is a platform, and obtaining its features requires the purchase of either the AFFDEX or FACET modules.

Pipeline for analyzing facial expressions. Facial expression analysis begins with the capture of face photographs or videos with a recording device such as a webcam, camcorder, head mounted camera, or 360 camera. Researchers may use Py-Feat to identify facial attributes such as the location of the face inside a rectangular bounding box, the placement of significant facial landmarks, action units, and emotions after recording the face, and then validate the detection findings with picture overlays and bar graphs. Preprocessing the detection findings by extracting additional characteristics such as Histogram of Oriented Gradients (HOG) or multi-wavelet decomposition. The resulting data may then be evaluated using statistical methods such as t-tests, regressions, and intersubject correlations inside the toolbox.

Py-Feat: Design and module overview

There are now two primary modules in Py-Feat for working with facial expression data. To begin, the Detector module makes it simple for users to identify face expression elements in images or videos. We provide many models for extracting the most important facial expression characteristics that most end users would want to deal with. This comprises recognizing faces in the stimulus and determining the spatial coordinates of a bounding box for each face. In addition, we detect 68 facial landmarks, which are coordinates that indicate the spatial placement of the eyes, nose, mouth, and jaw. Models can utilize the bounding box and landmarks to identify head poses such as facial orientation in terms of rotation around axes in three-dimensional space.Py-Feat can also recognize higher-level facial expression characteristics like AUs and fundamental emotion categories. We provide numerous models for each detector to make the toolbox versatile for a wide range of applications, but we have also chosen appropriate defaults for users who may be overwhelmed by the quantity of possibilities. The characteristics cover the vast majority of the ways in which computer vision algorithms can now characterize face emotions. Importantly, when new characteristics and models become available in the field, they may be added to the toolkit.

In addition, Py-feat offers the Fex data module for working with the Detector module's features. This module contains techniques for preparing, analyzing, and visualizing data from face expressions.

Face detection

One of the most fundamental elements in the facial feature identification process is determining whether or not there is a face in the image and where that face is positioned. Faceboxes, Multi-task Convolutional Neural Network (MTCNN), and RetinaFace are three prominent face detectors included in Py-Feat. These detectors are frequently utilized in other open-source applications and are known to produce rapid and reliable results, especially for partially obstructed or non-frontal faces. Face detection results are displayed as a rectangular bounding box with a confidence score for each found face.

Landmark detection

After identifying a face in an image, it is customary to identify the facial landmarks, which are coordinate points in image space that outline a face's jaw, mouth, nose, eyes, and brows. The distances and angles between the landmarks can be utilized to depict facial emotions and infer emotional states such as pain. Py-feat employs a common 68-coordinate facial landmark scheme that is extensively used in datasets and software, and it presently supports three facial landmark detectors: the Practical Facial Landmark Detector (PFLD), MobileNets, and MobileFaceNets algorithms.

Head pose detection

Aside from its placement in an image or the location of certain elements of the face, the position of the head in three dimensional space is another aspect of a facial expression. From a head-on perspective, rotations may be characterized in terms of rotations around the x, y, and z planes, which are referred to as pitch, roll, and yaw, respectively. Py-feat adds Img2Pose model support. Because this model does not rely on earlier face detections, it may also be used to identify face bounding boxes. Img2Pose's confined version is fine-tuned on the 300W-LP dataset, which only includes head poses in the range (-90° to +90°).

Action unit detection

Py-feat offers models for identifying deviations of specific facial muscles (i.e., action units; AUs) from a neutral face expression using the FACS coding scheme, in addition to the fundamental features of a face in an image. There are presently two models in Py-feat for identifying action units. The models' architecture is based on the highly robust and well-performing model used in OpenFace, which extracts Histogram of Oriented Gradient (HOG) features from within the landmark coordinates using a convex hull algorithm, compresses the HOG representation using Principal Components Analysis (PCA), and then uses these features to predict each of the 12 AUs individually using popular shallow learning methods based on kernels (i.e., linear Support Vect).

Emotion detection

Finally, based on third-party judgements, Py-feat offers models for identifying the existence of certain emotion categories. Emotion detectors are trained using intentionally posed or naturally evoked emotional facial expressions, allowing them to categorize fresh photos depending on how closely a face matches a canonical emotional facial expression. It is worth noting that there is presently no agreement in the field as to whether categorical representations of emotion are the most trustworthy and valid nosology of emotional facial expressions. Detecting a smiling face as joyful, for example, does not always suggest that the individual is feeling an internal subjective state of pleasure, because these sorts of latent state inferences require extra contextual information beyond a static picture.

Robustness Experiments

Luminance

We modified our benchmark datasets to include two different levels of luminance (low, where the brightness factor was uniformly sampled from [0.1, 0.8] for each image, and high, where the brightness factor was uniformly sampled from [1.2, 1.9] for each image) to test the robustness of our model to different lighting conditions. This can be beneficial for determining how uneven lighting or minor differences in skin colour may affect the models. Overall, we discovered that the majority of deep learning detectors were quite resistant to changes in brightness.

However, high and low levels of variation have a greater influence on shallow learning detectors that rely on HOG characteristics.

Occlusion

Furthermore, we assessed the performance of each detector in three distinct occlusion circumstances. Face occlusions are quite prevalent in real-world data gathering circumstances, when a participant may hide their face with their hand or be partially obscured behind some other physical item. On the benchmark datasets given above, we masked out the eyes, nose, and mouth independently by applying a black mask to parts of the face using facial landmark information.

Experiments with Py-feat Detector Robustness. A) Illustration of a robustness modification. B) Face detection robustness findings for RetinaFace. The values represent Average Precision, with higher values indicating greater performance. C) The robustness of landmark detection findings. Normalized Mean Average Error (MAE) data are used, with lower values indicating greater performance. D) Pose detection robustness findings for img2pose-constrained images. The numbers are Mean Average Error (MAE), with lower values indicating greater performance. Results of Feat-XGB AU detection robustness. The figures represent F1 scores, with higher numbers indicating greater performance. We should mention that the DISFA+ dataset lacks labels for AU7. F) Emotion detection robustness findings from the Residual Masking Network. The figures represent F1 scores, with higher numbers indicating greater performance. G) Feat-XGB AU rotation resistance findings. The figures represent F1 scores, with higher numbers indicating greater performance.

Action unit to landmark visualization demonstration. (A): Facial expressions based on AU detections on real-world photos. Py-Feat's visualization model was used to project detected AU activations from each of six annotated pictures exhibiting one emotion. (B): Facial expressions produced by manually activating each AU in order.

VPDA

Monday, October 23, 2023

Reasoning based visual programming frameworks

Saturday, October 21, 2023

Py-Feat: Python Facial Expression Analysis Toolbox

Py-Feat: Python Facial Expression Analysis Toolbox

Py-Feat: Design and module overview

Face detection

Landmark detection

Head pose detection

Action unit detection

Emotion detection

Robustness Experiments

Luminance

Occlusion

Soft Label PU Learning

Report Abuse

Labels