Figure 2 and Algorithm 1 summarize the steps of the proposed technique. HybridFlow is the refined flow resulting from the interpolation of the combined initial flows computed from the matches to sparse graphs of super pixels and feature matches of pixels in small clusters, as discussed below.

### Perceptual grouping and trait matching

Feature descriptors encode distinctive information about a pixel and form the basis for perceptual grouping and matching. We conduct experiments with three different trait descriptors: rootSIFT proposed in Ref.^{28}pre-trained DeepLab on ImageNet and pre-trained encoders with the same architecture as in Ref.^{25}. As discussed later in the experimental results and in the Implementation Details section, the latter descriptor leads to the best performance. Next, we group pixels based on their feature descriptors to replace the rigid structure of the pixel grid, as in Fig. 1b. In particular, we classify each pixel as the argmax value of its N-dimensional feature descriptor and aggregate them into clusters. So one pixel *p* a cluster index is assigned \(i_{p}\) given by

$$\begin{aligned} {i_{p} = {{\,\mathrm{arg\,max}\,}}(Softmax(ReLU({F}_{c}(p))))}, \ end{aligned}$$

(3)

Where \(\mathscr{F}_{c}\) is the feature descriptor. Hence, this results in an arbitrary number of coarse-scale clusters in each image, which are matched according to their cluster indices. A cluster cannot be contiguous. Since the index as in Eq. (3), it indicates the class of the object and is used during graph matching to match clusters of the same class as explained in the following section.

Pixels contained in clusters of area less than 10,000 are matched according to the similarity of their feature descriptors using the sum of squared differences (SSD) ratio test. Outliers in the initial matches are removed from subsequent processing using RANSAC, which finds a localized fundamental matrix per cluster.

The initial weak flow resulting from this step consists of the flow calculated from each of the interior features. Figure 1f shows the initial flow resulting from sparse feature matching of the pixels contained in all small clusters. The pixel size is increased by \(10 \times 10\) for clarity in visualization.

Coarse-scale clusters larger than 10,000 pixels in area are further clustered by a simple linear iterative clustering (SLIC) that adapts k-mean clustering to group pixels into perceptually meaningful atomic regions^{29}. The parameter \(\kappa\) is calculated based on the image size and the desired super pixel size and is given by \(\kappa = \frac{|I|}{|s|}\) Where \(|s| \approx 2223, s \in \mathscr{S}\)and |*I*| is the size of the image. This limits the number of superpixels of roughly the same size \(\mathscr{S}\); in our experiments discussed in the Implementation Details section, the optimal value for \(\kappa\) \(\about 250\) to 300. For the finer super pixels \(\mathscr{S}\)constructs a graph where each node corresponds to a superpixel centroid and edges correspond to the result of Delaunay triangulation, as discussed in the Graph Matching section below.

### graph matching

The two sets of superpixels contained in the adjusted coarse-scale image clusters \(I_{1}, I_{2}\) are represented by the graph model described in the Graph Model and Matching section. For every super pixel *S*the knots *P* are a subset of all pixels *p* in *S* ie \(P \subseteq \{p : \forall p \in S \in I\}\). The edges *E* and topology *T* each graph are derived from a Delaunay triangulation of the nodes *P*. The graph is undirected, and the edge weight function *w*(., .) is symmetric with respect to edges \(\vec{e_{a}}, \vec{e_{b}} \in E\)so that \(w(\vec{e_{a}}, \vec{e_{b}}) = w(\vec{e_{b}}, \vec{e_{a}})\). The resemblance works \(\lambda^{P}(.,.)\) and \(\lambda^{E}(.,.)\) are also symmetrical; to the \(p_{i}, p_{j} \in P_{1}\), \(p_{k}, p_{l} \in P_{2}\)and edges \(e_{a} \in E_{1}\), \(e_{b} \in E_{2}\)the similarity functions are given by

$$\begin{aligned}&\lambda ^{P} (p_{i}, p_{k}) = e^{-\bigg |d^{P}(f(p_{i}), f(p_ {k}))\bigg |}, \end{aligned}$$

(4)

$$\begin{aligned}&\lambda ^{E} (e_{a}, e_{b}) = e^{ – \frac{1}{2}\left[ \Phi ^{\circ } + \bigg |d^{E}(\theta _{e_{a}}, \theta _{e_{b}})\bigg | + \bigg |d^{L}(e_{a}, e_{b})\bigg | \right] }, \end{aligned}$$

(5)

Where \(\phi^{\circ}\) is given by

$$\begin{aligned} \Phi ^{\circ }&= \Phi ^{1}_{gradient}(f(p_{i}), f(p_{j}), f(p_{k}) , f(p_{l})) + \Phi ^{2}_{gradient}(f(p_{i}), f(p_{j}), f(p_{k}), f(p_{l })) \nonumber \\& \quad + \Phi ^{1}_{color}({{C}}_(p_{i}), {{C}}_(p_{j}), {{ C}}_(p_{k}), {{{C}}}_(p_{l})) + \Phi ^{2}_{color}({{C}}_(p_{i}) , {{C}}_(p_{j}), {{C}}_(p_{k}), {{C}}_(p_{l})), \end{aligned}$$

(6)

$$\begin{aligned} \Phi ^{1}_{gradient}&= \bigg | d^{P}(f(p_{i}), f(p_{k})) \bigg | + \bigg | d^{P}(f(p_{j}), f(p_{l})) \bigg |, \nonumber \\ \Phi ^{1}_{color} &= \bigg | d^{\mathscr{C}}(f(p_{i}), f(p_{k}))\bigg| + \bigg | d^{\mathscr {C}}(f(p_{j}), f(p_{l})) \bigg |, \end{aligned}$$

(7)

$$\begin{aligned} \Phi ^{2}_{gradient} &= \bigg | d^{P}(f(p_{i}), f(p_{j}))\bigg | – \bigg | d^{P}(f(p_{k}), f(p_{l})) \bigg |, \nonumber \\ \Phi ^{2}_{color} &= \bigg | d^{\mathscr{C}}(f(p_{i}), f(p_{j}))\bigg| – \bigg | d^{\mathscr{C}}(f(p_{k}), f(p_{l}))\bigg|. \end{aligned}$$

(8th)

\(f: P \longrightarrow S\) is a feature descriptor with cardinality *S* for a knot \(p\in P\), \(\mathscr{C}:P \longrightarrow 6\) is a function that calculates the 6 vector \(<\mu _{r}, \mu _{g}, \mu _{b}, \sigma _{r}, \sigma _{g}, \sigma _{b}>\) with color distribution means and variances (\(\mu , \sigma \)) at *p* modeled as 1D Gaussian for each color channel, \(d^{P}: S \times S \longrightarrow \mathbb {R}\) is the \(\mathscr{L}^{1}\)-Norm of the difference between the feature descriptors of two nodes in \(p_{i}, p_{j}, p_{k}, p_{l} \in P\), \(d^{E}:\mathbb{R}\times\mathbb{R}\longrightarrow\mathbb{R}\) is the difference between the angles \(\theta _{e_{a}}, \theta _{e_{b}}\) of the two edges \(e_{a}\in E_{1}, e_{b}\in E_{2}\) to the horizontal axes and \(d^{\mathscr{C}}: 6 \times 6 \longrightarrow \mathbb {R}\) is the \(\mathscr{L}^{1}\)-Norm of the difference between the two 6 vectors containing color distribution information for the two nodes in \(p_{i}, p_{j}, p_{k}, p_{l} \in P\).

\(\Phi^{1}_{*}\) denotes first-order similarities and measures similarities between the vertices and edges of the two graphs. In addition to the similarities of the first order \(\Phi^{1}_{*}\)the functions in the above equations define additional second-order similarities \(\Phi^{2}_{*}\) which have been shown to improve matching performance^{30}. That is, instead of just using similarity functions that result in small and otherwise large differences between similar gradients/colors, e.g. First-order, we additionally include the second-order similarities defined above, which measure the similarity between the two gradients and colors Tee *distance between their differences*^{31}. For example, first-order similarity \(\phi ^{1}_{slope}\) calculates the distance between the two feature descriptors in the two graphs, ie \(\lambda^{P}(p_{i}, p_{k})\) in Eq. (4) while the second-order similarity calculates the *Distance between the feature descriptor differences of the endpoints in each plot* ie \(\phi ^{2}_{slope}\) and \(\Phi^{2}_{color}\) in Eq. (4) and (8). A descriptor \(f(s_{i})\), as in Eq. (6) is calculated for each centroid node representing superpixels \(s_{i}\in \mathscr{S}\) as the average of the feature descriptors of all the pixels it contains \(f(s_{i}) = \frac{1}{|s_{i}|} \sum _{\forall p\in s_{i} \subset I} \phi _{p}\) Where \(|s_{i}|\) is the number of pixels in superpixels \(if}\)and \(\phi_{p}\) is the feature descriptor of pixels \(p\in s_{i} \subset I\).

Given the function definitions above, graph matching is achieved by maximizing Eq. (1) Using a path tracing algorithm. \({\textbf{K}}\) is factored into a Kronecker product of six smaller matrices, which ensures manageable computational complexity on graphs with nodes \(N, M \approx. 300\)^{32}. Furthermore, the robustness against geometric transformations such as rotation and scale is increased by finding an optimal transformation at the same time as finding the optimal correspondences, thus enforcing global rigid (e.g. similarity, affine) and non-rigid geometric constraints during optimization will^{33}.

The result is super-pixel matches within the coarse-scale adjusted clusters. Assuming piecewise rigid motion, we use RANSAC to remove outliers from the superpixel matches. For every super pixel *s* with at least three matching neighbors we fit an affine transformation. We only check whether the super pixel *s* is an outlier, in which case it is removed from further processing. This process is repeated for all small clusters and graphically adjusted super pixels. We proceed by matching the pixels contained in the fitted superpixels based on their feature descriptors. Similar to what we did previously in the Perceptual Grouping and Feature Matching section, we remove outlier pixel matches contained within the superpixels by using RANSAC to find a localized ground matrix.

The initial weak flow resulting from graph matching consists of a flow computed from each pixel contained in the matched superpixels. Figure 1b shows the result of the clustering of the feature descriptors for the in Fig. 1a. Large area clusters are further subdivided into superpixels. The graph nodes correspond to the centroid of each superpixel, and the edges result from the Delaunay triangulation of the nodes, as explained above. Figure 1c,d shows the result of graphically fitting superpixels within a coarse-scale fitted cluster. Matches are color-coded and mismatched nodes are represented as smaller yellow circles. Examples of mismatched nodes appear in the left part of the left image in Fig. 1 C. The images shown are from the MPI-Sintel benchmark data set^{13}.

### interpolation and refinement

The combined initial sparse currents (Fig. 1e,f) computed from sparse feature matching and graph matching as described above in the Perceptual Grouping and Feature Matching and Graph Matching sections, respectively first interpolated and then refined. For the interpolation we apply an edge-preserving technique^{10}. This creates a dense flow as shown in Fig. 1g. In the final step, we refine the interpolated flow by variational optimization on the full scale of the initial flows, i.e. without a coarse-to-fine scheme, with the same data and smoothing terms as used in Ref.^{10}. The end result is in Fig. 1h.

#Motion #Estimation #Large #Displacements #Deformations #Scientific #Reports