Shrimp counting is essential for farmers to estimate and manage hatching. However, counting shrimp from images is a challenging task for several reasons, including the small size of shrimp and their transparent color, which we cannot easily see. An additional challenge to shrimp counting that is not present in shrimp detection is distinguishing multiple overlapping shrimps.

Deep learning is an obvious choice for dealing with cluttered scenes where conventional vision analytics machine methods struggle with semantic segmentation.

Shrimp farmers who use ponds for production rely on cast netting shrimp and then relating the amount caught in the surface area of the cast net to the surface area of the entire pond.

Holding shrimp in water causes stress due to a lack of dissolved oxygen and the fact that they are highly concentrated when packed for weighing, which can take up to two minutes. This increase in stress and potential exoskeleton damage increases mortality and hastens the spread of black spot occurrences.

Related Work

The aspects related to the technological factors are hardware availability, cost, and software modularity. On the other hand, organizational factors such as knowledge sharing and management support are essential to maintain the shrimp counting system.

Finally, other factors contributed to the resource factors such as dataset, deep learning skills, and fishery farm availability. Our study focused on the counting process using the deep learning-based algorithm in underwater fish detection and recognition.

Handcrafted Feature Engineering

Calculating geometrical features is a suitable manual inspection method in the industry or agriculture sector. However, a discontinuous feature includes an irregular edge or circularity. Another significant matter in handcrafted feature engineering is scale-invariant.

“Many remarkable feature engineering inventions can address scale-invariant with the employment of a non-linear function; however, such approaches are less tolerable or robust when dealing with low-contrast and high-contrast images.

Again, their procedures entail an extensive, long processing time for the training model.

Non-Machine Learning-Based

There are several methods in counting based on non-machine learning, namely blob, counting by detecting the pixel area, and shape analysis. The input image is segmented into blobs of moving objects, using background subtraction and shadow elimination.

Various features are extracted and normalized for each blob according to its approximate size in the actual scene. The number of objects could estimate simultaneously in each blob.

Machine Learning-Based

Machine learning is an algorithm that allows software applications to become more accurate in predicting outcomes without being explicitly programmed. The basic premise of machine learning is to build algorithms that can receive input data and use statistical analysis to predict an output while updating outputs as new data become available.

Deep Learning-Based

Deep learning is an artificial intelligence function that mimics the work of the human brain in data processing and decision-making patterns. Deep learning is an artificial intelligence subset of machine learning that involves a network that can learn without supervision from unstructured or unlabeled data.

“Deep learning can also be referred to as deep neural learning or a deep neural network.”

Mask R-CNN

Mask R-CNN aims to solve the instance segmentation problem and separate objects in an image or a video. Mask R-CNN includes two stages: generating the proposals regions with an object given the input image or video in the first stage.

“The second stage covers a pipeline that can anticipate the object class label, uncover the bounding box, and create an object mask at the pixel level specified in the first stage proposal region.”

On the other hand, faster RCNN is a unique algorithm used for object detection. Similarly, the Faster R-CNN consists of two phases. The first phase, known as the regional proposed network (RPN), recommends a bounding box only for nominees with constrained objects.

In the second stage, after extracting features from each bounding box via Region of Interest Pooling (RoIPool), Faster R-CNN executes subsequent processes involving the classification and regression for each bounding box.

Experimental Results and Analysis To save time in the training models’ process and shorten the time in the labeling dataset, Mask R-CNN is used in this paper to find out the best parameter and detect the total number of shrimps in an image.

Building a Dataset

The dataset used for this paper is a picture that consists of a troupe of shrimps. The picture was collected with a total number of 120 images.

“Underwater cameras obtained some of the data used in the experiment for marine animals and others. The data are divided into seven categories: fish, shrimp, scallop, crab, lobster, abalone, and sea cucumber. “

Each category ranges from 1,000 to 1,400 sheets, with a total of 8,455 sheets where 80% of data were used for training and 20% for testing sets.

Training the Model

We chose ResNet101 in combination with FPNs from the Mask R-CNN backbone networks. The feature map was extracted from the input image by the backbone network first, and then the features were output by the backbone network.

The training model procedure in this paper uses 100 training images and the default and improved hyperparameters of the Mask R-CNN model. In the last step, the optimal hyperparameters are selected based on their performance, and the model is used in the next phase, which is the implementation phase or the testing phase.

Evaluation Index

The evaluation index for the performance of the model is evaluated based on precisions, recall, mean average precision (mAP), accuracy based on category, and value of R2 .

With 20 images as the validation set, the validation results of the improved method are compared with those of other methods.

“To calculate the accuracy based on the category is a comparison between the actual number (ground truth) and the number predicted based on the training dataset.”

The density of the number of shrimps is divided into three categories: less dense, medium dense, and highly dense. The maximum number of actual numbers is 256 shrimps, and the minimum number is four shrimps.

For the less dense category, the ground truth is between 1 to 90 shrimps, consisting of 82 images. The value of R2 is the comparison of results between the actual number of shrimps and the predicted number of shrimps. The method performed on the actual number with the predicted number is using linear regression.

Experimental Results and Analysis

We studied the performance of the proposed improved Mask R-CNN model and compared it with the existing Mask R-CNN model using the shrimp datasets. The improved Mask R-CNN model has a significant improvement in precision and recall by comparing the Mask RCNN model. Notably, the accuracy drops as the density increases.

It also suggests that in the less dense category with the number of ground truths of 2,682, the proposed model can obtain the predicted number of shrimps of 2,671 and achieve the value for an accuracy rate of 99.59% and an error rate of 0.41%.

“In the medium dense category with the ground truth number of 1,715, the proposed model can achieve the predicted number of 1,679 shrimps, 97.90% accuracy, and 2.10% error rate. “

Meanwhile, if the number of ground truths is 644, the proposed model still predicts 564 shrimps and 87.58% accuracy, and a 12.42% error rate. Therefore, the analysis recommends that the overall accuracy rate for the proposed model on the training dataset reached 97.48%, which are 4,914 shrimps out of 5,041 shrimps.

One of the ideas in counting is to calculate the object indirectly by estimating the density map. Density maps are created by performing a convolution with a Gaussian kernel and normalizing it so that integrating it yields the number of objects.

The main objective is to train the convolutional network to plot an image to a density map that can accumulate the number of object occurrences.”

Linear regression for the improved Mask R-CNN model suggests that the regression line fits nicely over the data, which means the predicted number of the shrimps is similar to the actual number of shrimps.

This work offers several significant contributions:

  • i. The shrimp images were recorded from the top view with the assumption of equal size due to similar shrimp age kept in the container.
  • ii. It can automatically estimate the number of shrimps using computer vision and deep learning.
  • iii. Default Mask R-CNN can be manipulated to effectively segment and count small shrimps or objects.
  • iv. The shrimp counting accuracy depreciates as the shrimp density increases or intensifies.
  • v. The shrimp estimation efficacy has a linear proportion when the hyperparameters such as maximum detection instance, learning rate, maximum ground truth instance, RPN threshold value, RPN train anchors per image, the number of steps per epoch, train region of interest per image, validation steps, and weight decay are increasing.
  • vi. The linear regression shows that R2 increases with better precision after performing hyperparameter manipulation over the default Mask R-CNN.
  • vii. This application can reduce shrimp death risk compared to practicing manual counting.


After testing and improvement, the proposed method improved the mAP, precision, and recall. The critical parameters that influence this advancement for the proposed method are maximum detection instance, maximum ground truth instance, number of thresholds, train anchors for each image, number of steps for each epoch, number of train regions of interest of each image, number of validation steps, number of steps in each epoch, and numbers of epochs, regularization, optimizers, learning rate, batch size, learning momentum, and weight decay.

The training dataset and validation dataset results show that the improved Mask RCNN model can detect and locate the shrimp accurately with a value of 97.48% compared to the existing method, which is more accurate than existing methods.

The current study contributes to our underwater computer vision knowledge by addressing three critical issues: 

  • reducing underwater animal death risk despite manual counting, 
  • Mask R-CNN configuration, 
  • and highlighting the pitfalls and advantages in terms of efficacy when dealing with different densities of small animals.

This is a summarized version developed by the editorial team of Aquaculture Magazine based on the review article titled “UNDERWATER FISH DETECTION AND COUNTING USING MASK REGIONAL CONVOLUTIONAL NEURAL NETWORK” developed by: TEH HONG KHAI – Universiti Kebangsaan Malaysia, SITI NORUL HUDA SHEIKH ABDULLAH – Universiti Kebangsaan Malaysia, MOHAMMAD KAMRUL HASAN – Universiti Kebangsaan Malaysia, AND AHMAD TARMIZI Mahjung Aquabest Hatchery.
The original article was published on JANUARY 2022, through MULTIDISCIPLINARY DIGITAL PUBLISHING INSTITUTE under the use of a creative commons open access license.
The full version can be accessed freely online through this link:


Please enter your comment!
Please enter your name here