Robotics AI Applications & Their Architectural Support

5133 words (21 pages) Dissertation

9th Dec 2019 Dissertation Reference this

Tags: EngineeringArtificial Intelligence

Disclaimer: This work has been submitted by a student. This is not an example of the work produced by our Dissertation Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of


A major challenge in robotics is the integration of symbolic task goals and low-level continuous representations. In the research area of object grasping and manipulation, the problem becomes a formidable challenge. Objects have many physical attributes that may constrain planning of a grasp, as also robots have limited sensorimotor capabilities due to their various embodiments. Considering the problem at hand, multiple approaches take their inspiration from imitation studies in developmental psychology: infants are able to infer the intention of others, and understand and reproduce the underlying task constraints through their own actions [1]. This goal-directed imitative ability is obtained along multiple stages in a developmental roadmap, both through the infant’s own motor exploration (trial and error) and through the observation of others interacting with the world (imitation learning) [2]. Roboticists follow a similar developmental approach in order to design architectures for artificial agents [2], [3], [4], [5]. Most of these works, however, focus on the exploratory stage, where robots obtain object affordances through their empirical interaction with the world. The affordances being modeled are measured as the salient changes in the agent’s sensory channels, which are interpreted as effects of specific actions applied on objects [4]. As an example, an effect of poking a ball is making it roll. Though it is an important step for a robot to discover this motor ability, another necessary step to achieve goal-directed behavior is to link this immediate motor act and its effects (as to poke the ball and let it roll), to the conceptual goal of an assigned task (as to provide the ball to a child). While trial-and-error-based exploration can be seen as inefficient to solve such goal learning problems, human supervision is helpful. This motivates an idea different from the classical developmental studies in such a way that it incorporates taskspecific inputs from a human teacher. Thus, a system would be able to learn natural, goal-oriented types of grasps in a more efficient way. We clarify this idea in the hand-over task shown in Fig. 1. Such a task requires enough free area for another person to grasp the object. The robot should learn that an important constraint for this task is free area. There are numerous similar examples, e.g. pouring water from a cup requires the opening of a cup uncovered, and using a knife needs the robot to grasp the handle part. We believe these links can efficiently be learned by the input from a human expert. In this work, we develop such a method for learning of task goals and task relevant representations. The learning is performed in a high-dimensional feature space that takes into account different object representations and robot embodiments together with an input from a teacher.


Deriving quantified constraints from conceptual task goals presents a challenge similar to integrating high-level reasoning with low-level path planning and control systems in robotics. The main challenges originate from the representational differences in the two research fields. [6] addresses this problem through statistical relational models for a high-level symbolic reasoner, which is integrated into a robot controller. [7] proposes a coherent control, trajectory

optimization, and action planning architecture by applying the inference-based methods across all levels of representations. In our work, we directly approach the task-oriented grasping problem considering characteristics of a real robot system. .A concept of providing expertise about task semantics through human tutoring has been implemented in [14]. To realize this, we take a widely used probabilistic graphical model, Bayesian Network [15]. This model will be used to encode the statistical dependencies between object attributes, grasp actions and a set of task constraints; therefore to link the symbolic tasks to quantified constraints. The main contributions of our work are (i) introducing a semi-automated method for acquiring manually annotated, task-related grasps; (ii) learning probabilistic relationships between a multitude of task-, object- and action-related features with a Bayesian network; (iii) thus acquiring a hand-specific concept of affordance, which maps symbolic representations of task requirements to the continuous constraints; (iv) additionally, using a probabilistic framework, we can easily extend the object and action spaces, and allow flexible learning of novel tasks and adaptation in uncertain environments; (v) finally, our model can be applied to a goaldirected imitation framework, which allows a robot to learn from humans despite differences in their embodiments.


To introduce our approach, we first identify four subsets of features which play major roles in the consideration of a task-oriented grasp: task, object features, action features, and constraint features.


In our notation, a task T ∈T ={T1,…,TnT} refers to a ‘basic task’ that involves grasping or manipulation of a single object. According to [17], such a basic task can be called a manipulation segment which starts and ends with both hands free and the object at the stationary state. These manipulation segments are the building blocks for complex manipulation tasks.


An object feature set O = {O1,…,OnO} specifies the attributes (e.g. size) and/or categorical (e.g. type) information of an object. The features in O are not necessarily independent. The same attribute, such as shape, can be represented by different variables dependent on the capabilities of the perceptual system and the current object knowledge.


An action feature set A = {A1,…,AnA} describes the object-centered, static and kinematic grasp features, which may be the direct outputs of a grasp planner. A may include properties like grasp position, hand approach vector, or the grasp configuration.


Finally, constraint feature set C ={C1,…,CnC}specifies a set of constraint functions which is defined by human experts; we term these to be a range of variables representing functions of both object and action features. As an example in a grasp scenario (like in Fig. 1), one may define the enclosure of the object volume as a constraint feature, which obviously depends on both object features (size and shape) and action features (grasp position and configuration). Thus, constraint features form the basic elements


Given a complementary set of variables {T,O,A,C} =X , our focus is to model the dependencies between their elements using a Bayesian network (BN) [15] (see an example network in Fig. 2). A BN encodes the relations between the set of random variables X = {X1,X2,…,Xn}. Each node in the network represents one variable, and the directed arcs represent conditional independence assumptions. The topology of the directed arcs is referred to as the structure of the network, and it is constrained to be directed and acyclic, meaning there are no cyclic connections between the nodes.

Fig. 2. Experimentally instantiated Bayesian network. The coarse structure of the BN specifies the subset dependencies between T,O,A,C. The fine policy specifies the dependencies between variables within each feature subset. The latent variables for GMM nodes are not shown here.

In this section, we will describe the application of the trained BN for three different experiments. While two of them will mainly provide a view on the evaluation of the technique, the third one will show a setup for robot imitation based on task-constraints. For each experiment, we formulate the corresponding semantic questions to the system.

A. “From where to grasp an object, given a task?”

Formulating this question as P(upos|task, size, conv), our goal is to observe how our three tasks influence the position of a grasp, upos. As representatives for the experimental results, we select a hammer, a bottle, and a mug out of the 25 object models as the test set, and train the Bayesian network using the Schunk hand data stored from the remaining 22 models.Analyzing the results, we have the following observations: (i) the BN is clearly affected by the BADGr planner, providing a lot of “from where to grasp” hypotheses from the four sides, top and bottom of an object. (ii) Given a hand-over task, the results do not substiantially differ, and all major directions are valid. (iii) Given a pouring task, the network clearly rejects to grasp from the top in cases of bottle and mug.

B. “Can you imitate this demonstrated grasping task?”

In the first step, the robot observes a human performing a grasp on an object, and estimates the intention (task) tH of the human action. PH(T|O,A,C) encodes the probability of the tasks for the demonstrated object-grasp combination, where PH means that the BN is specific to the demonstrator’s embodiment. We denote the maximum-likelihood estimate of the task as ˆ tH. In the second step, the robot finds the most compatible grasp on the object(s) it perceived, in order to achieve the same task ˆ tH. This step can be formulated as a Bayesian decision problem, where a reward function r defines the degree.

  1. Matching of Tasks: The objective is to plan a grasp to match the same task while the robot is given six objects (see Tab. IV). In step 1, the robot estimates the most likely task of the demonstrated grasp to be pouring ˆ tH = T2. In the second imitation step, the robot first follows step 2.1 to select the object o∗ ∈ O that best affords T2, and then step 2.2 to select the grasp action a∗ ∈A that best affords T2. The results of the second step are illustrated in the two left bar plots in Tab. IV.
  2. Matching of Tasks and Features of Object and Action: In this scenario, the objective is not only to choose the object and action that afford the task, but also to select those that are similar to the object used by the human and the grasp the human applied, i.e. matching their features. This requires adding to the objective function a similarity measure between o and oH, and between a and aH. The results of the second step of imitation are illustrated in the two right bar plots in Tab. IV. Note that the feature vectors o and a are both concatenations of multiple variables, such as egpc and upos for a.


Our approach is semi-automated and embodiment-specific. A simulation-based grasp planner generates a set of hand-specific, stable grasp hypotheses on a range of objects. A teacher provides the knowledge of task requirements by labeling each hypothesis with the suitable manipulation task(s). The underlying relations between the conceptual task goals and the continuous object-action features are encoded by the probabilistic dependencies in a Bayesian network. Using this network as a knowledge base, the simulation experiments showed that the robot is able to infer the intended task of a human demonstration, choose the object that affords this task, and select the best grasp action to fulfill the task requirements. Though we implement and test the current framework based on the BADGr grasp planner [12], this task constraint framework can be integrated with any grasp planning system.


This work was supported by EU IST-FP7-IP GRASP, EU IST-FP6-IP-027657 PACO-PLUS, and Swedish Foundation for Strategic Research.


In 2015 about 90% of robots in use didn’t use sensors.  Most of these are the standard in industry today[23].  They are placed inside a controlled area to fulfill their functions as shown in figure 1.  This achieves approximately 10% of manufacturing industry penetration [23].  Robotic systems need to be able to operate in a dynamically changing environment and must be able to obtain enough information on positions of themselves, obstacles, objects, and interaction partners [26].  The number of sensors being added is increasing.  Vision is becoming incorporated in sensing, and Artificial Intelligence (AI), especially in the form of neural networks, is being used for robotic systems in learning to improve.  Both AI and vision sensing require increased processing power.

Movements are becoming more complex because robotic systems not only need to complete preplanned actions related to their task, but have to be able to process kinematics planning at higher speed to work in a dynamic environment [27].  Faster kinematics planning also needs higher computational ability. CPUs were the first choice for processing.  GPUs now hold supremacy as the choice of processing for graphics and AI, though this may change as new chips for AI become available [25].  Field Programmable Gate Arrays (FPGAs) are also being used to process video and kinematics to speed up movement planning[27] and image processing [28][29].  Increasingly autonomous robotic systems require a shift in design towards lighter weight, and more energy efficient platforms over the more common geographically rigid industrial systems in use.  This opens opportunities for combining on board processing with distributed or cloud computing to achieve high performance while circumventing space and power consumption restrictions of smaller robotic platforms [26].

Figure 1 [24]


Processing video is a computationally intensive task.   One of the more recent advancements has been in the use of smart cameras, or cameras with on board processing power.  Normal cameras provide raw data in the form of many images which have to be processed by off-board computer systems in the robots local environment because of bandwidth restrictions which don’t allow for the streaming at a high enough speed for fast movement [30].  Smart camera systems provide image information instead of being limited to provision of raw image data.  This provides the robotic system with high level results of image analysis to be precisely integrated in task planning software.  Although the data is somewhat dated, we can get an idea of possible distributed computing performance gain from work by Bistry and Zhang in A Cloud Computing Approach to Complex Robotic Vision Tasks using Smart Camera Systems.  They test four different systems and results are shown in table 1.  The systems use a feature extraction algorithm based on OpenCV and can be used to compare time required to identify objects by CPU and/or smart cameras.  They create a framework which uses the SIFT (Scale-Invariant Feature Transform) algorithm. SIFT discovers significant points in an image and describes them in a way such that the description is invariant to rotation, translation, and scaling.   Their framework is unique in that the image processing algorithms aren’t rigid, they can be replaced at runtime.

System 1 SIFT vectors of images are computed on the intelligent camera system and distributed to the image matching process over eight systems in the network which are Core 2 Duo, 2.4 GHz.  Each system generates feature vectors of the objects to be detected and attempts to match it to the current feature vector.  The task has a goal of detecting 100 objects.  WiFi usage is kept low by requiring one of the systems to be a feature vector distributor for the eight-member cloud of computers.

System 2 uses additional hardware on the service robot TASER.  Two laptops with Core 2 Duo 2.2 GHz processor are installed on the robot and the smart camera splits the image into two regions with one sent to each laptop which splits it again for each of its CPU cores.

System 3 does all the computational processes over the network using the smart camera as a normal Ethernet camera and processing the image data using the laptops removed from system 2.

System 4 does the complete image processing on the control PC of the service robot which has a Pentium 4 2.4 GHz.  The PC also runs several real-time tasks for the robot and has approximately a 60% load.  Similar to system 3, the camera does no preprocessing of image data [31].

Table 1 Image Processing Comparison
System Extraction Matching Data Transfer
1 6781 ms 381 ms 16 ms
2 269 ms 376 ms 62 ms
3 274 ms 383 ms 695 ms
4 4110 ms 10428 ms 50 ms

From the table we see that the best extraction time is from system 2 where the images are split by the smart camera, and the laptops on board do the matching the quickest.  We could obtain the best results overall if we split the image with the smart camera as in system 2, and then did the matching and data transfer with the cloud type setup in system 1.

On board vision processing has improved substantially.  FPGAs have the advantage of running image processing algorithms on reprogrammable hardware with parallel processing capabilities.  Successful work has been done using histogram of oriented gradients (HOG) and average magnitude difference function (AMDF) on Altera’s Cyclone II [29].  Reducing the floating point simulation in Matlab to integer numbers lowered the success rate in object recognition from 100% to 95.95%, 95.89%, and 95.76%  even with introduction of salt and pepper noise at 5%, 10%, and 25% respectively for an overall result of approximately 95% [29].  The system could classify 250 shapes in leather per second, and the introduction of noise showed the application to be viable for industrial settings.  A smart camera has also been implemented using a digital camera and an FPGA-SoC device benefiting from a remote client for observation of the visual processing [28].  The architecture is able to extract real-time data and has shown to work on robotic platforms with future work being described as including object recognition or facial expression recognition [28].

Nvidia has also created a new full Linux system the size of a Raspberry Pi with a GPU based on their Pascal architecture which will aid in computer vision and navigation [32].  The Nvidia Jetson TX2 supplies 1.5 teraflops of processing power, WiFi, and includes 64bit Denver 2 and A57 CPUs with 8GB of 12bit LPDDR4 memory along with the ability to encode and decode video at 2160p at up to 60fps [32].  The Tx2 represents the trend towards small systems which can be used for robots, drones and other devices which are increasingly dependent on vision.


Motion planning is one of the most time-consuming tasks for robotic movement.  For a robotic arm and gripper, the majority of the time required for the motion-planning algorithm is devoted to figuring out how to get the gripper to where it needs to go without the arm unintentionally running into something else.  Collision detection processing takes 99% of the time required for motion planning [33].  GPUs can achieve motion plans on the order of hundreds of milliseconds, but power consumption is in hundreds of watts, and is not feasible for untethered robots [33].

A probabilistic road map(PRM) is a graph of points in obstacle free space with lines referred to as edges connecting points where direct movement between them doesn’t result in a collision [27].  Researchers at Duke combined aggressive precomputation which happens when first setting  up the robot, and massive parallelism generating a large PRM with approximately 150,000 edges representing possible robot motions while avoiding self-collisions and collisions with objects which don’t change position such as the floor [27].  To trim down the PRM they simulated 10,000 scenarios with randomly located obstacles of varying numbers and size then checked to find which edges in the PRM were less frequently used [27].  Once the edges of the PRM are trimmed down to a size which will fit to the programming of one edge per FPGA circuit, a limit of a few thousand, the FPGA can simultaneously accept the 3D location of a single pixel in a depth image and output a single bit to indicate if there is a collision with the pixels location which quickly leads to a collision-free PRM [27].  The robot can then pick the shortest path in the PRM.  An example of one complex computation took just over 0.6 ms for the FPGA compared to 2,738 ms for a quad-core Intel Xeon 3.5 GHz processor.

Artificial Intelligence (AI)

Figure 2 Google TPU [25]

New processors are being created to confront the processing requirements of AI machine-learning(ML) which uses deep neural networks (DNN) and deep learning (DL) for everything from voice recognition to self-driving cars [25].  AI will be used by robotic systems for improving vision, and kinematics as well.  Google’s Tensor Processing Unit (TPU), Intel’s Lake Crest, and Knupath’s Hermosa are examples from a few of the vendors intending to provide platforms targeting neural networks.  The TPU has an 8 bit matrix multiply unit which optimizes DNN number crunching at a lean 700 MHz which outperforms CPU and GPU processing for DNNs with moderate energy, consuming 40 W of power (Figure 2 ) [25].  CPUs are usually 64-bit platforms and GPUs have wider word widths both optimized for larger data items whereas smaller 8-bit integers have found wide application in many DNN implementations [25].  The Intel Lake Crest is the code name for an Intel platform intended to complement the Xeon Phi which has been used for many AI tasks but found to be challenged by applications the Google TPU or Intel Lake Crest can easily perform with more efficiency [25].  The Lake Crest uses a multi chip module (MCM) design and “Flexpoint” architecture with twelve specialized multicore processing nodes somewhat like the Google TPU’s matrix multiply unit and has 32 GB of High Bandwidth Memory 2 (HBM2) with an aggregate 8 TB bandwidth [25].  The Knupath Hermosa has 256 DSP cores organized in eight cluster of eight cores connected by it’s Lambda Fabric which is also designed to create a low latency high throughput mesh to link thousands of Hermosa processors [25] and is shown in figure .  The Hermosa includes an integrated L1 router with 32 ports for a  bandwidth of 1 Tbit/s and links to the network by 16 10 Gbit/s bidirectional ports [25].

Although CPUs and GPUs do most of the AI work now there is a lot of promise in the development of new AI chips.  The Nvidia Jetson TX2 board can also be used as an AI accelerator in the tiny Intel Curie model [25].

The new AI chips used as examples show we are moving forward with both onboard and cloud type systems.  AI and ML including DNN can help Robotic systems in vision including object recognition and manipulation, kinematics for movement planning and optimization, and many other tasks.  Architecture of robotic systems which require AI processing needs to not only provide design for implementation of hardware for onboard and offboard processing, but also be able to decide when to process locally and when to outsource to the cloud by what is most efficient in combination of speed and power savings specific to their task(s) and platforms.


[1] A. N. Meltzoff, Elements of a Developmental Theory of Imitation. Cambridge, MA, USA: Cambridge University Press, 2002, pp. 19–41.

[2] R. Rao, A. Shon, and A. Meltzoff, “A Bayesian Model of Imitation in Infants and Robots,” in Imitation and Social Learning in Robots, Humans, and Animals, 2004, pp. 217–247.

[3] D. B. Grimes and R. P. N. Rao, “Learning Actions through Imitation and Exploration: Towards Humanoid Robots that Learn from Humans,” in Creating Brain-Like Intelligence, ser. Lecture Notes in Computer Science, vol. 5436. Springer, 2009, pp. 103–138.

[4] L. Montesano, M. Lopes, A. Bernardino, and J. Santos-Victor, “Learning Object Affordances: From Sensory–Motor Coordination to Imitation,” IEEE Transactions on Robotics, vol. 24, no. 1, pp. 15–26, 2008.

[5] C. Acosta-Calderon and H. Hu, “Robot Imitation: Body Schema and Body Percept,” Applied Bionics and Biomechanics, vol. 2, no. 3-4, pp. 131–148, 2005.

[6] D. Jain, L. M¨osenlechner, and M. Beetz, “Equipping Robot Control Programs with First-order Probabilistic Reasoning Capabilities,” in IEEE Int. Conf. on Robotics and Automation, 2009, pp. 3130–3135.

[7] M. Toussaint, N. Plath, T. Lang, and N. Jetchev, “Integrated Motor Control, Planning, Grasping and High-level Reasoning in a Blocks World using Probabilistic Inference,” in IEEE International Conference on Robotics and Automation, 2010, to appear.

[8] C. L. Nehaniv and K. Dautenhahn, Eds., Imitation and Social Learning in Robots, Humans, and Animals: Behavioural, Social and Communicative Dimensions. Cambridge University Press, 2004.

[9] D. Wolpert and M. Kawato, “Multiple Paired Forward and Inverse Models for Motor Control,” Neural Networks, vol. 11, no. 7-8, pp. 1317–1329, October 1998.

[10] Y. Demiris and M. Johnson, “Distributed, Predictive Perception of Actions: A Biologically Inspired Robotics Architecture for Imitation and Learning,” Connection Science, vol. 15, no. 4, pp. 231–243, 2003. [11] E. Oztop, D. Wolpert, and M. Kawato, “Mental State Inference using Visual Control Parameters,” Cognitive Brain Research, vol. 22, no. 2, pp. 129–151, February 2005.

[12] K. Huebner, S. Ruthotto, and D. Kragic, “Minimum Volume Bounding Box Decomposition for Shape Approximation in Robot Grasping,” in IEEE Int. Conf. on Robotics and Automation, 2008, pp. 1628–1633. [13] A. Miller and P. Allen, “Graspit! A Versatile Simulator for Robotic Grasping,” Robotics and Automation, vol. 11 (4), pp. 110–122, 2004.

[14] Z. Xue, A. Kasper, M. J. Zoellner, and R. Dillmann, “An Automatic Grasp Planning System for Service Robots,” in 14th International Conference on Advanced Robotics, 2009.

[15] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, September 1988.

[16] A. Rao, B. A. Olshausen, and M. Lewicki, Eds., Probabilistic Models of the Brain: Perception and Neural Function. MA: MIT Press, 2002.

[17] R. Z¨ollner, M. Pardowitz, S. Knoop, and R. Dillmann, “Towards Cognitive Robots: Building Hierarchical Task Representations of Manipulations from Human Demonstration,” in IEEE International Conference on Robotics and Automation, 2005, pp. 1535–1540.

[18] M. Novotni and R. Klein, “Shape Retrieval using 3D Zernike Descriptors,” Computer-Aided Design, vol. 36 (11), pp. 1047–1062, 2004.

[19] P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser, “The Princeton Shape Benchmark,” in International Conference on Shape Modeling and Applications, 2004, pp. 167–178.

[20] M. Ciocarlie, C. Goldfeder, and P. Allen, “Dexterous Grasping via Eigengrasps: A Low-dimensional Approach to a High-complexity Problem,” in RSS 2007 Manipulation Workshop, 2007.

[21] C. Ferrari and J. Canny, “Planning Optimal Grasps,” in IEEE Int. Conference on Robotics and Automation, vol. 3, 1992, pp. 2290–2295.

[22] K. Murphy, “BNT – Bayes Net Toolbox for Matlab,” [URL] http: //, 1997. Last visited July 15, 2010.

[23] (2018). Intelligent Robots: A Feast for the Senses. Available:

[24] “Global Robotics System Integration Market 2017 – Dynamic Automation, Geku Automation, RobotWorx, Midwest Engineered Systems, Van Hoecke Automation – Albanian Times,” 2017-06-08 2017.

[25] W. Wong. (2017) CPUs, GPUs, and now: AI chips. 22. Available:

[26] “A cloud computing approach to complex robot vision tasks using smart camera systems,” ed, 2010, p. 3195.

[27] E. Ackerman, “Motion-Planning Chip Speeds Robots,” IEEE Spectrum, 2018.

[28] “FPGA-based bio-inspired architecture for multi-scale attentional vision,” ed: ECSI, 2016, p. 231.

[29] M. Peker, H. Altun, and F. Karakaya, “Hardware implementation of a scale and rotation invariant object detection algorithm on FPGA for real-time applications,” Turkish Journal of Electrical Engineering & Computer Sciences, Article vol. 24, no. 5, pp. 4368-4382, 08// 2016.

[30] E. Ackerman, “Dynamic Vision Sensors Enable High-Speed Maneuvers With Robots,”Available:

[31] H. Bistry and J. Zhang, “A cloud computing approach to complex robot vision tasks using smart camera systems,” ed, 2010, p. 3195.

[32] “Nvidia’s Pascal-powered Jetson TX2 computer blows away Raspberry Pi,” ed, 2018.

[33] “The microarchitecture of a real-time robot motion planning accelerator,” ed: IEEE, 2016, p. 1.

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this dissertation and no longer wish to have your work published on the website then please: