Robotic Arm Control with Deep Reinforcement Learning
https://github.com/mcdermatt/RL/tree/master/torque_control
In this project I use an Actor-Critic network to control a simulation I made of my latest robotic arm prototype.
Motivation: PID controllers are by far the most common method of performing motor control for single input/ single output (SISO) systems. Unfortunately, PID controllers tend to fail in situations with multiple inputs and outputs, such as in a robotic arm where the dynamics of each joint are affected by other joints in the kinematic chain. Many industrial robotic arms are able to circumvent this issue by designing joints that are non-backdrivable such that any forces introduced in each joint are passed through to the base of the robot. Alternatively, PID control can be used at sufficiently slow speeds where velocity dependent terms can be neglected. Pictured below is a demo of a PID controller failing to accurately control my robotic arm despite having conservatively tuned each joint individually.
Compare the desired movement of the arm shown on the laptop screen with the actual movement of the robot. Note how side to side motion closely resembles that of the reference target, however movement in the vertical direction is significantly different.
Even before the robot pulls itself out of the clamp holding the base to the cart, the end effector is thrown backwards well beyond the vertical position; this is due to the controller for the elbow joint being unable to account for forces exerted by the shoulder. This problem must be approached with a MIMO control solution. While a policy of optimal control could be devised using EOMs similar to my previous project of canceling the dynamics of the robot, a far easier approach is to obtain a policy via a neural network.
I chose a Deep Deterministic Policy Gradient (DDPG) implementation which is a reinforcement learning technique that uses Actor and Critic networks to solve problems with continuous input and action spaces. The Actor network takes current and goal states as inputs and outputs torque commands for the motors. The Critic network takes in the state-action pair from the Actor and outputs a value which is used as loss to train the Actor. Rewards from the environment are used to calculate loss for the critic. At the beginning of training, both networks act randomly, however, gradually each network builds off the other until a working policy is formed.
Like most RL tasks, DDPG is sample intensive. Even with off policy approaches such as DDPG, training data must be from a policy reasonably close to that of the current policy in order to create useful gradients. For that reason, training data must be regenerated from scratch continuously while the networks train.
To save time, I first verified the network design on a single pendulum system.
With a little bit more training the network can successfully track a moving target with minimal overshoot.
At this point I was ready to move on to the more complicated 3DOF system. I updated the EOM solver I wrote for my inertia cancellation project to include inertial tensors from the latest hardware version of my robot. The model for friction was improved to include both viscous (velocity dependent) and slip-stick (Heaviside constant) terms to better reflect each gearbox’s resistance to the onset of motion from a static position. Conservative values for maximum motor torque were given so that torque losses due to overheated motors can be neglected. With all these factors taken into account, the digital twin of my robot should almost perfectly model the dynamics of its real world counterpart, meaning that a policy that is successful in simulation will also work on the real robot.
Further hyperparameter optimization helped improve the performance of the model. The policy below was trained for 12 hours on my GeForce 1060.
Next steps: Transfer learning to bring this policy to my new hardware