All participants whose models traveled at least 15 meters in 10 seconds of the simulator time were invited to share their solutions in this manuscript. Nine teams agreed to contribute papers. The winning algorithm is published separately (to appear), while the remaining eight are collected here. On this page, we present solutions of teams who released videos describing their solutions.
We identified multiple strategies shared across teams.
Authors: {% avatar wjaskowski size=30 %}
{% avatar nnaisense size=30 %}
PKU team used an Actor-Critic Ensemble (ACE) method for improving the performance of Deep Deterministic Policy Gradient (DDPG) algorithm. At inference time, their method uses a critic ensemble to select the best action from proposals of multiple actors running in parallel. By having a larger candidate set, their method can avoid actions that have fatal consequences, while staying deterministic.
Authors: {% avatar hzwer size=30 %}
{% avatar NewGod size=30 %}
{% avatar liu-jc size=30 %}
Reason8 taem benchmarked state of the art policy-gradient methods and concluded that Deep Deterministic Policy Gradient (DDPG) method is the most efficient method for this environment. They also applied several improvements to DDPG method, such as layer normalization, parameter noise, action and state reflecting. All this improvements helped to stabilize training and improve its sample-efficiency.
Reason8.ai team provides to implementations on GitHub: Theano and PyTorch.
Authors: {% avatar fgvbrt size=30 %}
{% avatar Scitator size=30 %}
{% avatar wassname size=30 %}
For improving the training effectiveness of DDPG on this physics-based simulation environment which has high computational complexity, ICML team designed a parallel architecture with deep residual network for the asynchronous training of DDPG.
Deepsense.ai solution was based on the distributed Proximal Policy Optimization algorithm combined with a few efficiency-improving techniques. They used frameskip to increase exploration. They changed rewards to encourage the agent to \textit{bend its knees}, which significantly stabilized the gait and accelerated the training. In the final stage, they found it beneficial to transfer skills from small networks (easier to train) to bigger ones (with more expressive power). They developed policy blending, a general cloning/transferring technique.
Team of Adam Melnik trained the final model with PPO on 80 cores in 5 days using reward shaping with a normalized observation vector.
A Medium article from {% avatar AdamStelmaszczyk size=30 %} Adam Stelmaszczyk.