osim-rl / Git / [077a87] /docs/_docs/nips2017/solutions.md

Models:

JoelW/

osim-rl

Downloads: 1

[077a87]: / docs / _docs / nips2017 / solutions.md

History

Download this file

107 lines (73 with data), 6.7 kB

title: Solutions

All participants whose models traveled at least 15 meters in 10 seconds of the simulator time were invited to share their solutions in this manuscript. Nine teams agreed to contribute papers. The winning algorithm is published separately (to appear), while the remaining eight are collected here. On this page, we present solutions of teams who released videos describing their solutions.

Tips and tricks

We identified multiple strategies shared across teams.

Speeding up OpenSim

Parallelization: run multiple simulations with different parameters on multiple CPUs.
Accuracy: in OpenSim, the accuracy of the integrator is parametrized and can be manually set before the simulation. Users reduced accuracy to have faster simulations.

Speeding up exploration

Frameskip: instead of sending signals every 1/100 of a second, keep the same control for, for example, 5 frames.
Symmetry: assume that given a mirrored environment you should take a mirrored action.
Binary actions: excitations 0 or 1 instead of values in the interval [0,1].
Sample efficiency: most teams used learning algorithms leveraging history, such as Deep Deterministic Policy Gradient.

Leveraging the model

Ground reaction forces: these were not given in the environment so users tried to estimate them.
Current muscle activity: this was also not given but can be estimated following the OpenSim muscles dynamics model.
Reward shaping: modifying the reward for training in such a way that it still makes the model train faster for the actual initial reward. E.g. reward in the challenge is to run as quickly as possible; one can add an extra penalty term for falling (it seems it’s easier to first learn not to fall and then to run, rather than just learn to run).

NNAISENSE (1st place)

Github repository

Authors: {% avatar wjaskowski size=30 %}
{% avatar nnaisense size=30 %}

PKU (2nd place)

PKU team used an Actor-Critic Ensemble (ACE) method for improving the performance of Deep Deterministic Policy Gradient (DDPG) algorithm. At inference time, their method uses a critic ensemble to select the best action from proposals of multiple actors running in parallel. By having a larger candidate set, their method can avoid actions that have fatal consequences, while staying deterministic.

Github repository

Authors: {% avatar hzwer size=30 %}
{% avatar NewGod size=30 %}
{% avatar liu-jc size=30 %}

reason8.ai (3rd place)

Reason8 taem benchmarked state of the art policy-gradient methods and concluded that Deep Deterministic Policy Gradient (DDPG) method is the most efficient method for this environment. They also applied several improvements to DDPG method, such as layer normalization, parameter noise, action and state reflecting. All this improvements helped to stabilize training and improve its sample-efficiency.

Reason8.ai team provides to implementations on GitHub: Theano and PyTorch.

Authors: {% avatar fgvbrt size=30 %}
{% avatar Scitator size=30 %}
{% avatar wassname size=30 %}

IMCL (4th place)

For improving the training effectiveness of DDPG on this physics-based simulation environment which has high computational complexity, ICML team designed a parallel architecture with deep residual network for the asynchronous training of DDPG.

deepsense.ai (6th place)

Deepsense.ai solution was based on the distributed Proximal Policy Optimization algorithm combined with a few efficiency-improving techniques. They used frameskip to increase exploration. They changed rewards to encourage the agent to \textit{bend its knees}, which significantly stabilized the gait and accelerated the training. In the final stage, they found it beneficial to transfer skills from small networks (easier to train) to bigger ones (with more expressive power). They developed policy blending, a general cloning/transferring technique.

Adam Melnik (22nd place)

Team of Adam Melnik trained the final model with PPO on 80 cores in 5 days using reward shaping with a normalized observation vector.

Other materials

A Medium article from {% avatar AdamStelmaszczyk size=30 %} Adam Stelmaszczyk.