Kaggle竞赛顶尖选手经验汇总

What is your first plan of action when working on a new competition?

理解竞赛，数据，评价标准。

建立交叉验证集。

制定、更新计划。

检索类似竞赛和相关论文。

What does your iteration cycle look like?

Sacrifice a couple of submissions in the beginning of the contest to understand the importance of the different algorithms -- save energy for last 100 meters.

Do the following process for multiple models

Select a model and do a recursive loop with the following steps:
- Transform data (scaling, log(x+1) values, treat missing values, PCA or none)
- Optimize hyper parameters of the model
- Do feature engineering for that model (as in generate new features)
- Do features' selection for that model (as in reducing them)
- Redo previous steps as optimum parameters are likely to have changed slightly
Save hold-out predictions to be used later (meta-modelling)
Check consistency of CV scores with leaderboard. If problematic, re-assess cross-validation process and re-do steps

Create partnerships. Ideally you look for people that are likely to have taken different approaches than you have. Historically (in contrast) I was looking for friends; people I can learn from and people I can have fun with - not so much winning.

Find a good way to ensemble

What does your iteration cycle look like?

It depends on the competition and I usually go through a few stages.

At the beginning I focus on data exploration and try some basic approaches so I iterate pretty quickly.

Once the obvious ideas are exhausted I usually slow down and do some research into the domain -- reading papers, forum post, etc. If I get an idea I would then implement it and submit it to the public LB.

My iteration cycle usually is short -- I rarely work on feature engineering that requires more than a few hours of coding for a particular feature.

My personal experience is that very complicated features usually do not work well -- possibly because of my buggy code.

What does your iteration cycle look like?

Read the overview and data description of the competition carefully

Find similar Kaggle competitions. As a relatively new comer, I have collected and done a basic analysis of all Kaggle competitions.

Read solutions of similar competitions.

Read papers to make sure I don’t miss any progress in the field.

Analyze the data and build a stable CV.

Data pre-processing, feature engineering, model training.

Result analysis such as prediction distribution, error analysis, hard examples.

Elaborate models or design a new model based on the analysis.

Based on data analysis and result analysis, design models to add diversities or solve hard samples.

Ensemble.

Return to a former step if necessary.

What does your iteration cycle look like?

I always prepare the dataset and apply feature engineering as much as I can, then I choose a training algorithm and optimize hyperparameters based on a cross validation score. If a model is good and stable I save the trainset and testset predictions. Then I start all over again using another training algorithm or model. When I have a handful of good model predictions, I start ensembling at the second level of training.

What does your iteration cycle look like?

Understand the dataset. At least enough to build a consistent validation set.

Build a consistent validation set and test its relationship with the leaderboard score.

Build a very simple model.

Look for approaches used in similar competitions in the past.

Start feature engineering, step by step to create a strong model.

Think about ensembling, be it by creating alternate versions of the feature set or using different modeling techniques (xgb, rf, linear regression, neural nets, factorization machines, etc).

What are your favorite machine learning algorithms?

ridge regression, resnet-50, GBT, XGB

What is your approach to hyper-tuning parameters?

用网格搜索。

基于交叉验证集。

查看类似竞赛，相关论文中类似问题下的设置。

对数据和算法的理解和经验。

观察调参前后的输出分布，受影响样本等。

In a few words, what wins competitions?

好的验证集，好的模型和特征，模型融合，从别的竞赛和论文中学习，遵守计划。