Kaggle Tinatic

Kaggle

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

從題目可以知道,這是一個 binary classification

最初想到SVM和perception

從題目給的數據,選擇Decision Tree 或 Random Forest可能是比較合理的想法

不過這邊我想用 Logistic Regression 來試試(sigmoid + cross entropy)

把訓練資料的內容全部都變成0-1的數字,剩下的就交給NN去解決

因為我們最後一層的active function是sigmoid

為了避免梯度消失,因次在做cross entropy時把最大最小值定為0.00001和0.99999

做每次的訓練時才不會有Nan的問題

結果

Kaggle : 0.76555

分數只有這樣

大概有幾個地方需要檢討

  1. overfitting??:train可以到90%但是test最高就是這數字,除了overfitting,另外一個就是資料的考量,因為有故意捨去某些資料來做訓練,可能留下的在測試資料中反而是缺失的
  2. 解決overfitting的方式:選用dropout可能在這裡沒有比regulaization還好,這需要調整
  3. 填補資料的方式:在空白資料上很多是填上零或著平均值,有些隱藏相關沒考慮到?
  4. feature:最可能就是feature的問題了,因為在類似的作法下,使用XGB也沒有好多少,因此應該要嘗試其他表現方式

 

原本想考慮好好用Random Forest 和 XGB 認真做一次

想想應該真的是在feature 上有問題;同樣用Deep learning來做的人,肯定也有做到非常高

先邁入下個試題,希望回頭後有新想法

My Github



發表留言