ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Data Science Trends on Kaggle !!
    DeepLearning 2018. 9. 17. 10:37

    A number of trends have changed over the years in the field of Data Science. Kaggle is the largest and the most popular data science community across the globe. In this kernel, I am using Kaggle Meta Data to explore the Data Science trends over the years.

    1. Linear Vs Logistic Regression

    Lets look at the comparison of linear regression and logistic regression discussions on forums, kernels and replies on kaggle.

    Code
    201020112012201320142015201620172018020406080Export to plot.ly »
    Linear RegressionLogistic RegressionKaggle Discussions: Linear vs LogisticNumber of Kaggle Discussions
    • From the above graph, we can observe that there were always been more discussions related to logistic regression than linear regression. The generel trend is that number of discussions are increasing every month.
    • One indication is that there are more number of classification problems than regression problems on Kaggle including the most popular Titanic Survival Prediction competition. This competition has most number of discussions and is one of the longest running compeition on Kaggle. There is a regression competition as well : House Prices advanced regression, but people more often start it after titanic only.
    • The number of logistic regression discussions on forums, kernel comments, and replies boomed to high numbers in October 2017 and March 2018. One of the reason is the the Toxic Comments Classification Competition" in which a number of authors shared excellent information related to classification models including logistic regression.

    2. The dominance of xgboost

    Code
    201020112012201320142015201620172018050100150200250300350400450Export to plot.ly »
    Decision TreeRandom ForestXgboostLightgbmCatboostKaggle Discussions: Tree based modelsNumber of Kaggle Discussions
    • Before 2014, Linear Models, Decision Trees, and Random Forests were very popular. But when XgBoost was open sourced in 2014, it gained popularty quickly and dominated the kaggle competitions and kernels. Today, xgboost is still used exhaustively in compeitions and is the part of the winning models of many competitions. Some examples are Otto Group Classification Competition in which first place solution made use of xgboost.
    • However with the arrival of Lightgbm in 2016, the useage of xgboost dipped to some extent and popularity of lightgbm started rising very quickly. Based on the recent increasing trend of lightgbm (shown in red), one can forecast that it will dominate next few years as well, unless any other company opensources a better model. For example, lightgbm was used in the winning solution of Porto Seguro’s Safe Driver Prediction . One of the reason for light gbm popularity is the faster implementation and simple interface as compared to xgboost.
    • For instance, Catboost was recently released and is starting gaining popularity.
    Code
    201020112012201320142015201620172018020406080100120Export to plot.ly »
    Neural NetworkDeep LearningKaggle Discussions: Neural Networks vs Deep LearningNumber of Kaggle Discussions
    • Neural networks were present in the industry since the decades but in recent years trends changed because of the access to much larger data and computational power.
    • The era of deep learning started in 2014 with the arrival of libraries such as theano, tensorflow in 2015, and keras in 2016. The number of discussions related to deep learning is increasing regularly and are always more than neural networks. Also, many cloud instance providers such as Amazon AWS, Google cloud etc showcases their capabilities of training very deep neural networks on clouds.
    • The deeplearning models also became popular because of a number of Image Classification competitions on Kaggle such as : Data Science Bowl, competitions from Google etc. Also, deeplearning models became popular for text classification problems for example Quora Duplicate Questions Classification.
    • Deep learning is also become populary every month because of different variants of models such as RNNs, CNNs have shown great improvements in the kernels. Also, transfer learning and pre-trained models have shown great results in competitions.
    • Kaggle can launch more competitions / playgrounds related to Image Classification Modelling as people wants to learn from them alot. Not to forget that Kaggle have added the GPU support in kernels which facilitates the Deep Learning useage on kaggle.

    4. ML Tools used on Kaggle

    Code
    201020112012201320142015201620172018050100150200250Export to plot.ly »
    ScikitTensorflowKerasPytorchKaggle Discussions: ML ToolsNumber of Kaggle Discussions
    • Scikit Learn was the only library used on kaggle for machine learning tasks, but since 2015 tensorflow gained populartiy.
    • Among the ML tools, Keras is the most popular because of the simplistic deep learning implementation.

    5. XgBoost vs Keras

    Code
    201020112012201320142015201620172018050100150200250300350400450Export to plot.ly »
    XgboostKerasKaggle Discussions: Xgboost vs Deep LearningNumber of Kaggle Discussions
    201020112012201320142015201620172018050100150200250300Export to plot.ly »
    CnnLstmKaggle Discussions: CNN and LSTMNumber of Kaggle Discussions
    • Among both the popular techniques on Kaggle - xgboost and deeplearning, xgboost has remained on top because it is faster and requires less computational infrastructure than very complex and deeper neural networks.

    6. What Kagglers are using for Data Visualizations ?

    Code
    20102011201220132014201520162017201801020304050Export to plot.ly »
    MatplotlibSeabornPlotlyKaggle Discussions: Python Data Visualization LibrariesNumber of Kaggle Discussions
    2010201120122013201420152016201720180510152025Export to plot.ly »
    GgplotHighchartLeafletKaggle Discussions: R Data Visualization LibrariesNumber of Kaggle Discussions
    • Plotly has gained so much popularity since 2017 and is one of the most used data visualization library among the kernels. The second best is seaborn which is used extensively as well. Some of the high quality visualization kernels by kaggle grandmasters such as SRK and Anistropic are created with plotly. Personally, I am a big fan of plotly as well. :P

    7. Important Data Science Techniques

    Code
    201020112012201320142015201620172018050100150200250300350Export to plot.ly »
    ExplorationFeature EngineeringParameter TuningEnsemblingKaggle Discussions: Important Data Science TechniquesNumber of Kaggle Discussions
    • Among the important data science steps, kagglers focus alot on Model Ensembling since many winning solutions on kaggle competitions are ensemble models - the blends and stacked models. In almost every regression or classification kernels, one can notice the ensemblling kernels. Just for an example - in Toxic Comment Classification Competition, massively large number of ensemling kernels were shared.
    • Data Exploration is the important technique and people have started stressing on the importance of exploration in the EDA kernels.
    • Surprizing to see that discussions related to Feature Engineering and Model Tuninig are less than Ensembling. These two tasks have the most important significance in the best and accurate models. People tend to forget that ensembling is only the last stage of any modelling process but a considerable amount of time should be given to feature engineering and model tuning tasks.

    8. Kaggle Components : What people talks about the most

    Code
    201020112012201320142015201620172018020040060080010001200Export to plot.ly »
    DatasetKernelCompetitionLearnWhat is hottest on KaggleNumber of Kaggle Discussions
    • Kaggle communitiy has shared a number of competition related discussions in fourms and are increasing in general.
    • With the launch of kernels in 2016, their useage increased to a great extent. Firstly kagglers shared kernels in competitions only, but with a more focus on kaggle datasets, kernel awards, the number of discussions related to kernels started rising and have surpassed the discussions related to competitions. Also, a number of Data Science for Good Challenges and Kernels only competitions have been launched on kaggle which are one of the reason of kernels popularity.
    • Kaggle also launched the awesome Kaggle Learn section which is becoming popular and popular but still it is behind than the compeitions, kernels, and discussions. This is because its primarily audience is the novice and begineers, but for sure in coming years and with the more addition of courses, kaggle learn section will reach the similar levels as competitions and kernels.

    [출처] https://www.kaggle.com/shivamb/data-science-trends-on-kaggle


Designed by Tistory.