À̹ø¿¡´Â ¿©·¯ sampling ±â¹ýÀ»
µ¿ÀÏ data set¿¡ Àû¿ëÇÏ¿© Â÷À̸¦ »ìÆìº¸·Á°í ÇÕ´Ï´Ù.
INTRODUCTION
°è±ÞºÒ±ÕÇü ÀÚ·á¿¡ °æ¿ì Á¤È®µµÀÇ ¿ª¼³À̶ó´Â Á¤ºÐ·ùÀ²ÀÌ ³ô¾Æµµ Àǹ̰¡ ¾ø´Â ¸ðÇüÀÌ µÇ´Â ¹®Á¦°¡ ¹ß»ýÇÏ°Ô µË´Ï´Ù. ±×·¡¼ ¿©·¯ sampling ±â¹ýÀ» ÀÌ¿ëÇÏ¿© À̸¦ ÇØ°áÇÏ·Á°í ÇÕ´Ï´Ù. ±×·¸´Ù¸é ¿©·¯ sampling Áß¿¡ ÁÁÀº ±â¹ýÀº ¹«¾ùÀΰ¡ ¶Ç ±â¹ý °£¿¡ Ư¡µéÀ» ºñ±³¸¦ ÅëÇØ ¾Ë¾Æº¸µµ·Ï ÇϰڽÀ´Ï´Ù.
ÃÑ 6°¡Áö ±â¹ýÀº
1. Under-Sampling
2.
Over-Sampling(´Ü¼ø ÀÓÀÇ ÃßÃâ),
3.
Under&Over-Sampling
4.
Over-Sampling(º¹¿øÃßÃâ)
5.
Boostrapping
6.
SMOTE – Macro
¾Õ¼± Æ÷½ºÆÃ¿¡¼ ±â¹ýµé¿¡ ´ëÇÑ ¼Ò°³¿Í ±×¸®°í ½Ç½ÀÀ» ÁøÇàÇÑ Æ÷½ºÆÃÀÌ
Àֱ⠶§¹®¿¡ ÀÚ¼¼È÷ ¾ð±ÞÀ» ÇÏÁö ¾Ê°í °£·«ÇÑ ¼Ò°³·Î ÁøÇàÇϰڽÀ´Ï´Ù
»ç¿ëÇÒ µ¥ÀÌÅÍ´Â kaggle¿¡ ¿ÀÇ µÅÀÖ´Â ¡°Credit Card Fraud Detection¡± µ¥ÀÌÅÍ·Î ½Å¿ëÄ«µå »ç±âŽÁö¸¦ À§ÇØ »ý¼ºÇÑ µ¥ÀÌÅÍÀÔ´Ï´Ù. ¾Õ¼± Æ÷½ºÆÃ°ú µ¿ÀÏÇÏ°Ô ¿ø·¡´Â ´Ù¾çÇÑ º¯¼öµéÀÌ Æ÷ÇԵǾî ÀÖÁö¸¸ °³ÀÎÁ¤º¸º¸È£ »óÀÇ ¹®Á¦·Î º¯¼öµéÀ» ÁÖ¼ººÐºÐ¼®À»
ÅëÇØ Â÷¿øÃà¼Ò ½ÃÄÑ ³õÀº µ¥ÀÌÅ͸¦ ÀÌ¿ëÇÏ¿´½À´Ï´Ù. ±× °á°ú °¢ º¯¼ö°¡ ¾î¶² Àǹ̸¦ °®´ÂÁö ÇØ¼®ÀÌ ºÒ°¡´ÉÇÕ´Ï´Ù.
METHODOLOGY
1. Under-Sampling

±×¸² 1: Under-Sampling, Over-Sampling Node
[±×¸² 1] Àº Under-Sampling °ú Over-Sampling ÁøÇà °úÁ¤ÀÔ´Ï´Ù. EM¿¡ Àִ ǥº»ÃßÃâ ³ëµå¸¦
ÅëÇÏ¿© ÁøÇàÀ» ÇÏ¿´°í ÃßÃâ Àü¿¡ IMPUTE ¿Í Transform
variables ³ëµå¸¦ ÅëÇÏ¿© µ¥ÀÌÅÍ Àü󸮸¦ ÁøÇàÇÏ¿´½À´Ï´Ù. Under-Sampling À» ÅëÇØ 686 °³¿¡, Over-Sampling À» ÅëÇØ 571 °³¿¡ data¸¦ »ý¼ºÇÏ¿´½À´Ï´Ù. ¶ÇÇÑ
train°ú validation ³ª´©±â À§ÇØ µ¥ÀÌÅÍ ºÐÇÒ ³ëµå¸¦ »ç¿ëÇÏ¿´Áö¸¸ ¾Õ¼± ÆäÀÌÁö¸¦
ÅëÇÏ¿© Á÷Á¢ÀûÀÎ ¿¬°áÀÌ °¡´ÉÇÏÁö ¾Ê¾Æ sas ÄÚµå ³ëµå¸¦ ÅëÇÏ¿©
validation data setÀ» ÀÌ¿ëÇÏ¿´½À´Ï´Ù.
2
Over-Sampling(º¹¿øÃßÃâ)

±×¸² 2 Over-Sampling(replace) node
[±×¸² 2] Àº Over-sampling °ú´Â ´Ù¸¥ ÃßÃâ ¹ýÀ¸·Î ¾Õ¼± samplingÀº
´Ü¼ø ÀÓÀÇ ÃßÃâ·Î Áï, Áߺ¹ data°¡ Á¸ÀçÇÏÁö ¾Ê½À´Ï´Ù. ÇÏÁö¸¸ º¹¿ø ÃßÃâ¹ýÀ» ÀÌ¿ëÇÏ¿© Áߺ¹À» Çã¿ëÇÑ ÃßÃâÀ» ÁøÇà ÇÏ¿´½À´Ï´Ù. Surveyselect
ÇÁ·Î½ÃÀú¸¦ ÀÌ¿ëÇÑ sas ÄÚµå ³ëµå¸¦ ÀÌ¿ëÇÏ¿© ÃßÃâ ÇÏ¿´½À´Ï´Ù. °¡Àå Áß¿äÇÑ ºÎºÐÀº µ¥ÀÌÅÍ ºÐÇÒÀ» ÁøÇà½Ã¿¡ ¼±Çà ³ëµå¸¦ ÅëÇØ ÁøÇàÇØ¾ß ÇÕ´Ï´Ù.
Áߺ¹ data°¡ Á¸ÀçÇϱ⠶§¹®¿¡ train°ú validation data set¿¡ µ¿ÀÏÇÑ °üÃøÄ¡°¡ Á¸ÀçÇØ¼´Â ¾ÈµÇ±â ¶§¹®ÀÔ´Ï´Ù.
3.
Under&Over-Sampling

±×¸² 3: Under&Over-Sampling Node
[±×¸² 3] Under&Over
sampling ¿ª½Ã Áߺ¹ µ¥ÀÌÅͰ¡ Á¸À縦 ÇÏÁö¸¸ ¾Õ¼± Over sampling°ú´Â ´Ù¸£°Ô
¸¹Àº Áߺ¹ µ¥ÀÌÅ͸¦ ÃßÃâÇÏÁö ¾ÊÀ¸¸ç ¶ÇÇÑ ´Ù¼öÁý´Ü¿¡ data ¿ª½Ã sampling
Çϱ⿡ Á¤º¸¼Õ½ÇÀ» ÁÙ¿©ÁÖ¸é¼ ÆíÇâ ¿ª½Ã ÁÙ¿©ÁÖ´Â samplingÀ¸·Î »ý°¢ µË´Ï´Ù. Under&Over sampling ¿ª½Ã sas ÄÚµå ³ëµå¸¦
ÀÌ¿ëÇÏ¿© ÁøÇà ÇÏ¿´½À´Ï´Ù.
4
Boostrapping

±×¸² 4 Boostrapping-Sampling Node
[±×¸² 4] Boostrapping
À» ÀÌ¿ëÇÑ sampling ÀÔ´Ï´Ù. EM ³ëµå¸¦
ÀÌ¿ëÇÑ Boostrapping µµ Á¸ÀçÇÏÁö¸¸ À̹ø Æ÷½ºÆÃ¿¡¼´Â sas
ÄÚµå ³ëµå¸¦ ÀÌ¿ëÇÏ¿© ÁøÇà ÇÏ¿´½À´Ï´Ù. Boostrapping¿¡ °æ¿ì ¿©·¯ ¸ðµ¨ Àû¿ëÇÑ
°á°ú¸¦ ÀÌ¿ëÇÑ °¡ÁßÄ¡¸¦ »ç¿ëÇÏ¿© ¸ðµ¨¿¡ ÀûÁß·üÀ» ³ôÀÔ´Ï´Ù. ±×·¸±â¿¡ ³ëµå ¿¬°áÀÌ À̾îÁö´Â °ÍÀ» º¼ ¼ö
ÀÖÀ¸¸ç, ¸ðµ¨ ensemble Àû¿ë ÇÏ¿´½À´Ï´Ù.
5 SMOTE – Macro

±×¸² 5 Smote-Sampling Node
[±×¸²
5] Smote samplig¿¡ °æ¿ì °¡Àå Áß¿äÇÑ ºÎºÐÀº sampling
ÁøÇà Àü¿¡ Á¤±Ôȸ¦ ÁøÇàÇØ¾ß ÇÕ´Ï´Ù. °üÃøÄ¡µé °£¿¡ °Å¸®¸¦ ÀÌ¿ëÇÏ¿© »õ·Î¿î Ç¥º»À» »ý¼ºÇÏ´Â
°úÁ¤À̱⿡ Á¤±Ôȸ¦ ÇÏÁö ¾ÊÀ» °æ¿ì ´ÜÀ§°¡ ´Ù¸¥ ÇÑ º¯¼ö¿¡ ÀÇÇØ °á°ú°¡ ÆíÇâ µÉ ¼ö ÀÖ½À´Ï´Ù. ±×·¡¼
º¯¼ö º¯È¯ ³ëµå¸¦ ¼±Çà³ëµå·Î ÁøÇàÀ» ÇÏ¿´À¸¸ç ÀÇ»ç°áÁ¤ ³ëµå¸¦ ÅëÇØ »çÀüÈ®·üÀ» ¾Õ¼± Æ÷½ºÆÃ°ú µ¿ÀÏÇÏ°Ô ¸ÂÃß¾ú½À´Ï´Ù.
PREDICTIVE
MODELING

±×¸² 6 Diagram
[±×¸² 6] ±âº»ÀûÀ¸·Î °áÃø°ª ó¸®¿Í º¯¼ö º¯È¯(Á¤±Ô¼º ÃÖ´ëÈ)À» ÀÌ¿ëÇÑ µ¥ÀÌÅÍ Àü󸮸¦ ÁøÇàÇÏ¿´°í µ¥ÀÌÅÍ ºÐÇÒÀº 70:30À¸·Î
ÁöÁ¤ÇÏ¿´½À´Ï´Ù. ¸ðµ¨ Àû¿ë¿¡¼´Â Bootstrapping
samping À» Á¦¿ÜÇϰí´Â ÃÑ 3°³ÀÇ ¸ðµ¨À» ÀÌ¿ëÇÏ¿´½À´Ï´Ù. [ÀÇ»ç°áÁ¤ ³ª¹«], [½Å°æ¸Á],
[Logistic regression] À» Àû¿ëÇÏ¿´½À´Ï´Ù. ¿É¼Ç¿¡ °æ¿ì sampling ±â¹ýµé¿¡ Â÷À̸¦ º¸±â À§ÇÏ¿© default °ª À»·Î
³õ°í ÁøÇàÇÏ¿´½À´Ï´Ù. BootstrappingÀÇ °æ¿ì °¡Àå ¸¹ÀÌ »ç¿ëÇÏ´Â [ÀÇ»ç°áÁ¤ ³ª¹«]¸¦ ÀÌ¿ëÇÑ ensembleÀ»
ÁøÇàÇÏ¿© ¸ðµ¨ºñ±³¸¦ ÁøÇàÇÏ¿´½À´Ï´Ù.
GOODNESS-OF-FIT
¸ðµ¨ ¸í
|
ROC index
|
RR
|
MR
|
SEN
|
SPEC
|
ȸ±Í(Regression) (6)
|
0.989
|
0.977342048
|
0.022658
|
0.946309
|
0.986842
|
ÀÇ»ç°áÁ¤Æ®¸®(Decision Tree) (9)
|
0.948
|
0.9708061
|
0.029194
|
0.899329
|
0.996411
|
¾Ó»óºí(Ensemble)
|
0.946
|
0.975163399
|
0.024837
|
0.892617
|
0.998804
|
½Å°æ¸Á(Neural Network) (6)
|
0.981
|
0.982570806
|
0.017429
|
0.926174
|
0.983254
|
under ÀÇ»ç°áÁ¤
|
0.939
|
0.906705539
|
0.093294
|
0.885906
|
0.992823
|
under ½Å°æ¸Á
|
0.981
|
0.986880466
|
0.01312
|
0.912752
|
0.980861
|
under Logistic
|
0.987
|
0.951895044
|
0.048105
|
0.939597
|
0.983254
|
smote ÀÇ»ç°áÁ¤
|
0.967
|
0.957183635
|
0.042816
|
0.947777
|
0.9689
|
smote ½Å°æ¸Á
|
0.968
|
0.937202664
|
0.062797
|
0.9012
|
0.961722
|
smote Logistic
|
0.928
|
0.872312084
|
0.127688
|
0.831334
|
0.922249
|
replace-over ÀÇ»ç°áÁ¤
|
0.957
|
0.907530738
|
0.092469
|
0.899329
|
0.996411
|
replace-over ½Å°æ¸Á
|
0.981
|
0.953637295
|
0.046363
|
0.90604
|
0.995215
|
replace-over Logistic
|
0.989
|
0.944672131
|
0.055328
|
0.946309
|
0.98445
|
over&under ÀÇ»ç°áÁ¤
|
0.944
|
0.921167247
|
0.078833
|
0.899329
|
0.990431
|
over&under ½Å°æ¸Á
|
0.986
|
0.965156794
|
0.034843
|
0.926174
|
0.994019
|
over&under Logistic
|
0.987
|
0.946428571
|
0.053571
|
0.946309
|
0.990431
|
no-replace-over ÀÇ»ç°áÁ¤
|
0.949
|
0.905429072
|
0.094571
|
0.912752
|
0.980861
|
no-replace-over ½Å°æ¸Á
|
0.976
|
0.989492119
|
0.010508
|
0.932886
|
0.964115
|
no-replace-over Logistic
|
0.985
|
0.949211909
|
0.050788
|
0.946309
|
0.9689
|
Ç¥-1 °á°ú °ª

±×¸² .7 ROC GRAPH
±×¸² 8 ROC À妽º
[±×¸²7]À» ÅëÇØ ´ëºÎºÐÀÇ ¸ðµ¨ÀÌ ¿ÞÂÊ »ó´Ü¿¡ °¡±î¿î ÇüŸ¦ º¸À̰í ÀÖ½À´Ï´Ù. ¶ÇÇÑ
train °ú validate ÀÇ °ªÀÇ Å« Â÷ÀÌ´Â ¾øÀ¸¸ç
°¡Àå Å« Â÷À̸¦ º¸ÀÌ´Â ¸ðµ¨Àº ´Ü¼øÀÓÀÇÃßÃâÇÑ over-sampling À¸·Î train¿¡¼± 0.999¿¡¼ validate
0.976À¸·Î Å« Â÷¸®¾Æ°í º¼ ¼ö ¾ø±â¿¡ ¸ðµ¨¿¡ °úÀûÀº ¾ø´Ù°í ÆÇ´ÜÇÒ ¼ö ÀÖ½À´Ï´Ù. °¡Àå
³·Àº ROC À妽º °ªÀ» º¸¸é train°ú validate ¸ðµÎ smote Logistic ÀÔ´Ï´Ù. ´Ù¸¥ Logistic regression¸ðµ¨µé º¸´Ù ³·Àº °ªÀ» º¸À̴µ¥
ÀÌ´Â smote samplingÀ» ÅëÇÑ µ¥ÀÌÅÍ º¯È¯ ÁøÇà ÇÑ °æ¿ì »õ·Ó°Ô »ý¼ºµÇ´Â µ¥ÀÌÅͰ¡ ¹ß»ýÇϰÔ
µË´Ï´Ù. ±×·¸±â¿¡ Logistic regression¿¡ °æ¿ì
¸¹Àº °¡Á¤°ú ´ÙÁß°ø¼±¼ºÀ» Çѹø ´õ È®ÀÎ ÈÄ ¼¼ºÎÀû Á¶Á¤ÇÑ ÈÄ¿¡ ¸ðµ¨À» Àû¿ëÇÏ¸é ¼º´ÉÀÌ ¿Ã¶ó°¥ °Í À̶ó°í »ý°¢µË´Ï´Ù. ±×·¸Áö¸¸ ¸ðµç ¸ðµ¨¿¡ ROC À妽º °ªÀÌ 0.9 ÀÌ»óÀ̱⿡ Á¤È®µµ¿¡ Å« ¹®Á¦´Â ¾øÀ» °ÍÀ¸·Î º¸ÀÔ´Ï´Ù.

±×¸² 9 ƯÀ̵µ¿Í ¹Î°¨µµ
[±×¸²9]´Â °¢ ºÐ·ù±â º° ƯÀ̵µ¿Í ¹Î°¨µµ ÀÔ´Ï´Ù. °è±ÞºÒ±ÕÇü ÀÚ·áÀ̱⠶§¹®¿¡ ƯÀ̵µ°¡ ´õ ³ôÀº °ªÀ», ¹Î°¨µµ°¡ ³·Àº
°ªÀ» º¸ÀÔ´Ï´Ù. SamplingÀ» ÇÏÁö ¾ÊÀº 3°¡Áö ºÐ·ù±â
Áß¿¡ Logistic regressionÀ» Á¦¿ÜÇÑ ½Å°æ¸Á°ú ÀÇ»ç°áÁ¤³ª¹«¿¡ °æ¿ì ƯÀ̵µ¿Í ¹Î°¨µµ°¡ ´Ù¼Ò
Â÷À̰¡ ÀÖ¾î º¸À̹ǷΠÀÌ·¯ÇÑ ¹®Á¦¸¦ ÇØ°áÇϱâ À§ÇØ samplingÀ» Àû¿ëÇÑ °æ¿ì ¹Î°¨µµ°¡ ³ô¾ÆÁö´Â °æ¿ì°¡
¹ß»ýÇÏ¿´½À´Ï´Ù. ¶ÇÇÑ smote-ÀÇ»ç°áÁ¤ °ú ´Ü¼øÀÓÀÇÃßÃâ
over-LogisticÀÇ °æ¿ì´Â ƯÀ̵µ¿Í ¹Î°¨µµ°¡ ³ôÀº °ª¿¡¼ Â÷À̰¡ °ÅÀÇ ³ªÁö ¾Ê´Â ¸ðÇüÀ¸·Î ³ªÅ¸³ª±â¿¡
Á¤È®µµ°¡ ³ôÀº °á°ú·Î ³ªÅ¸³µ½À´Ï´Ù.
CONCLUSION
À̹ø¿£ ¿©·¯ Sampling ±â¹ýÀ» SAS EM°úSAS
CODE ¸¦ ÅëÇÏ¿© ±¸ÇöÇØº¸¾Ò½À´Ï´Ù. ¿©·¯ SamplingÀ»
µ¿ÀÏ µ¥ÀÌÅÍ¿¡ Àû¿ëÇÏ¿© ÀüüÀûÀÎ ºñ±³¸¦ ÁøÇàÇÏ¿´½À´Ï´Ù. ¿¹»ó¿Ü·Î samplingÀ»
ÇÏÁö ¾ÊÀº ¸ðµ¨¿¡µµ Àß ÀûÇÕÇÏ¿´Áö¸¸ ´õ¿í ¼Ò¼ö µ¥ÀÌÅÍ ºÐ·ù¸¦ À§ÇØ ¸ðµ¨¿¡ ¼º´ÉÀ» ³ôÀ̱⠽ʹٸé samplingÀ»
ÅëÇÑ ºÐ¼®ÀÌ ÁÁÀ½À» È®ÀÎ ÇÒ ¼ö ÀÖ½À´Ï´Ù. ½Ç½ÀÀ» ÇÏ°Ô µÇ½Å´Ù¸é ´õ¿í ºÒ±ÕÇüÀÌ ½ÉÇÑ µ¥ÀÌÅ͸¦ ÇØº¸´Â
°ÍÀ» Ãßõ µå¸®°Ú½À´Ï´Ù.
REFRERENCE
[1] 7 Techniqes
to Handle Imbalanced Data, KDnuggets, 2017-jun, https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html?utm_content=bufferf0775&utm_medium=social&utm_source=facebook.com&utm_campaign=buffer
[2] Practical Guide to deal with Imbalanced
Classification Problems in R,Analytics Vidhya,
.
|