»çÀÌÆ®¸Ê  |  Contact Us
 
Ȩ > SAS Tech & Tip > SAS Enterprise Miner
[EM] Dealing with imbalanced data 2017.10.16
ÃÖ¼¼È¯ 1414 3
http://www.mysas.co.kr/SAS_tiptech/j_eminer.asp?b_no=7482&gotopage=1&con=subject&keyword=&cmd=content&bd_no=29&gubun=

 


À̹ø¿¡´Â ¿©·¯ sampling ±â¹ýÀ» µ¿ÀÏ data set¿¡ Àû¿ëÇÏ¿© Â÷À̸¦ »ìÆìº¸·Á°í ÇÕ´Ï´Ù.


 


INTRODUCTION

 


°è±ÞºÒ±ÕÇü ÀÚ·á¿¡ °æ¿ì Á¤È®µµÀÇ ¿ª¼³À̶ó´Â Á¤ºÐ·ùÀ²ÀÌ ³ô¾Æµµ Àǹ̰¡ ¾ø´Â ¸ðÇüÀÌ µÇ´Â ¹®Á¦°¡ ¹ß»ýÇÏ°Ô µË´Ï´Ù. ±×·¡¼­ ¿©·¯ sampling ±â¹ýÀ» ÀÌ¿ëÇÏ¿© À̸¦ ÇØ°áÇÏ·Á°í ÇÕ´Ï´Ù. ±×·¸´Ù¸é ¿©·¯ sampling Áß¿¡ ÁÁÀº ±â¹ýÀº ¹«¾ùÀΰ¡ ¶Ç ±â¹ý °£¿¡ Ư¡µéÀ» ºñ±³¸¦ ÅëÇØ ¾Ë¾Æº¸µµ·Ï ÇϰڽÀ´Ï´Ù.


 

ÃÑ 6°¡Áö ±â¹ýÀº



1. Under-Sampling


2. Over-Sampling(´Ü¼ø ÀÓÀÇ ÃßÃâ),


3. Under&Over-Sampling


4. Over-Sampling(º¹¿øÃßÃâ)


5. Boostrapping


6. SMOTE – Macro



¾Õ¼± Æ÷½ºÆÃ¿¡¼­ ±â¹ýµé¿¡ ´ëÇÑ ¼Ò°³¿Í ±×¸®°í ½Ç½ÀÀ» ÁøÇàÇÑ Æ÷½ºÆÃÀÌ Àֱ⠶§¹®¿¡ ÀÚ¼¼È÷ ¾ð±ÞÀ» ÇÏÁö ¾Ê°í °£·«ÇÑ ¼Ò°³·Î ÁøÇàÇϰڽÀ´Ï´Ù


»ç¿ëÇÒ µ¥ÀÌÅÍ´Â kaggle¿¡ ¿ÀÇ µÅÀÖ´Â ¡°Credit Card Fraud Detection¡± µ¥ÀÌÅÍ·Î ½Å¿ëÄ«µå »ç±âŽÁö¸¦ À§ÇØ »ý¼ºÇÑ µ¥ÀÌÅÍÀÔ´Ï´Ù. ¾Õ¼± Æ÷½ºÆÃ°ú µ¿ÀÏÇÏ°Ô ¿ø·¡´Â ´Ù¾çÇÑ º¯¼öµéÀÌ Æ÷ÇԵǾî ÀÖÁö¸¸ °³ÀÎÁ¤º¸º¸È£ »óÀÇ ¹®Á¦·Î º¯¼öµéÀ» ÁÖ¼ººÐºÐ¼®À» ÅëÇØ Â÷¿øÃà¼Ò ½ÃÄÑ ³õÀº µ¥ÀÌÅ͸¦ ÀÌ¿ëÇÏ¿´½À´Ï´Ù. ±× °á°ú °¢ º¯¼ö°¡ ¾î¶² Àǹ̸¦ °®´ÂÁö ÇØ¼®ÀÌ ºÒ°¡´ÉÇÕ´Ï´Ù.



METHODOLOGY



1. Under-Sampling



±×¸² 1: Under-Sampling, Over-Sampling Node


[±×¸² 1] Àº Under-Sampling °ú Over-Sampling ÁøÇà °úÁ¤ÀÔ´Ï´Ù. EM¿¡ Àִ ǥº»ÃßÃâ ³ëµå¸¦ ÅëÇÏ¿© ÁøÇàÀ» ÇÏ¿´°í ÃßÃâ Àü¿¡ IMPUTE ¿Í Transform variables ³ëµå¸¦ ÅëÇÏ¿© µ¥ÀÌÅÍ Àü󸮸¦ ÁøÇàÇÏ¿´½À´Ï´Ù.

Under-Sampling À» ÅëÇØ 686 °³¿¡, Over-Sampling À» ÅëÇØ 571 °³¿¡ data¸¦ »ý¼ºÇÏ¿´½À´Ï´Ù.  ¶ÇÇÑ train°ú validation ³ª´©±â À§ÇØ µ¥ÀÌÅÍ ºÐÇÒ ³ëµå¸¦ »ç¿ëÇÏ¿´Áö¸¸ ¾Õ¼± ÆäÀÌÁö¸¦ ÅëÇÏ¿© Á÷Á¢ÀûÀÎ ¿¬°áÀÌ °¡´ÉÇÏÁö ¾Ê¾Æ sas ÄÚµå ³ëµå¸¦ ÅëÇÏ¿© validation data setÀ» ÀÌ¿ëÇÏ¿´½À´Ï´Ù.


 

2 Over-Sampling(º¹¿øÃßÃâ)



±×¸² 2 Over-Sampling(replace) node


[±×¸² 2] Àº Over-sampling °ú´Â ´Ù¸¥ ÃßÃâ ¹ýÀ¸·Î ¾Õ¼± samplingÀº ´Ü¼ø ÀÓÀÇ ÃßÃâ·Î Áï, Áߺ¹ data°¡ Á¸ÀçÇÏÁö ¾Ê½À´Ï´Ù. ÇÏÁö¸¸ º¹¿ø ÃßÃâ¹ýÀ» ÀÌ¿ëÇÏ¿© Áߺ¹À» Çã¿ëÇÑ ÃßÃâÀ» ÁøÇà ÇÏ¿´½À´Ï´Ù. Surveyselect ÇÁ·Î½ÃÀú¸¦ ÀÌ¿ëÇÑ sas ÄÚµå ³ëµå¸¦ ÀÌ¿ëÇÏ¿© ÃßÃâ ÇÏ¿´½À´Ï´Ù. °¡Àå Áß¿äÇÑ ºÎºÐÀº µ¥ÀÌÅÍ ºÐÇÒÀ» ÁøÇà½Ã¿¡ ¼±Çà ³ëµå¸¦ ÅëÇØ ÁøÇàÇØ¾ß ÇÕ´Ï´Ù. Áߺ¹ data°¡ Á¸ÀçÇϱ⠶§¹®¿¡ train°ú validation data set¿¡ µ¿ÀÏÇÑ °üÃøÄ¡°¡ Á¸ÀçÇØ¼­´Â ¾ÈµÇ±â ¶§¹®ÀÔ´Ï´Ù.


3. Under&Over-Sampling



±×¸² 3: Under&Over-Sampling Node


[±×¸² 3] Under&Over sampling ¿ª½Ã Áߺ¹ µ¥ÀÌÅͰ¡ Á¸À縦 ÇÏÁö¸¸ ¾Õ¼± Over sampling°ú´Â ´Ù¸£°Ô ¸¹Àº Áߺ¹ µ¥ÀÌÅ͸¦ ÃßÃâÇÏÁö ¾ÊÀ¸¸ç ¶ÇÇÑ ´Ù¼öÁý´Ü¿¡ data ¿ª½Ã sampling Çϱ⿡ Á¤º¸¼Õ½ÇÀ» ÁÙ¿©Áָ鼭 ÆíÇâ ¿ª½Ã ÁÙ¿©ÁÖ´Â samplingÀ¸·Î »ý°¢ µË´Ï´Ù. Under&Over sampling ¿ª½Ã sas ÄÚµå ³ëµå¸¦ ÀÌ¿ëÇÏ¿© ÁøÇà ÇÏ¿´½À´Ï´Ù.


4 Boostrapping



±×¸² 4 Boostrapping-Sampling Node


 


[±×¸² 4] Boostrapping À» ÀÌ¿ëÇÑ sampling ÀÔ´Ï´Ù. EM ³ëµå¸¦ ÀÌ¿ëÇÑ Boostrapping µµ Á¸ÀçÇÏÁö¸¸ À̹ø Æ÷½ºÆÃ¿¡¼­´Â sas ÄÚµå ³ëµå¸¦ ÀÌ¿ëÇÏ¿© ÁøÇà ÇÏ¿´½À´Ï´Ù. Boostrapping¿¡ °æ¿ì ¿©·¯ ¸ðµ¨ Àû¿ëÇÑ °á°ú¸¦ ÀÌ¿ëÇÑ °¡ÁßÄ¡¸¦ »ç¿ëÇÏ¿© ¸ðµ¨¿¡ ÀûÁß·üÀ» ³ôÀÔ´Ï´Ù. ±×·¸±â¿¡ ³ëµå ¿¬°áÀÌ À̾îÁö´Â °ÍÀ» º¼ ¼ö ÀÖÀ¸¸ç, ¸ðµ¨ ensemble Àû¿ë ÇÏ¿´½À´Ï´Ù.



5 SMOTE – Macro



 

±×¸² 5 Smote-Sampling Node


[±×¸² 5] Smote samplig¿¡ °æ¿ì °¡Àå Áß¿äÇÑ ºÎºÐÀº sampling ÁøÇà Àü¿¡ Á¤±ÔÈ­¸¦ ÁøÇàÇØ¾ß ÇÕ´Ï´Ù. °üÃøÄ¡µé °£¿¡ °Å¸®¸¦ ÀÌ¿ëÇÏ¿© »õ·Î¿î Ç¥º»À» »ý¼ºÇÏ´Â °úÁ¤À̱⿡ Á¤±ÔÈ­¸¦ ÇÏÁö ¾ÊÀ» °æ¿ì ´ÜÀ§°¡ ´Ù¸¥ ÇÑ º¯¼ö¿¡ ÀÇÇØ °á°ú°¡ ÆíÇâ µÉ ¼ö ÀÖ½À´Ï´Ù. ±×·¡¼­ º¯¼ö º¯È¯ ³ëµå¸¦ ¼±Çà³ëµå·Î ÁøÇàÀ» ÇÏ¿´À¸¸ç ÀÇ»ç°áÁ¤ ³ëµå¸¦ ÅëÇØ »çÀüÈ®·üÀ» ¾Õ¼± Æ÷½ºÆÃ°ú µ¿ÀÏÇÏ°Ô ¸ÂÃß¾ú½À´Ï´Ù.


PREDICTIVE MODELING



±×¸² 6 Diagram


[±×¸² 6] ±âº»ÀûÀ¸·Î °áÃø°ª ó¸®¿Í º¯¼ö º¯È¯(Á¤±Ô¼º ÃÖ´ëÈ­)À» ÀÌ¿ëÇÑ µ¥ÀÌÅÍ Àü󸮸¦ ÁøÇàÇÏ¿´°í µ¥ÀÌÅÍ ºÐÇÒÀº 70:30À¸·Î ÁöÁ¤ÇÏ¿´½À´Ï´Ù. ¸ðµ¨ Àû¿ë¿¡¼­´Â Bootstrapping samping À» Á¦¿ÜÇϰí´Â ÃÑ 3°³ÀÇ ¸ðµ¨À» ÀÌ¿ëÇÏ¿´½À´Ï´Ù. [ÀÇ»ç°áÁ¤ ³ª¹«], [½Å°æ¸Á], [Logistic regression] À» Àû¿ëÇÏ¿´½À´Ï´Ù. ¿É¼Ç¿¡ °æ¿ì sampling ±â¹ýµé¿¡ Â÷À̸¦ º¸±â À§ÇÏ¿© default °ª À»·Î ³õ°í ÁøÇàÇÏ¿´½À´Ï´Ù. BootstrappingÀÇ °æ¿ì °¡Àå ¸¹ÀÌ »ç¿ëÇÏ´Â [ÀÇ»ç°áÁ¤ ³ª¹«]¸¦ ÀÌ¿ëÇÑ ensembleÀ» ÁøÇàÇÏ¿© ¸ðµ¨ºñ±³¸¦ ÁøÇàÇÏ¿´½À´Ï´Ù.


GOODNESS-OF-FIT


¸ðµ¨ ¸í

ROC index

RR

MR

SEN

SPEC

ȸ±Í(Regression) (6)

0.989

0.977342048

0.022658

0.946309

0.986842

ÀÇ»ç°áÁ¤Æ®¸®(Decision Tree) (9)

0.948

0.9708061

0.029194

0.899329

0.996411

¾Ó»óºí(Ensemble)

0.946

0.975163399

0.024837

0.892617

0.998804

½Å°æ¸Á(Neural Network) (6)

0.981

0.982570806

0.017429

0.926174

0.983254

under ÀÇ»ç°áÁ¤

0.939

0.906705539

0.093294

0.885906

0.992823

under ½Å°æ¸Á

0.981

0.986880466

0.01312

0.912752

0.980861

under Logistic

0.987

0.951895044

0.048105

0.939597

0.983254

smote ÀÇ»ç°áÁ¤

0.967

0.957183635

0.042816

0.947777

0.9689

smote ½Å°æ¸Á

0.968

0.937202664

0.062797

0.9012

0.961722

smote Logistic

0.928

0.872312084

0.127688

0.831334

0.922249

replace-over ÀÇ»ç°áÁ¤

0.957

0.907530738

0.092469

0.899329

0.996411

replace-over ½Å°æ¸Á

0.981

0.953637295

0.046363

0.90604

0.995215

replace-over Logistic

0.989

0.944672131

0.055328

0.946309

0.98445

over&under ÀÇ»ç°áÁ¤

0.944

0.921167247

0.078833

0.899329

0.990431

over&under ½Å°æ¸Á

0.986

0.965156794

0.034843

0.926174

0.994019

over&under Logistic

0.987

0.946428571

0.053571

0.946309

0.990431

no-replace-over ÀÇ»ç°áÁ¤

0.949

0.905429072

0.094571

0.912752

0.980861

no-replace-over ½Å°æ¸Á

0.976

0.989492119

0.010508

0.932886

0.964115

no-replace-over Logistic

0.985

0.949211909

0.050788

0.946309

0.9689


Ç¥-1 °á°ú °ª


 



±×¸² .7 ROC GRAPH



 


±×¸² 8 ROC À妽º


[±×¸²7]À» ÅëÇØ ´ëºÎºÐÀÇ ¸ðµ¨ÀÌ ¿ÞÂÊ »ó´Ü¿¡ °¡±î¿î ÇüŸ¦ º¸À̰í ÀÖ½À´Ï´Ù. ¶ÇÇÑ train °ú validate ÀÇ °ªÀÇ Å« Â÷ÀÌ´Â ¾øÀ¸¸ç °¡Àå Å« Â÷À̸¦ º¸ÀÌ´Â ¸ðµ¨Àº ´Ü¼øÀÓÀÇÃßÃâÇÑ over-sampling À¸·Î train¿¡¼± 0.999¿¡¼­ validate 0.976À¸·Î Å« Â÷¸®¾Æ°í º¼ ¼ö ¾ø±â¿¡ ¸ðµ¨¿¡ °úÀûÀº ¾ø´Ù°í ÆÇ´ÜÇÒ ¼ö ÀÖ½À´Ï´Ù. °¡Àå ³·Àº ROC À妽º °ªÀ» º¸¸é train°ú validate ¸ðµÎ smote Logistic ÀÔ´Ï´Ù. ´Ù¸¥ Logistic regression¸ðµ¨µé º¸´Ù ³·Àº °ªÀ» º¸À̴µ¥ ÀÌ´Â smote samplingÀ» ÅëÇÑ µ¥ÀÌÅÍ º¯È¯ ÁøÇà ÇÑ °æ¿ì »õ·Ó°Ô »ý¼ºµÇ´Â µ¥ÀÌÅͰ¡ ¹ß»ýÇÏ°Ô µË´Ï´Ù. ±×·¸±â¿¡ Logistic regression¿¡ °æ¿ì ¸¹Àº °¡Á¤°ú ´ÙÁß°ø¼±¼ºÀ» Çѹø ´õ È®ÀÎ ÈÄ ¼¼ºÎÀû Á¶Á¤ÇÑ ÈÄ¿¡ ¸ðµ¨À» Àû¿ëÇÏ¸é ¼º´ÉÀÌ ¿Ã¶ó°¥ °Í À̶ó°í »ý°¢µË´Ï´Ù. ±×·¸Áö¸¸ ¸ðµç ¸ðµ¨¿¡ ROC À妽º °ªÀÌ 0.9 ÀÌ»óÀ̱⿡ Á¤È®µµ¿¡ Å« ¹®Á¦´Â ¾øÀ» °ÍÀ¸·Î º¸ÀÔ´Ï´Ù.




±×¸² 9 ƯÀ̵µ¿Í ¹Î°¨µµ


[±×¸²9]´Â °¢ ºÐ·ù±â º° ƯÀ̵µ¿Í ¹Î°¨µµ ÀÔ´Ï´Ù. °è±ÞºÒ±ÕÇü ÀÚ·áÀ̱⠶§¹®¿¡ ƯÀ̵µ°¡ ´õ ³ôÀº °ªÀ», ¹Î°¨µµ°¡ ³·Àº °ªÀ» º¸ÀÔ´Ï´Ù. SamplingÀ» ÇÏÁö ¾ÊÀº 3°¡Áö ºÐ·ù±â Áß¿¡ Logistic regressionÀ» Á¦¿ÜÇÑ ½Å°æ¸Á°ú ÀÇ»ç°áÁ¤³ª¹«¿¡ °æ¿ì ƯÀ̵µ¿Í ¹Î°¨µµ°¡ ´Ù¼Ò Â÷À̰¡ ÀÖ¾î º¸À̹ǷΠÀÌ·¯ÇÑ ¹®Á¦¸¦ ÇØ°áÇϱâ À§ÇØ samplingÀ» Àû¿ëÇÑ °æ¿ì ¹Î°¨µµ°¡ ³ô¾ÆÁö´Â °æ¿ì°¡ ¹ß»ýÇÏ¿´½À´Ï´Ù. ¶ÇÇÑ smote-ÀÇ»ç°áÁ¤ °ú ´Ü¼øÀÓÀÇÃßÃâ over-LogisticÀÇ °æ¿ì´Â ƯÀ̵µ¿Í ¹Î°¨µµ°¡ ³ôÀº °ª¿¡¼­ Â÷À̰¡ °ÅÀÇ ³ªÁö ¾Ê´Â ¸ðÇüÀ¸·Î ³ªÅ¸³ª±â¿¡ Á¤È®µµ°¡ ³ôÀº °á°ú·Î ³ªÅ¸³µ½À´Ï´Ù.


 


CONCLUSION


À̹ø¿£ ¿©·¯ Sampling ±â¹ýÀ» SAS EM°úSAS CODE ¸¦ ÅëÇÏ¿© ±¸ÇöÇØº¸¾Ò½À´Ï´Ù. ¿©·¯ SamplingÀ» µ¿ÀÏ µ¥ÀÌÅÍ¿¡ Àû¿ëÇÏ¿© ÀüüÀûÀÎ ºñ±³¸¦ ÁøÇàÇÏ¿´½À´Ï´Ù. ¿¹»ó¿Ü·Î samplingÀ» ÇÏÁö ¾ÊÀº ¸ðµ¨¿¡µµ Àß ÀûÇÕÇÏ¿´Áö¸¸ ´õ¿í ¼Ò¼ö µ¥ÀÌÅÍ ºÐ·ù¸¦ À§ÇØ ¸ðµ¨¿¡ ¼º´ÉÀ» ³ôÀ̱⠽ʹٸé samplingÀ» ÅëÇÑ ºÐ¼®ÀÌ ÁÁÀ½À» È®ÀÎ ÇÒ ¼ö ÀÖ½À´Ï´Ù. ½Ç½ÀÀ» ÇÏ°Ô µÇ½Å´Ù¸é ´õ¿í ºÒ±ÕÇüÀÌ ½ÉÇÑ µ¥ÀÌÅ͸¦ ÇØº¸´Â °ÍÀ» Ãßõ µå¸®°Ú½À´Ï´Ù.



REFRERENCE


[1] 7 Techniqes to Handle Imbalanced Data, KDnuggets, 2017-jun, https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html?utm_content=bufferf0775&utm_medium=social&utm_source=facebook.com&utm_campaign=buffer


[2]  Practical Guide to deal with Imbalanced Classification Problems in R,Analytics Vidhya,


MARCH 28 2016, https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/


 


.