처음 시도 php 함수 mb_detect_encoding으로 검사

파라미터에 대해서 인코딩방식이 어떤 것인지 판별하여 Return해준다.

string mb_detect_encoding ( string $str [, mixed $encoding_list= mb_detect_order() [, bool $strict= false ]] )

예제


사용은 대충 이런식으로 하게 된다.

/* Detect character encoding with current detect_order */
echo mb_detect_encoding($str);

/* "auto" is expanded to "ASCII,JIS,UTF-8,EUC-JP,SJIS" */
echo mb_detect_encoding($str, "auto");

/* Specify encoding_list character encoding by comma separated list */
echo mb_detect_encoding($str, "JIS, eucjp-win, sjis-win");

/* Use array to specify encoding_list  */
$ary[] = "ASCII";
$ary[] = "JIS";
$ary[] = "EUC-JP";
echo mb_detect_encoding($str, $ary);

하지만 한글에 대해선 완벽하게 지원해주지 않는 걸 확인했다. URL의 파라미터에 직접 한글을 입력하여 한 결과 처음 자음이 ㅈ~ㅎ으로 시작하면 UTF-8로 인식한다.

따라 함수를 따로 만들거나 해야 함.



1. detect_encoding함수
Another light way to detect character encoding:

function detect_encoding($string) {  
  static $list = array('utf-8', 'windows-1251');
  
  foreach ($list as $item) {
    $sample = iconv($item, $item, $string);
    if (md5($sample) == md5($string))
      return $item;
  }
  return null;
}

2. Function to detect UTF-8, when mb_detect_encoding is not available it may be useful.
Function to detect UTF-8, when mb_detect_encoding is not available it may be useful.

function is_utf8($str) {
    $c=0; $b=0;
    $bits=0;
    $len=strlen($str);
    for($i=0; $i<$len; $i++){
        $c=ord($str[$i]);
        if($c > 128){
            if(($c >= 254)) return false;
            elseif($c >= 252) $bits=6;
            elseif($c >= 248) $bits=5;
            elseif($c >= 240) $bits=4;
            elseif($c >= 224) $bits=3;
            elseif($c >= 192) $bits=2;
            else return false;
            if(($i+$bits) > $len) return false;
            while($bits > 1){
                $i++;
                $b=ord($str[$i]);
                if($b < 128 || $b > 191) return false;
                $bits--;
            }
        }
    }
    return true;
}

3. conver to Utf8 if $str is not equals to 'UTF-8'

/* 
*QQ: 290359552 
* conver to Utf8 if $str is not equals to 'UTF-8' 
*/ 
function convToUtf8($str){ 
     if( mb_detect_encoding($str,"UTF-8, ISO-8859-1, GBK")!="UTF-8" ){ 
         return  iconv("gbk","utf-8",$str); 

     }else{ 
         return $str; 
     }
} 

4. from PHPDIG

function isUTF8($str) {
        if ($str === mb_convert_encoding(mb_convert_encoding($str, "UTF-32", "UTF-8"), "UTF-8", "UTF-32")) {
            return true;
        } else {
            return false;
        }
    } 

5. Much simpler UTF-8-ness checker using a regular expression created by the W3C:

// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {
    
    // From http://w3.org/International/questions/qa-forms-utf-8.html
    return preg_match('%^(?:
          [\x09\x0A\x0D\x20-\x7E]            # ASCII
        | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
        |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*$%xs', $string);
    
} // function is_utf8



6. Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8. To verify utf 8 use the following:

//
//    utf8 encoding validation developed based on Wikipedia entry at:
//    http://en.wikipedia.org/wiki/UTF-8
//
//    Implemented as a recursive descent parser based on a simple state machine
//    copyright 2005 Maarten Meijer
//
//    This cries out for a C-implementation to be included in PHP core
//
    function valid_1byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0x80) == 0x00;
    }
    
    function valid_2byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xE0) == 0xC0;
    }

    function valid_3byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xF0) == 0xE0;
    }

    function valid_4byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xF8) == 0xF0;
    }
    
    function valid_nextbyte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xC0) == 0x80;
    }
    
    function valid_utf8($string) {
        $len = strlen($string);
        $i = 0;    
        while( $i < $len ) {
            $char = ord(substr($string, $i++, 1));
            if(valid_1byte($char)) {    // continue
                continue;
            } else if(valid_2byte($char)) { // check 1 byte
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
            } else if(valid_3byte($char)) { // check 2 bytes
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
            } else if(valid_4byte($char)) { // check 3 bytes
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
            } // goto next char
        }
        return true; // done
    }

for a drawing of the statemachine see: http://www.xs4all.nl/~mjmeijer/unicode.png and http://www.xs4all.nl/~mjmeijer/unicode2.png 


Posted by Finebe
,