Package emissary.util

Class CharsetUtil


  • public class CharsetUtil
    extends Object
    A collection of utilities for dealing with different character sets in Java. Mainly with the aim of getting to UTF-8. The j* routines generally take Java CharSet names while the non j* routines take derived charset names. This class contains an interpretation in Java of the GPL method isUTF8, available in C from http://billposer.org/Software/unidesc.html and the copied routine is called LegalUTF8P in Get_UTF32_From_UTF8i.c Copyright (C) 2003-2006 William J. Poser (billposer@alum.mit.edu) This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA or go to the web page: http://www.gnu.org/licenses/gpl.txt. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= This class contains the Apache Licensed isUnicodeString which is from Jakarta POI http://jakarta.apache.org/poi Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
    • Method Detail

      • getUtfCharArray

        public static char[] getUtfCharArray​(byte[] byteArray,
                                             String charSet,
                                             int start,
                                             int end)
        Get an array of UTF-8 characters from the input bytes
        Parameters:
        byteArray - the input bytes
        charSet - derived charSet of the input array
        start - index into input array to start copying
        end - index into input array to stop copying
        Returns:
        array of UTF8 char
      • jGetUtfCharArray

        public static char[] jGetUtfCharArray​(byte[] byteArray,
                                              @Nullable
                                              String charSet,
                                              int start,
                                              int end)
        Get an array of UTF-8 characters from the input bytes
        Parameters:
        byteArray - the input bytes
        charSet - JAVA charSet of the input array
        start - byte index into input array to start copying
        end - byte index into input array to stop copying
        Returns:
        array of UTF8 char
      • getUtfString

        public static String getUtfString​(String s,
                                          String charSet)
        Get a string in the specified encoding from the input String
      • getUtfString

        @Nullable
        public static String getUtfString​(byte[] data,
                                          String charSet)
        Get a string in the specified encoding
        Parameters:
        data - input bytes
        charSet - the JAVA charset
        Returns:
        JUCS2 string or null if error
      • byteToCharArray

        public static char[] byteToCharArray​(byte[] bArray)
        Convert bytes to chars using platform default encoding
        Parameters:
        bArray - the input data
      • isAscii

        public static boolean isAscii​(String s)
        test for ascii ness
        Parameters:
        s - string to test
        Returns:
        true if string is ascii
      • isUtf8

        public static boolean isUtf8​(String s)
        Do the bytes behind this string represent valid utf8?
        Parameters:
        s - string to test
        Returns:
        true if string is utf8
      • isUtf8

        public static boolean isUtf8​(byte[] data)
        do these bytes represent a valid utf8 string?
        Parameters:
        data - the bytes to check
        Returns:
        true if valid utf8
      • isUtf8

        public static boolean isUtf8​(byte[] data,
                                     int offs,
                                     int dlen)
        Check for valid utf8 data. Borrowed from the unidesc package (GPL) by Bill Poser, converted from C to Java. The check runs from offs to dlen-1
        Parameters:
        data - the bytes to check for validity
        offs - beginning offset to check
        dlen - ending offset of the range
        Returns:
        true if valid utf8
      • hasMultibyte

        public static boolean hasMultibyte​(@Nullable
                                           String value)
        See if string has multibyte chars (No longer based on org.apache.poi.util.StringUtil) It would be a bad idea to call this with a very large string
        Parameters:
        value - string to test
        Returns:
        true if string has at least one multibyte char