Data is being collected in many languages. However, in this course, you will be doing text analysis for the English might not work for other languages.
Let’s have Prof, Srinath discuss this. he explains how characters of different languages are stored on computers.
Now, it is not necessary that when you work with text, you’ll get to work the English language. With so many languages in the world and internet being accessed by many countries, there is a lot of text in non-English languages. For you to work with non-English text, you need to understand how all the other characters are stored.
Computer could handle number directly and store them on registers( the smallest unit of memory on a computer). But they couldn’t store the non-numeric characters as is . the alphabets and special characters were to be converted to a numeric value first before they could be stored.
Hence, the concept of encoding came into existence. all the non-numeric characters were encoded to a number using a code. Also, the encoding techniques had to be standardised so that different computer manufacturers won’t use different encoding techniques.
The fist encoding standard that came into existence was the ASCH(American Standard Code for Information Interchange) standard, in 1960 . ASCII standard assigned a unique code to each character of the keyboard which was known as ASCII code. For example , the ASCII code of the alphabet ‘A’ is 65 and that of the digit zero is 48. Since then, there have been several revisions made to the codes to incorporate new characters that came into existence after the initial encoding.
When ASCII was built, English alphabets were the only alphabets that were present on the keyboard. With time, new languages began to show up on keyboard sets which brought new characters. ASCII became outdated and couldn’t incorporate so many languages. A new standard has come into existence in recent years – the Unicode standard. It supports all the languages in the world – both modern and the older ones.
For someone working on text processing, knowing how to handle encodings becomes crucial. Before even beginning with any text processing, you need to know what kind of encoding the text has and if required, modify it to another encoding format.
In this segment, you’ll understand how encoding works in Python and the different types of encodings that you can use in Python.
Note : At 0:51, Professor mistakely told ASCII is 256 bit encoding standard but it should be 8 bit encoding standard.
To get a more in-depth understanding of Unicode, there’s a guide on official Python website. You can check it out here.
To summarise, there are two most popular encoding standards:
- American Standard Code for Information Interchange (ASCII)
- Unicode
- UTF-8
- UTF-16
Let’s look at the relation between ASCII, UTF-8 and UTF-16 through an example. The table below shows the ASCII, UTF-8 and UTF-16 codes for two symbols – the dollar sign and the Indian rupee symbol.
As you can see, UTF-8 offers a big advantage in cases when the character is an English character or a character from the ASCII character set. Also, while UTF-8 uses only 8 bits to store the character, UTF-16 (BE) uses 16 bits to store it, which looks like a waste of memory.
However, in the second case, a symbol is used which doesn’t appear in the ASCII character set. For this case, UTF-8 uses 24 bits, whereas UTF-16 (BE) only uses 16. Hence the storage advantages offered by UTF-8 is reversed and actually becomes a disadvantage here. Also, the advantage UTF-8 offered previously by being same as the ASCII code is also not of use here, as ASCII code doesn’t even exist for this case.
The default encoding for strings in python is Unicode UTF-8. You can also look at this UTF-8 encoder-decoder to look how a string is stored. Note that, the online tool gives you the hexadecimal codes of a given string.
Try this code in your Jupyter notebook and look at its output. Feel free to tinker with the code.
# create a string
amount = u"₹50"
print('Default string: ', amount, '\n', 'Type of string', type(amount), '\n')
# encode to UTF-8 byte format
amount_encoded = amount.encode('utf-8')
print('Encoded to UTF-8: ', amount_encoded, '\n', 'Type of string', type(amount_encoded), '\n')
# sometime later in another computer...
# decode from UTF-8 byte format
amount_decoded = amount_encoded.decode('utf-8')
print('Decoded from UTF-8: ', amount_decoded, '\n', 'Type of string', type(amount_decoded), '\n')
In the next segment, you’ll learn about regular expressions which are a must-know tool for anyone working in the field of natural language processing and text analytics.