How to think like a programmer -- Data Types and Collections

Assumed audience: You are a TechLabs student pursuing the Data Science track. You require little hand-holding and prefer to refer to textbooks or denser materials if a concept doesn't make sense to you.

Algorithms

A set of rules or steps to solve a problem.

Data structures

A particular way of organising data.

Data types

The data we manipulate are of a certain type, each with their own quirks.

Data Types

I differentiate between five broad categories of data types: strings, numeric, Boolean, NoneType and Collections. Everything else is commentary to a Data Scientist.

I don't think it's useful to commit to memory every method we have available to manipulate data types and collections.

You can find out what methods you have available by typing a simple command.

text = type("string")
dir(text)
# this will list all the methods available on a string

But if you must have a list of methods that will come in handy, here's one for strings.

str.endswith('p') # does your string end with the letter 'p'. Returns true or false.
str.startswith('p') # does your string begin with the letter 'p'. Returns true or false.
str.find() # pass in the letter you want the index to return from
str.replace('old', 'new')
str.lower() # lower case
str.upper() # upppercase 
str.rstrip() # remove whitespace right of the string (e.g. 'test ')
str.lstrip() # remove whitespace left of the string (e.g. ' test') 
str.strip() # remove whitespace on both sides (e.g. ' test ')

You'll need to learn how to slice strings in any way you want.

Let's say you're scanning a text but you're only interested in certain groups of text.

For instance, you're scanning a bunch of emails and you want to extract the sender's name from the text/corpus.

text = "From: [email protected]"

start = text.find(':') # find starting index of :
text[start + 2:] # add 2 to starting index

The text[start + 2:] bit slices the string. You need to provide a starting index and an ending or leave it empty to return the entire length of the string that comes after the starting index.

But real world texts are more difficult to parse. You'll have to learn how to navigate large texts and extract what you need from them.

NoneType

The NoneType or None is a reference to null or "no data". The question you need to ask yourself is, when is it appropriate to use None?

Example: Maybe you're consuming an API that provides you with data. Your function may need to check if there's any data to parse otherwise your code will crash. If the API returns nothing (None), then don't execute the function. Think about what that code might look like.

Boolean

The True or False data type. Useful when doing comparison checks.

Numeric

There are three numeric types: int, float, and complex. Use the dir function to learn more about them. It's a straightforward data type.

Collections

There are four data types in Python that will let you store collections of data.

Lists
Tuples
Sets
Dictionaries

The other data types don't let you store collections of data.

Why would you choose one over the other to store collections?

That's what we are here to learn. The way I remember the other data collection types is by comparing them to Lists and noticing the differences. This is a mental shortcut. I basically only remember what Lists are and then when I think about the other types I think "what makes them different from Lists?"

Lists are ordered, mutable, and allow duplicate values.

Tuples are like Lists except they are not mutable.

Dictionaries are like Lists except they don't allow duplicate values.

Sets are the complete opposite of Lists. They are unordered, immutable, unindexed, and don't allow duplicate values.

Lists

How do you remove, add, and generally manipulate lists?

Type in dir() and pass in the parameter that refers to the list and it'll spit out a number of methods you can apply to it. That's the essential thing you need to learn how to do.

The methods are all straightforward.

Use while and for loops to iterate through data structures of your choosing.

How do you access items in a list?

L1 = ['test', 'yada', 'ha']
L1[0]
# returns 'test'

The items within a List can be of any data type.

Practice tasks: Lists

Open the file data/romeo.txt and read it line by line. For each line, split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already in the list and if not append it to the list. When the program completes, sort and print the resulting words in alphabetical order.


with open('data/romeo.txt') as f:
    list_of_words = []
    lines = f.readlines()
    
   
    for line in lines:
      words = line.split()
      
      for word in words:
        word = word.lower()
        if word in list_of_words:
          continue
        else:
          list_of_words.append(word)
        
    list_of_words.sort()
    print(list_of_words)

Open the file data/mbox-short.txt and read it line by line. When you find a line that starts with "From " like the following line:

From [email protected] Sat Jan 5 09:14:16 2008 You will parse the From line using split() and print out the second word in the line (i.e. the entire address of the person who sent the message). Then print out a count at the end. Hint: make sure not to include the lines that start with "From:".

with open('data/mbox-short.txt') as f:
  lines = f.readlines()
  fromLines = []
  for line in lines:
    if line.startswith('From'):
      fromLines.append(line)

  addresses = []
  for item in fromLines:
      address = item.split()
      if address[1] in addresses:
       continue
      else:      
        addresses.append(address[1])
        print(address[1])
  print('Number of addresses:')
  print(len(addresses))

Dictionaries

Instead of square brackets Dictionaries use curly brackets {}.

Dictionaries are like Lists except they don't allow duplicate values.

Practice task: Dictionary

Write a program to read through the data/mbox-short.txt and figure out who has sent the greatest number of mail messages. The program looks for 'From ' lines and takes the second word of those lines as the person who sent the mail. The program creates a Python dictionary that maps the sender's mail address to a count of the number of times they appear in the file. After the dictionary is produced, the program reads through the dictionary using a maximum loop to find the most prolific committer.

with open('data/mbox-short.txt') as file:
  lines = file.readlines()

  email = dict()
 
  # What we want ...
  #  {
  #   'ldjflakjf@#kjdlkf.com : 30 
  # }
  #
  
  for line in lines:
   
    if line.startswith('From') == True and line.startswith('From:') == False:
      splitLines = line.split()
      # print(splitLines)
      address = splitLines[1]

      if address not in email:
        email[address] = 1
      else:
        email[address] += 1

  correspondent = dict()
  correspondent['address'] = ''
  
  count = 0

  for key in email:
    # print(email[key])
    if email[key] > count:
      count = email[key]
      correspondent['address'] = key
      correspondent['count'] = count
  
  print('Highest:')
  print(correspondent)

Tuples

Instead of square brackets Tuples use parentheses ().

Tuples are immutable. You can't change the items contained within them. There are workarounds. Adding two or more tuples together is possible and you can add or remove values.

Practice task: Tuples

Write a program to read through the data/mbox-short.txt and figure out the distribution by hour of the day for each of the messages. You can pull the hour out from the 'From ' line by finding the time and then splitting the string a second time using a colon. From [email protected] Sat Jan 5 09:14:16 2008 Once you have accumulated the counts for each hour, print out the counts, sorted by hour in the following format: Hour: Count.

with open('data/mbox-short.txt') as file:
  lines = file.readlines()
 
  timetable = dict()
  for line in lines:
    if line.startswith('From') == True and line.startswith('From: ') == False:
      splitFromLine = line.split()
      time = splitFromLine[5]
      hours = time.split(':')[0]
      # print(address)

      if hours not in timetable:
        timetable[hours] = 1 
      else:
        timetable[hours] += 1
        

  sortedList = sorted(timetable.items())

  print('Hour : Count')
  for item in sortedList:
    print(item[0], item[1])

Sets

Use the set keyword to create sets because they too use curly brackets, like Dictionaries, but they behave differently and must be manipulated differently.

fhand = open('data/mbox-short.txt')
emails = set()
for line in fhand:
  if line.startswith('From') == True and line.startswith('From:') == False:
    words = line.split()
    emails.add(words[1])
print(emails)

Why choose a Set? You can pass the set keyword a List and create a set to compare values between two lists.

Sets remove duplicate values for you.

Pass in a List and all duplicate values will disappear from the set.

You can add values with the update method. It will accept Lists and other sets.

L1 = [1,2,3,4,5,5,6]
S1 = set([7,8,9])
S2 = set([10, 10, 11])
S1.update(L1, S2)
print(S1)

Sometimes you want to know what values are present in set 1 but not in set 2.

s1 = {1, 2, 3}
s2 = {2, 3, 4}
s1.difference(s2)
# returns {1}
s1.symmetric_difference(s2)
# returns {1, 4}

If you need to find a value in a List it's more performant to cast the list into a set.

L1 = [ 'Mandy', 'Jake', 'Corey', 'Jessica']

developers = set(L1)
if 'Corey' in developers:
  print('Found!')

# Sets are O(1) and Lists are O(n)
# I don't know what this means

I don't understand why it's more performant. I don't care, really. All I need to remember is that sets will let me find values faster.

You can find values that are shared between sets. This is useful for instance when you are looking through a list of employees and you want to find out if they belong to other categories as well.

gym_members = {'Corey', 'Jake', 'Jessica'}
developers = {'Jake', 'Jessica'}
employees = {'Corey', 'Jake', 'Jessica', 'June' }

employees.intersection(developers, gym_members)
# returns { 'Jake', 'Jessica' }

Tips

Sequence functions. Turn a collection into an enumerable object.

pairs = []
# START YOUR CODE HERE.
fhand = open('data/mbox-short.txt')
# enumerate creates a Tuple 
# then we iterate through the Tuple with a for loop
for i, line in enumerate(fhand):
  if line.startswith('From ') == True:
    email = line.split()
    pairs.append((i, email[1]))

# now we use an anonymous function to sort the Tuple
sortKey = lambda email: email[1]
# the first "email" is the iterator variable (I think, I'm not sure)
pairs.sort(key=sortKey)
print(pairs)

Here is a more involved example created with ChatGPT:

celestial_bodies = [
    {
        "name": "Sun",
        "type": "Star",
        "mass": 1.989e30,  # in kilograms
        "radius": 6.957e8,  # in meters
        "distance_from_earth": 1.496e11  # in meters
    },
    {
        "name": "Earth",
        "type": "Planet",
        "mass": 5.972e24,  # in kilograms
        "radius": 6.371e6,  # in meters
        "distance_from_sun": 1.496e11  # in meters
    },
    {
        "name": "Moon",
        "type": "Satellite",
        "mass": 7.342e22,  # in kilograms
        "radius": 1.737e6,  # in meters
        "distance_from_earth": 3.844e8  # in meters
    },
    {
        "name": "Mars",
        "type": "Planet",
        "mass": 6.417e23,  # in kilograms
        "radius": 3.3895e6,  # in meters
        "distance_from_sun": 2.279e11  # in meters
    },
    {
        "name": "Jupiter",
        "type": "Planet",
        "mass": 1.898e27,  # in kilograms
        "radius": 6.9911e7,  # in meters
        "distance_from_sun": 7.786e11  # in meters
    }
]

# Sorting the celestial_bodies list by mass using a lambda function
sorted_by_mass = sorted(
  celestial_bodies, 
  key=lambda body: body['mass']
  )

print("Sorted by mass:")
for body in sorted_by_mass:
    print(f"{body['name']}: Mass = {body['mass']} kg")

Useful trick for when you need to keep track of where you are in the list.

words = ['test', 'hi', 'hello']
track = range(len(words))
for i in track:
  print(i)