TDM 20100: Project 4 — 2023
Motivation: Becoming comfortable piping commands in a chain, and getting used to navigating files in a terminal, are important skills for every data scientist to learn. These skills will give you the ability to quickly understand and manipulate files in a way which is not possible using tools like Microsoft Office, Google Sheets, etc. You may find that these UNIX tools are really useful for analyzing data.
Context: We’ve been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called 'piping'.
Scope: grep, regular expression basics, UNIX utilities, redirection, piping
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/stackoverflow/unprocessed/*
-
/anvil/projects/tdm/data/stackoverflow/processed/*
-
/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt
Questions
For this project, please submit a |
Question 1 (2 pts)
The following statement will check how many columns are found in this csv file:
BUT this file is a little bit strange, because it only has 1 large line. (In fact, there is no line ending at the end of the line, so
In the question below, we can to turn the commas in this file into newline characters, and then count the number of words in the file.
|
-
Please use commands
head
,tr
andwc
to find out how many words occur in the first 10 lines of the file/anvil/projects/tdm/data/stackoverflow/unprocessed/2011.csv
Question 2 (2 pts)
As you can see, csv files are not always so straightforward to parse. For this particular set of questions, we want to focus on using some other UNIX tools that are more useful on semi-clean datasets, e.g. The following statement outputs the number of columns in each of the first 10 lines of the file:
We are just starting to introduce |
-
Let’s turn our attention to a different file. Use
awk
to find out how many columns appear in the fifth row of the file/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt
Question 3 (2 pts)
With appropriate commands, the following statement use finds the 5 largest orders, in terms of the number of
|
-
Use UNIX commands to find out what are the 6 highest 'state bottle retail' prices from the file
/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt
and what are the analogous item descriptions for these 6 items? (Some are repeated, and that is OK.)
|
Question 4 (2 pts)
Here is another example. We can pipeline
|
-
Please find out how many times each bottle volume appears in the file
|
Project 04 Assignment Checklist
-
Jupyter Lab notebook with your code and comments for the assignment
-
firstname-lastname-project04.ipynb
.
-
-
A
.sh
text file with all of yourbash
code and comments written inside of it-
bash code and comments used to solve questions 1 through 4
-
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |