DATA SCIENCE ENVIROMENT
Data science is a fast-growing field that requires a variety of tools and technologies. Setting up a data science development environment can be confusing, but it's important to have a reliable and productive workspace set up in your personal laptop. This blog post will walk you through the steps of setting up a data science development environment on your personal laptop.
Hardware requirements:
Before you start setting up your development environment, it's important to make sure that your laptop meets the hardware and software requirements for data science. Here are some general guidelines that are feasible.
- Processor: At least a quad-core Intel Core i5
- Memory: At least 4GB of RAM
- Storage: At least 40 GB free space in your hard drive
Software requirements
The following software is essential for a data science development environment:
- Python programming language
- Jupyter Notebook
- Anaconda
- Git and Git bash
- Python - There are many programming languages that can be used for data science, most data scientists are using Python to write their code
- Jupyter Notebook - Most of those data scientists use Jupyter Notebook for writing their Python code. Jupyter Notebook is a tool that allows you to mix comments in-between your code snippets so you can document and share your thought process and make it easier for others to review, replicate, and expand on your work.·
- Anaconda - Anaconda is one of the most popular ways for data scientists to install Python and Jupyter Notebook on their computers. It also provides package management and virtual environments so you can get all the latest data science tools running, like NumPy, Scikit-Learn, and Tensorflow, and so you can use different versions of Python and your packages for different projects without them conflicting with each other.
- Git - Git is a version control system. It’s a way of keeping track of all the changes made across your project. Think of it like “track changes” in Word - but with the ability to track changes across multiple documents.
- GitHub - GitHub is a website where data scientists (and programmers) can save their work in case their computer breaks, and share it with their team or the world
Setting up a Professional Data Science Environment – Windows Installation
There are two major pieces we need to install in order to set you up for success as a professional data scientist! In this lesson, we will be installing
Git and
Anaconda for Windows. Once Anaconda is installed, you'll have access to Python and many other popular data science packages.
Installing Git
Download the install package from:
git-scm.com/download/win
Git installation step by step
Step 1: Git’s download page for Windows OS - choose 32-bit or 64-bit option
Step 2: Open the downloaded file - on the license prompt, click “Next” to accept
Step 3: Select the installation destination folder (default is recommended)
Step 4: Select components - keep the “Windows Explorer integration” options
Step 5: Choose the default editor - choose Nano or Visual Studio Code if you have not used any editor before
Step 6: Adjust the PATH environment - second option is recommended
Step 7: Choose the HTTPS transport backend - choose OpenSSL library
Step 8: Configure line ending conversions - select the default option
Step 9: Configure the terminal emulator - choose MinTTY
Step 10: Choose “Default” as the default behavior of git pull
Step 11: Configure extra options to enable file system caching
Step 12: Choose Git Credential Manager as the credential helper
Step 13: Install
Step 14: Installation Complete - Click "Finish" to exit Setup (do not need to view release notes)
To confirm you have installed Git successfully:
1. Open a terminal window - When we ask you to use the terminal, we mean the Git Bash application we just installed through Git
2. Type `git --version`: It should return the version of git you are running
Installing Anaconda
The easiest way to get set up with Python and Jupyter Notebook so you can start coding is to install the Anaconda distribution.
Step 1: Anaconda’s download page for Windows OS - choose 32-bit or 64-bit option. Download the latest version of Anaconda:
www.anaconda.com/download/
Same as with the Git installation - If you do not know whether you need the 32 or 64-bit install, check your system type.
Step 2: Open the downloaded file - on the license prompt, click “I Agree” to accept
Step 3: Select “Just Me” for Installation Type
Step 4: Select the installation destination folder (default is recommended)
Step 5: Make sure to choose both Advanced Installation Options!
Step 6: Installing Anaconda
Step 7, Continued: You can skip any add-ons, like the PyCharm installation
Step 8: Installation Complete - click “Finish”
To confirm you have installed Anaconda successfully:
1. Open a terminal window
2. Type `conda info`: It should return a table of details about your conda installation. Hurray! If you've gotten this far and everything has worked, you have successfully installed Git and Anaconda on your Windows PC!
Setting up a Professional Data Science Environment - Configuring Git and Anaconda Connecting Your Terminal to GitHub
Now that you have Git installed locally, you'll be often working back and forth between GitHub, a service which hosts Git repositories online, and your local computer. You will first need to sign up for a GitHub account Next, to better integrate with GitHub, you should set up your name and email address on your local machine.
1. In your terminal window- type `git config --global user.name` - If it returns your name, you’re set! - If it returns nothing or displays an error message, type `git config --global user.name “Your Name”` - replacing Your Name with your name inside the quotes (this should be your real first and last name, not your GitHub username)
2. In your terminal window, type `git config --global user.email` - If it returns your email address, you’re set! - If it returns nothing or displays an error message, type `git config --global user.email your@email.com` - replacing your@email.com with your email address.
Remember, when we say "terminal" we mean the Terminal app for Mac, and the Git Bash program for Windows. If you have not used the command line much or at all, follow the below steps:
1. Open a new terminal window
2. Type `pwd` - this should show your home directory, the most basic of paths on your computer
3. Type `cd Documents` - this will change your directory, and move you into your Documents folder
4. Type `mkdir cti` - this will create a new folder, called Flatiron, to keep all of your Flatiron repositories and files
5. Type `cd cti` - this will change your directory, moving you into the new CTI folder you just created
Setting Up Virtual Environments
As you do data science projects, you will spend a lot of your time using pre-written libraries to speed up your development - like numpy, pandas, or scikit-learn. As you work on different projects, you may also find that you use different versions of different libraries for different projects. The most common versioning issue is that some projects will run in Python 2 whereas others will run in Python 3, but you may also find that different projects depend on different versions of libraries like Tensorflow.
Occasionally, code that works in an old version of a library won’t work in a newer version. So if you open up a new project and install the dependencies, it’s possible that your old project won’t work anymore. To avoid that problem, a best practice is to use “virtual environments”. Virtual environments allow you to have different versions of Python and different versions of the various libraries you use, so you can install a new version of a library for one project but still use the old version for another project. It’s almost as if you have multiple computers that you can swap between, each having a different setup and configuration, just by running a couple of commands.
Creating the Conda Virtual Environment
You need to start by navigating into this project folder. If you run `pwd` to print your working directory in your terminal, you should be inside the folder CTI we navigated above with the link . If the name of the current working directory is not "dsc-data-science-env-config", then you need to move into that folder - follow the steps above
.
For Windows:
run conda env create -f win_environment.yml. Depending on the speed of your computer and your internet connection it may take up to twenty minutes for this to complete. While it does you should see output similar to that displayed below start to appear in your terminal.
Activating the Conda Virtual Environment
Next, run `conda init bash` to initialize a permanent shell which adds shell code to the startup scripts of your shell (e.g. ~/.bashrc). Now you are ready to try activating the environment. Type `conda activate learn-env`.
To confirm that it worked, type `conda info --envs` and confirm that the asterisk (*) is next to the learn-env environment.
Note: For Windows 11, you may need to follow the steps here in order to run [conda activate learn-env].
Troubleshooting
If you see a message that states “WARNING: A newer version of Conda exists”, run `conda update -n base conda` and then try again to create the environment. If you see a message that states "file not found", double check that you are running this command from the directory that contains the .yml file. If you type `ls` you should see the environment.yml file. If you don't see it, you likely forgot to run `cd dsc-data-science-env-config` to change into the right directory.
Setting your Default Environment
You have successfully created your virtual environment! But, to be sure that you are using the learn-env, it's helpful to set it as your default environment so that you don't need to remember to manually switch to it every time you open the terminal. This step is **highly recommended** but not required.
Windows
To follow these instructions on a Windows machine you must be using the Git Bash shell it was suggested to install above.
1. Run `touch ~/.bash_profile` to create a new file.
2. Run `echo "conda activate learn-env" >> ~/.bash_profile` to add the configuration to your bash profile
3. Run `source ~/.bash_profile` to activate the changes you just made
Updating your Virtual Environment
Python packages are constantly updating and changing, and switching between environments, updating or installing new packages, and troubleshooting environment issues will are all necessary skills for when you're a fully-fledged professional data scientist. In general, because we are using Anaconda as our package manager, it is preferable to update or install new packages using `conda` options instead of `pip
NOTE:If you are ever concerned about conflicting package versions, just remember that creating a new conda environment is as easy as `conda create --name new-env` - and it is very normal to have different environments with different packages for different purposes. Just remember that you've likely just set up learn-env to activate by default, so you'll need to either change that or activate other environments manually when needed.
Configuring your Kernel and Confirming your Configuration
Jupyter Notebooks run "kernels" - the computational engine used for executing your code. It's important to be running the right kernel within your notebook, otherwise you may get errors stating that you don't have a particular package or have the wrong version of it or even complaints about the version of Python you're running (some packages don't support Python 3.8, for example). Right now, let's check that everything is running properly. In your terminal, run `jupyter notebook`. This should prompt a new browser window to open, at an address that is something like "localhost:8888"For now, we want to check not only that the terminal shortcut you just used to open a Jupyter Notebook worked, but also that you are running your learn-env kernel in your notebook. You should be able to see learn-env as an option in the two places shown below. If you don't see the learn-env option in those two places:- Close the notebook in the browser- Close down the notebook server from the terminal - (run `ctrl` + `c` and then type `y` to confirm that you want to close down jupyter)- In the terminal, enter
conda activate learn-envpython -m ipykernel install --user --name learn-env --display-name "Python (learn-env” That will add the learn-env to your list of kernels. When you restart the Jupyter Notebook server and try again, you'll be able to select the learn-env option in these two places:-
When you create a new Jupyter Notebook, by clicking "New" on the right-hand sideOnce you're in a notebook or create a new one, by checking the options under "Kernel" in the top menu barIt will be essential to run `conda activate learn-env` every time you start a new terminal window if you do not set your terminal to activate that environment by default. If you don't do this you **will** get errors, so please check this first. You can always run `conda info --envs` to see which environment is selected - and, if you run the above steps to set the learn-env to open by default, you won't need to remember to activate every time you open your terminal.
Congratulations! That was a lot! If you've gotten this far and everything has worked, you have successfully set up your computer with some of the primary tools you need to work as a professional data scientist!.