Searching in big DataFrames? Pandas to the rescue!

Searching in big DataFrames? Pandas to the rescue!

How to search a specific row with pandas??

·

2 min read

Hello everybody! 👋

I work with bioinformatic data (RNA-sequencing 🧬, etc), which is formed for thousands and thousands of text lines (resulting in files with sizes from 5gb to around 30gb). I was working with differential gene expression (DGE) data yesterday (statistical validation to say if two genes in different conditions have statistical significance), which are tabular files with IDs and a lot of columns with numbers. Usually, you perform the DGE with some known packages in R (I don't know if it can be done in python, there's got to be a way), and then you can use some user friendly software like Excel or libreoffice to analyze the results 📊. My problem was the size of the .tab files. All of them were over 17mb, really hard to manage that in Excel. I needed to search for specific genes (IDs), so I decided to try that with pandas 🐼.

Best choice ever

I'm not an expert with pandas (actually I used to do this kind of things in R), but lately I find it pretty useful and relatively simple to use. Ok, let's code:

# First, we have to import pandas (and I also import numpy because I use it frequently.

import pandas as pd
import numpy as np

DGE_df = pd.read_csv("DGE_results.tab")
DGE_df.head()

We can see that the .tab file has been correctly imported 💪:

1.png

If you want to see how many rows the file has, you can use: len(DGE_df)

Ok, what's next? All I needed was the row of a specific ID (from the column "Gene"). So, it's a fairly easy task:

DGE_df[DGE_df["Gene"] == "maker-Fvb3-1-augustus-gene-210.44"]

Note that you have to write two times the name of the dataframe (first one is for accessing the dataframe, and the second one for the column), otherwise you won't access the specific column you need (in this case it's "Gene").

2.png

This did the trick for me. I repeated for a couple of genes that I needed and it took me only a few minutes. You can also use this approach if you need to compare numbers. For example, if I want to see the rows with padj <= 0.05:

DGE_df[DGE_df["padj"] <= 0.05]

The result is:

3.png

And that's it. I hope you find this article useful.

Until next time! 🙋‍♂️

(The images from the cover were downloaded for free from Flaticon).