Python - Why does the id of a pandas dataframe cell change with each execution?

Why does the id of a pandas dataframe cell change with each execution?… here is a solution to the problem.

Why does the id of a pandas dataframe cell change with each execution?

I’m having this issue while trying to determine some properties of the data frame View.

Let’s say I have a data frame defined as: df = pd. DataFrame(columns=list('abc'), data=np.arange(18).reshape(6, 3)) and the View definition for this dataframe is: df1 = df.iloc[:3, :]. We now have two data frames as follows:

print(df)
    a   b   c
0   0   1   2
1   3   4   5
2   6   7   8
3   9  10  11
4  12  13  14
5  15  16  17

print(df1)

a  b  c
0  0  1  2
1  3  4  5
2  6  7  8

Now I want to output the id:of a specific cell of these two dataframes

print(id(df.loc[0, 'a']))
print(id(df1.loc[0, 'a']))

My output is:

140114943491408
140114943491408

Oddly enough, if I execute these two lines of “print id” code in a row, the id also changes:

140114943491480
140114943491480

must emphasize that I did not execute the ‘df definition’ code when I executed those two ‘print id’ codes, so df and df1 were not redefined. So in my opinion, the memory address of each element in the dataframe should be fixed, so how can the output change?

Something weirder happens when I go ahead and execute these two lines of “print id” code. In rare cases, the two IDs are not even equal:

140114943181088
140114943181112

But if I also execute id(df.loc[0, ‘a’]

) == id(df1.loc[0, 'a']), python will still output true. I know that because df1 is df’s View, their cells should share a memory, but why do the output of their id occasionally differ?

Those strange behaviors completely overwhelmed me. Who can explain these behaviors? Are they due to the nature of the data frame or the id function in Python? Thank you!

FYI, I’m using Python 3.5.2.

Solution

Instead of getting the ID of the “

cell”, you are getting the ID of the object returned by the .loc accessor, which is the underlying data.

So,

>>> import pandas as pd
>>> df = pd. DataFrame(columns=list('abc'), data=np.arange(18).reshape(6, 3))
>>> df1 = df.iloc[:3, :]
>>> df.dtypes
a    int64
b    int64
c    int64
dtype: object
>>> df1.dtypes
a    int64
b    int64
c    int64
dtype: object

But since everything in Python is an object, your loc method must return an object:

>>> x = df.loc[0, 'a']
>>> x
0
>>> type(x)
<class 'numpy.int64'>
>>> isinstance(x, object)
True

However, the actual underlying buffer is a primitive array of C-fixed-size 64-bit signed integers. They are not Python objects, they are “boxed” to borrow terminology from other languages, mixing primitive types with objects.

You now see the phenomenon that all objects have the same ID:

>>> id(df.loc[0, 'a']), id(df.loc[0, 'a'])
(4539673432, 4539673432)
>>> id(df.loc[0, 'a']), id(df.loc[0, 'a']), id(df1.loc[0,'a'])
(4539673432, 4539673432, 4539673432)

This happens because in Python, objects are free to reuse the memory address of the most recently reclaimed object. In fact, when you create a tuple of id, the object returned by loc only exists long enough to be passed and processed by the first call of id<, and the second time you use loc, the object that has already been freed simply reuses the same memory. You can see the same behavior in any Python object, such as list:

>>> id([]), id([])
(4545276872, 4545276872)

Fundamentally, IDs are only guaranteed to be unique for the lifetime of an object. Read more about this phenomenon Note, however, that it is always different in the following cases:

>>> x = df.loc[0, 'a']
>>> x2 = df.loc[0, 'a']
>>> id(x), id(x2)
(4539673432, 4539673408)

Because you maintain references, objects are not recycled and new memory is required.

Note that for many immutable objects, the interpreter is free to optimize and return the exact same object. In CPython, this is the case with “small integers”, the so-called small integer cache:

>>> x = 2
>>> y = 2
>>> id(x), id(y)
(4304820368, 4304820368)

But this is an implementation detail that you shouldn’t rely on.

If you want to prove to yourself that your dataframes share the same underlying buffers, just change them and you’ll see the same reflection in each View:

>>> df
    a   b   c
0   0   1   2
1   3   4   5
2   6   7   8
3   9  10  11
4  12  13  14
5  15  16  17
>>> df1
   a  b  c
0  0  1  2
1  3  4  5
2  6  7  8
>>> df.loc[0, 'a'] = 99
>>> df
    a   b   c
0  99   1   2
1   3   4   5
2   6   7   8
3   9  10  11
4  12  13  14
5  15  16  17
>>> df1
    a  b  c
0  99  1  2
1   3  4  5
2   6  7  8

Python – Why does the id of a pandas dataframe cell change with each execution?

Why does the id of a pandas dataframe cell change with each execution?

Solution

Related Problems and Solutions