Python Files and Directories

Pathlib is a module in Python’s standard library that provides an object-oriented interface for working with file system paths. It offers a more intuitive and readable way to handle file and directory operations compared to traditional methods using the os and os.path modules.

Get total size of a directory
#

Take a folder containing lots of nested subfolders also containing too many files:

from pathlib import Path
from time import time

def get_directory_size(path: Path) -> float:
    return sum(f.stat().st_size for f in path.rglob("*") if f.is_file())

start_time = time()
size = get_directory_size(Path("my big messy fatty folder"))
print(f"size : {round(size / 1024**3, 2)} GB\ntook : {round(time() - start_time, 2)} seconds")

size : 55.09 GB
took : 33.11 seconds

Faster way to get total size of a directory
#

Take the same folder, add some concurrency:

import concurrent.futures
from pathlib import Path
from time import time

def calculate_size(path: Path) -> float:
    return sum(f.stat().st_size for f in path.rglob("*") if f.is_file())

def get_directory_size(path: Path) -> float:
    subpaths = [p for p in path.glob("*/*") if p.is_dir()]
    with concurrent.futures.ProcessPoolExecutor() as executor:
        sizes = executor.map(calculate_size, subpaths)
    return sum(sizes)

start_time = time()
size = get_directory_size(Path("my big messy fatty folder"))
print(f"size : {round(size / 1024**3, 2)} GB\ntook : {round(time() - start_time, 2)} seconds")

size : 55.09 GB
took : 8.27 seconds

Performance Analysis
#

The concurrent approach provides a 4x speed improvement for large directories with multiple subdirectories. Here’s why:

Sequential: Processes files one by one
Concurrent: Utilizes multiple CPU cores for parallel processing
Best for: Large directories with many subdirectories

Best Practices
#

✅ Do’s
#

Use pathlib.Path for modern, readable path operations
Implement error handling for permission issues
Consider memory usage for very large directories
Use ProcessPoolExecutor for CPU-intensive tasks

❌ Don’ts
#

Don’t use string concatenation for paths
Avoid blocking the main thread for large operations
Don’t ignore file access permissions

Additional File Operations
#

Check if Path Exists
#

from pathlib import Path

file_path = Path("example.txt")
if file_path.exists():
    print("File exists!")

Create Directories
#

# Create directory with parents
Path("path/to/new/directory").mkdir(parents=True, exist_ok=True)

List Files with Filter
#

# Get all Python files
python_files = list(Path(".").glob("**/*.py"))
print(f"Found {len(python_files)} Python files")

Read and Write Files
#

# Write to a file
with open("example.txt", "w") as f:
    f.write("Hello, World!")
# Read from a file
with open("example.txt", "r") as f:
    content = f.read()
    print(content)

Get total size of a directory#

Faster way to get total size of a directory#

Performance Analysis#

Best Practices#

✅ Do’s#

❌ Don’ts#

Additional File Operations#

Check if Path Exists#

Create Directories#

List Files with Filter#

Read and Write Files#